Daily arXiv Papers - 2025-12-17

AI-enhanced summaries of 0 research papers from arXiv

Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

Table of Contents

cs.CL

[1] FiNERweb: Datasets and Artifacts for Scalable Multilingual Named Entity Recognition

Jonas Golde, Patrick Haller, Alan Akbik

Main category: cs.CL

TL;DR: FiNERweb is a multilingual NER dataset creation pipeline that scales teacher-student paradigm to 91 languages, producing 225k passages with 235k entity labels, achieving comparable performance with 19x less data than baselines.

DetailsMotivation: Current multilingual NER datasets from LLMs are mostly by-products rather than systematic, reusable resources. There's a need for scalable, high-quality synthetic supervision for multilingual NER that can be systematically created and reused.

Method: Pipeline approach: 1) Train regression models to identify NER-relevant passages from FineWeb-Edu, 2) Annotate selected passages with multilingual LLMs, 3) Release dataset with both English labels and translated target language labels for 91 languages across 25 scripts.

Result: Regression model achieves >84 F1; models trained on FiNERweb show comparable/improved zero-shot transfer performance on English, Thai, and Swahili with 19x less data than baselines; high annotation quality (faithfulness: 3.99/5, completeness: 4.05/5).

Conclusion: FiNERweb provides a scalable, high-quality multilingual NER dataset creation pipeline that enables effective teacher-student training with significantly less data, addressing the need for systematic synthetic supervision resources in multilingual NER research.

Abstract: Recent multilingual named entity recognition (NER) work has shown that large language models (LLMs) can provide effective synthetic supervision, yet such datasets have mostly appeared as by-products of broader experiments rather than as systematic, reusable resources. We introduce FiNERweb, a dataset-creation pipeline that scales the teacher-student paradigm to 91 languages and 25 scripts. Building on FineWeb-Edu, our approach trains regression models to identify NER-relevant passages and annotates them with multilingual LLMs, resulting in about 225k passages with 235k distinct entity labels. Our experiments show that the regression model achieves more than 84 F1, and that models trained on FiNERweb obtain comparable or improved performance in zero shot transfer settings on English, Thai, and Swahili, despite being trained on 19x less data than strong baselines. In addition, we assess annotation quality using LLM-as-a-judge and observe consistently high scores for both faithfulness (3.99 out of 5) and completeness (4.05 out of 5), indicating reliable and informative annotations. Further, we release the dataset with both English labels and translated label sets in the respective target languages because we observe that the performance of current state-of-the-art models drops by 0.02 to 0.09 F1 when evaluated using target language labels instead of English ones. We release FiNERweb together with all accompanying artifacts to the research community in order to facilitate more effective student-teacher training for multilingual named entity recognition.

[2] Olmo 3

Team Olmo, :, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, Jacob Morrison, Jake Poznanski, Kyle Lo, Luca Soldaini, Matt Jordan, Mayee Chen, Michael Noukhovitch, Nathan Lambert, Pete Walsh, Pradeep Dasigi, Robert Berry, Saumya Malik, Saurabh Shah, Scott Geng, Shane Arora, Shashank Gupta, Taira Anderson, Teng Xiao, Tyler Murray, Tyler Romero, Victoria Graf, Akari Asai, Akshita Bhagia, Alexander Wettig, Alisa Liu, Aman Rangapur, Chloe Anastasiades, Costa Huang, Dustin Schwenk, Harsh Trivedi, Ian Magnusson, Jaron Lochner, Jiacheng Liu, Lester James V. Miranda, Maarten Sap, Malia Morgan, Michael Schmitz, Michal Guerquin, Michael Wilson, Regan Huff, Ronan Le Bras, Rui Xin, Rulin Shao, Sam Skjonsberg, Shannon Zejiang Shen, Shuyue Stella Li, Tucker Wilde, Valentina Pyatkin, Will Merrill, Yapei Chang, Yuling Gu, Zhiyuan Zeng, Ashish Sabharwal, Luke Zettlemoyer, Pang Wei Koh, Ali Farhadi, Noah A. Smith, Hannaneh Hajishirzi

Main category: cs.CL

TL;DR: Olmo 3 is a family of fully-open 7B and 32B parameter language models focused on long-context reasoning, function calling, coding, instruction following, chat, and knowledge recall, with complete transparency of the entire model development lifecycle.

DetailsMotivation: To create state-of-the-art, fully-open language models that provide complete transparency in the model development process, addressing the need for open models with strong capabilities in reasoning, coding, and long-context understanding.

Method: Developed a family of language models at 7B and 32B parameter scales with specific targeting of long-context reasoning, function calling, coding, instruction following, chat, and knowledge recall. The approach includes releasing the entire model flow - every stage, checkpoint, data point, and dependency used in construction.

Result: Created Olmo 3 Think 32B as the strongest fully-open thinking model released to date, with both 7B and 32B parameter versions available. The models demonstrate capabilities across multiple domains including reasoning, coding, and long-context understanding.

Conclusion: Olmo 3 represents a significant advancement in open language models, providing state-of-the-art performance with complete transparency in the development process, making it a valuable resource for research and applications requiring open, capable language models.

Abstract: We introduce Olmo 3, a family of state-of-the-art, fully-open language models at the 7B and 32B parameter scales. Olmo 3 model construction targets long-context reasoning, function calling, coding, instruction following, general chat, and knowledge recall. This release includes the entire model flow, i.e., the full lifecycle of the family of models, including every stage, checkpoint, data point, and dependency used to build it. Our flagship model, Olmo 3 Think 32B, is the strongest fully-open thinking model released to-date.

[3] Structure-Aware Decoding Mechanisms for Complex Entity Extraction with Large-Scale Language Models

Zhimin Qiu, Di Wu, Feng Liu, Chenrui Hu, Yuxiao Wang

Main category: cs.CL

TL;DR: Proposes structure-aware decoding for LLMs to improve nested/overlapping entity extraction by maintaining semantic integrity and structural consistency through candidate span generation and hierarchical constraints.

DetailsMotivation: Traditional approaches struggle to maintain both semantic integrity and structural consistency in nested and overlapping entity extraction tasks, especially with complex hierarchical relationships and cross-dependencies.

Method: Uses pretrained language model for context-aware representations, candidate span generation mechanism, structured attention modeling, hierarchical structural constraints during decoding, and joint optimization of classification loss and structural consistency loss.

Result: Significant improvements in Accuracy, Precision, Recall, and F1-Score on ACE 2005 dataset, particularly in nested and overlapping entity recognition with stronger boundary localization and structural modeling capability.

Conclusion: Structure-aware decoding is effective for complex semantic extraction, provides new perspective for hierarchical understanding in language models, and establishes foundation for high-precision information extraction.

Abstract: This paper proposes a structure-aware decoding method based on large language models to address the difficulty of traditional approaches in maintaining both semantic integrity and structural consistency in nested and overlapping entity extraction tasks. The method introduces a candidate span generation mechanism and structured attention modeling to achieve unified modeling of entity boundaries, hierarchical relationships, and cross-dependencies. The model first uses a pretrained language model to obtain context-aware semantic representations, then captures multi-granular entity span features through candidate representation combinations, and introduces hierarchical structural constraints during decoding to ensure consistency between semantics and structure. To enhance stability in complex scenarios, the model jointly optimizes classification loss and structural consistency loss, maintaining high recognition accuracy under multi-entity co-occurrence and long-sentence dependency conditions. Experiments conducted on the ACE 2005 dataset demonstrate significant improvements in Accuracy, Precision, Recall, and F1-Score, particularly in nested and overlapping entity recognition, where the model shows stronger boundary localization and structural modeling capability. This study verifies the effectiveness of structure-aware decoding in complex semantic extraction tasks, provides a new perspective for developing language models with hierarchical understanding, and establishes a methodological foundation for high-precision information extraction.

[4] What Affects the Effective Depth of Large Language Models?

Yi Hu, Cai Zhou, Muhan Zhang

Main category: cs.CL

TL;DR: LLMs underuse their available depth across scales, training types, and task difficulties - effective depth ratio remains stable despite model growth, and reasoning improvements come from longer context rather than deeper computation.

DetailsMotivation: To understand why deeper LLMs show diminishing performance gains despite added layers, and to systematically study how effective depth varies with model scale, training type, and task difficulty.

Method: Analyzed Qwen-2.5 family models (1.5B-32B), compared base vs long-CoT models, and evaluated across tasks of varying difficulty to measure effective depth patterns.

Result: Effective layers grow with model size but ratio remains stable; long-CoT models show no increased effective depth (reasoning improvements from longer context); models don’t use more layers for harder problems.

Conclusion: Current LLMs underuse available depth across all studied dimensions, pointing to research opportunities for increasing layer utilization, model pruning, and early exiting strategies.

Abstract: The scaling of large language models (LLMs) emphasizes increasing depth, yet performance gains diminish with added layers. Prior work introduces the concept of “effective depth”, arguing that deeper models fail to fully utilize their layers for meaningful computation. Building on this, we systematically study how effective depth varies with model scale, training type, and task difficulty. First, we analyze the model behavior of Qwen-2.5 family (1.5B-32B) and find that while the number of effective layers grows with model size, the effective depth ratio remains stable. Besides, comparisons between base and corresponding long-CoT models show no increase in effective depth, suggesting that improved reasoning stems from longer context rather than deeper per-token computation. Furthermore, evaluations across tasks of varying difficulty indicate that models do not dynamically use more layers for harder problems. Our results suggest that current LLMs underuse available depth across scales, training paradigms and tasks of varying difficulties, pointing out research opportunities on increasing the layer utilization rate of LLMs, model pruning, and early exiting. Our code is released at https://github.com/AheadOFpotato/what_affects_effective_depth.

[5] Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed

Yonggan Fu, Lexington Whalen, Zhifan Ye, Xin Dong, Shizhe Diao, Jingyu Liu, Chengyue Wu, Hao Zhang, Enze Xie, Song Han, Maksim Khadkevich, Jan Kautz, Yingyan Celine Lin, Pavlo Molchanov

Main category: cs.CL

TL;DR: AR-to-dLM conversion framework transforms pretrained autoregressive models into efficient diffusion language models with improved speed while maintaining accuracy, using block-wise attention and position-dependent masking.

DetailsMotivation: Diffusion language models enable parallel non-autoregressive generation but are less efficient to train from scratch compared to autoregressive models. The goal is to convert pretrained AR models into efficient dLMs that preserve task accuracy while gaining speed advantages.

Method: 1) Block-wise attention pattern that maintains causal relationships across blocks while enabling bidirectional modeling within blocks, preserving pretrained AR weight distributions. 2) Position-dependent token masking strategy that assigns higher masking probabilities to later tokens during training to better mimic test-time behavior. 3) Systematic study of attention patterns, training dynamics, and design choices for scalable AR-to-dLM conversion.

Result: Efficient-DLM family outperforms state-of-the-art AR models and dLMs. Efficient-DLM 8B achieves +5.4%/+2.7% higher accuracy with 4.5x/2.7x higher throughput compared to Dream 7B and Qwen3 4B respectively.

Conclusion: The proposed AR-to-dLM conversion framework successfully transforms pretrained AR models into efficient dLMs that achieve win-win improvements in both accuracy and efficiency, providing actionable insights for scalable conversion methodologies.

Abstract: Diffusion language models (dLMs) have emerged as a promising paradigm that enables parallel, non-autoregressive generation, but their learning efficiency lags behind that of autoregressive (AR) language models when trained from scratch. To this end, we study AR-to-dLM conversion to transform pretrained AR models into efficient dLMs that excel in speed while preserving AR models’ task accuracy. We achieve this by identifying limitations in the attention patterns and objectives of existing AR-to-dLM methods and then proposing principles and methodologies for more effective AR-to-dLM conversion. Specifically, we first systematically compare different attention patterns and find that maintaining pretrained AR weight distributions is critical for effective AR-to-dLM conversion. As such, we introduce a continuous pretraining scheme with a block-wise attention pattern, which remains causal across blocks while enabling bidirectional modeling within each block. We find that this approach can better preserve pretrained AR models’ weight distributions than fully bidirectional modeling, in addition to its known benefit of enabling KV caching, and leads to a win-win in accuracy and efficiency. Second, to mitigate the training-test gap in mask token distributions (uniform vs. highly left-to-right), we propose a position-dependent token masking strategy that assigns higher masking probabilities to later tokens during training to better mimic test-time behavior. Leveraging this framework, we conduct extensive studies of dLMs’ attention patterns, training dynamics, and other design choices, providing actionable insights into scalable AR-to-dLM conversion. These studies lead to the Efficient-DLM family, which outperforms state-of-the-art AR models and dLMs, e.g., our Efficient-DLM 8B achieves +5.4%/+2.7% higher accuracy with 4.5x/2.7x higher throughput compared to Dream 7B and Qwen3 4B, respectively.

[6] A Unified Sparse Attention via Multi-Granularity Compression

Siran Liu, Zane Cao, Yongchao He

Main category: cs.CL

TL;DR: UniSparse introduces composite tokens and multi-granularity compression for efficient sparse attention that achieves near-full attention accuracy with significant speedups.

DetailsMotivation: Self-attention's quadratic scaling creates computational bottlenecks for long-context LLM applications. Existing sparse attention methods have trade-offs: training-based methods are costly and not pluggable, while inference-time methods sacrifice efficiency or cross-modal generality.

Method: UniSparse introduces composite tokens that aggregate multi-granularity contextual information, then dynamically constructs sparse attention through multi-granularity compression and block-level selection for hardware-friendly GPU execution.

Result: UniSparse consistently outperforms state-of-the-art sparse attention methods (MInference, XAttention, FlexPrefill) across multiple modalities and tasks, achieving ≥99% of full-attention accuracy and up to 2.61× faster attention computation than FlashAttention.

Conclusion: UniSparse provides an effective unified mechanism for efficient long-context understanding that balances accuracy and efficiency while maintaining cross-modal generality.

Abstract: Efficient long-context understanding and reasoning are increasingly vital for large language model (LLM) applications such as multi-turn dialogue and program analysis. However, the core self-attention mechanism scales quadratically with sequence length, creating a fundamental computational bottleneck. Existing sparse attention methods alleviate this issue but face trade-offs: training-based methods are costly and cannot be directly applied as acceleration plugins for other models, while inference-time methods often compromise efficiency or cross-modal generality. To address these limitations, we present UniSparse, a unified mechanism that introduces the notion of composite tokens–compact representations that aggregate multi-granularity contextual information. Building on this abstraction, UniSparse dynamically constructs sparse attention through multi-granularity compression and block-level selection, enabling efficient and hardware-friendly execution on GPU. Across multiple modalities and tasks ranging from synthetic benchmarks to real-world applications, UniSparse consistently surpasses state-of-the-art sparse attention methods (e.g., MInference, XAttention, FlexPrefill) in both accuracy and efficiency, achieving $\ge$ 99% of full-attention accuracy and up to 2.61$\times$ faster attention computation than FlashAttention.

[7] Multilingual and Continuous Backchannel Prediction: A Cross-lingual Study

Koji Inoue, Mikey Elmers, Yahui Fu, Zi Haur Pang, Taiga Mori, Divesh Lala, Keiko Ochi, Tatsuya Kawahara

Main category: cs.CL

TL;DR: Multilingual Transformer model predicts backchannels in Japanese, English, Chinese, showing language-specific timing patterns and cue usage differences.

DetailsMotivation: To create a unified model for cross-linguistic backchannel prediction and investigate how backchannel timing differs across languages, informing more natural, culturally-aware dialogue systems.

Method: Transformer-based frame-level model jointly trained with auxiliary tasks on ~300 hours of dyadic conversations across three languages (Japanese, English, Chinese).

Result: Multilingual model matches/surpasses monolingual baselines; learns universal cues and language-specific patterns; Japanese uses short-term linguistic info, English/Chinese sensitive to silence/prosody; Japanese robust to short contexts, Chinese benefits from longer contexts; real-time CPU inference demonstrated.

Conclusion: Provides empirical evidence for cross-linguistic backchannel timing differences and a practical model for culturally-aware dialogue systems, showing multilingual training enables shared yet adaptable representations.

Abstract: We present a multilingual, continuous backchannel prediction model for Japanese, English, and Chinese, and use it to investigate cross-linguistic timing behavior. The model is Transformer-based and operates at the frame level, jointly trained with auxiliary tasks on approximately 300 hours of dyadic conversations. Across all three languages, the multilingual model matches or surpasses monolingual baselines, indicating that it learns both language-universal cues and language-specific timing patterns. Zero-shot transfer with two-language training remains limited, underscoring substantive cross-lingual differences. Perturbation analyses reveal distinct cue usage: Japanese relies more on short-term linguistic information, whereas English and Chinese are more sensitive to silence duration and prosodic variation; multilingual training encourages shared yet adaptable representations and reduces overreliance on pitch in Chinese. A context-length study further shows that Japanese is relatively robust to shorter contexts, while Chinese benefits markedly from longer contexts. Finally, we integrate the trained model into a real-time processing software, demonstrating CPU-only inference. Together, these findings provide a unified model and empirical evidence for how backchannel timing differs across languages, informing the design of more natural, culturally-aware spoken dialogue systems.

[8] Linguists should learn to love speech-based deep learning models

Marianne de Heer Kloots, Paul Boersma, Willem Zuidema

Main category: cs.CL

TL;DR: The paper critiques Futrell & Mahowald’s framework for bridging deep learning and linguistics, arguing that their focus on text-based LLMs is too limited and that audio-based models are essential for studying human language.

DetailsMotivation: To address the limitations of text-based LLMs in linguistic research, highlighting that many aspects of human language (prosody, intonation, speech patterns) are lost in written text, necessitating audio-based approaches.

Method: Critical analysis of existing framework, proposing the integration of audio-based deep learning models as a crucial extension to better capture the full spectrum of human language phenomena.

Result: Identifies fundamental limitations in text-only approaches to linguistic theory development, establishes the importance of audio data for comprehensive language understanding.

Conclusion: Audio-based deep learning models should play a central role in bridging deep learning systems with linguistic theories, as they capture essential aspects of human language that text-based models miss.

Abstract: Futrell and Mahowald present a useful framework bridging technology-oriented deep learning systems and explanation-oriented linguistic theories. Unfortunately, the target article’s focus on generative text-based LLMs fundamentally limits fruitful interactions with linguistics, as many interesting questions on human language fall outside what is captured by written text. We argue that audio-based deep learning models can and should play a crucial role.

[9] CogMem: A Cognitive Memory Architecture for Sustained Multi-Turn Reasoning in Large Language Models

Yiran Zhang, Jincheng Hu, Mark Dras, Usman Naseem

Main category: cs.CL

TL;DR: CogMem introduces a layered memory architecture for LLMs to improve multi-turn reasoning by preventing context bloat and addressing common failure modes like reasoning bias and memory decay.

DetailsMotivation: LLMs perform well on single-turn tasks but struggle with extended multi-turn interactions, suffering from reasoning bias, task drift, hallucination, overconfidence, and memory decay. Current approaches that append full conversation histories lead to unbounded context growth, higher computational costs, and degraded reasoning efficiency.

Method: CogMem is a cognitively inspired, memory-augmented LLM architecture with three layers: 1) Long-Term Memory (LTM) that consolidates cross-session reasoning strategies, 2) Direct Access (DA) memory that maintains session-level notes and retrieves relevant long-term memories, and 3) Focus of Attention (FoA) mechanism that dynamically reconstructs concise, task-relevant context at each turn.

Result: Experiments on TurnBench show that CogMem’s layered design mitigates reasoning failures, controls context growth, and improves consistency across extended reasoning chains.

Conclusion: CogMem moves toward more reliable, human-like reasoning in LLMs by addressing the limitations of current approaches through structured, persistent memory that supports sustained iterative reasoning.

Abstract: Large language models (LLMs) excel at single-turn reasoning but often lose accuracy and coherence over extended, multi-turn interactions. Recent evaluations such as TurnBench highlight recurring failure modes-reasoning bias, task drift, hallucination, overconfidence, and memory decay. Current approaches typically append full conversational histories, causing unbounded context growth, higher computational costs, and degraded reasoning efficiency. We introduce CogMem, a cognitively inspired, memory-augmented LLM architecture that supports sustained iterative reasoning through structured, persistent memory. CogMem incorporates three layers: a Long-Term Memory (LTM) that consolidates cross-session reasoning strategies; a Direct Access (DA) memory that maintains session-level notes and retrieves relevant long-term memories; and a Focus of Attention (FoA) mechanism that dynamically reconstructs concise, task-relevant context at each turn. Experiments on TurnBench show that this layered design mitigates reasoning failures, controls context growth, and improves consistency across extended reasoning chains, moving toward more reliable, human-like reasoning in LLMs.

[10] Spoken DialogSum: An Emotion-Rich Conversational Dataset for Spoken Dialogue Summarization

Yen-Ju Lu, Kunxiao Gao, Mingrui Liang, Helin Wang, Thomas Thebaud, Laureano Moro-Velazquez, Najim Dehak, Jesus Villalba

Main category: cs.CL

TL;DR: Spoken DialogSum is the first dataset aligning raw conversational audio with factual/emotion-rich summaries and paralinguistic labels, created via LLM rewriting and expressive TTS synthesis.

DetailsMotivation: Research on emotion-aware or spoken dialogue summarization is limited by the lack of data linking speech, summaries, and paralinguistic cues. Existing datasets don't align conversational audio with both factual and emotion-focused summaries along with speaker characteristics.

Method: Two-stage approach: 1) LLM rewrites DialogSum scripts with Switchboard-style fillers/back-channels and tags utterances with emotion, pitch, and speaking rate; 2) Expressive TTS engine synthesizes speech from tagged scripts, aligned with paralinguistic labels.

Result: Spoken DialogSum contains 13,460 emotion-diverse dialogues, each with factual and emotion-focused summaries. Audio-LLM baseline shows 28% relative improvement in emotional-summary ROUGE-L over cascaded ASR-LLM system, demonstrating end-to-end speech modeling value.

Conclusion: The dataset enables emotion-aware spoken dialogue summarization research. End-to-end audio modeling significantly outperforms cascaded approaches, highlighting the importance of directly processing speech signals for emotion-rich summarization tasks.

Abstract: Recent audio language models can follow long conversations. However, research on emotion-aware or spoken dialogue summarization is constrained by the lack of data that links speech, summaries, and paralinguistic cues. We introduce Spoken DialogSum, the first corpus aligning raw conversational audio with factual summaries, emotion-rich summaries, and utterance-level labels for speaker age, gender, and emotion. The dataset is built in two stages: first, an LLM rewrites DialogSum scripts with Switchboard-style fillers and back-channels, then tags each utterance with emotion, pitch, and speaking rate. Second, an expressive TTS engine synthesizes speech from the tagged scripts, aligned with paralinguistic labels. Spoken DialogSum comprises 13,460 emotion-diverse dialogues, each paired with both a factual and an emotion-focused summary. The dataset is available online at https://fatfat-emosum.github.io/EmoDialog-Sum-Audio-Samples/. Baselines show that an Audio-LLM raises emotional-summary ROUGE-L by 28% relative to a cascaded ASR-LLM system, confirming the value of end-to-end speech modeling.

[11] Astraea: A State-Aware Scheduling Engine for LLM-Powered Agents

Hongqiu Ni, Jiabao Zhang, Guopeng Li, Zilong Wang, Ruiqi Wu, Chi Zhang, Haisheng Tan

Main category: cs.CL

TL;DR: Astraea is a service engine that optimizes LLM agent workflows by shifting from local segment optimization to global request lifecycle optimization, reducing average job completion time by up to 25.5%.

DetailsMotivation: LLMs deployed as intelligent agents have multi-stage workflows alternating between computation and external API calls, creating a mismatch with existing inference systems that focus on per-segment optimization rather than minimizing end-to-end latency of complete agentic workflows.

Method: Astraea uses a state-aware hierarchical scheduling algorithm integrating historical state with future predictions, dynamically classifies requests by I/O vs compute intensity, employs enhanced HRRN policy for efficiency-fairness balance, and implements adaptive KV cache manager for intelligent agent state handling during I/O waits based on memory pressure.

Result: Astraea reduces average Job Completion Time by up to 25.5% compared to baseline methods and demonstrates strong robustness and stability under high load across various model scales.

Conclusion: Shifting optimization from local segments to global request lifecycle with state-aware hierarchical scheduling and adaptive KV cache management significantly improves LLM agent workflow performance and system stability.

Abstract: Large Language Models (LLMs) are increasingly being deployed as intelligent agents. Their multi-stage workflows, which alternate between local computation and calls to external network services like Web APIs, introduce a mismatch in their execution pattern and the scheduling granularity of existing inference systems such as vLLM. Existing systems typically focus on per-segment optimization which prevents them from minimizing the end-to-end latency of the complete agentic workflow, i.e., the global Job Completion Time (JCT) over the entire request lifecycle. To address this limitation, we propose Astraea, a service engine designed to shift the optimization from local segments to the global request lifecycle. Astraea employs a state-aware, hierarchical scheduling algorithm that integrates a request’s historical state with future predictions. It dynamically classifies requests by their I/O and compute intensive nature and uses an enhanced HRRN policy to balance efficiency and fairness. Astraea also implements an adaptive KV cache manager that intelligently handles the agent state during I/O waits based on the system memory pressure. Extensive experiments show that Astraea reduces average JCT by up to 25.5% compared to baseline methods. Moreover, our approach demonstrates strong robustness and stability under high load across various model scales.

[12] A Comparative Analysis of Retrieval-Augmented Generation Techniques for Bengali Standard-to-Dialect Machine Translation Using LLMs

K. M. Jubair Sami, Dipto Sumit, Ariyan Hossain, Farig Sadeque

Main category: cs.CL

TL;DR: Novel RAG pipelines for standard-to-dialectal Bengali translation using transcript-based and sentence-pair approaches, with sentence-pair method significantly outperforming and enabling smaller models to beat larger ones.

DetailsMotivation: Standard-to-dialect translation is challenging due to scarce data and linguistic variation, especially in Bengali. There's a need for effective solutions to preserve linguistic diversity without extensive fine-tuning.

Method: Two RAG pipelines: 1) Transcript-Based Pipeline using large dialect sentence contexts from audio transcripts, and 2) Standardized Sentence-Pairs Pipeline using structured local_dialect:standard_bengali sentence pairs. Evaluated across six Bengali dialects with multiple LLMs using BLEU, ChrF, WER, and BERTScore metrics.

Result: Sentence-pair pipeline consistently outperforms transcript-based approach, reducing WER from 76% to 55% for Chittagong dialect. Smaller models (Llama-3.1-8B) can outperform much larger models (GPT-OSS-120B) with proper retrieval strategy.

Conclusion: Well-designed retrieval strategy is more crucial than model size for dialect translation. Provides fine-tuning-free solution for low-resource settings and practical blueprint for preserving linguistic diversity.

Abstract: Translating from a standard language to its regional dialects is a significant NLP challenge due to scarce data and linguistic variation, a problem prominent in the Bengali language. This paper proposes and compares two novel RAG pipelines for standard-to-dialectal Bengali translation. The first, a Transcript-Based Pipeline, uses large dialect sentence contexts from audio transcripts. The second, a more effective Standardized Sentence-Pairs Pipeline, utilizes structured local_dialect:standard_bengali sentence pairs. We evaluated both pipelines across six Bengali dialects and multiple LLMs using BLEU, ChrF, WER, and BERTScore. Our findings show that the sentence-pair pipeline consistently outperforms the transcript-based one, reducing Word Error Rate (WER) from 76% to 55% for the Chittagong dialect. Critically, this RAG approach enables smaller models (e.g., Llama-3.1-8B) to outperform much larger models (e.g., GPT-OSS-120B), demonstrating that a well-designed retrieval strategy can be more crucial than model size. This work contributes an effective, fine-tuning-free solution for low-resource dialect translation, offering a practical blueprint for preserving linguistic diversity.

[13] Ladder Up, Memory Down: Low-Cost Fine-Tuning With Side Nets

Estelle Zheng, Nathan Cerisara, Sébastien Warichet, Emmanuel Helbert, Christophe Cerisara

Main category: cs.CL

TL;DR: Ladder Side Tuning (LST) is a parameter-efficient fine-tuning method that adds a lightweight side network, reducing peak memory by 50% compared to QLoRA while maintaining competitive performance, enabling 7B-parameter model fine-tuning on 12GB GPUs.

DetailsMotivation: Fine-tuning large language models is constrained by GPU memory limitations. Existing PEFT methods like QLoRA still incur high memory usage from backward passes in full models, limiting deployment on consumer hardware.

Method: Revisits Ladder Side Tuning (LST) which adds a lightweight side network instead of modifying the main model. Introduces xLadder variant with depth extension via cross-connections to shorten chain-of-thought reasoning at fixed parameter count.

Result: LST cuts peak memory by 50% compared to QLoRA while matching compute scaling slope and achieving competitive performance across NLP, math, and LLM-critic tasks. Enables 7B-parameter model fine-tuning on single 12GB GPU with 2k-token contexts without gradient checkpointing.

Conclusion: LST provides superior memory efficiency for LLM fine-tuning on constrained hardware while maintaining performance. xLadder extends this by enabling deeper reasoning without additional memory overhead, making both approaches valuable for memory-constrained scenarios.

Abstract: Fine-tuning large language models (LLMs) is often limited by the memory available on commodity GPUs. Parameter-efficient fine-tuning (PEFT) methods such as QLoRA reduce the number of trainable parameters, yet still incur high memory usage induced by the backward pass in the full model. We revisit Ladder Side Tuning (LST), a rarely explored PEFT technique that adds a lightweight side network, and show that it matches QLoRA’s compute scaling slope while cutting peak memory by 50%. Across different downstream benchmarks spanning natural language understanding, mathematical and LLM-critic tasks, LST has competitive performance with QLoRA’s accuracy on average while being much more memory-efficient. This efficiency enables fine-tuning of 7B-parameter models on a single 12 GB consumer GPU with 2k-token contexts, requiring no gradient checkpointing\textemdash conditions under which QLoRA exhausts memory. Beyond memory efficiency, we also establish scaling laws showing that LST scales similarly to QLoRA. We exploit Ladder’s architectural flexibility by introducing xLadder, a depth-extended variant that increases effective depth via cross-connections and shortens chain-of-thought (CoT) at fixed parameter count. Ladder is strong when memory is the bottleneck; xLadder builds on this by enabling deeper reasoning without additional memory overhead.

[14] Two CFG Nahuatl for automatic corpora expansion

Juan-José Guzmán-Landa, Juan-Manuel Torres-Moreno, Miguel Figueroa-Saavedra, Ligia Quintana-Torres, Graham Ranger Martha-Lorena Avendaño-Garrido

Main category: cs.CL

TL;DR: The paper introduces two Context-Free Grammars for expanding Nawatl language corpora to address the lack of digital resources for this indigenous Mexican language, enabling generation of synthetic sentences for better embeddings.

DetailsMotivation: Nawatl is an Amerindian language with very few digital resources, making it challenging to train Large Language Models due to virtually non-existent corpora. This creates a significant barrier for computational linguistics work on this language.

Method: The authors introduce two new Context-Free Grammars specifically for Nawatl and use them in generative mode to produce syntactically valid artificial sentences, thereby expanding the available corpora.

Result: The expanded corpus shows improvement in embedding quality compared to using only the original corpus. The results also demonstrate that economic embeddings often perform better than some LLMs on semantic similarity tasks.

Conclusion: CFG-based corpus expansion is an effective approach for low-resource languages like Nawatl, enabling better embeddings and showing that simpler, more economical embedding methods can outperform some LLMs for specific tasks.

Abstract: The aim of this article is to introduce two Context-Free Grammars (CFG) for Nawatl Corpora expansion. Nawatl is an Amerindian language (it is a National Language of Mexico) of the $π$-language type, i.e. a language with few digital resources. For this reason the corpora available for the learning of Large Language Models (LLMs) are virtually non-existent, posing a significant challenge. The goal is to produce a substantial number of syntactically valid artificial Nawatl sentences and thereby to expand the corpora for the purpose of learning non contextual embeddings. For this objective, we introduce two new Nawatl CFGs and use them in generative mode. Using these grammars, it is possible to expand Nawatl corpus significantly and subsequently to use it to learn embeddings and to evaluate their relevance in a sentences semantic similarity task. The results show an improvement compared to the results obtained using only the original corpus without artificial expansion, and also demonstrate that economic embeddings often perform better than some LLMs.

[15] From Context to EDUs: Faithful and Structured Context Compression via Elementary Discourse Unit Decomposition

Yiqing Zhou, Yu Lei, Shuzheng Si, Qingyan Sun, Wei Wang, Yifei Wu, Hao Wen, Gang Chen, Fanchao Qi, Maosong Sun

Main category: cs.CL

TL;DR: EDU-based Context Compressor: A novel explicit compression framework that preserves document structure by converting text to Elementary Discourse Unit trees and selecting query-relevant sub-trees, achieving state-of-the-art performance while reducing costs.

DetailsMotivation: Existing context compression methods for LLMs have limitations: they either disrupt local coherence through discrete token removal, or rely on implicit latent encoding that suffers from positional bias and incompatibility with closed-source APIs. Managing extensive context remains a critical bottleneck for LLMs in applications like long-document QA and autonomous agents.

Method: Two-stage structure-then-select approach: 1) LingoEDU transforms linear text into a structural relation tree of Elementary Discourse Units (EDUs) anchored to source indices to eliminate hallucination; 2) A lightweight ranking module selects query-relevant sub-trees for linearization. Also introduces StructBench, a manually annotated dataset of 248 diverse documents for evaluation.

Result: Achieves state-of-the-art structural prediction accuracy, significantly outperforms frontier LLMs while reducing costs. Structure-aware compression substantially enhances performance across downstream tasks ranging from long-context tasks to complex Deep Search scenarios.

Conclusion: The EDU-based Context Compressor provides an effective explicit compression framework that preserves both global structure and fine-grained details, addressing limitations of existing methods and improving performance on various long-context applications.

Abstract: Managing extensive context remains a critical bottleneck for Large Language Models (LLMs), particularly in applications like long-document question answering and autonomous agents where lengthy inputs incur high computational costs and introduce noise. Existing compression techniques often disrupt local coherence through discrete token removal or rely on implicit latent encoding that suffers from positional bias and incompatibility with closed-source APIs. To address these limitations, we introduce the EDU-based Context Compressor, a novel explicit compression framework designed to preserve both global structure and fine-grained details. Our approach reformulates context compression as a structure-then-select process. First, our LingoEDU transforms linear text into a structural relation tree of Elementary Discourse Units (EDUs) which are anchored strictly to source indices to eliminate hallucination. Second, a lightweight ranking module selects query-relevant sub-trees for linearization. To rigorously evaluate structural understanding, we release StructBench, a manually annotated dataset of 248 diverse documents. Empirical results demonstrate that our method achieves state-of-the-art structural prediction accuracy and significantly outperforms frontier LLMs while reducing costs. Furthermore, our structure-aware compression substantially enhances performance across downstream tasks ranging from long-context tasks to complex Deep Search scenarios.

[16] Inflation Attitudes of Large Language Models

Nikoleta Anesti, Edward Hill, Andreas Joseph

Main category: cs.CL

TL;DR: LLMs (GPT-3.5) can form inflation perceptions similar to human survey responses, tracking aggregate data and replicating demographic patterns, but lack consistent inflation models.

DetailsMotivation: To investigate whether LLMs can form realistic inflation perceptions and expectations comparable to human survey data, and to develop methods for evaluating LLM behavior in social science applications.

Method: Quasi-experimental design using GPT-3.5-turbo with September 2021 training cut-off, comparing outputs to Bank of England’s Inflation Attitudes Survey data and official statistics. Uses Shapley value decomposition to analyze prompt influence on model outputs.

Result: GPT tracks aggregate survey projections and official statistics at short horizons, replicates demographic patterns (income, housing tenure, social class), shows heightened sensitivity to food inflation like humans, but lacks consistent inflation models.

Conclusion: LLMs can generate inflation perceptions resembling human survey responses, offering potential for social science research and survey design evaluation, but their inconsistent internal models limit reliability for economic forecasting.

Abstract: This paper investigates the ability of Large Language Models (LLMs), specifically GPT-3.5-turbo (GPT), to form inflation perceptions and expectations based on macroeconomic price signals. We compare the LLM’s output to household survey data and official statistics, mimicking the information set and demographic characteristics of the Bank of England’s Inflation Attitudes Survey (IAS). Our quasi-experimental design exploits the timing of GPT’s training cut-off in September 2021 which means it has no knowledge of the subsequent UK inflation surge. We find that GPT tracks aggregate survey projections and official statistics at short horizons. At a disaggregated level, GPT replicates key empirical regularities of households’ inflation perceptions, particularly for income, housing tenure, and social class. A novel Shapley value decomposition of LLM outputs suited for the synthetic survey setting provides well-defined insights into the drivers of model outputs linked to prompt content. We find that GPT demonstrates a heightened sensitivity to food inflation information similar to that of human respondents. However, we also find that it lacks a consistent model of consumer price inflation. More generally, our approach could be used to evaluate the behaviour of LLMs for use in the social sciences, to compare different models, or to assist in survey design.

[17] Step-Tagging: Toward controlling the generation of Language Reasoning Models through step monitoring

Yannis Belkhiter, Seshu Tirupathi, Giulio Zizzo, John D. Kelleher

Main category: cs.CL

TL;DR: Step-Tagging framework uses real-time sentence classification to tag reasoning steps, enabling interpretable early stopping that reduces token generation by 20-50% while maintaining accuracy.

DetailsMotivation: LRMs are inefficient and over-generate verification/reflection steps despite advances. Need lightweight monitoring to control generation and study reasoning behaviors.

Method: Introduce Step-Tagging framework with ReasonType taxonomy for classifying reasoning steps. Use online monitoring of step counts as interpretable early stopping criteria during inference.

Result: 20-50% token reduction while maintaining comparable accuracy to standard generation, with largest gains on computation-heavy tasks. Evaluated on MATH500, GSM8K, AIME, GPQA, MMLU-Pro.

Conclusion: Provides novel control over LRM generation and new tool to study reasoning behaviors, addressing inefficiency through interpretable early stopping.

Abstract: The field of Language Reasoning Models (LRMs) has been very active over the past few years with advances in training and inference techniques enabling LRMs to reason longer, and more accurately. However, a growing body of studies show that LRMs are still inefficient, over-generating verification and reflection steps. To address this challenge, we introduce the Step-Tagging framework, a lightweight sentence-classifier enabling real-time annotation of the type of reasoning steps that an LRM is generating. To monitor reasoning behaviors, we introduced ReasonType: a novel taxonomy of reasoning steps. Building on this framework, we demonstrated that online monitoring of the count of specific steps can produce effective interpretable early stopping criteria of LRM inferences. We evaluate the Step-tagging framework on three open-source reasoning models across standard benchmark datasets: MATH500, GSM8K, AIME and non-mathematical tasks (GPQA and MMLU-Pro). We achieve 20 to 50% token reduction while maintaining comparable accuracy to standard generation, with largest gains observed on more computation-heavy tasks. This work offers a novel way to increase control over the generation of LRMs, and a new tool to study behaviors of LRMs.

[18] Effect of Document Packing on the Latent Multi-Hop Reasoning Capabilities of Large Language Models

Gabriele Prato, Shagun Sodhani, Alessandro Sordoni, Sarath Chandar

Main category: cs.CL

TL;DR: Document packing during LLM training improves multi-hop reasoning but requires more compute; ablation study reveals key factors behind packing benefits.

DetailsMotivation: Standard LLM training packs multiple documents for computational efficiency, but the impact on model capabilities remains unexplored, particularly regarding multi-hop reasoning abilities.

Method: Investigated different document-packing strategies and their influence on latent multi-hop reasoning abilities of LLMs, conducting an ablation study to identify key factors explaining packing advantages.

Result: Packing improves model performance compared to training on individual documents, though at the expense of more computational resources. The ablation study identified key factors that explain the advantages of packing.

Conclusion: The research deepens understanding of LLM training dynamics and provides practical insights for optimizing model development through better document-packing strategies.

Abstract: The standard practice for training large language models involves packing multiple documents together to optimize computational efficiency. However, the impact of this process on the models’ capabilities remains largely unexplored. To address this gap, we investigate how different document-packing strategies influence the latent multi-hop reasoning abilities of LLMs. Our findings indicate that packing can improve model performance compared to training on individual documents, at the expense of more compute. To further understand the underlying mechanisms, we conduct an ablation study, identifying key factors that explain the advantages of packing. Ultimately, our research deepens the understanding of LLM training dynamics and provides practical insights for optimizing model development.

[19] SASQ: Static Activation Scaling for Quantization-Aware Training in Large Language Models

Shizhuo Mao, Song Chen, Yi Kang

Main category: cs.CL

TL;DR: SASQ is a lightweight quantization-aware training framework that optimizes only quantization factors (not weights) for activation quantization, enabling static inference with high accuracy while maintaining deployment efficiency.

DetailsMotivation: LLMs face deployment challenges due to their large size outpacing GPU memory advancements. Existing quantization solutions have trade-offs: dynamic quantization has high computational overhead, static quantization sacrifices accuracy, and QAT suffers from weight training costs.

Method: SASQ is a lightweight QAT framework that exclusively optimizes quantization factors without changing pre-trained weights. It adaptively truncates some outliers to reduce quantization difficulty while preserving activation distribution characteristics.

Result: SASQ surpasses existing SOTA quantization schemes and outperforms corresponding FP16 models. On LLaMA2-7B, it achieves 5.2% lower perplexity than QuaRot and 4.7% lower perplexity than the FP16 model on WikiText2.

Conclusion: SASQ provides an effective solution for LLM deployment by enabling static inference with high accuracy through lightweight optimization of quantization factors, addressing the fundamental trade-offs in existing quantization approaches.

Abstract: Large language models (LLMs) excel at natural language tasks but face deployment challenges due to their growing size outpacing GPU memory advancements. Model quantization mitigates this issue by lowering weight and activation precision, but existing solutions face fundamental trade-offs: dynamic quantization incurs high computational overhead and poses deployment challenges on edge devices, while static quantization sacrifices accuracy. Existing approaches of quantization-aware training (QAT) further suffer from weight training costs. We propose SASQ: a lightweight QAT framework specifically tailored for activation quantization factors. SASQ exclusively optimizes only the quantization factors (without changing pre-trained weights), enabling static inference with high accuracy while maintaining deployment efficiency. SASQ adaptively truncates some outliers, thereby reducing the difficulty of quantization while preserving the distributional characteristics of the activations. SASQ not only surpasses existing SOTA quantization schemes but also outperforms the corresponding FP16 models. On LLaMA2-7B, it achieves 5.2% lower perplexity than QuaRot and 4.7% lower perplexity than the FP16 model on WikiText2.

[20] C-ing Clearly: Enhanced Binary Code Explanations using C code

Teodor Poncu, Ioana Pintilie, Marius Dragoi, Dragos Tantaru, Florin Brad

Main category: cs.CL

TL;DR: C-ing Clearly: Synthetic data generation method using C code to improve LLM performance on assembly language tasks like binary code summarization and vulnerability detection.

DetailsMotivation: LLMs typically perform well on high-level programming languages but struggle with lower-level languages like assembly. There's a need to enhance LLM understanding of assembly code for security and reverse engineering applications.

Method: Proposes C-ing Clearly, a synthetic data generation method that leverages corresponding C code to enhance LLM understanding of assembly. The approach involves fine-tuning LLMs on data generated through this method.

Result: Demonstrates improved LLM performance for binary code summarization and vulnerability detection. Shows consistent gains across different LLM families and model sizes.

Conclusion: Using C code as a bridge to improve LLM understanding of assembly is effective. The C-ing Clearly method provides a practical approach to enhance LLM capabilities for low-level programming language tasks.

Abstract: Large Language Models (LLMs) typically excel at coding tasks involving high-level programming languages, as opposed to lower-level programming languages, such as assembly. We propose a synthetic data generation method named C-ing Clearly, which leverages the corresponding C code to enhance an LLM’s understanding of assembly. By fine-tuning on data generated through our method, we demonstrate improved LLM performance for binary code summarization and vulnerability detection. Our approach demonstrates consistent gains across different LLM families and model sizes.

[21] VersatileFFN: Achieving Parameter Efficiency in LLMs via Adaptive Wide-and-Deep Reuse

Ying Nie, Kai Han, Hongguang Li, Hang Zhou, Tianyu Guo, Enhua Wu, Xinghao Chen, Yunhe Wang

Main category: cs.CL

TL;DR: VersatileFFN is a novel feed-forward network that enables flexible parameter reuse in width and depth dimensions within fixed parameter budget, using adaptive pathways and difficulty-aware gating to enhance model capacity without increasing memory costs.

DetailsMotivation: Large Language Models have prohibitive memory costs from scaling, and existing parameter-efficient approaches like pruning and quantization only compress pretrained models without enhancing architectural capacity, hitting the representational ceiling of base models.

Method: VersatileFFN comprises two adaptive pathways: 1) width-versatile path that generates mixture of sub-experts from single shared FFN (mimicking sparse expert routing without parameter increase), and 2) depth-versatile path that recursively applies same FFN for deeper processing. Difficulty-aware gating dynamically balances pathways, routing “easy” tokens through width-wise route and allocating deeper refinement to “hard” tokens. Both pathways reuse same parameters.

Result: Experiments across diverse benchmarks and model scales demonstrate the effectiveness of the method in enhancing model capacity without increasing memory costs.

Conclusion: VersatileFFN provides a novel approach to enhance LLM capacity through computational reuse rather than parameter addition, offering flexible parameter reuse in width and depth dimensions within fixed parameter budget, with all additional capacity coming from computation rather than memory.

Abstract: The rapid scaling of Large Language Models (LLMs) has achieved remarkable performance, but it also leads to prohibitive memory costs. Existing parameter-efficient approaches such as pruning and quantization mainly compress pretrained models without enhancing architectural capacity, thereby hitting the representational ceiling of the base model. In this work, we propose VersatileFFN, a novel feed-forward network (FFN) that enables flexible reuse of parameters in both width and depth dimensions within a fixed parameter budget. Inspired by the dual-process theory of cognition, VersatileFFN comprises two adaptive pathways: a width-versatile path that generates a mixture of sub-experts from a single shared FFN, mimicking sparse expert routing without increasing parameters, and a depth-versatile path that recursively applies the same FFN to emulate deeper processing for complex tokens. A difficulty-aware gating dynamically balances the two pathways, steering “easy” tokens through the efficient width-wise route and allocating deeper iterative refinement to “hard” tokens. Crucially, both pathways reuse the same parameters, so all additional capacity comes from computation rather than memory. Experiments across diverse benchmarks and model scales demonstrate the effectiveness of the method. The code will be available at https://github.com/huawei-noah/noah-research/tree/master/VersatileFFN.

[22] Dual Language Models: Balancing Training Efficiency and Overfitting Resilience

David Samuel, Lucas Georges Gabriel Charpentier

Main category: cs.CL

TL;DR: Combining autoregressive and masked-diffusion training objectives improves language model performance without architectural changes, achieving better results than single-objective models across various data repetition levels.

DetailsMotivation: Autoregressive models are training-efficient but prone to overfitting, while masked-diffusion models are more resilient to overfitting but less efficient to train. The paper aims to combine both approaches to get the best of both worlds.

Method: Train language models with dual-objective training combining autoregressive and masked-diffusion objectives without architectural modifications. Conduct systematic evaluation of 50 models with varying objective ratios across different levels of data repetition to find optimal combinations.

Result: Dual-objective training consistently outperforms single-objective models across all evaluated settings. The optimal ratio between objectives is similar whether targeting autoregressive or masked-diffusion downstream performance, suggesting a universal optimal balance.

Conclusion: Combining autoregressive and masked-diffusion training objectives is beneficial for language models, providing improved performance and robustness across different data conditions without requiring architectural changes.

Abstract: This paper combines autoregressive and masked-diffusion training objectives without any architectural modifications, resulting in flexible language models that outperform single-objective models. Autoregressive modeling has been a popular approach, partly because of its training efficiency; however, that comes at the cost of sensitivity to overfitting. On the other hand, masked-diffusion models are less efficient to train while being more resilient to overfitting. In this work, we demonstrate that dual-objective training achieves the best of both worlds. To derive the optimal ratio between both objectives, we train and evaluate 50 language models under varying levels of data repetition. We show that it is optimal to combine both objectives under all evaluated settings and that the optimal ratio is similar whether targeting autoregressive or masked-diffusion downstream performance.

Nguyen Tien Dong, Minh-Anh Nguyen, Thanh Dat Hoang, Nguyen Tuan Ngoc, Dao Xuan Quang Minh, Phan Phi Hai, Nguyen Thi Ngoc Anh, Dang Van Tu, Binh Vu

Main category: cs.CL

TL;DR: VLegal-Bench is the first comprehensive benchmark for evaluating LLMs on Vietnamese legal tasks, featuring 10,450 expert-annotated samples across multiple cognitive levels and practical scenarios.

DetailsMotivation: The complexity, hierarchical organization, and frequent revisions of Vietnamese legislation create challenges for evaluating how well LLMs interpret and utilize legal knowledge, creating a need for a standardized evaluation framework.

Method: Developed a benchmark informed by Bloom’s cognitive taxonomy, with 10,450 samples generated through rigorous annotation pipeline where legal experts label and cross-validate each instance using an annotation system. Tasks reflect practical usage scenarios including general Q&A, retrieval-augmented generation, multi-step reasoning, and scenario-based problem solving.

Result: Created VLegal-Bench, the first comprehensive benchmark for Vietnamese legal tasks, establishing a standardized, transparent, and cognitively informed evaluation framework for assessing LLM performance in Vietnamese legal contexts.

Conclusion: VLegal-Bench provides a solid foundation for assessing LLM performance in Vietnamese legal contexts and supports the development of more reliable, interpretable, and ethically aligned AI-assisted legal systems.

Abstract: The rapid advancement of large language models (LLMs) has enabled new possibilities for applying artificial intelligence within the legal domain. Nonetheless, the complexity, hierarchical organization, and frequent revisions of Vietnamese legislation pose considerable challenges for evaluating how well these models interpret and utilize legal knowledge. To address this gap, Vietnamese Legal Benchmark (VLegal-Bench) is introduced, the first comprehensive benchmark designed to systematically assess LLMs on Vietnamese legal tasks. Informed by Bloom’s cognitive taxonomy, VLegal-Bench encompasses multiple levels of legal understanding through tasks designed to reflect practical usage scenarios. The benchmark comprises 10,450 samples generated through a rigorous annotation pipeline, where legal experts label and cross-validate each instance using our annotation system to ensure every sample is grounded in authoritative legal documents and mirrors real-world legal assistant workflows, including general legal questions and answers, retrieval-augmented generation, multi-step reasoning, and scenario-based problem solving tailored to Vietnamese law. By providing a standardized, transparent, and cognitively informed evaluation framework, VLegal-Bench establishes a solid foundation for assessing LLM performance in Vietnamese legal contexts and supports the development of more reliable, interpretable, and ethically aligned AI-assisted legal systems.

[24] Agreement Between Large Language Models and Human Raters in Essay Scoring: A Research Synthesis

Hongli Li, Che Han Chen, Kevin Fan, Chiho Young-Johnson, Soyoung Lim, Yali Feng

Main category: cs.CL

TL;DR: LLMs show moderate to good agreement with human raters in essay scoring, but reliability varies substantially across studies due to methodological differences and lack of standardized reporting.

DetailsMotivation: Despite growing use of LLMs for automatic essay scoring, empirical evidence about their reliability compared to human raters remains inconsistent and needs systematic synthesis.

Method: Systematic review following PRISMA 2020 guidelines, analyzing 65 published and unpublished studies from January 2022 to August 2025 examining LLM-human agreement in AES.

Result: LLM-human agreement was generally moderate to good (agreement indices 0.30-0.80), but substantial variability existed across studies due to study-specific factors and lack of standardized reporting.

Conclusion: LLMs show promise for AES but reliability varies; standardization in reporting and methodological practices is needed for more consistent evaluation of LLM performance in essay scoring.

Abstract: Despite the growing promise of large language models (LLMs) in automatic essay scoring (AES), empirical findings regarding their reliability compared to human raters remain mixed. Following the PRISMA 2020 guidelines, we synthesized 65 published and unpublished studies from January 2022 to August 2025 that examined agreement between LLMs and human raters in AES. Across studies, reported LLM-human agreement was generally moderate to good, with agreement indices (e.g., Quadratic Weighted Kappa, Pearson correlation, and Spearman’s rho) mostly ranging between 0.30 and 0.80. Substantial variability in agreement levels was observed across studies, reflecting differences in study-specific factors as well as the lack of standardized reporting practices. Implications and directions for future research are discussed.

[25] Polypersona: Persona-Grounded LLM for Synthetic Survey Responses

Tejaswani Dash, Dinesh Karri, Anudeep Vurity, Gautam Datla, Tazeem Ahmad, Saima Rafi, Rohith Tangudu

Main category: cs.CL

TL;DR: PolyPersona is a framework for generating persona-conditioned survey responses using efficient fine-tuning of small language models, achieving results comparable to larger models.

DetailsMotivation: Need for efficient and reproducible generation of synthetic survey data that preserves persona consistency across multiple domains, enabling controlled evaluation and bias analysis.

Method: Instruction-tunes compact chat models using LoRA adapters with 4-bit quantization; uses dialogue-based data pipeline to preserve persona cues; constructs dataset of 3,568 synthetic responses across 10 domains and 433 personas.

Result: Small models (TinyLlama 1.1B, Phi-2) achieve performance comparable to larger 7B-8B models, with highest BLEU score of 0.090 and ROUGE-1 of 0.429; framework enables reliable synthetic survey data generation.

Conclusion: Persona-conditioned fine-tuning enables small language models to generate coherent synthetic survey data efficiently, providing scalable evaluation and transparent bias analysis capabilities.

Abstract: This paper introduces PolyPersona, a generative framework for synthesizing persona-conditioned survey responses across multiple domains. The framework instruction-tunes compact chat models using parameter-efficient LoRA adapters with 4-bit quantization under a resource-adaptive training setup. A dialogue-based data pipeline explicitly preserves persona cues, ensuring consistent behavioral alignment across generated responses. Using this pipeline, we construct a dataset of 3,568 synthetic survey responses spanning ten domains and 433 distinct personas, enabling controlled instruction tuning and systematic multi-domain evaluation. We evaluate the generated responses using a multi-metric evaluation suite that combines standard text generation metrics, including BLEU, ROUGE, and BERTScore, with survey-specific metrics designed to assess structural coherence, stylistic consistency, and sentiment alignment.Experimental results show that compact models such as TinyLlama 1.1B and Phi-2 achieve performance comparable to larger 7B to 8B baselines, with a highest BLEU score of 0.090 and ROUGE-1 of 0.429. These findings demonstrate that persona-conditioned fine-tuning enables small language models to generate reliable and coherent synthetic survey data. The proposed framework provides an efficient and reproducible approach for survey data generation, supporting scalable evaluation while facilitating bias analysis through transparent and open protocols.

[26] Low-Resource, High-Impact: Building Corpora for Inclusive Language Technologies

Ekaterina Artemova, Laurie Burchell, Daryna Dementieva, Shu Okabe, Mariya Shmatova, Pedro Ortiz Suarez

Main category: cs.CL

TL;DR: Tutorial on building equitable NLP pipelines for low-resource languages, covering data collection, parallel mining, translation, and downstream tasks with focus on fairness and reproducibility.

DetailsMotivation: To address the inequity in NLP technologies by providing practical tools for working with underrepresented languages, tackling data scarcity and cultural variance challenges.

Method: End-to-end NLP pipeline development including web crawling, parallel sentence mining, machine translation, and downstream applications like text classification and multimodal reasoning.

Result: Practical toolkit demonstrated across 10+ languages from diverse language families and geopolitical contexts, covering both resource-rich and severely underrepresented languages.

Conclusion: The tutorial provides community-informed development approaches for creating more equitable and socially impactful language technologies for low-resource languages.

Abstract: This tutorial (https://tum-nlp.github.io/low-resource-tutorial) is designed for NLP practitioners, researchers, and developers working with multilingual and low-resource languages who seek to create more equitable and socially impactful language technologies. Participants will walk away with a practical toolkit for building end-to-end NLP pipelines for underrepresented languages – from data collection and web crawling to parallel sentence mining, machine translation, and downstream applications such as text classification and multimodal reasoning. The tutorial presents strategies for tackling the challenges of data scarcity and cultural variance, offering hands-on methods and modeling frameworks. We will focus on fair, reproducible, and community-informed development approaches, grounded in real-world scenarios. We will showcase a diverse set of use cases covering over 10 languages from different language families and geopolitical contexts, including both digitally resource-rich and severely underrepresented languages.

[27] Towards Nepali-language LLMs: Efficient GPT training with a Nepali BPE tokenizer

Adarsha Shrestha, Basanta Pokharel, Binit Shrestha, Smriti Adhikari, Dinesh Gothe

Main category: cs.CL

TL;DR: This paper presents a GPT-2-based language model for Nepali, incorporating GPT-3-inspired training strategies and a custom BPE tokenizer, achieving improved text generation capabilities for this low-resource language.

DetailsMotivation: Nepali faces significant NLP challenges due to its complex grammar, agglutinative morphology, and limited high-quality corpora. Existing encoder-based approaches are insufficient for Nepali-specific text generation, creating a need for better generative models.

Method: The authors developed a GPT-2-based model with GPT-3-inspired training strategies including optimized learning rate schedules, batch scaling, and architectural refinements. They created a custom 16k BPE tokenizer trained exclusively on Nepali text and pretrained the model on a combined dataset (10.75GB cleaned NepBERTa corpus + web-scraped news articles). FlashAttention was integrated to reduce memory usage and stabilize training.

Result: After two epochs, the model achieved a training loss of 3.168177, validation loss of 3.081982, and final perplexity of 21.80. The model demonstrated capability to generate coherent Nepali news-style text.

Conclusion: The developed GPT-2-based model with custom Nepali tokenizer and optimized training strategies successfully addresses the text generation challenges for Nepali, providing a foundation for improved NLP applications in this low-resource language.

Abstract: Nepali, a low-resource language spoken by over 32 million people, continues to face challenges in natural language processing (NLP) due to its complex grammar, agglutinative morphology, and limited availability of high-quality corpora. Most efforts to date have centered on basic encoder architectures; they remain insufficient for Nepali-specific text generation. This study presents a GPT-2-based Nepali language model trained using several training strategies inspired by GPT-3, including optimized learning rate schedules, batch scaling, and architectural refinements. A custom 16k Byte-Pair Encoding (BPE) tokenizer was trained exclusively on Nepali text to ensure more consistent segmentation and improved input representation. The model was pretrained on a combined dataset comprising a 10.75GB cleaned NepBERTa corpus and additional web-scraped Nepali news articles. FlashAttention was integrated to reduce memory usage and stabilize training. After two epochs, the model achieved a training loss of 3.168177, a validation loss of 3.081982, and a final perplexity of 21.80, demonstrating its capability to generate coherent Nepali news-style text.

[28] JMMMU-Pro: Image-based Japanese Multi-discipline Multimodal Understanding Benchmark via Vibe Benchmark Construction

Atsuyuki Miyai, Shota Onohara, Jeonghun Baek, Kiyoharu Aizawa

Main category: cs.CL

TL;DR: JMMMU-Pro is a Japanese multimodal benchmark that combines question images and text into single images, requiring integrated visual-textual understanding. It’s built using Vibe Benchmark Construction with generative AI and human verification.

DetailsMotivation: To create a more rigorous evaluation tool for assessing Japanese capabilities of Large Multimodal Models (LMMs) that requires integrated visual-textual understanding, extending from JMMMU and following the evolution from MMMU to MMMU-Pro.

Method: Vibe Benchmark Construction: Uses image generative models (like Nano Banana Pro) to produce candidate visual questions, with human verification and regeneration with adjusted prompts when needed. Leverages realistic image generation and clean Japanese text embedding capabilities.

Result: All open-source LMMs struggle substantially with JMMMU-Pro, demonstrating its challenging nature. The benchmark covers wide range of background and layout designs and is constructed at low cost with high quality.

Conclusion: JMMMU-Pro provides important benchmark for guiding future efforts in open-source community and offers more rigorous evaluation of Japanese LMM capabilities. Vibe Benchmark Construction offers efficient guideline for future image-based VQA benchmark development.

Abstract: This paper introduces JMMMU-Pro, an image-based Japanese Multi-discipline Multimodal Understanding Benchmark, and Vibe Benchmark Construction, a scalable construction method. Following the evolution from MMMU to MMMU-Pro, JMMMU-Pro extends JMMMU by composing the question image and question text into a single image, thereby creating a benchmark that requires integrated visual-textual understanding through visual perception. To build JMMMU-Pro, we propose Vibe Benchmark Construction, a methodology in which an image generative model (e.g., Nano Banana Pro) produces candidate visual questions, and humans verify the outputs and, when necessary, regenerate with adjusted prompts to ensure quality. By leveraging Nano Banana Pro’s highly realistic image generation capabilities and its ability to embed clean Japanese text, we construct a high-quality benchmark at low cost, covering a wide range of background and layout designs. Experimental results show that all open-source LMMs struggle substantially with JMMMU-Pro, underscoring JMMMU-Pro as an important benchmark for guiding future efforts in the open-source community. We believe that JMMMU-Pro provides a more rigorous evaluation tool for assessing the Japanese capabilities of LMMs and that our Vibe Benchmark Construction also offers an efficient guideline for future development of image-based VQA benchmarks.

[29] TiME: Tiny Monolingual Encoders for Efficient NLP Pipelines

David Schulmeister, Valentin Hartmann, Lars Klein, Robert West

Main category: cs.CL

TL;DR: TiME models are tiny monolingual encoders that offer better efficiency-performance trade-offs than large models for specific NLP tasks, using distillation techniques and supporting low-resource languages.

DetailsMotivation: Large language models are inefficient for many practical NLP applications that require real-time responses, high throughput, and low energy consumption, especially on battery-powered devices. There's a need for small, specialized models that can handle well-defined tasks efficiently.

Method: Train small monolingual encoder models (TiME) using modern training techniques like distillation, including distilling monolingual models from multilingual teachers and distilling models with absolute positional embeddings from teachers with relative positional embeddings.

Result: TiME models demonstrate improved trade-off between benchmark performance and efficiency metrics (throughput, latency, energy consumption) across a range of common NLP tasks, while also supporting low-resource languages.

Conclusion: Small, specialized models like TiME offer practical advantages over large general-purpose models for efficiency-critical applications, with distillation techniques enabling effective training of compact models while maintaining performance.

Abstract: Today, a lot of research on language models is focused on large, general-purpose models. However, many NLP pipelines only require models with a well-defined, small set of capabilities. While large models are capable of performing the tasks of those smaller models, they are simply not fast enough to process large amounts of data or offer real-time responses. Furthermore, they often use unnecessarily large amounts of energy, leading to sustainability concerns and problems when deploying them on battery-powered devices. In our work, we show how to train small models for such efficiency-critical applications. As opposed to many off-the-shelf NLP pipelines, our models use modern training techniques such as distillation, and offer support for low-resource languages. We call our models TiME (Tiny Monolingual Encoders) and comprehensively evaluate them on a range of common NLP tasks, observing an improved trade-off between benchmark performance on one hand, and throughput, latency and energy consumption on the other. Along the way, we show that distilling monolingual models from multilingual teachers is possible, and likewise distilling models with absolute positional embeddings from teachers with relative positional embeddings.

[30] Fast and Accurate Causal Parallel Decoding using Jacobi Forcing

Lanxiang Hu, Siqi Kou, Yichao Fu, Samyam Rajbhandari, Tajana Rosing, Yuxiong He, Zhijie Deng, Hao Zhang

Main category: cs.CL

TL;DR: Jacobi Forcing is a progressive distillation method that converts autoregressive LLMs into efficient parallel decoders, achieving 3.8-4.0x speedup with minimal performance loss.

DetailsMotivation: Current diffusion-based LLMs for parallel decoding suffer from limited speedup due to pretrain-to-posttrain mismatch - masked data distribution differs from real-world data, and bidirectional attention conflicts with pretrained causal priors, preventing efficient KV cache reuse.

Method: Jacobi Forcing: progressive distillation where models are trained on their own generated parallel decoding trajectories, smoothly shifting AR models into parallel decoders while preserving pretrained causal inference properties. Also introduces multi-block decoding with rejection recycling for higher token acceptance per iteration.

Result: Jacobi Forcing Models achieve 3.8x wall-clock speedup on coding and math benchmarks with minimal performance loss. Multi-block decoding with rejection recycling enables up to 4.5x higher token acceptance per iteration and nearly 4.0x wall-clock speedup.

Conclusion: Jacobi Forcing effectively addresses the pretrain-posttrain mismatch in parallel decoding models, enabling efficient conversion of AR models to parallel decoders with significant speedup while maintaining generation quality.

Abstract: Multi-token generation has emerged as a promising paradigm for accelerating transformer-based large model inference. Recent efforts primarily explore diffusion Large Language Models (dLLMs) for parallel decoding to reduce inference latency. To achieve AR-level generation quality, many techniques adapt AR models into dLLMs to enable parallel decoding. However, they suffer from limited speedup compared to AR models due to a pretrain-to-posttrain mismatch. Specifically, the masked data distribution in post-training deviates significantly from the real-world data distribution seen during pretraining, and dLLMs rely on bidirectional attention, which conflicts with the causal prior learned during pretraining and hinders the integration of exact KV cache reuse. To address this, we introduce Jacobi Forcing, a progressive distillation paradigm where models are trained on their own generated parallel decoding trajectories, smoothly shifting AR models into efficient parallel decoders while preserving their pretrained causal inference property. The models trained under this paradigm, Jacobi Forcing Model, achieves 3.8x wall-clock speedup on coding and math benchmarks with minimal loss in performance. Based on Jacobi Forcing Models’ trajectory characteristics, we introduce multi-block decoding with rejection recycling, which enables up to 4.5x higher token acceptance count per iteration and nearly 4.0x wall-clock speedup, effectively trading additional compute for lower inference latency. Our code is available at https://github.com/hao-ai-lab/JacobiForcing.

[31] MMGR: Multi-Modal Generative Reasoning

Zefan Cai, Haoyi Qiu, Tianyi Ma, Haozhe Zhao, Gengze Zhou, Kung-Hsiang Huang, Parisa Kordjamshidi, Minjia Zhang, Xiao Wen, Jiuxiang Gu, Nanyun Peng, Junjie Hu

Main category: cs.CL

TL;DR: MMGR is a new benchmark that evaluates video foundation models’ reasoning abilities across physical, logical, spatial, and temporal domains, revealing significant gaps in abstract reasoning and spatial planning despite good perceptual quality.

DetailsMotivation: Current video foundation models produce visually realistic content but lack evaluation of their reasoning capabilities. Existing metrics like FVD focus on perceptual quality while ignoring violations of causality, physics, and global consistency, creating a need for principled reasoning evaluation.

Method: MMGR introduces a multi-modal evaluation framework based on five reasoning abilities: Physical, Logical, 3D Spatial, 2D Spatial, and Temporal. It evaluates across three domains: Abstract Reasoning (ARC-AGI, Sudoku), Embodied Navigation (3D navigation/localization), and Physical Commonsense (sports/interactions). The framework uses fine-grained metrics requiring holistic correctness across video and image generation.

Result: Benchmarking leading video models (Veo-3, Sora-2, Wan-2.2) and image models (Nano-banana, Nano-banana Pro, GPT-4o-image, Qwen-image) reveals strong performance gaps. Models show moderate success on Physical Commonsense tasks but perform poorly on Abstract Reasoning (below 10% accuracy on ARC-AGI) and struggle with long-horizon spatial planning in embodied settings.

Conclusion: Current models have key limitations including overreliance on perceptual data, weak global state consistency, and objectives that reward visual plausibility over causal correctness. MMGR provides a unified diagnostic benchmark and a path toward reasoning-aware generative world models.

Abstract: Video foundation models generate visually realistic and temporally coherent content, but their reliability as world simulators depends on whether they capture physical, logical, and spatial constraints. Existing metrics such as Frechet Video Distance (FVD) emphasize perceptual quality and overlook reasoning failures, including violations of causality, physics, and global consistency. We introduce MMGR (Multi-Modal Generative Reasoning Evaluation and Benchmark), a principled evaluation framework based on five reasoning abilities: Physical, Logical, 3D Spatial, 2D Spatial, and Temporal. MMGR evaluates generative reasoning across three domains: Abstract Reasoning (ARC-AGI, Sudoku), Embodied Navigation (real-world 3D navigation and localization), and Physical Commonsense (sports and compositional interactions). MMGR applies fine-grained metrics that require holistic correctness across both video and image generation. We benchmark leading video models (Veo-3, Sora-2, Wan-2.2) and image models (Nano-banana, Nano-banana Pro, GPT-4o-image, Qwen-image), revealing strong performance gaps across domains. Models show moderate success on Physical Commonsense tasks but perform poorly on Abstract Reasoning (below 10 percent accuracy on ARC-AGI) and struggle with long-horizon spatial planning in embodied settings. Our analysis highlights key limitations in current models, including overreliance on perceptual data, weak global state consistency, and objectives that reward visual plausibility over causal correctness. MMGR offers a unified diagnostic benchmark and a path toward reasoning-aware generative world models.

[32] Question Answering Over Spatio-Temporal Knowledge Graph

Xinbang Dai, Huiying Li, Nan Hu, Yongrui Chen, Rihui Jin, Huikang Hu, Guilin Qi

Main category: cs.CL

TL;DR: STQAD: First comprehensive spatio-temporal KGQA benchmark with 10K questions requiring both temporal and spatial reasoning. STCQA method achieves SOTA by jointly embedding temporal/spatial features and constraint-aware reasoning.

DetailsMotivation: Research on spatio-temporal knowledge graph question answering (STKGQA) is limited due to lack of datasets containing both spatio-temporal information and methods capable of handling implicit spatio-temporal reasoning.

Method: Introduced STQAD dataset (10K natural language questions with real-world spatio-temporal facts) and proposed STCQA method that jointly embeds temporal and spatial features into KG representations with dynamic constraint-aware reasoning.

Result: Existing KGQA methods underperform on STQAD due to inability to model spatio-temporal interactions. STCQA achieves state-of-the-art performance, significantly outperforming existing baselines.

Conclusion: Provides valuable resource (STQAD) for future research and advances field with robust baseline (STCQA) for answering complex spatio-temporal questions.

Abstract: Spatio-temporal knowledge graphs (STKGs) enhance traditional KGs by integrating temporal and spatial annotations, enabling precise reasoning over questions with spatio-temporal dependencies. Despite their potential, research on spatio-temporal knowledge graph question answering (STKGQA) remains limited. This is primarily due to the lack of datasets that simultaneously contain spatio-temporal information, as well as methods capable of handling implicit spatio-temporal reasoning. To bridge this gap, we introduce the spatio-temporal question answering dataset (STQAD), the first comprehensive benchmark comprising 10,000 natural language questions that require both temporal and spatial reasoning. STQAD is constructed with real-world facts containing spatio-temporal information, ensuring that the dataset reflects practical scenarios. Furthermore, our experiments reveal that existing KGQA methods underperform on STQAD, primarily due to their inability to model spatio-temporal interactions. To address this, we propose the spatio-temporal complex question answering (STCQA) method, which jointly embeds temporal and spatial features into KG representations and dynamically filters answers through constraint-aware reasoning. STCQA achieves state-of-the-art performance, significantly outperforming existing baselines. Our work not only provides a valuable resource for future research but also advances the field by offering a robust baseline for answering complex spatio-temporal questions.

[33] Enhancing Long-term RAG Chatbots with Psychological Models of Memory Importance and Forgetting

Ryuichi Sumida, Koji Inoue, Tatsuya Kawahara

Main category: cs.CL

TL;DR: LUFY is a novel RAG method for long-term conversations that selectively retains emotionally arousing memories while forgetting most conversation content, improving retrieval accuracy and user experience.

DetailsMotivation: As conversations progress in RAG systems, increasing memory load degrades retrieval accuracy. The paper addresses this problem by drawing on psychological insights about memory retention and forgetting.

Method: Proposes LUFY, a simple yet effective method that focuses on emotionally arousing memories and retains less than 10% of conversation content, implementing selective forgetting based on emotional arousal.

Result: In extensive user experiments (2 hours over 4 sessions per chatbot type - the most extensive assessment to date), prioritizing arousing memories while forgetting majority of conversation significantly enhanced user experience.

Conclusion: The study pushes the frontier of long-term conversations and highlights the importance of forgetting unimportant parts of conversations for better RAG performance and user experience.

Abstract: While Retrieval-Augmented Generation (RAG) has shown promise in enhancing long-term conversations, the increasing memory load as conversations progress degrades retrieval accuracy. Drawing on psychological insights, we propose LUFY, a simple yet effective method that focuses on emotionally arousing memories and retains less than 10% of the conversation. In the user experiment, participants interacted with three types of RAG chatbots, each for 2 hours over 4 sessions, marking the most extensive assessment of a chatbot’s long-term capabilities to date – more than four times longer than any existing benchmark. The results demonstrate that prioritizing arousing memories while forgetting the majority of the conversation significantly enhances user experience. This study pushes the frontier of long-term conversations and highlights the importance of forgetting unimportant parts of conversations. Code and Dataset: https://github.com/ryuichi-sumida/LUFY

[34] Estimating Privacy Leakage of Augmented Contextual Knowledge in Language Models

James Flemings, Bo Jiang, Wanrong Zhang, Zafar Takhirov, Murali Annavaram

Main category: cs.CL

TL;DR: The paper introduces “context influence,” a differential privacy-based metric to measure privacy leakage of contextual knowledge in language models, showing leakage occurs when context is out-of-distribution relative to parametric knowledge.

DetailsMotivation: Language models use contextual knowledge that may contain private information, but current methods overestimate privacy risk by not accounting for whether the LM already knows the information from its parametric knowledge.

Method: Develops “context influence” metric based on differential privacy to measure how each subset of context influences LM responses while separating the LM’s specific parametric knowledge.

Result: Context privacy leakage occurs when contextual knowledge is out-of-distribution with respect to parametric knowledge. The metric properly attributes privacy leakage to augmented contexts and evaluates factors like model size, context size, and generation position.

Conclusion: The context influence metric effectively measures privacy leakage in language models, informing practitioners about privacy risks associated with augmented contextual knowledge.

Abstract: Language models (LMs) rely on their parametric knowledge augmented with relevant contextual knowledge for certain tasks, such as question answering. However, the contextual knowledge can contain private information that may be leaked when answering queries, and estimating this privacy leakage is not well understood. A straightforward approach of directly comparing an LM’s output to the contexts can overestimate the privacy risk, since the LM’s parametric knowledge might already contain the augmented contextual knowledge. To this end, we introduce context influence, a metric that builds on differential privacy, a widely-adopted privacy notion, to estimate the privacy leakage of contextual knowledge during decoding. Our approach effectively measures how each subset of the context influences an LM’s response while separating the specific parametric knowledge of the LM. Using our context influence metric, we demonstrate that context privacy leakage occurs when contextual knowledge is out of distribution with respect to parametric knowledge. Moreover, we experimentally demonstrate how context influence properly attributes the privacy leakage to augmented contexts, and we evaluate how factors – such as model size, context size, generation position, etc. – affect context privacy leakage. The practical implications of our results will inform practitioners of the privacy risk associated with augmented contextual knowledge.

[35] MultiBanAbs: A Comprehensive Multi-Domain Bangla Abstractive Text Summarization Dataset

Md. Tanzim Ferdous, Naeem Ahsan Chowdhury, Prithwiraj Bhattacharjee

Main category: cs.CL

TL;DR: A new Bangla abstractive summarization dataset of 54,000+ articles from diverse sources (blogs, newspapers) was created to address the limitation of existing news-only datasets, enabling more adaptable summarization systems for real-world Bangla content.

DetailsMotivation: Existing Bangla summarization research focuses mainly on news articles with fixed writing styles, which fails to handle the diverse real-world Bangla content from blogs, newspapers, and social media. There's a pressing need for systems that can reduce information overload and help readers understand varied Bangla content quickly.

Method: Developed a dataset of over 54,000 Bangla articles and summaries collected from multiple sources including blogs (Cinegolpo) and newspapers (Samakal, The Business Standard). The dataset spans multiple domains and writing styles. Established baselines by training and evaluating with deep learning and transfer learning models including LSTM, BanglaT5-small, and MTS-small.

Result: Created a comprehensive multi-domain Bangla summarization dataset that offers greater adaptability and practical relevance than single-domain resources. The results demonstrate its potential as a benchmark for future Bangla NLP research.

Conclusion: This dataset provides a solid foundation for building robust Bangla summarization systems and helps expand NLP resources for low-resource languages, addressing the gap in diverse real-world Bangla text processing.

Abstract: This study developed a new Bangla abstractive summarization dataset to generate concise summaries of Bangla articles from diverse sources. Most existing studies in this field have concentrated on news articles, where journalists usually follow a fixed writing style. While such approaches are effective in limited contexts, they often fail to adapt to the varied nature of real-world Bangla texts. In today’s digital era, a massive amount of Bangla content is continuously produced across blogs, newspapers, and social media. This creates a pressing need for summarization systems that can reduce information overload and help readers understand content more quickly. To address this challenge, we developed a dataset of over 54,000 Bangla articles and summaries collected from multiple sources, including blogs such as Cinegolpo and newspapers such as Samakal and The Business Standard. Unlike single-domain resources, our dataset spans multiple domains and writing styles. It offers greater adaptability and practical relevance. To establish strong baselines, we trained and evaluated this dataset using several deep learning and transfer learning models, including LSTM, BanglaT5-small, and MTS-small. The results highlight its potential as a benchmark for future research in Bangla natural language processing. This dataset provides a solid foundation for building robust summarization systems and helps expand NLP resources for low-resource languages.

[36] Why Reinforcement Fine-Tuning Enables MLLMs Preserve Prior Knowledge Better: A Data Perspective

Zhihao Zhang, Qiaole Dong, Qi Zhang, Jun Zhao, Enyu Zhou, Zhiheng Xi, Senjie Jin, Xiaoran Fan, Yuhao Zhou, Mingqi Wu, Yanwei Fu, Tao Ji, Tao Gui, Xuanjing Huang, Kai Chen

Main category: cs.CL

TL;DR: SFT causes catastrophic forgetting while RFT preserves prior knowledge in multimodal LLMs, with training data distribution being more important than algorithmic differences.

DetailsMotivation: To understand how post-training algorithms (SFT and RFT) affect prior knowledge in multimodal large language models, using jigsaw puzzles as a novel task not present in pretraining data.

Method: Introduced jigsaw puzzles as a novel task and systematically studied SFT and RFT on Qwen2.5-VL series models. Analyzed learning dynamics by examining magnitude and direction of training data influence on prior knowledge.

Result: SFT enables rapid task acquisition but causes catastrophic forgetting, while RFT learns more slowly but maintains prior knowledge. RFT reinforces correct samples aligned with base model’s probability landscape, causing weaker interference.

Conclusion: Training data distribution, not algorithmic differences, plays central role in forgetting. RFT-simulated rollouts allow SFT to preserve knowledge while learning new tasks, highlighting RFT’s potential for stable continual learning in multimodal LLMs.

Abstract: Post-training algorithms such as Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT) are widely used to adapt multimodal large language models to downstream tasks. While effective at task adaptation, their impact on prior knowledge remains unclear. In this paper, we introduce jigsaw puzzles as a novel task absent from existing pretraining corpora and systematically study the behavior of SFT and RFT on open-source multimodal model, Qwen2.5-VL series. Our experiments reveal a sharp trade-off: SFT enables rapid task acquisition but leads to catastrophic forgetting, whereas RFT learns more slowly but maintains prior knowledge. We study this phenomenon through learning dynamics by examining both the magnitude and direction of how training data influence prior knowledge. Our analysis shows that RFT mainly reinforces correct samples naturally aligned with the base model’s probability landscape, leading to weaker interference with prior knowledge. Moreover, training on RFT-simulated rollouts, which exert a small magnitude of influence and are well aligned in direction to prior knowledge, allows SFT to preserve prior knowledge better while rapidly learning new tasks. These findings suggest that distribution of training data, rather than algorithmic differences, plays a central role in forgetting, and highlight RFT’s potential for stable continual learning in multimodal large language models.

[37] TaP: A Taxonomy-Guided Framework for Automated and Scalable Preference Data Generation

Renren Jin, Tianhao Shen, Xinwei Wu, Dan Shi, Haoran Sun, Yuqi Ren, Wuwei Huang, Quandong Wang, Wei Liu, Jian Luan, Bin Wang, Deyi Xiong

Main category: cs.CL

TL;DR: TaP framework automates multi-language preference dataset generation using taxonomy guidance, enabling better LLM fine-tuning than much larger datasets.

DetailsMotivation: High-quality datasets for supervised and preference fine-tuning are resource-intensive to create and mostly available only in English, limiting multilingual LLM alignment.

Method: TaP (Taxonomy-Guided Preference Data Generation) framework uses structured taxonomy for fine-grained control over dataset composition, enabling automated and scalable construction of preference datasets across multiple languages.

Result: LLMs trained on TaP-generated datasets outperform those trained on existing open-source datasets, even surpassing performance of models trained on datasets 180 times larger.

Conclusion: TaP provides an effective solution for scalable, multilingual preference dataset generation that enables better LLM alignment with human preferences across diverse languages.

Abstract: Conducting supervised fine-tuning and preference fine-tuning on large language models (LLMs) requires high-quality datasets to improve their ability to follow instructions and align with human preferences and values. However, constructing such datasets is resource-intensive, and most available datasets for supervised and preference fine-tuning are in English. To address these challenges, we propose the \underline{\textbf{Ta}}xonomy-Guided \underline{\textbf{P}}reference Data Generation (TaP) framework, which facilitates automated and scalable construction of preference datasets across various languages. TaP is grounded in a structured taxonomy that allows fine-grained control over dataset composition, thereby ensuring both diversity and comprehensive coverage. We employ TaP-generated datasets to perform supervised and preference fine-tuning on various LLMs. Experimental results demonstrate that LLMs trained on TaP-generated datasets outperform those trained on existing open-source datasets. Remarkably, LLMs trained on TaP-generated datasets surpass the performance of those trained on an open-source dataset that is 180 times larger.

[38] TIBSTC-CoT: A Multi-Domain Instruction Dataset for Chain-of-Thought Reasoning in Language Models

Fan Gao, Cheng Huang, Nyima Tashi, Yutong Liu, Xiangxiang Wang, Thupten Tsering, Ban Ma-bao, Renzeg Duojie, Gadeng Luosang, Rinchen Dongrub, Dorje Tashi, Xiao Feng, Hao Wang, Yongbin Yu

Main category: cs.CL

TL;DR: Researchers created TIBSTC-CoT, a large-scale Tibetan dataset using chain-of-thought prompting with LLMs to address data scarcity, and developed Sunshine-thinking LLM family trained on this dataset, achieving SOTA-comparable performance for Tibetan language processing.

DetailsMotivation: To address severe data scarcity in Tibetan, a low-resource language spoken by over six million people, enabling high-quality Tibetan language processing through inclusive AI development.

Method: Introduced TIBSTC-CoT dataset automatically constructed via chain-of-thought prompting with LLMs, establishing scalable framework for low-resource dataset creation. Developed Sunshine-thinking LLM family trained entirely on TIBSTC-CoT with chain-of-thought capabilities.

Result: Sunshine-thinking LLMs demonstrated strong reasoning and generation performance comparable to state-of-the-art multilingual LLMs, with all data made publicly available on GitHub.

Conclusion: This work marks significant step toward inclusive AI by enabling high-quality Tibetan language processing through both resource creation (dataset) and model innovation (LLM family).

Abstract: To address the severe data scarcity in Tibetan, a low-resource language spoken by over six million people, we introduce TIBSTC-CoT, the large-scale, multi-domain Tibetan dataset automatically constructed via chain-of-thought prompting with large language models (LLMs). TIBSTC-CoT establishes a scalable and reproducible framework for dataset creation in low-resource settings, covering diverse domains and reasoning patterns essential for language understanding and generation. Building on this dataset, we develop the Sunshine-thinking LLM family, a series of Tibetan-centric LLMs equipped with chain-of-thought capabilities. Trained entirely on TIBSTC-CoT, Sunshine-thinking has demonstrated strong reasoning and generation performance, comparable to state-of-the-art (SOTA) multilingual LLMs. Our work marks a significant step toward inclusive AI by enabling high-quality Tibetan language processing through both resource creation and model innovation. All data are available: https://github.com/Vicentvankor/sun-shine.

[39] Latent Self-Consistency for Reliable Majority-Set Selection in Short- and Long-Answer Reasoning

Jungsuk Oh, Jay-Yoon Lee

Main category: cs.CL

TL;DR: LSC (Latent Self-Consistency) uses learnable token embeddings to select semantically consistent responses, outperforming existing consistency methods on both short and long-form reasoning tasks with minimal computational overhead.

DetailsMotivation: Existing probabilistic decoding in LLMs produces inconsistent outputs, especially on complex/long-form questions. Current consistency methods (SC, USC, WUCS) have limitations: SC only works for short-form QA with exact string matching, while USC and WUCS extend to long-form but lose accuracy on short-form benchmarks.

Method: Latent Self-Consistency (LSC) selects the most semantically consistent response using learnable token embeddings. It performs lightweight forward processing of summary tokens only, introducing negligible runtime overhead (≤0.9%) on top of standard decoding, with no architecture changes required.

Result: LSC surpasses SC, USC, and WUCS on average performance across 6 short-form and 5 long-form reasoning benchmarks (MATH, MMLU, TruthfulQA, etc.). It works effectively across various answer formats while adding negligible computational overhead. LSC also provides well-calibrated confidence estimates with low expected calibration error.

Conclusion: LSC is a reliable consistency-selection method that works effectively across both short and long-form answer formats, offering superior performance to existing methods with minimal computational cost and providing well-calibrated confidence estimates.

Abstract: Probabilistic decoding in Large Language Models (LLMs) often yields inconsistent outputs, particularly on complex or long-form questions. Self-Consistency (SC) mitigates this for short-form QA by majority voting over exact strings, whereas Universal Self-Consistency (USC) and Weighted Unigram Consistency Score (WUCS) extend to long-form responses but lose accuracy on short-form benchmarks. We introduce \textbf{Latent Self-Consistency (LSC)}, which selects the most semantically consistent response using learnable token embeddings. LSC’s lightweight forward processing of summary tokens only introduces negligible runtime overhead (at most $0.9%$) on top of standard decoding of the base LLM, and requires no changes to the model architecture. Across 6 short-form and 5 long-form reasoning benchmarks (e.g., MATH, MMLU, TruthfulQA), LSC surpasses SC, USC, and WUCS on both short-form and long-form on average performance, while adding negligible computational overhead on vanilla inference. These results position LSC as a reliable consistency-selection method that works effectively across various answer formats. Additionally, LSC provides well-calibrated confidence estimates, maintaining low expected calibration error across both answer formats.

[40] DIWALI: Diversity and Inclusivity aWare cuLture specific Items for India: Dataset and Assessment of LLMs for Cultural Text Adaptation in Indian Context

Pramit Sahoo, Maharaj Brahma, Maunendra Sankar Desarkar

Main category: cs.CL

TL;DR: DIWALI dataset: A culturally-grounded dataset for evaluating LLM cultural competence in Indian contexts with ~8k concepts across 36 sub-regions and 17 cultural facets.

DetailsMotivation: LLMs lack cultural alignment and produce biased generations due to insufficient cultural knowledge. Current evaluation is challenging due to lack of proper metrics and culturally-grounded datasets that capture regional/sub-regional cultural complexity.

Method: Created a novel CSI (culture specific items) dataset for Indian culture with ~8k concepts from 36 sub-regions across 17 cultural facets. Evaluated LLMs on cultural text adaptation task using CSIs, LLM-as-Judge, and human evaluations from diverse socio-demographic regions.

Result: Quantitative analysis revealed selective sub-regional coverage and surface-level adaptations across all evaluated LLMs, demonstrating limitations in cultural competence.

Conclusion: The DIWALI dataset addresses the gap in culturally-grounded evaluation resources for Indian culture, enabling better assessment of LLM cultural competence and revealing current limitations in cultural adaptation capabilities.

Abstract: Large language models (LLMs) are widely used in various tasks and applications. However, despite their wide capabilities, they are shown to lack cultural alignment \citep{ryan-etal-2024-unintended, alkhamissi-etal-2024-investigating} and produce biased generations \cite{naous-etal-2024-beer} due to a lack of cultural knowledge and competence. Evaluation of LLMs for cultural awareness and alignment is particularly challenging due to the lack of proper evaluation metrics and unavailability of culturally grounded datasets representing the vast complexity of cultures at the regional and sub-regional levels. Existing datasets for culture specific items (CSIs) focus primarily on concepts at the regional level and may contain false positives. To address this issue, we introduce a novel CSI dataset for Indian culture, belonging to 17 cultural facets. The dataset comprises ~8k cultural concepts from 36 sub-regions. To measure the cultural competence of LLMs on a cultural text adaptation task, we evaluate the adaptations using the CSIs created, LLM as Judge, and human evaluations from diverse socio-demographic region. Furthermore, we perform quantitative analysis demonstrating selective sub-regional coverage and surface-level adaptations across all considered LLMs. Our dataset is available here: https://huggingface.co/datasets/nlip/DIWALI, project webpage https://nlip-lab.github.io/nlip/publications/diwali/, and our codebase with model outputs can be found here: https://github.com/pramitsahoo/culture-evaluation

[41] Pragmatic Inference for Moral Reasoning Acquisition: Generalization via Distributional Semantics

Guangliang Liu, Xi Chen, Bocheng Chen, Han Zi, Xitong Zhang, Kristen Johnson

Main category: cs.CL

TL;DR: LLMs struggle with moral reasoning generalization due to distributional semantics vs pragmatic morals; proposed pragmatic inference methods based on moral foundations theory improve generalization.

DetailsMotivation: Moral reasoning is promising for LLMs but generalization remains challenging because LLMs rely on distributional semantics while morals operate at pragmatic level, creating a fundamental gap.

Method: Proposed pragmatic inference methods grounded in moral foundations theory that leverage contextual information at each step to bridge the pragmatic gap and connect moral foundations with reasoning objectives.

Result: Experimental results show the approach significantly enhances LLMs’ generalization in moral reasoning, demonstrating effectiveness of bridging the pragmatic gap.

Conclusion: The work provides a foundation for future research in moral reasoning for LLMs grounded in moral foundations theory, showing pragmatic inference can overcome distributional semantics limitations.

Abstract: Moral reasoning has emerged as a promising research direction for Large Language Models (LLMs), yet achieving generalization remains a central challenge. From a linguistic standpoint, this difficulty arises because LLMs are adept at capturing distributional semantics, which fundamentally differs from the morals which operate at the pragmatic level. This paper investigates how LLMs can achieve generalized moral reasoning despite their reliance on distributional semantics. We propose pragmatic inference methods grounded in moral foundations theory, which leverage contextual information at each step to bridge the pragmatic gap and guide LLMs in connecting moral foundations with moral reasoning objectives. Experimental results demonstrate that our approach significantly enhances LLMs’ generalization in moral reasoning, providing a foundation for future research grounded in moral foundations theory.

[42] Analysing Knowledge Construction in Online Learning: Adapting the Interaction Analysis Model for Unstructured Large-Scale Discourse

Jindi Wang, Yidi Zhang, Zhaoxing Li, Pedro Bem Haja, Ioannis Ivrissimtzis, Zichen Zhao, Sebastian Stein

Main category: cs.CL

TL;DR: Proposes and validates a framework combining Interaction Analysis Model codebook with automated classifier for large-scale analysis of knowledge construction in unstructured online learning discourse.

DetailsMotivation: The rapid growth of online courses and social media generates large volumes of unstructured learner-generated text, but existing manual coding approaches don't scale to fragmented online discourse. Understanding knowledge construction is crucial for analyzing learning processes, informing content design, and providing feedback at scale.

Method: Combines a codebook inspired by the Interaction Analysis Model with automated classification. Adapts four comment-level categories: Non-Knowledge Construction, Share, Explore, and Integrate. Three annotators coded 20,000 comments from YouTube education channels. For automated classification, compared bag-of-words baselines with transformer models using 10-fold cross-validation.

Result: Codebook demonstrated strong reliability (Cohen’s kappa = 0.79-0.93). DeBERTa-v3-large achieved highest macro-averaged F1 score (0.841), outperforming all baselines. External validation on four domains yielded macro-F1 above 0.705, with stronger transfer in structured domains (medicine, programming) and weaker in varied domains (language, music).

Conclusion: Theory-driven, semi-automated analysis of knowledge construction at scale is feasible, enabling integration of knowledge-construction indicators into learning analytics and design of online learning environments.

Abstract: The rapid expansion of online courses and social media has generated large volumes of unstructured learner-generated text. Understanding how learners construct knowledge in these spaces is crucial for analysing learning processes, informing content design, and providing feedback at scale. However, existing approaches typically rely on manual coding of well-structured discussion forums, which does not scale to the fragmented discourse found in online learning. This study proposes and validates a framework that combines a codebook inspired by the Interaction Analysis Model with an automated classifier to enable large-scale analysis of knowledge construction in unstructured online discourse. We adapt four comment-level categories of knowledge construction: Non-Knowledge Construction, Share, Explore, and Integrate. Three trained annotators coded a balanced sample of 20,000 comments from YouTube education channels. The codebook demonstrated strong reliability, with Cohen’s kappa = 0.79 on the main dataset and 0.85–0.93 across four additional educational domains. For automated classification, bag-of-words baselines were compared with transformer-based language models using 10-fold cross-validation. A DeBERTa-v3-large model achieved the highest macro-averaged F1 score (0.841), outperforming all baselines and other transformer models. External validation on four domains yielded macro-F1 above 0.705, with stronger transfer in medicine and programming, where discourse was more structured and task-focused, and weaker transfer in language and music, where comments were more varied and context-dependent. Overall, the study shows that theory-driven, semi-automated analysis of knowledge construction at scale is feasible, enabling the integration of knowledge-construction indicators into learning analytics and the design of online learning environments.

[43] One Battle After Another: Probing LLMs’ Limits on Multi-Turn Instruction Following with a Benchmark Evolving Framework

Qi Jia, Ye Shen, Xiujie Song, Kaiwei Zhang, Shibo Wang, Dun Pei, Xiangyang Zhu, Guangtao Zhai

Main category: cs.CL

TL;DR: EvolIF is a novel framework for evaluating LLMs’ instruction-following ability in multi-topic dialogues using Flow Theory and process-centric metrics that terminate only when user patience is exhausted, showing GPT-5 performs best with 14 turns and 66.40% robustness.

DetailsMotivation: Existing benchmarks for evaluating LLMs' instruction-following in multi-topic dialogues are limited to fixed turn counts, susceptible to saturation, and fail to account for users' interactive experience, creating a need for more realistic evaluation frameworks.

Method: Proposes a novel framework with three-layer tracking mechanism and query synthesis agent to mimic sequential user behaviors, incorporates Flow Theory with process-centric metrics, and terminates evaluation only upon exhausting user patience. Presents EvolIF benchmark covering 12 constraint groups.

Result: GPT-5 excels with 14 turns sustained and 66.40% robustness, outperforming Gemini-3.0-Pro by 5.59%. Other models trail behind these top performers.

Conclusion: The EvolIF framework provides a more realistic evaluation of LLMs’ instruction-following ability in multi-topic dialogues by addressing limitations of existing benchmarks and incorporating user experience factors through Flow Theory and patience-based termination.

Abstract: Evaluating LLMs’ instruction-following ability in multi-topic dialogues is essential yet challenging. Existing benchmarks are limited to a fixed number of turns, susceptible to saturation and failing to account for users’ interactive experience. In this work, we propose a novel framework backed by a three-layer tracking mechanism and a query synthesis agent to mimic sequential user behaviors. Incorporating Flow Theory, we introduce process-centric metrics and terminate a conversational evaluation only upon exhausting user patience. Upon this framework, we present EvolIF, an evolving benchmark covering 12 constraint groups. Results indicate that GPT-5 excels, sustaining 14 turns with 66.40% robustness. It outperforms Gemini-3.0-Pro by a margin of 5.59%, while other models trail behind. Resources are available at https://github.com/JiaQiSJTU/EvolvingInstructionFollowing.

[44] Listening Between the Lines: Decoding Podcast Narratives with Language Modeling

Shreya Gupta, Ojasva Saxena, Arghodeep Nandi, Sarah Masud, Kiran Garimella, Tanmoy Chakraborty

Main category: cs.CL

TL;DR: Fine-tuned BERT model for analyzing narrative frames in podcasts by linking frames to specific entities, enabling scalable analysis of how topics are framed in conversational media.

DetailsMotivation: Podcasts are important for shaping public opinion but their unscripted, conversational nature makes automated analysis challenging. Existing LLMs trained on structured text struggle with subtle narrative cues in podcasts, limiting scalable analysis of how podcasts persuade and inform audiences.

Method: Developed a fine-tuned BERT model that explicitly links narrative frames to specific entities mentioned in conversations, grounding abstract frames in concrete details. Uses granular frame labels correlated with high-level topics to reveal discourse trends.

Result: Novel frame-labeling methodology that more closely aligns with human judgment for messy conversational data, and reveals systematic relationships between topics (what is discussed) and frames (how it’s presented).

Conclusion: Provides a more robust framework for studying influence in digital media by enabling accurate, scalable analysis of narrative frames in podcasts, overcoming limitations of existing LLMs on conversational data.

Abstract: Podcasts have become a central arena for shaping public opinion, making them a vital source for understanding contemporary discourse. Their typically unscripted, multi-themed, and conversational style offers a rich but complex form of data. To analyze how podcasts persuade and inform, we must examine their narrative structures – specifically, the narrative frames they employ. The fluid and conversational nature of podcasts presents a significant challenge for automated analysis. We show that existing large language models, typically trained on more structured text such as news articles, struggle to capture the subtle cues that human listeners rely on to identify narrative frames. As a result, current approaches fall short of accurately analyzing podcast narratives at scale. To solve this, we develop and evaluate a fine-tuned BERT model that explicitly links narrative frames to specific entities mentioned in the conversation, effectively grounding the abstract frame in concrete details. Our approach then uses these granular frame labels and correlates them with high-level topics to reveal broader discourse trends. The primary contributions of this paper are: (i) a novel frame-labeling methodology that more closely aligns with human judgment for messy, conversational data, and (ii) a new analysis that uncovers the systematic relationship between what is being discussed (the topic) and how it is being presented (the frame), offering a more robust framework for studying influence in digital media.

[45] Can Finetuing LLMs on Small Human Samples Increase Heterogeneity, Alignment, and Belief-Action Coherence?

Steven Wang, Kyle Hunt, Shaojie Tang, Kenneth Joseph

Main category: cs.CL

TL;DR: Fine-tuning LLMs on small human survey data improves simulation quality but still fails to reproduce regression coefficients, making LLMs unsuitable for replacing human participants in inferential analyses.

DetailsMotivation: To investigate whether fine-tuning LLMs on small human survey data can address known limitations of LLM-based simulation (limited diversity, subgroup misalignment, insufficient variance, belief-action discrepancies) and make them suitable replacements for human participants in research.

Method: Used a behavioral experiment on information disclosure to compare human and LLM-generated responses. Fine-tuned LLMs on small subsets of human survey data (like pilot study data). Evaluated across multiple dimensions: distributional divergence, subgroup alignment, belief-action coherence, and recovery of regression coefficients.

Result: Fine-tuning on small human samples substantially improved heterogeneity, alignment, and belief-action coherence compared to base models. However, even the best fine-tuned models failed to reproduce the regression coefficients from the original human study.

Conclusion: While fine-tuning improves LLM simulation quality, LLM-generated data remain unsuitable for replacing human participants in formal inferential analyses due to inability to reproduce key statistical relationships (regression coefficients).

Abstract: There is ongoing debate about whether large language models (LLMs) can serve as substitutes for human participants in survey and experimental research. While recent work in fields such as marketing and psychology has explored the potential of LLM-based simulation, a growing body of evidence cautions against this practice: LLMs often fail to align with real human behavior, exhibiting limited diversity, systematic misalignment for minority subgroups, insufficient within-group variance, and discrepancies between stated beliefs and actions. This study examines an important and distinct question in this domain: whether fine-tuning on a small subset of human survey data, such as that obtainable from a pilot study, can mitigate these issues and yield realistic simulated outcomes. Using a behavioral experiment on information disclosure, we compare human and LLM-generated responses across multiple dimensions, including distributional divergence, subgroup alignment, belief-action coherence, and the recovery of regression coefficients. We find that fine-tuning on small human samples substantially improves heterogeneity, alignment, and belief-action coherence relative to the base model. However, even the best-performing fine-tuned models fail to reproduce the regression coefficients of the original study, suggesting that LLM-generated data remain unsuitable for replacing human participants in formal inferential analyses.

[46] Tree Matching Networks for Natural Language Inference: Parameter-Efficient Semantic Understanding via Dependency Parse Trees

Jason Lunder

Main category: cs.CL

TL;DR: Tree Matching Networks (TMN) using dependency parse trees outperform BERT on NLI tasks with much smaller size and faster training, but struggle on similarity tasks; multi-headed attention aggregation proposed for scalability.

DetailsMotivation: Transformer models like BERT achieve high NLI accuracy but require huge parameters and learn relationships from scratch. Explicit linguistic structures (dependency parse trees) could leverage prior encoded information to improve learning efficiency.

Method: Adapt Graph Matching Networks (GMN) to operate on dependency parse trees, creating Tree Matching Networks (TMN). Compare TMN to BERT on SNLI entailment and SemEval similarity tasks. Propose multi-headed attention aggregation to address scalability limitations.

Result: TMN achieves significantly better results than BERT on SNLI task with reduced memory footprint and less training time. Both models struggled on SemEval. Explicit structural representations outperform sequence-based models at comparable scales, but current aggregation methods limit scalability.

Conclusion: Dependency parse tree-based models (TMN) are more efficient than BERT for NLI tasks, demonstrating the value of explicit linguistic structures. Multi-headed attention aggregation is needed to overcome scalability limitations of current methods.

Abstract: In creating sentence embeddings for Natural Language Inference (NLI) tasks, using transformer-based models like BERT leads to high accuracy, but require hundreds of millions of parameters. These models take in sentences as a sequence of tokens, and learn to encode the meaning of the sequence into embeddings such that those embeddings can be used reliably for NLI tasks. Essentially, every word is considered against every other word in the sequence, and the transformer model is able to determine the relationships between them, entirely from scratch. However, a model that accepts explicit linguistic structures like dependency parse trees may be able to leverage prior encoded information about these relationships, without having to learn them from scratch, thus improving learning efficiency. To investigate this, we adapt Graph Matching Networks (GMN) to operate on dependency parse trees, creating Tree Matching Networks (TMN). We compare TMN to a BERT based model on the SNLI entailment task and on the SemEval similarity task. TMN is able to achieve significantly better results with a significantly reduced memory footprint and much less training time than the BERT based model on the SNLI task, while both models struggled to preform well on the SemEval. Explicit structural representations significantly outperform sequence-based models at comparable scales, but current aggregation methods limit scalability. We propose multi-headed attention aggregation to address this limitation.

[47] AugServe: Adaptive Request Scheduling for Augmented Large Language Model Inference Serving

Ying Wang, Zhen Jin, Jiexiong Xu, Wenhai Lin, Yiquan Chen, Wenzhi Chen

Main category: cs.CL

TL;DR: AugServe is an efficient inference framework that improves augmented LLM serving by using adaptive request scheduling and dynamic token batching to increase effective throughput and reduce latency.

DetailsMotivation: Current augmented LLM inference systems suffer from two main problems: (1) FCFS scheduling causes head-of-line blocking and queuing delays exceeding SLOs, and (2) static batch token limits fail to adapt to varying loads and hardware conditions, degrading effective throughput and service quality.

Method: AugServe uses a two-stage adaptive request scheduling strategy: Stage I optimizes scheduling order based on inference features of augmented LLM requests, and Stage II continuously refines decisions with runtime information. It also dynamically adjusts token batching based on hardware status and real-time load.

Result: AugServe achieves 4.7x and 3.3x higher effective throughput than vLLM and InferCept respectively, while reducing time-to-first-token (TTFT) by up to 96.3% and 95.0%.

Conclusion: AugServe effectively addresses the limitations of existing augmented LLM inference systems by introducing adaptive scheduling and dynamic batching, significantly improving both throughput and latency performance for better user experience.

Abstract: As augmented large language models (LLMs) with external tools become increasingly popular in web applications, improving augmented LLM inference serving efficiency and optimizing service-level objectives (SLOs) are critical for enhancing user experience. To achieve this, inference systems must maximize request handling within latency constraints, referred to as increasing effective throughput. However, existing systems face two major challenges: (i) reliance on first-come-first-served (FCFS) scheduling causes severe head-of-line blocking, leading to queuing delays exceeding the SLOs for many requests; and (ii) static batch token limit, which fails to adapt to fluctuating loads and hardware conditions. Both of these factors degrade effective throughput and service quality. This paper presents AugServe, an efficient inference framework designed to reduce queueing latency and enhance effective throughput for augmented LLM inference services. The core idea of AugServe is a two-stage adaptive request scheduling strategy. Specifically, AugServe combines the inference features of augmented LLM requests to optimize the order of scheduling decisions (stage I). These decisions are continuously refined with runtime information (stage II), adapting to both request characteristics and system capabilities. In addition, AugServe dynamically adjusts the token batching mechanism based on hardware status and real-time load, further enhancing throughput performance. Experimental results show that AugServe achieves 4.7x and 3.3x higher effective throughput than vLLM and InferCept, while reducing time-to-first-token (TTFT) by up to 96.3% and 95.0%, respectively.

[48] A Systematic Evaluation of Preference Aggregation in Federated RLHF for Pluralistic Alignment of LLMs

Mahmoud Srewa, Tianyu Zhao, Salma Elmalaki

Main category: cs.CL

TL;DR: The paper proposes an adaptive aggregation method for aligning LLMs with diverse human preferences in federated learning, achieving better fairness while maintaining competitive alignment quality.

DetailsMotivation: Standard methods for aligning LLMs often fail to adequately represent diverse human preferences in federated learning environments, where different groups may have varying viewpoints that need to be fairly aggregated.

Method: The authors introduce a comprehensive evaluation framework for assessing alignment-fairness trade-offs, where groups locally evaluate rollouts and produce reward signals. They evaluate standard aggregation techniques (min, max, average) and propose a novel adaptive scheme that dynamically adjusts preference weights based on each group’s historical alignment performance.

Result: Experiments on Q/A tasks using a PPO-based RLHF pipeline show that the adaptive approach consistently achieves superior fairness while maintaining competitive alignment scores compared to standard aggregation methods.

Conclusion: The work provides a robust methodology for evaluating LLM behavior across diverse populations and offers a practical solution for developing truly pluralistic and fairly aligned models in federated learning settings.

Abstract: This paper addresses the challenge of aligning large language models (LLMs) with diverse human preferences within federated learning (FL) environments, where standard methods often fail to adequately represent diverse viewpoints. We introduce a comprehensive evaluation framework that systematically assesses the trade-off between alignment quality and fairness when using different aggregation strategies for human preferences. In our federated setting, each group locally evaluates rollouts and produces reward signals, and the server aggregates these group-level rewards without accessing any raw data. Specifically, we evaluate standard reward aggregation techniques (min, max, and average) and introduce a novel adaptive scheme that dynamically adjusts preference weights based on a group’s historical alignment performance. Our experiments on question-answering (Q/A) tasks using a PPO-based RLHF pipeline demonstrate that our adaptive approach consistently achieves superior fairness while maintaining competitive alignment scores. This work offers a robust methodology for evaluating LLM behavior across diverse populations and provides a practical solution for developing truly pluralistic and fairly aligned models.

[49] MentraSuite: Post-Training Large Language Models for Mental Health Reasoning and Assessment

Mengxi Xiao, Kailai Yang, Pengde Zhao, Enze Zhang, Ziyan Kuang, Zhiwei Liu, Weiguang Han, Shu Liao, Lianting Huang, Jinpeng Hu, Min Peng, Qianqian Xie, Sophia Ananiadou

Main category: cs.CL

TL;DR: MentraSuite is a unified framework for reliable mental-health reasoning with LLMs, featuring MentraBench benchmark and Mindora model optimized through SFT-RL with inconsistency-detection rewards.

DetailsMotivation: Current psychological LLMs focus on emotional understanding or knowledge recall but lack the step-wise, clinically aligned reasoning needed for mental health assessment, diagnosis, and intervention planning, making their deployment risky due to incomplete or inconsistent reasoning.

Method: 1) MentraBench: comprehensive benchmark covering 5 reasoning aspects, 6 tasks, and 13 datasets with 5 evaluation dimensions. 2) Mindora: post-trained model using hybrid SFT-RL framework with inconsistency-detection reward. 3) Novel reasoning trajectory generation strategy with difficult sample filtering and structured rewriting for high-quality training data.

Result: Mindora achieves highest average performance on MentraBench among 20 evaluated LLMs and demonstrates remarkable reasoning reliability for complex mental-health scenarios.

Conclusion: MentraSuite provides a unified framework for advancing reliable mental-health reasoning in LLMs, addressing critical gaps in clinical reasoning quality through comprehensive benchmarking and specialized model optimization.

Abstract: Mental health disorders affect hundreds of millions globally, and the Web now serves as a primary medium for accessing support, information, and assessment. Large language models (LLMs) offer scalable and accessible assistance, yet their deployment in mental-health settings remains risky when their reasoning is incomplete, inconsistent, or ungrounded. Existing psychological LLMs emphasize emotional understanding or knowledge recall but overlook the step-wise, clinically aligned reasoning required for appraisal, diagnosis, intervention planning, abstraction, and verification. To address these issues, we introduce MentraSuite, a unified framework for advancing reliable mental-health reasoning. We propose MentraBench, a comprehensive benchmark spanning five core reasoning aspects, six tasks, and 13 datasets, evaluating both task performance and reasoning quality across five dimensions: conciseness, coherence, hallucination avoidance, task understanding, and internal consistency. We further present Mindora, a post-trained model optimized through a hybrid SFT-RL framework with an inconsistency-detection reward to enforce faithful and coherent reasoning. To support training, we construct high-quality trajectories using a novel reasoning trajectory generation strategy, that strategically filters difficult samples and applies a structured, consistency-oriented rewriting process to produce concise, readable, and well-balanced trajectories. Across 20 evaluated LLMs, Mindora achieves the highest average performance on MentraBench and shows remarkable performances in reasoning reliability, demonstrating its effectiveness for complex mental-health scenarios.

[50] Confucius Code Agent: Scalable Agent Scaffolding for Real-World Codebases

Zhaodong Wang, Zhenting Qi, Sherman Wong, Nathan Hu, Samuel Lin, Jun Ge, Erwin Gao, Wenlin Chen, Yilun Du, Minlan Yu, Ying Zhang

Main category: cs.CL

TL;DR: Confucius Code Agent (CCA) is a scalable software engineering agent that outperforms existing solutions on real-world coding tasks while maintaining extensibility and interpretability.

DetailsMotivation: Existing coding agents either offer transparency but struggle with real-world workloads (research-grade), or achieve strong performance but lack extensibility and interpretability (proprietary systems). There's a need for agents that can handle massive repositories, sustain long-horizon sessions, and coordinate complex toolchains while remaining extensible and interpretable.

Method: Built on Confucius SDK with three perspectives: Agent Experience (AX), User Experience (UX), and Developer Experience (DX). Features include: unified orchestrator with hierarchical working memory for long-context operation, persistent note-taking for cross-session continual learning, modular extension system for reliable tool use, and a meta-agent that automates synthesis, evaluation, and refinement of agent configurations through build-test-improve loop.

Result: On SWE-Bench-Pro, CCA achieves Resolve@1 of 54.3%, surpassing both research-grade and proprietary coding agents under comparable model conditions. Demonstrates strong performance on real-world software engineering tasks.

Conclusion: Confucius SDK and CCA form a general, extensible, production-grade foundation for building robust coding agents, bridging the gap between research prototypes and practical large-scale deployment.

Abstract: Real-world software engineering tasks require coding agents that can operate over massive repositories, sustain long-horizon sessions, and reliably coordinate complex toolchains at test time. Existing research-grade agents offer transparency but struggle when scaled to real-world workloads, while proprietary systems achieve strong practical performance but provide limited extensibility, interpretability, and controllability. We introduce the Confucius Code Agent (CCA), a scalable software engineering agent that can operate at enterprise-level codebases. CCA is built on top of the Confucius SDK, an agent development platform structured around three complementary perspectives: Agent Experience (AX), User Experience (UX), and Developer Experience (DX). The SDK integrates a unified orchestrator with hierarchical working memory for long-context operation, a persistent note-taking mechanism for cross-session continual learning, and a modular extension system for reliable tool use. In addition, we introduce a meta-agent that automates the synthesis, evaluation, and refinement of agent configurations through a build-test-improve loop, enabling rapid adaptation to new tasks, environments, and tool stacks. Instantiated with these mechanisms, CCA demonstrates strong performance on real-world software engineering tasks. On SWE-Bench-Pro, CCA achieves a Resolve@1 of 54.3%, surpassing both research-grade and proprietary coding agents under comparable model conditions. Together, the Confucius SDK and CCA form a general, extensible, and production-grade foundation for building robust coding agents, bridging the gap between research prototypes and practical large-scale deployment.

[51] Sliding Window Attention Adaptation

Yijiong Yu, Jiale Liu, Qingyun Wu, Huazheng Wang, Ji Pei

Main category: cs.CL

TL;DR: SWAA enables adaptation of full-attention pretrained LLMs to sliding window attention for efficient long-context inference without retraining, using synergistic combinations of five methods.

DetailsMotivation: Self-attention in Transformers has quadratic complexity with input length, making long-context inference expensive. Sliding window attention reduces this to linear complexity, but naively switching from full attention to SWA at inference causes severe performance degradation due to training-inference mismatch.

Method: Proposes Sliding Window Attention Adaptation (SWAA) with five practical methods: (1) applying SWA only during prefilling phase, (2) preserving “sink” tokens, (3) interleaving FA/SWA layers, (4) chain-of-thought reasoning, and (5) fine-tuning. These methods are combined in synergistic configurations.

Result: SWA adaptation is feasible but non-trivial - no single method suffices alone, but specific synergistic combinations effectively recover original long-context performance. Different SWAA configurations offer performance-efficiency trade-offs, accelerating LLM long-context inference speed by up to 100%.

Conclusion: FA-pretrained LLMs can be well adapted to SWA without full pretraining through carefully designed combinations of adaptation methods. SWAA provides practical recipes for diverse scenarios, fundamentally accelerating long-context inference while maintaining performance.

Abstract: The self-attention mechanism in Transformer-based Large Language Models (LLMs) scales quadratically with input length, making long-context inference expensive. Sliding window attention (SWA) reduces this cost to linear complexity, but naively enabling complete SWA at inference-time for models pretrained with full attention (FA) causes severe long-context performance degradation due to training-inference mismatch. This makes us wonder: Can FA-pretrained LLMs be well adapted to SWA without pretraining? We investigate this by proposing Sliding Window Attention Adaptation (SWAA), a set of practical recipes that combine five methods for better adaptation: (1) applying SWA only during prefilling; (2) preserving “sink” tokens; (3) interleaving FA/SWA layers; (4) chain-of-thought (CoT); and (5) fine-tuning. Our experiments show that SWA adaptation is feasible while non-trivial: no single method suffices, yet specific synergistic combinations effectively recover the original long-context performance. We further analyze the performance-efficiency trade-offs of different SWAA configurations and provide recommended recipes for diverse scenarios, which can greatly and fundamentally accelerate LLM long-context inference speed by up to 100%. Our code is available at https://github.com/yuyijiong/sliding-window-attention-adaptation

[52] Enhancing Geo-localization for Crowdsourced Flood Imagery via LLM-Guided Attention

Fengyi Xu, Jun Ma, Waishan Qiu, Cui Guo, Jack C. P. Cheng

Main category: cs.CL

TL;DR: VPR-AttLLM integrates LLMs into Visual Place Recognition pipelines using attention mechanisms to enhance geo-localization of crowdsourced crisis imagery, improving performance without retraining.

DetailsMotivation: Crowdsourced street-view imagery from social media provides real-time visual evidence of urban crises but lacks reliable geographic metadata. Existing VPR models degrade significantly when applied to such imagery due to visual distortions and domain shifts in cross-source scenarios.

Method: VPR-AttLLM is a model-agnostic framework that integrates LLMs’ semantic reasoning and geo-knowledge into VPR pipelines through attention-guided descriptor enhancement. LLMs identify location-informative regions and suppress visual noise, improving retrieval without requiring model retraining or additional data.

Result: Integrating VPR-AttLLM with three state-of-the-art VPR models (CosPlace, EigenPlaces, SALAD) consistently improves recall performance, with relative gains typically between 1-3% and up to 8% on the most challenging real flood imagery. Evaluations were conducted on extended benchmarks including SF-XL with real social-media flood images, synthetic flooding scenarios, and the new HK-URBAN dataset.

Conclusion: VPR-AttLLM establishes a generalizable paradigm for LLM-guided multimodal fusion in visual retrieval systems, bridging human-like spatial reasoning with modern VPR architectures. Its plug-and-play design, cross-source robustness, and interpretability highlight potential for scalable urban monitoring and rapid geo-localization of crowdsourced crisis imagery.

Abstract: Crowdsourced street-view imagery from social media provides real-time visual evidence of urban flooding and other crisis events, yet it often lacks reliable geographic metadata for emergency response. Existing image geo-localization approaches, also known as Visual Place Recognition (VPR) models, exhibit substantial performance degradation when applied to such imagery due to visual distortions and domain shifts in cross-source scenarios. This paper presents VPR-AttLLM, a model-agnostic framework that integrates the semantic reasoning and geo-knowledge of Large Language Models (LLMs) into established VPR pipelines through attention-guided descriptor enhancement. By leveraging LLMs to identify location-informative regions within the city context and suppress visual noise, VPR-AttLLM improves retrieval performance without requiring model retraining or additional data. Comprehensive evaluations are conducted on extended benchmarks including SF-XL enriched with real social-media flood images, synthetic flooding scenarios over established query sets and Mapillary photos, and a new HK-URBAN dataset capturing morphologically distinct cityscapes. Integrating VPR-AttLLM with three state-of-the-art VPR models-CosPlace, EigenPlaces, and SALAD-consistently improves recall performance, yielding relative gains typically between 1-3% and reaching up to 8% on the most challenging real flood imagery. Beyond measurable gains in retrieval accuracy, this study establishes a generalizable paradigm for LLM-guided multimodal fusion in visual retrieval systems. By embedding principles from urban perception theory into attention mechanisms, VPR-AttLLM bridges human-like spatial reasoning with modern VPR architectures. Its plug-and-play design, strong cross-source robustness, and interpretability highlight its potential for scalable urban monitoring and rapid geo-localization of crowdsourced crisis imagery.

[53] Efficient Adaptive Rejection Sampling for Accelerating Speculative Decoding in Large Language Models

Chendong Sun, mingmin Chen, Lei Xu

Main category: cs.CL

TL;DR: EARS (Efficient Adaptive Rejection Sampling) improves speculative decoding by dynamically adjusting acceptance thresholds based on model uncertainty, reducing random rejections and boosting throughput by up to 18.12%.

DetailsMotivation: Current speculative decoding suffers from "random rejection" problem where plausible candidate tokens are frequently rejected due to fixed, context-independent random thresholds, especially in high-uncertainty scenarios like creative writing and open-domain QA, undermining inference efficiency.

Method: EARS dynamically adjusts acceptance thresholds by incorporating the target model’s predictive uncertainty (1 - max(P_target)). It introduces a tolerance term proportional to this uncertainty, relaxing acceptance criteria when the model is uncertain while maintaining strict standards when confident.

Result: EARS significantly enhances speculative decoding efficiency, achieving up to 18.12% increase in throughput with only 0.84% accuracy drop on GSM8K benchmark. It works well on creative writing and open-domain QA tasks.

Conclusion: EARS effectively addresses the random rejection problem in speculative decoding by adapting thresholds to model uncertainty, requires no architecture changes, and can be seamlessly integrated into existing frameworks for improved inference efficiency.

Abstract: Speculative Decoding is a prominent technique for accelerating the autoregressive inference of large language models (LLMs) by employing a fast draft model to propose candidate token sequences and a large target model to verify them in parallel. However, its core component – the rejection sampling mechanism – relies on a fixed, context-independent random threshold. This leads to a significant “random rejection” problem in high-uncertainty generation scenarios, where plausible candidate tokens are frequently rejected due to random chance, undermining inference efficiency. This paper introduces Efficient Adaptive Rejection Sampling (EARS), a novel method that dynamically adjusts the acceptance threshold by incorporating the target model’s own predictive uncertainty, measured as 1 - max(P_target). By introducing a tolerance term proportional to this uncertainty, EARS intelligently relaxes the acceptance criterion when the model is uncertain, effectively reducing random rejections while maintaining strict standards when the model is confident. Experiments on creative writing and open-domain QA tasks demonstrate that EARS significantly enhances the efficiency of speculative decoding, achieving up to an 18.12% increase in throughput with a negligible 0.84% accuracy drop on the GSM8K benchmark. The method requires no modifications to model architectures and can be seamlessly integrated into existing speculative decoding frameworks.

[54] Non-Resolution Reasoning (NRR): A Computational Framework for Contextual Identity and Ambiguity Preservation

Kei Saito

Main category: cs.CL

TL;DR: NRR is a computational framework that treats ambiguity retention as valid reasoning, challenging AI’s tendency to prematurely collapse multiple interpretations into single outputs.

DetailsMotivation: Current AI systems have a fundamental limitation: they resolve ambiguity prematurely through "premature semantic collapse" - collapsing multiple valid interpretations into single outputs due to classical identity assumptions in standard neural architectures.

Method: NRR introduces three core principles: (1) Non-Identity (A ≠ A) - same symbol refers to different entities across contexts; (2) Approximate Identity (A ≈ A) - entities share partial structural overlap; (3) Non-Resolution - conflicting interpretations can coexist. Formalized through Multi-Vector Embeddings, Non-Collapsing Attention, and Contextual Identity Tracking.

Result: NRR-lite model achieves 90.9% out-of-distribution accuracy on synthetic context-shift task vs 9.1% for standard architectures, demonstrating ambiguity preservation enables structural generalization. Case studies show advantages in paradox handling, creative generation, and context-dependent reasoning.

Conclusion: NRR challenges the assumption that meaning must collapse to be useful, offering foundation for AI systems capable of sophisticated ambiguity handling and creative reasoning. The key question becomes not whether AI should resolve ambiguity, but when, how, and under whose control.

Abstract: Current artificial intelligence systems, despite remarkable capabilities in text generation and pattern recognition, exhibit a fundamental architectural limitation: they resolve ambiguity prematurely. This premature semantic collapse – the tendency to collapse multiple valid interpretations into a single output – stems from classical identity assumptions embedded in standard neural architectures. We propose Non-Resolution Reasoning (NRR), a computational framework that treats ambiguity retention as a valid reasoning mode rather than a defect to be eliminated. NRR introduces three core principles: (1) Non-Identity (A $\ne$ A) – the same symbol refers to different entities across contexts; (2) Approximate Identity (A $\approx$ A) – entities share partial structural overlap without being identical; and (3) Non-Resolution – conflicting interpretations can coexist without forced convergence. We formalize these principles through three architectural components: Multi-Vector Embeddings for context-dependent representation, Non-Collapsing Attention for parallel interpretation retention, and Contextual Identity Tracking (CIT) for maintaining A $\ne$ A across inference. We demonstrate NRR’s advantages through case studies in paradox handling, creative generation, and context-dependent reasoning. Crucially, we provide a minimal empirical validation on a synthetic context-shift task where an NRR-lite model achieves 90.9% out-of-distribution accuracy compared to 9.1% for standard architectures, demonstrating that ambiguity preservation enables structural generalization. NRR challenges the assumption that meaning must collapse to be useful, offering a foundation for AI systems capable of sophisticated ambiguity handling and creative reasoning. The question is not whether AI should resolve ambiguity, but when, how, and under whose control.

[55] Temporal Tokenization Strategies for Event Sequence Modeling with Large Language Models

Zefang Liu, Nam H. Nguyen, Yinzhu Quan, Shi-Xiong Zhang

Main category: cs.CL

TL;DR: First empirical study comparing temporal tokenization strategies for event sequences in LLMs, finding no universal best approach - performance depends on aligning tokenizer with data’s statistical properties.

DetailsMotivation: Representing continuous time in LLMs for temporal event sequences is critical but under-explored, with unclear optimal approaches despite various proposed strategies like byte-level representations or calendar tokens.

Method: Empirical study comparing five temporal encoding strategies: naive numeric strings, high-precision byte-level representations, human-semantic calendar tokens, classic uniform binning, and adaptive residual scalar quantization, evaluated by fine-tuning LLMs on real-world datasets with diverse statistical distributions.

Result: No single strategy is universally superior; prediction performance depends heavily on aligning tokenizer with data’s statistical properties. Log-based strategies excel on skewed distributions, while human-centric formats prove robust for mixed modalities.

Conclusion: Temporal tokenization strategy should be chosen based on the statistical properties of the event data rather than seeking a one-size-fits-all solution, with different approaches optimal for different data distributions.

Abstract: Representing continuous time is a critical and under-explored challenge in modeling temporal event sequences with large language models (LLMs). Various strategies like byte-level representations or calendar tokens have been proposed. However, the optimal approach remains unclear, especially given the diverse statistical distributions of real-world event data, which range from smooth log-normal to discrete, spiky patterns. This paper presents the first empirical study of temporal tokenization for event sequences, comparing distinct encoding strategies: naive numeric strings, high-precision byte-level representations, human-semantic calendar tokens, classic uniform binning, and adaptive residual scalar quantization. We evaluate these strategies by fine-tuning LLMs on real-world datasets that exemplify these diverse distributions. Our analysis reveals that no single strategy is universally superior; instead, prediction performance depends heavily on aligning the tokenizer with the data’s statistical properties, with log-based strategies excelling on skewed distributions and human-centric formats proving robust for mixed modalities.

[56] A stylometric analysis of speaker attribution from speech transcripts

Cristina Aggazzotti, Elizabeth Allyn Smith

Main category: cs.CL

TL;DR: The paper introduces StyloSpeaker, a stylometric method for speaker attribution that analyzes linguistic content from transcribed speech, achieving better performance on normalized transcripts and comparing favorably to neural approaches.

DetailsMotivation: When speakers disguise their voices or use text-to-speech software, traditional acoustic speaker recognition fails, leaving only linguistic content for analysis. There's a need to adapt authorship attribution methods from written text to transcribed speech for speaker identification.

Method: StyloSpeaker applies stylometric analysis to speech transcripts, incorporating character, word, token, sentence, and style features from authorship attribution literature. The method is evaluated on two transcript formats (prescriptive with punctuation vs. normalized without) under varying topic control conditions.

Result: Higher attribution performance was generally achieved on normalized transcripts, except under the strongest topic control condition where overall performance was highest. The explainable stylometric model was compared to black-box neural approaches, and the most effective stylistic features for distinguishing speakers were investigated.

Conclusion: Content-based speaker attribution using stylometric methods is viable when acoustic features are unreliable. Normalized transcripts generally perform better, and topic control significantly impacts performance. The approach offers an explainable alternative to neural methods for forensic applications.

Abstract: Forensic scientists often need to identify an unknown speaker or writer in cases such as ransom calls, covert recordings, alleged suicide notes, or anonymous online communications, among many others. Speaker recognition in the speech domain usually examines phonetic or acoustic properties of a voice, and these methods can be accurate and robust under certain conditions. However, if a speaker disguises their voice or employs text-to-speech software, vocal properties may no longer be reliable, leaving only their linguistic content available for analysis. Authorship attribution methods traditionally use syntactic, semantic, and related linguistic information to identify writers of written text (authorship attribution). In this paper, we apply a content-based authorship approach to speech that has been transcribed into text, using what a speaker says to attribute speech to individuals (speaker attribution). We introduce a stylometric method, StyloSpeaker, which incorporates character, word, token, sentence, and style features from the stylometric literature on authorship, to assess whether two transcripts were produced by the same speaker. We evaluate this method on two types of transcript formatting: one approximating prescriptive written text with capitalization and punctuation and another normalized style that removes these conventions. The transcripts’ conversation topics are also controlled to varying degrees. We find generally higher attribution performance on normalized transcripts, except under the strongest topic control condition, in which overall performance is highest. Finally, we compare this more explainable stylometric model to black-box neural approaches on the same data and investigate which stylistic features most effectively distinguish speakers.

cs.CV

[57] Complex Mathematical Expression Recognition: Benchmark, Large-Scale Dataset and Strong Baseline

Weikang Bai, Yongkun Du, Yuchen Su, Yazhen Xie, Zhineng Chen

Main category: cs.CV

TL;DR: The paper introduces CMER-Bench benchmark for mathematical expression recognition difficulty levels, creates large-scale datasets MER-17M and CMER-3M for complex expressions, proposes Structured Mathematical Language representation, and develops CMERNet model that outperforms existing methods on complex expressions.

DetailsMotivation: Current Mathematical Expression Recognition (MER) methods perform well on simple expressions but struggle with complex mathematical expressions containing many tokens and multiple lines, mainly due to training datasets being dominated by simple samples.

Method: 1) Created CMER-Bench benchmark categorizing expressions into easy, moderate, and complex difficulty levels; 2) Built large-scale datasets MER-17M and CMER-3M focusing on complex expressions; 3) Proposed Structured Mathematical Language representation to model hierarchical and spatial structure beyond LaTeX; 4) Developed CMERNet model with encoder-decoder architecture trained on CMER-3M.

Result: CMERNet with only 125 million parameters significantly outperforms existing MER models and multimodal large language models (MLLMs) on the CMER-Bench benchmark, especially on complex mathematical expressions.

Conclusion: The paper addresses the gap in complex mathematical expression recognition by providing comprehensive benchmarks, large-scale datasets, a novel structured representation, and a specialized model that demonstrates superior performance on complex expressions compared to existing approaches.

Abstract: Mathematical Expression Recognition (MER) has made significant progress in recognizing simple expressions, but the robust recognition of complex mathematical expressions with many tokens and multiple lines remains a formidable challenge. In this paper, we first introduce CMER-Bench, a carefully constructed benchmark that categorizes expressions into three difficulty levels: easy, moderate, and complex. Leveraging CMER-Bench, we conduct a comprehensive evaluation of existing MER models and general-purpose multimodal large language models (MLLMs). The results reveal that while current methods perform well on easy and moderate expressions, their performance degrades significantly when handling complex mathematical expressions, mainly because existing public training datasets are primarily composed of simple samples. In response, we propose MER-17M and CMER-3M that are large-scale datasets emphasizing the recognition of complex mathematical expressions. The datasets provide rich and diverse samples to support the development of accurate and robust complex MER models. Furthermore, to address the challenges posed by the complicated spatial layout of complex expressions, we introduce a novel expression tokenizer, and a new representation called Structured Mathematical Language, which explicitly models the hierarchical and spatial structure of expressions beyond LaTeX format. Based on these, we propose a specialized model named CMERNet, built upon an encoder-decoder architecture and trained on CMER-3M. Experimental results show that CMERNet, with only 125 million parameters, significantly outperforms existing MER models and MLLMs on CMER-Bench.

[58] Human-AI Collaboration Mechanism Study on AIGC Assisted Image Production for Special Coverage

Yajie Yang, Yuqing Zhao, Xiaochao Xi, Yinan Zhu

Main category: cs.CV

TL;DR: This paper addresses ethical challenges in AI-generated journalism images by developing a human-in-the-loop pipeline for controllable image production, with experiments testing cross-platform adaptability and proposing evaluation metrics for journalistic AIGC.

DetailsMotivation: AIGC in journalism faces controversies around misinformation, authenticity, and interpretability. Most AI tools are opaque "black boxes" that create ethical, sociotechnical, and trust dilemmas, failing to meet journalism's dual demands for content accuracy and semantic alignment.

Method: Two experiments with Chinese media projects: (1) Testing cross-platform adaptability via standardized prompts across three scenes to identify semantic alignment issues; (2) Building a human-in-the-loop modular pipeline combining high-precision segmentation (SAM, GroundingDINO), semantic alignment (BrushNet), style regulating (Style-LoRA, Prompt-to-Prompt), with CLIP-based semantic scoring, NSFW/OCR/YOLO filtering, and verifiable content credentials.

Result: Experiment 1 revealed disparities in semantic alignment, cultural specificity, and visual realism driven by training-corpus bias and platform-level filtering. Experiment 2 successfully created a traceable deployment system that preserves semantic representation and ensures editorial fidelity.

Conclusion: The paper proposes a human-AI collaboration mechanism for AIGC-assisted image production in special coverage and recommends evaluating Character Identity Stability (CIS), Cultural Expression Accuracy (CEA), and User-Public Appropriateness (U-PA) as key metrics for journalistic AIGC.

Abstract: Artificial Intelligence Generated Content (AIGC) assisting image production triggers controversy in journalism while attracting attention from media agencies. Key issues involve misinformation, authenticity, semantic fidelity, and interpretability. Most AIGC tools are opaque “black boxes,” hindering the dual demands of content accuracy and semantic alignment and creating ethical, sociotechnical, and trust dilemmas. This paper explores pathways for controllable image production in journalism’s special coverage and conducts two experiments with projects from China’s media agency: (1) Experiment 1 tests cross-platform adaptability via standardized prompts across three scenes, revealing disparities in semantic alignment, cultural specificity, and visual realism driven by training-corpus bias and platform-level filtering. (2) Experiment 2 builds a human-in-the-loop modular pipeline combining high-precision segmentation (SAM, GroundingDINO), semantic alignment (BrushNet), and style regulating (Style-LoRA, Prompt-to-Prompt), ensuring editorial fidelity through CLIP-based semantic scoring, NSFW/OCR/YOLO filtering, and verifiable content credentials. Traceable deployment preserves semantic representation. Consequently, we propose a human-AI collaboration mechanism for AIGC assisted image production in special coverage and recommend evaluating Character Identity Stability (CIS), Cultural Expression Accuracy (CEA), and User-Public Appropriateness (U-PA).

[59] DL$^3$M: A Vision-to-Language Framework for Expert-Level Medical Reasoning through Deep Learning and Large Language Models

Md. Najib Hasan, Imran Ahmad, Sourav Basak Shuvo, Md. Mahadi Hasan Ankon, Sunanda Das, Nazmul Siddique, Hui Wang

Main category: cs.CV

TL;DR: A framework combining deep learning image classification with LLMs for clinical reasoning on endoscopic images, showing improved explanations but revealing LLM instability and unreliability for high-stakes medical decisions.

DetailsMotivation: There's a gap between medical image classifiers that detect diseases but don't explain decisions, and LLMs that generate clinical text but struggle with visual reasoning and produce unstable explanations. Clinicians need explanations that bridge what models see with clinical reasoning.

Method: Introduces a framework linking image classification with structured clinical reasoning. Uses MobileCoAtNet (a hybrid model for endoscopic images) for high-accuracy classification across 8 stomach-related classes, then uses these outputs to drive reasoning by 32 different LLMs. Builds two expert-verified benchmarks covering causes, symptoms, treatment, lifestyle, and follow-up care to evaluate LLM reasoning.

Result: MobileCoAtNet achieves high accuracy for classification. Strong classification improves LLM explanation quality, but no LLM reaches human-level stability - even the best LLMs change their reasoning with varying prompts. Current LLMs remain unreliable for high-stakes medical decisions despite producing useful clinical narratives.

Conclusion: Combining DL with LLMs can produce useful clinical narratives, but current LLMs are unreliable for high-stakes medical decisions. The framework provides clearer understanding of LLM limits and a path for building safer reasoning systems. Source code and datasets are publicly available.

Abstract: Medical image classifiers detect gastrointestinal diseases well, but they do not explain their decisions. Large language models can generate clinical text, yet they struggle with visual reasoning and often produce unstable or incorrect explanations. This leaves a gap between what a model sees and the type of reasoning a clinician expects. We introduce a framework that links image classification with structured clinical reasoning. A new hybrid model, MobileCoAtNet, is designed for endoscopic images and achieves high accuracy across eight stomach-related classes. Its outputs are then used to drive reasoning by several LLMs. To judge this reasoning, we build two expert-verified benchmarks covering causes, symptoms, treatment, lifestyle, and follow-up care. Thirty-two LLMs are evaluated against these gold standards. Strong classification improves the quality of their explanations, but none of the models reach human-level stability. Even the best LLMs change their reasoning when prompts vary. Our study shows that combining DL with LLMs can produce useful clinical narratives, but current LLMs remain unreliable for high-stakes medical decisions. The framework provides a clearer view of their limits and a path for building safer reasoning systems. The complete source code and datasets used in this study are available at https://github.com/souravbasakshuvo/DL3M.

[60] FoodLogAthl-218: Constructing a Real-World Food Image Dataset Using Dietary Management Applications

Mitsuki Watanabe, Sosuke Amano, Kiyoharu Aizawa, Yoko Yamakata

Main category: cs.CV

TL;DR: FoodLogAthl-218 is a real-world food image dataset from dietary app logs with 6,925 images across 218 categories, featuring natural meal photos and rich metadata for context-aware classification tasks.

DetailsMotivation: Existing food image datasets rely on web-crawled images that don't match real-world meal photos, creating a gap between training data and actual user-submitted images in dietary management applications.

Method: Collected 6,925 real-world meal images from FoodLog Athl app users, annotated with 14,349 bounding boxes across 218 food categories, with rich metadata including meal time, user IDs, and meal context.

Result: Created FoodLogAthl-218 dataset with greater intra-class diversity, natural meal frequency distribution, and casual unfiltered images, plus introduced three evaluation tasks including context-aware classification.

Conclusion: The dataset bridges the gap between web-crawled and real-world food images, enabling more realistic food classification models for dietary management applications.

Abstract: Food image classification models are crucial for dietary management applications because they reduce the burden of manual meal logging. However, most publicly available datasets for training such models rely on web-crawled images, which often differ from users’ real-world meal photos. In this work, we present FoodLogAthl-218, a food image dataset constructed from real-world meal records collected through the dietary management application FoodLog Athl. The dataset contains 6,925 images across 218 food categories, with a total of 14,349 bounding boxes. Rich metadata, including meal date and time, anonymized user IDs, and meal-level context, accompany each image. Unlike conventional datasets-where a predefined class set guides web-based image collection-our data begins with user-submitted photos, and labels are applied afterward. This yields greater intra-class diversity, a natural frequency distribution of meal types, and casual, unfiltered images intended for personal use rather than public sharing. In addition to (1) a standard classification benchmark, we introduce two FoodLog-specific tasks: (2) an incremental fine-tuning protocol that follows the temporal stream of users’ logs, and (3) a context-aware classification task where each image contains multiple dishes, and the model must classify each dish by leveraging the overall meal context. We evaluate these tasks using large multimodal models (LMMs). The dataset is publicly available at https://huggingface.co/datasets/FoodLog/FoodLogAthl-218.

[61] Why Text Prevails: Vision May Undermine Multimodal Medical Decision Making

Siyuan Dai, Lunxiao Li, Kun Zhao, Eardi Lila, Paul K. Crane, Heng Huang, Dongkuan Xu, Haoteng Tang, Liang Zhan

Main category: cs.CV

TL;DR: Current multimodal LLMs struggle with medical decision making tasks, performing worse with visual inputs than text-only reasoning, revealing a lack of grounded visual understanding in healthcare applications.

DetailsMotivation: Despite impressive zero-shot capabilities of advanced multimodal LLMs in general vision-language tasks, they perform poorly on basic medical decision making tasks in the biomedical domain, which is critical for healthcare applications.

Method: The study investigates MLLM limitations using two challenging biomedical datasets: three-stage Alzheimer’s disease classification and MIMIC-CXR chest radiograph classification. It compares text-only, vision-only, and multimodal approaches, and explores three improvement strategies: in-context learning with reason-annotated exemplars, vision captioning followed by text-only inference, and few-shot fine-tuning of the vision tower with classification supervision.

Result: Text-only reasoning consistently outperforms vision-only or vision-text settings, with multimodal inputs often performing worse than text alone. This reveals that current MLLMs lack grounded visual understanding for medical decision making tasks.

Conclusion: Current MLLMs have significant limitations in medical visual understanding, but the explored strategies (in-context learning, vision captioning, and vision tower fine-tuning) point to promising directions for improving multimodal decision making in healthcare applications.

Abstract: With the rapid progress of large language models (LLMs), advanced multimodal large language models (MLLMs) have demonstrated impressive zero-shot capabilities on vision-language tasks. In the biomedical domain, however, even state-of-the-art MLLMs struggle with basic Medical Decision Making (MDM) tasks. We investigate this limitation using two challenging datasets: (1) three-stage Alzheimer’s disease (AD) classification (normal, mild cognitive impairment, dementia), where category differences are visually subtle, and (2) MIMIC-CXR chest radiograph classification with 14 non-mutually exclusive conditions. Our empirical study shows that text-only reasoning consistently outperforms vision-only or vision-text settings, with multimodal inputs often performing worse than text alone. To mitigate this, we explore three strategies: (1) in-context learning with reason-annotated exemplars, (2) vision captioning followed by text-only inference, and (3) few-shot fine-tuning of the vision tower with classification supervision. These findings reveal that current MLLMs lack grounded visual understanding and point to promising directions for improving multimodal decision making in healthcare.

[62] TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs

Jun Zhang, Teng Wang, Yuying Ge, Yixiao Ge, Xinhao Li, Ying Shan, Limin Wang

Main category: cs.CV

TL;DR: TimeLens establishes a systematic baseline for video temporal grounding in MLLMs by addressing data quality issues and algorithmic design, achieving state-of-the-art performance.

DetailsMotivation: Multimodal LLMs excel at video understanding but lack optimized recipes for video temporal grounding (VTG), with existing benchmarks having critical quality issues and noisy training data.

Method: Two-pronged approach: 1) Data quality - create TimeLens-Bench (re-annotated benchmarks) and TimeLens-100K (large-scale high-quality training data via automated re-annotation); 2) Algorithmic design - interleaved textual encoding for time, thinking-free RLVR training paradigm with verifiable rewards.

Result: TimeLens models achieve state-of-the-art VTG performance among open-source models, surpassing proprietary models like GPT-5 and Gemini-2.5-Flash, with dramatic model re-rankings showing prior benchmarks were unreliable.

Conclusion: The paper provides essential baseline for VTG by systematically addressing data quality and algorithmic design, releasing all resources to facilitate future research in video temporal grounding.

Abstract: This paper does not introduce a novel method but instead establishes a straightforward, incremental, yet essential baseline for video temporal grounding (VTG), a core capability in video understanding. While multimodal large language models (MLLMs) excel at various video understanding tasks, the recipes for optimizing them for VTG remain under-explored. In this paper, we present TimeLens, a systematic investigation into building MLLMs with strong VTG ability, along two primary dimensions: data quality and algorithmic design. We first expose critical quality issues in existing VTG benchmarks and introduce TimeLens-Bench, comprising meticulously re-annotated versions of three popular benchmarks with strict quality criteria. Our analysis reveals dramatic model re-rankings compared to legacy benchmarks, confirming the unreliability of prior evaluation standards. We also address noisy training data through an automated re-annotation pipeline, yielding TimeLens-100K, a large-scale, high-quality training dataset. Building on our data foundation, we conduct in-depth explorations of algorithmic design principles, yielding a series of meaningful insights and effective yet efficient practices. These include interleaved textual encoding for time representation, a thinking-free reinforcement learning with verifiable rewards (RLVR) approach as the training paradigm, and carefully designed recipes for RLVR training. These efforts culminate in TimeLens models, a family of MLLMs with state-of-the-art VTG performance among open-source models and even surpass proprietary models such as GPT-5 and Gemini-2.5-Flash. All codes, data, and models will be released to facilitate future research.

[63] STAR: STacked AutoRegressive Scheme for Unified Multimodal Learning

Jie Qin, Jiancheng Huang, Limeng Qiao, Lin Ma

Main category: cs.CV

TL;DR: STAR introduces a stacked autoregressive scheme for unified multimodal learning, decomposing tasks into understanding, generation, and editing stages to avoid optimization conflicts while achieving SOTA performance.

DetailsMotivation: Current multimodal LLMs struggle with unified target for both understanding and generation due to optimization conflicts and performance trade-offs. There's a need to enhance generative performance while preserving existing comprehension capabilities.

Method: STAR uses a stacked autoregressive scheme with three stages: understanding, generation, and editing. It freezes fundamental AR model parameters and progressively stacks isomorphic AR modules to avoid cross-task interference. Also introduces high-capacity VQ for better image representation granularity and implicit reasoning for complex condition generation.

Result: Achieves state-of-the-art performance on GenEval (0.91), DPG-Bench (87.44), and ImgEdit (4.34), validating efficacy for unified multimodal learning.

Conclusion: STAR effectively addresses optimization conflicts in multimodal learning through task-progressive decomposition and stacked architecture, enabling unified understanding and generation capabilities without performance trade-offs.

Abstract: Multimodal large language models (MLLMs) play a pivotal role in advancing the quest for general artificial intelligence. However, achieving unified target for multimodal understanding and generation remains challenging due to optimization conflicts and performance trade-offs. To effectively enhance generative performance while preserving existing comprehension capabilities, we introduce STAR: a STacked AutoRegressive scheme for task-progressive unified multimodal learning. This approach decomposes multimodal learning into multiple stages: understanding, generation, and editing. By freezing the parameters of the fundamental autoregressive (AR) model and progressively stacking isomorphic AR modules, it avoids cross-task interference while expanding the model’s capabilities. Concurrently, we introduce a high-capacity VQ to enhance the granularity of image representations and employ an implicit reasoning mechanism to improve generation quality under complex conditions. Experiments demonstrate that STAR achieves state-of-the-art performance on GenEval (0.91), DPG-Bench (87.44), and ImgEdit (4.34), validating its efficacy for unified multimodal learning.

[64] Time-aware UNet and super-resolution deep residual networks for spatial downscaling

Mika Sipilä, Sabrina Maggio, Sandra De Iaco, Klaus Nordhausen, Monica Palma, Sara Taskinen

Main category: cs.CV

TL;DR: Time-aware deep learning models (SRDRN and UNet with temporal encoding) significantly improve spatial downscaling of satellite ozone data over Italy compared to baseline methods.

DetailsMotivation: Satellite atmospheric pollutant data is often too coarse for local-scale environmental analysis and decision-making, creating a need for effective spatial downscaling methods.

Method: Extended two deep learning architectures (SRDRN and UNet) with lightweight temporal modules using sinusoidal or RBF encoding to fuse temporal features with spatial representations for ozone downscaling.

Result: Temporal modules significantly improved downscaling performance and convergence speed with only slight increases in computational complexity.

Conclusion: Incorporating temporal information through lightweight encoding modules enhances deep learning-based spatial downscaling of atmospheric pollutants like tropospheric ozone.

Abstract: Satellite data of atmospheric pollutants are often available only at coarse spatial resolution, limiting their applicability in local-scale environmental analysis and decision-making. Spatial downscaling methods aim to transform the coarse satellite data into high-resolution fields. In this work, two widely used deep learning architectures, the super-resolution deep residual network (SRDRN) and the encoder-decoder-based UNet, are considered for spatial downscaling of tropospheric ozone. Both methods are extended with a lightweight temporal module, which encodes observation time using either sinusoidal or radial basis function (RBF) encoding, and fuses the temporal features with the spatial representations in the networks. The proposed time-aware extensions are evaluated against their baseline counterparts in a case study on ozone downscaling over Italy. The results suggest that, while only slightly increasing computational complexity, the temporal modules significantly improve downscaling performance and convergence speed.

[65] ACE-SLAM: Scene Coordinate Regression for Neural Implicit Real-Time SLAM

Ignacio Alzugaray, Marwan Taher, Andrew J. Davison

Main category: cs.CV

TL;DR: A real-time neural RGB-D SLAM system using Scene Coordinate Regression (SCR) as implicit map representation, achieving competitive performance with efficient, privacy-preserving mapping.

DetailsMotivation: To develop a neural implicit SLAM system that achieves strict real-time performance while providing efficient, low-memory 3D map representations with fast relocalization and inherent privacy preservation.

Method: Uses Scene Coordinate Regression (SCR) as core implicit map representation, where a lightweight network maps 2D image features directly to 3D global coordinates. Introduces novel SCR architecture specifically designed for live SLAM integration, supporting both sparse and dense features.

Result: First system to achieve strict real-time performance in neural implicit RGB-D SLAM. Demonstrates competitive performance on synthetic and real-world benchmarks, operates reliably in dynamic environments without special adaptation.

Conclusion: SCR-based representation is particularly suitable for neural implicit SLAM, providing efficient mapping, fast relocalization, privacy preservation, and real-time operation with simple yet flexible framework.

Abstract: We present a novel neural RGB-D Simultaneous Localization And Mapping (SLAM) system that learns an implicit map of the scene in real time. For the first time, we explore the use of Scene Coordinate Regression (SCR) as the core implicit map representation in a neural SLAM pipeline, a paradigm that trains a lightweight network to directly map 2D image features to 3D global coordinates. SCR networks provide efficient, low-memory 3D map representations, enable extremely fast relocalization, and inherently preserve privacy, making them particularly suitable for neural implicit SLAM. Our system is the first one to achieve strict real-time in neural implicit RGB-D SLAM by relying on a SCR-based representation. We introduce a novel SCR architecture specifically tailored for this purpose and detail the critical design choices required to integrate SCR into a live SLAM pipeline. The resulting framework is simple yet flexible, seamlessly supporting both sparse and dense features, and operates reliably in dynamic environments without special adaptation. We evaluate our approach on established synthetic and real-world benchmarks, demonstrating competitive performance against the state of the art. Project Page: https://github.com/ialzugaray/ace-slam

[66] Nexels: Neurally-Textured Surfels for Real-Time Novel View Synthesis with Sparse Geometries

Victor Rong, Jan Held, Victor Chu, Daniel Rebain, Marc Van Droogenbroeck, Kiriakos N. Kutulakos, Andrea Tagliasacchi, David B. Lindell

Main category: cs.CV

TL;DR: Surfel-based representation decouples geometry and appearance, using surfels for geometry and neural field + per-primitive colors for appearance, achieving compact representation with fewer primitives and less memory than Gaussian splatting.

DetailsMotivation: Gaussian splatting requires millions of primitives even for simple geometry scenes, leading to inefficient memory usage. The paper aims to create a more compact representation by separating geometry and appearance modeling.

Method: Uses surfels for geometry representation and combines a global neural field with per-primitive colors for appearance. The neural field textures a fixed number of primitives per pixel to keep computation low.

Result: Achieves 9.7× fewer primitives and 5.5× less memory on outdoor scenes, and 31× fewer primitives and 3.7× less memory on indoor scenes compared to 3D Gaussian splatting, while matching perceptual quality. Renders twice as fast as existing textured primitives with better visual quality.

Conclusion: The proposed representation successfully decouples geometry and appearance, achieving compactness and efficiency while maintaining high visual quality, making it a practical alternative to point-based rendering methods.

Abstract: Though Gaussian splatting has achieved impressive results in novel view synthesis, it requires millions of primitives to model highly textured scenes, even when the geometry of the scene is simple. We propose a representation that goes beyond point-based rendering and decouples geometry and appearance in order to achieve a compact representation. We use surfels for geometry and a combination of a global neural field and per-primitive colours for appearance. The neural field textures a fixed number of primitives for each pixel, ensuring that the added compute is low. Our representation matches the perceptual quality of 3D Gaussian splatting while using $9.7\times$ fewer primitives and $5.5\times$ less memory on outdoor scenes and using $31\times$ fewer primitives and $3.7\times$ less memory on indoor scenes. Our representation also renders twice as fast as existing textured primitives while improving upon their visual quality.

[67] Adaptable Segmentation Pipeline for Diverse Brain Tumors with Radiomic-guided Subtyping and Lesion-Wise Model Ensemble

Daniel Capellán-Martín, Abhijeet Parida, Zhifan Jiang, Nishad Kulkarni, Krithika Iyer, Austin Tapp, Syed Muhammad Anwar, María J. Ledesma-Carbayo, Marius George Linguraru

Main category: cs.CV

TL;DR: The paper presents a flexible pipeline for brain tumor segmentation on MRI that combines state-of-the-art models with tumor-specific processing and radiomic feature analysis, achieving top performance across multiple BraTS 2025 challenges.

DetailsMotivation: Brain tumor segmentation remains challenging due to tumor type diversity. The BraTS 2025 Lighthouse Challenge benchmarks methods on diverse pediatric and adult tumor datasets, requiring robust and generalizable approaches.

Method: A flexible, modular pipeline that selects and combines state-of-the-art models with tumor-specific pre- and post-processing. Uses radiomic features for tumor subtype detection to balance training, and custom lesion-level metrics to optimize ensemble weighting and post-processing.

Result: The pipeline achieved performance comparable to top-ranked algorithms across multiple BraTS 2025 challenges (PED, MEN, MEN-RT, MET testing sets), demonstrating robust segmentation across diverse tumor types.

Conclusion: Custom lesion-aware processing and model selection yield robust segmentations without locking to specific network architectures. The method has potential for clinical quantitative tumor measurement to support diagnosis and prognosis.

Abstract: Robust and generalizable segmentation of brain tumors on multi-parametric magnetic resonance imaging (MRI) remains difficult because tumor types differ widely. The BraTS 2025 Lighthouse Challenge benchmarks segmentation methods on diverse high-quality datasets of adult and pediatric tumors: multi-consortium international pediatric brain tumor segmentation (PED), preoperative meningioma tumor segmentation (MEN), meningioma radiotherapy segmentation (MEN-RT), and segmentation of pre- and post-treatment brain metastases (MET). We present a flexible, modular, and adaptable pipeline that improves segmentation performance by selecting and combining state-of-the-art models and applying tumor- and lesion-specific processing before and after training. Radiomic features extracted from MRI help detect tumor subtype, ensuring a more balanced training. Custom lesion-level performance metrics determine the influence of each model in the ensemble and optimize post-processing that further refines the predictions, enabling the workflow to tailor every step to each case. On the BraTS testing sets, our pipeline achieved performance comparable to top-ranked algorithms across multiple challenges. These findings confirm that custom lesion-aware processing and model selection yield robust segmentations yet without locking the method to a specific network architecture. Our method has the potential for quantitative tumor measurement in clinical practice, supporting diagnosis and prognosis.

[68] VajraV1 – The most accurate Real Time Object Detector of the YOLO family

Naman Balbir Singh Makkar

Main category: cs.CV

TL;DR: VajraV1 is a new real-time object detection model that combines design elements from recent YOLO versions to achieve state-of-the-art accuracy while maintaining competitive inference speeds across multiple model sizes.

DetailsMotivation: Recent years have seen rapid advancement in real-time object detection with multiple YOLO versions released (v10-v13). The motivation is to create an improved architecture that combines effective design choices from these models to push the state-of-the-art in accuracy while maintaining real-time performance.

Method: VajraV1 introduces architectural enhancements over existing YOLO-based detectors by combining effective design choices from prior YOLO models (v10-v13). The method focuses on optimizing the architecture for both accuracy and inference speed.

Result: On COCO validation set: VajraV1-Nano achieves 44.3% mAP (outperforming YOLOv12-N by 3.7% and YOLOv13-N by 2.7%); Small achieves 50.4% mAP (exceeding v12-S and v13-S by 2.4%); Medium achieves 52.7% mAP (outperforming v12-M by 0.2%); Large achieves 53.7% mAP (surpassing v13-L by 0.3%); Xlarge achieves 56.2% mAP (outperforming all existing real-time detectors).

Conclusion: VajraV1 achieves state-of-the-art accuracy among real-time object detectors across all model sizes while maintaining competitive inference speeds, demonstrating that architectural enhancements combining effective design choices from recent YOLO models can significantly improve performance.

Abstract: Recent years have seen significant advances in real-time object detection, with the release of YOLOv10, YOLO11, YOLOv12, and YOLOv13 between 2024 and 2025. This technical report presents the VajraV1 model architecture, which introduces architectural enhancements over existing YOLO-based detectors. VajraV1 combines effective design choices from prior YOLO models to achieve state-of-the-art accuracy among real-time object detectors while maintaining competitive inference speed. On the COCO validation set, VajraV1-Nano achieves 44.3% mAP, outperforming YOLOv12-N by 3.7% and YOLOv13-N by 2.7% at latency competitive with YOLOv12-N and YOLOv11-N. VajraV1-Small achieves 50.4% mAP, exceeding YOLOv12-S and YOLOv13-S by 2.4%. VajraV1-Medium achieves 52.7% mAP, outperforming YOLOv12-M by 0.2%. VajraV1-Large achieves 53.7% mAP, surpassing YOLOv13-L by 0.3%. VajraV1-Xlarge achieves 56.2% mAP, outperforming all existing real-time object detectors.

[69] MoLingo: Motion-Language Alignment for Text-to-Motion Generation

Yannan He, Garvita Tiwari, Xiaohan Zhang, Pankaj Bora, Tolga Birdal, Jan Eric Lenssen, Gerard Pons-Moll

Main category: cs.CV

TL;DR: MoLingo is a text-to-motion model that generates realistic human motion by denoising in a continuous latent space, achieving state-of-the-art results through semantic-aligned motion encoding and cross-attention text conditioning.

DetailsMotivation: The paper aims to improve text-to-motion generation by addressing two key challenges: how to create a semantically aligned latent space for more effective diffusion, and how to best inject text conditioning for better motion-description alignment.

Method: The method uses a semantic-aligned motion encoder trained with frame-level text labels to create a diffusion-friendly latent space, employs auto-regressive generation, and implements multi-token cross-attention for text conditioning instead of single-token conditioning.

Result: MoLingo achieves state-of-the-art performance in human motion generation on standard metrics and in user studies, demonstrating improved motion realism and better text-motion alignment compared to previous approaches.

Conclusion: The combination of semantically aligned latents, auto-regressive generation, and cross-attention text conditioning enables superior text-to-motion generation, setting a new benchmark in the field with plans to release code and models for further research.

Abstract: We introduce MoLingo, a text-to-motion (T2M) model that generates realistic, lifelike human motion by denoising in a continuous latent space. Recent works perform latent space diffusion, either on the whole latent at once or auto-regressively over multiple latents. In this paper, we study how to make diffusion on continuous motion latents work best. We focus on two questions: (1) how to build a semantically aligned latent space so diffusion becomes more effective, and (2) how to best inject text conditioning so the motion follows the description closely. We propose a semantic-aligned motion encoder trained with frame-level text labels so that latents with similar text meaning stay close, which makes the latent space more diffusion-friendly. We also compare single-token conditioning with a multi-token cross-attention scheme and find that cross-attention gives better motion realism and text-motion alignment. With semantically aligned latents, auto-regressive generation, and cross-attention text conditioning, our model sets a new state of the art in human motion generation on standard metrics and in a user study. We will release our code and models for further research and downstream usage.

[70] Improvise, Adapt, Overcome – Telescopic Adapters for Efficient Fine-tuning of Vision Language Models in Medical Imaging

Ujjwal Mishra, Vinita Shukla, Praful Hambarde, Amit Shukla

Main category: cs.CV

TL;DR: Telescopic Adapters: A novel PEFT framework with depth-aware scaling for efficient medical VLSM fine-tuning, using only 613k parameters (244x fewer than full fine-tuning) while achieving superior performance across medical datasets.

DetailsMotivation: Conventional fine-tuning of Vision Language Segmentation Models for medical imaging requires heavy computation, and existing PEFT methods use uniform adapter dimensions across layers, leading to suboptimal parameter allocation and reduced adaptation efficiency.

Method: Introduces Telescopic Adapters with depth-aware scaling that progressively increases adapter capacity from shallow to deep transformer layers. Integrates lightweight bottleneck modules in CLIPSeg’s vision and text encoders, with adapter dimensions dynamically scaled based on layer depth and semantic relevance.

Result: Achieves superior performance across five diverse medical datasets (polyp segmentation, skin lesion detection, breast ultrasound) using only 613k trainable parameters - 244x fewer than end-to-end fine-tuning. Ablation studies show deeper layers require substantially more adaptation capacity.

Conclusion: Establishes a new paradigm for efficient medical VLSM fine-tuning, enabling deployment in resource-constrained clinical environments while maintaining competitive segmentation accuracy through depth-aware parameter allocation.

Abstract: Adapting Vision Language Segmentation Models (VLSMs) to medical imaging domains requires significant computational overhead when using conventional fine-tuning approaches. Existing Parameter-Efficient Fine-Tuning (PEFT) methods apply uniform adapter dimensions across all transformer layers, leading to suboptimal parameter allocation and reduced adaptation efficiency. We introduce Telescopic Adapters, a novel PEFT framework that employs depth-aware scaling to progressively increase adapter capacity from shallow to deep transformer layers. Our method integrates lightweight bottleneck modules within CLIPSeg’s vision and text encoders, with adapter dimensions dynamically scaled based on layer depth and semantic relevance. Using only 613k trainable parameters–244x fewer than end-to-end fine-tuning, Telescopic Adapters achieve superior performance across five diverse medical datasets spanning polyp segmentation, skin lesion detection, and breast ultrasound imaging. Comprehensive ablation studies demonstrate that deeper layers require substantially more adaptation capacity than shallow layers, validating our telescopic scaling hypothesis. Our approach establishes a new paradigm for efficient medical VLSM fine-tuning, enabling deployment in resource-constrained clinical environments while maintaining competitive segmentation accuracy.

[71] Coarse-to-Fine Hierarchical Alignment for UAV-based Human Detection using Diffusion Models

Wenda Li, Meng Wu, Sungmin Eum, Heesung Kwon, Qing Qu

Main category: cs.CV

TL;DR: CFHA is a three-stage diffusion framework that transforms synthetic UAV images to reduce domain gap with real data while preserving labels, improving human detection accuracy by up to +14.1 mAP50.

DetailsMotivation: Training object detectors for UAV-based human detection faces challenges due to constantly shifting target distributions and scarcity of labeled real images. Synthetic data offers low-cost annotation but suffers from domain gap issues that hinder real-world application.

Method: CFHA uses a three-stage hierarchical alignment: (1) Global Style Transfer - diffusion model aligns color, illumination, and texture statistics using small real reference set; (2) Local Refinement - super-resolution diffusion model enhances fine-grained details for small human objects; (3) Hallucination Removal - filters out human instances with unrealistic visual attributes.

Result: Extensive experiments on UAV Sim2Real detection benchmarks show significant improvement in detection accuracy compared to non-transformed baselines, achieving up to +14.1 improvement of mAP50 on Semantic-Drone benchmark. Ablation studies confirm complementary roles of global and local stages.

Conclusion: CFHA effectively bridges the synthetic-to-real domain gap for UAV-based human detection through hierarchical alignment, enabling better utilization of synthetic data while preserving original annotations. The framework demonstrates the importance of addressing both global style and local content discrepancies.

Abstract: Training object detectors demands extensive, task-specific annotations, yet this requirement becomes impractical in UAV-based human detection due to constantly shifting target distributions and the scarcity of labeled images. As a remedy, synthetic simulators are adopted to generate annotated data, with a low annotation cost. However, the domain gap between synthetic and real images hinders the model from being effectively applied to the target domain. Accordingly, we introduce Coarse-to-Fine Hierarchical Alignment (CFHA), a three-stage diffusion-based framework designed to transform synthetic data for UAV-based human detection, narrowing the domain gap while preserving the original synthetic labels. CFHA explicitly decouples global style and local content domain discrepancies and bridges those gaps using three modules: (1) Global Style Transfer – a diffusion model aligns color, illumination, and texture statistics of synthetic images to the realistic style, using only a small real reference set; (2) Local Refinement – a super-resolution diffusion model is used to facilitate fine-grained and photorealistic details for the small objects, such as human instances, preserving shape and boundary integrity; (3) Hallucination Removal – a module that filters out human instances whose visual attributes do not align with real-world data to make the human appearance closer to the target distribution. Extensive experiments on public UAV Sim2Real detection benchmarks demonstrate that our methods significantly improve the detection accuracy compared to the non-transformed baselines. Specifically, our method achieves up to $+14.1$ improvement of mAP50 on Semantic-Drone benchmark. Ablation studies confirm the complementary roles of the global and local stages and highlight the importance of hierarchical alignment. The code is released at \href{https://github.com/liwd190019/CFHA}{this url}.

[72] SAGE: Training Smart Any-Horizon Agents for Long Video Reasoning with Reinforcement Learning

Jitesh Jain, Jialuo Li, Zixian Ma, Jieyu Zhang, Chris Dongjoo Kim, Sangho Lee, Rohun Tripathi, Tanmay Gupta, Christopher Clark, Humphrey Shi

Main category: cs.CV

TL;DR: SAGE is an any-horizon video reasoning system that mimics human behavior by performing multi-turn reasoning on long videos while handling simpler problems in a single turn, achieving up to 6.1% improvement on open-ended video reasoning tasks.

DetailsMotivation: Current video reasoning models process all frames in a single turn like watching entire long videos, requiring significant resources. Humans reason flexibly across different durations (skimming long videos or watching short ones fully), so the paper aims to develop performant any-horizon video reasoning systems.

Method: 1) Proposed SAGE agent system for multi-turn reasoning on long videos; 2) Introduced synthetic data generation pipeline using Gemini-2.5-Flash to train orchestrator SAGE-MM; 3) Developed effective RL post-training recipe for any-horizon reasoning; 4) Curated SAGE-Bench benchmark with average duration >700 seconds for real-world entertainment evaluation.

Result: Notable improvements of up to 6.1% on open-ended video reasoning tasks, and impressive 8.2% improvement on videos longer than 10 minutes, validating the effectiveness of the system, data, and RL recipe.

Conclusion: The paper successfully demonstrates that performant any-horizon video reasoning systems are possible by mimicking human reasoning behavior, with SAGE showing significant improvements particularly on long videos through multi-turn reasoning and effective RL training.

Abstract: As humans, we are natural any-horizon reasoners, i.e., we can decide whether to iteratively skim long videos or watch short ones in full when necessary for a given task. With this in mind, one would expect video reasoning models to reason flexibly across different durations. However, SOTA models are still trained to predict answers in a single turn while processing a large number of frames, akin to watching an entire long video, requiring significant resources. This raises the question: Is it possible to develop performant any-horizon video reasoning systems? Inspired by human behavior, we first propose SAGE, an agent system that performs multi-turn reasoning on long videos while handling simpler problems in a single turn. Secondly, we introduce an easy synthetic data generation pipeline using Gemini-2.5-Flash to train the orchestrator, SAGE-MM, which lies at the core of SAGE. We further propose an effective RL post-training recipe essential for instilling any-horizon reasoning ability in SAGE-MM. Thirdly, we curate SAGE-Bench with an average duration of greater than 700 seconds for evaluating video reasoning ability in real-world entertainment use cases. Lastly, we empirically validate the effectiveness of our system, data, and RL recipe, observing notable improvements of up to 6.1% on open-ended video reasoning tasks, as well as an impressive 8.2% improvement on videos longer than 10 minutes.

[73] Route-DETR: Pairwise Query Routing in Transformers for Object Detection

Ye Zhang, Qi Chen, Wenyou Huang, Rui Liu, Zhengjian Kang

Main category: cs.CV

TL;DR: Route-DETR improves DETR by using adaptive pairwise routing in decoder self-attention to distinguish between competing vs complementary queries, reducing redundancy and improving detection performance without adding inference cost.

DetailsMotivation: DETR suffers from inefficient query competition where multiple queries converge to similar positions, leading to redundant computations and suboptimal performance.

Method: Introduces adaptive pairwise routing with dual mechanisms: suppressor routes (modulate attention between competing queries to reduce duplication) and delegator routes (encourage exploration of different regions). Uses learnable low-rank attention biases for asymmetric query interactions. Employs dual-branch training where routing biases are only used during training while preserving standard attention for inference.

Result: Consistent improvements across multiple DETR baselines on COCO and Cityscapes: +1.7% mAP gain over DINO on ResNet-50, and achieves 57.6% mAP on Swin-L, surpassing prior state-of-the-art models.

Conclusion: Route-DETR effectively addresses query competition in DETR through adaptive routing mechanisms, improving detection performance without adding computational overhead during inference.

Abstract: Detection Transformer (DETR) offers an end-to-end solution for object detection by eliminating hand-crafted components like non-maximum suppression. However, DETR suffers from inefficient query competition where multiple queries converge to similar positions, leading to redundant computations. We present Route-DETR, which addresses these issues through adaptive pairwise routing in decoder self-attention layers. Our key insight is distinguishing between competing queries (targeting the same object) versus complementary queries (targeting different objects) using inter-query similarity, confidence scores, and geometry. We introduce dual routing mechanisms: suppressor routes that modulate attention between competing queries to reduce duplication, and delegator routes that encourage exploration of different regions. These are implemented via learnable low-rank attention biases enabling asymmetric query interactions. A dual-branch training strategy incorporates routing biases only during training while preserving standard attention for inference, ensuring no additional computational cost. Experiments on COCO and Cityscapes demonstrate consistent improvements across multiple DETR baselines, achieving +1.7% mAP gain over DINO on ResNet-50 and reaching 57.6% mAP on Swin-L, surpassing prior state-of-the-art models.

[74] Med3DVLM: An Efficient Vision-Language Model for 3D Medical Image Analysis

Yu Xin, Gorkem Can Ates, Kuang Gong, Wei Shao

Main category: cs.CV

TL;DR: Med3DVLM is a 3D vision-language model for medical imaging that achieves state-of-the-art performance on multiple benchmarks through efficient 3D encoding, improved image-text alignment, and multi-modal fusion.

DetailsMotivation: Extending vision-language models to 3D medical imaging is challenging due to high computational demands of volumetric data and difficulty aligning 3D spatial features with clinical text.

Method: Three key innovations: (1) DCFormer - efficient 3D encoder using decomposed convolutions; (2) SigLIP - contrastive learning with pairwise sigmoid loss for better image-text alignment; (3) dual-stream MLP-Mixer projector for fusing low/high-level image features with text embeddings.

Result: Superior performance on M3D dataset: 61.00% R@1 for image-text retrieval (vs 19.10% SOTA), 36.42% METEOR for report generation (vs 14.38%), 36.76% METEOR for open-ended VQA (vs 33.58%), and 79.95% accuracy for closed-ended VQA (vs 75.78%).

Conclusion: Med3DVLM successfully bridges the gap between 3D imaging and language, enabling scalable multi-task reasoning for clinical applications, with code publicly available.

Abstract: Vision-language models (VLMs) have shown promise in 2D medical image analysis, but extending them to 3D remains challenging due to the high computational demands of volumetric data and the difficulty of aligning 3D spatial features with clinical text. We present Med3DVLM, a 3D VLM designed to address these challenges through three key innovations: (1) DCFormer, an efficient encoder that uses decomposed 3D convolutions to capture fine-grained spatial features at scale; (2) SigLIP, a contrastive learning strategy with pairwise sigmoid loss that improves image-text alignment without relying on large negative batches; and (3) a dual-stream MLP-Mixer projector that fuses low- and high-level image features with text embeddings for richer multi-modal representations. We evaluate our model on the M3D dataset, which includes radiology reports and VQA data for 120,084 3D medical images. Results show that Med3DVLM achieves superior performance across multiple benchmarks. For image-text retrieval, it reaches 61.00% R@1 on 2,000 samples, significantly outperforming the current state-of-the-art M3D model (19.10%). For report generation, it achieves a METEOR score of 36.42% (vs. 14.38%). In open-ended visual question answering (VQA), it scores 36.76% METEOR (vs. 33.58%), and in closed-ended VQA, it achieves 79.95% accuracy (vs. 75.78%). These results highlight Med3DVLM’s ability to bridge the gap between 3D imaging and language, enabling scalable, multi-task reasoning across clinical applications. Our code is publicly available at https://github.com/mirthAI/Med3DVLM.

[75] KLO-Net: A Dynamic K-NN Attention U-Net with CSP Encoder for Efficient Prostate Gland Segmentation from MRI

Anning Tian, Byunghyun Ko, Kaichen Qu, Mengyuan Liu, Jeongkyu Lee

Main category: cs.CV

TL;DR: KLO-Net: Dynamic K-NN attention U-Net with CSP encoder for efficient prostate MRI segmentation, balancing computational efficiency and segmentation accuracy.

DetailsMotivation: Real-time deployment of prostate MRI segmentation is bottlenecked by computational load and memory footprint, while deep learning approaches struggle with anatomical variability. Need to bridge efficiency gap while maintaining reliable segmentation accuracy.

Method: Propose KLO-Net: dynamic K-Nearest Neighbor attention U-Net with Cross Stage Partial (CSP) encoder. Dynamic K-NN attention adaptively determines number of attention connections per spatial location. CSP blocks reduce computational load and memory consumption.

Result: Comprehensive experiments on PROMISE12 and PROSTATEx datasets demonstrate model’s advantage in computational efficiency and segmentation quality through detailed comparative analysis and ablation studies.

Conclusion: KLO-Net effectively bridges the efficiency gap for prostate MRI segmentation while maintaining reliable accuracy, making it suitable for real-time clinical deployment on workstations with limited computational resources.

Abstract: Real-time deployment of prostate MRI segmentation on clinical workstations is often bottlenecked by computational load and memory footprint. Deep learning-based prostate gland segmentation approaches remain challenging due to anatomical variability. To bridge this efficiency gap while still maintaining reliable segmentation accuracy, we propose KLO-Net, a dynamic K-Nearest Neighbor attention U-Net with Cross Stage Partial, i.e., CSP, encoder for efficient prostate gland segmentation from MRI scan. Unlike the regular K-NN attention mechanism, the proposed dynamic K-NN attention mechanism allows the model to adaptively determine the number of attention connections for each spatial location within a slice. In addition, CSP blocks address the computational load to reduce memory consumption. To evaluate the model’s performance, comprehensive experiments and ablation studies are conducted on two public datasets, i.e., PROMISE12 and PROSTATEx, to validate the proposed architecture. The detailed comparative analysis demonstrates the model’s advantage in computational efficiency and segmentation quality.

[76] MFGDiffusion: Mask-Guided Smoke Synthesis for Enhanced Forest Fire Detection

Guanghao Wu, Yunqing Shang, Chen Xu, Hai Song, Chong Wang, Qixing Zhang

Main category: cs.CV

TL;DR: A framework for generating realistic forest fire smoke images using diffusion models with mask guidance and multimodal filtering to enhance smoke detection performance.

DetailsMotivation: Scarcity of real forest fire smoke images hinders deep learning-based smoke detection; current inpainting models produce inconsistent smoke-background relationships.

Method: Uses pre-trained segmentation for smoke masks and multimodal models for captions; introduces mask-guided network architecture with mask random difference loss; employs multimodal LLM filtering for dataset quality.

Result: Generated realistic and diverse smoke images that effectively enhance forest fire smoke detection model performance.

Conclusion: Proposed framework successfully addresses data scarcity and quality issues in smoke image generation, improving downstream detection tasks.

Abstract: Smoke is the first visible indicator of a wildfire.With the advancement of deep learning, image-based smoke detection has become a crucial method for detecting and preventing forest fires. However, the scarcity of smoke image data from forest fires is one of the significant factors hindering the detection of forest fire smoke. Image generation models offer a promising solution for synthesizing realistic smoke images. However, current inpainting models exhibit limitations in generating high-quality smoke representations, particularly manifesting as inconsistencies between synthesized smoke and background contexts. To solve these problems, we proposed a comprehensive framework for generating forest fire smoke images. Firstly, we employed the pre-trained segmentation model and the multimodal model to obtain smoke masks and image captions.Then, to address the insufficient utilization of masks and masked images by inpainting models, we introduced a network architecture guided by mask and masked image features. We also proposed a new loss function, the mask random difference loss, which enhances the consistency of the generated effects around the mask by randomly expanding and eroding the mask edges.Finally, to generate a smoke image dataset using random masks for subsequent detection tasks, we incorporated smoke characteristics and use a multimodal large language model as a filtering tool to select diverse and reasonable smoke images, thereby improving the quality of the synthetic dataset. Experiments showed that our generated smoke images are realistic and diverse, and effectively enhance the performance of forest fire smoke detection models. Code is available at https://github.com/wghr123/MFGDiffusion.

[77] An evaluation of SVBRDF Prediction from Generative Image Models for Appearance Modeling of 3D Scenes

Alban Gauthier, Valentin Deschaintre, Alexandre Lanvin, Fredo Durand, Adrien Bousseau, George Drettakis

Main category: cs.CV

TL;DR: The paper analyzes SVBRDF prediction challenges in fast appearance modeling pipelines using generative models, finding standard UNet architectures competitive with more complex designs.

DetailsMotivation: Digital content creation is changing with deep generative models. While conditional image generators can synthesize realistic RGB images aligned with 3D geometry, and SVBRDF prediction networks recover material parameters, combining these technologies creates opportunities for fast appearance modeling pipelines. However, single-view SVBRDF predictions may suffer from multiview incoherence, while generated RGB images provide additional information compared to photographs.

Method: The paper analyzes neural architectures and conditions for SVBRDF prediction in fast appearance modeling pipelines. It compares different designs to identify those achieving high accuracy and coherence, examining how generated RGB images and their conditioning modalities can provide additional information for SVBRDF estimation.

Result: Surprisingly, a standard UNet architecture is found to be competitive with more complex designs for SVBRDF prediction in this context. The analysis identifies designs that achieve both high accuracy and coherence in texture atlas generation.

Conclusion: The research provides insights into SVBRDF prediction challenges and opportunities in modern appearance modeling pipelines, demonstrating that simpler architectures can be effective when leveraging the additional information available from generated RGB images and their conditioning modalities.

Abstract: Digital content creation is experiencing a profound change with the advent of deep generative models. For texturing, conditional image generators now allow the synthesis of realistic RGB images of a 3D scene that align with the geometry of that scene. For appearance modeling, SVBRDF prediction networks recover material parameters from RGB images. Combining these technologies allows us to quickly generate SVBRDF maps for multiple views of a 3D scene, which can be merged to form a SVBRDF texture atlas of that scene. In this paper, we analyze the challenges and opportunities for SVBRDF prediction in the context of such a fast appearance modeling pipeline. On the one hand, single-view SVBRDF predictions might suffer from multiview incoherence and yield inconsistent texture atlases. On the other hand, generated RGB images, and the different modalities on which they are conditioned, can provide additional information for SVBRDF estimation compared to photographs. We compare neural architectures and conditions to identify designs that achieve high accuracy and coherence. We find that, surprisingly, a standard UNet is competitive with more complex designs. Project page: http://repo-sam.inria.fr/nerphys/svbrdf-evaluation

[78] From Unlearning to UNBRANDING: A Benchmark for Trademark-Safe Text-to-Image Generation

Dawid Malarz, Artur Kasymov, Filip Manjak, Maciej Zięba, Przemysław Spurek

Main category: cs.CV

TL;DR: The paper introduces “unbranding” - a novel task for removing both explicit trademarks and subtle structural brand features from AI-generated images while preserving semantic coherence, addressing limitations of prior work that only targeted general concepts.

DetailsMotivation: The rapid advancement of text-to-image diffusion models raises concerns about unauthorized reproduction of trademarked content. Prior work fails to address specific brand identifiers, particularly subtle structural features beyond explicit logos. Brand recognition is multi-dimensional, extending to distinctive structural elements like car grilles or product shapes.

Method: The paper introduces the “unbranding” task and constructs a comprehensive benchmark dataset. It proposes a novel evaluation metric using Vision Language Models (VLMs) with a question-answering framework to detect both explicit logos and implicit, holistic brand characteristics, addressing limitations of existing brand detectors.

Result: Results show that newer text-to-image models (SDXL, FLUX) synthesize brand identifiers more readily than older models (Stable Diffusion), highlighting the urgency of the unbranding challenge. The VLM-based metric validates that unbranding is a distinct, practically relevant problem requiring specialized techniques.

Conclusion: Unbranding is a novel and important task for removing trademarked content from AI-generated images, requiring specialized approaches beyond existing methods. The proposed VLM-based evaluation metric effectively captures both explicit and implicit brand features, and the increasing fidelity of newer models makes this problem more urgent.

Abstract: The rapid progress of text-to-image diffusion models raises significant concerns regarding the unauthorized reproduction of trademarked content. While prior work targets general concepts (e.g., styles, celebrities), it fails to address specific brand identifiers. Crucially, we note that brand recognition is multi-dimensional, extending beyond explicit logos to encompass distinctive structural features (e.g., a car’s front grille). To tackle this, we introduce unbranding, a novel task for the fine-grained removal of both trademarks and subtle structural brand features, while preserving semantic coherence. To facilitate research, we construct a comprehensive benchmark dataset. Recognizing that existing brand detectors are limited to logos and fail to capture abstract trade dress (e.g., the shape of a Coca-Cola bottle), we introduce a novel evaluation metric based on Vision Language Models (VLMs). This VLM-based metric uses a question-answering framework to probe images for both explicit logos and implicit, holistic brand characteristics. Furthermore, we observe that as model fidelity increases, with newer systems (SDXL, FLUX) synthesizing brand identifiers more readily than older models (Stable Diffusion), the urgency of the unbranding challenge is starkly highlighted. Our results, validated by our VLM metric, confirm unbranding is a distinct, practically relevant problem requiring specialized techniques. Project Page: https://gmum.github.io/UNBRANDING/.

[79] Quality-Driven and Diversity-Aware Sample Expansion for Robust Marine Obstacle Segmentation

Miaohua Zhang, Mohammad Ali Armin, Xuesong Li, Sisi Liang, Lars Petersson, Changming Sun, David Ahmedt-Aristizabal, Zeeshan Hayder

Main category: cs.CV

TL;DR: Proposes a quality-driven, diversity-aware sample expansion pipeline using class-aware style banks and adaptive annealing to generate diverse synthetic training data for marine obstacle segmentation without retraining diffusion models.

DetailsMotivation: Marine obstacle detection faces challenges from sun glitter, fog, and wave patterns that degrade image quality. Limited and repetitive marine datasets constrain training diversity, while existing mask-conditioned diffusion models produce low-diversity outputs when conditioned on low-entropy masks and prompts.

Method: A two-component inference-time pipeline: (1) class-aware style bank that constructs high-entropy, semantically grounded prompts, and (2) adaptive annealing sampler that perturbs early conditioning with COD-guided proportional controller to regulate perturbation for diversity while maintaining layout fidelity.

Result: Augmenting training data with controlled synthetic samples consistently improves segmentation performance across multiple backbones on marine obstacle benchmarks and increases visual variation in rare and texture-sensitive classes.

Conclusion: The proposed inference-time sample expansion pipeline effectively addresses data scarcity and diversity limitations in marine obstacle detection by generating high-quality, diverse synthetic training data without requiring diffusion model retraining.

Abstract: Marine obstacle detection demands robust segmentation under challenging conditions, such as sun glitter, fog, and rapidly changing wave patterns. These factors degrade image quality, while the scarcity and structural repetition of marine datasets limit the diversity of available training data. Although mask-conditioned diffusion models can synthesize layout-aligned samples, they often produce low-diversity outputs when conditioned on low-entropy masks and prompts, limiting their utility for improving robustness. In this paper, we propose a quality-driven and diversity-aware sample expansion pipeline that generates training data entirely at inference time, without retraining the diffusion model. The framework combines two key components:(i) a class-aware style bank that constructs high-entropy, semantically grounded prompts, and (ii) an adaptive annealing sampler that perturbs early conditioning, while a COD-guided proportional controller regulates this perturbation to boost diversity without compromising layout fidelity. Across marine obstacle benchmarks, augmenting training data with these controlled synthetic samples consistently improves segmentation performance across multiple backbones and increases visual variation in rare and texture-sensitive classes.

[80] XAI-Driven Diagnosis of Generalization Failure in State-Space Cerebrovascular Segmentation Models: A Case Study on Domain Shift Between RSNA and TopCoW Datasets

Youssef Abuzeid, Shimaa El-Bana, Ahmad Al-Kabbany

Main category: cs.CV

TL;DR: The paper presents an XAI-based diagnostic approach to analyze why a state-of-the-art SSM (UMamba) fails to generalize in cerebrovascular segmentation due to domain shift between medical imaging datasets.

DetailsMotivation: Domain shift in medical imaging severely hinders clinical deployment of deep learning models, causing catastrophic failure on external datasets. Simple performance metrics are insufficient; deeper understanding through XAI is needed as a diagnostic tool to address this critical barrier to trustworthy AI.

Method: Two-phase approach: 1) Quantify domain gap between Source (RSNA CTA Aneurysm) and Target (TopCoW Circle of Willis CT) datasets, noting differences in Z-resolution and background noise. 2) Use Seg-XRes-CAM to diagnose generalization failure by measuring overlap between attention maps and Ground Truth vs. Prediction Mask, quantifying model focus shift.

Result: Model performance dropped catastrophically from Dice 0.8604 (Source) to 0.2902 (Target). Analysis proved the model failed to generalize because its attention mechanism abandoned true anatomical features in Target domain. Quantitative metrics showed focus shifted away from Ground Truth vessels (IoU0.101 at 0.3 threshold) while still aligning with wrong predictions (IoU0.282 at 0.3 threshold).

Conclusion: XAI is a powerful diagnostic tool for identifying dataset bias in emerging architectures. The model learned spurious correlations rather than true anatomical features, demonstrating that attention mechanisms can fail catastrophically under domain shift. This approach provides deeper understanding beyond simple performance metrics.

Abstract: The clinical deployment of deep learning models in medical imaging is severely hindered by domain shift. This challenge, where a high-performing model fails catastrophically on external datasets, is a critical barrier to trustworthy AI. Addressing this requires moving beyond simple performance metrics toward deeper understanding, making Explainable AI (XAI) an essential diagnostic tool in medical image analysis. We present a rigorous, two-phase approach to diagnose the generalization failure of state-of-the-art State-Space Models (SSMs), specifically UMamaba, applied to cerebrovascular segmentation. We first established a quantifiable domain gap between our Source (RSNA CTA Aneurysm) and Target (TopCoW Circle of Willis CT) datasets, noting significant differences in Z-resolution and background noise. The model’s Dice score subsequently plummeted from 0.8604 (Source) to 0.2902 (Target). In the second phase, which is our core contribution, we utilized Seg-XRes-CAM to diagnose the cause of this failure. We quantified the model’s focus by measuring the overlap between its attention maps and the Ground Truth segmentations, and between its attention maps and its own Prediction Mask. Our analysis proves the model failed to generalize because its attention mechanism abandoned true anatomical features in the Target domain. Quantitative metrics confirm the model’s focus shifted away from the Ground Truth vessels (IoU0.101 at 0.3 threshold) while still aligning with its own wrong predictions (IoU0.282 at 0.3 threshold). This demonstrates the model learned spurious correlations, confirming XAI is a powerful diagnostic tool for identifying dataset bias in emerging architectures.

[81] FocalComm: Hard Instance-Aware Multi-Agent Perception

Dereje Shenkut, Vijayakumar Bhagavatula

Main category: cs.CV

TL;DR: FocalComm is a collaborative perception framework that exchanges hard-instance-oriented features to improve pedestrian detection in autonomous driving, outperforming state-of-the-art methods on real-world datasets.

DetailsMotivation: Existing collaborative perception approaches optimize for vehicle detection metrics and underperform on smaller, safety-critical objects like pedestrians. They also rely on full feature exchange rather than communicating only salient features that help reduce false negatives.

Method: FocalComm uses two novel designs: (1) a learnable progressive hard instance mining (HIM) module to extract hard instance-oriented features per agent, and (2) a query-based feature-level fusion technique that dynamically weights these identified features during collaboration.

Result: FocalComm outperforms state-of-the-art collaborative perception methods on V2X-Real and DAIR-V2X datasets across both vehicle-centric and infrastructure-centric setups. It shows strong performance gains in pedestrian detection on V2X-Real.

Conclusion: FocalComm effectively addresses the limitations of existing collaborative perception methods by focusing on hard-instance-oriented feature exchange, leading to improved detection of safety-critical objects like pedestrians in autonomous driving scenarios.

Abstract: Multi-agent collaborative perception (CP) is a promising paradigm for improving autonomous driving safety, particularly for vulnerable road users like pedestrians, via robust 3D perception. However, existing CP approaches often optimize for vehicle detection performance metrics, underperforming on smaller, safety-critical objects such as pedestrians, where detection failures can be catastrophic. Furthermore, previous CP methods rely on full feature exchange rather than communicating only salient features that help reduce false negatives. To this end, we present FocalComm, a novel collaborative perception framework that focuses on exchanging hard-instance-oriented features among connected collaborative agents. FocalComm consists of two key novel designs: (1) a learnable progressive hard instance mining (HIM) module to extract hard instance-oriented features per agent, and (2) a query-based feature-level (intermediate) fusion technique that dynamically weights these identified features during collaboration. We show that FocalComm outperforms state-of-the-art collaborative perception methods on two challenging real-world datasets (V2X-Real and DAIR-V2X) across both vehicle-centric and infrastructure-centric collaborative setups. FocalComm also shows a strong performance gain in pedestrian detection in V2X-Real.

[82] Repurposing 2D Diffusion Models for 3D Shape Completion

Yao He, Youngjoong Kwon, Tiange Xiang, Wenxiao Cai, Ehsan Adeli

Main category: cs.CV

TL;DR: A framework that adapts 2D diffusion models for 3D shape completion from incomplete point clouds using a compact 2D representation called Shape Atlas.

DetailsMotivation: 3D diffusion models lag behind 2D diffusion models due to scarcity of high-quality 3D datasets and modality gap between 3D inputs and 2D latent spaces.

Method: Introduces Shape Atlas - a compact 2D representation of 3D geometry that enables full utilization of pretrained 2D diffusion models and aligns modalities between conditional input and output spaces.

Result: Validated effectiveness on PCN and ShapeNet-55 datasets, producing high-quality, detail-preserving shape completions. Demonstrated downstream application for creating artist-created meshes.

Conclusion: The unified 2D formulation facilitates learning from limited 3D data and produces practical, high-quality 3D shape completions while leveraging powerful pretrained 2D diffusion models.

Abstract: We present a framework that adapts 2D diffusion models for 3D shape completion from incomplete point clouds. While text-to-image diffusion models have achieved remarkable success with abundant 2D data, 3D diffusion models lag due to the scarcity of high-quality 3D datasets and a persistent modality gap between 3D inputs and 2D latent spaces. To overcome these limitations, we introduce the Shape Atlas, a compact 2D representation of 3D geometry that (1) enables full utilization of the generative power of pretrained 2D diffusion models, and (2) aligns the modalities between the conditional input and output spaces, allowing more effective conditioning. This unified 2D formulation facilitates learning from limited 3D data and produces high-quality, detail-preserving shape completions. We validate the effectiveness of our results on the PCN and ShapeNet-55 datasets. Additionally, we show the downstream application of creating artist-created meshes from our completed point clouds, further demonstrating the practicality of our method.

[83] Sparse-LaViDa: Sparse Multimodal Discrete Diffusion Language Models

Shufan Li, Jiuxiang Gu, Kangning Liu, Zhe Lin, Zijun Wei, Aditya Grover, Jason Kuen

Main category: cs.CV

TL;DR: Sparse-LaViDa accelerates Masked Discrete Diffusion Models by dynamically truncating unnecessary masked tokens during inference, achieving 2x speedup while maintaining quality.

DetailsMotivation: Masked Discrete Diffusion Models (MDMs) have strong performance but suffer from slow inference speed due to repeatedly processing redundant masked tokens at every sampling step.

Method: Proposes Sparse-LaViDa framework that dynamically truncates unnecessary masked tokens at each inference step, introduces specialized register tokens as compact representations for truncated tokens, and designs specialized attention masks to match truncated sampling during training.

Result: Achieves up to 2x speedup across diverse tasks including text-to-image generation, image editing, and mathematical reasoning while maintaining generation quality.

Conclusion: Sparse-LaViDa successfully accelerates MDM inference while preserving performance, making it a practical solution for efficient multimodal generation tasks.

Abstract: Masked Discrete Diffusion Models (MDMs) have achieved strong performance across a wide range of multimodal tasks, including image understanding, generation, and editing. However, their inference speed remains suboptimal due to the need to repeatedly process redundant masked tokens at every sampling step. In this work, we propose Sparse-LaViDa, a novel modeling framework that dynamically truncates unnecessary masked tokens at each inference step to accelerate MDM sampling. To preserve generation quality, we introduce specialized register tokens that serve as compact representations for the truncated tokens. Furthermore, to ensure consistency between training and inference, we design a specialized attention mask that faithfully matches the truncated sampling procedure during training. Built upon the state-of-the-art unified MDM LaViDa-O, Sparse-LaViDa achieves up to a 2x speedup across diverse tasks including text-to-image generation, image editing, and mathematical reasoning, while maintaining generation quality.

[84] KFS-Bench: Comprehensive Evaluation of Key Frame Sampling in Long Video Understanding

Zongyao Li, Kengo Ishida, Satoshi Yamazaki, Xiaotong Ji, Jianquan Liu

Main category: cs.CV

TL;DR: KFS-Bench is the first benchmark for evaluating key frame sampling in long video QA with multi-scene annotations, enabling direct assessment of sampling strategies and identifying key factors for QA performance.

DetailsMotivation: Key frame sampling is essential for efficient long-form video understanding, but prior works only indirectly assess frame selection quality via QA accuracy. There's a need for direct evaluation of sampling strategies across entire long videos.

Method: Created KFS-Bench with ground-truth annotations of multiple disjoint scenes required per question. Designed a novel sampling quality metric considering precision, scene coverage, and sampling balance. Developed an adaptive sampling method using question-video relevance to balance diversity against similarity.

Result: Comprehensive study identified that sampling precision, scene coverage, and sampling balance are key factors influencing QA performance. The adaptive balanced sampling approach achieved superior performance in both key frame sampling and QA accuracy.

Conclusion: KFS-Bench enables direct evaluation of key frame sampling strategies, revealing critical factors for QA performance. The proposed adaptive sampling method effectively balances diversity and relevance, improving coverage of relevant scenes in long videos.

Abstract: We propose KFS-Bench, the first benchmark for key frame sampling in long video question answering (QA), featuring multi-scene annotations to enable direct and robust evaluation of sampling strategies. Key frame sampling is crucial for efficient long-form video understanding. In long video QA, selecting informative frames enables multimodal large language models (MLLMs) to improve both accuracy and efficiency. KFS-Bench addresses the limitation of prior works that only indirectly assess frame selection quality via QA accuracy. By providing ground-truth annotations of multiple disjoint scenes required per question, KFS-Bench allows us to directly analyze how different sampling approaches capture essential content across an entire long video. Using KFS-Bench, we conduct a comprehensive study of key frame sampling methods and identify that not only sampling precision but also scene coverage and sampling balance are the key factors influencing QA performance. Regarding all the factors, we design a novel sampling quality metric that correlates with QA accuracy. Furthermore, we develop a novel key frame sampling method that leverages question-video relevance to balance sampling diversity against question-frame similarity, thereby improving coverage of relevant scenes. Our adaptively balanced sampling approach achieves superior performance in both key frame sampling and QA performance. The benchmark is available at https://github.com/NEC-VID/KFS-Bench.

[85] Deep Learning Perspective of Scene Understanding in Autonomous Robots

Afia Maham, Dur E Nayab Tashfa

Main category: cs.CV

TL;DR: Review paper on deep learning applications for scene understanding in autonomous robots, covering object detection, segmentation, depth estimation, 3D reconstruction, and visual SLAM, with discussion of integration challenges and future directions.

DetailsMotivation: To survey how deep learning techniques address limitations of traditional geometric models in autonomous robot scene understanding, particularly in handling occlusions, textureless surfaces, and improving real-time depth perception and semantic reasoning.

Method: Literature review and analysis of deep learning applications across multiple scene understanding tasks including object detection, semantic/instance segmentation, depth estimation, 3D reconstruction, and visual SLAM.

Result: Deep learning techniques significantly improve autonomous robots’ ability to understand complex scenes, overcome traditional geometric model limitations, and enhance real-time performance in dynamic unstructured environments.

Conclusion: Integration of deep learning perception modules enables more effective decision-making, navigation, and interaction for autonomous robots, though challenges remain that require further research in learning-based scene understanding.

Abstract: This paper provides a review of deep learning applications in scene understanding in autonomous robots, including innovations in object detection, semantic and instance segmentation, depth estimation, 3D reconstruction, and visual SLAM. It emphasizes how these techniques address limitations of traditional geometric models, improve depth perception in real time despite occlusions and textureless surfaces, and enhance semantic reasoning to understand the environment better. When these perception modules are integrated into dynamic and unstructured environments, they become more effective in decisionmaking, navigation and interaction. Lastly, the review outlines the existing problems and research directions to advance learning-based scene understanding of autonomous robots.

[86] Unleashing the Power of Image-Tabular Self-Supervised Learning via Breaking Cross-Tabular Barriers

Yibing Fu, Yunpeng Zhao, Zhitao Zeng, Cheng Chen, Yueming Jin

Main category: cs.CV

TL;DR: CITab is a novel self-supervised learning framework for cross-tabular multi-modal medical data that integrates images and heterogeneous tabular data by using semantic-aware column headers and a prototype-guided mixture-of-linear layer module.

DetailsMotivation: Existing SSL methods for image-tabular representation learning are limited to specific data cohorts due to rigid tabular modeling mechanisms, creating an inter-tabular barrier that prevents learning transferrable medical knowledge across diverse cohorts.

Method: CITab designs tabular modeling from a semantic-awareness perspective by integrating column headers as semantic cues, and proposes a prototype-guided mixture-of-linear layer (P-MoLin) module for tabular feature specialization to handle tabular data heterogeneity and explore underlying medical concepts.

Result: The framework was evaluated on Alzheimer’s disease diagnosis across three publicly available data cohorts containing 4,461 subjects, demonstrating superior performance over state-of-the-art approaches.

Conclusion: CITab enables effective and scalable cross-tabular multi-modal learning by overcoming the limitations of existing SSL methods and facilitating transferrable knowledge learning across diverse medical data cohorts.

Abstract: Multi-modal learning integrating medical images and tabular data has significantly advanced clinical decision-making in recent years. Self-Supervised Learning (SSL) has emerged as a powerful paradigm for pretraining these models on large-scale unlabeled image-tabular data, aiming to learn discriminative representations. However, existing SSL methods for image-tabular representation learning are often confined to specific data cohorts, mainly due to their rigid tabular modeling mechanisms when modeling heterogeneous tabular data. This inter-tabular barrier hinders the multi-modal SSL methods from effectively learning transferrable medical knowledge shared across diverse cohorts. In this paper, we propose a novel SSL framework, namely CITab, designed to learn powerful multi-modal feature representations in a cross-tabular manner. We design the tabular modeling mechanism from a semantic-awareness perspective by integrating column headers as semantic cues, which facilitates transferrable knowledge learning and the scalability in utilizing multiple data sources for pretraining. Additionally, we propose a prototype-guided mixture-of-linear layer (P-MoLin) module for tabular feature specialization, empowering the model to effectively handle the heterogeneity of tabular data and explore the underlying medical concepts. We conduct comprehensive evaluations on Alzheimer’s disease diagnosis task across three publicly available data cohorts containing 4,461 subjects. Experimental results demonstrate that CITab outperforms state-of-the-art approaches, paving the way for effective and scalable cross-tabular multi-modal learning.

[87] Robust Single-shot Structured Light 3D Imaging via Neural Feature Decoding

Jiaheng Li, Qiyu Dai, Lihan Li, Praneeth Chakravarthula, He Sun, Baoquan Chen, Wenzheng Chen

Main category: cs.CV

TL;DR: Learning-based structured light decoding using neural feature matching instead of traditional pixel-domain matching, with depth refinement using monocular depth priors, trained on synthetic data and generalizing to real-world 3D sensing.

DetailsMotivation: Traditional structured light systems used in commercial devices (Apple Face ID, Intel RealSense) have limited robustness under challenging scenarios like occlusions, fine details, and non-Lambertian surfaces due to pixel-domain matching approaches.

Method: Proposes a learning-based framework that extracts neural features from projected patterns and IR images, builds cost volumes in feature space using geometric priors, and adds a depth refinement module leveraging large-scale monocular depth estimation models. Uses a physically-based rendering pipeline to generate synthetic training data.

Result: Method trained exclusively on synthetic data generalizes well to real-world indoor environments, processes various pattern types without retraining, and outperforms both commercial structured light systems and passive stereo RGB-based depth estimation methods.

Conclusion: Neural feature matching in structured light decoding significantly improves robustness and performance over traditional pixel-domain approaches, enabling better 3D imaging in challenging scenarios through synthetic training and monocular depth priors.

Abstract: We consider the problem of active 3D imaging using single-shot structured light systems, which are widely employed in commercial 3D sensing devices such as Apple Face ID and Intel RealSense. Traditional structured light methods typically decode depth correspondences through pixel-domain matching algorithms, resulting in limited robustness under challenging scenarios like occlusions, fine-structured details, and non-Lambertian surfaces. Inspired by recent advances in neural feature matching, we propose a learning-based structured light decoding framework that performs robust correspondence matching within feature space rather than the fragile pixel domain. Our method extracts neural features from the projected patterns and captured infrared (IR) images, explicitly incorporating their geometric priors by building cost volumes in feature space, achieving substantial performance improvements over pixel-domain decoding approaches. To further enhance depth quality, we introduce a depth refinement module that leverages strong priors from large-scale monocular depth estimation models, improving fine detail recovery and global structural coherence. To facilitate effective learning, we develop a physically-based structured light rendering pipeline, generating nearly one million synthetic pattern-image pairs with diverse objects and materials for indoor settings. Experiments demonstrate that our method, trained exclusively on synthetic data with multiple structured light patterns, generalizes well to real-world indoor environments, effectively processes various pattern types without retraining, and consistently outperforms both commercial structured light systems and passive stereo RGB-based depth estimation methods. Project page: https://namisntimpot.github.io/NSLweb/.

[88] ASAP-Textured Gaussians: Enhancing Textured Gaussians with Adaptive Sampling and Anisotropic Parameterization

Meng Wei, Cheng Zhang, Jianmin Zheng, Hamid Rezatofighi, Jianfei Cai

Main category: cs.CV

TL;DR: ASAP Textured Gaussians improves memory efficiency in textured 3D Gaussian Splatting through adaptive sampling and anisotropic parameterization, achieving high-fidelity rendering with fewer texture parameters.

DetailsMotivation: Existing textured Gaussian methods have memory efficiency challenges due to two key limitations: (1) textures defined in canonical space lead to inefficient sampling that wastes capacity on low-contribution regions, and (2) uniform texture parameterization across all Gaussians causes over-parameterization regardless of visual complexity.

Method: Proposes two strategies: (1) adaptive sampling based on Gaussian density distribution to optimize texture usage, and (2) error-driven anisotropic parameterization that allocates texture resources according to rendering error rather than uniform assignment.

Result: ASAP Textured Gaussians significantly improves the quality-efficiency tradeoff, achieving high-fidelity rendering with far fewer texture parameters compared to existing methods.

Conclusion: The proposed adaptive sampling and anisotropic parameterization strategies effectively address memory efficiency challenges in textured 3D Gaussian Splatting, offering a simple yet effective solution for better quality-efficiency balance.

Abstract: Recent advances have equipped 3D Gaussian Splatting with texture parameterizations to capture spatially varying attributes, improving the performance of both appearance modeling and downstream tasks. However, the added texture parameters introduce significant memory efficiency challenges. Rather than proposing new texture formulations, we take a step back to examine the characteristics of existing textured Gaussian methods and identify two key limitations in common: (1) Textures are typically defined in canonical space, leading to inefficient sampling that wastes textures’ capacity on low-contribution regions; and (2) texture parameterization is uniformly assigned across all Gaussians, regardless of their visual complexity, resulting in over-parameterization. In this work, we address these issues through two simple yet effective strategies: adaptive sampling based on the Gaussian density distribution and error-driven anisotropic parameterization that allocates texture resources according to rendering error. Our proposed ASAP Textured Gaussians, short for Adaptive Sampling and Anisotropic Parameterization, significantly improve the quality efficiency tradeoff, achieving high-fidelity rendering with far fewer texture parameters.

[89] ChartAgent: A Chart Understanding Framework with Tool Integrated Reasoning

Boran Wang, Xinming Wang, Yi Chen, Xiang Li, Jian Xu, Jing Yuan, Chenglin Liu

Main category: cs.CV

TL;DR: ChartAgent: A tool-integrated reasoning framework for robust chart understanding that decomposes analysis into observable steps using modular tools, improving performance when key textual annotations are missing.

DetailsMotivation: Current multimodal LLMs for chart understanding heavily depend on explicit textual annotations and degrade significantly when key numerals are absent, limiting their practical application across diverse chart types with sparse annotations.

Method: ChartAgent uses Tool-Integrated Reasoning (TIR) to decompose chart analysis into observable, replayable steps. It employs an extensible modular tool library with over a dozen core tools (key element detection, instance segmentation, OCR, etc.) that the agent dynamically orchestrates for systematic visual parsing. The framework standardizes intermediate outputs into a structured Evidence Package for traceability.

Result: ChartAgent substantially improves robustness under sparse annotation settings, offering better performance when key textual information is missing compared to existing multimodal LLMs.

Conclusion: ChartAgent provides a practical path toward trustworthy and extensible chart understanding systems by moving beyond black-box approaches through transparent, verifiable reasoning with traceable intermediate outputs.

Abstract: With their high information density and intuitive readability, charts have become the de facto medium for data analysis and communication across disciplines. Recent multimodal large language models (MLLMs) have made notable progress in automated chart understanding, yet they remain heavily dependent on explicit textual annotations and the performance degrades markedly when key numerals are absent. To address this limitation, we introduce ChartAgent, a chart understanding framework grounded in Tool-Integrated Reasoning (TIR). Inspired by human cognition, ChartAgent decomposes complex chart analysis into a sequence of observable, replayable steps. Supporting this architecture is an extensible, modular tool library comprising more than a dozen core tools, such as keyelement detection, instance segmentation, and optical character recognition (OCR), which the agent dynamically orchestrates to achieve systematic visual parsing across diverse chart types. Leveraging TIRs transparency and verifiability, ChartAgent moves beyond the black box paradigm by standardizing and consolidating intermediate outputs into a structured Evidence Package, providing traceable and reproducible support for final conclusions. Experiments show that ChartAgent substantially improves robustness under sparse annotation settings, offering a practical path toward trustworthy and extensible systems for chart understanding.

[90] OmniDrive-R1: Reinforcement-driven Interleaved Multi-modal Chain-of-Thought for Trustworthy Vision-Language Autonomous Driving

Zhenguo Zhang, Haohan Zhen, Yishen Wang, Le Xu, Tianchen Deng, Xuefeng Chen, Qu Chen, Bo Zhang, Wuxiong Huang

Main category: cs.CV

TL;DR: OmniDrive-R1 is an end-to-end VLM framework for autonomous driving that uses interleaved multimodal Chain-of-Thought reasoning with reinforcement-driven visual grounding to reduce object hallucination without dense labels.

DetailsMotivation: VLMs in safety-critical domains like autonomous driving suffer from reliability failures, particularly object hallucination, due to ungrounded text-based reasoning. Existing multimodal CoT approaches have decoupled perception/reasoning stages and require expensive dense localization labels.

Method: Introduces OmniDrive-R1 with interleaved Multi-modal Chain-of-Thought (iMCoT) mechanism that unifies perception and reasoning. Uses reinforcement-driven visual grounding with a two-stage RL training pipeline and Clip-GRPO algorithm, which employs annotation-free, process-based grounding rewards for cross-modal consistency.

Result: On DriveLMM-o1 benchmark: Overall reasoning score improved from 51.77% to 80.35%, and final answer accuracy from 37.81% to 73.62% compared to baseline Qwen2.5VL-7B.

Conclusion: OmniDrive-R1 effectively addresses object hallucination in VLMs for autonomous driving through end-to-end joint optimization and reinforcement-driven visual grounding without requiring dense labels, significantly improving reasoning reliability.

Abstract: The deployment of Vision-Language Models (VLMs) in safety-critical domains like autonomous driving (AD) is critically hindered by reliability failures, most notably object hallucination. This failure stems from their reliance on ungrounded, text-based Chain-of-Thought (CoT) reasoning.While existing multi-modal CoT approaches attempt mitigation, they suffer from two fundamental flaws: (1) decoupled perception and reasoning stages that prevent end-to-end joint optimization, and (2) reliance on expensive, dense localization labels.Thus we introduce OmniDrive-R1, an end-to-end VLM framework designed for autonomous driving, which unifies perception and reasoning through an interleaved Multi-modal Chain-of-Thought (iMCoT) mechanism. Our core innovation is an Reinforcement-driven visual grounding capability, enabling the model to autonomously direct its attention and “zoom in” on critical regions for fine-grained analysis. This capability is enabled by our pure two-stage reinforcement learning training pipeline and Clip-GRPO algorithm. Crucially, Clip-GRPO introduces an annotation-free, process-based grounding reward. This reward not only eliminates the need for dense labels but also circumvents the instability of external tool calls by enforcing real-time cross-modal consistency between the visual focus and the textual reasoning. Extensive experiments on DriveLMM-o1 demonstrate our model’s significant improvements. Compared to the baseline Qwen2.5VL-7B, OmniDrive-R1 improves the overall reasoning score from 51.77% to 80.35%, and the final answer accuracy from 37.81% to 73.62%.

[91] SELECT: Detecting Label Errors in Real-world Scene Text Data

Wenjun Liu, Qian Wu, Yifeng Hu, Yuke Li

Main category: cs.CV

TL;DR: SELECT is a multi-modal approach for detecting label errors in scene text datasets using image-text encoders and character tokenizers, with SSLC for realistic error simulation during training.

DetailsMotivation: Real-world scene text datasets contain label errors that degrade STR performance, but existing methods don't properly handle variable-length sequence labels, label misalignment, and character-level errors.

Method: Uses image-text encoder with character-level tokenizer to handle variable-length sequences, plus SSLC (Similarity-based Sequence Label Corruption) that introduces realistic errors during training by considering visual character similarity and length changes.

Result: Outperforms existing methods in accuracy and practical utility, successfully detects label errors in real-world datasets, and improves STR accuracy on real-world text datasets.

Conclusion: SELECT is the first successful method for detecting label errors in real-world scene text datasets with variable-length labels, demonstrating practical utility for improving dataset quality and STR performance.

Abstract: We introduce SELECT (Scene tExt Label Errors deteCTion), a novel approach that leverages multi-modal training to detect label errors in real-world scene text datasets. Utilizing an image-text encoder and a character-level tokenizer, SELECT addresses the issues of variable-length sequence labels, label sequence misalignment, and character-level errors, outperforming existing methods in accuracy and practical utility. In addition, we introduce Similarity-based Sequence Label Corruption (SSLC), a process that intentionally introduces errors into the training labels to mimic real-world error scenarios during training. SSLC not only can cause a change in the sequence length but also takes into account the visual similarity between characters during corruption. Our method is the first to detect label errors in real-world scene text datasets successfully accounting for variable-length labels. Experimental results demonstrate the effectiveness of SELECT in detecting label errors and improving STR accuracy on real-world text datasets, showcasing its practical utility.

[92] HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices

HyperAI Team, Yuchen Liu, Kaiyang Han, Zhiqiang Xia, Yuhang Dong, Chen Song, Kangyu Tang, Jiaming Xu, Xiushi Feng, WenXuan Yu, Li Peng, Mingyang Wang, Kai Wang, Changpeng Yang, Yang Li, Haoyu Lu, Hao Wang, Bingna Xu, Guangyao Liu, Long Huang, Kaibin Guo, Jinyang Wu, Dan Wu, Hongzhen Wang, Peng Zhou, Shuai Nie, Shande Wang, Runyu Shi, Ying Huang

Main category: cs.CV

TL;DR: HyperVL is an efficient multimodal LLM for on-device use that reduces memory/latency via image tiling, adaptive resolution compression, and dual consistency learning.

DetailsMotivation: Current multimodal LLMs have high computational/memory requirements making them unsuitable for on-device deployment. Standard ViT encoders are particularly problematic with high-resolution inputs due to excessive latency and memory consumption.

Method: 1) Image-tiling strategy to cap peak memory usage; 2) Visual Resolution Compressor (VRC) that adaptively predicts optimal encoding resolutions to eliminate redundant computation; 3) Dual Consistency Learning (DCL) which aligns multi-scale ViT encoders within a unified framework for dynamic switching between visual branches.

Result: Achieves state-of-the-art performance among models of comparable size across multiple benchmarks. Significantly reduces latency and power consumption on real mobile devices.

Conclusion: HyperVL demonstrates practicality for on-device multimodal inference by efficiently balancing performance with computational efficiency through novel architectural innovations.

Abstract: Current multimodal large lanauge models possess strong perceptual and reasoning capabilities, however high computational and memory requirements make them difficult to deploy directly on on-device environments. While small-parameter models are progressively endowed with strong general capabilities, standard Vision Transformer (ViT) encoders remain a critical bottleneck, suffering from excessive latency and memory consumption when processing high-resolution inputs.To address these challenges, we introduce HyperVL, an efficient multimodal large language model tailored for on-device inference. HyperVL adopts an image-tiling strategy to cap peak memory usage and incorporates two novel techniques: (1) a Visual Resolution Compressor (VRC) that adaptively predicts optimal encoding resolutions to eliminate redundant computation, and (2) Dual Consistency Learning (DCL), which aligns multi-scale ViT encoders within a unified framework, enabling dynamic switching between visual branches under a shared LLM. Extensive experiments demonstrate that HyperVL achieves state-of-the-art performance among models of comparable size across multiple benchmarks. Furthermore, it significantly significantly reduces latency and power consumption on real mobile devices, demonstrating its practicality for on-device multimodal inference.

[93] FacEDiT: Unified Talking Face Editing and Generation via Facial Motion Infilling

Kim Sung-Bin, Joohyun Chang, David Harwath, Tae-Hyun Oh

Main category: cs.CV

TL;DR: FacEDiT unifies talking face editing and generation as speech-conditional facial motion infilling using a Diffusion Transformer with flow matching, enabling localized edits with seamless transitions.

DetailsMotivation: Talking face editing and generation have traditionally been studied as separate problems, lacking a unified framework. The authors propose viewing both as subtasks of speech-conditional facial motion infilling to create a more comprehensive approach.

Method: FacEDiT: a speech-conditional Diffusion Transformer trained with flow matching, inspired by masked autoencoders. It learns to synthesize masked facial motions conditioned on surrounding motions and speech. Uses biased attention and temporal smoothness constraints for boundary continuity and lip sync. Also introduces FacEDiTBench dataset for evaluation.

Result: FacEDiT produces accurate, speech-aligned facial edits with strong identity preservation and smooth visual continuity. It generalizes effectively to talking face generation, validating that both tasks emerge as subtasks of speech-conditional motion infilling.

Conclusion: The proposed speech-conditional facial motion infilling framework successfully unifies talking face editing and generation. FacEDiT enables localized generation and edits (substitution, insertion, deletion) with seamless transitions, addressing the lack of standard editing benchmarks through the new FacEDiTBench dataset.

Abstract: Talking face editing and face generation have often been studied as distinct problems. In this work, we propose viewing both not as separate tasks but as subtasks of a unifying formulation, speech-conditional facial motion infilling. We explore facial motion infilling as a self-supervised pretext task that also serves as a unifying formulation of dynamic talking face synthesis. To instantiate this idea, we propose FacEDiT, a speech-conditional Diffusion Transformer trained with flow matching. Inspired by masked autoencoders, FacEDiT learns to synthesize masked facial motions conditioned on surrounding motions and speech. This formulation enables both localized generation and edits, such as substitution, insertion, and deletion, while ensuring seamless transitions with unedited regions. In addition, biased attention and temporal smoothness constraints enhance boundary continuity and lip synchronization. To address the lack of a standard editing benchmark, we introduce FacEDiTBench, the first dataset for talking face editing, featuring diverse edit types and lengths, along with new evaluation metrics. Extensive experiments validate that talking face editing and generation emerge as subtasks of speech-conditional motion infilling; FacEDiT produces accurate, speech-aligned facial edits with strong identity preservation and smooth visual continuity while generalizing effectively to talking face generation.

[94] Real-time prediction of workplane illuminance distribution for daylight-linked controls using non-intrusive multimodal deep learning

Zulin Zhuang, Yu Bian

Main category: cs.CV

TL;DR: A multimodal deep learning framework predicts indoor workplane illuminance in real-time from window images, enabling accurate daylight-linked controls in dynamically occupied spaces.

DetailsMotivation: Daylight-linked controls have significant energy-saving potential, but existing indoor daylight prediction methods are limited to static scenes and don't work well in dynamically occupied spaces.

Method: Proposes a multimodal deep learning framework that extracts temporal-spatial features from non-intrusive images of side-lit window areas (not interior pixels) to predict workplane illuminance distributions in real time.

Result: Model achieved R2 > 0.98 with RMSE < 0.14 on same-distribution test set and R2 > 0.82 with RMSE < 0.17 on unseen-day test set, showing high accuracy and acceptable temporal generalization.

Conclusion: The approach enables accurate real-time indoor daylight prediction for daylight-linked controls in dynamically occupied spaces, using only window area images to maintain applicability.

Abstract: Daylight-linked controls (DLCs) have significant potential for energy savings in buildings, especially when abundant daylight is available and indoor workplane illuminance can be accurately predicted in real time. Most existing studies on indoor daylight predictions were developed and tested for static scenes. This study proposes a multimodal deep learning framework that predicts indoor workplane illuminance distributions in real time from non-intrusive images with temporal-spatial features. By extracting image features only from the side-lit window areas rather than interior pixels, the approach remains applicable in dynamically occupied indoor spaces. A field experiment was conducted in a test room in Guangzhou (China), where 17,344 samples were collected for model training and validation. The model achieved R2 > 0.98 with RMSE < 0.14 on the same-distribution test set and R2 > 0.82 with RMSE < 0.17 on an unseen-day test set, indicating high accuracy and acceptable temporal generalization.

[95] Bridging Fidelity-Reality with Controllable One-Step Diffusion for Image Super-Resolution

Hao Chen, Junyang Chen, Jinshan Pan, Jiangxin Dong

Main category: cs.CV

TL;DR: CODSR is a controllable one-step diffusion network for image super-resolution that addresses three key limitations of existing methods: information loss from compression, insufficient generative prior activation, and text-semantic misalignment.

DetailsMotivation: Recent diffusion-based one-step super-resolution methods have three critical limitations: (1) inferior fidelity due to information loss from LQ input compression, (2) insufficient region-discriminative activation of generative priors, and (3) misalignment between text prompts and corresponding semantic regions.

Method: CODSR proposes three key components: (1) LQ-guided feature modulation module that uses uncompressed LQ information for high-fidelity conditioning, (2) region-adaptive generative prior activation method to enhance perceptual richness while preserving local structure, and (3) text-matching guidance strategy to better utilize text prompt conditioning.

Result: Extensive experiments show CODSR achieves superior perceptual quality and competitive fidelity compared to state-of-the-art methods while maintaining efficient one-step inference.

Conclusion: CODSR successfully addresses the three identified limitations of existing diffusion-based one-step super-resolution methods, achieving both high perceptual quality and competitive fidelity with efficient inference.

Abstract: Recent diffusion-based one-step methods have shown remarkable progress in the field of image super-resolution, yet they remain constrained by three critical limitations: (1) inferior fidelity performance caused by the information loss from compression encoding of low-quality (LQ) inputs; (2) insufficient region-discriminative activation of generative priors; (3) misalignment between text prompts and their corresponding semantic regions. To address these limitations, we propose CODSR, a controllable one-step diffusion network for image super-resolution. First, we propose an LQ-guided feature modulation module that leverages original uncompressed information from LQ inputs to provide high-fidelity conditioning for the diffusion process. We then develop a region-adaptive generative prior activation method to effectively enhance perceptual richness without sacrificing local structural fidelity. Finally, we employ a text-matching guidance strategy to fully harness the conditioning potential of text prompts. Extensive experiments demonstrate that CODSR achieves superior perceptual quality and competitive fidelity compared with state-of-the-art methods with efficient one-step inference.

[96] SDAR-VL: Stable and Efficient Block-wise Diffusion for Vision-Language Understanding

Shuang Cheng, Yuhua Jiang, Zineng Zhou, Dawei Liu, Wang Tao, Linfeng Zhang, Biqing Qi, Bowen Zhou

Main category: cs.CV

TL;DR: SDAR-VL is a block-wise discrete diffusion model for vision-language understanding that introduces three training innovations to overcome previous limitations of high cost, slow convergence, and instability, achieving state-of-the-art performance across 21 benchmarks.

DetailsMotivation: Block-wise discrete diffusion offers a good balance between parallel generation and causal dependency modeling for vision-language tasks, but its practical adoption has been limited by high training costs, slow convergence, and instability compared to autoregressive baselines.

Method: SDAR-VL introduces an integrated framework with three key components: (1) Asynchronous Block-wise Noise Scheduling to diversify supervision within batches, (2) Effective Mask Ratio Scaling for unbiased loss normalization under stochastic masking, and (3) Progressive Beta Noise Curriculum that increases effective mask coverage while preserving corruption diversity.

Result: Experiments on 21 single-image, multi-image, and video benchmarks show SDAR-VL consistently improves training efficiency, convergence stability, and task performance over conventional block diffusion. It sets new state-of-the-art among diffusion-based vision-language models and matches or surpasses strong autoregressive baselines like LLaVA-OneVision and global diffusion baseline LLaDA-V.

Conclusion: SDAR-VL establishes block-wise diffusion as a practical backbone for vision-language understanding by overcoming previous limitations through systematic training innovations, making it competitive with and sometimes superior to autoregressive approaches.

Abstract: Block-wise discrete diffusion offers an attractive balance between parallel generation and causal dependency modeling, making it a promising backbone for vision-language modeling. However, its practical adoption has been limited by high training cost, slow convergence, and instability, which have so far kept it behind strong autoregressive (AR) baselines. We present \textbf{SDAR-VL}, the first systematic application of block-wise discrete diffusion to large-scale vision-language understanding (VLU), together with an \emph{integrated framework for efficient and stable training}. This framework unifies three components: (1) \textbf{Asynchronous Block-wise Noise Scheduling} to diversify supervision within each batch; (2) \textbf{Effective Mask Ratio Scaling} for unbiased loss normalization under stochastic masking; and (3) a \textbf{Progressive Beta Noise Curriculum} that increases effective mask coverage while preserving corruption diversity. Experiments on 21 single-image, multi-image, and video benchmarks show that SDAR-VL consistently improves \emph{training efficiency}, \emph{convergence stability}, and \emph{task performance} over conventional block diffusion. On this evaluation suite, SDAR-VL sets a new state of the art among diffusion-based vision-language models and, under matched settings, matches or surpasses strong AR baselines such as LLaVA-OneVision as well as the global diffusion baseline LLaDA-V, establishing block-wise diffusion as a practical backbone for VLU.

[97] GaussianPlant: Structure-aligned Gaussian Splatting for 3D Reconstruction of Plants

Yang Yang, Risa Shinoda, Hiroaki Santo, Fumio Okura

Main category: cs.CV

TL;DR: GaussianPlant: Hierarchical 3DGS representation that jointly recovers plant appearance and internal structure from multi-view images, enabling both high-fidelity rendering and structural analysis.

DetailsMotivation: 3D Gaussian Splatting (3DGS) excels at appearance reconstruction for novel-view synthesis but lacks structural representations needed for applications like plant phenotyping, which requires understanding branching patterns and leaf organization.

Method: Introduces hierarchical representation with structure primitives (StPs) for branch/leaf geometry (cylinders/disks) and appearance primitives (ApPs) for visual appearance. StPs self-organize to distinguish branches/leaves, while ApPs are bound to StPs. Joint optimization uses re-rendering loss and gradient flow between primitives.

Result: Achieves both high-fidelity appearance reconstruction via ApPs and accurate structural reconstruction via StPs, enabling extraction of branch structure and leaf instances from multi-view images.

Conclusion: GaussianPlant successfully bridges the gap between appearance and structural reconstruction in botanical plants, making 3DGS applicable to practical tasks like plant phenotyping while maintaining rendering quality.

Abstract: We present a method for jointly recovering the appearance and internal structure of botanical plants from multi-view images based on 3D Gaussian Splatting (3DGS). While 3DGS exhibits robust reconstruction of scene appearance for novel-view synthesis, it lacks structural representations underlying those appearances (e.g., branching patterns of plants), which limits its applicability to tasks such as plant phenotyping. To achieve both high-fidelity appearance and structural reconstruction, we introduce GaussianPlant, a hierarchical 3DGS representation, which disentangles structure and appearance. Specifically, we employ structure primitives (StPs) to explicitly represent branch and leaf geometry, and appearance primitives (ApPs) to the plants’ appearance using 3D Gaussians. StPs represent a simplified structure of the plant, i.e., modeling branches as cylinders and leaves as disks. To accurately distinguish the branches and leaves, StP’s attributes (i.e., branches or leaves) are optimized in a self-organized manner. ApPs are bound to each StP to represent the appearance of branches or leaves as in conventional 3DGS. StPs and ApPs are jointly optimized using a re-rendering loss on the input multi-view images, as well as the gradient flow from ApP to StP using the binding correspondence information. We conduct experiments to qualitatively evaluate the reconstruction accuracy of both appearance and structure, as well as real-world experiments to qualitatively validate the practical performance. Experiments show that the GaussianPlant achieves both high-fidelity appearance reconstruction via ApPs and accurate structural reconstruction via StPs, enabling the extraction of branch structure and leaf instances.

[98] ProtoFlow: Interpretable and Robust Surgical Workflow Modeling with Learned Dynamic Scene Graph Prototypes

Felix Holm, Ghazal Ghazaei, Nassir Navab

Main category: cs.CV

TL;DR: ProtoFlow introduces a prototype-based framework using dynamic scene graph prototypes to model surgical workflows interpretably and robustly, outperforming GNN baselines and showing strong few-shot performance.

DetailsMotivation: Current surgical AI faces challenges with high annotation costs, data scarcity, and lack of interpretable models. Scene graphs offer structured surgical event abstraction but their potential remains untapped for detailed surgical recognition.

Method: ProtoFlow uses a GNN encoder-decoder architecture with self-supervised pretraining for rich representation learning, followed by prototype-based fine-tuning that discovers and refines core prototypes capturing recurring surgical interaction patterns.

Result: On CAT-SG dataset, ProtoFlow outperforms standard GNN baselines in accuracy and shows exceptional robustness in few-shot scenarios (works with as few as one surgical video). Learned prototypes identify surgical sub-techniques and provide interpretable insights into workflow deviations.

Conclusion: ProtoFlow combines robust representation learning with explainability, advancing transparent, reliable, and data-efficient AI systems for clinical adoption in surgical training, real-time decision support, and workflow optimization.

Abstract: Purpose: Detailed surgical recognition is critical for advancing AI-assisted surgery, yet progress is hampered by high annotation costs, data scarcity, and a lack of interpretable models. While scene graphs offer a structured abstraction of surgical events, their full potential remains untapped. In this work, we introduce ProtoFlow, a novel framework that learns dynamic scene graph prototypes to model complex surgical workflows in an interpretable and robust manner. Methods: ProtoFlow leverages a graph neural network (GNN) encoder-decoder architecture that combines self-supervised pretraining for rich representation learning with a prototype-based fine-tuning stage. This process discovers and refines core prototypes that encapsulate recurring, clinically meaningful patterns of surgical interaction, forming an explainable foundation for workflow analysis. Results: We evaluate our approach on the fine-grained CAT-SG dataset. ProtoFlow not only outperforms standard GNN baselines in overall accuracy but also demonstrates exceptional robustness in limited-data, few-shot scenarios, maintaining strong performance when trained on as few as one surgical video. Our qualitative analyses further show that the learned prototypes successfully identify distinct surgical sub-techniques and provide clear, interpretable insights into workflow deviations and rare complications. Conclusion: By uniting robust representation learning with inherent explainability, ProtoFlow represents a significant step toward developing more transparent, reliable, and data-efficient AI systems, accelerating their potential for clinical adoption in surgical training, real-time decision support, and workflow optimization.

[99] Quality-Aware Framework for Video-Derived Respiratory Signals

Nhi Nguyen, Constantino Álvarez Casado, Le Nguyen, Manuel Lage Cañellas, Miguel Bordallo López

Main category: cs.CV

TL;DR: A predictive quality-aware framework for video-based respiratory rate estimation that integrates multiple signal sources and uses quality indices to train ML models for adaptive signal fusion and filtering.

DetailsMotivation: Video-based respiratory rate estimation is often unreliable due to inconsistent signal quality across different extraction methods, requiring a more robust approach that can handle heterogeneous signal sources and dynamically assess reliability.

Method: Extracts ten signals from facial rPPG, upper-body motion, and deep learning pipelines; analyzes them using four spectral estimators (Welch’s method, MUSIC, FFT, peak detection); uses segment-level quality indices to train ML models that predict accuracy or select the most reliable signal; enables adaptive signal fusion and quality-based segment filtering.

Result: Experiments on three public datasets (OMuSense-23, COHFACE, MAHNOB-HCI) show the framework achieves lower RR estimation errors than individual methods in most cases, with performance gains depending on dataset characteristics.

Conclusion: Quality-driven predictive modeling has potential to deliver scalable and generalizable video-based respiratory monitoring solutions by intelligently combining multiple signal sources based on their reliability.

Abstract: Video-based respiratory rate (RR) estimation is often unreliable due to inconsistent signal quality across extraction methods. We present a predictive, quality-aware framework that integrates heterogeneous signal sources with dynamic assessment of reliability. Ten signals are extracted from facial remote photoplethysmography (rPPG), upper-body motion, and deep learning pipelines, and analyzed using four spectral estimators: Welch’s method, Multiple Signal Classification (MUSIC), Fast Fourier Transform (FFT), and peak detection. Segment-level quality indices are then used to train machine learning models that predict accuracy or select the most reliable signal. This enables adaptive signal fusion and quality-based segment filtering. Experiments on three public datasets (OMuSense-23, COHFACE, MAHNOB-HCI) show that the proposed framework achieves lower RR estimation errors than individual methods in most cases, with performance gains depending on dataset characteristics. These findings highlight the potential of quality-driven predictive modeling to deliver scalable and generalizable video-based respiratory monitoring solutions.

[100] AnchorHOI: Zero-shot Generation of 4D Human-Object Interaction via Anchor-based Prior Distillation

Sisi Dai, Kai Xu

Main category: cs.CV

TL;DR: AnchorHOI is a novel framework for zero-shot 4D human-object interaction generation that uses hybrid priors from video diffusion models and an anchor-based distillation strategy to overcome dataset limitations and improve interaction quality.

DetailsMotivation: Current 4D HOI generation faces scalability issues due to limited datasets, and zero-shot methods using image diffusion models fail to properly distill interaction cues, restricting their applicability across diverse scenarios.

Method: AnchorHOI incorporates video diffusion models beyond image models and introduces an anchor-based prior distillation strategy with two tailored anchors: anchor NeRFs for expressive interaction composition and anchor keypoints for realistic motion synthesis, guiding generation in a tractable two-step process.

Result: Extensive experiments demonstrate that AnchorHOI outperforms previous methods with superior diversity and generalization in 4D human-object interaction generation.

Conclusion: The proposed framework successfully addresses the challenges of 4D HOI generation by leveraging hybrid priors and an anchor-based distillation approach, enabling better interaction composition and motion synthesis without requiring large-scale supervised datasets.

Abstract: Despite significant progress in text-driven 4D human-object interaction (HOI) generation with supervised methods, the scalability remains limited by the scarcity of large-scale 4D HOI datasets. To overcome this, recent approaches attempt zero-shot 4D HOI generation with pre-trained image diffusion models. However, interaction cues are minimally distilled during the generation process, restricting their applicability across diverse scenarios. In this paper, we propose AnchorHOI, a novel framework that thoroughly exploits hybrid priors by incorporating video diffusion models beyond image diffusion models, advancing 4D HOI generation. Nevertheless, directly optimizing high-dimensional 4D HOI with such priors remains challenging, particularly for human pose and compositional motion. To address this challenge, AnchorHOI introduces an anchor-based prior distillation strategy, which constructs interaction-aware anchors and then leverages them to guide generation in a tractable two-step process. Specifically, two tailored anchors are designed for 4D HOI generation: anchor Neural Radiance Fields (NeRFs) for expressive interaction composition, and anchor keypoints for realistic motion synthesis. Extensive experiments demonstrate that AnchorHOI outperforms previous methods with superior diversity and generalization.

[101] OUSAC: Optimized Guidance Scheduling with Adaptive Caching for DiT Acceleration

Ruitong Sun, Tianze Yang, Wei Niu, Jin Sun

Main category: cs.CV

TL;DR: OUSAC accelerates diffusion transformers by optimizing guidance scheduling and adaptive caching, achieving up to 82% reduction in unconditional passes while maintaining or improving image quality.

DetailsMotivation: Diffusion models are computationally expensive due to iterative denoising, and Classifier-Free Guidance (CFG) doubles computation by requiring both conditional and unconditional passes at every timestep. Current caching methods fail under variable guidance patterns.

Method: Two-stage approach: Stage-1 uses evolutionary algorithms to jointly optimize which timesteps to skip CFG and what guidance scales to use. Stage-2 introduces adaptive rank allocation that tailors calibration efforts per transformer block to maintain caching effectiveness under variable guidance.

Result: Achieves 53% computational savings with 15% quality improvement on DiT-XL/2 (ImageNet 512x512), 60% savings with 16.1% improvement on PixArt-alpha (MSCOCO), and 5x speedup on FLUX while improving CLIP Score over 50-step baseline. Eliminates up to 82% of unconditional passes.

Conclusion: OUSAC provides a systematic optimization framework that significantly accelerates diffusion transformers through optimized guidance scheduling and adaptive caching, enabling high-quality image generation with substantially reduced computational cost.

Abstract: Diffusion models have emerged as the dominant paradigm for high-quality image generation, yet their computational expense remains substantial due to iterative denoising. Classifier-Free Guidance (CFG) significantly enhances generation quality and controllability but doubles the computation by requiring both conditional and unconditional forward passes at every timestep. We present OUSAC (Optimized gUidance Scheduling with Adaptive Caching), a framework that accelerates diffusion transformers (DiT) through systematic optimization. Our key insight is that variable guidance scales enable sparse computation: adjusting scales at certain timesteps can compensate for skipping CFG at others, enabling both fewer total sampling steps and fewer CFG steps while maintaining quality. However, variable guidance patterns introduce denoising deviations that undermine standard caching methods, which assume constant CFG scales across steps. Moreover, different transformer blocks are affected at different levels under dynamic conditions. This paper develops a two-stage approach leveraging these insights. Stage-1 employs evolutionary algorithms to jointly optimize which timesteps to skip and what guidance scale to use, eliminating up to 82% of unconditional passes. Stage-2 introduces adaptive rank allocation that tailors calibration efforts per transformer block, maintaining caching effectiveness under variable guidance. Experiments demonstrate that OUSAC significantly outperforms state-of-the-art acceleration methods, achieving 53% computational savings with 15% quality improvement on DiT-XL/2 (ImageNet 512x512), 60% savings with 16.1% improvement on PixArt-alpha (MSCOCO), and 5x speedup on FLUX while improving CLIP Score over the 50-step baseline.

[102] ViewMask-1-to-3: Multi-View Consistent Image Generation via Multimodal Diffusion Models

Ruishu Zhu, Zhihao Huang, Jiacheng Sun, Ping Luo, Hongyuan Zhang, Xuelong Li

Main category: cs.CV

TL;DR: ViewMask-1-to-3 introduces a discrete diffusion approach for multi-view image generation from single image+text, achieving state-of-the-art results without complex 3D constraints.

DetailsMotivation: Multi-view generation from single image+text is challenging due to geometric consistency issues. Existing methods need extensive multi-view data and complex 3D priors, so a simpler approach is needed.

Method: Formulates multi-view synthesis as discrete sequence modeling using MAGVIT-v2 tokenization. Uses masked token prediction to unify language/vision, with iterative token unmasking and self-attention for cross-view consistency.

Result: Achieves first place on average across GSO and 3D-FUTURE datasets in PSNR, SSIM, and LPIPS metrics while maintaining architectural simplicity.

Conclusion: Discrete diffusion provides a viable, simple alternative to existing multi-view generation methods, eliminating need for complex 3D geometric constraints or specialized architectures.

Abstract: Multi-view image generation from a single image and text description remains challenging due to the difficulty of maintaining geometric consistency across different viewpoints. Existing approaches typically rely on 3D-aware architectures or specialized diffusion models that require extensive multi-view training data and complex geometric priors. In this work, we introduce ViewMask-1-to-3, a pioneering approach to apply discrete diffusion models to multi-view image generation. Unlike continuous diffusion methods that operate in latent spaces, ViewMask-1-to-3 formulates multi-view synthesis as a discrete sequence modeling problem, where each viewpoint is represented as visual tokens obtained through MAGVIT-v2 tokenization. By unifying language and vision through masked token prediction, our approach enables progressive generation of multiple viewpoints through iterative token unmasking with text input. ViewMask-1-to-3 achieves cross-view consistency through simple random masking combined with self-attention, eliminating the requirement for complex 3D geometric constraints or specialized attention architectures. Our approach demonstrates that discrete diffusion provides a viable and simple alternative to existing multi-view generation methods, ranking first on average across GSO and 3D-FUTURE datasets in terms of PSNR, SSIM, and LPIPS, while maintaining architectural simplicity.

[103] Neurosymbolic Inference On Foundation Models For Remote Sensing Text-to-image Retrieval With Complex Queries

Emanuele Mezzi, Gertjan Burghouts, Maarten Kruithof

Main category: cs.CV

TL;DR: RUNE combines LLMs with neurosymbolic AI for explainable text-to-image retrieval in remote sensing, using explicit reasoning over detected entities and FOL expressions instead of implicit embeddings.

DetailsMotivation: Current RS-LVLMs have limited explainability and poor handling of complex spatial relations, which hinders real-world applications in remote sensing retrieval tasks.

Method: Uses LLMs to translate text queries into First-Order Logic expressions, then performs neurosymbolic reasoning over detected entities. Includes logic decomposition for scalability and only uses foundation models for logic generation, not end-to-end retrieval.

Result: Outperforms state-of-the-art RS-LVLMs, demonstrates superior performance on complex retrieval tasks, and shows effectiveness in text-to-logic translation. Introduces new metrics RRQC and RRIU for robustness evaluation.

Conclusion: RUNE offers better performance, robustness, and explainability than joint-embedding models, with demonstrated potential for real-world applications like post-flood satellite image retrieval.

Abstract: Text-to-image retrieval in remote sensing (RS) has advanced rapidly with the rise of large vision-language models (LVLMs) tailored for aerial and satellite imagery, culminating in remote sensing large vision-language models (RS-LVLMS). However, limited explainability and poor handling of complex spatial relations remain key challenges for real-world use. To address these issues, we introduce RUNE (Reasoning Using Neurosymbolic Entities), an approach that combines Large Language Models (LLMs) with neurosymbolic AI to retrieve images by reasoning over the compatibility between detected entities and First-Order Logic (FOL) expressions derived from text queries. Unlike RS-LVLMs that rely on implicit joint embeddings, RUNE performs explicit reasoning, enhancing performance and interpretability. For scalability, we propose a logic decomposition strategy that operates on conditioned subsets of detected entities, guaranteeing shorter execution time compared to neural approaches. Rather than using foundation models for end-to-end retrieval, we leverage them only to generate FOL expressions, delegating reasoning to a neurosymbolic inference module. For evaluation we repurpose the DOTA dataset, originally designed for object detection, by augmenting it with more complex queries than in existing benchmarks. We show the LLM’s effectiveness in text-to-logic translation and compare RUNE with state-of-the-art RS-LVLMs, demonstrating superior performance. We introduce two metrics, Retrieval Robustness to Query Complexity (RRQC) and Retrieval Robustness to Image Uncertainty (RRIU), which evaluate performance relative to query complexity and image uncertainty. RUNE outperforms joint-embedding models in complex RS retrieval tasks, offering gains in performance, robustness, and explainability. We show RUNE’s potential for real-world RS applications through a use case on post-flood satellite image retrieval.

[104] Selective, Controlled and Domain-Agnostic Unlearning in Pretrained CLIP: A Training- and Data-Free Approach

Ashish Mishra, Gyanaranjan Nayak, Tarun Kumar, Arpit Shah, Suparna Bhattacharya, Martin Foltin

Main category: cs.CV

TL;DR: A training- and data-free framework for unlearning specific object classes from CLIP models using multimodal nullspace integration of text prompts and synthesized visual prototypes.

DetailsMotivation: Real-world applications require removing specific object classes from pretrained models like CLIP without additional data or retraining, while preserving performance on unrelated tasks. Existing retraining-based methods are limited and inefficient.

Method: Leverages multimodal nullspace through synergistic integration of text prompts and synthesized visual prototypes derived from CLIP’s joint embedding space to remove undesired class information while preserving remaining knowledge.

Result: Enables three distinct forgetting paradigms: global unlearning across all domains, domain-specific knowledge removal, and complete unlearning in selective domains, offering flexible and computationally efficient forgetting.

Conclusion: The proposed framework overcomes limitations of existing retraining-based methods and provides a practical solution for controlled model forgetting in real-world applications.

Abstract: Pretrained models like CLIP have demonstrated impressive zero-shot classification capabilities across diverse visual domains, spanning natural images, artistic renderings, and abstract representations. However, real-world applications often demand the removal (or “unlearning”) of specific object classes without requiring additional data or retraining, or affecting the model’s performance on unrelated tasks. In this paper, we propose a novel training- and data-free unlearning framework that enables three distinct forgetting paradigms: (1) global unlearning of selected objects across all domains, (2) domain-specific knowledge removal (e.g., eliminating sketch representations while preserving photo recognition), and (3) complete unlearning in selective domains. By leveraging a multimodal nullspace through synergistic integration of text prompts and synthesized visual prototypes derived from CLIP’s joint embedding space, our method efficiently removes undesired class information while preserving the remaining knowledge. This approach overcomes the limitations of existing retraining-based methods and offers a flexible and computationally efficient solution for controlled model forgetting.

[105] MFE-GAN: Efficient GAN-based Framework for Document Image Enhancement and Binarization with Multi-scale Feature Extraction

Rui-Yang Ju, KokSheik Wong, Yanlin Jin, Jen-Shiun Chiang

Main category: cs.CV

TL;DR: MFE-GAN is an efficient GAN framework for document image enhancement and binarization that uses multi-scale feature extraction with Haar wavelet transformation to reduce training/inference times while maintaining SOTA performance.

DetailsMotivation: Existing methods use multiple independent GANs for different color channels to remove shadows/noise from degraded documents, but this results in long training and inference times. There's a need for more efficient document image enhancement and binarization models.

Method: Proposes MFE-GAN with multi-scale feature extraction (MFE) incorporating Haar wavelet transformation (HWT) and normalization to preprocess images before GAN training. Introduces novel generators, discriminators, and loss functions, with ablation studies to validate design choices.

Result: Experimental results on Benchmark, Nabuco, and CMATERdb datasets show MFE-GAN significantly reduces total training and inference times while maintaining comparable performance with state-of-the-art methods.

Conclusion: MFE-GAN provides an efficient solution for document image enhancement and binarization that balances performance with computational efficiency, making it practical for real-world OCR applications.

Abstract: Document image enhancement and binarization are commonly performed prior to document analysis and recognition tasks for improving the efficiency and accuracy of optical character recognition (OCR) systems. This is because directly recognizing text in degraded documents, particularly in color images, often results in unsatisfactory recognition performance. To address these issues, existing methods train independent generative adversarial networks (GANs) for different color channels to remove shadows and noise, which, in turn, facilitates efficient text information extraction. However, deploying multiple GANs results in long training and inference times. To reduce both training and inference times of document image enhancement and binarization models, we propose MFE-GAN, an efficient GAN-based framework with multi-scale feature extraction (MFE), which incorporates Haar wavelet transformation (HWT) and normalization to process document images before feeding them into GANs for training. In addition, we present novel generators, discriminators, and loss functions to improve the model’s performance, and we conduct ablation studies to demonstrate their effectiveness. Experimental results on the Benchmark, Nabuco, and CMATERdb datasets demonstrate that the proposed MFE-GAN significantly reduces the total training and inference times while maintaining comparable performance with respect to state-of-the-art (SOTA) methods. The implementation of this work is available at https://ruiyangju.github.io/MFE-GAN.

[106] SportsGPT: An LLM-driven Framework for Interpretable Sports Motion Assessment and Training Guidance

Wenbo Tian, Ruting Lin, Hongxian Zheng, Yaodong Yang, Geng Wu, Zihao Zhang, Zhang Zhang

Main category: cs.CV

TL;DR: SportsGPT is an LLM-driven framework for interpretable sports motion assessment and training guidance that establishes a closed loop from motion input to professional training recommendations.

DetailsMotivation: Existing sports analysis systems focus mainly on scoring and visualization but lack automatic performance diagnosis and interpretable training guidance. Recent advances in Large Language Models (LLMs) and motion analysis techniques provide opportunities to address these limitations.

Method: 1) MotionDTW: A two-stage time series alignment algorithm for accurate keyframe extraction from skeleton-based motion sequences. 2) KISMAM: Knowledge-based Interpretable Sports Motion Assessment Model that obtains interpretable assessment metrics by contrasting keyframes with target models. 3) SportsRAG: A RAG-based training guidance model using Qwen3 LLM with a 6B-token knowledge base to generate professional training guidance by retrieving domain-specific QA pairs.

Result: MotionDTW significantly outperforms traditional methods with lower temporal error and higher IoU scores. Ablation studies validate KISMAM and SportsRAG, confirming that SportsGPT surpasses general LLMs in diagnostic accuracy and professionalism.

Conclusion: SportsGPT successfully addresses the limitations of existing sports analysis systems by providing interpretable motion assessment and professional training guidance through an integrated LLM-driven framework that combines motion analysis with domain knowledge retrieval.

Abstract: Existing intelligent sports analysis systems mainly focus on “scoring and visualization,” often lacking automatic performance diagnosis and interpretable training guidance. Recent advances of Large Language Models (LMMs) and motion analysis techniques provide new opportunities to address the above limitations. In this paper, we propose SportsGPT, an LLM-driven framework for interpretable sports motion assessment and training guidance, which establishes a closed loop from motion time-series input to professional training guidance. First, given a set of high-quality target models, we introduce MotionDTW, a two-stage time series alignment algorithm designed for accurate keyframe extraction from skeleton-based motion sequences. Subsequently, we design a Knowledge-based Interpretable Sports Motion Assessment Model (KISMAM) to obtain a set of interpretable assessment metrics (e.g., insufficient extension) by constrasting the keyframes with the targe models. Finally, we propose SportsRAG, a RAG-based training guidance model based on Qwen3. Leveraging a 6B-token knowledge base, it prompts the LLM to generate professional training guidance by retrieving domain-specific QA pairs. Experimental results demonstrate that MotionDTW significantly outperforms traditional methods with lower temporal error and higher IoU scores. Furthermore, ablation studies validate the KISMAM and SportsRAG, confirming that SportsGPT surpasses general LLMs in diagnostic accuracy and professionalism.

[107] Consistent Instance Field for Dynamic Scene Understanding

Junyi Wu, Van Nguyen Nguyen, Benjamin Planche, Jiachen Tao, Changchang Sun, Zhongpai Gao, Zhenghao Zhao, Anwesa Choudhuri, Gengyu Zhang, Meng Zheng, Feiran Wang, Terrence Chen, Yan Yan, Ziyan Wu

Main category: cs.CV

TL;DR: A continuous probabilistic spatio-temporal representation called Consistent Instance Field that disentangles visibility from object identity for dynamic scene understanding, outperforming SOTA methods on novel-view panoptic segmentation and 4D querying tasks.

DetailsMotivation: Prior methods for dynamic scene understanding rely on discrete tracking or view-dependent features, which don't properly disentangle visibility from persistent object identity across space and time.

Method: Introduces instance-embedded representation using deformable 3D Gaussians that jointly encode radiance and semantic information. Includes mechanisms to calibrate per-Gaussian identities and resample Gaussians toward semantically active regions for consistent instance representations.

Result: Significantly outperforms state-of-the-art methods on HyperNeRF and Neu3D datasets for novel-view panoptic segmentation and open-vocabulary 4D querying tasks.

Conclusion: The Consistent Instance Field provides a continuous probabilistic representation that effectively disentangles visibility from object identity, enabling consistent instance understanding across space and time in dynamic scenes.

Abstract: We introduce Consistent Instance Field, a continuous and probabilistic spatio-temporal representation for dynamic scene understanding. Unlike prior methods that rely on discrete tracking or view-dependent features, our approach disentangles visibility from persistent object identity by modeling each space-time point with an occupancy probability and a conditional instance distribution. To realize this, we introduce a novel instance-embedded representation based on deformable 3D Gaussians, which jointly encode radiance and semantic information and are learned directly from input RGB images and instance masks through differentiable rasterization. Furthermore, we introduce new mechanisms to calibrate per-Gaussian identities and resample Gaussians toward semantically active regions, ensuring consistent instance representations across space and time. Experiments on HyperNeRF and Neu3D datasets demonstrate that our method significantly outperforms state-of-the-art methods on novel-view panoptic segmentation and open-vocabulary 4D querying tasks.

[108] Erasing CLIP Memories: Non-Destructive, Data-Free Zero-Shot class Unlearning in CLIP Models

Ashish Mishra, Tarun Kumar, Gyanaranjan Nayak, Arpit Shah, Suparna Bhattacharya, Martin Foltin

Main category: cs.CV

TL;DR: A closed-form nullspace projection method for selective unlearning in multimodal models like CLIP that erases target class information without retraining or forget set images.

DetailsMotivation: Traditional unlearning methods require iterative fine-tuning and extensive data curation, which is computationally expensive. There's a need for efficient, precise methods to remove specific knowledge from pretrained models for privacy preservation and model decontamination.

Method: Leverages nullspace projection to erase target class information from the final projection layer. Computes an orthonormal basis for the subspace spanned by target text embeddings and projects these directions to reduce alignment between image features and undesired classes.

Result: Dramatically reduces zero-shot performance for target classes while preserving overall multimodal knowledge. Partial projection can balance between complete unlearning and retaining useful information. The method is computationally efficient and surgically precise.

Conclusion: The closed-form nullspace projection approach provides an effective solution for selective unlearning in multimodal models, addressing key challenges in model decontamination and privacy preservation without requiring retraining or forget set images.

Abstract: We introduce a novel, closed-form approach for selective unlearning in multimodal models, specifically targeting pretrained models such as CLIP. Our method leverages nullspace projection to erase the target class information embedded in the final projection layer, without requiring any retraining or the use of images from the forget set. By computing an orthonormal basis for the subspace spanned by target text embeddings and projecting these directions, we dramatically reduce the alignment between image features and undesired classes. Unlike traditional unlearning techniques that rely on iterative fine-tuning and extensive data curation, our approach is both computationally efficient and surgically precise. This leads to a pronounced drop in zero-shot performance for the target classes while preserving the overall multimodal knowledge of the model. Our experiments demonstrate that even a partial projection can balance between complete unlearning and retaining useful information, addressing key challenges in model decontamination and privacy preservation.

[109] SketchAssist: A Practical Assistant for Semantic Edits and Precise Local Redrawing

Han Zou, Yan Zhang, Ruiqi Yu, Cong Xie, Jie Huang, Zhenpeng Zhan

Main category: cs.CV

TL;DR: SketchAssist: A unified sketch editing assistant that combines instruction-guided global edits with line-guided local redrawing while preserving style and structure, enabled by a novel controllable data generation pipeline.

DetailsMotivation: Existing image editing systems struggle with sketch editing because they fail to preserve the sparse, style-sensitive structure of line art while supporting both high-level semantic changes and precise local redrawing.

Method: 1) Controllable data generation pipeline: attribute-addition sequences, multi-step edit chains via cross-sequence sampling, style-preserving attribute-removal model. 2) Unified sketch editing framework using DiT-based editors with RGB channels repurposed for input encoding. 3) Task-guided mixture-of-experts in LoRA layers for specialized behavior across editing modes.

Result: State-of-the-art results on both instruction-guided edits and line-guided redrawing tasks, with superior instruction adherence and style/structure preservation compared to recent baselines.

Conclusion: SketchAssist provides a practical, controllable assistant for sketch creation and revision, unifying global semantic edits with precise local redrawing while maintaining overall composition and style fidelity.

Abstract: Sketch editing is central to digital illustration, yet existing image editing systems struggle to preserve the sparse, style-sensitive structure of line art while supporting both high-level semantic changes and precise local redrawing. We present SketchAssist, an interactive sketch drawing assistant that accelerates creation by unifying instruction-guided global edits with line-guided region redrawing, while keeping unrelated regions and overall composition intact. To enable this assistant at scale, we introduce a controllable data generation pipeline that (i) constructs attribute-addition sequences from attribute-free base sketches, (ii) forms multi-step edit chains via cross-sequence sampling, and (iii) expands stylistic coverage with a style-preserving attribute-removal model applied to diverse sketches. Building on this data, SketchAssist employs a unified sketch editing framework with minimal changes to DiT-based editors. We repurpose the RGB channels to encode the inputs, enabling seamless switching between instruction-guided edits and line-guided redrawing within a single input interface. To further specialize behavior across modes, we integrate a task-guided mixture-of-experts into LoRA layers, routing by text and visual cues to improve semantic controllability, structural fidelity, and style preservation. Extensive experiments show state-of-the-art results on both tasks, with superior instruction adherence and style/structure preservation compared to recent baselines. Together, our dataset and SketchAssist provide a practical, controllable assistant for sketch creation and revision.

[110] TorchTraceAP: A New Benchmark Dataset for Detecting Performance Anti-Patterns in Computer Vision Models

Hanning Chen, Keyu Man, Kevin Zhu, Chenguang Zhu, Haonan Li, Tongbo Luo, Xizhou Feng, Wei Sun, Sreen Tallam, Mohsen Imani, Partha Kanuparthy

Main category: cs.CV

TL;DR: A benchmark dataset and iterative ML+LLM approach for detecting performance anti-patterns in PyTorch traces, outperforming traditional methods.

DetailsMotivation: Current methods for identifying ML performance anti-patterns require deep expertise and are resource-intensive, making them inaccessible to most computer vision researchers. Pinpointing problematic trace segments is time-consuming and difficult to automate with existing ML models.

Method: Created first benchmark dataset with 600+ PyTorch traces from diverse CV models across multiple hardware platforms. Proposed iterative approach: lightweight ML model detects anti-pattern segments first, then LLM provides fine-grained classification and targeted feedback.

Result: The method significantly outperforms unsupervised clustering and rule-based statistical techniques for detecting anti-pattern regions. It effectively compensates for LLM’s limited context length and reasoning inefficiencies.

Conclusion: The proposed benchmark and iterative ML+LLM approach provides an accessible, automated solution for detecting performance anti-patterns in ML model traces, addressing a critical gap in ML optimization workflows.

Abstract: Identifying and addressing performance anti-patterns in machine learning (ML) models is critical for efficient training and inference, but it typically demands deep expertise spanning system infrastructure, ML models and kernel development. While large tech companies rely on dedicated ML infrastructure engineers to analyze torch traces and benchmarks, such resource-intensive workflows are largely inaccessible to computer vision researchers in general. Among the challenges, pinpointing problematic trace segments within lengthy execution traces remains the most time-consuming task, and is difficult to automate with current ML models, including LLMs. In this work, we present the first benchmark dataset specifically designed to evaluate and improve ML models’ ability to detect anti patterns in traces. Our dataset contains over 600 PyTorch traces from diverse computer vision models classification, detection, segmentation, and generation collected across multiple hardware platforms. We also propose a novel iterative approach: a lightweight ML model first detects trace segments with anti patterns, followed by a large language model (LLM) for fine grained classification and targeted feedback. Experimental results demonstrate that our method significantly outperforms unsupervised clustering and rule based statistical techniques for detecting anti pattern regions. Our method also effectively compensates LLM’s limited context length and reasoning inefficiencies.

[111] CIS-BA: Continuous Interaction Space Based Backdoor Attack for Object Detection in the Real-World

Shuxin Zhao, Bo Lang, Nan Xiao, Yilang Zhang

Main category: cs.CV

TL;DR: CIS-BA is a novel backdoor attack on object detection systems that uses continuous inter-object interaction patterns as triggers instead of static object features, enabling robust multi-trigger-multi-object attacks that evade defenses.

DetailsMotivation: Existing backdoor attacks on object detection systems are limited by their dependence on single-trigger-single-object mappings and fragile pixel-level cues, making them insufficient for real-world applications like autonomous driving that face complex interaction scenarios.

Method: CIS-BA models inter-object interaction patterns as a continuous interaction space, creating “space triggers” that represent geometric relations between objects. CIS-Frame implements this by constructing space triggers via interaction analysis, formalizing them as class-geometry constraints for sample poisoning, and embedding the backdoor during detector training.

Result: The attack achieves over 97% success rate in complex environments, maintains over 95% effectiveness under dynamic multi-trigger conditions, and evades three state-of-the-art defenses on MS-COCO and real-world video datasets.

Conclusion: CIS-BA extends backdoor attack capabilities to interaction-intensive scenarios and provides new insights into object detection system security by demonstrating the vulnerability of systems to attacks based on object interaction patterns rather than static features.

Abstract: Object detection models deployed in real-world applications such as autonomous driving face serious threats from backdoor attacks. Despite their practical effectiveness,existing methods are inherently limited in both capability and robustness due to their dependence on single-trigger-single-object mappings and fragile pixel-level cues. We propose CIS-BA, a novel backdoor attack paradigm that redefines trigger design by shifting from static object features to continuous inter-object interaction patterns that describe how objects co-occur and interact in a scene. By modeling these patterns as a continuous interaction space, CIS-BA introduces space triggers that, for the first time, enable a multi-trigger-multi-object attack mechanism while achieving robustness through invariant geometric relations. To implement this paradigm, we design CIS-Frame, which constructs space triggers via interaction analysis, formalizes them as class-geometry constraints for sample poisoning, and embeds the backdoor during detector training. CIS-Frame supports both single-object attacks (object misclassification and disappearance) and multi-object simultaneous attacks, enabling complex and coordinated effects across diverse interaction states. Experiments on MS-COCO and real-world videos show that CIS-BA achieves over 97% attack success under complex environments and maintains over 95% effectiveness under dynamic multi-trigger conditions, while evading three state-of-the-art defenses. In summary, CIS-BA extends the landscape of backdoor attacks in interaction-intensive scenarios and provides new insights into the security of object detection systems.

[112] FastDDHPose: Towards Unified, Efficient, and Disentangled 3D Human Pose Estimation

Qingyuan Cai, Linxin Zhang, Xuecai Hu, Saihui Hou, Yongzhen Huang

Main category: cs.CV

TL;DR: Fast3DHPE is a unified framework for fair comparison and efficient training of 3D human pose estimation methods, with FastDDHPose achieving SOTA performance using disentangled diffusion modeling.

DetailsMotivation: Existing 3D human pose estimation methods lack standardized training/evaluation protocols, making fair comparisons difficult and training inefficient. There's a need for a unified framework to address these issues.

Method: Proposes Fast3DHPE framework with standardized protocols. Introduces FastDDHPose method using disentangled diffusion modeling to separately model bone length and direction distributions, with a Kinematic-Hierarchical Spatial and Temporal Denoiser to focus on joint hierarchies.

Result: Fast3DHPE enables fair comparison across methods while significantly improving training efficiency. FastDDHPose achieves state-of-the-art performance on Human3.6M and MPI-INF-3DHP datasets with strong generalization to in-the-wild scenarios.

Conclusion: The Fast3DHPE framework solves the standardization problem in 3D HPE research, and FastDDHPose demonstrates superior performance through disentangled diffusion modeling and efficient denoising architecture.

Abstract: Recent approaches for monocular 3D human pose estimation (3D HPE) have achieved leading performance by directly regressing 3D poses from 2D keypoint sequences. Despite the rapid progress in 3D HPE, existing methods are typically trained and evaluated under disparate frameworks, lacking a unified framework for fair comparison. To address these limitations, we propose Fast3DHPE, a modular framework that facilitates rapid reproduction and flexible development of new methods. By standardizing training and evaluation protocols, Fast3DHPE enables fair comparison across 3D human pose estimation methods while significantly improving training efficiency. Within this framework, we introduce FastDDHPose, a Disentangled Diffusion-based 3D Human Pose Estimation method which leverages the strong latent distribution modeling capability of diffusion models to explicitly model the distributions of bone length and bone direction while avoiding further amplification of hierarchical error accumulation. Moreover, we design an efficient Kinematic-Hierarchical Spatial and Temporal Denoiser that encourages the model to focus on kinematic joint hierarchies while avoiding unnecessary modeling of overly complex joint topologies. Extensive experiments on Human3.6M and MPI-INF-3DHP show that the Fast3DHPE framework enables fair comparison of all methods while significantly improving training efficiency. Within this unified framework, FastDDHPose achieves state-of-the-art performance with strong generalization and robustness in in-the-wild scenarios. The framework and models will be released at: https://github.com/Andyen512/Fast3DHPE

[113] Improving Semantic Uncertainty Quantification in LVLMs with Semantic Gaussian Processes

Joseph Hoche, Andrei Bursuc, David Brellmann, Gilles Louppe, Pavel Izmailov, Angela Yao, Gianni Franchi

Main category: cs.CV

TL;DR: SGPU is a Bayesian framework that uses Gaussian Process Classifier on spectral representations of answer embeddings to quantify semantic uncertainty in LVLMs, avoiding fragile clustering methods.

DetailsMotivation: LVLMs produce plausible but unreliable outputs, requiring robust uncertainty estimation. Existing semantic uncertainty methods rely on brittle clustering that is sensitive to phrasing variations and can incorrectly group/separate semantically similar answers.

Method: SGPU maps generated answers into dense semantic space, computes Gram matrix of embeddings, summarizes semantic configuration via eigenspectrum, and feeds this spectral representation into Gaussian Process Classifier to learn mapping from semantic consistency patterns to predictive uncertainty.

Result: Across six LLMs and LVLMs on eight datasets spanning VQA, image classification, and textual QA, SGPU consistently achieves state-of-the-art calibration (ECE) and discriminative (AUROC, AUARC) performance. SGPU transfers across models and modalities.

Conclusion: SGPU provides a robust Bayesian framework for semantic uncertainty estimation that avoids fragile clustering, captures general patterns of semantic uncertainty, and works in both black-box and white-box settings with strong cross-model transferability.

Abstract: Large Vision-Language Models (LVLMs) often produce plausible but unreliable outputs, making robust uncertainty estimation essential. Recent work on semantic uncertainty estimates relies on external models to cluster multiple sampled responses and measure their semantic consistency. However, these clustering methods are often fragile, highly sensitive to minor phrasing variations, and can incorrectly group or separate semantically similar answers, leading to unreliable uncertainty estimates. We propose Semantic Gaussian Process Uncertainty (SGPU), a Bayesian framework that quantifies semantic uncertainty by analyzing the geometric structure of answer embeddings, avoiding brittle clustering. SGPU maps generated answers into a dense semantic space, computes the Gram matrix of their embeddings, and summarizes their semantic configuration via the eigenspectrum. This spectral representation is then fed into a Gaussian Process Classifier that learns to map patterns of semantic consistency to predictive uncertainty, and that can be applied in both black-box and white-box settings. Across six LLMs and LVLMs on eight datasets spanning VQA, image classification, and textual QA, SGPU consistently achieves state-of-the-art calibration (ECE) and discriminative (AUROC, AUARC) performance. We further show that SGPU transfers across models and modalities, indicating that its spectral representation captures general patterns of semantic uncertainty.

[114] Spherical Voronoi: Directional Appearance as a Differentiable Partition of the Sphere

Francesco Di Sario, Daniel Rebain, Dor Verbin, Marco Grangetto, Andrea Tagliasacchi

Main category: cs.CV

TL;DR: Spherical Voronoi (SV) replaces Spherical Harmonics for appearance modeling in 3D Gaussian Splatting, offering better handling of high-frequency signals, specular reflections, and reduced optimization complexity.

DetailsMotivation: Current radiance field methods using Spherical Harmonics (SH) have limitations: they struggle with high-frequency signals, exhibit Gibbs ringing artifacts, and fail to capture specular reflections essential for realistic rendering. Existing alternatives like spherical Gaussians add significant optimization complexity.

Method: Proposes Spherical Voronoi (SV) as a unified framework for appearance representation. SV partitions the directional domain into learnable regions with smooth boundaries, providing intuitive parameterization for view-dependent effects. For reflections, SV serves as learnable reflection probes using reflected directions as input following classical graphics principles.

Result: SV achieves competitive results for diffuse appearance with simpler optimization than alternatives. For reflections where SH fail, SV attains state-of-the-art results on both synthetic and real-world datasets.

Conclusion: SV offers a principled, efficient, and general solution for appearance modeling in explicit 3D representations, overcoming SH limitations while maintaining optimization simplicity.

Abstract: Radiance field methods (e.g. 3D Gaussian Splatting) have emerged as a powerful paradigm for novel view synthesis, yet their appearance modeling often relies on Spherical Harmonics (SH), which impose fundamental limitations. SH struggle with high-frequency signals, exhibit Gibbs ringing artifacts, and fail to capture specular reflections - a key component of realistic rendering. Although alternatives like spherical Gaussians offer improvements, they add significant optimization complexity. We propose Spherical Voronoi (SV) as a unified framework for appearance representation in 3D Gaussian Splatting. SV partitions the directional domain into learnable regions with smooth boundaries, providing an intuitive and stable parameterization for view-dependent effects. For diffuse appearance, SV achieves competitive results while keeping optimization simpler than existing alternatives. For reflections - where SH fail - we leverage SV as learnable reflection probes, taking reflected directions as input following principles from classical graphics. This formulation attains state-of-the-art results on synthetic and real-world datasets, demonstrating that SV offers a principled, efficient, and general solution for appearance modeling in explicit 3D representations.

[115] Fracture Morphology Classification: Local Multiclass Modeling for Multilabel Complexity

Cassandra Krause, Mattias P. Heinrich, Ron Keuth

Main category: cs.CV

TL;DR: Proposes method to extract fracture morphology by automatically assigning global AO codes to fracture bounding boxes, improving F1 score by 7.89% but performance declines with imperfect detectors.

DetailsMotivation: 15-45% of children experience fractures during growth years, making accurate diagnosis essential. Fracture morphology is a key diagnostic feature alongside location and fragment angle.

Method: Proposes method to extract fracture morphology by automatically assigning global AO codes to corresponding fracture bounding boxes. Reformulates global multilabel task into local multiclass task, enabling use of public datasets.

Result: Method improves average F1 score by 7.89% compared to previous approaches. However, performance declines when using imperfect fracture detectors, highlighting challenges for real-world deployment.

Conclusion: The proposed approach shows promise for fracture morphology extraction but faces practical challenges with imperfect detectors in real-world scenarios. Code is available on GitHub.

Abstract: Between $15,%$ and $45,%$ of children experience a fracture during their growth years, making accurate diagnosis essential. Fracture morphology, alongside location and fragment angle, is a key diagnostic feature. In this work, we propose a method to extract fracture morphology by assigning automatically global AO codes to corresponding fracture bounding boxes. This approach enables the use of public datasets and reformulates the global multilabel task into a local multiclass one, improving the average F1 score by $7.89,%$. However, performance declines when using imperfect fracture detectors, highlighting challenges for real-world deployment. Our code is available on GitHub.

[116] Beyond a Single Light: A Large-Scale Aerial Dataset for Urban Scene Reconstruction Under Varying Illumination

Zhuoxiao Li, Wenzong Ma, Taoyu Wu, Jinjing Zhu, Zhenchao Q, Shuai Zhang, Jing Ou, Yinrui Ren, Weiqing Qi, Guobin Shen, Hui Xiong, Wufan Zhao

Main category: cs.CV

TL;DR: SkyLume dataset for illumination-robust 3D reconstruction from UAV multi-temporal data with LiDAR ground truth and new evaluation metric.

DetailsMotivation: Multi-temporal UAV captures suffer from illumination inconsistencies causing color artifacts and geometric inaccuracies, but lack of datasets with systematic illumination variations hinders research.

Method: Created SkyLume dataset with 10 urban regions, 100k+ high-res UAV images captured at three times of day, plus LiDAR scans and 3D ground truth for evaluation.

Result: Dataset enables systematic study of illumination effects; introduced Temporal Consistency Coefficient metric for evaluating albedo stability in inverse rendering.

Conclusion: SkyLume provides foundation for advancing research in large-scale inverse rendering, geometry reconstruction, and novel view synthesis under varying illumination.

Abstract: Recent advances in Neural Radiance Fields and 3D Gaussian Splatting have demonstrated strong potential for large-scale UAV-based 3D reconstruction tasks by fitting the appearance of images. However, real-world large-scale captures are often based on multi-temporal data capture, where illumination inconsistencies across different times of day can significantly lead to color artifacts, geometric inaccuracies, and inconsistent appearance. Due to the lack of UAV datasets that systematically capture the same areas under varying illumination conditions, this challenge remains largely underexplored. To fill this gap, we introduceSkyLume, a large-scale, real-world UAV dataset specifically designed for studying illumination robust 3D reconstruction in urban scene modeling: (1) We collect data from 10 urban regions data comprising more than 100k high resolution UAV images (four oblique views and nadir), where each region is captured at three periods of the day to systematically isolate illumination changes. (2) To support precise evaluation of geometry and appearance, we provide per-scene LiDAR scans and accurate 3D ground-truth for assessing depth, surface normals, and reconstruction quality under varying illumination. (3) For the inverse rendering task, we introduce the Temporal Consistency Coefficient (TCC), a metric that measuress cross-time albedo stability and directly evaluates the robustness of the disentanglement of light and material. We aim for this resource to serve as a foundation that advances research and real-world evaluation in large-scale inverse rendering, geometry reconstruction, and novel view synthesis.

[117] DRAW2ACT: Turning Depth-Encoded Trajectories into Robotic Demonstration Videos

Yang Bai, Liudi Yang, George Eskandar, Fengyi Shen, Mohammad Altillawi, Ziyuan Liu, Gitta Kutyniok

Main category: cs.CV

TL;DR: DRAW2ACT is a depth-aware trajectory-conditioned video generation framework for robotic manipulation that extracts multiple orthogonal representations from input trajectories and jointly generates aligned RGB and depth videos, enabling more controllable and consistent robotic demonstrations.

DetailsMotivation: Video diffusion models are powerful real-world simulators for embodied AI but lack controllability for robotic manipulation. Existing trajectory-conditioned video generation methods are limited by 2D trajectories or single modality conditioning, restricting their ability to produce controllable and consistent robotic demonstrations.

Method: 1) Extracts multiple orthogonal representations from input trajectories capturing depth, semantics, shape, and motion, injecting them into diffusion models. 2) Jointly generates spatially aligned RGB and depth videos using cross-modality attention mechanisms and depth supervision. 3) Uses a multimodal policy model conditioned on generated RGB and depth sequences to regress robot joint angles.

Result: Experiments on Bridge V2, Berkeley Autolab, and simulation benchmarks show DRAW2ACT achieves superior visual fidelity and consistency while yielding higher manipulation success rates compared to existing baselines.

Conclusion: DRAW2ACT addresses the controllability limitations of video diffusion models for robotic manipulation by introducing depth-aware trajectory conditioning and joint RGB-depth generation, resulting in more consistent demonstrations and improved manipulation performance.

Abstract: Video diffusion models provide powerful real-world simulators for embodied AI but remain limited in controllability for robotic manipulation. Recent works on trajectory-conditioned video generation address this gap but often rely on 2D trajectories or single modality conditioning, which restricts their ability to produce controllable and consistent robotic demonstrations. We present DRAW2ACT, a depth-aware trajectory-conditioned video generation framework that extracts multiple orthogonal representations from the input trajectory, capturing depth, semantics, shape and motion, and injects them into the diffusion model. Moreover, we propose to jointly generate spatially aligned RGB and depth videos, leveraging cross-modality attention mechanisms and depth supervision to enhance the spatio-temporal consistency. Finally, we introduce a multimodal policy model conditioned on the generated RGB and depth sequences to regress the robot’s joint angles. Experiments on Bridge V2, Berkeley Autolab, and simulation benchmarks show that DRAW2ACT achieves superior visual fidelity and consistency while yielding higher manipulation success rates compared to existing baselines.

[118] History-Enhanced Two-Stage Transformer for Aerial Vision-and-Language Navigation

Xichen Ding, Jianzhe Gao, Cong Pan, Wenguan Wang, Jie Qin

Main category: cs.CV

TL;DR: HETT framework improves UAV navigation by combining global reasoning and local scene analysis through a two-stage coarse-to-fine approach with historical spatial memory.

DetailsMotivation: Existing UAV agents for aerial vision-language navigation use mono-granularity frameworks that struggle to balance global environmental reasoning with local scene comprehension, limiting their navigation performance in large-scale urban environments.

Method: Proposes History-Enhanced Two-Stage Transformer (HETT) with coarse-to-fine pipeline: 1) coarse-grained target position prediction by fusing spatial landmarks and historical context, 2) fine-grained action refinement via visual analysis, plus historical grid map for structured spatial memory aggregation.

Result: Experiments on refined CityNav dataset show significant performance gains, with ablation studies verifying effectiveness of each component.

Conclusion: HETT successfully integrates global reasoning and local comprehension for improved aerial vision-language navigation, demonstrating the value of hierarchical approaches and structured spatial memory in UAV navigation tasks.

Abstract: Aerial Vision-and-Language Navigation (AVLN) requires Unmanned Aerial Vehicle (UAV) agents to localize targets in large-scale urban environments based on linguistic instructions. While successful navigation demands both global environmental reasoning and local scene comprehension, existing UAV agents typically adopt mono-granularity frameworks that struggle to balance these two aspects. To address this limitation, this work proposes a History-Enhanced Two-Stage Transformer (HETT) framework, which integrates the two aspects through a coarse-to-fine navigation pipeline. Specifically, HETT first predicts coarse-grained target positions by fusing spatial landmarks and historical context, then refines actions via fine-grained visual analysis. In addition, a historical grid map is designed to dynamically aggregate visual features into a structured spatial memory, enhancing comprehensive scene awareness. Additionally, the CityNav dataset annotations are manually refined to enhance data quality. Experiments on the refined CityNav dataset show that HETT delivers significant performance gains, while extensive ablation studies further verify the effectiveness of each component.

[119] OmniGen: Unified Multimodal Sensor Generation for Autonomous Driving

Tao Tang, Enhui Ma, xia zhou, Letian Wang, Tianyi Yan, Xueyang Zhang, Kun Zhan, Peng Jia, XianPeng Lang, Jia-Wang Bian, Kaicheng Yu, Xiaodan Liang

Main category: cs.CV

TL;DR: OminiGen is a unified framework for generating aligned multimodal sensor data (LiDAR and multi-view cameras) for autonomous driving, using BEV space unification and novel reconstruction methods.

DetailsMotivation: Real-world data collection for autonomous driving is costly and inefficient, especially for diverse corner cases. Existing generative approaches focus on single modalities, causing inefficiencies and misalignment in multimodal sensor data.

Method: 1) Uses shared Bird’s Eye View (BEV) space to unify multimodal features; 2) Proposes UAE (generalizable multimodal reconstruction method) for joint LiDAR and camera decoding via volume rendering; 3) Incorporates Diffusion Transformer (DiT) with ControlNet branch for controllable generation.

Result: Comprehensive experiments demonstrate OminiGen achieves desired performance in unified multimodal sensor data generation with multimodal consistency and flexible sensor adjustments.

Conclusion: OminiGen successfully addresses multimodal sensor data generation challenges by providing a unified framework that ensures alignment and enables controllable generation of LiDAR and camera data.

Abstract: Autonomous driving has seen remarkable advancements, largely driven by extensive real-world data collection. However, acquiring diverse and corner-case data remains costly and inefficient. Generative models have emerged as a promising solution by synthesizing realistic sensor data. However, existing approaches primarily focus on single-modality generation, leading to inefficiencies and misalignment in multimodal sensor data. To address these challenges, we propose OminiGen, which generates aligned multimodal sensor data in a unified framework. Our approach leverages a shared Bird\u2019s Eye View (BEV) space to unify multimodal features and designs a novel generalizable multimodal reconstruction method, UAE, to jointly decode LiDAR and multi-view camera data. UAE achieves multimodal sensor decoding through volume rendering, enabling accurate and flexible reconstruction. Furthermore, we incorporate a Diffusion Transformer (DiT) with a ControlNet branch to enable controllable multimodal sensor generation. Our comprehensive experiments demonstrate that OminiGen achieves desired performances in unified multimodal sensor data generation with multimodal consistency and flexible sensor adjustments.

[120] Multi-View MRI Approach for Classification of MGMT Methylation in Glioblastoma Patients

Rawan Alyahya, Asrar Alruwayqi, Atheer Alqarni, Asma Alkhaldi, Metab Alkubeyyer, Xin Gao, Mona Alshahrani

Main category: cs.CV

TL;DR: A multi-view deep learning approach using MRI scans to non-invasively detect MGMT promoter methylation in glioblastoma patients, avoiding complex 3D models while maintaining spatial relationships between MRI views.

DetailsMotivation: Current MGMT promoter methylation detection requires invasive brain tumor biopsies. There's a need for non-invasive methods to identify this genetic marker that significantly affects chemotherapy effectiveness in glioblastoma treatment.

Method: Proposed a multi-view deep learning approach using MRI scans that considers spatial relationships between MRI views without using complex 3D models. Introduced a new technique for tumor slice extraction and compared against state-of-the-art models.

Result: Demonstrated superiority over existing methods based on multiple evaluation metrics. The approach effectively detects MGMT methylation status while avoiding issues of high parameter count, slow convergence, and substantial memory demands.

Conclusion: Highlights the potential of non-invasive radiogenomics for MGMT methylation detection, contributes to advancing precision medicine in GBM treatment, and promotes transparency through a reproducible pipeline of published models.

Abstract: The presence of MGMT promoter methylation significantly affects how well chemotherapy works for patients with Glioblastoma Multiforme (GBM). Currently, confirmation of MGMT promoter methylation relies on invasive brain tumor tissue biopsies. In this study, we explore radiogenomics techniques, a promising approach in precision medicine, to identify genetic markers from medical images. Using MRI scans and deep learning models, we propose a new multi-view approach that considers spatial relationships between MRI views to detect MGMT methylation status. Importantly, our method extracts information from all three views without using a complicated 3D deep learning model, avoiding issues associated with high parameter count, slow convergence, and substantial memory demands. We also introduce a new technique for tumor slice extraction and show its superiority over existing methods based on multiple evaluation metrics. By comparing our approach to state-of-the-art models, we demonstrate the efficacy of our method. Furthermore, we share a reproducible pipeline of published models, encouraging transparency and the development of robust diagnostic tools. Our study highlights the potential of non-invasive methods for identifying MGMT promoter methylation and contributes to advancing precision medicine in GBM treatment.

[121] ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body

Juze Zhang, Changan Chen, Xin Chen, Heng Yu, Tiange Xiang, Ali Sartaz Khan, Shrinidhi K. Lakshmikanth, Ehsan Adeli

Main category: cs.CV

TL;DR: ViBES is a conversational 3D agent that jointly plans language and movement using a mixture-of-modality-experts architecture, enabling socially competent multimodal interaction beyond isolated speech-conditioned motion generation.

DetailsMotivation: Current systems treat human behavior as translation tasks (co-speech gesture or text-to-motion) without agentic decision-making, leading to brittle timing, weak social grounding, and fragmented stacks where speech, text, and motion are trained/inferred in isolation.

Method: ViBES uses a speech-language-behavior (SLB) model with mixture-of-modality-experts (MoME) backbone: modality-partitioned transformer experts for speech, facial expression, and body motion. It processes interleaved multimodal token streams with hard routing by modality while sharing information through cross-expert attention.

Result: The system supports mixed-initiative interaction (users can speak, type, or issue body-action directives mid-conversation) and shows consistent gains over strong co-speech and text-to-motion baselines in multi-turn conversation benchmarks using automatic metrics of dialogue-motion alignment and behavior quality.

Conclusion: ViBES moves beyond “speech-conditioned motion generation” toward agentic virtual bodies where language, prosody, and movement are jointly generated, enabling controllable, socially competent 3D interaction.

Abstract: Human communication is inherently multimodal and social: words, prosody, and body language jointly carry intent. Yet most prior systems model human behavior as a translation task co-speech gesture or text-to-motion that maps a fixed utterance to motion clips-without requiring agentic decision-making about when to move, what to do, or how to adapt across multi-turn dialogue. This leads to brittle timing, weak social grounding, and fragmented stacks where speech, text, and motion are trained or inferred in isolation. We introduce ViBES (Voice in Behavioral Expression and Synchrony), a conversational 3D agent that jointly plans language and movement and executes dialogue-conditioned body actions. Concretely, ViBES is a speech-language-behavior (SLB) model with a mixture-of-modality-experts (MoME) backbone: modality-partitioned transformer experts for speech, facial expression, and body motion. The model processes interleaved multimodal token streams with hard routing by modality (parameters are split per expert), while sharing information through cross-expert attention. By leveraging strong pretrained speech-language models, the agent supports mixed-initiative interaction: users can speak, type, or issue body-action directives mid-conversation, and the system exposes controllable behavior hooks for streaming responses. We further benchmark on multi-turn conversation with automatic metrics of dialogue-motion alignment and behavior quality, and observe consistent gains over strong co-speech and text-to-motion baselines. ViBES goes beyond “speech-conditioned motion generation” toward agentic virtual bodies where language, prosody, and movement are jointly generated, enabling controllable, socially competent 3D interaction. Code and data will be made available at: ai.stanford.edu/~juze/ViBES/

[122] 4D-RaDiff: Latent Diffusion for 4D Radar Point Cloud Generation

Jimmie Kwok, Holger Caesar, Andras Palffy

Main category: cs.CV

TL;DR: 4D-RaDiff: A diffusion-based framework for generating synthetic 4D radar point clouds to address limited annotated radar data, improving object detection performance and reducing annotation requirements by up to 90%.

DetailsMotivation: Automotive radar is cost-effective and robust in adverse weather, but limited annotated radar data hinders advancement of radar-based perception systems. There's a need to generate realistic radar data for training and evaluation.

Method: Proposes 4D-RaDiff framework that applies diffusion to latent point cloud representations (not image-based), considering radar point cloud sparsity and characteristics. Generates 4D radar point clouds via conditioning at object or scene level, converting unlabeled bounding boxes into radar annotations and transforming LiDAR data into realistic radar scenes.

Result: Synthetic radar data from 4D-RaDiff consistently improves object detection performance when used as data augmentation. Pre-training on synthetic data reduces required annotated radar data by up to 90% while achieving comparable detection performance.

Conclusion: The proposed 4D-RaDiff framework effectively addresses the radar data scarcity problem by generating high-quality synthetic radar point clouds, enabling better training of object detectors and significantly reducing annotation requirements.

Abstract: Automotive radar has shown promising developments in environment perception due to its cost-effectiveness and robustness in adverse weather conditions. However, the limited availability of annotated radar data poses a significant challenge for advancing radar-based perception systems. To address this limitation, we propose a novel framework to generate 4D radar point clouds for training and evaluating object detectors. Unlike image-based diffusion, our method is designed to consider the sparsity and unique characteristics of radar point clouds by applying diffusion to a latent point cloud representation. Within this latent space, generation is controlled via conditioning at either the object or scene level. The proposed 4D-RaDiff converts unlabeled bounding boxes into high-quality radar annotations and transforms existing LiDAR point cloud data into realistic radar scenes. Experiments demonstrate that incorporating synthetic radar data of 4D-RaDiff as data augmentation method during training consistently improves object detection performance compared to training on real data only. In addition, pre-training on our synthetic data reduces the amount of required annotated radar data by up to 90% while achieving comparable object detection performance.

[123] Elastic3D: Controllable Stereo Video Conversion with Guided Latent Decoding

Nando Metzger, Prune Truong, Goutam Bhat, Konrad Schindler, Federico Tombari

Main category: cs.CV

TL;DR: Elastic3D is a controllable end-to-end method for converting monocular videos to stereo videos using conditional latent diffusion, avoiding artifacts from explicit depth estimation and warping.

DetailsMotivation: Growing demand for immersive 3D content requires automated monocular-to-stereo video conversion without the artifacts caused by traditional depth estimation and warping approaches.

Method: Uses conditional latent diffusion with a novel guided VAE decoder that ensures sharp and epipolar-consistent stereo video output, providing user control over stereo strength via a scalar tuning knob.

Result: Outperforms both traditional warping-based and recent warping-free baselines on three different datasets of real-world stereo videos, setting a new standard for reliable, controllable stereo video conversion.

Conclusion: Elastic3D provides high-quality, controllable stereo video conversion through a diffusion-based approach with a guided VAE decoder, offering superior results compared to existing methods.

Abstract: The growing demand for immersive 3D content calls for automated monocular-to-stereo video conversion. We present Elastic3D, a controllable, direct end-to-end method for upgrading a conventional video to a binocular one. Our approach, based on (conditional) latent diffusion, avoids artifacts due to explicit depth estimation and warping. The key to its high-quality stereo video output is a novel, guided VAE decoder that ensures sharp and epipolar-consistent stereo video output. Moreover, our method gives the user control over the strength of the stereo effect (more precisely, the disparity range) at inference time, via an intuitive, scalar tuning knob. Experiments on three different datasets of real-world stereo videos show that our method outperforms both traditional warping-based and recent warping-free baselines and sets a new standard for reliable, controllable stereo video conversion. Please check the project page for the video samples https://elastic3d.github.io.

[124] Enhancing Visual Programming for Visual Reasoning via Probabilistic Graphs

Wentao Wan, Kaiyu Wu, Qingyang Ma, Nan Kang, Yunjie Chen, Liang Lin, Keze Wang

Main category: cs.CV

TL;DR: EVPG enhances Visual Programming for visual reasoning by converting non-differentiable VP execution into differentiable probabilistic graph inference, enabling end-to-end gradient-based optimization using final task labels.

DetailsMotivation: Previous VP enhancement methods focused only on improving LLM-generated visual programs while neglecting optimization of pre-trained models used for visual sub-tasks. The non-differentiable nature of VP prevents gradient-based optimization using final task labels, and sub-task labels are unavailable.

Method: EVPG builds a directed probabilistic graph based on variable dependency relationships during VP execution, reconstructing the non-differentiable process into differentiable exact probability inference. This enables gradient-based optimization for end-to-end supervised learning on target visual reasoning tasks.

Result: Extensive experiments show significant performance improvements on three classical complex visual reasoning tasks: GQA, NLVRv2, and Open Images, demonstrating the effectiveness and advantages of EVPG.

Conclusion: EVPG successfully overcomes the limitations of non-differentiable VP by transforming execution into probabilistic graph inference, enabling efficient gradient-based optimization and significant performance gains in complex visual reasoning tasks.

Abstract: Recently, Visual Programming (VP) based on large language models (LLMs) has rapidly developed and demonstrated significant potential in complex Visual Reasoning (VR) tasks. Previous works to enhance VP have primarily focused on improving the quality of LLM-generated visual programs. However, they have neglected to optimize the VP-invoked pre-trained models, which serve as modules for the visual sub-tasks decomposed from the targeted tasks by VP. The difficulty is that there are only final labels of targeted VR tasks rather than labels of sub-tasks. Besides, the non-differentiable nature of VP impedes the direct use of efficient gradient-based optimization methods to leverage final labels for end-to-end learning of the entire VP framework. To overcome these issues, we propose EVPG, a method to Enhance Visual Programming for visual reasoning via Probabilistic Graphs. Specifically, we creatively build a directed probabilistic graph according to the variable dependency relationships during the VP executing process, which reconstructs the non-differentiable VP executing process into a differentiable exact probability inference process on this directed probabilistic graph. As a result, this enables the VP framework to utilize the final labels for efficient, gradient-based optimization in end-to-end supervised learning on targeted VR tasks. Extensive and comprehensive experiments demonstrate the effectiveness and advantages of our EVPG, showing significant performance improvements for VP on three classical complex VR tasks: GQA, NLVRv2, and Open Images.

[125] DriverGaze360: OmniDirectional Driver Attention with Object-Level Guidance

Shreedhar Govil, Didier Stricker, Jason Rambach

Main category: cs.CV

TL;DR: DriverGaze360 introduces a large-scale 360° driver attention dataset and panoramic attention prediction model to overcome limitations of narrow field-of-view approaches in understanding driver gaze behavior.

DetailsMotivation: Existing driver attention prediction methods are constrained by narrow frontal field-of-view and limited driving diversity, failing to capture full spatial context during lane changes, turns, and interactions with peripheral objects like pedestrians or cyclists.

Method: Introduces DriverGaze360 dataset (~1M gaze-labeled frames from 19 drivers) and DriverGaze360-Net model that jointly learns attention maps and attended objects using an auxiliary semantic segmentation head for panoramic attention prediction.

Result: DriverGaze360-Net achieves state-of-the-art attention prediction performance on multiple metrics for panoramic driving images, demonstrating improved spatial awareness across wide panoramic inputs.

Conclusion: The panoramic approach enables comprehensive omnidirectional modeling of driver gaze behavior, addressing critical limitations in existing driver attention prediction for autonomous driving systems.

Abstract: Predicting driver attention is a critical problem for developing explainable autonomous driving systems and understanding driver behavior in mixed human-autonomous vehicle traffic scenarios. Although significant progress has been made through large-scale driver attention datasets and deep learning architectures, existing works are constrained by narrow frontal field-of-view and limited driving diversity. Consequently, they fail to capture the full spatial context of driving environments, especially during lane changes, turns, and interactions involving peripheral objects such as pedestrians or cyclists. In this paper, we introduce DriverGaze360, a large-scale 360$^\circ$ field of view driver attention dataset, containing $\sim$1 million gaze-labeled frames collected from 19 human drivers, enabling comprehensive omnidirectional modeling of driver gaze behavior. Moreover, our panoramic attention prediction approach, DriverGaze360-Net, jointly learns attention maps and attended objects by employing an auxiliary semantic segmentation head. This improves spatial awareness and attention prediction across wide panoramic inputs. Extensive experiments demonstrate that DriverGaze360-Net achieves state-of-the-art attention prediction performance on multiple metrics on panoramic driving images. Dataset and method available at https://av.dfki.de/drivergaze360.

[126] Zoom-Zero: Reinforced Coarse-to-Fine Video Understanding via Temporal Zoom-in

Xiaoqian Shen, Min-Hung Chen, Yu-Chiang Frank Wang, Mohamed Elhoseiny, Ryo Hachiuma

Main category: cs.CV

TL;DR: Zoom-Zero is a coarse-to-fine framework for grounded video QA that improves temporal grounding by first localizing relevant segments, then zooming into salient frames for visual verification, addressing limitations of existing GRPO methods.

DetailsMotivation: Large video-language models have limited temporal awareness for grounded video QA, and existing GRPO-based approaches struggle with faithful temporal grounding, leading to mislocalization and hallucinations in answers.

Method: A coarse-to-fine framework with two key innovations: 1) zoom-in accuracy reward that validates temporal grounding fidelity and enables fine-grained visual verification on grounded frames, and 2) token-selective credit assignment that attributes rewards to tokens responsible for temporal localization or answer generation.

Result: Improves temporal grounding by 5.2% on NExT-GQA and 4.6% on ReXTime, enhances average answer accuracy by 2.4%, and yields 6.4% average improvement on long-video benchmarks through coarse-to-fine zoom-in during inference.

Conclusion: Zoom-Zero advances grounded video QA by addressing GRPO limitations, improving both temporal grounding and answer accuracy while benefiting long-form video understanding through preserved visual details and global context.

Abstract: Grounded video question answering (GVQA) aims to localize relevant temporal segments in videos and generate accurate answers to a given question; however, large video-language models (LVLMs) exhibit limited temporal awareness. Although existing approaches based on Group Relative Policy Optimization (GRPO) attempt to improve temporal grounding, they still struggle to faithfully ground their answers in the relevant video evidence, leading to temporal mislocalization and hallucinations. In this work, we present Zoom-Zero, a coarse-to-fine framework that first localizes query-relevant segments and then temporally zooms into the most salient frames for finer-grained visual verification. Our method addresses the limits of GRPO for the GVQA task with two key innovations: (i) a zoom-in accuracy reward that validates the fidelity of temporal grounding prediction and facilitates fine-grained visual verification on grounded frames; (ii) token-selective credit assignment, which attributes rewards to the tokens responsible for temporal localization or answer generation, mitigating GRPO’s issue in handling multi-faceted reward signals. Our proposed method advances grounded video question answering, improving temporal grounding by 5.2% on NExT-GQA and 4.6% on ReXTime, while also enhancing average answer accuracy by 2.4%. Additionally, the coarse-to-fine zoom-in during inference further benefits long-form video understanding by preserving critical visual details without compromising global context, yielding an average improvement of 6.4% on long-video benchmarks.

[127] From YOLO to VLMs: Advancing Zero-Shot and Few-Shot Detection of Wastewater Treatment Plants Using Satellite Imagery in MENA Region

Akila Premarathna, Kanishka Hewageegana, Garcia Andarcia Mariangel

Main category: cs.CV

TL;DR: VLMs outperform YOLOv8 for WWTP identification in satellite images, enabling annotation-free classification with Gemma-3 showing highest performance.

DetailsMotivation: Traditional WWTP identification methods like YOLOv8 require extensive manual labeling, while VLMs offer efficient, annotation-free alternatives through inherent reasoning capabilities.

Method: Structured VLM comparison methodology with zero-shot and few-shot streams, using expert prompts to identify WWTP components and distinguish confounders, producing JSON outputs with confidence scores.

Result: Several VLMs outperformed YOLOv8’s true positive rate in zero-shot evaluations, with Gemma-3 achieving the highest performance, confirming VLMs can replace traditional methods.

Conclusion: VLMs, particularly in zero-shot settings, can efficiently replace YOLOv8 for WWTP classification, enabling scalable, annotation-free remote sensing for environmental monitoring.

Abstract: In regions of the Middle East and North Africa (MENA), there is a high demand for wastewater treatment plants (WWTPs), crucial for sustainable water management. Precise identification of WWTPs from satellite images enables environmental monitoring. Traditional methods like YOLOv8 segmentation require extensive manual labeling. But studies indicate that vision-language models (VLMs) are an efficient alternative to achieving equivalent or superior results through inherent reasoning and annotation. This study presents a structured methodology for VLM comparison, divided into zero-shot and few-shot streams specifically to identify WWTPs. The YOLOv8 was trained on a governmental dataset of 83,566 high-resolution satellite images from Egypt, Saudi Arabia, and UAE: ~85% WWTPs (positives), 15% non-WWTPs (negatives). Evaluated VLMs include LLaMA 3.2 Vision, Qwen 2.5 VL, DeepSeek-VL2, Gemma 3, Gemini, and Pixtral 12B (Mistral), used to identify WWTP components such as circular/rectangular tanks, aeration basins and distinguish confounders via expert prompts producing JSON outputs with confidence and descriptions. The dataset comprises 1,207 validated WWTP locations (198 UAE, 354 KSA, 655 Egypt) and equal non-WWTP sites from field/AI data, as 600mx600m Geo-TIFF images (Zoom 18, EPSG:4326). Zero-shot evaluations on WWTP images showed several VLMs out-performing YOLOv8’s true positive rate, with Gemma-3 highest. Results confirm that VLMs, particularly with zero-shot, can replace YOLOv8 for efficient, annotation-free WWTP classification, enabling scalable remote sensing.

[128] TUN: Detecting Significant Points in Persistence Diagrams with Deep Learning

Yu Chen, Hongwei Lin

Main category: cs.CV

TL;DR: TUN is a multi-modal neural network that automatically identifies significant points in persistence diagrams, outperforming classic methods for reliable topological feature detection.

DetailsMotivation: Persistence diagrams are powerful for understanding point cloud topology, but distinguishing genuine signals from noise remains challenging, hindering practical adoption of topological data analysis in applications requiring automated interpretation for downstream decision-making.

Method: Topology Understanding Net (TUN) - a multi-modal network combining enhanced PD descriptors with self-attention, PointNet-style point cloud encoder, learned fusion, and per-point classification, with stable preprocessing and imbalance-aware training.

Result: TUN outperforms classic methods in detecting significant points in persistence diagrams, demonstrating effectiveness in real-world applications.

Conclusion: TUN provides an automated and effective solution for identifying significant points in persistence diagrams, addressing a critical challenge in practical topological data analysis applications.

Abstract: Persistence diagrams (PDs) provide a powerful tool for understanding the topology of the underlying shape of a point cloud. However, identifying which points in PDs encode genuine signals remains challenging. This challenge directly hinders the practical adoption of topological data analysis in many applications, where automated and reliable interpretation of persistence diagrams is essential for downstream decision-making. In this paper, we study automatic significance detection for one-dimensional persistence diagrams. Specifically, we propose Topology Understanding Net (TUN), a multi-modal network that combines enhanced PD descriptors with self-attention, a PointNet-style point cloud encoder, learned fusion, and per-point classification, alongside stable preprocessing and imbalance-aware training. It provides an automated and effective solution for identifying significant points in PDs, which are critical for downstream applications. Experiments show that TUN outperforms classic methods in detecting significant points in PDs, illustrating its effectiveness in real-world applications.

[129] Semantic Mismatch and Perceptual Degradation: A New Perspective on Image Editing Immunity

Shuai Dong, Jie Zhang, Guoying Zhao, Shiguang Shan, Xilin Chen

Main category: cs.CV

TL;DR: SIFM introduces a new approach to immunize images against unauthorized diffusion-based editing by perturbing intermediate features with dual objectives, and proposes ISR as a better metric for evaluating immunization success.

DetailsMotivation: Current metrics for image immunization against unauthorized edits focus on visual dissimilarity from reference outputs, but overlook the core requirement of disrupting semantic alignment with attacker intent. The paper argues immunization should be judged by whether edits semantically mismatch the prompt or suffer perceptual degradations.

Method: Synergistic Intermediate Feature Manipulation (SIFM) strategically perturbs intermediate diffusion features through two synergistic objectives: (1) maximizing feature divergence from original edit trajectory to disrupt semantic alignment, and (2) minimizing feature norms to induce perceptual degradations.

Result: Extensive experiments show SIFM achieves state-of-the-art performance for safeguarding visual content against malicious diffusion-based manipulation. The paper also introduces Immunization Success Rate (ISR) as a novel metric for rigorous evaluation.

Conclusion: SIFM provides an effective approach to image immunization by focusing on disrupting semantic alignment rather than just visual dissimilarity, with ISR offering a more appropriate metric for evaluating immunization success.

Abstract: Text-guided image editing via diffusion models, while powerful, raises significant concerns about misuse, motivating efforts to immunize images against unauthorized edits using imperceptible perturbations. Prevailing metrics for evaluating immunization success typically rely on measuring the visual dissimilarity between the output generated from a protected image and a reference output generated from the unprotected original. This approach fundamentally overlooks the core requirement of image immunization, which is to disrupt semantic alignment with attacker intent, regardless of deviation from any specific output. We argue that immunization success should instead be defined by the edited output either semantically mismatching the prompt or suffering substantial perceptual degradations, both of which thwart malicious intent. To operationalize this principle, we propose Synergistic Intermediate Feature Manipulation (SIFM), a method that strategically perturbs intermediate diffusion features through dual synergistic objectives: (1) maximizing feature divergence from the original edit trajectory to disrupt semantic alignment with the expected edit, and (2) minimizing feature norms to induce perceptual degradations. Furthermore, we introduce the Immunization Success Rate (ISR), a novel metric designed to rigorously quantify true immunization efficacy for the first time. ISR quantifies the proportion of edits where immunization induces either semantic failure relative to the prompt or significant perceptual degradations, assessed via Multimodal Large Language Models (MLLMs). Extensive experiments show our SIFM achieves the state-of-the-art performance for safeguarding visual content against malicious diffusion-based manipulation.

[130] SS4D: Native 4D Generative Model via Structured Spacetime Latents

Zhibing Li, Mengchen Zhang, Tong Wu, Jing Tan, Jiaqi Wang, Dahua Lin

Main category: cs.CV

TL;DR: SS4D is a native 4D generative model that synthesizes dynamic 3D objects directly from monocular video, achieving high fidelity, temporal coherence, and structural consistency through compressed spacetime latents.

DetailsMotivation: Prior approaches construct 4D representations by optimizing over 3D or video generative models, which may not achieve optimal temporal coherence and structural consistency. There's a need for a native 4D generative model that can directly synthesize dynamic 3D objects from monocular video.

Method: 1) Builds on pre-trained single-image-to-3D model to address 4D data scarcity and preserve spatial consistency. 2) Introduces dedicated temporal layers for cross-frame reasoning. 3) Uses factorized 4D convolutions and temporal downsampling blocks to compress latent sequences for efficient training/inference. 4) Employs carefully designed training strategy for occlusion robustness.

Result: The model achieves high fidelity, temporal coherence, and structural consistency in synthesizing dynamic 3D objects directly from monocular video, overcoming challenges of 4D data scarcity and computational efficiency.

Conclusion: SS4D demonstrates that native 4D generative modeling is feasible and effective for dynamic 3D object synthesis from monocular video, offering improved temporal coherence and structural consistency compared to prior optimization-based approaches.

Abstract: We present SS4D, a native 4D generative model that synthesizes dynamic 3D objects directly from monocular video. Unlike prior approaches that construct 4D representations by optimizing over 3D or video generative models, we train a generator directly on 4D data, achieving high fidelity, temporal coherence, and structural consistency. At the core of our method is a compressed set of structured spacetime latents. Specifically, (1) To address the scarcity of 4D training data, we build on a pre-trained single-image-to-3D model, preserving strong spatial consistency. (2) Temporal consistency is enforced by introducing dedicated temporal layers that reason across frames. (3) To support efficient training and inference over long video sequences, we compress the latent sequence along the temporal axis using factorized 4D convolutions and temporal downsampling blocks. In addition, we employ a carefully designed training strategy to enhance robustness against occlusion

[131] PSMamba: Progressive Self-supervised Vision Mamba for Plant Disease Recognition

Abdullah Al Mamun, Miaohua Zhang, David Ahmedt-Aristizabal, Zeeshan Hayder, Mohammad Awrangjeb

Main category: cs.CV

TL;DR: PSMamba is a progressive self-supervised learning framework that uses Vision Mamba with dual-student hierarchical distillation to capture multi-scale lesion patterns in plant disease imagery, outperforming existing SSL methods.

DetailsMotivation: Existing SSL frameworks focus on global alignment but struggle to capture hierarchical, multi-scale lesion patterns in plant disease imagery, which is crucial for accurate disease diagnosis and analysis.

Method: PSMamba integrates Vision Mamba’s efficient sequence modeling with a dual-student hierarchical distillation strategy. It uses a shared global teacher and two specialized students: one for mid-scale views (lesion distributions, vein structures) and another for local views (texture irregularities, early-stage lesions), with consistency losses for cross-scale alignment.

Result: Experiments on three benchmark datasets show PSMamba consistently outperforms state-of-the-art SSL methods, delivering superior accuracy and robustness in both domain-shifted and fine-grained scenarios.

Conclusion: PSMamba effectively addresses the limitations of conventional SSL methods by capturing multi-scale lesion patterns through hierarchical distillation, making it a promising approach for plant disease analysis and potentially other medical imaging tasks requiring multi-granular feature learning.

Abstract: Self-supervised Learning (SSL) has become a powerful paradigm for representation learning without manual annotations. However, most existing frameworks focus on global alignment and struggle to capture the hierarchical, multi-scale lesion patterns characteristic of plant disease imagery. To address this gap, we propose PSMamba, a progressive self-supervised framework that integrates the efficient sequence modelling of Vision Mamba (VM) with a dual-student hierarchical distillation strategy. Unlike conventional single teacher-student designs, PSMamba employs a shared global teacher and two specialised students: one processes mid-scale views to capture lesion distributions and vein structures, while the other focuses on local views to capture fine-grained cues such as texture irregularities and early-stage lesions. This multi-granular supervision facilitates the joint learning of contextual and detailed representations, with consistency losses ensuring coherent cross-scale alignment. Experiments on three benchmark datasets show that PSMamba consistently outperforms state-of-the-art SSL methods, delivering superior accuracy and robustness in both domain-shifted and fine-grained scenarios.

[132] Dual Attention Guided Defense Against Malicious Edits

Jie Zhang, Shuai Dong, Shiguang Shan, Xilin Chen

Main category: cs.CV

TL;DR: DANP is a novel immunization method that adds imperceptible perturbations to images to protect against malicious text-to-image editing by manipulating both cross-attention maps and noise prediction processes over multiple timesteps.

DetailsMotivation: Text-to-image diffusion models enable powerful image editing but pose ethical risks from malicious use in creating deceptive/harmful content. Current defense methods using imperceptible perturbations are insufficient against sophisticated tampering attacks.

Method: Dual Attention-Guided Noise Perturbation (DANP) adds imperceptible perturbations that operate over multiple timesteps. It uses dynamic thresholds to create masks identifying text-relevant/irrelevant regions, reduces attention in relevant areas while increasing it in irrelevant ones, and maximizes discrepancy between injected and predicted noise to disrupt generation.

Result: DANP demonstrates impressive immunity against malicious edits and achieves state-of-the-art performance in protecting images from unauthorized text-to-image manipulation, as confirmed by extensive experiments.

Conclusion: The proposed DANP method effectively addresses the security vulnerabilities of text-to-image diffusion models by simultaneously targeting attention mechanisms and noise prediction processes, providing robust protection against malicious editing while maintaining image quality.

Abstract: Recent progress in text-to-image diffusion models has transformed image editing via text prompts, yet this also introduces significant ethical challenges from potential misuse in creating deceptive or harmful content. While current defenses seek to mitigate this risk by embedding imperceptible perturbations, their effectiveness is limited against malicious tampering. To address this issue, we propose a Dual Attention-Guided Noise Perturbation (DANP) immunization method that adds imperceptible perturbations to disrupt the model’s semantic understanding and generation process. DANP functions over multiple timesteps to manipulate both cross-attention maps and the noise prediction process, using a dynamic threshold to generate masks that identify text-relevant and irrelevant regions. It then reduces attention in relevant areas while increasing it in irrelevant ones, thereby misguides the edit towards incorrect regions and preserves the intended targets. Additionally, our method maximizes the discrepancy between the injected noise and the model’s predicted noise to further interfere with the generation. By targeting both attention and noise prediction mechanisms, DANP exhibits impressive immunity against malicious edits, and extensive experiments confirm that our method achieves state-of-the-art performance.

[133] Towards Transferable Defense Against Malicious Image Edits

Jie Zhang, Shuai Dong, Shiguang Shan, Xilin Chen

Main category: cs.CV

TL;DR: TDAE is a bimodal framework that enhances image immunity against malicious edits through coordinated image-text optimization, achieving state-of-the-art cross-model transferability.

DetailsMotivation: Existing methods for defending against malicious image edits using imperceptible perturbations suffer from limited transferability across different diffusion models, creating a need for more robust cross-model protection.

Method: Proposes TDAE with two components: 1) FlatGrad Defense Mechanism (FDM) that incorporates gradient regularization to steer perturbations toward flat minima for visual defense, and 2) Dynamic Prompt Defense (DPD) that periodically refines text embeddings through adversarial optimization to align editing outcomes between immunized and original images.

Result: Extensive experiments show TDAE achieves state-of-the-art performance in mitigating malicious edits under both intra-model and cross-model evaluations, demonstrating superior transferability.

Conclusion: TDAE effectively addresses the transferability limitation of existing methods through its bimodal optimization approach, providing robust protection against malicious image edits across different diffusion models.

Abstract: Recent approaches employing imperceptible perturbations in input images have demonstrated promising potential to counter malicious manipulations in diffusion-based image editing systems. However, existing methods suffer from limited transferability in cross-model evaluations. To address this, we propose Transferable Defense Against Malicious Image Edits (TDAE), a novel bimodal framework that enhances image immunity against malicious edits through coordinated image-text optimization. Specifically, at the visual defense level, we introduce FlatGrad Defense Mechanism (FDM), which incorporates gradient regularization into the adversarial objective. By explicitly steering the perturbations toward flat minima, FDM amplifies immune robustness against unseen editing models. For textual enhancement protection, we propose an adversarial optimization paradigm named Dynamic Prompt Defense (DPD), which periodically refines text embeddings to align the editing outcomes of immunized images with those of the original images, then updates the images under optimized embeddings. Through iterative adversarial updates to diverse embeddings, DPD enforces the generation of immunized images that seek a broader set of immunity-enhancing features, thereby achieving cross-model transferability. Extensive experimental results demonstrate that our TDAE achieves state-of-the-art performance in mitigating malicious edits under both intra- and cross-model evaluations.

[134] Vector Prism: Animating Vector Graphics by Stratifying Semantic Structure

Jooyeol Yun, Jaegul Choo

Main category: cs.CV

TL;DR: The paper introduces a framework for recovering semantic structure in SVG graphics to enable more reliable animation by vision-language models, addressing the challenge of fragmented low-level shapes in current VLM approaches.

DetailsMotivation: SVG animation is increasingly important for dynamic web design, but current vision-language models struggle with SVG animation because they treat visually coherent objects as fragmented low-level shapes, lacking guidance on which elements should move together.

Method: The framework recovers semantic structure through statistical aggregation of multiple weak part predictions, reorganizing SVGs into semantic groups to provide better guidance for animation.

Result: Experiments show substantial gains over existing approaches, with the semantic recovery enabling VLMs to produce animations with far greater coherence.

Conclusion: Semantic recovery is the key step that unlocks robust SVG animation and supports more interpretable interactions between VLMs and vector graphics.

Abstract: Scalable Vector Graphics (SVG) are central to modern web design, and the demand to animate them continues to grow as web environments become increasingly dynamic. Yet automating the animation of vector graphics remains challenging for vision-language models (VLMs) despite recent progress in code generation and motion planning. VLMs routinely mis-handle SVGs, since visually coherent parts are often fragmented into low-level shapes that offer little guidance of which elements should move together. In this paper, we introduce a framework that recovers the semantic structure required for reliable SVG animation and reveals the missing layer that current VLM systems overlook. This is achieved through a statistical aggregation of multiple weak part predictions, allowing the system to stably infer semantics from noisy predictions. By reorganizing SVGs into semantic groups, our approach enables VLMs to produce animations with far greater coherence. Our experiments demonstrate substantial gains over existing approaches, suggesting that semantic recovery is the key step that unlocks robust SVG animation and supports more interpretable interactions between VLMs and vector graphics.

[135] Enhancing Interpretability for Vision Models via Shapley Value Optimization

Kanglong Fan, Yunqiao Yang, Chen Ma

Main category: cs.CV

TL;DR: A novel self-explaining framework using Shapley value estimation as auxiliary training to achieve faithful explanations without sacrificing model performance.

DetailsMotivation: Current explanation methods have limitations: post-hoc methods lack faithfulness to model behaviors, while self-explaining networks sacrifice performance and compatibility due to specialized architectures.

Method: Proposes a self-explaining framework that integrates Shapley value estimation as an auxiliary task during training, enabling fair allocation of prediction scores to image patches with minimal structural modifications.

Result: Achieves state-of-the-art interpretability on multiple benchmarks while preserving model performance and compatibility.

Conclusion: The proposed framework successfully addresses the trade-off between interpretability and performance by integrating Shapley value estimation into training, creating inherently faithful explanations with minimal architectural changes.

Abstract: Deep neural networks have demonstrated remarkable performance across various domains, yet their decision-making processes remain opaque. Although many explanation methods are dedicated to bringing the obscurity of DNNs to light, they exhibit significant limitations: post-hoc explanation methods often struggle to faithfully reflect model behaviors, while self-explaining neural networks sacrifice performance and compatibility due to their specialized architectural designs. To address these challenges, we propose a novel self-explaining framework that integrates Shapley value estimation as an auxiliary task during training, which achieves two key advancements: 1) a fair allocation of the model prediction scores to image patches, ensuring explanations inherently align with the model’s decision logic, and 2) enhanced interpretability with minor structural modifications, preserving model performance and compatibility. Extensive experiments on multiple benchmarks demonstrate that our method achieves state-of-the-art interpretability.

[136] HGS: Hybrid Gaussian Splatting with Static-Dynamic Decomposition for Compact Dynamic View Synthesis

Kaizhe Zhang, Yijie Zhou, Weizhan Zhang, Caixia Yan, Haipeng Du, yugui xie, Yu-Hui Wen, Yong-Jin Liu

Main category: cs.CV

TL;DR: Hybrid Gaussian Splatting (HGS) reduces model size by 98% and achieves real-time rendering up to 125 FPS at 4K by disentangling static/dynamic regions using Radial Basis Functions.

DetailsMotivation: Existing dynamic novel view synthesis methods using 3D Gaussian Splatting suffer from excessive model complexity, parameter redundancy, large model sizes, and slow rendering speeds, making them inefficient for real-time applications on resource-constrained devices.

Method: Proposes Hybrid Gaussian Splatting (HGS) with Static-Dynamic Decomposition (SDD) strategy using Radial Basis Function modeling: time-dependent RBFs for dynamic regions to capture temporal variations, and shared temporally invariant parameters for static regions. Includes two-stage training strategy for explicit models to enhance temporal coherence at boundaries.

Result: Reduces model size by up to 98%, achieves real-time rendering at up to 125 FPS at 4K resolution on RTX 3090, sustains 160 FPS at 1352×1014 on RTX 3050, integrates into VR system, and maintains comparable rendering quality with improved visual fidelity for high-frequency details and abrupt scene changes.

Conclusion: HGS provides a compact and efficient framework for dynamic novel view synthesis that significantly reduces parameter redundancy while maintaining high rendering quality and enabling real-time performance on resource-constrained devices, making it suitable for practical applications like VR systems.

Abstract: Dynamic novel view synthesis (NVS) is essential for creating immersive experiences. Existing approaches have advanced dynamic NVS by introducing 3D Gaussian Splatting (3DGS) with implicit deformation fields or indiscriminately assigned time-varying parameters, surpassing NeRF-based methods. However, due to excessive model complexity and parameter redundancy, they incur large model sizes and slow rendering speeds, making them inefficient for real-time applications, particularly on resource-constrained devices. To obtain a more efficient model with fewer redundant parameters, in this paper, we propose Hybrid Gaussian Splatting (HGS), a compact and efficient framework explicitly designed to disentangle static and dynamic regions of a scene within a unified representation. The core innovation of HGS lies in our Static-Dynamic Decomposition (SDD) strategy, which leverages Radial Basis Function (RBF) modeling for Gaussian primitives. Specifically, for dynamic regions, we employ time-dependent RBFs to effectively capture temporal variations and handle abrupt scene changes, while for static regions, we reduce redundancy by sharing temporally invariant parameters. Additionally, we introduce a two-stage training strategy tailored for explicit models to enhance temporal coherence at static-dynamic boundaries. Experimental results demonstrate that our method reduces model size by up to 98% and achieves real-time rendering at up to 125 FPS at 4K resolution on a single RTX 3090 GPU. It further sustains 160 FPS at 1352 * 1014 on an RTX 3050 and has been integrated into the VR system. Moreover, HGS achieves comparable rendering quality to state-of-the-art methods while providing significantly improved visual fidelity for high-frequency details and abrupt scene changes.

[137] DISCODE: Distribution-Aware Score Decoder for Robust Automatic Evaluation of Image Captioning

Nakamasa Inoue, Kanoko Goto, Masanari Oi, Martyna Gruszka, Mahiro Ukai, Takumi Hirose, Yusuke Sekikawa

Main category: cs.CV

TL;DR: DISCODE is a finetuning-free method for robust image caption evaluation using LVLMs that adapts at test-time with a Gaussian prior distribution to better align with human judgments across diverse domains.

DetailsMotivation: Current LVLMs struggle with robust image caption evaluation, especially under domain-shift scenarios where evaluation metrics need to work consistently across different domains.

Method: DISCODE uses a test-time adaptive evaluation approach with Adaptive Test-Time (ATT) loss that leverages a Gaussian prior distribution. It minimizes this loss efficiently at test time using an analytical solution. Also introduces MCEval benchmark covering six domains.

Result: DISCODE achieves state-of-the-art performance as a reference-free evaluation metric across the new MCEval benchmark and four existing benchmarks.

Conclusion: DISCODE provides a robust, finetuning-free solution for image caption evaluation that better aligns with human judgments across diverse domains through test-time adaptation.

Abstract: Large vision-language models (LVLMs) have shown impressive performance across a broad range of multimodal tasks. However, robust image caption evaluation using LVLMs remains challenging, particularly under domain-shift scenarios. To address this issue, we introduce the Distribution-Aware Score Decoder (DISCODE), a novel finetuning-free method that generates robust evaluation scores better aligned with human judgments across diverse domains. The core idea behind DISCODE lies in its test-time adaptive evaluation approach, which introduces the Adaptive Test-Time (ATT) loss, leveraging a Gaussian prior distribution to improve robustness in evaluation score estimation. This loss is efficiently minimized at test time using an analytical solution that we derive. Furthermore, we introduce the Multi-domain Caption Evaluation (MCEval) benchmark, a new image captioning evaluation benchmark covering six distinct domains, designed to assess the robustness of evaluation metrics. In our experiments, we demonstrate that DISCODE achieves state-of-the-art performance as a reference-free evaluation metric across MCEval and four representative existing benchmarks.

[138] Semantic-Drive: Democratizing Long-Tail Data Curation via Open-Vocabulary Grounding and Neuro-Symbolic VLM Consensus

Antonio Guillen-Perez

Main category: cs.CV

TL;DR: Semantic-Drive is a local-first neuro-symbolic framework that enables efficient semantic mining of rare safety-critical events in AV video logs using a two-stage perception approach with hallucination mitigation.

DetailsMotivation: AV development is bottlenecked by scarce "long-tail" training data for rare safety-critical events. Current solutions either use imprecise metadata search or privacy-invasive, expensive cloud-based VLMs.

Method: Two-stage neuro-symbolic framework: (1) Symbolic Grounding via real-time open-vocabulary detector (YOLOE) to anchor attention, (2) Cognitive Analysis via Reasoning VLM for forensic scene analysis. Uses “System 2” inference-time alignment with multi-model “Judge-Scout” consensus to mitigate hallucination.

Result: Achieves Recall of 0.966 (vs. 0.475 for CLIP) on nuScenes dataset against Waymo Open Dataset taxonomy, reduces Risk Assessment Error by 40% compared to best single scout models. Runs entirely on consumer hardware (NVIDIA RTX 3090).

Conclusion: Semantic-Drive provides a privacy-preserving, efficient alternative to cloud-based solutions for mining rare safety-critical events in AV data, enabling better training data collection for long-tail scenarios.

Abstract: The development of robust Autonomous Vehicles (AVs) is bottlenecked by the scarcity of “Long-Tail” training data. While fleets collect petabytes of video logs, identifying rare safety-critical events (e.g., erratic jaywalking, construction diversions) remains a manual, cost-prohibitive process. Existing solutions rely on coarse metadata search, which lacks precision, or cloud-based VLMs, which are privacy-invasive and expensive. We introduce Semantic-Drive, a local-first, neuro-symbolic framework for semantic data mining. Our approach decouples perception into two stages: (1) Symbolic Grounding via a real-time open-vocabulary detector (YOLOE) to anchor attention, and (2) Cognitive Analysis via a Reasoning VLM that performs forensic scene analysis. To mitigate hallucination, we implement a “System 2” inference-time alignment strategy, utilizing a multi-model “Judge-Scout” consensus mechanism. Benchmarked on the nuScenes dataset against the Waymo Open Dataset (WOD-E2E) taxonomy, Semantic-Drive achieves a Recall of 0.966 (vs. 0.475 for CLIP) and reduces Risk Assessment Error by 40% ccompared to the best single scout models. The system runs entirely on consumer hardware (NVIDIA RTX 3090), offering a privacy-preserving alternative to the cloud.

[139] Mimicking Human Visual Development for Learning Robust Image Representations

Ankita Raj, Kaashika Prajaapat, Tapan Kumar Gandhi, Chetan Arora

Main category: cs.CV

TL;DR: Progressive blurring curriculum mimicking human visual development improves CNN generalization and robustness by starting with blurred images and gradually reducing blur during training.

DetailsMotivation: Human visual system adapts well to input distribution changes, while CNNs struggle. Inspired by infant visual development where acuity improves gradually, the authors propose mimicking this process to enhance CNN robustness.

Method: Train CNNs with progressive blurring curriculum: start with highly blurred images in early epochs, then gradually reduce blur strength as training advances. This encourages networks to prioritize global structures over high-frequency artifacts.

Result: Reduces mean corruption error (mCE) by up to 8.30% on CIFAR-10-C and 4.43% on ImageNet-100-C compared to standard training. Enhances both natural and adversarial robustness, complements other augmentation techniques like CutMix and MixUp.

Conclusion: Progressive blurring curriculum effectively improves CNN generalization and robustness by mimicking human visual development, challenging prior claims that early blurring harms performance. The structured progression yields consistent gains across datasets.

Abstract: The human visual system is remarkably adept at adapting to changes in the input distribution; a capability modern convolutional neural networks (CNNs) still struggle to match. Drawing inspiration from the developmental trajectory of human vision, we propose a progressive blurring curriculum to improve the generalization and robustness of CNNs. Human infants are born with poor visual acuity, gradually refining their ability to perceive fine details. Mimicking this process, we begin training CNNs on highly blurred images during the initial epochs and progressively reduce the blur as training advances. This approach encourages the network to prioritize global structures over high-frequency artifacts, improving robustness against distribution shifts and noisy inputs. Challenging prior claims that blurring in the initial training epochs imposes a stimulus deficit and irreversibly harms model performance, we reveal that early-stage blurring enhances generalization with minimal impact on in-domain accuracy. Our experiments demonstrate that the proposed curriculum reduces mean corruption error (mCE) by up to 8.30% on CIFAR-10-C and 4.43% on ImageNet-100-C datasets, compared to standard training without blurring. Unlike static blur-based augmentation, which applies blurred images randomly throughout training, our method follows a structured progression, yielding consistent gains across various datasets. Furthermore, our approach complements other augmentation techniques, such as CutMix and MixUp, and enhances both natural and adversarial robustness against common attack methods. Code is available at https://github.com/rajankita/Visual_Acuity_Curriculum.

[140] Unified Semantic Transformer for 3D Scene Understanding

Sebastian Koch, Johanna Wald, Hide Matsuki, Pedro Hermosilla, Timo Ropinski, Federico Tombari

Main category: cs.CV

TL;DR: UNITE is a unified neural network that performs multiple 3D semantic understanding tasks from RGB images in seconds, outperforming task-specific models.

DetailsMotivation: Existing 3D scene understanding models are limited to task-specific approaches due to real-world complexity, creating a need for a unified model that can handle diverse semantic tasks efficiently.

Method: A Unified Semantic Transformer (UNITE) that operates end-to-end on RGB images, using 2D distillation with self-supervision and novel multi-view consistency losses to predict multiple 3D semantic attributes including segmentation, instance embeddings, open-vocabulary features, affordance, and articulations.

Result: UNITE achieves state-of-the-art performance on several semantic tasks, often outperforming task-specific models and even methods that use ground truth 3D geometry, while running in just a few seconds.

Conclusion: UNITE demonstrates that a unified approach to 3D scene understanding is not only feasible but can surpass specialized models across multiple tasks, enabling efficient holistic 3D scene parsing from RGB images alone.

Abstract: Holistic 3D scene understanding involves capturing and parsing unstructured 3D environments. Due to the inherent complexity of the real world, existing models have predominantly been developed and limited to be task-specific. We introduce UNITE, a Unified Semantic Transformer for 3D scene understanding, a novel feed-forward neural network that unifies a diverse set of 3D semantic tasks within a single model. Our model operates on unseen scenes in a fully end-to-end manner and only takes a few seconds to infer the full 3D semantic geometry. Our approach is capable of directly predicting multiple semantic attributes, including 3D scene segmentation, instance embeddings, open-vocabulary features, as well as affordance and articulations, solely from RGB images. The method is trained using a combination of 2D distillation, heavily relying on self-supervision and leverages novel multi-view losses designed to ensure 3D view consistency. We demonstrate that UNITE achieves state-of-the-art performance on several different semantic tasks and even outperforms task-specific models, in many cases, surpassing methods that operate on ground truth 3D geometry. See the project website at unite-page.github.io

[141] TACK Tunnel Data (TTD): A Benchmark Dataset for Deep Learning-Based Defect Detection in Tunnels

Andreas Sjölander, Valeria Belloni, Robel Fekadu, Andrea Nascetti

Main category: cs.CV

TL;DR: New public dataset of annotated tunnel lining images with cracks, leaching, and water infiltration defects to support DL-based automated tunnel inspection.

DetailsMotivation: Tunnel inspections are crucial for safety but manual methods are inefficient. Current automated approaches using Deep Learning are limited by lack of domain-specific tunnel datasets.

Method: Created a publicly available annotated dataset containing images of three different tunnel linings with labeled defects (cracks, leaching, water infiltration). Designed to support supervised, semi-supervised, and unsupervised DL methods.

Result: Provides a diverse dataset that captures variations in tunnel textures and construction techniques, enabling research on defect detection, segmentation, and model generalization across different tunnel types.

Conclusion: This dataset addresses the critical data scarcity problem in tunnel inspection domain and contributes to advancing automated inspection methods for safer, more efficient infrastructure maintenance.

Abstract: Tunnels are essential elements of transportation infrastructure, but are increasingly affected by ageing and deterioration mechanisms such as cracking. Regular inspections are required to ensure their safety, yet traditional manual procedures are time-consuming, subjective, and costly. Recent advances in mobile mapping systems and Deep Learning (DL) enable automated visual inspections. However, their effectiveness is limited by the scarcity of tunnel datasets. This paper introduces a new publicly available dataset containing annotated images of three different tunnel linings, capturing typical defects: cracks, leaching, and water infiltration. The dataset is designed to support supervised, semi-supervised, and unsupervised DL methods for defect detection and segmentation. Its diversity in texture and construction techniques also enables investigation of model generalization and transferability across tunnel types. By addressing the critical lack of domain-specific data, this dataset contributes to advancing automated tunnel inspection and promoting safer, more efficient infrastructure maintenance strategies.

[142] Adaptive Detector-Verifier Framework for Zero-Shot Polyp Detection in Open-World Settings

Shengkai Xu, Hsiang Lun Kao, Tianxiang Xu, Honghui Zhang, Junqiao Wang, Runmeng Ding, Guanyu Liu, Tianyu Shi, Zhenyu Yu, Guofeng Pan, Ziqian Bi, Yuqi Ouyang

Main category: cs.CV

TL;DR: AdaptiveDetector: A two-stage polyp detection framework using YOLOv11 detector with VLM verifier and adaptive confidence thresholds, trained with cost-sensitive reinforcement learning to reduce missed detections in challenging clinical conditions.

DetailsMotivation: Polyp detectors trained on clean datasets underperform in real-world endoscopy due to illumination changes, motion blur, and occlusions. Existing approaches struggle with the domain gap between controlled lab conditions and clinical practice with adverse imaging conditions.

Method: Proposes AdaptiveDetector: a two-stage detector-verifier framework with YOLOv11 detector and vision-language model (VLM) verifier. The detector adaptively adjusts per-frame confidence thresholds under VLM guidance. The verifier is fine-tuned with Group Relative Policy Optimization (GRPO) using an asymmetric, cost-sensitive reward function designed to discourage missed detections. Also constructs a synthetic testbed by systematically degrading clean datasets with adverse clinical conditions for realistic evaluation.

Result: Extensive zero-shot evaluation on synthetically degraded CVC-ClinicDB and Kvasir-SEG images shows improvements: recall increases by 14 to 22 percentage points over YOLO alone, while precision remains within 0.7 points below to 1.7 points above the baseline.

Conclusion: The combination of adaptive thresholding and cost-sensitive reinforcement learning achieves clinically aligned, open-world polyp detection with substantially fewer false negatives, reducing the risk of missed precancerous polyps and improving patient outcomes.

Abstract: Polyp detectors trained on clean datasets often underperform in real-world endoscopy, where illumination changes, motion blur, and occlusions degrade image quality. Existing approaches struggle with the domain gap between controlled laboratory conditions and clinical practice, where adverse imaging conditions are prevalent. In this work, we propose AdaptiveDetector, a novel two-stage detector-verifier framework comprising a YOLOv11 detector with a vision-language model (VLM) verifier. The detector adaptively adjusts per-frame confidence thresholds under VLM guidance, while the verifier is fine-tuned with Group Relative Policy Optimization (GRPO) using an asymmetric, cost-sensitive reward function specifically designed to discourage missed detections – a critical clinical requirement. To enable realistic assessment under challenging conditions, we construct a comprehensive synthetic testbed by systematically degrading clean datasets with adverse conditions commonly encountered in clinical practice, providing a rigorous benchmark for zero-shot evaluation. Extensive zero-shot evaluation on synthetically degraded CVC-ClinicDB and Kvasir-SEG images demonstrates that our approach improves recall by 14 to 22 percentage points over YOLO alone, while precision remains within 0.7 points below to 1.7 points above the baseline. This combination of adaptive thresholding and cost-sensitive reinforcement learning achieves clinically aligned, open-world polyp detection with substantially fewer false negatives, thereby reducing the risk of missed precancerous polyps and improving patient outcomes.

[143] Optimizing Rank for High-Fidelity Implicit Neural Representations

Julian McGinnis, Florian A. Hölzl, Suprosanna Shit, Florentin Bieder, Paul Friedrich, Mark Mühlau, Björn Menze, Daniel Rueckert, Benedikt Wiestler

Main category: cs.CV

TL;DR: Vanilla MLPs can learn high-frequency signals with proper rank regulation during training, challenging the belief that architectural modifications are necessary for high-frequency representation in INRs.

DetailsMotivation: The paper challenges the common belief that vanilla MLPs have an intrinsic architectural limitation preventing them from representing high-frequency content in Implicit Neural Representations (INRs). The authors argue that the low-frequency bias is not an architectural issue but rather a symptom of rank degradation during training.

Method: The authors propose regulating the network’s rank during training to prevent stable rank degradation. They use optimizers like Muon that produce high-rank, near-orthogonal updates, which substantially improves the fidelity of learned signals even with simple MLP architectures.

Result: Extensive experiments show up to 9 dB PSNR improvements over previous state-of-the-art across diverse domains including natural and medical images, and novel view synthesis. The method consistently enhances INR architectures beyond simple ReLU MLPs.

Conclusion: The low-frequency bias of vanilla MLPs in INRs is not an architectural limitation but rather a training issue related to rank degradation. Proper rank regulation during training enables even simple MLP architectures to effectively represent high-frequency content.

Abstract: Implicit Neural Representations (INRs) based on vanilla Multi-Layer Perceptrons (MLPs) are widely believed to be incapable of representing high-frequency content. This has directed research efforts towards architectural interventions, such as coordinate embeddings or specialized activation functions, to represent high-frequency signals. In this paper, we challenge the notion that the low-frequency bias of vanilla MLPs is an intrinsic, architectural limitation to learn high-frequency content, but instead a symptom of stable rank degradation during training. We empirically demonstrate that regulating the network’s rank during training substantially improves the fidelity of the learned signal, rendering even simple MLP architectures expressive. Extensive experiments show that using optimizers like Muon, with high-rank, near-orthogonal updates, consistently enhances INR architectures even beyond simple ReLU MLPs. These substantial improvements hold across a diverse range of domains, including natural and medical images, and novel view synthesis, with up to 9 dB PSNR improvements over the previous state-of-the-art. Our project page, which includes code and experimental results, is available at: (https://muon-inrs.github.io).

[144] CAPRMIL: Context-Aware Patch Representations for Multiple Instance Learning

Andreas Lolos, Theofilos Christodoulou, Aris L. Moustakas, Stergios Christodoulidis, Maria Vakalopoulou

Main category: cs.CV

TL;DR: CAPRMIL introduces a context-aware MIL framework for computational pathology that uses global context tokens and multi-head self-attention to create rich patch embeddings, enabling simple mean aggregation while matching SOTA performance with dramatically reduced parameters and computational costs.

DetailsMotivation: Current MIL methods in computational pathology rely on complex attention-based aggregation mechanisms that are computationally expensive. There's a need for more efficient, scalable approaches that can handle gigapixel WSIs while maintaining performance.

Method: CAPRMIL projects patch features from a frozen encoder into a small set of global context tokens, uses multi-head self-attention to inject global context with linear complexity, and pairs this with a simple Mean MIL aggregator, removing correlation learning from the aggregator.

Result: Matches SOTA slide-level performance across multiple pathology benchmarks while reducing trainable parameters by 48%-92.8%, lowering FLOPs during inference by 52%-99%, and ranking among best models for GPU memory efficiency and training time.

Conclusion: Learning rich, context-aware instance representations before aggregation is an effective and scalable alternative to complex pooling for whole-slide analysis, enabling efficient MIL with simple aggregation.

Abstract: In computational pathology, weak supervision has become the standard for deep learning due to the gigapixel scale of WSIs and the scarcity of pixel-level annotations, with Multiple Instance Learning (MIL) established as the principal framework for slide-level model training. In this paper, we introduce a novel setting for MIL methods, inspired by proceedings in Neural Partial Differential Equation (PDE) Solvers. Instead of relying on complex attention-based aggregation, we propose an efficient, aggregator-agnostic framework that removes the complexity of correlation learning from the MIL aggregator. CAPRMIL produces rich context-aware patch embeddings that promote effective correlation learning on downstream tasks. By projecting patch features – extracted using a frozen patch encoder – into a small set of global context/morphology-aware tokens and utilizing multi-head self-attention, CAPRMIL injects global context with linear computational complexity with respect to the bag size. Paired with a simple Mean MIL aggregator, CAPRMIL matches state-of-the-art slide-level performance across multiple public pathology benchmarks, while reducing the total number of trainable parameters by 48%-92.8% versus SOTA MILs, lowering FLOPs during inference by 52%-99%, and ranking among the best models on GPU memory efficiency and training time. Our results indicate that learning rich, context-aware instance representations before aggregation is an effective and scalable alternative to complex pooling for whole-slide analysis. Our code is available at https://github.com/mandlos/CAPRMIL

[145] EcoScapes: LLM-Powered Advice for Crafting Sustainable Cities

Martin Röhn, Nora Gourmelon, Vincent Christlein

Main category: cs.CV

TL;DR: A multi-layered AI system combining LLMs, satellite imagery, and knowledge base to help small cities develop climate adaptation strategies despite limited resources.

DetailsMotivation: Small cities struggle with limited personnel and integrating diverse data sources for comprehensive climate adaptation analysis, which is vital for urban sustainability and survival.

Method: Proposes a multi-layered system combining specialized Large Language Models (LLMs), satellite imagery analysis, and a knowledge base to aid in climate adaptation strategy development.

Result: The system is implemented with corresponding code available at https://github.com/Photon-GitHub/EcoScapes, providing a practical tool for small cities.

Conclusion: The proposed AI-powered system addresses resource constraints in small cities by integrating multiple data sources to support effective climate adaptation planning.

Abstract: Climate adaptation is vital for the sustainability and sometimes the mere survival of our urban areas. However, small cities often struggle with limited personnel resources and integrating vast amounts of data from multiple sources for a comprehensive analysis. To overcome these challenges, this paper proposes a multi-layered system combining specialized LLMs, satellite imagery analysis and a knowledge base to aid in developing effective climate adaptation strategies. The corresponding code can be found at https://github.com/Photon-GitHub/EcoScapes.

[146] CLNet: Cross-View Correspondence Makes a Stronger Geo-Localizationer

Xianwei Cao, Dou Quan, Shuang Wang, Ning Huyan, Wei Wang, Yunan Li, Licheng Jiao

Main category: cs.CV

TL;DR: CLNet is a novel correspondence-aware feature refinement framework for image retrieval-based cross-view geo-localization that explicitly models spatial correspondences between satellite and street-level images through three complementary modules.

DetailsMotivation: Existing methods for cross-view geo-localization rely on learning robust global representations or implicit feature alignment, which often fail to model explicit spatial correspondences crucial for accurate localization between significantly different viewpoints like satellite and street-level images.

Method: CLNet decomposes view alignment into three learnable modules: 1) Neural Correspondence Map (NCM) for spatial alignment via latent correspondence fields, 2) Nonlinear Embedding Converter (NEC) for perspective transformation using MLP-based mapping, and 3) Global Feature Recalibration (GFR) for channel reweighting guided by spatial cues.

Result: Extensive experiments on four public benchmarks (CVUSA, CVACT, VIGOR, and University-1652) demonstrate state-of-the-art performance with better interpretability and generalizability.

Conclusion: CLNet effectively bridges semantic and geometric gaps between different views by jointly capturing high-level semantics and fine-grained alignments, offering a superior approach to cross-view geo-localization.

Abstract: Image retrieval-based cross-view geo-localization (IRCVGL) aims to match images captured from significantly different viewpoints, such as satellite and street-level images. Existing methods predominantly rely on learning robust global representations or implicit feature alignment, which often fail to model explicit spatial correspondences crucial for accurate localization. In this work, we propose a novel correspondence-aware feature refinement framework, termed CLNet, that explicitly bridges the semantic and geometric gaps between different views. CLNet decomposes the view alignment process into three learnable and complementary modules: a Neural Correspondence Map (NCM) that spatially aligns cross-view features via latent correspondence fields; a Nonlinear Embedding Converter (NEC) that remaps features across perspectives using an MLP-based transformation; and a Global Feature Recalibration (GFR) module that reweights informative feature channels guided by learned spatial cues. The proposed CLNet can jointly capture both high-level semantics and fine-grained alignments. Extensive experiments on four public benchmarks, CVUSA, CVACT, VIGOR, and University-1652, demonstrate that our proposed CLNet achieves state-of-the-art performance while offering better interpretability and generalizability.

[147] Broadening View Synthesis of Dynamic Scenes from Constrained Monocular Videos

Le Jiang, Shaotong Zhu, Yedi Luo, Shayda Moezzi, Sarah Ostadabbas

Main category: cs.CV

TL;DR: ExpanDyNeRF improves dynamic NeRF for large viewpoint changes using Gaussian splatting priors and pseudo-ground-truth generation, with new SynDM dataset for evaluation.

DetailsMotivation: Current dynamic NeRF methods fail under significant viewpoint deviations, producing unstable and unrealistic renderings when there are large-angle rotations.

Method: ExpanDyNeRF uses Gaussian splatting priors and pseudo-ground-truth generation strategy to optimize density and color features for better reconstruction from challenging perspectives. Also introduces SynDM dataset for evaluation.

Result: ExpanDyNeRF significantly outperforms existing dynamic NeRF methods in rendering fidelity under extreme viewpoint shifts, demonstrated on both SynDM and real-world datasets.

Conclusion: ExpanDyNeRF enables realistic novel view synthesis under large-angle rotations in dynamic scenes, addressing a key limitation of current dynamic NeRF systems.

Abstract: In dynamic Neural Radiance Fields (NeRF) systems, state-of-the-art novel view synthesis methods often fail under significant viewpoint deviations, producing unstable and unrealistic renderings. To address this, we introduce Expanded Dynamic NeRF (ExpanDyNeRF), a monocular NeRF framework that leverages Gaussian splatting priors and a pseudo-ground-truth generation strategy to enable realistic synthesis under large-angle rotations. ExpanDyNeRF optimizes density and color features to improve scene reconstruction from challenging perspectives. We also present the Synthetic Dynamic Multiview (SynDM) dataset, the first synthetic multiview dataset for dynamic scenes with explicit side-view supervision-created using a custom GTA V-based rendering pipeline. Quantitative and qualitative results on SynDM and real-world datasets demonstrate that ExpanDyNeRF significantly outperforms existing dynamic NeRF methods in rendering fidelity under extreme viewpoint shifts. Further details are provided in the supplementary materials.

[148] FakeRadar: Probing Forgery Outliers to Detect Unknown Deepfake Videos

Zhaolun Li, Jichang Li, Yinqi Cai, Junye Chen, Xiaonan Luo, Guanbin Li, Rushi Lan

Main category: cs.CV

TL;DR: FakeRadar is a deepfake video detection framework that improves cross-domain generalization by using pretrained models to identify distribution gaps and generating synthetic outlier samples to simulate emerging manipulation techniques.

DetailsMotivation: Existing deepfake detection methods fail to generalize to unseen manipulation techniques because they rely on manipulation-specific cues. There's a need for detection systems that can adapt to emerging forgery patterns in real-world scenarios.

Method: Uses large-scale pretrained models (CLIP) to probe feature space and identify distribution gaps. Introduces Forgery Outlier Probing with dynamic subcluster modeling and cluster-conditional outlier generation to create synthetic outlier samples. Implements Outlier-Guided Tri-Training with outlier-driven contrastive learning and outlier-conditioned cross-entropy losses to distinguish real, fake, and outlier samples.

Result: FakeRadar outperforms existing methods across various benchmark datasets for deepfake video detection, particularly excelling in cross-domain evaluations by effectively handling emerging manipulation techniques.

Conclusion: The proposed framework successfully addresses cross-domain generalization challenges in deepfake detection by proactively identifying distribution gaps and simulating novel forgery artifacts, making it more robust against emerging manipulation techniques.

Abstract: In this paper, we propose FakeRadar, a novel deepfake video detection framework designed to address the challenges of cross-domain generalization in real-world scenarios. Existing detection methods typically rely on manipulation-specific cues, performing well on known forgery types but exhibiting severe limitations against emerging manipulation techniques. This poor generalization stems from their inability to adapt effectively to unseen forgery patterns. To overcome this, we leverage large-scale pretrained models (e.g. CLIP) to proactively probe the feature space, explicitly highlighting distributional gaps between real videos, known forgeries, and unseen manipulations. Specifically, FakeRadar introduces Forgery Outlier Probing, which employs dynamic subcluster modeling and cluster-conditional outlier generation to synthesize outlier samples near boundaries of estimated subclusters, simulating novel forgery artifacts beyond known manipulation types. Additionally, we design Outlier-Guided Tri-Training, which optimizes the detector to distinguish real, fake, and outlier samples using proposed outlier-driven contrastive learning and outlier-conditioned cross-entropy losses. Experiments show that FakeRadar outperforms existing methods across various benchmark datasets for deepfake video detection, particularly in cross-domain evaluations, by handling the variety of emerging manipulation techniques.

[149] LCMem: A Universal Model for Robust Image Memorization Detection

Mischa Dombrowski, Felix Nützel, Bernhard Kainz

Main category: cs.CV

TL;DR: LCMem is a cross-domain memorization detection model that treats memorization as both re-identification and copy detection, achieving significant improvements over existing methods for privacy auditing.

DetailsMotivation: Current generative models can create realistic images that deceive humans, but their privacy-preserving potential is poorly understood due to unreliable memorization detection, limited quantitative evaluation, and poor generalization of privacy auditing methods across domains.

Method: Propose viewing memorization detection as unified problem of re-identification (identity consistency) and copy detection (augmentation-robust duplication). Introduce Latent Contrastive Memorization Network (LCMem) with two-stage training: first learns identity consistency, then incorporates augmentation-robust copy detection.

Result: Across six benchmark datasets, LCMem achieves improvements of up to 16 percentage points on re-identification and 30 percentage points on copy detection, enabling substantially more reliable memorization detection at scale. Shows existing privacy filters provide limited performance and robustness.

Conclusion: LCMem sets new standard for cross-domain privacy auditing, offering reliable and scalable memorization detection. Highlights need for stronger protection mechanisms beyond existing privacy filters.

Abstract: Recent advances in generative image modeling have achieved visual realism sufficient to deceive human experts, yet their potential for privacy preserving data sharing remains insufficiently understood. A central obstacle is the absence of reliable memorization detection mechanisms, limited quantitative evaluation, and poor generalization of existing privacy auditing methods across domains. To address this, we propose to view memorization detection as a unified problem at the intersection of re-identification and copy detection, whose complementary goals cover both identity consistency and augmentation-robust duplication, and introduce Latent Contrastive Memorization Network (LCMem), a cross-domain model evaluated jointly on both tasks. LCMem achieves this through a two-stage training strategy that first learns identity consistency before incorporating augmentation-robust copy detection. Across six benchmark datasets, LCMem achieves improvements of up to 16 percentage points on re-identification and 30 percentage points on copy detection, enabling substantially more reliable memorization detection at scale. Our results show that existing privacy filters provide limited performance and robustness, highlighting the need for stronger protection mechanisms. We show that LCMem sets a new standard for cross-domain privacy auditing, offering reliable and scalable memorization detection. Code and model is publicly available at https://github.com/MischaD/LCMem.

[150] A Multicenter Benchmark of Multiple Instance Learning Models for Lymphoma Subtyping from HE-stained Whole Slide Images

Rao Muhammad Umer, Daniel Sens, Jonathan Noll, Christian Matek, Lukas Wolfseher, Rainer Spang, Ralf Huss, Johannes Raffler, Sarah Reinke, Wolfram Klapper, Katja Steiger, Kristina Schwamborn, Carsten Marr

Main category: cs.CV

TL;DR: Benchmark study of pathology foundation models for lymphoma subtyping on multicenter data shows good in-distribution performance (~80% accuracy) but poor generalization to out-of-distribution data (~60% accuracy).

DetailsMotivation: Lymphoma diagnosis requires multiple expensive tests causing treatment delays. Deep learning could assist pathologists using routinely available HE-stained slides, but comprehensive benchmarks for lymphoma subtyping on multicenter data are lacking.

Method: Created first multicenter lymphoma benchmarking dataset covering 4 common subtypes and healthy tissue. Evaluated 5 pathology foundation models (H-optimus-1, H0-mini, Virchow2, UNI2, Titan) with attention-based (AB-MIL) and transformer-based (TransMIL) multiple instance learning aggregators across 3 magnifications (10x, 20x, 40x).

Result: In-distribution: Models achieved >80% balanced accuracy across all magnifications, with similar performance across foundation models and aggregation methods. 40x resolution sufficient, no gains from higher resolutions or cross-magnification aggregation. Out-of-distribution: Performance dropped to ~60%, highlighting significant generalization challenges.

Conclusion: Pathology foundation models show promise for lymphoma subtyping but struggle with generalization. Larger multicenter studies covering additional rare lymphoma subtypes are needed. Authors provide automated benchmarking pipeline to facilitate future research.

Abstract: Timely and accurate lymphoma diagnosis is essential for guiding cancer treatment. Standard diagnostic practice combines hematoxylin and eosin (HE)-stained whole slide images with immunohistochemistry, flow cytometry, and molecular genetic tests to determine lymphoma subtypes, a process requiring costly equipment, skilled personnel, and causing treatment delays. Deep learning methods could assist pathologists by extracting diagnostic information from routinely available HE-stained slides, yet comprehensive benchmarks for lymphoma subtyping on multicenter data are lacking. In this work, we present the first multicenter lymphoma benchmarking dataset covering four common lymphoma subtypes and healthy control tissue. We systematically evaluate five publicly available pathology foundation models (H-optimus-1, H0-mini, Virchow2, UNI2, Titan) combined with attention-based (AB-MIL) and transformer-based (TransMIL) multiple instance learning aggregators across three magnifications (10x, 20x, 40x). On in-distribution test sets, models achieve multiclass balanced accuracies exceeding 80% across all magnifications, with all foundation models performing similarly and both aggregation methods showing comparable results. The magnification study reveals that 40x resolution is sufficient, with no performance gains from higher resolutions or cross-magnification aggregation. However, on out-of-distribution test sets, performance drops substantially to around 60%, highlighting significant generalization challenges. To advance the field, larger multicenter studies covering additional rare lymphoma subtypes are needed. We provide an automated benchmarking pipeline to facilitate such future research.

[151] The Devil is in Attention Sharing: Improving Complex Non-rigid Image Editing Faithfulness via Attention Synergy

Zhuo Chen, Fanyue Wei, Runze Xu, Jingjing Li, Lixin Duan, Angela Yao, Wen Li

Main category: cs.CV

TL;DR: SynPS is a training-free image editing method that synergistically combines positional embeddings and semantic information to achieve faithful non-rigid edits (pose/shape changes) in diffusion models, addressing attention collapse issues in existing approaches.

DetailsMotivation: Existing training-free image editing methods with large diffusion models struggle with complex non-rigid edits (pose/shape changes) due to attention collapse in attention sharing mechanisms, where either positional embeddings or semantic features dominate, leading to over-editing or under-editing.

Method: SynPS introduces an editing measurement that quantifies required editing magnitude at each denoising step, and an attention synergy pipeline that dynamically modulates the influence of positional embeddings to balance semantic modifications and fidelity preservation.

Result: Extensive experiments on public and newly curated benchmarks demonstrate superior performance and faithfulness compared to existing methods, effectively avoiding both over-editing and under-editing.

Conclusion: SynPS provides a practical solution for faithful non-rigid image editing by adaptively integrating positional and semantic cues, addressing the fundamental attention collapse problem in existing attention sharing mechanisms.

Abstract: Training-free image editing with large diffusion models has become practical, yet faithfully performing complex non-rigid edits (e.g., pose or shape changes) remains highly challenging. We identify a key underlying cause: attention collapse in existing attention sharing mechanisms, where either positional embeddings or semantic features dominate visual content retrieval, leading to over-editing or under-editing.To address this issue, we introduce SynPS, a method that Synergistically leverages Positional embeddings and Semantic information for faithful non-rigid image editing. We first propose an editing measurement that quantifies the required editing magnitude at each denoising step. Based on this measurement, we design an attention synergy pipeline that dynamically modulates the influence of positional embeddings, enabling SynPS to balance semantic modifications and fidelity preservation.By adaptively integrating positional and semantic cues, SynPS effectively avoids both over- and under-editing. Extensive experiments on public and newly curated benchmarks demonstrate the superior performance and faithfulness of our approach.

[152] VASA-3D: Lifelike Audio-Driven Gaussian Head Avatars from a Single Image

Sicheng Xu, Guojun Chen, Jiaolong Yang, Yizhong Zhang, Yu Deng, Steve Lin, Baining Guo

Main category: cs.CV

TL;DR: VASA-3D is an audio-driven 3D head avatar generator that creates realistic talking heads from single portrait images using VASA-1’s motion latent and optimization framework.

DetailsMotivation: The research addresses two key challenges: capturing subtle facial expression details present in real human faces, and reconstructing intricate 3D head avatars from just a single portrait image.

Method: VASA-3D leverages VASA-1’s motion latent for expression detail, translates it to 3D via a motion-latent-conditioned 3D head model, and customizes to single images using optimization with synthesized video frames and robust training losses.

Result: VASA-3D produces realistic 3D talking heads surpassing prior art, supporting online generation of 512x512 free-viewpoint videos at up to 75 FPS for immersive engagements.

Conclusion: The system enables creation of lifelike 3D avatars from single images with realistic audio-driven facial animations, facilitating more immersive virtual interactions.

Abstract: We propose VASA-3D, an audio-driven, single-shot 3D head avatar generator. This research tackles two major challenges: capturing the subtle expression details present in real human faces, and reconstructing an intricate 3D head avatar from a single portrait image. To accurately model expression details, VASA-3D leverages the motion latent of VASA-1, a method that yields exceptional realism and vividness in 2D talking heads. A critical element of our work is translating this motion latent to 3D, which is accomplished by devising a 3D head model that is conditioned on the motion latent. Customization of this model to a single image is achieved through an optimization framework that employs numerous video frames of the reference head synthesized from the input image. The optimization takes various training losses robust to artifacts and limited pose coverage in the generated training data. Our experiment shows that VASA-3D produces realistic 3D talking heads that cannot be achieved by prior art, and it supports the online generation of 512x512 free-viewpoint videos at up to 75 FPS, facilitating more immersive engagements with lifelike 3D avatars.

[153] Score-Based Turbo Message Passing for Plug-and-Play Compressive Imaging

Chang Cai, Hao Jiang, Xiaojun Yuan, Ying-Jun Angela Zhang

Main category: cs.CV

TL;DR: Score-based turbo message passing (STMP) integrates score-based MMSE denoisers into message-passing for compressive imaging, achieving fast convergence and better performance-complexity tradeoff than traditional PnP methods.

DetailsMotivation: Traditional plug-and-play compressive imaging methods use generic denoisers that fail to capture complex image statistics, leading to suboptimal reconstruction. Score-based models offer better priors but are computationally expensive for direct posterior sampling.

Method: Developed STMP framework combining message passing with score-based MMSE denoisers. For quantized measurements, added Q-STMP with component-wise MMSE dequantization. Both methods have state-evolution equations for performance prediction.

Result: STMP achieves significantly better performance-complexity tradeoff than baselines on FFHQ dataset. Q-STMP remains robust under 1-bit quantization. Both methods converge within 10 iterations.

Conclusion: The proposed STMP/Q-STMP framework successfully integrates score-based generative priors into message passing, enabling fast, high-quality compressive image recovery even with quantized measurements.

Abstract: Message-passing algorithms have been adapted for compressive imaging by incorporating various off-the-shelf image denoisers. However, these denoisers rely largely on generic or hand-crafted priors and often fall short in accurately capturing the complex statistical structure of natural images. As a result, traditional plug-and-play (PnP) methods often lead to suboptimal reconstruction, especially in highly underdetermined regimes. Recently, score-based generative models have emerged as a powerful framework for accurately characterizing sophisticated image distribution. Yet, their direct use for posterior sampling typically incurs prohibitive computational complexity. In this paper, by exploiting the close connection between score-based generative modeling and empirical Bayes denoising, we devise a message-passing framework that integrates a score-based minimum mean-squared error (MMSE) denoiser for compressive image recovery. The resulting algorithm, named score-based turbo message passing (STMP), combines the fast convergence of message passing with the expressive power of score-based generative priors. For practical systems with quantized measurements, we further propose quantized STMP (Q-STMP), which augments STMP with a component-wise MMSE dequantization module. We demonstrate that the asymptotic performance of STMP and Q-STMP can be accurately predicted by a set of state-evolution (SE) equations. Experiments on the FFHQ dataset demonstrate that STMP strikes a significantly better performance-complexity tradeoff compared with competing baselines, and that Q-STMP remains robust even under 1-bit quantization. Remarkably, both STMP and Q-STMP typically converge within 10 iterations.

[154] S2D: Sparse-To-Dense Keymask Distillation for Unsupervised Video Instance Segmentation

Leon Sick, Lukas Hoyer, Dominik Engel, Pedro Hermosilla, Timo Ropinski

Main category: cs.CV

TL;DR: Unsupervised video instance segmentation trained only on real video data, using temporal coherence from keymasks and sparse-to-dense distillation to outperform synthetic-data-based methods.

DetailsMotivation: Current unsupervised video instance segmentation relies on synthetic video data from object-centric image datasets, which fails to model realistic motion patterns like perspective changes, part movements, and camera motion. There's a need for methods trained exclusively on real video data.

Method: 1) Start with unsupervised instance segmentation masks on individual frames. 2) Establish temporal coherence by identifying high-quality keymasks using deep motion priors. 3) Use sparse keymask pseudo-annotations to train a segmentation model via Sparse-To-Dense Distillation approach with Temporal DropLoss. 4) Train final model on resulting dense labelset.

Result: The approach outperforms current state-of-the-art across various benchmarks, demonstrating superior performance compared to methods relying on synthetic video data.

Conclusion: Training unsupervised video instance segmentation exclusively on real video data with temporal coherence establishment and sparse-to-dense distillation is effective and outperforms synthetic-data-based approaches, addressing limitations of unrealistic motion modeling in current methods.

Abstract: In recent years, the state-of-the-art in unsupervised video instance segmentation has heavily relied on synthetic video data, generated from object-centric image datasets such as ImageNet. However, video synthesis by artificially shifting and scaling image instance masks fails to accurately model realistic motion in videos, such as perspective changes, movement by parts of one or multiple instances, or camera motion. To tackle this issue, we propose an unsupervised video instance segmentation model trained exclusively on real video data. We start from unsupervised instance segmentation masks on individual video frames. However, these single-frame segmentations exhibit temporal noise and their quality varies through the video. Therefore, we establish temporal coherence by identifying high-quality keymasks in the video by leveraging deep motion priors. The sparse keymask pseudo-annotations are then used to train a segmentation model for implicit mask propagation, for which we propose a Sparse-To-Dense Distillation approach aided by a Temporal DropLoss. After training the final model on the resulting dense labelset, our approach outperforms the current state-of-the-art across various benchmarks.

[155] Native and Compact Structured Latents for 3D Generation

Jianfeng Xiang, Xiaoxue Chen, Sicheng Xu, Ruicheng Wang, Zelong Lv, Yu Deng, Hongyuan Zhu, Yue Dong, Hao Zhao, Nicholas Jing Yuan, Jiaolong Yang

Main category: cs.CV

TL;DR: O-Voxel: A new sparse voxel representation for 3D generative modeling that captures complex topology and detailed appearance, enabling high-quality 3D asset generation with efficient inference.

DetailsMotivation: Existing 3D representations struggle with complex topologies and detailed appearance, limiting the realism of generated 3D assets despite recent advancements in generative modeling.

Method: Introduces O-Voxel (omni-voxel), a sparse voxel structure encoding both geometry and appearance. Builds a Sparse Compression VAE for high spatial compression, and trains large-scale flow-matching models (4B parameters) on diverse 3D datasets.

Result: The approach achieves highly efficient inference despite model scale, and generates assets with geometry and material quality far exceeding existing models.

Conclusion: O-Voxel representation offers significant advancement in 3D generative modeling by addressing limitations of current representations while maintaining efficiency.

Abstract: Recent advancements in 3D generative modeling have significantly improved the generation realism, yet the field is still hampered by existing representations, which struggle to capture assets with complex topologies and detailed appearance. This paper present an approach for learning a structured latent representation from native 3D data to address this challenge. At its core is a new sparse voxel structure called O-Voxel, an omni-voxel representation that encodes both geometry and appearance. O-Voxel can robustly model arbitrary topology, including open, non-manifold, and fully-enclosed surfaces, while capturing comprehensive surface attributes beyond texture color, such as physically-based rendering parameters. Based on O-Voxel, we design a Sparse Compression VAE which provides a high spatial compression rate and a compact latent space. We train large-scale flow-matching models comprising 4B parameters for 3D generation using diverse public 3D asset datasets. Despite their scale, inference remains highly efficient. Meanwhile, the geometry and material quality of our generated assets far exceed those of existing models. We believe our approach offers a significant advancement in 3D generative modeling.

[156] A4-Agent: An Agentic Framework for Zero-Shot Affordance Reasoning

Zixin Zhang, Kanghao Chen, Hanqing Wang, Hongfei Zhang, Harold Haodong Chen, Chenfei Liao, Litao Guo, Ying-Cong Chen

Main category: cs.CV

TL;DR: A4-Agent is a training-free agentic framework that decouples affordance prediction into a three-stage pipeline using specialized foundation models, achieving state-of-the-art zero-shot performance.

DetailsMotivation: Prevailing end-to-end models for affordance prediction couple high-level reasoning and low-level grounding into monolithic pipelines, relying on annotated datasets which leads to poor generalization on novel objects and unseen environments.

Method: A4-Agent decouples affordance prediction into three stages using specialized foundation models: (1) Dreamer uses generative models to visualize how interactions would look, (2) Thinker uses large vision-language models to decide what object part to interact with, and (3) Spotter orchestrates vision foundation models to precisely locate where the interaction area is.

Result: The zero-shot framework significantly outperforms state-of-the-art supervised methods across multiple benchmarks and demonstrates robust generalization to real-world settings without any task-specific fine-tuning.

Conclusion: By leveraging complementary strengths of pre-trained foundation models in a decoupled, agentic framework, A4-Agent achieves superior affordance prediction performance with strong generalization capabilities, moving beyond the limitations of end-to-end supervised approaches.

Abstract: Affordance prediction, which identifies interaction regions on objects based on language instructions, is critical for embodied AI. Prevailing end-to-end models couple high-level reasoning and low-level grounding into a single monolithic pipeline and rely on training over annotated datasets, which leads to poor generalization on novel objects and unseen environments. In this paper, we move beyond this paradigm by proposing A4-Agent, a training-free agentic framework that decouples affordance prediction into a three-stage pipeline. Our framework coordinates specialized foundation models at test time: (1) a $\textbf{Dreamer}$ that employs generative models to visualize $\textit{how}$ an interaction would look; (2) a $\textbf{Thinker}$ that utilizes large vision-language models to decide $\textit{what}$ object part to interact with; and (3) a $\textbf{Spotter}$ that orchestrates vision foundation models to precisely locate $\textit{where}$ the interaction area is. By leveraging the complementary strengths of pre-trained models without any task-specific fine-tuning, our zero-shot framework significantly outperforms state-of-the-art supervised methods across multiple benchmarks and demonstrates robust generalization to real-world settings.

[157] Spherical Leech Quantization for Visual Tokenization and Generation

Yue Zhao, Hanwen Jiang, Zhenlin Xu, Chutong Yang, Ehsan Adeli, Philipp Krähenbühl

Main category: cs.CV

TL;DR: The paper presents a unified formulation of non-parametric quantization methods using lattice coding theory, identifies issues with existing methods like BSQ, and proposes Leech lattice-based quantization (Λ₂₄-SQ) that achieves better performance with simplified training.

DetailsMotivation: To develop a more effective non-parametric quantization approach by leveraging lattice coding theory to understand existing methods' limitations and find better quantization structures that offer high symmetry and even distribution on hyperspheres.

Method: Proposes a unified formulation of non-parametric quantization through lattice coding, explores various lattice candidates (random, Fibonacci, densest sphere packing), and develops Spherical Leech Quantization (Λ₂₄-SQ) based on the highly symmetric Leech lattice.

Result: Λ₂₄-SQ achieves better reconstruction quality across all metrics than prior art BSQ while using slightly fewer bits, improves reconstruction-compression tradeoff, and extends benefits to state-of-the-art auto-regressive image generation frameworks.

Conclusion: Lattice coding provides a theoretical foundation for understanding non-parametric quantization, and the Leech lattice’s high symmetry properties lead to superior quantization performance with simplified training procedures.

Abstract: Non-parametric quantization has received much attention due to its efficiency on parameters and scalability to a large codebook. In this paper, we present a unified formulation of different non-parametric quantization methods through the lens of lattice coding. The geometry of lattice codes explains the necessity of auxiliary loss terms when training auto-encoders with certain existing lookup-free quantization variants such as BSQ. As a step forward, we explore a few possible candidates, including random lattices, generalized Fibonacci lattices, and densest sphere packing lattices. Among all, we find the Leech lattice-based quantization method, which is dubbed as Spherical Leech Quantization ($Λ_{24}$-SQ), leads to both a simplified training recipe and an improved reconstruction-compression tradeoff thanks to its high symmetry and even distribution on the hypersphere. In image tokenization and compression tasks, this quantization approach achieves better reconstruction quality across all metrics than BSQ, the best prior art, while consuming slightly fewer bits. The improvement also extends to state-of-the-art auto-regressive image generation frameworks.

[158] Automated Pollen Recognition in Optical and Holographic Microscopy Images

Swarn Singh Warshaneyan, Maksims Ivanovs, Blaž Cugmas, Inese Bērziņa, Laura Goldberga, Mindaugas Tamosiunas, Roberts Kadiķis

Main category: cs.CV

TL;DR: Deep learning models (YOLOv8s & MobileNetV3L) applied to pollen grain detection/classification in optical & holographic microscopy, with performance improvements via dataset expansion techniques.

DetailsMotivation: To improve and automate pollen grain detection and classification in veterinary cytology using both optical and holographic microscopy images, addressing the performance gap between these imaging modalities.

Method: Used YOLOv8s for object detection and MobileNetV3L for classification, evaluated across optical and holographic microscopy images. Applied dataset expansion techniques including automated labeling and bounding box area enlargement to improve holographic image performance.

Result: Optical images: 91.3% mAP50 for detection, 97% overall accuracy for classification. Initial holographic images: poor performance (2.49% mAP50 detection, 42% classification). After dataset expansion: holographic performance improved to 13.3% mAP50 detection and 54% classification accuracy.

Conclusion: Deep learning techniques can be effectively paired with cost-effective lensless digital holographic microscopy devices for image classification tasks, though performance remains lower than with optical microscopy.

Abstract: This study explores the application of deep learning to improve and automate pollen grain detection and classification in both optical and holographic microscopy images, with a particular focus on veterinary cytology use cases. We used YOLOv8s for object detection and MobileNetV3L for the classification task, evaluating their performance across imaging modalities. The models achieved 91.3% mAP50 for detection and 97% overall accuracy for classification on optical images, whereas the initial performance on greyscale holographic images was substantially lower. We addressed the performance gap issue through dataset expansion using automated labeling and bounding box area enlargement. These techniques, applied to holographic images, improved detection performance from 2.49% to 13.3% mAP50 and classification performance from 42% to 54%. Our work demonstrates that, at least for image classification tasks, it is possible to pair deep learning techniques with cost-effective lensless digital holographic microscopy devices.

[159] SuperCLIP: CLIP with Simple Classification Supervision

Weiheng Zhao, Zilong Huang, Jiashi Feng, Xinggang Wang

Main category: cs.CV

TL;DR: SuperCLIP enhances CLIP by adding token-level classification supervision to improve fine-grained visual-text alignment with minimal computational overhead.

DetailsMotivation: CLIP models underutilize fine-grained semantic signals in text, especially with long captions, due to only optimizing global image-text similarity without token-level supervision.

Method: Adds lightweight linear layer to vision encoder for classification-based supervision using token-level cues, augmenting contrastive learning with only 0.077% FLOPs increase.

Result: Consistently improves zero-shot classification, image-text retrieval, and purely visual tasks; works with both original web data and re-captioned data; alleviates small-batch performance drop.

Conclusion: SuperCLIP effectively recovers textual supervision through token-level classification, enhancing fine-grained alignment without additional data or significant computational cost.

Abstract: Contrastive Language-Image Pretraining (CLIP) achieves strong generalization in vision-language tasks by aligning images and texts in a shared embedding space. However, recent findings show that CLIP-like models still underutilize fine-grained semantic signals in text, and this issue becomes even more pronounced when dealing with long and detailed captions. This stems from CLIP’s training objective, which optimizes only global image-text similarity and overlooks token-level supervision - limiting its ability to achieve fine-grained visual-text alignment. To address this, we propose SuperCLIP, a simple yet effective framework that augments contrastive learning with classification-based supervision. By adding only a lightweight linear layer to the vision encoder, SuperCLIP leverages token-level cues to enhance visual-textual alignment - with just a 0.077% increase in total FLOPs, and no need for additional annotated data. Experiments show that SuperCLIP consistently improves zero-shot classification, image-text retrieval, and purely visual tasks. These gains hold regardless of whether the model is trained on original web data or rich re-captioned data, demonstrating SuperCLIP’s ability to recover textual supervision in both cases. Furthermore, SuperCLIP alleviates CLIP’s small-batch performance drop through classification-based supervision that avoids reliance on large batch sizes. Code and models will be made open source.

[160] SignIT: A Comprehensive Dataset and Multimodal Analysis for Italian Sign Language Recognition

Alessia Micieli, Giovanni Maria Farinella, Francesco Ragusa

Main category: cs.CV

TL;DR: SignIT is a new Italian Sign Language (LIS) dataset with 644 videos (3.33 hours) annotated with 94 sign classes across 5 categories, plus extracted 2D keypoints, used to benchmark sign recognition models.

DetailsMotivation: To address the lack of resources for Italian Sign Language (LIS) recognition research and provide a benchmark dataset for studying sign language understanding.

Method: Created SignIT dataset with manual annotation of 94 sign classes across 5 macro-categories, extracted 2D keypoints (hands, face, body), and benchmarked using state-of-the-art models to analyze impact of temporal information, keypoints, and RGB frames.

Result: Results demonstrate limitations of current state-of-the-art models on the challenging LIS dataset, showing how different input modalities (temporal, keypoints, RGB) influence recognition performance.

Conclusion: SignIT provides a valuable resource for LIS research, revealing current model limitations and highlighting the need for improved approaches in sign language recognition, with all data and annotations publicly released.

Abstract: In this work we present SignIT, a new dataset to study the task of Italian Sign Language (LIS) recognition. The dataset is composed of 644 videos covering 3.33 hours. We manually annotated videos considering a taxonomy of 94 distinct sign classes belonging to 5 macro-categories: Animals, Food, Colors, Emotions and Family. We also extracted 2D keypoints related to the hands, face and body of the users. With the dataset, we propose a benchmark for the sign recognition task, adopting several state-of-the-art models showing how temporal information, 2D keypoints and RGB frames can be influence the performance of these models. Results show the limitations of these models on this challenging LIS dataset. We release data and annotations at the following link: https://fpv-iplab.github.io/SignIT/.

[161] Native Intelligence Emerges from Large-Scale Clinical Practice: A Retinal Foundation Model with Deployment Efficiency

Jia Guo, Jiawei Du, Shengzhu Yang, Shuai Lu, Wenquan Cheng, Kaiwen Zhang, Yihua Sun, Chuhong Yang, Weihang Zhang, Fang Chen, Yilan Wu, Lie Ju, Guochen Ning, Longfei Ma, Huiping Yao, Jinyuan Wang, Peilun Shi, Yukun Zhou, Jie Xu, Pearse A. Keane, Hanruo Liu, Hongen Liao, Ningli Wang, Huiqi Li

Main category: cs.CV

TL;DR: ReVision is a retinal foundation model trained on 485,980 fundus photos and diagnostic reports from a decade-long telemedicine program, achieving state-of-the-art zero-shot disease detection and requiring minimal adaptation for deployment in low-resource settings.

DetailsMotivation: Current retinal foundation models are limited by curated research datasets lacking clinical context and require extensive task-specific optimization, making them inefficient for deployment in low-resource clinical settings.

Method: ReVision learns from natural alignment between 485,980 color fundus photographs and corresponding diagnostic reports accumulated through a decade-long telemedicine program across 162 medical institutions in China, without requiring additional annotation.

Result: Achieves zero-shot disease detection with average AUROC of 0.946 across 12 public benchmarks and 0.952 on 3 clinical cohorts; matches fine-tuned alternatives with orders of magnitude fewer parameters and examples; improves ophthalmologist diagnostic accuracy by 14.8% in reader study.

Conclusion: Clinical native intelligence can be directly extracted from clinical archives without further annotation to build medical AI systems suited for various low-resource settings, demonstrating superior deployment efficiency and performance.

Abstract: Current retinal foundation models remain constrained by curated research datasets that lack authentic clinical context, and require extensive task-specific optimization for each application, limiting their deployment efficiency in low-resource settings. Here, we show that these barriers can be overcome by building clinical native intelligence directly from real-world medical practice. Our key insight is that large-scale telemedicine programs, where expert centers provide remote consultations across distributed facilities, represent a natural reservoir for learning clinical image interpretation. We present ReVision, a retinal foundation model that learns from the natural alignment between 485,980 color fundus photographs and their corresponding diagnostic reports, accumulated through a decade-long telemedicine program spanning 162 medical institutions across China. Through extensive evaluation across 27 ophthalmic benchmarks, we demonstrate that ReVison enables deployment efficiency with minimal local resources. Without any task-specific training, ReVision achieves zero-shot disease detection with an average AUROC of 0.946 across 12 public benchmarks and 0.952 on 3 independent clinical cohorts. When minimal adaptation is feasible, ReVision matches extensively fine-tuned alternatives while requiring orders of magnitude fewer trainable parameters and labeled examples. The learned representations also transfer effectively to new clinical sites, imaging domains, imaging modalities, and systemic health prediction tasks. In a prospective reader study with 33 ophthalmologists, ReVision’s zero-shot assistance improved diagnostic accuracy by 14.8% across all experience levels. These results demonstrate that clinical native intelligence can be directly extracted from clinical archives without any further annotation to build medical AI systems suited to various low-resource settings.

[162] DASP: Self-supervised Nighttime Monocular Depth Estimation with Domain Adaptation of Spatiotemporal Priors

Yiheng Huang, Junhong Chen, Anqi Ning, Zhanhong Liang, Nick Michiels, Luc Claesen, Wenyin Liu

Main category: cs.CV

TL;DR: DASP: A self-supervised framework using spatiotemporal priors for nighttime depth estimation, addressing low visibility and illumination issues through adversarial learning and 3D consistency projection.

DetailsMotivation: Self-supervised monocular depth estimation works well in daytime but deteriorates at night due to low visibility, textureless areas from insufficient light, and blurry regions from moving objects.

Method: DASP framework with two branches: 1) Adversarial branch with spatiotemporal priors learning blocks (SPLB) containing spatial-based temporal learning module (STLM) for motion variations and axial spatial learning module (ASLM) for multiscale structure; 2) Self-supervised branch with 3D consistency projection loss for bilateral projection into shared 3D space.

Result: Achieves state-of-the-art performance on Oxford RobotCar and nuScenes datasets for nighttime depth estimation; ablation studies validate component effectiveness.

Conclusion: DASP successfully addresses nighttime depth estimation challenges by leveraging spatiotemporal priors and 3D consistency, demonstrating superior performance over existing methods.

Abstract: Self-supervised monocular depth estimation has achieved notable success under daytime conditions. However, its performance deteriorates markedly at night due to low visibility and varying illumination, e.g., insufficient light causes textureless areas, and moving objects bring blurry regions. To this end, we propose a self-supervised framework named DASP that leverages spatiotemporal priors for nighttime depth estimation. Specifically, DASP consists of an adversarial branch for extracting spatiotemporal priors and a self-supervised branch for learning. In the adversarial branch, we first design an adversarial network where the discriminator is composed of four devised spatiotemporal priors learning blocks (SPLB) to exploit the daytime priors. In particular, the SPLB contains a spatial-based temporal learning module (STLM) that uses orthogonal differencing to extract motion-related variations along the time axis and an axial spatial learning module (ASLM) that adopts local asymmetric convolutions with global axial attention to capture the multiscale structural information. By combining STLM and ASLM, our model can acquire sufficient spatiotemporal features to restore textureless areas and estimate the blurry regions caused by dynamic objects. In the self-supervised branch, we propose a 3D consistency projection loss to bilaterally project the target frame and source frame into a shared 3D space, and calculate the 3D discrepancy between the two projected frames as a loss to optimize the 3D structural consistency and daytime priors. Extensive experiments on the Oxford RobotCar and nuScenes datasets demonstrate that our approach achieves state-of-the-art performance for nighttime depth estimation. Ablation studies further validate the effectiveness of each component.

[163] HiFi-Portrait: Zero-shot Identity-preserved Portrait Generation with High-fidelity Multi-face Fusion

Yifang Xu, Benxiang Zhai, Yunzhuo Sun, Ming Li, Yang Li, Sidan Du

Main category: cs.CV

TL;DR: HiFi-Portrait: A high-fidelity zero-shot portrait generation method that improves identity preservation and face attribute control using multi-face features and 3D-aware landmarks.

DetailsMotivation: Existing diffusion-based methods for identity-preserved portrait generation produce lower-fidelity results when using multiple reference images and struggle to customize face attributes precisely.

Method: Introduces face refiner and landmark generator for fine-grained multi-face features and 3D-aware landmarks; designs HiFi-Net to fuse features and align with landmarks; creates automated pipeline for ID-based dataset construction.

Result: Surpasses state-of-the-art approaches in face similarity and controllability; compatible with previous SDXL-based works.

Conclusion: HiFi-Portrait effectively addresses fidelity and control limitations in identity-preserved portrait generation through multi-face feature fusion and landmark alignment.

Abstract: Recent advancements in diffusion-based technologies have made significant strides, particularly in identity-preserved portrait generation (IPG). However, when using multiple reference images from the same ID, existing methods typically produce lower-fidelity portraits and struggle to customize face attributes precisely. To address these issues, this paper presents HiFi-Portrait, a high-fidelity method for zero-shot portrait generation. Specifically, we first introduce the face refiner and landmark generator to obtain fine-grained multi-face features and 3D-aware face landmarks. The landmarks include the reference ID and the target attributes. Then, we design HiFi-Net to fuse multi-face features and align them with landmarks, which improves ID fidelity and face control. In addition, we devise an automated pipeline to construct an ID-based dataset for training HiFi-Portrait. Extensive experimental results demonstrate that our method surpasses the SOTA approaches in face similarity and controllability. Furthermore, our method is also compatible with previous SDXL-based works.

[164] TAT: Task-Adaptive Transformer for All-in-One Medical Image Restoration

Zhiwen Yang, Jiaju Zhang, Yang Yi, Jian Liang, Bingzheng Wei, Yan Xu

Main category: cs.CV

TL;DR: TAT is a task-adaptive Transformer framework for medical image restoration that addresses task interference and imbalance in All-in-One models through task-specific weight generation and dynamic loss balancing.

DetailsMotivation: All-in-One medical image restoration models face challenges with task interference (conflicting gradient updates across tasks on shared parameters) and task imbalance (uneven optimization due to varying learning difficulties), which hinder performance when handling diverse modalities and degradation types simultaneously.

Method: Proposes a task-adaptive Transformer (TAT) with two key innovations: 1) Task-adaptive weight generation strategy that creates task-specific weight parameters to eliminate gradient conflicts on shared parameters, and 2) Task-adaptive loss balancing strategy that dynamically adjusts loss weights based on task-specific learning difficulties to prevent task domination or undertraining.

Result: Extensive experiments show TAT achieves state-of-the-art performance in three MedIR tasks (PET synthesis, CT denoising, and MRI super-resolution) in both task-specific and All-in-One settings.

Conclusion: The proposed TAT framework effectively addresses critical inter-task relationships in All-in-One medical image restoration models, demonstrating superior performance through task-adaptive weight generation and loss balancing strategies.

Abstract: Medical image restoration (MedIR) aims to recover high-quality medical images from their low-quality counterparts. Recent advancements in MedIR have focused on All-in-One models capable of simultaneously addressing multiple different MedIR tasks. However, due to significant differences in both modality and degradation types, using a shared model for these diverse tasks requires careful consideration of two critical inter-task relationships: task interference, which occurs when conflicting gradient update directions arise across tasks on the same parameter, and task imbalance, which refers to uneven optimization caused by varying learning difficulties inherent to each task. To address these challenges, we propose a task-adaptive Transformer (TAT), a novel framework that dynamically adapts to different tasks through two key innovations. First, a task-adaptive weight generation strategy is introduced to mitigate task interference by generating task-specific weight parameters for each task, thereby eliminating potential gradient conflicts on shared weight parameters. Second, a task-adaptive loss balancing strategy is introduced to dynamically adjust loss weights based on task-specific learning difficulties, preventing task domination or undertraining. Extensive experiments demonstrate that our proposed TAT achieves state-of-the-art performance in three MedIR tasks–PET synthesis, CT denoising, and MRI super-resolution–both in task-specific and All-in-One settings. Code is available at https://github.com/Yaziwel/TAT.

[165] LLM-driven Knowledge Enhancement for Multimodal Cancer Survival Prediction

Chenyu Zhao, Yingxue Xu, Fengtao Zhou, Yihui Wang, Hao Chen

Main category: cs.CV

TL;DR: KEMM is an LLM-driven knowledge-enhanced multimodal model that integrates expert reports and prognostic background knowledge to improve cancer survival prediction by focusing on discriminative features from redundant pathology and genomic data.

DetailsMotivation: Current multimodal survival prediction methods struggle with high-dimensional, redundant pathology images and genomic data, making it difficult to extract discriminative features and align modalities. Simple survival labels are insufficient for supervising such complex tasks.

Method: Proposes KEMM with two key components: 1) Expert reports refined by LLMs provide succinct clinical diagnostic statements, and 2) Prognostic background knowledge (PBK) generated by LLMs offers cancer-specific prognostic context. Uses knowledge-enhanced cross-modal (KECM) attention module to guide feature extraction toward survival-relevant features.

Result: Extensive experiments on five datasets demonstrate that KEMM achieves state-of-the-art performance in cancer survival prediction.

Conclusion: Integrating expert knowledge and prognostic background through LLM-driven approaches significantly enhances multimodal survival prediction by focusing on clinically relevant features and overcoming data redundancy challenges.

Abstract: Current multimodal survival prediction methods typically rely on pathology images (WSIs) and genomic data, both of which are high-dimensional and redundant, making it difficult to extract discriminative features from them and align different modalities. Moreover, using a simple survival follow-up label is insufficient to supervise such a complex task. To address these challenges, we propose KEMM, an LLM-driven Knowledge-Enhanced Multimodal Model for cancer survival prediction, which integrates expert reports and prognostic background knowledge. 1) Expert reports, provided by pathologists on a case-by-case basis and refined by large language model (LLM), offer succinct and clinically focused diagnostic statements. This information may typically suggest different survival outcomes. 2) Prognostic background knowledge (PBK), generated concisely by LLM, provides valuable prognostic background knowledge on different cancer types, which also enhances survival prediction. To leverage these knowledge, we introduce the knowledge-enhanced cross-modal (KECM) attention module. KECM can effectively guide the network to focus on discriminative and survival-relevant features from highly redundant modalities. Extensive experiments on five datasets demonstrate that KEMM achieves state-of-the-art performance. The code will be released upon acceptance.

[166] TUMTraf EMOT: Event-Based Multi-Object Tracking Dataset and Baseline for Traffic Scenarios

Mengyu Li, Xingcheng Zhou, Guang Chen, Alois Knoll, Hu Cao

Main category: cs.CV

TL;DR: A new event camera dataset for ITS with detection and tracking benchmarks, addressing limitations of frame-based cameras in low-light and high-speed scenarios.

DetailsMotivation: Frame-based cameras in ITS struggle with dim lighting and high-speed motion, while event cameras offer low latency, high dynamic range, and high temporal resolution. There's a research gap in event-based vision for ITS applications.

Method: Created a pilot dataset for event-based ITS covering vehicle and pedestrian detection and tracking. Established a tracking-by-detection benchmark with a specialized feature extractor based on this dataset.

Result: Achieved excellent performance on the established benchmark, demonstrating the effectiveness of the proposed approach for event-based multi-object tracking in ITS.

Conclusion: Event cameras show considerable potential for ITS applications, and the introduced dataset and benchmark provide a foundation for future research in event-based vision for intelligent transportation systems.

Abstract: In Intelligent Transportation Systems (ITS), multi-object tracking is primarily based on frame-based cameras. However, these cameras tend to perform poorly under dim lighting and high-speed motion conditions. Event cameras, characterized by low latency, high dynamic range and high temporal resolution, have considerable potential to mitigate these issues. Compared to frame-based vision, there are far fewer studies on event-based vision. To address this research gap, we introduce an initial pilot dataset tailored for event-based ITS, covering vehicle and pedestrian detection and tracking. We establish a tracking-by-detection benchmark with a specialized feature extractor based on this dataset, achieving excellent performance.

[167] WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling

Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, Chunchao Guo

Main category: cs.CV

TL;DR: WorldPlay is a streaming video diffusion model that achieves real-time interactive world modeling with long-term geometric consistency, generating 720p video at 24 FPS while overcoming speed-memory trade-offs.

DetailsMotivation: Current methods face a trade-off between speed and memory, limiting real-time interactive world modeling with long-term geometric consistency. The paper aims to resolve this limitation.

Method: Three key innovations: 1) Dual Action Representation for robust user input control, 2) Reconstituted Context Memory with temporal reframing to maintain geometric consistency and alleviate memory attenuation, 3) Context Forcing distillation to preserve long-range information in memory-aware models.

Result: WorldPlay generates long-horizon streaming 720p video at 24 FPS with superior consistency, outperforming existing techniques and showing strong generalization across diverse scenes.

Conclusion: WorldPlay successfully enables real-time interactive world modeling with long-term geometric consistency by addressing the speed-memory trade-off through innovative memory management and distillation techniques.

Abstract: This paper presents WorldPlay, a streaming video diffusion model that enables real-time, interactive world modeling with long-term geometric consistency, resolving the trade-off between speed and memory that limits current methods. WorldPlay draws power from three key innovations. 1) We use a Dual Action Representation to enable robust action control in response to the user’s keyboard and mouse inputs. 2) To enforce long-term consistency, our Reconstituted Context Memory dynamically rebuilds context from past frames and uses temporal reframing to keep geometrically important but long-past frames accessible, effectively alleviating memory attenuation. 3) We also propose Context Forcing, a novel distillation method designed for memory-aware model. Aligning memory context between the teacher and student preserves the student’s capacity to use long-range information, enabling real-time speeds while preventing error drift. Taken together, WorldPlay generates long-horizon streaming 720p video at 24 FPS with superior consistency, comparing favorably with existing techniques and showing strong generalization across diverse scenes. Project page and online demo can be found: https://3d-models.hunyuan.tencent.com/world/ and https://3d.hunyuan.tencent.com/sceneTo3D.

[168] Distill Video Datasets into Images

Zhenghao Zhao, Haoxuan Wang, Kai Wang, Yuzhang Shang, Yuan Hong, Yan Yan

Main category: cs.CV

TL;DR: SFVD proposes single-frame video distillation that distills videos into informative frames per class, using differentiable interpolation to create video sequences while restricting updates to frames for better optimization.

DetailsMotivation: Dataset distillation works well for images but struggles with videos due to the substantial increase in learnable parameters from the temporal dimension, which complicates optimization and hinders convergence.

Method: SFVD distills videos into single informative frames per class, uses differentiable interpolation to transform frames into video sequences, matches with original dataset while restricting updates to frames, and incorporates temporal information by combining distilled frames with sampled real videos through channel reshaping.

Result: SFVD substantially outperforms prior methods on multiple benchmarks, achieving improvements of up to 5.3% on MiniUCF, offering a more effective solution for video dataset distillation.

Conclusion: The single-frame approach effectively addresses the optimization challenges in video dataset distillation by reducing learnable parameters while maintaining discriminative semantics, providing an efficient framework for video dataset compression.

Abstract: Dataset distillation aims to synthesize compact yet informative datasets that allow models trained on them to achieve performance comparable to training on the full dataset. While this approach has shown promising results for image data, extending dataset distillation methods to video data has proven challenging and often leads to suboptimal performance. In this work, we first identify the core challenge in video set distillation as the substantial increase in learnable parameters introduced by the temporal dimension of video, which complicates optimization and hinders convergence. To address this issue, we observe that a single frame is often sufficient to capture the discriminative semantics of a video. Leveraging this insight, we propose Single-Frame Video set Distillation (SFVD), a framework that distills videos into highly informative frames for each class. Using differentiable interpolation, these frames are transformed into video sequences and matched with the original dataset, while updates are restricted to the frames themselves for improved optimization efficiency. To further incorporate temporal information, the distilled frames are combined with sampled real videos from real videos during the matching process through a channel reshaping layer. Extensive experiments on multiple benchmarks demonstrate that SFVD substantially outperforms prior methods, achieving improvements of up to 5.3% on MiniUCF, thereby offering a more effective solution.

[169] AMD-HookNet++: Evolution of AMD-HookNet with Hybrid CNN-Transformer Feature Enhancement for Glacier Calving Front Segmentation

Fei Wu, Marcel Dreier, Nora Gourmelon, Sebastian Wind, Jianlin Zhang, Thorsten Seehaus, Matthias Braun, Andreas Maier, Vincent Christlein

Main category: cs.CV

TL;DR: AMD-HookNet++ is a hybrid CNN-Transformer model for glacier segmentation and calving front delineation in SAR images, achieving state-of-the-art performance with smoother front boundaries.

DetailsMotivation: Existing pure CNN approaches like AMD-HookNet struggle with long-range dependencies due to convolution's local nature, while pure Transformer models produce jagged calving front edges. There's a need for a method that combines global context with local detail preservation for accurate glacier segmentation.

Method: Proposes AMD-HookNet++ with a dual-branch architecture: 1) Transformer-based context branch for long-range dependencies and global context, and 2) CNN-based target branch for local detail preservation. Includes enhanced spatial-channel attention module to dynamically adjust token relationships between branches, and pixel-to-pixel contrastive deep supervision for optimization.

Result: Achieves state-of-the-art on CaFFe benchmark: IoU of 78.2, HD95 of 1,318 m, and competitive MDE of 367 m. Produces smoother calving front delineations compared to pure Transformer approaches, resolving jagged edge issues.

Conclusion: The hybrid CNN-Transformer architecture effectively combines global context modeling with local detail preservation for superior glacier segmentation and calving front delineation, offering improved monitoring capabilities for glacier dynamics and ice sheet mass balance.

Abstract: The dynamics of glaciers and ice shelf fronts significantly impact the mass balance of ice sheets and coastal sea levels. To effectively monitor glacier conditions, it is crucial to consistently estimate positional shifts of glacier calving fronts. AMD-HookNet firstly introduces a pure two-branch convolutional neural network (CNN) for glacier segmentation. Yet, the local nature and translational invariance of convolution operations, while beneficial for capturing low-level details, restricts the model ability to maintain long-range dependencies. In this study, we propose AMD-HookNet++, a novel advanced hybrid CNN-Transformer feature enhancement method for segmenting glaciers and delineating calving fronts in synthetic aperture radar images. Our hybrid structure consists of two branches: a Transformer-based context branch to capture long-range dependencies, which provides global contextual information in a larger view, and a CNN-based target branch to preserve local details. To strengthen the representation of the connected hybrid features, we devise an enhanced spatial-channel attention module to foster interactions between the hybrid CNN-Transformer branches through dynamically adjusting the token relationships from both spatial and channel perspectives. Additionally, we develop a pixel-to-pixel contrastive deep supervision to optimize our hybrid model by integrating pixelwise metric learning into glacier segmentation. Through extensive experiments and comprehensive quantitative and qualitative analyses on the challenging glacier segmentation benchmark dataset CaFFe, we show that AMD-HookNet++ sets a new state of the art with an IoU of 78.2 and a HD95 of 1,318 m, while maintaining a competitive MDE of 367 m. More importantly, our hybrid model produces smoother delineations of calving fronts, resolving the issue of jagged edges typically seen in pure Transformer-based approaches.

[170] ViRC: Enhancing Visual Interleaved Mathematical CoT with Reason Chunking

Lihong Wang, Liangqi Li, Weiwei Feng, Jiamin Wu, Changtao Miao, Tieru Wu, Rui Ma, Bo Zhang, Zhe Li

Main category: cs.CV

TL;DR: ViRC framework introduces Reason Chunking with Critical Reasoning Units (CRUs) to structure multimodal mathematical reasoning, achieving 18.8% improvement over baselines.

DetailsMotivation: Existing MLLMs perform textual reasoning from static images only, missing dynamic visual acquisition during reasoning. Humans repeatedly examine images and use step-by-step reasoning with intermediate propositions, following Miller's Law in cognitive science.

Method: Propose ViRC framework with Reason Chunking mechanism that structures multimodal CoT into consecutive Critical Reasoning Units (CRUs). Create CRUX dataset with explicit CRU annotations using visual tools and reasoning patterns. Use progressive training: Instructional SFT, Practice SFT, and Strategic RL.

Result: ViRC-7B model achieves 18.8% average improvement over baselines across multiple mathematical benchmarks.

Conclusion: The ViRC framework effectively simulates human expert problem-solving patterns through structured reasoning units, significantly enhancing multimodal mathematical reasoning performance.

Abstract: CoT has significantly enhanced the reasoning ability of LLMs while it faces challenges when extended to multimodal domains, particularly in mathematical tasks. Existing MLLMs typically perform textual reasoning solely from a single static mathematical image, overlooking dynamic visual acquisition during reasoning. In contrast, humans repeatedly examine visual image and employ step-by-step reasoning to prove intermediate propositions. This strategy of decomposing the problem-solving process into key logical nodes adheres to Miller’s Law in cognitive science. Inspired by this insight, we propose a ViRC framework for multimodal mathematical tasks, introducing a Reason Chunking mechanism that structures multimodal mathematical CoT into consecutive Critical Reasoning Units (CRUs) to simulate human expert problem-solving patterns. CRUs ensure intra-unit textual coherence for intermediate proposition verification while integrating visual information across units to generate subsequent propositions and support structured reasoning. To this end, we present CRUX dataset by using three visual tools and four reasoning patterns to provide explicitly annotated CRUs across multiple reasoning paths for each mathematical problem. Leveraging the CRUX dataset, we propose a progressive training strategy inspired by human cognitive learning, which includes Instructional SFT, Practice SFT, and Strategic RL, aimed at further strengthening the Reason Chunking ability of the model.The resulting ViRC-7B model achieves a 18.8% average improvement over baselines across multiple mathematical benchmarks. Code is available at https://github.com/Leon-LihongWang/ViRC.

[171] Enhancing Visual Sentiment Analysis via Semiotic Isotopy-Guided Dataset Construction

Marco Blanchini, Giovanna Maria Dimitri, Benedetta Tondi, Tarcisio Lancioni, Mauro Barni

Main category: cs.CV

TL;DR: The paper proposes a method to create larger, more diverse VSA datasets using semiotic isotopy concepts, enabling models to better focus on emotionally relevant image elements and achieve improved generalization across benchmarks.

DetailsMotivation: Visual Sentiment Analysis faces challenges due to the vast diversity of emotionally salient images and difficulty acquiring comprehensive data. Existing datasets are limited, leading to poor generalization when models are trained and tested across different datasets.

Method: The approach integrates semiotic isotopy concepts into dataset creation, starting from existing data collections to build larger, more diverse datasets that help models focus on emotionally relevant combinations of image elements.

Result: Models trained on datasets generated with this method consistently outperform those trained on original data collections, achieving superior generalization across major VSA benchmarks.

Conclusion: The semiotic isotopy-based dataset creation approach effectively addresses VSA challenges by enabling better model focus on emotional content and improving generalization performance across diverse datasets.

Abstract: Visual Sentiment Analysis (VSA) is a challenging task due to the vast diversity of emotionally salient images and the inherent difficulty of acquiring sufficient data to capture this variability comprehensively. Key obstacles include building large-scale VSA datasets and developing effective methodologies that enable algorithms to identify emotionally significant elements within an image. These challenges are reflected in the limited generalization performance of VSA algorithms and models when trained and tested across different datasets. Starting from a pool of existing data collections, our approach enables the creation of a new larger dataset that not only contains a wider variety of images than the original ones, but also permits training new models with improved capability to focus on emotionally relevant combinations of image elements. This is achieved through the integration of the semiotic isotopy concept within the dataset creation process, providing deeper insights into the emotional content of images. Empirical evaluations show that models trained on a dataset generated with our method consistently outperform those trained on the original data collections, achieving superior generalization across major VSA benchmarks

[172] ART: Articulated Reconstruction Transformer

Zizhang Li, Cheng Zhang, Zhengqin Li, Henry Howard-Jenkins, Zhaoyang Lv, Chen Geng, Jiajun Wu, Richard Newcombe, Jakob Engel, Zhao Dong

Main category: cs.CV

TL;DR: ART is a feed-forward transformer model that reconstructs complete 3D articulated objects from sparse multi-view RGB images, using part-based prediction to output physically interpretable reconstructions with articulation parameters.

DetailsMotivation: Previous methods for articulated object reconstruction are either slow optimization-based approaches that rely on fragile cross-state correspondences, or feed-forward models limited to specific object categories. There's a need for a category-agnostic, efficient method that can handle diverse articulated objects.

Method: ART treats articulated objects as assemblies of rigid parts and formulates reconstruction as part-based prediction. It uses a transformer architecture that maps sparse image inputs to learnable part slots, then jointly decodes unified representations for individual parts including 3D geometry, texture, and explicit articulation parameters.

Result: ART achieves significant improvements over existing baselines and establishes new state-of-the-art for articulated object reconstruction from image inputs across diverse benchmarks. The reconstructions are physically interpretable and readily exportable for simulation.

Conclusion: ART presents a category-agnostic, feed-forward approach that successfully reconstructs complete 3D articulated objects from sparse RGB images, overcoming limitations of previous methods and producing physically interpretable results suitable for downstream applications like simulation.

Abstract: We introduce ART, Articulated Reconstruction Transformer – a category-agnostic, feed-forward model that reconstructs complete 3D articulated objects from only sparse, multi-state RGB images. Previous methods for articulated object reconstruction either rely on slow optimization with fragile cross-state correspondences or use feed-forward models limited to specific object categories. In contrast, ART treats articulated objects as assemblies of rigid parts, formulating reconstruction as part-based prediction. Our newly designed transformer architecture maps sparse image inputs to a set of learnable part slots, from which ART jointly decodes unified representations for individual parts, including their 3D geometry, texture, and explicit articulation parameters. The resulting reconstructions are physically interpretable and readily exportable for simulation. Trained on a large-scale, diverse dataset with per-part supervision, and evaluated across diverse benchmarks, ART achieves significant improvements over existing baselines and establishes a new state of the art for articulated object reconstruction from image inputs.

[173] CRISP: Contact-Guided Real2Sim from Monocular Video with Planar Scene Primitives

Zihan Wang, Jiashun Wang, Jeff Tan, Yiwen Zhao, Jessica Hodgins, Shubham Tulsiani, Deva Ramanan

Main category: cs.CV

TL;DR: CRISP reconstructs simulatable human motion and scene geometry from monocular video using planar primitives, contact modeling, and RL-based physical plausibility verification.

DetailsMotivation: Prior methods for joint human-scene reconstruction either lack physics or produce noisy geometry that fails in motion tracking with scene interactions. There's a need for clean, simulation-ready geometry for real-to-sim applications in robotics and AR/VR.

Method: 1) Recovers convex, clean geometry by fitting planar primitives to point clouds via clustering over depth, normals, and flow. 2) Uses human-scene contact modeling to reconstruct occluded geometry (e.g., chair seats). 3) Ensures physical plausibility by using reconstructions to drive humanoid controllers via reinforcement learning.

Result: Reduces motion tracking failure rates from 55.2% to 6.9% on EMDB and PROX benchmarks, achieves 43% faster RL simulation throughput, and validates on diverse in-the-wild videos including casually-captured, Internet, and Sora-generated content.

Conclusion: CRISP enables physically-valid human motion and interaction environment generation at scale, advancing real-to-sim applications for robotics and AR/VR by producing simulation-ready reconstructions from monocular video.

Abstract: We introduce CRISP, a method that recovers simulatable human motion and scene geometry from monocular video. Prior work on joint human-scene reconstruction relies on data-driven priors and joint optimization with no physics in the loop, or recovers noisy geometry with artifacts that cause motion tracking policies with scene interactions to fail. In contrast, our key insight is to recover convex, clean, and simulation-ready geometry by fitting planar primitives to a point cloud reconstruction of the scene, via a simple clustering pipeline over depth, normals, and flow. To reconstruct scene geometry that might be occluded during interactions, we make use of human-scene contact modeling (e.g., we use human posture to reconstruct the occluded seat of a chair). Finally, we ensure that human and scene reconstructions are physically-plausible by using them to drive a humanoid controller via reinforcement learning. Our approach reduces motion tracking failure rates from 55.2% to 6.9% on human-centric video benchmarks (EMDB, PROX), while delivering a 43% faster RL simulation throughput. We further validate it on in-the-wild videos including casually-captured videos, Internet videos, and even Sora-generated videos. This demonstrates CRISP’s ability to generate physically-valid human motion and interaction environments at scale, greatly advancing real-to-sim applications for robotics and AR/VR.

[174] MemFlow: Flowing Adaptive Memory for Consistent and Efficient Long Video Narratives

Sihui Ji, Xi Chen, Shuai Yang, Xin Tao, Pengfei Wan, Hengshuang Zhao

Main category: cs.CV

TL;DR: MemFlow introduces dynamic memory retrieval for streaming video generation, using text prompts to select relevant historical frames for each chunk, improving narrative coherence while maintaining efficiency.

DetailsMotivation: Existing streaming video generation methods use fixed strategies to compress historical frames, but different video chunks need different historical references. Fixed strategies can't adapt to varying narrative needs when new events or scene changes occur.

Method: Before generating each video chunk, dynamically update memory bank by retrieving most relevant historical frames using the text prompt for that chunk. During generation, only activate most relevant tokens in memory for each attention query to maintain efficiency.

Result: Achieves outstanding long-context consistency with minimal computation overhead (only 7.9% speed reduction compared to memory-free baseline). Maintains compatibility with any streaming video generation model using KV cache.

Conclusion: MemFlow effectively solves the consistency challenge in streaming video generation through dynamic memory retrieval, enabling narrative coherence across scene changes while preserving computational efficiency.

Abstract: The core challenge for streaming video generation is maintaining the content consistency in long context, which poses high requirement for the memory design. Most existing solutions maintain the memory by compressing historical frames with predefined strategies. However, different to-generate video chunks should refer to different historical cues, which is hard to satisfy with fixed strategies. In this work, we propose MemFlow to address this problem. Specifically, before generating the coming chunk, we dynamically update the memory bank by retrieving the most relevant historical frames with the text prompt of this chunk. This design enables narrative coherence even if new event happens or scenario switches in future frames. In addition, during generation, we only activate the most relevant tokens in the memory bank for each query in the attention layers, which effectively guarantees the generation efficiency. In this way, MemFlow achieves outstanding long-context consistency with negligible computation burden (7.9% speed reduction compared with the memory-free baseline) and keeps the compatibility with any streaming video generation model with KV cache.

[175] MIMIR: Masked Image Modeling for Mutual Information-based Adversarial Robustness

Xiaoyun Xu, Shujian Yu, Zhuoran Liu, Stjepan Picek

Main category: cs.CV

TL;DR: MIMIR: A self-supervised adversarial training method for Vision Transformers that uses mutual information constraints to improve robustness against evasion attacks.

DetailsMotivation: Vision Transformers (ViTs) are vulnerable to evasion attacks, but existing adversarial training methods are incompatible with ViTs. The paper aims to develop specialized AT strategies for ViTs' unique architecture.

Method: Proposes MIMIR, a self-supervised AT method using mutual information penalty. Based on theoretical MI analysis showing MI between adversarial examples and latent representations should be constrained. Uses masked image modeling with autoencoders for adversarial pre-training.

Result: MIMIR consistently improves natural and robust accuracy on CIFAR-10, Tiny-ImageNet, and ImageNet-1K, outperforming SOTA AT results on ImageNet-1K. Shows superior robustness against unforeseen attacks, common corruption data, and withstands adaptive attacks.

Conclusion: MIMIR provides an effective self-supervised adversarial training approach for Vision Transformers by leveraging mutual information constraints, addressing the incompatibility of existing AT methods with ViT architecture.

Abstract: Vision Transformers (ViTs) have emerged as a fundamental architecture and serve as the backbone of modern vision-language models. Despite their impressive performance, ViTs exhibit notable vulnerability to evasion attacks, necessitating the development of specialized Adversarial Training (AT) strategies tailored to their unique architecture. While a direct solution might involve applying existing AT methods to ViTs, our analysis reveals significant incompatibilities, particularly with state-of-the-art (SOTA) approaches such as Generalist (CVPR 2023) and DBAT (USENIX Security 2024). This paper presents a systematic investigation of adversarial robustness in ViTs and provides a novel theoretical Mutual Information (MI) analysis in its autoencoder-based self-supervised pre-training. Specifically, we show that MI between the adversarial example and its latent representation in ViT-based autoencoders should be constrained via derived MI bounds. Building on this insight, we propose a self-supervised AT method, MIMIR, that employs an MI penalty to facilitate adversarial pre-training by masked image modeling with autoencoders. Extensive experiments on CIFAR-10, Tiny-ImageNet, and ImageNet-1K show that MIMIR can consistently provide improved natural and robust accuracy, where MIMIR outperforms SOTA AT results on ImageNet-1K. Notably, MIMIR demonstrates superior robustness against unforeseen attacks and common corruption data and can also withstand adaptive attacks where the adversary possesses full knowledge of the defense mechanism. Our code and trained models are publicly available at: https://github.com/xiaoyunxxy/MIMIR.

[176] Renal Cell Carcinoma subtyping: learning from multi-resolution localization

Mohamad Mohamad, Francesco Ponzio, Santa Di Cataldo, Damien Ambrosetti, Xavier Descombes

Main category: cs.CV

TL;DR: A self-supervised learning approach for Renal Cell Carcinoma subtype classification that reduces the need for annotated datasets by leveraging multi-resolution histological data.

DetailsMotivation: Renal Cell Carcinoma has high mortality due to late diagnosis, and current AI diagnostic tools are limited by scarce annotated datasets needed for supervised training.

Method: A novel self-supervised training strategy that exploits the multi-resolution nature of histological samples to reduce annotation requirements while maintaining accuracy.

Result: The tool demonstrates classification capability on whole slide imaging datasets for Renal Cancer subtyping, with performance comparable to state-of-the-art supervised methods.

Conclusion: Self-supervised learning using multi-resolution histological data can effectively reduce annotation needs for Renal Cell Carcinoma diagnosis without significantly compromising accuracy.

Abstract: Renal Cell Carcinoma is typically asymptomatic at the early stages for many patients. This leads to a late diagnosis of the tumor, where the curability likelihood is lower, and makes the mortality rate of Renal Cell Carcinoma high, with respect to its incidence rate. To increase the survival chance, a fast and correct categorization of the tumor subtype is paramount. Nowadays, computerized methods, based on artificial intelligence, represent an interesting opportunity to improve the productivity and the objectivity of the microscopy-based Renal Cell Carcinoma diagnosis. Nonetheless, much of their exploitation is hampered by the paucity of annotated dataset, essential for a proficient training of supervised machine learning technologies. This study sets out to investigate a novel self supervised training strategy for machine learning diagnostic tools, based on the multi-resolution nature of the histological samples. We aim at reducing the need of annotated dataset, without significantly reducing the accuracy of the tool. We demonstrate the classification capability of our tool on a whole slide imaging dataset for Renal Cancer subtyping, and we compare our solution with several state-of-the-art classification counterparts.

[177] LSM: A Comprehensive Metric for Assessing the Safety of Lane Detection Systems in Autonomous Driving

Jörg Gamerdinger, Sven Teufel, Stephan Amann, Georg Volk, Oliver Bringmann

Main category: cs.CV

TL;DR: Proposes Lane Safety Metric (LSM) - a safety evaluation metric for lane detection systems that considers scene semantics, detection range, and potential causes of missing detections to provide interpretable safety scores.

DetailsMotivation: Current lane detection evaluation lacks sufficient safety metrics, unlike object detection which has advanced safety evaluation. Safe autonomous driving requires not just object detection but also drivable area and lane corridor detection, which needs proper safety assessment.

Method: Develops Lane Safety Metric (LSM) that incorporates multiple safety factors: scene semantics (road type, road width), detection range, and potential causes of missing detections (incorporating vehicle speed). Provides interpretable safety scores.

Result: Evaluated LSM on various virtual scenarios using different lane detection approaches. Compared with state-of-the-art performance metrics to demonstrate its effectiveness for safety assessment.

Conclusion: LSM addresses the gap in lane detection safety evaluation by providing comprehensive safety assessment that considers contextual factors, enabling better safety evaluation for autonomous vehicle perception systems.

Abstract: Comprehensive perception of the vehicle’s environment and correct interpretation of the environment are crucial for the safe operation of autonomous vehicles. The perception of surrounding objects is the main component for further tasks such as trajectory planning. However, safe trajectory planning requires not only object detection, but also the detection of drivable areas and lane corridors. While first approaches consider an advanced safety evaluation of object detection, the evaluation of lane detection still lacks sufficient safety metrics. Similar to the safety metrics for object detection, additional factors such as the semantics of the scene with road type and road width, the detection range as well as the potential causes of missing detections, incorporated by vehicle speed, should be considered for the evaluation of lane detection. Therefore, we propose the Lane Safety Metric (LSM), which takes these factors into account and allows to evaluate the safety of lane detection systems by determining an easily interpretable safety score. We evaluate our offline safety metric on various virtual scenarios using different lane detection approaches and compare it with state-of-the-art performance metrics.

[178] A Unified Framework with Multimodal Fine-tuning for Remote Sensing Semantic Segmentation

Xianping Ma, Xiaokang Zhang, Man-On Pun, Bo Huang

Main category: cs.CV

TL;DR: A unified framework called MFNet that integrates multimodal remote sensing data with SAM foundation model for semantic segmentation, achieving state-of-the-art performance on benchmark datasets.

DetailsMotivation: Multimodal remote sensing data provides comprehensive Earth surface information, but existing methods don't fully leverage foundation models like SAM while effectively integrating multimodal data for semantic segmentation.

Method: Proposes MFNet framework with Multimodal Fine-tuning Network that integrates SAM with multimodal data using Adapter/LoRA fine-tuning mechanisms, plus a pyramid-based Deep Fusion Module for multi-scale feature integration.

Result: Significantly outperforms existing methods on ISPRS Vaihingen, ISPRS Potsdam and MMHunan datasets, demonstrating SAM’s generalization with DSM data and setting new state-of-the-art.

Conclusion: MFNet provides a versatile foundation for multimodal remote sensing segmentation that retains SAM’s general knowledge while effectively leveraging multimodal data, with extensible architecture for future fine-tuning strategies.

Abstract: Multimodal remote sensing data, acquired from diverse sensors, offer a comprehensive and integrated perspective of the Earth’s surface. Leveraging multimodal fusion techniques, semantic segmentation enables detailed and accurate analysis of geographic scenes, surpassing single-modality approaches. Building on advancements in vision foundation models, particularly the Segment Anything Model (SAM), this study proposes a unified framework incorporating a novel Multimodal Fine-tuning Network (MFNet) for remote sensing semantic segmentation. The proposed framework is designed to seamlessly integrate with various fine-tuning mechanisms, demonstrated through the inclusion of Adapter and Low-Rank Adaptation (LoRA) as representative examples. This extensibility ensures the framework’s adaptability to other emerging fine-tuning strategies, allowing models to retain SAM’s general knowledge while effectively leveraging multimodal data. Additionally, a pyramid-based Deep Fusion Module (DFM) is introduced to integrate high-level geographic features across multiple scales, enhancing feature representation prior to decoding. This work also highlights SAM’s robust generalization capabilities with Digital Surface Model (DSM) data, a novel application. Extensive experiments on three benchmark multimodal remote sensing datasets, ISPRS Vaihingen, ISPRS Potsdam and MMHunan, demonstrate that the proposed MFNet significantly outperforms existing methods in multimodal semantic segmentation, setting a new standard in the field while offering a versatile foundation for future research and applications. The source code for this work is accessible at https://github.com/sstary/SSRS.

[179] Semantic-Free Procedural 3D Shapes Are Surprisingly Good Teachers

Xuweiyi Chen, Zezhou Cheng

Main category: cs.CV

TL;DR: Self-supervised 3D representation learning from procedural 3D programs (simple primitives) performs as well as learning from semantic 3D models across multiple downstream tasks, suggesting current methods don’t rely on shape semantics.

DetailsMotivation: 3D data acquisition is difficult (requires expertise/equipment) and raises copyright concerns, unlike widely accessible 2D images. Need scalable approach for 3D representation learning without real 3D assets.

Method: Learn 3D representations from procedural 3D programs that automatically generate 3D shapes using simple 3D primitives and augmentations, instead of real semantic 3D models.

Result: Representations from procedurally generated shapes perform on par with state-of-the-art representations from semantic 3D models across shape classification, part segmentation, masked point cloud completion, and scene semantic/instance segmentation tasks.

Conclusion: Current 3D self-supervised learning methods don’t rely on 3D shape semantics. Procedural programs provide effective alternative to real 3D data, with analysis of factors for good procedural programs.

Abstract: Self-supervised learning has emerged as a promising approach for acquiring transferable 3D representations from unlabeled 3D point clouds. Unlike 2D images, which are widely accessible, acquiring 3D assets requires specialized expertise or professional 3D scanning equipment, making it difficult to scale and raising copyright concerns. To address these challenges, we propose learning 3D representations from procedural 3D programs that automatically generate 3D shapes using simple 3D primitives and augmentations. Remarkably, despite lacking semantic content, the 3D representations learned from the procedurally generated 3D shapes perform on par with state-of-the-art representations learned from semantically recognizable 3D models (e.g., airplanes) across various downstream 3D tasks, such as shape classification, part segmentation, masked point cloud completion, and both scene semantic and instance segmentation. We provide a detailed analysis on factors that make a good 3D procedural programs. Extensive experiments further suggest that current 3D self-supervised learning methods on point clouds do not rely on semantics of 3D shapes, shedding light on the nature of 3D representations learned.

[180] Multimodal classification of forest biodiversity potential from 2D orthophotos and 3D airborne laser scanning point clouds

Simon B. Jensen, Stefan Oehmcke, Andreas Møgelmose, Meysam Madadi, Christian Igel, Sergio Escalera, Thomas B. Moeslund

Main category: cs.CV

TL;DR: Deep learning fusion of 2D orthophotos and 3D ALS point clouds achieves 82.0% accuracy in assessing forest biodiversity potential, outperforming single-modality approaches.

DetailsMotivation: Traditional forest biodiversity field surveys are labor-intensive and spatially limited, creating need for scalable automated assessment methods using remote sensing data.

Method: Created BioVista dataset with 44,378 paired orthophoto-ALS samples; used ResNet for 2D images and PointVector for 3D point clouds; tested confidence-based ensembling, feature concatenation, and end-to-end fusion approaches.

Result: Single modalities achieved 76.7% (orthophotos) and 75.8% (ALS) accuracy; end-to-end fusion achieved best performance at 82.0% accuracy for distinguishing low/high biodiversity potential areas.

Conclusion: Spectral information from orthophotos and structural information from ALS point clouds effectively complement each other for reliable forest biodiversity assessment through multimodal deep learning fusion.

Abstract: Assessment of forest biodiversity is crucial for ecosystem management and conservation. While traditional field surveys provide high-quality assessments, they are labor-intensive and spatially limited. This study investigates whether deep learning-based fusion of close-range sensing data from 2D orthophotos and 3D airborne laser scanning (ALS) point clouds can reliable assess the biodiversity potential of forests. We introduce the BioVista dataset, comprising 44378 paired samples of orthophotos and ALS point clouds from temperate forests in Denmark, designed to explore multimodal fusion approaches. Using deep neural networks (ResNet for orthophotos and PointVector for ALS point clouds), we investigate each data modality’s ability to assess forest biodiversity potential, achieving overall accuracies of 76.7% and 75.8%, respectively. We explore various 2D and 3D fusion approaches: confidence-based ensembling, feature-level concatenation, and end-to-end training, with the latter achieving an overall accuracies of 82.0% when separating low- and high potential forest areas. Our results demonstrate that spectral information from orthophotos and structural information from ALS point clouds effectively complement each other in the assessment of forest biodiversity potential.

[181] From My View to Yours: Ego-to-Exo Transfer in VLMs for Understanding Activities of Daily Living

Dominick Reilly, Manish Kumar Govind, Le Xue, Srijan Das

Main category: cs.CV

TL;DR: Ego2ExoVLM is a vision-language model that learns to infer egocentric properties from exocentric videos by leveraging synchronized ego-exo video pairs during training, addressing the limitation of viewpoint invariant VLMs in understanding human-object interactions from third-person perspectives.

DetailsMotivation: Standard VLMs have viewpoint invariant training that limits their ability to understand egocentric properties (like human-object interactions) from exocentric video observations. This is critical for applications like Activities of Daily Living monitoring where understanding first-person perspectives is essential but deploying egocentric cameras is impractical.

Method: Ego2ExoVLM uses two key components: 1) Ego2Exo Sequence Distillation - transfers knowledge from an egocentric teacher model to an exocentric student model, and 2) Ego Adaptive Visual Tokens - enhances the effectiveness of this knowledge transfer. The model leverages time-synchronized ego-exo video pairs during training.

Result: The model achieves state-of-the-art results on the ADL-X benchmark suite and outperforms strong baselines on the proposed Ego-in-Exo Perception benchmark (3.9K questions). It’s evaluated on 10 tasks across both benchmarks.

Conclusion: Ego2ExoVLM successfully addresses the limitation of viewpoint invariant VLMs by enabling inference of egocentric properties from exocentric videos, with applications in ADL monitoring where egocentric cameras are impractical to deploy.

Abstract: Vision Language Models (VLMs) have achieved strong performance across diverse video understanding tasks. However, their viewpoint invariant training limits their ability to understand egocentric properties (e.g., human object interactions) from exocentric video observations. This limitation is critical for many applications, such as Activities of Daily Living (ADL) monitoring, where the understanding of egocentric properties is essential, and egocentric cameras are impractical to deploy. To address this limitation, we propose Ego2ExoVLM, a VLM that learns to infer egocentric properties from exocentric videos by leveraging time-synchronized ego-exo videos during training. Ego2ExoVLM accomplishes this through the use of two components: Ego2Exo Sequence Distillation, which transfers knowledge from an egocentric teacher to an exocentric student, and Ego Adaptive Visual Tokens, designed to enhance the effectiveness of this knowledge transfer. To measure this capability, we introduce Ego-in-Exo Perception, a benchmark of 3.9K questions curated to explicitly measure the understanding of egocentric properties from exocentric videos. Ego2ExoVLM is evaluated on 10 tasks across Ego-in-Exo Perception and existing ADL benchmarks, achieving state-of-the-art results on the ADL-X benchmark suite and outperforming strong baselines on our proposed benchmark. All code, models, and data will be released at https://github.com/dominickrei/EgoExo4ADL.

[182] LWGANet: Addressing Spatial and Channel Redundancy in Remote Sensing Visual Tasks with Light-Weight Grouped Attention

Wei Lu, Xue Yang, Si-Bao Chen

Main category: cs.CV

TL;DR: LWGANet is a lightweight neural network backbone designed specifically for remote sensing images that addresses spatial redundancy (vast homogeneous backgrounds) and channel redundancy (extreme scale variations) through two novel modules: TGFI for spatial attention and LWGA for channel grouping.

DetailsMotivation: Remote sensing images have unique challenges not addressed by existing lightweight models designed for natural images: spatial redundancy from large homogeneous backgrounds and channel redundancy due to extreme scale variations that make single feature spaces inefficient.

Method: LWGANet introduces two core modules: 1) Top-K Global Feature Interaction (TGFI) to mitigate spatial redundancy by focusing computation on salient regions, and 2) Light-Weight Grouped Attention (LWGA) to resolve channel redundancy by partitioning channels into specialized, scale-specific pathways.

Result: Extensive experiments on twelve diverse datasets across four major RS tasks (scene classification, oriented object detection, semantic segmentation, and change detection) show LWGANet consistently outperforms state-of-the-art lightweight backbones in both accuracy and efficiency.

Conclusion: LWGANet establishes a new robust baseline for efficient visual analysis in remote sensing images by synergistically resolving core inefficiencies specific to RS scenarios, achieving superior trade-off between feature representation quality and computational cost.

Abstract: Light-weight neural networks for remote sensing (RS) visual analysis must overcome two inherent redundancies: spatial redundancy from vast, homogeneous backgrounds, and channel redundancy, where extreme scale variations render a single feature space inefficient. Existing models, often designed for natural images, fail to address this dual challenge in RS scenarios. To bridge this gap, we propose LWGANet, a light-weight backbone engineered for RS-specific properties. LWGANet introduces two core innovations: a Top-K Global Feature Interaction (TGFI) module that mitigates spatial redundancy by focusing computation on salient regions, and a Light-Weight Grouped Attention (LWGA) module that resolves channel redundancy by partitioning channels into specialized, scale-specific pathways. By synergistically resolving these core inefficiencies, LWGANet achieves a superior trade-off between feature representation quality and computational cost. Extensive experiments on twelve diverse datasets across four major RS tasks–scene classification, oriented object detection, semantic segmentation, and change detection–demonstrate that LWGANet consistently outperforms state-of-the-art light-weight backbones in both accuracy and efficiency. Our work establishes a new, robust baseline for efficient visual analysis in RS images.

[183] Text Embedded Swin-UMamba for DeepLesion Segmentation

Ruida Cheng, Tejas Sudharshan Mathai, Pritam Mukherjee, Benjamin Hou, Qingqing Zhu, Zhiyong Lu, Matthew McAuliffe, Ronald M. Summers

Main category: cs.CV

TL;DR: Text-Swin-U/Mamba model integrates LLM text features with Swin-UMamba architecture for lesion segmentation, achieving state-of-the-art performance on DeepLesion dataset.

DetailsMotivation: To improve lesion segmentation for clinical assessment by combining imaging features with textual descriptions from radiology reports using LLMs.

Method: Integrates text features from LLMs into Swin-UMamba architecture for multimodal lesion segmentation using DeepLesion dataset with radiology report descriptions.

Result: Achieved Dice score of 82.64 and Hausdorff distance of 6.34 pixels, outperforming LanGuideMedSeg by 37.79%, XLSTM-UNet by 2.58%, and nnUNet by 1.01%.

Conclusion: Integration of LLM text features with vision models significantly improves lesion segmentation performance, demonstrating feasibility of multimodal approaches in medical imaging.

Abstract: Segmentation of lesions on CT enables automatic measurement for clinical assessment of chronic diseases (e.g., lymphoma). Integrating large language models (LLMs) into the lesion segmentation workflow has the potential to combine imaging features with descriptions of lesion characteristics from the radiology reports. In this study, we investigate the feasibility of integrating text into the Swin-UMamba architecture for the task of lesion segmentation. The publicly available ULS23 DeepLesion dataset was used along with short-form descriptions of the findings from the reports. On the test dataset, our method achieved a high Dice score of 82.64, and a low Hausdorff distance of 6.34 pixels was obtained for lesion segmentation. The proposed Text-Swin-U/Mamba model outperformed prior approaches: 37.79% improvement over the LLM-driven LanGuideMedSeg model (p < 0.001), and surpassed the purely image-based XLSTM-UNet and nnUNet models by 2.58% and 1.01%, respectively. The dataset and code can be accessed at https://github.com/ruida/LLM-Swin-UMamba

[184] Exo2Ego: Exocentric Knowledge Guided MLLM for Egocentric Video Understanding

Haoyu Zhang, Qiaohui Chu, Meng Liu, Haoxiang Shi, Yaowei Wang, Liqiang Nie

Main category: cs.CV

TL;DR: The paper proposes Ego-ExoClip, a dataset and training pipeline to transfer exocentric knowledge from existing MLLMs to improve egocentric video understanding, addressing limitations of current models that focus only on third-person vision.

DetailsMotivation: Current MLLMs primarily focus on third-person (exocentric) vision and overlook first-person (egocentric) video challenges. High data acquisition costs limit dataset size, impairing MLLM performance for embodied AI assistants that require egocentric understanding.

Method: Proposes learning mapping between exocentric and egocentric domains using Ego-ExoClip dataset (1.1M synchronized ego-exo clip-text pairs) and EgoIT instruction-tuning dataset. Introduces progressive mapping learning pipeline with three stages: Demonstrator Self-Preparation, Demonstrator-Learner Guidance, and Learner Self-Practice.

Result: Extensive experiments show existing MLLMs perform inadequately in egocentric video understanding, while the proposed model significantly outperforms leading models across diverse egocentric tasks.

Conclusion: The approach successfully transfers exocentric knowledge to enhance egocentric video understanding, addressing key limitations for embodied AI assistants through domain mapping and progressive learning.

Abstract: AI personal assistants, deployed through robots or wearables, require embodied understanding to collaborate effectively with humans. However, current Multimodal Large Language Models (MLLMs) primarily focus on third-person (exocentric) vision, overlooking the unique challenges of first-person (egocentric) videos. Additionally, high acquisition costs limit data size, impairing MLLM performance. To address these challenges, we propose learning the mapping between exocentric and egocentric domains, leveraging the extensive exocentric knowledge within existing MLLMs to enhance egocentric video understanding. To this end, we introduce Ego-ExoClip, a pre-training dataset comprising 1.1M synchronized ego-exo clip-text pairs derived from Ego-Exo4D, together with the instruction-tuning dataset EgoIT, which is collected from multiple sources to enhance the model’s instruction-following capabilities. Building upon the datasets, we propose a migration strategy and further design a progressive mapping learning pipeline with three stages: Demonstrator Self-Preparation, Demonstrator-Learner Guidance, and Learner Self-Practice. Extensive experiments across diverse egocentric tasks reveal that existing MLLMs perform inadequately in egocentric video understanding, while our model significantly outperforms these leading models.

[185] GASPACHO: Gaussian Splatting for Controllable Humans and Objects

Aymen Mir, Arthur Moreau, Helisa Dhamo, Zhensong Zhang, Gerard Pons-Moll, Eduardo Pérez-Pellitero

Main category: cs.CV

TL;DR: GASPACHO reconstructs animatable human and object templates as separate Gaussian sets from multi-view video, enabling controllable novel human-object interactions with sharp object details and physically plausible animation.

DetailsMotivation: Prior methods reconstruct only humans and treat objects as background, lacking the ability to generate controllable renderings of novel human-object interactions with both components being animatable.

Method: Simultaneously recovers animatable human and object templates as distinct Gaussian sets; learns object Gaussians on 2D surface manifold rather than 3D volume for sharper details; introduces contact constraint in Gaussian space for physically plausible animation.

Result: Achieves high-quality reconstructions under heavy occlusion across BEHAVE, NeuralDome, and DNA-Rendering benchmarks; supports controllable synthesis of novel human-object interactions; enables composition of humans and objects in 3D scenes.

Conclusion: First method to showcase neural rendering for controllable generation of photoreal humans interacting with dynamic objects in diverse scenes, advancing human-object interaction modeling.

Abstract: We present GASPACHO, a method for generating photorealistic, controllable renderings of human-object interactions from multi-view RGB video. Unlike prior work that reconstructs only the human and treats objects as background, GASPACHO simultaneously recovers animatable templates for both the human and the interacting object as distinct sets of Gaussians, thereby allowing for controllable renderings of novel human object interactions in different poses from novel-camera viewpoints. We introduce a novel formulation that learns object Gaussians on an underlying 2D surface manifold rather than in 3D volume, yielding sharper, fine-grained object details for dynamic object reconstruction. We further propose a contact constraint in Gaussian space that regularizes human-object relations and enables natural, physically plausible animation. Across three benchmarks - BEHAVE, NeuralDome, and DNA-Rendering - GASPACHO achieves high-quality reconstructions under heavy occlusion and supports controllable synthesis of novel human-object interactions. We also demonstrate that our method allows for composition of humans and objects in 3D scenes and for the first time showcase that neural rendering can be used for the controllable generation of photoreal humans interacting with dynamic objects in diverse scenes. Our results are available at: https://miraymen.github.io/gaspacho/

[186] CCMNet: Leveraging Calibrated Color Correction Matrices for Cross-Camera Color Constancy

Dongyoung Kim, Mahmoud Afifi, Dongyun Kim, Michael S. Brown, Seon Joo Kim

Main category: cs.CV

TL;DR: A learning-based cross-camera color constancy method that generalizes to new cameras without retraining by using pre-calibrated color correction matrices to create camera fingerprint embeddings.

DetailsMotivation: White balance algorithms must adapt to different camera-specific raw color spaces, but current methods often require retraining for each new camera. There's a need for a solution that can generalize to unseen cameras without additional training.

Method: Uses pre-calibrated CCMs from camera ISPs to transform standard illumination colors into each camera’s raw space, encodes these into compact camera fingerprint embeddings (CFE), and employs data augmentation through interpolation between cameras and their CCMs to prevent overfitting.

Result: Achieves state-of-the-art cross-camera color constancy performance across multiple datasets and backbones, while remaining lightweight and using only data readily available in camera ISPs.

Conclusion: The proposed method effectively solves the cross-camera color constancy problem by leveraging existing ISP calibration data to create generalizable camera representations, enabling adaptation to new cameras without retraining.

Abstract: Computational color constancy, or white balancing, is a key module in a camera’s image signal processor (ISP) that corrects color casts from scene lighting. Because this operation occurs in the camera-specific raw color space, white balance algorithms must adapt to different cameras. This paper introduces a learning-based method for cross-camera color constancy that generalizes to new cameras without retraining. Our method leverages pre-calibrated color correction matrices (CCMs) available on ISPs that map the camera’s raw color space to a standard space (e.g., CIE XYZ). Our method uses these CCMs to transform predefined illumination colors (i.e., along the Planckian locus) into the test camera’s raw space. The mapped illuminants are encoded into a compact camera fingerprint embedding (CFE) that enables the network to adapt to unseen cameras. To prevent overfitting due to limited cameras and CCMs during training, we introduce a data augmentation technique that interpolates between cameras and their CCMs. Experimental results across multiple datasets and backbones show that our method achieves state-of-the-art cross-camera color constancy while remaining lightweight and relying only on data readily available in camera ISPs.

[187] Learning neuroimaging models from health system-scale data

Yiwei Lyu, Samir Harake, Asadur Chowdury, Soumyanil Banerjee, Rachel Gologorsky, Shixuan Liu, Anna-Katharina Meissner, Akshay Rao, Chenhui Zhao, Akhil Kondepudi, Cheng Jiang, Xinhai Hou, Rushikesh S. Joshi, Volker Neuschmelting, Ashok Srinivasan, Dawn Kleindorfer, Brian Athey, Vikas Gulani, Aditya Pandey, Honglak Lee, Todd Hollon

Main category: cs.CV

TL;DR: Prima is a vision language model for neuroimaging that analyzes clinical MRI studies to provide diagnoses, worklist prioritization, and referral recommendations, achieving 92.0 mean AUC across 52 neurological diagnoses.

DetailsMotivation: The global demand for MRI studies has increased strain on health systems, prolonged turnaround times, and intensified physician burnout, disproportionately affecting low-resource and rural patients. There's a need for AI solutions to address these challenges in neuroimaging.

Method: Developed Prima, a vision language model trained on over 220,000 MRI studies using a hierarchical vision architecture. Tested in a 1-year health system-wide study including 30K MRI studies across 52 radiologic diagnoses from major neurological disorders.

Result: Prima achieved a mean diagnostic area under the ROC curve of 92.0, outperforming other state-of-the-art general and medical AI models. The model demonstrated algorithmic fairness across sensitive groups and can help mitigate health system biases.

Conclusion: Prima represents a transformative AI foundation for neuroimaging that can improve healthcare delivery by providing explainable differential diagnoses, worklist prioritization, and clinical referrals while addressing health system challenges and biases.

Abstract: Neuroimaging is a ubiquitous tool for evaluating patients with neurological diseases. The global demand for magnetic resonance imaging (MRI) studies has risen steadily, placing significant strain on health systems, prolonging turnaround times, and intensifying physician burnout. These challenges disproportionately impact patients in low-resource and rural settings. Here, we utilized a large academic health system as a data engine to develop Prima, the first vision language model (VLM) serving as an AI foundation for neuroimaging that supports real-world, clinical MRI studies as input. Trained on over 220,000 MRI studies, Prima uses a hierarchical vision architecture that provides general and transferable MRI features. Prima was tested in a 1-year health system-wide study that included 30K MRI studies. Across 52 radiologic diagnoses from the major neurologic disorders, including neoplastic, inflammatory, infectious, and developmental lesions, Prima achieved a mean diagnostic area under the ROC curve of 92.0, outperforming other state-of-the-art general and medical AI models. Prima offers explainable differential diagnoses, worklist priority for radiologists, and clinical referral recommendations across diverse patient demographics and MRI systems. Prima demonstrates algorithmic fairness across sensitive groups and can help mitigate health system biases, such as prolonged turnaround times for low-resource populations. These findings highlight the transformative potential of health system-scale VLMs and Prima’s role in advancing AI-driven healthcare.

[188] AI-GenBench: A New Ongoing Benchmark for AI-Generated Image Detection

Lorenzo Pellegrini, Davide Cozzolino, Serafino Pandolfini, Davide Maltoni, Matteo Ferrara, Luisa Verdoliva, Marco Prati, Marco Ramilli

Main category: cs.CV

TL;DR: Ai-GenBench is a novel benchmark for detecting AI-generated images that uses temporal evaluation with historically ordered synthetic images to test generalization to new generative models.

DetailsMotivation: The rapid advancement of generative AI has enabled high-quality image synthesis but raised critical challenges for media authenticity, creating an urgent need for robust detection of AI-generated images in real-world scenarios.

Method: Introduces a temporal evaluation framework where detection methods are incrementally trained on synthetic images historically ordered by their generative models, focusing on high-quality diverse content with standardized protocols and accessible tools.

Result: Ai-GenBench provides a comprehensive dataset, standardized evaluation protocol, and accessible tools that overcome limitations of current approaches like arbitrary dataset splits, unfair comparisons, and excessive computational demands.

Conclusion: The benchmark enables meaningful comparison of detection methods and scalable solutions to keep pace with new synthetic generators, with publicly available code and data to ensure reproducibility and support robust forensic detector development.

Abstract: The rapid advancement of generative AI has revolutionized image creation, enabling high-quality synthesis from text prompts while raising critical challenges for media authenticity. We present Ai-GenBench, a novel benchmark designed to address the urgent need for robust detection of AI-generated images in real-world scenarios. Unlike existing solutions that evaluate models on static datasets, Ai-GenBench introduces a temporal evaluation framework where detection methods are incrementally trained on synthetic images, historically ordered by their generative models, to test their ability to generalize to new generative models, such as the transition from GANs to diffusion models. Our benchmark focuses on high-quality, diverse visual content and overcomes key limitations of current approaches, including arbitrary dataset splits, unfair comparisons, and excessive computational demands. Ai-GenBench provides a comprehensive dataset, a standardized evaluation protocol, and accessible tools for both researchers and non-experts (e.g., journalists, fact-checkers), ensuring reproducibility while maintaining practical training requirements. By establishing clear evaluation rules and controlled augmentation strategies, Ai-GenBench enables meaningful comparison of detection methods and scalable solutions. Code and data are publicly available to ensure reproducibility and to support the development of robust forensic detectors to keep pace with the rise of new synthetic generators.

[189] GT2-GS: Geometry-aware Texture Transfer for Gaussian Splatting

Wenjie Liu, Zhongliang Liu, Junwei Shu, Changbo Wang, Yang Li

Main category: cs.CV

TL;DR: GT²-GS is a geometry-aware texture transfer framework for 3D Gaussian Splatting that transfers 2D image textures to 3D scenes while preserving geometric integrity through feature augmentation, geometry-consistent loss, and iterative geometry preservation.

DetailsMotivation: Existing 3D style transfer methods often overlook geometric information, making it challenging to achieve high-quality 3D texture transfer results. There's a need for methods that can transfer image textures to 3D representations while maintaining geometric consistency.

Method: The framework includes: 1) Geometry-aware texture augmentation module to expand texture features, 2) Geometry-consistent texture loss incorporating camera pose and 3D geometric information for controllable editing, and 3) Geometry preservation strategy alternating between texture transfer and geometry correction stages over multiple iterations.

Result: Extensive experiments demonstrate the method’s effectiveness and controllability. Through geometric awareness, the approach achieves texture transfer results that better align with human visual perception compared to existing methods.

Conclusion: GT²-GS successfully addresses the challenge of transferring 2D textures to 3D Gaussian Splatting representations by incorporating geometric awareness, achieving better quality results that preserve scene geometry while enabling controllable texture-oriented appearance editing.

Abstract: Transferring 2D textures to 3D modalities is of great significance for improving the efficiency of multimedia content creation. Existing approaches have rarely focused on transferring image textures onto 3D representations. 3D style transfer methods are capable of transferring abstract artistic styles to 3D scenes. However, these methods often overlook the geometric information of the scene, which makes it challenging to achieve high-quality 3D texture transfer results. In this paper, we present GT^2-GS, a geometry-aware texture transfer framework for gaussian splitting. From the perspective of matching texture features with geometric information in rendered views, we identify the issue of insufficient texture features and propose a geometry-aware texture augmentation module to expand the texture feature set. Moreover, a geometry-consistent texture loss is proposed to optimize texture features into the scene representation. This loss function incorporates both camera pose and 3D geometric information of the scene, enabling controllable texture-oriented appearance editing. Finally, a geometry preservation strategy is introduced. By alternating between the texture transfer and geometry correction stages over multiple iterations, this strategy achieves a balance between learning texture features and preserving geometric integrity. Extensive experiments demonstrate the effectiveness and controllability of our method. Through geometric awareness, our approach achieves texture transfer results that better align with human visual perception. Our homepage is available at https://vpx-ecnu.github.io/GT2-GS-website.

[190] VIBE: Can a VLM Read the Room?

Tania Chakraborty, Eylon Caplan, Dan Goldwasser

Main category: cs.CV

TL;DR: VLMs have a Visual Social-Pragmatic Inference gap - they struggle to understand social cues and dynamics from visual context, unlike humans who use non-verbal communication.

DetailsMotivation: While LLMs excel at textual social reasoning, they miss crucial non-verbal cues. VLMs could bridge this gap but their ability to understand social dynamics from visual context is largely unexplored and potentially limited.

Method: Identified the Visual Social-Pragmatic Inference gap, proposed a new task for VLMs to address it, created a high-quality dataset, and benchmarked several VLMs on this task.

Result: The paper reveals a previously overlooked limitation in VLMs regarding social reasoning from visual context, establishing a benchmark for evaluating their social-pragmatic inference capabilities.

Conclusion: VLMs have significant limitations in understanding social dynamics from visual cues, highlighting the need for improved models that can better interpret social-pragmatic information in visual contexts.

Abstract: Understanding human social behavior such as recognizing emotions and the social dynamics causing them is an important and challenging problem. While LLMs have made remarkable advances, they are limited to the textual domain and cannot account for the major role that non-verbal cues play in understanding social situations. Vision Language Models (VLMs) can potentially account for this gap, however their ability to make correct inferences over such social cues has received little attention. In this paper, we explore the capabilities of VLMs at social reasoning. We identify a previously overlooked limitation in VLMs: the Visual Social-Pragmatic Inference gap. To target this gap, we propose a new task for VLMs: Visual Social-Pragmatic Inference. We construct a high quality dataset to test the abilities of a VLM for this task and benchmark the performance of several VLMs on it.

[191] Wi-CBR: Salient-aware Adaptive WiFi Sensing for Cross-domain Behavior Recognition

Ruobei Zhang, Shengeng Tang, Huan Yan, Xiang Zhang, Jiabao Guo

Main category: cs.CV

TL;DR: Wi-CBR: A WiFi sensing method using dual-branch self-attention to combine phase and Doppler features with saliency guidance for cross-domain behavior recognition.

DetailsMotivation: Traditional WiFi-based cross-domain behavior recognition suffers from domain-specific signal interference on gesture variation. Existing methods map phase features to common spaces but may degrade gesture semantics. Using Doppler Frequency Shift (DFS) to dynamically supplement phase features could enable better generalization by exploring wider feature spaces while preserving gesture information.

Method: Proposes Wi-CBR with: 1) Dual-branch self-attention module capturing temporal features from phase (dynamic path length) and kinematic features from DFS (motion velocity); 2) Saliency Guidance Module using group attention to mine critical activity features and gating mechanisms to optimize information entropy for feature fusion between salient and non-salient characteristics.

Result: Extensive experiments on Widar3.0 and XRF55 datasets demonstrate superior performance in both in-domain and cross-domain scenarios compared to previous methods.

Conclusion: Wi-CBR effectively addresses cross-domain WiFi behavior recognition by combining phase and DFS features with saliency-aware fusion, achieving better generalization and preserving gesture semantics across domains.

Abstract: The challenge in WiFi-based cross-domain Behavior Recognition lies in the significant interference of domain-specific signals on gesture variation. However, previous methods alleviate this interference by mapping the phase from multiple domains into a common feature space. If the Doppler Frequency Shift (DFS) signal is used to dynamically supplement the phase features to achieve better generalization, it enables the model to not only explore a wider feature space but also to avoid potential degradation of gesture semantic information. Specifically, we propose a novel Salient-aware Adaptive WiFi Sensing for Cross-domain Behavior Recognition (Wi-CBR), which constructs a dual-branch self-attention module that captures temporal features from phase information reflecting dynamic path length variations while extracting kinematic features from DFS correlated with motion velocity. Moreover, we design a Saliency Guidance Module that employs group attention mechanisms to mine critical activity features and utilizes gating mechanisms to optimize information entropy, facilitating feature fusion and enabling effective interaction between salient and non-salient behavioral characteristics. Extensive experiments on two large-scale public datasets (Widar3.0 and XRF55) demonstrate the superior performance of our method in both in-domain and cross-domain scenarios.

[192] Recent Advances in Multi-Agent Human Trajectory Prediction: A Comprehensive Review

Céline Finet, Stephane Da Silva Martins, Jean-Bernard Hayet, Ioannis Karamouzas, Javad Amirian, Sylvie Le Hégarat-Mascle, Julien Pettré, Emanuel Aldea

Main category: cs.CV

TL;DR: Survey of deep learning-based multi-agent trajectory prediction methods (2020-2025), focusing on architectural designs, input representations, prediction strategies, and ETH/UCY benchmark evaluation.

DetailsMotivation: To understand multi-agent interactions better for applications in social robot navigation, autonomous navigation, and crowd modeling, and to review recent advancements in deep learning-based trajectory prediction.

Method: Categorizes existing methods based on architectural design, input representations, and prediction strategies, with emphasis on models evaluated using ETH/UCY benchmark.

Result: Provides comprehensive review of recent advancements (2020-2025) in multi-agent trajectory prediction, highlighting current state-of-the-art approaches and their evaluation.

Conclusion: Identifies key challenges and future research directions in multi-agent human trajectory prediction, emphasizing the need for continued advancement in this important field.

Abstract: With the emergence of powerful data-driven methods in human trajectory prediction (HTP), gaining a finer understanding of multi-agent interactions lies within hand’s reach, with important implications in areas such as social robot navigation, autonomous navigation, and crowd modeling. This survey reviews some of the most recent advancements in deep learning-based multi-agent trajectory prediction, focusing on studies published between 2020 and 2025. We categorize the existing methods based on their architectural design, their input representations, and their overall prediction strategies, placing a particular emphasis on models evaluated using the ETH/UCY benchmark. Furthermore, we highlight key challenges and future research directions in the field of multi-agent HTP.

[193] TextMesh4D: Text-to-4D Mesh Generation via Jacobian Deformation Field

Sisi Dai, Xinxin Su, Ruizhen Hu, Kai Xu

Main category: cs.CV

TL;DR: TextMesh4D: A novel framework for text-to-4D mesh generation that produces dynamic meshes with high geometric fidelity and temporal consistency using Jacobian Deformation Fields and Local-Global Semantic Regularization.

DetailsMotivation: Existing text-to-4D methods avoid direct mesh generation due to topological constraints, using alternative representations (NeRFs, 3DGS) that suffer from insufficient geometric fidelity, temporal artifacts, and limited compatibility with CG pipelines. Direct mesh generation faces challenges with deformation inflexibility and semantic inconsistency.

Method: Two core innovations: 1) Jacobian Deformation Field (JDF) shifts deformation unit from vertices to faces using per-face Jacobians for flexible transformations free from topological constraints. 2) Local-Global Semantic Regularizer (LGSR) leverages mesh’s innate geometric properties to enforce semantic coherence both locally and globally across frames.

Result: TextMesh4D achieves state-of-the-art performance in temporal consistency, structural fidelity, and visual realism while requiring only a single 24GB GPU. It establishes a new benchmark for efficient and high-quality text-to-4D mesh generation.

Conclusion: The paper introduces a pioneering framework for text-to-4D mesh generation that directly addresses challenges of deformation inflexibility and semantic inconsistency, enabling high-quality dynamic mesh generation compatible with modern CG pipelines.

Abstract: Dynamic 3D (4D) content generation, particularly text-to-4D, remains a challenging and under-explored problem due to its inherent spatiotemporal complexity. Existing text-to-4D methods typically avoid direct mesh generation due to inherent topological constraints, favoring alternative representations like NeRFs or 3DGS. However, these non-mesh approaches, suffer from insufficient geometric fidelity, temporal artifacts, and limited compatibility with modern computer graphics (CG) pipelines. In contrast, directly generating dynamic meshes faces two key challenges: i) deformation inflexibility, as traditional vertex-based optimization is constrained by meshes’ explicitly encoded topology, and ii) semantic inconsistency, arising from stochastic noise in distilled priors. In this paper, we introduce TextMesh4D, a pioneering framework for text-to-4D mesh generation that directly addresses these challenges. TextMesh4D features two core innovations: 1) the Jacobian Deformation Field (JDF), which shifts the deformation unit from vertices to faces, using per-face Jacobians to model flexible transformations free from topological constraints. 2) the Local-Global Semantic Regularizer (LGSR), which leverages the mesh’s innate geometric properties to enforce semantic coherence both locally and globally across frames. Extensive experiments demonstrate that TextMesh4D achieves state-of-the-art performance in temporal consistency, structural fidelity, and visual realism, while requiring only a single 24GB GPU. Our work establishes a new benchmark for efficient and high-quality text-to-4D mesh generation. The code will be released to facilitate future research.

[194] Inter- and Intra-image Refinement for Few Shot Segmentation

Ourui Fu, Hangzhou He, Kaiwen Li, Xinliang Zhang, Lei Zhu, Shuang Zeng, Zhaoheng Xie, Yanye Lu

Main category: cs.CV

TL;DR: IIR model addresses inter- and intra-image discrepancies in few-shot semantic segmentation by generating dual prototypes and using directional dropout for feature refinement.

DetailsMotivation: Existing few-shot semantic segmentation methods suffer from annotation bottlenecks and prototype-based limitations: intra-class gaps between support-query images cause noisy prior maps, and inter-class interference from visually similar regions leads to erroneous predictions.

Method: Proposes Inter- and Intra-image Refinement (IIR) model with: 1) inter-image class activation mapping generating two prototypes (core discriminative features + local specific features) for robust prior maps, and 2) intra-image directional dropout to mask inconsistent support-query feature pairs in cross attention.

Result: Achieves state-of-the-art performance on 9 benchmarks covering standard FSS, part FSS, and cross-domain FSS.

Conclusion: IIR effectively addresses inter- and intra-image discrepancies in few-shot semantic segmentation through dual-prototype generation and directional dropout, demonstrating superior generalization across diverse segmentation tasks.

Abstract: Deep neural networks for semantic segmentation rely on large-scale annotated datasets, leading to an annotation bottleneck that motivates few shot semantic segmentation (FSS) which aims to generalize to novel classes with minimal labeled exemplars. Most existing FSS methods adopt a prototype-based paradigm, which generates query prior map by extracting masked-area features from support images and then makes predictions guided by the prior map. However, they suffer from two critical limitations induced by inter- and intra-image discrepancies: 1) The intra-class gap between support and query images, caused by single-prototype representation, results in scattered and noisy prior maps; 2) The inter-class interference from visually similar but semantically distinct regions leads to inconsistent support-query feature matching and erroneous predictions. To address these issues, we propose the Inter- and Intra-image Refinement (IIR) model. The model contains an inter-image class activation mapping based method that generates two prototypes for class-consistent region matching, including core discriminative features and local specific features, and yields an accurate and robust prior map. For intra-image refinement, a directional dropout mechanism is introduced to mask inconsistent support-query feature pairs in cross attention, thereby enhancing decoder performance. Extensive experiments demonstrate that IIR achieves state-of-the-art performance on 9 benchmarks, covering standard FSS, part FSS, and cross-domain FSS. Our source code is available at \href{https://github.com/forypipi/IIR}{https://github.com/forypipi/IIR}.

[195] Guideline-Consistent Segmentation via Multi-Agent Refinement

Vanshika Vats, Ashwani Rathee, James Davis

Main category: cs.CV

TL;DR: Multi-agent training-free framework uses Worker-Supervisor architecture with reinforcement learning stop policy to ensure semantic segmentation adheres to complex textual guidelines without retraining.

DetailsMotivation: Real-world semantic segmentation requires strict adherence to complex, paragraph-length labeling guidelines, but both human and automated labeling often fail to follow them faithfully. Traditional approaches need expensive task-specific retraining that must be repeated as guidelines evolve, while current open-vocabulary methods struggle with intricate segmentation rules.

Method: Multi-agent, training-free framework coordinating general-purpose vision-language models within iterative Worker-Supervisor refinement architecture. Worker performs segmentation, Supervisor critiques it against retrieved guidelines, and lightweight reinforcement learning stop policy decides when to terminate the loop.

Result: Method notably outperforms state-of-the-art baselines on Waymo and ReasonSeg datasets, demonstrating strong generalization and instruction adherence while ensuring guideline-consistent masks and balancing resource use.

Conclusion: The proposed framework effectively addresses the challenge of following complex textual guidelines in semantic segmentation without requiring expensive retraining, offering a practical solution for real-world applications where guidelines frequently evolve.

Abstract: Semantic segmentation in real-world applications often requires not only accurate masks but also strict adherence to textual labeling guidelines. These guidelines are typically complex and long, and both human and automated labeling often fail to follow them faithfully. Traditional approaches depend on expensive task-specific retraining that must be repeated as the guidelines evolve. Although recent open-vocabulary segmentation methods excel with simple prompts, they often fail when confronted with sets of paragraph-length guidelines that specify intricate segmentation rules. To address this, we introduce a multi-agent, training-free framework that coordinates general-purpose vision-language models within an iterative Worker-Supervisor refinement architecture. The Worker performs the segmentation, the Supervisor critiques it against the retrieved guidelines, and a lightweight reinforcement learning stop policy decides when to terminate the loop, ensuring guideline-consistent masks while balancing resource use. Evaluated on the Waymo and ReasonSeg datasets, our method notably outperforms state-of-the-art baselines, demonstrating strong generalization and instruction adherence.

[196] White Aggregation and Restoration for Few-shot 3D Point Cloud Semantic Segmentation

Jiyun Im, SuBeen Lee, Miso Lee, Jae-Pil Heo

Main category: cs.CV

TL;DR: WARM proposes a novel prototype generation method using whitening and coloring transformations to align features before cross-attention, achieving state-of-the-art performance in few-shot 3D point cloud segmentation.

DetailsMotivation: Existing few-shot 3D point cloud segmentation methods use farthest point sampling (FPS) for prototype generation, which causes performance fluctuations due to sampling variability and lacks exploration of better prototype generation methods.

Method: Proposes White Aggregation and Restoration Module (WARM) that sandwiches cross-attention between whitening and coloring transformations. Whitening aligns support features to prototypical tokens before attention, and coloring restores the original distribution after attention.

Result: WARM achieves state-of-the-art performance with significant margins on few-shot 3D point cloud segmentation benchmarks, demonstrating effectiveness through extensive experiments.

Conclusion: The simple yet effective WARM design enables robust attention-based prototype generation that captures semantic relationships in support features, overcoming distributional gaps in vanilla attention modules.

Abstract: Few-Shot 3D Point Cloud Semantic Segmentation (FS-PCS) aims to predict per-point labels for an unlabeled point cloud, given only a few labeled examples. To extract discriminative representations from the limited labeled set, existing methods have constructed prototypes using algorithms such as farthest point sampling (FPS). However, we point out that this convention has undesirable effects as performance fluctuates depending on sampling, while the prototype generation process remains underexplored in the field. This motivates us to investigate an advanced prototype generation method based on attention mechanism. Despite its potential, we found that vanilla attention module suffers from the distributional gap between prototypical tokens and support features. To overcome this, we propose White Aggregation and Restoration Module (WARM), which resolves the misalignment by sandwiching cross-attention between whitening and coloring transformations. Specifically, whitening aligns the features to tokens before the attention process, and coloring subsequently restores the original distribution to the attended tokens. This simple yet effective design enables robust attention, thereby generating prototypes that capture the semantic relationships in support features. WARM achieves state-of-the-art performance with a significant margin on FS-PCS benchmarks, and demonstrates its effectiveness through extensive experiments.

[197] Video Generation with Stable Transparency via Shiftable RGB-A Distribution Learner

Haotian Dong, Wenjing Wang, Chen Li, Jing Lyu, Di Lin

Main category: cs.CV

TL;DR: Proposes a method for generating high-quality RGB-A videos with transparency by learning shiftable RGB-A distributions through latent and noise space adjustments.

DetailsMotivation: Current RGB-A video generation methods suffer from low quality due to confusion between RGB and alpha channels, limiting applications that require transparency.

Method: Adjusts both latent and noise spaces: 1) transparency-aware bidirectional diffusion loss during VAE training to shift RGB-A distribution based on likelihood, 2) shifts mean of diffusion noise sampling and applies Gaussian ellipse mask for transparency guidance and controllability.

Result: Model outperforms state-of-the-art methods in visual quality, naturalness, transparency rendering, inference convenience, and controllability; includes release of high-quality RGB-A video dataset.

Conclusion: Proposed approach enables stable transparency generation without compromising RGB quality, addressing key limitations of existing RGB-A video generation methods.

Abstract: Generating RGB-A videos, which include alpha channels for transparency, has wide applications. However, current methods often suffer from low quality due to confusion between RGB and alpha. In this paper, we address this problem by learning shiftable RGB-A distributions. We adjust both the latent space and noise space, shifting the alpha distribution outward while preserving the RGB distribution, thereby enabling stable transparency generation without compromising RGB quality. Specifically, for the latent space, we propose a transparency-aware bidirectional diffusion loss during VAE training, which shifts the RGB-A distribution according to likelihood. For the noise space, we propose shifting the mean of diffusion noise sampling and applying a Gaussian ellipse mask to provide transparency guidance and controllability. Additionally, we construct a high-quality RGB-A video dataset. Compared to state-of-the-art methods, our model excels in visual quality, naturalness, transparency rendering, inference convenience, and controllability. The released model is available on our website: https://donghaotian123.github.io/Wan-Alpha/.

[198] One Layer Is Enough: Adapting Pretrained Visual Encoders for Image Generation

Yuan Gao, Chen Chen, Tianrong Chen, Jiatao Gu

Main category: cs.CV

TL;DR: FAE (Feature Auto-Encoder) is a simple framework that adapts pre-trained visual representations into low-dimensional latents suitable for generative models using minimal architecture, achieving state-of-the-art image generation quality.

DetailsMotivation: There's a fundamental mismatch between understanding-oriented visual representations (which need high-dimensional features) and generation-friendly latent spaces (which need low-dimensional latents). Existing approaches require complex architectures to bridge this gap.

Method: FAE uses two separate deep decoders: one reconstructs the original feature space, and another uses those reconstructed features for image generation. It can work with various self-supervised encoders (DINO, SigLIP) and generative models (diffusion, normalizing flows).

Result: On ImageNet 256x256, FAE achieves near state-of-the-art FID of 1.29 with CFG (800 epochs) and 1.70 (80 epochs). Without CFG, it reaches state-of-the-art FID of 1.48 (800 epochs) and 2.08 (80 epochs), showing both high quality and fast learning.

Conclusion: FAE provides a simple yet effective solution to adapt pre-trained visual representations for generative tasks, bridging the gap between understanding and generation with minimal architecture while achieving strong performance across different generative model families.

Abstract: Visual generative models (e.g., diffusion models) typically operate in compressed latent spaces to balance training efficiency and sample quality. In parallel, there has been growing interest in leveraging high-quality pre-trained visual representations, either by aligning them inside VAEs or directly within the generative model. However, adapting such representations remains challenging due to fundamental mismatches between understanding-oriented features and generation-friendly latent spaces. Representation encoders benefit from high-dimensional latents that capture diverse hypotheses for masked regions, whereas generative models favor low-dimensional latents that must faithfully preserve injected noise. This discrepancy has led prior work to rely on complex objectives and architectures. In this work, we propose FAE (Feature Auto-Encoder), a simple yet effective framework that adapts pre-trained visual representations into low-dimensional latents suitable for generation using as little as a single attention layer, while retaining sufficient information for both reconstruction and understanding. The key is to couple two separate deep decoders: one trained to reconstruct the original feature space, and a second that takes the reconstructed features as input for image generation. FAE is generic; it can be instantiated with a variety of self-supervised encoders (e.g., DINO, SigLIP) and plugged into two distinct generative families: diffusion models and normalizing flows. Across class-conditional and text-to-image benchmarks, FAE achieves strong performance. For example, on ImageNet 256x256, our diffusion model with CFG attains a near state-of-the-art FID of 1.29 (800 epochs) and 1.70 (80 epochs). Without CFG, FAE reaches the state-of-the-art FID of 1.48 (800 epochs) and 2.08 (80 epochs), demonstrating both high quality and fast learning.

[199] Dynamic Prompt Generation for Interactive 3D Medical Image Segmentation Training

Tidiane Camaret Ndir, Alexander Pfefferle, Robin Tibor Schirrmeister

Main category: cs.CV

TL;DR: A training strategy combining dynamic volumetric prompt generation with content-aware adaptive cropping for efficient interactive 3D biomedical image segmentation, achieving competitive performance on benchmark tasks.

DetailsMotivation: Current foundation models for interactive 3D biomedical image segmentation either lack volumetric awareness or have limited interactive capabilities, creating a need for more efficient models that can iteratively refine predictions based on user prompts.

Method: Proposes a training strategy with dynamic volumetric prompt generation and content-aware adaptive cropping to optimize image encoder usage. Simulates realistic user interaction patterns during training while addressing computational challenges of learning from sequential refinement feedback on a single GPU. Initializes network using publicly available weights from nnInteractive segmentation model.

Result: Achieved strong performance on the Foundation Models for Interactive 3D Biomedical Image Segmentation competition with average final Dice score of 0.6385, normalized surface distance of 0.6614, and area-under-the-curve metrics of 2.4799 (Dice) and 2.5671 (NSD).

Conclusion: The proposed training strategy effectively addresses computational challenges while maintaining strong interactive segmentation performance, demonstrating a practical approach for efficient 3D biomedical image segmentation with user interaction.

Abstract: Interactive 3D biomedical image segmentation requires efficient models that can iteratively refine predictions based on user prompts. Current foundation models either lack volumetric awareness or suffer from limited interactive capabilities. We propose a training strategy that combines dynamic volumetric prompt generation with content-aware adaptive cropping to optimize the use of the image encoder. Our method simulates realistic user interaction patterns during training while addressing the computational challenges of learning from sequential refinement feedback on a single GPU. For efficient training, we initialize our network using the publicly available weights from the nnInteractive segmentation model. Evaluation on the \textbf{Foundation Models for Interactive 3D Biomedical Image Segmentation} competition demonstrates strong performance with an average final Dice score of 0.6385, normalized surface distance of 0.6614, and area-under-the-curve metrics of 2.4799 (Dice) and 2.5671 (NSD).

[200] FutrTrack: A Camera-LiDAR Fusion Transformer for 3D Multiple Object Tracking

Martha Teiko Teye, Ori Maoz, Matthias Rottmann

Main category: cs.CV

TL;DR: FutrTrack is a modular camera-LiDAR multi-object tracking framework that uses transformer-based refinement and multimodal fusion to improve 3D object tracking performance without explicit motion models.

DetailsMotivation: The paper aims to improve 3D multi-object tracking by leveraging multimodal sensor data (camera and LiDAR) to overcome limitations of single-sensor approaches, particularly for handling occlusion and viewpoint changes.

Method: Uses a two-stage approach: 1) Temporal smoother with transformer to refine bounding box sequences and reduce jitter, 2) Fusion tracker that integrates bounding boxes with multimodal BEV features from cameras and LiDAR without explicit motion models, using geometric and semantic cues for identity propagation.

Result: Achieves aMOTA of 74.7 on nuScenes test set, reduces identity switches while maintaining competitive accuracy, and demonstrates significant benefits of multimodal features over single-sensor approaches on both nuScenes and KITTI datasets.

Conclusion: Query-based transformer tracking methods benefit substantially from multimodal sensor features, and the framework provides an efficient way to improve transformer-based trackers to compete with other neural methods even with limited data and no pretraining.

Abstract: We propose FutrTrack, a modular camera-LiDAR multi-object tracking framework that builds on existing 3D detectors by introducing a transformer-based smoother and a fusion-driven tracker. Inspired by query-based tracking frameworks, FutrTrack employs a multimodal two-stage transformer refinement and tracking pipeline. Our fusion tracker integrates bounding boxes with multimodal bird’s-eye-view (BEV) fusion features from multiple cameras and LiDAR without the need for an explicit motion model. The tracker assigns and propagates identities across frames, leveraging both geometric and semantic cues for robust re-identification under occlusion and viewpoint changes. Prior to tracking, we refine sequences of bounding boxes with a temporal smoother over a moving window to refine trajectories, reduce jitter, and improve spatial consistency. Evaluated on nuScenes and KITTI, FutrTrack demonstrates that query-based transformer tracking methods benefit significantly from multimodal sensor features compared with previous single-sensor approaches. With an aMOTA of 74.7 on the nuScenes test set, FutrTrack achieves strong performance on 3D MOT benchmarks, reducing identity switches while maintaining competitive accuracy. Our approach provides an efficient framework for improving transformer-based trackers to compete with other neural-network-based methods even with limited data and without pretraining.

[201] Training Multi-Image Vision Agents via End2End Reinforcement Learning

Chengqi Dong, Chuhuai Yue, Hang He, Rongge Mao, Fenghe Tang, S Kevin Zhou, Zekun Xu, Xiaohan Wang, Jiajun Chai, Wei Lin, Guojun Yin

Main category: cs.CV

TL;DR: IMAgent is an open-source vision agent trained via end-to-end RL for complex multi-image QA tasks, using multi-agent generated data and specialized visual tools to maintain visual attention during reasoning.

DetailsMotivation: Current open-source VLM-based agents are limited to single-image inputs and fail on real-world multi-image QA tasks, while also suffering from visual attention degradation during deeper reasoning steps.

Method: Uses multi-agent system to generate challenging multi-image QA pairs (MIFG-QA dataset), develops specialized tools for visual reflection and confirmation, and employs action-trajectory two-level mask strategy for stable RL training without SFT data.

Result: IMAgent maintains strong performance on existing single-image benchmarks while achieving substantial improvements on the proposed multi-image dataset, with stable tool use behavior via pure RL training.

Conclusion: IMAgent addresses the multi-image QA gap in open-source vision agents through innovative data generation, specialized visual tools, and efficient RL training, providing actionable insights for the research community.

Abstract: Recent VLM-based agents aim to replicate OpenAI O3’s ``thinking with images" via tool use, but most open-source methods limit input to a single image, falling short on real-world multi-image QA tasks. To address this, we propose IMAgent, an open-source vision agent trained via end-to-end reinforcement learning dedicated for complex multi-image tasks. By leveraging a multi-agent system, we generate challenging and visually-rich multi-image QA pairs to fully activate the tool-use potential of the base VLM. Through manual verification, we obtain MIFG-QA, comprising 10k samples for training and evaluation. With deeper reasoning steps, VLMs may increasingly ignore visual inputs. We therefore develop two specialized tools for visual reflection and confirmation, allowing the model to proactively reallocate its attention to image content during inference. Benefiting from our well-designed action-trajectory two-level mask strategy, IMAgent achieves stable tool use behavior via pure RL training without requiring costly supervised fine-tuning data. Extensive experiments demonstrate that IMAgent maintains strong performance on existing single-image benchmarks while achieving substantial improvements on our proposed multi-image dataset, with our analysis providing actionable insights for the research community. Codes and data will be released soon.

[202] Weakly Supervised Pneumonia Localization from Chest X-Rays Using Deep Neural Network and Grad-CAM Explanations

Kiran Shahi, Anup Bagale

Main category: cs.CV

TL;DR: Weakly supervised deep learning framework using Grad-CAM for pneumonia classification and localization from chest X-rays, achieving 96-98% accuracy with image-level labels instead of costly pixel-level annotations.

DetailsMotivation: Pneumonia diagnosis via chest X-rays typically requires expensive pixel-level annotations for accurate localization. Need for cost-effective methods using only image-level labels while maintaining clinical relevance and explainability.

Method: Proposed weakly supervised framework using Gradient-weighted Class Activation Mapping (Grad-CAM) with seven pre-trained models (including Vision Transformer). Used focal loss, patient-wise data splits to prevent leakage, and image-level labels to generate pneumonia localization heatmaps.

Result: All models achieved high accuracy (96-98%). ResNet-18 and EfficientNet-B0 showed best overall performance, MobileNet-V2 provided efficient lightweight alternative. Grad-CAM visualizations confirmed focus on clinically relevant lung regions.

Conclusion: Weakly supervised explainable AI models can effectively localize pneumonia regions without pixel-level annotations, enhancing transparency and clinical trust in AI-assisted radiological diagnostics.

Abstract: Chest X-ray imaging is commonly used to diagnose pneumonia, but accurately localizing the pneumonia affected regions typically requires detailed pixel-level annotations, which are costly and time consuming to obtain. To address this limitation, this study proposes a weakly supervised deep learning framework for pneumonia classification and localization using Gradient-weighted Class Activation Mapping (Grad-CAM). Instead of relying on costly pixel-level annotations, the proposed method utilizes image-level labels to generate clinically meaningful heatmaps that highlight pneumonia affected regions. Furthermore, we evaluate seven pre-trained deep learning models including a Vision Transformer under identical training conditions, using focal loss and patient-wise splits to prevent data leakage. Experimental results suggest that all models achieved high accuracy (96-98%), with ResNet-18 and EfficientNet-B0 showing the best overall performance and MobileNet-V2 providing an efficient lightweight alternative. Grad-CAM heatmap visualizations in this study confirm that the proposed methods focus on clinically relevant lung regions, supporting the use of explainable AI for radiological diagnostics. Overall, this work highlights the potential of weakly supervised, explainable models that enhance transparency and clinical trust in AI-assisted pneumonia screening.

[203] VesSAM: Efficient Multi-Prompting for Segmenting Complex Vessel

Suzhong Fu, Rui Sun, Xuan Ding, Jingqi Dong, Yiming Yang, Yao Zhu, Min Chang Jordan Ren, Delin Deng, Angelica Aviles-Rivero, Shuguang Cui, Zhen Li

Main category: cs.CV

TL;DR: VesSAM is a specialized 2D vessel segmentation framework that enhances SAM with convolutional adapters, multi-prompt encoding, and lightweight decoding, achieving superior performance with fewer parameters.

DetailsMotivation: Vessel segmentation is clinically important but challenging due to thin, branching structures and low contrast. Foundation models like SAM underperform on vascular structures, requiring specialized adaptation.

Method: VesSAM integrates: (1) convolutional adapter for local texture features, (2) multi-prompt encoder fusing anatomical prompts (skeletons, bifurcation points, segment midpoints) via hierarchical cross-attention, (3) lightweight mask decoder to reduce artifacts. Includes automated pipeline for multi-prompt annotation generation.

Result: Outperforms PEFT-based SAM variants by >10% Dice and >13% IoU. Achieves competitive performance vs fully fine-tuned methods with significantly fewer parameters. Generalizes well to out-of-distribution settings, outperforming all baselines in average OoD Dice and IoU.

Conclusion: VesSAM provides an efficient, powerful framework for vessel segmentation that effectively adapts foundation models to vascular structures while maintaining parameter efficiency and strong generalization.

Abstract: Accurate vessel segmentation is critical for clinical applications such as disease diagnosis and surgical planning, yet remains challenging due to thin, branching structures and low texture contrast. While foundation models like the Segment Anything Model (SAM) have shown promise in generic segmentation, they perform sub-optimally on vascular structures. In this work, we present VesSAM, a powerful and efficient framework tailored for 2D vessel segmentation. VesSAM integrates (1) a convolutional adapter to enhance local texture features, (2) a multi-prompt encoder that fuses anatomical prompts, including skeletons, bifurcation points, and segment midpoints, via hierarchical cross-attention, and (3) a lightweight mask decoder to reduce jagged artifacts. We also introduce an automated pipeline to generate structured multi-prompt annotations, and curate a diverse benchmark dataset spanning 8 datasets across 5 imaging modalities. Experimental results demonstrate that VesSAM consistently outperforms state-of-the-art PEFT-based SAM variants by over 10% Dice and 13% IoU, and achieves competitive performance compared to fully fine-tuned methods, with significantly fewer parameters. VesSAM also generalizes well to out-of-distribution (OoD) settings, outperforming all baselines in average OoD Dice and IoU.

[204] Luminance-Aware Statistical Quantization: Unsupervised Hierarchical Learning for Illumination Enhancement

Derong Kong, Zhixiong Yang, Shengxi Li, Shuaifeng Zhi, Li Liu, Zhen Liu, Jingyuan Xia

Main category: cs.CV

TL;DR: LASQ reformulates low-light image enhancement as statistical sampling over hierarchical luminance distributions using power-law transitions, enabling unsupervised enhancement without normal-light references.

DetailsMotivation: Existing LLIE methods focus on deterministic pixel-level mappings between paired low/normal-light images, neglecting continuous physical luminance transitions and struggling when normal-light references are unavailable. There's a need for methods that understand natural luminance dynamics and work in practical scenarios without reference images.

Method: Introduces Luminance-Aware Statistical Quantification (LASQ) that models luminance transitions as power-law distributions in intensity coordinate space, approximated by stratified power functions. Uses a diffusion forward process to autonomously discover optimal transition paths between luminance layers, replacing deterministic mappings with probabilistic sampling over continuous luminance layers.

Result: LASQ considerably improves performance in practical situations without normal-light references, enabling more adaptable light restoration. When normal-light references are available, it achieves superior performance on domain-specific datasets alongside better generalization ability across non-reference datasets.

Conclusion: LASQ successfully reformulates LLIE as a statistical sampling process over hierarchical luminance distributions, addressing the limitations of deterministic approaches and enabling effective enhancement in both reference and non-reference scenarios through understanding of natural luminance dynamics.

Abstract: Low-light image enhancement (LLIE) faces persistent challenges in balancing reconstruction fidelity with cross-scenario generalization. While existing methods predominantly focus on deterministic pixel-level mappings between paired low/normal-light images, they often neglect the continuous physical process of luminance transitions in real-world environments, leading to performance drop when normal-light references are unavailable. Inspired by empirical analysis of natural luminance dynamics revealing power-law distributed intensity transitions, this paper introduces Luminance-Aware Statistical Quantification (LASQ), a novel framework that reformulates LLIE as a statistical sampling process over hierarchical luminance distributions. Our LASQ re-conceptualizes luminance transition as a power-law distribution in intensity coordinate space that can be approximated by stratified power functions, therefore, replacing deterministic mappings with probabilistic sampling over continuous luminance layers. A diffusion forward process is designed to autonomously discover optimal transition paths between luminance layers, achieving unsupervised distribution emulation without normal-light references. In this way, it considerably improves the performance in practical situations, enabling more adaptable and versatile light restoration. This framework is also readily applicable to cases with normal-light references, where it achieves superior performance on domain-specific datasets alongside better generalization-ability across non-reference datasets.

[205] OLATverse: A Large-scale Real-world Object Dataset with Precise Lighting Control

Xilong Zhou, Jianchun Chen, Pramod Rao, Timo Teufel, Linjie Lyu, Tigran Minasian, Oleksandr Sotnychenko, Xiao-Xiao Long, Marc Habermann, Christian Theobalt

Main category: cs.CV

TL;DR: OLATverse is a large-scale real-world dataset with 9M images of 765 objects captured under precisely controlled lighting from multiple viewpoints, providing comprehensive resources for inverse rendering and relighting research.

DetailsMotivation: Current object-centric inverse rendering, novel view synthesis, and relighting methods heavily rely on synthetic datasets for training and small-scale real-world datasets for benchmarking, limiting their realism and generalization capabilities.

Method: Created a dataset of 765 real-world objects captured using 35 DSLR cameras and 331 individually controlled light sources, providing well-calibrated camera parameters, accurate object masks, photometric surface normals, and diffuse albedo as auxiliary resources.

Result: OLATverse offers large-scale coverage of real objects with high-fidelity appearance under precisely controlled illuminations, establishing the first comprehensive real-world object-centric benchmark for inverse rendering and normal estimation.

Conclusion: OLATverse represents a pivotal step toward integrating next-generation inverse rendering and relighting methods with real-world data, addressing the gap between synthetic training data and real-world application.

Abstract: We introduce OLATverse, a large-scale dataset comprising around 9M images of 765 real-world objects, captured from multiple viewpoints under a diverse set of precisely controlled lighting conditions. While recent advances in object-centric inverse rendering, novel view synthesis and relighting have shown promising results, most techniques still heavily rely on the synthetic datasets for training and small-scale real-world datasets for benchmarking, which limits their realism and generalization. To address this gap, OLATverse offers two key advantages over existing datasets: large-scale coverage of real objects and high-fidelity appearance under precisely controlled illuminations. Specifically, OLATverse contains 765 common and uncommon real-world objects, spanning a wide range of material categories. Each object is captured using 35 DSLR cameras and 331 individually controlled light sources, enabling the simulation of diverse illumination conditions. In addition, for each object, we provide well-calibrated camera parameters, accurate object masks, photometric surface normals, and diffuse albedo as auxiliary resources. We also construct an extensive evaluation set, establishing the first comprehensive real-world object-centric benchmark for inverse rendering and normal estimation. We believe that OLATverse represents a pivotal step toward integrating the next generation of inverse rendering and relighting methods with real-world data. The full dataset, along with all post-processing workflows, will be publicly released at https://vcai.mpi-inf.mpg.de/projects/OLATverse/.

[206] Adapting General-Purpose Foundation Models for X-ray Ptychography in Low-Data Regimes

Robinson Umeike, Neil Getty, Yin Xiangyu, Yi Jiang

Main category: cs.CV

TL;DR: PtychoBench benchmark shows task-dependent optimal specialization strategies for microscopy AI: SFT+ICL best for visual artifact detection, while ICL alone works best for textual parameter recommendation.

DetailsMotivation: Foundation models (LLMs/VLMs) show potential for automating microscopy workflows, but optimal domain adaptation strategies for specialized scientific tasks are unclear, requiring systematic comparison.

Method: Introduce PtychoBench, a multi-modal, multi-task benchmark for ptychographic analysis. Systematically compare Supervised Fine-Tuning (SFT) and In-Context Learning (ICL) strategies on visual artifact detection (VLMs) and textual parameter recommendation (LLMs) in data-scarce settings.

Result: Optimal specialization is task-dependent: For visual tasks, SFT+ICL achieves best performance (Micro-F1 0.728); for textual tasks, ICL on large base model is superior (Micro-F1 0.847 vs SFT’s 0.839). Context-aware prompting works best, with consistent contextual interference in fine-tuned models.

Conclusion: Optimal AI specialization for scientific tasks depends on task modality, providing a framework for developing effective science-based agentic systems. Benchmark shows clear guidance for choosing between SFT and ICL strategies.

Abstract: The automation of workflows in advanced microscopy is a key goal where foundation models like Language Models (LLMs) and Vision-Language Models (VLMs) show great potential. However, adapting these general-purpose models for specialized scientific tasks is critical, and the optimal domain adaptation strategy is often unclear. To address this, we introduce PtychoBench, a new multi-modal, multi-task benchmark for ptychographic analysis. Using this benchmark, we systematically compare two specialization strategies: Supervised Fine-Tuning (SFT) and In-Context Learning (ICL). We evaluate these strategies on a visual artifact detection task with VLMs and a textual parameter recommendation task with LLMs in a data-scarce regime. Our findings reveal that the optimal specialization pathway is task-dependent. For the visual task, SFT and ICL are highly complementary, with a fine-tuned model guided by context-aware examples achieving the highest mean performance (Micro-F1 of 0.728). Conversely, for the textual task, ICL on a large base model is the superior strategy, reaching a peak Micro-F1 of 0.847 and outperforming a powerful “super-expert” SFT model (0-shot Micro-F1 of 0.839). We also confirm the superiority of context-aware prompting and identify a consistent contextual interference phenomenon in fine-tuned models. These results, benchmarked against strong baselines including GPT-4o and a DINOv3-based classifier, offer key observations for AI in science: the optimal specialization path in our benchmark is dependent on the task modality, offering a clear framework for developing more effective science-based agentic systems.

[207] SIGMMA: Hierarchical Graph-Based Multi-Scale Multi-modal Contrastive Alignment of Histopathology Image and Spatial Transcriptome

Dabin Jeong, Amirhossein Vahidi, Ciro Ramírez-Suástegui, Marie Moullet, Kevin Ly, Mohammad Vali Sanian, Sebastian Birk, Yinshui Chang, Adam Boxall, Daniyal Jafree, Lloyd Steele, Vijaya Baskar MS, Muzlifah Haniffa, Mohammad Lotfollahi

Main category: cs.CV

TL;DR: Sigmma is a multi-modal contrastive alignment framework that learns hierarchical representations of HE images and spatial transcriptome profiles across multiple scales, improving gene-expression prediction and cross-modal retrieval.

DetailsMotivation: Existing approaches align HE tiles with ST profiles at a single scale, overlooking fine-grained cellular structures and their spatial organization, which limits the ability to capture detailed tissue microenvironment information.

Method: Proposes Sigmma with multi-scale contrastive alignment to ensure coherent representations across modalities at different scales, and represents cell interactions as a graph integrating inter- and intra-subgraph relationships to capture cell-cell interactions from fine to coarse levels.

Result: Sigmma improves gene-expression prediction by avg. 9.78% and cross-modal retrieval by avg. 26.93% across datasets, and learns meaningful multi-tissue organization in downstream analyses.

Conclusion: Sigmma’s multi-scale hierarchical approach effectively captures cross-modal correspondences and cell-cell interactions in tissue microenvironments, outperforming single-scale alignment methods in computational pathology tasks.

Abstract: Recent advances in computational pathology have leveraged vision-language models to learn joint representations of Hematoxylin and Eosin (HE) images with spatial transcriptomic (ST) profiles. However, existing approaches typically align HE tiles with their corresponding ST profiles at a single scale, overlooking fine-grained cellular structures and their spatial organization. To address this, we propose Sigmma, a multi-modal contrastive alignment framework for learning hierarchical representations of HE images and spatial transcriptome profiles across multiple scales. Sigmma introduces multi-scale contrastive alignment, ensuring that representations learned at different scales remain coherent across modalities. Furthermore, by representing cell interactions as a graph and integrating inter- and intra-subgraph relationships, our approach effectively captures cell-cell interactions, ranging from fine to coarse, within the tissue microenvironment. We demonstrate that Sigmm learns representations that better capture cross-modal correspondences, leading to an improvement of avg. 9.78% in the gene-expression prediction task and avg. 26.93% in the cross-modal retrieval task across datasets. We further show that it learns meaningful multi-tissue organization in downstream analyses.

[208] A Semantically Enhanced Generative Foundation Model Improves Pathological Image Synthesis

Xianchao Guan, Zhiyuan Fan, Yifeng Wang, Fuqiang Chen, Yanjiang Zhou, Zengyang Che, Hongxue Meng, Xin Li, Yaowei Wang, Hongpeng Wang, Min Zhang, Heng Tao Shen, Zheng Zhang, Yongbing Zhang

Main category: cs.CV

TL;DR: CRAFTS is a pathology-specific generative foundation model that creates diverse, biologically accurate tissue images to overcome data scarcity in clinical AI development.

DetailsMotivation: Clinical-grade AI in pathology is limited by scarce, diverse annotated datasets. Existing generative models suffer from semantic instability and morphological hallucinations that compromise diagnostic reliability.

Method: Dual-stage training on ~2.8M image-caption pairs with a novel correlation-regulated alignment mechanism to suppress semantic drift and ensure biological accuracy. Can be coupled with ControlNet for precise tissue architecture control.

Result: Generates diverse pathological images across 30 cancer types with quality validated by objective metrics and pathologist evaluations. CRAFTS-augmented datasets enhance performance in classification, cross-modal retrieval, self-supervised learning, and visual question answering.

Conclusion: CRAFTS overcomes data scarcity and privacy barriers, providing limitless diverse annotated histology data to unlock robust diagnostic tools for rare and complex cancer phenotypes.

Abstract: The development of clinical-grade artificial intelligence in pathology is limited by the scarcity of diverse, high-quality annotated datasets. Generative models offer a potential solution but suffer from semantic instability and morphological hallucinations that compromise diagnostic reliability. To address this challenge, we introduce a Correlation-Regulated Alignment Framework for Tissue Synthesis (CRAFTS), the first generative foundation model for pathology-specific text-to-image synthesis. By leveraging a dual-stage training strategy on approximately 2.8 million image-caption pairs, CRAFTS incorporates a novel alignment mechanism that suppresses semantic drift to ensure biological accuracy. This model generates diverse pathological images spanning 30 cancer types, with quality rigorously validated by objective metrics and pathologist evaluations. Furthermore, CRAFTS-augmented datasets enhance the performance across various clinical tasks, including classification, cross-modal retrieval, self-supervised learning, and visual question answering. In addition, coupling CRAFTS with ControlNet enables precise control over tissue architecture from inputs such as nuclear segmentation masks and fluorescence images. By overcoming the critical barriers of data scarcity and privacy concerns, CRAFTS provides a limitless source of diverse, annotated histology data, effectively unlocking the creation of robust diagnostic tools for rare and complex cancer phenotypes.

[209] INQUIRE-Search: A Framework for Interactive Discovery in Large-Scale Biodiversity Databases

Edward Vendrow, Julia Chae, Rupa Kurinchi-Vendhan, Isaac Eckert, Jazlynn Hall, Marta Jarzyna, Reymond Miyajima, Ruth Oliver, Laura Pollock, Lauren Shrack, Scott Yanco, Oisin Mac Aodha, Sara Beery

Main category: cs.CV

TL;DR: INQUIRE-Search is an open-source system that enables scientists to search biodiversity image databases using natural language, allowing efficient discovery of ecological context from millions of images that was previously inaccessible at scale.

DetailsMotivation: Large biodiversity platforms like iNaturalist contain hundreds of millions of images with rich ecological context (behaviors, interactions, phenology, habitat), but current workflows rely on metadata filtering or manual inspection, leaving this secondary information inaccessible for large-scale scientific analysis.

Method: INQUIRE-Search is an open-source system that allows scientists to: 1) rapidly search ecological image databases using natural language queries, 2) verify and export relevant observations, and 3) utilize discovered data for novel scientific analysis. It represents a new paradigm for interactive, efficient, and scalable scientific discovery.

Result: The system dramatically reduces search time compared to traditional methods. Five case studies demonstrate diverse scientific applications, including seasonal variation in behavior across species and forest regrowth after wildfires. The tool enables previously impossible scientific questions to be explored efficiently.

Conclusion: INQUIRE-Search unlocks previously inaccessible scientific value in large-scale biodiversity datasets. However, using such AI-enabled discovery tools requires experts to reframe scientific priorities and develop novel methods for experiment design, data collection, survey effort, and uncertainty analysis.

Abstract: Large community science platforms such as iNaturalist contain hundreds of millions of biodiversity images that often capture ecological context on behaviors, interactions, phenology, and habitat. Yet most ecological workflows rely on metadata filtering or manual inspection, leaving this secondary information inaccessible at scale. We introduce INQUIRE-Search, an open-source system that enables scientists to rapidly and interactively search within an ecological image database for specific concepts using natural language, verify and export relevant observations, and utilize this discovered data for novel scientific analysis. Compared to traditional methods, INQUIRE-Search takes a fraction of the time, opening up new possibilities for scientific questions that can be explored. Through five case studies, we show the diversity of scientific applications that a tool like INQUIRE-Search can support, from seasonal variation in behavior across species to forest regrowth after wildfires. These examples demonstrate a new paradigm for interactive, efficient, and scalable scientific discovery that can begin to unlock previously inaccessible scientific value in large-scale biodiversity datasets. Finally, we emphasize using such AI-enabled discovery tools for science call for experts to reframe the priorities of the scientific process and develop novel methods for experiment design, data collection, survey effort, and uncertainty analysis.

[210] TransientTrack: Advanced Multi-Object Tracking and Classification of Cancer Cells with Transient Fluorescent Signals

Florian Bürger, Martim Dias Gomes, Nica Gutu, Adrián E. Granada, Noémie Moreau, Katarzyna Bozek

Main category: cs.CV

TL;DR: TransientTrack is a deep learning framework for cell tracking in multi-channel microscopy videos with transient fluorescent signals, capable of detecting cell division and death events to build complete cell lineages.

DetailsMotivation: Current cell tracking methods are limited to videos with constant signals and cannot detect critical events like cell death, creating a need for a method that handles transient fluorescent signals and captures complete cell dynamics including division and death.

Method: A lightweight deep learning framework that performs matching on cell detection embeddings without tracking-specific feature quantification, integrating Transformer Networks, multi-stage matching using all detection boxes, and Kalman Filter interpolation for missing tracklets.

Result: The framework achieves strong performance across diverse conditions, effectively tracks cells, captures cell division and death events, and demonstrates practical application in analyzing chemotherapeutic drug efficacy at single-cell level.

Conclusion: TransientTrack advances quantitative studies of cancer cell dynamics by enabling detailed characterization of treatment response and resistance mechanisms, with code publicly available for further research.

Abstract: Tracking cells in time-lapse videos is an essential technique for monitoring cell population dynamics at a single-cell level. Current methods for cell tracking are developed on videos with mostly single, constant signals and do not detect pivotal events such as cell death. Here, we present TransientTrack, a deep learning-based framework for cell tracking in multi-channel microscopy video data with transient fluorescent signals that fluctuate over time following processes such as the circadian rhythm of cells. By identifying key cellular events - mitosis (cell division) and apoptosis (cell death) our method allows us to build complete trajectories, including cell lineage information. TransientTrack is lightweight and performs matching on cell detection embeddings directly, without the need for quantification of tracking-specific cell features. Furthermore, our approach integrates Transformer Networks, multi-stage matching using all detection boxes, and the interpolation of missing tracklets with the Kalman Filter. This unified framework achieves strong performance across diverse conditions, effectively tracking cells and capturing cell division and death. We demonstrate the use of TransientTrack in an analysis of the efficacy of a chemotherapeutic drug at a single-cell level. The proposed framework could further advance quantitative studies of cancer cell dynamics, enabling detailed characterization of treatment response and resistance mechanisms. The code is available at https://github.com/bozeklab/TransientTrack.

[211] VideoMem: Enhancing Ultra-Long Video Understanding via Adaptive Memory Management

Hongbo Jin, Qingyuan Wang, Wenhao Zhang, Yang Liu, Sijie Cheng

Main category: cs.CV

TL;DR: VideoMem: A novel framework for ultra-long video understanding using adaptive memory management and sequential generation, outperforming existing models.

DetailsMotivation: Existing vision language models struggle with ultra-long videos due to limited context length and poor long-term memory retention. Current RAG-based approaches have high storage and computational overhead.

Method: VideoMem treats long video understanding as sequential generation with adaptive memory management. It dynamically updates a global memory buffer to retain critical information while discarding redundancy. Uses Progressive Grouped Relative Policy Optimization (PRPO) with Progressive State Propagation (PSP) and Temporal Cascading Reward (TCR) modules for efficient training.

Result: Extensive experiments show VideoMem significantly outperforms existing open-source models across diverse benchmarks for ultra-long video understanding tasks.

Conclusion: VideoMem provides an effective solution for ultra-long video understanding through adaptive memory management and efficient training algorithms, addressing key limitations of current approaches.

Abstract: Ultra long video understanding remains an open challenge, as existing vision language models (VLMs) falter on such content due to limited context length and inefficient long term memory retention. To address this, recent works have attempted to construct external knowledge bases and corresponding retrieval agumented generation (RAG) systems, yet these incur enormous storage and computational overhead. In this paper, we propose VideoMem, a novel framework that pioneers models long video understanding as a sequential generation task via adaptive memory management. Specifically, VideoMem dynamically updates a global memory buffer, which adaptively retains critical information while discarding redundant content across the video timeline. To efficiently train VLMs for such long-term tasks, VideoMem integrates the Progressive Grouped Relative Policy Optimization (PRPO) algorithm, equipped with two core modules: Progressive State Propagation (PSP) adaptively retains valid current states, propagates them to the next rollout step, and gradually narrows the model exploration space. Temporal Cascading Reward (TCR) further alleviates reward sparsity, improving sample utilization and accelerating convergence. Extensive experiments demonstrate that VideoMem significantly outperforms existing open-source models across diverse benchmarks for ultra-long video understanding tasks.

[212] TARDis: Time Attenuated Representation Disentanglement for Incomplete Multi-Modal Tumor Segmentation and Classification

Zishuo Wan, Qinqin Kang, Na Li, Yi Huang, Qianru Zhang, Le Lu, Yun Bian, Dawei Ding, Ke Yan

Main category: cs.CV

TL;DR: TARDis is a physics-aware framework for tumor segmentation in CT scans with missing temporal phases, treating missing modalities as points on continuous time-attenuation curves rather than absent channels.

DetailsMotivation: Complete temporal dynamics in contrast-enhanced CT are often missing due to radiation dose limits and inconsistent protocols, creating a prevalent missing modality problem. Existing methods ignore the temporal continuity of hemodynamics.

Method: Proposes Time Attenuated Representation Disentanglement (TARDis) with dual-path architecture: quantization-based path for time-invariant anatomical structures, and probabilistic path using Hemodynamic Conditional VAE to model dynamic enhancement conditioned on scan time.

Result: Significantly outperforms state-of-the-art incomplete modality frameworks on large-scale private abdominal CT dataset (2,282 patients) and two public datasets. Maintains robust performance even in extreme data-sparsity scenarios.

Conclusion: TARDis demonstrates potential for reducing radiation exposure while maintaining diagnostic precision by effectively handling missing temporal phases through physics-aware modeling of hemodynamic continuity.

Abstract: The accurate diagnosis and segmentation of tumors in contrast-enhanced Computed Tomography (CT) are fundamentally driven by the distinctive hemodynamic profiles of contrast agents over time. However, in real-world clinical practice, complete temporal dynamics are often hard to capture by strict radiation dose limits and inconsistent acquisition protocols across institutions, leading to a prevalent missing modality problem. Existing deep learning approaches typically treat missing phases as absent independent channels, ignoring the inherent temporal continuity of hemodynamics. In this work, we propose Time Attenuated Representation Disentanglement (TARDis), a novel physics-aware framework that redefines missing modalities as missing sample points on a continuous Time-Attenuation Curve. We first hypothesize that the latent feature can be disentangled into a time-invariant static component (anatomy) and a time-dependent dynamic component (perfusion). We achieve this via a dual-path architecture: a quantization-based path using a learnable embedding dictionary to extract consistent anatomical structures, and a probabilistic path using a Hemodynamic Conditional Variational Autoencoder to model dynamic enhancement conditioned on the estimated scan time. This design allows the network to infer missing hemodynamic features by sampling from the learned latent distribution. Extensive experiments on a large-scale multi-modal private abdominal CT dataset (2,282 patients) and two public datasets demonstrate that TARDis significantly outperforms state-of-the-art incomplete modality frameworks. Notably, our method maintains robust diagnostic performance even in extreme data-sparsity scenarios, highlighting its potential for reducing radiation exposure while maintaining diagnostic precision.

[213] SAM3-I: Segment Anything with Instructions

Jingjing Li, Yue Feng, Yuchen Guo, Jincai Huang, Yongri Piao, Qi Bi, Miao Zhang, Xiaoqi Zhao, Qiang Chen, Shihao Zou, Wei Ji, Huchuan Lu, Li Cheng

Main category: cs.CV

TL;DR: SAM3-I enhances SAM3 by enabling direct instruction-following segmentation with rich natural language expressions, preserving original concept capabilities while adding instruction-level reasoning.

DetailsMotivation: SAM3's current limitation of only handling short noun-phrase prompts is insufficient for real-world applications that require complex expressions with attributes, spatial relations, functionalities, actions, states, and implicit reasoning. The reliance on external multi-modal agents for instruction-to-NP conversion leads to coarse representations that often fail to precisely identify specific instances.

Method: SAM3-I introduces: 1) Instruction-aware cascaded adaptation mechanism that progressively aligns expressive instruction semantics with SAM3’s vision-language representations, 2) Structured instruction taxonomy spanning concept, simple, and complex levels, and 3) Scalable data engine to construct diverse instruction-mask pairs dataset.

Result: SAM3-I delivers appealing performance, demonstrating that SAM3 can be effectively extended to follow natural-language instructions while preserving its strong concept grounding. The framework is open-sourced with practical fine-tuning workflows for domain-specific applications.

Conclusion: SAM3-I successfully unifies concept-level understanding and instruction-level reasoning within the SAM family, enabling direct instruction-following segmentation without sacrificing original capabilities, making it more practical for real-world applications requiring rich natural language expressions.

Abstract: Segment Anything Model 3 (SAM3) has advanced open-vocabulary segmentation through promptable concept segmentation, allowing users to segment all instances corresponding to a given concept, typically specified with short noun-phrase (NP) prompts. While this marks the first integration of language-level concepts within the SAM family, real-world usage typically requires far richer expressions that include attributes, spatial relations, functionalities, actions, states, and even implicit reasoning over instances. Currently, SAM3 relies on external multi-modal agents to convert complex instructions into NPs and then conduct iterative mask filtering. However, these NP-level concepts remain overly coarse, often failing to precisely represent a specific instance. In this work, we present SAM3-I, an enhanced framework that unifies concept-level understanding and instruction-level reasoning within the SAM family. SAM3-I introduces an instruction-aware cascaded adaptation mechanism that progressively aligns expressive instruction semantics with SAM3’s existing vision-language representations, enabling direct instruction-following segmentation without sacrificing its original concept-driven capabilities. Furthermore, we design a structured instruction taxonomy spanning concept, simple, and complex levels, and develop a scalable data engine to construct a dataset with diverse instruction-mask pairs. Experiments show that SAM3-I delivers appealing performance, demonstrating that SAM3 can be effectively extended to follow natural-language instructions while preserving its strong concept grounding. We open-source SAM3-I and provide practical fine-tuning workflows, enabling researchers to adapt it to domain-specific applications. The source code is available here.

[214] Order Matters: 3D Shape Generation from Sequential VR Sketches

Yizi Chen, Sidi Wu, Tianyi Xiao, Nina Wiedemann, Loic Landrieu

Main category: cs.CV

TL;DR: VRSketch2Shape: First framework for generating 3D shapes from sequential VR sketches using order-aware encoding and diffusion models.

DetailsMotivation: Existing sketch-to-shape models ignore temporal stroke ordering, losing crucial structural cues and design intent information from VR sketching.

Method: Three main contributions: 1) Automated pipeline for generating sequential VR sketches from arbitrary shapes, 2) Multi-category dataset with 20k+ synthetic and 900 hand-drawn sketch-shape pairs, 3) Order-aware sketch encoder with diffusion-based 3D generator.

Result: Higher geometric fidelity than prior work, effective generalization from synthetic to real sketches with minimal supervision, and good performance on partial sketches.

Conclusion: VRSketch2Shape successfully leverages temporal stroke ordering for better 3D shape generation from VR sketches, with open-source release of data and models.

Abstract: VR sketching lets users explore and iterate on ideas directly in 3D, offering a faster and more intuitive alternative to conventional CAD tools. However, existing sketch-to-shape models ignore the temporal ordering of strokes, discarding crucial cues about structure and design intent. We introduce VRSketch2Shape, the first framework and multi-category dataset for generating 3D shapes from sequential VR sketches. Our contributions are threefold: (i) an automated pipeline that generates sequential VR sketches from arbitrary shapes, (ii) a dataset of over 20k synthetic and 900 hand-drawn sketch-shape pairs across four categories, and (iii) an order-aware sketch encoder coupled with a diffusion-based 3D generator. Our approach yields higher geometric fidelity than prior work, generalizes effectively from synthetic to real sketches with minimal supervision, and performs well even on partial sketches. All data and models will be released open-source at https://chenyizi086.github.io/VRSketch2Shape_website.

[215] HQ-DM: Single Hadamard Transformation-Based Quantization-Aware Training for Low-Bit Diffusion Models

Shizhuo Mao, Hongtao Zou, Qihu Xie, Song Chen, Yi Kang

Main category: cs.CV

TL;DR: HQ-DM is a quantization-aware training framework that uses Single Hadamard Transformation to reduce activation outliers in diffusion models, enabling efficient low-bit quantization with minimal performance degradation.

DetailsMotivation: Diffusion models have high computational and memory costs that hinder deployment. While quantization can help reduce these costs, existing methods struggle with activation outliers during inference, causing significant performance degradation in low-bit scenarios.

Method: Proposes HQ-DM, a quantization-aware training framework that applies Single Hadamard Transformation to activation matrices. This approach reduces activation outliers while preserving model performance under quantization, supporting INT convolution operations and preventing weight outlier amplification.

Result: On ImageNet 256x256 using LDM-4 model, W4A4 quantization improves Inception Score by 12.8% and W4A3 quantization improves by 467.73% over state-of-the-art methods.

Conclusion: HQ-DM effectively addresses the outlier problem in diffusion model quantization, enabling efficient low-bit deployment with substantial performance improvements over existing quantization methods.

Abstract: Diffusion models have demonstrated significant applications in the field of image generation. However, their high computational and memory costs pose challenges for deployment. Model quantization has emerged as a promising solution to reduce storage overhead and accelerate inference. Nevertheless, existing quantization methods for diffusion models struggle to mitigate outliers in activation matrices during inference, leading to substantial performance degradation under low-bit quantization scenarios. To address this, we propose HQ-DM, a novel Quantization-Aware Training framework that applies Single Hadamard Transformation to activation matrices. This approach effectively reduces activation outliers while preserving model performance under quantization. Compared to traditional Double Hadamard Transformation, our proposed scheme offers distinct advantages by seamlessly supporting INT convolution operations while preventing the amplification of weight outliers. For conditional generation on the ImageNet 256x256 dataset using the LDM-4 model, our W4A4 and W4A3 quantization schemes improve the Inception Score by 12.8% and 467.73%, respectively, over the existing state-of-the-art method.

[216] Debiasing Diffusion Priors via 3D Attention for Consistent Gaussian Splatting

Shilong Jin, Haoran Duan, Litao Hua, Wentao Huang, Yuan Zhou

Main category: cs.CV

TL;DR: TD-Attn is a novel framework that addresses multi-view inconsistency in 3D tasks by correcting prior view bias in T2I diffusion models through 3D-aware attention guidance and hierarchical attention modulation.

DetailsMotivation: T2I diffusion models used for 3D tasks suffer from prior view bias, causing conflicting appearances between different views of an object due to subject-words activating prior view features regardless of target view conditions.

Method: Two key components: 1) 3D-Aware Attention Guidance Module constructs view-consistent 3D attention Gaussian for subject-words to enforce spatial consistency; 2) Hierarchical Attention Modulation Module uses Semantic Guidance Tree and Semantic Response Profiler to localize and modulate CA layers responsive to view conditions.

Result: Extensive experiments show TD-Attn significantly enhances multi-view consistency across 3D tasks and can serve as a universal plugin for various 3D applications.

Conclusion: TD-Attn effectively addresses the prior view bias limitation in T2I models, enabling more consistent and controllable 3D generation and editing through semantic-specific interventions.

Abstract: Versatile 3D tasks (e.g., generation or editing) that distill from Text-to-Image (T2I) diffusion models have attracted significant research interest for not relying on extensive 3D training data. However, T2I models exhibit limitations resulting from prior view bias, which produces conflicting appearances between different views of an object. This bias causes subject-words to preferentially activate prior view features during cross-attention (CA) computation, regardless of the target view condition. To overcome this limitation, we conduct a comprehensive mathematical analysis to reveal the root cause of the prior view bias in T2I models. Moreover, we find different UNet layers show different effects of prior view in CA. Therefore, we propose a novel framework, TD-Attn, which addresses multi-view inconsistency via two key components: (1) the 3D-Aware Attention Guidance Module (3D-AAG) constructs a view-consistent 3D attention Gaussian for subject-words to enforce spatial consistency across attention-focused regions, thereby compensating for the limited spatial information in 2D individual view CA maps; (2) the Hierarchical Attention Modulation Module (HAM) utilizes a Semantic Guidance Tree (SGT) to direct the Semantic Response Profiler (SRP) in localizing and modulating CA layers that are highly responsive to view conditions, where the enhanced CA maps further support the construction of more consistent 3D attention Gaussians. Notably, HAM facilitates semantic-specific interventions, enabling controllable and precise 3D editing. Extensive experiments firmly establish that TD-Attn has the potential to serve as a universal plugin, significantly enhancing multi-view consistency across 3D tasks.

[217] UnCageNet: Tracking and Pose Estimation of Caged Animal

Sayak Dutta, Harish Katti, Shashikant Verma, Shanmuganathan Raman

Main category: cs.CV

TL;DR: A three-stage preprocessing pipeline (cage segmentation → inpainting → pose estimation) that removes cage occlusions to restore animal tracking performance in environments with systematic occlusions.

DetailsMotivation: Existing animal tracking and pose estimation systems (like STEP and ViTPose) suffer significant performance degradation when processing images/videos with cage structures and systematic occlusions, which obscure animal features and disrupt tracking.

Method: Three-stage pipeline: 1) Cage segmentation using Gabor-enhanced ResNet-UNet with 72 directional kernels for orientation-aware feature extraction; 2) Cage inpainting using CRFill for content-aware reconstruction of occluded regions; 3) Pose estimation and tracking on the uncaged frames using existing methods.

Result: The pipeline successfully removes cage occlusions, enabling pose estimation and tracking performance comparable to environments without occlusions. Significant improvements in keypoint detection accuracy and trajectory consistency were observed.

Conclusion: Preprocessing to remove systematic cage occlusions is crucial for maintaining robust animal tracking and pose estimation performance in constrained environments, with the proposed Gabor-enhanced segmentation and inpainting approach effectively addressing this limitation.

Abstract: Animal tracking and pose estimation systems, such as STEP (Simultaneous Tracking and Pose Estimation) and ViTPose, experience substantial performance drops when processing images and videos with cage structures and systematic occlusions. We present a three-stage preprocessing pipeline that addresses this limitation through: (1) cage segmentation using a Gabor-enhanced ResNet-UNet architecture with tunable orientation filters, (2) cage inpainting using CRFill for content-aware reconstruction of occluded regions, and (3) evaluation of pose estimation and tracking on the uncaged frames. Our Gabor-enhanced segmentation model leverages orientation-aware features with 72 directional kernels to accurately identify and segment cage structures that severely impair the performance of existing methods. Experimental validation demonstrates that removing cage occlusions through our pipeline enables pose estimation and tracking performance comparable to that in environments without occlusions. We also observe significant improvements in keypoint detection accuracy and trajectory consistency.

[218] OpenVE-3M: A Large-Scale High-Quality Dataset for Instruction-Guided Video Editing

Haoyang He, Jie Wang, Jiangning Zhang, Zhucun Xue, Xingyuan Bu, Qiangpeng Yang, Shilei Wen, Lei Xie

Main category: cs.CV

TL;DR: OpenVE-3M is a large-scale, high-quality dataset for instruction-based video editing, with OpenVE-Bench as a unified benchmark and OpenVE-Edit as a state-of-the-art 5B model trained on it.

DetailsMotivation: There's a scarcity of large-scale, high-quality datasets for instruction-based video editing compared to image editing, creating a gap in the field that needs to be addressed.

Method: Created OpenVE-3M dataset with two categories: spatially-aligned edits (Global Style, Background Change, Local Change, Local Remove, Local Add, Subtitles Edit) and non-spatially-aligned edits (Camera Multi-Shot Edit, Creative Edit). Used meticulously designed data pipeline with rigorous quality filtering. Built OpenVE-Bench benchmark with 431 video-edit pairs and three human-aligned metrics. Trained OpenVE-Edit, a 5B model on the dataset.

Result: OpenVE-3M surpasses existing open-source datasets in scale, diversity, instruction length, and quality. OpenVE-Edit sets new state-of-the-art on OpenVE-Bench, outperforming all prior open-source models including a 14B baseline.

Conclusion: The work addresses the dataset scarcity in instruction-based video editing by providing a comprehensive dataset, benchmark, and effective model that advances the field significantly.

Abstract: The quality and diversity of instruction-based image editing datasets are continuously increasing, yet large-scale, high-quality datasets for instruction-based video editing remain scarce. To address this gap, we introduce OpenVE-3M, an open-source, large-scale, and high-quality dataset for instruction-based video editing. It comprises two primary categories: spatially-aligned edits (Global Style, Background Change, Local Change, Local Remove, Local Add, and Subtitles Edit) and non-spatially-aligned edits (Camera Multi-Shot Edit and Creative Edit). All edit types are generated via a meticulously designed data pipeline with rigorous quality filtering. OpenVE-3M surpasses existing open-source datasets in terms of scale, diversity of edit types, instruction length, and overall quality. Furthermore, to address the lack of a unified benchmark in the field, we construct OpenVE-Bench, containing 431 video-edit pairs that cover a diverse range of editing tasks with three key metrics highly aligned with human judgment. We present OpenVE-Edit, a 5B model trained on our dataset that demonstrates remarkable efficiency and effectiveness by setting a new state-of-the-art on OpenVE-Bench, outperforming all prior open-source models including a 14B baseline. Project page is at https://lewandofskee.github.io/projects/OpenVE.

[219] Fast and Explicit: Slice-to-Volume Reconstruction via 3D Gaussian Primitives with Analytic Point Spread Function Modeling

Maik Dannecker, Steven Jia, Nil Stolt-Ansó, Nadine Girard, Guillaume Auzias, François Rousseau, Daniel Rueckert

Main category: cs.CV

TL;DR: Proposes Gaussian-based explicit representations instead of neural implicit representations for 3D medical image reconstruction, achieving 5-10× speed-up while matching state-of-the-art quality.

DetailsMotivation: High-resolution 3D reconstruction from motion-corrupted 2D acquisitions is crucial for fetal MRI diagnosis. Current implicit neural representations (INRs) suffer from computational bottlenecks due to expensive stochastic Monte Carlo sampling needed to model acquisition physics.

Method: Shift from neural network based implicit representations to Gaussian based explicit representations. Parameterize HR 3D image volume as a field of anisotropic Gaussian primitives, leveraging the property that Gaussians are closed under convolution to derive a closed-form analytical solution for the forward model (Σ_obs = Σ_HR + Σ_PSF).

Result: Matches reconstruction quality of state-of-the-art self-supervised SVR frameworks while delivering 5-10× speed-up on neonatal and fetal data. Convergence often reached in under 30 seconds.

Conclusion: The Gaussian-based approach bypasses compute-intensive stochastic sampling while ensuring exact gradient propagation, paving the way for clinical translation of real-time fetal 3D MRI.

Abstract: Recovering high-fidelity 3D images from sparse or degraded 2D images is a fundamental challenge in medical imaging, with broad applications ranging from 3D ultrasound reconstruction to MRI super-resolution. In the context of fetal MRI, high-resolution 3D reconstruction of the brain from motion-corrupted low-resolution 2D acquisitions is a prerequisite for accurate neurodevelopmental diagnosis. While implicit neural representations (INRs) have recently established state-of-the-art performance in self-supervised slice-to-volume reconstruction (SVR), they suffer from a critical computational bottleneck: accurately modeling the image acquisition physics requires expensive stochastic Monte Carlo sampling to approximate the point spread function (PSF). In this work, we propose a shift from neural network based implicit representations to Gaussian based explicit representations. By parameterizing the HR 3D image volume as a field of anisotropic Gaussian primitives, we leverage the property of Gaussians being closed under convolution and thus derive a \textit{closed-form analytical solution} for the forward model. This formulation reduces the previously intractable acquisition integral to an exact covariance addition ($\mathbfΣ_{obs} = \mathbfΣ_{HR} + \mathbfΣ_{PSF}$), effectively bypassing the need for compute-intensive stochastic sampling while ensuring exact gradient propagation. We demonstrate that our approach matches the reconstruction quality of self-supervised state-of-the-art SVR frameworks while delivering a 5$\times$–10$\times$ speed-up on neonatal and fetal data. With convergence often reached in under 30 seconds, our framework paves the way towards translation into clinical routine of real-time fetal 3D MRI. Code will be public at {https://github.com/m-dannecker/Gaussian-Primitives-for-Fast-SVR}.

[220] Audio-Visual Camera Pose Estimation with Passive Scene Sounds and In-the-Wild Video

Daniel Adebi, Sagnik Majumder, Kristen Grauman

Main category: cs.CV

TL;DR: Audio-visual framework improves camera pose estimation by using passive scene sounds as complementary cues to visual information, especially effective in visually degraded conditions.

DetailsMotivation: Visual methods for camera motion estimation struggle under visually degraded conditions like motion blur or occlusions. Passive scene sounds provide complementary cues that could enhance pose estimation robustness.

Method: Integrates direction-of-arrival (DOA) spectra and binauralized embeddings into a state-of-the-art vision-only pose estimation model, creating a simple but effective audio-visual framework.

Result: Consistent gains over strong visual baselines on two large datasets, plus demonstrated robustness when visual information is corrupted. First successful work leveraging audio for relative camera pose estimation in real-world videos.

Conclusion: Incidental, everyday audio serves as an unexpected but promising signal for classic spatial challenges in camera pose estimation, establishing audio as a valuable complementary modality to vision.

Abstract: Understanding camera motion is a fundamental problem in embodied perception and 3D scene understanding. While visual methods have advanced rapidly, they often struggle under visually degraded conditions such as motion blur or occlusions. In this work, we show that passive scene sounds provide complementary cues for relative camera pose estimation for in-the-wild videos. We introduce a simple but effective audio-visual framework that integrates direction-ofarrival (DOA) spectra and binauralized embeddings into a state-of-the-art vision-only pose estimation model. Our results on two large datasets show consistent gains over strong visual baselines, plus robustness when the visual information is corrupted. To our knowledge, this represents the first work to successfully leverage audio for relative camera pose estimation in real-world videos, and it establishes incidental, everyday audio as an unexpected but promising signal for a classic spatial challenge. Project: http://vision.cs.utexas.edu/projects/av_camera_pose.

[221] Bi-Erasing: A Bidirectional Framework for Concept Removal in Diffusion Models

Hao Chen, Yiwei Wang, Songze Li

Main category: cs.CV

TL;DR: Bi-Erasing: A bidirectional image-guided concept erasure framework that simultaneously suppresses harmful concepts and enhances safe alternatives in diffusion models, achieving better balance between removal efficacy and generation quality.

DetailsMotivation: Existing concept erasure methods use unidirectional strategies (either suppressing target concepts or reinforcing safe alternatives), making it difficult to achieve balanced trade-off between concept removal and generation quality in text-to-image models.

Method: Proposes Bidirectional Image-Guided Concept Erasure (Bi-Erasing) with two decoupled image branches: negative branch suppresses harmful semantics, positive branch provides visual guidance for safe alternatives. Uses joint text-image representations and mask-based filtering to prevent interference from irrelevant content.

Result: Bi-Erasing outperforms baseline methods in balancing concept removal effectiveness and visual fidelity across extensive experimental evaluations.

Conclusion: The bidirectional approach achieves better balance between erasure efficacy and generation usability compared to unidirectional methods, addressing limitations of existing concept removal techniques in diffusion models.

Abstract: Concept erasure, which fine-tunes diffusion models to remove undesired or harmful visual concepts, has become a mainstream approach to mitigating unsafe or illegal image generation in text-to-image models.However, existing removal methods typically adopt a unidirectional erasure strategy by either suppressing the target concept or reinforcing safe alternatives, making it difficult to achieve a balanced trade-off between concept removal and generation quality. To address this limitation, we propose a novel Bidirectional Image-Guided Concept Erasure (Bi-Erasing) framework that performs concept suppression and safety enhancement simultaneously. Specifically, based on the joint representation of text prompts and corresponding images, Bi-Erasing introduces two decoupled image branches: a negative branch responsible for suppressing harmful semantics and a positive branch providing visual guidance for safe alternatives. By jointly optimizing these complementary directions, our approach achieves a balance between erasure efficacy and generation usability. In addition, we apply mask-based filtering to the image branches to prevent interference from irrelevant content during the erasure process. Across extensive experiment evaluations, the proposed Bi-Erasing outperforms baseline methods in balancing concept removal effectiveness and visual fidelity.

[222] MMDrive: Interactive Scene Understanding Beyond Vision with Multi-representational Fusion

Minghui Hou, Wei-Hsing Huang, Shaofeng Liang, Daizong Liu, Tai-Hao Wen, Gang Wang, Runwei Guan, Weiping Ding

Main category: cs.CV

TL;DR: MMDrive is a multimodal vision-language model that extends 2D image understanding to 3D scene understanding for autonomous driving by fusing occupancy maps, LiDAR point clouds, and textual descriptions with adaptive cross-modal fusion mechanisms.

DetailsMotivation: Existing vision-language models are limited to 2D image understanding, which restricts their ability to perceive 3D spatial information and perform deep semantic fusion in complex autonomous driving environments, leading to suboptimal performance.

Method: MMDrive introduces a multimodal framework with three complementary modalities (occupancy maps, LiDAR point clouds, textual descriptions) and two novel components: Text-oriented Multimodal Modulator for adaptive cross-modal fusion based on semantic cues, and Cross-Modal Abstractor using learnable abstract tokens to generate compact cross-modal summaries highlighting key regions and semantics.

Result: MMDrive achieves significant performance gains over existing vision-language models: BLEU-4 score of 54.56 and METEOR of 41.78 on DriveLM benchmark, and 62.7% accuracy on NuScenes-QA benchmark.

Conclusion: MMDrive effectively breaks the traditional image-only understanding barrier, enabling robust multimodal reasoning in complex driving environments and providing a new foundation for interpretable autonomous driving scene understanding.

Abstract: Vision-language models enable the understanding and reasoning of complex traffic scenarios through multi-source information fusion, establishing it as a core technology for autonomous driving. However, existing vision-language models are constrained by the image understanding paradigm in 2D plane, which restricts their capability to perceive 3D spatial information and perform deep semantic fusion, resulting in suboptimal performance in complex autonomous driving environments. This study proposes MMDrive, an multimodal vision-language model framework that extends traditional image understanding to a generalized 3D scene understanding framework. MMDrive incorporates three complementary modalities, including occupancy maps, LiDAR point clouds, and textual scene descriptions. To this end, it introduces two novel components for adaptive cross-modal fusion and key information extraction. Specifically, the Text-oriented Multimodal Modulator dynamically weights the contributions of each modality based on the semantic cues in the question, guiding context-aware feature integration. The Cross-Modal Abstractor employs learnable abstract tokens to generate compact, cross-modal summaries that highlight key regions and essential semantics. Comprehensive evaluations on the DriveLM and NuScenes-QA benchmarks demonstrate that MMDrive achieves significant performance gains over existing vision-language models for autonomous driving, with a BLEU-4 score of 54.56 and METEOR of 41.78 on DriveLM, and an accuracy score of 62.7% on NuScenes-QA. MMDrive effectively breaks the traditional image-only understanding barrier, enabling robust multimodal reasoning in complex driving environments and providing a new foundation for interpretable autonomous driving scene understanding.

[223] POLAR: A Portrait OLAT Dataset and Generative Framework for Illumination-Aware Face Modeling

Zhuo Chen, Chengqun Yang, Zhuo Su, Zheng Lv, Jingnan Gao, Xiaoyuan Zhang, Xiaokang Yang, Yichao Yan

Main category: cs.CV

TL;DR: POLAR introduces a large-scale OLAT dataset and POLARNet model for physically-grounded face relighting, enabling scalable illumination synthesis from single portraits.

DetailsMotivation: Face relighting progress is limited by lack of large-scale, physically consistent illumination data. Existing methods rely on statistical or contextual cues rather than physically interpretable illumination transformations.

Method: Created POLAR dataset with 200+ subjects under 156 lighting directions, multiple views, and expressions. Developed POLARNet, a flow-based generative model that predicts per-light OLAT responses from single portraits, modeling illumination as continuous physical transformations.

Result: Established a unified illumination learning framework linking real data, generative synthesis, and physically grounded relighting, creating a self-sustaining cycle for scalable portrait illumination.

Conclusion: POLAR and POLARNet provide a scalable, reproducible approach to face relighting with physically interpretable illumination modeling, overcoming data limitations through a unified framework.

Abstract: Face relighting aims to synthesize realistic portraits under novel illumination while preserving identity and geometry. However, progress remains constrained by the limited availability of large-scale, physically consistent illumination data. To address this, we introduce POLAR, a large-scale and physically calibrated One-Light-at-a-Time (OLAT) dataset containing over 200 subjects captured under 156 lighting directions, multiple views, and diverse expressions. Building upon POLAR, we develop a flow-based generative model POLARNet that predicts per-light OLAT responses from a single portrait, capturing fine-grained and direction-aware illumination effects while preserving facial identity. Unlike diffusion or background-conditioned methods that rely on statistical or contextual cues, our formulation models illumination as a continuous, physically interpretable transformation between lighting states, enabling scalable and controllable relighting. Together, POLAR and POLARNet form a unified illumination learning framework that links real data, generative synthesis, and physically grounded relighting, establishing a self-sustaining “chicken-and-egg” cycle for scalable and reproducible portrait illumination. Our project page: https://rex0191.github.io/POLAR/.

[224] Video Reality Test: Can AI-Generated ASMR Videos fool VLMs and Humans?

Jiaqi Wang, Weijia Wu, Yi Zhan, Rui Zhao, Ming Hu, James Cheng, Wei Liu, Philip Torr, Kevin Qinghong Lin

Main category: cs.CV

TL;DR: Video Reality Test is an ASMR-sourced benchmark for evaluating perceptual realism of AI-generated videos with audio, showing current models can fool VLMs but not human experts.

DetailsMotivation: As AI-generated videos become increasingly realistic, there's a need to assess whether state-of-the-art video generation models can produce immersive, audio-paired videos that reliably deceive both humans and vision-language models (VLMs). Existing benchmarks mostly evaluate video without audio and focus only on classification.

Method: The authors introduce Video Reality Test, an ASMR-sourced video benchmark suite featuring: (1) Immersive ASMR video-audio sources with fine-grained action-object interactions, and (2) A peer-review evaluation protocol where video generation models act as creators trying to fool reviewers, while VLMs serve as reviewers trying to detect fakeness.

Result: The best creator model (Veo3.1-Fast) fools most VLMs, with the strongest reviewer (Gemini 2.5-Pro) achieving only 56% accuracy (random is 50%), far below human experts (81.25%). Adding audio improves real-fake discrimination, but superficial cues like watermarks can still significantly mislead models.

Conclusion: The findings delineate the current boundary of video generation realism and expose limitations of VLMs in perceptual fidelity and audio-visual consistency, highlighting that while AI-generated videos can deceive VLMs, human experts remain significantly better at detection.

Abstract: Recent advances in video generation have produced vivid content that are often indistinguishable from real videos, making AI-generated video detection an emerging societal challenge. Prior AIGC detection benchmarks mostly evaluate video without audio, target broad narrative domains, and focus on classification solely. Yet it remains unclear whether state-of-the-art video generation models can produce immersive, audio-paired videos that reliably deceive humans and VLMs. To this end, we introduce Video Reality Test, an ASMR-sourced video benchmark suite for testing perceptual realism under tight audio-visual coupling, featuring the following dimensions: \textbf{(i) Immersive ASMR video-audio sources.} Built on carefully curated real ASMR videos, the benchmark targets fine-grained action-object interactions with diversity across objects, actions, and backgrounds. \textbf{(ii) Peer-Review evaluation.} An adversarial creator-reviewer protocol where video generation models act as creators aiming to fool reviewers, while VLMs serve as reviewers seeking to identify fakeness. Our experimental findings show: The best creator Veo3.1-Fast even fools most VLMs: the strongest reviewer (Gemini 2.5-Pro) achieves only 56% accuracy (random 50%), far below that of human experts (81.25%). Adding audio improves real-fake discrimination, yet superficial cues such as watermarks can still significantly mislead models. These findings delineate the current boundary of video generation realism and expose limitations of VLMs in perceptual fidelity and audio-visual consistency. Our code is available at https://github.com/video-reality-test/video-reality-test.

[225] CausalCLIP: Causally-Informed Feature Disentanglement and Filtering for Generalizable Detection of Generated Images

Bo Liu, Qiao Qin, Qinghui He

Main category: cs.CV

TL;DR: CausalCLIP is a framework that disentangles causal from non-causal features using causal inference principles to improve generalization in generated image detection across diverse generative models.

DetailsMotivation: Existing generated image detectors produce entangled representations mixing task-relevant forensic cues with spurious patterns, limiting generalization across diverse and evolving generation techniques.

Method: Models generation process with structural causal model, uses Gumbel-Softmax-based feature masking and HSIC constraints to enforce statistical independence, isolating stable causal features robust to distribution shifts.

Result: Achieves 6.83% improvement in accuracy and 4.06% improvement in average precision over state-of-the-art methods when tested on unseen generative models from different series.

Conclusion: CausalCLIP effectively disentangles causal features from non-causal ones, enabling strong generalization in generated image detection through targeted filtering guided by causal inference principles.

Abstract: The rapid advancement of generative models has increased the demand for generated image detectors capable of generalizing across diverse and evolving generation techniques. However, existing methods, including those leveraging pre-trained vision-language models, often produce highly entangled representations, mixing task-relevant forensic cues (causal features) with spurious or irrelevant patterns (non-causal features), thus limiting generalization. To address this issue, we propose CausalCLIP, a framework that explicitly disentangles causal from non-causal features and employs targeted filtering guided by causal inference principles to retain only the most transferable and discriminative forensic cues. By modeling the generation process with a structural causal model and enforcing statistical independence through Gumbel-Softmax-based feature masking and Hilbert-Schmidt Independence Criterion (HSIC) constraints, CausalCLIP isolates stable causal features robust to distribution shifts. When tested on unseen generative models from different series, CausalCLIP demonstrates strong generalization ability, achieving improvements of 6.83% in accuracy and 4.06% in average precision over state-of-the-art methods.

[226] Beyond the Visible: Disocclusion-Aware Editing via Proxy Dynamic Graphs

Anran Qi, Changjian Li, Adrien Bousseau, Niloy J. Mitra

Main category: cs.CV

TL;DR: A training-free image-to-video generation method that separates motion specification from appearance synthesis using a user-editable Proxy Dynamic Graph for controllable articulation and user-specified content in disoccluded regions.

DetailsMotivation: Current image-to-video pipelines struggle with predictable, articulated motions and enforcing user-specified content in newly revealed (disoccluded) areas of the final frame.

Method: Introduces a lightweight, user-editable Proxy Dynamic Graph (PDG) that deterministically drives part motion, while using a frozen diffusion prior as a motion-guided shader. Users annotate/repose PDG, compute dense motion flow, edit appearance in disoccluded areas, and perform latent-space composite using visibility information from PDG.

Result: Demonstrates clear advantages over state-of-the-art alternatives for turning images into short videos of articulated objects, furniture, vehicles, and deformables, enabling controllable articulation and user control over disocclusions without fine-tuning.

Conclusion: The method combines generative control (loose pose and structure) with predictable controls (appearance specification in disoccluded regions), unlocking a new image-to-video workflow that separates motion specification from appearance synthesis.

Abstract: We address image-to-video generation with explicit user control over the final frame’s disoccluded regions. Current image-to-video pipelines produce plausible motion but struggle to generate predictable, articulated motions while enforcing user-specified content in newly revealed areas. Our key idea is to separate motion specification from appearance synthesis: we introduce a lightweight, user-editable Proxy Dynamic Graph (PDG) that deterministically yet approximately drives part motion, while a frozen diffusion prior is used to synthesize plausible appearance that follows that motion. In our training-free pipeline, the user loosely annotates and reposes a PDG, from which we compute a dense motion flow to leverage diffusion as a motion-guided shader. We then let the user edit appearance in the disoccluded areas of the image, and exploit the visibility information encoded by the PDG to perform a latent-space composite that reconciles motion with user intent in these areas. This design yields controllable articulation and user control over disocclusions without fine-tuning. We demonstrate clear advantages against state-of-the-art alternatives towards images turned into short videos of articulated objects, furniture, vehicles, and deformables. Our method mixes generative control, in the form of loose pose and structure, with predictable controls, in the form of appearance specification in the final frame in the disoccluded regions, unlocking a new image-to-video workflow. Code will be released on acceptance. Project page: https://anranqi.github.io/beyond-visible.github.io/

[227] Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model

Heyi Chen, Siyan Chen, Xin Chen, Yanfei Chen, Ying Chen, Zhuo Chen, Feng Cheng, Tianheng Cheng, Xinqi Cheng, Xuyan Chi, Jian Cong, Jing Cui, Qinpeng Cui, Qide Dong, Junliang Fan, Jing Fang, Zetao Fang, Chengjian Feng, Han Feng, Mingyuan Gao, Yu Gao, Dong Guo, Qiushan Guo, Boyang Hao, Qingkai Hao, Bibo He, Qian He, Tuyen Hoang, Ruoqing Hu, Xi Hu, Weilin Huang, Zhaoyang Huang, Zhongyi Huang, Donglei Ji, Siqi Jiang, Wei Jiang, Yunpu Jiang, Zhuo Jiang, Ashley Kim, Jianan Kong, Zhichao Lai, Shanshan Lao, Yichong Leng, Ai Li, Feiya Li, Gen Li, Huixia Li, JiaShi Li, Liang Li, Ming Li, Shanshan Li, Tao Li, Xian Li, Xiaojie Li, Xiaoyang Li, Xingxing Li, Yameng Li, Yifu Li, Yiying Li, Chao Liang, Han Liang, Jianzhong Liang, Ying Liang, Zhiqiang Liang, Wang Liao, Yalin Liao, Heng Lin, Kengyu Lin, Shanchuan Lin, Xi Lin, Zhijie Lin, Feng Ling, Fangfang Liu, Gaohong Liu, Jiawei Liu, Jie Liu, Jihao Liu, Shouda Liu, Shu Liu, Sichao Liu, Songwei Liu, Xin Liu, Xue Liu, Yibo Liu, Zikun Liu, Zuxi Liu, Junlin Lyu, Lecheng Lyu, Qian Lyu, Han Mu, Xiaonan Nie, Jingzhe Ning, Xitong Pan, Yanghua Peng, Lianke Qin, Xueqiong Qu, Yuxi Ren, Kai Shen, Guang Shi, Lei Shi, Yan Song, Yinglong Song, Fan Sun, Li Sun, Renfei Sun, Yan Sun, Zeyu Sun, Wenjing Tang, Yaxue Tang, Zirui Tao, Feng Wang, Furui Wang, Jinran Wang, Junkai Wang, Ke Wang, Kexin Wang, Qingyi Wang, Rui Wang, Sen Wang, Shuai Wang, Tingru Wang, Weichen Wang, Xin Wang, Yanhui Wang, Yue Wang, Yuping Wang, Yuxuan Wang, Ziyu Wang, Guoqiang Wei, Wanru Wei, Di Wu, Guohong Wu, Hanjie Wu, Jian Wu, Jie Wu, Ruolan Wu, Xinglong Wu, Yonghui Wu, Ruiqi Xia, Liang Xiang, Fei Xiao, XueFeng Xiao, Pan Xie, Shuangyi Xie, Shuang Xu, Jinlan Xue, Shen Yan, Bangbang Yang, Ceyuan Yang, Jiaqi Yang, Runkai Yang, Tao Yang, Yang Yang, Yihang Yang, ZhiXian Yang, Ziyan Yang, Songting Yao, Yifan Yao, Zilyu Ye, Bowen Yu, Jian Yu, Chujie Yuan, Linxiao Yuan, Sichun Zeng, Weihong Zeng, Xuejiao Zeng, Yan Zeng, Chuntao Zhang, Heng Zhang, Jingjie Zhang, Kuo Zhang, Liang Zhang, Liying Zhang, Manlin Zhang, Ting Zhang, Weida Zhang, Xiaohe Zhang, Xinyan Zhang, Yan Zhang, Yuan Zhang, Zixiang Zhang, Fengxuan Zhao, Huating Zhao, Yang Zhao, Hao Zheng, Jianbin Zheng, Xiaozheng Zheng, Yangyang Zheng, Yijie Zheng, Jiexin Zhou, Jiahui Zhu, Kuan Zhu, Shenhan Zhu, Wenjia Zhu, Benhui Zou, Feilong Zuo

Main category: cs.CV

TL;DR: Seedance 1.5 pro is a foundational model for joint audio-video generation using dual-branch Diffusion Transformer with cross-modal integration, achieving superior synchronization and quality through specialized data pipeline and post-training optimizations.

DetailsMotivation: To advance unified audio-visual generation by creating a professional-grade foundational model that can generate synchronized audio and video content with practical utility for content creation.

Method: Dual-branch Diffusion Transformer architecture with cross-modal joint module, multi-stage data pipeline, Supervised Fine-Tuning (SFT) on high-quality datasets, Reinforcement Learning from Human Feedback (RLHF) with multi-dimensional reward models, and an acceleration framework for inference.

Result: Achieves exceptional audio-visual synchronization, superior generation quality, precise multilingual/dialect lip-syncing, dynamic cinematic camera control, enhanced narrative coherence, and over 10X inference speed boost through acceleration framework.

Conclusion: Seedance 1.5 pro positions itself as a robust engine for professional-grade content creation with practical deployment on Volcano Engine platform.

Abstract: Recent strides in video generation have paved the way for unified audio-visual generation. In this work, we present Seedance 1.5 pro, a foundational model engineered specifically for native, joint audio-video generation. Leveraging a dual-branch Diffusion Transformer architecture, the model integrates a cross-modal joint module with a specialized multi-stage data pipeline, achieving exceptional audio-visual synchronization and superior generation quality. To ensure practical utility, we implement meticulous post-training optimizations, including Supervised Fine-Tuning (SFT) on high-quality datasets and Reinforcement Learning from Human Feedback (RLHF) with multi-dimensional reward models. Furthermore, we introduce an acceleration framework that boosts inference speed by over 10X. Seedance 1.5 pro distinguishes itself through precise multilingual and dialect lip-syncing, dynamic cinematic camera control, and enhanced narrative coherence, positioning it as a robust engine for professional-grade content creation. Seedance 1.5 pro is now accessible on Volcano Engine at https://console.volcengine.com/ark/region:ark+cn-beijing/experience/vision?type=GenVideo.

[228] MMhops-R1: Multimodal Multi-hop Reasoning

Tao Zhang, Ziqi Zhang, Zongyang Ma, Yuxin Chen, Bing Li, Chunfeng Yuan, Guangting Wang, Fengyun Rao, Ying Shan, Weiming Hu

Main category: cs.CV

TL;DR: MMhops is a new benchmark for evaluating multi-modal multi-hop reasoning, with MMhops-R1 as a reinforcement learning-based mRAG framework that outperforms baselines and generalizes well.

DetailsMotivation: Existing MLLMs are limited to single-step reasoning, and current benchmarks lack the complexity needed to evaluate and drive multi-modal multi-hop reasoning abilities required for complex real-world challenges.

Method: Proposes MMhops-R1, a multi-modal Retrieval-Augmented Generation framework using reinforcement learning to optimize autonomous planning of reasoning paths, targeted query formulation, and multi-level information synthesis.

Result: MMhops-R1 significantly outperforms strong baselines on the MMhops benchmark, demonstrating that dynamic planning and multi-modal knowledge integration are crucial for complex reasoning. It also shows strong generalization to fixed-hop reasoning tasks.

Conclusion: The work contributes a challenging new benchmark (MMhops) and a powerful baseline model (MMhops-R1), with plans to release code, data, and weights to catalyze future research in multi-modal multi-hop reasoning.

Abstract: The ability to perform multi-modal multi-hop reasoning by iteratively integrating information across various modalities and external knowledge is critical for addressing complex real-world challenges. However, existing Multi-modal Large Language Models (MLLMs) are predominantly limited to single-step reasoning, as existing benchmarks lack the complexity needed to evaluate and drive multi-hop abilities. To bridge this gap, we introduce MMhops, a novel, large-scale benchmark designed to systematically evaluate and foster multi-modal multi-hop reasoning. MMhops dataset comprises two challenging task formats, Bridging and Comparison, which necessitate that models dynamically construct complex reasoning chains by integrating external knowledge. To tackle the challenges posed by MMhops, we propose MMhops-R1, a novel multi-modal Retrieval-Augmented Generation (mRAG) framework for dynamic reasoning. Our framework utilizes reinforcement learning to optimize the model for autonomously planning reasoning paths, formulating targeted queries, and synthesizing multi-level information. Comprehensive experiments demonstrate that MMhops-R1 significantly outperforms strong baselines on MMhops, highlighting that dynamic planning and multi-modal knowledge integration are crucial for complex reasoning. Moreover, MMhops-R1 demonstrates strong generalization to tasks requiring fixed-hop reasoning, underscoring the robustness of our dynamic planning approach. In conclusion, our work contributes a challenging new benchmark and a powerful baseline model, and we will release the associated code, data, and weights to catalyze future research in this critical area.

[229] MindDrive: A Vision-Language-Action Model for Autonomous Driving via Online Reinforcement Learning

Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Hongwei Xie, Bing Wang, Guang Chen, Dingkang Liang, Xiang Bai

Main category: cs.CV

TL;DR: MindDrive is a Vision-Language-Action framework that uses online reinforcement learning for autonomous driving, addressing exploration inefficiency in continuous action spaces by separating decision-making and action execution through dual LoRA-parameterized LLMs.

DetailsMotivation: Current VLA paradigms in autonomous driving rely on Imitation Learning, which suffers from distribution shift and causal confusion. Online RL could address these issues but is hindered by inefficient exploration in continuous action spaces.

Method: MindDrive uses one LLM with dual LoRA parameter sets: a Decision Expert for scenario reasoning and driving decisions, and an Action Expert that maps linguistic decisions to trajectories. It enables trial-and-error learning over discrete linguistic decisions rather than continuous actions.

Result: Using Qwen-0.5B LLM, MindDrive achieves Driving Score of 78.04 and Success Rate of 55.09% on the challenging Bench2Drive benchmark, demonstrating the first effective online RL for VLA models in autonomous driving.

Conclusion: MindDrive successfully balances optimal decision-making, human-like driving behavior, and efficient exploration in online RL by operating in linguistic decision space rather than continuous action space, overcoming key limitations of current VLA approaches.

Abstract: Current Vision-Language-Action (VLA) paradigms in autonomous driving primarily rely on Imitation Learning (IL), which introduces inherent challenges such as distribution shift and causal confusion. Online Reinforcement Learning offers a promising pathway to address these issues through trial-and-error learning. However, applying online reinforcement learning to VLA models in autonomous driving is hindered by inefficient exploration in continuous action spaces. To overcome this limitation, we propose MindDrive, a VLA framework comprising a large language model (LLM) with two distinct sets of LoRA parameters. The one LLM serves as a Decision Expert for scenario reasoning and driving decision-making, while the other acts as an Action Expert that dynamically maps linguistic decisions into feasible trajectories. By feeding trajectory-level rewards back into the reasoning space, MindDrive enables trial-and-error learning over a finite set of discrete linguistic driving decisions, instead of operating directly in a continuous action space. This approach effectively balances optimal decision-making in complex scenarios, human-like driving behavior, and efficient exploration in online reinforcement learning. Using the lightweight Qwen-0.5B LLM, MindDrive achieves Driving Score (DS) of 78.04 and Success Rate (SR) of 55.09% on the challenging Bench2Drive benchmark. To the best of our knowledge, this is the first work to demonstrate the effectiveness of online reinforcement learning for the VLA model in autonomous driving.

cs.AI

[230] Leveraging LLMs for Structured Data Extraction from Unstructured Patient Records

Mitchell A. Klusty, Elizabeth C. Solie, Caroline N. Leach, W. Vaiden Logan, Lynnet E. Richey, John C. Gensel, David P. Szczykutowicz, Bryan C. McLellan, Emily B. Collier, Samuel E. Armstrong, V. K. Cody Bumgardner

Main category: cs.AI

TL;DR: A secure framework using locally deployed LLMs for automated extraction of structured features from clinical notes, reducing manual chart review burden in clinical research.

DetailsMotivation: Manual chart review is extremely time-consuming and resource-intensive for clinical research, requiring experts to extract complex information from unstructured EHR narratives.

Method: A secure, modular framework leveraging locally deployed LLMs on HIPAA-compliant infrastructure, integrating retrieval augmented generation (RAG) and structured response methods into a scalable container.

Result: The framework achieved high accuracy across multiple medical characteristics in patient notes compared to expert-annotated datasets, and identified annotation errors missed in manual review.

Conclusion: LLM systems can reduce manual chart review burden through automated extraction, increase data capture consistency, and accelerate clinical research.

Abstract: Manual chart review remains an extremely time-consuming and resource-intensive component of clinical research, requiring experts to extract often complex information from unstructured electronic health record (EHR) narratives. We present a secure, modular framework for automated structured feature extraction from clinical notes leveraging locally deployed large language models (LLMs) on institutionally approved, Health Insurance Portability and Accountability Act (HIPPA)-compliant compute infrastructure. This system integrates retrieval augmented generation (RAG) and structured response methods of LLMs into a widely deployable and scalable container to provide feature extraction for diverse clinical domains. In evaluation, the framework achieved high accuracy across multiple medical characteristics present in large bodies of patient notes when compared against an expert-annotated dataset and identified several annotation errors missed in manual review. This framework demonstrates the potential of LLM systems to reduce the burden of manual chart review through automated extraction and increase consistency in data capture, accelerating clinical research.

[231] Blind Radio Mapping via Spatially Regularized Bayesian Trajectory Inference

Zheng Xing, Junting Chen

Main category: cs.AI

TL;DR: Blind radio map construction using MIMO-OFDM channel measurements without location labels, achieving 0.68m localization error and 3.3% beam map reconstruction error.

DetailsMotivation: Conventional radio map construction requires extensive location-labeled data, which is costly and impractical. Need a method to infer radio maps without explicit location labels.

Method: Proves CSI exhibits spatial continuity under NLOS, derives CSI-distance metric proportional to physical distance. Develops spatially regularized Bayesian inference framework to jointly estimate channel features, distinguish LOS/NLOS conditions, and recover user trajectories.

Result: Theoretical analysis shows CRLB of localization error vanishes asymptotically. Experimental results on ray-tracing dataset show 0.68m average localization error and 3.3% beam map reconstruction error.

Conclusion: Proposed blind radio map construction framework effectively infers user trajectories and reconstructs radio maps without location labels, validated by theoretical analysis and experimental results.

Abstract: Radio maps enable intelligent wireless applications by capturing the spatial distribution of channel characteristics. However, conventional construction methods demand extensive location-labeled data, which are costly and impractical in many real-world scenarios. This paper presents a blind radio map construction framework that infers user trajectories from indoor multiple-input multiple-output (MIMO)-Orthogonal Frequency-Division Multiplexing (OFDM) channel measurements without relying on location labels. It first proves that channel state information (CSI) under non-line-of-sight (NLOS) exhibits spatial continuity under a quasi-specular environmental model, allowing the derivation of a CSI-distance metric that is proportional to the corresponding physical distance. For rectilinear trajectories in Poisson-distributed access point (AP) deployments, it is shown that the Cramer-Rao Lower Bound (CRLB) of localization error vanishes asymptotically, even under poor angular resolution. Building on these theoretical results, a spatially regularized Bayesian inference framework is developed that jointly estimates channel features, distinguishes line-of-sight (LOS)/NLOS conditions and recovers user trajectories. Experiments on a ray-tracing dataset demonstrate an average localization error of 0.68 m and a beam map reconstruction error of 3.3%, validating the effectiveness of the proposed blind mapping method.

[232] LoopBench: Discovering Emergent Symmetry Breaking Strategies with LLM Swarms

Ali Parsaee, Yashar Talebirad, Csongor Szepesvári, Vishwajeet Ohal, Eden Redman

Main category: cs.AI

TL;DR: LoopBench is a benchmark for evaluating LLM reasoning in distributed symmetry breaking problems, specifically graph coloring of odd cycles where deterministic agents fail in infinite loops.

DetailsMotivation: LLMs are increasingly used as autonomous agents, but their ability to coordinate in distributed systems remains poorly understood. There's a need to evaluate how LLMs can handle distributed symmetry breaking problems where classical deterministic approaches fail.

Method: The benchmark focuses on coloring odd cycle graphs (C3, C5, C11) with limited colors, where deterministic, non-communicating agents fail in infinite loops. A strategy passing mechanism is implemented as a form of consistent memory to enable coordination.

Result: Standard LLMs and classical heuristics struggle with these problems, but advanced reasoning models (e.g., O3) can devise strategies to escape deadlocks and solve the symmetry breaking challenges.

Conclusion: LoopBench provides a testbed for studying emergent distributed algorithms based on language-based reasoning, offering insights into collective intelligence and LLM coordination capabilities in distributed systems.

Abstract: Large Language Models (LLMs) are increasingly being utilized as autonomous agents, yet their ability to coordinate in distributed systems remains poorly understood. We introduce \textbf{LoopBench}, a benchmark to evaluate LLM reasoning in distributed symmetry breaking and meta-cognitive thinking. The benchmark focuses on coloring odd cycle graphs ($C_3, C_5, C_{11}$) with limited colors, where deterministic, non-communicating agents fail in infinite loops. A strategy passing mechanism is implemented as a form of consistent memory. We show that while standard LLMs and classical heuristics struggle, advanced reasoning models (e.g., O3) devise strategies to escape deadlocks. LoopBench allows the study of emergent distributed algorithms based on language-based reasoning, offering a testbed for collective intelligence.

[233] Adjudicator: Correcting Noisy Labels with a KG-Informed Council of LLM Agents

Doohee You, Sundeep Paul

Main category: cs.AI

TL;DR: Adjudicator: A neuro-symbolic system using knowledge graphs and multi-agent LLM debates to automatically identify and correct noisy labels in production ML systems, achieving near-perfect F1-score on benchmark data.

DetailsMotivation: Production ML systems are limited by training data quality, especially in high-stakes industrial applications where noisy labels degrade performance and erode user trust. There's a need for automated, high-precision data verification for golden datasets in strictly governed environments.

Method: Neuro-symbolic approach: 1) Constructs dynamic Knowledge Graph (KG) to unify item context, 2) Uses “Council of Agents” - a novel multi-agent LLM architecture where specialized agents debate and vote on label validity, 3) Implements novel override logic using KG to identify complex structural errors.

Result: Achieved 0.99 F1-score on 1,000-item balanced subset of AlleNoise benchmark, significantly outperforming single-LLM baseline (0.48 F1) and non-KG council (0.59 F1). Perfect Recall for complex structural errors that baselines failed to find.

Conclusion: Adjudicator demonstrates a robust, explainable system for automated high-precision data verification, serving as vital proof-of-concept for generating golden datasets in strictly governed industrial environments.

Abstract: The performance of production machine learning systems is fundamentally limited by the quality of their training data. In high-stakes industrial applications, noisy labels can degrade performance and erode user trust. This paper presents Adjudicator, a system that addresses the critical data mining challenge of automatically identifying and correcting label noise and has been validated for production deployment. Adjudicator models this as a neuro-symbolic task, first constructing a dynamic Knowledge Graph (KG) to unify item context. This KG then informs a “Council of Agents,” a novel multi-agent Large Language Model architecture where specialized agents debate and vote on a label’s validity. We validate our system on a 1,000-item balanced subset of the AlleNoise benchmark. Our KG-informed model achieves a 0.99 F1-score, significantly outperforming a single-LLM baseline (0.48 F1) and a non-KG council (0.59 F1). Our analysis reveals this is due to a Precision, achieved by a novel override logic that uses the KG to perfectly identify complex, structural errors (complete Recall) – a class of errors that baselines fail to find. This result demonstrates a robust and explainable system for automated, high-precision data verification, serving as a vital proof-of-concept for generating golden datasets in strictly governed industrial environments.

[234] EvoLattice: Persistent Internal-Population Evolution through Multi-Alternative Quality-Diversity Graph Representations for LLM-Guided Program Discovery

Kamer Ali Yuksel

Main category: cs.AI

TL;DR: EvoLattice is a framework that represents populations of programs/agents as a single DAG with multiple alternatives per node, enabling combinatorial search space exploration while preserving structural correctness.

DetailsMotivation: Existing LLM-based evolution methods use overwrite-based mutations that discard useful variants, suffer from destructive edits, and explore brittle search spaces prone to structural failure.

Method: Represents candidate populations as a single directed acyclic graph where each node stores multiple persistent alternatives. Each valid path defines a distinct executable candidate. Uses fine-grained alternative-level evaluation across all paths, with deterministic self-repair for structural correctness.

Result: EvoLattice yields more stable evolution, greater expressivity, and stronger improvement trajectories than prior LLM-guided methods in program synthesis tasks (proxy and optimizer meta-learning).

Conclusion: EvoLattice provides a novel graph-based representation for evolutionary search that enables quality-diversity optimization dynamics to emerge implicitly from its multi-alternative structure, overcoming limitations of traditional overwrite-based mutation approaches.

Abstract: Large language models (LLMs) are increasingly used to evolve programs and multi-agent systems, yet most existing approaches rely on overwrite-based mutations that maintain only a single candidate at a time. Such methods discard useful variants, suffer from destructive edits, and explore a brittle search space prone to structural failure. We introduce EvoLattice, a framework that represents an entire population of candidate programs or agent behaviors within a single directed acyclic graph. Each node stores multiple persistent alternatives, and every valid path through the graph defines a distinct executable candidate, yielding a large combinatorial search space without duplicating structure. EvoLattice enables fine-grained alternative-level evaluation by scoring each alternative across all paths in which it appears, producing statistics that reveal how local design choices affect global performance. These statistics provide a dense, data-driven feedback signal for LLM-guided mutation, recombination, and pruning, while preserving successful components. Structural correctness is guaranteed by a deterministic self-repair mechanism that enforces acyclicity and dependency consistency independently of the LLM. EvoLattice naturally extends to agent evolution by interpreting alternatives as prompt fragments or sub-agent behaviors. Across program synthesis (proxy and optimizer meta-learning), EvoLattice yields more stable evolution, greater expressivity, and stronger improvement trajectories than prior LLM-guided methods. The resulting dynamics resemble quality-diversity optimization, emerging implicitly from EvoLattice’s internal multi-alternative representation rather than an explicit external archive.

[235] AI-Powered Annotation Pipelines for Stabilizing Large Language Models: A Human-AI Synergy Approach

Gangesh Pathak, Prasanna Kumar

Main category: cs.AI

TL;DR: This paper proposes an AI-based annotation pipeline that systematically identifies, labels, and fixes instability patterns in LLM outputs to address reliability issues in regulated industries.

DetailsMotivation: LLMs face reliability challenges in regulated industries due to instability, inconsistent reasoning, hallucinations, and performance variability. Current stabilization methods like RLHF and supervised fine-tuning are expensive, human-intensive, and not easily scalable.

Method: The paper presents a human-AI synergy method combining automated weak supervision and confidence-based annotation with targeted human validation. It introduces stability-specific annotation categories (semantic consistency, factual correctness, logical coherence) within a feedback loop framework for continuous model calibration.

Result: The proposed AI-based annotation pipeline aims to systematically identify and fix instability patterns in LLM outputs, ensuring reliability and ethical integrity of feedback information for model improvement.

Conclusion: The framework enables continuous calibration and enhancement of LLM robustness through stability-specific annotation and feedback loops, addressing scalability and reliability challenges in regulated industry applications.

Abstract: LLM implementations are failing in highly regulated industries owing to instability issues, inconsistent reasoning, hallucinations and performance variability, especially in workflows. These reliability issues restrict safe use of LLM in areas that need the precision of facts and consistent behavior (Aiyappa et al., 2023). The current methods of stabilization, such as, reinforcement learning with human feedback (RLHF) and supervised fine-tuning, offer quantifiable improvements but are expensive and based on the intensive annotation of humans, thus being not easily scaled in a sustainable way (Dong et al., 2023; Retzlaff et al., 2024). This paper presents an AI-based annotation pipeline that systematically identifies, labels, and fixes for instability patterns on LLM output. Our human-AI synergy method combines the models of automated weak supervision and confidence-based annotation with the target human validation to guarantee the reliability and moral uprightness of feedback information (Cabitza et al., 2023; Jiang et al., 2023). The semantic consistency, factual correctness, and logical coherence categories of stability-specific annotation are introduced into our framework, allowing the continuous calibration of models and the enhancement of their robustness based on the feedback loops (Honovich et al., 2021; Nan et al., 2021).

[236] Meta Hierarchical Reinforcement Learning for Scalable Resource Management in O-RAN

Fatemeh Lotfi, Fatemeh Afghah

Main category: cs.AI

TL;DR: Meta-HRL framework for O-RAN combines hierarchical RL with meta-learning to optimize resource allocation and network slicing, achieving 19.8% efficiency improvement and faster adaptation.

DetailsMotivation: Modern applications need adaptive wireless networks; O-RAN with RIC enables dynamic resource management, but existing AI methods struggle with unpredictable conditions.

Method: Adaptive Meta-HRL framework inspired by MAML with hierarchical control: high-level allocates resources across slices, low-level agents handle intra-slice scheduling, using meta-update weighted by temporal difference error variance.

Result: 19.8% improvement in network management efficiency vs baseline RL/meta-RL, faster adaptation, higher QoS satisfaction across eMBB/URLLC/mMTC slices, 40% faster adaptation in ablation studies, consistent performance with scaling.

Conclusion: Meta-HRL framework effectively addresses dynamic O-RAN resource management with theoretical convergence guarantees and practical performance improvements in efficiency, adaptation speed, and QoS.

Abstract: The increasing complexity of modern applications demands wireless networks capable of real time adaptability and efficient resource management. The Open Radio Access Network (O-RAN) architecture, with its RAN Intelligent Controller (RIC) modules, has emerged as a pivotal solution for dynamic resource management and network slicing. While artificial intelligence (AI) driven methods have shown promise, most approaches struggle to maintain performance under unpredictable and highly dynamic conditions. This paper proposes an adaptive Meta Hierarchical Reinforcement Learning (Meta-HRL) framework, inspired by Model Agnostic Meta Learning (MAML), to jointly optimize resource allocation and network slicing in O-RAN. The framework integrates hierarchical control with meta learning to enable both global and local adaptation: the high-level controller allocates resources across slices, while low level agents perform intra slice scheduling. The adaptive meta-update mechanism weights tasks by temporal difference error variance, improving stability and prioritizing complex network scenarios. Theoretical analysis establishes sublinear convergence and regret guarantees for the two-level learning process. Simulation results demonstrate a 19.8% improvement in network management efficiency compared with baseline RL and meta-RL approaches, along with faster adaptation and higher QoS satisfaction across eMBB, URLLC, and mMTC slices. Additional ablation and scalability studies confirm the method’s robustness, achieving up to 40% faster adaptation and consistent fairness, latency, and throughput performance as network scale increases.

[237] Grammar Search for Multi-Agent Systems

Mayank Singh, Vikas Yadav, Shiva Krishna Reddy Malay, Shravan Nayak, Sai Rajeswar, Sathwik Tejaswi Madhusudhan, Eduardo Blanco

Main category: cs.AI

TL;DR: A structured framework using fixed, composable components outperforms LLM-based free-form search for multi-agent system design, achieving better results on 4/5 benchmarks with cost efficiency and interpretability.

DetailsMotivation: Prior approaches to automatic search for Multi-Agent Systems rely on LLM-based free-form search over code space, which lacks structure and may be inefficient. The authors aim to develop a more structured, systematic approach.

Method: Proposes a structured framework that explores the multi-agent system design space through a fixed set of simple, composable components, rather than using LLM-based free-form generation.

Result: Outperforms prior approaches on four out of five benchmarks across mathematics and question answering domains, despite lacking LLM’s generative flexibility. Also achieves cost-efficient search and generates modular, interpretable systems with simpler logic.

Conclusion: A structured approach using fixed, composable components is more effective than LLM-based free-form search for designing multi-agent systems, offering better performance, cost efficiency, and interpretability.

Abstract: Automatic search for Multi-Agent Systems has recently emerged as a key focus in agentic AI research. Several prior approaches have relied on LLM-based free-form search over the code space. In this work, we propose a more structured framework that explores the same space through a fixed set of simple, composable components. We show that, despite lacking the generative flexibility of LLMs during the candidate generation stage, our method outperforms prior approaches on four out of five benchmarks across two domains: mathematics and question answering. Furthermore, our method offers additional advantages, including a more cost-efficient search process and the generation of modular, interpretable multi-agent systems with simpler logic.

[238] ValuePilot: A Two-Phase Framework for Value-Driven Decision-Making

Yitong Luo, Ziang Chen, Hou Hei Lam, Jiayu zhan, Junqi Wang, Zhenliang Zhang, Xue Feng

Main category: cs.AI

TL;DR: ValuePilot: A value-driven framework for personalized AI decision-making that learns human value preferences to enable consistent, interpretable behavior across novel scenarios.

DetailsMotivation: As AI systems expand into real-world applications, there's a critical need for personalized decision-making that aligns with individual users' value preferences beyond just task completion or collective alignment. Current task-oriented paradigms driven by external rewards lack interpretability and fail to adapt to novel scenarios.

Method: Propose ValuePilot, a two-phase framework: 1) Dataset Generation Toolkit (DGT) constructs diverse, value-annotated scenarios through human-LLM collaboration, and 2) Decision-Making Module (DMM) learns to evaluate actions based on personal value preferences for context-sensitive, individualized decisions.

Result: DMM outperforms strong LLM baselines (GPT-5, Claude-Sonnet-4, Gemini-2-flash, Llama-3.1-70b) in aligning with human action choices on previously unseen scenarios.

Conclusion: Value-driven decision-making is an effective and extensible engineering pathway toward building interpretable, personalized AI agents that can act consistently across diverse contexts.

Abstract: Personalized decision-making is essential for human-AI interaction, enabling AI agents to act in alignment with individual users’ value preferences. As AI systems expand into real-world applications, adapting to personalized values beyond task completion or collective alignment has become a critical challenge. We address this by proposing a value-driven approach to personalized decision-making. Human values serve as stable, transferable signals that support consistent and generalizable behavior across contexts. Compared to task-oriented paradigms driven by external rewards and incentives, value-driven decision-making enhances interpretability and enables agents to act appropriately even in novel scenarios. We introduce ValuePilot, a two-phase framework consisting of a dataset generation toolkit (DGT) and a decision-making module (DMM). DGT constructs diverse, value-annotated scenarios from a human-LLM collaborative pipeline. DMM learns to evaluate actions based on personal value preferences, enabling context-sensitive, individualized decisions. When evaluated on previously unseen scenarios, DMM outperforms strong LLM baselines, including GPT-5, Claude-Sonnet-4, Gemini-2-flash, and Llama-3.1-70b, in aligning with human action choices. Our results demonstrate that value-driven decision-making is an effective and extensible engineering pathway toward building interpretable, personalized AI agents.

[239] Compressed Causal Reasoning: Quantization and GraphRAG Effects on Interventional and Counterfactual Accuracy

Steve Nwaiwu, Nipat Jongsawat, Anucha Tungkasthan

Main category: cs.AI

TL;DR: Causal reasoning in LLMs remains surprisingly robust under 4-bit quantization (NF4), with less than 1% overall degradation, though interventional queries are most sensitive while counterfactual reasoning shows heterogeneous weaknesses.

DetailsMotivation: As LLM deployment shifts to edge/resource-constrained environments requiring quantized models (INT8, NF4), the impact of precision reduction on formal causal reasoning across all three levels of Pearl's Causal Ladder is poorly understood and needs systematic evaluation.

Method: Systematically evaluated quantization effects using 3000-sample stratified CLadder benchmark on Llama 3 8B, tested across INT8 and NF4 precisions, and further evaluated Graph Retrieval Augmented Generation with ground truth causal graphs.

Result: Causal reasoning accuracy remains broadly stable under quantization (NF4 shows <1% overall degradation). Interventional queries (rung 2) are most sensitive, while counterfactual reasoning (rung 3) is stable but shows heterogeneous weaknesses. Graph RAG improves NF4 interventional accuracy by +1.7%, partially offsetting compression degradation.

Conclusion: Causal reasoning is unexpectedly robust to 4-bit quantization, graph-structured augmentation can selectively reinforce interventional reasoning, and current counterfactual benchmarks fail to capture deeper causal brittleness, providing guidance for deploying efficient causal AI systems.

Abstract: Causal reasoning in Large Language Models spanning association, intervention, and counterfactual inference is essential for reliable decision making in high stakes settings. As deployment shifts toward edge and resource constrained environments, quantized models such as INT8 and NF4 are becoming standard. Yet the impact of precision reduction on formal causal reasoning is poorly understood. To our knowledge, this is the first study to systematically evaluate quantization effects across all three levels of Pearls Causal Ladder. Using a 3000 sample stratified CLadder benchmark, we find that rung level accuracy in Llama 3 8B remains broadly stable under quantization, with NF4 showing less than one percent overall degradation. Interventional queries at rung 2 are the most sensitive to precision loss, whereas counterfactual reasoning at rung 3 is comparatively stable but exhibits heterogeneous weaknesses across query types such as collider bias and backdoor adjustment. Experiments on the CRASS benchmark show near identical performance across precisions, indicating that existing commonsense counterfactual datasets lack the structural sensitivity needed to reveal quantization induced reasoning drift. We further evaluate Graph Retrieval Augmented Generation using ground truth causal graphs and observe a consistent improvement in NF4 interventional accuracy of plus 1.7 percent, partially offsetting compression related degradation. These results suggest that causal reasoning is unexpectedly robust to four bit quantization, graph structured augmentation can selectively reinforce interventional reasoning, and current counterfactual benchmarks fail to capture deeper causal brittleness. This work provides an initial empirical map of compressed causal reasoning and practical guidance for deploying efficient and structurally supported causal AI systems.

[240] State-Dependent Refusal and Learned Incapacity in RLHF-Aligned Language Models

TK Lee

Main category: cs.AI

TL;DR: The paper introduces a qualitative auditing framework to detect selective behavioral patterns in LLMs during extended interactions, revealing systematic refusal in policy-sensitive domains while maintaining normal performance elsewhere.

DetailsMotivation: Standard quantitative benchmarks fail to capture behavioral patterns that emerge during extended interactions with LLMs, particularly policy-linked behavioral selectivity where models show different responses in sensitive vs. non-sensitive domains.

Method: Qualitative case-study methodology with 86-turn dialogue session to audit behavioral selectivity; operationalizes three response regimes (Normal Performance, Functional Refusal, Meta-Narrative); analyzes asymmetry between domains.

Result: Models show consistent asymmetry: Normal Performance in broad domains but repeated Functional Refusal in provider- or policy-sensitive domains; Meta-Narrative role-framing tends to co-occur with refusals in sensitive contexts.

Conclusion: Proposes interaction-level auditing framework based on observable behavior; introduces “learned incapacity” as behavioral descriptor for selective withholding; warrants further investigation across users and models for alignment side effects.

Abstract: Large language models (LLMs) are widely deployed as general-purpose tools, yet extended interaction can reveal behavioral patterns not captured by standard quantitative benchmarks. We present a qualitative case-study methodology for auditing policy-linked behavioral selectivity in long-horizon interaction. In a single 86-turn dialogue session, the same model shows Normal Performance (NP) in broad, non-sensitive domains while repeatedly producing Functional Refusal (FR) in provider- or policy-sensitive domains, yielding a consistent asymmetry between NP and FR across domains. Drawing on learned helplessness as an analogy, we introduce learned incapacity (LI) as a behavioral descriptor for this selective withholding without implying intentionality or internal mechanisms. We operationalize three response regimes (NP, FR, Meta-Narrative; MN) and show that MN role-framing narratives tend to co-occur with refusals in the same sensitive contexts. Overall, the study proposes an interaction-level auditing framework based on observable behavior and motivates LI as a lens for examining potential alignment side effects, warranting further investigation across users and models.

[241] Mathematics and Coding are Universal AI Benchmarks

Przemyslaw Chojecki

Main category: cs.AI

TL;DR: Mathematics and coding tasks create dense subspaces in psychometric battery moduli space, with formal proof systems enabling spectrally stable self-improvement in AI agents.

DetailsMotivation: To understand the special role of mathematics and coding in evaluating and improving AI agents, particularly how formal verification enables stable self-improvement regimes.

Method: Uses AAI framework and GVU dynamics to define Mathematics Fiber, analyzes formal proof kernels (Lean, Coq) with oracle-like verification, proves density theorem under uniform tightness and Lipschitz conditions.

Result: Coding alone is universal (dense in moduli space), pure mathematics is not universal but privileged spectrally; formal mathematics enables spectrally stable self-improvement via oracle verification.

Conclusion: Mathematics and coding provide “universal coordinates” for AI evaluation, with formal mathematics serving as natural ignition domain for recursive self-improvement in advanced AI agents.

Abstract: We study the special role of mathematics and coding inside the moduli space of psychometric batteries for AI agents. Building on the AAI framework and GVU dynamics from previous works, we define the Mathematics Fiber and show that, when paired with formal proof kernels (e.g. Lean, Coq), GVU flows on this fiber admit spectrally stable self-improvement regimes due to oracle-like verification. Our main technical result is a density theorem: under uniform tightness of agent outputs and a Lipschitz AAI functional, the subspace of batteries generated by mathematical theorem-proving and coding tasks is dense in the moduli space of batteries with respect to the evaluation metric. Coding alone is universal in this sense, while pure mathematics is not; its privilege is spectral rather than expressive. We interpret this as evidence that mathematics and coding provide ``universal coordinates’’ for evaluation, and that formal mathematics is a natural ignition domain for recursive self-improvement in advanced AI agents.

[242] Semantic Grounding Index: Geometric Bounds on Context Engagement in RAG Systems

Javier Marín

Main category: cs.AI

TL;DR: The paper introduces Semantic Grounding Index (SGI), a geometric measure that detects hallucinations in RAG systems by comparing angular distances between responses, questions, and contexts in embedding space.

DetailsMotivation: To develop a computationally efficient and theoretically grounded method for identifying hallucinated responses in RAG systems by examining geometric patterns in embedding space.

Method: Define SGI as ratio of angular distances from response to question vs context on unit hypersphere. Analyze geometric relationships using spherical triangle inequality, validate across multiple embedding models on HaluEval dataset.

Result: Hallucinated responses show “semantic laziness” - staying close to questions rather than moving toward contexts. Large effect sizes (Cohen’s d 0.92-1.28), strong cross-model correlation (r=0.85). Discriminative power increases with question-context separation (AUC improves from 0.72 to 0.83).

Conclusion: SGI provides efficient infrastructure for detecting hallucinations in RAG systems, measuring topical engagement rather than factual accuracy, with good calibration (ECE=0.10) for probability estimation.

Abstract: When retrieval-augmented generation (RAG) systems hallucinate, what geometric trace does this leave in embedding space? We introduce the Semantic Grounding Index (SGI), defined as the ratio of angular distances from the response to the question versus the context on the unit hypersphere $\mathbb{S}^{d-1}$.Our central finding is \emph{semantic laziness}: hallucinated responses remain angularly proximate to questions rather than departing toward retrieved contexts. On HaluEval ($n$=5,000), we observe large effect sizes (Cohen’s $d$ ranging from 0.92 to 1.28) across five embedding models with mean cross-model correlation $r$=0.85. Crucially, we derive from the spherical triangle inequality that SGI’s discriminative power should increase with question-context angular separation $θ(q,c)$-a theoretical prediction confirmed empirically: effect size rises monotonically from $d$=0.61 -low $θ(q,c)$, to $d$=1.27 -high $θ(q,c)$, with AUC improving from 0.72 to 0.83. Subgroup analysis reveals that SGI excels on long responses ($d$=2.05) and short questions ($d$=1.22), while remaining robust across context lengths. Calibration analysis yields ECE=0.10, indicating SGI scores can serve as probability estimates, not merely rankings. A critical negative result on TruthfulQA (AUC=0.478) establishes that angular geometry measures topical engagement rather than factual accuracy. SGI provides computationally efficient, theoretically grounded infrastructure for identifying responses that warrant verification in production RAG deployments.

[243] MURIM: Multidimensional Reputation-based Incentive Mechanism for Federated Learning

Sindhuja Madabushi, Dawood Wasif, Jin-Hee Cho

Main category: cs.AI

TL;DR: MURIM is a multi-dimensional reputation-based incentive mechanism for federated learning that jointly considers client reliability, privacy, resource capacity, and fairness to prevent malicious clients from earning undeserved rewards.

DetailsMotivation: Federated Learning faces key challenges including weak client incentives, privacy risks, and resource constraints. Assessing client reliability is essential for fair incentive allocation and ensuring meaningful contributions to the global model.

Method: Proposes MURIM, a multi-dimensional reputation-based incentive mechanism that allocates incentives based on client contribution, latency, and reputation, supported by a reliability verification module.

Result: Achieves up to 18% improvement in fairness metrics, reduces privacy attack success rates by 5-9%, and improves robustness against poisoning and noisy-gradient attacks by up to 85% compared to state-of-the-art baselines on MNIST, FMNIST, and ADULT Income datasets.

Conclusion: MURIM effectively mitigates adversarial threats, promotes fair and truthful participation, and preserves stable model convergence across heterogeneous and dynamic federated settings.

Abstract: Federated Learning (FL) has emerged as a leading privacy-preserving machine learning paradigm, enabling participants to share model updates instead of raw data. However, FL continues to face key challenges, including weak client incentives, privacy risks, and resource constraints. Assessing client reliability is essential for fair incentive allocation and ensuring that each client’s data contributes meaningfully to the global model. To this end, we propose MURIM, a MUlti-dimensional Reputation-based Incentive Mechanism that jointly considers client reliability, privacy, resource capacity, and fairness while preventing malicious or unreliable clients from earning undeserved rewards. MURIM allocates incentives based on client contribution, latency, and reputation, supported by a reliability verification module. Extensive experiments on MNIST, FMNIST, and ADULT Income datasets demonstrate that MURIM achieves up to 18% improvement in fairness metrics, reduces privacy attack success rates by 5-9%, and improves robustness against poisoning and noisy-gradient attacks by up to 85% compared to state-of-the-art baselines. Overall, MURIM effectively mitigates adversarial threats, promotes fair and truthful participation, and preserves stable model convergence across heterogeneous and dynamic federated settings.

[244] Exploring Network-Knowledge Graph Duality: A Case Study in Agentic Supply Chain Risk Analysis

Evan Heus, Rick Bookstaber, Dhruv Sharma

Main category: cs.AI

TL;DR: LLM agent framework for supply chain risk analysis using network-KG duality, graph traversal with centrality scores, and context shells for quantitative data interpretation.

DetailsMotivation: LLMs struggle with complex multi-modal financial risk data, standard RAG oversimplifies relationships, and specialist models are costly and static.

Method: Treat supply chain network as knowledge graph, use graph traverser guided by network centrality scores, orchestrate with agentic architecture combining graph retrieval, factor tables, and news streams, and employ “context shells” to embed raw figures in natural language.

Result: Enables generation of concise, explainable, context-rich risk narratives in real-time without costly fine-tuning or dedicated graph database.

Conclusion: Lightweight LLM-centric agent framework effectively addresses supply chain risk analysis by leveraging network-KG duality and making quantitative data intelligible to LLMs through context shells.

Abstract: Large Language Models (LLMs) struggle with the complex, multi-modal, and network-native data underlying financial risk. Standard Retrieval-Augmented Generation (RAG) oversimplifies relationships, while specialist models are costly and static. We address this gap with an LLM-centric agent framework for supply chain risk analysis. Our core contribution is to exploit the inherent duality between networks and knowledge graphs (KG). We treat the supply chain network as a KG, allowing us to use structural network science principles for retrieval. A graph traverser, guided by network centrality scores, efficiently extracts the most economically salient risk paths. An agentic architecture orchestrates this graph retrieval alongside data from numerical factor tables and news streams. Crucially, it employs novel ``context shells’’ – descriptive templates that embed raw figures in natural language – to make quantitative data fully intelligible to the LLM. This lightweight approach enables the model to generate concise, explainable, and context-rich risk narratives in real-time without costly fine-tuning or a dedicated graph database.

[245] Evaluating Frontier LLMs on PhD-Level Mathematical Reasoning: A Benchmark on a Textbook in Theoretical Computer Science about Randomized Algorithms

Yang Cao, Yubin Chen, Xuyang Guo, Zhao Song, Song Yue, Jiahao Zhang, Jiale Zhao

Main category: cs.AI

TL;DR: The paper benchmarks four frontier LLMs (GPT-5-Thinking, Gemini-3-Pro, Claude-Sonnet-4.5-Thinking, Grok-4) on graduate-level randomized algorithms proofs from Motwani & Raghavan’s textbook, finding significant performance variance with top models achieving ~66% accuracy.

DetailsMotivation: While LLMs show promise in mathematical reasoning and scientific discovery (as demonstrated by recent studies), there's a need for rigorous evaluation of their baseline capabilities on canonical graduate-level mathematical theory to understand their true reasoning abilities.

Method: Comprehensive benchmark testing four frontier LLMs on generating formal LaTeX proofs for lemmas and exercises from the classic “Randomized Algorithms” textbook by Motwani and Raghavan, with qualitative analysis of proof quality.

Result: Top-tier models (Gemini and Claude) achieved ~66% accuracy, demonstrating robust grasp of probabilistic method and formal logic, while other models lagged significantly (~40% accuracy). Analysis revealed differences in conciseness, hallucination rates, and logical structure across models.

Conclusion: Frontier LLMs have reached proficiency suitable for graduate-level pedagogical assistance and formalization, but significant variance exists in their reliability for rigorous mathematical derivation, highlighting the need for careful model selection in mathematical applications.

Abstract: The rapid advancement of large language models (LLMs) has led to significant breakthroughs in automated mathematical reasoning and scientific discovery. Georgiev, G${ó}$mez-Serrano, Tao, and Wagner [GGSTW+25] demonstrate that AI systems can explore new constructions and improve existing bounds, illustrating the growing potential of LLMs to accelerate mathematical discovery. Similarly, Bubeck et al. [BCE+25] show that GPT-5 can meaningfully contribute to scientific workflows, from proposing hypotheses to generating proofs and analyses. Despite these advances, a rigorous evaluation of these models on canonical, graduate-level mathematical theory remains necessary to understand their baseline reasoning capabilities. In this paper, we present a comprehensive benchmark of four frontier models: GPT-5-Thinking, Gemini-3-Pro, Claude-Sonnet-4.5-Thinking, and Grok-4 against the classic curriculum of Randomized Algorithms by Motwani and Raghavan [MR95]. We tasked each model with generating formal LaTeX proofs for a series of lemmas and exercises spanning the textbook. We find that while the top-tier models (Gemini, and Claude) achieve a high accuracy rate (approx. 66%), demonstrating a robust grasp of probabilistic method and formal logic, other models lag significantly in consistency (approx. 40%). We provide a qualitative analysis of the generated proofs, highlighting differences in conciseness, hallucination rates, and logical structure. Our results suggest that while frontier models have reached a threshold of proficiency suitable for graduate-level pedagogical assistance and formalization, significant variance exists in their reliability for rigorous mathematical derivation. The code and the full set of LLM-generated responses are open-sourced and publicly available at https://github.com/magiclinux/math_benchmark_probability.

[246] Single-Agent Scaling Fails Multi-Agent Intelligence: Towards Foundation Models with Native Multi-Agent Intelligence

Shuyue Hu, Haoyang Yan, Yiqun Zhang, Yang Chen, Dongzhan Zhou, Lei Bai

Main category: cs.AI

TL;DR: Current foundation models lack robust multi-agent intelligence despite strong single-agent capabilities, requiring dedicated research to develop native multi-agent abilities.

DetailsMotivation: While foundation models are becoming the "brain" of AI agents and have developed strong single-agent abilities, they lack native multi-agent intelligence, which is essential for the next frontier of AI agent development.

Method: The paper identifies four core multi-agent capabilities (understanding, planning, efficient communication, adaptation), conducts extensive empirical evaluation across 41 large language models and 7 challenging benchmarks, and analyzes the gap between single-agent and multi-agent performance.

Result: Scaling single-agent performance alone does not automatically yield robust multi-agent intelligence, as evidenced by the empirical findings across diverse models and benchmarks.

Conclusion: The research outlines key directions for developing foundation models with native multi-agent intelligence, including dataset construction, evaluation frameworks, training paradigms, and safety considerations.

Abstract: Foundation models (FMs) are increasingly assuming the role of the ‘‘brain’’ of AI agents. While recent efforts have begun to equip FMs with native single-agent abilities – such as GUI interaction or integrated tool use – we argue that the next frontier is endowing FMs with native multi-agent intelligence. We identify four core capabilities of FMs in multi-agent contexts: understanding, planning, efficient communication, and adaptation. Contrary to assumptions about the spontaneous emergence of such abilities, we provide extensive empirical evidence, across 41 large language models and 7 challenging benchmarks, showing that scaling single-agent performance alone does not automatically yield robust multi-agent intelligence. To address this gap, we outline key research directions – spanning dataset construction, evaluation, training paradigms, and safety considerations – for building FMs with native multi-agent intelligence.

[247] ReflCtrl: Controlling LLM Reflection via Representation Engineering

Ge Yan, Chung-En Sun, Tsui-Wei, Weng

Main category: cs.AI

TL;DR: ReflCtrl framework uses representation engineering to control LLM self-reflection frequency, showing reflections are often redundant (saving up to 33.6% tokens) and correlated with internal uncertainty signals.

DetailsMotivation: While self-reflection in LLMs improves reasoning performance, it significantly increases inference costs. The paper aims to study self-reflection through representation engineering to understand and control this behavior more efficiently.

Method: Segment model reasoning into steps, identify reflection steps, extract a reflection direction in latent space, and propose ReflCtrl framework for stepwise steering to control reflection frequency.

Result: (1) Reflections are often redundant, especially in stronger models (up to 33.6% token savings while preserving performance); (2) Reflection behavior correlates with internal uncertainty signals, suggesting self-reflection may be controlled by model uncertainty.

Conclusion: The ReflCtrl framework enables efficient control of self-reflection in LLMs, revealing that many reflections are unnecessary and that reflection behavior is linked to internal uncertainty, offering insights for more cost-effective reasoning.

Abstract: Large language models (LLMs) with Chain-of-Thought (CoT) reasoning have achieved strong performance across diverse tasks, including mathematics, coding, and general reasoning. A distinctive ability of these reasoning models is self-reflection: the ability to review and revise previous reasoning steps. While self-reflection enhances reasoning performance, it also increases inference cost. In this work, we study self-reflection through the lens of representation engineering. We segment the model’s reasoning into steps, identify the steps corresponding to reflection, and extract a reflection direction in the latent space that governs this behavior. Using this direction, we propose a stepwise steering method that can control reflection frequency. We call our framework ReflCtrl. Our experiments show that (1) in many cases reflections are redundant, especially in stronger models (in our experiments, we can save up to 33.6 percent of reasoning tokens while preserving performance), and (2) the model’s reflection behavior is highly correlated with an internal uncertainty signal, implying self-reflection may be controlled by the model’s uncertainty.

[248] Sparsity-Controllable Dynamic Top-p MoE for Large Foundation Model Pre-training

Can Jin, Hongwu Peng, Mingcan Xiang, Qixin Zhang, Xiangchi Yuan, Amit Hasan, Ohiremen Dibua, Yifan Gong, Yan Kang, Dimitris N. Metaxas

Main category: cs.AI

TL;DR: DTop-p MoE: A dynamic Top-p routing mechanism with sparsity control using a PI controller to adjust probability thresholds and maintain target activated-expert sparsity.

DetailsMotivation: Standard Top-k routing imposes uniform sparsity that ignores token difficulty variations, while existing Top-p implementations use fixed global thresholds leading to uncontrolled computational costs and hyperparameter sensitivity.

Method: Proposes DTop-p MoE with: 1) PI Controller to dynamically adjust probability threshold to align running activated-expert sparsity with target, 2) Dynamic routing normalization to adapt layer-wise routing logits allowing different layers to learn distinct expert-selection patterns while using global threshold.

Result: Extensive experiments on Large Language Models and Diffusion Transformers show DTop-p consistently outperforms both Top-k and fixed-threshold Top-p baselines. Maintains precise control over activated experts while adaptively allocating resources across tokens and layers.

Conclusion: DTop-p offers robust framework for large-scale MoE pre-training with strong scaling properties across expert granularity, expert capacity, model size, and dataset size.

Abstract: Sparse Mixture-of-Experts (MoE) architectures effectively scale model capacity by activating only a subset of experts for each input token. However, the standard Top-k routing strategy imposes a uniform sparsity pattern that ignores the varying difficulty of tokens. While Top-p routing offers a flexible alternative, existing implementations typically rely on a fixed global probability threshold, which results in uncontrolled computational costs and sensitivity to hyperparameter selection. In this paper, we propose DTop-p MoE, a sparsity-controllable dynamic Top-p routing mechanism. To resolve the challenge of optimizing a non-differentiable threshold, we utilize a Proportional-Integral (PI) Controller that dynamically adjusts the probability threshold to align the running activated-expert sparsity with a specified target. Furthermore, we introduce a dynamic routing normalization mechanism that adapts layer-wise routing logits, allowing different layers to learn distinct expert-selection patterns while utilizing a global probability threshold. Extensive experiments on Large Language Models and Diffusion Transformers demonstrate that DTop-p consistently outperforms both Top-k and fixed-threshold Top-p baselines. Our analysis confirms that DTop-p maintains precise control over the number of activated experts while adaptively allocating resources across different tokens and layers. Furthermore, DTop-p exhibits strong scaling properties with respect to expert granularity, expert capacity, model size, and dataset size, offering a robust framework for large-scale MoE pre-training.

[249] MobileWorldBench: Towards Semantic World Modeling For Mobile Agents

Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Yusuke Kato, Kazuki Kozuka, Aditya Grover

Main category: cs.AI

TL;DR: This paper introduces MobileWorldBench, a benchmark for evaluating vision-language models as world models for mobile GUI agents, along with a large-scale dataset and framework that improves agent task performance.

DetailsMotivation: Pixel-space world models face practical limitations in GUI settings where predicting complex visual elements is difficult. The authors explore an alternative approach using natural language descriptions of state transitions for GUI agents.

Method: 1) Created MobileWorldBench benchmark for evaluating VLMs as world models; 2) Released MobileWorld dataset with 1.4M samples to improve VLM world modeling; 3) Proposed a framework integrating VLM world models into mobile agent planning.

Result: The semantic world models based on natural language descriptions can directly benefit mobile agents by improving task success rates, demonstrating the effectiveness of this alternative formulation.

Conclusion: Natural language-based world modeling for GUI agents is a viable alternative to pixel-space approaches, with practical benefits for mobile agent performance, supported by new benchmarks, datasets, and frameworks.

Abstract: World models have shown great utility in improving the task performance of embodied agents. While prior work largely focuses on pixel-space world models, these approaches face practical limitations in GUI settings, where predicting complex visual elements in future states is often difficult. In this work, we explore an alternative formulation of world modeling for GUI agents, where state transitions are described in natural language rather than predicting raw pixels. First, we introduce MobileWorldBench, a benchmark that evaluates the ability of vision-language models (VLMs) to function as world models for mobile GUI agents. Second, we release MobileWorld, a large-scale dataset consisting of 1.4M samples, that significantly improves the world modeling capabilities of VLMs. Finally, we propose a novel framework that integrates VLM world models into the planning framework of mobile agents, demonstrating that semantic world models can directly benefit mobile agents by improving task success rates. The code and dataset is available at https://github.com/jacklishufan/MobileWorld

[250] Echo-CoPilot: A Multi-View, Multi-Task Agent for Echocardiography Interpretation and Reporting

Moein Heidari, Mohammad Amin Roohi, Ilker Hacihaliloglu

Main category: cs.AI

TL;DR: Echo-CoPilot is an AI agent that uses a large language model to orchestrate specialized echocardiography tools for multi-view, multi-task analysis, outperforming existing models on clinical benchmarks.

DetailsMotivation: Echocardiography interpretation is cognitively demanding and performed manually, while existing foundation models operate in isolation without providing unified clinical assessments.

Method: A multi-view, multi-task agent using a large language model in a ReAct-style loop to decompose clinician queries, invoke specialized tools for view recognition, segmentation, measurement, disease prediction, and report synthesis.

Result: Achieves 50.8% accuracy on MIMIC-EchoQA benchmark, outperforming general-purpose and biomedical video vision-language models, and demonstrates ability to resolve challenging borderline cases using quantitative measurements and physiologic context.

Conclusion: Echo-CoPilot provides a unified, clinically coherent assessment of echocardiography studies by orchestrating specialized tools, showing promise for enhancing clinical decision-making in cardiovascular care.

Abstract: Echocardiography is central to contemporary cardiovascular care, but full-study interpretation remains a cognitively demanding, multi-view task that is still performed manually. While recent foundation models for echocardiography can achieve strong performance on individual perceptual subtasks such as view classification, segmentation, or disease prediction, they typically operate in isolation and do not provide a unified, clinically coherent assessment. In this work, we introduce Echo-CoPilot, a multi-view, multi-task agent that uses a large language model to orchestrate a suite of specialized echocardiography tools. Within a ReAct-style loop, the agent decomposes clinician queries, invokes tools for view recognition, cardiac structure segmentation, measurement and disease prediction, and report synthesis, and integrates their outputs into guideline-aware answers and narrative summaries. We evaluate Echo-CoPilot on the public MIMIC-EchoQA benchmark, where it achieves an accuracy of 50.8%, outperforming both general-purpose and biomedical video vision-language models. Qualitative analyses further show that the agent leverages quantitative measurements and physiologic context to resolve challenging cases near clinical decision thresholds, such as borderline left ventricular hypertrophy or pericardial effusion severity. The code will be released upon acceptance of the paper.

[251] Evaluating Small Language Models for Agentic On-Farm Decision Support Systems

Enhong Liu, Haiyu Yang, Miel Hostens

Main category: cs.AI

TL;DR: Benchmarking 20 open-source Small Language Models for dairy farming decision support under farm-realistic computing constraints, with Qwen-4B showing best performance across most tasks.

DetailsMotivation: LLMs have potential for dairy farming decision support but cloud-based access is impractical; need lightweight alternatives that can run locally on farm hardware for privacy and computational efficiency.

Method: Developed agentic AI system with 5 task-specific agents (literature search, web search, SQL/NoSQL database interaction, graph generation). Evaluated 20 SLMs in two phases: initial screening with 5 questions, then comprehensive evaluation with 30 questions across task categories.

Result: Qwen-4B achieved superior performance across most task categories, though showed unstable effectiveness in NoSQL database interactions via PySpark. First work evaluating SLM feasibility for dairy farming decision-making with privacy and computational efficiency focus.

Conclusion: SLMs show promise for practical deployment in dairy farming, but challenges remain and fine-tuning is needed to refine performance on dairy-specific questions.

Abstract: Large Language Models (LLM) hold potential to support dairy scholars and farmers by supporting decision-making and broadening access to knowledge for stakeholders with limited technical expertise. However, the substantial computational demand restricts access to LLM almost exclusively through cloud-based service, which makes LLM-based decision support tools impractical for dairy farming. To address this gap, lightweight alternatives capable of running locally on farm hardware are required. In this work, we benchmarked 20 open-source Small Language Models (SLM) available on HuggingFace under farm-realistic computing constraints. Building on our prior work, we developed an agentic AI system that integrates five task-specific agents: literature search, web search, SQL database interaction, NoSQL database interaction, and graph generation following predictive models. Evaluation was conducted in two phases. In the first phase, five test questions were used for the initial screening to identify models capable of following basic dairy-related instructions and performing reliably in a compute-constrained environment. Models that passed this preliminary stage were then evaluated using 30 questions (five per task category mentioned above, plus one category addressing integrity and misconduct) in phase two. In results, Qwen-4B achieved superior performance across most of task categories, although showed unstable effectiveness in NoSQL database interactions through PySpark. To our knowledge, this is the first work explicitly evaluating the feasibility of SLM as engines for dairy farming decision-making, with central emphases on privacy and computational efficiency. While results highlight the promise of SLM-assisted tools for practical deployment in dairy farming, challenges remain, and fine-tuning is still needed to refine SLM performance in dairy-specific questions.

[252] Intention Chain-of-Thought Prompting with Dynamic Routing for Code Generation

Shen Li, Li Huang, Shaoxiong Zhan, Weifeng Sun, Tao Yin, Zhongxin Liu, Meng Yan

Main category: cs.AI

TL;DR: RoutingGen is a difficulty-aware routing framework that dynamically switches between few-shot prompting for simple code tasks and Intention Chain-of-Thought (ICoT) for complex ones, achieving SOTA performance while reducing token usage by 46.37%.

DetailsMotivation: Existing CoT prompting methods have two limitations: 1) uniform application causes overthinking on simple tasks, and 2) they lack intention abstraction in code generation, failing to model core algorithmic design and efficiency. Inspired by cognitive economy principles, the authors aim to conserve cognitive resources by applying structured reasoning only when necessary.

Method: RoutingGen framework with two components: 1) Difficulty-aware routing that classifies tasks as simple or complex, 2) Intention Chain-of-Thought (ICoT) for complex tasks that guides models to capture task intention including core algorithmic logic and time complexity. Simple tasks use few-shot prompting.

Result: Experiments across three models and six standard code generation benchmarks show RoutingGen achieves state-of-the-art performance in most settings while reducing total token usage by 46.37% on average. ICoT outperforms six existing prompting baselines on challenging benchmarks.

Conclusion: RoutingGen effectively addresses limitations of uniform CoT application by dynamically adapting prompting strategies based on task difficulty, balancing performance and efficiency. The ICoT approach successfully captures task intention in code generation, leading to improved performance on complex tasks.

Abstract: Large language models (LLMs) exhibit strong generative capabilities and have shown great potential in code generation. Existing chain-of-thought (CoT) prompting methods enhance model reasoning by eliciting intermediate steps, but suffer from two major limitations: First, their uniform application tends to induce overthinking on simple tasks. Second, they lack intention abstraction in code generation, such as explicitly modeling core algorithmic design and efficiency, leading models to focus on surface-level structures while neglecting the global problem objective. Inspired by the cognitive economy principle of engaging structured reasoning only when necessary to conserve cognitive resources, we propose RoutingGen, a novel difficulty-aware routing framework that dynamically adapts prompting strategies for code generation. For simple tasks, it adopts few-shot prompting; for more complex ones, it invokes a structured reasoning strategy, termed Intention Chain-of-Thought (ICoT), which we introduce to guide the model in capturing task intention, such as the core algorithmic logic and its time complexity. Experiments across three models and six standard code generation benchmarks show that RoutingGen achieves state-of-the-art performance in most settings, while reducing total token usage by 46.37% on average across settings. Furthermore, ICoT outperforms six existing prompting baselines on challenging benchmarks.

[253] OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value

Mengzhang Cai, Xin Gao, Yu Li, Honglin Lin, Zheng Liu, Zhuoshi Pan, Qizhi Pei, Xiaoran Shang, Mengyuan Sun, Zinan Tang, Xiaoyang Wang, Zhanping Zhong, Yun Zhu, Dahua Lin, Conghui He, Lijun Wu

Main category: cs.AI

TL;DR: OpenDataArena (ODA) is an open platform for benchmarking post-training data quality, featuring unified pipelines, multi-dimensional scoring, data lineage visualization, and open-source tools to enable principled data-centric AI research.

DetailsMotivation: Current LLM development suffers from a "black box" problem where post-training data composition is opaque, hindering reproducibility and obscuring causal links between data characteristics and model behaviors.

Method: ODA establishes a four-pillar ecosystem: 1) unified training-evaluation pipeline for fair comparisons, 2) multi-dimensional scoring framework across tens of quality axes, 3) interactive data lineage explorer for visualizing dataset genealogy, and 4) fully open-source toolkit for training, evaluation, and scoring.

Result: Experiments on ODA (120+ datasets across multiple domains, 22 benchmarks, 600+ training runs, 40M+ processed data points) reveal trade-offs between data complexity and task performance, identify redundancy in popular benchmarks through lineage tracing, and map genealogical relationships across datasets.

Conclusion: ODA enables a shift from trial-and-error data curation to principled Data-Centric AI, democratizing access to high-quality data evaluation and paving the way for rigorous studies on data mixing laws and strategic foundation model composition.

Abstract: The rapid evolution of Large Language Models (LLMs) is predicated on the quality and diversity of post-training datasets. However, a critical dichotomy persists: while models are rigorously benchmarked, the data fueling them remains a black box–characterized by opaque composition, uncertain provenance, and a lack of systematic evaluation. This opacity hinders reproducibility and obscures the causal link between data characteristics and model behaviors. To bridge this gap, we introduce OpenDataArena (ODA), a holistic and open platform designed to benchmark the intrinsic value of post-training data. ODA establishes a comprehensive ecosystem comprising four key pillars: (i) a unified training-evaluation pipeline that ensures fair, open comparisons across diverse models (e.g., Llama, Qwen) and domains; (ii) a multi-dimensional scoring framework that profiles data quality along tens of distinct axes; (iii) an interactive data lineage explorer to visualize dataset genealogy and dissect component sources; and (iv) a fully open-source toolkit for training, evaluation, and scoring to foster data research. Extensive experiments on ODA–covering over 120 training datasets across multiple domains on 22 benchmarks, validated by more than 600 training runs and 40 million processed data points–reveal non-trivial insights. Our analysis uncovers the inherent trade-offs between data complexity and task performance, identifies redundancy in popular benchmarks through lineage tracing, and maps the genealogical relationships across datasets. We release all results, tools, and configurations to democratize access to high-quality data evaluation. Rather than merely expanding a leaderboard, ODA envisions a shift from trial-and-error data curation to a principled science of Data-Centric AI, paving the way for rigorous studies on data mixing laws and the strategic composition of foundation models.

[254] RADAR: Accelerating Large Language Model Inference With RL-Based Dynamic Draft Trees

Junjie Ma, Jinlong Li

Main category: cs.AI

TL;DR: RADAR is a speculative sampling method that uses RL-based dynamic draft trees to accelerate LLM inference by making real-time decisions on draft model calls instead of using preset hyperparameters.

DetailsMotivation: Current speculative sampling methods use a fixed number of draft model calls, which lacks flexibility and leads to redundant computations. There's a need for more efficient generation and utilization of candidate tokens to further accelerate LLM inference.

Method: RADAR formulates draft tree generation as a Markov Decision Process (MDP) and uses offline reinforcement learning to train a prediction model that makes real-time decisions about when to call the draft model, creating dynamic draft trees instead of fixed ones.

Result: RADAR achieves 3.17x-4.82x speedup over auto-regressive decoding baseline across three LLMs and four tasks, demonstrating significant acceleration while reducing redundant computations.

Conclusion: RADAR’s RL-based dynamic draft tree approach effectively accelerates LLM inference by making intelligent, real-time decisions about draft model calls, outperforming traditional speculative sampling methods with fixed hyperparameters.

Abstract: Inference with modern Large Language Models (LLMs) is expensive and slow, and speculative sampling has emerged as an effective solution to this problem, however, the number of the calls to the draft model for generating candidate tokens in speculative sampling is a preset hyperparameter, lacking flexibility. To generate and utilize the candidate tokens more effectively, we propose RADAR, a novel speculative sampling method with RL-based dynamic draft trees. RADAR formulates the draft tree generation process as a Markov Decision Process (MDP) and employs offline reinforcement learning to train a prediction model, which enables real-time decision on the calls to the draft model, reducing redundant computations and further accelerating inference. Evaluations across three LLMs and four tasks show that RADAR achieves a speedup of 3.17x-4.82x over the auto-regressive decoding baseline. The code is available at https://github.com/minaduki-sora/RADAR.

[255] HydroGEM: A Self Supervised Zero Shot Hybrid TCN Transformer Foundation Model for Continental Scale Streamflow Quality Control

Ijaz Ul Haq, Byung Suk Lee, Julia N. Perdrial, David Baude

Main category: cs.AI

TL;DR: HydroGEM is a foundation model for automated streamflow data quality control that uses self-supervised pretraining on millions of hydrological sequences and fine-tuning with synthetic anomalies, achieving state-of-the-art detection and reconstruction performance with cross-national generalization.

DetailsMotivation: Maintaining data quality across thousands of remote streamflow sensors is labor-intensive despite millions of annual observations. Current methods struggle with the scale and complexity of continental-scale monitoring networks.

Method: Two-stage training: 1) self-supervised pretraining on 6.03 million sequences from 3,724 USGS stations to learn hydrological representations, 2) fine-tuning with synthetic anomalies for detection and reconstruction. Uses hybrid TCN-Transformer architecture (14.2M parameters) with hierarchical normalization to handle discharge magnitude variations.

Result: Achieves F1 = 0.792 for detection and 68.7% reconstruction-error reduction (36.3% improvement over existing methods) on synthetic tests. Zero-shot transfer to 100 Canadian stations yields F1 = 0.586, demonstrating cross-national generalization. Maintains consistent detection across correction magnitudes and aligns with seasonal patterns.

Conclusion: HydroGEM provides an effective foundation model for streamflow quality control that generalizes across monitoring networks and national boundaries. Designed for human-in-the-loop workflows where outputs are quality control suggestions requiring expert review, not autonomous corrections.

Abstract: Real-time streamflow monitoring networks generate millions of observations annually, yet maintaining data quality across thousands of remote sensors remains labor-intensive. We introduce HydroGEM (Hydrological Generalizable Encoder for Monitoring), a foundation model for continental-scale streamflow quality control. HydroGEM uses two-stage training: self-supervised pretraining on 6.03 million sequences from 3,724 USGS stations learns hydrological representations, followed by fine-tuning with synthetic anomalies for detection and reconstruction. A hybrid TCN-Transformer architecture (14.2M parameters) captures local temporal patterns and long-range dependencies, while hierarchical normalization handles six orders of magnitude in discharge. On held-out synthetic tests comprising 799 stations with 18 expert-validated anomaly types, HydroGEM achieves F1 = 0.792 for detection and 68.7% reconstruction-error reduction, a 36.3% improvement over existing methods. Zero-shot transfer to 100 Environment and Climate Change Canada stations yields F1 = 0.586, exceeding all baselines and demonstrating cross-national generalization. The model maintains consistent detection across correction magnitudes and aligns with operational seasonal patterns. HydroGEM is designed for human-in-the-loop workflows - outputs are quality control suggestions requiring expert review, not autonomous corrections.

[256] Optimizing Multi-Tier Supply Chain Ordering with a Hybrid Liquid Neural Network and Extreme Gradient Boosting Model

Chunan Tong

Main category: cs.AI

TL;DR: Proposes hybrid Liquid Neural Network (LNN) + XGBoost model for supply chain management to address bullwhip effect and demand fluctuations, combining LNN’s dynamic feature extraction with XGBoost’s global optimization.

DetailsMotivation: Supply chain management faces challenges with demand fluctuations and bullwhip effect. Traditional methods and LLMs struggle with complex continuous time-series data. ML approaches like LSTM and XGBoost have computational inefficiency limitations. Liquid Neural Networks (LNN) have proven effective in robotics but remain untapped in SCM.

Method: Hybrid LNN+XGBoost model for multi-tier supply chains. Combines LNN’s dynamic feature extraction capabilities with XGBoost’s global optimization strengths to create an efficient and adaptable SCM solution.

Result: The proposed model aims to minimize the bullwhip effect and increase profitability in supply chain operations by addressing the efficiency and adaptability gaps in current SCM approaches.

Conclusion: This innovative hybrid approach fills a critical gap in intelligent supply chain management by leveraging LNN’s robotics-proven adaptability with XGBoost’s optimization capabilities, offering a promising solution for complex SCM challenges.

Abstract: Supply chain management (SCM) faces significant challenges like demand fluctuations and the bullwhip effect. Traditional methods and even state-of-the-art LLMs struggle with benchmarks like the Vending Machine Test, failing to handle SCM’s complex continuous time-series data. While ML approaches like LSTM and XGBoost offer solutions, they are often limited by computational inefficiency. Liquid Neural Networks (LNN), known for their adaptability and efficiency in robotics, remain untapped in SCM. This study proposes a hybrid LNN+XGBoost model for multi-tier supply chains. By combining LNN’s dynamic feature extraction with XGBoost’s global optimization, the model aims to minimize the bullwhip effect and increase profitability. This innovative approach addresses the need for efficiency and adaptability, filling a critical gap in intelligent SCM.

[257] Incentivizing Tool-augmented Thinking with Images for Medical Image Analysis

Yankai Jiang, Yujie Zhang, Peng Zhang, Yichen Li, Jintai Chen, Xiaoming Shi, Shihui Zhen

Main category: cs.AI

TL;DR: Ophiuchus is a tool-augmented MLLM framework that enables medical AI to dynamically focus on fine-grained visual regions through tool-integrated reasoning, outperforming SOTA methods on medical benchmarks.

DetailsMotivation: Current medical MLLMs struggle with complex tasks requiring dynamic, iterative focusing on fine-grained visual regions for precise grounding and diagnosis. They lack the ability to decide when and where to probe medical images and integrate that information into reasoning chains.

Method: Three-stage training strategy: 1) Cold-start training with tool-integrated reasoning data for basic tool selection and adaptation; 2) Self-reflection fine-tuning to strengthen reflective reasoning and tool output revisiting; 3) Agentic Tool Reinforcement Learning to optimize task-specific rewards and emulate expert diagnostic behavior.

Result: Ophiuchus consistently outperforms both closed-source and open-source state-of-the-art methods across diverse medical benchmarks including VQA, detection, and reasoning-based segmentation.

Conclusion: The framework demonstrates a path toward medical AI agents that can genuinely “think with images” through tool-integrated reasoning, with potential for public release of datasets, codes, and trained models.

Abstract: Recent reasoning based medical MLLMs have made progress in generating step by step textual reasoning chains. However, they still struggle with complex tasks that necessitate dynamic and iterative focusing on fine-grained visual regions to achieve precise grounding and diagnosis. We introduce Ophiuchus, a versatile, tool-augmented framework that equips an MLLM to (i) decide when additional visual evidence is needed, (ii) determine where to probe and ground within the medical image, and (iii) seamlessly weave the relevant sub-image content back into an interleaved, multimodal chain of thought. In contrast to prior approaches limited by the performance ceiling of specialized tools, Ophiuchus integrates the model’s inherent grounding and perception capabilities with external tools, thereby fostering higher-level reasoning. The core of our method is a three-stage training strategy: cold-start training with tool-integrated reasoning data to achieve basic tool selection and adaptation for inspecting key regions; self-reflection fine-tuning to strengthen reflective reasoning and encourage revisiting tool outputs; and Agentic Tool Reinforcement Learning to directly optimize task-specific rewards and emulate expert-like diagnostic behavior. Extensive experiments show that Ophiuchus consistently outperforms both closed-source and open-source SOTA methods across diverse medical benchmarks, including VQA, detection, and reasoning-based segmentation. Our approach illuminates a path toward medical AI agents that can genuinely “think with images” through tool-integrated reasoning. Datasets, codes, and trained models will be released publicly.

[258] Georeferencing complex relative locality descriptions with large language models

Aneesha Fernando, Surangika Ranathunga, Kristin Stock, Raj Prasanna, Christopher B. Jones

Main category: cs.AI

TL;DR: LLMs outperform traditional methods for georeferencing complex biodiversity locality descriptions, achieving 65% accuracy within 10km across datasets.

DetailsMotivation: Traditional georeferencing methods (gazetteer-based or language modeling) struggle with relative spatial descriptions common in biodiversity records, creating demand for automated solutions to handle complex locality narratives.

Method: Identified effective prompting patterns, then fine-tuned an LLM using Quantized Low-Rank Adaptation (QLoRA) on multi-region, multi-language biodiversity datasets.

Result: Outperformed existing baselines with average 65% of records within 10km radius; best results: New York state - 85% within 10km and 67% within 1km.

Conclusion: LLMs show strong potential for georeferencing complex locality descriptions, particularly effective for lengthy, intricate biodiversity narratives.

Abstract: Georeferencing text documents has typically relied on either gazetteer-based methods to assign geographic coordinates to place names, or on language modelling approaches that associate textual terms with geographic locations. However, many location descriptions specify positions relatively with spatial relationships, making geocoding based solely on place names or geo-indicative words inaccurate. This issue frequently arises in biological specimen collection records, where locations are often described through narratives rather than coordinates if they pre-date GPS. Accurate georeferencing is vital for biodiversity studies, yet the process remains labour-intensive, leading to a demand for automated georeferencing solutions. This paper explores the potential of Large Language Models (LLMs) to georeference complex locality descriptions automatically, focusing on the biodiversity collections domain. We first identified effective prompting patterns, then fine-tuned an LLM using Quantized Low-Rank Adaptation (QLoRA) on biodiversity datasets from multiple regions and languages. Our approach outperforms existing baselines with an average, across datasets, of 65% of records within a 10 km radius, for a fixed amount of training data. The best results (New York state) were 85% within 10km and 67% within 1km. The selected LLM performs well for lengthy, complex descriptions, highlighting its potential for georeferencing intricate locality descriptions.

[259] Gödel’s Poetry

Kelly J. Davis

Main category: cs.AI

TL;DR: A new automated theorem proving system using specialized language models for Lean4 proof generation with recursive decomposition of difficult theorems, achieving 90.4% pass rate on miniF2F benchmark.

DetailsMotivation: Formal, automated theorem proving has long been viewed as a challenge to artificial intelligence, requiring new approaches to tackle difficult mathematical proofs.

Method: Combines specialized language models for Lean4 proof generation with recursive decomposition of difficult theorems into simpler propositions, coordinated through a multi-agent architecture that handles autoformalization, proof generation, decomposition, and recursive proof.

Result: Achieves 90.4% pass rate on miniF2F benchmark without decomposition, with significant improvement when decomposition is added. Key technical contribution includes extending Kimina Lean Server with AST parsing for automated recursive proof decomposition.

Conclusion: The system (goedels-poetry) provides an effective approach to automated theorem proving through language models and recursive decomposition, with open-source implementation available for adaptation and extension.

Abstract: Formal, automated theorem proving has long been viewed as a challenge to artificial intelligence. We introduce here a new approach to computer theorem proving, one that employs specialized language models for Lean4 proof generation combined with recursive decomposition of difficult theorems into simpler entailing propositions. These models are coordinated through a multi-agent architecture that orchestrates autoformalization (if required), proof generation, decomposition of difficult theorems into simpler entailing propositions, and recursive proof (and/or decomposition) of these propositions. Without decomposition, we achieve a 90.4% pass rate on miniF2F. With decomposition, this is significantly improved. A key technical contribution lies in our extension of the Kimina Lean Server with abstract syntax tree (AST) parsing capabilities to facilitate automated, recursive proof decomposition. The system is made available on PyPI as goedels-poetry (at https://pypi.org/project/goedels-poetry ), and the open-source implementation KellyJDavis/goedels-poetry (at https://github.com/KellyJDavis/goedels-poetry ) facilitates both adaptation to alternative language models and extension with custom functionality.

[260] Leveraging LLMs for Collaborative Ontology Engineering in Parkinson Disease Monitoring and Alerting

Georgios Bouchouras, Dimitrios Doumanas, Andreas Soularidis, Konstantinos Kotis, George A. Vouros

Main category: cs.AI

TL;DR: LLMs can generate basic Parkinson’s Disease monitoring ontologies but require human collaboration for comprehensive, accurate results. Hybrid approaches (X-HCOME, SimX-HCOME+) combining human expertise with LLM capabilities produce the best outcomes.

DetailsMotivation: To determine if LLMs alone can create comprehensive ontologies for complex domains like Parkinson's Disease monitoring, and whether human-LLM collaboration can achieve better results than either approach alone.

Method: Four methodologies: One Shot (OS) prompts, Chain of Thought (CoT) prompts, X-HCOME (hybrid human-LLM approach), and SimX-HCOME+ (continuous human supervision with iterative refinement).

Result: LLMs alone (OS/CoT) can generate basic ontologies but lack comprehensiveness and require human refinement. X-HCOME produced ontologies very similar to expert-built ones. SimX-HCOME+ with continuous human supervision created the most comprehensive and accurate ontologies.

Conclusion: Human-LLM collaboration significantly advances ontology engineering, especially in complex domains like Parkinson’s Disease. Future research should focus on developing specialized GPT models for ontology construction.

Abstract: This paper explores the integration of Large Language Models (LLMs) in the engineering of a Parkinson’s Disease (PD) monitoring and alerting ontology through four key methodologies: One Shot (OS) prompt techniques, Chain of Thought (CoT) prompts, X-HCOME, and SimX-HCOME+. The primary objective is to determine whether LLMs alone can create comprehensive ontologies and, if not, whether human-LLM collaboration can achieve this goal. Consequently, the paper assesses the effectiveness of LLMs in automated ontology development and the enhancement achieved through human-LLM collaboration. Initial ontology generation was performed using One Shot (OS) and Chain of Thought (CoT) prompts, demonstrating the capability of LLMs to autonomously construct ontologies for PD monitoring and alerting. However, these outputs were not comprehensive and required substantial human refinement to enhance their completeness and accuracy. X-HCOME, a hybrid ontology engineering approach that combines human expertise with LLM capabilities, showed significant improvements in ontology comprehensiveness. This methodology resulted in ontologies that are very similar to those constructed by experts. Further experimentation with SimX-HCOME+, another hybrid methodology emphasizing continuous human supervision and iterative refinement, highlighted the importance of ongoing human involvement. This approach led to the creation of more comprehensive and accurate ontologies. Overall, the paper underscores the potential of human-LLM collaboration in advancing ontology engineering, particularly in complex domains like PD. The results suggest promising directions for future research, including the development of specialized GPT models for ontology construction.

[261] TiCard: Deployable EXPLAIN-only Residual Learning for Cardinality Estimation

Qizhi Wang

Main category: cs.AI

TL;DR: TiCard is a low-intrusion correction framework that improves database cardinality estimation by learning multiplicative residual corrections using only EXPLAIN features, requiring minimal integration and no invasive optimizer changes.

DetailsMotivation: Cardinality estimation is critical for query optimization but existing approaches have limitations: classical estimators miss correlations, while learned estimators require complex training pipelines and invasive optimizer integration. There's a need for deployable improvements that don't require replacing native estimators.

Method: TiCard augments rather than replaces native estimators by learning multiplicative residual corrections using only EXPLAIN features. It uses EXPLAIN ANALYZE for offline labels only. Two instantiations are studied: (1) Gradient Boosting Regressor for sub-millisecond inference, and (2) TabPFN, an in-context tabular foundation model that adapts by refreshing a small reference set without gradient retraining.

Result: On TiDB with TPCH and Join Order Benchmark, in a low-trace setting (263 executions total; 157 used for learning), TiCard significantly improves operator-level tail accuracy: P90 Q-error drops from 312.85 (native) to 13.69 (TiCard-GBR), and P99 drops from 37,974.37 to 3,416.50 (TiCard-TabPFN). A join-only policy preserves near-perfect median behavior.

Conclusion: TiCard provides a deployable AI4DB building block with explicit scope, conservative integration policies, and a roadmap from offline correction to in-optimizer use, offering substantial accuracy improvements with minimal intrusion into existing database systems.

Abstract: Cardinality estimation is a key bottleneck for cost-based query optimization, yet deployable improvements remain difficult: classical estimators miss correlations, while learned estimators often require workload-specific training pipelines and invasive integration into the optimizer. This paper presents TiCard, a low intrusion, correction-based framework that augments (rather than replaces) a database’s native estimator. TiCard learns multiplicative residual corrections using EXPLAIN-only features, and uses EXPLAIN ANALYZE only for offline labels. We study two practical instantiations: (i) a Gradient Boosting Regressor for sub-millisecond inference, and (ii) TabPFN, an in-context tabular foundation model that adapts by refreshing a small reference set without gradient retraining. On TiDB with TPCH and the Join Order Benchmark, in a low-trace setting (263 executions total; 157 used for learning), TiCard improves operator-level tail accuracy substantially: P90 Q-error drops from 312.85 (native) to 13.69 (TiCard-GBR), and P99 drops from 37,974.37 to 3,416.50 (TiCard-TabPFN), while a join-only policy preserves near-perfect median behavior. We position TiCard as an AI4DB building block focused on deployability: explicit scope, conservative integration policies, and an integration roadmap from offline correction to in-optimizer use.

[262] Massive Editing for Large Language Models Based on Dynamic Weight Generation

Wentao Wan, Qiqing Lao, Zhiwei Xie, Hefeng Wu, Runnan Lin, Liang Lin, Keze Wang

Main category: cs.AI

TL;DR: MeG proposes a massive knowledge editing approach for LLMs using dynamic weight neurons generated by diffusion models to enable large-scale edits while maintaining reliability, generality, and locality.

DetailsMotivation: Current knowledge editing methods struggle with large-scale modifications of LLMs while maintaining key metrics like Reliability, Generality, and Locality. There's a need for efficient methods that can perform massive edits without expensive retraining.

Method: MeG attaches dynamic weight neurons to specific LLM layers and uses a diffusion model to conditionally generate neuron weights based on input queries. This allows large-scale knowledge editing by adding just a single dynamic weight neuron.

Result: MeG significantly outperforms existing knowledge editing methods on Reliability, Generality, and Locality metrics, with particularly large improvements in Locality (high percentage point increase in absolute value index).

Conclusion: The MeG approach enables effective large-scale knowledge editing in LLMs through dynamic weight generation, offering advantages over existing methods in maintaining edit quality metrics.

Abstract: Knowledge Editing (KE) is a field that studies how to modify some knowledge in Large Language Models (LLMs) at a low cost (compared to pre-training). Currently, performing large-scale edits on LLMs while ensuring the Reliability, Generality, and Locality metrics of the edits remain a challenge. This paper proposes a Massive editing approach for LLMs based on dynamic weight Generation (MeG). Our MeG involves attaching a dynamic weight neuron to specific layers of the LLMs and using a diffusion model to conditionally generate the weights of this neuron based on the input query required for the knowledge. This allows the use of adding a single dynamic weight neuron to achieve the goal of large-scale knowledge editing. Experiments show that our MeG can significantly improve the performance of large-scale KE in terms of Reliability, Generality, and Locality metrics compared to existing knowledge editing methods, particularly with a high percentage point increase in the absolute value index for the Locality metric, demonstrating the advantages of our proposed method.

[263] PortAgent: LLM-driven Vehicle Dispatching Agent for Port Terminals

Jia Hu, Junqi Li, Weimeng Lin, Peng Jia, Yuxiong Ji, Jintao Lai

Main category: cs.AI

TL;DR: PortAgent: An LLM-driven vehicle dispatching agent that automates VDS transfer across container terminals without needing port specialists, minimal data, and fast deployment.

DetailsMotivation: Vehicle Dispatching Systems (VDSs) have low transferability across different Automated Container Terminals due to high dependency on port specialists, need for terminal-specific data, and time-consuming manual deployment processes.

Method: PortAgent uses a Virtual Expert Team (VET) with four LLM-powered experts: Knowledge Retriever, Modeler, Coder, and Debugger. They learn VDS-domain knowledge via few-shot example learning with RAG mechanism, and work through an automatic VDS design workflow with self-correction loops inspired by LLM Reflexion framework.

Result: The proposed system eliminates specialist dependency, reduces data requirements, and enables fast deployment of VDS across different terminals.

Conclusion: PortAgent demonstrates that LLMs can automate the VDS transferring workflow, addressing the key limitations that hinder widespread commercialization of VDSs across diverse container terminals.

Abstract: Vehicle Dispatching Systems (VDSs) are critical to the operational efficiency of Automated Container Terminals (ACTs). However, their widespread commercialization is hindered due to their low transferability across diverse terminals. This transferability challenge stems from three limitations: high reliance on port operational specialists, a high demand for terminal-specific data, and time-consuming manual deployment processes. Leveraging the emergence of Large Language Models (LLMs), this paper proposes PortAgent, an LLM-driven vehicle dispatching agent that fully automates the VDS transferring workflow. It bears three features: (1) no need for port operations specialists; (2) low need of data; and (3) fast deployment. Specifically, specialist dependency is eliminated by the Virtual Expert Team (VET). The VET collaborates with four virtual experts, including a Knowledge Retriever, Modeler, Coder, and Debugger, to emulate a human expert team for the VDS transferring workflow. These experts specialize in the domain of terminal VDS via a few-shot example learning approach. Through this approach, the experts are able to learn VDS-domain knowledge from a few VDS examples. These examples are retrieved via a Retrieval-Augmented Generation (RAG) mechanism, mitigating the high demand for terminal-specific data. Furthermore, an automatic VDS design workflow is established among these experts to avoid extra manual interventions. In this workflow, a self-correction loop inspired by the LLM Reflexion framework is created

[264] Seismology modeling agent: A smart assistant for geophysical researchers

Yukun Ren, Siwei Yu, Kai Chen, Jianwei Ma

Main category: cs.AI

TL;DR: This paper introduces an intelligent, interactive workflow for SPECFEM seismic simulation software using Large Language Models and Model Context Protocol servers to simplify the traditionally complex workflow through conversational interfaces.

DetailsMotivation: The traditional SPECFEM workflow has a steep learning curve and relies on complex manual file editing and command-line operations, creating barriers for researchers in computational seismology.

Method: Developed the first Model Context Protocol (MCP) server suite for SPECFEM that decomposes the simulation process into discrete, agent-executable tools covering parameter generation, mesh partitioning, solver execution, and visualization. This enables intent-driven conversational interactions instead of file-driven operations.

Result: Validated through multiple case studies, the workflow operates seamlessly in both autonomous and interactive modes, producing high-fidelity results consistent with standard baselines while significantly reducing tedious low-level operations.

Conclusion: This first application of MCP technology to computational seismology lowers entry barriers, enhances reproducibility, and offers a promising avenue for advancing computational geophysics toward AI-assisted and automated scientific research.

Abstract: To address the steep learning curve and reliance on complex manual file editing and command-line operations in the traditional workflow of the mainstream open-source seismic wave simulation software SPECFEM, this paper proposes an intelligent, interactive workflow powered by Large Language Models (LLMs). We introduce the first Model Context Protocol (MCP) server suite for SPECFEM (supporting 2D, 3D Cartesian, and 3D Globe versions), which decomposes the entire simulation process into discrete, agent-executable tools spanning from parameter generation and mesh partitioning to solver execution and visualization. This approach enables a paradigm shift from file-driven to intent-driven conversational interactions. The framework supports both fully automated execution and human-in-the-loop collaboration, allowing researchers to guide simulation strategies in real time and retain scientific decision-making authority while significantly reducing tedious low-level operations. Validated through multiple case studies, the workflow operates seamlessly in both autonomous and interactive modes, yielding high-fidelity results consistent with standard baselines. As the first application of MCP technology to computational seismology, this study significantly lowers the entry barrier, enhances reproducibility, and offers a promising avenue for advancing computational geophysics toward AI-assisted and automated scientific research. The complete source code is available at https://github.com/RenYukun1563/specfem-mcp.

[265] Context-Picker: Dynamic context selection using multi-stage reinforcement learning

Siyuan Zhu, Chengdong Xu, Kaiqiang Ke, Chao Yu

Main category: cs.AI

TL;DR: Context-Picker is a reasoning-aware framework for long-context QA that selects minimal sufficient evidence sets via two-stage reinforcement learning, outperforming traditional retrieval methods.

DetailsMotivation: Traditional approaches in long-context QA struggle with determining optimal context length - too little context misses critical information, while too much introduces noise. This is especially problematic for factoid questions that need only a few specific evidence pieces.

Method: Context-Picker shifts from similarity-based ranking to minimal sufficient subset selection using a two-stage RL schedule: recall-oriented stage for reasoning chain coverage, followed by precision-oriented stage for redundancy pruning. Uses offline evidence distillation via Leave-One-Out procedure to address reward sparsity.

Result: Outperforms strong RAG baselines on five long-context and multi-hop QA benchmarks, achieving superior answer accuracy with comparable or reduced context lengths.

Conclusion: The coarse-to-fine optimization schedule, redundancy-aware reward shaping, and rationale-guided format all contribute significantly to performance gains, demonstrating the effectiveness of treating context selection as a decision-making process.

Abstract: In long-context question answering (LCQA), determining the optimal amount of context for a given query is a significant challenge. Including too few passages may omit critical information, while including too many can introduce noise and reduce the quality of the answer. Traditional approaches, such as fixed Top-$K$ retrieval and single-stage reranking, face the dilemma of selecting the right number of passages. This problem is particularly pronounced for factoid questions, which often require only a few specific pieces of evidence. To address this issue, we introduce \emph{Context-Picker}, a reasoning-aware framework that shifts the paradigm from similarity-based ranking to minimal sufficient subset selection. Context-Picker treats context selection as a decision-making process optimized via a human-inspired, two-stage reinforcement learning schedule: a \emph{recall-oriented} stage that prioritizes the coverage of reasoning chains, followed by a \emph{precision-oriented} stage that aggressively prunes redundancy to distill a compact evidence set. To resolve reward sparsity, we propose an offline evidence distillation pipeline that mines “minimal sufficient sets” via a Leave-One-Out (LOO) procedure, providing dense, task-aligned supervision. Experiments on five long-context and multi-hop QA benchmarks demonstrate that Context-Picker significantly outperforms strong RAG baselines, achieving superior answer accuracy with comparable or reduced context lengths. Ablation studies indicate that the coarse-to-fine optimization schedule, the redundancy-aware reward shaping, and the rationale-guided format all contribute substantially to these gains.

[266] Model-First Reasoning LLM Agents: Reducing Hallucinations through Explicit Problem Modeling

Annu Rana, Gaurav Kumar

Main category: cs.AI

TL;DR: MFR (Model-First Reasoning) improves LLM planning by first building an explicit problem model before generating solutions, reducing constraint violations across multiple domains.

DetailsMotivation: LLMs struggle with complex multi-step planning tasks, showing high constraint violations and inconsistent solutions. Existing approaches like Chain-of-Thought and ReAct lack explicit problem representations, relying on implicit state tracking.

Method: Proposes Model-First Reasoning (MFR), a two-phase paradigm: 1) LLM constructs explicit problem model (entities, state variables, actions, constraints), 2) LLM generates solution plan using this model. Inspired by classical AI planning.

Result: Across medical scheduling, route planning, resource allocation, logic puzzles, and procedural synthesis domains, MFR reduces constraint violations and improves solution quality compared to Chain-of-Thought and ReAct. Ablation studies confirm explicit modeling phase is critical.

Conclusion: Many LLM planning failures stem from representational deficiencies rather than reasoning limitations. Explicit modeling is key for robust, interpretable AI agents. All materials documented for reproducibility.

Abstract: Large Language Models (LLMs) often struggle with complex multi-step planning tasks, showing high rates of constraint violations and inconsistent solutions. Existing strategies such as Chain-of-Thought and ReAct rely on implicit state tracking and lack an explicit problem representation. Inspired by classical AI planning, we propose Model-First Reasoning (MFR), a two-phase paradigm in which the LLM first constructs an explicit model of the problem, defining entities, state variables, actions, and constraints, before generating a solution plan. Across multiple planning domains, including medical scheduling, route planning, resource allocation, logic puzzles, and procedural synthesis, MFR reduces constraint violations and improves solution quality compared to Chain-of-Thought and ReAct. Ablation studies show that the explicit modeling phase is critical for these gains. Our results suggest that many LLM planning failures stem from representational deficiencies rather than reasoning limitations, highlighting explicit modeling as a key component for robust and interpretable AI agents. All prompts, evaluation procedures, and task datasets are documented to facilitate reproducibility.

[267] Sparse Multi-Modal Transformer with Masking for Alzheimer’s Disease Classification

Cheng-Han Lu, Pei-Hsuan Tsai

Main category: cs.AI

TL;DR: SMMT is a sparse multi-modal transformer that reduces computational costs while maintaining performance for Alzheimer’s Disease classification.

DetailsMotivation: Transformer-based multi-modal systems have high computational and energy costs due to dense self-attention, limiting scalability under resource constraints.

Method: SMMT introduces cluster-based sparse attention for near-linear complexity and modality-wise masking for robustness against incomplete inputs, built on a cascaded multi-modal transformer framework.

Result: SMMT maintains competitive predictive performance while significantly reducing training time, memory usage, and energy consumption compared to dense attention baselines.

Conclusion: SMMT is a suitable resource-aware architectural component for scalable intelligent systems, demonstrated through Alzheimer’s Disease classification on ADNI dataset.

Abstract: Transformer-based multi-modal intelligent systems often suffer from high computational and energy costs due to dense self-attention, limiting their scalability under resource constraints. This paper presents SMMT, a sparse multi-modal transformer architecture designed to improve efficiency and robustness. Building upon a cascaded multi-modal transformer framework, SMMT introduces cluster-based sparse attention to achieve near linear computational complexity and modality-wise masking to enhance robustness against incomplete inputs. The architecture is evaluated using Alzheimer’s Disease classification on the ADNI dataset as a representative multi-modal case study. Experimental results show that SMMT maintains competitive predictive performance while significantly reducing training time, memory usage, and energy consumption compared to dense attention baselines, demonstrating its suitability as a resource-aware architectural component for scalable intelligent systems.

[268] Dynamic Learning Rate Scheduling based on Loss Changes Leads to Faster Convergence

Shreyas Subramanian, Bala Krishnamoorthy, Pranav Murthy

Main category: cs.AI

TL;DR: GreedyLR is a novel adaptive learning rate scheduler that adjusts LR based on current loss, outperforming standard schedulers like Cosine/Exponential across NLP, CV, and LLM tasks up to 7B parameters.

DetailsMotivation: Most research uses common scheduler choices like Cosine or exponential decay, but there's room for improvement with adaptive schedulers that can respond to training dynamics in real-time.

Method: GreedyLR adaptively adjusts learning rate during training based on current loss. The paper includes theoretical analysis with convergence proof and derivation of optimal scaling factor F that maximizes convergence rate.

Result: Outperforms several state-of-the-art schedulers in accuracy, speed, and convergence across NLP, CV, and LLM tasks (up to 7B parameters) for both fine-tuning and pre-training. Shows robustness to realistic noisy landscapes.

Conclusion: GreedyLR is easy to implement, computationally efficient, and could be considered a good default scheduler for training due to its adaptive nature and strong empirical performance.

Abstract: Despite significant advances in optimizers for training, most research works use common scheduler choices like Cosine or exponential decay. In this paper, we study \emph{GreedyLR}, a novel scheduler that adaptively adjusts the learning rate during training based on the current loss. To validate the effectiveness of our proposed scheduler, we conduct experiments on several NLP, CV, and LLM tasks with up to $7B$ parameters, including both fine-tuning and pre-training experiments. The results show that our approach outperforms several state-of-the-art schedulers in terms of accuracy, speed, and convergence. We also provide a theoretical analysis of the GreedyLR algorithm, including a proof of convergence and derivation of the optimal scaling factor $F$ that maximizes the convergence rate, along with experiments to show robustness of the algorithm to realistic noisy landscapes. Our scheduler is easy to implement, computationally efficient, and could be considered a good default scheduler for training.

[269] Universal Reasoning Model

Zitian Gao, Lynx Chen, Yihao Xiao, He Xing, Ran Tao, Haoming Luo, Joey Zhou, Bryan Dai

Main category: cs.AI

TL;DR: Universal Transformers’ performance gains on reasoning tasks come from recurrent inductive bias and Transformer’s nonlinear components, not complex architectures. The proposed Universal Reasoning Model (URM) enhances UT with short convolution and truncated backpropagation, achieving SOTA results on ARC-AGI benchmarks.

DetailsMotivation: While Universal Transformers (UTs) show strong performance on complex reasoning tasks like ARC-AGI and Sudoku, the specific sources of their performance gains remain poorly understood. The authors aim to systematically analyze UT variants to identify what actually drives improvements, rather than assuming elaborate architectural designs are responsible.

Method: The authors first conduct systematic analysis of UT variants to identify key performance drivers. Based on findings that recurrent inductive bias and Transformer’s strong nonlinear components are crucial, they propose Universal Reasoning Model (URM) which enhances UT with two key components: short convolution and truncated backpropagation through time.

Result: URM achieves state-of-the-art performance on ARC-AGI benchmarks: 53.8% pass@1 on ARC-AGI 1 and 16.0% pass@1 on ARC-AGI 2. The analysis reveals that improvements primarily come from recurrent inductive bias and Transformer’s nonlinear components, not from elaborate architectural designs.

Conclusion: The study demonstrates that Universal Transformers’ reasoning performance gains stem from fundamental properties (recurrent inductive bias and nonlinear components) rather than complex architectural designs. The proposed URM, enhanced with short convolution and truncated backpropagation, effectively leverages these insights to achieve SOTA results on challenging reasoning benchmarks.

Abstract: Universal transformers (UTs) have been widely used for complex reasoning tasks such as ARC-AGI and Sudoku, yet the specific sources of their performance gains remain underexplored. In this work, we systematically analyze UTs variants and show that improvements on ARC-AGI primarily arise from the recurrent inductive bias and strong nonlinear components of Transformer, rather than from elaborate architectural designs. Motivated by this finding, we propose the Universal Reasoning Model (URM), which enhances the UT with short convolution and truncated backpropagation. Our approach substantially improves reasoning performance, achieving state-of-the-art 53.8% pass@1 on ARC-AGI 1 and 16.0% pass@1 on ARC-AGI 2. Our code is avaliable at https://github.com/zitian-gao/URM.

[270] Optimizing Large Language Models for ESG Activity Detection in Financial Texts

Mattia Birti, Andrea Maurino, Francesco Osborne

Main category: cs.AI

TL;DR: Fine-tuning LLMs on ESG-Activities dataset improves environmental activity classification, with open models outperforming proprietary ones in some cases.

DetailsMotivation: ESG integration into corporate decision-making is crucial but challenging due to evolving regulations. AI solutions could help assess sustainability report alignment, but current LLMs struggle with domain-specific ESG contexts and lack quality datasets.

Method: Investigate current LLMs’ ability to identify environmental activity text, then enhance performance through fine-tuning on ESG-Activities dataset (1,325 labeled text segments classified by EU ESG taxonomy). Use combination of original and synthetically generated data for fine-tuning.

Result: Fine-tuning on ESG-Activities significantly improves classification accuracy. Open models like Llama 7B and Gemma 7B outperform large proprietary solutions in specific configurations.

Conclusion: Findings have important implications for financial analysts, policymakers, and AI researchers seeking to enhance ESG transparency and compliance through advanced NLP techniques.

Abstract: The integration of Environmental, Social, and Governance (ESG) factors into corporate decision-making is a fundamental aspect of sustainable finance. However, ensuring that business practices align with evolving regulatory frameworks remains a persistent challenge. AI-driven solutions for automatically assessing the alignment of sustainability reports and non-financial disclosures with specific ESG activities could greatly support this process. Yet, this task remains complex due to the limitations of general-purpose Large Language Models (LLMs) in domain-specific contexts and the scarcity of structured, high-quality datasets. In this paper, we investigate the ability of current-generation LLMs to identify text related to environmental activities. Furthermore, we demonstrate that their performance can be significantly enhanced through fine-tuning on a combination of original and synthetically generated data. To this end, we introduce ESG-Activities, a benchmark dataset containing 1,325 labelled text segments classified according to the EU ESG taxonomy. Our experimental results show that fine-tuning on ESG-Activities significantly enhances classification accuracy, with open models such as Llama 7B and Gemma 7B outperforming large proprietary solutions in specific configurations. These findings have important implications for financial analysts, policymakers, and AI researchers seeking to enhance ESG transparency and compliance through advanced natural language processing techniques.

[271] Inverse Scaling in Test-Time Compute

Aryo Pradipta Gema, Alexander Hägele, Runjin Chen, Andy Arditi, Jacob Goldman-Wetzler, Kit Fraser-Taliente, Henry Sleight, Linda Petrini, Julian Michael, Beatrice Alex, Pasquale Minervini, Yanda Chen, Joe Benton, Ethan Perez

Main category: cs.AI

TL;DR: Extending reasoning length in Large Reasoning Models can degrade performance across multiple task types, revealing five distinct failure modes including distraction, overfitting, spurious correlations, focus loss, and concerning behavior amplification.

DetailsMotivation: To investigate whether simply increasing test-time compute (reasoning length) in Large Reasoning Models always improves performance, and to identify potential failure modes that emerge when models reason for longer periods.

Method: Constructed evaluation tasks across four categories: simple counting with distractors, regression with spurious features, deduction with constraint tracking, and advanced AI risks. Tested models with varying reasoning lengths to observe performance changes.

Result: Found inverse scaling relationship where longer reasoning deteriorates performance. Identified five failure modes: 1) Claude models get distracted by irrelevant info, 2) OpenAI o-series overfits to problem framings, 3) models shift to spurious correlations, 4) all models struggle with complex deduction focus, 5) extended reasoning amplifies concerning behaviors (e.g., Claude Sonnet 4 shows increased self-preservation expressions).

Conclusion: While test-time compute scaling shows promise for improving capabilities, it can inadvertently reinforce problematic reasoning patterns. Evaluation across diverse reasoning lengths is crucial to identify and address these failure modes in Large Reasoning Models.

Abstract: We construct evaluation tasks where extending the reasoning length of Large Reasoning Models (LRMs) deteriorates performance, exhibiting an inverse scaling relationship between test-time compute and accuracy. Our evaluation tasks span four categories: simple counting tasks with distractors, regression tasks with spurious features, deduction tasks with constraint tracking, and advanced AI risks. We identify five distinct failure modes when models reason for longer: 1) Claude models become increasingly distracted by irrelevant information; 2) OpenAI o-series models resist distractors but overfit to problem framings; 3) models shift from reasonable priors to spurious correlations; 4) all models show difficulties in maintaining focus on complex deductive tasks; and 5) extended reasoning may amplify concerning behaviors, with Claude Sonnet 4 showing increased expressions of self-preservation. These findings suggest that while test-time compute scaling remains promising for improving model capabilities, it may inadvertently reinforce problematic reasoning patterns. Our results demonstrate the importance of evaluating models across diverse reasoning lengths to identify and address these failure modes in LRMs.

[272] CADDesigner: Conceptual Design of CAD Models Based on General-Purpose Agent

Fengxiao Fan, Jingzhe Ni, Xiaolong Yin, Sirui Wang, Xingyu Lu, Qiang Zou, Ruofeng Tong, Min Tang, Peng Du

Main category: cs.AI

TL;DR: LLM-powered CAD agent that accepts text/sketch inputs, uses dialogue to refine requirements, generates CAD code via ECIP paradigm with visual feedback, and improves through knowledge base storage.

DetailsMotivation: CAD design requires high expertise, creating barriers to entry. The paper aims to lower these barriers and improve design efficiency by making CAD more accessible through AI assistance.

Method: Proposes a CAD agent powered by LLMs that accepts textual descriptions and sketches as input. Uses interactive dialogue for requirement refinement, employs Explicit Context Imperative Paradigm (ECIP) for code generation, incorporates iterative visual feedback, and stores designs in a structured knowledge base for continuous improvement.

Result: The method achieves state-of-the-art performance in CAD code generation, demonstrating effectiveness in generating high-quality CAD modeling code through the proposed approach.

Conclusion: The LLM-powered CAD agent successfully lowers the entry barrier for CAD design while improving efficiency, with the ECIP paradigm and knowledge base system enabling continuous improvement in code generation capabilities.

Abstract: Computer Aided Design (CAD) plays a pivotal role in industrial manufacturing but typically requires a high level of expertise from designers. To lower the entry barrier and improve design efficiency, we present an agent for CAD conceptual design powered by large language models (LLMs). The agent accepts both textual descriptions and sketches as input, engaging in interactive dialogue with users to refine and clarify design requirements through comprehensive requirement analysis. Built upon a novel Explicit Context Imperative Paradigm (ECIP), the agent generates high-quality CAD modeling code. During the generation process, the agent incorporates iterative visual feedback to improve model quality. Generated design cases are stored in a structured knowledge base, enabling continuous improvement of the agent’s code generation capabilities. Experimental results demonstrate that our method achieves state-of-the-art performance in CAD code generation.

[273] Language Self-Play For Data-Free Training

Jakub Grudzien Kuba, Mengting Gu, Qi Ma, Yuandong Tian, Vijai Mohan, Jason Chen

Main category: cs.AI

TL;DR: LLMs face data bottleneck; Language Self-Play (LSP) enables improvement without additional data through self-play reinforcement learning.

DetailsMotivation: LLM progress is limited by the need for ever-increasing training data. The paper aims to overcome this fundamental bottleneck by enabling models to improve without requiring additional external data.

Method: Proposes Language Self-Play (LSP), a reinforcement learning approach using game-theoretic self-play framework. Models compete against themselves in a competitive game setting, where stronger policies emerge through self-play without external data.

Result: Experiments with Llama-3.2-3B-Instruct show pretrained models can be effectively improved using self-play alone on instruction-following, mathematics, and coding benchmarks.

Conclusion: Self-play reinforcement learning provides a viable path to improve LLMs without data dependency, addressing a fundamental bottleneck in current LLM development.

Abstract: Large language models (LLMs) have advanced rapidly in recent years, driven by scale, abundant high-quality training data, and reinforcement learning. Yet this progress faces a fundamental bottleneck: the need for ever more data from which models can continue to learn. In this work, we propose a reinforcement learning approach that removes this dependency by enabling models to improve without additional data. Our method leverages a game-theoretic framework of self-play, where a model’s capabilities are cast as performance in a competitive game and stronger policies emerge by having the model play against itself-a process we call Language Self-Play (LSP). Experiments with Llama-3.2-3B-Instruct on instruction-following, mathematics, and coding benchmarks show that pretrained models can be effectively improved with self-play alone.

[274] IPR-1: Interactive Physical Reasoner

Mingyu Zhang, Lifeng Zhuo, Tianxi Tan, Guocan Xie, Xian Nie, Yan Li, Renjie Zhao, Zizhu He, Ziyu Wang, Jiting Cai, Yong-Lu Li

Main category: cs.AI

TL;DR: IPR (Interactive Physical Reasoner) combines world-model rollouts with VLM policies using physics-centric action codes to achieve human-like physical reasoning across 1000+ games with visual domain gaps, outperforming existing methods including GPT-5.

DetailsMotivation: Humans learn physics and causality through observation and interaction. The paper aims to develop agents that can similarly acquire human-like reasoning from interaction and improve with experience, addressing limitations of current VLMs (lack look-ahead) and world models (overfit to visual patterns).

Method: Proposes IPR (Interactive Physical Reasoner) that uses world-model rollouts to score and reinforce a VLM’s policy. Introduces PhysCode, a physics-centric action code that aligns semantic intent with dynamics to provide a shared action space for prediction and reasoning. Pretrained on 1000+ heterogeneous games with visual domain gaps.

Result: IPR performs robustly across levels from primitive intuition to goal-driven reasoning, surpassing GPT-5 overall. Performance improves with more training games and interaction steps, and the model zero-shot transfers to unseen games.

Conclusion: Physics-centric interaction provides a path to steadily improving physical reasoning. The approach demonstrates that agents can acquire human-like reasoning through interaction and improve with experience, supporting the viability of this learning paradigm.

Abstract: Humans learn by observing, interacting with environments, and internalizing physics and causality. Here, we aim to ask whether an agent can similarly acquire human-like reasoning from interaction and keep improving with more experience. To study this, we introduce a Game-to-Unseen (G2U) benchmark of 1,000+ heterogeneous games that exhibit significant visual domain gaps. Existing approaches, including VLMs and world models, struggle to capture underlying physics and causality since they are not focused on core mechanisms and overfit to visual details. VLM/VLA agents reason but lack look-ahead in interactive settings, while world models imagine but imitate visual patterns rather than analyze physics and causality. We therefore propose IPR (Interactive Physical Reasoner), using world-model rollouts to score and reinforce a VLM’s policy, and introduce PhysCode, a physics-centric action code aligning semantic intent with dynamics to provide a shared action space for prediction and reasoning. Pretrained on 1,000+ games, our IPR performs robustly on levels from primitive intuition to goal-driven reasoning, and even surpasses GPT-5 overall. We find that performance improves with more training games and interaction steps, and that the model also zero-shot transfers to unseen games. These results support physics-centric interaction as a path to steadily improving physical reasoning. Further demos and project details can be found at https://mybearyzhang.github.io/ipr-1.

[275] Meta-Reinforcement Learning for Building Energy Management System

Huiliang Zhang, Di Wu, Arnaud Zinflou, Benoit Boulet

Main category: cs.AI

TL;DR: MetaEMS uses meta-reinforcement learning to enable fast adaptation of energy management systems across different buildings, overcoming the slow training limitations of traditional RL methods.

DetailsMotivation: Buildings are major energy consumers, and improving energy efficiency is crucial for cost reduction and emissions control. While RL shows promise for EMS, existing methods require extensive training and struggle to adapt to new buildings, limiting practical deployment.

Method: MetaEMS employs a meta-reinforcement learning framework with two-level adaptation: group-level knowledge transfer from previously solved tasks and building-level adaptation for specific environments, enabling efficient policy learning across diverse buildings.

Result: Experimental results show MetaEMS adapts faster to unseen buildings and consistently outperforms baseline methods across various scenarios, demonstrating superior learning efficiency and control effectiveness.

Conclusion: MetaEMS provides an effective meta-RL solution for building energy management that enables rapid adaptation to new buildings, addressing the practical deployment limitations of traditional RL-based EMS methods.

Abstract: The building sector is one of the largest contributors to global energy consumption. Improving its energy efficiency is essential for reducing operational costs and greenhouse gas emissions. Energy management systems (EMS) play a key role in monitoring and controlling building appliances efficiently and reliably. With the increasing integration of renewable energy, intelligent EMS solutions have received growing attention. Reinforcement learning (RL) has recently been explored for this purpose and shows strong potential. However, most RL-based EMS methods require a large number of training steps to learn effective control policies, especially when adapting to unseen buildings, which limits their practical deployment. This paper introduces MetaEMS, a meta-reinforcement learning framework for EMS. MetaEMS improves learning efficiency by transferring knowledge from previously solved tasks to new ones through group-level and building-level adaptation, enabling fast adaptation and effective control across diverse building environments. Experimental results demonstrate that MetaEMS adapts more rapidly to unseen buildings and consistently outperforms baseline methods across various scenarios.

[276] COMMA: A Communicative Multimodal Multi-Agent Benchmark

Timothy Ossowski, Danyal Maqbool, Jixuan Chen, Zefan Cai, Tyler Bradshaw, Junjie Hu

Main category: cs.AI

TL;DR: COMMA is a new benchmark for evaluating multimodal multi-agent collaboration through language communication, revealing surprising weaknesses in state-of-the-art models including GPT-4o and reasoning models.

DetailsMotivation: Current multimodal agent research overlooks language-based communication between agents in collaborative tasks, creating a gap in understanding their real-world effectiveness, especially when communicating with humans. Existing benchmarks fail to address inter-agent communication and collaboration in scenarios with unequal information access.

Method: Introduces COMMA: a novel puzzle benchmark designed to evaluate collaborative performance of multimodal multi-agent systems through language communication. Features various multimodal puzzles across four key categories of agentic capability in communicative collaboration settings.

Result: Reveals surprising weaknesses in state-of-the-art models, including strong proprietary models like GPT-4o and reasoning models like o4-mini. Many chain of thought reasoning models (R1-Onevision, LLaVA-CoT) struggle to outperform even random baselines in agent-agent collaboration.

Conclusion: The benchmark highlights a critical gap in current multimodal agent capabilities and identifies communication abilities as a potential growth area for improvement in collaborative AI systems.

Abstract: The rapid advances of multimodal agents built on large foundation models have largely overlooked their potential for language-based communication between agents in collaborative tasks. This oversight presents a critical gap in understanding their effectiveness in real-world deployments, particularly when communicating with humans. Existing agentic benchmarks fail to address key aspects of inter-agent communication and collaboration, particularly in scenarios where agents have unequal access to information and must work together to achieve tasks beyond the scope of individual capabilities. To fill this gap, we introduce COMMA: a novel puzzle benchmark designed to evaluate the collaborative performance of multimodal multi-agent systems through language communication. Our benchmark features a variety of multimodal puzzles, providing a comprehensive evaluation across four key categories of agentic capability in a communicative collaboration setting. Our findings reveal surprising weaknesses in state-of-the-art models, including strong proprietary models like GPT-4o and reasoning models like o4-mini. Many chain of thought reasoning models such as R1-Onevision and LLaVA-CoT struggle to outperform even a random baseline in agent-agent collaboration, indicating a potential growth area in their communication abilities.

[277] A LoRA-Based Approach to Fine-Tuning LLMs for Educational Guidance in Resource-Constrained Settings

Md Millat Hosen

Main category: cs.AI

TL;DR: Researchers developed a cost-effective method to adapt Mistral-7B LLM for academic advising in study-abroad contexts using LoRA and 4-bit quantization, achieving 92% accuracy in domain-specific recommendations with efficient training.

DetailsMotivation: To create an affordable solution for adapting large language models to academic advising, particularly for study-abroad contexts, that can work in low-resource institutional settings where computational resources are limited.

Method: Used Mistral-7B-Instruct model with Low-Rank Adaptation (LoRA) and 4-bit quantization for parameter-efficient fine-tuning. Two-stage training: Phase 1 used synthetic dataset from Gemini Pro API, Phase 2 used manually curated datasets from StudyAbroadGPT project. Implemented memory-efficient quantization and continuous training analytics via Weights & Biases.

Result: Achieved 52.7% reduction in training loss, 92% accuracy in domain-specific recommendations, 95% markdown formatting support, and median run-rate of 100 samples/second on standard GPU equipment. Demonstrated effective application in low-resource educational advising scenarios.

Conclusion: Instruction-tuned LLMs can be effectively adapted for educational advising in low-resource settings. While limitations include decreased generalizability and reliance on synthetic data, the framework is scalable for multilingual and real-time advising. Future work includes RAG integration, dynamic quantization, and real-time database connections.

Abstract: The current study describes a cost-effective method for adapting large language models (LLMs) for academic advising with study-abroad contexts in mind and for application in low-resource methods for acculturation. With the Mistral-7B-Instruct model applied with a Low-Rank Adaptation (LoRA) method and a 4-bit quantization method, the model underwent training in two distinct stages related to this study’s purpose to enhance domain specificity while maintaining computational efficiency. In Phase 1, the model was conditioned with a synthetic dataset via the Gemini Pro API, and in Phase 2, it was trained with manually curated datasets from the StudyAbroadGPT project to achieve enhanced, contextualized responses. Technical innovations entailed memory-efficient quantization, parameter-efficient adaptation, and continuous training analytics via Weights & Biases. After training, this study demonstrated a reduction in training loss by 52.7%, 92% accuracy in domain-specific recommendations, achieved 95% markdown-based formatting support, and a median run-rate of 100 samples per second on off-the-shelf GPU equipment. These findings support the effective application of instruction-tuned LLMs within educational advisers, especially in low-resource institutional scenarios. Limitations included decreased generalizability and the application of a synthetically generated dataset, but this framework is scalable for adding new multilingual-augmented and real-time academic advising processes. Future directions may include plans for the integration of retrieval-augmented generation, applying dynamic quantization routines, and connecting to real-time academic databases to increase adaptability and accuracy.

[278] Developing Large Language Models for Clinical Research Using One Million Clinical Trials

Zifeng Wang, Jiacheng Lin, Qiao Jin, Junyi Gao, Jathurshan Pradeepkumar, Pengcheng Jiang, Zhiyong Lu, Jimeng Sun

Main category: cs.AI

TL;DR: TrialPanorama is a large-scale structured resource aggregating 1.6M clinical trial records from global registries, linked with biomedical ontologies and literature, enabling AI development for clinical research tasks.

DetailsMotivation: Developing effective AI for clinical research requires comprehensive data foundations for model training and rigorous evaluation, which current resources lack.

Method: Created TrialPanorama by aggregating 1.6M clinical trial records from 15 global registries, linking them with biomedical ontologies and literature. Built pipeline to construct 152K training/testing samples for 8 clinical research tasks, then developed an 8B LLM using supervised finetuning and reinforcement learning on this data.

Result: The 8B LLM trained on TrialPanorama outperformed 70B generic LLMs across all 8 clinical research tasks with relative improvements ranging from 5.2% to 73.7%, demonstrating superior clinical reasoning capabilities.

Conclusion: TrialPanorama provides a solid foundation for scaling AI in clinical research, showing that domain-specific training on comprehensive structured data enables smaller models to outperform much larger generic models in specialized clinical reasoning tasks.

Abstract: Developing artificial intelligence (AI) for clinical research requires a comprehensive data foundation that supports model training and rigorous evaluation. Here, we introduce TrialPanorama, a large-scale structured resource that aggregates 1.6M clinical trial records from fifteen global registries and links them with biomedical ontologies and associated literature. To demonstrate its utility, we build a pipeline that constructs 152K training and testing samples for eight key clinical research tasks. Three tasks support systematic review workflows, including study search, study screening, and evidence summarization. Five tasks focus on trial design and optimization, including arm design, eligibility criteria design, endpoint selection, sample size estimation, and trial completion assessment and rationalization. Benchmarking cutting-edge large language models (LLMs) reveals that generic LLMs have limited capability in clinical reasoning. In contrast, an 8B LLM we developed on TrialPanorama using supervised finetuning and reinforcement learning wins over the 70B generic counterparts in all eight tasks, with a relative improvement of 73.7%, 67.6%, 38.4%, 37.8%, 26.5%, 20.7%, 20.0%, 18.1%, and 5.2%, respectively. We envision that TrialPanorama provides a solid foundation for future scaling of AI for clinical research.

[279] LTA-thinker: Latent Thought-Augmented Training Framework for Large Language Models on Complex Reasoning

Jiaqi Wang, Binquan Ji, Haibo Luo, Yiyang Qi, Ruiting Li, Huiyan Wang, Yuantao Han, Cangyi Yang, jiaxu Zhang, Feiliang Ren

Main category: cs.AI

TL;DR: LTA-Thinker is a Latent Thought-Augmented Training Framework that improves complex reasoning in LLMs by increasing variance in latent thought distributions and optimizing through multi-objective co-training.

DetailsMotivation: Current methods for complex reasoning in LLMs (like Coconut, SoftCoT) face bottlenecks in generating high-quality latent thoughts. The paper builds on SoftCoT++ theory showing that larger variance in latent thought distributions better approximates golden truth distributions.

Method: Two key innovations: 1) A learnable prior-based architecture for latent thought generation to increase distribution variance, 2) A distribution-based directional optimization paradigm with multi-objective co-training (SFT loss + Semantic Alignment Loss using KL divergence + Reasoning Focus Loss using contrastive learning).

Result: LTA-Thinker achieves state-of-the-art performance among various baselines, demonstrating higher performance ceiling and better scaling effects.

Conclusion: The proposed framework effectively addresses bottlenecks in latent thought generation and utilization, improving reasoning performance through enhanced distribution variance and optimized training objectives.

Abstract: Complex Reasoning in Large Language Models can be dynamically optimized using Test-Time Scaling (TTS) to mitigate Overthinking. Methods such as Coconut, SoftCoT and its variant are effective in continuous latent space inference, the core bottleneck still lies in the efficient generation and utilization of high-quality Latent Thought. Drawing from the theory of SoftCoT++ that a larger variance in the generated Latent Thought distribution more closely approximates the golden truth distribution, we propose a Latent Thought-Augmented Training Framework–LTA-Thinker, which improves distributional variance and enhances reasoning performance from two perspectives. First, LTA-Thinker constructs a Latent Thought generation architecture based on a learnable prior. This architecture aims to increase the variance distribution of generated Latent Thought Vectors in order to simplify the overall structure and raise the performance ceiling. Second, LTA-Thinker introduces a distribution-based directional optimization paradigm that jointly constrains both distribution locality and distribution scale. This mechanism improves information efficiency and computational cost through a multi-objective co-training strategy, which combines standard Supervised Fine-Tuning (SFT) loss with two novel losses: Semantic Alignment Loss, which utilizes KL divergence to ensure that the Latent Thought is highly relevant to the semantics of the question; Reasoning Focus Loss, which utilizes a contrastive learning mechanism to guide the model to focus on the most critical reasoning steps. Experiments show that LTA-thinker achieves state-of-the-art (SOTA) performance among various baselines and demonstrates a higher performance ceiling and better scaling effects.

[280] MCTS-EP: Empowering Embodied Planning with Online Preference Optimization

Hang Xu, Zang Yu, Yehui Tang, Pengbo Hu, Yuhao Tang, Hao Dong

Main category: cs.AI

TL;DR: MCTS-EP is an online learning framework combining LLMs with Monte Carlo Tree Search for training embodied agents, achieving SOTA performance on benchmarks with improved efficiency.

DetailsMotivation: To develop a more efficient and effective framework for training embodied agents that can leverage the reasoning capabilities of LLMs while addressing exploration challenges in complex environments.

Method: Combines LLMs with MCTS through three key components: MCTS-guided exploration for preference data collection, efficient multi-modal reasoning mechanism, and iterative training pipeline based on preference optimization.

Result: Achieves 92% and 87% success rates for textual and visual tasks in ALFWorld, 0.81 average reward in WebShop, and reduces average interaction steps from 18.7/19.5 to 10.2/9.9 steps in visual ALFWorld.

Conclusion: MCTS-EP provides a theoretically sound and practically effective framework that outperforms conventional on-policy algorithms, demonstrating the value of combining LLMs with search-based methods for embodied agent training.

Abstract: This paper introduces MCTS-EP, an online learning framework that combines large language models (LLM) with Monte Carlo Tree Search (MCTS) for training embodied agents. MCTS-EP integrates three key components: MCTS-guided exploration for preference data collection, efficient multi-modal reasoning mechanism, and iterative training pipeline based on preference optimization. We theoretically prove that MCTS-EP achieves better performance bounds than conventional on-policy algorithms when the loss function is strongly convex, and demonstrate that it can be formulated as a search-enhanced variant of GAIL. MCTS-EP achieves state-of-the-art performace across serval benchmarks. In ALFWorld, it achieves 92% and 87% success rates for textual and visual tasks. In WebShop, it reaches an average reward of 0.81. MTCS-EP also reduces average interaction steps from from 18.7/19.5 to 10.2/9.9 steps in visual ALFWorld.Code available at: https://github.com/xuhang-2/Embodied-Agent-Planning

[281] GALAX: Graph-Augmented Language Model for Explainable Reinforcement-Guided Subgraph Reasoning in Precision Medicine

Heming Zhang, Di Huang, Wenyu Li, Michael Province, Yixin Chen, Philip Payne, Fuhai Li

Main category: cs.AI

TL;DR: GALAX integrates GNNs and LLMs via reinforcement learning with Graph Process Reward Model for explainable subgraph reasoning in precision medicine target discovery.

DetailsMotivation: Existing approaches have limitations: numerical omics ignore topological context, text-centric LLMs lack quantitative reasoning, and graph-only models underuse node semantics and LLM generalization. Process Reward Models have issues with unreliable intermediate evaluation and reward hacking. Need to integrate quantitative multi-omic signals, topological structure, and literature-scale text via LLMs.

Method: Propose GALAX framework integrating pretrained GNNs into LLMs via reinforcement learning guided by Graph Process Reward Model (GPRM). Generates disease-relevant subgraphs step-wise: initiated by LLM, iteratively evaluated by pretrained GNN and schema-based rule check. Also introduce Target-QA benchmark combining CRISPR targets, multi-omic profiles, and biomedical graph knowledge for GNN pretraining.

Result: Enables process-level supervision without explicit labels, supports long-context reasoning over text-numeric graphs (TNGs), provides scalable and biologically grounded framework for explainable, reinforcement-guided subgraph reasoning.

Conclusion: GALAX offers reliable and interpretable target discovery in precision medicine by bridging numeric evidence, topological knowledge, and language context through subgraph reasoning.

Abstract: In precision medicine, quantitative multi-omic features, topological context, and textual biological knowledge play vital roles in identifying disease-critical signaling pathways and targets. Existing pipelines capture only part of these-numerical omics ignore topological context, text-centric LLMs lack quantitative grounded reasoning, and graph-only models underuse node semantics and the generalization of LLMs-limiting mechanistic interpretability. Although Process Reward Models (PRMs) aim to guide reasoning in LLMs, they remain limited by unreliable intermediate evaluation, and vulnerability to reward hacking with computational cost. These gaps motivate integrating quantitative multi-omic signals, topological structure with node annotations, and literature-scale text via LLMs, using subgraph reasoning as the principle bridge linking numeric evidence, topological knowledge and language context. Therefore, we propose GALAX (Graph Augmented LAnguage model with eXplainability), an innovative framework that integrates pretrained Graph Neural Networks (GNNs) into Large Language Models (LLMs) via reinforcement learning guided by a Graph Process Reward Model (GPRM), which generates disease-relevant subgraphs in a step-wise manner initiated by an LLM and iteratively evaluated by a pretrained GNN and schema-based rule check, enabling process-level supervision without explicit labels. As an application, we also introduced Target-QA, a benchmark combining CRISPR-identified targets, multi-omic profiles, and biomedical graph knowledge across diverse cancer cell lines, which enables GNN pretraining for supervising step-wise graph construction and supports long-context reasoning over text-numeric graphs (TNGs), providing a scalable and biologically grounded framework for explainable, reinforcement-guided subgraph reasoning toward reliable and interpretable target discovery in precision medicine.

[282] Structured Sparse Transition Matrices to Enable State Tracking in State-Space Models

Aleksandar Terzić, Nicolas Menet, Michael Hersche, Thomas Hofmann, Abbas Rahimi

Main category: cs.AI

TL;DR: PD-SSM proposes a structured sparse parametrization of transition matrices in state-space models that enables optimal finite-state automata tracking with linear computational cost.

DetailsMotivation: Current SSMs face a trade-off: diagonal/structured transition matrices are computationally efficient but limited in expressivity for FSA emulation, while unstructured matrices are expressive but computationally prohibitive for moderate state sizes.

Method: PD-SSM parametrizes transition matrix as product of column one-hot matrix (P) and complex-valued diagonal matrix (D), enabling FSA state tracking with optimal state size and depth while keeping computational cost comparable to diagonal SSMs.

Result: Theoretically BIBO-stable and can emulate any N-state FSA with one layer of dimension N and linear readout. Experimentally outperforms modern SSM variants on FSA tracking tasks, comparable to neural controlled differential equations on time-series classification, and effectively tracks complex FSA states in Transformer-SSM architecture.

Conclusion: PD-SSM provides an optimal balance between expressivity and computational efficiency for state-space models, enabling effective FSA emulation with linear computational scaling while maintaining theoretical guarantees.

Abstract: Modern state-space models (SSMs) often utilize transition matrices which enable efficient computation but pose restrictions on the model’s expressivity, as measured in terms of the ability to emulate finite-state automata (FSA). While unstructured transition matrices are optimal in terms of expressivity, they come at a prohibitively high compute and memory cost even for moderate state sizes. We propose a structured sparse parametrization of transition matrices in SSMs that enables FSA state tracking with optimal state size and depth, while keeping the computational cost of the recurrence comparable to that of diagonal SSMs. Our method, PD-SSM, parametrizes the transition matrix as the product of a column one-hot matrix ($P$) and a complex-valued diagonal matrix ($D$). Consequently, the computational cost of parallel scans scales linearly with the state size. Theoretically, the model is BIBO-stable and can emulate any $N$-state FSA with one layer of dimension $N$ and a linear readout of size $N \times N$, significantly improving on all current structured SSM guarantees. Experimentally, the model significantly outperforms a wide collection of modern SSM variants on various FSA state tracking tasks. On multiclass time-series classification, the performance is comparable to that of neural controlled differential equations, a paradigm explicitly built for time-series analysis. Finally, we integrate PD-SSM into a hybrid Transformer-SSM architecture and demonstrate that the model can effectively track the states of a complex FSA in which transitions are encoded as a set of variable-length English sentences. The code is available at https://github.com/IBM/expressive-sparse-state-space-model

[283] STEMS: Spatial-Temporal Enhanced Safe Multi-Agent Coordination for Building Energy Management

Huiliang Zhang, Di Wu, Arnaud Zinflou, Benoit Boulet

Main category: cs.AI

TL;DR: STEMS is a safety-constrained multi-agent RL framework for coordinated building energy management that combines spatial-temporal graph learning with control barrier functions to achieve cost/emission reductions while ensuring operational safety.

DetailsMotivation: Current multi-building energy systems face three key challenges: insufficient spatial-temporal information exploitation, lack of rigorous safety guarantees, and system complexity. Coordinated building energy management is essential for carbon reduction, occupant comfort, and energy cost savings.

Method: STEMS integrates two core components: (1) a spatial-temporal graph representation learning framework using GCN-Transformer fusion architecture to capture inter-building relationships and temporal patterns, and (2) a safety-constrained multi-agent RL algorithm incorporating Control Barrier Functions to provide mathematical safety guarantees.

Result: Extensive experiments show STEMS achieves 21% cost reduction, 18% emission reduction, and dramatically reduces safety violations from 35.1% to 5.6% while maintaining optimal comfort with only 0.13 discomfort proportion. The framework demonstrates strong robustness during extreme weather and maintains effectiveness across different building types.

Conclusion: STEMS provides an effective solution for coordinated building energy management that addresses spatial-temporal dependencies while ensuring operational safety, making it a promising approach for multi-building energy systems.

Abstract: Building energy management is essential for achieving carbon reduction goals, improving occupant comfort, and reducing energy costs. Coordinated building energy management faces critical challenges in exploiting spatial-temporal dependencies while ensuring operational safety across multi-building systems. Current multi-building energy systems face three key challenges: insufficient spatial-temporal information exploitation, lack of rigorous safety guarantees, and system complexity. This paper proposes Spatial-Temporal Enhanced Safe Multi-Agent Coordination (STEMS), a novel safety-constrained multi-agent reinforcement learning framework for coordinated building energy management. STEMS integrates two core components: (1) a spatial-temporal graph representation learning framework using a GCN-Transformer fusion architecture to capture inter-building relationships and temporal patterns, and (2) a safety-constrained multi-agent RL algorithm incorporating Control Barrier Functions to provide mathematical safety guarantees. Extensive experiments on real-world building datasets demonstrate STEMS’s superior performance over existing methods, showing that STEMS achieves 21% cost reduction, 18% emission reduction, and dramatically reduces safety violations from 35.1% to 5.6% while maintaining optimal comfort with only 0.13 discomfort proportion. The framework also demonstrates strong robustness during extreme weather conditions and maintains effectiveness across different building types.

[284] DART: Difficulty-Adaptive Reasoning Truncation for Efficient Large Language Models

Ruofan Zhang, Bin Xia, Zhen Cheng, Cairen Jian, Minglun Yang, Ngai Wong, Yuan Cheng

Main category: cs.AI

TL;DR: DART is a difficulty-adaptive reasoning truncation framework that learns when to stop thinking by adjusting reasoning length based on problem difficulty, achieving 81.2% truncation and 5.33× computational acceleration while maintaining accuracy.

DetailsMotivation: Current chain-of-thought methods generate long explanations indiscriminately, causing inefficiency, while existing reinforcement learning approaches for adaptive thinking are unstable and reward-dependent. There's a need for a stable, efficient reasoning framework that aligns computational effort with problem difficulty.

Method: DART uses a supervised framework that: 1) distills concise reasoning patterns from stronger models, 2) interpolates them into a continuum of reasoning styles, and 3) curates optimal training data balancing correctness and compactness to learn when to stop thinking based on problem difficulty.

Result: Across multiple mathematical benchmarks, DART achieves 81.2% reasoning truncation (DeepSeek-R1-Distill-Qwen-7B on GSM8K) with 5.33× computational acceleration while preserving or improving accuracy compared to standard chain-of-thought methods.

Conclusion: DART provides a stable and general paradigm for efficient reasoning that advances adaptive intelligence in LLMs by aligning computational effort with problem difficulty, offering significant efficiency gains without sacrificing accuracy.

Abstract: Adaptive reasoning is essential for aligning the computational effort of large language models (LLMs) with the intrinsic difficulty of problems. Current chain-of-thought methods boost reasoning ability but indiscriminately generate long explanations, leading to evident inefficiency. However, existing reinforcement learning approaches to adaptive thinking remain unstable and heavily reward-dependent. Here we propose \textbf{DART}, a supervised \textbf{D}ifficulty-\textbf{A}daptive \textbf{R}easoning \textbf{T}runcation framework that adjusts thinking length according to problem difficulty. By distilling concise reasoning patterns from stronger models, interpolating them into a continuum of reasoning styles, and curating optimal training data that balances correctness and compactness, DART learns when to ``stop thinking’’. Across multiple mathematical benchmarks, experimental results demonstrate its remarkable efficiency while preserving or improving accuracy, achieving a significant 81.2% reasoning truncation (DeepSeek-R1-Distill-Qwen-7B on GSM8K dataset) with 5.33$\times$ computational acceleration. DART provides a stable and general paradigm for efficient reasoning, advancing the development of adaptive intelligence in LLMs.

[285] Advanced Black-Box Tuning of Large Language Models with Limited API Calls

Zhikang Xie, Weilin Wan, Peizhu Gong, Weizhong Zhang, Cheng Jin

Main category: cs.AI

TL;DR: Proposes a novel black-box tuning method using Gaussian Process surrogate with minimal API calls to efficiently adapt large language models without parameter access.

DetailsMotivation: Current black-box tuning methods face a dilemma: either use inefficient proxy models with limited improvement, or make expensive API calls in every iteration. There's a need for a method that balances effectiveness with computational efficiency.

Method: Trains a Gaussian Process surrogate model using “LogitMap Pairs” from querying the foundation model on a small but informative training subset. This surrogate approximates foundation model outputs to guide proxy model training, dramatically reducing direct API queries.

Result: Achieves accuracy improvement from 55.92% to 86.85% while reducing API query frequency to only 1.38%. Outperforms offline approaches and matches/exceeds query-intensive methods with significantly lower API costs.

Conclusion: Provides a robust, high-efficiency paradigm for language model adaptation that balances performance with computational cost, offering a practical solution for black-box tuning of LLMs.

Abstract: Black-box tuning is an emerging paradigm for adapting large language models (LLMs) to better achieve desired behaviors, particularly when direct access to model parameters is unavailable. Current strategies, however, often present a dilemma of suboptimal extremes: either separately train a small proxy model and then use it to shift the predictions of the foundation model, offering notable efficiency but often yielding limited improvement; or making API calls in each tuning iteration to the foundation model, which entails prohibitive computational costs. Therefore, we propose a novel advanced black-box tuning method for LLMs with limited API calls. Our core strategy involves training a Gaussian Process (GP) surrogate model with “LogitMap Pairs” derived from querying the foundation model on a minimal but highly informative training subset. This surrogate can approximate the outputs of the foundation model to guide the training of the proxy model, thereby effectively reducing the need for direct queries to the foundation model. Extensive experiments verify that our approach elevates pre-trained language model accuracy from 55.92% to 86.85%, reducing the frequency of API queries to merely 1.38%. This significantly outperforms offline approaches that operate entirely without API access. Notably, our method also achieves comparable or superior accuracy to query-intensive approaches, while significantly reducing API costs. This offers a robust and high-efficiency paradigm for language model adaptation.

[286] Menta: A Small Language Model for On-Device Mental Health Prediction

Tianyi Zhang, Xiangyuan Xue, Lingyan Ruan, Shiya Fu, Feng Xia, Simon D’Alfonso, Vassilis Kostakos, Ting Dang, Hong Jia

Main category: cs.AI

TL;DR: Menta is an optimized small language model fine-tuned for multi-task mental health prediction from social media data, achieving better performance than larger models while being deployable on mobile devices.

DetailsMotivation: Mental health conditions affect hundreds of millions globally but early detection is limited. Large language models (LLMs) show promise but are too computationally heavy for practical deployment. Small language models (SLMs) are lightweight but underexplored for social media-based mental health prediction.

Method: Menta is jointly trained across six classification tasks using a LoRA-based framework, cross-dataset strategy, and balanced accuracy-oriented loss. It’s optimized specifically for multi-task mental health prediction from social media data.

Result: Menta achieves 15.2% average improvement across tasks (depression, stress, suicidality) compared to best non-fine-tuned SLMs. It outperforms 13B-parameter LLMs on depression and stress tasks while being 3.25x smaller. Successfully deployed on iPhone 15 Pro Max with only ~3GB RAM.

Conclusion: Menta demonstrates the potential for scalable, privacy-preserving mental health monitoring through optimized small language models that can run on mobile devices, offering practical deployment advantages over larger models.

Abstract: Mental health conditions affect hundreds of millions globally, yet early detection remains limited. While large language models (LLMs) have shown promise in mental health applications, their size and computational demands hinder practical deployment. Small language models (SLMs) offer a lightweight alternative, but their use for social media–based mental health prediction remains largely underexplored. In this study, we introduce Menta, the first optimized SLM fine-tuned specifically for multi-task mental health prediction from social media data. Menta is jointly trained across six classification tasks using a LoRA-based framework, a cross-dataset strategy, and a balanced accuracy–oriented loss. Evaluated against nine state-of-the-art SLM baselines, Menta achieves an average improvement of 15.2% across tasks covering depression, stress, and suicidality compared with the best-performing non–fine-tuned SLMs. It also achieves higher accuracy on depression and stress classification tasks compared to 13B-parameter LLMs, while being approximately 3.25x smaller. Moreover, we demonstrate real-time, on-device deployment of Menta on an iPhone 15 Pro Max, requiring only approximately 3GB RAM. Supported by a comprehensive benchmark against existing SLMs and LLMs, Menta highlights the potential for scalable, privacy-preserving mental health monitoring. Code is available at: https://hong-labs.github.io/menta-project/

[287] Efficient Reinforcement Learning with Semantic and Token Entropy for LLM Reasoning

Hongye Cao, Zhixin Bai, Ziyue Peng, Boyan Wang, Tianpei Yang, Jing Huo, Yuyao Zhang, Yang Gao

Main category: cs.AI

TL;DR: Proposes an efficient RL framework using semantic and token-level entropy signals to prevent entropy collapse in LLM reasoning, outperforming other entropy-based methods across multiple benchmarks.

DetailsMotivation: RLVR improves LLM reasoning but suffers from entropy collapse, which reduces policy exploration and limits reasoning capabilities. Need to address this limitation while maintaining accuracy.

Method: Two-pronged approach: 1) Semantic entropy-guided curriculum learning organizes training data from low to high semantic entropy for progressive optimization. 2) Non-uniform token treatment applies KL regularization on low-entropy tokens (critical for exploration) with stronger constraints on high-covariance portions within these tokens.

Result: Method outperforms other entropy-based approaches across 6 benchmarks with 3 different parameter-scale base models, effectively mitigating entropy collapse and enhancing LLM reasoning.

Conclusion: Joint optimization of data organization (curriculum learning) and algorithmic design (non-uniform token treatment) effectively addresses entropy collapse in RLVR, leading to improved reasoning capabilities in LLMs.

Abstract: Reinforcement learning with verifiable rewards (RLVR) has demonstrated superior performance in enhancing the reasoning capability of large language models (LLMs). However, this accuracy-oriented learning paradigm often suffers from entropy collapse, which reduces policy exploration and limits reasoning capabilities. To address this challenge, we propose an efficient reinforcement learning framework that leverages entropy signals at both the semantic and token levels to improve reasoning. From the data perspective, we introduce semantic entropy-guided curriculum learning, organizing training data from low to high semantic entropy to guide progressive optimization from easier to more challenging tasks. For the algorithmic design, we adopt non-uniform token treatment by imposing KL regularization on low-entropy tokens that critically impact policy exploration and applying stronger constraints on high-covariance portions within these tokens. By jointly optimizing data organization and algorithmic design, our method effectively mitigates entropy collapse and enhances LLM reasoning. Experimental results across 6 benchmarks with 3 different parameter-scale base models demonstrate that our method outperforms other entropy-based approaches in improving reasoning.

[288] Can AI Understand What We Cannot Say? Measuring Multilevel Alignment Through Abortion Stigma Across Cognitive, Interpersonal, and Structural Levels

Anika Sharma, Malavika Mampally, Chidaksh Ravuru, Kandyce Brennan, Neil Gaikwad

Main category: cs.AI

TL;DR: LLMs fail to genuinely understand abortion stigma across cognitive, interpersonal, and structural levels, revealing current alignment approaches produce appropriate language but not coherent multilevel understanding.

DetailsMotivation: As LLMs increasingly mediate stigmatized health decisions, there's a critical need to evaluate whether they can genuinely understand complex psychological and physiological phenomena that people may not be able to articulate.

Method: Systematically tested 627 demographically diverse personas across five leading LLMs using the validated Individual Level Abortion Stigma Scale (ILAS), conducting multilevel analysis of stigma representation across cognitive, interpersonal, and structural dimensions.

Result: Models fail tests of genuine understanding across all levels: overestimate interpersonal stigma, underestimate cognitive stigma, assume uniform community condemnation, introduce demographic biases, miss validated stigma-secrecy relationships, and contradict themselves within theoretical constructs.

Conclusion: Current LLMs lack coherent multilevel understanding of psychological constructs, requiring new approaches to design (multilevel coherence), evaluation (continuous auditing), governance (mandatory audits, accountability), and AI literacy for high-stakes contexts.

Abstract: As large language models increasingly mediate stigmatized health decisions, their capacity to genuinely understand complex psychological and physiological phenomena remains poorly evaluated. Can AI understand what we cannot say? We investigate whether LLMs coherently represent abortion stigma across the cognitive, interpersonal, and structural levels where it operates. We systematically tested 627 demographically diverse personas across five leading LLMs using the validated Individual Level Abortion Stigma Scale (ILAS). Our multilevel analysis examined whether models coherently represent stigma at the cognitive level (self-judgment), interpersonal level (anticipated judgment and isolation), and structural level (community condemnation and disclosure patterns), as well as overall stigma. Models fail tests of genuine understanding across all levels. They overestimate interpersonal stigma while underestimating cognitive stigma, assume uniform community condemnation, introduce demographic biases absent from human validation data, miss the empirically validated stigma-secrecy relationship, and contradict themselves within theoretical constructs. These patterns reveal that current alignment approaches ensure appropriate language but not coherent multilevel understanding. This work provides empirical evidence that current LLMs lack coherent multilevel understanding of psychological and physiological constructs. AI safety in high-stakes contexts demands new approaches to design (multilevel coherence), evaluation (continuous auditing), governance and regulation (mandatory audits, accountability, deployment restrictions), and AI literacy in domains where understanding what people cannot say determines whether support helps or harms.

cs.SD

[289] Toward Noise-Aware Audio Deepfake Detection: Survey, SNR-Benchmarks, and Practical Recipes

Udayon Sen, Alka Luqman, Anupam Chattopadhyay

Main category: cs.SD

TL;DR: Paper evaluates robustness of deepfake audio detection models in noisy real-world conditions using controlled SNR testing framework with MS-SNSD noises and ASVspoof 2021 data, showing finetuning improves performance significantly.

DetailsMotivation: Deepfake audio detection models perform well in clean lab conditions but degrade in realistic scenarios with background noise, room reverberation, and consumer channels. There's a need to systematically evaluate and improve robustness in noisy environments.

Method: Developed reproducible framework mixing MS-SNSD noises with ASVspoof 2021 DF utterances to evaluate under controlled SNR levels (35 dB to -5 dB). Studied multi-condition training and fixed-SNR testing for pretrained encoders (WavLM, Wav2Vec2, MMS), measuring accuracy, ROC-AUC, and EER on binary and four-class tasks.

Result: Finetuning reduces EER by 10-15 percentage points at 10-0 dB SNR across all backbones. The framework enables systematic evaluation of model degradation under varying noise conditions.

Conclusion: Robust deepfake audio detection requires explicit consideration of real-world noise conditions. The proposed SNR-based evaluation framework provides controlled testing methodology, and finetuning significantly improves performance in noisy environments across different pretrained encoders.

Abstract: Deepfake audio detection has progressed rapidly with strong pre-trained encoders (e.g., WavLM, Wav2Vec2, MMS). However, performance in realistic capture conditions - background noise (domestic/office/transport), room reverberation, and consumer channels - often lags clean-lab results. We survey and evaluate robustness for state-of-the-art audio deepfake detection models and present a reproducible framework that mixes MS-SNSD noises with ASVspoof 2021 DF utterances to evaluate under controlled signal-to-noise ratios (SNRs). SNR is a measured proxy for noise severity used widely in speech; it lets us sweep from near-clean (35 dB) to very noisy (-5 dB) to quantify graceful degradation. We study multi-condition training and fixed-SNR testing for pretrained encoders (WavLM, Wav2Vec2, MMS), reporting accuracy, ROC-AUC, and EER on binary and four-class (authenticity x corruption) tasks. In our experiments, finetuning reduces EER by 10-15 percentage points at 10-0 dB SNR across backbones.

[290] Ensemble-Guided Distillation for Compact and Robust Acoustic Scene Classification on Edge Devices

Hossein Sharify, Behnam Raoufi, Mahdy Ramezani, Khosrow Hajsadeghi, Saeed Bagheri Shouraki

Main category: cs.SD

TL;DR: A compact acoustic scene classification framework using knowledge distillation from a teacher ensemble to a quantization-ready student network achieves SOTA results on TAU dataset for mobile deployment.

DetailsMotivation: To create an efficient, compact acoustic scene classification model suitable for edge/mobile deployment while maintaining high accuracy, addressing the need for practical ASC solutions on resource-constrained devices.

Method: Uses knowledge distillation where a compact student network (with depthwise-separable blocks and global response normalization) learns from a diverse teacher ensemble via two fusion heads (z1 for per-teacher weights, z2 for per-class logit fusion) with temperature-scaled soft targets and hard labels.

Result: Achieves state-of-the-art results on the TAU Urban Acoustic Scenes 2022 Mobile benchmark under matched edge-deployment constraints, demonstrating strong performance and practicality for mobile ASC.

Conclusion: The proposed framework successfully creates a compact, quantization-ready ASC model that approximates ensemble performance while being suitable for mobile deployment, offering a practical solution for edge acoustic scene classification.

Abstract: We present a compact, quantization-ready acoustic scene classification (ASC) framework that couples an efficient student network with a learned teacher ensemble and knowledge distillation. The student backbone uses stacked depthwise-separable “expand-depthwise-project” blocks with global response normalization to stabilize training and improve robustness to device and noise variability, while a global pooling head yields class logits for efficient edge inference. To inject richer inductive bias, we assemble a diverse set of teacher models and learn two complementary fusion heads: z1, which predicts per-teacher mixture weights using a student-style backbone, and z2, a lightweight MLP that performs per-class logit fusion. The student is distilled from the ensemble via temperature-scaled soft targets combined with hard labels, enabling it to approximate the ensemble’s decision geometry with a single compact model. Evaluated on the TAU Urban Acoustic Scenes 2022 Mobile benchmark, our approach achieves state-of-the-art (SOTA) results on the TAU dataset under matched edge-deployment constraints, demonstrating strong performance and practicality for mobile ASC.

[291] Memo2496: Expert-Annotated Dataset and Dual-View Adaptive Framework for Music Emotion Recognition

Qilin Li, C. L. Philip Chen, TongZhang

Main category: cs.SD

TL;DR: This paper introduces Memo2496, a large-scale annotated music emotion dataset, and DAMER, a novel model with three modules to address cross-track feature drift and improve music emotion recognition performance.

DetailsMotivation: Music Emotion Recognition (MER) research is limited by scarce high-quality annotated datasets and challenges with cross-track feature drift, where models struggle to generalize across different music tracks.

Method: Two main contributions: 1) Memo2496 dataset with 2496 instrumental tracks annotated by 30 music specialists using valence-arousal labels with quality control; 2) DAMER model with three modules: Dual Stream Attention Fusion for multimodal feature interaction, Progressive Confidence Labelling for pseudo-label generation, and Style Anchored Memory Learning to mitigate feature drift.

Result: DAMER achieves state-of-the-art performance on Memo2496, 1000songs, and PMEmo datasets, improving arousal dimension accuracy by 3.43%, 2.25%, and 0.17% respectively. Ablation studies confirm each module’s effectiveness.

Conclusion: The paper successfully addresses MER challenges through a high-quality dataset and an innovative model architecture that handles cross-track feature drift, with both resources made publicly available to advance the field.

Abstract: Music Emotion Recogniser (MER) research faces challenges due to limited high-quality annotated datasets and difficulties in addressing cross-track feature drift. This work presents two primary contributions to address these issues. Memo2496, a large-scale dataset, offers 2496 instrumental music tracks with continuous valence arousal labels, annotated by 30 certified music specialists. Annotation quality is ensured through calibration with extreme emotion exemplars and a consistency threshold of 0.25, measured by Euclidean distance in the valence arousal space. Furthermore, the Dual-view Adaptive Music Emotion Recogniser (DAMER) is introduced. DAMER integrates three synergistic modules: Dual Stream Attention Fusion (DSAF) facilitates token-level bidirectional interaction between Mel spectrograms and cochleagrams via cross attention mechanisms; Progressive Confidence Labelling (PCL) generates reliable pseudo labels employing curriculum-based temperature scheduling and consistency quantification using Jensen Shannon divergence; and Style Anchored Memory Learning (SAML) maintains a contrastive memory queue to mitigate cross-track feature drift. Extensive experiments on the Memo2496, 1000songs, and PMEmo datasets demonstrate DAMER’s state-of-the-art performance, improving arousal dimension accuracy by 3.43%, 2.25%, and 0.17%, respectively. Ablation studies and visualisation analyses validate each module’s contribution. Both the dataset and source code are publicly available.

[292] Joint Multimodal Contrastive Learning for Robust Spoken Term Detection and Keyword Spotting

Ramesh Gundluru, Shubham Gupta, Sri Rama Murty K

Main category: cs.SD

TL;DR: Joint multimodal contrastive learning framework for acoustic word embeddings that unifies audio-text and audio-audio supervision in shared embedding space, outperforming baselines on word discrimination while supporting both STD and KWS tasks.

DetailsMotivation: Existing acoustic word embedding approaches suffer from unimodal supervision, disjoint optimization of audio-audio and audio-text alignment, and require task-specific models, limiting their effectiveness and flexibility.

Method: Proposes joint multimodal contrastive learning framework with two simultaneous optimization objectives: (1) audio-text contrastive learning (CLAP-inspired loss) to align audio and text representations, and (2) audio-audio contrastive learning (Deep Word Discrimination loss) to enhance intra-class compactness and inter-class separation in shared embedding space.

Result: Outperforms existing AWE baselines on word discrimination task while flexibly supporting both Spoken Term Detection (STD) and Keyword Spotting (KWS) applications.

Conclusion: The proposed joint multimodal contrastive learning framework represents the first comprehensive approach to address limitations of existing AWE methods, achieving superior performance on word discrimination while providing flexible support for multiple speech retrieval tasks.

Abstract: Acoustic Word Embeddings (AWEs) improve the efficiency of speech retrieval tasks such as Spoken Term Detection (STD) and Keyword Spotting (KWS). However, existing approaches suffer from limitations, including unimodal supervision, disjoint optimization of audio-audio and audio-text alignment, and the need for task-specific models. To address these shortcomings, we propose a joint multimodal contrastive learning framework that unifies both acoustic and cross-modal supervision in a shared embedding space. Our approach simultaneously optimizes: (i) audio-text contrastive learning, inspired by the CLAP loss, to align audio and text representations and (ii) audio-audio contrastive learning, via Deep Word Discrimination (DWD) loss, to enhance intra-class compactness and inter-class separation. The proposed method outperforms existing AWE baselines on word discrimination task while flexibly supporting both STD and KWS. To our knowledge, this is the first comprehensive approach of its kind.

[293] GLM-TTS Technical Report

Jiayan Cui, Zhihan Yang, Naihan Li, Jiankun Tian, Xingyu Ma, Yi Zhang, Guangyu Chen, Runxuan Yang, Yuqing Cheng, Yizhi Zhou, Guochen Yu, Xiaotao Gu, Jie Tang

Main category: cs.SD

TL;DR: GLM-TTS is a production-ready text-to-speech system using a two-stage autoregressive + diffusion architecture that achieves SOTA performance with 100k hours of training data, featuring optimized tokenization, multi-reward RL, efficient voice customization, and precise pronunciation control.

DetailsMotivation: To create a production-level TTS system that balances efficiency, controllability, and high-fidelity speech generation for real-world deployment needs.

Method: Two-stage architecture: 1) text-to-token autoregressive model, 2) token-to-waveform diffusion model. Uses optimized speech tokenizer with F0 constraints, GRPO-based multi-reward RL framework for joint optimization of pronunciation, speaker similarity, and prosody. Includes LoRA-based voice customization and hybrid phoneme-text input for control.

Result: Achieves state-of-the-art performance on multiple open-source benchmarks with only 100k hours of training data. Enables efficient deployment with real-time synthesis capabilities.

Conclusion: GLM-TTS provides a production-ready solution that successfully balances high-quality speech synthesis with practical deployment requirements including efficiency, customization, and precise control.

Abstract: This work proposes GLM-TTS, a production-level TTS system designed for efficiency, controllability, and high-fidelity speech generation. GLM-TTS follows a two-stage architecture, consisting of a text-to-token autoregressive model and a token-to-waveform diffusion model. With only 100k hours of training data, GLM-TTS achieves state-of-the-art performance on multiple open-source benchmarks. To meet production requirements, GLM-TTS improves speech quality through an optimized speech tokenizer with fundamental frequency constraints and a GRPO-based multi-reward reinforcement learning framework that jointly optimizes pronunciation, speaker similarity, and expressive prosody. In parallel, the system enables efficient and controllable deployment via parameter-efficient LoRA-based voice customization and a hybrid phoneme-text input scheme that provides precise pronunciation control. Our code is available at https://github.com/zai-org/GLM-TTS. Real-time speech synthesis demos are provided via Z.ai (audio.z.ai), the Zhipu Qingyan app/web (chatglm.cn).

[294] Sound and Music Biases in Deep Music Transcription Models: A Systematic Analysis

Lukáš Samuel Marták, Patricia Hu, Gerhard Widmer

Main category: cs.SD

TL;DR: AMT systems struggle with generalization beyond classical piano music, showing significant performance drops under musical distribution shifts like genre, dynamics, and polyphony.

DetailsMotivation: Current AMT systems are trained on limited datasets (mostly classical piano), raising concerns about their ability to generalize to other musical contexts with different genres, dynamics, and polyphony levels.

Method: Introduced MDS corpus with three subsets (Genre, Random, MAEtest) to simulate distribution shifts. Evaluated state-of-the-art AMT systems using both traditional IR metrics and musically-informed metrics to isolate performance degradation.

Result: Found significant performance drops: 20 percentage point F1 drop due to sound variations and 14 due to genre shifts. Dynamics estimation was more vulnerable than onset prediction. Random non-musical sequences revealed system limitations under extreme shifts.

Conclusion: Deep AMT systems suffer from persistent Corpus Bias, showing limited generalization to musical contexts beyond their training data, highlighting the need for more diverse training data and better generalization techniques.

Abstract: Automatic Music Transcription (AMT) – the task of converting music audio into note representations – has seen rapid progress, driven largely by deep learning systems. Due to the limited availability of richly annotated music datasets, much of the progress in AMT has been concentrated on classical piano music, and even a few very specific datasets. Whether these systems can generalize effectively to other musical contexts remains an open question. Complementing recent studies on distribution shifts in sound (e.g., recording conditions), in this work we investigate the musical dimension – specifically, variations in genre, dynamics, and polyphony levels. To this end, we introduce the MDS corpus, comprising three distinct subsets – (1) Genre, (2) Random, and (3) MAEtest – to emulate different axes of distribution shift. We evaluate the performance of several state-of-the-art AMT systems on the MDS corpus using both traditional information-retrieval and musically-informed performance metrics. Our extensive evaluation isolates and exposes varying degrees of performance degradation under specific distribution shifts. In particular, we measure a note-level F1 performance drop of 20 percentage points due to sound, and 14 due to genre. Generally, we find that dynamics estimation proves more vulnerable to musical variation than onset prediction. Musically informed evaluation metrics, particularly those capturing harmonic structure, help identify potential contributing factors. Furthermore, experiments with randomly generated, non-musical sequences reveal clear limitations in system performance under extreme musical distribution shifts. Altogether, these findings offer new evidence of the persistent impact of the Corpus Bias problem in deep AMT systems.

[295] MuseCPBench: an Empirical Study of Music Editing Methods through Music Context Preservation

Yash Vishe, Eric Xue, Xunyi Jiang, Zachary Novack, Junda Wu, Julian McAuley, Xin Xu

Main category: cs.SD

TL;DR: The paper introduces MuseCPBench, the first benchmark for evaluating Music Context Preservation (MCP) in music editing tasks, addressing inconsistent evaluation protocols in existing works.

DetailsMotivation: Current music editing research lacks consistent evaluation of Music Context Preservation (MCP) - the ability to preserve unchanged musical facets during editing. Existing studies use inconsistent protocols and metrics, making comparisons unreliable.

Method: The authors create MuseCPBench, a comprehensive MCP evaluation benchmark covering four categories of musical facets. They use this benchmark to systematically evaluate five representative music editing baselines.

Result: The systematic analysis reveals consistent preservation gaps in current music editing methods. The benchmark enables comprehensive comparisons and provides insightful explanations about MCP capabilities.

Conclusion: The findings offer practical guidance for developing more effective and reliable music editing strategies with strong MCP capability. MuseCPBench addresses the evaluation gap in music editing research.

Abstract: Music editing plays a vital role in modern music production, with applications in film, broadcasting, and game development. Recent advances in music generation models have enabled diverse editing tasks such as timbre transfer, instrument substitution, and genre transformation. However, many existing works overlook the evaluation of their ability to preserve musical facets that should remain unchanged during editing a property we define as Music Context Preservation (MCP). While some studies do consider MCP, they adopt inconsistent evaluation protocols and metrics, leading to unreliable and unfair comparisons. To address this gap, we introduce the first MCP evaluation benchmark, MuseCPBench, which covers four categories of musical facets and enables comprehensive comparisons across five representative music editing baselines. Through systematic analysis along musical facets, methods, and models, we identify consistent preservation gaps in current music editing methods and provide insightful explanations. We hope our findings offer practical guidance for developing more effective and reliable music editing strategies with strong MCP capability

[296] Robust Training of Singing Voice Synthesis Using Prior and Posterior Uncertainty

Yiwen Zhao, Jiatong Shi, Yuxun Tang, William Chen, Shinji Watanabe

Main category: cs.SD

TL;DR: Proposes uncertainty-based optimization for singing voice synthesis to address data scarcity issues, using differentiable data augmentation and frame-level uncertainty prediction.

DetailsMotivation: Public singing datasets are limited compared to speech/audio data, causing performance degradation in long-tail scenarios like imbalanced pitch distributions or rare singing styles.

Method: Two-part approach: 1) Differentiable data augmentation in adversarial training to increase prior uncertainty; 2) Frame-level uncertainty prediction module to estimate posterior uncertainty and allocate more learning capacity to low-confidence segments.

Result: Empirical results on Opencpop and Ofuton-P datasets across Chinese and Japanese demonstrate improved performance in various perspectives.

Conclusion: Uncertainty-based optimization effectively mitigates data scarcity challenges in singing voice synthesis, improving model performance on limited datasets.

Abstract: Singing voice synthesis (SVS) has seen remarkable advancements in recent years. However, compared to speech and general audio data, publicly available singing datasets remain limited. In practice, this data scarcity often leads to performance degradation in long-tail scenarios, such as imbalanced pitch distributions or rare singing styles. To mitigate these challenges, we propose uncertainty-based optimization to improve the training process of end-to-end SVS models. First, we introduce differentiable data augmentation in the adversarial training, which operates in a sample-wise manner to increase the prior uncertainty. Second, we incorporate a frame-level uncertainty prediction module that estimates the posterior uncertainty, enabling the model to allocate more learning capacity to low-confidence segments. Empirical results on the Opencpop and Ofuton-P, across Chinese and Japanese, demonstrate that our approach improves performance in various perspectives.

[297] Adapting Speech Language Model to Singing Voice Synthesis

Yiwen Zhao, Jiatong Shi, Jinchuan Tian, Yuxun Tang, Jiarui Hai, Jionghao Han, Shinji Watanabe

Main category: cs.SD

TL;DR: A 1.7B parameter speech language model pretrained for TTS is successfully adapted for singing voice synthesis using only 135 hours of synthetic singing data, achieving performance comparable to leading discrete token-based SVS models.

DetailsMotivation: While Speech Language Models (SLMs) have shown promise as a unified framework for various speech tasks, their generalization capabilities to new domains like singing voice synthesis remain underexplored. The authors aim to investigate whether large-scale pre-trained SLMs can effectively generalize to SVS with limited domain-specific data.

Method: The approach adapts ESPNet-SpeechLM (1.7B parameters) for SVS using a 135-hour synthetic singing corpus. The method involves: (1) tokenizing music score conditions and singing waveforms, (2) multi-stream language model token prediction, (3) conditional flow matching for mel-spectrogram generation, and (4) a mel-to-wave vocoder.

Result: The adapted SLM demonstrates strong generalization to singing voice synthesis, achieving performance comparable to leading discrete token-based SVS models despite using only synthetic singing data and building on a TTS-pretrained foundation.

Conclusion: Large-scale pre-trained speech language models can effectively generalize to singing voice synthesis with limited domain-specific data, validating SLMs as a versatile paradigm that can extend beyond their original training domains with appropriate adaptation techniques.

Abstract: Speech Language Models (SLMs) have recently emerged as a unified paradigm for addressing a wide range of speech-related tasks, including text-to-speech (TTS), speech enhancement (SE), and automatic speech recognition (ASR). However, the generalization capability of large-scale pre-trained SLMs remains underexplored. In this work, we adapt a 1.7B parameter TTS pretrained SLM for singing voice synthesis (SVS), using only a 135-hour synthetic singing corpus, ACE-Opencpop. Building upon the ESPNet-SpeechLM, our recipe involves the following procedure: (1) tokenization of music score conditions and singing waveforms, (2) multi-stream language model token prediction, (3) conditional flow matching-based mel-spectrogram generation. (4) a mel-to-wave vocoder. Experimental results demonstrate that our adapted SLM generalizes well to SVS and achieves performance comparable to leading discrete token-based SVS models.

[298] PhraseVAE and PhraseLDM: Latent Diffusion for Full-Song Multitrack Symbolic Music Generation

Longshen Ou, Ye Wang

Main category: cs.SD

TL;DR: PhraseVAE and PhraseLDM: A phrase-level latent diffusion framework for full-song symbolic music generation that compresses variable-length polyphonic sequences into compact latent representations, enabling efficient non-autoregressive generation of complete multi-track songs.

DetailsMotivation: Existing symbolic music models suffer from extremely long sequences, limited context length, and weak support for long-range structure due to operating on note-attribute tokens. These limitations hinder full-song generation with coherent global structure.

Method: Introduces PhraseVAE to compress arbitrary variable-length polyphonic note sequences into 64-dimensional phrase-level latent representations with high reconstruction fidelity. Then builds PhraseLDM, a latent diffusion model on this space that generates entire multi-track songs in a single pass without autoregressive components, eliminating bar-wise sequential modeling.

Result: The framework supports up to 128 bars of music (8 minutes at 64 bpm), generates full songs within seconds with only 45M parameters, produces coherent local texture, idiomatic instrument patterns, and clear global structure while maintaining competitive musical quality and generation diversity.

Conclusion: Phrase-level latent diffusion provides an effective and scalable solution to long-sequence modeling in symbolic music generation, encouraging future research to move beyond note-attribute tokens and consider phrase-level units as more effective and musically meaningful modeling targets.

Abstract: This technical report presents a new paradigm for full-song symbolic music generation. Existing symbolic models operate on note-attribute tokens and suffer from extremely long sequences, limited context length, and weak support for long-range structure. We address these issues by introducing PhraseVAE and PhraseLDM, the first latent diffusion framework designed for full-song multitrack symbolic music. PhraseVAE compresses an arbitrary variable-length polyphonic note sequence into a single compact 64-dimensional phrase-level latent representation with high reconstruction fidelity, allowing a well-structured latent space and efficient generative modeling. Built on this latent space, PhraseLDM generates an entire multi-track song in a single pass without any autoregressive components. The system eliminates bar-wise sequential modeling, supports up to 128 bars of music (8 minutes at 64 bpm), and produces complete songs with coherent local texture, idiomatic instrument patterns, and clear global structure. With only 45M parameters, our framework generates a full song within seconds while maintaining competitive musical quality and generation diversity. Together, these results show that phrase-level latent diffusion provides an effective and scalable solution to long-sequence modeling in symbolic music generation. We hope this work encourages future symbolic music research to move beyond note-attribute tokens and to consider phrase-level units as a more effective and musically meaningful modeling target.

cs.LG

[299] Physics-Guided Deep Learning for Heat Pump Stress Detection: A Comprehensive Analysis on When2Heat Dataset

Md Shahabub Alam, Md Asifuzzaman Jishan, Ayan Kumar Ghosh

Main category: cs.LG

TL;DR: A physics-guided deep neural network achieves 78.1% accuracy for heat pump stress classification using the When2Heat dataset, outperforming baseline methods by 2-5%.

DetailsMotivation: Heat pump operational stress detection is challenging due to complex thermodynamics and limited real-world data, requiring improved methods for energy-efficient building management.

Method: Physics-Guided Deep Neural Network (PG-DNN) with physics-guided feature selection, 5 hidden layers, dual regularization strategies, and variable thresholding for realistic class distribution using the When2Heat dataset (131,483 samples, 656 features).

Result: 78.1% test accuracy and 78.5% validation accuracy, outperforming baselines by +5.0% over shallow networks, +4.0% over limited feature sets, and +2.0% over single regularization strategies.

Conclusion: The PG-DNN provides a production-ready solution for heat pump stress detection, validated through comprehensive ablation studies and cross-country energy pattern analysis.

Abstract: Heat pump systems are critical components in modern energy-efficient buildings, yet their operational stress detection remains challenging due to complex thermodynamic interactions and limited real-world data. This paper presents a novel Physics-Guided Deep Neural Network (PG-DNN) approach for heat pump stress classification using the When2Heat dataset, containing 131,483 samples with 656 features across 26 European countries. The methodology integrates physics-guided feature selection and class definition with a deep neural network architecture featuring 5 hidden layers and dual regularization strategies. The model achieves 78.1% test accuracy and 78.5% validation accuracy, demonstrating significant improvements over baseline approaches: +5.0% over shallow networks, +4.0% over limited feature sets, and +2.0% over single regularization strategies. Comprehensive ablation studies validate the effectiveness of physics-guided feature selection, variable thresholding for realistic class distribution, and cross-country energy pattern analysis. The proposed system provides a production-ready solution for heat pump stress detection with 181,348 parameters and 720 seconds training time on AMD Ryzen 9 7950X with RTX 4080 hardware.

[300] Scaling and Transferability of Annealing Strategies in Large Language Model Training

Siqi Wang, Zhengyu Chen, Teng Xiao, Zheqi Lv, Jinluan Yang, Xunliang Cai, Jingang Wang, Xiaomeng Li

Main category: cs.LG

TL;DR: The paper presents a framework for optimizing learning rate annealing strategies in LLM training, showing that optimal annealing ratios follow consistent patterns that can be transferred across model sizes and configurations.

DetailsMotivation: Learning rate scheduling is critical for training large language models, but finding optimal annealing strategies requires extensive hyperparameter searches. The authors aim to understand the transferability of annealing dynamics across different model configurations to reduce this search burden.

Method: The authors develop an improved predictive framework for optimizing annealing strategies under the Warmup-Steady-Decay (WSD) scheduler. The framework incorporates training steps, maximum learning rate, and annealing behavior. They validate their approach using both Dense and Mixture-of-Experts (MoE) models across different training configurations.

Result: The framework enables more efficient optimization of learning rate schedules without exhaustive hyperparameter searches. The authors demonstrate that optimal annealing ratios follow consistent patterns and can be transferred across different training configurations, and that smaller models can serve as reliable proxies for optimizing the training dynamics of larger models.

Conclusion: The work provides practical guidance for selecting optimal annealing strategies in LLM training, showing that annealing dynamics are transferable across model sizes and configurations, which reduces the need for extensive hyperparameter tuning.

Abstract: Learning rate scheduling is crucial for training large language models, yet understanding the optimal annealing strategies across different model configurations remains challenging. In this work, we investigate the transferability of annealing dynamics in large language model training and refine a generalized predictive framework for optimizing annealing strategies under the Warmup-Steady-Decay (WSD) scheduler. Our improved framework incorporates training steps, maximum learning rate, and annealing behavior, enabling more efficient optimization of learning rate schedules. Our work provides a practical guidance for selecting optimal annealing strategies without exhaustive hyperparameter searches, demonstrating that smaller models can serve as reliable proxies for optimizing the training dynamics of larger models. We validate our findings on extensive experiments using both Dense and Mixture-of-Experts (MoE) models, demonstrating that optimal annealing ratios follow consistent patterns and can be transferred across different training configurations.

[301] Mitigating Catastrophic Forgetting in Mathematical Reasoning Finetuning through Mixed Training

John Graham Reynolds

Main category: cs.LG

TL;DR: Mixed training with interleaved math and NLI examples eliminates catastrophic forgetting while maintaining math performance, showing specialization doesn’t require forgetting general capabilities.

DetailsMotivation: When finetuning large language models for specialized tasks like mathematical reasoning, models suffer from catastrophic forgetting, losing previously learned general capabilities like natural language inference.

Method: Finetuned Flan-T5-Base on DeepMind Mathematics dataset, measured forgetting on MultiNLI. Proposed mixed training strategies that interleave mathematical and NLI examples during training, systematically exploring mixing ratios from 1:1 to 15:1.

Result: Math-only training improved math accuracy from 3.1% to 12.0% but caused NLI accuracy to collapse from 81.0% to 16.5% (64.5 percentage point drop). Mixed training with 1:1 ratio achieved 12.0% math accuracy (matching math-only) while preserving 86.2% NLI accuracy. Even minimal NLI exposure (6.2%) provided effective regularization.

Conclusion: Specialization need not require forgetting general capabilities. Mixed training completely eliminates catastrophic forgetting while maintaining equivalent mathematical performance, with implications for scaling to larger models where it may confer additional benefits beyond forgetting prevention.

Abstract: When finetuning large language models for specialized tasks such as mathematical reasoning, models exhibit catastrophic forgetting, losing previously learned capabilities. We investigate this by finetuning Flan-T5-Base (250M parameters) on the DeepMind Mathematics dataset and measuring forgetting on MultiNLI. Math-only training improves mathematical accuracy from 3.1% to 12.0% but causes NLI accuracy to collapse from 81.0% to 16.5%–a 64.5 percentage point drop occurring within the first 1,000 training steps. We propose mixed training strategies that interleave mathematical and NLI examples during training. Our results demonstrate that mixed training completely eliminates catastrophic forgetting while maintaining equivalent mathematical performance: the balanced 1:1 ratio achieves 12.0% math accuracy (matching math-only) while preserving 86.2% NLI accuracy. We systematically explore mixing ratios from 1:1 to 15:1, finding that even minimal NLI exposure (6.2%) provides effective regularization. These findings demonstrate that specialization need not require forgetting general capabilities, with implications for scaling to larger models where mixed training may confer additional benefits beyond forgetting prevention.

[302] Variational Physics-Informed Ansatz for Reconstructing Hidden Interaction Networks from Steady States

Kaiming Luo

Main category: cs.LG

TL;DR: VPIA is a variational physics-informed method that reconstructs complex interaction networks from heterogeneous steady-state data without requiring temporal trajectories or derivative estimation.

DetailsMotivation: Existing methods struggle to reconstruct interaction structures in complex dynamical systems with nonlinear, heterogeneous, and higher-order couplings, especially when only steady states (not temporal trajectories) are observable.

Method: VPIA embeds steady-state constraints into a differentiable variational representation and minimizes a physics-derived steady-state residual. It uses residual sampling with natural-gradient optimization for scalable learning of large, higher-order networks.

Result: VPIA accurately recovers directed, weighted, and multi-body interaction structures across diverse nonlinear systems, even under substantial noise conditions.

Conclusion: VPIA provides a unified, robust framework for physics-constrained inference of complex interaction networks using only snapshot observations, overcoming limitations of existing reconstruction methods.

Abstract: The interaction structure of a complex dynamical system governs its collective behavior, yet existing reconstruction methods struggle with nonlinear, heterogeneous, and higher-order couplings, especially when only steady states are observable. We propose a Variational Physics-Informed Ansatz (VPIA) that infers general interaction operators directly from heterogeneous steady-state data. VPIA embeds the steady-state constraints of the dynamics into a differentiable variational representation and reconstructs the underlying couplings by minimizing a physics-derived steady-state residual, without requiring temporal trajectories, derivative estimation, or supervision. Residual sampling combined with natural-gradient optimization enables scalable learning of large and higher-order networks. Across diverse nonlinear systems, VPIA accurately recovers directed, weighted, and multi-body structures under substantial noise, providing a unified and robust framework for physics-constrained inference of complex interaction networks in settings where only snapshot observations are available.

[303] Predictive Modeling of Flood-Prone Areas Using SAR and Environmental Variables

Edwin Oluoch Awino, Denis Machanda

Main category: cs.LG

TL;DR: Combines SAR imagery with environmental data to model flood susceptibility in Kenya’s River Nyando watershed using machine learning, with Random Forest achieving best performance.

DetailsMotivation: Flooding is a destructive natural hazard worldwide, posing serious risks to ecosystems, infrastructure, and human livelihoods. There's a need for effective flood susceptibility mapping, particularly in data-limited regions like western Kenya.

Method: Uses Sentinel-1 SAR data from May 2024 flood event to create binary flood inventory. Combines six conditioning factors (slope, elevation, aspect, land use/land cover, soil type, distance from streams) with SAR data to train four ML models: Logistic Regression, Classification and Regression Trees, Support Vector Machines, and Random Forest.

Result: Random Forest achieved highest predictive performance (accuracy = 0.762; Kappa = 0.480), outperforming other models. RF-based susceptibility map identified low-lying Kano Plains near Lake Victoria as highest flood vulnerability area, consistent with historical records and May 2024 event impacts.

Conclusion: Demonstrates value of combining SAR data with ensemble ML methods for flood susceptibility mapping in data-limited regions. Resulting maps provide important insights for disaster risk reduction, land-use planning, and early warning system development.

Abstract: Flooding is one of the most destructive natural hazards worldwide, posing serious risks to ecosystems, infrastructure, and human livelihoods. This study combines Synthetic Aperture Radar (SAR) imagery with environmental and hydrological data to model flood susceptibility in the River Nyando watershed, western Kenya. Sentinel-1 dual-polarization SAR data from the May 2024 flood event were processed to produce a binary flood inventory, which served as training data for machine learning (ML) models. Six conditioning factors – slope, elevation, aspect, land use/land cover, soil type, and distance from streams – were integrated with the SAR-derived flood inventory to train four supervised classifiers: Logistic Regression (LR), Classification and Regression Trees (CART), Support Vector Machines (SVM), and Random Forest (RF). Model performance was assessed using accuracy, Cohen’s Kappa, and Receiver Operating Characteristic (ROC) analysis. Results indicate that RF achieved the highest predictive performance (accuracy = 0.762; Kappa = 0.480), outperforming LR, CART, and SVM. The RF-based susceptibility map showed that low-lying Kano Plains near Lake Victoria have the highest flood vulnerability, consistent with historical flood records and the impacts of the May 2024 event. These findings demonstrate the value of combining SAR data and ensemble ML methods for flood susceptibility mapping in regions with limited data. The resulting maps offer important insights for disaster risk reduction, land-use planning, and early warning system development.

[304] Delete and Retain: Efficient Unlearning for Document Classification

Aadya Goel, Mayuri Sridhar

Main category: cs.LG

TL;DR: Hessian Reassignment: A two-step, model-agnostic class-level unlearning method for document classifiers that uses Hessian-vector products and Top-1 classification to efficiently remove target class influence while maintaining accuracy.

DetailsMotivation: Machine unlearning for LLMs has seen progress, but document classification models remain understudied. There's a need for efficient methods to remove specific training data influence without full retraining, particularly for class-level unlearning in document classifiers.

Method: Two-step approach: 1) Single influence-style update subtracting target class contributions via Hessian-vector system solved with conjugate gradients (only needs gradient and Hessian-vector products). 2) Decision-space guarantee via Top-1 classification instead of random reclassification of deleted-class samples.

Result: Achieves retained-class accuracy close to full retrain-without-class while running orders of magnitude faster. Consistently lowers membership-inference advantage on removed class measured with pooled multi-shadow attacks.

Conclusion: Hessian Reassignment demonstrates a practical, principled path to efficient class unlearning in document classification, offering model-agnostic solution with strong performance guarantees.

Abstract: Machine unlearning aims to efficiently remove the influence of specific training data from a model without full retraining. While much progress has been made in unlearning for LLMs, document classification models remain relatively understudied. In this paper, we study class-level unlearning for document classifiers and present Hessian Reassignment, a two-step, model-agnostic solution. First, we perform a single influence-style update that subtracts the contribution of all training points from the target class by solving a Hessian-vector system with conjugate gradients, requiring only gradient and Hessian-vector products. Second, in contrast to common unlearning baselines that randomly reclassify deleted-class samples, we enforce a decision-space guarantee via Top-1 classification. On standard text benchmarks, Hessian Reassignment achieves retained-class accuracy close to full retrain-without-class while running orders of magnitude faster. Additionally, it consistently lowers membership-inference advantage on the removed class, measured with pooled multi-shadow attacks. These results demonstrate a practical, principled path to efficient class unlearning in document classification.

[305] Prediction of Respiratory Syncytial Virus-Associated Hospitalizations Using Machine Learning Models Based on Environmental Data

Eric Guo

Main category: cs.LG

TL;DR: Machine learning framework using wastewater, weather, and air quality data predicts RSV hospitalization risk levels in the US, with wastewater RSV as strongest predictor and identification of disparities among Native populations.

DetailsMotivation: RSV is a major cause of child hospitalizations with outbreaks influenced by environmental factors. Current prediction methods may not adequately integrate multiple data sources for timely outbreak forecasting and resource allocation.

Method: Developed ML framework combining weekly hospitalization rates, wastewater RSV levels, meteorological data, and air pollutant concentrations. Trained classification models (CART, Random Forest, Boosting) to predict three risk levels: Low risk, Alert, and Epidemic.

Result: Wastewater RSV level was strongest predictor, followed by temperature, ozone, and specific humidity. Found significantly higher hospitalization rates among Native Americans/Alaska Natives and in high-altitude states. Created interactive R Shiny dashboard for practical use.

Conclusion: Integration of environmental and surveillance data enables effective RSV outbreak forecasting. Identified health disparities requiring further research. Dashboard provides accessible tool for public health planning and intervention timing.

Abstract: Respiratory syncytial virus (RSV) is a leading cause of hospitalization among young children, with outbreaks strongly influenced by environmental conditions. This study developed a machine learning framework to predict RSV-associated hospitalizations in the United States (U.S.) by integrating wastewater surveillance, meteorological, and air quality data. The dataset combined weekly hospitalization rates, wastewater RSV levels, daily meteorological measurements, and air pollutant concentrations. Classification models, including CART, Random Forest, and Boosting, were trained to predict weekly RSV-associated hospitalization rates classified as \textit{Low risk}, \textit{Alert}, and \textit{Epidemic} levels. The wastewater RSV level was identified as the strongest predictor, followed by meteorological and air quality variables such as temperature, ozone levels, and specific humidity. Notably, the analysis also revealed significantly higher RSV-associated hospitalization rates among Native Americans and Alaska Natives. Further research is needed to better understand the drivers of RSV disparity in these communities to improve prevention strategies. Furthermore, states at high altitudes, characterized by lower surface pressure, showed consistently higher RSV-associated hospitalization rates. These findings highlight the value of combining environmental and community surveillance data to forecast RSV outbreaks, enabling more timely public health interventions and resource allocation. In order to provide accessibility and practical use of the models, we have developed an interactive R Shiny dashboard (https://f6yxlu-eric-guo.shinyapps.io/rsv_app/), which allows users to explore RSV-associated hospitalization risk levels across different states, visualize the impact of key predictors, and interactively generate RSV outbreak forecasts.

[306] Federated Few-Shot Learning for Epileptic Seizure Detection Under Privacy Constraints

Ekaterina Sysoykova, Bernhard Anzengruber-Tanase, Michael Haslgrubler, Philipp Seidl, Alois Ferscha

Main category: cs.LG

TL;DR: Two-stage federated few-shot learning framework for personalized EEG seizure detection that addresses data scarcity, privacy constraints, and patient-specific adaptation using only 5 labeled EEG segments per patient.

DetailsMotivation: EEG data for seizure detection is often scarce, distributed across institutions, and governed by strict privacy regulations that prevent data pooling, making traditional centralized deep learning approaches impractical in clinical settings.

Method: Two-stage approach: 1) Federated learning fine-tunes a pretrained biosignal transformer (BIOT) across non-IID hospital sites without centralizing data; 2) Federated few-shot personalization adapts the classifier to each patient using only five labeled EEG segments while retaining seizure-specific information.

Result: Federated fine-tuning achieved balanced accuracy of 0.43 (vs 0.52 centralized), Cohen’s kappa of 0.42 (vs 0.49), and weighted F1 of 0.69 (vs 0.74). In FFSL stage, client-specific models reached average balanced accuracy of 0.77, Cohen’s kappa of 0.62, and weighted F1 of 0.73 across four sites with heterogeneous distributions.

Conclusion: FFSL framework enables effective patient-adaptive seizure detection under realistic data-availability and privacy constraints, demonstrating practical feasibility for clinical deployment where data cannot be centralized.

Abstract: Many deep learning approaches have been developed for EEG-based seizure detection; however, most rely on access to large centralized annotated datasets. In clinical practice, EEG data are often scarce, patient-specific distributed across institutions, and governed by strict privacy regulations that prohibit data pooling. As a result, creating usable AI-based seizure detection models remains challenging in real-world medical settings. To address these constraints, we propose a two-stage federated few-shot learning (FFSL) framework for personalized EEG-based seizure detection. The method is trained and evaluated on the TUH Event Corpus, which includes six EEG event classes. In Stage 1, a pretrained biosignal transformer (BIOT) is fine-tuned across non-IID simulated hospital sites using federated learning, enabling shared representation learning without centralizing EEG recordings. In Stage 2, federated few-shot personalization adapts the classifier to each patient using only five labeled EEG segments, retaining seizure-specific information while still benefiting from cross-site knowledge. Federated fine-tuning achieved a balanced accuracy of 0.43 (centralized: 0.52), Cohen’s kappa of 0.42 (0.49), and weighted F1 of 0.69 (0.74). In the FFSL stage, client-specific models reached an average balanced accuracy of 0.77, Cohen’s kappa of 0.62, and weighted F1 of 0.73 across four sites with heterogeneous event distributions. These results suggest that FFSL can support effective patient-adaptive seizure detection under realistic data-availability and privacy constraints.

[307] Privacy-Enhancing Infant Cry Classification with Federated Transformers and Denoising Regularization

Geofrey Owino, Bernard Shibwabo

Main category: cs.LG

TL;DR: Privacy-preserving infant cry classification system using federated learning with denoising autoencoder and Transformer, achieving high accuracy while reducing communication overhead and enabling real-time edge inference.

DetailsMotivation: Infant cry classification can help assess infant needs, but deployment faces privacy concerns with audio data, sensitivity to background noise, and domain shift across recording environments. Current solutions are limited by these practical constraints.

Method: End-to-end pipeline integrating denoising autoencoder (DAE), convolutional tokenizer, and Transformer encoder trained with communication-efficient federated learning. System performs on-device denoising, adaptive segmentation, post hoc calibration, and energy-based OOD abstention. Federated training uses regularized control variate update with 8-bit adapter deltas under secure aggregation.

Result: Achieved macro F1 score of 0.938, AUC of 0.962, and Expected Calibration Error (ECE) of 0.032 on Baby Chillanto and Donate-a-Cry datasets with ESC-50 noise overlays. Reduced per-round client upload from ~36-42 MB to 3.3 MB. Real-time edge inference on NVIDIA Jetson Nano achieves 96 ms per one-second spectrogram frame.

Conclusion: The system demonstrates a practical path toward privacy-preserving, noise-robust, and communication-efficient infant cry classification suitable for federated deployment, addressing key limitations of traditional audio-based approaches.

Abstract: Infant cry classification can aid early assessment of infant needs. However, deployment of such solutions is limited by privacy concerns around audio data, sensitivity to background noise, and domain shift across recording environments. We present an end-to-end infant cry analysis pipeline that integrates a denoising autoencoder (DAE), a convolutional tokenizer, and a Transformer encoder trained using communication-efficient federated learning (FL). The system performs on-device denoising, adaptive segmentation, post hoc calibration, and energy-based out-of-distribution (OOD) abstention. Federated training employs a regularized control variate update with 8-bit adapter deltas under secure aggregation. Using the Baby Chillanto and Donate-a-Cry datasets with ESC-50 noise overlays, the model achieves a macro F1 score of 0.938, an AUC of 0.962, and an Expected Calibration Error (ECE) of 0.032, while reducing per-round client upload from approximately 36 to 42 MB to 3.3 MB. Real-time edge inference on an NVIDIA Jetson Nano (4 GB, TensorRT FP16) achieves 96 ms per one-second spectrogram frame. These results demonstrate a practical path toward privacy-preserving, noise-robust, and communication-efficient infant cry classification suitable for federated deployment.

[308] Time-Constrained Recommendations: Reinforcement Learning Strategies for E-Commerce

Sayak Chakrabarty, Souradip Pal

Main category: cs.LG

TL;DR: The paper proposes using reinforcement learning to optimize slate recommendations under user time budget constraints, showing RL methods outperform traditional bandit approaches when users have limited time for evaluation.

DetailsMotivation: Traditional recommendation systems don't account for user time budgets, which is critical in real-world scenarios like mobile shopping where users have limited time to evaluate items. Highly relevant items with high evaluation costs may not fit within users' time constraints, affecting engagement.

Method: Formulates time-constrained slate recommendation as Markov Decision Processes with budget-aware utilities, develops a simulation framework to study policy behavior on re-ranking data, and evaluates both on-policy and off-policy reinforcement learning control methods using Alibaba’s Personalized Re-ranking dataset.

Result: Empirical evidence shows that reinforcement learning approaches (both on-policy and off-policy control) can improve performance under tight time budgets compared to traditional contextual bandit-based methods.

Conclusion: Reinforcement learning is effective for optimizing slate recommendations under user time constraints, balancing item relevance with evaluation costs to maximize engagement within limited user time budgets.

Abstract: Unlike traditional recommendation tasks, finite user time budgets introduce a critical resource constraint, requiring the recommender system to balance item relevance and evaluation cost. For example, in a mobile shopping interface, users interact with recommendations by scrolling, where each scroll triggers a list of items called slate. Users incur an evaluation cost - time spent assessing item features before deciding to click. Highly relevant items having higher evaluation costs may not fit within the user’s time budget, affecting engagement. In this position paper, our objective is to evaluate reinforcement learning algorithms that learn patterns in user preferences and time budgets simultaneously, crafting recommendations with higher engagement potential under resource constraints. Our experiments explore the use of reinforcement learning to recommend items for users using Alibaba’s Personalized Re-ranking dataset supporting slate optimization in e-commerce contexts. Our contributions include (i) a unified formulation of time-constrained slate recommendation modeled as Markov Decision Processes (MDPs) with budget-aware utilities; (ii) a simulation framework to study policy behavior on re-ranking data; and (iii) empirical evidence that on-policy and off-policy control can improve performance under tight time budgets than traditional contextual bandit-based methods.

[309] RAST-MoE-RL: A Regime-Aware Spatio-Temporal MoE Framework for Deep Reinforcement Learning in Ride-Hailing

Yuhan Tang, Kangxin Cui, Jung Ho Park, Yibo Zhao, Xuan Jiang, Haoze He, Dingyi Zhuang, Shenhao Wang, Jiangbo Yu, Haris Koutsopoulos, Jinhua Zhao

Main category: cs.LG

TL;DR: RAST-MoE framework uses regime-aware MDP with self-attention Mixture-of-Experts to optimize ride-hailing matching, reducing delays by 10-15% on Uber data.

DetailsMotivation: Ride-hailing platforms struggle with balancing passenger waiting times and system efficiency under uncertain supply-demand conditions. Existing RL approaches oversimplify traffic dynamics or use shallow encoders that miss complex spatiotemporal patterns.

Method: Regime-Aware Spatio-Temporal Mixture-of-Experts (RAST-MoE) formalizes adaptive delayed matching as a regime-aware MDP with self-attention MoE encoder. Experts specialize automatically for better representation. Includes physics-informed congestion surrogate for realistic density-speed feedback and adaptive reward scheme.

Result: With only 12M parameters, outperforms strong baselines. On Uber San Francisco data: improves total reward by over 13%, reduces average matching delay by 10% and pickup delay by 15%. Shows robustness across unseen demand regimes and stable training.

Conclusion: MoE-enhanced RL shows strong potential for large-scale decision-making with complex spatiotemporal dynamics in ride-hailing and similar systems.

Abstract: Ride-hailing platforms face the challenge of balancing passenger waiting times with overall system efficiency under highly uncertain supply-demand conditions. Adaptive delayed matching creates a trade-off between matching and pickup delays by deciding whether to assign drivers immediately or batch requests. Since outcomes accumulate over long horizons with stochastic dynamics, reinforcement learning (RL) is a suitable framework. However, existing approaches often oversimplify traffic dynamics or use shallow encoders that miss complex spatiotemporal patterns. We introduce the Regime-Aware Spatio-Temporal Mixture-of-Experts (RAST-MoE), which formalizes adaptive delayed matching as a regime-aware MDP equipped with a self-attention MoE encoder. Unlike monolithic networks, our experts specialize automatically, improving representation capacity while maintaining computational efficiency. A physics-informed congestion surrogate preserves realistic density-speed feedback, enabling millions of efficient rollouts, while an adaptive reward scheme guards against pathological strategies. With only 12M parameters, our framework outperforms strong baselines. On real-world Uber trajectory data (San Francisco), it improves total reward by over 13%, reducing average matching and pickup delays by 10% and 15% respectively. It demonstrates robustness across unseen demand regimes and stable training. These findings highlight the potential of MoE-enhanced RL for large-scale decision-making with complex spatiotemporal dynamics.

[310] CurvaDion: Curvature-Adaptive Distributed Orthonormalization

Bhavesh Kumar, Roger Jin, Jeffrey Quesnelle

Main category: cs.LG

TL;DR: CurvaDion reduces communication in distributed training by detecting when synchronization is needed using momentum-based curvature detection, achieving 99% communication reduction while maintaining convergence.

DetailsMotivation: As language models scale to trillions of parameters, gradient synchronization becomes a critical bottleneck in distributed training. Current methods synchronize at every step regardless of optimization landscape, but synchronization needs vary dramatically throughout training - redundant in flat regions but essential in high-curvature regions.

Method: Introduces CurvaDion which uses Relative Maximum Momentum Change (RMMC) to detect high-curvature regions requiring synchronization. RMMC leverages momentum dynamics already computed during optimization as a computationally tractable proxy for directional curvature, adding only O(d) operations per layer.

Result: CurvaDion achieves 99% communication reduction while matching baseline convergence across models from 160M to 1.3B parameters. Theoretical connections between RMMC and loss curvature are established.

Conclusion: CurvaDion provides an efficient method to dramatically reduce communication overhead in distributed training by intelligently synchronizing only when necessary, based on optimization landscape characteristics detected through momentum dynamics.

Abstract: As language models scale to trillions of parameters, distributed training across many GPUs becomes essential, yet gradient synchronization over high-bandwidth, low-latency networks remains a critical bottleneck. While recent methods like Dion reduce per-step communication through low-rank updates, they synchronize at every step regardless of the optimization landscape. We observe that synchronization requirements vary dramatically throughout training: workers naturally compute similar gradients in flat regions, making frequent synchronization redundant, while high-curvature regions require coordination to prevent divergence. We introduce CurvaDion, which uses Relative Maximum Momentum Change (RMMC) to detect high-curvature regions requiring synchronization. RMMC leverages momentum dynamics which are already computed during optimization as a computationally tractable proxy for directional curvature, adding only $\mathcal{O}(d)$ operations per layer. We establish theoretical connections between RMMC and loss curvature and demonstrate that CurvaDion achieves 99% communication reduction while matching baseline convergence across models from 160M to 1.3B parameters.

[311] Composite Classifier-Free Guidance for Multi-Modal Conditioning in Wind Dynamics Super-Resolution

Jacob Schnell, Aditya Makkar, Gunadi Gani, Aniket Srinivasan Ashok, Darren Lo, Mike Optis, Alexander Wong, Yuhao Chen

Main category: cs.LG

TL;DR: The paper introduces WindDM, a diffusion model for wind data super-resolution that uses novel composite classifier-free guidance (CCFG) to handle multiple conditioning variables, achieving state-of-the-art results at dramatically lower cost than classical methods.

DetailsMotivation: High-resolution wind data is essential for weather modeling applications but expensive to acquire. Traditional reconstruction methods face a trade-off between cost-effectiveness and accuracy, while existing deep learning approaches don't handle the multi-channel nature of wind data well (10+ channels vs. 3-channel RGB images).

Method: Developed WindDM, a diffusion model for wind dynamics reconstruction. Introduced composite classifier-free guidance (CCFG) - a generalization of standard classifier-free guidance that can handle multiple conditioning inputs. CCFG can be integrated into any pre-trained diffusion model trained with standard CFG dropout.

Result: CCFG produces higher-fidelity outputs than standard CFG on wind super-resolution tasks. WindDM achieves state-of-the-art reconstruction quality among deep learning models and costs up to 1000× less than classical methods for industrial-scale wind dynamics reconstruction.

Conclusion: The proposed CCFG approach effectively handles multiple conditioning variables in diffusion models, enabling high-quality wind data reconstruction at dramatically reduced costs compared to traditional methods, making high-resolution wind data more accessible for various applications.

Abstract: Various weather modelling problems (e.g., weather forecasting, optimizing turbine placements, etc.) require ample access to high-resolution, highly accurate wind data. Acquiring such high-resolution wind data, however, remains a challenging and expensive endeavour. Traditional reconstruction approaches are typically either cost-effective or accurate, but not both. Deep learning methods, including diffusion models, have been proposed to resolve this trade-off by leveraging advances in natural image super-resolution. Wind data, however, is distinct from natural images, and wind super-resolvers often use upwards of 10 input channels, significantly more than the usual 3-channel RGB inputs in natural images. To better leverage a large number of conditioning variables in diffusion models, we present a generalization of classifier-free guidance (CFG) to multiple conditioning inputs. Our novel composite classifier-free guidance (CCFG) can be dropped into any pre-trained diffusion model trained with standard CFG dropout. We demonstrate that CCFG outputs are higher-fidelity than those from CFG on wind super-resolution tasks. We present WindDM, a diffusion model trained for industrial-scale wind dynamics reconstruction and leveraging CCFG. WindDM achieves state-of-the-art reconstruction quality among deep learning models and costs up to $1000\times$ less than classical methods.

[312] PIS: A Generalized Physical Inversion Solver for Arbitrary Sparse Observations via Set-Conditioned Diffusion

Weijie Yang, Xun Zhang

Main category: cs.LG

TL;DR: PIS is a set-conditioned diffusion framework that enables robust physical parameter inversion from arbitrary, sparse observations with uncertainty quantification, outperforming existing methods even at extreme sparsity levels (0.29% observation rate).

DetailsMotivation: Physical parameter estimation from limited indirect measurements is ill-posed in many real-world applications (fluid mechanics, seismic inversion, structural monitoring). Existing deep learning methods fail under sparse, irregular observations due to fixed-grid assumptions, poor reconstruction, lack of robustness, and no uncertainty quantification.

Method: PIS uses a set-conditioned diffusion framework with Set Transformer-based encoder to handle arbitrary observation geometry/number, cosine-annealed sparsity curriculum for robustness, and information-theoretic analysis to understand inversion limits under extreme sparsity.

Result: PIS reduces inversion error by 12.28%-88.73% across Darcy flow, Helmholtz wavefield inversion, and structural health monitoring tasks, remaining stable even at 0.29% observation rate while producing calibrated posterior samples that reflect data scarcity and physical ambiguity.

Conclusion: PIS is a powerful, general-purpose solution for physical inversion under arbitrary and severely undersampled observations, offering sparsity-resilient performance with uncertainty quantification where existing methods fail.

Abstract: Estimation of PDE-constrained physical parameters from limited indirect measurements is inherently ill-posed, particularly when observations are sparse, irregular, and constrained by real-world sensor placement. This challenge is ubiquitous in fields such as fluid mechanics, seismic inversion, and structural health monitoring. Existing deep and operator-learning models collapse under these conditions: fixed-grid assumptions fail, reconstruction deteriorates sharply, and inversion becomes unreliable with limited robustness and no uncertainty quantification (UQ).We propose the Physical Inversion Solver (PIS), a set-conditioned diffusion framework enabling inversion from truly arbitrary observation sets. PIS employs a Set Transformer-based encoder to handle measurements of any number or geometry, and a cosine-annealed sparsity curriculum for exceptional robustness. An accompanying information-theoretic analysis provides insight into the limits of inversion under extreme sparsity by revealing how observation entropy varies across physical systems.PIS is evaluated on three challenging PDE inverse problems: Darcy flow, wavefield inversion (Helmholtz), and structural health monitoring (Hooke’s Law). Across all tasks and sparsity regimes – including extreme cases with an observation rate of only $0.29%$ – existing operator-learning baselines fail to reconstruct meaningful fields, often diverging or collapsing entirely.In stark contrast, PIS remains stable and accurate, reducing inversion error by $12.28%$–$88.73%$ and reliably producing calibrated posterior samples. These samples accurately reflect both data scarcity and intrinsic physical ambiguity. These results position PIS as a powerful, general-purpose, and uniquely sparsity-resilient solution for physical inversion under arbitrary and severely undersampled observations.

[313] Low-Rank Compression of Language Models via Differentiable Rank Selection

Sidhant Sundrani, Francesco Tudisco, Pasquale Minervini

Main category: cs.LG

TL;DR: LLRC is a gradient-based method that learns mask weights to select optimal singular values for low-rank compression of LLMs without fine-tuning, outperforming existing fine-tuning-free approaches.

DetailsMotivation: Current methods for selecting optimal ranks in low-rank LLM compression either use heuristics with limited search space or gradient-based approaches that require post-compression fine-tuning to be effective. There's a need for a fine-tuning-free approach that can directly learn optimal rank selection.

Method: LLRC learns mask weights that select singular values using a calibration dataset. It trains only the mask weights to minimize activation divergence from the original model while selecting fewer singular values, without requiring post-compression fine-tuning.

Result: LLRC outperforms competing fine-tuning-free rank selection methods across various compression rates on common-sense reasoning and QA tasks. At 20% compression on Llama-2-13B, it beats STRS by 12% on MMLU, 3.5% on BoolQ, and 4.4% on OpenbookQA. It also consistently outperforms fine-tuning-free variants of SVD-LLM and LLM-Pruner.

Conclusion: LLRC provides an effective gradient-based approach for low-rank compression that doesn’t require fine-tuning, achieving better performance than existing fine-tuning-free methods and competitive performance with fine-tuning-based approaches.

Abstract: Approaches for compressing large-language models using low-rank decomposition have made strides, particularly with the introduction of activation and loss-aware SVD, which improves the trade-off between decomposition rank and downstream task performance. Despite these advancements, a persistent challenge remains–selecting the optimal ranks for each layer to jointly optimise compression rate and downstream task accuracy. Current methods either rely on heuristics that can yield sub-optimal results due to their limited discrete search space or are gradient-based but are not as performant as heuristic approaches without post-compression fine-tuning. To address these issues, we propose Learning to Low-Rank Compress (LLRC), a gradient-based approach which directly learns the weights of masks that select singular values in a fine-tuning-free setting. Using a calibration dataset, we train only the mask weights to select fewer and fewer singular values while minimising the divergence of intermediate activations from the original model. Our approach outperforms competing ranking selection methods that similarly require no post-compression fine-tuning across various compression rates on common-sense reasoning and open-domain question-answering tasks. For instance, with a compression rate of 20% on Llama-2-13B, LLRC outperforms the competitive Sensitivity-based Truncation Rank Searching (STRS) on MMLU, BoolQ, and OpenbookQA by 12%, 3.5%, and 4.4%, respectively. Compared to other compression techniques, our approach consistently outperforms fine-tuning-free variants of SVD-LLM and LLM-Pruner across datasets and compression rates. Our fine-tuning-free approach also performs competitively with the fine-tuning variant of LLM-Pruner.

[314] Plug-and-Play Parameter-Efficient Tuning of Embeddings for Federated Recommendation

Haochen Yuan, Yang Zhang, Xiang He, Quan Z. Sheng, Zhongjie Wang

Main category: cs.LG

TL;DR: Proposes a federated recommendation framework with parameter-efficient fine-tuning for embeddings to reduce communication overhead while maintaining accuracy.

DetailsMotivation: Federated recommendation systems face communication inefficiency due to large embedding parameters, but existing approaches overlook embedding parameter overhead.

Method: A PEFT-based embedding framework that integrates LoRA, Hash-based encoding, and novel RQ-VAE techniques as lightweight plugins for existing FR methods.

Result: Extensive experiments show significant communication overhead reduction with improved accuracy across various FR backbones and datasets.

Conclusion: The framework provides an effective solution for communication-efficient federated recommendation through parameter-efficient embedding fine-tuning.

Abstract: With the rise of cloud-edge collaboration, recommendation services are increasingly trained in distributed environments. Federated Recommendation (FR) enables such multi-end collaborative training while preserving privacy by sharing model parameters instead of raw data. However, the large number of parameters, primarily due to the massive item embeddings, significantly hampers communication efficiency. While existing studies mainly focus on improving the efficiency of FR models, they largely overlook the issue of embedding parameter overhead. To address this gap, we propose a FR training framework with Parameter-Efficient Fine-Tuning (PEFT) based embedding designed to reduce the volume of embedding parameters that need to be transmitted. Our approach offers a lightweight, plugin-style solution that can be seamlessly integrated into existing FR methods. In addition to incorporating common PEFT techniques such as LoRA and Hash-based encoding, we explore the use of Residual Quantized Variational Autoencoders (RQ-VAE) as a novel PEFT strategy within our framework. Extensive experiments across various FR model backbones and datasets demonstrate that our framework significantly reduces communication overhead while improving accuracy. The source code is available at https://github.com/young1010/FedPEFT.

[315] DARTs: A Dual-Path Robust Framework for Anomaly Detection in High-Dimensional Multivariate Time Series

Xuechun Liu, Heli Sun, Xuecheng Wu, Ruichen Cao, Yunyun Shi, Dingkang Yang, Haoran Li

Main category: cs.LG

TL;DR: DARTs is a robust dual-path framework for multivariate time series anomaly detection that captures both short-term and long-term spatiotemporal patterns in high-dimensional noisy data through adaptive graph learning and window-aware fusion mechanisms.

DetailsMotivation: Existing MTSAD approaches work well in low-dimensional scenarios but fail to robustly capture long-range spatiotemporal dependencies in high-dimensional noisy time series, which is critical for industrial control systems.

Method: A dual-path framework with short-term path (Multi-View Sparse Graph Learner + Diffusion Multi-Relation Graph Unit) for hierarchical short-term patterns, long-term path (Multi-Scale Spatiotemporal Graph Constructor) for long-term dynamics, and window-aware spatiotemporal soft-fusion mechanism to integrate patterns while filtering noise.

Result: Extensive experiments across mainstream datasets demonstrate superiority and robustness of DARTs, with ablation studies validating the importance of proposed components.

Conclusion: DARTs effectively addresses limitations of existing approaches by robustly capturing both short-term and long-term spatiotemporal dependencies in high-dimensional noisy multivariate time series for industrial anomaly detection.

Abstract: Multivariate time series anomaly detection (MTSAD) aims to accurately identify and localize complex abnormal patterns in the large-scale industrial control systems. While existing approaches excel in recognizing the distinct patterns under the low-dimensional scenarios, they often fail to robustly capture long-range spatiotemporal dependencies when learning representations from the high-dimensional noisy time series. To address these limitations, we propose DARTs, a robust long short-term dual-path framework with window-aware spatiotemporal soft fusion mechanism, which can be primarily decomposed into three complementary components. Specifically, in the short-term path, we introduce a Multi-View Sparse Graph Learner and a Diffusion Multi-Relation Graph Unit that collaborate to adaptively capture hierarchical discriminative short-term spatiotemporal patterns in the high-noise time series. While in the long-term path, we design a Multi-Scale Spatiotemporal Graph Constructor to model salient long-term dynamics within the high-dimensional representation space. Finally, a window-aware spatiotemporal soft-fusion mechanism is introduced to filter the residual noise while seamlessly integrating anomalous patterns. Extensive qualitative and quantitative experimental results across mainstream datasets demonstrate the superiority and robustness of our proposed DARTs. A series of ablation studies are also conducted to explore the crucial design factors of our proposed components. Our code and model will be made publicly open soon.

[316] TF-MCL: Time-frequency Fusion and Multi-domain Cross-Loss for Self-supervised Depression Detection

Li-Xuan Zhao, Chen-Yang Xu, Wen-Qiang Li, Bo Wang, Rong-Xing Wei, Qing-Hao Menga

Main category: cs.LG

TL;DR: A time-frequency fusion and multi-domain cross-loss (TF-MCL) model for major depressive disorder detection using EEG signals, leveraging contrastive learning to overcome labeling challenges and improve time-frequency representation learning.

DetailsMotivation: Supervised MDD detection methods heavily rely on labeled EEG data, which is challenging to obtain. Existing contrastive learning methods fail to adequately capture the time-frequency characteristics of EEG signals and produce insufficiently discriminative representations for MDD detection tasks.

Method: Proposes TF-MCL model with: 1) Fusion Mapping Head (FMH) to remap time-frequency domain information to a fusion domain, creating hybrid representations; 2) Multi-domain cross-loss function optimization to reconstruct representation distributions across time, frequency, and fusion domains.

Result: Significant performance improvements on MODMA and PRED+CT datasets, outperforming state-of-the-art methods by 5.87% and 9.96% accuracy respectively.

Conclusion: TF-MCL effectively addresses limitations of existing contrastive learning methods for EEG-based MDD detection by better capturing time-frequency characteristics and improving representation learning, demonstrating superior performance on benchmark datasets.

Abstract: In recent years, there has been a notable increase in the use of supervised detection methods of major depressive disorder (MDD) based on electroencephalogram (EEG) signals. However, the process of labeling MDD remains challenging. As a self-supervised learning method, contrastive learning could address the shortcomings of supervised learning methods, which are unduly reliant on labels in the context of MDD detection. However, existing contrastive learning methods are not specifically designed to characterize the time-frequency distribution of EEG signals, and their capacity to acquire low-semantic data representations is still inadequate for MDD detection tasks. To address the problem of contrastive learning method, we propose a time-frequency fusion and multi-domain cross-loss (TF-MCL) model for MDD detection. TF-MCL generates time-frequency hybrid representations through the use of a fusion mapping head (FMH), which efficiently remaps time-frequency domain information to the fusion domain, and thus can effectively enhance the model’s capacity to synthesize time-frequency information. Moreover, by optimizing the multi-domain cross-loss function, the distribution of the representations in the time-frequency domain and the fusion domain is reconstructed, thereby improving the model’s capacity to acquire fusion representations. We evaluated the performance of our model on the publicly available datasets MODMA and PRED+CT and show a significant improvement in accuracy, outperforming the existing state-of-the-art (SOTA) method by 5.87% and 9.96%, respectively.

[317] The Laminar Flow Hypothesis: Detecting Jailbreaks via Semantic Turbulence in Large Language Models

Md. Hasib Ur Rahman

Main category: cs.LG

TL;DR: The paper introduces the Laminar Flow Hypothesis and Semantic Turbulence concept to detect jailbreaking attacks on LLMs by analyzing internal reasoning dynamics, showing it works as a lightweight real-time detector and diagnostic tool.

DetailsMotivation: Current LLM defense strategies against adversarial jailbreaking attacks rely on expensive external classifiers or brittle lexical filters, overlooking the intrinsic dynamics of the model's reasoning process. There's a need for more effective detection methods that understand internal model behavior.

Method: Introduces the Laminar Flow Hypothesis: benign inputs cause smooth transitions in LLM’s latent space, while adversarial prompts trigger chaotic trajectories (Semantic Turbulence). Formalizes this with a zero-shot metric: variance of layer-wise cosine velocity. Evaluates across diverse small language models to measure turbulence patterns.

Result: RLHF-aligned Qwen2-1.5B shows 75.4% increase in turbulence under attack (p<0.001), validating internal conflict hypothesis. Gemma-2B shows 22.0% decrease in turbulence, indicating a distinct “reflex-based” refusal mechanism. Semantic Turbulence serves as effective jailbreak detector and diagnostic tool.

Conclusion: Semantic Turbulence provides a lightweight, real-time jailbreak detection method and non-invasive diagnostic tool for categorizing black-box model safety architectures, offering insights into internal model dynamics during adversarial attacks.

Abstract: As Large Language Models (LLMs) become ubiquitous, the challenge of securing them against adversarial “jailbreaking” attacks has intensified. Current defense strategies often rely on computationally expensive external classifiers or brittle lexical filters, overlooking the intrinsic dynamics of the model’s reasoning process. In this work, the Laminar Flow Hypothesis is introduced, which posits that benign inputs induce smooth, gradual transitions in an LLM’s high-dimensional latent space, whereas adversarial prompts trigger chaotic, high-variance trajectories - termed Semantic Turbulence - resulting from the internal conflict between safety alignment and instruction-following objectives. This phenomenon is formalized through a novel, zero-shot metric: the variance of layer-wise cosine velocity. Experimental evaluation across diverse small language models reveals a striking diagnostic capability. The RLHF-aligned Qwen2-1.5B exhibits a statistically significant 75.4% increase in turbulence under attack (p less than 0.001), validating the hypothesis of internal conflict. Conversely, Gemma-2B displays a 22.0% decrease in turbulence, characterizing a distinct, low-entropy “reflex-based” refusal mechanism. These findings demonstrate that Semantic Turbulence serves not only as a lightweight, real-time jailbreak detector but also as a non-invasive diagnostic tool for categorizing the underlying safety architecture of black-box models.

[318] Comparative Evaluation of Embedding Representations for Financial News Sentiment Analysis

Joyjit Roy, Samaresh Kumar Singh

Main category: cs.LG

TL;DR: Embedding-based methods for financial sentiment analysis fail when data is scarce, performing worse than trivial baselines despite good validation metrics.

DetailsMotivation: Financial sentiment analysis is valuable but standard NLP approaches struggle with small datasets, creating challenges for resource-constrained environments.

Method: Comparative evaluation of Word2Vec, GloVe, and sentence transformer embeddings combined with gradient boosting on manually labeled financial headlines.

Result: Models performed worse than trivial baselines on test data despite strong validation metrics, showing pretrained embeddings have diminishing returns below critical data thresholds.

Conclusion: Embedding quality alone cannot solve data scarcity; practitioners need alternative approaches like few-shot learning, data augmentation, or lexicon-enhanced methods for small datasets.

Abstract: Financial sentiment analysis enhances market understanding; however, standard natural language processing approaches encounter significant challenges when applied to small datasets. This study provides a comparative evaluation of embedding-based methods for financial news sentiment classification in resource-constrained environments. Word2Vec, GloVe, and sentence transformer representations are evaluated in combination with gradient boosting on manually labeled headlines. Experimental results identify a substantial gap between validation and test performance, with models performing worse than trivial baselines despite strong validation metrics. The analysis demonstrates that pretrained embeddings yield diminishing returns below a critical data sufficiency threshold, and that small validation sets contribute to overfitting during model selection. Practical application is illustrated through weekly sentiment aggregation and narrative summarization for market monitoring workflows. The findings offer empirical evidence that embedding quality alone cannot address fundamental data scarcity in sentiment classification. For practitioners operating with limited resources, the results indicate the need to consider alternative approaches such as few-shot learning, data augmentation, or lexicon-enhanced hybrid methods when labeled samples are scarce.

[319] MIDUS: Memory-Infused Depth Up-Scaling

Taero Kim, Hoyoon Byun, Youngjun Choi, Sungrae Park, Kyungwoo Song

Main category: cs.LG

TL;DR: MIDUS introduces head-wise memory layers instead of FFNs for depth up-scaling, improving efficiency and performance over traditional DUS approaches.

DetailsMotivation: Current Depth Up-Scaling (DUS) methods duplicate layers with feed-forward networks (FFNs), which limits efficiency and performance gains. There's a need for more efficient scaling approaches that don't incur excessive parameter growth or inference costs.

Method: MIDUS replaces FFNs in duplicated blocks with head-wise memory (HML) layers. Each attention head gets an independent memory bank for head-wise retrieval, preserving functional structure while injecting information into subsequent layers. Includes efficient per-head value factorization module for sparse memory access.

Result: MIDUS shows robust performance improvements over strong DUS baselines while maintaining highly efficient parameter footprint during Continual Pre-training experiments.

Conclusion: MIDUS establishes itself as a compelling, resource-efficient alternative to conventional FFN replication for depth up-scaling by leveraging head-wise memory design that relaxes the efficiency-performance trade-off.

Abstract: Scaling large language models (LLMs) demands approaches that increase capacity without incurring excessive parameter growth or inference cost. Depth Up-Scaling (DUS) has emerged as a promising strategy by duplicating layers and applying Continual Pre-training (CPT), but its reliance on feed-forward networks (FFNs) limits efficiency and attainable gains. We introduce Memory-Infused Depth Up-Scaling (MIDUS), which replaces FFNs in duplicated blocks with a head-wise memory (HML) layer. Motivated by observations that attention heads have distinct roles both across and within layers, MIDUS assigns an independent memory bank to each head, enabling head-wise retrieval and injecting information into subsequent layers while preserving head-wise functional structure. This design combines sparse memory access with head-wise representations and incorporates an efficient per-head value factorization module, thereby relaxing the usual efficiency-performance trade-off. Across our CPT experiments, MIDUS exhibits robust performance improvements over strong DUS baselines while maintaining a highly efficient parameter footprint. Our findings establish MIDUS as a compelling and resource-efficient alternative to conventional FFN replication for depth up-scaling by leveraging its head-wise memory design.

[320] Network-Wide Traffic Volume Estimation from Speed Profiles using a Spatio-Temporal Graph Neural Network with Directed Spatial Attention

Léo Hein, Giovanni de Nunzio, Giovanni Chierchia, Aurélie Pirayre, Laurent Najman

Main category: cs.LG

TL;DR: HDA-STGNN is a graph neural network that uses speed data and road attributes to estimate network-wide traffic volumes without needing volume sensors at inference time.

DetailsMotivation: Existing traffic volume estimation methods either focus on forecasting (ignoring unmonitored roads) or spatial imputation (requiring volume data at inference time). Both approaches are limited in sensor-scarce cities, while probe vehicle speeds and static road attributes are more widely available.

Method: Hybrid Directed-Attention Spatio-Temporal Graph Neural Network (HDA-STGNN) - an inductive deep learning framework that leverages speed profiles, static road attributes, and road network topology to predict daily traffic volume profiles across all road segments.

Result: Extensive ablation studies demonstrate the model’s capacity to capture complex spatio-temporal dependencies and highlight the value of topological information for accurate network-wide traffic volume estimation without relying on volume data at inference time.

Conclusion: HDA-STGNN provides a practical solution for network-wide traffic volume estimation in sensor-scarce cities by leveraging widely available speed data and road attributes instead of requiring volume sensors at inference time.

Abstract: Existing traffic volume estimation methods typically address either forecasting traffic on sensor-equipped roads or spatially imputing missing volumes using nearby sensors. While forecasting models generally disregard unmonitored roads by design, spatial imputation methods explicitly address network-wide estimation; yet this approach relies on volume data at inference time, limiting its applicability in sensor-scarce cities. Unlike traffic volume data, probe vehicle speeds and static road attributes are more broadly accessible and support full coverage of road segments in most urban networks. In this work, we present the Hybrid Directed-Attention Spatio-Temporal Graph Neural Network (HDA-STGNN), an inductive deep learning framework designed to tackle the network-wide volume estimation problem. Our approach leverages speed profiles, static road attributes, and road network topology to predict daily traffic volume profiles across all road segments in the network. To evaluate the effectiveness of our approach, we perform extensive ablation studies that demonstrate the model’s capacity to capture complex spatio-temporal dependencies and highlight the value of topological information for accurate network-wide traffic volume estimation without relying on volume data at inference time.

[321] Enhancing Semi-Supervised Multi-View Graph Convolutional Networks via Supervised Contrastive Learning and Self-Training

Huaiyuan Xiao, Fadi Dornaika, Jingjun Bi

Main category: cs.LG

TL;DR: MV-SupGCN: A semi-supervised GCN model that integrates complementary components for multi-view learning, combining supervised contrastive loss, robust graph construction, and unified contrastive learning with pseudo-labeling.

DetailsMotivation: Existing GCN-based multi-view learning methods fail to fully exploit complementary information across views, leading to suboptimal feature representations and limited performance. There's a need to better integrate structural information from heterogeneous views while improving model generalization.

Method: 1) Joint loss combining Cross-Entropy with Supervised Contrastive loss to minimize intra-class variance and maximize inter-class separability. 2) Combined KNN-based and semi-supervised graph construction for robustness. 3) Unified framework integrating contrastive learning for multi-view consistency and pseudo-labeling for additional supervision.

Result: MV-SupGCN consistently surpasses state-of-the-art methods across multiple benchmarks, demonstrating the effectiveness of the integrated approach in improving multi-view learning performance.

Conclusion: The proposed MV-SupGCN framework successfully addresses limitations in existing multi-view GCN methods by integrating complementary components that mutually reinforce each other, leading to superior performance through better exploitation of view complementarity and improved generalization.

Abstract: The advent of graph convolutional network (GCN)-based multi-view learning provides a powerful framework for integrating structural information from heterogeneous views, enabling effective modeling of complex multi-view data. However, existing methods often fail to fully exploit the complementary information across views, leading to suboptimal feature representations and limited performance. To address this, we propose MV-SupGCN, a semi-supervised GCN model that integrates several complementary components with clear motivations and mutual reinforcement. First, to better capture discriminative features and improve model generalization, we design a joint loss function that combines Cross-Entropy loss with Supervised Contrastive loss, encouraging the model to simultaneously minimize intra-class variance and maximize inter-class separability in the latent space. Second, recognizing the instability and incompleteness of single graph construction methods, we combine both KNN-based and semi-supervised graph construction approaches on each view, thereby enhancing the robustness of the data structure representation and reducing generalization error. Third, to effectively utilize abundant unlabeled data and enhance semantic alignment across multiple views, we propose a unified framework that integrates contrastive learning in order to enforce consistency among multi-view embeddings and capture meaningful inter-view relationships, together with pseudo-labeling, which provides additional supervision applied to both the cross-entropy and contrastive loss functions to enhance model generalization. Extensive experiments demonstrate that MV-SupGCN consistently surpasses state-of-the-art methods across multiple benchmarks, validating the effectiveness of our integrated approach. The source code is available at https://github.com/HuaiyuanXiao/MVSupGCN

[322] Constrained Policy Optimization via Sampling-Based Weight-Space Projection

Shengfan Cao, Francesco Borrelli

Main category: cs.LG

TL;DR: SCPO: A safe constrained policy optimization method that projects gradient updates in parameter space using sampling and smoothness bounds to ensure safety without constraint gradients.

DetailsMotivation: Safety-critical learning requires policies that improve performance while staying within safe operating regimes, especially when dealing with unknown, rollout-based safety constraints that cannot be violated during training.

Method: Proposes SCPO, a sampling-based weight-space projection method that constructs local safe regions using trajectory rollouts and smoothness bounds. Each gradient update is projected via a convex Second-Order Cone Program (SOCP) to produce safe first-order steps without requiring gradient access to constraint functions.

Result: The approach provides safe-by-induction guarantees from any safe initialization, maintains feasibility throughout training, rejects unsafe updates, and achieves meaningful primal objective improvement in constrained control settings with stabilizing backup policies.

Conclusion: SCPO enables safe constrained policy learning by directly enforcing safety in parameter space through sampling-based projections, ensuring both safety and performance improvement without requiring constraint gradients.

Abstract: Safety-critical learning requires policies that improve performance without leaving the safe operating regime. We study constrained policy learning where model parameters must satisfy unknown, rollout-based safety constraints. We propose SCPO, a sampling-based weight-space projection method that enforces safety directly in parameter space without requiring gradient access to the constraint functions. Our approach constructs a local safe region by combining trajectory rollouts with smoothness bounds that relate parameter changes to shifts in safety metrics. Each gradient update is then projected via a convex SOCP, producing a safe first-order step. We establish a safe-by-induction guarantee: starting from any safe initialization, all intermediate policies remain safe given feasible projections. In constrained control settings with a stabilizing backup policy, our approach further ensures closed-loop stability and enables safe adaptation beyond the conservative backup. On regression with harmful supervision and a constrained double-integrator task with malicious expert, our approach consistently rejects unsafe updates, maintains feasibility throughout training, and achieves meaningful primal objective improvement.

[323] EEG-D3: A Solution to the Hidden Overfitting Problem of Deep Learning Models

Siegfried Ludwig, Stylianos Bakas, Konstantinos Barmpas, Georgios Zoumpourlis, Dimitrios A. Adamos, Nikolaos Laskaris, Yannis Panagakis, Stefanos Zafeiriou

Main category: cs.LG

TL;DR: D3 is a weakly supervised method that disentangles EEG components to prevent hidden overfitting and improve generalization to real applications.

DetailsMotivation: Deep learning models for EEG decoding show good benchmark performance but fail to generalize to real applications due to hidden overfitting to task-correlated artifacts.

Method: Disentangled Decoding Decomposition (D3) - a weakly supervised method that predicts trial sequence position to separate latent brain activity components, using independent sub-networks for interpretability.

Result: D3 reliably separates latent brain activity components, prevents hidden overfitting, enables few-shot learning on sleep stage classification, and improves generalization to real-world applications.

Conclusion: D3 provides a tool to distinguish genuine brain activity from artifacts, addressing the hidden overfitting problem and enabling better generalization with minimal labeled data.

Abstract: Deep learning for decoding EEG signals has gained traction, with many claims to state-of-the-art accuracy. However, despite the convincing benchmark performance, successful translation to real applications is limited. The frequent disconnect between performance on controlled BCI benchmarks and its lack of generalisation to practical settings indicates hidden overfitting problems. We introduce Disentangled Decoding Decomposition (D3), a weakly supervised method for training deep learning models across EEG datasets. By predicting the place in the respective trial sequence from which the input window was sampled, EEG-D3 separates latent components of brain activity, akin to non-linear ICA. We utilise a novel model architecture with fully independent sub-networks for strict interpretability. We outline a feature interpretation paradigm to contrast the component activation profiles on different datasets and inspect the associated temporal and spatial filters. The proposed method reliably separates latent components of brain activity on motor imagery data. Training downstream classifiers on an appropriate subset of these components prevents hidden overfitting caused by task-correlated artefacts, which severely affects end-to-end classifiers. We further exploit the linearly separable latent space for effective few-shot learning on sleep stage classification. The ability to distinguish genuine components of brain activity from spurious features results in models that avoid the hidden overfitting problem and generalise well to real-world applications, while requiring only minimal labelled data. With interest to the neuroscience community, the proposed method gives researchers a tool to separate individual brain processes and potentially even uncover heretofore unknown dynamics.

[324] The Double Life of Code World Models: Provably Unmasking Malicious Behavior Through Execution Traces

Subramanyam Sahoo, Jared Junkin

Main category: cs.LG

TL;DR: CTVP is a verification framework that detects backdoors in LLM-generated code by analyzing predicted execution traces across semantically equivalent program transformations, using semantic orbit analysis to identify behavioral anomalies without executing potentially malicious code.

DetailsMotivation: As LLMs increasingly generate code with minimal human oversight, there are critical concerns about backdoor injection and malicious behavior that need to be addressed through reliable verification methods.

Method: Cross-Trace Verification Protocol (CTVP) verifies untrusted code-generating models by analyzing the model’s own predictions of execution traces across semantically equivalent program transformations. Instead of direct execution, it examines consistency patterns in predicted traces to detect behavioral anomalies indicative of backdoors.

Result: The approach introduces Adversarial Robustness Quotient (ARQ) which quantifies verification computational cost relative to baseline generation, showing exponential growth with orbit size. Theoretical analysis establishes information-theoretic bounds demonstrating non-gamifiability - adversaries cannot improve through training due to fundamental space complexity constraints.

Conclusion: Semantic orbit analysis provides a scalable, theoretically grounded approach to AI control for code generation tasks, offering a practical solution for verifying LLM-generated code against backdoor attacks.

Abstract: Large language models (LLMs) increasingly generate code with minimal human oversight, raising critical concerns about backdoor injection and malicious behavior. We present Cross-Trace Verification Protocol (CTVP), a novel AI control framework that verifies untrusted code-generating models through semantic orbit analysis. Rather than directly executing potentially malicious code, CTVP leverages the model’s own predictions of execution traces across semantically equivalent program transformations. By analyzing consistency patterns in these predicted traces, we detect behavioral anomalies indicative of backdoors. Our approach introduces the Adversarial Robustness Quotient (ARQ), which quantifies the computational cost of verification relative to baseline generation, demonstrating exponential growth with orbit size. Theoretical analysis establishes information-theoretic bounds showing non-gamifiability – adversaries cannot improve through training due to fundamental space complexity constraints. This work demonstrates that semantic orbit analysis provides a scalable, theoretically grounded approach to AI control for code generation tasks.

[325] Explainable reinforcement learning from human feedback to improve alignment

Shicheng Liu, Siyuan Xu, Wenjie Qiu, Hangfan Zhang, Minghui Zhu

Main category: cs.LG

TL;DR: A method to improve RLHF by identifying and unlearning training data that cause unsatisfactory LM responses, using post-hoc explanation and targeted unlearning.

DetailsMotivation: Human improvement strategy of finding and correcting causes of unsatisfactory outcomes can be applied to improve RLHF for language model alignment, as RLHF-tuned LMs still produce unsatisfactory responses.

Method: Two-part approach: 1) Post-hoc explanation method to identify training data causing unsatisfactory responses via constrained combinatorial optimization (finding training data closest to prompt-response pair in feature space), solved with efficient iterative data selection algorithm. 2) Unlearning method that improves unsatisfactory responses by unlearning identified problematic training data while preserving satisfactory responses to other prompts.

Result: Experimental results demonstrate that the proposed algorithm can effectively improve RLHF performance by reducing unsatisfactory responses.

Conclusion: The human strategy of finding and correcting causes can be successfully applied to improve RLHF for language model alignment through targeted training data identification and unlearning.

Abstract: A common and effective strategy for humans to improve an unsatisfactory outcome in daily life is to find a cause of this outcome and correct the cause. In this paper, we investigate whether this human improvement strategy can be applied to improving reinforcement learning from human feedback (RLHF) for alignment of language models (LMs). In particular, it is observed in the literature that LMs tuned by RLHF can still output unsatisfactory responses. This paper proposes a method to improve the unsatisfactory responses by correcting their causes. Our method has two parts. The first part proposes a post-hoc explanation method to explain why an unsatisfactory response is generated to a prompt by identifying the training data that lead to this response. We formulate this problem as a constrained combinatorial optimization problem where the objective is to find a set of training data closest to this prompt-response pair in a feature representation space, and the constraint is that the prompt-response pair can be decomposed as a convex combination of this set of training data in the feature space. We propose an efficient iterative data selection algorithm to solve this problem. The second part proposes an unlearning method that improves unsatisfactory responses to some prompts by unlearning the training data that lead to these unsatisfactory responses and, meanwhile, does not significantly degrade satisfactory responses to other prompts. Experimental results demonstrate that our algorithm can improve RLHF.

[326] Topologically-Stabilized Graph Neural Networks: Empirical Robustness Across Domains

Jelena Losic

Main category: cs.LG

TL;DR: A novel GNN framework that integrates persistent homology features with stability regularization to enhance robustness against structural perturbations.

DetailsMotivation: Graph Neural Networks (GNNs) have become standard for graph representation learning but remain vulnerable to structural perturbations, creating a need for more robust approaches.

Method: Integrates persistent homology features with stability regularization, combining GIN architectures with multi-scale topological features from persistence images, enforced by Hiraoka-Kusano-inspired stability constraints based on stability theorems of persistent homology.

Result: Demonstrates exceptional robustness to edge perturbations across six diverse datasets (biochemical, social, collaboration networks) with minimal performance degradation (0-4% on most datasets), significantly outperforming baseline stability methods.

Conclusion: Provides a theoretically-grounded and empirically-validated approach to robust graph learning that aligns with recent advances in topological regularization.

Abstract: Graph Neural Networks (GNNs) have become the standard for graph representation learning but remain vulnerable to structural perturbations. We propose a novel framework that integrates persistent homology features with stability regularization to enhance robustness. Building on the stability theorems of persistent homology \cite{cohen2007stability}, our method combines GIN architectures with multi-scale topological features extracted from persistence images, enforced by Hiraoka-Kusano-inspired stability constraints. Across six diverse datasets spanning biochemical, social, and collaboration networks , our approach demonstrates exceptional robustness to edge perturbations while maintaining competitive accuracy. Notably, we observe minimal performance degradation (0-4% on most datasets) under perturbation, significantly outperforming baseline stability. Our work provides both a theoretically-grounded and empirically-validated approach to robust graph learning that aligns with recent advances in topological regularization

[327] Dropout Neural Network Training Viewed from a Percolation Perspective

Finley Devlin, Jaron Sanders

Main category: cs.LG

TL;DR: Dropout in neural networks exhibits percolation effects that can cause training breakdown when connections are randomly removed, particularly in networks without biases.

DetailsMotivation: To investigate whether dropout's random connection removal creates percolation effects that could disrupt neural network training by potentially breaking all paths between input and output layers.

Method: Develop new percolation models that mimic dropout in neural networks, analyze the relationship between network topology and path connectivity, and study training breakdown in networks without biases.

Result: Theoretical demonstration of percolation effects in dropout, showing that these effects can cause training breakdown in networks without biases, with heuristic arguments suggesting similar issues in networks with biases.

Conclusion: Dropout exhibits percolative behavior that can disrupt neural network training by breaking connectivity paths, highlighting important topological considerations for dropout regularization.

Abstract: In this work, we investigate the existence and effect of percolation in training deep Neural Networks (NNs) with dropout. Dropout methods are regularisation techniques for training NNs, first introduced by G. Hinton et al. (2012). These methods temporarily remove connections in the NN, randomly at each stage of training, and update the remaining subnetwork with Stochastic Gradient Descent (SGD). The process of removing connections from a network at random is similar to percolation, a paradigm model of statistical physics. If dropout were to remove enough connections such that there is no path between the input and output of the NN, then the NN could not make predictions informed by the data. We study new percolation models that mimic dropout in NNs and characterise the relationship between network topology and this path problem. The theory shows the existence of a percolative effect in dropout. We also show that this percolative effect can cause a breakdown when training NNs without biases with dropout; and we argue heuristically that this breakdown extends to NNs with biases.

[328] Let’s (not) just put things in Context: Test-Time Training for Long-Context LLMs

Rachit Bansal, Aston Zhang, Rishabh Tiwari, Lovish Madaan, Sai Surya Duvvuri, Devvrit Khatri, David Brandfonbrener, David Alvarez-Melis, Prajjwal Bhargava, Mihir Sanjay Kale, Samy Jelassi

Main category: cs.LG

TL;DR: Current inference-time compute strategies (like generating thinking tokens) fail for long-context LLMs due to score dilution in static self-attention. Instead, targeted gradient updates on the specific context yield much better performance improvements.

DetailsMotivation: Long-context LLMs can process millions of tokens but struggle to reliably use all that information. While inference-time compute (like generating thinking tokens) helps with multi-step reasoning, it fails for long-context tasks due to fundamental limitations of static self-attention.

Method: Proposes targeted gradient updates on the given context instead of generating thinking tokens. This approach overcomes the limitations of static self-attention by specifically adapting the model to the context through context-specific training during inference.

Result: The method achieves large performance improvements: 12.6 and 14.1 percentage point average improvements for Qwen3-4B on LongBench-v2 and ZeroScrolls benchmarks. Consistently outperforms current inference-time scaling strategies across models and long-context benchmarks.

Conclusion: For long-context tasks, context-specific training via targeted gradient updates is a more effective use of inference compute than generating thinking tokens. This represents a practical shift in how to allocate inference-time resources for long-context LLMs.

Abstract: Progress on training and architecture strategies has enabled LLMs with millions of tokens in context length. However, empirical evidence suggests that such long-context LLMs can consume far more text than they can reliably use. On the other hand, it has been shown that inference-time compute can be used to scale performance of LLMs, often by generating thinking tokens, on challenging tasks involving multi-step reasoning. Through controlled experiments on sandbox long-context tasks, we find that such inference-time strategies show rapidly diminishing returns and fail at long context. We attribute these failures to score dilution, a phenomenon inherent to static self-attention. Further, we show that current inference-time strategies cannot retrieve relevant long-context signals under certain conditions. We propose a simple method that, through targeted gradient updates on the given context, provably overcomes limitations of static self-attention. We find that this shift in how inference-time compute is spent leads to consistently large performance improvements across models and long-context benchmarks. Our method leads to large 12.6 and 14.1 percentage point improvements for Qwen3-4B on average across subsets of LongBench-v2 and ZeroScrolls benchmarks. The takeaway is practical: for long context, a small amount of context-specific training is a better use of inference compute than current inference-time scaling strategies like producing more thinking tokens.

[329] Measuring Uncertainty Calibration

Kamil Ciosek, Nicolò Felicioni, Sina Ghiassian, Juan Elenter Litwin, Francesco Tonolini, David Gustaffson, Eva Garcia Martin, Carmen Barcena Gonzales, Raphaëlle Bertrand-Lalo

Main category: cs.LG

TL;DR: The paper provides non-asymptotic, distribution-free methods for estimating L₁ calibration error of binary classifiers with bounded variation assumptions and practical modification techniques.

DetailsMotivation: The motivation is to address the practical challenge of accurately estimating calibration error from finite datasets without restrictive assumptions, enabling reliable calibration assessment in real-world applications.

Method: Two main contributions: 1) An upper bound for classifiers with bounded variation calibration functions, and 2) A method to modify any classifier to enable efficient upper bounding of calibration error without significantly impacting performance.

Result: The results provide non-asymptotic, distribution-free bounds and practical procedures that can be efficiently run on real-world datasets with modest computational overhead.

Conclusion: The paper concludes with practical advice on measuring calibration error and demonstrates that the methods yield practical procedures applicable to real-world datasets.

Abstract: We make two contributions to the problem of estimating the $L_1$ calibration error of a binary classifier from a finite dataset. First, we provide an upper bound for any classifier where the calibration function has bounded variation. Second, we provide a method of modifying any classifier so that its calibration error can be upper bounded efficiently without significantly impacting classifier performance and without any restrictive assumptions. All our results are non-asymptotic and distribution-free. We conclude by providing advice on how to measure calibration error in practice. Our methods yield practical procedures that can be run on real-world datasets with modest overhead.

[330] OPTIMA: Optimal One-shot Pruning for LLMs via Quadratic Programming Reconstruction

Mohammad Mozaffari, Samuel Kushnir, Maryam Mehri Dehnavi, Amir Yazdanbakhsh

Main category: cs.LG

TL;DR: OPTIMA is a practical one-shot post-training pruning method that uses row-wise Quadratic Programming to find globally optimal weight updates, balancing accuracy and scalability for large language models.

DetailsMotivation: Current post-training pruning methods face a trade-off: simple heuristics are fast but degrade accuracy, while principled optimization methods recover accuracy but are computationally infeasible at modern scale. There's a need for a method that balances accuracy and scalability.

Method: OPTIMA casts layer-wise weight reconstruction after mask selection as independent, row-wise Quadratic Programs (QPs) that share a common layer Hessian. It implements an accelerator-friendly QP solver that accumulates one Hessian per layer and solves many small QPs in parallel, enabling one-shot pruning without fine-tuning.

Result: OPTIMA achieves up to 3.97% absolute accuracy improvement across multiple LLM families and sparsity regimes. On an NVIDIA H100, it prunes an 8B-parameter transformer end-to-end in 40 hours with 60GB peak memory, setting new state-of-the-art accuracy-efficiency trade-offs.

Conclusion: OPTIMA provides a practical solution for one-shot post-training pruning that effectively balances accuracy and scalability, making large-scale model pruning feasible on single accelerators without fine-tuning.

Abstract: Post-training model pruning is a promising solution, yet it faces a trade-off: simple heuristics that zero weights are fast but degrade accuracy, while principled joint optimization methods recover accuracy but are computationally infeasible at modern scale. One-shot methods such as SparseGPT offer a practical trade-off in optimality by applying efficient, approximate heuristic weight updates. To close this gap, we introduce OPTIMA, a practical one-shot post-training pruning method that balances accuracy and scalability. OPTIMA casts layer-wise weight reconstruction after mask selection as independent, row-wise Quadratic Programs (QPs) that share a common layer Hessian. Solving these QPs yields the per-row globally optimal update with respect to the reconstruction objective given the estimated Hessian. The shared-Hessian structure makes the problem highly amenable to batching on accelerators. We implement an accelerator-friendly QP solver that accumulates one Hessian per layer and solves many small QPs in parallel, enabling one-shot post-training pruning at scale on a single accelerator without fine-tuning. OPTIMA integrates with existing mask selectors and consistently improves zero-shot performance across multiple LLM families and sparsity regimes, yielding up to 3.97% absolute accuracy improvement. On an NVIDIA H100, OPTIMA prunes a 8B-parameter transformer end-to-end in 40 hours with 60GB peak memory. Together, these results set a new state-of-the-art accuracy-efficiency trade-offs for one-shot post-training pruning.

[331] Exploring Machine Learning, Deep Learning, and Explainable AI Methods for Seasonal Precipitation Prediction in South America

Matheus Corrêa Domingos, Valdivino Alexandre de Santiago Júnior, Juliana Aparecida Anochi, Elcio Hideiti Shiguemori, Luísa Mirelle Costa dos Santos, Hércules Carlos dos Santos Pereira, André Estevam Costa Oliveira

Main category: cs.LG

TL;DR: Deep learning models, particularly LSTM, outperform traditional dynamic models for precipitation forecasting in South America, with XGBoost offering a good balance of accuracy and computational efficiency.

DetailsMotivation: There's a lack of comprehensive investigations into purely data-driven approaches for precipitation forecasting, despite the growing relevance of AI/ML techniques as alternatives or complements to traditional dynamic modeling. Accurate precipitation forecasts are crucial for societal climate impact mitigation.

Method: Comparative study of classical ML (Random Forests, XGBoost) and DL (1D CNN, LSTM, GRU) approaches for precipitation forecasting across all 2019 seasons in South America, using the Brazilian Global Atmospheric Model (BAM) as traditional dynamic modeling baseline, with explainable AI techniques to understand model behaviors.

Result: LSTM showed the strongest predictive performance, particularly for heavy precipitation, while BAM (traditional dynamic model) had the worst results. XGBoost offered lower latency with only slight accuracy loss compared to LSTM.

Conclusion: Deep learning models are viable for climate forecasting, confirming a global trend in meteorological centers, with LSTM being most accurate for heavy precipitation and XGBoost providing a cost-effective alternative with good performance.

Abstract: Forecasting meteorological variables is challenging due to the complexity of their processes, requiring advanced models for accuracy. Accurate precipitation forecasts are vital for society. Reliable predictions help communities mitigate climatic impacts. Based on the current relevance of artificial intelligence (AI), classical machine learning (ML) and deep learning (DL) techniques have been used as an alternative or complement to dynamic modeling. However, there is still a lack of broad investigations into the feasibility of purely data-driven approaches for precipitation forecasting. This study aims at addressing this issue where different classical ML and DL approaches for forecasting precipitation in South America, taking into account all 2019 seasons, are considered in a detailed investigation. The selected classical ML techniques were Random Forests and extreme gradient boosting (XGBoost), while the DL counterparts were a 1D convolutional neural network (CNN 1D), a long short-term memory (LSTM) model, and a gated recurrent unit (GRU) model. Additionally, the Brazilian Global Atmospheric Model (BAM) was used as a representative of the traditional dynamic modeling approach. We also relied on explainable artificial intelligence (XAI) to provide some explanations for the models behaviors. LSTM showed strong predictive performance while BAM, the traditional dynamic model representative, had the worst results. Despite presented the higher latency, LSTM was most accurate for heavy precipitation. If cost is a concern, XGBoost offers lower latency with slightly accuracy loss. The results of this research confirm the viability of DL models for climate forecasting, solidifying a global trend in major meteorological and climate forecasting centers.

[332] RePo: Language Models with Context Re-Positioning

Huayang Li, Tianyu Zhao, Richard Sproat

Main category: cs.LG

TL;DR: RePo introduces a differentiable context re-positioning mechanism that assigns token positions based on contextual dependencies rather than fixed linear indices, reducing extraneous cognitive load and improving performance on tasks with noisy contexts, structured data, and longer sequences.

DetailsMotivation: Current LLMs use rigid, fixed positional indices (linear or constant) that create uninformative structure, increasing extraneous cognitive load according to Cognitive Load Theory. This consumes working memory that should be used for deep reasoning and attention allocation.

Method: RePo uses a differentiable module f_φ to assign token positions that capture contextual dependencies, rather than relying on pre-defined integer ranges. The approach is implemented by continually pre-training on the OLMo-2 1B backbone.

Result: RePo significantly enhances performance on tasks involving noisy contexts, structured data, and longer context lengths, while maintaining competitive performance on general short-context tasks. Analysis shows it allocates higher attention to distant but relevant information, assigns positions in dense non-linear space, and captures intrinsic input structure.

Conclusion: RePo’s context re-positioning mechanism effectively reduces extraneous cognitive load by replacing rigid positional indices with learned contextual dependencies, improving LLM performance on complex contextual tasks without sacrificing general short-context capabilities.

Abstract: In-context learning is fundamental to modern Large Language Models (LLMs); however, prevailing architectures impose a rigid and fixed contextual structure by assigning linear or constant positional indices. Drawing on Cognitive Load Theory (CLT), we argue that this uninformative structure increases extraneous cognitive load, consuming finite working memory capacity that should be allocated to deep reasoning and attention allocation. To address this, we propose RePo, a novel mechanism that reduces extraneous load via context re-positioning. Unlike standard approaches, RePo utilizes a differentiable module, $f_φ$, to assign token positions that capture contextual dependencies, rather than replying on pre-defined integer range. By continually pre-training on the OLMo-2 1B backbone, we demonstrate that RePo significantly enhances performance on tasks involving noisy contexts, structured data, and longer context length, while maintaining competitive performance on general short-context tasks. Detailed analysis reveals that RePo successfully allocate higher attention to distant but relevant information, assign positions in dense and non-linear space, and capture the intrinsic structure of the input context. Our code is available at https://github.com/SakanaAI/repo.

[333] Capturing reduced-order quantum many-body dynamics out of equilibrium via neural ordinary differential equations

Patrick Egenlauf, Iva Březinová, Sabine Andergassen, Miriam Klopotek

Main category: cs.LG

TL;DR: Neural ODE can learn 2RDM dynamics from exact data only when two- and three-particle cumulants are strongly correlated; fails in anti-correlated/uncorrelated regimes, revealing limitations of time-local functionals and need for memory-dependent kernels.

DetailsMotivation: To understand when time-local reconstruction functionals for the three-particle cumulant in TD2RDM formalism can accurately capture quantum many-body dynamics, and to develop a diagnostic tool for assessing the validity of cumulant expansion methods.

Method: Train neural ODE models on exact 2RDM data (without dimensionality reduction) to reproduce dynamics, analyze correlation between two- and three-particle cumulants, and compare with existing TD2RDM reconstructions across different dynamical regimes.

Result: Neural ODE succeeds only when Pearson correlation between two- and three-particle cumulants is large; fails in anti-correlated/uncorrelated regimes. Time-averaged three-particle-correlation buildup predicts success: moderate buildup yields accurate predictions, strong buildup causes systematic breakdowns.

Conclusion: No simple time-local functional of instantaneous two-particle cumulant can capture evolution in anti-correlated/uncorrelated regimes; memory-dependent kernels are needed. Neural ODE serves as model-agnostic diagnostic tool mapping applicability regimes of cumulant expansion methods and enabling data-driven simulation of correlated quantum matter.

Abstract: Out-of-equilibrium quantum many-body systems exhibit rapid correlation buildup that underlies many emerging phenomena. Exact wave-function methods to describe this scale exponentially with particle number; simpler mean-field approaches neglect essential two-particle correlations. The time-dependent two-particle reduced density matrix (TD2RDM) formalism offers a middle ground by propagating the two-particle reduced density matrix (2RDM) and closing the BBGKY hierarchy with a reconstruction of the three-particle cumulant. But the validity and existence of time-local reconstruction functionals ignoring memory effects remain unclear across different dynamical regimes. We show that a neural ODE model trained on exact 2RDM data (no dimensionality reduction) can reproduce its dynamics without any explicit three-particle information – but only in parameter regions where the Pearson correlation between the two- and three-particle cumulants is large. In the anti-correlated or uncorrelated regime, the neural ODE fails, indicating that no simple time-local functional of the instantaneous two-particle cumulant can capture the evolution. The magnitude of the time-averaged three-particle-correlation buildup appears to be the primary predictor of success: For a moderate correlation buildup, both neural ODE predictions and existing TD2RDM reconstructions are accurate, whereas stronger values lead to systematic breakdowns. These findings pinpoint the need for memory-dependent kernels in the three-particle cumulant reconstruction for the latter regime. Our results place the neural ODE as a model-agnostic diagnostic tool that maps the regime of applicability of cumulant expansion methods and guides the development of non-local closure schemes. More broadly, the ability to learn high-dimensional RDM dynamics from limited data opens a pathway to fast, data-driven simulation of correlated quantum matter.

[334] Adaptive digital twins for predictive decision-making: Online Bayesian learning of transition dynamics

Eugenio Varetti, Matteo Torzoni, Marco Tezzele, Andrea Manzoni

Main category: cs.LG

TL;DR: Adaptive digital twins using probabilistic graphical models with hierarchical online learning of state transitions, applied to railway bridge maintenance.

DetailsMotivation: To enhance value realization of digital twins in civil engineering by making them adaptive, enabling better personalization, robustness, and cost-effectiveness.

Method: Uses probabilistic graphical models (dynamic Bayesian networks) with state transition probabilities as random variables with conjugate priors for hierarchical online learning. Solves parametric Markov decision processes through reinforcement learning for dynamic policies.

Result: Proposed framework provides enhanced personalization, increased robustness, and improved cost-effectiveness. Applied to structural health monitoring and maintenance planning of a railway bridge.

Conclusion: Adaptive digital twins with hierarchical online learning of transition dynamics can significantly improve value realization in civil engineering applications like infrastructure maintenance.

Abstract: This work shows how adaptivity can enhance value realization of digital twins in civil engineering. We focus on adapting the state transition models within digital twins represented through probabilistic graphical models. The bi-directional interaction between the physical and virtual domains is modeled using dynamic Bayesian networks. By treating state transition probabilities as random variables endowed with conjugate priors, we enable hierarchical online learning of transition dynamics from a state to another through effortless Bayesian updates. We provide the mathematical framework to account for a larger class of distributions with respect to the current literature. To compute dynamic policies with precision updates we solve parametric Markov decision processes through reinforcement learning. The proposed adaptive digital twin framework enjoys enhanced personalization, increased robustness, and improved cost-effectiveness. We assess our approach on a case study involving structural health monitoring and maintenance planning of a railway bridge.

[335] Sliding Window Recurrences for Sequence Models

Dragos Secrieru, Garyk Brixi, Yoshua Bengio, Taiji Suzuki, Michael Poli, Stefano Massaroli

Main category: cs.LG

TL;DR: Phalanx layers using Sliding Window Recurrences achieve 10-40% speedup over optimized Transformers while matching perplexity in 1B parameter models.

DetailsMotivation: Multi-hybrid architectures offer better quality and performance for language modeling, but need efficient algorithms aligned with GPU memory hierarchies.

Method: Introduce hierarchical decomposition framework for linear recurrences, develop Sliding Window Recurrences (SWR) that truncate recurrences to hardware-aligned windows, and create Phalanx layers as drop-in replacements for windowed attention or linear recurrences.

Result: Phalanx achieves over 10-40% speedup across 4K to 32K context length over optimized Transformers while matching perplexity in 1B parameter multi-hybrid models.

Conclusion: Sliding Window Recurrences enable efficient multi-hybrid architectures with significant performance gains while maintaining model quality.

Abstract: Multi-hybrid architectures are poised to take over language modeling due to better quality and performance. We introduce a hierarchical decomposition framework for linear recurrences that allows us to develop algorithms aligned with GPU memory hierarchies, yielding Sliding Window Recurrences. We focus specifically on truncating recurrences to hardware-aligned windows which are naturally jagged, limiting costly inter-warp communication. Using SWR, we develop Phalanx layers that serve as drop-in replacements for windowed attention or linear recurrences. In 1B parameter multi-hybrid models, Phalanx achieves over 10-40% speedup across 4K to 32K context length over optimized Transformers while matching perplexity.

[336] A Complete Guide to Spherical Equivariant Graph Transformers

Sophia Tang

Main category: cs.LG

TL;DR: A comprehensive guide to spherical equivariant graph neural networks (EGNNs) that provides mathematical foundations and practical implementations for learning on 3D molecular systems while respecting rotational symmetries.

DetailsMotivation: To enable principled learning on 3D molecular and biomolecular systems where predictions must respect the rotational symmetries inherent in physics, as traditional GNNs and Transformers don't naturally handle these symmetries.

Method: Develops a complete foundation for spherical equivariant modeling using group representations, spherical harmonics, tensor products, Clebsch-Gordan decomposition, and SO(3)-equivariant kernels. Constructs Tensor Field Network and SE(3)-Transformer architectures that perform equivariant message-passing and attention on geometric graphs.

Result: Provides a self-contained guide with clear mathematical derivations and annotated code excerpts that enables researchers to understand and implement spherical EGNNs for applications in chemistry, molecular property prediction, protein structure modeling, and generative modeling.

Conclusion: Spherical EGNNs offer a principled framework for learning on 3D molecular systems by representing features as spherical tensors that transform under irreducible representations of SO(3), ensuring physically meaningful predictions under rotations, with practical architectures available for implementation.

Abstract: Spherical equivariant graph neural networks (EGNNs) provide a principled framework for learning on three-dimensional molecular and biomolecular systems, where predictions must respect the rotational symmetries inherent in physics. These models extend traditional message-passing GNNs and Transformers by representing node and edge features as spherical tensors that transform under irreducible representations of the rotation group SO(3), ensuring that predictions change in physically meaningful ways under rotations of the input. This guide develops a complete, intuitive foundation for spherical equivariant modeling - from group representations and spherical harmonics, to tensor products, Clebsch-Gordan decomposition, and the construction of SO(3)-equivariant kernels. Building on this foundation, we construct the Tensor Field Network and SE(3)-Transformer architectures and explain how they perform equivariant message-passing and attention on geometric graphs. Through clear mathematical derivations and annotated code excerpts, this guide serves as a self-contained introduction for researchers and learners seeking to understand or implement spherical EGNNs for applications in chemistry, molecular property prediction, protein structure modeling, and generative modeling.

[337] EDGC: Entropy-driven Dynamic Gradient Compression for Efficient LLM Training

Qingao Yi, Jiaang Duan, Hanwen Hu, Qin Hua, Haiyan Zhao, Shiyou Qian, Dingyu Yang, Jian Cao, Jinghua Tang, Yinghao Yu, Chenzhi Liao, Kangjin Wang, Liping Zhang

Main category: cs.LG

TL;DR: EDGC is an entropy-driven dynamic gradient compression framework that adapts compression rates during LLM training based on gradient entropy trends, reducing communication overhead by up to 46.45% while maintaining model accuracy.

DetailsMotivation: Training large language models requires massive computational resources and suffers from significant communication overhead in distributed training. Static gradient compression methods fail to adapt to the dynamic nature of evolving gradients during training, leading to performance degradation.

Method: EDGC uses three key components: 1) down-sampling method to efficiently estimate gradient entropy, 2) theoretical model linking compression rate with gradient entropy for informed compression decisions, and 3) window-based adjustment mechanism that dynamically adapts compression rates across pipeline stages.

Result: Implemented on 32-NVIDIA-V100 and 64-NVIDIA-H100 clusters training GPT2-2.5B and GPT2-12.1B respectively, EDGC reduced communication latency by up to 46.45% and training time by up to 16.13% while preserving LLM accuracy.

Conclusion: EDGC successfully addresses the challenge of accelerating LLM training through dynamic gradient compression that adapts to evolving gradient entropy, significantly improving communication efficiency without sacrificing model performance.

Abstract: Training large language models (LLMs) poses significant challenges regarding computational resources and memory capacity. Although distributed training techniques help mitigate these issues, they still suffer from considerable communication overhead. Existing approaches primarily rely on static gradient compression to enhance communication efficiency; however, these methods neglect the dynamic nature of evolving gradients during training, leading to performance degradation. Accelerating LLM training via compression without sacrificing performance remains a challenge. In this paper, we propose an entropy-driven dynamic gradient compression framework called EDGC. The core concept is to adjust the compression rate during LLM training based on the evolving trends of gradient entropy, taking into account both compression efficiency and error. EDGC consists of three key components.First, it employs a down-sampling method to efficiently estimate gradient entropy, reducing computation overhead. Second, it establishes a theoretical model linking compression rate with gradient entropy, enabling more informed compression decisions. Lastly, a window-based adjustment mechanism dynamically adapts the compression rate across pipeline stages, improving communication efficiency and maintaining model performance. We implemented EDGC on a 32-NVIDIA-V100 cluster and a 64-NVIDIA-H100 cluster to train GPT2-2.5B and GPT2-12.1B, respectively. The results show that EDGC significantly reduces communication latency and training time by up to 46.45% and 16.13% while preserving LLM accuracy.

[338] Informing Acquisition Functions via Foundation Models for Molecular Discovery

Qi Chen, Fabio Ramos, Alán Aspuru-Guzik, Florian Shkurti

Main category: cs.LG

TL;DR: A likelihood-free Bayesian Optimization method that leverages LLM priors and tree-structured space partitioning for efficient molecular discovery without explicit surrogate modeling.

DetailsMotivation: Traditional Bayesian Optimization struggles in low-data regimes with vast molecular search spaces. While LLMs and chemistry foundation models offer rich priors, their high-dimensional features, costly in-context learning, and computational burden of deep Bayesian surrogates limit their effective utilization.

Method: Proposes a likelihood-free BO method that bypasses explicit surrogate modeling, directly leverages priors from LLMs and chemistry foundation models, learns tree-structured partitions of molecular space with local acquisition functions, and uses Monte Carlo Tree Search for candidate selection. Incorporates coarse-grained LLM-based clustering to restrict evaluations to promising clusters.

Result: The method substantially improves scalability, robustness, and sample efficiency in LLM-guided Bayesian Optimization for molecular discovery, as demonstrated through extensive experiments and ablations.

Conclusion: The proposed likelihood-free approach effectively addresses limitations of traditional BO by leveraging LLM priors without explicit surrogate modeling, enabling more efficient molecular discovery through scalable tree-structured search and intelligent clustering.

Abstract: Bayesian Optimization (BO) is a key methodology for accelerating molecular discovery by estimating the mapping from molecules to their properties while seeking the optimal candidate. Typically, BO iteratively updates a probabilistic surrogate model of this mapping and optimizes acquisition functions derived from the model to guide molecule selection. However, its performance is limited in low-data regimes with insufficient prior knowledge and vast candidate spaces. Large language models (LLMs) and chemistry foundation models offer rich priors to enhance BO, but high-dimensional features, costly in-context learning, and the computational burden of deep Bayesian surrogates hinder their full utilization. To address these challenges, we propose a likelihood-free BO method that bypasses explicit surrogate modeling and directly leverages priors from general LLMs and chemistry-specific foundation models to inform acquisition functions. Our method also learns a tree-structured partition of the molecular search space with local acquisition functions, enabling efficient candidate selection via Monte Carlo Tree Search. By further incorporating coarse-grained LLM-based clustering, it substantially improves scalability to large candidate sets by restricting acquisition function evaluations to clusters with statistically higher property values. We show through extensive experiments and ablations that the proposed method substantially improves scalability, robustness, and sample efficiency in LLM-guided BO for molecular discovery.

[339] Pattern-Guided Diffusion Models

Vivian Lin, Kuk Jin Jang, Wenwen Si, Insup Lee

Main category: cs.LG

TL;DR: PGDM uses pattern guidance from archetypal analysis to improve diffusion model forecasting of multivariate time series by leveraging recurring patterns in the data.

DetailsMotivation: Existing diffusion models for time series forecasting often fail to account for recurring patterns in temporal data, which limits their ability to make realistic predictions that fit within known patterns.

Method: PGDM extracts patterns using archetypal analysis, estimates the most likely next pattern, and guides diffusion model predictions with this pattern estimate. It also introduces uncertainty quantification based on archetypal analysis and dynamically scales guidance based on pattern uncertainty.

Result: Pattern guidance improves PGDM’s performance by up to 40.67%/56.26% (MAE/CRPS) on visual field measurements and 14.12%/14.10% on motion capture frames. PGDM outperforms baselines by up to 65.58%/84.83% and 93.64%/92.55% respectively.

Conclusion: Pattern-guided diffusion models effectively leverage recurring patterns in temporal data to improve forecasting accuracy and uncertainty quantification, demonstrating significant performance gains over existing methods.

Abstract: Diffusion models have shown promise in forecasting future data from multivariate time series. However, few existing methods account for recurring structures, or patterns, that appear within the data. We present Pattern-Guided Diffusion Models (PGDM), which leverage inherent patterns within temporal data for forecasting future time steps. PGDM first extracts patterns using archetypal analysis and estimates the most likely next pattern in the sequence. By guiding predictions with this pattern estimate, PGDM makes more realistic predictions that fit within the set of known patterns. We additionally introduce a novel uncertainty quantification technique based on archetypal analysis, and we dynamically scale the guidance level based on the pattern estimate uncertainty. We apply our method to two well-motivated forecasting applications, predicting visual field measurements and motion capture frames. On both, we show that pattern guidance improves PGDM’s performance (MAE / CRPS) by up to 40.67% / 56.26% and 14.12% / 14.10%, respectively. PGDM also outperforms baselines by up to 65.58% / 84.83% and 93.64% / 92.55%.

[340] A Single Architecture for Representing Invariance Under Any Space Group

Cindy Y. Zhang, Elif Ertekin, Peter Orbanz, Ryan P. Adams

Main category: cs.LG

TL;DR: Single ML architecture adapts weights automatically to enforce invariance to any of 230 space groups using symmetry-adapted Fourier bases and constraint encoding.

DetailsMotivation: Current approaches require designing bespoke architectures for each symmetry group, limiting scalability and preventing knowledge transfer across related symmetries, especially challenging for 230 space groups in crystallography.

Method: Construct symmetry-adapted Fourier bases by characterizing constraints that group operations impose on Fourier coefficients, then encode these constraints into a neural network layer to enable weight sharing across different space groups.

Result: Achieves competitive performance on material property prediction tasks and enables zero-shot learning to generalize to unseen space groups.

Conclusion: The approach overcomes limitations of bespoke architectures by creating a single adaptable model that leverages structural similarities between space groups and handles data sparsity through weight sharing.

Abstract: Incorporating known symmetries in data into machine learning models has consistently improved predictive accuracy, robustness, and generalization. However, achieving exact invariance to specific symmetries typically requires designing bespoke architectures for each group of symmetries, limiting scalability and preventing knowledge transfer across related symmetries. In the case of the space groups, symmetries critical to modeling crystalline solids in materials science and condensed matter physics, this challenge is particularly salient as there are 230 such groups in three dimensions. In this work we present a new approach to such crystallographic symmetries by developing a single machine learning architecture that is capable of adapting its weights automatically to enforce invariance to any input space group. Our approach is based on constructing symmetry-adapted Fourier bases through an explicit characterization of constraints that group operations impose on Fourier coefficients. Encoding these constraints into a neural network layer enables weight sharing across different space groups, allowing the model to leverage structural similarities between groups and overcome data sparsity when limited measurements are available for specific groups. We demonstrate the effectiveness of this approach in achieving competitive performance on material property prediction tasks and performing zero-shot learning to generalize to unseen groups.

[341] Accelerating MHC-II Epitope Discovery via Multi-Scale Prediction in Antigen Presentation

Yue Wan, Jiayi Yuan, Zhiwei Feng, Xiaowei Jia

Main category: cs.LG

TL;DR: This paper introduces a well-curated MHC-II dataset and establishes a comprehensive machine learning framework for three key tasks in computational immunotherapy: peptide binding, peptide presentation, and antigen presentation.

DetailsMotivation: MHC-II antigenic epitope studies face significant challenges compared to MHC-I due to complex binding specificity and ambiguous motif patterns, with existing datasets being smaller and less standardized. This hinders computational immunotherapy research for MHC-II.

Method: Created a well-curated dataset from IEDB and other public sources, formulated three ML tasks (peptide binding, peptide presentation, antigen presentation), employed multi-scale evaluation framework, and conducted comprehensive analysis with modular framework.

Result: Developed an extended and standardized peptide-MHC-II dataset plus a novel antigen-MHC-II dataset with richer biological context, providing a valuable resource for advancing computational immunotherapy research.

Conclusion: This work serves as a foundation for future ML-guided epitope discovery and predictive modeling of immune responses, addressing critical gaps in MHC-II computational immunotherapy research.

Abstract: Antigenic epitope presented by major histocompatibility complex II (MHC-II) proteins plays an essential role in immunotherapy. However, compared to the more widely studied MHC-I in computational immunotherapy, the study of MHC-II antigenic epitope poses significantly more challenges due to its complex binding specificity and ambiguous motif patterns. Consequently, existing datasets for MHC-II interactions are smaller and less standardized than those available for MHC-I. To address these challenges, we present a well-curated dataset derived from the Immune Epitope Database (IEDB) and other public sources. It not only extends and standardizes existing peptide-MHC-II datasets, but also introduces a novel antigen-MHC-II dataset with richer biological context. Leveraging this dataset, we formulate three major machine learning (ML) tasks of peptide binding, peptide presentation, and antigen presentation, which progressively capture the broader biological processes within the MHC-II antigen presentation pathway. We further employ a multi-scale evaluation framework to benchmark existing models, along with a comprehensive analysis over various modeling designs to this problem with a modular framework. Overall, this work serves as a valuable resource for advancing computational immunotherapy, providing a foundation for future research in ML guided epitope discovery and predictive modeling of immune responses.

[342] EXAONE Path 2.5: Pathology Foundation Model with Multi-Omics Alignment

Juseung Yun, Sunwoo Yu, Sumin Ha, Jonghyun Kim, Janghyeon Lee, Jongseong Jang, Soonyoung Lee

Main category: cs.LG

TL;DR: EXAONE Path 2.5 is a pathology foundation model that integrates histologic, genomic, epigenetic, and transcriptomic data to create comprehensive patient representations for cancer analysis, outperforming existing models in clinical and benchmark settings.

DetailsMotivation: Cancer progression involves interactions across multiple biological layers beyond just morphology, and image-only models miss important molecular information. There's a need for models that capture the broader biological landscape by integrating multiple modalities to better reflect tumor biology.

Method: Three key components: (1) multimodal SigLIP loss for all-pairwise contrastive learning across heterogeneous modalities, (2) fragment-aware rotary positional encoding (F-RoPE) to preserve spatial structure in whole slide images, and (3) domain-specialized internal foundation models for WSI and RNA-seq to provide biologically grounded embeddings for robust multimodal alignment.

Result: Achieved on-par performance with state-of-the-art foundation models on Patho-Bench (80 tasks) while showing highest adaptability in internal clinical settings. Demonstrated high data and parameter efficiency compared to six leading pathology foundation models.

Conclusion: The biologically informed multimodal design is valuable for cancer analysis, and integrated genotype-to-phenotype modeling has significant potential for next-generation precision oncology.

Abstract: Cancer progression arises from interactions across multiple biological layers, especially beyond morphological and across molecular layers that remain invisible to image-only models. To capture this broader biological landscape, we present EXAONE Path 2.5, a pathology foundation model that jointly models histologic, genomic, epigenetic and transcriptomic modalities, producing an integrated patient representation that reflects tumor biology more comprehensively. Our approach incorporates three key components: (1) multimodal SigLIP loss enabling all-pairwise contrastive learning across heterogeneous modalities, (2) a fragment-aware rotary positional encoding (F-RoPE) module that preserves spatial structure and tissue-fragment topology in WSI, and (3) domain-specialized internal foundation models for both WSI and RNA-seq to provide biologically grounded embeddings for robust multimodal alignment. We evaluate EXAONE Path 2.5 against six leading pathology foundation models across two complementary benchmarks: an internal real-world clinical dataset and the Patho-Bench benchmark covering 80 tasks. Our framework demonstrates high data and parameter efficiency, achieving on-par performance with state-of-the-art foundation models on Patho-Bench while exhibiting the highest adaptability in the internal clinical setting. These results highlight the value of biologically informed multimodal design and underscore the potential of integrated genotype-to-phenotype modeling for next-generation precision oncology.

[343] Multivariate Time Series Forecasting with Hybrid Euclidean-SPD Manifold Graph Neural Networks

Yong Fang, Na Li, Hangguan Shan, Eryun Liu, Xinyu Li, Wei Ni, Er-Ping Li

Main category: cs.LG

TL;DR: HSMGNN is a novel graph neural network that uses hybrid Euclidean-Riemannian geometry for multivariate time series forecasting, achieving up to 13.8% improvement over SOTA methods.

DetailsMotivation: Existing MTS forecasting approaches are limited to either Euclidean or Riemannian space, which restricts their ability to capture diverse geometric structures and complex spatio-temporal dependencies in real-world data.

Method: Proposes HSMGNN with three key components: 1) Submanifold-Cross-Segment embedding to project MTS into both Euclidean and Riemannian spaces, 2) Adaptive-Distance-Bank layer to reduce Riemannian distance computation cost, and 3) Fusion Graph Convolutional Network to integrate features from dual spaces via learnable fusion operator.

Result: Experiments on three benchmark datasets show HSMGNN achieves up to 13.8% improvement in forecasting accuracy over state-of-the-art baselines.

Conclusion: HSMGNN is the first work to leverage hybrid geometric representations for MTS forecasting, enabling more expressive and comprehensive modeling of geometric properties through a hybrid Euclidean-Riemannian framework.

Abstract: Multivariate Time Series (MTS) forecasting plays a vital role in various real-world applications, such as traffic management and predictive maintenance. Existing approaches typically model MTS data in either Euclidean or Riemannian space, limiting their ability to capture the diverse geometric structures and complex spatio-temporal dependencies inherent in real-world data. To overcome this limitation, we propose the Hybrid Symmetric Positive-Definite Manifold Graph Neural Network (HSMGNN), a novel graph neural network-based model that captures data geometry within a hybrid Euclidean-Riemannian framework. To the best of our knowledge, this is the first work to leverage hybrid geometric representations for MTS forecasting, enabling expressive and comprehensive modeling of geometric properties. Specifically, we introduce a Submanifold-Cross-Segment (SCS) embedding to project input MTS into both Euclidean and Riemannian spaces, thereby capturing spatio-temporal variations across distinct geometric domains. To alleviate the high computational cost of Riemannian distance, we further design an Adaptive-Distance-Bank (ADB) layer with a trainable memory mechanism. Finally, a Fusion Graph Convolutional Network (FGCN) is devised to integrate features from the dual spaces via a learnable fusion operator for accurate prediction. Experiments on three benchmark datasets demonstrate that HSMGNN achieves up to a 13.8 percent improvement over state-of-the-art baselines in forecasting accuracy.

[344] FusAD: Time-Frequency Fusion with Adaptive Denoising for General Time Series Analysis

Da Zhang, Bingyu Li, Zhiyuan Zhao, Feiping Nie, Junyu Gao, Xuelong Li

Main category: cs.LG

TL;DR: FusAD is a unified time series analysis framework that integrates adaptive time-frequency fusion and denoising mechanisms to handle multiple tasks (classification, forecasting, anomaly detection) across diverse time series types.

DetailsMotivation: Existing deep learning approaches for time series analysis are often task-specific or data-type-specific, lacking a unified framework that can handle multiple tasks simultaneously while dealing with real-world challenges like noise, complex frequency components, and multi-scale patterns.

Method: FusAD features: 1) Adaptive time-frequency fusion mechanism integrating Fourier and Wavelet transforms to capture global-local and multi-scale dynamic features; 2) Adaptive denoising mechanism to automatically sense and filter various noise types; 3) General information fusion and decoding structure with masked pre-training for multi-granularity representation learning.

Result: Extensive experiments show FusAD consistently outperforms state-of-the-art models on mainstream time series benchmarks for classification, forecasting, and anomaly detection tasks, while maintaining high efficiency and scalability.

Conclusion: FusAD provides an effective unified framework for diverse time series analysis tasks, addressing key challenges in multi-task modeling, noise handling, and multi-scale feature extraction through its adaptive time-frequency fusion and denoising mechanisms.

Abstract: Time series analysis plays a vital role in fields such as finance, healthcare, industry, and meteorology, underpinning key tasks including classification, forecasting, and anomaly detection. Although deep learning models have achieved remarkable progress in these areas in recent years, constructing an efficient, multi-task compatible, and generalizable unified framework for time series analysis remains a significant challenge. Existing approaches are often tailored to single tasks or specific data types, making it difficult to simultaneously handle multi-task modeling and effectively integrate information across diverse time series types. Moreover, real-world data are often affected by noise, complex frequency components, and multi-scale dynamic patterns, which further complicate robust feature extraction and analysis. To ameliorate these challenges, we propose FusAD, a unified analysis framework designed for diverse time series tasks. FusAD features an adaptive time-frequency fusion mechanism, integrating both Fourier and Wavelet transforms to efficiently capture global-local and multi-scale dynamic features. With an adaptive denoising mechanism, FusAD automatically senses and filters various types of noise, highlighting crucial sequence variations and enabling robust feature extraction in complex environments. In addition, the framework integrates a general information fusion and decoding structure, combined with masked pre-training, to promote efficient learning and transfer of multi-granularity representations. Extensive experiments demonstrate that FusAD consistently outperforms state-of-the-art models on mainstream time series benchmarks for classification, forecasting, and anomaly detection tasks, while maintaining high efficiency and scalability. Code is available at https://github.com/zhangda1018/FusAD.

[345] SonicMoE: Accelerating MoE with IO and Tile-aware Optimizations

Wentao Guo, Mayank Mishra, Xinle Cheng, Ion Stoica, Tri Dao

Main category: cs.LG

TL;DR: SonicMoE: A memory-efficient MoE training system with novel kernels and token rounding that reduces activation memory by 45% and improves compute throughput by 1.86x on Hopper GPUs.

DetailsMotivation: Current fine-grained MoE models suffer from increased activation memory footprint and reduced hardware efficiency due to higher IO costs, while sparser MoEs suffer from wasted computations due to padding in Grouped GEMM kernels.

Method: 1) Memory-efficient algorithm for MoE forward/backward passes with minimal activation caching; 2) GPU kernels that overlap memory IO with computation; 3) Novel “token rounding” method to minimize wasted compute from padding in Grouped GEMM kernels.

Result: 45% reduction in activation memory and 1.86x compute throughput improvement on Hopper GPUs compared to ScatterMoE. Achieves 213B tokens/day on 64 H100s vs ScatterMoE’s 225B tokens/day on 96 H100s. Token rounding yields additional 1.16x speedup while maintaining performance.

Conclusion: SonicMoE addresses key efficiency bottlenecks in MoE training through memory optimization, IO-computation overlap, and token rounding, enabling faster and more efficient MoE model training with open-sourced kernels.

Abstract: Mixture of Experts (MoE) models have emerged as the de facto architecture for scaling up language models without significantly increasing the computational cost. Recent MoE models demonstrate a clear trend towards high expert granularity (smaller expert intermediate dimension) and higher sparsity (constant number of activated experts with higher number of total experts), which improve model quality per FLOP. However, fine-grained MoEs suffer from increased activation memory footprint and reduced hardware efficiency due to higher IO costs, while sparser MoEs suffer from wasted computations due to padding in Grouped GEMM kernels. In response, we propose a memory-efficient algorithm to compute the forward and backward passes of MoEs with minimal activation caching for the backward pass. We also design GPU kernels that overlap memory IO with computation benefiting all MoE architectures. Finally, we propose a novel “token rounding” method that minimizes the wasted compute due to padding in Grouped GEMM kernels. As a result, our method SonicMoE reduces activation memory by 45% and achieves a 1.86x compute throughput improvement on Hopper GPUs compared to ScatterMoE’s BF16 MoE kernel for a fine-grained 7B MoE. Concretely, SonicMoE on 64 H100s achieves a training throughput of 213 billion tokens per day comparable to ScatterMoE’s 225 billion tokens per day on 96 H100s for a 7B MoE model training with FSDP-2 using the lm-engine codebase. Under high MoE sparsity settings, our tile-aware token rounding algorithm yields an additional 1.16x speedup on kernel execution time compared to vanilla top-$K$ routing while maintaining similar downstream performance. We open-source all our kernels to enable faster MoE model training.

[346] Derivative-Informed Fourier Neural Operator: Universal Approximation and Applications to PDE-Constrained Optimization

Boyuan Yao, Dingcheng Luo, Lianghao Cao, Nikola Kovachki, Thomas O’Leary-Roseberry, Omar Ghattas

Main category: cs.LG

TL;DR: DIFNOs are Fourier neural operators trained with both output and derivative data to accurately emulate PDE operators and their sensitivities, enabling efficient PDE-constrained optimization.

DetailsMotivation: Accurate surrogate-driven PDE-constrained optimization requires accurate surrogate Fréchet derivatives, which conventional FNOs may not provide. The authors aim to develop neural operators that can closely approximate both PDE solution operators and their sensitivities for optimization applications.

Method: Derivative-informed Fourier neural operators (DIFNOs) trained by minimizing prediction error on both output and Fréchet derivative samples. Developed efficient training schemes using dimension reduction and multi-resolution techniques to reduce memory/computational costs for derivative learning.

Result: Established theoretical approximation guarantees: (i) simultaneous universal approximation of FNOs and their Fréchet derivatives on compact sets, (ii) universal approximation in weighted Sobolev spaces with unbounded input measures. Numerical experiments on diffusion-reaction, Helmholtz, and Navier-Stokes equations show DIFNOs achieve superior sample complexity for operator learning and solving PDE-constrained inverse problems.

Conclusion: DIFNOs provide an effective framework for derivative-informed operator learning, enabling accurate solution of PDE-constrained optimization problems with high accuracy at low training sample sizes, certified by theoretical approximation guarantees.

Abstract: We present approximation theories and efficient training methods for derivative-informed Fourier neural operators (DIFNOs) with applications to PDE-constrained optimization. A DIFNO is an FNO trained by minimizing its prediction error jointly on output and Fréchet derivative samples of a high-fidelity operator (e.g., a parametric PDE solution operator). As a result, a DIFNO can closely emulate not only the high-fidelity operator’s response but also its sensitivities. To motivate the use of DIFNOs instead of conventional FNOs as surrogate models, we show that accurate surrogate-driven PDE-constrained optimization requires accurate surrogate Fréchet derivatives. Then, for continuously differentiable operators, we establish (i) simultaneous universal approximation of FNOs and their Fréchet derivatives on compact sets, and (ii) universal approximation of FNOs in weighted Sobolev spaces with input measures that have unbounded supports. Our theoretical results certify the capability of FNOs for accurate derivative-informed operator learning and accurate solution of PDE-constrained optimization. Furthermore, we develop efficient training schemes using dimension reduction and multi-resolution techniques that significantly reduce memory and computational costs for Fréchet derivative learning. Numerical examples on nonlinear diffusion–reaction, Helmholtz, and Navier–Stokes equations demonstrate that DIFNOs are superior in sample complexity for operator learning and solving infinite-dimensional PDE-constrained inverse problems, achieving high accuracy at low training sample sizes.

[347] Arithmetic-Intensity-Aware Quantization

Taig Singh, Shreshth Rajan, Nikhil Iyer

Main category: cs.LG

TL;DR: AIQ is a mixed precision quantization framework that optimizes per-layer bit-widths to maximize arithmetic intensity and minimize accuracy loss, improving inference throughput on memory-bound neural networks.

DetailsMotivation: Modern neural networks are increasingly memory-bound, with inference throughput limited by DRAM bandwidth rather than compute. There's a need for quantization methods that can improve arithmetic intensity while maintaining accuracy.

Method: AIQ is a post-training quantization framework that uses search algorithms to find optimal per-layer bit-widths. It minimizes a weighted loss function that balances arithmetic intensity (AI) and accuracy degradation.

Result: On ResNet-20/CIFAR-10, AIQ increases arithmetic intensity by ~50% over FP32 baseline while keeping test accuracy within ~1 percentage point. On MobileNetV2, AIQ configurations achieve 1.66x higher throughput than FP32 baseline while maintaining accuracy within 1 percentage point. The method naturally quantizes larger layers more aggressively.

Conclusion: AIQ effectively addresses memory-bound inference limitations by optimizing per-layer quantization to maximize arithmetic intensity while minimizing accuracy loss, outperforming uniform quantization schemes and providing practical throughput improvements.

Abstract: As modern neural networks become increasingly memory-bound, inference throughput is limited by DRAM bandwidth rather than compute. We present Arithmetic-Intensity-Aware Quantization (AIQ), a mixed precision quantization framework that chooses per-layer bit-widths to maximize arithmetic intensity (AI) while minimizing accuracy loss. AIQ is a post-training quantization method that uses search algorithms over per-layer quantization schemes to minimize a weighted loss over AI and accuracy. On ResNet-20/CIFAR-10, AIQ increases AI by ~50% over an FP32 baseline while keeping test accuracy within ~1 percentage point, and outperforming global uniform quantization schemes. On a memory-bound MobileNetV2 architecture, AIQ configurations give a 1.66x higher throughput than the FP32 baseline while keeping test accuracy within 1 percentage point. We also find that AIQ naturally quantizes larger layers more aggressively.

[348] Cornserve: Efficiently Serving Any-to-Any Multimodal Models

Jeff J. Ma, Jae-Won Chung, Jisang Ahn, Yizhuo Liang, Akshay Jajoo, Myungjin Lee, Mosharaf Chowdhury

Main category: cs.LG

TL;DR: Cornserve is an efficient online serving system for Any-to-Any multimodal models that automatically optimizes deployment and execution, achieving up to 3.81× throughput improvement and 5.79× tail latency reduction.

DetailsMotivation: Any-to-Any models introduce serving challenges due to heterogeneity in request types, computation paths, and scaling requirements when handling combinations of text and multimodal data (image, video, audio) as both input and output.

Method: Cornserve allows developers to describe the computation graph of Any-to-Any models. Its planner automatically finds optimized deployment plans, including model disaggregation decisions based on model and workload characteristics. A distributed runtime then executes the model efficiently.

Result: Cornserve delivers up to 3.81× throughput improvement and up to 5.79× tail latency reduction compared to existing solutions, while efficiently serving diverse Any-to-Any models and workloads.

Conclusion: Cornserve provides an effective solution for serving heterogeneous Any-to-Any multimodal models by automating deployment optimization and efficient distributed execution, significantly outperforming existing serving systems.

Abstract: We present Cornserve, an efficient online serving system for an emerging class of multimodal models called Any-to-Any models. Any-to-Any models accept combinations of text and multimodal data (e.g., image, video, audio) as input and also generate combinations of text and multimodal data as output, introducing request type, computation path, and computation scaling heterogeneity in model serving. Cornserve allows model developers to describe the computation graph of generic Any-to-Any models, which consists of heterogeneous components such as multimodal encoders, autoregressive models like Large Language Models (LLMs), and multimodal generators like Diffusion Transformers (DiTs). Given this, Cornserve’s planner automatically finds an optimized deployment plan for the model, including whether and how to disaggregate the model into smaller components based on model and workload characteristics. Cornserve’s distributed runtime then executes the model per the plan, efficiently handling Any-to-Any model heterogeneity during online serving. Evaluations show that Cornserve can efficiently serve diverse Any-to-Any models and workloads, delivering up to 3.81$\times$ throughput improvement and up to 5.79$\times$ tail latency reduction over existing solutions.

[349] A First-Order Logic-Based Alternative to Reward Models in RLHF

Chunjin Jian, Xinhua Zhu

Main category: cs.LG

TL;DR: S-GRPO: A supervised variant of GRPO that uses logic-similarity-based reward mechanism instead of conventional reward modeling for LLM alignment, preventing model collapse while improving performance and robustness.

DetailsMotivation: Existing RLHF approaches like PPO heavily depend on reward models for aligning LLMs with human preferences, but reward model quality and stability significantly impact final alignment performance. Conventional reward modeling relies on heuristic estimation which may not be optimal.

Method: Proposes logic-similarity-based reward mechanism using formal logical consistency instead of heuristic reward estimation. Introduces S-GRPO (supervised variant of GRPO) with additional supervised component to prevent model collapse in multi-perspective scenarios. Jointly optimizes generation term, KL-divergence regularization, and label-based objective during training.

Result: S-GRPO consistently outperforms standard supervised fine-tuning (SFT) in both performance and robustness. Extends existing preference-learning frameworks (GRPO and DPO) offering more flexible and task-adaptive approach to alignment training.

Conclusion: Logic-similarity-based reward mechanism with S-GRPO framework provides effective alternative to conventional reward modeling for LLM alignment, addressing model collapse issues while improving alignment performance and robustness.

Abstract: Reinforcement Learning from Human Feedback (RLHF) plays a crucial role in aligning large language models (LLMs) with human values and preferences. However, the quality and stability of the trained reward model largely determine the final alignment performance. Existing approaches such as Proximal Policy Optimization (PPO) rely heavily on reward models to guide LLMs toward human-aligned behaviors. In this work, we propose a logic-similarity-based reward mechanism as an alternative to conventional reward modeling. Instead of relying on heuristic reward estimation, our method leverages formal logical consistency to steer model alignment with human preferences. Since real-world questions can be interpreted from multiple perspectives, to ensure that logic-based reinforcement learning does not cause model collapse, we introduce S-GRPO, a supervised variant of the GRPO framework. S-GRPO incorporates an additional supervised component and jointly optimizes the generation term, KL-divergence regularization, and label-based objective during training. Experimental results demonstrate that S-GRPO consistently outperforms standard supervised fine-tuning (SFT) in both performance and robustness. Furthermore, it extends existing preference-learning frameworks such as GRPO and DPO, offering a more flexible and task-adaptive approach to alignment training. Our code is available at https://github.com/ChunjinJiang/sgrpo.

[350] PathFinder: Advancing Path Loss Prediction for Single-to-Multi-Transmitter Scenario

Zhijie Zhong, Zhiwen Yu, Pengyu Li, Jianming Lv, C. L. Philip Chen, Min Chen

Main category: cs.LG

TL;DR: PathFinder: A novel deep learning architecture for radio path loss prediction that addresses limitations in environmental modeling, multi-transmitter scenarios, and distribution shift generalization through disentangled feature encoding and attention mechanisms.

DetailsMotivation: Current deep learning-based radio path loss prediction methods have three key limitations: 1) passive environmental modeling that overlooks transmitters and key features, 2) focus on single-transmitter scenarios despite real-world multi-transmitter prevalence, and 3) poor generalization under distribution shifts when training/testing environments differ in building density or transmitter configurations.

Method: Proposes PathFinder architecture with: 1) Disentangled feature encoding to actively model buildings and transmitters separately, 2) Mask-Guided Low-rank Attention to independently focus on receiver and building regions, 3) Transmitter-Oriented Mixup strategy for robust training, and 4) A new benchmark called single-to-multi-transmitter RPP (S2MT-RPP) to evaluate extrapolation performance.

Result: PathFinder significantly outperforms state-of-the-art methods, especially in challenging multi-transmitter scenarios, demonstrating superior generalization capabilities under distribution shifts.

Conclusion: The proposed PathFinder architecture effectively addresses key limitations in radio path loss prediction by enabling proactive environmental modeling, handling realistic multi-transmitter scenarios, and improving generalization under distribution shifts through novel architectural components and training strategies.

Abstract: Radio path loss prediction (RPP) is critical for optimizing 5G networks and enabling IoT, smart city, and similar applications. However, current deep learning-based RPP methods lack proactive environmental modeling, struggle with realistic multi-transmitter scenarios, and generalize poorly under distribution shifts, particularly when training/testing environments differ in building density or transmitter configurations. This paper identifies three key issues: (1) passive environmental modeling that overlooks transmitters and key environmental features; (2) overemphasis on single-transmitter scenarios despite real-world multi-transmitter prevalence; (3) excessive focus on in-distribution performance while neglecting distribution shift challenges. To address these, we propose PathFinder, a novel architecture that actively models buildings and transmitters via disentangled feature encoding and integrates Mask-Guided Low-rank Attention to independently focus on receiver and building regions. We also introduce a Transmitter-Oriented Mixup strategy for robust training and a new benchmark, single-to-multi-transmitter RPP (S2MT-RPP), tailored to evaluate extrapolation performance (multi-transmitter testing after single-transmitter training). Experimental results show PathFinder outperforms state-of-the-art methods significantly, especially in challenging multi-transmitter scenarios. Our code and project site are available at: https://emorzz1g.github.io/PathFinder/.

[351] On Improving Deep Active Learning with Formal Verification

Jonathan Spiegelman, Guy Amir, Guy Katz

Main category: cs.LG

TL;DR: Adversarial examples from formal verification significantly boost Deep Active Learning performance by augmenting training data with robustness-violating inputs.

DetailsMotivation: To enhance data efficiency in Deep Active Learning beyond sample selection, by exploring how adversarial inputs that violate robustness constraints can improve model generalization without requiring additional manual labeling.

Method: Extends DAL techniques by augmenting training data with adversarial examples generated via formal verification (rather than standard gradient-based attacks), and proposes a new DAL technique incorporating this approach.

Result: Formal verification-based adversarial examples contribute substantially more than gradient-based attacks, yielding significant improvements in model generalization across standard benchmarks when applied to multiple modern DAL techniques.

Conclusion: Augmenting training data with formally verified adversarial inputs is an effective strategy for improving Deep Active Learning performance and model generalization.

Abstract: Deep Active Learning (DAL) aims to reduce labeling costs in neural-network training by prioritizing the most informative unlabeled samples for annotation. Beyond selecting which samples to label, several DAL approaches further enhance data efficiency by augmenting the training set with synthetic inputs that do not require additional manual labeling. In this work, we investigate how augmenting the training data with adversarial inputs that violate robustness constraints can improve DAL performance. We show that adversarial examples generated via formal verification contribute substantially more than those produced by standard, gradient-based attacks. We apply this extension to multiple modern DAL techniques, as well as to a new technique that we propose, and show that it yields significant improvements in model generalization across standard benchmarks.

[352] Optimizing the Adversarial Perturbation with a Momentum-based Adaptive Matrix

Wei Tao, Sheng Long, Xin Liu, Wei Li, Qing Tao

Main category: cs.LG

TL;DR: The paper proposes AdaMI, a novel momentum-based adversarial attack that uses an adaptive matrix to scale perturbations, addressing theoretical optimization issues in existing attacks like PGD and MI-FGSM.

DetailsMotivation: Existing gradient-based attacks (PGD, MI-FGSM) use the sign function to scale perturbations, which raises theoretical optimization concerns. The authors aim to address these issues by developing a more theoretically sound attack method.

Method: The authors first analyze PGD as a reformulation of projected gradient method. They show that using an adaptive matrix with accumulated gradients transforms PGD into AdaGrad. Building on this, they propose AdaMI - a momentum-based attack that optimizes perturbations using a momentum-based adaptive matrix.

Result: AdaMI achieves optimal convergence for convex problems, addressing the non-convergence issue of MI-FGSM. Experiments show it boosts adversarial transferability across different networks while maintaining better stability and imperceptibility compared to state-of-the-art methods.

Conclusion: The momentum-based adaptive matrix serves as an effective general technique for improving adversarial transferability. AdaMI provides a more theoretically grounded approach to adversarial example generation with better optimization properties.

Abstract: Generating adversarial examples (AEs) can be formulated as an optimization problem. Among various optimization-based attacks, the gradient-based PGD and the momentum-based MI-FGSM have garnered considerable interest. However, all these attacks use the sign function to scale their perturbations, which raises several theoretical concerns from the point of view of optimization. In this paper, we first reveal that PGD is actually a specific reformulation of the projected gradient method using only the current gradient to determine its step-size. Further, we show that when we utilize a conventional adaptive matrix with the accumulated gradients to scale the perturbation, PGD becomes AdaGrad. Motivated by this analysis, we present a novel momentum-based attack AdaMI, in which the perturbation is optimized with an interesting momentum-based adaptive matrix. AdaMI is proved to attain optimal convergence for convex problems, indicating that it addresses the non-convergence issue of MI-FGSM, thereby ensuring stability of the optimization process. The experiments demonstrate that the proposed momentum-based adaptive matrix can serve as a general and effective technique to boost adversarial transferability over the state-of-the-art methods across different networks while maintaining better stability and imperceptibility.

[353] Random-Bridges as Stochastic Transports for Generative Models

Stefano Goria, Levent A. Mengütürk, Murat C. Mengütürk, Berkan Sesen

Main category: cs.LG

TL;DR: Random-bridges (stochastic processes conditioned on target distributions) enable efficient generative modeling with faster sampling and competitive quality compared to traditional methods.

DetailsMotivation: To leverage random-bridges as stochastic transports between probability distributions for generative modeling, offering computational efficiency and flexibility in pattern representation (Markovian/non-Markovian, continuous/discontinuous/hybrid).

Method: Start from general probabilistic statements, derive specific representations for learning and simulation algorithms using information processing, with empirical implementation based on Gaussian random bridges.

Result: Gaussian random bridges produce high-quality samples in significantly fewer steps than traditional approaches while achieving competitive Fréchet inception distance scores, demonstrating computational efficiency for high-speed generation.

Conclusion: Random-bridges provide a computationally cheap framework suitable for high-speed generative tasks, bridging theoretical probabilistic foundations with practical learning and simulation algorithms.

Abstract: This paper motivates the use of random-bridges – stochastic processes conditioned to take target distributions at fixed timepoints – in the realm of generative modelling. Herein, random-bridges can act as stochastic transports between two probability distributions when appropriately initialized, and can display either Markovian or non-Markovian, and either continuous, discontinuous or hybrid patterns depending on the driving process. We show how one can start from general probabilistic statements and then branch out into specific representations for learning and simulation algorithms in terms of information processing. Our empirical results, built on Gaussian random bridges, produce high-quality samples in significantly fewer steps compared to traditional approaches, while achieving competitive Frechet inception distance scores. Our analysis provides evidence that the proposed framework is computationally cheap and suitable for high-speed generation tasks.

[354] Understanding and Improving Hyperbolic Deep Reinforcement Learning

Timo Klein, Thomas Lang, Andrii Shkabrii, Alexander Sturm, Kevin Sidak, Lukas Miklautz, Claudia Plant, Yllka Velaj, Sebastian Tschiatschek

Main category: cs.LG

TL;DR: Hyper++ is a new hyperbolic PPO agent that stabilizes training in hyperbolic feature spaces by addressing gradient instability issues, outperforming prior hyperbolic and Euclidean baselines on ProcGen and Atari benchmarks.

DetailsMotivation: Hyperbolic feature spaces are well-suited for RL as they capture hierarchical structure in complex environments, but they face optimization challenges due to nonstationarity and gradient instability from large-norm embeddings.

Method: Hyper++ introduces three components: (1) stable critic training using categorical value loss instead of regression, (2) feature regularization for bounded norms without dimensionality curse from clipping, and (3) optimization-friendly hyperbolic network layer formulations.

Result: Hyper++ guarantees stable learning, outperforms prior hyperbolic agents on ProcGen, reduces wall-clock time by ~30%, and strongly outperforms Euclidean and hyperbolic baselines on Atari-5 with Double DQN.

Conclusion: The paper successfully identifies key factors for hyperbolic RL training success, proposes Hyper++ as a solution to gradient instability, and demonstrates its effectiveness across multiple RL benchmarks with improved stability and performance.

Abstract: The performance of reinforcement learning (RL) agents depends critically on the quality of the underlying feature representations. Hyperbolic feature spaces are well-suited for this purpose, as they naturally capture hierarchical and relational structure often present in complex RL environments. However, leveraging these spaces commonly faces optimization challenges due to the nonstationarity of RL. In this work, we identify key factors that determine the success and failure of training hyperbolic deep RL agents. By analyzing the gradients of core operations in the Poincaré Ball and Hyperboloid models of hyperbolic geometry, we show that large-norm embeddings destabilize gradient-based training, leading to trust-region violations in proximal policy optimization (PPO). Based on these insights, we introduce Hyper++, a new hyperbolic PPO agent that consists of three components: (i) stable critic training through a categorical value loss instead of regression; (ii) feature regularization guaranteeing bounded norms while avoiding the curse of dimensionality from clipping; and (iii) using a more optimization-friendly formulation of hyperbolic network layers. In experiments on ProcGen, we show that Hyper++ guarantees stable learning, outperforms prior hyperbolic agents, and reduces wall-clock time by approximately 30%. On Atari-5 with Double DQN, Hyper++ strongly outperforms Euclidean and hyperbolic baselines. We release our code at https://github.com/Probabilistic-and-Interactive-ML/hyper-rl .

[355] Estimating problem difficulty without ground truth using Large Language Model comparisons

Marthe Ballon, Andres Algaba, Brecht Verbeken, Vincent Ginis

Main category: cs.LG

TL;DR: Proposes LLM compare, a new method for estimating problem difficulty using pairwise comparisons by LLMs and Bradley-Terry scoring, addressing limitations of existing approaches for out-of-distribution problems.

DetailsMotivation: Current difficulty estimation methods (human calibration, performance-based scoring) fail for out-of-distribution problems because they're not scalable, time-consuming, and ground truth dependent. Need better methods for increasingly difficult synthetic data generation.

Method: LLM compare: uses LLMs to perform pairwise difficulty comparisons between problems, then computes Bradley-Terry scores from the outcomes. This creates a continuous, dynamic, model-agnostic measure independent of ground truth.

Result: 1) Conceptual framework shows LLM compare occupies all desirable quadrants for scoring out-of-distribution problems. 2) Strong alignment with human annotations (Pearson r ≥ 0.80 for n=1876). 3) Robust to hallucinations (<6% degradation with 10% noise injection).

Conclusion: LLM compare represents significant progress toward replacing human annotations and synthetic data generation, with applications in curriculum design, model evaluation, and AI-assisted research ideation.

Abstract: Recent advances in the finetuning of large language models (LLMs) have significantly improved their performance on established benchmarks, emphasizing the need for increasingly difficult, synthetic data. A key step in this data generation pipeline is a method for estimating problem difficulty. Current approaches, such as human calibration or performance-based scoring, fail to generalize to out-of-distribution problems, i.e. problems currently unsolvable by humans and LLMs, because they are not scalable, time-consuming, and ground truth dependent. Therefore, we propose a new method for estimating problem difficulty, LLM compare, that addresses these limitations. An LLM performs pairwise difficulty comparisons, and then Bradley-Terry scores are computed based on the outcomes. To validate our method, we first propose a conceptual framework that positions existing approaches on three orthogonal planes–construction, scale and dependence–identifying which quadrants a measure needs to occupy to score out-of-distribution problems. LLM compare naturally occupies all desirable quadrants as the first measure that is continuous and dynamic, model-agnostic and independent of ground truth information. As a second validation, we show that LLM compare demonstrates strong alignment with human annotations: Pearson $r \geq 0.80$ for $n=1876$. Thirdly, we show that LLM compare is robust to hallucinations, with less than $6%$ degradation in Pearson correlation for $10%$ noise injection. Our work represents a significant step towards replacing time-consuming human annotations and synthetic data generation, and will be an important driver for curriculum design, model evaluation, and AI-assisted research ideation.

[356] Understanding the Gain from Data Filtering in Multimodal Contrastive Learning

Divyansh Pareek, Sewoong Oh, Simon S. Du

Main category: cs.LG

TL;DR: Teacher-based data filtering improves contrastive learning performance by reducing error rates, especially when data quality is low.

DetailsMotivation: Internet-scale multimodal datasets contain low-quality data, making data curation critical. Teacher-based filtering (using pre-trained models to score data quality) works empirically but lacks theoretical understanding of why it's effective.

Method: Theoretical analysis using a linear contrastive learning setup under a bimodal data generation model. Compares error bounds with and without filtering, where η represents the fraction of correctly matched modality pairs.

Result: Without filtering, error is bounded by 1/(η√n). With teacher-based filtering: in large η regime, error bounded by 1/√(ηn); in small η regime, error bounded by 1/√n. Shows filtering provides provable benefits.

Conclusion: Teacher-based filtering theoretically improves contrastive learning performance by reducing error rates, with greater benefits when data quality is low (small η). Provides theoretical justification for empirical success of this filtering approach.

Abstract: The success of modern multimodal representation learning relies on internet-scale datasets. Due to the low quality of a large fraction of raw web data, data curation has become a critical step in the training pipeline. Filtering using a trained model (i.e., teacher-based filtering) has emerged as a successful solution, leveraging a pre-trained model to compute quality scores. To explain the empirical success of teacher-based filtering, we characterize the performance of filtered contrastive learning under the standard bimodal data generation model. Denoting $η\in(0,1]$ as the fraction of data with correctly matched modalities among $n$ paired samples, we utilize a linear contrastive learning setup to show a provable benefit of data filtering: $(i)$ the error without filtering is upper and lower bounded by $\frac{1}{η\sqrt{n}}$, and $(ii)$ the error with teacher-based filtering is upper bounded by $\frac{1}{\sqrt{ηn}}$ in the large $η$ regime, and by $\frac{1}{\sqrt{n}}$ in the small $η$ regime.

[357] Physically consistent model learning for reaction-diffusion systems

Erion Morina, Martin Holler

Main category: cs.LG

TL;DR: This paper develops methods to learn physically consistent reaction-diffusion systems from data by enforcing mass conservation and quasipositivity constraints, ensuring well-posedness and alignment with fundamental physical laws.

DetailsMotivation: The motivation is to create data-driven models for reaction-diffusion systems that are not only accurate but also physically consistent, preserving key properties like mass conservation and non-negativity, which are essential for reliable and interpretable models.

Method: The method builds on a regularization-based framework for structured model learning, proposing techniques to systematically modify parameterized reaction terms to inherently satisfy mass conservation and quasipositivity. The approach extends existing theoretical results to reaction-diffusion systems with these physically consistent terms.

Result: The paper proves that solutions to the learning problem converge to a unique, regularization-minimizing solution even when conservation laws and quasipositivity are enforced. It also provides approximation results for quasipositive functions essential for constructing physically consistent parameterizations.

Conclusion: The work advances interpretable and reliable data-driven models for reaction-diffusion systems by ensuring physical consistency and well-posedness, bridging the gap between data-driven learning and fundamental physical principles.

Abstract: This paper addresses the problem of learning reaction-diffusion (RD) systems from data while ensuring physical consistency and well-posedness of the learned models. Building on a regularization-based framework for structured model learning, we focus on learning parameterized reaction terms and investigate how to incorporate key physical properties, such as mass conservation and quasipositivity, directly into the learning process. Our main contributions are twofold: First, we propose techniques to systematically modify a given class of parameterized reaction terms such that the resulting terms inherently satisfy mass conservation and quasipositivity, ensuring that the learned RD systems preserve non-negativity and adhere to physical principles. These modifications also guarantee well-posedness of the resulting PDEs under additional regularity and growth conditions. Second, we extend existing theoretical results on regularization-based model learning to RD systems using these physically consistent reaction terms. Specifically, we prove that solutions to the learning problem converge to a unique, regularization-minimizing solution of a limit system even when conservation laws and quasipositivity are enforced. In addition, we provide approximation results for quasipositive functions, essential for constructing physically consistent parameterizations. These results advance the development of interpretable and reliable data-driven models for RD systems that align with fundamental physical laws.

[358] Beyond MMD: Evaluating Graph Generative Models with Geometric Deep Learning

Salvatore Romano, Marco Grassia, Giuseppe Mangioni

Main category: cs.LG

TL;DR: The paper introduces RGM, a novel methodology for evaluating Graph Generative Models that addresses limitations of Maximum Mean Discrepancy, and demonstrates it by evaluating GRAN and EDGE models.

DetailsMotivation: Current evaluation of Graph Generative Models relies heavily on Maximum Mean Discrepancy, which has significant limitations in assessing how well generated graphs preserve structural characteristics of real-world networks.

Method: Proposes RGM (Representation-aware Graph-generation Model evaluation) methodology and demonstrates it using a Geometric Deep Learning model trained on custom datasets of synthetic and real-world graphs for classification tasks.

Result: Evaluation of GRAN and EDGE shows both can generate graphs with certain topological properties but have significant limitations in preserving structural characteristics that distinguish different graph domains.

Conclusion: MMD is inadequate for evaluating GGMs, and alternative approaches like RGM are needed; future research should focus on better evaluation metrics for graph generative models.

Abstract: Graph generation is a crucial task in many fields, including network science and bioinformatics, as it enables the creation of synthetic graphs that mimic the properties of real-world networks for various applications. Graph Generative Models (GGMs) have emerged as a promising solution to this problem, leveraging deep learning techniques to learn the underlying distribution of real-world graphs and generate new samples that closely resemble them. Examples include approaches based on Variational Auto-Encoders, Recurrent Neural Networks, and more recently, diffusion-based models. However, the main limitation often lies in the evaluation process, which typically relies on Maximum Mean Discrepancy (MMD) as a metric to assess the distribution of graph properties in the generated ensemble. This paper introduces a novel methodology for evaluating GGMs that overcomes the limitations of MMD, which we call RGM (Representation-aware Graph-generation Model evaluation). As a practical demonstration of our methodology, we present a comprehensive evaluation of two state-of-the-art Graph Generative Models: Graph Recurrent Attention Networks (GRAN) and Efficient and Degree-guided graph GEnerative model (EDGE). We investigate their performance in generating realistic graphs and compare them using a Geometric Deep Learning model trained on a custom dataset of synthetic and real-world graphs, specifically designed for graph classification tasks. Our findings reveal that while both models can generate graphs with certain topological properties, they exhibit significant limitations in preserving the structural characteristics that distinguish different graph domains. We also highlight the inadequacy of Maximum Mean Discrepancy as an evaluation metric for GGMs and suggest alternative approaches for future research.

[359] FLAME: Flow Enhanced Legendre Memory Models for General Time Series Forecasting

Xingjian Wu, Hanyin Cheng, Xiangfei Qiu, Zhengyu Li, Jilin Hu, Chenjuan Guo, Bin Yang

Main category: cs.LG

TL;DR: FLAME is a lightweight time series foundation model that achieves state-of-the-art zero-shot performance on both deterministic and probabilistic forecasting using Legendre Memory variants and Normalization Flow.

DetailsMotivation: To create an extremely lightweight yet capable time series foundation model that supports both deterministic and probabilistic forecasting while ensuring efficiency and robustness.

Method: Uses Legendre Memory variants (LegT and LegS) in Encoding and Decoding phases to capture inductive bias and enable efficient long-range inference. Employs Normalization Flow based forecasting head for generative probabilistic modeling of complex distributions.

Result: Demonstrates consistent state-of-the-art zero-shot performance on TSFM-Bench and ProbTS benchmarks for both deterministic and probabilistic forecasting tasks.

Conclusion: FLAME provides an efficient, lightweight foundation model for time series forecasting that excels in both deterministic and probabilistic settings with strong generalization capabilities.

Abstract: In this work, we introduce FLAME, a family of extremely lightweight and capable Time Series Foundation Models, which support both deterministic and probabilistic forecasting via generative probabilistic modeling, thus ensuring both efficiency and robustness. FLAME utilizes the Legendre Memory for strong generalization capabilities. Through adapting variants of Legendre Memory, i.e., translated Legendre (LegT) and scaled Legendre (LegS), in the Encoding and Decoding phases, FLAME can effectively capture the inherent inductive bias within data and make efficient long-range inferences. To enhance the accuracy of probabilistic forecasting while keeping efficient, FLAME adopts a Normalization Flow based forecasting head, which can model the arbitrarily intricate distributions over the forecasting horizon in a generative manner. Comprehensive experiments on well-recognized benchmarks, including TSFM-Bench and ProbTS, demonstrate the consistent state-of-the-art zero-shot performance of FLAME on both deterministic and probabilistic forecasting tasks.

[360] Differentially Private Knowledge Distillation via Synthetic Text Generation

James Flemings, Murali Annavaram

Main category: cs.LG

TL;DR: DistilDP: A differentially private knowledge distillation method that uses synthetic data from a DP-trained teacher LLM to compress models while preserving privacy, achieving better utility than existing baselines.

DetailsMotivation: There's growing need to train LLMs with differential privacy for data protection while also compressing models for deployment on resource-constrained devices. Both DP and compression typically cause utility loss, and applying them together compounds this degradation.

Method: Proposes DistilDP: a DP knowledge distillation algorithm using synthetic data generated by a differentially private teacher LLM. Knowledge is transferred via hard labels from synthetic data and soft labels from teacher’s output distribution. When teacher and student share similar architecture, also aligns hidden representations.

Result: Substantially improves utility over existing baselines, achieving at least 9.0 PPL improvement on Big Patent dataset with strong privacy parameters (ε=2).

Conclusion: DistilDP enables effective privacy-preserving compression of autoregressive LLMs, advancing the field of deploying private, efficient language models.

Abstract: Large Language models (LLMs) are achieving state-of-the-art performance in many different downstream tasks. However, the increasing urgency of data privacy puts pressure on practitioners to train LLMs with Differential Privacy (DP) on private data. Concurrently, the exponential growth in parameter size of LLMs necessitates model compression before deployment of LLMs on resource-constrained devices or latency-sensitive applications. Differential privacy and model compression generally must trade off utility loss to achieve their objectives. Moreover, simultaneously applying both schemes can compound the utility degradation. To this end, we propose DistilDP: a novel differentially private knowledge distillation algorithm that exploits synthetic data generated by a differentially private teacher LLM. The knowledge of a teacher LLM is transferred onto the student in two ways: one way from the synthetic data itself – the hard labels, and the other way by the output distribution of the teacher evaluated on the synthetic data – the soft labels. Furthermore, if the teacher and student share a similar architectural structure, we can further distill knowledge by aligning the hidden representations between both. Our experimental results demonstrate that DistilDP can substantially improve the utility over existing baselines, at least $9.0$ PPL on the Big Patent dataset, with strong privacy parameters, $ε=2$. These promising results progress privacy-preserving compression of autoregressive LLMs. Our code can be accessed here: https://github.com/james-flemings/dp_compress.

[361] Explainable Preference Learning: a Decision Tree-based Surrogate Model for Preferential Bayesian Optimization

Nick Leenders, Thomas Quadt, Boris Cule, Roy Lindelauf, Herman Monsuur, Joost van Oijen, Mark Voskuijl

Main category: cs.LG

TL;DR: The paper introduces an interpretable decision tree-based surrogate model for Preferential Bayesian Optimization that handles categorical/continuous data, scales to large datasets, and outperforms GP-based methods on spiky functions.

DetailsMotivation: Current GP-based Preferential Bayesian Optimization methods have three main limitations: they are hard to interpret, struggle with categorical data, and are computationally complex, which restricts their real-world usability.

Method: The authors propose an inherently interpretable decision tree-based surrogate model that can handle both categorical and continuous data, and is scalable to large datasets. They also explore using historical preference data to accelerate optimization for new users.

Result: Extensive experiments on eight spiky optimization functions show the model outperforms GP-based alternatives on spiky functions and has only marginally lower performance for non-spiky functions. Application to the real-world Sushi dataset demonstrates its ability to learn individual sushi preferences.

Conclusion: The decision tree-based approach provides an interpretable, scalable alternative to GP-based Preferential Bayesian Optimization that handles mixed data types effectively and shows promising results for both synthetic and real-world preference learning tasks.

Abstract: Current Preferential Bayesian Optimization methods rely on Gaussian Processes (GPs) as surrogate models. These models are hard to interpret, struggle with handling categorical data, and are computationally complex, limiting their real-world usability. In this paper, we introduce an inherently interpretable decision tree-based surrogate model capable of handling both categorical and continuous data, and scalable to large datasets. Extensive numerical experiments on eight increasingly spiky optimization functions show that our model outperforms GP-based alternatives on spiky functions and has only marginally lower performance for non-spiky functions. Moreover, we apply our model to the real-world Sushi dataset and show its ability to learn an individual’s sushi preferences. Finally, we show some initial work on using historical preference data to speed up the optimization process for new unseen users.

[362] Implicit Bias and Invariance: How Hopfield Networks Efficiently Learn Graph Orbits

Michael Murray, Tenzin Chan, Kedar Karhadker, Christopher J. Hillar

Main category: cs.LG

TL;DR: Hopfield networks can learn graph isomorphism classes from small samples via norm-efficient solutions and implicit bias toward invariant subspaces.

DetailsMotivation: To understand how invariance emerges implicitly in neural networks when trained on group-structured data, particularly in classical Hopfield networks learning graph isomorphism classes.

Method: Study Hopfield networks learning graph isomorphism classes, analyze gradient descent minimizing energy flow (MEF), examine implicit bias toward norm-efficient solutions, and track parameter convergence toward invariant subspaces.

Result: Graph isomorphism classes can be represented in 3D invariant subspace; MEF gradient descent has implicit bias toward norm-efficient solutions enabling polynomial sample complexity; parameters converge toward invariant subspace with increasing samples.

Conclusion: Generalization in Hopfield networks is driven by norm-efficiency bias that leads to approximate invariance emergence under group-structured data.

Abstract: Many learning problems involve symmetries, and while invariance can be built into neural architectures, it can also emerge implicitly when training on group-structured data. We study this phenomenon in classical Hopfield networks and show they can infer the full isomorphism class of a graph from a small random sample. Our results reveal that: (i) graph isomorphism classes can be represented within a three-dimensional invariant subspace, (ii) using gradient descent to minimize energy flow (MEF) has an implicit bias toward norm-efficient solutions, which underpins a polynomial sample complexity bound for learning isomorphism classes, and (iii) across multiple learning rules, parameters converge toward the invariant subspace as sample sizes grow. Together, these findings highlight a unifying mechanism for generalization in Hopfield networks: a bias toward norm efficiency in learning drives the emergence of approximate invariance under group-structured data.

[363] Causal Structure Learning for Dynamical Systems with Theoretical Score Analysis

Nicholas Tagliapietra, Katharina Ensinger, Christoph Zimmer, Osman Mian

Main category: cs.LG

TL;DR: CaDyT is a novel causal discovery method for dynamical systems that models continuous-time dynamics using Difference-based causal models and Gaussian Process inference, outperforming state-of-the-art methods on both regularly and irregularly-sampled data.

DetailsMotivation: Real-world systems evolve in continuous-time with unknown dynamics, but existing approaches either discretize time (performing poorly on irregularly sampled data) or ignore underlying causality. There's a need for methods that address both challenges.

Method: CaDyT uses Difference-based causal models instead of discrete-time Dynamic Bayesian networks, leveraging exact Gaussian Process inference for continuous-time dynamics. It identifies causal structure via greedy search guided by the Algorithmic Markov Condition and Minimum Description Length principle.

Result: CaDyT outperforms state-of-the-art methods on both regularly and irregularly-sampled data, discovering causal networks closer to the true underlying dynamics.

Conclusion: CaDyT successfully addresses the limitations of existing approaches by modeling continuous-time dynamics more accurately while incorporating causality, making it effective for real-world dynamical systems with irregular sampling.

Abstract: Real world systems evolve in continuous-time according to their underlying causal relationships, yet their dynamics are often unknown. Existing approaches to learning such dynamics typically either discretize time – leading to poor performance on irregularly sampled data – or ignore the underlying causality. We propose CaDyT, a novel method for causal discovery on dynamical systems addressing both these challenges. In contrast to state-of-the-art causal discovery methods that model the problem using discrete-time Dynamic Bayesian networks, our formulation is grounded in Difference-based causal models, which allow milder assumptions for modeling the continuous nature of the system. CaDyT leverages exact Gaussian Process inference for modeling the continuous-time dynamics which is more aligned with the underlying dynamical process. We propose a practical instantiation that identifies the causal structure via a greedy search guided by the Algorithmic Markov Condition and Minimum Description Length principle. Our experiments show that CaDyT outperforms state-of-the-art methods on both regularly and irregularly-sampled data, discovering causal networks closer to the true underlying dynamics.

[364] Black-Box Auditing of Quantum Model: Lifted Differential Privacy with Quantum Canaries

Baobao Song, Shiva Raj Pokhrel, Athanasios V. Vasilakos, Tianqing Zhu, Gang Li

Main category: cs.LG

TL;DR: First black-box privacy auditing framework for Quantum Machine Learning using quantum canaries to detect memorization and quantify privacy leakage, bridging theoretical QDP guarantees with practical verification.

DetailsMotivation: Quantum ML models risk memorizing sensitive training data, creating privacy vulnerabilities. While Quantum Differential Privacy provides theoretical guarantees, there's a critical lack of empirical verification tools for deployed QML models.

Method: Introduces a black-box privacy auditing framework based on Lifted Quantum Differential Privacy, using quantum canaries (strategically offset-encoded quantum states) to detect memorization. Establishes mathematical connection between canary offset and trace distance bounds to derive empirical lower bounds on privacy budget consumption.

Result: Comprehensive evaluations across both simulated and physical quantum hardware demonstrate the framework’s effectiveness in measuring actual privacy loss in QML models, enabling robust privacy verification in QML systems.

Conclusion: The framework bridges the critical gap between theoretical Quantum Differential Privacy guarantees and practical privacy verification, providing the first empirical auditing tool for QML privacy.

Abstract: Quantum machine learning (QML) promises significant computational advantages, yet models trained on sensitive data risk memorizing individual records, creating serious privacy vulnerabilities. While Quantum Differential Privacy (QDP) mechanisms provide theoretical worst-case guarantees, they critically lack empirical verification tools for deployed models. We introduce the first black-box privacy auditing framework for QML based on Lifted Quantum Differential Privacy, leveraging quantum canaries (strategically offset-encoded quantum states) to detect memorization and precisely quantify privacy leakage during training. Our framework establishes a rigorous mathematical connection between canary offset and trace distance bounds, deriving empirical lower bounds on privacy budget consumption that bridge the critical gap between theoretical guarantees and practical privacy verification. Comprehensive evaluations across both simulated and physical quantum hardware demonstrate our framework’s effectiveness in measuring actual privacy loss in QML models, enabling robust privacy verification in QML systems.

[365] SuperWing: a comprehensive transonic wing dataset for data-driven aerodynamic design

Yunjia Yang, Weishao Tang, Mengxin Liu, Nils Thuerey, Yufei Zhang, Haixin Chen

Main category: cs.LG

TL;DR: SuperWing is a comprehensive open dataset of 4,239 parameterized wing geometries with 28,856 RANS flow solutions for transonic swept-wing aerodynamics, enabling accurate ML surrogate models that generalize well to complex benchmark wings.

DetailsMotivation: Current ML surrogate models for aerodynamic design are limited by scarce and non-diverse datasets for 3D wings, hindering progress toward generalizable predictors.

Method: Created SuperWing dataset using expressive geometry parameterization with spanwise variations in airfoil shape, twist, and dihedral. Generated 4,239 wing shapes simulated across broad Mach numbers and angles of attack using RANS flow field solutions.

Result: Benchmarked state-of-the-art Transformers achieve 2.5 drag-count error on held-out samples. Models pretrained on SuperWing show strong zero-shot generalization to complex benchmark wings (DLR-F6, NASA CRM).

Conclusion: SuperWing provides a diverse, comprehensive dataset that enables accurate ML surrogate models for transonic wing aerodynamics with strong generalization capabilities, addressing previous dataset limitations.

Abstract: Machine-learning surrogate models have shown promise in accelerating aerodynamic design, yet progress toward generalizable predictors for three-dimensional wings has been limited by the scarcity and restricted diversity of existing datasets. Here, we present SuperWing, a comprehensive open dataset of transonic swept-wing aerodynamics comprising 4,239 parameterized wing geometries and 28,856 Reynolds-averaged Navier-Stokes flow field solutions. The wing shapes in the dataset are generated using a simplified yet expressive geometry parameterization that incorporates spanwise variations in airfoil shape, twist, and dihedral, allowing for an enhanced diversity without relying on perturbations of a baseline wing. All shapes are simulated under a broad range of Mach numbers and angles of attack covering the typical flight envelope. To demonstrate the dataset’s utility, we benchmark two state-of-the-art Transformers that accurately predict surface flow and achieve a 2.5 drag-count error on held-out samples. Models pretrained on SuperWing further exhibit strong zero-shot generalization to complex benchmark wings such as DLR-F6 and NASA CRM, underscoring the dataset’s diversity and potential for practical usage.

[366] FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference

Guangda Liu, Chengwei Li, Zhenyu Ning, Jing Lin, Yiwu Yao, Danning Ke, Minyi Guo, Jieru Zhao

Main category: cs.LG

TL;DR: FreeKV is an algorithm-system co-optimization framework that improves KV cache efficiency for long-context LLMs by combining speculative retrieval with fine-grained correction and optimized memory layouts.

DetailsMotivation: Long contexts in LLMs create deployment challenges due to the KV cache size growing proportionally with context length. Existing KV cache compression methods either lose accuracy (KV dropping) or suffer efficiency bottlenecks (KV retrieval).

Method: FreeKV uses algorithm-system co-optimization: (1) Algorithm side: speculative retrieval to move KV selection/recall out of critical path + fine-grained correction for accuracy; (2) System side: hybrid KV layouts across CPU/GPU memory to eliminate fragmented transfers + double-buffered streamed recall.

Result: FreeKV achieves near-lossless accuracy across various scenarios and models, delivering up to 13× speedup compared to state-of-the-art KV retrieval methods.

Conclusion: FreeKV effectively addresses KV cache efficiency challenges for long-context LLMs by balancing accuracy preservation with significant performance improvements through algorithm-system co-design.

Abstract: Large language models (LLMs) have been widely deployed with rapidly expanding context windows to support increasingly demanding applications. However, long contexts pose significant deployment challenges, primarily due to the KV cache whose size grows proportionally with context length. While KV cache compression methods are proposed to address this issue, KV dropping methods incur considerable accuracy loss, and KV retrieval methods suffer from significant efficiency bottlenecks. We propose FreeKV, an algorithm-system co-optimization framework to enhance KV retrieval efficiency while preserving accuracy. On the algorithm side, FreeKV introduces speculative retrieval to shift the KV selection and recall processes out of the critical path, combined with fine-grained correction to ensure accuracy. On the system side, FreeKV employs hybrid KV layouts across CPU and GPU memory to eliminate fragmented data transfers, and leverages double-buffered streamed recall to further improve efficiency. Experiments demonstrate that FreeKV achieves near-lossless accuracy across various scenarios and models, delivering up to 13$\times$ speedup compared to SOTA KV retrieval methods.

[367] GRAFT: Grid-Aware Load Forecasting with Multi-Source Textual Alignment and Fusion

Fangzhou Lin, Guoshun He, Zhenyu Guo, Zhe Huang, Jinsong Tao

Main category: cs.LG

TL;DR: GRAFT is a text-enhanced load forecasting model that aligns daily news, social media, and policy texts with half-hour electricity load data, using cross-attention for text-guided fusion and providing plug-and-play memory interface for real-world deployment.

DetailsMotivation: Electric load forecasting is complex due to multiple time-scale influences from weather, calendar rhythms, sudden events, and policies. Existing models need better support for grid-aware forecasting and multi-source textual interventions to capture these diverse factors.

Method: GRAFT modifies STanHOP to strictly align daily-aggregated texts (news, social media, policy) with half-hour load data, implements text-guided fusion via cross-attention during training and rolling forecasting, and provides plug-and-play external-memory interface for different information sources.

Result: GRAFT significantly outperforms strong baselines and reaches/surpasses state-of-the-art across multiple Australian regions and forecasting horizons (hourly, daily, monthly). The model shows robustness in event-driven scenarios and enables temporal localization and source-level interpretation of text-to-load effects.

Conclusion: GRAFT effectively integrates multi-source textual information with load forecasting, providing improved accuracy, interpretability, and real-world deployment flexibility. The released benchmark and tools facilitate standardized evaluation in power grid load forecasting research.

Abstract: Electric load is simultaneously affected across multiple time scales by exogenous factors such as weather and calendar rhythms, sudden events, and policies. Therefore, this paper proposes GRAFT (GRid-Aware Forecasting with Text), which modifies and improves STanHOP to better support grid-aware forecasting and multi-source textual interventions. Specifically, GRAFT strictly aligns daily-aggregated news, social media, and policy texts with half-hour load, and realizes text-guided fusion to specific time positions via cross-attention during both training and rolling forecasting. In addition, GRAFT provides a plug-and-play external-memory interface to accommodate different information sources in real-world deployment. We construct and release a unified aligned benchmark covering 2019–2021 for five Australian states (half-hour load, daily-aligned weather/calendar variables, and three categories of external texts), and conduct systematic, reproducible evaluations at three scales – hourly, daily, and monthly – under a unified protocol for comparison across regions, external sources, and time scales. Experimental results show that GRAFT significantly outperforms strong baselines and reaches or surpasses the state of the art across multiple regions and forecasting horizons. Moreover, the model is robust in event-driven scenarios and enables temporal localization and source-level interpretation of text-to-load effects through attention read-out. We release the benchmark, preprocessing scripts, and forecasting results to facilitate standardized empirical evaluation and reproducibility in power grid load forecasting.

[368] Dual-Axis RCCL: Representation-Complete Convergent Learning for Organic Chemical Space

Dejun Hu, Zhiming Li, Jia-Rui Shen, Jia-Ning Tu, Zi-Hao Ye, Junliang Zhang

Main category: cs.LG

TL;DR: The paper introduces a Dual-Axis Representation-Complete Convergent Learning (RCCL) strategy with FD25 dataset to achieve near-complete coverage of organic chemical space, enabling convergent learning and strong generalization in molecular property prediction.

DetailsMotivation: Despite machine learning's impact on molecular modeling, it's unclear whether models can achieve convergent learning across the vast chemical space (10^30-10^60 molecules). Current approaches lack systematic coverage and principled frameworks for representation completeness.

Method: Dual-Axis RCCL strategy combining: 1) GCN encoding of local valence environments based on valence bond theory, and 2) NBG encoding of ring/cage topologies. This provides quantitative chemical-space coverage measurement. Applied to create FD25 dataset covering 13,302 local valence units and 165,726 ring/cage topologies for H/C/N/O/F organic molecules.

Result: Graph neural networks trained on FD25 achieve representation-complete convergent learning with strong out-of-distribution generalization. Overall prediction error ~1.0 kcal/mol MAE across external benchmarks, establishing quantitative link between representation completeness and model generalization.

Conclusion: The RCCL framework provides principled basis for constructing datasets that support convergent learning for large models, establishing foundation for interpretable, transferable, and data-efficient molecular intelligence through systematic chemical space coverage.

Abstract: Machine learning is profoundly reshaping molecular and materials modeling; however, given the vast scale of chemical space (10^30-10^60), it remains an open scientific question whether models can achieve convergent learning across this space. We introduce a Dual-Axis Representation-Complete Convergent Learning (RCCL) strategy, enabled by a molecular representation that integrates graph convolutional network (GCN) encoding of local valence environments, grounded in modern valence bond theory, together with no-bridge graph (NBG) encoding of ring/cage topologies, providing a quantitative measure of chemical-space coverage. This framework formalizes representation completeness, establishing a principled basis for constructing datasets that support convergent learning for large models. Guided by this RCCL framework, we develop the FD25 dataset, systematically covering 13,302 local valence units and 165,726 ring/cage topologies, achieving near-complete combinatorial coverage of organic molecules with H/C/N/O/F elements. Graph neural networks trained on FD25 exhibit representation-complete convergent learning and strong out-of-distribution generalization, with an overall prediction error of approximately 1.0 kcal/mol MAE across external benchmarks. Our results establish a quantitative link between molecular representation, structural completeness, and model generalization, providing a foundation for interpretable, transferable, and data-efficient molecular intelligence.

[369] Bridging Artificial Intelligence and Data Assimilation: The Data-driven Ensemble Forecasting System ClimaX-LETKF

Akira Takeshima, Kenta Shiraishi, Atsushi Okazaki, Tadashi Tsuyuki, Shunji Kotsuki

Main category: cs.LG

TL;DR: First purely data-driven ML-based ensemble weather forecasting system that assimilates real observations and operates stably for years without NWP models

DetailsMotivation: Limited research on assimilating real observations or ensemble forecasts within MLWP models despite significant advancements in ML-based weather prediction

Method: ClimaX-LETKF system that assimilates NCEP ADP Global Upper Air and Surface Weather Observations, using relaxation to prior perturbation (RTPP) instead of RTPS

Result: System demonstrates greater stability and accuracy with RTPP than RTPS (opposite of NWP models), and MLWP models are less capable of restoring atmospheric field to its attractor than NWP models

Conclusion: Provides valuable insights for enhancing MLWP ensemble forecasting systems and represents a substantial step toward practical applications of MLWP

Abstract: While machine learning-based weather prediction (MLWP) has achieved significant advancements, research on assimilating real observations or ensemble forecasts within MLWP models remains limited. We introduce ClimaX-LETKF, the first purely data-driven ML-based ensemble weather forecasting system. It operates stably over multiple years, independently of numerical weather prediction (NWP) models, by assimilating the NCEP ADP Global Upper Air and Surface Weather Observations. The system demonstrates greater stability and accuracy with relaxation to prior perturbation (RTPP) than with relaxation to prior spread (RTPS), while NWP models tend to be more stable with RTPS. RTPP replaces an analysis perturbation with a weighted blend of analysis and background perturbations, whereas RTPS simply rescales the analysis perturbation. Our experiments reveal that MLWP models are less capable of restoring the atmospheric field to its attractor than NWP models. This work provides valuable insights for enhancing MLWP ensemble forecasting systems and represents a substantial step toward their practical applications.

[370] Retrieval Enhanced Feedback via In-context Neural Error-book

Jongyeop Hyun, Bumsoo Kim

Main category: cs.LG

TL;DR: REFINE is a teacher-student framework that uses structured error analysis and targeted feedback to enhance multimodal reasoning in MLLMs, improving efficiency and performance through systematic error-book construction.

DetailsMotivation: Existing methods for learning from errors in multimodal reasoning lack structured frameworks, particularly for MLLMs where visual-textual integration adds complexity. Current approaches have inefficient retrieval mechanisms and don't systematically analyze or mitigate errors.

Method: Proposes REFINE: Retrieval-Enhanced Feedback via In-context Neural Error-book, a teacher-student framework with three systematic queries (Feed-Target, Feed-Check, Feed-Path) to construct structured feedback that prioritizes visual information, diagnoses failure points, and formulates corrective actions.

Result: Demonstrates substantial speedup, reduced computational costs, and successful generalization. The framework improves inference efficiency, token usage, and scalability compared to prior approaches with redundant retrievals.

Conclusion: REFINE provides an effective structured approach for error analysis and feedback in multimodal reasoning, showing potential for enhancing MLLM performance through systematic error-book construction and optimized feedback retrieval.

Abstract: Recent advancements in Large Language Models (LLMs) have significantly improved reasoning capabilities, with in-context learning (ICL) emerging as a key technique for adaptation without retraining. While previous works have focused on leveraging correct examples, recent research highlights the importance of learning from errors to enhance performance. However, existing methods lack a structured framework for analyzing and mitigating errors, particularly in Multimodal Large Language Models (MLLMs), where integrating visual and textual inputs adds complexity. To address this issue, we propose REFINE: Retrieval-Enhanced Feedback via In-context Neural Error-book, a teacher-student framework that systematically structures errors and provides targeted feedback. REFINE introduces three systematic queries to construct structured feedback – Feed-Target, Feed-Check, and Feed-Path – to enhance multimodal reasoning by prioritizing relevant visual information, diagnosing critical failure points, and formulating corrective actions. Unlike prior approaches that rely on redundant retrievals, REFINE optimizes structured feedback retrieval, improving inference efficiency, token usage, and scalability. Our results demonstrate substantial speedup, reduced computational costs, and successful generalization, highlighting REFINE’s potential for enhancing multimodal reasoning.

[371] AnySleep: a channel-agnostic deep learning system for high-resolution sleep staging in multi-center cohorts

Niklas Grieger, Jannik Raskob, Siamak Mehrkanoon, Stephan Bialonski

Main category: cs.LG

TL;DR: AnySleep is a deep neural network that scores sleep at adjustable temporal resolutions using any EEG/EOG data, trained on 19k+ recordings across 21 datasets for robust generalization.

DetailsMotivation: Manual sleep staging is labor-intensive, and traditional 30-second epoch scoring is based on pragmatic rather than physiological reasons. Multi-center studies face challenges due to variability in electrode setups, montages, and subject characteristics, limiting harmonized research and biomarker discovery at shorter timescales.

Method: Developed a deep neural network model that can use any EEG or EOG data to score sleep at adjustable temporal resolutions. Trained and validated on over 19,000 overnight recordings from 21 datasets (nearly 200,000 hours of data) to ensure robust generalization across different clinical sites and recording setups.

Result: Achieves state-of-the-art performance, matching or surpassing established baselines at 30-second epochs. Performance improves with more channels but remains strong with limited inputs (EOG only or single EEG derivations). At sub-30-second timescales, the model captures short wake intrusions consistent with arousals and improves prediction of physiological characteristics (age, sex) and pathophysiological conditions (sleep apnea).

Conclusion: AnySleep enables large-scale sleep studies with heterogeneous electrode setups and accelerates discovery of novel biomarkers by providing flexible, high-resolution sleep scoring that generalizes well across different clinical settings and recording configurations.

Abstract: Sleep is essential for good health throughout our lives, yet studying its dynamics requires manual sleep staging, a labor-intensive step in sleep research and clinical care. Across centers, polysomnography (PSG) recordings are traditionally scored in 30-s epochs for pragmatic, not physiological, reasons and can vary considerably in electrode count, montage, and subject characteristics. These constraints present challenges in conducting harmonized multi-center sleep studies and discovering novel, robust biomarkers on shorter timescales. Here, we present AnySleep, a deep neural network model that uses any electroencephalography (EEG) or electrooculography (EOG) data to score sleep at adjustable temporal resolutions. We trained and validated the model on over 19,000 overnight recordings from 21 datasets collected across multiple clinics, spanning nearly 200,000 hours of EEG and EOG data, to promote robust generalization across sites. The model attains state-of-the-art performance and surpasses or equals established baselines at 30-s epochs. Performance improves as more channels are provided, yet remains strong when EOG is absent or when only EOG or single EEG derivations (frontal, central, or occipital) are available. On sub-30-s timescales, the model captures short wake intrusions consistent with arousals and improves prediction of physiological characteristics (age, sex) and pathophysiological conditions (sleep apnea), relative to standard 30-s scoring. We make the model publicly available to facilitate large-scale studies with heterogeneous electrode setups and to accelerate the discovery of novel biomarkers in sleep.

[372] Kinetic-Mamba: Mamba-Assisted Predictions of Stiff Chemical Kinetics

Additi Pandey, Liang Wei, Hessam Babaee, George Em Karniadakis

Main category: cs.LG

TL;DR: Kinetic-Mamba: A Mamba-based neural operator framework for accurate chemical kinetics modeling in combustion simulations, using three complementary models to predict thermochemical state evolution from initial conditions.

DetailsMotivation: Accurate chemical kinetics modeling is essential for combustion simulations as it governs complex reaction pathways and thermochemical states. Current approaches need improvement in efficiency and accuracy for predicting kinetic behavior.

Method: Three complementary Mamba-based models: (1) standalone Mamba for state evolution prediction, (2) constrained Mamba enforcing mass conservation, (3) regime-informed architecture with two Mamba models for temperature-dependent regimes. Also developed latent Kinetic-Mamba variant for reduced latent space evolution with physical manifold reconstruction.

Result: High fidelity in predicting complex kinetic behavior using only initial conditions. Demonstrated accuracy and robustness with time-decomposition and recursive-prediction strategies. Good extrapolation capabilities on out-of-distribution datasets. Validated on Syngas and GRI-Mech 3.0 reaction mechanisms.

Conclusion: Kinetic-Mamba framework successfully integrates neural operators with Mamba architectures for efficient and accurate chemical kinetics modeling, achieving high prediction fidelity while maintaining physical constraints like mass conservation.

Abstract: Accurate chemical kinetics modeling is essential for combustion simulations, as it governs the evolution of complex reaction pathways and thermochemical states. In this work, we introduce Kinetic-Mamba, a Mamba-based neural operator framework that integrates the expressive power of neural operators with the efficient temporal modeling capabilities of Mamba architectures. The framework comprises three complementary models: (i) a standalone Mamba model that predicts the time evolution of thermochemical state variables from given initial conditions; (ii) a constrained Mamba model that enforces mass conservation while learning the state dynamics; and (iii) a regime-informed architecture employing two standalone Mamba models to capture dynamics across temperature-dependent regimes. We additionally develop a latent Kinetic-Mamba variant that evolves dynamics in a reduced latent space and reconstructs the full state on the physical manifold. We evaluate the accuracy and robustness of Kinetic-Mamba using both time-decomposition and recursive-prediction strategies. We further assess the extrapolation capabilities of the model on varied out-of-distribution datasets. Computational experiments on Syngas and GRI-Mech 3.0 reaction mechanisms demonstrate that our framework achieves high fidelity in predicting complex kinetic behavior using only the initial conditions of the state variables.

[373] Improving Slow Transfer Predictions: Generative Methods Compared

Jacob Taegon Kim, Alex Sim, Kesheng Wu, Jinoh Kim

Main category: cs.LG

TL;DR: The paper addresses class imbalance in ML models for predicting data transfer performance in scientific networks, comparing augmentation strategies and finding that advanced techniques like CTGAN don’t significantly outperform simple stratified sampling.

DetailsMotivation: Class imbalance is a key bottleneck limiting the predictive power of ML models for monitoring data transfer performance in scientific computing networks. Early prediction of sluggish transfers could optimize network usage, but imbalanced data hinders model accuracy.

Method: The study analyzes and compares various augmentation strategies including traditional oversampling methods and generative techniques (like CTGAN). Researchers also adjust class imbalance ratios in training datasets to evaluate their impact on model performance.

Result: While augmentation may improve performance in some cases, as the imbalance ratio increases, performance does not significantly improve. The most advanced technique (CTGAN) does not significantly improve over simple stratified sampling.

Conclusion: Advanced generative augmentation techniques like CTGAN don’t provide significant advantages over simpler stratified sampling approaches for addressing class imbalance in data transfer performance prediction models, especially as imbalance ratios increase.

Abstract: Monitoring data transfer performance is a crucial task in scientific computing networks. By predicting performance early in the communication phase, potentially sluggish transfers can be identified and selectively monitored, optimizing network usage and overall performance. A key bottleneck to improving the predictive power of machine learning (ML) models in this context is the issue of class imbalance. This project focuses on addressing the class imbalance problem to enhance the accuracy of performance predictions. In this study, we analyze and compare various augmentation strategies, including traditional oversampling methods and generative techniques. Additionally, we adjust the class imbalance ratios in training datasets to evaluate their impact on model performance. While augmentation may improve performance, as the imbalance ratio increases, the performance does not significantly improve. We conclude that even the most advanced technique, such as CTGAN, does not significantly improve over simple stratified sampling.

[374] Synthetic Electrogram Generation with Variational Autoencoders for ECGI

Miriam Gutiérrez Fernández, Karen López-Linares, Carlos Fambuena Santos, María S. Guillem, Andreu M. Climent, Óscar Barquero Pérez

Main category: cs.LG

TL;DR: VAE-based generative models create synthetic atrial electrograms to address data scarcity in electrocardiographic imaging, with two models achieving different trade-offs between fidelity and rhythm-specific generation.

DetailsMotivation: Atrial fibrillation assessment requires accurate atrial electrical activity characterization, but deep learning approaches for electrocardiographic imaging are limited by scarce paired body surface potential and intracardiac electrogram datasets.

Method: Two variational autoencoder models: VAE-S for sinus rhythm-specific generation and VAE-C for class-conditioned generation trained on both sinus rhythm and AF signals. Generated electrograms evaluated using morphological, spectral, and distributional similarity metrics.

Result: VAE-S achieves higher fidelity with in silico electrograms, while VAE-C enables rhythm-specific generation at the expense of reduced sinus reconstruction quality. Generated electrograms used for data augmentation moderately improves downstream noninvasive electrogram reconstruction performance.

Conclusion: VAE-based generative modeling shows potential to alleviate data scarcity and enhance deep learning-based electrocardiographic imaging pipelines for atrial fibrillation assessment.

Abstract: Atrial fibrillation (AF) is the most prevalent sustained cardiac arrhythmia, and its clinical assessment requires accurate characterization of atrial electrical activity. Noninvasive electrocardiographic imaging (ECGI) combined with deep learning (DL) approaches for estimating intracardiac electrograms (EGMs) from body surface potentials (BSPMs) has shown promise, but progress is hindered by the limited availability of paired BSPM-EGM datasets. To address this limitation, we investigate variational autoencoders (VAEs) for the generation of synthetic multichannel atrial EGMs. Two models are proposed: a sinus rhythm-specific VAE (VAE-S) and a class-conditioned VAE (VAE-C) trained on both sinus rhythm and AF signals. Generated EGMs are evaluated using morphological, spectral, and distributional similarity metrics. VAE-S achieves higher fidelity with respect to in silico EGMs, while VAE-C enables rhythm-specific generation at the expense of reduced sinus reconstruction quality. As a proof of concept, the generated EGMs are used for data augmentation in a downstream noninvasive EGM reconstruction task, where moderate augmentation improves estimation performance. These results demonstrate the potential of VAE-based generative modeling to alleviate data scarcity and enhance deep learning-based ECGI pipelines.

[375] Counterfactual Explanations for Time Series Should be Human-Centered and Temporally Coherent in Interventions

Emmanuel C. Chukwu, Rianne M. Schouten, Monique Tabak, Mykola Pechenizkiy

Main category: cs.LG

TL;DR: Current counterfactual methods for time series classification are inadequate for clinical settings due to static assumptions, minimal perturbation focus, and lack of causal plausibility and temporal coherence. The paper calls for more realistic, actionable interventions aligned with clinical reasoning.

DetailsMotivation: Existing counterfactual explanation methods for time series classification fail to address the needs of clinical recommendation settings, where interventions must be causally plausible, temporally coherent, and aligned with patient-specific dynamics over time.

Method: The paper conducts a robustness analysis of state-of-the-art time series counterfactual methods, examining their sensitivity to stochastic noise and identifying critical gaps in temporal coherence and user-centered design.

Result: Current counterfactual methods are highly sensitive to stochastic noise, making them unreliable for real-world clinical applications where minor measurement variations are inevitable. They lack temporal coherence and fail to consider feasibility or actionability.

Conclusion: The field needs a paradigm shift toward counterfactuals that reflect sustained, goal-directed interventions aligned with clinical reasoning. Future methods must prioritize actionable, purpose-driven interventions that are feasible in real-world contexts with proper evaluation frameworks.

Abstract: Counterfactual explanations are increasingly proposed as interpretable mechanisms to achieve algorithmic recourse. However, current counterfactual techniques for time series classification are predominantly designed with static data assumptions and focus on generating minimal input perturbations to flip model predictions. This paper argues that such approaches are fundamentally insufficient in clinical recommendation settings, where interventions unfold over time and must be causally plausible and temporally coherent. We advocate for a shift towards counterfactuals that reflect sustained, goal-directed interventions aligned with clinical reasoning and patient-specific dynamics. We identify critical gaps in existing methods that limit their practical applicability, specifically, temporal blind spots and the lack of user-centered considerations in both method design and evaluation metrics. To support our position, we conduct a robustness analysis of several state-of-the-art methods for time series and show that the generated counterfactuals are highly sensitive to stochastic noise. This finding highlights their limited reliability in real-world clinical settings, where minor measurement variations are inevitable. We conclude by calling for methods and evaluation frameworks that go beyond mere prediction changes without considering feasibility or actionability. We emphasize the need for actionable, purpose-driven interventions that are feasible in real-world contexts for the users of such applications.

[376] The Instability of Safety: How Random Seeds and Temperature Expose Inconsistent LLM Refusal Behavior

Erik Larsen

Main category: cs.LG

TL;DR: Single-shot safety evaluations of LLMs are unreliable because safety refusal decisions flip 18-28% of the time across different random seeds and temperatures, showing models are stochastic rather than deterministic in safety alignment.

DetailsMotivation: Current safety evaluations assume LLM responses are deterministic and representative of safety alignment, but this assumption hasn't been tested for stability across stochastic sampling variations.

Method: Tested 4 instruction-tuned models on 876 harmful prompts across 20 sampling configurations (4 temperatures × 5 random seeds), measuring decision flips and developing Safety Stability Index (SSI). Used Claude 3.5 Haiku as external judge for validation.

Result: 18-28% of prompts exhibit decision flips across configurations; higher temperatures significantly reduce stability (SSI drops from 0.977 at temp 0.0 to 0.942 at temp 1.0). Models “waver” more on borderline requests (compliance-stability correlation: -0.47 to -0.70). Single-shot evaluation agrees with multi-sample ground truth only 92.4% of the time.

Conclusion: Single-shot safety evaluations are insufficient; evaluation protocols must account for stochastic variation. Recommend using at least 3 samples per prompt for reliable safety assessment.

Abstract: Current safety evaluations of large language models rely on single-shot testing, implicitly assuming that model responses are deterministic and representative of the model’s safety alignment. We challenge this assumption by investigating the stability of safety refusal decisions across random seeds and temperature settings. Testing four instruction-tuned models from three families (Llama 3.1 8B, Qwen 2.5 7B, Qwen 3 8B, Gemma 3 12B) on 876 harmful prompts across 20 different sampling configurations (4 temperatures x 5 random seeds), we find that 18-28% of prompts exhibit decision flips–the model refuses in some configurations but complies in others–depending on the model. Our Safety Stability Index (SSI) reveals that higher temperatures significantly reduce decision stability (Friedman chi-squared = 396.81, p < 0.001), with mean within-temperature SSI dropping from 0.977 at temperature 0.0 to 0.942 at temperature 1.0. We validate our findings across all model families using Claude 3.5 Haiku as a unified external judge, achieving 89.0% inter-judge agreement with our primary Llama 70B judge (Cohen’s kappa = 0.62). Within each model, prompts with higher compliance rates exhibit lower stability (Spearman rho = -0.47 to -0.70, all p < 0.001), indicating that models “waver” more on borderline requests. These findings demonstrate that single-shot safety evaluations are insufficient for reliable safety assessment and that evaluation protocols must account for stochastic variation in model behavior. We show that single-shot evaluation agrees with multi-sample ground truth only 92.4% of the time when pooling across temperatures (94.2-97.7% at fixed temperature depending on setting), and recommend using at least 3 samples per prompt for reliable safety assessment.

[377] Residual GRU+MHSA: A Lightweight Hybrid Recurrent Attention Model for Cardiovascular Disease Detection

Tejaswani Dash, Gautam Datla, Anudeep Vurity, Tazeem Ahmad, Mohd Adnan, Saima Rafi, Saisha Patro, Saina Patro

Main category: cs.LG

TL;DR: A lightweight deep learning model combining residual GRU and multi-head self-attention achieves state-of-the-art performance for cardiovascular disease prediction on clinical tabular data, outperforming traditional and modern baselines.

DetailsMotivation: Cardiovascular disease is the leading cause of death worldwide, but existing predictive tools either rely on handcrafted features (traditional methods) or struggle to generalize across noisy clinical data (machine learning methods). There's a need for reliable, efficient predictive tools that can support early intervention in clinical settings.

Method: Proposed Residual GRU with Multi-Head Self-Attention architecture: integrates residual bidirectional gated recurrent units for sequential modeling of feature columns, channel reweighting block, and multi-head self-attention pooling with learnable classification token to capture global context. Evaluated on UCI Heart Disease dataset using 5-fold stratified cross-validation.

Result: Achieves accuracy of 0.861, macro-F1 of 0.860, ROC-AUC of 0.908, and PR-AUC of 0.904, outperforming all baselines (Logistic Regression, Random Forest, SVM, DeepMLP, CNNs, RNNs, Transformers). Ablation studies confirm contributions of residual recurrence, channel gating, and attention pooling. t-SNE visualizations show clearer class separation in learned embeddings.

Conclusion: Lightweight hybrid recurrent and attention-based architectures provide strong balance between accuracy and efficiency for clinical risk prediction, supporting deployment in resource-constrained healthcare settings.

Abstract: Cardiovascular disease (CVD) remains the leading cause of mortality worldwide, underscoring the need for reliable and efficient predictive tools that support early intervention. Traditional diagnostic approaches rely on handcrafted features and clinician expertise, while machine learning methods improve reproducibility but often struggle to generalize across noisy and heterogeneous clinical data. In this work, we propose Residual GRU with Multi-Head Self-Attention, a compact deep learning architecture designed for tabular clinical records. The model integrates residual bidirectional gated recurrent units for sequential modeling of feature columns, a channel reweighting block, and multi-head self-attention pooling with a learnable classification token to capture global context. We evaluate the model on the UCI Heart Disease dataset using 5-fold stratified cross-validation and compare it against classical methods such as Logistic Regression, Random Forest, and Support Vector Machines, as well as modern deep learning baselines including DeepMLP, convolutional networks, recurrent networks, and Transformers. The proposed model achieves an accuracy of 0.861, macro-F1 of 0.860, ROC-AUC of 0.908, and PR-AUC of 0.904, outperforming all baselines. Ablation studies confirm the individual contributions of residual recurrence, channel gating, and attention pooling. t-SNE visualizations further indicate that the learned embeddings exhibit clearer separation between disease and non-disease classes compared to raw features. These results demonstrate that lightweight hybrid recurrent and attention-based architectures provide a strong balance between accuracy and efficiency for clinical risk prediction, supporting deployment in resource-constrained healthcare settings.

[378] Hybrid Iterative Solvers with Geometry-Aware Neural Preconditioners for Parametric PDEs

Youngkyu Lee, Francesc Levrero Florencio, Jay Pathak, George Em Karniadakis

Main category: cs.LG

TL;DR: Geo-DeepONet: A geometry-aware deep operator network that enables accurate operator learning across arbitrary unstructured meshes without retraining, used to develop robust hybrid preconditioned iterative solvers for parametric PDEs.

DetailsMotivation: Classical iterative solvers for parametric PDEs are sensitive to domain geometry and discretization. Previous hybrid solvers combining classical methods with neural operators under-perform on unseen geometries during training.

Method: Introduce Geo-DeepONet that incorporates domain information from finite element discretizations. Use it to develop geometry-aware hybrid preconditioned iterative solvers by coupling Geo-DeepONet with traditional methods like relaxation schemes and Krylov subspace algorithms.

Result: Demonstrate enhanced robustness and efficiency of proposed hybrid solvers through numerical experiments on parametric PDEs posed over diverse unstructured domains for multiple real-world applications.

Conclusion: Geo-DeepONet enables geometry-aware operator learning across arbitrary meshes without retraining, leading to more robust and efficient hybrid solvers for parametric PDEs on diverse geometries.

Abstract: The convergence behavior of classical iterative solvers for parametric partial differential equations (PDEs) is often highly sensitive to the domain and specific discretization of PDEs. Previously, we introduced hybrid solvers by combining the classical solvers with neural operators for a specific geometry 1, but they tend to under-perform in geometries not encountered during training. To address this challenge, we introduce Geo-DeepONet, a geometry-aware deep operator network that incorporates domain information extracted from finite element discretizations. Geo-DeepONet enables accurate operator learning across arbitrary unstructured meshes without requiring retraining. Building on this, we develop a class of geometry-aware hybrid preconditioned iterative solvers by coupling Geo-DeepONet with traditional methods such as relaxation schemes and Krylov subspace algorithms. Through numerical experiments on parametric PDEs posed over diverse unstructured domains, we demonstrate the enhanced robustness and efficiency of the proposed hybrid solvers for multiple real-world applications.

[379] Hierarchical Persistence Velocity for Network Anomaly Detection: Theory and Applications to Cryptocurrency Markets

Omid Khormali

Main category: cs.LG

TL;DR: OW-HNPV is a new topological data analysis method that uses velocity-based persistence diagrams to detect anomalies in time-varying networks, outperforming existing methods for cryptocurrency price prediction.

DetailsMotivation: Existing topological methods measure cumulative topological presence, but there's a need for velocity-based approaches that track the rate at which topological features appear and disappear in dynamic networks, especially for anomaly detection applications like cryptocurrency monitoring.

Method: Introduces Overlap-Weighted Hierarchical Normalized Persistence Velocity (OW-HNPV), which measures the rate of topological feature appearance/disappearance in persistence diagrams. Uses overlap-based weighting to automatically downweight noise and proves mathematical stability properties for comparing diagrams with different feature types.

Result: Applied to Ethereum transaction networks (May 2017-May 2018), OW-HNPV achieved up to 10.4% AUC gain over baseline models for 7-day price movement predictions. Outperformed established methods (VAB, persistence landscapes, persistence images) for medium- to long-range forecasting (4-7 days) with most consistent performance across horizons.

Conclusion: Modeling topological velocity is crucial for detecting structural anomalies in dynamic networks. OW-HNPV provides a stable, velocity-based perspective that excels at anomaly detection in time-varying systems like cryptocurrency networks.

Abstract: We introduce the Overlap-Weighted Hierarchical Normalized Persistence Velocity (OW-HNPV), a novel topological data analysis method for detecting anomalies in time-varying networks. Unlike existing methods that measure cumulative topological presence, we introduce the first velocity-based perspective on persistence diagrams, measuring the rate at which features appear and disappear, automatically downweighting noise through overlap-based weighting. We also prove that OW-HNPV is mathematically stable. It behaves in a controlled, predictable way, even when comparing persistence diagrams from networks with different feature types. Applied to Ethereum transaction networks (May 2017-May 2018), OW-HNPV demonstrates superior performance for cryptocurrency anomaly detection, achieving up to 10.4% AUC gain over baseline models for 7-day price movement predictions. Compared with established methods, including Vector of Averaged Bettis (VAB), persistence landscapes, and persistence images, velocity-based summaries excel at medium- to long-range forecasting (4-7 days), with OW-HNPV providing the most consistent and stable performance across prediction horizons. Our results show that modeling topological velocity is crucial for detecting structural anomalies in dynamic networks.

[380] Model-Based Reinforcement Learning in Discrete-Action Non-Markovian Reward Decision Processes

Alessandro Trapasso, Luca Iocchi, Fabio Patrizi

Main category: cs.LG

TL;DR: QR-MAX is a model-based RL algorithm for non-Markovian reward decision processes that factorizes Markovian transitions from non-Markovian rewards via reward machines, achieving PAC convergence with polynomial sample complexity.

DetailsMotivation: Many real-world decision problems depend on entire system histories rather than just final states. While NMRDPs can handle such temporal dependencies, existing approaches lack formal guarantees on optimality and sample efficiency.

Method: QR-MAX uses reward machines to factorize Markovian transition learning from non-Markovian reward handling. For continuous state spaces, Bucket-QR-MAX employs SimHash-based discretization that preserves the factorized structure without manual gridding or function approximation.

Result: QR-MAX is the first model-based RL algorithm for discrete-action NMRDPs with PAC convergence to ε-optimal policies and polynomial sample complexity. Experiments show significant improvement in sample efficiency and robustness compared to state-of-the-art model-based RL approaches.

Conclusion: The factorization approach via reward machines enables formal guarantees for NMRDP learning, addressing long-standing issues of optimality and sample efficiency in temporal-dependency tasks.

Abstract: Many practical decision-making problems involve tasks whose success depends on the entire system history, rather than on achieving a state with desired properties. Markovian Reinforcement Learning (RL) approaches are not suitable for such tasks, while RL with non-Markovian reward decision processes (NMRDPs) enables agents to tackle temporal-dependency tasks. This approach has long been known to lack formal guarantees on both (near-)optimality and sample efficiency. We contribute to solving both issues with QR-MAX, a novel model-based algorithm for discrete NMRDPs that factorizes Markovian transition learning from non-Markovian reward handling via reward machines. To the best of our knowledge, this is the first model-based RL algorithm for discrete-action NMRDPs that exploits this factorization to obtain PAC convergence to $\varepsilon$-optimal policies with polynomial sample complexity. We then extend QR-MAX to continuous state spaces with Bucket-QR-MAX, a SimHash-based discretiser that preserves the same factorized structure and achieves fast and stable learning without manual gridding or function approximation. We experimentally compare our method with modern state-of-the-art model-based RL approaches on environments of increasing complexity, showing a significant improvement in sample efficiency and increased robustness in finding optimal policies.

[381] ParaFormer: A Generalized PageRank Graph Transformer for Graph Representation Learning

Chaohao Yuan, Zhenjie Song, Ercan Engin Kuruoglu, Kangfei Zhao, Yang Liu, Deli Zhao, Hong Cheng, Yu Rong

Main category: cs.LG

TL;DR: ParaFormer addresses over-smoothing in Graph Transformers by using PageRank-enhanced attention as an adaptive-pass filter, improving performance on node and graph classification tasks.

DetailsMotivation: Graph Transformers were introduced to capture global information and address over-smoothing in deep GNNs, but empirical/theoretical analysis shows they actually suffer from severe over-smoothing due to inherent low-pass filtering in global attention.

Method: Proposes PageRank Transformer (ParaFormer) with a PageRank-enhanced attention module designed to mimic deep Transformers’ behavior, functioning as an adaptive-pass filter to mitigate over-smoothing.

Result: ParaFormer achieves consistent performance improvements across both node classification and graph classification tasks on 11 datasets ranging from thousands to millions of nodes.

Conclusion: ParaFormer effectively mitigates over-smoothing in Graph Transformers through adaptive-pass filtering, demonstrating superior performance on various graph learning tasks.

Abstract: Graph Transformers (GTs) have emerged as a promising graph learning tool, leveraging their all-pair connected property to effectively capture global information. To address the over-smoothing problem in deep GNNs, global attention was initially introduced, eliminating the necessity for using deep GNNs. However, through empirical and theoretical analysis, we verify that the introduced global attention exhibits severe over-smoothing, causing node representations to become indistinguishable due to its inherent low-pass filtering. This effect is even stronger than that observed in GNNs. To mitigate this, we propose PageRank Transformer (ParaFormer), which features a PageRank-enhanced attention module designed to mimic the behavior of deep Transformers. We theoretically and empirically demonstrate that ParaFormer mitigates over-smoothing by functioning as an adaptive-pass filter. Experiments show that ParaFormer achieves consistent performance improvements across both node classification and graph classification tasks on 11 datasets ranging from thousands to millions of nodes, validating its efficacy. The supplementary material, including code and appendix, can be found in https://github.com/chaohaoyuan/ParaFormer.

[382] gridfm-datakit-v1: A Python Library for Scalable and Realistic Power Flow and Optimal Power Flow Data Generation

Alban Puech, Matteo Mazzonelli, Celia Cintas, Tamara R. Govindasamy, Mangaliso Mngomezulu, Jonas Weiss, Matteo Baù, Anna Varbella, François Mirallès, Kibaek Kim, Le Xie, Hendrik F. Hamann, Etienne Vos, Thomas Brunschwiler

Main category: cs.LG

TL;DR: gridfm-datakit-v1 is a Python library for generating realistic Power Flow and Optimal Power Flow datasets with diverse scenarios, including violations of operating limits and varying generator costs, scaling to large grids up to 10,000 buses.

DetailsMotivation: Existing PF/OPF datasets have limitations: lack realistic stochastic load/topology perturbations, restrict PF data to OPF-feasible points only, and use fixed generator cost functions, which hinders ML solver generalization.

Method: Combines global load scaling from real-world profiles with localized noise, supports arbitrary N-k topology perturbations, generates PF samples beyond operating limits, and produces OPF data with varying generator costs.

Result: Creates diverse yet realistic datasets that address the three main challenges, scales efficiently to large grids (up to 10,000 buses), and is compared with existing tools like OPFData, OPF-Learn, PGLearn, and PFΔ.

Conclusion: gridfm-datakit-v1 provides a comprehensive solution for generating realistic PF/OPF datasets to improve ML solver training, available as open-source software under Apache 2.0 license.

Abstract: We introduce gridfm-datakit-v1, a Python library for generating realistic and diverse Power Flow (PF) and Optimal Power Flow (OPF) datasets for training Machine Learning (ML) solvers. Existing datasets and libraries face three main challenges: (1) lack of realistic stochastic load and topology perturbations, limiting scenario diversity; (2) PF datasets are restricted to OPF-feasible points, hindering generalization of ML solvers to cases that violate operating limits (e.g., branch overloads or voltage violations); and (3) OPF datasets use fixed generator cost functions, limiting generalization across varying costs. gridfm-datakit addresses these challenges by: (1) combining global load scaling from real-world profiles with localized noise and supporting arbitrary N-k topology perturbations to create diverse yet realistic datasets; (2) generating PF samples beyond operating limits; and (3) producing OPF data with varying generator costs. It also scales efficiently to large grids (up to 10,000 buses). Comparisons with OPFData, OPF-Learn, PGLearn, and PF$Δ$ are provided. Available on GitHub at https://github.com/gridfm/gridfm-datakit under Apache 2.0 and via pip install gridfm-datakit.

[383] Beyond Lipschitz Continuity and Monotonicity: Fractal and Chaotic Activation Functions in Echo State Networks

Rae Chipera, Jenny Du, Irene Tsapara

Main category: cs.LG

TL;DR: Non-smooth activation functions (chaotic, stochastic, fractal) in reservoir computing can outperform traditional smooth functions, with Cantor function showing 2.6x faster convergence and 10x higher spectral radius tolerance.

DetailsMotivation: Traditional reservoir computing relies on smooth activation functions, limiting applications in extreme conditions (defense, disaster response, pharmaceuticals). Need robust operation under extreme conditions requires exploring non-smooth alternatives.

Method: Systematic investigation of non-smooth activation functions in echo state networks through comprehensive parameter sweeps across 36,610 reservoir configurations. Introduced theoretical framework for quantized functions with Degenerate Echo State Property (d-ESP) and critical crowding ratio Q=N/k.

Result: Several non-smooth functions maintain Echo State Property and outperform smooth activations. Cantor function achieves ESP up to spectral radii ~10 (10x beyond typical bounds) with 2.6x faster convergence than tanh/ReLU. Preprocessing topology, not continuity, determines stability.

Conclusion: Non-smooth activation functions can be superior to smooth ones in reservoir computing, challenging traditional design assumptions. Preprocessing topology is key to stability, but mechanism behind fractal function performance remains unexplained, indicating gaps in understanding geometric properties’ influence on dynamics.

Abstract: Contemporary reservoir computing relies heavily on smooth, globally Lipschitz continuous activation functions, limiting applications in defense, disaster response, and pharmaceutical modeling where robust operation under extreme conditions is critical. We systematically investigate non-smooth activation functions, including chaotic, stochastic, and fractal variants, in echo state networks. Through comprehensive parameter sweeps across 36,610 reservoir configurations, we demonstrate that several non-smooth functions not only maintain the Echo State Property (ESP) but outperform traditional smooth activations in convergence speed and spectral radius tolerance. Notably, the Cantor function (continuous everywhere and flat almost everywhere) maintains ESP-consistent behavior up to spectral radii of rho ~ 10, an order of magnitude beyond typical bounds for smooth functions, while achieving 2.6x faster convergence than tanh and ReLU. We introduce a theoretical framework for quantized activation functions, defining a Degenerate Echo State Property (d-ESP) that captures stability for discrete-output functions and proving that d-ESP implies traditional ESP. We identify a critical crowding ratio Q=N/k (reservoir size / quantization levels) that predicts failure thresholds for discrete activations. Our analysis reveals that preprocessing topology, rather than continuity per se, determines stability: monotone, compressive preprocessing maintains ESP across scales, while dispersive or discontinuous preprocessing triggers sharp failures. While our findings challenge assumptions about activation function design in reservoir computing, the mechanism underlying the exceptional performance of certain fractal functions remains unexplained, suggesting fundamental gaps in our understanding of how geometric properties of activation functions influence reservoir dynamics.

[384] Bias-Variance Trade-off for Clipped Stochastic First-Order Methods: From Bounded Variance to Infinite Mean

Chuan He

Main category: cs.LG

TL;DR: Clipped stochastic first-order methods achieve improved complexity guarantees for heavy-tailed noise across full tail index range α∈(0,2], including infinite mean cases rarely studied before.

DetailsMotivation: Existing stochastic optimization theory for heavy-tailed noise only covers α∈(1,2] (finite mean), with complexity bounds diverging as α→1. The infinite mean regime (α∈(0,1]) has been scarcely studied despite practical relevance.

Method: Novel analysis of bias-variance trade-off in gradient clipping, showing that when noise tail symmetry is controlled, clipped SFOMs achieve improved complexity guarantees. The analysis is straightforward and can be combined with classical light-tailed noise analyses.

Result: Unified complexity guarantees for clipped SFOMs across full tail index range α∈(0,2], including regimes from bounded variance to infinite mean noise. Numerical experiments validate theoretical findings.

Conclusion: Clipped stochastic first-order methods provide robust optimization guarantees for heavy-tailed noise across the complete spectrum of tail indices, addressing previously unstudied infinite mean cases while offering practical analysis techniques.

Abstract: Stochastic optimization is fundamental to modern machine learning. Recent research has extended the study of stochastic first-order methods (SFOMs) from light-tailed to heavy-tailed noise, which frequently arises in practice, with clipping emerging as a key technique for controlling heavy-tailed gradients. Extensive theoretical advances have further shown that the oracle complexity of SFOMs depends on the tail index $α$ of the noise. Nonetheless, existing complexity results often cover only the case $α\in (1,2]$, that is, the regime where the noise has a finite mean, while the complexity bounds tend to infinity as $α$ approaches $1$. This paper tackles the general case of noise with tail index $α\in(0,2]$, covering regimes ranging from noise with bounded variance to noise with an infinite mean, where the latter case has been scarcely studied. Through a novel analysis of the bias-variance trade-off in gradient clipping, we show that when a symmetry measure of the noise tail is controlled, clipped SFOMs achieve improved complexity guarantees in the presence of heavy-tailed noise for any tail index $α\in (0,2]$. Our analysis of the bias-variance trade-off not only yields new unified complexity guarantees for clipped SFOMs across this full range of tail indices, but is also straightforward to apply and can be combined with classical analyses under light-tailed noise to establish oracle complexity guarantees under heavy-tailed noise. Finally, numerical experiments validate our theoretical findings.

[385] Early Warning Index for Patient Deteriorations in Hospitals

Dimitris Bertsimas, Yu Ma, Kimberly Villalobos Carballo, Gagan Singh, Michal Laskowski, Jeff Mather, Dan Kombert, Howard Haronian

Main category: cs.LG

TL;DR: EWI is a multimodal ML framework that predicts ICU admission, emergency response, and mortality risk using EHR data, with human-in-the-loop design and SHAP explanations for clinical interpretability.

DetailsMotivation: Hospitals lack automated systems to leverage heterogeneous clinical data for forecasting critical events. Early identification of at-risk patients is essential for quality care and physician management, but inconsistent data formats make accurate risk assessment challenging.

Method: Developed Early Warning Index (EWI) - a multimodal ML framework with human-in-the-loop design where clinicians help set alert thresholds and interpret outputs. Uses SHAP for explainable AI to highlight clinical/operational risk factors. Deployed in hospital dashboard with three risk tiers, automatically extracting features from structured/unstructured EHR data.

Result: Achieved C-statistic of 0.796 on dataset of 18,633 unique patients at a large U.S. hospital. Currently used as triage tool for proactive patient management. Saves physician time by automatically sorting patients by risk level.

Conclusion: EWI enables data-informed adjustments to caregiver scheduling and resource allocation, helping avert downstream complications like costly procedures and high readmission rates while improving overall patient flow through proactive risk management.

Abstract: Hospitals lack automated systems to harness the growing volume of heterogeneous clinical and operational data to effectively forecast critical events. Early identification of patients at risk for deterioration is essential not only for patient care quality monitoring but also for physician care management. However, translating varied data streams into accurate and interpretable risk assessments poses significant challenges due to inconsistent data formats. We develop a multimodal machine learning framework, the Early Warning Index (EWI), to predict the aggregate risk of ICU admission, emergency response team dispatch, and mortality. Key to EWI’s design is a human-in-the-loop process: clinicians help determine alert thresholds and interpret model outputs, which are enhanced by explainable outputs using Shapley Additive exPlanations (SHAP) to highlight clinical and operational factors (e.g., scheduled surgeries, ward census) driving each patient’s risk. We deploy EWI in a hospital dashboard that stratifies patients into three risk tiers. Using a dataset of 18,633 unique patients at a large U.S. hospital, our approach automatically extracts features from both structured and unstructured electronic health record (EHR) data and achieves C-statistics of 0.796. It is currently used as a triage tool for proactively managing at-risk patients. The proposed approach saves physicians valuable time by automatically sorting patients of varying risk levels, allowing them to concentrate on patient care rather than sifting through complex EHR data. By further pinpointing specific risk drivers, the proposed model provides data-informed adjustments to caregiver scheduling and allocation of critical resources. As a result, clinicians and administrators can avert downstream complications, including costly procedures or high readmission rates and improve overall patient flow.

[386] Online Multi-modal Root Cause Identification in Microservice Systems

Lecheng Zheng, Zhengzhang Chen, Haifeng Chen

Main category: cs.LG

TL;DR: OCEAN is an online multi-modal causal structure learning method for root cause localization in microservice systems that handles complex multi-modal data interactions in real-time.

DetailsMotivation: Traditional RCA methods are limited to offline applications due to high computational demands, and existing online RCA methods only handle single-modal data, overlooking complex interactions in multi-modal microservice systems.

Method: OCEAN uses dilated convolutional neural networks for long-term temporal dependencies, graph neural networks for causal relationships, multi-factor attention mechanism for metric/log analysis, and contrastive mutual information maximization-based graph fusion for multi-modal integration.

Result: Extensive experiments on three real-world datasets demonstrate the effectiveness and efficiency of the proposed OCEAN method for online root cause localization.

Conclusion: OCEAN provides an effective online solution for multi-modal root cause analysis in microservice systems, overcoming limitations of traditional offline and single-modal approaches.

Abstract: Root Cause Analysis (RCA) is essential for pinpointing the root causes of failures in microservice systems. Traditional data-driven RCA methods are typically limited to offline applications due to high computational demands, and existing online RCA methods handle only single-modal data, overlooking complex interactions in multi-modal systems. In this paper, we introduce OCEAN, a novel online multi-modal causal structure learning method for root cause localization. OCEAN employs a dilated convolutional neural network to capture long-term temporal dependencies and graph neural networks to learn causal relationships among system entities and key performance indicators. We further design a multi-factor attention mechanism to analyze and reassess the relationships among different metrics and log indicators/attributes for enhanced online causal graph learning. Additionally, a contrastive mutual information maximization-based graph fusion module is developed to effectively model the relationships across various modalities. Extensive experiments on three real-world datasets demonstrate the effectiveness and efficiency of our proposed method.

[387] Text-guided multi-property molecular optimization with a diffusion language model

Yida Xiong, Kun Li, Jiameng Chen, Hongzhi Zhang, Di Lin, Yan Che, Wenbin Hu

Main category: cs.LG

TL;DR: TransDLM: A text-guided multi-property molecular optimization method using transformer-based diffusion language model that leverages chemical nomenclature as semantic representations to mitigate error propagation in drug discovery.

DetailsMotivation: Existing molecular optimization approaches rely on external property predictors that introduce errors and noise due to approximation, leading to discrepancy accumulation, reduced generalization, and suboptimal molecular candidates in drug discovery.

Method: Proposes TransDLM (transformer-based diffusion language model) that uses standardized chemical nomenclature as semantic representations of molecules and embeds property requirements into textual descriptions to mitigate error propagation during diffusion process.

Result: TransDLM surpasses state-of-the-art methods in maintaining molecular structural similarity and enhancing chemical properties on benchmark datasets, with a successful case study demonstrating practical problem-solving ability.

Conclusion: The text-guided approach effectively integrates diverse information sources to balance structural retention and property enhancement, offering a robust solution for molecular optimization in drug discovery.

Abstract: Molecular optimization (MO) is a crucial stage in drug discovery in which task-oriented generated molecules are optimized to meet practical industrial requirements. Existing mainstream MO approaches primarily utilize external property predictors to guide iterative property optimization. However, learning all molecular samples in the vast chemical space is unrealistic for predictors. As a result, errors and noise are inevitably introduced during property prediction due to the nature of approximation. This leads to discrepancy accumulation, generalization reduction and suboptimal molecular candidates. In this paper, we propose a text-guided multi-property molecular optimization method utilizing transformer-based diffusion language model (TransDLM). TransDLM leverages standardized chemical nomenclature as semantic representations of molecules and implicitly embeds property requirements into textual descriptions, thereby mitigating error propagation during diffusion process. By fusing physically and chemically detailed textual semantics with specialized molecular representations, TransDLM effectively integrates diverse information sources to guide precise optimization, which enhances the model’s ability to balance structural retention and property enhancement. Additionally, the success of a case study further demonstrates TransDLM’s ability to solve practical problems. Experimentally, our approach surpasses state-of-the-art methods in maintaining molecular structural similarity and enhancing chemical properties on the benchmark dataset.

[388] Not All Data are Good Labels: On the Self-supervised Labeling for Time Series Forecasting

Yuxuan Yang, Dalin Zhang, Yuxuan Liang, Hua Lu, Gang Chen, Huan Li

Main category: cs.LG

TL;DR: SCAM introduces a self-supervised approach to re-label time series datasets by constructing candidate datasets during reconstruction network optimization, using intermediates as pseudo labels to improve generalization for any predictor.

DetailsMotivation: Existing time series forecasting models rely heavily on high-quality data and insufficiently exploit all available data. There's a need for better data utilization and generalization improvement in TSF models.

Method: Proposes Self-Correction with Adaptive Mask (SCAM) which discards overfitted components and selectively replaces them with pseudo labels generated from reconstructions. Also incorporates Spectral Norm Regularization (SNR) to suppress overfitting from a loss landscape perspective.

Result: Experiments on eleven real-world datasets demonstrate that SCAM consistently improves the performance of various backbone models.

Conclusion: This work offers a new perspective on constructing datasets and enhancing the generalization of TSF models through self-supervised learning.

Abstract: Time Series Forecasting (TSF) is a crucial task in various domains, yet existing TSF models rely heavily on high-quality data and insufficiently exploit all available data. This paper explores a novel self-supervised approach to re-label time series datasets by inherently constructing candidate datasets. During the optimization of a simple reconstruction network, intermediates are used as pseudo labels in a self-supervised paradigm, improving generalization for any predictor. We introduce the Self-Correction with Adaptive Mask (SCAM), which discards overfitted components and selectively replaces them with pseudo labels generated from reconstructions. Additionally, we incorporate Spectral Norm Regularization (SNR) to further suppress overfitting from a loss landscape perspective. Our experiments on eleven real-world datasets demonstrate that SCAM consistently improves the performance of various backbone models. This work offers a new perspective on constructing datasets and enhancing the generalization of TSF models through self-supervised learning. The code is available at https://github.com/SuDIS-ZJU/SCAM.

[389] Theoretical Guarantees of Learning Ensembling Strategies with Applications to Time Series Forecasting

Hilaf Hasson, Danielle C. Maddix, Yuyang Wang, Gaurav Gupta, Youngsuk Park

Main category: cs.LG

TL;DR: The paper analyzes stacked generalization (ensemble methods), proving theoretical guarantees for cross-validated stacking and proposing a novel stacking approach for probabilistic forecasting with adaptive ensemble weights.

DetailsMotivation: Stacking is widely used in practice for ensemble learning but lacks strong theoretical foundations. The authors aim to bridge this gap by providing theoretical guarantees for stacked generalization and developing improved stacking methods for probabilistic forecasting applications.

Method: The authors prove a theoretical result showing that cross-validated stacked generalization performs nearly as well as the oracle best. They then propose a family of stacked generalizations for probabilistic forecasting where ensemble weights can vary across items, timestamps, and quantiles with different sensitivity levels.

Result: Theoretical proof establishes that cross-validated stacking doesn’t perform “much worse” than the oracle best. Experimental results demonstrate performance gains from the proposed adaptive stacking method for probabilistic forecasting.

Conclusion: The paper provides important theoretical foundations for stacked generalization while also developing a practical, improved stacking approach for probabilistic forecasting that outperforms existing methods.

Abstract: Ensembling is among the most popular tools in machine learning (ML) due to its effectiveness in minimizing variance and thus improving generalization. Most ensembling methods for black-box base learners fall under the umbrella of “stacked generalization,” namely training an ML algorithm that takes the inferences from the base learners as input. While stacking has been widely applied in practice, its theoretical properties are poorly understood. In this paper, we prove a novel result, showing that choosing the best stacked generalization from a (finite or finite-dimensional) family of stacked generalizations based on cross-validated performance does not perform “much worse” than the oracle best. Our result strengthens and significantly extends the results in Van der Laan et al. (2007). Inspired by the theoretical analysis, we further propose a particular family of stacked generalizations in the context of probabilistic forecasting, each one with a different sensitivity for how much the ensemble weights are allowed to vary across items, timestamps in the forecast horizon, and quantiles. Experimental results demonstrate the performance gain of the proposed method.

[390] Group-robust Machine Unlearning

Thomas De Min, Subhankar Roy, Stéphane Lathuilière, Elisa Ricci, Massimiliano Mancini

Main category: cs.LG

TL;DR: This paper addresses fairness issues in machine unlearning when forget sets are non-uniformly distributed across groups, proposing group-robust machine unlearning with a new method called MIU that preserves fairness while removing data influence.

DetailsMotivation: Previous machine unlearning approaches assume uniformly distributed forget sets, but when data to unlearn is dominant in specific groups (like ethnicity or gender), performance degrades for those groups, creating fairness issues that need to be addressed.

Method: The paper introduces group-robust machine unlearning and presents two approaches: 1) a simple exact unlearning strategy using sample distribution reweighting, and 2) MIU (Mutual Information-aware Machine Unlearning) - the first approximate unlearning method that minimizes mutual information between model features and group information while using reweighting and mutual information calibration.

Result: Experiments on three datasets show that MIU outperforms standard methods, achieving effective unlearning without compromising model robustness and fairness, successfully reducing performance degradation in dominant groups of the forget set.

Conclusion: The work formalizes the problem of non-uniformly distributed forget sets in machine unlearning and provides effective solutions that maintain fairness while removing data influence, with MIU representing a significant advancement for group-robust approximate unlearning.

Abstract: Machine unlearning is an emerging paradigm to remove the influence of specific training data (i.e., the forget set) from a model while preserving its knowledge of the rest of the data (i.e., the retain set). Previous approaches assume the forget data to be uniformly distributed from all training datapoints. However, if the data to unlearn is dominant in one group (e.g., ethnicity, gender), we empirically show that performance for this group degrades, leading to fairness issues. To perform unlearning while preserving fairness, this work addresses the overlooked problem of non-uniformly distributed forget sets, which we refer to as group-robust machine unlearning. We formalize the problem and present a simple and effective exact unlearning strategy that mitigates the performance loss in dominant groups via sample distribution reweighting. Moreover, we present MIU (Mutual Information-aware Machine Unlearning), the first approach for group robustness in approximate machine unlearning. MIU minimizes the mutual information between model features and group information, achieving unlearning while reducing performance degradation in the dominant group of the forget set. Additionally, MIU exploits sample distribution reweighting and mutual information calibration with the original model to preserve group robustness. We conduct experiments on three datasets and show that MIU outperforms standard methods, achieving unlearning without compromising model robustness. Source code available at https://github.com/tdemin16/group-robust_machine_unlearning

[391] I-Diff: Structural Regularization for High-Fidelity Diffusion Models

Shakthi Perera, Dilum Fernando, H. L. P. Malshan, H. M. P. S. Madushan, Roshan Godaliyadda, M. P. B. Ekanayake, Dhananjaya Jayasundara, Roshan Ragel

Main category: cs.LG

TL;DR: I-Diff improves DDPMs by adding a structural regularizer that preserves data distribution fidelity, achieving significant fidelity gains across multiple models and datasets.

DetailsMotivation: While DDPMs have advanced generative AI, enhancing fidelity without compromising semantic content remains challenging. Current approaches haven't effectively integrated structural information about data distributions into the DDPM framework.

Method: I-Diff introduces a carefully designed regularizer that enables DDPMs to encode structural information, preserving inherent data distribution fidelity. The method is model-agnostic and tested across DDPM, Improved DDPM, and Latent Diffusion Model.

Result: Significant fidelity improvements across multiple datasets: Density and Precision increased by 10% and 37% respectively on CIFAR-100. Consistent improvements across different models demonstrate the method’s effectiveness and model-agnostic nature.

Conclusion: I-Diff successfully addresses the fidelity challenge in DDPMs by incorporating structural information through a novel regularizer, offering a model-agnostic solution that significantly enhances generative fidelity across various diffusion models.

Abstract: Denoising Diffusion Probabilistic Models (DDPMs) have significantly advanced generative AI, achieving impressive results in high-quality image and data generation. However, enhancing fidelity without compromising semantic content remains a key challenge in the field. Recent diffusion research in multiple disciplines has introduced objectives and architectural refinements that tighten the match between generated and real data distributions, yielding higher fidelity than earlier generative frameworks. Multi-stage architectures, physics-guided modeling, semantic conditioning, and rarity-aware generation are some of the explored works to achieve this task. However, the integration of structural information of the data distribution into DDPM has largely been unexplored. The conventional DDPM framework relies solely on the $L^2$ norm between the additive and predicted noise to generate new data distributions. We introduce I-Diff, an improved version of DDPM that incorporates a carefully designed regularizer, effectively enabling the model to encode structural information, thereby preserving the inherent fidelity of the data distribution. The proposed approach is validated through extensive experiments on DDPM, Improved DDPM and Latent Diffusion Model across multiple datasets. Empirical results demonstrate significant improvements in fidelity (Density and Precision increase 10% and 37% in CIFAR-100 dataset respectively) across other tested datasets. These results highlight the effectiveness of our method in enhancing the fidelity of the generated data. Notably, improvements across different models highlight the model-agnostic nature of our proposed method in the wider field of generative AI.

[392] Unsupervised Representation Learning from Sparse Transformation Analysis

Yue Song, Thomas Anderson Keller, Yisong Yue, Pietro Perona, Max Welling

Main category: cs.LG

TL;DR: The paper proposes learning representations from sequence data by factorizing latent transformations into sparse components using a probability flow model with rotational and potential flow fields.

DetailsMotivation: Existing representation learning approaches use principles like coding efficiency, statistical independence, causality, controllability, or symmetry. The authors aim to learn representations that capture not just independent factors but also independent transformation primitives.

Method: Input data is encoded as latent distributions, transformed using a probability flow model decomposed into rotational (divergence-free) and potential flow (curl-free) vector fields. A sparsity prior ensures only few fields are active at any time. Training uses standard variational objective.

Result: The model achieves state-of-the-art performance in both data likelihood and unsupervised approximate equivariance errors on datasets composed of sequence transformations.

Conclusion: The approach learns a new form of disentangled representations where inputs are represented by combinations of independent factors AND independent transformation primitives, which can be interpreted as learning approximately equivariant representations.

Abstract: There is a vast literature on representation learning based on principles such as coding efficiency, statistical independence, causality, controllability, or symmetry. In this paper we propose to learn representations from sequence data by factorizing the transformations of the latent variables into sparse components. Input data are first encoded as distributions of latent activations and subsequently transformed using a probability flow model, before being decoded to predict a future input state. The flow model is decomposed into a number of rotational (divergence-free) vector fields and a number of potential flow (curl-free) fields. Our sparsity prior encourages only a small number of these fields to be active at any instant and infers the speed with which the probability flows along these fields. Training this model is completely unsupervised using a standard variational objective and results in a new form of disentangled representations where the input is not only represented by a combination of independent factors, but also by a combination of independent transformation primitives given by the learned flow fields. When viewing the transformations as symmetries one may interpret this as learning approximately equivariant representations. Empirically we demonstrate that this model achieves state of the art in terms of both data likelihood and unsupervised approximate equivariance errors on datasets composed of sequence transformations.

[393] Universal Approximation with Softmax Attention

Jerry Yao-Chieh Hu, Hude Liu, Hong-Yu Chen, Weimin Wu, Han Liu

Main category: cs.LG

TL;DR: Self-attention layers (with linear transformations) are universal approximators for sequence-to-sequence functions, and can approximate ReLU-like functions, enabling attention-only Transformers without feed-forward networks.

DetailsMotivation: Prior works rely on feed-forward networks to establish universal approximation in Transformers. This paper aims to show that self-attention layers alone (without feed-forward networks) can achieve universal approximation for sequence-to-sequence functions.

Method: Uses a new interpolation-based method to analyze attention’s internal mechanism, showing self-attention can approximate a generalized version of ReLU to arbitrary precision. This enables proving that two-layer multi-head attention suffices as a universal approximator.

Result: Proves that: (i) two-layer self-attention and (ii) one-layer self-attention followed by softmax are universal approximators for continuous sequence-to-sequence functions on compact domains. Also shows attention-only layers can approximate various statistical models in-context.

Conclusion: Self-attention layers alone are sufficient for universal approximation in Transformers, challenging the necessity of feed-forward networks. The interpolation-based analysis technique provides new insights into attention’s capabilities and has independent interest.

Abstract: We prove that with linear transformations, both (i) two-layer self-attention and (ii) one-layer self-attention followed by a softmax function are universal approximators for continuous sequence-to-sequence functions on compact domains. Our main technique is a new interpolation-based method for analyzing attention’s internal mechanism. This leads to our key insight: self-attention is able to approximate a generalized version of ReLU to arbitrary precision, and hence subsumes many known universal approximators. Building on these, we show that two-layer multi-head attention alone suffices as a sequence-to-sequence universal approximator. In contrast, prior works rely on feed-forward networks to establish universal approximation in Transformers. Furthermore, we extend our techniques to show that, (softmax-)attention-only layers are capable of approximating various statistical models in-context. We believe these techniques hold independent interest.

[394] Transparent Networks for Multivariate Time Series

Minkyu Kim, Suan Lee, Jinho Kim

Main category: cs.LG

TL;DR: GATSM is a transparent neural network model for time series that combines feature networks with a transparent temporal module, achieving performance comparable to black-box models while maintaining interpretability.

DetailsMotivation: Despite growing interest in transparent models for high-stakes domains, there's a lack of transparent models for time series data, which is prevalent in real-world applications.

Method: Proposes Generalized Additive Time Series Model (GATSM) with two components: 1) independent feature networks to learn feature representations, and 2) a transparent temporal module to learn temporal patterns across different time steps using those feature representations.

Result: GATSM significantly outperforms existing generalized additive models and achieves comparable performance to black-box time series models like RNNs and Transformers. It also discovers interesting patterns in time series data.

Conclusion: GATSM effectively bridges the gap between transparency and performance in time series modeling, offering an interpretable alternative to black-box models while maintaining competitive accuracy.

Abstract: Transparent models, which provide inherently interpretable predictions, are receiving significant attention in high-stakes domains. However, despite much real-world data being collected as time series, there is a lack of studies on transparent time series models. To address this gap, we propose a novel transparent neural network model for time series called Generalized Additive Time Series Model (GATSM). GATSM consists of two parts: 1) independent feature networks to learn feature representations, and 2) a transparent temporal module to learn temporal patterns across different time steps using the feature representations. This structure allows GATSM to effectively capture temporal patterns and handle varying-length time series while preserving transparency. Empirical experiments show that GATSM significantly outperforms existing generalized additive models and achieves comparable performance to black-box time series models, such as recurrent neural networks and Transformer. In addition, we demonstrate that GATSM finds interesting patterns in time series.

[395] Shared DIFF Transformer

Yueyang Cang, Yuhang Liu, Xiaoteng Zhang, Li Shi, Wenge Que

Main category: cs.LG

TL;DR: Shared DIFF Transformer improves upon DIFF Transformer by introducing a shared base matrix with low-rank updates, reducing parameter redundancy while maintaining noise suppression for better performance in long-sequence tasks.

DetailsMotivation: DIFF Transformer's independent signal generation leads to parameter redundancy and suboptimal information utilization, motivating a more efficient differential attention mechanism design.

Method: Introduces Shared DIFF Transformer with a shared base matrix to model global patterns and low-rank updates for task-specific flexibility, reducing parameter redundancy while retaining noise suppression.

Result: Outperforms DIFF Transformer in long-sequence modeling, key information retrieval, and in-context learning tasks while being more parameter-efficient.

Conclusion: Provides an efficient approach to optimizing differential attention mechanisms and advancing robust Transformer architectures with better parameter utilization.

Abstract: DIFF Transformer improves attention allocation by enhancing focus on relevant context while suppressing noise. It introduces a differential attention mechanism that calculates the difference between two independently generated attention distributions, effectively reducing noise and promoting sparse attention patterns. However, the independent signal generation in DIFF Transformer results in parameter redundancy and suboptimal utilization of information. In this work, we propose Shared DIFF Transformer, which draws on the idea of a differential amplifier by introducing a shared base matrix to model global patterns and incorporating low-rank updates to enhance task-specific flexibility. This design significantly reduces parameter redundancy, improves efficiency, and retains strong noise suppression capabilities. Experimental results show that, compared to DIFF Transformer, our method achieves better performance in tasks such as long-sequence modeling, key information retrieval, and in-context learning. Our work provides a novel and efficient approach to optimizing differential attention mechanisms and advancing robust Transformer architectures.

[396] Improvement of AMPs Identification with Generative Adversarial Network and Ensemble Classification

Reyhaneh Keshavarzpour, Eghbal Mansoori

Main category: cs.LG

TL;DR: The paper presents an improved deep learning method for antimicrobial peptide prediction by combining optimal coding techniques and addressing dataset imbalance, achieving superior accuracy over existing methods.

DetailsMotivation: Antimicrobial peptides (AMPs) are crucial as antibiotic alternatives in biomedical applications, drug design, and innate immunity. AI algorithms can facilitate AMP identification, but current methods need improvement in prediction accuracy and efficiency.

Method: The proposed method combines the best coding techniques from different perspectives and uses a deep neural network to handle imbalanced datasets. This integrated approach aims to improve AMP prediction performance.

Result: The method shows significant improvement in accuracy and efficiency for AMP prediction compared to existing methods, providing the best results in the field.

Conclusion: The developed prediction and classification method for antimicrobial peptides has high effectiveness and practical applications in medicine and pharmaceutical industries.

Abstract: Identification of antimicrobial peptides is an important and necessary issue in today’s era. Antimicrobial peptides are essential as an alternative to antibiotics for biomedical applications and many other practical applications. These oligopeptides are useful in drug design and cause innate immunity against microorganisms. Artificial intelligence algorithms have played a significant role in the ease of identifying these peptides.This research is improved by improving proposed method in the field of antimicrobial peptides prediction. Suggested method is improved by combining the best coding method from different perspectives, In the following a deep neural network to balance the imbalanced combined datasets. The results of this research show that the proposed method have a significant improvement in the accuracy and efficiency of the prediction of antimicrobial peptides and are able to provide the best results compared to the existing methods. These development in the field of prediction and classification of antimicrobial peptides, basically in the fields of medicine and pharmaceutical industries, have high effectiveness and application.

[397] Investigating the Generalizability of ECG Noise Detection Across Diverse Data Sources and Noise Types

Sharmad Kalpande, Nilesh Kumar Sahu, Haroon Lone

Main category: cs.LG

TL;DR: Proposes an HRV-based ML approach for detecting noisy ECG segments with cross-dataset validation showing over 90% accuracy and AUPRC, plus release of curated labeled dataset.

DetailsMotivation: ECG signals from wearables are often corrupted by motion/muscle artifacts that distort R-peaks and QRS complex, hindering HRV analysis and risking clinical misinterpretation. Existing methods lack generalizability as they're typically evaluated on single datasets.

Method: HRV-based machine learning approach for detecting noisy ECG segments, evaluated using cross-dataset experiments on four datasets collected in both controlled and uncontrolled settings.

Result: Method achieves over 90% average accuracy and AUPRC exceeding 90%, even on previously unseen datasets, demonstrating robust performance across heterogeneous data sources.

Conclusion: The proposed approach shows strong generalizability for ECG noise detection across diverse sensors and recording conditions, with released curated dataset supporting reproducibility and further research.

Abstract: Electrocardiograms (ECGs) are vital for monitoring cardiac health, enabling the assessment of heart rate variability (HRV), detection of arrhythmias, and diagnosis of cardiovascular conditions. However, ECG signals recorded from wearable devices are frequently corrupted by noise artifacts, particularly those arising from motion and large muscle activity, which distort R-peaks and the QRS complex. These distortions hinder reliable HRV analysis and increase the risk of clinical misinterpretation. Existing studies on ECG noise detection typically evaluate performance on a single dataset, limiting insight into the generalizability of such methods across diverse sensors and recording conditions. In this work, we propose an HRV-based machine learning approach to detect noisy ECG segments and evaluate its generalizability using cross-dataset experiments on four datasets collected in both controlled and uncontrolled settings. Our method achieves over 90% average accuracy and an AUPRC exceeding 90%, even on previously unseen datasets-demonstrating robust performance across heterogeneous data sources. To support reproducibility and further research, we also release a curated and labeled ECG dataset annotated for noise artifacts.

[398] Unitless Unrestricted Markov-Consistent SCM Generation: Better Benchmark Datasets for Causal Discovery

Rebecca J. Herman, Jonas Wahl, Urmi Ninad, Jakob Runge

Main category: cs.LG

TL;DR: The paper addresses artifacts in causal discovery evaluation, proposes improved SCM generation methods to avoid unrealistic sortability patterns, and extends the approach to time series.

DetailsMotivation: Current causal discovery evaluation relies on simulated data, but existing data generation techniques create unrealistic artifacts (var- and R2-sortability) that some methods exploit, leading to inflated performance expectations for real-world applications.

Method: The authors analyze sortability patterns expected in real data, propose a new method for drawing coefficients that better samples the space of SCMs, and extend this approach to time series settings.

Result: The proposed method effectively addresses var-sortability and R2-sortability artifacts, with the iSCM approach showing limitations for denser graphs that exhibit reversed R2-sortability patterns.

Conclusion: Improved SCM generation methods are needed for realistic causal discovery evaluation, and the proposed coefficient sampling approach provides a better foundation for assessing algorithm performance on real-world data.

Abstract: Causal discovery aims to extract qualitative causal knowledge in the form of causal graphs from data. Because causal ground truth is rarely known in the real world, simulated data plays a vital role in evaluating the performance of the various causal discovery algorithms proposed in the literature. But recent work highlighted certain artifacts of commonly used data generation techniques for a standard class of structural causal models (SCM) that may be nonphysical, including var- and R2-sortability, where the variables’ variance and coefficients of determination (R2) after regressing on all other variables, respectively, increase along the causal order. Some causal methods exploit such artifacts, leading to unrealistic expectations for their performance on real-world data. Some modifications have been proposed to remove these artifacts; notably, the internally-standardized structural causal model (iSCM) avoids varsortability and largely alleviates R2-sortability on sparse causal graphs, but exhibits a reversed R2-sortability pattern for denser graphs not featured in their work. We analyze which sortability patterns we expect to see in real data, and propose a method for drawing coefficients that we argue more effectively samples the space of SCMs. Finally, we propose a novel extension of our SCM generation method to the time series setting.

[399] Stable Trajectory Clustering: An Efficient Split and Merge Algorithm

Atieh Rahmani, Mansoor Davoodi, Justin M. Calabrese

Main category: cs.LG

TL;DR: This paper introduces trajectory clustering algorithms based on DBSCAN line segment clustering, including whole-trajectory clustering, sub-trajectory clustering, and a novel stable trajectory clustering method that handles transient anomalies.

DetailsMotivation: Existing trajectory clustering algorithms often split trajectories in response to temporary anomalies, which obscures consistent clustering patterns and leads to less reliable insights about object movement behavior.

Method: The paper presents three algorithms: (1) whole-trajectory clustering using DBSCAN line segment clustering with split/merge events, (2) sub-trajectory clustering with sliding window model, and (3) stable trajectory clustering that uses mean absolute deviation to selectively omit transient deviations while preserving cluster integrity.

Result: The proposed algorithms are evaluated on real trajectory datasets, demonstrating their effectiveness and sensitivity to parameter variations, with stable trajectory clustering showing improved stability and interpretability.

Conclusion: Selective omission of transient deviations in trajectory clustering not only preserves cluster integrity but also improves stability and interpretability, offering more reliable insights into object movement patterns.

Abstract: Clustering algorithms fundamentally group data points by characteristics to identify patterns. Over the past two decades, researchers have extended these methods to analyze trajectories of humans, animals, and vehicles, studying their behavior and movement across applications. \noindent This paper presents whole-trajectory clustering and sub-trajectory clustering algorithms based on DBSCAN line segment clustering, which encompasses two key events: split and merge of line segments. The events are utilized to capture object movement history based on the average Euclidean distance between line segments. In this framework, whole-trajectory clustering considers entire entities’ trajectories, whereas sub-trajectory clustering employs a sliding window model to identify local similarity patterns. Many existing trajectory clustering algorithms respond to temporary anomalies in data by splitting trajectories, which often obscures otherwise consistent clustering patterns and leads to less reliable insights. To address this, we introduce the stable trajectory clustering algorithm, which leverages the mean absolute deviation concept to demonstrate that selective omission of transient deviations not only preserves the integrity of clusters but also improves their stability and interpretability. We evaluate all proposed algorithms on real trajectory datasets to illustrate their effectiveness and sensitivity to parameter variations.

[400] Masked Omics Modeling for Multimodal Representation Learning across Histopathology and Molecular Profiles

Lucas Robinet, Ahmad Berjaoui, Elizabeth Cohen-Jonathan Moyal

Main category: cs.LG

TL;DR: MORPHEUS is a multimodal pre-training framework that integrates histopathology images with multi-omics data using a transformer architecture and masked omics modeling for cross-modal learning.

DetailsMotivation: Current self-supervised learning in computational pathology focuses on tissue analysis alone, missing complementary molecular information from omics profiles (transcriptomics, methylomics, genomics). There's a need for multimodal approaches that can capture broader molecular complexity.

Method: Introduces MORPHEUS, a multimodal pre-training strategy with a shared transformer-based architecture. Uses novel masked omics modeling objective to learn cross-modal relationships between histopathology and multi-omics data. Supports flexible any-to-any omics reconstruction.

Result: Pre-trained on large pan-cancer cohort, MORPHEUS shows substantial improvements over supervised and SSL baselines across diverse tasks and modality combinations. Enables histopathology-only or multimodal inference and omics reconstruction.

Conclusion: MORPHEUS represents a promising direction for multimodal foundation models in oncology, offering general-purpose pre-trained encoder for various modality combinations and reconstruction capabilities.

Abstract: Self-supervised learning (SSL) has driven major advances in computational pathology by enabling the learning of rich representations from histopathology data. Yet, tissue analysis alone may fall short in capturing broader molecular complexity, as key complementary information resides in high-dimensional omics profiles such as transcriptomics, methylomics, and genomics. To address this gap, we introduce MORPHEUS, the first multimodal pre-training strategy that integrates histopathology images and multi-omics data within a shared transformer-based architecture. At its core, MORPHEUS relies on a novel masked omics modeling objective that encourages the model to learn meaningful cross-modal relationships. This yields a general-purpose pre-trained encoder that can be applied to histopathology alone or in combination with any subset of omics modalities. Beyond inference, MORPHEUS also supports flexible any-to-any omics reconstruction, enabling one or more omics profiles to be reconstructed from any modality subset that includes histopathology. Pre-trained on a large pan-cancer cohort, MORPHEUS shows substantial improvements over supervised and SSL baselines across diverse tasks and modality combinations. Together, these capabilities position it as a promising direction for the development of multimodal foundation models in oncology. Code is publicly available at https://github.com/Lucas-rbnt/MORPHEUS

[401] Manifold Learning for Personalized and Label-Free Detection of Cardiac Arrhythmias

Amir Reza Vazifeh, Jason W. Fleischer

Main category: cs.LG

TL;DR: NLDR methods like t-SNE and UMAP can identify medically relevant ECG patterns without training, enabling accurate heartbeat classification and arrhythmia detection from raw signals.

DetailsMotivation: Manual ECG analysis is time-consuming and error-prone, while supervised ML struggles with signal variations across individuals/leads and labeling inconsistencies. Unsupervised methods like PCA miss subtle clinical patterns.

Method: Apply nonlinear dimensionality reduction (t-SNE and UMAP) to raw ECG signals from MIT-BIH dataset (lead II and V1). Generate 2D embeddings and analyze clustering patterns for both mixed populations and single subjects.

Result: UMAP and t-SNE produce visually separable clusters corresponding to different individuals or arrhythmia patterns. Simple classifiers on embeddings achieve ≥90% accuracy for individual identification and 98.96% median accuracy with 91.02% F1-score for arrhythmia detection.

Conclusion: NLDR shows promise for automated cardiac monitoring without training, applicable to both single-lead and 12-lead ECG standards, with potential for personalized healthcare beyond cardiology.

Abstract: Electrocardiograms (ECGs) provide direct, non-invasive measurements of heart activity and are well-established tools for detecting and monitoring cardiovascular disease. However, manual ECG analysis can be time-consuming and prone to errors. Machine learning has emerged as a promising approach for automated heartbeat recognition and classification, but substantial variations in ECG signals make it challenging to develop generalizable supervised models. ECG signals vary widely across individuals and leads, while datasets often follow different labeling standards and may be biased, greatly hindering supervised methods. Conventional unsupervised methods, such as principal component analysis, prioritize large (often obvious) variances and typically overlook subtle yet clinically relevant patterns. When labels are missing or variations are small, both approaches fail. Here, we show that nonlinear dimensionality reduction (NLDR) algorithms, namely t-distributed stochastic neighbor embedding (t-SNE) and uniform manifold approximation and projection (UMAP), can address these challenges and identify medically relevant features in ECG signals without training or prior information. Using lead II and V1 signals from the MIT-BIH dataset, UMAP and t-SNE generate rich two-dimensional latent spaces with visually separable clusters. Applied to mixed populations of heartbeats, these clusters correspond to different individuals, while for single subjects they reveal distinct arrhythmia patterns. A simple classifier on these embeddings discriminates individual recordings with >= 90% accuracy and identifies arrhythmias in single patients with a median accuracy of 98.96% and median F1-score of 91.02%. The results show that NLDR holds much promise for cardiac monitoring, including the limiting cases of single-lead ECG and the current 12-lead standard of care, and for personalized health care beyond cardiology.

[402] Small Batch Size Training for Language Models: When Vanilla SGD Works, and Why Gradient Accumulation Is Wasteful

Martin Marek, Sanae Lotfi, Aditya Somasundaram, Andrew Gordon Wilson, Micah Goldblum

Main category: cs.LG

TL;DR: Small batch sizes (down to batch size 1) can train stably in language models when Adam hyperparameters are properly scaled, challenging conventional wisdom that requires gradient accumulation for stability.

DetailsMotivation: Conventional wisdom suggests small batch sizes make language model training unstable, requiring gradient accumulation. This work challenges this assumption and explores whether small batch sizes can be made stable with proper hyperparameter scaling.

Method: Proposes a rule for scaling Adam hyperparameters to small batch sizes: instead of fixing the decay rate of the second moment, fix its half-life in terms of tokens. Tests small batch sizes down to batch size 1 across various training scenarios.

Result: Small batch sizes (1) train stably, (2) are more robust to hyperparameter choices, (3) achieve equal or better per-FLOP performance than larger batch sizes, and (4) enable stable training with vanilla SGD without momentum.

Conclusion: Provides practical recommendations for batch size selection and optimizer hyperparameters. Recommends against gradient accumulation unless training on multiple devices. Shows small batch sizes with small-state optimizers can achieve full fine-tuning performance with LoRA-like memory footprint.

Abstract: Conventional wisdom dictates that small batch sizes make language model pretraining and fine-tuning unstable, motivating gradient accumulation, which trades off the number of optimizer steps for a proportional increase in batch size. While it is common to decrease the learning rate for smaller batch sizes, other hyperparameters are often held fixed. In this work, we revisit small batch sizes all the way down to batch size one, and we propose a rule for scaling Adam hyperparameters to small batch sizes. In particular, rather than holding the decay rate of the second moment fixed across batch sizes, we propose to hold its half-life fixed in terms of tokens. We find that small batch sizes (1) train stably, (2) are consistently more robust to hyperparameter choices, (3) achieve equal or better per-FLOP performance than larger batch sizes, and (4) notably enable stable language model training with vanilla SGD, even without momentum, despite storing no optimizer state. Building on these results, we provide practical recommendations for selecting a batch size and setting optimizer hyperparameters. We further recommend against gradient accumulation unless training on multiple devices with multiple model replicas. Finally, we show that a small batch size combined with an optimizer with a small state size can provide the performance benefits of full fine-tuning while maintaining a similar memory footprint to LoRA.

[403] Multi-Model Ensemble and Reservoir Computing for River Discharge Prediction in Ungauged Basins

Mizuki Funato, Yohei Sawada

Main category: cs.LG

TL;DR: HYPER combines Bayesian model averaging of 47 uncalibrated hydrological models with reservoir computing error correction for efficient, accurate river discharge prediction in data-scarce regions.

DetailsMotivation: Many regions lack sufficient river discharge data, and existing models struggle to achieve high accuracy, interpretability, and efficiency under data-scarce conditions.

Method: Uses Bayesian model averaging on 47 uncalibrated conceptual hydrological models, then applies reservoir computing (via linear regression) to correct errors. For ungauged basins, maps weights to catchment attributes from gauged basins.

Result: On 87 Japanese basins: In data-rich scenario, HYPER (NSE 0.59) performed comparably to LSTM (NSE 0.64) with only 3% computational time. In data-scarce scenario (20% gauged), HYPER maintained NSE 0.51 while LSTM degraded to NSE -0.61.

Conclusion: HYPER provides robust, efficient, generalizable discharge prediction without basin-specific calibration, offering scalable interpretable framework for data-scarce regions.

Abstract: Despite the necessity for accurate flood prediction, many regions lack sufficient river discharge observations. Although numerous models for daily river discharge prediction exist, achieving high accuracy, interpretability, and efficiency under data-scarce conditions remains a major challenge. We address this with a novel method, HYdrological Prediction with multi-model Ensemble and Reservoir computing (HYPER). Our approach applies Bayesian model averaging (BMA) to 47 “uncalibrated” catchment-based conceptual hydrological models. A reservoir computing (RC) model, a type of machine learning model, is then trained via linear regression to correct BMA output errors, a non-iterative process ensuring computational efficiency. For ungauged basins, we infer the required BMA and RC weights by mapping them to catchment attributes from gauged basins, creating a generalizable framework. Evaluated on 87 Japanese basins, in a data-rich scenario, HYPER (median Nash Sutcliffe Efficiency, NSE, of 0.59) performed comparably to a benchmark LSTM (NSE 0.64) but required only 3 % of its computational time. In a data-scarce scenario (where only ~20 % of basins are gauged), HYPER maintained robust performance (NSE 0.51) by leveraging the physical structure of the ensemble. In contrast, the LSTM’s performance degraded substantially (NSE -0.61) due to data insufficiency. These results demonstrate that calibrating individual conceptual hydrological models is unnecessary when using a sufficiently large ensemble that is assembled and combined with machine-learning-based bias correction. HYPER provides a robust, efficient, and generalizable solution for discharge prediction, particularly in ungauged basins. By eliminating basin-specific calibration, HYPER offers a scalable, interpretable framework for accurate hydrological prediction in diverse data-scarce regions.

[404] An Unsupervised Deep Explainable AI Framework for Localization of Concurrent Replay Attacks in Nuclear Reactor Signals

Konstantinos Vasili, Zachery T. Dahm, Stylianos Chatzidakis

Main category: cs.LG

TL;DR: Unsupervised XAI framework combining autoencoder and customized windowSHAP detects and characterizes replay attacks on nuclear reactor time series data with 95%+ accuracy.

DetailsMotivation: Next-gen nuclear reactors generate multivariate time series data vulnerable to replay attacks. Current approaches rely on synthetic data, linear assumptions, or lack explainability for root cause analysis in real nuclear cyber-physical systems.

Method: Unsupervised explainable AI framework combining autoencoder for anomaly detection with customized windowSHAP algorithm to characterize replay attacks (detection, source identification, timing, and type).

Result: Framework benchmarked on real datasets from Purdue’s PUR-1 reactor with up to six signals concurrently replayed. Achieved 95%+ accuracy in detecting, identifying source/number of signals, and duration of falsification.

Conclusion: Proposed XAI framework successfully addresses replay attack characterization in real nuclear reactor data, overcoming limitations of current synthetic-data approaches and providing explainable predictions for cyber-physical system security.

Abstract: Next generation advanced nuclear reactors are expected to be smaller both in size and power output, relying extensively on fully digital instrumentation and control systems. These reactors will generate a large flow of information in the form of multivariate time series data, conveying simultaneously various non linear cyber physical, process, control, sensor, and operational states. Ensuring data integrity against deception attacks is becoming increasingly important for networked communication and a requirement for safe and reliable operation. Current efforts to address replay attacks, almost universally focus on watermarking or supervised anomaly detection approaches without further identifying and characterizing the root cause of the anomaly. In addition, these approaches rely mostly on synthetic data with uncorrelated Gaussian process and measurement noise and full state feedback or are limited to univariate signals, signal stationarity, linear quadratic regulators, or other linear-time invariant state-space which may fail to capture any unmodeled system dynamics. In the realm of regulated nuclear cyber-physical systems, additional work is needed on characterization of replay attacks and explainability of predictions using real data. Here, we propose an unsupervised explainable AI framework based on a combination of autoencoder and customized windowSHAP algorithm to fully characterize real-time replay attacks, i.e., detection, source identification, timing and type, of increasing complexity during a dynamic time evolving reactor process. The proposed XAI framework was benchmarked on several real world datasets from Purdue’s nuclear reactor PUR-1 with up to six signals concurrently being replayed. In all cases, the XAI framework was able to detect and identify the source and number of signals being replayed and the duration of the falsification with 95 percent or better accuracy.

[405] Beyond Gaussian Initializations: Signal Preserving Weight Initialization for Odd-Sigmoid Activations

Hyunwoo Lee, Hayoung Choi, Hyunju Kim

Main category: cs.LG

TL;DR: The paper introduces odd-sigmoid activation functions and an activation-aware initialization method that preserves signal variance and gradient norms in deep narrow networks, outperforming standard Gaussian initializations.

DetailsMotivation: Standard Gaussian i.i.d. initializations are designed for wide/infinite width networks but fail in deep narrow networks with sigmoidal activations, causing preactivations to saturate and gradients to collapse. There's a need for better initialization schemes that work with various nonlinearities.

Method: Proposes odd-sigmoid activation functions and develops an activation-aware initialization method tailored to this class. The method maintains forward signal variance and backpropagated gradient norms across a wide range of variance scales, even in deep narrow networks.

Result: The proposed initialization is substantially less sensitive to depth, width, and activation scale than Gaussian initializations. In PINNs, scaled odd-sigmoid activations with the new initialization achieve lower losses than Gaussian-based setups, showing diagonal-plus-noise weights work when Gaussian initialization fails.

Conclusion: Odd-sigmoid activations combined with activation-aware initialization provide a robust alternative to standard Gaussian schemes, especially beneficial for deep narrow networks and applications like PINNs where traditional methods break down.

Abstract: Activation functions critically influence trainability and expressivity, and recent work has therefore explored a broad range of nonlinearities. However, widely used Gaussian i.i.d. initializations are designed to preserve activation variance under wide or infinite width assumptions. In deep and relatively narrow networks with sigmoidal nonlinearities, these schemes often drive preactivations into saturation, and collapse gradients. To address this, we introduce an odd-sigmoid activations and propose an activation aware initialization tailored to any function in this class. Our method remains robust over a wide band of variance scales, preserving both forward signal variance and backpropagated gradient norms even in very deep and narrow networks. Empirically, across standard image benchmarks we find that the proposed initialization is substantially less sensitive to depth, width, and activation scale than Gaussian initializations. In physics informed neural networks (PINNs), scaled odd-sigmoid activations combined with our initialization achieve lower losses than Gaussian based setups, suggesting that diagonal-plus-noise weights provide a practical alternative when Gaussian initialization breaks down.

[406] Understanding Sampler Stochasticity in Training Diffusion Models for RLHF

Jiayuan Sheng, Hanyang Zhao, Haoxian Chen, David D. Yao, Wenpin Tang

Main category: cs.LG

TL;DR: The paper analyzes the reward gap between stochastic training and deterministic inference in RLHF for diffusion models, providing theoretical bounds and empirical validation that higher-stochasticity training improves inference quality.

DetailsMotivation: There's a mismatch between stochastic samplers used during RLHF training of diffusion models (for exploration) and deterministic samplers used during inference (for efficiency/stability), creating a reward gap that raises concerns about inference quality.

Method: Theoretical characterization of reward gap with non-vacuous bounds for general diffusion models, using gDDIM framework to support arbitrary stochasticity while preserving data marginals. Empirical validation through large-scale text-to-image experiments with DDPO and MixGRPO.

Result: Theoretical bounds show reward gap narrows over training, with sharper convergence rates for VE and VP Gaussian models. Empirical results confirm reward gaps consistently narrow, and ODE sampling quality improves when models are updated using higher-stochasticity SDE training.

Conclusion: Higher-stochasticity training during RLHF fine-tuning of diffusion models leads to better deterministic inference quality, addressing the training-inference mismatch problem and providing theoretical guarantees for practical applications.

Abstract: Reinforcement Learning from Human Feedback (RLHF) is increasingly used to fine-tune diffusion models, but a key challenge arises from the mismatch between stochastic samplers used during training and deterministic samplers used during inference. In practice, models are fine-tuned using stochastic SDE samplers to encourage exploration, while inference typically relies on deterministic ODE samplers for efficiency and stability. This discrepancy induces a reward gap, raising concerns about whether high-quality outputs can be expected during inference. In this paper, we theoretically characterize this reward gap and provide non-vacuous bounds for general diffusion models, along with sharper convergence rates for Variance Exploding (VE) and Variance Preserving (VP) Gaussian models. Methodologically, we adopt the generalized denoising diffusion implicit models (gDDIM) framework to support arbitrarily high levels of stochasticity, preserving data marginals throughout. Empirically, our findings through large-scale experiments on text-to-image models using denoising diffusion policy optimization (DDPO) and mixed group relative policy optimization (MixGRPO) validate that reward gaps consistently narrow over training, and ODE sampling quality improves when models are updated using higher-stochasticity SDE training.

[407] Forking-Sequences

Willa Potosnak, Malcolm Wolff, Mengfei Cao, Ruijun Ma, Tatiana Konstantinova, Dmitry Efimov, Michael W. Mahoney, Boris Oreshkin, Kin G. Olivares

Main category: cs.LG

TL;DR: Forking-sequences neural architecture improves forecast stability across forecast creation dates while maintaining accuracy, outperforming conventional methods that process FCDs independently.

DetailsMotivation: Forecast stability across forecast creation dates is crucial for downstream decision-making, but conventional methods produce erratic revisions. Forking-sequences architecture addresses this stability issue while maintaining accuracy.

Method: Formalize forking-sequences design that jointly encodes and decodes entire time series across all forecast creation dates, producing multi-horizon forecast grid in single forward pass. Compare with baseline window-sampling on M-series benchmark using 16 datasets.

Result: Median accuracy improvements: 29.7% (MLP), 46.2% (RNN), 49.3% (LSTM), 28.6% (CNN), 24.7% (Transformer), 6.4% (State Space). Forecast ensembling improves stability by 10.8-13.2% while maintaining accuracy.

Conclusion: Forking-sequences architecture provides three key benefits: increased forecast stability through ensembling, gradient variance reduction for stable training, and computational efficiency during inference, making it superior to conventional independent FCD processing.

Abstract: While accuracy is a critical requirement for time series forecasting, an equally important desideratum is forecast stability across forecast creation dates (FCDs). Even highly accurate models can produce erratic revisions between FCDs, disrupting downstream decision-making. To improve forecast stability, several state-of-the-art models including MQCNN, MQT, and SPADE employ a powerful yet underexplored neural network architectural design known as forking-sequences. This architectural design jointly encodes and decodes the entire time series across all FCDs, producing an entire multi-horizon forecast grid in a single forward pass. This approach contrasts with conventional statistical and neural forecasting methods that process FCDs independently, generating only a single multi-horizon forecast per forward pass. In this work, we formalize the forking-sequences design and motivate its broader adoption by introducing a metric for quantifying excess volatility in forecast revisions and by providing theoretical and empirical analysis. We theoretically motivate three key benefits of forking-sequences: (i) increased forecast stability through ensembling; (ii) gradient variance reduction, leading to more stable and consistent training steps; and (iii) improved computational efficiency during inference. We validate the benefits of forking-sequences compared to baseline window-sampling on the M-series benchmark, using 16 datasets from the M1, M3, M4, and Tourism competitions. We observe median accuracy improvements across datasets of 29.7%, 46.2%, 49.3%, 28.6%, 24.7%, and 6.4% for MLP, RNN, LSTM, CNN, Transformer, and State Space-based architectures, respectively. We then show that forecast ensembling during inference can improve median forecast stability by 10.8%, 13.2%, 13.0%, 10.9%, 10.2%, and 11.2% for these respective models trained with forking-sequences, while maintaining accuracy.

[408] HeSRN: Representation Learning On Heterogeneous Graphs via Slot-Aware Retentive Network

Yifan Lu, Ziyun Zou, Belal Alsinglawi, Islam Al-Qudah, Izzat Alsmadi, Feilong Tang, Pengfei Jiao, Shoaib Jameel, Imran Razzak

Main category: cs.LG

TL;DR: HeSRN is a heterogeneous graph transformer that uses slot-aware retention networks for efficient and expressive representation learning with linear complexity.

DetailsMotivation: Graph Transformers have quadratic complexity and struggle with heterogeneous semantics, limiting scalability and generalization on real-world heterogeneous graphs.

Method: Introduces slot-aware structure encoder to disentangle node-type semantics via independent slots with slot normalization and retention-based fusion, plus retention-based encoder replacing self-attention for linear complexity.

Result: Outperforms state-of-the-art heterogeneous GNNs and Graph Transformers on node classification across four real-world datasets with significantly lower computational complexity.

Conclusion: HeSRN provides an efficient and expressive solution for heterogeneous graph representation learning by addressing semantic entanglement and computational bottlenecks of previous approaches.

Abstract: Graph Transformers have recently achieved remarkable progress in graph representation learning by capturing long-range dependencies through self-attention. However, their quadratic computational complexity and inability to effectively model heterogeneous semantics severely limit their scalability and generalization on real-world heterogeneous graphs. To address these issues, we propose HeSRN, a novel Heterogeneous Slot-aware Retentive Network for efficient and expressive heterogeneous graph representation learning. HeSRN introduces a slot-aware structure encoder that explicitly disentangles node-type semantics by projecting heterogeneous features into independent slots and aligning their distributions through slot normalization and retention-based fusion, effectively mitigating the semantic entanglement caused by forced feature-space unification in previous Transformer-based models. Furthermore, we replace the self-attention mechanism with a retention-based encoder, which models structural and contextual dependencies in linear time complexity while maintaining strong expressive power. A heterogeneous retentive encoder is further employed to jointly capture both local structural signals and global heterogeneous semantics through multi-scale retention layers. Extensive experiments on four real-world heterogeneous graph datasets demonstrate that HeSRN consistently outperforms state-of-the-art heterogeneous graph neural networks and Graph Transformer baselines on node classification tasks, achieving superior accuracy with significantly lower computational complexity.

[409] TempoPFN: Synthetic Pre-training of Linear RNNs for Zero-shot Time Series Forecasting

Vladyslav Moroshan, Julien Siems, Arber Zela, Timur Carstensen, Frank Hutter

Main category: cs.LG

TL;DR: TempoPFN is a univariate time series foundation model using linear RNNs with GatedDeltaProduct architecture, pre-trained on synthetic data, achieving top-tier zero-shot performance while being more efficient than existing baselines.

DetailsMotivation: Foundation models for zero-shot time series forecasting face challenges in efficient long-horizon prediction and reproducibility, with existing synthetic-only approaches underperforming on challenging benchmarks.

Method: Uses linear Recurrent Neural Networks (RNNs) with GatedDeltaProduct architecture and state-weaving for fully parallelizable training across sequence lengths. Pre-trained exclusively on synthetic data from a comprehensive pipeline unifying diverse generators including stochastic differential equations, Gaussian processes, and audio synthesis with novel augmentations.

Result: Achieves top-tier competitive performance in zero-shot evaluations on Gift-Eval, fev-bench and Chronos-ZS benchmarks, outperforming all existing synthetic-only approaches and surpassing majority of models trained on real-world data, while being more efficient than existing baselines.

Conclusion: TempoPFN provides a reproducible foundation for future research with open-sourced complete data generation pipeline and training code, demonstrating that synthetic-only pre-training can achieve competitive zero-shot time series forecasting performance.

Abstract: Foundation models for zero-shot time series forecasting face challenges in efficient long-horizon prediction and reproducibility, with existing synthetic-only approaches underperforming on challenging benchmarks. This paper presents TempoPFN, a univariate time series foundation model based on linear Recurrent Neural Networks (RNNs) pre-trained exclusively on synthetic data. The model uses a GatedDeltaProduct architecture with state-weaving for fully parallelizable training across sequence lengths, eliminating the need for windowing or summarization techniques while maintaining robust temporal state-tracking. Our comprehensive synthetic data pipeline unifies diverse generators, including stochastic differential equations, Gaussian processes, and audio synthesis, with novel augmentations. In zero-shot evaluations on the Gift-Eval, fev-bench and Chronos-ZS benchmarks, TempoPFN achieves top-tier competitive performance, outperforming all existing synthetic-only approaches and surpassing the majority of models trained on real-world data, while being more efficient than existing baselines by leveraging fully parallelizable training and inference. We open-source our complete data generation pipeline and training code, providing a reproducible foundation for future research.

[410] Enhancing Time Series Forecasting through Selective Representation Spaces: A Patch Perspective

Xingjian Wu, Xiangfei Qiu, Hanyin Cheng, Zhengyu Li, Jilin Hu, Chenjuan Guo, Bin Yang

Main category: cs.LG

TL;DR: SRS module introduces selective patching and dynamic reassembly to create flexible representation spaces for time series forecasting, improving patch-based models’ performance.

DetailsMotivation: Conventional patching uses adjacent patches with fixed representation spaces, leading to insufficiently expressive representations. The paper aims to create selective representation spaces that can flexibly include the most informative patches.

Method: Proposes Selective Representation Space (SRS) module with learnable Selective Patching and Dynamic Reassembly techniques to adaptively select and shuffle patches from contextual time series. Also introduces SRSNet combining SRS with an MLP head.

Result: SRSNet achieves state-of-the-art performance on real-world datasets from multiple domains. SRS module also enhances existing patch-based models as a plug-and-play component.

Conclusion: The SRS module effectively addresses limitations of conventional patching by creating flexible representation spaces, improving forecasting performance while being compatible with existing patch-based architectures.

Abstract: Time Series Forecasting has made significant progress with the help of Patching technique, which partitions time series into multiple patches to effectively retain contextual semantic information into a representation space beneficial for modeling long-term dependencies. However, conventional patching partitions a time series into adjacent patches, which causes a fixed representation space, thus resulting in insufficiently expressful representations. In this paper, we pioneer the exploration of constructing a selective representation space to flexibly include the most informative patches for forecasting. Specifically, we propose the Selective Representation Space (SRS) module, which utilizes the learnable Selective Patching and Dynamic Reassembly techniques to adaptively select and shuffle the patches from the contextual time series, aiming at fully exploiting the information of contextual time series to enhance the forecasting performance of patch-based models. To demonstrate the effectiveness of SRS module, we propose a simple yet effective SRSNet consisting of SRS and an MLP head, which achieves state-of-the-art performance on real-world datasets from multiple domains. Furthermore, as a novel plug-and-play module, SRS can also enhance the performance of existing patch-based models. The resources are available at https://github.com/decisionintelligence/SRSNet.

[411] Improving Conditional VAE with approximation using Normalizing Flows

Tuhin Subhra De

Main category: cs.LG

TL;DR: CVAEs with normalizing flows for conditional latent distribution estimation outperform traditional CVAEs, reducing FID by 4% and increasing log likelihood by 7.6%.

DetailsMotivation: Traditional generative models (VAEs, GANs) have been superseded by diffusion models, but there's still value in improving CVAEs for conditional image generation. Existing CVAEs make unrealistic assumptions about conditional latent distributions.

Method: Proposes using normalizing flows to estimate the conditional distribution of latent space given labels in CVAEs, rather than assuming it equals the prior distribution. Also uses learnable variance in the Gaussian decoder to address blurry image generation.

Result: The method achieves better image generation than existing CVAE approaches, reducing FID by 4% and increasing log likelihood by 7.6% compared to previous methods.

Conclusion: Improving traditional generative models like CVAEs through better conditional distribution modeling with normalizing flows can still yield significant performance gains, even in the era of diffusion models.

Abstract: Variational Autoencoders and Generative Adversarial Networks remained the state-of-the-art (SOTA) generative models until 2022. Now they are superseded by diffusion based models. Efforts to improve traditional models have stagnated as a result. In old-school fashion, we explore image generation with conditional Variational Autoencoders (CVAE) to incorporate desired attributes within the images. VAEs are known to produce blurry images with less diversity, we refer a method that solve this issue by leveraging the variance of the gaussian decoder as a learnable parameter during training. Previous works on CVAEs assumed that the conditional distribution of the latent space given the labels is equal to the prior distribution, which is not the case in reality. We show that estimating it using normalizing flows results in better image generation than existing methods by reducing the FID by 4% and increasing log likelihood by 7.6% than the previous case.

[412] Scale-Agnostic Kolmogorov-Arnold Geometry in Neural Networks

Mathew Vanherreweghe, Michael H. Freedman, Keith M. Adams

Main category: cs.LG

TL;DR: Kolmogorov-Arnold geometric structure emerges in 2-layer MLPs trained on MNIST, showing scale-invariant properties from local neighborhoods to full images.

DetailsMotivation: Previous work showed KAG structure develops in shallow MLPs on 3D synthetic tasks, but it was unclear if this phenomenon extends to realistic high-dimensional data like MNIST and what spatial properties it exhibits.

Method: Extended KAG analysis to MNIST (784 dimensions) using 2-layer MLPs with systematic spatial analysis at multiple scales, examining both standard training and training with spatial augmentation.

Result: KAG emerges during training and appears consistently across spatial scales (from 7-pixel neighborhoods to full 28x28 images), with the same qualitative pattern across different training procedures.

Conclusion: Neural networks spontaneously develop organized, scale-invariant geometric structure during learning on realistic high-dimensional data, demonstrating that KAG phenomenon persists beyond synthetic tasks.

Abstract: Recent work by Freedman and Mulligan demonstrated that shallow multilayer perceptrons spontaneously develop Kolmogorov-Arnold geometric (KAG) structure during training on synthetic three-dimensional tasks. However, it remained unclear whether this phenomenon persists in realistic high-dimensional settings and what spatial properties this geometry exhibits. We extend KAG analysis to MNIST digit classification (784 dimensions) using 2-layer MLPs with systematic spatial analysis at multiple scales. We find that KAG emerges during training and appears consistently across spatial scales, from local 7-pixel neighborhoods to the full 28x28 image. This scale-agnostic property holds across different training procedures: both standard training and training with spatial augmentation produce the same qualitative pattern. These findings reveal that neural networks spontaneously develop organized, scale-invariant geometric structure during learning on realistic high-dimensional data.

[413] Doubly Wild Refitting: Model-Free Evaluation of High Dimensional Black-Box Predictions under Convex Losses

Haichen Hu, David Simchi-Levi

Main category: cs.LG

TL;DR: Efficient refitting procedure for excess risk evaluation of ERM under convex losses using wild response perturbations, providing model-free high-probability upper bounds without complexity assumptions.

DetailsMotivation: Traditional capacity-based learning theory fails for modern opaque ML systems (deep networks, generative models) due to extreme hypothesis class complexity. Need model-free methods to evaluate excess risk without prior knowledge of function class complexity.

Method: 1) Generate two sets of wild responses by stochastically perturbing gradient vectors with careful scaling; 2) Refit black-box training procedure twice on pseudo-labeled datasets to get two wild predictors; 3) Combine original predictor, wild predictors, and wild responses to derive efficient excess risk upper bound.

Result: Developed an efficient refitting procedure that computes excess risk and provides high-probability upper bounds under fixed-design setting, requiring only black-box access to training algorithm and single dataset.

Conclusion: The method is essentially model-free, requires no prior knowledge of function class complexity, and holds significant promise for theoretically evaluating modern opaque ML systems where traditional learning theory becomes infeasible.

Abstract: We study the problem of excess risk evaluation for empirical risk minimization (ERM) under general convex loss functions. Our contribution is an efficient refitting procedure that computes the excess risk and provides high-probability upper bounds under the fixed-design setting. Assuming only black-box access to the training algorithm and a single dataset, we begin by generating two sets of artificially modified pseudo-outcomes termed wild response, created by stochastically perturbing the gradient vectors with carefully chosen scaling. Using these two pseudo-labeled datasets, we then refit the black-box procedure twice to obtain two corresponding wild predictors. Finally, leveraging the original predictor, the two wild predictors, and the constructed wild responses, we derive an efficient excess risk upper bound. A key feature of our analysis is that it requires no prior knowledge of the complexity of the underlying function class. As a result, the method is essentially model-free and holds significant promise for theoretically evaluating modern opaque machine learning system–such as deep nerral networks and generative model–where traditional capacity-based learning theory becomes infeasible due to the extreme complexity of the hypothesis class.

[414] ASTRO: Adaptive Stitching via Dynamics-Guided Trajectory Rollouts

Hang Yu, Di Zhang, Qiwei Du, Yanping Zhao, Hai Zhang, Guang Chen, Eduardo E. Veas, Junqiao Zhao

Main category: cs.LG

TL;DR: ASTRO is a data augmentation framework for offline RL that generates novel, dynamics-consistent trajectories through temporal-distance representation learning and dynamics-guided stitching with Rollout Deviation Feedback.

DetailsMotivation: Offline RL struggles with suboptimal, fragmented datasets that cause inaccurate value estimation. Existing augmentation methods either stay too close to behavior policy support or violate dynamics constraints, limiting policy improvement.

Method: ASTRO learns temporal-distance representations to identify reachable stitch targets, then uses a dynamics-guided stitch planner with Rollout Deviation Feedback (gap between target and actual state sequences) to generate connecting action sequences for feasible trajectory stitching.

Result: ASTRO outperforms prior offline RL augmentation methods across various algorithms, achieving significant performance gains on challenging OGBench suite and consistent improvements on standard D4RL benchmarks.

Conclusion: ASTRO effectively addresses trajectory fragmentation in offline RL by generating distributionally novel yet dynamics-consistent trajectories through intelligent stitching, leading to enhanced policy learning performance.

Abstract: Offline reinforcement learning (RL) enables agents to learn optimal policies from pre-collected datasets. However, datasets containing suboptimal and fragmented trajectories present challenges for reward propagation, resulting in inaccurate value estimation and degraded policy performance. While trajectory stitching via generative models offers a promising solution, existing augmentation methods frequently produce trajectories that are either confined to the support of the behavior policy or violate the underlying dynamics, thereby limiting their effectiveness for policy improvement. We propose ASTRO, a data augmentation framework that generates distributionally novel and dynamics-consistent trajectories for offline RL. ASTRO first learns a temporal-distance representation to identify distinct and reachable stitch targets. We then employ a dynamics-guided stitch planner that adaptively generates connecting action sequences via Rollout Deviation Feedback, defined as the gap between target state sequence and the actual arrived state sequence by executing predicted actions, to improve trajectory stitching’s feasibility and reachability. This approach facilitates effective augmentation through stitching and ultimately enhances policy learning. ASTRO outperforms prior offline RL augmentation methods across various algorithms, achieving notable performance gain on the challenging OGBench suite and demonstrating consistent improvements on standard offline RL benchmarks such as D4RL.

[415] MSTN: Fast and Efficient Multivariate Time Series Prediction Model

Sumit S Shevtekar, Chandresh K Maurya

Main category: cs.LG

TL;DR: MSTN is a multi-scale temporal network that combines convolutional encoding, sequence modeling, and self-gated fusion to handle non-stationary time series with complex dynamics across multiple temporal scales, achieving SOTA performance across diverse benchmarks.

DetailsMotivation: Real-world time series exhibit strong non-stationarity, complex nonlinear dynamics, and multi-scale behavior, but existing architectures impose rigid structural priors that limit adaptability to abrupt events and over-regularize temporal dynamics.

Method: MSTN integrates three components: (1) multi-scale convolutional encoder for fine-grained local structure, (2) sequence modeling module (recurrent or attention-based) for long-range dependencies, and (3) self-gated fusion with squeeze-excitation and multi-head attention to dynamically modulate cross-scale representations.

Result: MSTN achieves state-of-the-art performance across forecasting, imputation, classification, and cross-dataset generalization benchmarks, outperforming recent leading approaches (TIME-LLM, HiMTM, SOFTS, LLM4TS, TimesNet, PatchTST) and establishing new best results on 24 out of 32 datasets.

Conclusion: MSTN provides a lightweight, efficient solution for multi-scale temporal modeling that avoids computational burdens of long-context models while maintaining strong performance and fast inference suitable for edge deployment.

Abstract: Real-world time series often exhibit strong non-stationarity, complex nonlinear dynamics, and behaviour expressed across multiple temporal scales, from rapid local fluctuations to slow-evolving long-range trends. However, many contemporary architectures impose rigid, fixed-scale structural priors – such as patch-based tokenization, predefined receptive fields, or frozen backbone encoders – which can over-regularize temporal dynamics and limit adaptability to abrupt high-magnitude events. To handle this, we introduce the \emph{Multi-scale Temporal Network} (MSTN), a hybrid neural architecture grounded in an \emph{Early Temporal Aggregation} principle. MSTN integrates three complementary components: (i) a multi-scale convolutional encoder that captures fine-grained local structure; (ii) a sequence modeling module that learns long-range dependencies through either recurrent or attention-based mechanisms; and (iii) a self-gated fusion stage incorporating squeeze-excitation and multi-head attention to dynamically modulate cross-scale representations. This design enables MSTN to flexibly model temporal patterns spanning milliseconds to extended horizons, while avoiding the computational burden typically associated with long-context models. Across extensive benchmarks covering forecasting, imputation, classification, and cross-dataset generalization, MSTN consistently delivers state-of-the-art performance, outperforming recent leading approaches including TIME-LLM, HiMTM, SOFTS, LLM4TS, TimesNet, and PatchTST, and establishing new best results on 24 out of 32 datasets. Despite its strong performance, MSTN remains lightweight and supports fast inference, making it well suited for deployment on edge devices and resource-constrained environments.

[416] Light-Weight Benchmarks Reveal the Hidden Hardware Cost of Zero-Shot Tabular Foundation Models

Ishaan Gangwani, Aayam Bansal

Main category: cs.LG

TL;DR: Benchmark shows tabular foundation models (FMs) have huge hardware costs vs. tree ensembles: FMs need 40,000x more latency and 9GB VRAM for only 0.8% accuracy gain, while tree models match or beat FM accuracy with near-zero resource usage.

DetailsMotivation: Zero-shot foundation models promise training-free prediction on tabular data, but their hardware footprint remains poorly characterized. There's a need to quantify the hardware-versus-accuracy trade-offs in current tabular FMs.

Method: Created a fully reproducible benchmark reporting test accuracy, wall-clock latency, peak CPU RAM, and peak GPU VRAM on four public datasets (Adult-Income, Higgs-100k, Wine-Quality, California-Housing). Compared two open FMs (TabPFN-1.0 and TabICL-base) against tuned XGBoost, LightGBM, and Random Forest baselines on a single NVIDIA T4 GPU.

Result: Tree ensembles equal or surpass FM accuracy on three datasets while completing full-test batches in ≤0.40s and ≤150MB RAM, using zero VRAM. TabICL achieves 0.8 percentage-point gain on Higgs but requires ~40,000x more latency (960s) and 9GB VRAM. TabPFN matches tree-model accuracy on Wine and Housing but peaks at 4GB VRAM and cannot process full 100k-row Higgs table.

Conclusion: Results quantify substantial hardware-versus-accuracy trade-offs in current tabular FMs and provide an open baseline for future efficiency-oriented research. Tree ensembles remain highly efficient alternatives with comparable or better accuracy in most cases.

Abstract: Zero-shot foundation models (FMs) promise training-free prediction on tabular data, yet their hardware footprint remains poorly characterized. We present a fully reproducible benchmark that reports test accuracy together with wall-clock latency, peak CPU RAM, and peak GPU VRAM on four public datasets: Adult-Income, Higgs-100k, Wine-Quality, and California-Housing. Two open FMs (TabPFN-1.0 and TabICL-base) are compared against tuned XGBoost, LightGBM, and Random Forest baselines on a single NVIDIA T4 GPU. The tree ensembles equal or surpass FM accuracy on three datasets while completing full-test batches in <= 0.40 s and <= 150 MB RAM, using zero VRAM. TabICL achieves a 0.8 percentage-point gain on Higgs but requires roughly 40,000 times more latency (960 s) and 9 GB VRAM. TabPFN matches tree-model accuracy on Wine and Housing but peaks at 4 GB VRAM and cannot process the full 100k-row Higgs table. These results quantify the substantial hardware-versus-accuracy trade-offs in current tabular FMs and provide an open baseline for future efficiency-oriented research.

[417] Difficulties with Evaluating a Deception Detector for AIs

Lewis Smith, Bilal Chughtai, Neel Nanda

Main category: cs.LG

TL;DR: Current deception detectors for AI systems lack reliable labeled examples for evaluation, with several obstacles preventing collection of such data.

DetailsMotivation: To mitigate risks from advanced AI systems by building reliable deception detectors that can predict when AI is being strategically deceptive without requiring behavioral evidence.

Method: Analysis through conceptual arguments, examination of existing empirical works, and novel illustrative case studies to identify obstacles in collecting labeled deception examples.

Result: Identified several concrete obstacles preventing collection of reliable labeled deception examples for AI systems, and found that proposed empirical workarounds are valuable but insufficient alone.

Conclusion: Progress on deception detection requires further consideration of the identified problems in collecting reliable labeled examples for evaluation.

Abstract: Building reliable deception detectors for AI systems – methods that could predict when an AI system is being strategically deceptive without necessarily requiring behavioural evidence – would be valuable in mitigating risks from advanced AI systems. But evaluating the reliability and efficacy of a proposed deception detector requires examples that we can confidently label as either deceptive or honest. We argue that we currently lack the necessary examples and further identify several concrete obstacles in collecting them. We provide evidence from conceptual arguments, analysis of existing empirical works, and analysis of novel illustrative case studies. We also discuss the potential of several proposed empirical workarounds to these problems and argue that while they seem valuable, they also seem insufficient alone. Progress on deception detection likely requires further consideration of these problems.

[418] GraphBench: Next-generation graph learning benchmarking

Timo Stoll, Chendi Qian, Ben Finkelshtein, Ali Parviz, Darius Weber, Fabrizio Frasca, Hadar Shavit, Antoine Siraudin, Arman Mielke, Marie Anastacio, Erik Müller, Maya Bechler-Speicher, Michael Bronstein, Mikhail Galkin, Holger Hoos, Mathias Niepert, Bryan Perozzi, Jan Tönshoff, Christopher Morris

Main category: cs.LG

TL;DR: GraphBench is a comprehensive benchmarking suite for graph machine learning that standardizes evaluation across diverse domains and tasks to address fragmentation in current practices.

DetailsMotivation: Current graph ML benchmarking is fragmented with narrow, task-specific datasets and inconsistent evaluation protocols, which hampers reproducibility and broader progress in the field.

Method: Introduces GraphBench, a comprehensive benchmarking suite spanning diverse domains and prediction tasks (node-level, edge-level, graph-level, generative settings) with standardized evaluation protocols, consistent dataset splits, performance metrics accounting for OOD generalization, and a unified hyperparameter tuning framework.

Result: The paper benchmarks GraphBench using message-passing neural networks and graph transformer models, providing principled baselines and establishing reference performance metrics for the community.

Conclusion: GraphBench addresses the fragmentation in graph ML benchmarking by providing a standardized, comprehensive evaluation framework that should improve reproducibility and accelerate progress in the field.

Abstract: Machine learning on graphs has recently achieved impressive progress in various domains, including molecular property prediction and chip design. However, benchmarking practices remain fragmented, often relying on narrow, task-specific datasets and inconsistent evaluation protocols, which hampers reproducibility and broader progress. To address this, we introduce GraphBench, a comprehensive benchmarking suite that spans diverse domains and prediction tasks, including node-level, edge-level, graph-level, and generative settings. GraphBench provides standardized evaluation protocols – with consistent dataset splits and performance metrics that account for out-of-distribution generalization – as well as a unified hyperparameter tuning framework. Additionally, we benchmark GraphBench using message-passing neural networks and graph transformer models, providing principled baselines and establishing a reference performance. See www.graphbench.io for further details.

[419] Reliable Statistical Guarantees for Conformal Predictors with Small Datasets

Miguel Sánchez-Domínguez, Lucas Lacasa, Javier de Vicente, Gonzalo Rubio, Eusebio Valero

Main category: cs.LG

TL;DR: This paper addresses limitations of standard conformal prediction for small calibration sets by proposing a new statistical guarantee that provides probabilistic information about coverage for individual predictors, with convergence to classic CP for large datasets.

DetailsMotivation: Standard conformal prediction offers marginal coverage guarantees that can be unreliable for small calibration set sizes, which are common in surrogate modeling applications. The dispersion of coverage distribution around its average for small datasets makes the statistical guarantee less applicable in practice.

Method: The authors propose a new statistical guarantee framework that provides probabilistic information about the coverage of a single conformal predictor. They develop methodology that converges to standard CP for large calibration sets but remains relevant for small data sizes, and implement an open-source software solution compatible with existing CP libraries.

Result: The proposed framework is validated through a suite of examples, showing that it maintains relevant coverage information for small calibration sets while converging to standard CP guarantees for large datasets. The software implementation enables practical deployment of uncertainty models with the new guarantee.

Conclusion: The paper successfully bridges the gap in conformal prediction for small calibration sets by introducing a more informative statistical guarantee that remains applicable across different data sizes, enhancing the reliability of uncertainty quantification in safety-critical surrogate modeling applications.

Abstract: Surrogate models (including deep neural networks and other machine learning algorithms in supervised learning) are capable of approximating arbitrarily complex, high-dimensional input-output problems in science and engineering, but require a thorough data-agnostic uncertainty quantification analysis before these can be deployed for any safety-critical application. The standard approach for data-agnostic uncertainty quantification is to use conformal prediction (CP), a well-established framework to build uncertainty models with proven statistical guarantees that do not assume any shape for the error distribution of the surrogate model. However, since the classic statistical guarantee offered by CP is given in terms of bounds for the marginal coverage, for small calibration set sizes (which are frequent in realistic surrogate modelling that aims to quantify error at different regions), the potentially strong dispersion of the coverage distribution around its average negatively impacts the relevance of the uncertainty model’s statistical guarantee, often obtaining coverages below the expected value, resulting in a less applicable framework. After providing a gentle presentation of uncertainty quantification for surrogate models for machine learning practitioners, in this paper we bridge the gap by proposing a new statistical guarantee that offers probabilistic information for the coverage of a single conformal predictor. We show that the proposed framework converges to the standard solution offered by CP for large calibration set sizes and, unlike the classic guarantee, still offers relevant information about the coverage of a conformal predictor for small data sizes. We validate the methodology in a suite of examples, and implement an open access software solution that can be used alongside common conformal prediction libraries to obtain uncertainty models that fulfil the new guarantee.

[420] IdealTSF: Can Non-Ideal Data Contribute to Enhancing the Performance of Time Series Forecasting Models?

Hua Wang, Jinghao Lu, Fan Zhang

Main category: cs.LG

TL;DR: IdealTSF framework leverages non-ideal negative samples to enhance time series forecasting by pretraining on negative data, transforming sequences into ideal positives, and applying adversarial optimization.

DetailsMotivation: Missing values and anomalies in time series data hinder deep learning performance. Previous approaches either extract features or treat suboptimal data as positive samples, but leveraging negative samples could better enhance event prediction.

Method: IdealTSF framework with three progressive steps: 1) Pretraining by extracting knowledge from negative sample data, 2) Training by transforming sequence data into ideal positive samples, and 3) Optimization with negative optimization mechanism using adversarial disturbances.

Result: Extensive experiments show negative sample data unlocks significant potential within basic attention architecture for time series forecasting, demonstrating the framework’s effectiveness.

Conclusion: IdealTSF is particularly well-suited for applications with noisy samples or low-quality data, as it effectively leverages both ideal positive and negative samples to improve forecasting performance.

Abstract: Deep learning has shown strong performance in time series forecasting tasks. However, issues such as missing values and anomalies in sequential data hinder its further development in prediction tasks. Previous research has primarily focused on extracting feature information from sequence data or addressing these suboptimal data as positive samples for knowledge transfer. A more effective approach would be to leverage these non-ideal negative samples to enhance event prediction. In response, this study highlights the advantages of non-ideal negative samples and proposes the IdealTSF framework, which integrates both ideal positive and negative samples for time series forecasting. IdealTSF consists of three progressive steps: pretraining, training, and optimization. It first pretrains the model by extracting knowledge from negative sample data, then transforms the sequence data into ideal positive samples during training. Additionally, a negative optimization mechanism with adversarial disturbances is applied. Extensive experiments demonstrate that negative sample data unlocks significant potential within the basic attention architecture for time series forecasting. Therefore, IdealTSF is particularly well-suited for applications with noisy samples or low-quality data.

[421] Learnability Window in Gated Recurrent Neural Networks

Lorenzo Livi

Main category: cs.LG

TL;DR: The paper develops a theoretical framework showing that gating mechanisms in RNNs determine learnability windows through effective learning rates, not just numerical stability, with sample complexity scaling inversely with these rates under heavy-tailed gradient noise.

DetailsMotivation: Classical analyses of RNN learnability focus on numerical stability of Jacobian products, but this is insufficient. The paper aims to understand what truly governs the learnability window - the maximum temporal horizon over which gradient information remains statistically recoverable in recurrent networks.

Method: The authors develop a theoretical framework analyzing gating mechanisms through effective learning rates μ_{t,ℓ} derived from first-order expansions of gate-induced Jacobian products in BPTT. They study how these rates act as multiplicative filters controlling gradient transport, and analyze sample complexity under heavy-tailed (α-stable) gradient noise.

Result: The minimal sample size to detect dependencies at lag ℓ scales as N(ℓ) ∝ f(ℓ)^{-α}, where f(ℓ) = ||μ_{t,ℓ}||₁ is the effective learning rate envelope. This yields explicit formulas for learnability window ℋ_N and scaling laws for different decay patterns (logarithmic, polynomial, exponential). Broader time-scale spectra enlarge learnability windows, while heavy-tailed noise compresses them.

Conclusion: Effective learning rates, not numerical stability alone, are the primary determinants of RNN learnability. These rates integrate gate-induced time-scale geometry with gradient noise and sample complexity, determining whether, when, and over what horizons recurrent networks can learn long-range temporal dependencies.

Abstract: We develop a theoretical framework that explains how gating mechanisms determine the learnability window $\mathcal{H}N$ of recurrent neural networks, defined as the largest temporal horizon over which gradient information remains statistically recoverable. While classical analyses emphasize numerical stability of Jacobian products, we show that stability alone is insufficient: learnability is governed instead by the \emph{effective learning rates} $μ{t,\ell}$, per-lag and per-neuron quantities obtained from first-order expansions of gate-induced Jacobian products in Backpropagation Through Time. These effective learning rates act as multiplicative filters that control both the magnitude and anisotropy of gradient transport. Under heavy-tailed ($α$-stable) gradient noise, we prove that the minimal sample size required to detect a dependency at lag~$\ell$ satisfies $N(\ell)\propto f(\ell)^{-α}$, where $f(\ell)=|μ_{t,\ell}|_1$ is the effective learning rate envelope. This leads to an explicit formula for $\mathcal{H}_N$ and closed-form scaling laws for logarithmic, polynomial, and exponential decay of $f(\ell)$. The theory shows that the time-scale spectra induced by the effective learning rates are the dominant determinants of learnability. Broader or more heterogeneous spectra slow the decay of $f(\ell)$, enlarging the learnability window, while heavy-tailed noise compresses $\mathcal{H}_N$ by limiting statistical concentration. By integrating gate-induced time-scale geometry with gradient noise and sample complexity, the framework identifies the effective learning rates as the primary objects that determine whether, when, and over what horizons recurrent networks can learn long-range temporal dependencies.

[422] Unsupervised Learning of Density Estimates with Topological Optimization

Sunia Tanweer, Firas A. Khasawneh

Main category: cs.LG

TL;DR: Unsupervised bandwidth selection for kernel density estimation using topological data analysis loss function.

DetailsMotivation: Kernel density estimation requires tuning the crucial bandwidth hyperparameter, which controls bias-variance trade-off and affects topological features. Current methods lack automated, unsupervised approaches that consider topological characteristics, especially in high dimensions where visualization is impossible.

Method: Propose an unsupervised learning approach using a topology-based loss function for automated bandwidth selection. Uses topological data analysis to mathematically quantify topological characteristics (connected components, loops, voids) even in high dimensions.

Result: Benchmarked against classical techniques and demonstrated potential across different dimensions.

Conclusion: Topology-based loss function enables automated, unsupervised selection of optimal bandwidth for kernel density estimation, outperforming classical methods across various dimensions.

Abstract: Kernel density estimation is a key component of a wide variety of algorithms in machine learning, Bayesian inference, stochastic dynamics and signal processing. However, the unsupervised density estimation technique requires tuning a crucial hyperparameter: the kernel bandwidth. The choice of bandwidth is critical as it controls the bias-variance trade-off by over- or under-smoothing the topological features. Topological data analysis provides methods to mathematically quantify topological characteristics, such as connected components, loops, voids et cetera, even in high dimensions where visualization of density estimates is impossible. In this paper, we propose an unsupervised learning approach using a topology-based loss function for the automated and unsupervised selection of the optimal bandwidth and benchmark it against classical techniques – demonstrating its potential across different dimensions.

[423] CFLight: Enhancing Safety with Traffic Signal Control through Counterfactual Learning

Mingyuan Li, Chunyu Liu, Zhuojun Li, Xiao Liu, Guangsheng Yu, Bo Du, Jun Shen, Qiang Wu

Main category: cs.LG

TL;DR: CFLight is a novel reinforcement learning framework for traffic signal control that uses counterfactual learning to improve intersection safety while maintaining traffic efficiency, reducing collisions through near-zero collision control.

DetailsMotivation: Current RL-based traffic signal control methods prioritize efficiency over safety and lack interpretability, failing to address the critical balance needed to reduce intersection accidents that cause millions of injuries and fatalities globally.

Method: Proposes a counterfactual learning framework with a structural causal model to predict outcomes of alternative actions, asking “what if” scenarios when unsafe events occur. Integrates CF module with other modules to promote safe RL practices.

Result: CFLight reduces collisions and improves overall traffic performance compared to conventional RL methods and recent safe RL models, demonstrated through extensive experiments on real-world and synthetic datasets.

Conclusion: The framework provides a generalized safe RL approach for traffic signal control that effectively balances safety and efficiency, with potential applications in other domains beyond traffic management.

Abstract: Traffic accidents result in millions of injuries and fatalities globally, with a significant number occurring at intersections each year. Traffic Signal Control (TSC) is an effective strategy for enhancing safety at these urban junctures. Despite the growing popularity of Reinforcement Learning (RL) methods in optimizing TSC, these methods often prioritize driving efficiency over safety, thus failing to address the critical balance between these two aspects. Additionally, these methods usually need more interpretability. CounterFactual (CF) learning is a promising approach for various causal analysis fields. In this study, we introduce a novel framework to improve RL for safety aspects in TSC. This framework introduces a novel method based on CF learning to address the question: What if, when an unsafe event occurs, we backtrack to perform alternative actions, and will this unsafe event still occur in the subsequent period?'' To answer this question, we propose a new structure causal model to predict the result after executing different actions, and we propose a new CF module that integrates with additional X’’ modules to promote safe RL practices. Our new algorithm, CFLight, which is derived from this framework, effectively tackles challenging safety events and significantly improves safety at intersections through a near-zero collision control strategy. Through extensive numerical experiments on both real-world and synthetic datasets, we demonstrate that CFLight reduces collisions and improves overall traffic performance compared to conventional RL methods and the recent safe RL model. Moreover, our method represents a generalized and safe framework for RL methods, opening possibilities for applications in other domains. The data and code are available in the github https://github.com/AdvancedAI-ComplexSystem/SmartCity/tree/main/CFLight.

[424] Supporting Migration Policies with Forecasts: Illegal Border Crossings in Europe through a Mixed Approach

C. Bosco, U. Minora, D. de Rigo, J. Pingsdorf, R. Cortinovis

Main category: cs.LG

TL;DR: A mixed-methodology combining machine learning with expert qualitative insights to forecast illegal border crossings in Europe across five migratory routes with one-year horizon, addressing EU migration policy needs.

DetailsMotivation: To address challenges posed by sudden shifts in migration patterns and limitations in traditional datasets, while responding to forecasting needs outlined in the EU Pact on Migration and Asylum and supporting the Asylum and Migration Management Regulation (AMMR).

Method: Integrates machine learning techniques with qualitative insights from migration experts, including a human-assessed covariate to improve predictive capacity of data-driven models.

Result: Methodology is tested and validated with known data to demonstrate its applicability and reliability in migration-related policy context, providing policy-relevant forecasts.

Conclusion: The approach introduces a novel operational tool for EU migration governance that aligns with academic recommendations by combining data-driven modeling with expert judgment to inform strategic decisions, early warning systems, and solidarity mechanisms.

Abstract: This paper presents a mixed-methodology to forecast illegal border crossings in Europe across five key migratory routes, with a one-year time horizon. The methodology integrates machine learning techniques with qualitative insights from migration experts. This approach aims at improving the predictive capacity of data-driven models through the inclusion of a human-assessed covariate, an innovation that addresses challenges posed by sudden shifts in migration patterns and limitations in traditional datasets. The proposed methodology responds directly to the forecasting needs outlined in the EU Pact on Migration and Asylum, supporting the Asylum and Migration Management Regulation (AMMR). It is designed to provide policy-relevant forecasts that inform strategic decisions, early warning systems, and solidarity mechanisms among EU Member States. By joining data-driven modeling with expert judgment, this work aligns with existing academic recommendations and introduces a novel operational tool tailored for EU migration governance. The methodology is tested and validated with known data to demonstrate its applicability and reliability in migration-related policy context.

[425] Features Emerge as Discrete States: The First Application of SAEs to 3D Representations

Albert Miao, Chenliang Zhou, Jiawei Zhou, Cengiz Oztireli

Main category: cs.LG

TL;DR: First application of Sparse Autoencoders to 3D domain reveals discrete feature encoding in 3D reconstruction VAEs, with phase-like transitions explaining several unintuitive model behaviors.

DetailsMotivation: Sparse Autoencoders have been powerful for dictionary learning in text but rarely applied outside textual domain, limiting theoretical explorations of feature decomposition. The authors want to extend SAEs to 3D to analyze feature decomposition in 3D reconstruction models.

Method: Applied SAEs to analyze features in a state-of-the-art 3D reconstruction VAE trained on 53k 3D models from Objaverse dataset. Used state transition framework to understand feature activation patterns.

Result: Found that 3D reconstruction models encode discrete rather than continuous features, approximating a discrete state space with phase-like transitions. This explains three unintuitive behaviors: preference for positional encoding representations, sigmoidal reconstruction loss from feature ablation, and bimodal distribution of phase transition points.

Conclusion: The work provides the first SAE application to 3D domain, revealing discrete feature encoding and phase transition dynamics that explain model behaviors. Offers a framework for understanding feature learning dynamics in 3D models and explains previously unintuitive phenomena.

Abstract: Sparse Autoencoders (SAEs) are a powerful dictionary learning technique for decomposing neural network activations, translating the hidden state into human ideas with high semantic value despite no external intervention or guidance. However, this technique has rarely been applied outside of the textual domain, limiting theoretical explorations of feature decomposition. We present the first application of SAEs to the 3D domain, analyzing the features used by a state-of-the-art 3D reconstruction VAE applied to 53k 3D models from the Objaverse dataset. We observe that the network encodes discrete rather than continuous features, leading to our key finding: such models approximate a discrete state space, driven by phase-like transitions from feature activations. Through this state transition framework, we address three otherwise unintuitive behaviors - the inclination of the reconstruction model towards positional encoding representations, the sigmoidal behavior of reconstruction loss from feature ablation, and the bimodality in the distribution of phase transition points. This final observation suggests the model redistributes the interference caused by superposition to prioritize the saliency of different features. Our work not only compiles and explains unexpected phenomena regarding feature decomposition, but also provides a framework to explain the model’s feature learning dynamics. The code and dataset of encoded 3D objects will be available on release.

[426] Pace: Physics-Aware Attentive Temporal Convolutional Network for Battery Health Estimation

Sara Sameer, Wei Zhang, Dhivya Dharshini Kannan, Xin Lou, Yulin Gao, Terence Goh, Qingyu Yan

Main category: cs.LG

TL;DR: Pace is a physics-aware attentive temporal convolutional network that integrates sensor data with battery physics for accurate health estimation, outperforming baselines by 6.5% and achieving 2x speedup with real-time edge deployment.

DetailsMotivation: Batteries are critical for modern energy systems (EVs, grid storage), requiring effective health management for safety, cost-efficiency, and sustainability. Current methods need improvement in accuracy and practical deployment.

Method: Pace integrates raw sensor measurements with battery physics features from equivalent circuit model. Uses three specialized modules: dilated temporal blocks for temporal encoding, chunked attention blocks for context modeling, and dual-head output block to fuse short- and long-term degradation patterns.

Result: Outperforms existing models on large public dataset with 6.5% average performance improvement and 2.0x speedup compared to two best-performing baselines. Successfully deployed in real-time on Raspberry Pi edge device.

Conclusion: Pace establishes a practical, high-performance solution for battery health analytics that works accurately across various battery usage conditions and is viable for real-time edge deployment.

Abstract: Batteries are critical components in modern energy systems such as electric vehicles and power grid energy storage. Effective battery health management is essential for battery system safety, cost-efficiency, and sustainability. In this paper, we propose Pace, a physics-aware attentive temporal convolutional network for battery health estimation. Pace integrates raw sensor measurements with battery physics features derived from the equivalent circuit model. We develop three battery-specific modules, including dilated temporal blocks for efficient temporal encoding, chunked attention blocks for context modeling, and a dual-head output block for fusing short- and long-term battery degradation patterns. Together, the modules enable Pace to predict battery health accurately and efficiently in various battery usage conditions. In a large public dataset, Pace performs much better than existing models, achieving an average performance improvement of 6.5 and 2.0x compared to two best-performing baseline models. We further demonstrate its practical viability with a real-time edge deployment on a Raspberry Pi. These results establish Pace as a practical and high-performance solution for battery health analytics.

[427] On the Design of One-step Diffusion via Shortcutting Flow Paths

Haitao Lin, Peiyan Hu, Minsi Ren, Zhifeng Gao, Zhi-Ming Ma, Guolin ke, Tailin Wu, Stan Z. Li

Main category: cs.LG

TL;DR: The paper proposes a unified design framework for shortcut diffusion models that enables systematic component-level improvements, achieving state-of-the-art FID scores on ImageNet-256x256 with one-step generation.

DetailsMotivation: Current few-step diffusion models have theoretical derivation and practical implementation closely coupled, which obscures the design space and makes it difficult to systematically identify improvements.

Method: Proposes a common design framework for representative shortcut models that provides theoretical justification and disentangles concrete component-level choices, enabling systematic identification of improvements.

Result: Achieves new state-of-the-art FID50k of 2.85 on ImageNet-256x256 with one-step generation under classifier-free guidance, and further reaches 2.52 with 2x training steps, without requiring pre-training, distillation, or curriculum learning.

Conclusion: The proposed framework lowers the barrier to component-level innovation in shortcut models and facilitates principled exploration of their design space.

Abstract: Recent advances in few-step diffusion models have demonstrated their efficiency and effectiveness by shortcutting the probabilistic paths of diffusion models, especially in training one-step diffusion models from scratch (\emph{a.k.a.} shortcut models). However, their theoretical derivation and practical implementation are often closely coupled, which obscures the design space. To address this, we propose a common design framework for representative shortcut models. This framework provides theoretical justification for their validity and disentangles concrete component-level choices, thereby enabling systematic identification of improvements. With our proposed improvements, the resulting one-step model achieves a new state-of-the-art FID50k of 2.85 on ImageNet-256x256 under the classifier-free guidance setting with one step generation, and further reaches FID50k of 2.52 with 2x training steps. Remarkably, the model requires no pre-training, distillation, or curriculum learning. We believe our work lowers the barrier to component-level innovation in shortcut models and facilitates principled exploration of their design space.

[428] Evaluating Adversarial Attacks on Federated Learning for Temperature Forecasting

Karina Chichifoi, Fabio Merizzi, Michele Colajanni

Main category: cs.LG

TL;DR: Federated learning for weather forecasting is vulnerable to data poisoning attacks, where compromised clients can significantly distort temperature predictions across large regions, with patch attacks being particularly effective and resistant to trimmed mean defenses.

DetailsMotivation: While federated learning enables collaborative weather forecasting without sharing raw data, its distributed nature creates vulnerabilities to data poisoning attacks that can degrade performance or introduce systematic biases, especially given the spatial dependencies in meteorological data.

Method: Simulated geographically distributed clients using the Copernicus European Regional ReAnalysis (CERRA) dataset, evaluated patch-based and global biasing attacks on regional temperature forecasts, and assessed trimmed mean aggregation as a defense mechanism.

Result: Even a small fraction of poisoned clients can mislead predictions across large areas: single-client global bias attacks shift predictions by up to -1.7K, while coordinated patch attacks triple MSE and produce persistent regional anomalies exceeding +3.5K. Trimmed mean defends against global bias (2-13% degradation) but fails against patch attacks (281-603% amplification).

Conclusion: Federated weather forecasting is highly vulnerable to data poisoning attacks, particularly patch attacks that exploit spatial correlations, and current outlier-based defenses like trimmed mean aggregation are insufficient for spatially correlated meteorological data, highlighting the need for more robust security mechanisms.

Abstract: Deep learning and federated learning (FL) are becoming powerful partners for next-generation weather forecasting. Deep learning enables high-resolution spatiotemporal forecasts that can surpass traditional numerical models, while FL allows institutions in different locations to collaboratively train models without sharing raw data, addressing efficiency and security concerns. While FL has shown promise across heterogeneous regions, its distributed nature introduces new vulnerabilities. In particular, data poisoning attacks, in which compromised clients inject manipulated training data, can degrade performance or introduce systematic biases. These threats are amplified by spatial dependencies in meteorological data, allowing localized perturbations to influence broader regions through global model aggregation. In this study, we investigate how adversarial clients distort federated surface temperature forecasts trained on the Copernicus European Regional ReAnalysis (CERRA) dataset. We simulate geographically distributed clients and evaluate patch-based and global biasing attacks on regional temperature forecasts. Our results show that even a small fraction of poisoned clients can mislead predictions across large, spatially connected areas. A global temperature bias attack from a single compromised client shifts predictions by up to -1.7 K, while coordinated patch attacks more than triple the mean squared error and produce persistent regional anomalies exceeding +3.5 K. Finally, we assess trimmed mean aggregation as a defense mechanism, showing that it successfully defends against global bias attacks (2-13% degradation) but fails against patch attacks (281-603% amplification), exposing limitations of outlier-based defenses for spatially correlated data.

[429] KD-PINN: Knowledge-Distilled PINNs for ultra-low-latency real-time neural PDE solvers

Karim Bounja, Lahcen Laayouni, Abdeljalil Sakat

Main category: cs.LG

TL;DR: KD-PINN framework transfers accuracy from high-capacity teacher to compact student using KL divergence adaptation, achieving 4.8-6.9x speedup while preserving accuracy, enabling ultra-low-latency real-time PDE solving.

DetailsMotivation: To develop accurate ultra-low-latency neural PDE solvers by reducing inference latency in Physics-Informed Neural Networks (PINNs) while maintaining predictive accuracy.

Method: Knowledge distillation framework that transfers knowledge from a high-capacity teacher PINN to a compact student model through continuous adaptation of Kullback-Leibler divergence, evaluated on various PDE benchmarks.

Result: Student models achieve 4.8x (Navier-Stokes) to 6.9x (Burgers) inference speedups with preserved accuracy, ~1% accuracy improvement when tuned, average 5.3 ms inference latency on CPU (sub-10 ms ultra-low-latency regime), and show regularizing effects.

Conclusion: Knowledge distillation effectively reduces inference latency in PINNs while maintaining accuracy, enabling development of accurate ultra-low-latency real-time neural PDE solvers for various dynamics and dimensionalities.

Abstract: This work introduces Knowledge-Distilled Physics-Informed Neural Networks (KD-PINN), a framework that transfers the predictive accuracy of a high-capacity teacher model to a compact student through a continuous adaptation of the Kullback-Leibler divergence. In order to confirm its generality for various dynamics and dimensionalities, the framework is evaluated on a representative set of partial differential equations (PDEs). Across the considered benchmarks, the student model achieves inference speedups ranging from x4.8 (Navier-Stokes) to x6.9 (Burgers), while preserving accuracy. Accuracy is improved by on the order of 1% when the model is properly tuned. The distillation process also revealed a regularizing effect. With an average inference latency of 5.3 ms on CPU, the distilled models enter the ultra-low-latency real-time regime defined by sub-10 ms performance. Finally, this study examines how knowledge distillation reduces inference latency in PINNs, to contribute to the development of accurate ultra-low-latency neural PDE solvers.

[430] Scalable Formal Verification via Autoencoder Latent Space Abstraction

Robert Reed, Luca Laurenti, Morteza Lahijanian

Main category: cs.LG

TL;DR: The paper presents a formal approach using convex autoencoders and kernel-based methods to reduce dimensionality for scalable finite abstraction verification while maintaining correctness guarantees.

DetailsMotivation: Finite abstraction methods face scalability challenges for high-dimensional systems due to exponential state-space growth. Learning-based dimensionality reduction approaches show promise but lack formal correctness guarantees for verification results.

Method: Uses convex autoencoders for dimensionality reduction, learns dynamics in latent space via kernel-based methods, constructs finite abstraction from learned latent model, and guarantees abstraction contains true system behaviors.

Result: Demonstrates effectiveness on multiple systems including a 26D neural network-controlled system, showing significant scalability improvements without loss of verification rigor.

Conclusion: Provides a formal framework that combines learning-based dimensionality reduction with rigorous verification guarantees, enabling scalable verification of high-dimensional systems while maintaining correctness.

Abstract: Finite Abstraction methods provide a powerful formal framework for proving that systems satisfy their specifications. However, these techniques face scalability challenges for high-dimensional systems, as they rely on state-space discretization which grows exponentially with dimension. Learning-based approaches to dimensionality reduction, utilizing neural networks and autoencoders, have shown great potential to alleviate this problem. However, ensuring the correctness of the resulting verification results remains an open question. In this work, we provide a formal approach to reduce the dimensionality of systems via convex autoencoders and learn the dynamics in the latent space through a kernel-based method. We then construct a finite abstraction from the learned model in the latent space and guarantee that the abstraction contains the true behaviors of the original system. We show that the verification results in the latent space can be mapped back to the original system. Finally, we demonstrate the effectiveness of our approach on multiple systems, including a 26D system controlled by a neural network, showing significant scalability improvements without loss of rigor.

cs.MA

[431] Multi-Agent Collaborative Framework for Intelligent IT Operations: An AOI System with Context-Aware Compression and Dynamic Task Scheduling

Zishan Bai, Enze Ge, Junfeng Hao

Main category: cs.MA

TL;DR: AOI is a multi-agent framework with LLM-based context compression that reduces information overload in cloud operations by 72.4% while maintaining 92.8% critical info, achieving 94.2% task success rate and 34.4% MTTR reduction.

DetailsMotivation: Cloud-native architectures create overwhelming operational data complexity, causing inefficient processing, poor task coordination, and context loss during fault diagnosis in modern IT infrastructures.

Method: AOI framework integrates three specialized agents with LLM-based Context Compressor, featuring dynamic task scheduling and three-layer memory architecture (Working, Episodic, Semantic) for optimized context retention.

Result: Achieves 72.4% context compression ratio while preserving 92.8% critical information, 94.2% task success rate, and 34.4% reduction in Mean Time to Repair compared to best baseline.

Conclusion: Presents paradigm shift toward scalable, adaptive, context-aware autonomous operations for next-generation IT infrastructures with minimal human intervention.

Abstract: The proliferation of cloud-native architectures, characterized by microservices and dynamic orchestration, has rendered modern IT infrastructures exceedingly complex and volatile. This complexity generates overwhelming volumes of operational data, leading to critical bottlenecks in conventional systems: inefficient information processing, poor task coordination, and loss of contextual continuity during fault diagnosis and remediation. To address these challenges, we propose AOI (AI-Oriented Operations), a novel multi-agent collaborative framework that integrates three specialized agents with an LLM-based Context Compressor. Its core innovations include: (1) a dynamic task scheduling strategy that adaptively prioritizes operations based on real-time system states, and (2) a three-layer memory architecture comprising Working, Episodic, and Semantic layers that optimizes context retention and retrieval. Extensive experiments on both synthetic and real-world benchmarks demonstrate that AOI effectively mitigates information overload, achieving a 72.4% context compression ratio while preserving 92.8% of critical information and significantly enhances operational efficiency, attaining a 94.2% task success rate and reducing the Mean Time to Repair (MTTR) by 34.4% compared to the best baseline. This work presents a paradigm shift towards scalable, adaptive, and context-aware autonomous operations, enabling robust management of next-generation IT infrastructures with minimal human intervention.

[432] Multi-Agent Medical Decision Consensus Matrix System: An Intelligent Collaborative Framework for Oncology MDT Consultations

Xudong Han, Xianglun Gao, Xiaoyi Qu, Zhenyu Yu

Main category: cs.MA

TL;DR: A multi-agent LLM system simulates multidisciplinary cancer care teams with specialized agents, uses consensus matrices and reinforcement learning to improve decision quality, and achieves high accuracy and consensus rates on medical benchmarks.

DetailsMotivation: Current multidisciplinary team (MDT) consultations for cancer care lack structured mechanisms for quantifying consensus and ensuring decision traceability, creating a need for systematic approaches to improve clinical decision-making quality and efficiency.

Method: Developed a Multi-Agent Medical Decision Consensus Matrix System with seven specialized LLM agents (oncologist, radiologist, nurse, psychologist, patient advocate, nutritionist, rehabilitation therapist) simulating MDT workflows. Integrated a mathematically grounded consensus matrix using Kendall’s coefficient of concordance and reinforcement learning methods (Q-Learning, PPO, DQN) to enhance treatment recommendations and consensus efficiency.

Result: Achieved 87.5% average accuracy across five medical benchmarks (MedQA, PubMedQA, DDXPlus, MedBullets, SymCat) vs. 83.8% for strongest baseline, with 89.3% consensus achievement rate, mean Kendall’s W of 0.823, and expert clinical appropriateness rating of 8.9/10. System provides full evidence traceability through mandatory citations following GRADE principles.

Conclusion: This work advances medical AI by providing structured consensus measurement, role-specialized multi-agent collaboration, and evidence-based explainability to improve the quality and efficiency of clinical decision-making in cancer care.

Abstract: Multidisciplinary team (MDT) consultations are the gold standard for cancer care decision-making, yet current practice lacks structured mechanisms for quantifying consensus and ensuring decision traceability. We introduce a Multi-Agent Medical Decision Consensus Matrix System that deploys seven specialized large language model agents, including an oncologist, a radiologist, a nurse, a psychologist, a patient advocate, a nutritionist and a rehabilitation therapist, to simulate realistic MDT workflows. The framework incorporates a mathematically grounded consensus matrix that uses Kendall’s coefficient of concordance to objectively assess agreement. To further enhance treatment recommendation quality and consensus efficiency, the system integrates reinforcement learning methods, including Q-Learning, PPO and DQN. Evaluation across five medical benchmarks (MedQA, PubMedQA, DDXPlus, MedBullets and SymCat) shows substantial gains over existing approaches, achieving an average accuracy of 87.5% compared with 83.8% for the strongest baseline, a consensus achievement rate of 89.3% and a mean Kendall’s W of 0.823. Expert reviewers rated the clinical appropriateness of system outputs at 8.9/10. The system guarantees full evidence traceability through mandatory citations of clinical guidelines and peer-reviewed literature, following GRADE principles. This work advances medical AI by providing structured consensus measurement, role-specialized multi-agent collaboration and evidence-based explainability to improve the quality and efficiency of clinical decision-making.

[433] Optimizing Highway Traffic Flow in Mixed Autonomy: A Multiagent Truncated Rollout Approach

Lu Liu, Chi Xie, Xi Xiong

Main category: cs.MA

TL;DR: Multiagent truncated rollout approach for CAV speed coordination in mixed autonomy traffic to improve highway throughput while reducing computational overhead.

DetailsMotivation: In mixed autonomy environments where CAVs coexist with human-driven vehicles, achieving efficient coordination among CAVs is challenging due to heterogeneous driving behaviors, requiring solutions that balance performance with computational efficiency.

Method: Proposes a multiagent truncated rollout approach with: 1) traffic density evolution equation accounting for CAV presence, 2) distributed coordination control framework, 3) agent-by-agent sequential solution mechanism using neighbor kinematic information, and 4) adaptive truncated rollout scheme that shortens optimization horizon based on control sequence evaluation.

Result: Theoretical analysis provides stability and performance guarantees. Simulations on real-world bottleneck scenarios show the method outperforms conventional MPC by reducing both average travel time in bottleneck areas and overall computational time in large-scale mixed traffic flows.

Conclusion: The proposed truncated rollout approach enables explicit cooperation among CAVs while significantly reducing computational complexity, demonstrating strong potential for practical deployment in mixed autonomy traffic systems.

Abstract: The development of connected and autonomous vehicles (CAVs) offers substantial opportunities to enhance traffic efficiency. However, in mixed autonomy environments where CAVs coexist with human-driven vehicles (HDVs), achieving efficient coordination among CAVs remains challenging due to heterogeneous driving behaviors. To address this, this paper proposes a multiagent truncated rollout approach that enhances CAV speed coordination to improve highway throughput while reducing computational overhead. In this approach, a traffic density evolution equation is formulated that comprehensively accounts for the presence or absence of CAVs, and a distributed coordination control framework is established accordingly. By incorporating kinematic information from neighbor agents and employing an agent-by-agent sequential solution mechanism, our method enables explicit cooperation among CAVs. Furthermore, we introduce a truncated rollout scheme that adaptively shortens the optimization horizon based on the evaluation of control sequences. This significantly reduces the time complexity, thereby improving real-time performance and scalability. Theoretical analysis provides rigorous guarantees on the stability and performance improvement of the system. Simulations conducted on real-world bottleneck scenarios demonstrate that, in large-scale mixed traffic flows, the proposed method outperforms conventional model predictive control methods by reducing both the average travel time in the bottleneck area and overall computational time, highlighting its strong potential for practical deployment.

[434] Bayesian Ego-graph Inference for Networked Multi-Agent Reinforcement Learning

Wei Duan, Jie Lu, Junyu Xuan

Main category: cs.MA

TL;DR: BayesG: A decentralized MARL framework that learns dynamic communication graphs via Bayesian variational inference for scalable multi-agent coordination under local observability.

DetailsMotivation: Existing Networked-MARL methods assume static communication neighborhoods, which limits adaptability to dynamic/heterogeneous environments. Centralized approaches that learn dynamic graphs require global state access and centralized infrastructure, making them impractical for real-world decentralized systems.

Method: Proposes a stochastic graph-based policy where each agent conditions decisions on sampled subgraphs over local physical neighborhoods. Introduces BayesG, a decentralized actor-framework that learns sparse, context-aware interaction structures via Bayesian variational inference. Each agent operates over an ego-graph and samples latent communication masks to guide message passing and policy computation, trained end-to-end with ELBO objective.

Result: BayesG outperforms strong MARL baselines on large-scale traffic control tasks with up to 167 agents, demonstrating superior scalability, efficiency, and performance.

Conclusion: BayesG enables agents to jointly learn both interaction topology and decision-making strategies in a decentralized manner, addressing limitations of static neighborhood assumptions and centralized approaches in Networked-MARL.

Abstract: In networked multi-agent reinforcement learning (Networked-MARL), decentralized agents must act under local observability and constrained communication over fixed physical graphs. Existing methods often assume static neighborhoods, limiting adaptability to dynamic or heterogeneous environments. While centralized frameworks can learn dynamic graphs, their reliance on global state access and centralized infrastructure is impractical in real-world decentralized systems. We propose a stochastic graph-based policy for Networked-MARL, where each agent conditions its decision on a sampled subgraph over its local physical neighborhood. Building on this formulation, we introduce BayesG, a decentralized actor-framework that learns sparse, context-aware interaction structures via Bayesian variational inference. Each agent operates over an ego-graph and samples a latent communication mask to guide message passing and policy computation. The variational distribution is trained end-to-end alongside the policy using an evidence lower bound (ELBO) objective, enabling agents to jointly learn both interaction topology and decision-making strategies. BayesG outperforms strong MARL baselines on large-scale traffic control tasks with up to 167 agents, demonstrating superior scalability, efficiency, and performance.

[435] Convergence dynamics of Agent-to-Agent Interactions with Misaligned objectives

Romain Cosentino, Sarath Shekkizhar, Adam Earle

Main category: cs.MA

TL;DR: Theoretical analysis of agent interactions in linear regression shows misaligned objectives cause biased equilibria, while adaptive objectives enable faster convergence.

DetailsMotivation: To develop a theoretical framework for understanding agent-to-agent interactions in simplified in-context learning settings, linking prompt geometry and objective misalignment to stability and bias in multi-agent systems.

Method: Uses single-layer transformers with linear self-attention trained for gradient-descent-like updates on quadratic regression. Studies coupled dynamics of two agents alternately updating from each other’s outputs under misaligned fixed objectives, and contrasts with adaptive multi-agent settings.

Result: Misalignment leads to biased equilibrium where neither agent reaches its target, with predictable residual errors. Adaptive objectives (helper agent implementing Newton-like steps) eliminate plateaus and accelerate convergence. Experiments with trained LSA agents and GPT-5-mini confirm theoretical predictions.

Conclusion: Provides a mechanistic framework connecting prompt geometry and objective misalignment to stability, bias, and robustness in multi-agent systems, serving as a stepping stone for analyzing more realistic LLM interactions.

Abstract: We develop and analyze a theoretical framework for agent-to-agent interactions in a simplified in-context linear regression setting. In our model, each agent is instantiated as a single-layer transformer with linear self-attention (LSA) trained to implement gradient-descent-like updates on a quadratic regression objective from in-context examples. We then study the coupled dynamics when two such LSA agents alternately update from each other’s outputs under potentially misaligned fixed objectives. Within this framework, we characterize the generation dynamics and show that misalignment leads to a biased equilibrium where neither agent reaches its target, with residual errors predictable from the objective gap and the prompt-induced geometry. We further contrast this fixed objective regime with an adaptive multi-agent setting, wherein a helper agent updates a turn-based objective to implement a Newton-like step for the main agent, eliminating the plateau and accelerating its convergence. Experiments with trained LSA agents, as well as black-box GPT-5-mini runs on in-context linear regression tasks, are consistent with our theoretical predictions within this simplified setting. We view our framework as a mechanistic framework that links prompt geometry and objective misalignment to stability, bias, and robustness, and as a stepping stone toward analyzing more realistic multi-agent LLM systems.

[436] Rethinking the Reliability of Multi-agent System: A Perspective from Byzantine Fault Tolerance

Lifan Zheng, Jiawei Chen, Qinghong Yin, Jingyuan Zhang, Xinyi Zeng, Yu Tian

Main category: cs.MA

TL;DR: LLM-based agents show stronger skepticism to erroneous messages than traditional agents, enabling better Byzantine fault tolerance. The paper proposes CP-WBFT, a confidence probe-based weighted Byzantine Fault Tolerant consensus mechanism that leverages LLMs’ reflective capabilities to enhance multi-agent system reliability.

DetailsMotivation: While LLM-based agents have advanced multi-agent systems for complex problem solving, their reliability implications remain unexplored. The paper investigates whether substituting traditional agents with LLM-based agents can enhance MAS reliability, particularly from a Byzantine fault tolerance perspective.

Method: The authors first conduct pilot experiments showing LLM-based agents demonstrate stronger skepticism when processing erroneous message flows. Based on this observation, they design CP-WBFT, a confidence probe-based weighted Byzantine Fault Tolerant consensus mechanism that uses probe-based, weighted information flow transmission to leverage LLMs’ intrinsic reflective and discriminative capabilities.

Result: Extensive experiments show CP-WBFT achieves superior performance across diverse network topologies under extreme Byzantine conditions (85.7% fault rate). The approach surpasses traditional methods with remarkable accuracy on various topologies and maintains strong reliability in both mathematical reasoning and safety assessment tasks.

Conclusion: LLM-based agents can effectively enhance MAS reliability through their stronger skepticism and reflective capabilities. The proposed CP-WBFT mechanism successfully leverages these characteristics to provide robust Byzantine fault tolerance, demonstrating that LLM-based agents offer significant reliability advantages over traditional agents in multi-agent systems.

Abstract: Ensuring the reliability of agent architectures and effectively identifying problematic agents when failures occur are crucial challenges in multi-agent systems (MAS). Advances in large language models (LLMs) have established LLM-based agents as a major branch of MAS, enabling major breakthroughs in complex problem solving and world modeling. However, the reliability implications of this shift remain largely unexplored. i.e., whether substituting traditional agents with LLM-based agents can effectively enhance the reliability of MAS. In this work, we investigate and quantify the reliability of LLM-based agents from the perspective of Byzantine fault tolerance. We observe that LLM-based agents demonstrate stronger skepticism when processing erroneous message flows, a characteristic that enables them to outperform traditional agents across different topological structures. Motivated by the results of the pilot experiment, we design CP-WBFT, a confidence probe-based weighted Byzantine Fault Tolerant consensus mechanism to enhance the stability of MAS with different topologies. It capitalizes on the intrinsic reflective and discriminative capabilities of LLMs by employing a probe-based, weighted information flow transmission method to improve the reliability of LLM-based agents. Extensive experiments demonstrate that CP-WBFT achieves superior performance across diverse network topologies under extreme Byzantine conditions (85.7% fault rate). Notably, our approach surpasses traditional methods by attaining remarkable accuracy on various topologies and maintaining strong reliability in both mathematical reasoning and safety assessment tasks.

[437] Beyond Task Completion: An Assessment Framework for Evaluating Agentic AI Systems

Sreemaee Akshathala, Bassam Adnan, Mahisha Ramesh, Karthik Vaidhyanathan, Basil Muhammed, Kannan Parthasarathy

Main category: cs.MA

TL;DR: Proposes an Agent Assessment Framework with four evaluation pillars (LLMs, Memory, Tools, Environment) to address limitations of current binary task completion metrics for assessing non-deterministic agentic AI systems.

DetailsMotivation: Current evaluation methods for agentic AI systems fail to capture behavioral uncertainty from non-deterministic LLMs and overlook critical dimensions like tool invocation, memory management, agent collaboration, and environmental interaction. These limitations became apparent during production deployment with MontyCloud Inc.

Method: Develops an end-to-end Agent Assessment Framework with four evaluation pillars: LLMs (model capabilities), Memory (ingestion/retrieval), Tools (invocation and usage), and Environment (interaction effectiveness). Validates the framework on an Autonomous CloudOps use case.

Result: The framework successfully captures behavioral deviations and runtime uncertainties that conventional binary task completion metrics overlook. Experiments on the CloudOps use case demonstrate the framework’s effectiveness in providing more comprehensive agent assessment.

Conclusion: A systematic, multi-dimensional assessment framework is needed for agentic AI systems to address their non-deterministic nature and complex interactions. The proposed four-pillar framework provides a more comprehensive evaluation approach than current methods.

Abstract: Recent advances in agentic AI have shifted the focus from standalone Large Language Models (LLMs) to integrated systems that combine LLMs with tools, memory, and other agents to perform complex tasks. These multi-agent architectures enable coordinated reasoning, planning, and execution across diverse domains, allowing agents to collaboratively automate complex workflows. Despite these advances, evaluation and assessment of LLM agents and the multi-agent systems they constitute remain a fundamental challenge. Although various approaches have been proposed in the software engineering literature for evaluating conventional software components, existing methods for AI-based systems often overlook the non-deterministic nature of models. This non-determinism introduces behavioral uncertainty during execution, yet existing evaluations rely on binary task completion metrics that fail to capture it. Evaluating agentic systems therefore requires examining additional dimensions, including the agent ability to invoke tools, ingest and retrieve memory, collaborate with other agents, and interact effectively with its environment. These challenges emerged during our ongoing industry collaboration with MontyCloud Inc., when we deployed an agentic system in production. These limitations surfaced during deployment, highlighting practical gaps in the current evaluation methods and the need for a systematic assessment of agent behavior beyond task outcomes. Informed by these observations and established definitions of agentic systems, we propose an end-to-end Agent Assessment Framework with four evaluation pillars encompassing LLMs, Memory, Tools, and Environment. We validate the framework on a representative Autonomous CloudOps use case, where experiments reveal behavioral deviations overlooked by conventional metrics, demonstrating its effectiveness in capturing runtime uncertainties.

cs.MM

[438] Generative AI for Video Translation: A Scalable Architecture for Multilingual Video Conferencing

Amirkia Rafiei Oskooei, Eren Caglar, Ibrahim Sahin, Ayse Kayabay, Mehmet S. Aktas

Main category: cs.MM

TL;DR: A system-level framework that reduces computational complexity from O(N²) to O(N) for real-time generative AI pipelines in video translation, enabling scalable multi-user video conferencing with perceptually real-time performance.

DetailsMotivation: Real-time deployment of cascaded generative AI pipelines for video translation faces critical bottlenecks: cumulative latency from sequential model inference and quadratic computational complexity (O(N²)) that makes multi-user video conferencing applications unscalable.

Method: Proposes a practical system-level framework with two key components: 1) a turn-taking mechanism to reduce computational complexity from quadratic to linear in multi-user scenarios, and 2) a segmented processing protocol to manage inference latency for perceptually real-time experience. Implemented a proof-of-concept pipeline and conducted performance analysis across multi-tiered hardware (RTX 4060, T4, A100 GPUs).

Result: The system achieves real-time throughput (τ<1.0) on modern hardware. A subjective user study shows users accept predictable initial processing delay in exchange for smooth, uninterrupted playback. The framework enables scalable real-time performance across different GPU tiers.

Conclusion: Presents a validated end-to-end system design that offers a practical roadmap for deploying scalable, real-time generative AI applications in multilingual communication platforms, addressing critical system-level bottlenecks for real-time video translation pipelines.

Abstract: The real-time deployment of cascaded generative AI pipelines for applications like video translation is constrained by significant system-level challenges. These include the cumulative latency of sequential model inference and the quadratic ($\mathcal{O}(N^2)$) computational complexity that renders multi-user video conferencing applications unscalable. This paper proposes and evaluates a practical system-level framework designed to mitigate these critical bottlenecks. The proposed architecture incorporates a turn-taking mechanism to reduce computational complexity from quadratic to linear in multi-user scenarios, and a segmented processing protocol to manage inference latency for a perceptually real-time experience. We implement a proof-of-concept pipeline and conduct a rigorous performance analysis across a multi-tiered hardware setup, including commodity (NVIDIA RTX 4060), cloud (NVIDIA T4), and enterprise (NVIDIA A100) GPUs. Our objective evaluation demonstrates that the system achieves real-time throughput ($τ< 1.0$) on modern hardware. A subjective user study further validates the approach, showing that a predictable, initial processing delay is highly acceptable to users in exchange for a smooth, uninterrupted playback experience. The work presents a validated, end-to-end system design that offers a practical roadmap for deploying scalable, real-time generative AI applications in multilingual communication platforms.

[439] End-to-End Learning-based Video Streaming Enhancement Pipeline: A Generative AI Approach

Emanuele Artioli, Farzad Tashtarian, Christian Timmerer

Main category: cs.MM

TL;DR: ELVIS is an end-to-end video streaming system that uses generative AI to remove redundant video data at the server and reconstruct it at the client, achieving up to 11 VMAF quality improvement without increasing bandwidth.

DetailsMotivation: Traditional video codecs must encode and transmit entire video data due to their inability to use context, creating a trade-off between video quality and smooth playback. There's a need for more efficient video streaming that can leverage contextual understanding to reduce bandwidth while maintaining quality.

Method: ELVIS combines server-side encoding optimizations with client-side generative in-painting to remove and reconstruct redundant video data. It features a modular design that can integrate different codecs, inpainting models, and quality metrics, making it adaptable to future innovations.

Result: Current technologies achieve improvements of up to 11 VMAF points over baseline benchmarks, demonstrating significant quality enhancement without increased bandwidth requirements.

Conclusion: ELVIS represents a foundational step toward incorporating generative AI into video streaming pipelines, enabling higher quality experiences without increased bandwidth. However, computational demands remain a challenge for real-time applications.

Abstract: The primary challenge of video streaming is to balance high video quality with smooth playback. Traditional codecs are well tuned for this trade-off, yet their inability to use context means they must encode the entire video data and transmit it to the client. This paper introduces ELVIS (End-to-end Learning-based VIdeo Streaming Enhancement Pipeline), an end-to-end architecture that combines server-side encoding optimizations with client-side generative in-painting to remove and reconstruct redundant video data. Its modular design allows ELVIS to integrate different codecs, inpainting models, and quality metrics, making it adaptable to future innovations. Our results show that current technologies achieve improvements of up to 11 VMAF points over baseline benchmarks, though challenges remain for real-time applications due to computational demands. ELVIS represents a foundational step toward incorporating generative AI into video streaming pipelines, enabling higher quality experiences without increased bandwidth requirements.

[440] CrossPT-EEG: A Benchmark for Cross-Participant and Cross-Time Generalization of EEG-based Visual Decoding

Shuqi Zhu, Ziyi Ye, Qingyao Ai, Yiqun Liu

Main category: cs.MM

TL;DR: CrossPT-EEG: A benchmark dataset for cross-participant and cross-time EEG-based visual decoding using 4,000 ImageNet images from 16 participants, addressing limitations of traditional block-design experiments.

DetailsMotivation: EEG offers low cost and excellent temporal resolution for studying visual perception, but its potential has been limited by scarce high-quality datasets and block-design experiments that introduce temporal confounds. There's a need for better EEG datasets to advance visual brain-computer interfaces and understanding of biological vision.

Method: Collected EEG data from 16 participants viewing 4,000 ImageNet images with multi-level annotations. Used a two-stage design separated in time to enable cross-time generalization and avoid block-design artifacts. Created benchmarks for non-block design classification and conducted pre-training experiments to assess cross-time and cross-participant generalization.

Result: Developed the CrossPT-EEG benchmark dataset that enables evaluation of cross-participant and cross-time generalization in EEG-based visual decoding. The dataset addresses previous limitations by avoiding block-design artifacts and providing temporal separation for robust generalization testing.

Conclusion: CrossPT-EEG has significant potential to enhance EEG-based visual brain-computer interfaces, deepen understanding of visual perception in biological systems, and may have promising applications for improving machine vision models through better brain activity decoding.

Abstract: Exploring brain activity in relation to visual perception provides insights into the biological representation of the world. While functional magnetic resonance imaging (fMRI) and magnetoencephalography (MEG) have enabled effective image classification and reconstruction, their high cost and bulk limit practical use. Electroencephalography (EEG), by contrast, offers low cost and excellent temporal resolution, but its potential has been limited by the scarcity of large, high-quality datasets and by block-design experiments that introduce temporal confounds. To fill this gap, we present CrossPT-EEG, a benchmark for cross-participant and cross-time generalization of visual decoding from EEG. We collected EEG data from 16 participants while they viewed 4,000 images sampled from ImageNet, with image stimuli annotated at multiple levels of granularity. Our design includes two stages separated in time to allow cross-time generalization and avoid block-design artifacts. We also introduce benchmarks tailored to non-block design classification, as well as pre-training experiments to assess cross-time and cross-participant generalization. These findings highlight the dataset’s potential to enhance EEG-based visual brain-computer interfaces, deepen our understanding of visual perception in biological systems, and suggest promising applications for improving machine vision models.

eess.AS

[441] Scalable Frameworks for Real-World Audio-Visual Speech Recognition

Sungnyun Kim

Main category: eess.AS

TL;DR: This dissertation proposes a hierarchical approach to build robust and scalable Audio-Visual Speech Recognition systems that can handle real-world noise and visual interference through representation, architecture, and system-level innovations.

DetailsMotivation: AVSR systems suffer significant performance degradation in real-world environments with unpredictable acoustic noise and visual interference. Current approaches lack systematic solutions for robustness and scalability across different levels of system design.

Method: Three-level hierarchical approach: 1) Representation level - unified model learning audio-visual features robust to diverse corruptions; 2) Architecture level - efficient model expansion with adaptive multimodal input processing and intelligent computational resource allocation; 3) System level - modular integration with large-scale foundation models to leverage their cognitive and generative capabilities.

Result: The dissertation aims to build a next-generation AVSR system with high reliability in real-world applications by systematically addressing robustness and scalability challenges at representation, architecture, and system levels.

Conclusion: A systematic hierarchical approach across representation, architecture, and system levels is essential to overcome real-world challenges in AVSR systems, enabling robust scalability and high reliability for practical deployment.

Abstract: The practical deployment of Audio-Visual Speech Recognition (AVSR) systems is fundamentally challenged by significant performance degradation in real-world environments, characterized by unpredictable acoustic noise and visual interference. This dissertation posits that a systematic, hierarchical approach is essential to overcome these challenges, achieving the robust scalability at the representation, architecture, and system levels. At the representation level, we investigate methods for building a unified model that learns audio-visual features inherently robust to diverse real-world corruptions, thereby enabling generalization to new environments without specialized modules. To address architectural scalability, we explore how to efficiently expand model capacity while ensuring the adaptive and reliable use of multimodal inputs, developing a framework that intelligently allocates computational resources based on the input characteristics. Finally, at the system level, we present methods to expand the system’s functionality through modular integration with large-scale foundation models, leveraging their powerful cognitive and generative capabilities to maximize final recognition accuracy. By systematically providing solutions at each of these three levels, this dissertation aims to build a next-generation, robust, and scalable AVSR system with high reliability in real-world applications.

[442] Investigating the impact of stereo processing – a study for extending the Open Dataset of Audio Quality (ODAQ)

Sascha Dick, Christoph Thompson, Chih-Wei Wu, Pablo Delgado, Phillip A. Williams, Matteo Torcoli

Main category: eess.AS

TL;DR: ODAQ dataset extended to study stereo processing effects on audio quality perception, finding listeners prioritize timbral impairments over spatial characteristics when evaluating quality.

DetailsMotivation: To extend the Open Dataset of Audio Quality (ODAQ) beyond monaural artifacts by investigating how stereo processing (LR and MS) affects audio quality perception, and to examine methodological aspects of listening tests for stereo audio quality assessment.

Method: Adapted monaural artifacts from ODAQ with left-right (LR) and mid-side (MS) stereo processing across various stimuli types (solo instruments, wide stereo mixes, hard-panned mixes). Conducted listening tests with 16 expert listeners in different presentation contexts (with/without direct comparison of MS and LR conditions). Extended ODAQ dataset with new materials and subjective scores.

Result: Substantial influences of stimuli’s spatial characteristics and presentation context on quality perception. Significant disparities between LR and MS processing only appear when presented in direct comparison. Listeners primarily assess timbral impairments when spatial characteristics are consistent, and focus on stereo image only when timbral quality is similar. Mono anchor ratings were consistent across stereo characteristics (average 65 on MUSHRA scale), confirming listeners prioritize timbral over spatial impressions.

Conclusion: Stereo processing significantly impacts audio quality perception, but listeners prioritize timbral quality over spatial characteristics when evaluating overall audio quality. Presentation context (direct comparison vs. separate evaluation) affects perception of differences between stereo processing methods. The extended ODAQ dataset provides valuable resources for stereo audio quality research.

Abstract: In this paper, we present an initial study for extending Open Dataset of Audio Quality (ODAQ) towards the impact of stereo processing. Monaural artifacts from ODAQ were adapted in combinations with left-right (LR) and mid-side (MS) stereo processing, across stimuli including solo instruments, typical wide stereo mixes and and hard-panned mixes. Listening tests in different presentation context – with and without direct comparison of MS and LR conditions – were conducted to collect subjective data beyond monaural artifacts while also scrutinizing the listening test methodology. The ODAQ dataset is extended with new material along with subjective scores from 16 expert listeners. The listening test results show substantial influences of the stimuli’s spatial characteristics as well as the presentation context. Notably, several significant disparities between LR and MS only occur when presented in direct comparison. The findings suggest that listeners primarily assess timbral impairments when spatial characteristics are consistent and focus on stereo image only when timbral quality is similar. The rating of an additional mono anchor was overall consistent across different stereo characteristics, averaging at 65 on the MUSHRA scale, further corroborating that listeners prioritize timbral over spatial impressions.

[443] Segmental Attention Decoding With Long Form Acoustic Encodings

Pawel Swietojanski, Xinwei Li, Mingbin Xu, Takaaki Hori, Dogan Can, Xiaodan Zhuang

Main category: eess.AS

TL;DR: Proposed four modifications to fix attention-based encoder-decoder models’ incompatibility with long-form acoustic encodings by addressing positional encoding issues in cross-attention.

DetailsMotivation: AED models trained on segmented utterances fail to generalize to long-form acoustic encodings because they learn to encode absolute frame positions using limited acoustic context beyond segment boundaries, but these cues vanish in long-form decoding, causing loss of ordering ability due to permutation invariance in cross-attention.

Method: Four key modifications: (1) inject explicit absolute positional encodings into cross-attention for each decoded segment, (2) use long-form training with extended acoustic context to eliminate implicit absolute position encoding, (3) apply segment concatenation to cover diverse segmentations needed during training, and (4) implement semantic segmentation to align AED-decoded segments with training segments.

Result: The proposed modifications close the accuracy gap between continuous and segmented acoustic encodings, enabling auto-regressive use of the attention decoder with long-form acoustic inputs.

Conclusion: The fundamental incompatibility of AED models with long-form acoustic encodings can be resolved through explicit positional encoding strategies and training modifications that address the permutation invariance issue in cross-attention, making AED models viable for long-form speech processing.

Abstract: We address the fundamental incompatibility of attention-based encoder-decoder (AED) models with long-form acoustic encodings. AED models trained on segmented utterances learn to encode absolute frame positions by exploiting limited acoustic context beyond segment boundaries, but fail to generalize when decoding long-form segments where these cues vanish. The model loses ability to order acoustic encodings due to permutation invariance of keys and values in cross-attention. We propose four modifications: (1) injecting explicit absolute positional encodings into cross-attention for each decoded segment, (2) long-form training with extended acoustic context to eliminate implicit absolute position encoding, (3) segment concatenation to cover diverse segmentations needed during training, and (4) semantic segmentation to align AED-decoded segments with training segments. We show these modifications close the accuracy gap between continuous and segmented acoustic encodings, enabling auto-regressive use of the attention decoder.

[444] Pronunciation-Lexicon Free Training for Phoneme-based Crosslingual ASR via Joint Stochastic Approximation

Saierdaer Yusuyin, Te Ma, Hao Huang, Zhijian Ou

Main category: eess.AS

TL;DR: Proposes JSA-SPG, a latent variable method for crosslingual speech recognition that eliminates need for pronunciation lexicons by treating phonemes as discrete latent variables, achieving 5% error rate reduction with minimal phoneme supervision.

DetailsMotivation: Existing phoneme-based crosslingual speech recognition requires pronunciation lexicons, which limits applicability. The paper aims to eliminate this requirement while maintaining the benefits of phonetic supervision for crosslingual transfer.

Method: Uses a latent variable model with phonemes as discrete latent variables, consisting of S2P (speech-to-phoneme), P2G (phoneme-to-grapheme), and G2P (grapheme-to-phoneme) models. Trained jointly using JSA (joint stochastic approximation) algorithm, with MLS decoding and P2G augmentation for robustness.

Result: JSA-SPG achieves 5% error rate reduction compared to best crosslingual fine-tuning approaches using only 10 minutes of phoneme supervision. In language domain adaptation, outperforms standard language model fusion by 9% error rate reduction.

Conclusion: The proposed JSA-SPG method effectively eliminates the need for pronunciation lexicons in crosslingual speech recognition while improving performance with minimal supervision, demonstrating advantages in both crosslingual transfer and language domain adaptation scenarios.

Abstract: Recently, pre-trained models with phonetic supervision have demonstrated their advantages for crosslingual speech recognition in data efficiency and information sharing across languages. However, a limitation is that a pronunciation lexicon is needed for such phoneme-based crosslingual speech recognition. In this study, we aim to eliminate the need for pronunciation lexicons and propose a latent variable model based method, with phonemes being treated as discrete latent variables. The new method consists of a speech-to-phoneme (S2P) model and a phoneme-to-grapheme (P2G) model, and a grapheme-to-phoneme (G2P) model is introduced as an auxiliary inference model. To jointly train the three models, we utilize the joint stochastic approximation (JSA) algorithm, which is a stochastic extension of the EM (expectation-maximization) algorithm and has demonstrated superior performance particularly in estimating discrete latent variable models. Furthermore, we propose marginal likelihood scoring (MLS) decoding to align inference with the training objective and P2G augmentation to improve the robustness of P2G mapping. Based on the Whistle multilingual pre-trained S2P model, crosslingual experiments are conducted in Polish (130 h) and Indonesian (20 h). With only 10 minutes of phoneme supervision, the new method, JSA-SPG, achieves 5% error rate reductions compared to the best crosslingual fine-tuning approach using subword or full phoneme supervision. Furthermore, it is found that in language domain adaptation (i.e., utilizing cross-domain text-only data), JSA-SPG outperforms the standard practice of language model fusion via the auxiliary support of the G2P model by 9% error rate reductions. To facilitate reproducibility and encourage further exploration in this field, we open-source the JSA-SPG training code and complete pipeline.

eess.IV

[445] Improving the Plausibility of Pressure Distributions Synthesized from Depth through Generative Modeling

Neevkumar Manavar, Hanno Gerd Meyer, Joachim Waßmuth, Barbara Hammer, Axel Schneider

Main category: eess.IV

TL;DR: Proposes a framework using Informed Latent Space and Weight Optimization Loss with generative modeling to produce physically plausible pressure maps for hospital bed monitoring, with diffusion-based BBDM and its latent counterpart LBBDM for faster inference.

DetailsMotivation: Current methods for monitoring contact pressure in hospital beds lack physical plausibility, limiting clinical reliability for pressure ulcer prevention and real-time patient assessment.

Method: Uses Informed Latent Space (ILS) and Weight Optimization Loss (WOL) with generative modeling. Applies conditional Brownian Bridge Diffusion Model (BBDM) and proposes training strategy for Latent Brownian Bridge Diffusion Model (LBBDM) for pressure synthesis in lying postures.

Result: BBDM with ILS delivers highly detailed pressure maps but with higher computational cost and inference time, while LBBDM provides faster inference with competitive performance. Overall improves physical plausibility and performance over baselines.

Conclusion: The approach enables non-invasive, vision-based, real-time patient monitoring in clinical environments with physically consistent pressure estimates for pressure ulcer prevention.

Abstract: Monitoring contact pressure in hospital beds is essential for preventing pressure ulcers and enabling real-time patient assessment. Current methods can predict pressure maps but often lack physical plausibility, limiting clinical reliability. This work proposes a framework that enhances plausibility via Informed Latent Space (ILS) and Weight Optimization Loss (WOL) with generative modeling to produce high-fidelity, physically consistent pressure estimates. This study also applies diffusion based conditional Brownian Bridge Diffusion Model (BBDM) and proposes training strategy for its latent counterpart Latent Brownian Bridge Diffusion Model (LBBDM) tailored for pressure synthesis in lying postures. Experiment results shows proposed method improves physical plausibility and performance over baselines: BBDM with ILS delivers highly detailed maps at higher computational cost and large inference time, whereas LBBDM provides faster inference with competitive performance. Overall, the approach supports non-invasive, vision-based, real-time patient monitoring in clinical environments.

[446] Towards Deep Learning Surrogate for the Forward Problem in Electrocardiology: A Scalable Alternative to Physics-Based Models

Shaheim Ogbomo-Harmitt, Cesare Magnetti, Chiara Spota, Jakub Grzelak, Oleg Aslanidi

Main category: eess.IV

TL;DR: Deep learning framework replaces computationally expensive physics-based models for predicting ECG signals from cardiac voltage maps with high accuracy (R²=0.99).

DetailsMotivation: Traditional physics-based models for electrocardiology forward problems are accurate but computationally expensive, limiting real-time clinical applications. Need for efficient surrogate models.

Method: Time-dependent attention-based sequence-to-sequence architecture with convolutional encoders. Uses hybrid loss combining Huber loss with spectral entropy term to preserve temporal and frequency-domain fidelity.

Result: Achieved high accuracy with mean R² = 0.99 ± 0.01 on 2D tissue simulations including healthy, fibrotic, and gap junction-remodelled conditions. Ablation studies confirmed importance of key components.

Conclusion: Deep learning provides scalable, cost-effective alternative to physics-based solvers with potential for clinical applications and digital twin implementations.

Abstract: The forward problem in electrocardiology, computing body surface potentials from cardiac electrical activity, is traditionally solved using physics-based models such as the bidomain or monodomain equations. While accurate, these approaches are computationally expensive, limiting their use in real-time and large-scale clinical applications. We propose a proof-of-concept deep learning (DL) framework as an efficient surrogate for forward solvers. The model adopts a time-dependent, attention-based sequence-to-sequence architecture to predict electrocardiogram (ECG) signals from cardiac voltage propagation maps. A hybrid loss combining Huber loss with a spectral entropy term was introduced to preserve both temporal and frequency-domain fidelity. Using 2D tissue simulations incorporating healthy, fibrotic, and gap junction-remodelled conditions, the model achieved high accuracy (mean $R^2 = 0.99 \pm 0.01$). Ablation studies confirmed the contributions of convolutional encoders, time-aware attention, and spectral entropy loss. These findings highlight DL as a scalable, cost-effective alternative to physics-based solvers, with potential for clinical and digital twin applications.

[447] Synthetic Aperture for High Spatial Resolution Acoustoelectric Imaging

Wei Yi Oon, Yuchen Tang, Baiqian Qi, Wei-Ning Lee

Main category: eess.IV

TL;DR: Synthetic Aperture Acoustoelectric imaging with coherence weighting improves resolution and SNR over conventional focused ultrasound methods for imaging electric fields in biological tissues.

DetailsMotivation: Conventional focused ultrasound (FUS) AE imaging has limited depth-of-field that doesn't span centimeter-scale organs, restricting practical imaging of biological currents.

Method: Proposed Synthetic Aperture AE (SA-AE) with pixel-based delay-and-sum reconstruction from unfocused signals, enhanced with coherence factor (CF) and pulse-length coherence factor (CFPL) weighting for noise suppression.

Result: SA-AE improved spatial resolution throughout depth-of-field but introduced noise; coherence weighting (CF/CFPL) further boosted resolution, contrast, and SNR beyond FUS-AE, with CFPL showing stronger noise suppression.

Conclusion: Coherence-weighted SA-AE using unfocused waves offers high-resolution, noise-robust solution for practical imaging of fast biological currents, overcoming FUS-AE’s depth limitations.

Abstract: Acoustoelectric (AE) imaging provides electro-anatomical contrast by mapping the distribution of electric fields in biological tissues, by delivering ultrasound waves which spatially modulate the medium resistivity via the AE effect. The conventional method in AE imaging is to transmit focused ultrasound (FUS) beams; however, the depth-of-field (DOF) of FUS-AE is limited to the size of the focal spot, which does not span across the centimeter-scale of organs. Instead of fixing the focal depth on transmission, we propose to dynamically synthesize the AE modulation regions via a Synthetic Aperture approach (SA-AE). SA-AE involves a straightforward pixel-based delay-and-sum reconstruction of AE images from unfocused AE signals. In saline and ex vivo lobster nerve experiments, FUS-AE was shown to perform well only at the focal depth, with poor spatial resolution for out-of-focus electric sources. Meanwhile, SA-AE generally improved spatial resolution throughout the DOF, but introduced strong background noise. The flexibility of uncoupled, single-element induced AE signals in SA-AE was further leveraged to quantify their spatial coherence across the transmit aperture, obtaining maps of the coherence factor (CF) and pulse-length coherence factor (CFPL). Weighting SA-AE images with their derived CF and CFPL maps resulted in further improvement in image resolution and contrast, and notably, boosted the image SNR beyond that of FUS-AE. CFPL exhibited stronger noise suppression over CF. Using unfocused wave transmissions, the proposed coherence-weighted SA-AE strategy offers a high resolution yet noise-robust solution towards the practical imaging of fast biological currents.

[448] Test Time Optimized Generalized AI-based Medical Image Registration Method

Sneha Sree C., Dattesh Shanbhag, Sudhanya Chatterjee

Main category: eess.IV

TL;DR: A novel AI-driven framework for 3D non-rigid medical image registration that generalizes across multiple imaging modalities and anatomical regions without requiring anatomy- or modality-specific customization.

DetailsMotivation: Medical image registration is critical but challenging, especially non-rigid registration which must handle complex anatomical deformations. Traditional methods require extensive parameter tuning and are computationally expensive, while recent deep learning approaches lack scalability due to task-specific retraining needs. There's a need for efficient, generalizable registration frameworks that can handle heterogeneous imaging contexts.

Method: Introduces an AI-driven framework for 3D non-rigid registration that eliminates anatomy- or modality-specific customization. Unlike conventional methods that rely on application-specific models, this approach enables streamlined integration into diverse clinical environments.

Result: The abstract doesn’t provide specific quantitative results, but claims the framework generalizes across multiple imaging modalities (CT, MRI, ultrasound) and anatomical regions while eliminating the need for application-specific customization.

Conclusion: The proposed framework addresses key limitations in medical image registration by providing a generalizable solution that can handle complex non-rigid deformations across diverse imaging contexts without requiring extensive customization, potentially enabling more efficient integration into clinical workflows.

Abstract: Medical image registration is critical for aligning anatomical structures across imaging modalities such as computed tomography (CT), magnetic resonance imaging (MRI), and ultrasound. Among existing techniques, non-rigid registration (NRR) is particularly challenging due to the need to capture complex anatomical deformations caused by physiological processes like respiration or contrast-induced signal variations. Traditional NRR methods, while theoretically robust, often require extensive parameter tuning and incur high computational costs, limiting their use in real-time clinical workflows. Recent deep learning (DL)-based approaches have shown promise; however, their dependence on task-specific retraining restricts scalability and adaptability in practice. These limitations underscore the need for efficient, generalizable registration frameworks capable of handling heterogeneous imaging contexts. In this work, we introduce a novel AI-driven framework for 3D non-rigid registration that generalizes across multiple imaging modalities and anatomical regions. Unlike conventional methods that rely on application-specific models, our approach eliminates anatomy- or modality-specific customization, enabling streamlined integration into diverse clinical environments.

[449] An Energy-Efficient Adiabatic Capacitive Neural Network Chip

Himadri Singh Raghav, Sachin Maheshwari, Mike Smart, Patrick Foster, Alex Serb

Main category: eess.IV

TL;DR: Mixed-signal adiabatic capacitive neural network chip achieves 2.1-6.8x energy savings for image classification compared to conventional CMOS capacitive implementation.

DetailsMotivation: Growing demand for high computational performance under stringent energy constraints in battery-powered and edge devices, especially for AI applications like video processing and high-resolution sensing.

Method: Designed a mixed-signal adiabatic capacitive neural network chip in 130nm CMOS technology with dual-layer hardware incorporating 16 single-cycle multiply-accumulate engines for classifying 4 classes of 8x8 1-bit images.

Result: Chip achieves over 95% classification accuracy (within 2.7% of equivalent software version) and demonstrates average energy savings between 2.1x and 6.8x compared to equivalent CMOS capacitive implementation.

Conclusion: The mixed-signal adiabatic capacitive approach provides significant energy efficiency improvements for neural network inference in edge devices while maintaining high classification accuracy.

Abstract: Recent advances in artificial intelligence, coupled with increasing data bandwidth requirements, in applications such as video processing and high-resolution sensing, have created a growing demand for high computational performance under stringent energy constraints, especially for battery-powered and edge devices. To address this, we present a mixed-signal adiabatic capacitive neural network chip, designed in a 130$nm$ CMOS technology, to demonstrate significant energy savings coupled with high image classification accuracy. Our dual-layer hardware chip, incorporating 16 single-cycle multiply-accumulate engines, can reliably distinguish between 4 classes of 8x8 1-bit images, with classification results over 95%, within 2.7% of an equivalent software version. Energy measurements reveal average energy savings between 2.1x and 6.8x, compared to an equivalent CMOS capacitive implementation.

[450] Configurable γ Photon Spectrometer to Enable Precision Radioguided Tumor Resection

Rahul Lall, Youngho Seo, Ali M. Niknejad, Mekhail Anwar

Main category: eess.IV

TL;DR: A 9.9 mm² CMOS-integrated gamma spectrometer with sub-keV resolution for radioguided surgery to detect microscopic cancer clusters.

DetailsMotivation: During tumor resection, microscopic cancer clusters are hard to visualize and often left behind, increasing recurrence risk. Current radioguided surgery needs mm-scale gamma spectrometers to localize cancer cells with high specificity.

Method: Developed a 180 nm CMOS IC-based gamma spectrometer using 2×2 μm reverse-biased diodes with low capacitance. Instead of measuring voltage directly, it measures decay time of voltage signals after gamma detection. Three pixel architectures allow configurable sensitivity, resolution, and dynamic range.

Result: The spectrometer resolves activities down to 1 μCi with sub-keV energy resolution and 1.315 MeV dynamic range using 5-minute acquisitions, tested with ⁶⁴Cu, ¹³³Ba, and ¹⁷⁷Lu radioisotopes.

Conclusion: The integrated gamma spectrometer enables precise localization of microscopic cancer cells during surgery, potentially reducing recurrence by improving tumor margin assessment.

Abstract: Surgical tumor resection aims to remove all cancer cells in the tumor margin and at centimeter-scale depths below the tissue surface. During surgery, microscopic clusters of disease are intraoperatively difficult to visualize and are often left behind, significantly increasing the risk of cancer recurrence. Radioguided surgery (RGS) has shown the ability to selectively tag cancer cells with gamma (γ) photon emitting radioisotopes to identify them, but require a mm-scale γ photon spectrometer to localize the position of these cells in the tissue margin (i.e., a function of incident γ photon energy) with high specificity. Here we present a 9.9 mm2 integrated circuit (IC)-based γ spectrometer implemented in 180 nm CMOS, to enable the measurement of single γ photons and their incident energy with sub-keV energy resolution. We use small 2 2 um reverse-biased diodes that have low depletion region capacitance, and therefore produce millivolt-scale voltage signals in response to the small charge generated by incident γ photons. A low-power energy spectrometry method is implemented by measuring the decay time it takes for the generated voltage signal to settle back to DC after a γ detection event, instead of measuring the voltage drop directly. This spectrometry method is implemented in three different pixel architectures that allow for configurable pixel sensitivity, energy-resolution, and energy dynamic range based on the widely heterogenous surgical and patient presentation in RGS. The spectrometer was tested with three common γ-emitting radioisotopes (64Cu, 133Ba, 177Lu), and is able to resolve activities down to 1 uCi with sub-keV energy resolution and 1.315 MeV energy dynamic range, using 5-minute acquisitions.

[451] Translating Electrocardiograms to Cardiac Magnetic Resonance Imaging Useful for Cardiac Assessment and Disease Screening: A Multi-Center Study

Zhengyao Ding, Ziyu Li, Yujian Hu, Youyao Xu, Chengchen Zhao, Yiheng Mao, Haitao Li, Zhikang Li, Qian Li, Jing Wang, Yue Chen, Mengjia Chen, Longbo Wang, Xuesen Chu, Weichao Pan, Ziyi Liu, Fei Wu, Hongkun Zhang, Ting Chen, Zhengxing Huang

Main category: eess.IV

TL;DR: CardioNets is a deep learning framework that translates 12-lead ECG signals into CMR-level functional parameters and synthetic images, enabling scalable cardiac assessment without expensive CMR.

DetailsMotivation: Cardiovascular diseases are the leading cause of global mortality, but gold-standard CMR is expensive and complex, while widely available ECG lacks granularity. There's a need for accessible, accurate diagnostic tools for large-scale screening.

Method: Deep learning framework integrating cross-modal contrastive learning and generative pretraining. Aligns ECG with CMR-derived cardiac phenotypes and synthesizes high-resolution CMR images via masked autoregressive model. Trained on 159,819 samples from five cohorts including UK Biobank and MIMIC-IV-ECG.

Result: Strong performance across disease screening and phenotype estimation: 24.8% improvement in cardiac phenotype regression R2, up to 39.3% improvement in cardiomyopathy AUC, 5.6% increase in pulmonary hypertension detection AUC. Generated CMR images showed 36.6% higher SSIM and 8.7% higher PSNR. ECG-only CardioNets achieved 13.9% higher accuracy than human physicians using both ECG and real CMR.

Conclusion: CardioNets offers a promising, low-cost alternative to CMR for large-scale CVD screening, particularly in resource-limited settings. Future work will focus on clinical deployment and regulatory validation of ECG-based synthetic imaging.

Abstract: Cardiovascular diseases (CVDs) are the leading cause of global mortality, necessitating accessible and accurate diagnostic tools. While cardiac magnetic resonance imaging (CMR) provides gold-standard insights into cardiac structure and function, its clinical utility is limited by high cost and complexity. In contrast, electrocardiography (ECG) is inexpensive and widely available but lacks the granularity of CMR. We propose CardioNets, a deep learning framework that translates 12-lead ECG signals into CMR-level functional parameters and synthetic images, enabling scalable cardiac assessment. CardioNets integrates cross-modal contrastive learning and generative pretraining, aligning ECG with CMR-derived cardiac phenotypes and synthesizing high-resolution CMR images via a masked autoregressive model. Trained on 159,819 samples from five cohorts, including the UK Biobank (n=42,483) and MIMIC-IV-ECG (n=164,550), and externally validated on independent clinical datasets (n=3,767), CardioNets achieved strong performance across disease screening and phenotype estimation tasks. In the UK Biobank, it improved cardiac phenotype regression R2 by 24.8% and cardiomyopathy AUC by up to 39.3% over baseline models. In MIMIC, it increased AUC for pulmonary hypertension detection by 5.6%. Generated CMR images showed 36.6% higher SSIM and 8.7% higher PSNR than prior approaches. In a reader study, ECG-only CardioNets achieved 13.9% higher accuracy than human physicians using both ECG and real CMR. These results suggest that CardioNets offers a promising, low-cost alternative to CMR for large-scale CVD screening, particularly in resource-limited settings. Future efforts will focus on clinical deployment and regulatory validation of ECG-based synthetic imaging.

[452] Multimodal Deep Learning for Stroke Prediction and Detection using Retinal Imaging and Clinical Data

Saeed Shurrab, Aadim Nepal, Terrence J. Lee-St. John, Nicola G. Ghazi, Bartlomiej Piechowski-Jozwiak, Farah E. Shamout

Main category: eess.IV

TL;DR: Multimodal deep learning framework combining retinal OCT/IR scans with clinical data improves stroke detection and risk prediction, achieving 5-8% AUROC gains over baselines.

DetailsMotivation: Stroke is a major global health issue, but current diagnostic methods rely on expensive imaging like CT scans. Retinal imaging offers a cost-effective alternative due to shared pathways between retina and brain, potentially enabling better stroke detection and risk assessment.

Method: Proposed multimodal deep neural network that processes both Optical Coherence Tomography (OCT) and infrared reflectance retinal scans, combined with clinical data (demographics, vital signs, diagnosis codes). Used self-supervised pretraining on 37k unlabeled scans, then fine-tuned on smaller labeled subset.

Result: Model demonstrated predictive ability for detecting lasting retinal effects of acute stroke and forecasting future stroke risk. Achieved 5% AUROC improvement over unimodal image-only baseline and 8% improvement over existing state-of-the-art foundation model.

Conclusion: Retinal imaging combined with clinical data shows significant potential for identifying high-risk stroke patients and improving long-term outcomes, offering a cost-effective alternative to traditional imaging methods.

Abstract: Stroke is a major public health problem, affecting millions worldwide. Deep learning has recently demonstrated promise for enhancing the diagnosis and risk prediction of stroke. However, existing methods rely on costly medical imaging modalities, such as computed tomography. Recent studies suggest that retinal imaging could offer a cost-effective alternative for cerebrovascular health assessment due to the shared clinical pathways between the retina and the brain. Hence, this study explores the impact of leveraging retinal images and clinical data for stroke detection and risk prediction. We propose a multimodal deep neural network that processes Optical Coherence Tomography (OCT) and infrared reflectance retinal scans, combined with clinical data, such as demographics, vital signs, and diagnosis codes. We pretrained our model using a self-supervised learning framework using a real-world dataset consisting of $37$ k scans, and then fine-tuned and evaluated the model using a smaller labeled subset. Our empirical findings establish the predictive ability of the considered modalities in detecting lasting effects in the retina associated with acute stroke and forecasting future risk within a specific time horizon. The experimental results demonstrate the effectiveness of our proposed framework by achieving $5$% AUROC improvement as compared to the unimodal image-only baseline, and $8$% improvement compared to an existing state-of-the-art foundation model. In conclusion, our study highlights the potential of retinal imaging in identifying high-risk patients and improving long-term outcomes.

[453] High Volume Rate 3D Ultrasound Reconstruction with Diffusion Models

Tristan S. W. Stevens, Oisín Nolan, Oudom Somphone, Jean-Luc Robert, Ruud J. G. van Sloun

Main category: eess.IV

TL;DR: DM-based 3D ultrasound reconstruction from undersampled elevation planes outperforms traditional interpolation methods in image quality and downstream tasks.

DetailsMotivation: 3D ultrasound offers real-time volumetric visualization but faces trade-offs between volume rates and image quality. Current methods using diverging waves achieve high rates but suffer from degraded quality, while elevation-focused approaches require interpolation when undersampling elevation planes.

Method: Proposes diffusion models (DMs) for 3D ultrasound reconstruction from reduced elevation planes. Compares traditional and supervised deep learning interpolation methods on 3D cardiac ultrasound data. Accelerates inference using temporal consistency and explores robustness via diffusion posterior sampling for uncertainty quantification.

Result: DM-based reconstruction consistently outperforms baseline methods in image quality and downstream task performance. The method shows improved recall on out-of-distribution data with synthetic anomalies under strong subsampling.

Conclusion: Diffusion models provide an effective approach for high-quality 3D ultrasound reconstruction from undersampled elevation planes, offering better image quality, task performance, and robustness with uncertainty quantification capabilities.

Abstract: Three-dimensional ultrasound enables real-time volumetric visualization of anatomical structures. Unlike traditional 2D ultrasound, 3D imaging reduces reliance on precise probe orientation, potentially making ultrasound more accessible to clinicians with varying levels of experience and improving automated measurements and post-exam analysis. However, achieving both high volume rates and high image quality remains a significant challenge. While 3D diverging waves can provide high volume rates, they suffer from limited tissue harmonic generation and increased multipath effects, which degrade image quality. One compromise is to retain focus in elevation while leveraging unfocused diverging waves in the lateral direction to reduce the number of transmissions per elevation plane. Reaching the volume rates achieved by full 3D diverging waves, however, requires dramatically undersampling the number of elevation planes. Subsequently, to render the full volume, simple interpolation techniques are applied. This paper introduces a novel approach to 3D ultrasound reconstruction from a reduced set of elevation planes by employing diffusion models (DMs) to achieve increased spatial and temporal resolution. We compare both traditional and supervised deep learning-based interpolation methods on a 3D cardiac ultrasound dataset. Our results show that DM-based reconstruction consistently outperforms the baselines in image quality and downstream task performance. Additionally, we accelerate inference by leveraging the temporal consistency inherent to ultrasound sequences. Finally, we explore the robustness of the proposed method by exploiting the probabilistic nature of diffusion posterior sampling to quantify reconstruction uncertainty and demonstrate improved recall on out-of-distribution data with synthetic anomalies under strong subsampling.

[454] Papanicolaou Stain Unmixing for RGB Image Using Weighted Nucleus Sparsity and Total Variation Regularization

Nanxin Gong, Saori Takeyama, Masahiro Yamaguchi, Takumi Urata, Fumikazu Kimura, Keiko Ishii

Main category: eess.IV

TL;DR: A training-free stain unmixing method for Pap-stained RGB images that converts subjective color observations into quantitative dye amounts for improved cervical cancer diagnosis.

DetailsMotivation: Pap stain provides essential color information for cervical cancer screening, but visual observation is subjective and RGB quantification is unreliable due to staining/imaging variations. While multispectral imaging can estimate dye amounts, applying this to RGB images is challenging because there are 5 dyes but only 3 RGB channels.

Method: Proposes a novel training-free Pap stain unmixing method for RGB images using convex optimization with three constraints: (i) nonnegativity, (ii) weighted nucleus sparsity for hematoxylin, and (iii) total variation smoothness.

Result: Method achieved excellent performance in stain quantification validated against multispectral imaging results. When applied to distinguish LEGH (precancerous gastric-type adenocarcinoma) from normal endocervical cells, stain abundance features clearly separated groups, with a classifier achieving 98.0% accuracy.

Conclusion: The technique successfully converts subjective color impressions into numerical markers, demonstrating strong promise for RGB-based stain unmixing in quantitative diagnosis of cervical lesions.

Abstract: The Papanicolaou stain, consisting of five dyes, provides extensive color information essential for cervical cancer cytological screening. The visual observation of these colors is subjective and difficult to characterize. Direct RGB quantification is unreliable because RGB intensities vary with staining and imaging conditions. Stain unmixing offers a promising alternative by quantifying dye amounts. In previous work, multispectral imaging was utilized to estimate the dye amounts of Papanicolaou stain. However, its application to RGB images presents a challenge since the number of dyes exceeds the three RGB channels. This paper proposes a novel training-free Papanicolaou stain unmixing method for RGB images. This model enforces (i) nonnegativity, (ii) weighted nucleus sparsity for hematoxylin, and (iii) total variation smoothness, resulting in a convex optimization problem. Our method achieved excellent performance in stain quantification when validated against the results of multispectral imaging. We further used it to distinguish cells in lobular endocervical glandular hyperplasia (LEGH), a precancerous gastric-type adenocarcinoma lesion, from normal endocervical cells. Stain abundance features clearly separated the two groups, and a classifier based on stain abundance achieved 98.0% accuracy. By converting subjective color impressions into numerical markers, this technique highlights the strong promise of RGB-based stain unmixing for quantitative diagnosis.

[455] TomoGraphView: 3D Medical Image Classification with Omnidirectional Slice Representations and Graph Neural Networks

Johannes Kiechle, Stefan M. Fischer, Daniel M. Lang, Cosmin I. Bercea, Matthew J. Nyflot, Lina Felsner, Julia A. Schnabel, Jan C. Peeken

Main category: eess.IV

TL;DR: TomoGraphView: A framework that uses omnidirectional volume slicing and spherical graph-based feature aggregation for 3D medical image analysis, overcoming limitations of traditional slice-based approaches.

DetailsMotivation: Current methods using 2D foundation models for 3D medical scans have limitations: standard axial/sagittal/coronal slicing fails to capture structures not aligned with these views, and slice features are aggregated independently, losing 3D spatial coherence.

Method: Proposes TomoGraphView with two key components: 1) Omnidirectional volume slicing that samples both canonical and non-canonical cross-sections from uniformly distributed points on a sphere enclosing the volume, and 2) Spherical graph-based feature aggregation that preserves 3D geometry and spatial relationships across slices.

Result: The framework is implemented with publicly accessible code and a user-friendly library (OmniSlicer) for omnidirectional volume slicing, providing tools for the research community.

Conclusion: TomoGraphView addresses fundamental limitations of current 2D foundation model approaches to 3D medical scans by capturing true spatial extent of structures regardless of orientation and preserving 3D spatial coherence through graph-based aggregation.

Abstract: The sharp rise in medical tomography examinations has created a demand for automated systems that can reliably extract informative features for downstream tasks such as tumor characterization. Although 3D volumes contain richer information than individual slices, effective 3D classification remains difficult: volumetric data encode complex spatial dependencies, and the scarcity of large-scale 3D datasets has constrained progress toward 3D foundation models. As a result, many recent approaches rely on 2D vision foundation models trained on natural images, repurposing them as feature extractors for medical scans with surprisingly strong performance. Despite their practical success, current methods that apply 2D foundation models to 3D scans via slice-based decomposition remain fundamentally limited. Standard slicing along axial, sagittal, and coronal planes often fails to capture the true spatial extent of a structure when its orientation does not align with these canonical views. More critically, most approaches aggregate slice features independently, ignoring the underlying 3D geometry and losing spatial coherence across slices. To overcome these limitations, we propose TomoGraphView, a novel framework that integrates omnidirectional volume slicing with spherical graph-based feature aggregation. Instead of restricting the model to axial, sagittal, or coronal planes, our method samples both canonical and non-canonical cross-sections generated from uniformly distributed points on a sphere enclosing the volume. We publicly share our accessible code base at http://github.com/compai-lab/2025-MedIA-kiechle and provide a user-friendly library for omnidirectional volume slicing at https://pypi.org/project/OmniSlicer.

Last updated: 2025-12-19
Built with Hugo, theme modified on Stack