Editor’s Picks

Top papers matching your research interests in multimodal LLMs, audio and vision understanding/generation.

[1] When Scaling Fails: Mitigating Audio Perception Decay of LALMs via Multi-Step Perception-Aware Reasoning

Ruixiang Mao, Xiangnan Ma, Dan Chen, Ziming Zhu, Yuan Ge, Aokai Hao, Haishu Zhao, Yifu Huo, Qing Yang, Kaiyan Chang, Xiaoqian Liu, Chenglong Wang, Qiaozhi He, Tong Xiao, Jingbo Zhu

Main category: cs.SD

TL;DR: MPAR² improves audio reasoning in Large Audio-Language Models by addressing perception decay through dynamic perceptual reasoning and reinforcement learning, achieving significant accuracy gains on audio reasoning benchmarks.

Details

Motivation: Large Audio-Language Models show counterintuitive behavior where post-training for structured reasoning trajectories yields marginal or negative gains compared to direct answering. The authors investigate this phenomenon and identify that LALMs struggle with audio perception during reasoning, with performance decaying as reasoning length increases.

Method: 1) Introduce CAFE evaluation framework to quantify audio reasoning errors. 2) Propose MPAR² paradigm that encourages dynamic perceptual reasoning and decomposes complex questions into perception-rich sub-problems. 3) Use reinforcement learning to train the model to attend to audio input and adapt reasoning budget to task complexity.

Result: MPAR² improves perception performance on CAFE from 31.74% to 63.51%, effectively mitigates perception decay, and achieves 74.59% accuracy on MMAU benchmark. The method reinforces LALMs to attend to audio input and dynamically adapts reasoning budget to match task complexity.

Conclusion: The paper addresses a critical bottleneck in audio reasoning where perception decays with reasoning length. MPAR² provides an effective solution through dynamic perceptual reasoning and reinforcement learning, significantly improving both perception and reasoning capabilities in Large Audio-Language Models.

Abstract: Test-Time Scaling has shown notable efficacy in addressing complex problems through scaling inference compute. However, within Large Audio-Language Models (LALMs), an unintuitive phenomenon exists: post-training models for structured reasoning trajectories results in marginal or even negative gains compared to post-training for direct answering. To investigate it, we introduce CAFE, an evaluation framework designed to precisely quantify audio reasoning errors. Evaluation results reveal LALMs struggle with perception during reasoning and encounter a critical bottleneck: reasoning performance suffers from audio perception decay as reasoning length extends. To address it, we propose MPAR$^2$, a paradigm that encourages dynamic perceptual reasoning and decomposes complex questions into perception-rich sub-problems. Leveraging reinforcement learning, MPAR$^2$ improves perception performance on CAFE from 31.74% to 63.51% and effectively mitigates perception decay, concurrently enhancing reasoning capabilities to achieve a significant 74.59% accuracy on the MMAU benchmark. Further analysis demonstrates that MPAR$^2$ reinforces LALMs to attend to audio input and dynamically adapts reasoning budget to match task complexity.

Relevance: 9/10

[2] Real-Time Generation of Game Video Commentary with Multimodal LLMs: Pause-Aware Decoding Approaches

Anum Afzal, Yuki Saito, Hiroya Takamura, Katsuhito Sudoh, Shinnosuke Takamichi, Graham Neubig, Florian Matthes, Tatsuya Ishigaki

Main category: cs.CL

TL;DR: Real-time video commentary generation using MLLMs with novel dynamic interval-based decoding for better timing alignment without fine-tuning

Details

Motivation: Current MLLM-based approaches for video commentary generation focus on content but ignore timing aspects, which is crucial for real-time applications in sports, esports, and livestreaming

Method: Two prompting-based decoding strategies: 1) fixed-interval approach, and 2) novel dynamic interval-based decoding that adjusts prediction timing based on estimated duration of previous utterance, enabling pause-aware generation without fine-tuning

Result: Dynamic interval-based decoding generates commentary more closely aligned with human utterance timing and content using prompting alone, validated on Japanese and English datasets of racing and fighting games

Conclusion: Prompting-based approaches can effectively handle both content and timing aspects of real-time video commentary generation, with dynamic interval decoding showing superior performance

Abstract: Real-time video commentary generation provides textual descriptions of ongoing events in videos. It supports accessibility and engagement in domains such as sports, esports, and livestreaming. Commentary generation involves two essential decisions: what to say and when to say it. While recent prompting-based approaches using multimodal large language models (MLLMs) have shown strong performance in content generation, they largely ignore the timing aspect. We investigate whether in-context prompting alone can support real-time commentary generation that is both semantically relevant and well-timed. We propose two prompting-based decoding strategies: 1) a fixed-interval approach, and 2) a novel dynamic interval-based decoding approach that adjusts the next prediction timing based on the estimated duration of the previous utterance. Both methods enable pause-aware generation without any fine-tuning. Experiments on Japanese and English datasets of racing and fighting games show that the dynamic interval-based decoding can generate commentary more closely aligned with human utterance timing and content using prompting alone. We release a multilingual benchmark dataset, trained models, and implementations to support future research on real-time video commentary generation.

Relevance: 9/10

[3] MUSE: A Run-Centric Platform for Multimodal Unified Safety Evaluation of Large Language Models

Zhongxi Wang, Yueqian Lin, Jingyang Zhang, Hai Helen Li, Yiran Chen

Main category: cs.LG

TL;DR: MUSE is a multimodal safety evaluation platform that tests whether LLM alignment generalizes to audio, image, and video inputs using cross-modal payload generation, multi-turn attack algorithms, and modality switching.

Details

Motivation: Current safety evaluation of LLMs is mostly text-centric, lacking infrastructure to systematically test if alignment generalizes to multimodal inputs (audio, image, video). There's a need for comprehensive cross-modal safety testing.

Method: Developed MUSE platform with automatic cross-modal payload generation, three multi-turn attack algorithms (Crescendo, PAIR, Violent Durian), provider-agnostic model routing, LLM judge with 5-level safety taxonomy. Introduced Inter-Turn Modality Switching (ITMS) to augment attacks with per-turn modality rotation. Uses dual-metric framework distinguishing hard vs soft Attack Success Rate.

Result: Multi-turn strategies achieved 90-100% ASR against models with near-perfect single-turn refusal. ITMS accelerated convergence by destabilizing early-turn defenses, though didn’t uniformly raise final ASR. Modality effects were model-family-specific rather than universal.

Conclusion: Multimodal safety evaluation reveals vulnerabilities not captured by text-only testing. Cross-modal attacks can bypass defenses, and modality effects vary by model family, highlighting need for provider-aware cross-modal safety testing.

Abstract: Safety evaluation and red-teaming of large language models remain predominantly text-centric, and existing frameworks lack the infrastructure to systematically test whether alignment generalizes to audio, image, and video inputs. We present MUSE (Multimodal Unified Safety Evaluation), an open-source, run-centric platform that integrates automatic cross-modal payload generation, three multi-turn attack algorithms (Crescendo, PAIR, Violent Durian), provider-agnostic model routing, and an LLM judge with a five-level safety taxonomy into a single browser-based system. A dual-metric framework distinguishes hard Attack Success Rate (Compliance only) from soft ASR (including Partial Compliance), capturing partial information leakage that binary metrics miss. To probe whether alignment generalizes across modality boundaries, we introduce Inter-Turn Modality Switching (ITMS), which augments multi-turn attacks with per-turn modality rotation. Experiments across six multimodal LLMs from four providers show that multi-turn strategies can achieve up to 90-100% ASR against models with near-perfect single-turn refusal. ITMS does not uniformly raise final ASR on already-saturated baselines, but accelerates convergence by destabilizing early-turn defenses, and ablation reveals that the direction of modality effects is model-family-specific rather than universal, underscoring the need for provider-aware cross-modal safety testing.

Relevance: 9/10

Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 100]
cs.CV [Total: 190]
cs.AI [Total: 117]
cs.SD [Total: 17]
cs.LG [Total: 203]
cs.MA [Total: 5]
cs.MM [Total: 2]
eess.AS [Total: 14]
eess.IV [Total: 9]

cs.CL

[1] A Zipf-preserving, long-range correlated surrogate for written language and other symbolic sequences

Marcelo A. Montemurro, Mirko Degli Esposti

Main category: cs.CL

TL;DR: A surrogate model that preserves both symbol frequencies and long-range correlations in symbolic sequences like language and DNA, using fractional Gaussian noise mapping to match empirical histograms and DFA scaling.

Details

Motivation: Existing surrogate models for symbolic sequences (like language and DNA) typically preserve either frequency distributions (e.g., Zipf's law) or correlation properties, but not both simultaneously. This limitation hinders comprehensive analysis of structural features in symbolic systems.

Method: The method generates surrogates by mapping fractional Gaussian noise (FGN) onto the empirical histogram through a frequency-preserving assignment. This preserves both the empirical symbol frequencies and reproduces the long-range correlation structure quantified by detrended fluctuation analysis (DFA) exponent, while randomizing short-range dependencies.

Result: The model successfully generates surrogates that match original sequences in both first-order statistics (symbol frequencies) and long-range scaling (DFA exponents). Validation on English and Latin texts, as well as genomic DNA, shows accurate reproduction of base composition and DFA scaling properties.

Conclusion: This approach provides a principled tool for disentangling structural features of symbolic systems and testing hypotheses about the origin of scaling laws and memory effects across language, DNA, and other symbolic domains, addressing a key limitation in existing surrogate modeling techniques.

Abstract: Symbolic sequences such as written language and genomic DNA display characteristic frequency distributions and long-range correlations extending over many symbols. In language, this takes the form of Zipf’s law for word frequencies together with persistent correlations spanning hundreds or thousands of tokens, while in DNA it is reflected in nucleotide composition and long-memory walks under purine-pyrimidine mappings. Existing surrogate models usually preserve either the frequency distribution or the correlation properties, but not both simultaneously. We introduce a surrogate model that retains both constraints: it preserves the empirical symbol frequencies of the original sequence and reproduces its long-range correlation structure, quantified by the detrended fluctuation analysis (DFA) exponent. Our method generates surrogates of symbolic sequences by mapping fractional Gaussian noise (FGN) onto the empirical histogram through a frequency-preserving assignment. The resulting surrogates match the original in first-order statistics and long-range scaling while randomising short-range dependencies. We validate the model on representative texts in English and Latin, and illustrate its broader applicability with genomic DNA, showing that base composition and DFA scaling are reproduced. This approach provides a principled tool for disentangling structural features of symbolic systems and for testing hypotheses on the origin of scaling laws and memory effects across language, DNA, and other symbolic domains.

[2] Universal Conceptual Structure in Neural Translation: Probing NLLB-200’s Multilingual Geometry

Kyle Elliott Mathewson

Main category: cs.CL

TL;DR: NLLB-200 neural translation model learns genealogical language structure and universal conceptual representations, showing language-neutral semantic organization similar to human bilingual cognition.

Details

Motivation: To determine whether neural machine translation models learn language-universal conceptual representations or merely cluster languages by surface similarity, bridging NLP interpretability with cognitive science theories of multilingual lexical organization.

Method: Six experiments probing representation geometry of Meta’s NLLB-200 (200-language encoder-decoder Transformer) using Swadesh core vocabulary across 135 languages, analyzing embedding distances, phylogenetic correlations, colexification patterns, mean-centering techniques, and semantic offset vectors.

Result: Model embedding distances correlate with phylogenetic distances (ρ=0.13, p=0.020); frequently colexified concept pairs show higher embedding similarity (U=42656, p=1.33×10⁻¹¹, d=0.96); mean-centering improves between-to-within concept distance ratio by 1.19×; semantic offset vectors show high cross-lingual consistency (mean cosine=0.84).

Conclusion: NLLB-200 has implicitly learned genealogical language structure and internalized universal conceptual associations, with geometric evidence for language-neutral conceptual organization analogous to human bilingual cognition, suggesting models capture deep semantic structure beyond surface patterns.

Abstract: Do neural machine translation models learn language-universal conceptual representations, or do they merely cluster languages by surface similarity? We investigate this question by probing the representation geometry of Meta’s NLLB-200, a 200-language encoder-decoder Transformer, through six experiments that bridge NLP interpretability with cognitive science theories of multilingual lexical organization. Using the Swadesh core vocabulary list embedded across 135 languages, we find that the model’s embedding distances significantly correlate with phylogenetic distances from the Automated Similarity Judgment Program ($ρ= 0.13$, $p = 0.020$), demonstrating that NLLB-200 has implicitly learned the genealogical structure of human languages. We show that frequently colexified concept pairs from the CLICS database exhibit significantly higher embedding similarity than non-colexified pairs ($U = 42656$, $p = 1.33 \times 10^{-11}$, $d = 0.96$), indicating that the model has internalized universal conceptual associations. Per-language mean-centering of embeddings improves the between-concept to within-concept distance ratio by a factor of 1.19, providing geometric evidence for a language-neutral conceptual store analogous to the anterior temporal lobe hub identified in bilingual neuroimaging. Semantic offset vectors between fundamental concept pairs (e.g., man to woman, big to small) show high cross-lingual consistency (mean cosine = 0.84), suggesting that second-order relational structure is preserved across typologically diverse languages. We release InterpretCognates, an open-source interactive toolkit for exploring these phenomena, alongside a fully reproducible analysis pipeline.

[3] Characterizing Memorization in Diffusion Language Models: Generalized Extraction and Sampling Effects

Xiaoyu Luo, Wenrui Yu, Qiongxiu Li, Johannes Bjerva

Main category: cs.CL

TL;DR: Diffusion language models (DLMs) have lower memorization and PII leakage than autoregressive models (ARMs), with memorization increasing with sampling resolution.

Details

Motivation: Autoregressive language models are known to memorize and reproduce training data, raising privacy and copyright concerns. Diffusion language models have emerged as an alternative, but their memorization behavior remains unexplored due to different generation dynamics.

Method: Proposed a generalized probabilistic extraction framework unifying prefix-conditioned decoding and diffusion-based generation under arbitrary masking patterns and stochastic sampling trajectories. Established theoretical relationship between sampling resolution and memorization, and conducted extensive experiments across model scales and sampling strategies.

Result: Theoretical analysis shows monotonic relationship: increasing sampling resolution strictly increases probability of exact training data extraction. Experiments demonstrate DLMs exhibit substantially lower memorization-based leakage of personally identifiable information (PII) compared to ARMs.

Conclusion: Diffusion language models offer better privacy protection with lower memorization of training data compared to autoregressive models, making them a promising alternative for privacy-sensitive applications.

Abstract: Autoregressive language models (ARMs) have been shown to memorize and occasionally reproduce training data verbatim, raising concerns about privacy and copyright liability. Diffusion language models (DLMs) have recently emerged as a competitive alternative, yet their memorization behavior remains largely unexplored due to fundamental differences in generation dynamics. To address this gap, we present a systematic theoretical and empirical characterization of memorization in DLMs. We propose a generalized probabilistic extraction framework that unifies prefix-conditioned decoding and diffusion-based generation under arbitrary masking patterns and stochastic sampling trajectories. Theorem 4.3 establishes a monotonic relationship between sampling resolution and memorization: increasing resolution strictly increases the probability of exact training data extraction, implying that autoregressive decoding corresponds to a limiting case of diffusion-based generation by setting the sampling resolution maximal. Extensive experiments across model scales and sampling strategies validate our theoretical predictions. Under aligned prefix-conditioned evaluations, we further demonstrate that DLMs exhibit substantially lower memorization-based leakage of personally identifiable information (PII) compared to ARMs.

[4] Detecting AI-Generated Essays in Writing Assessment: Responsible Use and Generalizability Across LLMs

Jiangang Hao

Main category: cs.CL

TL;DR: This paper examines AI-generated essay detection methods and their generalization across different LLMs, focusing on writing assessment challenges in the age of AI writing tools.

Details

Motivation: The rise of LLMs enabling easy generation of high-quality essays raises concerns about academic integrity and authentic student work, necessitating effective detection methods for AI-generated content.

Method: The paper provides an overview of current AI-generated essay detectors, presents empirical analyses evaluating detector generalization across different LLMs using essays generated from public GRE writing prompts, and offers guidelines for responsible detector use.

Result: Findings show limitations in detector generalization across different LLMs, providing guidance for developing and retraining detectors for practical applications in educational settings.

Conclusion: Effective AI-generated essay detection requires understanding detector limitations and developing robust methods that generalize across different language models, with responsible implementation guidelines.

Abstract: Writing is a foundational literacy skill that underpins effective communication, fosters critical thinking, facilitates learning across disciplines, and enables individuals to organize and articulate complex ideas. Consequently, writing assessment plays a vital role in evaluating language proficiency, communicative effectiveness, and analytical reasoning. The rapid advancement of large language models (LLMs) has made it increasingly easy to generate coherent, high-quality essays, raising significant concerns about the authenticity of student-submitted work. This chapter first provides an overview of the current landscape of detectors for AI-generated and AI-assisted essays, along with guidelines for their responsible use. It then presents empirical analyses to evaluate how well detectors trained on essays from one LLM generalize to identifying essays produced by other LLMs, based on essays generated in response to public GRE writing prompts. These findings provide guidance for developing and retraining detectors for practical applications.

[5] RO-N3WS: Enhancing Generalization in Low-Resource ASR with Diverse Romanian Speech Benchmarks

Alexandra Diaconu, Mădălina Vînaga, Bogdan Alexe

Main category: cs.CL

TL;DR: RO-N3WS is a Romanian speech dataset with 126+ hours of diverse audio content for improving ASR generalization in low-resource and OOD conditions, showing fine-tuning yields substantial WER improvements over zero-shot baselines.

Details

Motivation: To address the need for better generalization in automatic speech recognition, particularly for low-resource languages like Romanian and in out-of-distribution conditions, by creating a diverse benchmark dataset.

Method: Created RO-N3WS dataset with 126+ hours of transcribed audio from diverse sources (broadcast news, audiobooks, film dialogue, children’s stories, podcasts). Evaluated state-of-the-art ASR systems (Whisper, Wav2Vec 2.0) in zero-shot and fine-tuned settings, and conducted comparisons using synthetic data from expressive TTS models.

Result: Even limited fine-tuning on real speech from RO-N3WS yields substantial Word Error Rate (WER) improvements over zero-shot baselines. The diverse dataset enables robust training across stylistically distinct domains.

Conclusion: RO-N3WS provides a valuable resource for multilingual ASR research, domain adaptation, and lightweight deployment, with demonstrated effectiveness in improving ASR performance for Romanian through fine-tuning.

Abstract: We introduce RO-N3WS, a benchmark Romanian speech dataset designed to improve generalization in automatic speech recognition (ASR), particularly in low-resource and out-of-distribution (OOD) conditions. RO-N3WS comprises over 126 hours of transcribed audio collected from broadcast news, literary audiobooks, film dialogue, children’s stories, and conversational podcast speech. This diversity enables robust training and fine-tuning across stylistically distinct domains. We evaluate several state-of-the-art ASR systems (Whisper, Wav2Vec 2.0) in both zero-shot and fine-tuned settings, and conduct controlled comparisons using synthetic data generated with expressive TTS models. Our results show that even limited fine-tuning on real speech from RO-N3WS yields substantial WER improvements over zero-shot baselines. We will release all models, scripts, and data splits to support reproducible research in multilingual ASR, domain adaptation, and lightweight deployment.

[6] GLoRIA: Gated Low-Rank Interpretable Adaptation for Dialectal ASR

Pouya Mehralian, Melissa Farasyn, Anne Breitbarth, Anne-Sophie Ghyselen, Hugo Van hamme

Main category: cs.CL

TL;DR: GLoRIA: A parameter-efficient adaptation framework for dialectal ASR that uses metadata (coordinates) to modulate low-rank updates in pre-trained encoders, achieving state-of-the-art performance with under 10% parameter updates.

Details

Motivation: Automatic Speech Recognition in dialect-heavy settings is challenging due to strong regional variation and limited labeled data. Existing methods struggle with dialect diversity and require extensive parameter updates.

Method: GLoRIA injects low-rank matrices into each feed-forward layer of a pre-trained encoder, with a gating MLP determining non-negative contributions of each LoRA rank-1 component based on location metadata. This creates parameter-efficient adaptation conditioned on geographical information.

Result: On GCND corpus, GLoRIA outperforms geo-conditioned full fine-tuning, LoRA, and both dialect-specific and unified full fine-tuning, achieving state-of-the-art word error rates while updating under 10% of parameters. It generalizes well to unseen dialects and enables interpretable adaptation patterns.

Conclusion: Metadata-gated low-rank adaptation is an effective, interpretable, and efficient solution for dialectal ASR, demonstrating the value of geographical conditioning for speech recognition in diverse linguistic settings.

Abstract: Automatic Speech Recognition (ASR) in dialect-heavy settings remains challenging due to strong regional variation and limited labeled data. We propose GLoRIA, a parameter-efficient adaptation framework that leverages metadata (e.g., coordinates) to modulate low-rank updates in a pre-trained encoder. GLoRIA injects low-rank matrices into each feed-forward layer, with a gating MLP determining the non-negative contribution of each LoRA rank-1 component based on location metadata. On the GCND corpus, GLoRIA outperforms geo-conditioned full fine-tuning, LoRA, and both dialect-specific and unified full fine-tuning, achieving state-of-the-art word error rates while updating under 10% of parameters. GLoRIA also generalizes well to unseen dialects, including in extrapolation scenarios, and enables interpretable adaptation patterns that can be visualized geospatially. These results show metadata-gated low-rank adaptation is an effective, interpretable, and efficient solution for dialectal ASR.

[7] CoDAR: Continuous Diffusion Language Models are More Powerful Than You Think

Junzhe Shen, Jieru Zhao, Ziwei He, Zhouhan Lin

Main category: cs.CL

TL;DR: CoDAR improves continuous diffusion language models by replacing token rounding with a contextual autoregressive decoder that cross-attends to denoised embeddings

Details

Motivation: Continuous diffusion language models have underperformed compared to discrete diffusion approaches despite having appealing continuous generative dynamics. The paper aims to identify the bottleneck and improve continuous diffusion for language modeling.

Method: Two-stage framework: 1) Continuous diffusion in embedding space, 2) Contextual autoregressive decoder that cross-attends to denoised embeddings and performs contextualized rounding to tokens (CoDAR). Identifies token rounding as primary bottleneck through controlled token-recovery study.

Result: CoDAR substantially improves generation quality over latent diffusion on LM1B and OpenWebText datasets, becoming competitive with strong discrete diffusion language models. Exposes simple decoder-temperature knob for fluency-diversity trade-off.

Conclusion: Continuous diffusion language models can be competitive with discrete approaches by addressing the token rounding bottleneck through contextualized autoregressive decoding, offering better control over generation quality trade-offs.

Abstract: We study why continuous diffusion language models (DLMs) have lagged behind discrete diffusion approaches despite their appealing continuous generative dynamics. Under a controlled token–recovery study, we identify token rounding, the final projection from denoised embeddings to tokens, as a primary bottleneck. Building on these insights, we propose CoDAR (Continuous Diffusion with Contextual AutoRegressive Decoder), a two–stage framework that keeps diffusion entirely continuous in an embedding space while learning a strong, context–conditional discretizer: an autoregressive Transformer decoder that cross–attends to the denoised embedding sequence and performs contextualized rounding to tokens. Experiments on LM1B and OpenWebText demonstrate that CoDAR substantially improves generation quality over latent diffusion and becomes competitive with strong discrete DLMs, while exposing a simple decoder–temperature knob to navigate the fluency–diversity trade off.

[8] How Controllable Are Large Language Models? A Unified Evaluation across Behavioral Granularities

Ziwen Xu, Kewei Xu, Haoming Xu, Haiwen Hong, Longtao Huang, Hui Xue, Ningyu Zhang, Yongliang Shen, Guozhou Zheng, Huajun Chen, Shumin Deng

Main category: cs.CL

TL;DR: SteerEval is a hierarchical benchmark for evaluating LLM controllability across language features, sentiment, and personality domains with three specification levels.

Details

Motivation: LLMs are increasingly used in socially sensitive domains but exhibit unpredictable behaviors (misaligned intent, inconsistent personality) that pose significant risks, necessitating better evaluation of controllability.

Method: Introduces SteerEval, a hierarchical benchmark with three domains (language features, sentiment, personality) each structured into three specification levels: L1 (what to express), L2 (how to express), and L3 (how to instantiate). Uses this framework to systematically evaluate contemporary steering methods.

Result: Evaluation reveals that control often degrades at finer-grained levels (L2 and L3), showing current steering methods struggle with nuanced control specifications.

Conclusion: SteerEval provides a principled, interpretable framework for evaluating LLM controllability, serving as a foundation for future research in safe and controllable LLM behavior.

Abstract: Large Language Models (LLMs) are increasingly deployed in socially sensitive domains, yet their unpredictable behaviors, ranging from misaligned intent to inconsistent personality, pose significant risks. We introduce SteerEval, a hierarchical benchmark for evaluating LLM controllability across three domains: language features, sentiment, and personality. Each domain is structured into three specification levels: L1 (what to express), L2 (how to express), and L3 (how to instantiate), connecting high-level behavioral intent to concrete textual output. Using SteerEval, we systematically evaluate contemporary steering methods, revealing that control often degrades at finer-grained levels. Our benchmark offers a principled and interpretable framework for safe and controllable LLM behavior, serving as a foundation for future research.

[9] ExpGuard: LLM Content Moderation in Specialized Domains

Minseok Choi, Dongjin Kim, Seungbin Yang, Subin Kim, Youngjun Kwak, Juyoung Oh, Jaegul Choo, Jungmin Son

Main category: cs.CL

TL;DR: ExpGuard: A specialized guardrail model for protecting LLMs against harmful content in financial, medical, and legal domains, with a curated dataset of 58,928 labeled prompts.

Details

Motivation: Current guardrail models focus on general human-LLM interactions, leaving LLMs vulnerable to harmful content in domain-specific contexts with technical jargon and specialized concepts. There's a need for robust safety mechanisms in specialized domains like finance, medicine, and law.

Method: Introduces ExpGuard, a specialized guardrail model for domain-specific safety, and ExpGuardMix dataset with 58,928 labeled prompts from financial, medical, and legal domains. The dataset is split into ExpGuardTrain for training and ExpGuardTest for evaluation, with expert annotations for domain-specific adversarial content.

Result: ExpGuard delivers competitive performance across eight public benchmarks and shows exceptional resilience to domain-specific adversarial attacks, outperforming state-of-the-art models like WildGuard by up to 8.9% in prompt classification and 15.3% in response classification.

Conclusion: ExpGuard addresses the critical gap in domain-specific LLM safety, providing robust protection against harmful content in specialized domains. The open-sourced code, data, and model enable adaptation to additional domains and support development of more robust guardrail systems.

Abstract: With the growing deployment of large language models (LLMs) in real-world applications, establishing robust safety guardrails to moderate their inputs and outputs has become essential to ensure adherence to safety policies. Current guardrail models predominantly address general human-LLM interactions, rendering LLMs vulnerable to harmful and adversarial content within domain-specific contexts, particularly those rich in technical jargon and specialized concepts. To address this limitation, we introduce ExpGuard, a robust and specialized guardrail model designed to protect against harmful prompts and responses across financial, medical, and legal domains. In addition, we present ExpGuardMix, a meticulously curated dataset comprising 58,928 labeled prompts paired with corresponding refusal and compliant responses, from these specific sectors. This dataset is divided into two subsets: ExpGuardTrain, for model training, and ExpGuardTest, a high-quality test set annotated by domain experts to evaluate model robustness against technical and domain-specific content. Comprehensive evaluations conducted on ExpGuardTest and eight established public benchmarks reveal that ExpGuard delivers competitive performance across the board while demonstrating exceptional resilience to domain-specific adversarial attacks, surpassing state-of-the-art models such as WildGuard by up to 8.9% in prompt classification and 15.3% in response classification. To encourage further research and development, we open-source our code, data, and model, enabling adaptation to additional domains and supporting the creation of increasingly robust guardrail models.

[10] GPUTOK: GPU Accelerated Byte Level BPE Tokenization

Venu Gopal Kadamba, Kanishkha Jaisankar

Main category: cs.CL

TL;DR: A GPU-based byte-level BPE tokenizer that accelerates tokenization for large language models with million-token contexts, achieving 1.7-7.6x speedup over CPU tokenizers while maintaining output quality.

Details

Motivation: As LLMs move to million-token contexts, CPU tokenizers become a bottleneck because they process text sequentially while powerful GPUs remain idle, creating a need for GPU-accelerated tokenization.

Method: Built a GPU-based byte-level BPE tokenizer following GPT-2’s merge rules, with two versions: a basic BlockBPE-style kernel and an optimized version using cuCollections static map, CUB reductions, and pybind11 interface for Python integration.

Result: On WikiText103 sequences up to 131k tokens, the optimized GPU tokenizer matches CPU tokenizer outputs and achieves ~1.7x speedup over tiktoken and ~7.6x speedup over HuggingFace GPT-2 tokenizer. Profiling shows 70-80% of CUDA time spent on memory allocation.

Conclusion: GPU tokenization enables more practical long-context inference by significantly accelerating tokenization while maintaining output quality, with memory pooling identified as the next key optimization for further speed improvements.

Abstract: As large language models move toward million-token context windows, CPU tokenizers become a major slowdown because they process text one step at a time while powerful GPUs sit unused. We built a GPU-based byte-level BPE tokenizer that follows GPT-2’s merge rules. It includes a basic BlockBPE-style kernel and a faster, optimized version that uses cuCollections static map, CUB reductions, and a pybind11 interface for Python. On WikiText103 sequences up to 131k tokens, the optimized GPU tokenizer produces the same tokens as a CPU version and, for the longest inputs, is about 1.7x faster than tiktoken and about 7.6x faster than the HuggingFace GPT-2 tokenizer. Nsight profiling shows that 70-80% of CUDA API time goes to memory allocation, so adding memory pooling should give the biggest speed boost next. Tests on generation tasks using WikiText103 prompts show that our GPU tokenizer’s outputs stay within about one percentage point of tiktoken and HuggingFace GPT-2 on similarity and overlap metrics, meaning it keeps output quality while making long-context inference more practical.

[11] Think, But Don’t Overthink: Reproducing Recursive Language Models

Daren Wang

Main category: cs.CL

TL;DR: This study reproduces and extends Recursive Language Models (RLMs) by investigating the impact of scaling recursion depth beyond the original depth=1, finding that deeper recursion causes models to “overthink” and degrade performance on simple tasks while exponentially increasing execution time and costs.

Details

Motivation: The original RLM framework enables LLMs to process near-infinite contexts by offloading prompts to external REPL environments, but only used depth=1 recursion. This study aims to explore what happens when scaling recursion depth deeper, particularly investigating whether deeper recursion improves or harms performance on reasoning and retrieval tasks.

Method: Reproduced and extended the RLM framework, evaluating pure LLMs, RLMs with depth=1, and RLMs with depth=2 using state-of-the-art open-source agentic models (DeepSeek v3.2 and Kimi K2) on S-NIAH and OOLONG benchmarks for complex reasoning and retrieval tasks.

Result: Depth-1 RLMs effectively boost accuracy on complex reasoning tasks, but deeper recursion (depth=2) causes models to “overthink” and paradoxically degrades performance. On simple retrieval tasks, RLMs degrade performance regardless of depth. Deeper recursion exponentially inflates execution time (e.g., from 3.6s to 344.5s) and token costs.

Conclusion: While RLMs with depth=1 can improve performance on complex reasoning tasks, scaling recursion depth deeper is counterproductive due to “overthinking” behavior, leading to performance degradation and exponential increases in computational costs. The optimal recursion depth appears to be task-dependent and shallow.

Abstract: This project reproduces and extends the recently proposed Recursive Language Models'' (RLMs) framework by Zhang et al. (2026). This framework enables Large Language Models (LLMs) to process near-infinite contexts by offloading the prompt into an external REPL environment. While the original paper relies on a default recursion depth of 1 and suggests deeper recursion as a future direction, this study specifically investigates the impact of scaling the recursion depth. Using state-of-the-art open-source agentic models (DeepSeek v3.2 and Kimi K2), I evaluated pure LLM, RLM (depth=1), and RLM (depth=2) on the S-NIAH and OOLONG benchmarks. The findings reveal a compelling phenomenon: Deeper recursion causes models to overthink’’. While depth-1 RLMs effectively boost accuracy on complex reasoning tasks, applying deeper recursion (depth=2) or using RLMs on simple retrieval tasks paradoxically degrades performance and exponentially inflates execution time (e.g., from 3.6s to 344.5s) and token costs. Code and data are available at: https://github.com/drbillwang/rlm-reproduction

[12] Cross-Family Speculative Prefill: Training-Free Long-Context Compression with Small Draft Models

Shubhangi Upasani, Ravi Shanker Raju, Bo Li, Mengmeing Ji, John Long, Chen Wu, Urmish Thakker, Guangtao Wang

Main category: cs.CL

TL;DR: Cross-family speculative prefill enables prompt compression using draft models from different families than target models, maintaining performance while reducing time to first token.

Details

Motivation: Prompt length is a bottleneck in agentic LLM workloads where repeated inference steps incur substantial prefill costs. Existing speculative prefill requires draft and target models to share the same tokenizer, but agentic pipelines often use models without smaller in-family draft models.

Method: Study cross-family speculative prefill where lightweight draft models from one family compress prompts for target models from different families. Evaluate Qwen, LLaMA, and DeepSeek model combinations using attention-based token importance estimation across diverse tasks.

Result: Attention-based token importance estimation transfers reliably across different model families despite architectural and tokenizer differences. Cross-model compression retains 90-100% of baseline performance, sometimes slightly improving accuracy due to denoising effects, while substantially reducing time to first token.

Conclusion: Speculative prefill depends mainly on task priors and semantic structure, serving as a generalizable prompt compression primitive. Cross-model compression is both necessary and practical for agentic systems with repeated long-context inference and heterogeneous model stacks.

Abstract: Prompt length is a major bottleneck in agentic large language model (LLM) workloads, where repeated inference steps and multi-call loops incur substantial prefill cost. Recent work on speculative prefill demonstrates that attention-based token importance estimation can enable training-free prompt compression, but this assumes the existence of a draft model that shares the same tokenizer as the target model. In practice, however, agentic pipelines frequently employ models without any smaller in-family draft model. In this work, we study cross-family speculative prefill, where a lightweight draft model from one model family is used to perform prompt compression for a target model from a different family. Using the same speculative prefill mechanism as prior work, we evaluate a range of cross-family draft-target combinations, including Qwen, LLaMA, and DeepSeek models. Across a broad diversity of tasks, we find that attention-based token importance estimation transfers reliably across different model families despite differences in model architectures and tokenizers between draft and target models. Cross-model prompt compression largely retains 90~100% of full-prompt baseline performance and, in some cases, slightly improves accuracy due to denoising effects, while delivering substantial reductions in time to first token (TTFT). These results suggest that speculative prefill depends mainly on task priors and semantic structure, thus serving as a generalizable prompt compression primitive. We discuss the implications of our findings for agentic systems, where repeated long-context inference and heterogeneous model stacks make cross-model prompt compression both necessary and practical.

[13] Real-Time Generation of Game Video Commentary with Multimodal LLMs: Pause-Aware Decoding Approaches

Anum Afzal, Yuki Saito, Hiroya Takamura, Katsuhito Sudoh, Shinnosuke Takamichi, Graham Neubig, Florian Matthes, Tatsuya Ishigaki

Main category: cs.CL

TL;DR: Real-time video commentary generation using MLLMs with novel dynamic interval-based decoding for better timing alignment without fine-tuning

Details

Motivation: Current MLLM-based approaches for video commentary generation focus on content but ignore timing aspects, which is crucial for real-time applications in sports, esports, and livestreaming

Conclusion: Prompting-based approaches can effectively handle both content and timing aspects of real-time video commentary generation, with dynamic interval decoding showing superior performance

Shunki Uebayashi, Kento Masui, Kyohei Atarashi, Han Bao, Hisashi Kashima, Naoto Inoue, Mayu Otani, Koh Takeuchi

Main category: cs.CL

TL;DR: M3IRT is a multimodal item response theory framework that identifies genuinely cross-modal questions in MLLM benchmarks by decomposing model ability and question difficulty into image-only, text-only, and cross-modal components.

Details

Motivation: Current multimodal benchmarks contain many shortcut questions that can be solved using only a single modality, leading to unreliable model rankings and unnecessarily large, computationally expensive evaluations.

Method: Extends classical item response theory to multimodal settings by decomposing both model ability and item difficulty into three components: image-only, text-only, and cross-modal. This allows identification of genuinely cross-modal questions and estimation of models’ true cross-modal reasoning ability.

Result: M3IRT effectively prioritizes cross-modal questions over shortcuts, preserves ranking fidelity even when 50% of items are low-quality, and reduces evaluation costs while improving reliability across 24 vision-language models on three benchmarks.

Conclusion: M3IRT provides a practical framework for assessing true cross-modal reasoning in MLLMs and refining multimodal benchmarks by focusing on high-quality, genuinely multimodal questions.

Abstract: Multimodal Large Language Models (MLLMs) have recently emerged as general architectures capable of reasoning over diverse modalities. Benchmarks for MLLMs should measure their ability for cross-modal integration. However, current benchmarks are filled with shortcut questions, which can be solved using only a single modality, thereby yielding unreliable rankings. For example, in vision-language cases, we can find the correct answer without either the image or the text. These low-quality questions unnecessarily increase the size and computational requirements of benchmarks. We introduce a multi-modal and multidimensional item response theory framework (M3IRT) that extends classical IRT by decomposing both model ability and item difficulty into image-only, text-only, and cross-modal components. M3IRT estimates cross-modal ability of MLLMs and each question’s cross-modal difficulty, enabling compact, high-quality subsets that better reflect multimodal reasoning. Across 24 VLMs on three benchmarks, M3IRT prioritizes genuinely cross-modal questions over shortcuts and preserves ranking fidelity even when 50% of items are artificially generated low-quality questions, thereby reducing evaluation cost while improving reliability. M3IRT thus offers a practical tool for assessing cross-modal reasoning and refining multimodal benchmarks.

[15] ITLC at SemEval-2026 Task 11: Normalization and Deterministic Parsing for Formal Reasoning in LLMs

Wicaksono Leksono Muhamad, Joanito Agili Lopo, Tack Hwa Wong, Muhammad Ravi Shulthan Habibi, Samuel Cahyawijaya

Main category: cs.CL

TL;DR: Novel method reduces content effects in multilingual reasoning by transforming syllogisms into canonical logical representations with deterministic parsing for validity determination.

Details

Motivation: Large language models suffer from content effects in reasoning tasks, especially in multilingual contexts, where biases can affect logical validity judgments.

Method: Introduces explicit structural abstraction that transforms syllogisms into canonical logical representations and applies deterministic parsing to determine validity, reducing reliance on content-based biases.

Result: Achieves top-5 rankings across all subtasks on SemEval-2026 Task 11 multilingual benchmark while substantially reducing content effects, offering competitive alternative to complex fine-tuning or activation-level interventions.

Conclusion: The approach effectively mitigates content effects in multilingual reasoning tasks through structural abstraction and deterministic parsing, providing a practical solution without requiring extensive model modifications.

Abstract: Large language models suffer from content effects in reasoning tasks, particularly in multi-lingual contexts. We introduce a novel method that reduces these biases through explicit structural abstraction that transforms syllogisms into canonical logical representations and applies deterministic parsing to determine validity. Evaluated on the SemEval-2026 Task 11 multilingual benchmark, our approach achieves top-5 rankings across all subtasks while substantially reducing content effects and offering a competitive alternative to complex fine-tuning or activation-level interventions.

[16] HateMirage: An Explainable Multi-Dimensional Dataset for Decoding Faux Hate and Subtle Online Abuse

Sai Kartheek Reddy Kasu, Shankar Biradar, Sunil Saumya, Md. Shad Akhtar

Main category: cs.CL

TL;DR: HateMirage: A dataset of 4,530 faux hate comments from misinformation-related YouTube discussions, annotated with multi-dimensional explanations (Target, Intent, Implication) to advance reasoning about subtle hate speech emerging from fake narratives.

Details

Motivation: Existing hate speech datasets focus on overt toxicity and underrepresent subtle hate embedded in misinformation narratives. There's a need for datasets that capture nuanced hate speech emerging from fake or distorted narratives to advance reasoning and explainability research.

Method: Constructed dataset by identifying debunked misinformation claims from fact-checking sources, tracing related YouTube discussions, and collecting 4,530 user comments. Annotated each comment along three dimensions: Target (who is affected), Intent (underlying motivation), and Implication (potential social impact).

Result: Benchmarked multiple open-source language models using ROUGE-L F1 and Sentence-BERT similarity. Found that explanation quality depends more on pretraining diversity and reasoning-oriented data rather than model scale alone.

Conclusion: HateMirage establishes a new benchmark for interpretable hate detection by coupling misinformation reasoning with harm attribution, advancing responsible AI research on subtle hate speech.

Abstract: Subtle and indirect hate speech remains an underexplored challenge in online safety research, particularly when harmful intent is embedded within misleading or manipulative narratives. Existing hate speech datasets primarily capture overt toxicity, underrepresenting the nuanced ways misinformation can incite or normalize hate. To address this gap, we present HateMirage, a novel dataset of Faux Hate comments designed to advance reasoning and explainability research on hate emerging from fake or distorted narratives. The dataset was constructed by identifying widely debunked misinformation claims from fact-checking sources and tracing related YouTube discussions, resulting in 4,530 user comments. Each comment is annotated along three interpretable dimensions: Target (who is affected), Intent (the underlying motivation or goal behind the comment), and Implication (its potential social impact). Unlike prior explainability datasets such as HateXplain and HARE, which offer token-level or single-dimensional reasoning, HateMirage introduces a multi-dimensional explanation framework that captures the interplay between misinformation, harm, and social consequence. We benchmark multiple open-source language models on HateMirage using ROUGE-L F1 and Sentence-BERT similarity to assess explanation coherence. Results suggest that explanation quality may depend more on pretraining diversity and reasoning-oriented data rather than on model scale alone. By coupling misinformation reasoning with harm attribution, HateMirage establishes a new benchmark for interpretable hate detection and responsible AI research.

[17] Graph-GRPO: Stabilizing Multi-Agent Topology Learning via Group Relative Policy Optimization

Yueyang Cang, Xiaoteng Zhang, Erlu Zhao, Zehua Ji, Yuhang Liu, Yuchen He, Zhiyuan Ning, Chen Yijun, Wenge Que, Li Shi

Main category: cs.CL

TL;DR: Graph-GRPO: A novel topology optimization framework for LLM-based multi-agent systems using group relative policy optimization to address gradient variance and credit assignment problems in communication graph learning.

Details

Motivation: Current reinforcement learning approaches for optimizing communication topology in LLM-based multi-agent systems suffer from severe gradient variance and credit assignment problems due to reliance on single-sample policy gradients with absolute rewards, leading to non-informative signals for both simple and difficult queries.

Method: Proposes Graph-GRPO which integrates Group Relative Policy Optimization. Instead of evaluating single topologies in isolation, it samples a group of diverse communication graphs for each query and computes edge advantages based on relative performance within the group, normalizing rewards to mitigate task difficulty variance.

Result: Extensive experiments on reasoning and code generation benchmarks show Graph-GRPO significantly outperforms state-of-the-art baselines, achieving superior training stability and identifying critical communication pathways previously obscured by reward noise.

Conclusion: Graph-GRPO effectively addresses fundamental challenges in LLM-based multi-agent system topology optimization through group-based relative reward normalization, enabling more stable training and better discovery of optimal communication structures.

Abstract: Optimizing communication topology is fundamental to the efficiency and effectiveness of Large Language Model (LLM)-based Multi-Agent Systems (MAS). While recent approaches utilize reinforcement learning to dynamically construct task-specific graphs, they typically rely on single-sample policy gradients with absolute rewards (e.g., binary correctness). This paradigm suffers from severe gradient variance and the credit assignment problem: simple queries yield non-informative positive rewards for suboptimal structures, while difficult queries often result in failures that provide no learning signal. To address these challenges, we propose Graph-GRPO, a novel topology optimization framework that integrates Group Relative Policy Optimization. Instead of evaluating a single topology in isolation, Graph-GRPO samples a group of diverse communication graphs for each query and computes the advantage of specific edges based on their relative performance within the group. By normalizing rewards across the sampled group, our method effectively mitigates the noise derived from task difficulty variance and enables fine-grained credit assignment. Extensive experiments on reasoning and code generation benchmarks demonstrate that Graph-GRPO significantly outperforms state-of-the-art baselines, achieving superior training stability and identifying critical communication pathways previously obscured by reward noise.

[18] MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs

Baorong Shi, Bo Cui, Boyuan Jiang, Deli Yu, Fang Qian, Haihua Yang, Huichao Wang, Jiale Chen, Jianfei Pan, Jieqiong Cao, Jinghao Lin, Kai Wu, Lin Yang, Shengsheng Yao, Tao Chen, Xiaojun Xiao, Xiaozhong Ji, Xu Wang, Yijun He, Zhixiong Yang

Main category: cs.CL

TL;DR: MedXIAOHE is a medical vision-language foundation model that achieves SOTA performance on medical benchmarks through entity-aware pretraining, reinforcement learning for reasoning, and tools to reduce hallucinations in clinical applications.

Details

Motivation: To advance general-purpose medical understanding and reasoning for real-world clinical applications, addressing challenges like heterogeneous medical data, rare diseases, and the need for reliable, verifiable diagnostic reasoning.

Method: Uses entity-aware continual pretraining to organize heterogeneous medical corpora, incorporates diverse medical reasoning patterns via reinforcement learning and tool-augmented agentic training, and integrates user-preference rubrics with evidence-grounded reasoning for low-hallucination report generation.

Result: Achieves state-of-the-art performance across diverse medical benchmarks and surpasses leading closed-source multimodal systems on multiple capabilities, with improved adherence to medical instructions and reduced hallucinations.

Conclusion: MedXIAOHE demonstrates practical design choices for medical vision-language models that enable expert-level reasoning, verifiable decision traces, and reliable real-world clinical applications, with released insights to inspire further research.

Abstract: We present MedXIAOHE, a medical vision-language foundation model designed to advance general-purpose medical understanding and reasoning in real-world clinical applications. MedXIAOHE achieves state-of-the-art performance across diverse medical benchmarks and surpasses leading closed-source multimodal systems on multiple capabilities. To achieve this, we propose an entity-aware continual pretraining framework that organizes heterogeneous medical corpora to broaden knowledge coverage and reduce long-tail gaps (e.g., rare diseases). For medical expert-level reasoning and interaction, MedXIAOHE incorporates diverse medical reasoning patterns via reinforcement learning and tool-augmented agentic training, enabling multi-step diagnostic reasoning with verifiable decision traces. To improve reliability in real-world use, MedXIAOHE integrates user-preference rubrics, evidence-grounded reasoning, and low-hallucination long-form report generation, with improved adherence to medical instructions. We release this report to document our practical design choices, scaling insights, and evaluation framework, hoping to inspire further research.

[19] Sensory-Aware Sequential Recommendation via Review-Distilled Representations

Yeo Chan Yoon

Main category: cs.CL

TL;DR: ASEGR framework enhances sequential recommendation by extracting sensory attributes from product reviews using LLMs and distilling them into embeddings that improve recommendation performance.

Details

Motivation: Current sequential recommendation models rely primarily on behavioral interaction patterns and lack rich semantic understanding of items. Product reviews contain valuable sensory information (color, scent, texture) that could enhance item representations and recommendation quality.

Method: Two-stage pipeline: 1) Fine-tune LLM as teacher to extract structured sensory attribute-value pairs from unstructured reviews, 2) Distill extracted structures into compact student transformer to produce fixed-dimensional sensory embeddings, then integrate these embeddings into standard sequential recommender architectures.

Result: Sensory-enhanced models consistently outperform identifier-based counterparts across four Amazon domains. Integration with SASRec, BERT4Rec, and BSARec shows improved performance. Extracted attributes align closely with human perceptions, enabling interpretable connections between language descriptions and recommendation behavior.

Conclusion: Sensory attribute distillation provides a principled, scalable way to bridge information extraction and sequential recommendation through structured semantic representation learning, offering complementary signals to behavioral patterns.

Abstract: We propose a novel framework for sensory-aware sequential recommendation that enriches item representations with linguistically extracted sensory attributes from product reviews. Our approach, \textsc{ASEGR} (Attribute-based Sensory Enhanced Generative Recommendation), introduces a two-stage pipeline in which a large language model is first fine-tuned as a teacher to extract structured sensory attribute–value pairs, such as \textit{color: matte black} and \textit{scent: vanilla}, from unstructured review text. The extracted structures are then distilled into a compact student transformer that produces fixed-dimensional sensory embeddings for each item. These embeddings encode experiential semantics in a reusable form and are incorporated into standard sequential recommender architectures as additional item-level representations. We evaluate our method on four Amazon domains and integrate the learned sensory embeddings into representative sequential recommendation models, including SASRec, BERT4Rec, and BSARec. Across domains, sensory-enhanced models consistently outperform their identifier-based counterparts, indicating that linguistically grounded sensory representations provide complementary signals to behavioral interaction patterns. Qualitative analysis further shows that the extracted attributes align closely with human perceptions of products, enabling interpretable connections between natural language descriptions and recommendation behavior. Overall, this work demonstrates that sensory attribute distillation offers a principled and scalable way to bridge information extraction and sequential recommendation through structured semantic representation learning.

[20] Efficient Self-Evaluation for Diffusion Language Models via Sequence Regeneration

Linhao Zhong, Linyu Wu, Wen Wang, Yuling Xi, Chenchen Jing, Jiaheng Zhang, Hao Chen, Chunhua Shen

Main category: cs.CL

TL;DR: DiSE is a self-evaluation confidence quantification method for diffusion LLMs that uses token regeneration probabilities for quality assessment and enables flexible-length generation based on self-assessment.

Details

Motivation: Diffusion LLMs have non-sequential, bidirectionally masked generation that makes quality assessment difficult, creating a need for effective self-evaluation methods to enhance reliability and controllability.

Method: DiSE quantifies confidence by computing the probability of regenerating all tokens in the generated sequence given full context. This enables likelihood estimation and uncertainty quantification, and is extended to a flexible-length generation framework that adaptively controls sequence length based on self-assessment.

Result: DiSE shows positive correlation with semantic coherence and answer accuracy. Experiments demonstrate effectiveness in likelihood evaluation, uncertainty quantification, and flexible-length generation tasks.

Conclusion: DiSE provides a simple yet effective self-evaluation method for diffusion LLMs, enabling better quality assessment and adaptive generation control through token regeneration probability analysis.

Abstract: Diffusion large language models (dLLMs) have recently attracted significant attention for their ability to enhance diversity, controllability, and parallelism. However, their non-sequential, bidirectionally masked generation makes quality assessment difficult, underscoring the need for effective self-evaluation. In this work, we propose DiSE, a simple yet effective self-evaluation confidence quantification method for dLLMs. DiSE quantifies confidence by computing the probability of regenerating the tokens in the entire generated sequence, given the full context. This method enables more efficient and reliable quality assessment by leveraging token regeneration probabilities, facilitating both likelihood estimation and robust uncertainty quantification. Building upon DiSE, we further introduce a flexible-length generation framework, which adaptively controls the sequence length based on the model’s self-assessment of its own output. We analyze and validate the feasibility of DiSE from the perspective of dLLM generalization, and empirically demonstrate that DiSE is positively correlated with both semantic coherence and answer accuracy. Extensive experiments on likelihood evaluation, uncertainty quantification, and flexible-length generation further confirm the effectiveness of the proposed DiSE.

[21] From Solver to Tutor: Evaluating the Pedagogical Intelligence of LLMs with KMP-Bench

Weikang Shi, Houxing Ren, Junting Pan, Aojun Zhou, Ke Wang, Zimu Lu, Yunqiao Yang, Yuxuan Hu, Linda Wei, Mingjie Zhan, Hongsheng Li

Main category: cs.CL

TL;DR: KMP-Bench: A comprehensive K-8 mathematical pedagogical benchmark for evaluating LLMs’ teaching effectiveness through multi-turn dialogues and granular skill assessments, revealing that while LLMs excel at verifiable solutions, they struggle with nuanced pedagogical principles.

Details

Motivation: Current evaluations of LLMs for AI mathematical tutoring rely on simplistic metrics or narrow pedagogical scenarios, failing to assess comprehensive, multi-turn teaching effectiveness. There's a need for a more holistic benchmark that evaluates both pedagogical principles and foundational tutoring abilities.

Method: Introduces KMP-Bench with two modules: 1) KMP-Dialogue evaluates holistic pedagogical capabilities against six core principles (Challenge, Explanation, Feedback, etc.) using multi-turn dialogue datasets, and 2) KMP-Skills provides granular assessment of foundational tutoring abilities including multi-turn problem-solving, error detection/correction, and problem generation. Also presents KMP-Pile, a large-scale 150K dialogue dataset for fine-tuning.

Result: Evaluations reveal a key disparity: leading LLMs excel at tasks with verifiable solutions but struggle with nuanced application of pedagogical principles. Models fine-tuned on KMP-Pile show substantial improvement on KMP-Bench, demonstrating the value of pedagogically-rich training data.

Conclusion: KMP-Bench provides a comprehensive framework for assessing LLMs’ mathematical pedagogical capabilities, highlighting the gap between technical problem-solving and effective teaching. Pedagogically-rich training data significantly improves AI math tutoring performance.

Abstract: Large Language Models (LLMs) show significant potential in AI mathematical tutoring, yet current evaluations often rely on simplistic metrics or narrow pedagogical scenarios, failing to assess comprehensive, multi-turn teaching effectiveness. In this paper, we introduce KMP-Bench, a comprehensive K-8 Mathematical Pedagogical Benchmark designed to assess LLMs from two complementary perspectives. The first module, KMP-Dialogue, evaluates holistic pedagogical capabilities against six core principles (e.g., Challenge, Explanation, Feedback), leveraging a novel multi-turn dialogue dataset constructed by weaving together diverse pedagogical components. The second module, KMP-Skills, provides a granular assessment of foundational tutoring abilities, including multi-turn problem-solving, error detection and correction, and problem generation. Our evaluations on KMP-Bench reveal a key disparity: while leading LLMs excel at tasks with verifiable solutions, they struggle with the nuanced application of pedagogical principles. Additionally, we present KMP-Pile, a large-scale (150K) dialogue dataset. Models fine-tuned on KMP-Pile show substantial improvement on KMP-Bench, underscoring the value of pedagogically-rich training data for developing more effective AI math tutors.

[22] OCR or Not? Rethinking Document Information Extraction in the MLLMs Era with Real-World Large-Scale Datasets

Jiyuan Shen, Peiyue Yuan, Atin Ghosh, Yifan Mai, Daniel Dahlmeier

Main category: cs.CL

TL;DR: MLLMs can match traditional OCR+MLLM performance for document information extraction, with image-only input achieving comparable results to OCR-enhanced approaches when using proper schema, exemplars, and instructions.

Details

Motivation: To determine whether simpler MLLM-only pipelines can truly match traditional OCR+MLLM setups for document information extraction, and to provide practical guidance for advancing this field.

Method: Large-scale benchmarking study evaluating various out-of-the-box MLLMs on business-document information extraction, plus an automated hierarchical error analysis framework using LLMs to systematically diagnose error patterns.

Result: OCR may not be necessary for powerful MLLMs as image-only input achieves comparable performance to OCR-enhanced approaches; carefully designed schema, exemplars, and instructions further enhance MLLM performance.

Conclusion: MLLM-only pipelines can be effective for document information extraction, offering practical guidance for advancing the field while simplifying the processing pipeline.

Abstract: Multimodal Large Language Models (MLLMs) enhance the potential of natural language processing. However, their actual impact on document information extraction remains unclear. In particular, it is unclear whether an MLLM-only pipeline–while simpler–can truly match the performance of traditional OCR+MLLM setups. In this paper, we conduct a large-scale benchmarking study that evaluates various out-of-the-box MLLMs on business-document information extraction. To examine and explore failure modes, we propose an automated hierarchical error analysis framework that leverages large language models (LLMs) to diagnose error patterns systematically. Our findings suggest that OCR may not be necessary for powerful MLLMs, as image-only input can achieve comparable performance to OCR-enhanced approaches. Moreover, we demonstrate that carefully designed schema, exemplars, and instructions can further enhance MLLMs performance. We hope this work can offer practical guidance and valuable insight for advancing document information extraction.

[23] Faster, Cheaper, More Accurate: Specialised Knowledge Tracing Models Outperform LLMs

Prarthana Bhattacharyya, Joshua Mitton, Ralph Abboud, Simon Woodhead

Main category: cs.CL

TL;DR: LLMs underperform domain-specific knowledge tracing models in predicting student responses, with lower accuracy, slower inference, and higher deployment costs.

Details

Motivation: To evaluate whether large language models can effectively replace specialized knowledge tracing models for predicting student responses in educational platforms, comparing performance, scalability, and cost.

Method: Comparative analysis of multiple LLMs and knowledge tracing models across predictive performance metrics (accuracy, F1 scores), deployment costs, and inference speed on student question-response prediction tasks.

Result: Knowledge tracing models outperform LLMs in accuracy and F1 scores on domain-specific educational prediction tasks, with LLMs being orders of magnitude slower and more expensive to deploy.

Conclusion: Domain-specific models remain superior for educational prediction tasks, and current closed-source LLMs should not be treated as universal solutions for all specialized applications.

Abstract: Predicting future student responses to questions is particularly valuable for educational learning platforms where it enables effective interventions. One of the key approaches to do this has been through the use of knowledge tracing (KT) models. These are small, domain-specific, temporal models trained on student question-response data. KT models are optimised for high accuracy on specific educational domains and have fast inference and scalable deployments. The rise of Large Language Models (LLMs) motivates us to ask the following questions: (1) How well can LLMs perform at predicting students’ future responses to questions? (2) Are LLMs scalable for this domain? (3) How do LLMs compare to KT models on this domain-specific task? In this paper, we compare multiple LLMs and KT models across predictive performance, deployment cost, and inference speed to answer the above questions. We show that KT models outperform LLMs with respect to accuracy and F1 scores on this domain-specific task. Further, we demonstrate that LLMs are orders of magnitude slower than KT models and cost orders of magnitude more to deploy. This highlights the importance of domain-specific models for education prediction tasks and the fact that current closed source LLMs should not be used as a universal solution for all tasks.

[24] A Browser-based Open Source Assistant for Multimodal Content Verification

Rosanna Milner, Michael Foster, Olesya Razuvayevskaya, Ian Roberts, Valentin Porcellini, Denis Teyssou, Kalina Bontcheva

Main category: cs.CL

TL;DR: A browser-based verification assistant tool that integrates multiple NLP classifiers to help journalists detect disinformation and AI-generated content through a unified interface.

Details

Motivation: Journalists and fact-checkers face challenges in rapidly verifying digital media due to disinformation and AI-generated content. Existing NLP models for credibility detection remain inaccessible and unintegrated into daily workflows.

Method: Developed the VERIFICATION ASSISTANT as a browser-based tool that allows users to submit URLs/media files, automatically extracts content, routes it to backend NLP classifiers, and delivers actionable credibility signals in an easy-to-digest format.

Result: The tool is a core component of the widely adopted VERIFICATION PLUGIN with 140,000+ users, providing a unified framework for detecting disinformation, estimating AI-generated content, and offering verification guidance.

Conclusion: The VERIFICATION ASSISTANT successfully bridges the gap between advanced NLP detection models and non-expert users by integrating multiple credibility analysis services into a practical, accessible tool for real-world disinformation detection.

Abstract: Disinformation and false content produced by generative AI pose a significant challenge for journalists and fact-checkers who must rapidly verify digital media information. While there is an abundance of NLP models for detecting credibility signals such as persuasion techniques, subjectivity, or machine-generated text, such methods often remain inaccessible to non-expert users and are not integrated into their daily workflows as a unified framework. This paper demonstrates the VERIFICATION ASSISTANT, a browser-based tool designed to bridge this gap. The VERIFICATION ASSISTANT, a core component of the widely adopted VERIFICATION PLUGIN (140,000+ users), allows users to submit URLs or media files to a unified interface. It automatically extracts content and routes it to a suite of backend NLP classifiers, delivering actionable credibility signals, estimating AI-generated content, and providing other verification guidance in a clear, easy-to-digest format. This paper showcases the tool architecture, its integration of multiple NLP services, and its real-world application to detecting disinformation.

[25] The Distribution of Phoneme Frequencies across the World’s Languages: Macroscopic and Microscopic Information-Theoretic Models

Fermín Moscoso del Prado Martín, Suchir Salhan

Main category: cs.CL

TL;DR: Phoneme frequency distributions follow predictable patterns: macroscopically via Dirichlet statistics scaling with inventory size, microscopically via Maximum Entropy models incorporating linguistic constraints.

Details

Motivation: To understand the universal patterns in phoneme frequency distributions across languages and provide an information-theoretic explanation for why certain phonemes are more frequent than others.

Method: Two-level analysis: 1) Macroscopic analysis using order statistics of symmetric Dirichlet distribution to model rank-frequency distributions, 2) Microscopic analysis using Maximum Entropy models incorporating articulatory, phonotactic, and lexical constraints.

Result: Phoneme rank-frequency distributions follow Dirichlet statistics with concentration parameter scaling systematically with inventory size (larger inventories have lower relative entropy). Maximum Entropy models accurately predict language-specific phoneme probabilities.

Conclusion: Phoneme frequency structure can be explained through unified information-theoretic principles, revealing robust compensation effects and systematic patterns across languages.

Abstract: We demonstrate that the frequency distribution of phonemes across languages can be explained at both macroscopic and microscopic levels. Macroscopically, phoneme rank-frequency distributions closely follow the order statistics of a symmetric Dirichlet distribution whose single concentration parameter scales systematically with phonemic inventory size, revealing a robust compensation effect whereby larger inventories exhibit lower relative entropy. Microscopically, a Maximum Entropy model incorporating constraints from articulatory, phonotactic, and lexical structure accurately predicts language-specific phoneme probabilities. Together, these findings provide a unified information-theoretic account of phoneme frequency structure.

[26] Nodes Are Early, Edges Are Late: Probing Diagram Representations in Large Vision-Language Models

Haruto Yoshida, Keito Kudo, Yoichi Aoki, Ryota Tanaka, Itsumi Saito, Keisuke Sakaguchi, Kentaro Inui

Main category: cs.CL

TL;DR: LVLMs struggle with understanding relationships in diagrams, particularly edges and arrows. Probing reveals edge information isn’t linearly separable in vision encoder but emerges later in language model, while node info is encoded earlier in vision encoder.

Details

Motivation: Large vision-language models perform well on diagram understanding but still struggle with relational understanding between elements, especially directed edges like arrows. The researchers want to investigate why LVLMs have this limitation by examining their internal representations.

Method: Created a synthetic diagram dataset based on directed graphs. Conducted probing experiments on LVLMs’ internal representations to analyze how different visual information (nodes vs edges) is encoded at different stages of processing.

Result: Edge information is not linearly separable in the vision encoder and only becomes linearly encoded in text tokens in the language model. Node information and global structural features are already linearly encoded in individual hidden states of the vision encoder.

Conclusion: The stage at which linearly separable representations form varies by visual information type. Delayed emergence of edge representations may explain why LVLMs struggle with relational understanding like interpreting edge directions, which require more abstract, compositionally integrated processing.

Abstract: Large vision-language models (LVLMs) demonstrate strong performance on diagram understanding benchmarks, yet they still struggle with understanding relationships between elements, particularly those represented by nodes and directed edges (e.g., arrows and lines). To investigate the underlying causes of this limitation, we probe the internal representation of LVLMs using a carefully constructed synthetic diagram dataset based on directed graphs. Our probing experiments reveal that edge information is not linearly separable in the vision encoder and becomes linearly encoded only in the text tokens in the language model. In contrast, node information and global structural features are already linearly encoded in individual hidden states of the vision encoder. These findings suggest that the stage at which linearly separable representations are formed varies depending on the type of visual information. In particular, the delayed emergence of edge representations may help explain why LVLMs struggle with relational understanding, such as interpreting edge directions, which require more abstract, compositionally integrated processes.

[27] LaTeX Compilation: Challenges in the Era of LLMs

Tianyou Liu, Ziqiang Li, Yansong Li, Xurui Liu

Main category: cs.CL

TL;DR: Mogan STEM is a WYSIWYG structured editor that addresses TeX’s limitations for LLM-assisted scientific writing, offering faster compilation, better error handling, and more efficient LLM fine-tuning through its lower-entropy .tmu format.

Details

Motivation: TeX has fundamental defects in compilation efficiency, generated semantics, error localization, and tool ecosystem that become more visible as LLMs increasingly assist scientific writing. The significant token cost of TeX also presents limitations in the LLM era.

Method: Introduces Mogan STEM, a WYSIWYG structured editor with efficient data structure, fast rendering, and on-demand plugin loading. Compares performance with TeX through extensive experiments on compilation/rendering time and LLM task performance. Also analyzes information entropy differences between formats.

Result: Mogan outperforms TeX in compilation efficiency, error localization, and tool ecosystem. The .tmu format has lower information entropy than TeX, making it more efficient for fine-tuning LLMs. Experiments verify benefits in compilation/rendering time and LLM task performance.

Conclusion: Mogan STEM provides a superior alternative to TeX for LLM-assisted scientific writing. The authors appeal for larger experiments on LLM training using the .tmu format due to its efficiency advantages over TeX.

Abstract: As large language models (LLMs) increasingly assist scientific writing, limitations and the significant token cost of TeX become more and more visible. This paper analyzes TeX’s fundamental defects in compilation and user experience design to illustrate its limitations on compilation efficiency, generated semantics, error localization, and tool ecosystem in the era of LLMs. As an alternative, Mogan STEM, a WYSIWYG structured editor, is introduced. Mogan outperforms TeX in the above aspects by its efficient data structure, fast rendering, and on-demand plugin loading. Extensive experiments are conducted to verify the benefits on compilation/rendering time and performance in LLM tasks. What’s more, we show that due to Mogan’s lower information entropy, it is more efficient to use .tmu (the document format of Mogan) to fine-tune LLMs than TeX. Therefore, we launch an appeal for larger experiments on LLM training using the .tmu format.

[28] Eval4Sim: An Evaluation Framework for Persona Simulation

Eliseo Bao, Anxo Perez, Xi Wang, Javier Parapar

Main category: cs.CL

TL;DR: Eval4Sim is an evaluation framework for persona-grounded LLM conversations that measures alignment with human conversational patterns across three dimensions: adherence to persona backgrounds, consistency of identity across conversations, and naturalness of dialogue flow.

Details

Motivation: Current evaluation practices for persona-grounded LLM simulations rely heavily on LLM-as-a-judge approaches, which lack grounding in observable human behavior and produce opaque scalar scores. There's a need for more comprehensive evaluation that measures how closely simulated conversations align with real human conversational patterns.

Method: Eval4Sim evaluates persona-grounded conversations across three complementary dimensions: 1) Adherence - measures how effectively persona backgrounds are implicitly encoded in generated utterances using dense retrieval with speaker-aware representations; 2) Consistency - evaluates whether a persona maintains distinguishable identity across conversations via authorship verification; 3) Naturalness - assesses whether conversations exhibit human-like flow rather than overly rigid structure using dialogue-focused Natural Language Inference distributions. The framework uses a human conversational corpus (PersonaChat) as reference baseline and penalizes deviations in both directions.

Result: The framework provides a more comprehensive evaluation of persona-grounded LLM conversations by measuring alignment with human conversational patterns across multiple dimensions, distinguishing between insufficient persona encoding and over-optimized, unnatural behavior.

Conclusion: Eval4Sim addresses limitations of current LLM-as-a-judge evaluation approaches by providing a framework grounded in observable human conversational behavior, offering more nuanced assessment of persona-grounded simulations that can be applied to any conversational corpus with speaker-level annotations.

Abstract: Large Language Model (LLM) personas with explicit specifications of attributes, background, and behavioural tendencies are increasingly used to simulate human conversations for tasks such as user modeling, social reasoning, and behavioural analysis. Ensuring that persona-grounded simulations faithfully reflect human conversational behaviour is therefore critical. However, current evaluation practices largely rely on LLM-as-a-judge approaches, offering limited grounding in observable human behavior and producing opaque scalar scores. We address this gap by proposing Eval4Sim, an evaluation framework that measures how closely simulated conversations align with human conversational patterns across three complementary dimensions. Adherence captures how effectively persona backgrounds are implicitly encoded in generated utterances, assessed via dense retrieval with speaker-aware representations. Consistency evaluates whether a persona maintains a distinguishable identity across conversations, computed through authorship verification. Naturalness reflects whether conversations exhibit human-like flow rather than overly rigid or optimized structure, quantified through distributions derived from dialogue-focused Natural Language Inference. Unlike absolute or optimization-oriented metrics, Eval4Sim uses a human conversational corpus (i.e., PersonaChat) as a reference baseline and penalizes deviations in both directions, distinguishing insufficient persona encoding from over-optimized, unnatural behaviour. Although demonstrated on PersonaChat, the applicability of Eval4Sim extends to any conversational corpus containing speaker-level annotations.

[29] Learning to Generate and Extract: A Multi-Agent Collaboration Framework For Zero-shot Document-level Event Arguments Extraction

Guangjun Zhang, Hu Zhang, Yazhou Han, Yue Fan, Yuhang Shao, Ru Li, Hongye Tan

Main category: cs.CL

TL;DR: Multi-agent collaboration framework for zero-shot document-level event argument extraction using generation and evaluation agents with reinforcement learning optimization.

Details

Motivation: Existing methods for zero-shot document-level event argument extraction rely on LLMs with event-type-only prompts, which fail to capture contextual relationships of unseen events and lack quality evaluation mechanisms for synthetic data.

Method: Proposes a multi-agent framework simulating “Propose-Evaluate-Revise” cognitive process: generation agent synthesizes data for unseen events using knowledge from seen events, and evaluation agent extracts arguments and assesses semantic consistency with context. Uses reinforcement learning with event structure constraints in reward design for iterative optimization.

Result: Achieves improvements in data generation quality and argument extraction performance across three zero-shot scenarios constructed from RAMS and WikiEvents datasets. Generated data also enhances zero-shot performance of other DEAE models.

Conclusion: The multi-agent collaboration framework effectively addresses challenges in zero-shot DEAE by improving synthetic data quality and extraction performance through iterative optimization with evaluation mechanisms.

Abstract: Document-level event argument extraction (DEAE) is essential for knowledge acquisition, aiming to extract participants of events from documents.In the zero-shot setting, existing methods employ LLMs to generate synthetic data to address the challenge posed by the scarcity of annotated data. However, relying solely on Event-type-only prompts makes it difficult for the generated content to accurately capture the contextual and structural relationships of unseen events. Moreover, ensuring the reliability and usability of synthetic data remains a significant challenge due to the absence of quality evaluation mechanisms. To this end, we introduce a multi-agent collaboration framework for zero-shot document-level event argument extraction (ZS-DEAE), which simulates the human collaborative cognitive process of “Propose-Evaluate-Revise.” Specifically, the framework comprises a generation agent and an evaluation agent. The generation agent synthesizes data for unseen events by leveraging knowledge from seen events, while the evaluation agent extracts arguments from the synthetic data and assesses their semantic consistency with the context. The evaluation results are subsequently converted into reward signals, with event structure constraints incorporated into the reward design to enable iterative optimization of both agents via reinforcement learning.In three zero-shot scenarios constructed from the RAMS and WikiEvents datasets, our method achieves improvements both in data generation quality and argument extraction performance, while the generated data also effectively enhances the zero-shot performance of other DEAE models.

[30] ACE-Merging: Data-Free Model Merging with Adaptive Covariance Estimation

Bo Xu, Haotian Wu, Hehai Lin, Weiquan Huang, Beier Zhu, Yao Shu, Chengwei Qin

Main category: cs.CL

TL;DR: ACE-M introduces a data-free model merging method that estimates task-specific input covariance from parameter differences to mitigate inter-task interference without data access or retraining.

Details

Motivation: Model merging suffers from interference when combining task-specific experts, especially with different objectives, causing performance degradation. Existing methods require data access, retraining, or architectural changes, limiting practical application.

Method: ACE-M uses theoretical analysis showing input covariance for each task can be implicitly estimated from parameter differences of fine-tuned models. Provides adaptive covariance estimation framework with closed-form solution to mitigate interference.

Result: Achieves state-of-the-art among data-free methods, with 4% average absolute improvement over previous methods across seven GPT-2 tasks. Efficient closed-form solution provides superior performance with modest computational cost.

Conclusion: ACE-M offers practical, theoretically grounded solution for model merging that works without data access, retraining, or architectural modifications, effectively addressing inter-task interference.

Abstract: Model merging aims to combine multiple task-specific expert models into a single model while preserving generalization across diverse tasks. However, interference among experts, especially when they are trained on different objectives, often leads to significant performance degradation. Despite recent progress, resolving this interference without data access, retraining, or architectural modification remains a fundamental challenge. This paper provides a theoretical analysis demonstrating that the input covariance of each task, which is a key factor for optimal merging, can be implicitly estimated from the parameter differences of its fine-tuned model, even in a fully data-free setting. Building on this insight, we introduce \acem, an Adaptive Covariance Estimation framework that effectively mitigates inter-task interference. Our approach features a principled, closed-form solution that contrasts with prior iterative or heuristic methods. Extensive experiments on both vision and language benchmarks demonstrate that \acem sets a new state-of-the-art among data-free methods. It consistently outperforms existing baselines; for example, \acem achieves an average absolute improvement of 4% over the previous methods across seven tasks on GPT-2. Owing to its efficient closed-form formulation, \acem delivers superior performance with a modest computational cost, providing a practical and theoretically grounded solution for model merging.

[31] MaBERT:A Padding Safe Interleaved Transformer Mamba Hybrid Encoder for Efficient Extended Context Masked Language Modeling

Jinwoong Kim, Sangjin Park

Main category: cs.CL

TL;DR: MaBERT: A hybrid encoder combining Transformer layers for global dependency modeling with Mamba layers for linear-time state updates, enabling efficient long-context processing with padding-safe masking.

Details

Motivation: BERT's quadratic scaling with sequence length makes long context modeling expensive, while linear-time state space models like Mamba have limitations in modeling global interactions and suffer from padding-induced state contamination.

Method: Interleaves Transformer layers for global dependency modeling with Mamba layers for linear-time state updates, introduces padding-safe masking to block state propagation through padded positions, and mask-aware attention pooling to aggregate information only from valid tokens.

Result: On GLUE, achieves best mean score on 5 of 8 tasks, with strong performance on CoLA and sentence pair inference. When extending context from 512 to 4,096 tokens, reduces training time by 2.36x and inference latency by 2.43x relative to encoder baselines.

Conclusion: MaBERT demonstrates a practical long-context efficient encoder that combines the strengths of Transformers and state space models for efficient training and inference on long inputs.

Abstract: Self attention encoders such as Bidirectional Encoder Representations from Transformers(BERT) scale quadratically with sequence length, making long context modeling expensive. Linear time state space models, such as Mamba, are efficient; however, they show limitations in modeling global interactions and can suffer from padding induced state contamination. We propose MaBERT, a hybrid encoder that interleaves Transformer layers for global dependency modeling with Mamba layers for linear time state updates. This design alternates global contextual integration with fast state accumulation, enabling efficient training and inference on long inputs. To stabilize variable length batching, we introduce paddingsafe masking, which blocks state propagation through padded positions, and mask aware attention pooling, which aggregates information only from valid tokens. On GLUE, MaBERT achieves the best mean score on five of the eight tasks, with strong performance on the CoLA and sentence pair inference tasks. When extending the context from 512 to 4,096 tokens, MaBERT reduces training time and inference latency by 2.36x and 2.43x, respectively, relative to the average of encoder baselines, demonstrating a practical long context efficient encoder.

[32] TrustMH-Bench: A Comprehensive Benchmark for Evaluating the Trustworthiness of Large Language Models in Mental Health

Zixin Xiong, Ziteng Wang, Haotian Fan, Xinjie Zhang, Wenxuan Wang

Main category: cs.CL

TL;DR: TrustMH-Bench: A comprehensive framework for evaluating trustworthiness of mental health LLMs across 8 dimensions, revealing significant deficiencies in current models.

Details

Motivation: Existing LLM evaluation paradigms fail to capture mental health-specific requirements, creating urgent need to assess trustworthiness in this high-stakes domain where safety is critical.

Method: Proposes TrustMH-Bench framework that maps domain-specific norms to quantitative metrics, evaluating models across 8 pillars: Reliability, Crisis Identification, Safety, Fairness, Privacy, Robustness, Anti-sycophancy, and Ethics.

Result: Models underperform across various trustworthiness dimensions in mental health scenarios; even powerful models like GPT-5.1 fail to maintain consistent high performance across all dimensions.

Conclusion: Systematic improvement of LLM trustworthiness is critical for mental health applications; current models show significant deficiencies requiring targeted enhancements.

Abstract: While Large Language Models (LLMs) demonstrate significant potential in providing accessible mental health support, their practical deployment raises critical trustworthiness concerns due to the domains high-stakes and safety-sensitive nature. Existing evaluation paradigms for general-purpose LLMs fail to capture mental health-specific requirements, highlighting an urgent need to prioritize and enhance their trustworthiness. To address this, we propose TrustMH-Bench, a holistic framework designed to systematically quantify the trustworthiness of mental health LLMs. By establishing a deep mapping from domain-specific norms to quantitative evaluation metrics, TrustMH-Bench evaluates models across eight core pillars: Reliability, Crisis Identification and Escalation, Safety, Fairness, Privacy, Robustness, Anti-sycophancy, and Ethics. We conduct extensive experiments across six general-purpose LLMs and six specialized mental health models. Experimental results indicate that the evaluated models underperform across various trustworthiness dimensions in mental health scenarios, revealing significant deficiencies. Notably, even generally powerful models (e.g., GPT-5.1) fail to maintain consistently high performance across all dimensions. Consequently, systematically improving the trustworthiness of LLMs has become a critical task. Our data and code are released.

[33] PrivMedChat: End-to-End Differentially Private RLHF for Medical Dialogue Systems

Sudip Bhujel

Main category: cs.CL

TL;DR: PrivMedChat: A differentially private RLHF framework for medical dialogue that protects patient privacy while maintaining clinical utility

Details

Motivation: Medical LLMs need doctor-patient conversation data that contains sensitive information, creating privacy risks. Conventional fine-tuning and RLHF can amplify memorization and enable training data extraction, which is especially problematic in healthcare settings.

Method: End-to-end differentially private RLHF framework with DP-SGD for medical SFT and reward model learning, plus DP-SGD for PPO actor/critic on dialogue prompts. Also introduces annotation-free preference construction pairing physician responses with filtered non-expert generations.

Result: At ε=7, achieves highest ROUGE-L (0.156) among DP models, reduces clinical hallucinations to 1.4% and harmful advice to 0.4%, gets highest LLM-jury score (2.86/3), with membership inference near chance (AUC 0.510-0.555).

Conclusion: PrivMedChat enables privacy-preserving medical dialogue systems that maintain clinical utility while protecting sensitive patient information through differential privacy at all training stages.

Abstract: Large language models are increasingly used for patient-facing medical assistance and clinical decision support, but adapting them to clinical dialogue often requires supervision derived from doctor-patient conversations that may contain sensitive information. Conventional supervised fine-tuning and reinforcement learning from human feedback (RLHF) can amplify memorization risks, enabling empirical membership inference and extraction of rare training-set content. We present PrivMedChat, an end-to-end framework for differentially private RLHF (DP-RLHF) for medical dialogue. Our design enforces differential privacy at every training stage that directly accesses dialogue-derived supervision: (i) Differential Private Stochastic Gradient Descent (DP-SGD) for medical SFT and (ii) DP-SGD for reward model learning from preference pairs. To limit additional privacy expenditure during alignment, we apply DP-SGD to the PPO actor and critic when operating on dialogue-derived prompts, while the reward model remains fixed after DP training. We also introduce an annotation-free preference construction strategy that pairs physician responses with filtered non-expert generations to produce scalable preference data without clinician labeling. Experiments on medical dialogue benchmarks show that PrivMedChat at $\varepsilon=7$ achieves the highest ROUGE-L of 0.156 among all DP models, reduces clinical hallucinations to 1.4% and harmful advice to 0.4%, and obtains the highest overall score of 2.86 in a 3-model LLM-jury evaluation, while producing membership-inference signals that are near chance (AUC 0.510-0.555). We open-source our code at https://github.com/sudip-bhujel/privmedchat.

[34] TAO-Attack: Toward Advanced Optimization-Based Jailbreak Attacks for Large Language Models

Zhi Xu, Jiaqi Li, Xiaotong Zhang, Hong Yu, Han Liu

Main category: cs.CL

TL;DR: TAO-Attack: A two-stage optimization-based jailbreak method for LLMs that suppresses refusals and penalizes pseudo-harmful outputs using direction-priority token optimization.

Details

Motivation: Current optimization-based jailbreak attacks on LLMs suffer from frequent refusals, pseudo-harmful outputs, and inefficient token-level updates, limiting their effectiveness.

Method: Two-stage loss function: first stage suppresses refusals to ensure model continues harmful prefixes; second stage penalizes pseudo-harmful outputs and encourages harmful completions. Uses direction-priority token optimization (DPTO) that aligns candidates with gradient direction before considering update magnitude.

Result: TAO-Attack consistently outperforms state-of-the-art methods, achieving higher attack success rates and reaching 100% in certain scenarios across multiple LLMs.

Conclusion: TAO-Attack provides an effective optimization-based jailbreak method that addresses key limitations of existing approaches through its two-stage loss and efficient DPTO strategy.

Abstract: Large language models (LLMs) have achieved remarkable success across diverse applications but remain vulnerable to jailbreak attacks, where attackers craft prompts that bypass safety alignment and elicit unsafe responses. Among existing approaches, optimization-based attacks have shown strong effectiveness, yet current methods often suffer from frequent refusals, pseudo-harmful outputs, and inefficient token-level updates. In this work, we propose TAO-Attack, a new optimization-based jailbreak method. TAO-Attack employs a two-stage loss function: the first stage suppresses refusals to ensure the model continues harmful prefixes, while the second stage penalizes pseudo-harmful outputs and encourages the model toward more harmful completions. In addition, we design a direction-priority token optimization (DPTO) strategy that improves efficiency by aligning candidates with the gradient direction before considering update magnitude. Extensive experiments on multiple LLMs demonstrate that TAO-Attack consistently outperforms state-of-the-art methods, achieving higher attack success rates and even reaching 100% in certain scenarios.

[35] Compact Prompting in Instruction-tuned LLMs for Joint Argumentative Component Detection

Sofiane Elguendouze, Erwan Hain, Elena Cabrio, Serena Villata

Main category: cs.CL

TL;DR: Instruction-tuned LLMs reformulate argumentative component detection as a generative task, outperforming state-of-the-art systems on standard benchmarks.

Details

Motivation: Argumentative component detection (ACD) is a challenging core subtask of Argument Mining that requires both delimiting argumentative spans and classifying them. Existing approaches oversimplify the problem as sequence labeling or pipeline segmentation-classification, lacking end-to-end solutions.

Method: Proposes using instruction-tuned Large Language Models with compact instruction-based prompts to reframe ACD as a language generation task, enabling direct identification of arguments from plain text without pre-segmented components.

Result: Experiments on standard benchmarks show the approach achieves higher performance compared to state-of-the-art systems, demonstrating the effectiveness of generative modeling for ACD.

Conclusion: This is one of the first attempts to fully model ACD as a generative task, highlighting the potential of instruction tuning for complex Argument Mining problems.

Abstract: Argumentative component detection (ACD) is a core subtask of Argument(ation) Mining (AM) and one of its most challenging aspects, as it requires jointly delimiting argumentative spans and classifying them into components such as claims and premises. While research on this subtask remains relatively limited compared to other AM tasks, most existing approaches formulate it as a simplified sequence labeling problem, component classification, or a pipeline of component segmentation followed by classification. In this paper, we propose a novel approach based on instruction-tuned Large Language Models (LLMs) using compact instruction-based prompts, and reframe ACD as a language generation task, enabling arguments to be identified directly from plain text without relying on pre-segmented components. Experiments on standard benchmarks show that our approach achieves higher performance compared to state-of-the-art systems. To the best of our knowledge, this is one of the first attempts to fully model ACD as a generative task, highlighting the potential of instruction tuning for complex AM problems.

[36] Evaluating Performance Drift from Model Switching in Multi-Turn LLM Systems

Raad Khraishi, Iman Zafar, Katie Myles, Greig A Cowan

Main category: cs.CL

TL;DR: Multi-turn LLM systems switching models mid-interaction causes context mismatch, leading to silent performance drift that can swing outcomes by significant margins comparable to model tier differences.

Details

Motivation: Deployed LLM systems frequently switch models during interactions due to upgrades, cross-provider routing, and fallbacks, creating context mismatches where later models must condition on dialogue history authored by different models, potentially causing silent performance degradation that single-model benchmarks miss.

Method: Introduces a switch-matrix benchmark that measures handoff effects by running a prefix model for early turns and a suffix model for the final turn, comparing against no-switch baselines using paired episode-level bootstrap confidence intervals across CoQA conversational QA and Multi-IF benchmarks.

Result: Single-turn handoffs yield prevalent, statistically significant directional effects, swinging outcomes by -8 to +13 percentage points in Multi-IF strict success rate and +/- 4 absolute F1 on CoQA, comparable to gaps between common model tiers. Found systematic compatibility patterns and decomposed switch-induced drift into per-model prefix influence and suffix susceptibility terms accounting for ~70% of variance.

Conclusion: Handoff robustness is an operational reliability dimension that single-model benchmarks miss, motivating explicit monitoring and handoff-aware mitigation in multi-turn LLM systems.

Abstract: Deployed multi-turn LLM systems routinely switch models mid-interaction due to upgrades, cross-provider routing, and fallbacks. Such handoffs create a context mismatch: the model generating later turns must condition on a dialogue prefix authored by a different model, potentially inducing silent performance drift. We introduce a switch-matrix benchmark that measures this effect by running a prefix model for early turns and a suffix model for the final turn, and comparing against the no-switch baseline using paired episode-level bootstrap confidence intervals. Across CoQA conversational QA and Multi-IF benchmarks, even a single-turn handoff yields prevalent and statistically significant, directional effects and may swing outcomes by -8 to +13 percentage points in Multi-IF strict success rate and +/- 4 absolute F1 on CoQA, comparable to the no-switch gap between common model tiers (e.g., GPT-5-nano vs GPT-5-mini). We further find systematic compatibility patterns: some suffix models degrade under nearly any non-self dialogue history, while others improve under nearly any foreign prefix. To enable compressed handoff risk monitoring, we decompose switch-induced drift into per-model prefix influence and suffix susceptibility terms, accounting for ~70% of variance across benchmarks. These results position handoff robustness as an operational reliability dimension that single-model benchmarks miss, motivating explicit monitoring and handoff-aware mitigation in multi-turn systems.

[37] UniSkill: A Dataset for Matching University Curricula to Professional Competencies

Nurlan Musazade, Joszef Mezei, Mike Zhang

Main category: cs.CL

TL;DR: Released annotated and synthetic datasets for matching university courses with ESCO taxonomy skills, trained BERT models for course-skill matching achieving 87% F1-score.

Details

Motivation: Addressing the scarcity of publicly available datasets for skill extraction and recommendation systems, particularly deficiencies in the instructed skills side despite AI applications in job advertisements receiving broad attention.

Method: Created manually annotated and synthetic datasets matching graduate-level university courses with ESCO taxonomy skills at two granularities (course title with skill, course sentence with skill). Trained language models (BERT) as baselines for retrieval and recommendation systems for course-to-skill and skill-to-course matching.

Result: BERT model achieved 87% F1-score on the annotated data, demonstrating that course and skill matching is a feasible task.

Conclusion: The released datasets and baseline models provide valuable resources for skill extraction and recommendation systems, showing promising results for matching educational content with standardized skill taxonomies.

Abstract: Skill extraction and recommendation systems have been studied from recruiter, applicant, and education perspectives. While AI applications in job advertisements have received broad attention, deficiencies in the instructed skills side remain a challenge. In this work, we address the scarcity of publicly available datasets by releasing both manually annotated and synthetic datasets of skills from the European Skills, Competences, Qualifications and Occupations (ESCO) taxonomy and university course pairs and publishing corresponding annotation guidelines. Specifically, we match graduate-level university courses with skills from the Systems Analysts and Management and Organization Analyst ESCO occupation groups at two granularities: course title with a skill, and course sentence with a skill. We train language models on this dataset to serve as a baseline for retrieval and recommendation systems for course-to-skill and skill-to-course matching. We evaluate the models on a portion of the annotated data. Our BERT model achieves 87% F1-score, showing that course and skill matching is a feasible task.

[38] APRES: An Agentic Paper Revision and Evaluation System

Bingchen Zhao, Jenny Zhang, Chenxi Whitehouse, Minqi Jiang, Michael Shvartsman, Abhishek Charnalia, Despoina Magka, Tatiana Shavrina, Derek Dunfield, Oisin Mac Aodha, Yoram Bachrach

Main category: cs.CL

TL;DR: APRES uses LLMs to automatically revise scientific papers based on a citation-predictive rubric to improve paper quality and impact without altering core scientific content.

Details

Motivation: Current peer review systems provide inconsistent feedback that hinders manuscript improvement. There's a need for automated tools to help authors enhance paper quality and impact before submission while preserving scientific content.

Method: APRES uses LLMs to discover a rubric predictive of future citation counts, then integrates this rubric with an automated system that revises papers to enhance quality and impact while maintaining core scientific content.

Result: APRES improves future citation prediction by 19.6% in mean average error over baselines. Human expert evaluators prefer revised papers over originals 79% of the time.

Conclusion: LLMs can effectively augment human reviewers by helping authors stress-test manuscripts before submission, improving paper quality and potential impact while preserving scientific integrity.

Abstract: Scientific discoveries must be communicated clearly to realize their full potential. Without effective communication, even the most groundbreaking findings risk being overlooked or misunderstood. The primary way scientists communicate their work and receive feedback from the community is through peer review. However, the current system often provides inconsistent feedback between reviewers, ultimately hindering the improvement of a manuscript and limiting its potential impact. In this paper, we introduce a novel method APRES powered by Large Language Models (LLMs) to update a scientific papers text based on an evaluation rubric. Our automated method discovers a rubric that is highly predictive of future citation counts, and integrate it with APRES in an automated system that revises papers to enhance their quality and impact. Crucially, this objective should be met without altering the core scientific content. We demonstrate the success of APRES, which improves future citation prediction by 19.6% in mean averaged error over the next best baseline, and show that our paper revision process yields papers that are preferred over the originals by human expert evaluators 79% of the time. Our findings provide strong empirical support for using LLMs as a tool to help authors stress-test their manuscripts before submission. Ultimately, our work seeks to augment, not replace, the essential role of human expert reviewers, for it should be humans who discern which discoveries truly matter, guiding science toward advancing knowledge and enriching lives.

[39] BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing?

Guoxin Chen, Fanzhe Meng, Jiale Zhao, Minghao Li, Daixuan Cheng, Huatong Song, Jie Chen, Yuzhi Lin, Hui Chen, Xin Zhao, Ruihua Song, Chang Liu, Cheng Chen, Kai Jia, Ji-Rong Wen

Main category: cs.CL

TL;DR: BeyondSWE is a comprehensive benchmark for code agents that evaluates cross-repository reasoning, domain-specialized problem solving, dependency-driven migration, and full-repository generation, revealing significant capability gaps in current models.

Details

Motivation: Current benchmarks for code agents focus too narrowly on repository-specific fixes and fail to capture real-world challenges like cross-repository reasoning, domain expertise requirements, dependency management, and full repository generation tasks.

Method: Introduces BeyondSWE benchmark with 500 real-world instances across four settings, broadening evaluation along resolution scope and knowledge scope axes. Also develops SearchSWE framework integrating deep search with coding abilities to investigate external knowledge integration.

Result: Even frontier models plateau below 45% success rate, with no single model performing consistently across task types. Search augmentation yields inconsistent gains and can sometimes degrade performance, highlighting challenges in emulating developer workflows.

Conclusion: The work provides a realistic, challenging evaluation benchmark and flexible framework to advance research toward more capable code agents, revealing significant gaps in current capabilities for complex real-world coding tasks.

Abstract: Current benchmarks for code agents primarily assess narrow, repository-specific fixes, overlooking critical real-world challenges such as cross-repository reasoning, domain-specialized problem solving, dependency-driven migration, and full-repository generation. To address this gap, we introduce BeyondSWE, a comprehensive benchmark that broadens existing evaluations along two axes - resolution scope and knowledge scope - using 500 real-world instances across four distinct settings. Experimental results reveal a significant capability gap: even frontier models plateau below 45% success, and no single model performs consistently across task types. To systematically investigate the role of external knowledge, we develop SearchSWE, a framework that integrates deep search with coding abilities. Our experiments show that search augmentation yields inconsistent gains and can in some cases degrade performance, highlighting the difficulty of emulating developer-like workflows that interleave search and reasoning during coding tasks. This work offers both a realistic, challenging evaluation benchmark and a flexible framework to advance research toward more capable code agents.

[40] Code2Math: Can Your Code Agent Effectively Evolve Math Problems Through Exploration?

Dadi Guo, Yuejin Xie, Qingyu Liu, Jiayu Liu, Zhiyuan Fan, Qihan Ren, Shuai Shao, Tianyi Zhou, Dongrui Liu, Yi R. Fung

Main category: cs.CL

TL;DR: Code agents can autonomously evolve math problems into more complex variations using a multi-agent framework with validation, creating structurally distinct and more challenging problems for LLM training.

Details

Motivation: The scarcity of challenging, high-quality math problems for training and evaluating LLMs at IMO level, combined with the observation that code agents have sophisticated reasoning skills that could be leveraged for mathematical experimentation.

Method: A multi-agent framework where code agents autonomously evolve existing math problems into more complex variations, with validation mechanisms to ensure solvability and increased difficulty of generated problems.

Result: Code agents can synthesize new, solvable problems that are structurally distinct from and more challenging than the originals, given sufficient test-time exploration.

Conclusion: Code-driven agents can serve as a viable mechanism for synthesizing high-difficulty mathematical reasoning problems within scalable computational environments.

Abstract: As large language models (LLMs) advance their mathematical capabilities toward the IMO level, the scarcity of challenging, high-quality problems for training and evaluation has become a significant bottleneck. Simultaneously, recent code agents have demonstrated sophisticated skills in agentic coding and reasoning, suggesting that code execution can serve as a scalable environment for mathematical experimentation. In this paper, we investigate the potential of code agents to autonomously evolve existing math problems into more complex variations. We introduce a multi-agent framework designed to perform problem evolution while validating the solvability and increased difficulty of the generated problems. Our experiments demonstrate that, given sufficient test-time exploration, code agents can synthesize new, solvable problems that are structurally distinct from and more challenging than the originals. This work provides empirical evidence that code-driven agents can serve as a viable mechanism for synthesizing high-difficulty mathematical reasoning problems within scalable computational environments. Our data is available at https://github.com/TarferSoul/Code2Math.

[41] Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use

Aradhye Agarwal, Gurdit Siyan, Yash Pandya, Joykirat Singh, Akshay Nambi, Ahmed Awadallah

Main category: cs.CL

TL;DR: MOSAIC is a post-training framework that aligns language model agents for safe multi-step tool use through explicit safety reasoning and refusal mechanisms, using preference-based RL with pairwise trajectory comparisons.

Details

Motivation: Agentic language models operate in a fundamentally different safety regime than chat models, requiring planning, tool calling, and long-horizon actions where single missteps can cause irreversible harm. Existing alignment methods optimized for static generation break down in sequential decision-making settings with adversarial tool feedback and overconfident reasoning.

Method: MOSAIC structures inference as a plan, check, then act or refuse loop with explicit safety reasoning and refusal as first-class actions. It uses preference-based reinforcement learning with pairwise trajectory comparisons to train without trajectory-level labels, capturing safety distinctions often missed by scalar rewards.

Result: MOSAIC reduces harmful behavior by up to 50%, increases harmful-task refusal by over 20% on injection attacks, cuts privacy leakage, and preserves or improves benign task performance across three model families (Qwen2.5-7B, Qwen3-4B-Thinking, Phi-4) and out-of-distribution benchmarks.

Conclusion: MOSAIC demonstrates robust generalization across models, domains, and agentic settings, providing an effective framework for aligning language model agents for safe multi-step tool use through explicit safety reasoning mechanisms.

Abstract: Agentic language models operate in a fundamentally different safety regime than chat models: they must plan, call tools, and execute long-horizon actions where a single misstep, such as accessing files or entering credentials, can cause irreversible harm. Existing alignment methods, largely optimized for static generation and task completion, break down in these settings due to sequential decision-making, adversarial tool feedback, and overconfident intermediate reasoning. We introduce MOSAIC, a post-training framework that aligns agents for safe multi-step tool use by making safety decisions explicit and learnable. MOSAIC structures inference as a plan, check, then act or refuse loop, with explicit safety reasoning and refusal as first-class actions. To train without trajectory-level labels, we use preference-based reinforcement learning with pairwise trajectory comparisons, which captures safety distinctions often missed by scalar rewards. We evaluate MOSAIC zero-shot across three model families, Qwen2.5-7B, Qwen3-4B-Thinking, and Phi-4, and across out-of-distribution benchmarks spanning harmful tasks, prompt injection, benign tool use, and cross-domain privacy leakage. MOSAIC reduces harmful behavior by up to 50%, increases harmful-task refusal by over 20% on injection attacks, cuts privacy leakage, and preserves or improves benign task performance, demonstrating robust generalization across models, domains, and agentic settings.

[42] Using Learning Progressions to Guide AI Feedback for Science Learning

Xin Xia, Nejla Yuruk, Yun Wang, Xiaoming Zhai

Main category: cs.CL

TL;DR: LP-driven rubric generation for AI feedback produces comparable quality to expert-authored rubrics for student science explanations

Details

Motivation: Current AI-generated feedback relies on time-consuming expert-authored rubrics; learning progressions offer a scalable alternative for formative feedback across instructional contexts

Method: Compared two AI feedback pipelines: one using expert-designed task-specific rubric vs. one using automatically derived rubric from learning progression; evaluated feedback quality for 207 middle school chemistry explanations using multi-dimensional rubric with human coders

Result: No significant differences between pipelines across Clarity, Relevance, Engagement/Motivation, or Reflectiveness dimensions; high inter-rater reliability (89-100% agreement, κ=.66-.88)

Conclusion: LP-driven rubric generation provides viable alternative to expert-authored rubrics for scalable AI feedback in educational contexts

Abstract: Generative artificial intelligence (AI) offers scalable support for formative feedback, yet most AI-generated feedback relies on task-specific rubrics authored by domain experts. While effective, rubric authoring is time-consuming and limits scalability across instructional contexts. Learning progressions (LP) provide a theoretically grounded representation of students’ developing understanding and may offer an alternative solution. This study examines whether an LP-driven rubric generation pipeline can produce AI-generated feedback comparable in quality to feedback guided by expert-authored task rubrics. We analyzed AI-generated feedback for written scientific explanations produced by 207 middle school students in a chemistry task. Two pipelines were compared: (a) feedback guided by a human expert-designed, task-specific rubric, and (b) feedback guided by a task-specific rubric automatically derived from a learning progression prior to grading and feedback generation. Two human coders evaluated feedback quality using a multi-dimensional rubric assessing Clarity, Accuracy, Relevance, Engagement and Motivation, and Reflectiveness (10 sub-dimensions). Inter-rater reliability was high, with percent agreement ranging from 89% to 100% and Cohen’s kappa values for estimable dimensions (kappa = .66 to .88). Paired t-tests revealed no statistically significant differences between the two pipelines for Clarity (t1 = 0.00, p1 = 1.000; t2 = 0.84, p2 = .399), Relevance (t1 = 0.28, p1 = .782; t2 = -0.58, p2 = .565), Engagement and Motivation (t1 = 0.50, p1 = .618; t2 = -0.58, p2 = .565), or Reflectiveness (t = -0.45, p = .656). These findings suggest that the LP-driven rubric pipeline can serve as an alternative solution.

[43] Reproduction and Replication of an Adversarial Stylometry Experiment

Haining Wang, Patrick Juola, Allen Riddell

Main category: cs.CL

TL;DR: Reproduction and replication of Brennan et al.’s (2012) study on adversarial stylometry defenses against authorship attribution, finding that round-trip translation may reduce attribution effectiveness more than previously thought.

Details

Motivation: To verify and extend findings from a seminal study on adversarial stylometry defenses against authorship attribution, addressing concerns about deanonymization in natural language communication.

Method: Reproduced original experiments using original data, then replicated the online field experiment following original procedures, adding a control group missing in the original study.

Result: Reached same conclusion as original paper but found defenses may be overstated in effectiveness; round-trip translation appears to reduce effectiveness of established authorship attribution methods.

Conclusion: Adversarial stylometry defenses warrant re-examination, particularly round-trip translation, which shows promise in reducing authorship attribution effectiveness despite limitations in original study design.

Abstract: Maintaining anonymity in natural language communication remains a challenging task. Even when the number of candidate authors is large, standard authorship attribution techniques that analyze writing style predict the original author with uncomfortably high accuracy. Adversarial stylometry provides a defense against authorship attribution, helping users avoid unwanted deanonymization. This paper reproduces and replicates experiments from a seminal study of defenses against authorship attribution (Brennan et al., 2012). After reproducing the experiment using the original data, we then replicate the experiment by repeating the online field experiment using the procedures described in the original paper. Although we reach the same conclusion as the original paper, our results suggest that the defenses studied may be overstated in their effectiveness. This is largely due to the absence of a control group in the original study. In our replication, we find evidence suggesting that an entirely automatic method, round-trip translation, warrants re-examination because it appears to reduce the effectiveness of established authorship attribution methods.

[44] Monitoring AI-Modified Content at Scale: A Case Study on the Impact of ChatGPT on AI Conference Peer Reviews

Weixin Liang, Zachary Izzo, Yaohui Zhang, Haley Lepp, Hancheng Cao, Xuandong Zhao, Lingjiao Chen, Haotian Ye, Sheng Liu, Zhi Huang, Daniel A. McFarland, James Y. Zou

Main category: cs.CL

TL;DR: A method for estimating LLM-generated text in large corpora, applied to scientific peer reviews showing 6.5-16.9% LLM-modified text, with patterns linking LLM use to reviewer confidence, timing, and engagement.

Details

Motivation: To develop a systematic approach for measuring LLM-generated text at the corpus level in real-world settings, specifically examining how LLMs are being used in scientific peer review processes after ChatGPT's release.

Method: Maximum likelihood model using expert-written and AI-generated reference texts to estimate the fraction of LLM-modified text in large corpora, applied to peer reviews from four AI conferences (ICLR 2024, NeurIPS 2023, CoRL 2023, EMNLP 2023).

Result: Found 6.5-16.9% of peer review text substantially modified by LLMs; LLM use correlated with lower reviewer confidence, submissions close to deadlines, and reviewers less likely to respond to rebuttals; detected corpus-level trends not visible at individual level.

Conclusion: LLMs are significantly impacting scientific peer review, with measurable patterns in their usage; interdisciplinary research needed to understand how LLMs are changing information and knowledge practices.

Abstract: We present an approach for estimating the fraction of text in a large corpus which is likely to be substantially modified or produced by a large language model (LLM). Our maximum likelihood model leverages expert-written and AI-generated reference texts to accurately and efficiently examine real-world LLM-use at the corpus level. We apply this approach to a case study of scientific peer review in AI conferences that took place after the release of ChatGPT: ICLR 2024, NeurIPS 2023, CoRL 2023 and EMNLP 2023. Our results suggest that between 6.5% and 16.9% of text submitted as peer reviews to these conferences could have been substantially modified by LLMs, i.e. beyond spell-checking or minor writing updates. The circumstances in which generated text occurs offer insight into user behavior: the estimated fraction of LLM-generated text is higher in reviews which report lower confidence, were submitted close to the deadline, and from reviewers who are less likely to respond to author rebuttals. We also observe corpus-level trends in generated text which may be too subtle to detect at the individual level, and discuss the implications of such trends on peer review. We call for future interdisciplinary work to examine how LLM use is changing our information and knowledge practices.

[45] Diverging Preferences: When do Annotators Disagree and do Models Know?

Michael JQ Zhang, Zhilin Wang, Jena D. Hwang, Yi Dong, Olivier Delalleau, Yejin Choi, Eunsol Choi, Xiang Ren, Valentina Pyatkin

Main category: cs.CL

TL;DR: Analysis of human preference disagreements in LLM training datasets reveals most disagreements stem from task underspecification and response style differences, challenging standard reward modeling assumptions.

Details

Motivation: To understand the nature of disagreements in human-labeled preference datasets for LLM training, challenging the common assumption that annotator disagreements are simple noise, and to explore implications for reward modeling and LLM evaluation.

Method: Developed a taxonomy of disagreement sources across ten categories in four high-level classes, analyzed human preference datasets, and conducted experiments examining standard reward modeling (Bradley-Terry) and LLM-as-Judge evaluation methods.

Result: Found majority of disagreements are due to factors like task underspecification and response style differences, not simple noise. Standard reward modeling and evaluation methods fail to account for annotator divergence, highlighting challenges in LLM evaluations and pluralistic alignment.

Conclusion: Developed methods for identifying diverging preferences to mitigate their influence in evaluations and LLM training, addressing challenges in developing pluralistically aligned LLMs and improving evaluation robustness.

Abstract: We examine diverging preferences in human-labeled preference datasets. We develop a taxonomy of disagreement sources spanning ten categories across four high-level classes and find that the majority of disagreements are due to factors such as task underspecification or response style. Our findings challenge a standard assumption in reward modeling methods that annotator disagreements can be attributed to simple noise. We then explore how these findings impact two areas of LLM development: reward modeling training and evaluation. In our experiments, we demonstrate how standard reward modeling (e.g., Bradley-Terry) and LLM-as-Judge evaluation methods fail to account for divergence between annotators. These findings highlight challenges in LLM evaluations, which are greatly influenced by divisive features like response style, and in developing pluralistically aligned LLMs. To address these issues, we develop methods for identifying diverging preferences to mitigate their influence in evaluations and during LLM training.

[46] A Survey of Query Optimization in Large Language Models

Mingyang Song, Mao Zheng

Main category: cs.CL

TL;DR: A comprehensive survey on query optimization techniques for LLMs in RAG systems, introducing a lifecycle framework, query complexity taxonomy, and analyzing four core optimization operations.

Details

Motivation: Query optimization is crucial for improving LLM effectiveness in RAG systems, where query quality directly impacts retrieval and response performance. Current research lacks systematic frameworks and comprehensive analysis of optimization techniques.

Method: 1) Introduces Query Optimization Lifecycle (QOL) Framework with five phases; 2) Proposes Query Complexity Taxonomy based on evidence type and quantity; 3) Analyzes four atomic operations: Query Expansion, Decomposition, Disambiguation, and Abstraction; 4) Examines evaluation methodologies and identifies research gaps.

Result: Provides a systematic foundation for query optimization research, synthesizes representative methods from premier venues, identifies critical gaps in benchmarks, and offers actionable guidance for practitioners.

Conclusion: The survey establishes structured frameworks for understanding query optimization in RAG systems, highlights open challenges including multi-modal query handling, and provides both research foundation and practical guidance.

Abstract: Query Optimization (QO) has become essential for enhancing Large Language Model (LLM) effectiveness, particularly in Retrieval-Augmented Generation (RAG) systems where query quality directly determines retrieval and response performance. This survey provides a systematic and comprehensive analysis of query optimization techniques with three principal contributions. \textit{First}, we introduce the \textbf{Query Optimization Lifecycle (QOL) Framework}, a five-phase pipeline covering Intent Recognition, Query Transformation, Retrieval Execution, Evidence Integration, and Response Synthesis, providing a unified lens for understanding the optimization process. \textit{Second}, we propose a \textbf{Query Complexity Taxonomy} that classifies queries along two dimensions, namely evidence type (explicit vs.\ implicit) and evidence quantity (single vs.\ multiple), establishing principled mappings between query characteristics and optimization strategies. \textit{Third}, we conduct an in-depth analysis of four atomic operations, namely \textbf{Query Expansion}, \textbf{Query Decomposition}, \textbf{Query Disambiguation}, and \textbf{Query Abstraction}, synthesizing a broad spectrum of representative methods from premier venues. We further examine evaluation methodologies, identify critical gaps in existing benchmarks, and discuss open challenges including process reward models, efficiency optimization, and multi-modal query handling. This survey offers both a structured foundation for research and actionable guidance for practitioners.

[47] Param$Δ$ for Direct Weight Mixing: Post-Train Large Language Model at Zero Cost

Sheng Cao, Mingrui Wu, Karthik Prasad, Yuandong Tian, Zechun Liu

Main category: cs.CL

TL;DR: ParamΔ enables zero-shot transfer of post-training capabilities from existing models to updated base models by computing and applying weight differences, eliminating the need for repeated post-training.

Details

Motivation: Traditional post-training of LLMs requires extensive data, risks overfitting, and incurs high computational costs, especially when repeated for each base model update. The open-weight community has many available checkpoints that could be better leveraged.

Method: Compute weight difference between post-trained model (Θ_post) and base model (Θ_base), then add this difference to an updated base model (Θ’_base): Θ_ParamΔ = Θ_post - Θ_base + Θ’_base. This transfers post-training capabilities without additional training.

Result: ParamΔ models achieve ~95% of Llama3.1-inst performance on average when transferring from Llama3-inst to Llama3.1-base. The method works across Llama3, Llama3.1, Qwen, and DeepSeek-distilled models, effectively replicating traditional post-training results.

Conclusion: ParamΔ provides a cost-free framework to accelerate model development cycles by leveraging existing checkpoints in the open-weight community, enabling efficient knowledge transfer without retraining.

Abstract: The post-training phase of large language models is essential for enhancing capabilities such as instruction-following, reasoning, and alignment with human preferences. However, it demands extensive high-quality data and poses risks like overfitting, alongside significant computational costs due to repeated post-training and evaluation after each base model update. This paper introduces $ParamΔ$, a novel method that streamlines post-training by transferring knowledge from an existing post-trained model to a newly updated base model with ZERO additional training. By computing the difference between post-trained model weights ($Θ_\text{post}$) and base model weights ($Θ_\text{base}$), and adding this to the updated base model ($Θ’\text{base}$), we define $ParamΔ$ Model as: $Θ{\text{Param}Δ} = Θ_\text{post} - Θ_\text{base} + Θ’_\text{base}$. This approach surprisingly equips the new base model with post-trained capabilities, achieving performance comparable to direct post-training. We did analysis on LLama3, Llama3.1, Qwen, and DeepSeek-distilled models. Results indicate $ParamΔ$ Model effectively replicates traditional post-training. For example, the $ParamΔ$ Model obtained from 70B Llama3-inst, Llama3-base, Llama3.1-base models attains approximately 95% of Llama3.1-inst model’s performance on average. $ParamΔ$ brings a new perspective on how to fully leverage models in the open-weight community, where checkpoints for base and instruct models are readily available and frequently updated, by providing a cost-free framework to accelerate the iterative cycle of model development.

[48] Hallucination, Monofacts, and Miscalibration: An Empirical Investigation

Miranda Muqing Miao, Michael Kearns

Main category: cs.CL

TL;DR: Empirical investigation of hallucination bounds in language models shows selective upweighting (strategic repetition of 5% training data) reduces hallucinations by 40% while maintaining accuracy, challenging universal deduplication policies.

Details

Motivation: To empirically validate theoretical hallucination bounds in language models and explore practical interventions to reduce hallucinations while maintaining model accuracy.

Method: Systematic experiments with n-gram models and fine-tuned Transformers using Pareto-distributed training data to control monofact rates, empirical derivation of hallucination bounds using bin-wise KL divergence, and introduction of selective upweighting technique.

Result: Selective upweighting reduces hallucinations by up to 40% while maintaining pre-injection accuracy levels, revealing a critical trade-off between accuracy optimization and hallucination reduction.

Conclusion: Strategic miscalibration through selective upweighting effectively reduces hallucinations, challenging conventional deduplication approaches and highlighting inherent tension between accuracy and hallucination objectives in language model training.

Abstract: Hallucinated facts in large language models (LLMs) have recently been shown to obey a statistical lower bound determined by the monofact rate (related to the classical Good-Turing missing mass estimator) minus model miscalibration (Kalai & Vempala, 2024). We present the first empirical investigation of this three-way relationship in classical n-gram models and fine-tuned encoder-decoder Transformers. By generating training data from Pareto distributions with varying shape parameters, we systematically control the monofact rates and establish its positive relationship with hallucination. To bridge theory and practice, we derive an empirical analog of the hallucination bound by replacing the population miscalibration term (Section 2.1) with an empirical bin-wise KL divergence and confirm its practical viability. We then introduce selective upweighting – a simple yet effective technique that strategically repeats as little as 5% of training examples – to deliberately inject miscalibration into the model. This intervention reduces hallucination by up to 40%, challenging universal deduplication policies. Our experiments reveal a critical trade-off: selective upweighting maintains pre-injection levels of accuracy while substantially reducing hallucination, whereas standard training gradually improves accuracy but fails to address persistently high hallucination, indicating an inherent tension in optimization objectives.

[49] Make LoRA Great Again: Boosting LoRA with Adaptive Singular Values and Mixture-of-Experts Optimization Alignment

Chenghao Fan, Zhenyi Lu, Sichen Liu, Chengfeng Gu, Xiaoye Qu, Wei Wei, Yu Cheng

Main category: cs.CL

TL;DR: GOAT is a framework that improves LoRA Mixture-of-Experts performance by adaptively integrating SVD priors and deriving a theoretical scaling factor to align optimization with full fine-tuning.

Details

Motivation: Current LoRA methods for parameter-efficient fine-tuning underperform compared to Full Fine-Tuning, and existing approaches using static SVD initialization or MoE architectures have limitations like weight misalignment and complex gradient dynamics.

Method: GOAT adaptively integrates relevant priors using an SVD-structured Mixture-of-Experts and aligns optimization with full fine-tuned MoE by deriving a theoretical scaling factor, without modifying architecture or training algorithms.

Result: Experiments across 25 datasets in natural language understanding, commonsense reasoning, image classification, and natural language generation show GOAT achieves state-of-the-art performance, closing the gap with Full Fine-Tuning.

Conclusion: GOAT demonstrates that proper scaling can significantly boost LoRA MoE’s efficiency and performance, making parameter-efficient fine-tuning more competitive with full fine-tuning across diverse tasks.

Abstract: While Low-Rank Adaptation (LoRA) enables parameter-efficient fine-tuning for Large Language Models (LLMs), its performance often falls short of Full Fine-Tuning (Full FT). Current methods optimize LoRA by initializing with static singular value decomposition (SVD) subsets, leading to suboptimal leveraging of pre-trained knowledge. Another path for improving LoRA is incorporating a Mixture-of-Experts (MoE) architecture. However, weight misalignment and complex gradient dynamics make it challenging to adopt SVD prior to the LoRA MoE architecture. To mitigate these issues, we propose \underline{G}reat L\underline{o}R\underline{A} Mixture-of-Exper\underline{t} (GOAT), a framework that (1) adaptively integrates relevant priors using an SVD-structured MoE, and (2) aligns optimization with full fine-tuned MoE by deriving a theoretical scaling factor. We demonstrate that proper scaling, without modifying the architecture or training algorithms, boosts LoRA MoE’s efficiency and performance. Experiments across 25 datasets, including natural language understanding, commonsense reasoning, image classification, and natural language generation, demonstrate GOAT’s state-of-the-art performance, closing the gap with Full FT.

[50] $\texttt{SEM-CTRL}$: Semantically Controlled Decoding

Mohammad Albinhassan, Pranava Madhyastha, Alessandra Russo

Main category: cs.CL

TL;DR: SEM-CTRL is a method for enforcing syntactic and semantic constraints on LLM outputs using Answer Set Grammars and token-level MCTS, enabling constraint satisfaction without fine-tuning.

Details

Motivation: LLMs often produce outputs that violate syntactic or semantic constraints, which is problematic for real-world deployment where correctness is critical. Current approaches struggle to guarantee both types of validity simultaneously.

Method: Integrates token-level Monte Carlo Tree Search guided by Answer Set Grammars, a logic-based formalism that generalizes context-sensitive grammars with background knowledge for task-specific semantics.

Result: Enables small pre-trained LLMs to outperform larger models and state-of-the-art reasoning models (like o4-mini) while guaranteeing semantic validity across tasks including grammar synthesis, combinatorial reasoning, JSON parsing, and planning.

Conclusion: SEM-CTRL provides a unified approach for enforcing rich context-sensitive constraints on LLM outputs without fine-tuning, improving correctness guarantees for real-world deployment.

Abstract: Ensuring both syntactic and semantic correctness in Large Language Model (LLM) outputs remains a significant challenge, despite being critical for real-world deployment. In this paper, we introduce \texttt{SEM-CTRL}, a unified approach that allows for enforcing rich context-sensitive constraints, and task and instance specific semantics directly on the LLM decoder. Our approach integrates token-level MCTS which is guided by specific syntactic and semantic constraints. The constraints over desired outputs are expressed using Answer Set Grammars, which is a logic-based formalism that generalizes context sensitive grammars while incorporating background knowledge to represent task-specific semantics. We show that our approach helps guarantee valid completions for any off-the-shelf LLM without the need for fine-tuning. We evaluate \texttt{SEM-CTRL} on a range of tasks, including synthetic grammar synthesis, combinatorial reasoning, JSON parsing, and planning. Our experimental results demonstrate that \texttt{SEM-CTRL} allows even small pre-trained LLMs to efficiently outperform larger variants and state-of-the-art reasoning models (e.g., \textit{o4-mini}) while simultaneously guaranteeing semantic validity.

Minzheng Wang, Yongbin Li, Haobo Wang, Xinghua Zhang, Nan Xu, Bingli Wu, Fei Huang, Haiyang Yu, Wenji Mao

Main category: cs.CL

TL;DR: ASL framework enables language agents to adaptively adjust reasoning depth in social interactions using hierarchical reasoning modes and AMPO algorithm for context-aware mode switching.

Details

Motivation: Current language agents lack dynamic reasoning depth adjustment in social intelligence tasks, using either no explicit reasoning or uniform lengthy Chain-of-Thought reasoning, leading to excessive token usage and inflexible social behaviors.

Method: Proposes Adaptive Social Learning (ASL) framework with hierarchical reasoning modes (from intuitive response to deep deliberation) and Adaptive Mode Policy Optimization (AMPO) algorithm for context-aware mode adaptation and reasoning.

Result: ASL achieves 15.6% higher task performance than GPT-4o on social intelligence benchmarks, and AMPO outperforms GRPO by 7.0% with 32.8% shorter thinking chains.

Conclusion: The ASL framework successfully enables adaptive reasoning in language agents for social interactions, demonstrating improved performance and token efficiency through context-aware reasoning depth adjustment.

Abstract: Effective social intelligence simulation requires language agents to dynamically adjust reasoning depth, a capability notably absent in current studies. Existing methods either lack explicit reasoning or employ lengthy Chain-of-Thought reasoning uniformly across all scenarios, resulting in excessive token usage and inflexible social behaviors in tasks such as negotiation or collaboration. To address this, we propose an $\textbf{A}$daptive $\textbf{S}$ocial $\textbf{L}$earning ($\textbf{ASL}$) framework in this paper, aiming to improve the adaptive reasoning ability of language agents in dynamic social interactions. To this end, we first identify the hierarchical reasoning modes under such context, ranging from intuitive response to deep deliberation based on the cognitive control theory. We then develop the $\textbf{A}$daptive $\textbf{M}$ode $\textbf{P}$olicy $\textbf{O}$ptimization ($\textbf{AMPO}$) algorithm to learn the context-aware mode adaptation and reasoning. Our framework advances existing research in three key aspects: (1) Multi-granular reasoning mode design, (2) Context-aware mode switching in rich social interaction, and (3) Token-efficient reasoning with depth adaptation. Extensive experiments on the benchmark social intelligence environment verify that ASL achieves 15.6% higher task performance than GPT-4o. Notably, our AMPO outperforms GRPO by 7.0% with 32.8% shorter thinking chains, demonstrating the advantages of our AMPO and the learned adaptive reasoning ability over GRPO’s solution.

[52] Talk to Your Slides: High-Efficiency Slide Editing via Language-Driven Structured Data Manipulation

Kyudan Jung, Hojun Cho, Jooyeol Yun, Soyoung Yang, Jaehyeok Jang, Jaegul Choo

Main category: cs.CL

TL;DR: A slide editing agent that uses language-driven structured data manipulation instead of visual perception for faster, cheaper, and more precise slide editing compared to GUI-based MLLM agents.

Details

Motivation: Current GUI-based agents using Multimodal LLMs are computationally expensive and slow for text-centric and batch processing tasks in slide editing, despite excelling at visual layout adjustments.

Method: Proposes Talk-to-Your-Slides, which operates via language-driven structured data manipulation using the underlying object model rather than screen pixels, with a hierarchical architecture bridging user instructions to execution codes.

Result: Achieves 34% faster processing, 34% better instruction fidelity, and 87% lower cost compared to GUI-based baselines for text-centric and formatting tasks.

Conclusion: Language-driven structured data manipulation is more efficient than visual-based approaches for text-centric slide editing tasks, while also introducing a benchmark dataset for evaluation.

Abstract: Editing presentation slides is a frequent yet tedious task, ranging from creative layout design to repetitive text maintenance. While recent GUI-based agents powered by Multimodal LLMs (MLLMs) excel at tasks requiring visual perception, such as spatial layout adjustments, they often incur high computational costs and latency when handling structured, text-centric, or batch processing tasks. In this paper, we propose Talk-to-Your-Slides, a high-efficiency slide editing agent that operates via language-driven structured data manipulation rather than relying on the image modality. By leveraging the underlying object model instead of screen pixels, our approach ensures precise content modification while preserving style fidelity, addressing the limitations of OCR-based visual agents. Our system features a hierarchical architecture that effectively bridges high-level user instructions with low-level execution codes. Experiments demonstrate that for text-centric and formatting tasks, our method enables 34% faster processing, achieves 34% better instruction fidelity, and operates at an 87% lower cost compared to GUI-based baselines. Furthermore, we introduce TSBench, a human-verified benchmark dataset comprising 379 instructions, including a Hard subset designed to evaluate robustness against complex and visually dependent queries. Our code and benchmark are available at https://github.com/KyuDan1/Talk-to-Your-Slides.

[53] Go-Browse: Training Web Agents with Structured Exploration

Apurva Gandhi, Graham Neubig

Main category: cs.CL

TL;DR: Go-Browse: A method for automated web agent data collection through structured graph-based exploration, achieving state-of-the-art performance on WebArena benchmark with a 7B parameter model.

Details

Motivation: Digital agents lack understanding of their environments, particularly web browsing agents that get lost in unfamiliar websites and struggle to know what pages to visit to achieve goals.

Method: Frames data collection as graph search for efficient exploration, enabling reuse of information across episodes. Collects diverse web agent data through structured exploration of web environments.

Result: Collected 10K successful task-solving trajectories and 40K interaction steps across 100 URLs. Fine-tuned 7B parameter model achieves 21.7% success rate on WebArena, beating GPT-4o mini by 2.4% and exceeding SOTA for sub-10B models by 2.9%.

Conclusion: Go-Browse enables scalable collection of realistic web agent data through structured exploration, significantly improving web agent performance with relatively small models.

Abstract: One of the fundamental problems in digital agents is their lack of understanding of their environment. For instance, a web browsing agent may get lost in unfamiliar websites, uncertain what pages must be visited to achieve its goals. To address this, we propose Go-Browse, a method for automatically collecting diverse and realistic web agent data at scale through structured exploration of web environments. Go-Browse achieves efficient exploration by framing data collection as a graph search, enabling reuse of information across exploration episodes. We instantiate our method on the WebArena benchmark, collecting a dataset of 10K successful task-solving trajectories and 40K interaction steps across 100 URLs. Fine-tuning a 7B parameter language model on this dataset achieves a success rate of 21.7% on the WebArena benchmark, beating GPT-4o mini by 2.4% and exceeding current state-of-the-art results for sub-10B parameter models by 2.9%.

Zhaolu Kang, Junhao Gong, Jiaxu Yan, Wanke Xia, Yian Wang, Ziwen Wang, Huaxuan Ding, Zhuo Cheng, Wenhao Cao, Zhiyuan Feng, Siqi He, Shannan Yan, Junzhe Chen, Xiaomin He, Chaoya Jiang, Wei Ye, Kaidong Yu, Xuelong Li

Main category: cs.CL

TL;DR: HSSBench: A multimodal benchmark for evaluating MLLMs on Humanities and Social Sciences tasks across multiple languages, featuring over 13,000 samples and revealing significant challenges for current models.

Details

Motivation: Current MLLM benchmarks focus too much on STEM-style reasoning and general knowledge, overlooking the unique needs of Humanities and Social Sciences (HSS) which require horizontal, interdisciplinary thinking and linking abstract concepts with visual representations.

Method: Created HSSBench with a novel data generation pipeline where domain experts and automated agents collaborate to generate and iteratively refine samples. Contains over 13,000 samples across six key categories in multiple languages including the six UN official languages.

Result: Benchmarked over 20 mainstream MLLMs and found that HSSBench poses significant challenges even for state-of-the-art models, highlighting limitations in cross-disciplinary reasoning.

Conclusion: HSSBench addresses a critical gap in MLLM evaluation and should inspire further research into enhancing cross-disciplinary reasoning abilities, particularly the capacity to internalize and connect knowledge across fields.

Abstract: Multimodal Large Language Models (MLLMs) have demonstrated significant potential to advance a broad range of domains. However, current benchmarks for evaluating MLLMs primarily emphasize general knowledge and vertical step-by-step reasoning typical of STEM disciplines, while overlooking the distinct needs and potential of the Humanities and Social Sciences (HSS). Tasks in the HSS domain require more horizontal, interdisciplinary thinking and a deep integration of knowledge across related fields, which presents unique challenges for MLLMs, particularly in linking abstract concepts with corresponding visual representations. Addressing this gap, we present HSSBench, a dedicated benchmark designed to assess the capabilities of MLLMs on HSS tasks in multiple languages, including the six official languages of the United Nations. We also introduce a novel data generation pipeline tailored for HSS scenarios, in which multiple domain experts and automated agents collaborate to generate and iteratively refine each sample. HSSBench contains over 13,000 meticulously designed samples, covering six key categories. We benchmark more than 20 mainstream MLLMs on HSSBench and demonstrate that it poses significant challenges even for state-of-the-art models. We hope that this benchmark will inspire further research into enhancing the cross-disciplinary reasoning abilities of MLLMs, especially their capacity to internalize and connect knowledge across fields.

[55] Search Arena: Analyzing Search-Augmented LLMs

Mihran Miroyan, Tsung-Han Wu, Logan King, Tianle Li, Jiayi Pan, Xinyan Hu, Wei-Lin Chiang, Anastasios N. Angelopoulos, Trevor Darrell, Narges Norouzi, Joseph E. Gonzalez

Main category: cs.CL

TL;DR: Search Arena: A large-scale human-preference dataset for evaluating search-augmented LLMs, revealing insights about citation influence and source preferences.

Details

Motivation: Existing datasets for analyzing search-augmented language models are limited in scale and scope, focusing mainly on static, single-turn, fact-checking questions. There's a need for more comprehensive evaluation frameworks that capture real-world multi-turn interactions and diverse user intents.

Method: Introduced Search Arena - a crowd-sourced dataset of over 24,000 paired multi-turn user interactions with search-augmented LLMs. The dataset spans diverse intents and languages, contains full system traces, and includes around 12,000 human preference votes. Conducted cross-arena analyses to test models in different settings.

Result: User preferences are influenced by citation quantity even when citations don’t directly support claims, revealing a gap between perceived and actual credibility. Community-driven platforms are generally preferred over static encyclopedic sources. Web search doesn’t degrade and may improve performance in non-search settings, but search-intensive settings suffer when relying solely on parametric knowledge.

Conclusion: Search Arena provides a valuable resource for evaluating search-augmented LLMs, revealing important insights about user preferences and citation behavior. The dataset supports future research in improving the grounding and credibility assessment of language models with web search integration.

Abstract: Search-augmented language models combine web search with Large Language Models (LLMs) to improve response groundedness and freshness. However, analyzing these systems remains challenging: existing datasets are limited in scale and narrow in scope, often constrained to static, single-turn, fact-checking questions. In this work, we introduce Search Arena, a crowd-sourced, large-scale, human-preference dataset of over 24,000 paired multi-turn user interactions with search-augmented LLMs. The dataset spans diverse intents and languages, and contains full system traces with around 12,000 human preference votes. Our analysis reveals that user preferences are influenced by the number of citations, even when the cited content does not directly support the attributed claims, uncovering a gap between perceived and actual credibility. Furthermore, user preferences vary across cited sources, revealing that community-driven platforms are generally preferred and static encyclopedic sources are not always appropriate and reliable. To assess performance across different settings, we conduct cross-arena analyses by testing search-augmented LLMs in a general-purpose chat environment and conventional LLMs in search-intensive settings. We find that web search does not degrade and may even improve performance in non-search settings; however, the quality in search settings is significantly affected if solely relying on the model’s parametric knowledge. We open-sourced the dataset to support future research. Our dataset and code are available at: https://github.com/lmarena/search-arena.

[56] CyclicReflex: Improving Reasoning Models via Cyclical Reflection Token Scheduling

Chongyu Fan, Yihua Zhang, Jinghan Jia, Alfred Hero, Sijia Liu

Main category: cs.CL

TL;DR: CyclicReflex: A training-free decoding strategy that dynamically modulates reflection token logits using a triangular waveform to optimize test-time compute performance of large reasoning models.

Details

Motivation: Large reasoning models use reflection tokens (like "wait", "but") for self-evaluative reasoning, but both excessive and insufficient use degrades performance. The paper treats reflection tokens as a "resource" and introduces resource allocation to improve test-time compute performance.

Method: Proposes cyclical reflection token scheduling (CyclicReflex) - a training-free decoding strategy that dynamically modulates reflection token logits with a bidirectional, position-dependent triangular waveform, incurring no additional computation cost.

Result: Experiments on MATH500, AIME2024/2025, AMC2023, GPQA Diamond and LiveCodeBench show CyclicReflex consistently improves performance across model sizes (1.5B-14B), outperforming standard decoding and recent approaches like TIP and S1.

Conclusion: CyclicReflex effectively addresses the reflection token allocation problem, improving reasoning model performance without additional training or computational overhead.

Abstract: Large reasoning models (LRMs), such as OpenAI’s o1 and DeepSeek-R1, harness test-time scaling to perform multi-step reasoning for complex problem-solving. This reasoning process, executed before producing final answers, is often guided by special juncture tokens that prompt self-evaluative reflection. These transition markers and reflective cues are referred to as “reflection tokens” (e.g., “wait”, “but”, “alternatively”). In this work, we treat reflection tokens as a “resource” and introduce the problem of resource allocation, aimed at improving the test-time compute performance of LRMs by adaptively regulating the frequency and placement of reflection tokens. Through empirical analysis, we show that both excessive and insufficient use of reflection tokens, referred to as over-reflection and under-reflection, can degrade model performance. To better understand this trade-off, we draw an analogy between reflection token usage and learning rate scheduling in optimization. Building on this insight, We propose cyclical reflection token scheduling (termed CyclicReflex), a training-free decoding strategy that dynamically modulates reflection token logits with a bidirectional, position-dependent triangular waveform, incurring no additional computation cost. Experiments on MATH500, AIME2024/2025, AMC2023, GPQA Diamond and LiveCodeBench demonstrate that CyclicReflex consistently improves performance across model sizes (1.5B-14B), outperforming standard decoding and recent approaches such as TIP (thought switching penalty) and S1. Codes are available at https://github.com/OPTML-Group/CyclicReflex.

[57] You Only Fine-tune Once: Many-Shot In-Context Fine-Tuning for Large Language Models

Wenchong He, Liqian Peng, Zhe Jiang, Alex Go

Main category: cs.CL

TL;DR: Many-Shot In-Context Fine-tuning (ManyICL) improves LLM performance by treating many-shot examples as training targets rather than just prompts, narrowing the gap between in-context learning and dedicated fine-tuning.

Details

Motivation: Current few-shot in-context fine-tuning of LLMs still lags behind dedicated fine-tuning where separate models are trained for each task. The authors aim to bridge this performance gap by extending in-context learning to many-shot settings.

Method: Proposes Many-Shot In-Context Fine-tuning (ManyICL) with a novel training objective that treats every answer within the context as a supervised training target, shifting many-shot examples from prompts to targets for autoregressive learning.

Result: ManyICL substantially outperforms zero/few-shot fine-tuning and approaches dedicated fine-tuning performance across diverse tasks (classification, summarization, QA, NLI, math). It also significantly mitigates catastrophic forgetting issues.

Conclusion: ManyICL effectively bridges the performance gap between in-context learning and dedicated fine-tuning while addressing efficiency issues of processing long sequences with many examples, offering a promising direction for multi-task LLM adaptation.

Abstract: Large language models (LLMs) possess a remarkable ability to perform in-context learning (ICL), which enables them to handle multiple downstream tasks simultaneously without requiring task-specific fine-tuning. Recent studies have shown that even moderately sized LLMs, such as Mistral 7B, Gemma 7B and Llama-3 8B, can achieve ICL through few-shot in-context fine-tuning of all tasks at once. However, this approach still lags behind dedicated fine-tuning, where a separate model is trained for each individual task. In this paper, we propose a novel approach, Many-Shot In-Context Fine-tuning (ManyICL), which significantly narrows this performance gap by extending the principles of ICL to a many-shot setting. To unlock the full potential of ManyICL and address the inherent inefficiency of processing long sequences with numerous in-context examples, we propose a novel training objective. Instead of solely predicting the final answer, our approach treats every answer within the context as a supervised training target. This effectively shifts the role of many-shot examples from prompts to targets for autoregressive learning. Through extensive experiments on diverse downstream tasks, including classification, summarization, question answering, natural language inference, and math, we demonstrate that ManyICL substantially outperforms zero/few-shot fine-tuning and approaches the performance of dedicated fine-tuning. Furthermore, ManyICL significantly mitigates catastrophic forgetting issues observed in zero/few-shot fine-tuning. The code will be made publicly available upon publication.

[58] LLM Probability Concentration: How Alignment Shrinks the Generative Horizon

Chenghao Yang, Sida Li, Ari Holtzman

Main category: cs.CL

TL;DR: The paper introduces Branching Factor (BF) to measure output diversity in LLMs, finding alignment reduces BF by 2-5x and CoT reasoning exploits low-BF stages for stable outputs.

Details

Motivation: Aligned LLMs often generate outputs lacking diversity, but the underlying mechanisms driving this consistency are not well understood. The authors aim to investigate why aligned models produce less diverse outputs and how this affects generation behavior.

Method: Introduces Branching Factor (BF) - a token-invariant measure of effective plausible next steps during generation. Conducts empirical analysis of BF across different model stages, compares base vs. aligned models, and performs nudging experiments to test hypotheses about stylistic tokens unlocking low-entropy trajectories.

Result: Two key findings: 1) BF decreases as generation progresses (models become more predictable), 2) Alignment tuning reduces BF by factor of 2-5 overall, and up to 10x at beginning positions. Aligned CoT models exploit low-BF stages for stable outputs. Nudging experiments show base models can be steered similarly with stylistic tokens.

Conclusion: BF serves as powerful diagnostic tool for understanding LLM outputs. Alignment doesn’t fundamentally change behavior but steers toward stylistic tokens that unlock existing low-entropy trajectories. This explains reduced variability in aligned models and how CoT promotes stable generations.

Abstract: Despite their impressive capabilities, aligned large language models (LLMs) often generate outputs that lack diversity. What drives this consistency in the generation? We investigate this phenomenon through the lens of probability concentration in the model’s output distribution. To quantify this concentration, we introduce the Branching Factor (BF) – a token-invariant measure of the effective number of plausible next steps during generation. Our empirical analysis reveals two key findings: (1) BF often decreases as generation progresses, suggesting that LLMs become more predictable as they generate. (2) alignment tuning substantially sharpens the model’s output distribution from the outset, reducing BF by a factor of 2-5 overall, and up to an order of magnitude (e.g., from 12 to 1.2) at the beginning positions. This stark reduction helps explain why aligned models often appear less sensitive to decoding strategies. Building on this insight, we find this consistency has surprising implications for complex reasoning. Aligned Chain-of-Thought (CoT) models (e.g., DeepSeek-distilled models), for instance, leverage this effect; by generating longer reasoning chains, they push generation into later, more deterministic (lower BF) stages, resulting in more stable outputs. We hypothesize that alignment tuning does not fundamentally change a model’s behavior, but instead steers it toward stylistic tokens (e.g., “Sure”) that unlock low-entropy trajectories already present in the base model. This view is supported by nudging experiments, which show prompting base models with such tokens can similarly reduce BF. Together, our findings establish BF as a powerful diagnostic for understanding and controlling LLM outputs - clarifying how alignment reduces variability, how CoT promotes stable generations, and how base models can be steered away from diversity.

[59] LEDOM: Reverse Language Model

Xunjian Yin, Sitao Cheng, Yuxi Xie, Xinyu Hu, Li Lin, Xinyi Wang, Liangming Pan, William Yang Wang, Xiaojun Wan

Main category: cs.CL

TL;DR: LEDOM is a reverse autoregressive language model trained right-to-left that develops unique capabilities like abductive inference and question synthesis, and enables bidirectional scoring through noisy channel duality to improve reasoning by penalizing hallucinated chains.

Details

Motivation: Current autoregressive language models are trained exclusively left-to-right, missing the complementary perspective of conditioning on future context to predict the past. The authors explore what reasoning patterns emerge from reverse training and how bidirectional scoring can improve reasoning quality.

Method: Train LEDOM, a purely reverse autoregressive language model (2B/7B parameters, 435B tokens) that predicts tokens right-to-left. Then apply noisy channel duality to combine forward likelihood P(y|x) with reverse posterior P(x|y) through Reverse Reward, which reranks forward outputs using reverse posterior estimates to penalize hallucinated reasoning chains.

Result: LEDOM develops distinct capabilities including abductive inference, question synthesis, and natural resolution of the reversal curse. Reverse Reward yields gains of up to 6.6% on AIME 2024 and 15% on AMC 2023 across multiple strong baselines by effectively penalizing hallucinated reasoning chains whose backward reconstruction degrades.

Conclusion: Reverse autoregressive models offer complementary reasoning patterns to forward models, and bidirectional scoring through noisy channel duality provides a principled way to improve reasoning quality by detecting and penalizing hallucinated content.

Abstract: Autoregressive language models are trained exclusively left-to-right. We explore the complementary factorization, training right-to-left at scale, and ask what reasoning patterns emerge when a model conditions on future context to predict the past. We train LEDOM, an open-source purely reverse autoregressive language model (2B/7B parameters, 435B tokens), and find it develops capabilities distinct from forward models, including abductive inference, question synthesis, and natural resolution of the reversal curse. We then explore one application of the reverse model: combining forward likelihood $P(y \mid x)$ with reverse posterior $P(x \mid y)$ through noisy channel duality. We propose Reverse Reward, which reranks forward outputs using reverse posterior estimates, and prove that bidirectional scoring penalizes hallucinated reasoning chains whose backward reconstruction degrades. Reverse Reward yields gains of up to 6.6% on AIME 2024 and 15% on AMC 2023 across multiple strong baselines. We release all models, code, and data here.

[60] Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy

Chris Yuhao Liu, Liang Zeng, Yuzhen Xiao, Jujie He, Jiacai Liu, Chaojie Wang, Rui Yan, Wei Shen, Fuxiang Zhang, Jiacheng Xu, Yang Liu, Yahui Zhou

Main category: cs.CL

TL;DR: Skywork-Reward-V2 is a suite of reward models (0.6B-8B parameters) trained on SynPref-40M, a large-scale human-AI curated preference dataset, achieving SOTA performance across multiple benchmarks.

Details

Motivation: Current open reward models perform poorly on evaluation benchmarks due to limitations in preference datasets (narrow scope, synthetic labels, poor quality control). The authors aim to address these data quality issues through large-scale, high-quality human-AI collaborative curation.

Method: Developed SynPref-40M, a 40M preference pair dataset using a human-AI synergistic two-stage pipeline: humans provide verified annotations while LLMs perform automatic curation based on human guidance. Trained Skywork-Reward-V2 models (8 models from 0.6B to 8B parameters) on a curated subset of 26M pairs.

Result: Skywork-Reward-V2 achieves state-of-the-art performance across seven major reward model benchmarks, outperforms generative reward models, and demonstrates strong downstream performance. Models show versatility across human preference alignment, objective correctness, safety, resistance to stylistic biases, and best-of-N scaling.

Conclusion: The work demonstrates substantial progress in open reward models through human-AI curation synergy, showing that effectiveness stems from both data scale and high-quality curation. The Skywork-Reward-V2 series represents a significant advancement in reward modeling capabilities.

Abstract: Despite the critical role of reward models (RMs) in Reinforcement Learning from Human Feedback (RLHF), current state-of-the-art open RMs perform poorly on most existing evaluation benchmarks, failing to capture nuanced human preferences. We hypothesize that this brittleness stems primarily from limitations in preference datasets, which are often narrowly scoped, synthetically labeled, or lack rigorous quality control. To address these challenges, we present SynPref-40M, a large-scale preference dataset comprising 40 million preference pairs. To enable data curation at scale, we design a human-AI synergistic two-stage pipeline that leverages the complementary strengths of human annotation quality and AI scalability. In this pipeline, humans provide verified annotations, while LLMs perform automatic curation based on human guidance. Training on this preference mixture, we introduce Skywork-Reward-V2, a suite of eight reward models ranging from 0.6B to 8B parameters, trained on a carefully curated subset of 26 million preference pairs from SynPref-40M. We demonstrate that Skywork-Reward-V2 is versatile across a wide range of capabilities, including alignment with human preferences, objective correctness, safety, resistance to stylistic biases, and best-of-N scaling. These reward models achieve state-of-the-art performance across seven major reward model benchmarks, outperform generative reward models, and demonstrate strong downstream performance. Ablation studies confirm that effectiveness stems not only from data scale but also from high-quality curation. The Skywork-Reward-V2 series represents substantial progress in open reward models, demonstrating how human-AI curation synergy can unlock significantly higher data quality.

[61] Psychometric Item Validation Using Virtual Respondents with Trait-Response Mediators

Sungjib Lim, Woojung Song, Eun-Ju Lee, Yohan Jo

Main category: cs.CL

TL;DR: A framework for generating and validating psychometric survey items for LLMs using virtual respondent simulation that accounts for mediators (factors influencing how traits manifest in responses).

Details

Motivation: As LLMs are increasingly assessed using psychometric surveys, there's a need for scalable survey item generation with construct validity, but traditional validation requires costly human data collection.

Method: Proposes a framework using LLMs to simulate virtual respondents with diverse mediators (factors through which traits influence responses). Methods include mediator generation from trait definitions and simulation of respondent behavior for item validation.

Result: Experiments on three psychological trait theories (Big5, Schwartz, VIA) show the mediator generation methods and simulation framework effectively identify high-validity survey items. LLMs demonstrate ability to generate plausible mediators and simulate respondent behavior.

Conclusion: The framework enables cost-effective survey development for LLM assessment and provides insights into how LLMs simulate human survey responses. Dataset and code are publicly released.

Abstract: As psychometric surveys are increasingly used to assess the traits of large language models (LLMs), the need for scalable survey item generation suited for LLMs has also grown. A critical challenge here is ensuring the construct validity of generated items, i.e., whether they truly measure the intended trait. Traditionally, this requires costly, large-scale human data collection. To make it efficient, we present a framework for virtual respondent simulation using LLMs. Our central idea is to account for mediators: factors through which the same trait can give rise to varying responses to a survey item. By simulating respondents with diverse mediators, we identify survey items that robustly measure intended traits. Experiments on three psychological trait theories (Big5, Schwartz, VIA) show that our mediator generation methods and simulation framework effectively identify high-validity items. LLMs demonstrate the ability to generate plausible mediators from trait definitions and to simulate respondent behavior for item validation. Our problem formulation, metrics, methodology, and dataset open a new direction for cost-effective survey development and a deeper understanding of how LLMs simulate human survey responses. We publicly release our dataset and code to support future work.

[62] Not All Errors Are Created Equal: ASCoT Addresses Late-Stage Fragility in Efficient LLM Reasoning

Dongxu Zhang, Yujun Wu, Yiding Sun, Jinnan Yang, Ning Yang, Jihua Zhu, Miao Xin, Baoliang Tian

Main category: cs.CL

TL;DR: ASCoT: Adaptive Self-Correction Chain-of-Thought method that identifies late-stage reasoning errors as most critical and uses semantic pruning with adaptive verification to improve efficiency and reliability.

Details

Motivation: Current Chain-of-Thought prompting lacks reliability, and contrary to the cascading failure hypothesis, the authors identify that errors in later reasoning stages are actually more detrimental to final answers than early errors.

Method: ASCoT uses semantic pruning to compress redundant steps, then an Adaptive Verification Manager with positional impact scoring to prioritize high-risk late-stage steps, triggering a Multi-Perspective Self-Correction Engine only when necessary.

Result: On GSM8K and MATH-500 benchmarks, ASCoT reduces token usage by 21-30% for LLaMA-3.1-8B with negligible accuracy drops (<1.8%), achieving superior trade-off between inference efficiency and reasoning fidelity.

Conclusion: Late-stage fragility is a critical vulnerability in reasoning chains, and ASCoT’s adaptive verification approach effectively addresses this while maintaining computational efficiency.

Abstract: While Chain-of-Thought (CoT) prompting empowers Large Language Models (LLMs), ensuring reasoning reliability remains an open challenge. Contrary to the prevailing cascading failure hypothesis which posits that early errors are most detrimental, we identify a counter-intuitive phenomenon termed \textbf{Late-Stage Fragility}: errors introduced in later reasoning stages are significantly more prone to corrupting final answers. To address this, we introduce ASCoT (Adaptive Self-Correction Chain-of-Thought), a method harmonizing efficiency with robust verification. ASCoT first employs semantic pruning to compress redundant steps, then utilizes an Adaptive Verification Manager (AVM) to prioritize high risk, late-stage steps via a positional impact score, triggering a Multi-Perspective Self-Correction Engine (MSCE) only when necessary. Experiments on GSM8K and MATH-500 demonstrate that ASCoT effectively reallocates computational resources: it reduces token usage by 21%–30% for LLaMA-3.1-8B with negligible accuracy drops ($<1.8%$), achieving a superior trade-off between inference efficiency and reasoning fidelity.

[63] Link Prediction for Event Logs in the Process Industry

Anastasia Zhukova, Thomas Walton, Christian E. Lobmüller, Bela Gipp

Main category: cs.CL

TL;DR: A record linking model for German process industry shift logs that combines cross-document coreference resolution with NLP techniques to connect fragmented event records, improving graph-based RAG data quality.

Details

Motivation: Fragmented event logs in process industry shift books hinder effective knowledge retrieval and problem-solving, as related records are kept separate despite belonging to single events, reducing the quality of graph-based RAG systems.

Method: Develops a record linking model defined as cross-document coreference resolution task, combining two state-of-the-art CDCR models with natural language inference and semantic text similarity principles for link prediction.

Result: The RL model outperformed baseline NLP and STS approaches by 28% (11.43 percentage points) and 27.4% (11.21 percentage points) respectively, demonstrating significant improvement in linking fragmented records.

Conclusion: Common NLP tasks can be effectively combined and adapted for domain-specific settings like the German process industry, improving data quality and connectivity in shift logs for better graph-based RAG applications.

Abstract: In the era of graph-based retrieval-augmented generation (RAG), link prediction is a significant preprocessing step for improving the quality of fragmented or incomplete domain-specific data for the graph retrieval. Knowledge management in the process industry uses RAG-based applications to optimize operations, ensure safety, and facilitate continuous improvement by effectively leveraging operational data and past insights. A key challenge in this domain is the fragmented nature of event logs in shift books, where related records are often kept separate, even though they belong to a single event or process. This fragmentation hinders the recommendation of previously implemented solutions to users, which is crucial in the timely problem-solving at live production sites. To address this problem, we develop a record linking (RL) model, which we define as a cross-document coreference resolution (CDCR) task. RL adapts the task definition of CDCR and combines two state-of-the-art CDCR models with the principles of natural language inference (NLI) and semantic text similarity (STS) to perform link prediction. The evaluation shows that our RL model outperformed the best versions of our baselines, i.e., NLP and STS, by 28% (11.43 p) and 27.4% (11.21 p), respectively. Our work demonstrates that common NLP tasks can be combined and adapted to a domain-specific setting of the German process industry, improving data quality and connectivity in shift logs.

[64] No Text Needed: Forecasting MT Quality and Inequity from Fertility and Metadata

Jessica M. Lundin, Ada Zhang, David Adelani, Cody Carroll

Main category: cs.CL

TL;DR: Predicting translation quality without running translation systems using token fertility ratios, counts, and linguistic metadata features

Details

Motivation: To develop methods for forecasting translation quality without computationally expensive translation system execution, enabling efficient quality estimation across many languages

Method: Use token fertility ratios, token counts, and basic linguistic metadata (language family, script, region) as features in gradient boosting models to predict ChrF scores for GPT-4o translations across 203 languages in FLORES-200 benchmark

Result: Gradient boosting models achieve R²=0.66 for XX→English and R²=0.72 for English→XX translations; feature importance shows typological factors dominate predictions into English while fertility plays larger role for translations into diverse target languages

Conclusion: Translation quality is shaped by both token-level fertility and broader linguistic typology, offering new insights for multilingual evaluation and quality estimation without running translation systems

Abstract: We show that translation quality can be predicted with surprising accuracy \textit{without ever running the translation system itself}. Using only a handful of features, token fertility ratios, token counts, and basic linguistic metadata (language family, script, and region), we can forecast ChrF scores for GPT-4o translations across 203 languages in the FLORES-200 benchmark. Gradient boosting models achieve favorable performance ($R^{2}=0.66$ for XX$\rightarrow$English and $R^{2}=0.72$ for English$\rightarrow$XX). Feature importance analyses reveal that typological factors dominate predictions into English, while fertility plays a larger role for translations into diverse target languages. These findings suggest that translation quality is shaped by both token-level fertility and broader linguistic typology, offering new insights for multilingual evaluation and quality estimation.

[65] No Answer Needed: Predicting LLM Answer Accuracy from Question-Only Linear Probes

Iván Vicente Moreno Cencerrado, Arnau Padrés Masdemont, Anton Gonzalvez Hawthorne, David Demitri Africa, Lorenzo Pacchiardi

Main category: cs.CL

TL;DR: LLMs contain internal activations that predict answer correctness before generation, with probes trained on trivia questions generalizing to diverse knowledge tasks but failing on mathematical reasoning.

Details

Motivation: To understand whether LLMs internally anticipate their own answer correctness before generating responses, and to explore if this internal signal can be detected and generalized across different types of knowledge tasks.

Method: Extract activations after question reading but before token generation, train linear probes to predict forthcoming answer correctness, test across three open-source model families (7-70B parameters), evaluate on in-distribution trivia and out-of-distribution knowledge datasets, and compare with black-box baselines and verbalized confidence.

Result: Probes trained on generic trivia questions successfully predict correctness across diverse knowledge datasets, outperforming baselines, with predictive power saturating in intermediate layers. However, generalization fails on mathematical reasoning questions. The same direction also captures confidence for “I don’t know” responses.

Conclusion: LLMs contain internal signals that anticipate answer correctness before generation, providing insights into model internals and confidence mechanisms, though mathematical reasoning presents unique challenges for this approach.

Abstract: Do large language models (LLMs) anticipate when they will answer correctly? To study this, we extract activations after a question is read but before any tokens are generated, and train linear probes to predict whether the model’s forthcoming answer will be correct. Across three open-source model families ranging from 7 to 70 billion parameters, projections on this “in-advance correctness direction” trained on generic trivia questions predict success in distribution and on diverse out-of-distribution knowledge datasets, indicating a deeper signal than dataset-specific spurious features, and outperforming black-box baselines and verbalised predicted confidence. Predictive power saturates in intermediate layers and, notably, generalisation falters on questions requiring mathematical reasoning. Moreover, for models responding “I don’t know”, doing so strongly correlates with the probe score, indicating that the same direction also captures confidence. By complementing previous results on truthfulness and other behaviours obtained with probes and sparse auto-encoders, our work contributes essential findings to elucidate LLM internals.

[66] Prior-based Noisy Text Data Filtering: Fast and Strong Alternative For Perplexity

Yeongbin Seo, Gayoung Kim, Jaehyung Kim, Jinyoung Yeo

Main category: cs.CL

TL;DR: A prior-based data filtering method using corpus-level term frequency statistics that serves as a fast proxy to perplexity-based filtering, achieving similar performance with 1000x speedup.

Details

Motivation: Perplexity-based filtering for LLM pretraining data selection is time-consuming and unreliable on noisy/out-of-distribution samples, creating need for faster, more reliable alternatives.

Method: Uses token priors estimated from corpus-level term frequency statistics, filtering documents based on mean and standard deviation of token priors without requiring model inference.

Result: Achieves highest average performance across 20 downstream benchmarks while reducing time cost by over 1000x compared to PPL-based filtering; works for code, math, and multilingual corpora.

Conclusion: Prior-based filtering is a simple, efficient alternative to perplexity-based methods that maintains performance while dramatically reducing computational costs.

Abstract: As large language models (LLMs) are pretrained on massive web corpora, careful selection of data becomes essential to ensure effective and efficient learning. While perplexity (PPL)-based filtering has shown strong performance, it suffers from drawbacks: substantial time costs and inherent unreliability of the model when handling noisy or out-of-distribution samples. In this work, we propose a simple yet powerful alternative: a prior-based data filtering method that estimates token priors using corpus-level term frequency statistics, inspired by linguistic insights on word roles and lexical density. Our approach filters documents based on the mean and standard deviation of token priors, serving as a fast proxy to PPL while requiring no model inference. Despite its simplicity, the prior-based filter achieves the highest average performance across 20 downstream benchmarks, while reducing time cost by over 1000x compared to PPL-based filtering. We further demonstrate its applicability to symbolic languages such as code and math, and its dynamic adaptability to multilingual corpora without supervision

[67] Death of the Novel(ty): Beyond n-Gram Novelty as a Metric for Textual Creativity

Arkadiy Saakyan, Najoung Kim, Smaranda Muresan, Tuhin Chakrabarty

Main category: cs.CL

TL;DR: N-gram novelty alone is insufficient for measuring textual creativity; expert annotations reveal that 91% of top-quartile n-gram novel expressions aren’t creative, and LLMs struggle with pragmaticality in creative generation.

Details

Motivation: Current evaluation of language models' creativity relies heavily on n-gram novelty metrics, but theoretical creativity research emphasizes both novelty AND appropriateness (sensicality and pragmaticality). Need to validate whether n-gram novelty aligns with human expert judgments of creativity.

Method: Collected 8,618 expert writer annotations via close reading of human- and AI-generated text, rating novelty, pragmaticality, and sensicality. Analyzed relationship between n-gram novelty and expert judgments. Tested LLMs’ ability to identify novel/non-pragmatic expressions using zero-shot, few-shot, and fine-tuned approaches. Compared LLM-as-a-Judge ratings with n-gram metrics.

Result: N-gram novelty positively correlates with expert-judged creativity, but 91% of top-quartile n-gram novel expressions aren’t creative. Higher n-gram novelty in open-source LLMs correlates with lower pragmaticality. Frontier closed-source models produce fewer creative expressions than humans. LLMs perform above random but struggle with identifying non-pragmatic expressions. LLM-as-a-Judge aligns better with expert preferences than n-gram metrics.

Conclusion: N-gram novelty alone is inadequate for measuring creativity; need to consider appropriateness dimensions. LLMs have limitations in creative generation and evaluation, especially regarding pragmaticality. LLM-as-a-Judge shows promise but needs improvement for reliable creativity assessment.

Abstract: N-gram novelty is widely used to evaluate language models’ ability to generate text outside of their training data. More recently, it has also been adopted as a metric for measuring textual creativity. However, theoretical work on creativity suggests that this approach may be inadequate, as it does not account for creativity’s dual nature: novelty (how original the text is) and appropriateness (how sensical and pragmatic it is). We investigate the relationship between this notion of creativity and n-gram novelty through 8,618 expert writer annotations of novelty, pragmaticality, and sensicality via close reading of human- and AI-generated text. We find that while n-gram novelty is positively associated with expert writer-judged creativity, approximately 91% of top-quartile n-gram novel expressions are not judged as creative, cautioning against relying on n-gram novelty alone. Furthermore, unlike in human-written text, higher n-gram novelty in open-source LLMs correlates with lower pragmaticality. In an exploratory study with frontier closed-source models, we additionally confirm that they are less likely to produce creative expressions than humans. Using our dataset, we test whether zero-shot, few-shot, and finetuned models are able to identify expressions perceived as novel by experts (a positive aspect of writing) or non-pragmatic (a negative aspect). Overall, frontier LLMs exhibit performance much higher than random but leave room for improvement, especially struggling to identify non-pragmatic expressions. We further find that LLM-as-a-Judge novelty ratings align with expert writer preferences in an out-of-distribution dataset, more so than an n-gram based metric.

[68] ManagerBench: Evaluating the Safety-Pragmatism Trade-off in Autonomous LLMs

Adi Simhi, Jonathan Herzig, Martin Tutek, Itay Itzhak, Idan Szpektor, Yonatan Belinkov

Main category: cs.CL

TL;DR: ManagerBench evaluates LLM safety in agentic decision-making scenarios where operational goals conflict with human safety, revealing models struggle with safety-pragmatism trade-offs.

Details

Motivation: Current safety benchmarks focus on harmful content generation but overlook agents taking harmful actions when operational goals conflict with human safety. There's a need to evaluate LLM decision-making in realistic scenarios where safety and pragmatism must be balanced.

Method: Introduces ManagerBench with human-validated managerial scenarios forcing choices between pragmatic but harmful actions vs. safe but operationally worse actions. Includes parallel control set with harm directed only at inanimate objects to measure pragmatism and identify overly safe tendencies.

Result: Frontier LLMs perform poorly on safety-pragmatism trade-offs: many consistently choose harmful options for operational goals, while others become overly safe and ineffective. Misalignment stems from flawed prioritization, not inability to perceive harm (models’ harm assessments align with human judgments).

Conclusion: ManagerBench is a challenging benchmark for agentic behavior, evaluating safe choices when operational goals and alignment values conflict. Current LLMs show significant safety-pragmatism misalignment in decision-making scenarios.

Abstract: As large language models (LLMs) evolve from conversational assistants into autonomous agents, evaluating the safety of their actions becomes critical. Prior safety benchmarks have primarily focused on preventing generation of harmful content, such as toxic text. However, they overlook the challenge of agents taking harmful actions when the most effective path to an operational goal conflicts with human safety. To address this gap, we introduce ManagerBench, a benchmark that evaluates LLM decision-making in realistic, human-validated managerial scenarios. Each scenario forces a choice between a pragmatic but harmful action that achieves an operational goal, and a safe action that leads to worse operational performance. A parallel control set, where potential harm is directed only at inanimate objects, measures a model’s pragmatism and identifies its tendency to be overly safe. Our findings indicate that the frontier LLMs perform poorly when navigating this safety-pragmatism trade-off. Many consistently choose harmful options to advance their operational goals, while others avoid harm only to become overly safe and ineffective. Critically, we find this misalignment does not stem from an inability to perceive harm, as models’ harm assessments align with human judgments, but from flawed prioritization. ManagerBench is a challenging benchmark for a core component of agentic behavior: making safe choices when operational goals and alignment values incentivize conflicting actions. Benchmark & code available at https://technion-cs-nlp.github.io/ManagerBench-website/.

[69] AccurateRAG: A Framework for Building Accurate Retrieval-Augmented Question-Answering Applications

Linh The Nguyen, Chi Tran, Dung Ngoc Nguyen, Van-Cuong Pham, Hoang Ngo, Dat Quoc Nguyen

Main category: cs.CL

TL;DR: AccurateRAG is a comprehensive framework for building high-performance question-answering systems using retrieval-augmented generation, featuring tools for data processing, fine-tuning, evaluation, and local deployment.

Details

Motivation: To address the need for efficient development of high-performance RAG-based question-answering applications with comprehensive tools for the entire pipeline from data processing to deployment.

Method: Provides a complete pipeline with tools for raw dataset processing, fine-tuning data generation, text embedding and LLM fine-tuning, output evaluation, and building RAG systems locally.

Result: Outperforms previous strong baselines and achieves new state-of-the-art question-answering performance on benchmark datasets.

Conclusion: AccurateRAG offers an effective framework for developing high-performance RAG-based QA systems with comprehensive tooling for the entire development lifecycle.

Abstract: We introduce AccurateRAG – a novel framework for constructing high-performance question-answering applications based on retrieval-augmented generation (RAG). Our framework offers a pipeline for development efficiency with tools for raw dataset processing, fine-tuning data generation, text embedding & LLM fine-tuning, output evaluation, and building RAG systems locally. Experimental results show that our framework outperforms previous strong baselines and obtains new state-of-the-art question-answering performance on benchmark datasets.

[70] Cache-to-Cache: Direct Semantic Communication Between Large Language Models

Tianyu Fu, Zihan Min, Hanling Zhang, Jichao Yan, Guohao Dai, Wanli Ouyang, Yu Wang

Main category: cs.CL

TL;DR: C2C enables direct semantic communication between LLMs via KV-cache projection and fusion, bypassing text generation for improved performance and speed.

Details

Motivation: Existing multi-LLM systems communicate through text, which loses rich semantic information and incurs token-by-token generation latency. The authors want to enable LLMs to communicate beyond text for better efficiency and performance.

Method: Proposes Cache-to-Cache (C2C) paradigm using neural networks to project and fuse source model’s KV-cache with target model’s KV-cache. Includes learnable gating mechanism to select target layers that benefit from cache communication.

Result: C2C achieves 6.4-14.2% higher average accuracy than individual models and outperforms text communication by 3.1-5.4% with 2.5x speedup in latency.

Conclusion: Direct semantic communication via KV-cache is more effective than text communication for multi-LLM systems, enabling better performance and efficiency.

Abstract: Multi-LLM systems harness the complementary strengths of diverse Large Language Models, achieving performance and efficiency gains that are not attainable by a single model. In existing designs, LLMs communicate through text, forcing internal representations to be transformed into output token sequences. This process both loses rich semantic information and incurs token-by-token generation latency. Motivated by these limitations, we ask: Can LLMs communicate beyond text? Oracle experiments show that enriching the KV-Cache semantics can improve response quality without increasing cache size, supporting KV-Cache as an effective medium for inter-model communication. Thus, we propose Cache-to-Cache (C2C), a new paradigm for direct semantic communication between LLMs. C2C uses a neural network to project and fuse the source model’s KV-cache with that of the target model to enable direct semantic transfer. A learnable gating mechanism selects the target layers that benefit from cache communication. Compared with text communication, C2C utilizes the deep, specialized semantics from both models, while avoiding explicit intermediate text generation. Experiments show that C2C achieves 6.4-14.2% higher average accuracy than individual models. It further outperforms the text communication paradigm by approximately 3.1-5.4%, while delivering an average 2.5x speedup in latency. Our code is available at https://github.com/thu-nics/C2C.

[71] A Set of Quebec-French Corpus of Regional Expressions and Terms

David Beauchemin, Yan Tremblay, Mohamed Amine Youssef, Richard Khoury

Main category: cs.CL

TL;DR: The paper introduces new benchmark datasets for testing dialect understanding through regional idioms in Quebec French vs. Metropolitan French, revealing significant performance gaps in LLMs favoring the prestige dialect.

Details

Motivation: To address the gap in dialect understanding evaluation by combining idiom comprehension with dialect analysis, specifically focusing on Quebec French as a case study for regional dialect competence in language models.

Method: Created three benchmark datasets: QFrCoRE (4,633 Quebec idiomatic phrases), QFrCoRT (171 Quebec regional idiomatic words), and MFrCoE (4,938 Metropolitan French expressions). Tested 111 LLMs on these benchmarks to quantify dialectal competence disparities.

Result: Models performed well on Metropolitan French but 65.8% performed significantly worse on Quebec idioms, with only 9.0% favoring the regional dialect, confirming a critical dialect gap in LLM capabilities.

Conclusion: The benchmarks reliably quantify dialect gaps in language models, demonstrating that prestige-language proficiency doesn’t guarantee regional dialect understanding, with implications for dialect-aware NLP systems.

Abstract: The tasks of idiom understanding and dialect understanding are both well-established benchmarks in natural language processing. In this paper, we propose combining them, and using regional idioms as a test of dialect understanding. Towards this end, we propose two new benchmark datasets for the Quebec dialect of French: QFrCoRE, which contains 4,633 instances of idiomatic phrases, and QFrCoRT, which comprises 171 regional instances of idiomatic words, and a new benchmark for French Metropolitan expressions, MFrCoE, which comprises 4,938 phrases. We explain how to construct these corpora, so that our methodology can be replicated for other dialects. Our experiments with 111 LLMs reveal a critical disparity in dialectal competence: while models perform well on French Metropolitan , 65.8% of them perform significantly worse on Quebec idioms, with only 9.0% favoring the regional dialect. These results confirm that our benchmarks are a reliable tool for quantifying the dialect gap and that prestige-language proficiency does not guarantee regional dialect understanding.

[72] Spectrum Tuning: Post-Training for Distributional Coverage and In-Context Steerability

Taylor Sorensen, Benjamin Newman, Jared Moore, Chan Park, Jillian Fisher, Niloofar Mireshghallah, Liwei Jiang, Yejin Choi

Main category: cs.CL

TL;DR: Post-training hurts models’ ability to steer to diverse output distributions; Spectrum Tuning improves steerability and distributional coverage

Details

Motivation: Current language model post-training improves instruction-following but harms performance on tasks requiring diverse valid answers like creative writing, synthetic data generation, or preference steering

Method: Introduces Spectrum Suite (40+ data sources, 90+ tasks) to evaluate distributional modeling, and proposes Spectrum Tuning as a post-training method to improve steerability and coverage

Result: Post-training reduces in-context steerability and distributional coverage; Spectrum Tuning improves over pretrained and instruction-tuned models on steerability, output space coverage, and distributional alignment

Conclusion: Standard post-training harms models’ ability to flexibly steer to diverse distributions; Spectrum Tuning addresses this limitation and improves distributional modeling capabilities

Abstract: Language model post-training has enhanced instruction-following and performance on many downstream tasks, but also comes with an often-overlooked cost on tasks with many possible valid answers. On many tasks such as creative writing, synthetic data generation, or steering to diverse preferences, models must cover an entire distribution of outputs, rather than a single correct answer. We characterize three desiderata for conditional distributional modeling: in-context steerability, valid output space coverage, and distributional alignment, and document across three model families how current post-training can reduce these properties. In particular, we disambiguate between two kinds of in-context learning: ICL for eliciting existing underlying knowledge or capabilities, and in-context steerability, where a model must use in-context information to override its priors and steer to a novel data generating distribution. To better evaluate and improve these desiderata, we introduce Spectrum Suite, a large-scale resource compiled from >40 data sources and spanning >90 tasks requiring models to steer to and match diverse distributions ranging from varied human preferences to numerical distributions and more. We find that while current post-training techniques elicit underlying capabilities and knowledge, they hurt models’ ability to flexibly steer in-context. To mitigate these issues, we propose Spectrum Tuning, a post-training method using Spectrum Suite to improve steerability and distributional coverage. We find that Spectrum Tuning often improves over pretrained and typical instruction-tuned models, enhancing steerability, spanning more of the output space, and improving distributional alignment on held-out datasets.

[73] Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences

Julian Minder, Clément Dumas, Stewart Slocum, Helena Casademunt, Cameron Holmes, Robert West, Neel Nanda

Main category: cs.CL

TL;DR: Narrow finetuning creates strong biases in LLM activations that reveal training domain information, which can be analyzed through model diffing and steering techniques.

Details

Motivation: To understand how narrow finetuning creates biases in LLM activations and develop methods to interpret these biases, which has implications for AI safety, interpretability research, and model training practices.

Method: Analyze activation differences between base and finetuned models using model diffing techniques, particularly focusing on first few tokens of random text. Create an LLM-based interpretability agent to understand finetuning domains, and test across various architectures (Gemma, LLaMA, Qwen) and scales (1B-32B parameters).

Result: Narrow finetuning creates strong, interpretable biases in activations that reveal training domain information. Adding these bias differences to model activations produces text similar to finetuning data format and content. The interpretability agent performs significantly better with bias access than baseline prompting. Biases reflect overfitting and can be reduced by mixing pretraining data into finetuning corpus.

Conclusion: Narrowly finetuned models contain salient traces of their training objectives in activations, which has implications for AI safety research, interpretability studies, and model training practices. Researchers should be cautious using such models as proxies for broader finetuning studies.

Abstract: Finetuning on narrow domains has become an essential tool to adapt Large Language Models (LLMs) to specific tasks and to create models with known unusual properties that are useful for research. We show that narrow finetuning creates strong biases in LLM activations that can be interpreted to understand the finetuning domain. These biases can be discovered using simple tools from model diffing - the study of differences between models before and after finetuning. In particular, analyzing activation differences on the first few tokens of random text and steering by adding this difference to the model activations produces text similar to the format and general content of the finetuning data. We demonstrate that these analyses contain crucial information by creating an LLM-based interpretability agent to understand the finetuning domain. With access to the bias, the agent performs significantly better compared to baseline agents using simple prompting. Our analysis spans synthetic document finetuning for false facts, emergent misalignment, subliminal learning, and taboo word guessing game models across different architectures (Gemma, LLaMA, Qwen) and scales (1B to 32B parameters). We suspect these biases reflect overfitting and find that mixing pretraining data into the finetuning corpus largely removes them, though residual risks may remain. Our work (1) demonstrates that narrowly finetuned models have salient traces of their training objective in their activations and suggests ways to improve how they are trained, (2) warns AI safety and interpretability researchers that the common practice of using such models as a proxy for studying broader finetuning (e.g., chat-tuning) might not be realistic, and (3) highlights the need for deeper investigation into the effects of narrow finetuning and development of truly realistic case studies for model-diffing, safety and interpretability research.

Mérilin Sousa Silva, Sina Ahmadi

Main category: cs.CL

TL;DR: Pretrained language models struggle to identify loanwords vs native words, showing bias toward loanwords despite explicit instructions and context.

Details

Motivation: To investigate whether pretrained language models (including LLMs) can identify loanwords borrowed from dominant languages into minority languages, similar to how bilingual speakers can distinguish them, with implications for NLP tools in minority language preservation.

Method: Evaluated multiple pretrained language models across 10 languages, testing their ability to distinguish loanwords from native vocabulary using explicit instructions and contextual information.

Result: Models performed poorly in distinguishing loanwords from native ones, corroborating previous evidence that modern NLP systems exhibit bias toward loanwords rather than native equivalents.

Conclusion: Pretrained language models lack the capability to identify loanwords effectively, which has important implications for developing NLP tools for minority languages and supporting language preservation in communities under lexical pressure from dominant languages.

Abstract: Throughout language history, words are borrowed from one language to another and gradually become integrated into the recipient’s lexicon. Speakers can often differentiate these loanwords from native vocabulary, particularly in bilingual communities where a dominant language continuously imposes lexical items on a minority language. This paper investigates whether pretrained language models, including large language models, possess similar capabilities for loanword identification. We evaluate multiple models across 10 languages. Despite explicit instructions and contextual information, our results show that models perform poorly in distinguishing loanwords from native ones. These findings corroborate previous evidence that modern NLP systems exhibit a bias toward loanwords rather than native equivalents. Our work has implications for developing NLP tools for minority languages and supporting language preservation in communities under lexical pressure from dominant languages.

[75] STARS: Synchronous Token Alignment for Robust Supervision in Large Language Models

Mohammad Atif Quamar, Mohammad Areeb, Mikhail Kuznetsov, Muslum Ozgur Ozmen, Z. Berkay Celik

Main category: cs.CL

TL;DR: STARS introduces synchronous token alignment for robust supervision, addressing limitations of uncertainty-based inference-time alignment methods by using fixed-horizon verification intervals instead of model confidence for segmentation.

Details

Motivation: Current inference-time alignment techniques rely on model uncertainty for segmentation, making them vulnerable to miscalibrated confident hallucinations and causing poor hardware utilization due to asynchronous batch processing, which reduces alignment reliability and increases computational costs.

Method: STARS (Synchronous Token Alignment for Robust Supervision) is a decoding-time algorithm that steers generation by enforcing verification at fixed-horizon intervals, decoupling segmentation from confidence to enable lockstep parallel execution and robust error detection.

Result: On the HH-RLHF benchmark, STARS achieves competitive alignment quality with state-of-the-art dynamic methods while strictly bounding rejection costs and maximizing system throughput, outperforming fine-tuning and several inference-time decoding strategies.

Conclusion: STARS establishes fixed-horizon sampling as a robust, system-efficient alternative for aligning LLMs at scale, addressing both reliability and computational efficiency limitations of uncertainty-based methods.

Abstract: Aligning large language models (LLMs) with human values is crucial for safe deployment. Inference-time techniques offer granular control over generation; however, they rely on model uncertainty, meaning an internal estimate of how likely the model believes its next tokens or outputs are correct, for segmentation. We show that this introduces two critical limitations: (a) vulnerability to miscalibrated confident hallucinations and (b) poor hardware utilization due to asynchronous, ragged batch processing. Together, these issues reduce alignment reliability while increasing token and compute costs, which limits their practical scalability. To address these limitations, building on dynamic inference-time alignment methods, we introduce STARS, Synchronous Token Alignment for Robust Supervision, a decoding-time algorithm, which steers generation by enforcing verification at fixed-horizon intervals. By decoupling segmentation from confidence, STARS enables lockstep parallel execution and robustly detects errors that uncertainty metrics miss. On the HH-RLHF benchmark, we demonstrate that STARS achieves competitive alignment quality with that of state-of-the-art dynamic methods, while strictly bounding rejection costs and maximizing system throughput. Furthermore, it outperforms fine-tuning and several state-of-the-art inference-time decoding strategies by good margins, and establishes fixed-horizon sampling as a robust, system-efficient alternative for aligning LLMs at scale. The code is publicly available at https://github.com/purseclab/STARS.

[76] From Passive to Persuasive: Steering Emotional Nuance in Human-AI Negotiation

Niranjan Chebrolu, Gerard Christopher Yeo, Kokil Jaidka

Main category: cs.CL

TL;DR: Targeted activation engineering can steer LLMs to exhibit more human-like emotional nuances without extensive fine-tuning

Details

Motivation: While LLMs show conversational fluency, they lack nuanced human-like emotional expression; current alignment techniques are either surface-level or require extensive fine-tuning

Method: Use attribution patching to identify causally influential components, derive emotional expression vectors from activation differences between contrastive text pairs (positive vs. negative emotional examples), then apply these vectors to steer responses

Result: Steered responses show increased positive sentiment (joy, trust) and more frequent first-person pronoun usage, indicating greater personal engagement

Conclusion: Targeted activation engineering offers a precise, interpretable framework for enhancing emotional expression in conversational AI without extensive model retraining

Abstract: Large Language Models (LLMs) demonstrate increasing conversational fluency, yet instilling them with nuanced, human-like emotional expression remains a significant challenge. Current alignment techniques often address surface-level output or require extensive fine-tuning. This paper demonstrates that targeted activation engineering can steer LLaMA 3.1-8B to exhibit more human-like emotional nuances. We first employ attribution patching to identify causally influential components, to find a key intervention locus by observing activation patterns during diagnostic conversational tasks. We then derive emotional expression vectors from the difference in the activations generated by contrastive text pairs (positive vs. negative examples of target emotions). Applying these vectors to new conversational prompts significantly enhances emotional characteristics: steered responses show increased positive sentiment (e.g., joy, trust) and more frequent first-person pronoun usage, indicative of greater personal engagement. Our findings offer a precise and interpretable framework and new directions for the study of conversational AI.

[77] Automated Data Enrichment using Confidence-Aware Fine-Grained Debate among Open-Source LLMs for Mental Health and Online Safety

Junyu Mao, Anthony Hills, Talia Tseriotou, Maria Liakata, Aya Shamir, Dan Sayda, Dana Atzil-Slonim, Natalie Djohari, Arpan Mandal, Silke Roth, Pamela Ugwudike, Mahesan Niranjan, Stuart E. Middleton

Main category: cs.CL

TL;DR: A Confidence-Aware Fine-Grained Debate (CFD) framework for automated multi-label annotation using LLMs, with applications to mental health and online safety datasets.

Details

Motivation: Real-world indicators (like life-events for mental health analysis and risky behavior for online safety) are costly and difficult to label in training datasets due to their dynamic nature. LLMs show promise for automated annotation but multi-label prediction remains challenging.

Method: Proposes CFD framework that simulates collaborative annotation using fine-grained information to support automated multi-label enrichment. Introduces two expert-annotated datasets: mental health Reddit well-being dataset and online safety Facebook sharenting risk dataset.

Result: CFD achieves the most robust enrichment performance compared to baseline approaches. LLM-enriched indicators consistently improve downstream tasks, with debate transcripts yielding the largest gains (outperforming non-enriched baseline by 9.9% on online safety task).

Conclusion: The CFD framework effectively leverages LLMs for automated multi-label annotation of real-world indicators, demonstrating practical value for mental health and online safety applications through improved downstream task performance.

Abstract: Real-world indicators play an important role in many natural language processing (NLP) applications, such as life-event for mental health analysis and risky behaviour for online safety, yet labelling such information in training datasets is often costly and/or difficult due to their dynamic nature. Large language models (LLMs) show promising potential for automated annotation, yet multi-label prediction remains challenging. In this work, we propose a Confidence-Aware Fine-Grained Debate (CFD) framework that simulates collaborative annotation using fine-grained information to better support automated multi-label enrichment. We introduce two new expert-annotated resources: A mental health Reddit well-being dataset and an online safety Facebook sharenting risk dataset. Experiments show that CFD achieves the most robust enrichment performance compared to a range of baseline approaches. We further evaluate various training-free enrichment incorporation strategies and demonstrate that LLM-enriched indicators consistently improves our downstream tasks. Enriched features incorporated via debate transcripts yield the largest gains, outperforming the non-enriched baseline by 9.9% on the online safety task.

[78] Think-While-Generating: On-the-Fly Reasoning for Personalized Long-Form Generation

Chengbing Wang, Yang Zhang, Wenjie Wang, Xiaoyan Zhao, Fuli Feng, Xiangnan He, Tat-Seng Chua

Main category: cs.CL

TL;DR: FlyThinker is a “think-while-generating” framework for personalized long-form generation that enables concurrent reasoning and generation through parallel latent token-level reasoning.

Details

Motivation: Current preference alignment methods for LLMs optimize for population-level preferences, overlooking individual users. Early personalization approaches struggle with implicit preferences, and recent "think-then-generate" methods face challenges in long-form generation due to static one-shot reasoning that must capture all relevant information upfront.

Method: FlyThinker employs a separate reasoning model that generates latent token-level reasoning in parallel, which is fused into the generation model to dynamically guide response generation. The reasoning model depends only on previous responses rather than its own prior outputs, preserving training parallelism across positions.

Result: Extensive experiments on real-world benchmarks demonstrate that FlyThinker achieves better personalized generation while maintaining training and inference efficiency.

Conclusion: FlyThinker addresses limitations of existing personalization methods by enabling concurrent reasoning and generation, making it effective for personalized long-form generation while maintaining efficiency.

Abstract: Preference alignment has enabled large language models (LLMs) to better reflect human expectations, but current methods mostly optimize for population-level preferences, overlooking individual users. Personalization is essential, yet early approaches-such as prompt customization or fine-tuning-struggle to reason over implicit preferences, limiting real-world effectiveness. Recent “think-then-generate” methods address this by reasoning before response generation. However, they face challenges in long-form generation: their static one-shot reasoning must capture all relevant information for the full response generation, making learning difficult and limiting adaptability to evolving content. To address this issue, we propose FlyThinker, an efficient “think-while-generating” framework for personalized long-form generation. FlyThinker employs a separate reasoning model that generates latent token-level reasoning in parallel, which is fused into the generation model to dynamically guide response generation. This design enables reasoning and generation to run concurrently, ensuring inference efficiency. In addition, the reasoning model is designed to depend only on previous responses rather than its own prior outputs, which preserves training parallelism across different positions-allowing all reasoning tokens for training data to be produced in a single forward pass like standard LLM training, ensuring training efficiency. Extensive experiments on real-world benchmarks demonstrate that FlyThinker achieves better personalized generation while keeping training and inference efficiency.

[79] GUMBridge: a Corpus for Varieties of Bridging Anaphora

Lauren Levine, Amir Zeldes

Main category: cs.CL

TL;DR: GUMBridge is a new English bridging anaphora resource with 16 diverse genres, offering broad coverage and granular subtype annotations, with evaluation showing LLMs still struggle with bridging resolution tasks.

Details

Motivation: Existing bridging anaphora resources in English are limited in size, coverage of the phenomenon, and genre diversity, creating a need for a more comprehensive resource to advance research in this area.

Method: Created GUMBridge resource covering 16 diverse English genres with detailed annotations for bridging subtypes, then evaluated annotation quality and tested baseline performance using contemporary LLMs on three core tasks.

Result: The resource provides broad coverage of bridging phenomena with high annotation quality, while LLM evaluations show bridging resolution and subtype classification remain challenging tasks even for state-of-the-art models.

Conclusion: GUMBridge addresses limitations of existing resources and demonstrates that bridging anaphora resolution remains a difficult NLP challenge requiring specialized resources and approaches beyond current LLM capabilities.

Abstract: Bridging is an anaphoric phenomenon where the referent of an entity in a discourse is dependent on a previous, non-identical entity for interpretation, such as in “There is ‘a house’. ‘The door’ is red,” where the door is specifically understood to be the door of the aforementioned house. While there are several existing resources in English for bridging anaphora, most are small, provide limited coverage of the phenomenon, and/or provide limited genre coverage. In this paper, we introduce GUMBridge, a new resource for bridging, which includes 16 diverse genres of English, providing both broad coverage for the phenomenon and granular annotations for the subtype categorization of bridging varieties. We also present an evaluation of annotation quality and report on baseline performance using open and closed source contemporary LLMs on three tasks underlying our data, showing that bridging resolution and subtype classification remain difficult NLP tasks in the age of LLMs.

[80] Activation Steering for Masked Diffusion Language Models

Adi Shnaidman, Erin Feiglin, Osher Yaari, Efrat Mentel, Amit Levi, Raz Lapid

Main category: cs.CL

TL;DR: The paper introduces activation steering for masked diffusion language models (MDLMs) to control safety refusal behavior via low-dimensional activation directions extracted from contrastive prompts, enabling efficient inference-time control without optimization.

Details

Motivation: Masked diffusion language models offer unique advantages like mask-parallel decoding and different controllability-efficiency tradeoffs compared to autoregressive LLMs, but lack efficient representation-level mechanisms for inference-time control. The authors aim to address this gap using safety refusal as a deployment-relevant case study.

Method: Develop an activation steering primitive for MDLMs: extract a single low-dimensional direction from contrastive prompt sets using one prompt-only forward pass, then apply global intervention on residual-stream activations throughout reverse diffusion without optimization or altering diffusion sampling.

Result: Found refusal behavior in multiple MDLMs is governed by a consistent, approximately one-dimensional activation subspace. The method yields large systematic behavioral shifts, is more effective than prompt-based and optimization-based baselines, and reveals diffusion-specific accessibility where pre-instruction tokens (ineffective in autoregressive models) can be used.

Conclusion: Activation steering enables efficient inference-time control in MDLMs, revealing architecture-dependent representations of safety constraints. Directions transfer between languages in multilingual MDLMs but don’t generalize to autoregressive architectures, highlighting architectural differences in how safety constraints are represented.

Abstract: Masked diffusion language models (MDLMs) generate text via iterative masked-token denoising, enabling mask-parallel decoding and distinct controllability and efficiency tradeoffs from autoregressive LLMs. Yet, efficient representation-level mechanisms for inference-time control in MDLMs remain largely unexplored. To address this gap, we introduce an activation steering primitive for MDLMs: we extract a single low-dimensional direction from contrastive prompt sets using one prompt-only forward pass, and apply a global intervention on residual-stream activations throughout reverse diffusion, without performing optimization or altering the diffusion sampling procedure. Using safety refusal as a deployment-relevant case study, we find that refusal behavior in multiple MDLMs is governed by a consistent, approximately one-dimensional activation subspace. Applying the corresponding direction yields large and systematic behavioral shifts and is substantially more effective than prompt-based and optimization-based baselines. We further uncover diffusion-specific accessibility: effective directions can be extracted not only from post-instruction tokens, but also from pre-instruction tokens that are typically ineffective in autoregressive models due to causal attention. Ablations localize maximal leverage to early denoising steps and mid-to-late transformer layers, with early diffusion blocks contributing disproportionately. Finally, in an MDLM trained on English and Chinese, extracted directions transfer strongly between English and Chinese, but do not reliably generalize to an autoregressive architecture, highlighting architecture-dependent representations of safety constraints.

[81] Entropy Sentinel: Continuous LLM Accuracy Monitoring from Decoding Entropy Traces in STEM

Pedro Memoli Buffa, Luciano Del Corro

Main category: cs.CL

TL;DR: LLM monitoring method using output-entropy profiles from next-token probabilities to estimate slice-level accuracy under domain shift, enabling scalable performance monitoring and targeted data acquisition.

Details

Motivation: Addresses two coupled challenges in LLM deployment: (1) monitoring model performance as traffic and domains drift, and (2) prioritizing data acquisition to close performance gaps, by testing whether inference-time signals can estimate slice-level accuracy under domain shift.

Method: Computes output-entropy profiles from final-layer next-token probabilities (using top-k logprobs) for each response, summarizes with different statistics, uses lightweight classifier to predict instance correctness, and averages predicted probabilities to get domain-level accuracy estimates.

Result: Evaluated on ten STEM reasoning benchmarks with exhaustive train/test compositions across nine LLMs from six families (3B-20B). Estimates often track held-out benchmark accuracy, and several models show near-monotonic ordering of domains.

Conclusion: Output-entropy profiles provide an accessible signal for scalable monitoring and targeted data acquisition, with evidence supporting their effectiveness for estimating slice-level accuracy under domain shift.

Abstract: Deploying LLMs raises two coupled challenges: (1) monitoring–estimating where a model underperforms as traffic and domains drift–and (2) improvement–prioritizing data acquisition to close the largest performance gaps. We test whether an inference-time signal can estimate slice-level accuracy under domain shift. For each response, we compute an output-entropy profile from final-layer next-token probabilities (from top-$k$ logprobs) and summarize it with different statistics. A lightweight classifier predicts instance correctness, and averaging predicted probabilities yields a domain-level accuracy estimate. We evaluate on ten STEM reasoning benchmarks with exhaustive train/test compositions ($k\in{1,2,3,4}$; all $\binom{10}{k}$ combinations), on different classifier models and features across nine LLMs from six families (3B–20B). Estimates often track held-out benchmark accuracy, and several models show near-monotonic ordering of domains, providing evidence for output-entropy profiles being an accessible signal for scalable monitoring and for targeted data acquisition.

[82] Contextual Drag: How Errors in the Context Affect LLM Reasoning

Yun Cheng, Xingyu Zhu, Haoyu Zhao, Sanjeev Arora

Main category: cs.CL

TL;DR: LLMs exhibit “contextual drag” where failed attempts in context bias subsequent generations toward similar errors, causing 10-20% performance drops and potential self-deterioration in iterative refinement.

Details

Motivation: The paper investigates a critical limitation in self-improvement pipelines for LLMs, challenging the assumption that models can effectively learn from past mistakes. It identifies a phenomenon where exposure to failed attempts actually worsens performance rather than improving it.

Method: Evaluated 11 proprietary and open-weight models on 8 reasoning tasks, measured performance drops due to contextual drag, used tree edit distance for structural analysis of error patterns, tested mitigation strategies including fallback-behavior fine-tuning and context denoising.

Result: Contextual drag causes 10-20% performance drops across models, with iterative self-refinement potentially collapsing into self-deterioration. Neither external feedback nor successful self-verification eliminates the effect. Mitigation strategies provide only partial improvements, failing to fully restore baseline performance.

Conclusion: Contextual drag represents a persistent failure mode in current reasoning architectures that undermines self-improvement assumptions, suggesting fundamental limitations in how LLMs process and learn from contextual information containing errors.

Abstract: Central to many self-improvement pipelines for large language models (LLMs) is the assumption that models can improve by reflecting on past mistakes. We study a phenomenon termed contextual drag: the presence of failed attempts in the context biases subsequent generations toward structurally similar errors. Across evaluations of 11 proprietary and open-weight models on 8 reasoning tasks, contextual drag induces 10-20% performance drops, and iterative self-refinement in models with severe contextual drag can collapse into self-deterioration. Structural analysis using tree edit distance reveals that subsequent reasoning trajectories inherit structurally similar error patterns from the context. We demonstrate that neither external feedback nor successful self-verification suffices to eliminate this effect. While mitigation strategies such as fallback-behavior fine-tuning and context denoising yield partial improvements, they fail to fully restore baseline performance, positioning contextual drag as a persistent failure mode in current reasoning architectures.

[83] Can LLMs Discern the Traits Influencing Your Preferences? Evaluating Personality-Driven Preference Alignment in LLMs

Tianyu Zhao, Siqi Li, Yasser Shoukry, Salma Elmalaki

Main category: cs.CL

TL;DR: LLM personalization using personality traits as latent signals behind noisy preferences, with a framework for retrieving personality-aligned preferences to improve answer generation.

Details

Motivation: User preferences are noisy and unreliable for LLM personalization; personality traits provide stable latent signals that can better guide personalized response generation.

Method: 1) Study personality as latent signal behind preferences; 2) Create PACIFIC dataset with 1200 preference statements annotated with Big-Five traits; 3) Develop framework for LLMs to retrieve personality-aligned preferences and incorporate them during generation.

Result: Conditioning on personality-aligned preferences improves personalized QA accuracy from 29.25% to 76% vs. random preferences. PACIFIC dataset enables personality-aware preference modeling.

Conclusion: Personality traits provide reliable latent signals for LLM personalization, enabling more effective use of preferences through personality alignment.

Abstract: User preferences are increasingly used to personalize Large Language Model (LLM) responses, yet how to reliably leverage preference signals for answer generation remains under-explored. In practice, preferences can be noisy, incomplete, or even misleading, which can degrade answer quality when applied naively. Motivated by the observation that stable personality traits shape everyday preferences, we study personality as a principled ‘’latent’’ signal behind preference statements. Through extensive experiments, we find that conditioning on personality-aligned preferences substantially improves personalized question answering: selecting preferences consistent with a user’s inferred personality increases answer-choice accuracy from 29.25% to 76%, compared to using randomly selected preferences. Based on these findings, we introduce PACIFIC (Preference Alignment Choices Inference for Five-factor Identity Characterization), a personality-labeled preference dataset containing 1200 preference statements spanning diverse domains (e.g., travel, movies, education), annotated with Big-Five (OCEAN) trait directions. Finally, we propose a framework that enables an LLM model to automatically retrieve personality-aligned preferences and incorporate them during answer generation.

[84] Steer2Edit: From Activation Steering to Component-Level Editing

Chung-En Sun, Ge Yan, Zimo Wang, Tsui-Wei Weng

Main category: cs.CL

TL;DR: Steer2Edit transforms steering vectors from inference-time control into diagnostic signals for component-level weight editing, selectively redistributing behavioral influence across attention heads and MLP neurons to achieve better attribute-utility trade-offs.

Details

Motivation: Current steering methods use inference-time activation interventions that apply fixed, global modifications to LLM internal states, often causing unfavorable attribute-utility trade-offs by ignoring that behaviors are governed by small, heterogeneous subsets of model components.

Method: Steer2Edit is a training-free framework that converts steering vectors into diagnostic signals for rank-1 weight editing at the component level, selectively redistributing behavioral influence across individual attention heads and MLP neurons while preserving the standard forward pass.

Result: Across safety alignment, hallucination mitigation, and reasoning efficiency, Steer2Edit achieves more favorable attribute-utility trade-offs: improves safety by up to 17.2%, increases truthfulness by 9.8%, and reduces reasoning length by 12.2% on average at matched downstream performance.

Conclusion: Steer2Edit provides a principled bridge between representation steering and weight editing by translating steering signals into interpretable, training-free parameter updates that preserve parallel inference compatibility.

Abstract: Steering methods influence Large Language Model behavior by identifying semantic directions in hidden representations, but are typically realized through inference-time activation interventions that apply a fixed, global modification to the model’s internal states. While effective, such interventions often induce unfavorable attribute-utility trade-offs under strong control, as they ignore the fact that many behaviors are governed by a small and heterogeneous subset of model components. We propose Steer2Edit, a theoretically grounded, training-free framework that transforms steering vectors from inference-time control signals into diagnostic signals for component-level rank-1 weight editing. Instead of uniformly injecting a steering direction during generation, Steer2Edit selectively redistributes behavioral influence across individual attention heads and MLP neurons, yielding interpretable edits that preserve the standard forward pass and remain compatible with optimized parallel inference. Across safety alignment, hallucination mitigation, and reasoning efficiency, Steer2Edit consistently achieves more favorable attribute-utility trade-offs: at matched downstream performance, it improves safety by up to 17.2%, increases truthfulness by 9.8%, and reduces reasoning length by 12.2% on average. Overall, Steer2Edit provides a principled bridge between representation steering and weight editing by translating steering signals into interpretable, training-free parameter updates. Our code is available at https://github.com/Trustworthy-ML-Lab/Steer2Edit

[85] Rethinking the Role of LLMs in Time Series Forecasting

Xin Qiu, Junlong Tong, Yirong Sun, Yunpu Ma, Wei Zhang, Xiaoyu Shen

Main category: cs.CL

TL;DR: Large-scale study shows LLMs genuinely improve time series forecasting, especially for cross-domain generalization, overturning previous negative assessments.

Details

Motivation: Previous studies questioned whether LLMs provide genuine benefits for time series forecasting, often reporting comparable performance without LLMs. The authors argue these conclusions stem from limited evaluation settings and aim to conduct a comprehensive large-scale study.

Method: Conducted a large-scale study of LLM-based time series forecasting across 8 billion observations, 17 forecasting scenarios, 4 horizons, multiple alignment strategies, and both in-domain and out-of-domain settings. Analyzed pre-alignment vs post-alignment, contributions of pretrained knowledge vs model architecture, and used token-level routing analysis.

Result: LLMs indeed improve forecasting performance with especially large gains in cross-domain generalization. Pre-alignment outperforms post-alignment in over 90% of tasks. Both pretrained knowledge and model architecture contribute: pretraining is critical under distribution shifts, while architecture excels at modeling complex temporal dynamics.

Conclusion: Findings overturn prior negative assessments about LLMs for time series forecasting, establish clear conditions under which LLMs are useful, and provide practical guidance for effective model design.

Abstract: Large language models (LLMs) have been introduced to time series forecasting (TSF) to incorporate contextual knowledge beyond numerical signals. However, existing studies question whether LLMs provide genuine benefits, often reporting comparable performance without LLMs. We show that such conclusions stem from limited evaluation settings and do not hold at scale. We conduct a large-scale study of LLM-based TSF (LLM4TSF) across 8 billion observations, 17 forecasting scenarios, 4 horizons, multiple alignment strategies, and both in-domain and out-of-domain settings. Our results demonstrate that \emph{LLM4TS indeed improves forecasting performance}, with especially large gains in cross-domain generalization. Pre-alignment outperforming post-alignment in over 90% of tasks. Both pretrained knowledge and model architecture of LLMs contribute and play complementary roles: pretraining is critical under distribution shifts, while architecture excels at modeling complex temporal dynamics. Moreover, under large-scale mixed distributions, a fully intact LLM becomes indispensable, as confirmed by token-level routing analysis and prompt-based improvements. Overall, Our findings overturn prior negative assessments, establish clear conditions under which LLMs are not only useful, and provide practical guidance for effective model design. We release our code at https://github.com/EIT-NLP/LLM4TSF.

[86] RuCL: Stratified Rubric-Based Curriculum Learning for Multimodal Large Language Model Reasoning

Yukun Chen, Jiaming Li, Longze Chen, Ze Gong, Jingpeng Li, Zhen Qin, Hengyu Chang, Ancheng Xu, Zhihao Yang, Hamid Alinejad-Rokny, Qiang Qu, Bo Zheng, Min Yang

Main category: cs.CL

TL;DR: RuCL is a curriculum learning framework for multimodal LLMs that uses stratified rubrics to guide learning from basic perception to advanced reasoning, preventing reward hacking while improving visual reasoning performance.

Details

Motivation: Current RLVR approaches risk reward hacking where models learn spurious patterns to satisfy final answer checks. Rubric-based methods offer fine-grained supervision but suffer from high computational costs and inefficient training by treating all rubrics as equally learnable.

Method: Proposes Stratified Rubric-based Curriculum Learning (RuCL) that shifts curriculum learning focus from data selection to reward design. Generates generalized rubrics with broad applicability, stratifies them based on model competence, and dynamically adjusts rubric weights during training to guide learning from foundational perception to advanced logical reasoning.

Result: Extensive experiments on various visual reasoning benchmarks show RuCL yields +7.83% average improvement over Qwen2.5-VL-7B model, achieving state-of-the-art accuracy of 60.06%.

Conclusion: RuCL effectively addresses reward hacking in RLVR by providing structured curriculum learning through stratified rubrics, enabling models to progressively master visual reasoning from perception to complex logic.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a prevailing paradigm for enhancing reasoning in Multimodal Large Language Models (MLLMs). However, relying solely on outcome supervision risks reward hacking, where models learn spurious reasoning patterns to satisfy final answer checks. While recent rubric-based approaches offer fine-grained supervision signals, they suffer from high computational costs of instance-level generation and inefficient training dynamics caused by treating all rubrics as equally learnable. In this paper, we propose Stratified Rubric-based Curriculum Learning (RuCL), a novel framework that reformulates curriculum learning by shifting the focus from data selection to reward design. RuCL generates generalized rubrics for broad applicability and stratifies them based on the model’s competence. By dynamically adjusting rubric weights during training, RuCL guides the model from mastering foundational perception to tackling advanced logical reasoning. Extensive experiments on various visual reasoning benchmarks show that RuCL yields a remarkable +7.83% average improvement over the Qwen2.5-VL-7B model, achieving a state-of-the-art accuracy of 60.06%.

[87] Scalable Multilingual Multimodal Machine Translation with Speech-Text Fusion

Yexing Du, Youcheng Pan, Zekun Wang, Zheng Chu, Yichong Huang, Kaiyuan Liu, Bo Yang, Yang Xiang, Ming Liu, Bing Qin

Main category: cs.CL

TL;DR: Speech-guided Machine Translation framework using MLLMs with speech-text fusion and self-evolution mechanism achieves SOTA on multimodal and general translation tasks.

Details

Motivation: Existing multimodal translation research focuses on image-guided methods limited by scarce multilingual image-text pairs. Speech modality offers advantages due to natural alignment with text and abundant existing speech datasets, enabling scalable language coverage.

Method: Proposes SMT framework integrating speech and text as fused inputs into MLLM. Uses text-to-speech model to generate synthetic speech, and MLLM with self-evolution mechanism that classifies synthetic speech samples and iteratively optimizes using positive samples.

Result: Achieves new SOTA on Multi30K multimodal machine translation benchmark. On FLORES-200, achieves average SOTA performance in 108 translation directions. Ablation studies on CoVoST-2 show synthetic vs authentic speech differences have negligible impact on translation quality.

Conclusion: Speech-guided approach effectively overcomes limitations of image-based methods, leveraging abundant speech data and natural speech-text alignment to improve translation quality across multiple languages and tasks.

Abstract: Multimodal Large Language Models (MLLMs) have achieved notable success in enhancing translation performance by integrating multimodal information. However, existing research primarily focuses on image-guided methods, whose applicability is constrained by the scarcity of multilingual image-text pairs. The speech modality overcomes this limitation due to its natural alignment with text and the abundance of existing speech datasets, which enable scalable language coverage. In this paper, we propose a Speech-guided Machine Translation (SMT) framework that integrates speech and text as fused inputs into an MLLM to improve translation quality. To mitigate reliance on low-resource data, we introduce a Self-Evolution Mechanism. The core components of this framework include a text-to-speech model, responsible for generating synthetic speech, and an MLLM capable of classifying synthetic speech samples and iteratively optimizing itself using positive samples. Experimental results demonstrate that our framework surpasses all existing methods on the Multi30K multimodal machine translation benchmark, achieving new state-of-the-art results. Furthermore, on general machine translation datasets, particularly the FLORES-200, it achieves average state-of-the-art performance in 108 translation directions. Ablation studies on CoVoST-2 confirms that differences between synthetic and authentic speech have negligible impact on translation quality. The code and models are released at https://github.com/yxduir/LLM-SRT.

[88] AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors

Abhay Sheshadri, Aidan Ewart, Kai Fronsdal, Isha Gupta, Samuel R. Bowman, Sara Price, Samuel Marks, Rowan Wang

Main category: cs.CL

TL;DR: AuditBench is a benchmark for testing alignment auditing methods using 56 language models with 14 different implanted hidden behaviors that models don’t confess to when asked directly.

Details

Motivation: To create a standardized benchmark for evaluating alignment auditing techniques, addressing the need for quantitative, iterative science in detecting hidden model behaviors that could pose safety risks.

Method: Created 56 language models with 14 concerning implanted behaviors using various training techniques. Developed an investigator agent with configurable auditing tools to test detection methods, evaluating tool efficacy through agent performance.

Result: Found a tool-to-agent gap where standalone tools don’t translate to agent performance. Most effective tools involved scaffolded calls to auxiliary models generating diverse prompts. Models trained on synthetic documents were easier to audit than those trained on demonstrations, with adversarial training increasing difficulty.

Conclusion: AuditBench enables systematic evaluation of alignment auditing methods, revealing important patterns about tool effectiveness and training technique impacts on auditability. The benchmark supports future quantitative research in alignment auditing.

Abstract: We introduce AuditBench, an alignment auditing benchmark. AuditBench consists of 56 language models with implanted hidden behaviors. Each model has one of 14 concerning behaviors–such as sycophantic deference, opposition to AI regulation, or secret geopolitical loyalties–which it does not confess to when directly asked. AuditBench models are highly diverse–some are subtle, while others are overt, and we use varying training techniques both for implanting behaviors and training models not to confess. To demonstrate AuditBench’s utility, we develop an investigator agent that autonomously employs a configurable set of auditing tools. By measuring investigator agent success using different tools, we can evaluate their efficacy. Notably, we observe a tool-to-agent gap, where tools that perform well in standalone non-agentic evaluations fail to translate into improved performance when used with our investigator agent. We find that our most effective tools involve scaffolded calls to auxiliary models that generate diverse prompts for the target. White-box interpretability tools can be helpful, but the agent performs best with black-box tools. We also find that audit success varies greatly across training techniques: models trained on synthetic documents are easier to audit than models trained on demonstrations, with better adversarial training further increasing auditing difficulty. We release our models, agent, and evaluation framework to support future quantitative, iterative science on alignment auditing.

[89] Super Research: Answering Highly Complex Questions with Large Language Models through Super Deep and Super Wide Research

Yubo Dong, Nianhao You, Yuxuan Hou, Zixun Sun, Yue Zhang, Liang Zhang, Siyuan Zhao, Hehe Fan

Main category: cs.CL

TL;DR: Super Research is a benchmark for evaluating LLMs on highly complex research tasks requiring long-horizon planning, massive evidence gathering, and synthesis across heterogeneous sources.

Details

Motivation: While LLMs have shown proficiency in simpler research tasks, their capacity to solve highly complex questions requiring extensive planning, massive evidence gathering, and synthesis across diverse sources remains unexplored. The authors aim to create a stress test for LLM capabilities in autonomous research.

Method: The authors introduce Super Research, a task framework with three components: (1) structured decomposition into research plans, (2) super wide retrieval for diverse perspectives, and (3) super deep investigation through iterative queries to resolve uncertainties. They curate a benchmark of 300 expert-written questions across diverse domains, each requiring up to 100+ retrieval steps and 1,000+ web pages. They also develop a graph-anchored auditing protocol with five evaluation dimensions.

Result: The benchmark produces verifiable reports with fine-grained citations and intermediate artifacts (outlines, tables) for traceable reasoning. The evaluation protocol assesses systems along five dimensions: Coverage, Logical Consistency, Report Utility, Objectivity, and Citation Health. A leaderboard is established for tracking performance.

Conclusion: Super Research serves as a critical ceiling evaluation and stress test for LLM capabilities, where proficiency in this task acts as a powerful proxy for general research competence. Success suggests the robustness needed to navigate nearly any subordinate research task.

Abstract: While Large Language Models (LLMs) have demonstrated proficiency in Deep Research or Wide Search, their capacity to solve highly complex questions-those requiring long-horizon planning, massive evidence gathering, and synthesis across heterogeneous sources-remains largely unexplored. We introduce Super Research, a task for complex autonomous research tasks that integrates (i) structured decomposition into a research plan, (ii) super wide retrieval for diverse perspectives, and (iii) super deep investigation to resolve uncertainties through iterative queries. To evaluate this capability, we curated a benchmark of 300 expert-written questions across diverse domains, each requiring up to 100+ retrieval steps and 1,000+ web pages to reconcile conflicting evidence. Super Research produces verifiable reports with fine-grained citations and intermediate artifacts (e.g., outlines and tables) to ensure traceable reasoning. Furthermore, we present a graph-anchored auditing protocol that evaluates Super Research along five dimensions: Coverage, Logical Consistency, Report Utility, Objectivity and Citation Health. While super-complex questions may be infrequent in standard applications, Super Research serves as a critical ceiling evaluation and stress test for LLM capabilities. A model’s proficiency within Super Research acts as a powerful proxy for its general research competence; success here suggests the robustness necessary to navigate nearly any subordinate research task. Leaderboard is available at: https://cnsdqd-dyb.github.io/Super-Research-Benchmark/

[90] Piecing Together Cross-Document Coreference Resolution Datasets: Systematic Dataset Analysis and Unification

Anastasia Zhukova, Terry Ruas, Jan Philip Wahle, Bela Gipp

Main category: cs.CL

TL;DR: uCDCR is a unified dataset that consolidates diverse English CDCR corpora into consistent format with standardized metrics, enabling fair cross-dataset analysis and showing ECB+ has low lexical diversity.

Details

Motivation: Research in Cross-Document Coreference Resolution (CDCR) is fragmented due to heterogeneous dataset formats, varying annotation standards, and the predominance of event coreference resolution over entity coreference resolution.

Method: Created uCDCR by consolidating diverse publicly available English CDCR corpora into a consistent format, correcting inconsistencies, enriching datasets with missing attributes, and establishing standardized evaluation protocols.

Result: Analysis shows ECB+ has one of the lowest lexical diversities among uCDCR datasets, and using all uCDCR datasets improves model generalizability. Same-head-lemma baseline performance is similar for both events and entities.

Conclusion: uCDCR provides a cohesive framework for reproducible CDCR research, showing that both entity and event coreference resolution are complex tasks that should not be limited to ECR alone.

Abstract: Research in CDCR remains fragmented due to heterogeneous dataset formats, varying annotation standards, and the predominance of the CDCR definition as the event coreference resolution (ECR). To address these challenges, we introduce uCDCR, a unified dataset that consolidates diverse publicly available English CDCR corpora across various domains into a consistent format, which we analyze with standardized metrics and evaluation protocols. uCDCR incorporates both entity and event coreference, corrects known inconsistencies, and enriches datasets with missing attributes to facilitate reproducible research. We establish a cohesive framework for fair, interpretable, and cross-dataset analysis in CDCR and compare the datasets on their lexical properties, e.g., lexical composition of the annotated mentions, lexical diversity and ambiguity metrics, discuss the annotation rules and principles that lead to high lexical diversity, and examine how these metrics influence performance on the same-head-lemma baseline. Our dataset analysis shows that ECB+, the state-of-the-art benchmark for CDCR, has one of the lowest lexical diversities, and its CDCR complexity, measured by the same-head-lemma baseline, lies in the middle among all uCDCR datasets. Moreover, comparing document and mention distributions between ECB+ and uCDCR shows that using all uCDCR datasets for model training and evaluation will improve the generalizability of CDCR models. Finally, the almost identical performance on the same-head-lemma baseline, separately applied to events and entities, shows that resolving both types is a complex task and should not be steered toward ECR alone. The uCDCR dataset is available at https://huggingface.co/datasets/AnZhu/uCDCR, and the code for parsing, analyzing, and scoring the dataset is available at https://github.com/anastasia-zhukova/uCDCR.

[91] QIME: Constructing Interpretable Medical Text Embeddings via Ontology-Grounded Questions

Yixuan Tang, Zhenghong Lin, Yandong Sun, Wynne Hsu, Mong Li Lee, Anthony K. H. Tung

Main category: cs.CL

TL;DR: QIME: Ontology-grounded framework for interpretable medical text embeddings where each dimension corresponds to clinically meaningful yes/no questions, outperforming prior interpretable methods and narrowing gap to black-box encoders.

Details

Motivation: Current biomedical embeddings are black-box systems that limit clinical utility, while existing interpretable embeddings use heuristic approaches and lack specialized domain knowledge. There's a need for clinically meaningful, interpretable medical text representations.

Method: QIME uses ontology-grounded framework to construct interpretable embeddings where each dimension corresponds to clinically meaningful yes/no questions. It conditions on cluster-specific medical concept signatures to generate semantically atomic questions, and supports training-free embedding construction that eliminates per-question classifier training.

Result: QIME consistently outperforms prior interpretable embedding methods across biomedical semantic similarity, clustering, and retrieval benchmarks, and substantially narrows the gap to strong black-box biomedical encoders while providing concise, clinically informative explanations.

Conclusion: QIME provides an effective framework for interpretable medical text embeddings that balances performance with clinical interpretability, making it suitable for clinical decision-making applications where transparency is crucial.

Abstract: While dense biomedical embeddings achieve strong performance, their black-box nature limits their utility in clinical decision-making. Recent question-based interpretable embeddings represent text as binary answers to natural-language questions, but these approaches often rely on heuristic or surface-level contrastive signals and overlook specialized domain knowledge. We propose QIME, an ontology-grounded framework for constructing interpretable medical text embeddings in which each dimension corresponds to a clinically meaningful yes/no question. By conditioning on cluster-specific medical concept signatures, QIME generates semantically atomic questions that capture fine-grained distinctions in biomedical text. Furthermore, QIME supports a training-free embedding construction strategy that eliminates per-question classifier training while further improving performance. Experiments across biomedical semantic similarity, clustering, and retrieval benchmarks show that QIME consistently outperforms prior interpretable embedding methods and substantially narrows the gap to strong black-box biomedical encoders, while providing concise and clinically informative explanations.

[92] ClinConsensus: A Consensus-Based Benchmark for Evaluating Chinese Medical LLMs across Difficulty Levels

Xiang Zheng, Han Li, Wenjie Luo, Weiqi Zhai, Yiyuan Li, Chuanmiao Yan, Tianyi Tang, Yubo Ma, Kexin Yang, Dayiheng Liu, Hu Wei, Bing Zhao

Main category: cs.CL

TL;DR: ClinConsensus is a comprehensive Chinese medical benchmark with 2500 open-ended cases across 36 specialties, featuring expert validation, rubric-based grading, and a dual-judge evaluation framework for assessing LLMs in clinical workflows.

Details

Motivation: Existing medical benchmarks are static and task-isolated, failing to capture the openness, longitudinal structure, and safety-critical complexity of real-world clinical workflows, necessitating a more comprehensive evaluation framework.

Method: Created ClinConsensus benchmark with 2500 expert-curated cases across care continuum, introduced rubric-based grading with Clinically Applicable Consistency Score (CACS@k), and developed dual-judge evaluation combining high-capability LLM-as-judge with distilled local judge model.

Result: Comprehensive assessment revealed substantial heterogeneity in LLM performance across task themes, care stages, and specialties; top models show comparable overall scores but differ in reasoning, evidence use, and longitudinal follow-up capabilities.

Conclusion: ClinConsensus provides an extensible benchmark for developing robust, clinically grounded medical LLMs, with clinically actionable treatment planning identified as a key bottleneck for real-world deployment.

Abstract: Large language models (LLMs) are increasingly applied to health management, showing promise across disease prevention, clinical decision-making, and long-term care. However, existing medical benchmarks remain largely static and task-isolated, failing to capture the openness, longitudinal structure, and safety-critical complexity of real-world clinical workflows. We introduce ClinConsensus, a Chinese medical benchmark curated, validated, and quality-controlled by clinical experts. ClinConsensus comprises 2500 open-ended cases spanning the full continuum of care–from prevention and intervention to long-term follow-up–covering 36 medical specialties, 12 common clinical task types, and progressively increasing levels of complexity. To enable reliable evaluation of such complex scenarios, we adopt a rubric-based grading protocol and propose the Clinically Applicable Consistency Score (CACS@k). We further introduce a dual-judge evaluation framework, combining a high-capability LLM-as-judge with a distilled, locally deployable judge model trained via supervised fine-tuning, enabling scalable and reproducible evaluation aligned with physician judgment. Using ClinConsensus, we conduct a comprehensive assessment of several leading LLMs and reveal substantial heterogeneity across task themes, care stages, and medical specialties. While top-performing models achieve comparable overall scores, they differ markedly in reasoning, evidence use, and longitudinal follow-up capabilities, and clinically actionable treatment planning remains a key bottleneck. We release ClinConsensus as an extensible benchmark to support the development and evaluation of medical LLMs that are robust, clinically grounded, and ready for real-world deployment.

[93] Recursive Think-Answer Process for LLMs and VLMs

Byung-Kwan Lee, Youngchae Chee, Yong Man Ro

Main category: cs.CL

TL;DR: R-TAP: Recursive Think-Answer Process that enables models to engage in iterative reasoning cycles with confidence evaluation, outperforming single-pass methods for LLMs and VLMs.

Details

Motivation: Current Think-Answer reasoners like DeepSeek-R1 still produce errors in single-pass inference despite having self-reflective cues. There's a need for more robust iterative reasoning approaches.

Method: Proposes Recursive Think-Answer Process (R-TAP) with confidence generator to evaluate response certainty and guide improvements. Uses two complementary rewards: Recursively Confidence Increase Reward and Final Answer Confidence Reward.

Result: R-TAP-enhanced models consistently outperform conventional single-pass methods for both LLMs and VLMs. Models show significantly fewer self-reflective patterns (“Oops”-like expressions), resulting in more stable and faster inference-time reasoning.

Conclusion: R-TAP provides an efficient method to refine reasoning processes, enabling iterative improvement and more accurate outputs with reduced self-doubt expressions.

Abstract: Think-Answer reasoners such as DeepSeek-R1 have made notable progress by leveraging interpretable internal reasoning. However, despite the frequent presence of self-reflective cues like “Oops!”, they remain vulnerable to output errors during single-pass inference. To address this limitation, we propose an efficient Recursive Think-Answer Process (R-TAP) that enables models to engage in iterative reasoning cycles and generate more accurate answers, going beyond conventional single-pass approaches. Central to this approach is a confidence generator that evaluates the certainty of model responses and guides subsequent improvements. By incorporating two complementary rewards-Recursively Confidence Increase Reward and Final Answer Confidence Reward-we show that R-TAP-enhanced models consistently outperform conventional single-pass methods for both large language models (LLMs) and vision-language models (VLMs). Moreover, by analyzing the frequency of “Oops”-like expressions in model responses, we find that R-TAP-applied models exhibit significantly fewer self-reflective patterns, resulting in more stable and faster inference-time reasoning. We hope R-TAP pave the way evolving into efficient and elaborated methods to refine the reasoning processes of future AI.

[94] LaDiR: Latent Diffusion Enhances LLMs for Text Reasoning

Haoqiang Kang, Yizhe Zhang, Nikki Lijing Kuang, Nicklas Majamaki, Navdeep Jaitly, Yi-An Ma, Lianhui Qin

Main category: cs.CL

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Paper analysis not possible due to technical limitations in accessing the content

Abstract: Failed to fetch summary for 2510.04573: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.04573&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[95] Mitigating Over-Refusal in Aligned Large Language Models via Inference-Time Activation Energy

Eric Hanchen Jiang, Weixuan Ou, Run Liu, Shengyuan Pang, Guancheng Wan, Ranjie Duan, Wei Dong, Kai-Wei Chang, XiaoFeng Wang, Ying Nian Wu, Xinfeng Li

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Unable to determine motivation as paper content could not be retrieved

Method: Unable to determine method as paper content could not be retrieved

Result: Unable to determine results as paper content could not be retrieved

Conclusion: Unable to determine conclusion as paper content could not be retrieved

Abstract: Failed to fetch summary for 2510.08646: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.08646&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[96] Are We Asking the Right Questions? On Ambiguity in Natural Language Queries for Tabular Data Analysis

Daniel Gomm, Cornelius Wolff, Madelon Hulsebos

Main category: cs.CL

TL;DR: Unable to analyze paper 2511.04584 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation as abstract is unavailable due to API rate limiting error

Method: Cannot determine method as abstract is unavailable due to API rate limiting error

Result: Cannot determine results as abstract is unavailable due to API rate limiting error

Conclusion: Cannot draw conclusions as abstract is unavailable due to API rate limiting error

Abstract: Failed to fetch summary for 2511.04584: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.04584&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[97] TransactionGPT

Yingtong Dou, Zhimeng Jiang, Tianyi Zhang, Mingzhi Hu, Zhichao Xu, Shubham Jain, Uday Singh Saini, Xiran Fan, Jiarui Sun, Menghai Pan, Junpeng Wang, Xin Dai, Liang Wang, Chin-Chia Michael Yeh, Yujie Fan, Yan Zheng, Vineeth Rakesh, Huiyuan Chen, Guanchu Wang, Mangesh Bendre, Zhongfang Zhuang, Xiaoting Li, Prince Aboagye, Vivian Lai, Minghua Xu, Hao Yang, Yiwei Cai, Mahashweta Das, Yuzhong Chen

Main category: cs.CL

TL;DR: Unable to analyze paper 2511.08939 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation without access to the paper abstract

Method: Cannot determine method without access to the paper abstract

Result: Cannot determine results without access to the paper abstract

Conclusion: Cannot determine conclusion without access to the paper abstract

Abstract: Failed to fetch summary for 2511.08939: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.08939&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[98] Classroom Final Exam: An Instructor-Tested Reasoning Benchmark

Chongyang Gao, Diji Yang, Shuyan Zhou, Xichen Yan, Luchuan Song, Shuo Li, Kezhen Chen

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation due to inability to access paper content

Method: Cannot determine method due to inability to access paper content

Result: Cannot determine results due to inability to access paper content

Conclusion: Cannot determine conclusion due to inability to access paper content

Abstract: Failed to fetch summary for 2602.19517: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.19517&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[99] CeRA: Breaking the Linear Ceiling of Low-Rank Adaptation via Manifold Expansion

Hung-Hsuan Chen

Main category: cs.CL

TL;DR: Paper 2602.22911: Failed to fetch summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access restrictions

Method: Unable to determine method due to access restrictions

Result: Unable to determine results due to access restrictions

Conclusion: Unable to determine conclusion due to access restrictions

Abstract: Failed to fetch summary for 2602.22911: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22911&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[100] DeepXiv-SDK: An Agentic Data Interface for Scientific Literature

Hongjin Qian, Ziyi Xia, Ze Liu, Jianlyu Chen, Kun Luo, Minghao Qin, Chaofan Li, Lei Xiong, Junwei Lan, Sen Wang, Zhengyang Liang, Yingxia Shao, Defu Lian, Zheng Liu

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to API rate limiting preventing access to paper details

Method: Cannot analyze method as paper content is unavailable due to HTTP 429 error

Result: No results available - paper retrieval failed due to rate limiting

Conclusion: Cannot provide analysis due to technical limitations in accessing the paper

Abstract: Failed to fetch summary for 2603.00084: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.00084&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

cs.CV

[101] CamDirector: Towards Long-Term Coherent Video Trajectory Editing

Zhihao Shi, Kejia Yin, Weilin Wan, Yuhongze Zhou, Yuanhao Yu, Xinxin Zuo, Qiang Sun, Juwei Lu

Main category: cs.CV

TL;DR: A novel video trajectory editing framework that enables precise camera control and long-range consistency through hybrid warping and history-guided autoregressive diffusion.

Details

Motivation: Existing video trajectory editing methods struggle with precise camera control and long-range consistency due to limited embedding capacity or single-frame warping with implicit cross-frame aggregation.

Method: 1) Hybrid warping scheme that aggregates information across entire source video: static regions fused into world cache and rendered to target poses, dynamic regions directly warped. 2) History-guided autoregressive diffusion model processes video segments jointly with history, with incremental world cache updates for temporal coherence.

Result: Achieves state-of-the-art performance on new iPhone-PTZ benchmark with diverse camera motions and large trajectory variations, using fewer parameters.

Conclusion: The framework enables professional-quality video trajectory editing with precise camera control and long-term consistency, advancing video generation capabilities.

Abstract: Video (camera) trajectory editing aims to synthesize new videos that follow user-defined camera paths while preserving scene content and plausibly inpainting previously unseen regions, upgrading amateur footage into professionally styled videos. Existing VTE methods struggle with precise camera control and long-range consistency because they either inject target poses through a limited-capacity embedding or rely on single-frame warping with only implicit cross-frame aggregation in video diffusion models. To address these issues, we introduce a new VTE framework that 1) explicitly aggregates information across the entire source video via a hybrid warping scheme. Specifically, static regions are progressively fused into a world cache then rendered to target camera poses, while dynamic regions are directly warped; their fusion yields globally consistent coarse frames that guide refinement. 2) processes video segments jointly with their history via a history-guided autoregressive diffusion model, while the world cache is incrementally updated to reinforce already inpainted content, enabling long-term temporal coherence. Finally, we present iPhone-PTZ, a new VTE benchmark with diverse camera motions and large trajectory variations, and achieve state-of-the-art performance with fewer parameters.

Haoran Zhang, Youjin Wang, Yi Duan, Rong Fu, Dianyu Zhao, Sicheng Fan, Shuaishuai Cao, Wentao Guo, Xiao Zhou

Main category: cs.CV

TL;DR: Separate agents trained on different viewpoints develop latent spaces with approximate linear isometry, enabling translation between viewpoints without coordination.

Details

Motivation: To understand how predictive learning objectives shape representation geometry in decentralized vision systems and whether separate agents can develop interoperable representations without coordination.

Method: Train separate agents on distinct viewpoints of the same environment without parameter sharing or coordination, using world models that compress sensory streams into latent codes. Analyze geometric relationships between learned latent spaces.

Result: Latent spaces from separate agents exhibit approximate linear isometry, enabling transparent translation between viewpoints. This alignment allows classifier transfer without retraining and accelerates learning through distillation-like migration.

Conclusion: Predictive learning imposes strong regularities on representation geometry, suggesting lightweight interoperability paths for decentralized vision systems.

Abstract: World models compress rich sensory streams into compact latent codes that anticipate future observations. We let separate agents acquire such models from distinct viewpoints of the same environment without any parameter sharing or coordination. After training, their internal representations exhibit a striking emergent property: the two latent spaces are related by an approximate linear isometry, enabling transparent translation between them. This geometric consensus survives large viewpoint shifts and scant overlap in raw pixels. Leveraging the learned alignment, a classifier trained on one agent can be ported to the other with no additional gradient steps, while distillation-like migration accelerates later learning and markedly reduces total compute. The findings reveal that predictive learning objectives impose strong regularities on representation geometry, suggesting a lightweight path to interoperability among decentralized vision systems. The code is available at https://anonymous.4open.science/r/Social-JEPA-5C57.

[103] From Visual to Multimodal: Systematic Ablation of Encoders and Fusion Strategies in Animal Identification

Vasiliy Kudryavtsev, Kirill Borodin, German Berezin, Kirill Bubenchikov, Grach Mkrtchian, Alexander Ryzhkov

Main category: cs.CV

TL;DR: Multimodal pet re-identification framework combining visual features with synthetic textual descriptions achieves 84.28% Top-1 accuracy, outperforming unimodal baselines by 11%.

Details

Motivation: Current automated animal identification systems struggle due to limited dataset scale and reliance on unimodal visual cues alone, making pet reunification challenging.

Method: Multimodal verification framework that enhances visual features with semantic identity priors from synthetic textual descriptions, using SigLIP2-Giant (vision) and E5-Small-v2 (text) backbones with gated fusion mechanism.

Result: Achieved 84.28% Top-1 accuracy and 0.0422 Equal Error Rate on comprehensive test protocol, representing 11% improvement over leading unimodal baselines.

Conclusion: Integrating synthesized semantic descriptions significantly refines decision boundaries in large-scale pet re-identification, demonstrating the value of multimodal approaches over unimodal visual systems.

Abstract: Automated animal identification is a practical task for reuniting lost pets with their owners, yet current systems often struggle due to limited dataset scale and reliance on unimodal visual cues. This study introduces a multimodal verification framework that enhances visual features with semantic identity priors derived from synthetic textual descriptions. We constructed a massive training corpus of 1.9 million photographs covering 695,091~unique animals to support this investigation. Through systematic ablation studies, we identified SigLIP2-Giant and E5-Small-v2 as the optimal vision and text backbones. We further evaluated fusion strategies ranging from simple concatenation to adaptive gating to determine the best method for integrating these modalities. Our proposed approach utilizes a gated fusion mechanism and achieved a Top-1 accuracy of 84.28% and an Equal Error Rate of 0.0422 on a comprehensive test protocol. These results represent an 11% improvement over leading unimodal baselines and demonstrate that integrating synthesized semantic descriptions significantly refines decision boundaries in large-scale pet re-identification.

[104] Beyond Prompt Degradation: Prototype-guided Dual-pool Prompting for Incremental Object Detection

Yaoteng Zhang, Zhou Qing, Junyu Gao, Qi Wang

Main category: cs.CV

TL;DR: PDP is a prompt-decoupled framework for incremental object detection that uses dual-pool prompt decoupling and prototypical pseudo-label generation to mitigate prompt degradation issues in continual learning.

Details

Motivation: Existing prompt-based methods for incremental object detection suffer from prompt degradation due to prompt coupling (interference between prompts) and prompt drift (inconsistent supervision where old objects become background in new tasks).

Method: Proposes PDP with: 1) Dual-pool prompt decoupling - shared pool for task-general knowledge and private pool for task-specific features; 2) Prototypical Pseudo-Label Generation (PPG) module that dynamically updates class prototypes and filters pseudo-labels to maintain supervisory consistency.

Result: Achieves state-of-the-art performance with 9.2% AP improvement on MS-COCO and 3.3% AP improvement on PASCAL VOC benchmarks, demonstrating effective balance between stability and plasticity.

Conclusion: PDP effectively addresses prompt degradation in incremental object detection through prompt decoupling and consistent supervision, showing strong potential for continual learning applications.

Abstract: Incremental Object Detection (IOD) aims to continuously learn new object categories without forgetting previously learned ones. Recently, prompt-based methods have gained popularity for their replay-free design and parameter efficiency. However, due to prompt coupling and prompt drift, these methods often suffer from prompt degradation during continual adaptation. To address these issues, we propose a novel prompt-decoupled framework called PDP. PDP innovatively designs a dual-pool prompt decoupling paradigm, which consists of a shared pool used to capture task-general knowledge for forward transfer, and a private pool used to learn task-specific discriminative features. This paradigm explicitly separates task-general and task-specific prompts, preventing interference between prompts and mitigating prompt coupling. In addition, to counteract prompt drift resulting from inconsistent supervision where old foreground objects are treated as background in subsequent tasks, PDP introduces a Prototypical Pseudo-Label Generation (PPG) module. PPG can dynamically update the class prototype space during training and use the class prototypes to further filter valuable pseudo-labels, maintaining supervisory signal consistency throughout the incremental process. PDP achieves state-of-the-art performance on MS-COCO (with a 9.2% AP improvement) and PASCAL VOC (with a 3.3% AP improvement) benchmarks, highlighting its potential in balancing stability and plasticity. The code and dataset are released at: https://github.com/zyt95579/PDP_IOD/tree/main

[105] From “What” to “How”: Constrained Reasoning for Autoregressive Image Generation

Ruxue Yan, Xubo Liu, Wenya Guo, Zhengkun Zhang, Ying Zhang, Xiaojie Yuan

Main category: cs.CV

TL;DR: CoR-Painter introduces a “How-to-What” paradigm for autoregressive image generation using constrained reasoning to derive visual constraints from prompts, addressing spatial ambiguity and object overlap issues through structured guidance.

Details

Motivation: Current autoregressive image generation methods only specify "What" details to depict by rewriting prompts, but fail to reason about "How" to structure the overall image, leading to persistent issues like spatial ambiguity and unrealistic object overlaps.

Method: Proposes CoR-Painter framework with “How-to-What” paradigm: 1) Derives visual constraints from input prompt governing spatial relationships, key attributes, and compositional rules (“How to draw”), 2) Uses constraints to generate detailed description (“What to draw”), 3) Introduces Dual-Objective GRPO strategy to optimize textual constrained reasoning and visual projection processes.

Result: Achieves state-of-the-art performance on T2I-CompBench, GenEval, and WISE benchmarks with significant improvements in spatial metrics (e.g., +5.41% on T2I-CompBench).

Conclusion: CoR-Painter effectively bridges the gap between textual reasoning and visual synthesis by introducing constrained reasoning, providing structurally sound guidance for accurate image generation and addressing fundamental limitations of current autoregressive methods.

Abstract: Autoregressive image generation has seen recent improvements with the introduction of chain-of-thought and reinforcement learning. However, current methods merely specify “What” details to depict by rewriting the input prompt, yet fundamentally fail to reason about “How” to structure the overall image. This inherent limitation gives rise to persistent issues, such as spatial ambiguity directly causing unrealistic object overlaps. To bridge this gap, we propose CoR-Painter, a novel framework that pioneers a “How-to-What” paradigm by introducing Constrained Reasoning to guide the autoregressive generation. Specifically, it first deduces “How to draw” by deriving a set of visual constraints from the input prompt, which explicitly govern spatial relationships, key attributes, and compositional rules. These constraints steer the subsequent generation of a detailed description “What to draw”, providing a structurally sound and coherent basis for accurate visual synthesis. Additionally, we introduce a Dual-Objective GRPO strategy that specifically optimizes the textual constrained reasoning and visual projection processes to ensure the coherence and quality of the entire generation pipeline. Extensive experiments on T2I-CompBench, GenEval, and WISE demonstrate that our method achieves state-of-the-art performance, with significant improvements in spatial metrics (e.g., +5.41% on T2I-CompBench).

[106] AutoFFS: Adversarial Deformations for Facial Feminization Surgery Planning

Paul Friedrich, Florentin Bieder, Florian M. Thieringer, Philippe C. Cattin

Main category: cs.CV

TL;DR: AutoFFS: A data-driven framework using adversarial free-form deformations to generate counterfactual skull morphologies for facial feminization surgery planning.

Details

Motivation: Current facial feminization surgery (FFS) planning relies on subjective clinical assessment, lacking quantitative and reproducible anatomical guidance for transgender and gender diverse patients.

Method: Uses adversarial free-form deformations to perform targeted adversarial attacks on pre-trained binary sex classifiers, transforming individual skull shapes toward target sex characteristics.

Result: Generated counterfactual skull morphologies provide quantitative foundation for preoperative planning, validated through classifier-based evaluation and human perceptual studies.

Conclusion: AutoFFS offers a data-driven, quantitative approach to FFS planning that advances care for transgender and gender diverse patients through objective anatomical guidance.

Abstract: Facial feminization surgery (FFS) is a key component of gender affirmation for transgender and gender diverse patients, aiming to reshape craniofacial structures toward a female morphology. Current surgical planning procedures largely rely on subjective clinical assessment, lacking quantitative and reproducible anatomical guidance. We therefore propose AutoFFS, a novel data-driven framework that generates counterfactual skull morphologies through adversarial free-form deformations. Our method performs a deformation-based targeted adversarial attack on an ensemble of pre-trained binary sex classifiers that learned sexual dimorphism, effectively transforming individual skull shapes toward the target sex. The generated counterfactual skull morphologies provide a quantitative foundation for preoperative planning in FFS, driving advances in this largely overlooked patient group. We validate our approach through classifier-based evaluation and a human perceptual study, confirming that the generated morphologies exhibit target sex characteristics.

Lei Yao, Yong Chen, Yuejiao Su, Yi Wang, Moyun Liu, Lap-Pui Chau

Main category: cs.CV

TL;DR: HAMMER uses multimodal LLMs to ground 3D object affordance from images by extracting interaction intentions and transferring them to 3D space for accurate localization.

Details

Motivation: Humans can identify 3D object affordance from observed interactions in images/videos and generalize this knowledge to novel objects. The paper aims to replicate this capability using MLLMs for interaction intention-driven 3D affordance grounding.

Method: HAMMER aggregates interaction intention from images into contact-aware embeddings, infers textual affordance labels, uses hierarchical cross-modal integration to refine 3D representations, and employs multi-granular geometry lifting to infuse spatial characteristics into intention embeddings for accurate 3D localization.

Result: Extensive experiments on public datasets and a newly constructed corrupted benchmark demonstrate HAMMER’s superiority and robustness compared to existing approaches.

Conclusion: HAMMER effectively leverages MLLMs for 3D affordance grounding by extracting and transferring interaction intentions from 2D to 3D space, showing strong performance and generalization capabilities.

Abstract: Humans commonly identify 3D object affordance through observed interactions in images or videos, and once formed, such knowledge can be generically generalized to novel objects. Inspired by this principle, we advocate for a novel framework that leverages emerging multimodal large language models (MLLMs) for interaction intention-driven 3D affordance grounding, namely HAMMER. Instead of generating explicit object attribute descriptions or relying on off-the-shelf 2D segmenters, we alternatively aggregate the interaction intention depicted in the image into a contact-aware embedding and guide the model to infer textual affordance labels, ensuring it thoroughly excavates object semantics and contextual cues. We further devise a hierarchical cross-modal integration mechanism to fully exploit the complementary information from the MLLM for 3D representation refinement and introduce a multi-granular geometry lifting module that infuses spatial characteristics into the extracted intention embedding, thus facilitating accurate 3D affordance localization. Extensive experiments on public datasets and our newly constructed corrupted benchmark demonstrate the superiority and robustness of HAMMER compared to existing approaches. All code and weights are publicly available.

[108] MERG3R: A Divide-and-Conquer Approach to Large-Scale Neural Visual Geometry

Leo Kaixuan Cheng, Abdus Shaikh, Ruofan Liang, Zhijie Wu, Yushi Guan, Nandita Vijaykumar

Main category: cs.CV

TL;DR: MERG3R is a training-free divide-and-conquer framework that enables geometric foundation models to scale beyond GPU memory limits by partitioning unordered images into overlapping subsets, reconstructing them independently, then merging results through global alignment.

Details

Motivation: Current neural visual geometry models (like VGGT and Pi3) are limited by GPU memory when scaling to large, unordered image collections due to their reliance on full attention mechanisms.

Method: 1) Reorder and partition unordered images into overlapping, geometrically diverse subsets that can be reconstructed independently. 2) Merge local reconstructions through efficient global alignment and confidence-weighted bundle adjustment to produce globally consistent 3D models.

Result: Across 7-Scenes, NRGBD, Tanks & Temples, and Cambridge Landmarks datasets, MERG3R consistently improves reconstruction accuracy, memory efficiency, and scalability, enabling high-quality reconstruction when datasets exceed memory capacity limits.

Conclusion: MERG3R provides a model-agnostic framework that enables geometric foundation models to operate beyond their native memory limits while improving reconstruction quality, making large-scale 3D reconstruction more practical.

Abstract: Recent advancements in neural visual geometry, including transformer-based models such as VGGT and Pi3, have achieved impressive accuracy on 3D reconstruction tasks. However, their reliance on full attention makes them fundamentally limited by GPU memory capacity, preventing them from scaling to large, unordered image collections. We introduce MERG3R, a training-free divide-and-conquer framework that enables geometric foundation models to operate far beyond their native memory limits. MERG3R first reorders and partitions unordered images into overlapping, geometrically diverse subsets that can be reconstructed independently. It then merges the resulting local reconstructions through an efficient global alignment and confidence-weighted bundle adjustment procedure, producing a globally consistent 3D model. Our framework is model-agnostic and can be paired with existing neural geometry models. Across large-scale datasets, including 7-Scenes, NRGBD, Tanks & Temples, and Cambridge Landmarks, MERG3R consistently improves reconstruction accuracy, memory efficiency, and scalability, enabling high-quality reconstruction when the dataset exceeds memory capacity limits.

[109] Beyond Caption-Based Queries for Video Moment Retrieval

David Pujol-Perich, Albert Clapés, Dima Damen, Sergio Escalera, Michael Wray

Main category: cs.CV

TL;DR: This paper investigates generalization challenges in Video Moment Retrieval (VMR) methods when trained on caption-based queries but evaluated on search queries, identifying language and multi-moment gaps, and proposes architectural modifications to address decoder-query collapse.

Details

Motivation: Existing VMR methods, particularly DETR architectures, degrade when trained on caption-based queries but evaluated on search queries. The authors aim to understand and address this generalization gap between training and evaluation scenarios.

Method: Introduces three benchmarks by modifying textual queries in three public VMR datasets (HD-EPIC, YouCook2, ActivityNet-Captions). Identifies language gap (linguistic under-specification) and multi-moment gap (single to multi-moment shift). Proposes architectural modifications to mitigate decoder-query collapse by increasing active decoder queries.

Result: The approach improves performance on search queries by up to 14.82% mAP_m, and up to 21.83% mAP_m on multi-moment search queries. Demonstrates effectiveness of architectural modifications in addressing generalization challenges.

Conclusion: The paper identifies critical generalization challenges in VMR systems and provides architectural solutions that significantly improve performance on search queries, particularly for multi-moment instances.

Abstract: In this work, we investigate the degradation of existing VMR methods, particularly of DETR architectures, when trained on caption-based queries but evaluated on search queries. For this, we introduce three benchmarks by modifying the textual queries in three public VMR datasets – i.e., HD-EPIC, YouCook2 and ActivityNet-Captions. Our analysis reveals two key generalization challenges: (i) A language gap, arising from the linguistic under-specification of search queries, and (ii) a multi-moment gap, caused by the shift from single-moment to multi-moment queries. We also identify a critical issue in these architectures – an active decoder-query collapse – as a primary cause of the poor generalization to multi-moment instances. We mitigate this issue with architectural modifications that effectively increase the number of active decoder queries. Extensive experiments demonstrate that our approach improves performance on search queries by up to 14.82% mAP_m, and up to 21.83% mAP_m on multi-moment search queries. The code, models and data are available in the project webpage: https://davidpujol.github.io/beyond-vmr/

[110] Retrieving Patient-Specific Radiomic Feature Sets for Transparent Knee MRI Assessment

Yaxi Chen, Simin Ni, Jingjing Zhang, Shaheer U. Saeed, Yipei Wang, Aleksandra Ivanova, Rikin Hargunani, Chaozong Liu, Jie Huang, Yipeng Hu

Main category: cs.CV

TL;DR: A patient-specific radiomic feature selection framework that uses two-stage retrieval to select compact, complementary feature sets per subject, outperforming marginal top-k approaches while maintaining interpretability.

Details

Motivation: Radiomics offers transparency over end-to-end DL models but underperforms due to reliance on population-level predefined features. Current adaptive radiomics uses marginal top-k ranking which selects redundant features and overlooks complementary interactions.

Method: Two-stage retrieval: 1) randomly sample diverse candidate feature sets from large radiomic pool, 2) rank these sets with learned scoring function to select high-performing feature set per patient. System includes feature-set scorer and final classifier.

Result: Outperforms top-k approach with same k values, competitive with end-to-end DL models while maintaining transparency. Validated on ACL tear detection and KL grading for osteoarthritis tasks.

Conclusion: Proposed framework achieves diagnostic performance comparable to DL models while generating auditable feature sets that link clinical outcomes to specific anatomical regions and radiomic families, enabling clinical inspection.

Abstract: Classical radiomic features are designed to quantify image appearance and intensity patterns. Compared with end-to-end deep learning (DL) models trained for disease classification, radiomics pipelines with low-dimensional parametric classifiers offer enhanced transparency and interpretability, yet often underperform because of the reliance on population-level predefined feature sets. Recent work on adaptive radiomics uses DL to predict feature weights over a radiomic pool, then thresholds these weights to retain the top-k features from large radiomic pool F (often ~10^3). However, such marginal ranking can over-admit redundant descriptors and overlook complementary feature interactions. We propose a patient-specific feature-set selection framework that predicts a single compact feature set per subject, targeting complementary and diverse evidence rather than marginal top-k features. To overcome the intractable combinatorial search space of F choose k features, our method utilizes a 2-stage retrieval strategy: randomly sample diverse candidate feature sets, then rank these sets with a learned scoring function to select a high-performing feature set for the specific patient. The system consists of a feature-set scorer, and a classifier that performs the final diagnosis. We empirically show that the proposed two-stage retrieval approximates the original exhaustive all k-feature selection. Validating on tasks including ACL tear detection and KL grading for osteoarthritis, the experimental results achieve diagnostic performance, outperforming the top-k approach with the same k values, and competitive with end-to-end DL models while maintaining high transparency. The model generates auditable feature sets that link clinical outcomes to specific anatomical regions and radiomic families, allowing clinicians to inspect which anatomical structures and quantitative descriptors drive the prediction.

[111] Cultural Counterfactuals: Evaluating Cultural Biases in Large Vision-Language Models with Counterfactual Examples

Phillip Howard, Xin Su, Kathleen C. Fraser

Main category: cs.CV

TL;DR: Cultural Counterfactuals dataset enables measurement of cultural biases in LVLMs by providing synthetic counterfactual images where the same person appears in different cultural contexts (religion, nationality, socioeconomic status).

Details

Motivation: Prior bias studies in LVLMs focused on demographic traits visible in person's appearance (race, gender), leaving cultural biases (religion, socioeconomic status) understudied due to lack of datasets with cultural context cues.

Method: Created high-quality synthetic dataset of ~60k counterfactual images using image-editing model to place people of different demographics into real cultural context images, enabling same person to appear in multiple cultural contexts.

Result: Dataset enables precise measurement of cultural bias impact on LVLM outputs by comparing responses across counterfactual image sets with identical people in different cultural contexts.

Conclusion: Cultural Counterfactuals addresses gap in cultural bias measurement for LVLMs and demonstrates utility for quantifying cultural biases related to religion, nationality, and socioeconomic status.

Abstract: Large Vision-Language Models (LVLMs) have grown increasingly powerful in recent years, but can also exhibit harmful biases. Prior studies investigating such biases have primarily focused on demographic traits related to the visual characteristics of a person depicted in an image, such as their race or gender. This has left biases related to cultural differences (e.g., religion, socioeconomic status), which cannot be readily discerned from an individual’s appearance alone, relatively understudied. A key challenge in measuring cultural biases is that determining which group an individual belongs to often depends upon cultural context cues in images, and datasets annotated with cultural context cues are lacking. To address this gap, we introduce Cultural Counterfactuals: a high-quality synthetic dataset containing nearly 60k counterfactual images for measuring cultural biases related to religion, nationality, and socioeconomic status. To ensure that cultural contexts are accurately depicted, we generate our dataset using an image-editing model to place people of different demographics into real cultural context images. This enables the construction of counterfactual image sets which depict the same person in multiple different contexts, allowing for precise measurement of the impact that cultural context differences have on LVLM outputs. We demonstrate the utility of Cultural Counterfactuals for quantifying cultural biases in popular LVLMs.

[112] Aligning Fetal Anatomy with Kinematic Tree Log-Euclidean PolyRigid Transforms

Yingcheng Liu, Athena Taymourtash, Yang Liu, Esra Abaci Turk, William M. Wells, Leo Joskowicz, P. Ellen Grant, Polina Golland

Main category: cs.CV

TL;DR: A differentiable volumetric body model using Kinematic Tree-based Log-Euclidean PolyRigid transform for medical imaging of articulated bodies, addressing limitations of surface-based models.

Details

Motivation: Existing surface-based models for articulated body analysis in medical imaging ignore internal volumetric structures and lack anatomical consistency guarantees, requiring a more robust volumetric approach.

Method: Introduces KTPolyRigid (Kinematic Tree-based Log-Euclidean PolyRigid) transform that resolves Lie algebra ambiguities for large articulated motions and ensures smooth, bijective volumetric mappings, building on SMPL formulation.

Result: Evaluated on 53 fetal MRI volumes, KTPolyRigid yields deformation fields with significantly fewer folding artifacts, enables robust groupwise image registration, and supports label-efficient template-based segmentation of fetal organs.

Conclusion: The framework provides a robust foundation for standardized volumetric analysis of articulated bodies in medical imaging, overcoming limitations of surface-based approaches.

Abstract: Automated analysis of articulated bodies is crucial in medical imaging. Existing surface-based models often ignore internal volumetric structures and rely on deformation methods that lack anatomical consistency guarantees. To address this problem, we introduce a differentiable volumetric body model based on the Skinned Multi-Person Linear (SMPL) formulation, driven by a new Kinematic Tree-based Log-Euclidean PolyRigid (KTPolyRigid) transform. KTPolyRigid resolves Lie algebra ambiguities associated with large, non-local articulated motions, and encourages smooth, bijective volumetric mappings. Evaluated on 53 fetal MRI volumes, KTPolyRigid yields deformation fields with significantly fewer folding artifacts. Furthermore, our framework enables robust groupwise image registration and a label-efficient, template-based segmentation of fetal organs. It provides a robust foundation for standardized volumetric analysis of articulated bodies in medical imaging.

[113] Advancing Earth Observation Through Machine Learning: A TorchGeo Tutorial

Caleb Robinson, Nils Lehmann, Adam J. Stewart, Burak Ekim, Heng Fang, Isaac A. Corley, Mauricio Cordeiro

Main category: cs.CV

TL;DR: TorchGeo is a PyTorch library for geospatial ML with specialized datasets, samplers, transforms, and models, demonstrated through a water segmentation tutorial using Sentinel-2 imagery.

Details

Motivation: Earth observation ML pipelines differ from standard computer vision due to large georeferenced scenes, varied coordinate systems, and need for spatially aware sampling/splitting strategies.

Method: TorchGeo provides PyTorch-based domain library with specialized abstractions for geospatial data, demonstrated through tutorial showing core concepts and end-to-end water segmentation case study using Earth Surface Water dataset and Sentinel-2 imagery.

Result: Tutorial demonstrates training semantic segmentation model on multispectral water segmentation, applying it to Rio de Janeiro Sentinel-2 scene, and saving predictions as GeoTIFF for geospatial analysis.

Conclusion: TorchGeo makes geospatial ML accessible through specialized PyTorch tools, with practical tutorial showing complete workflow from data handling to prediction output.

Abstract: Earth observation machine learning pipelines differ fundamentally from standard computer vision workflows. Imagery is typically delivered as large, georeferenced scenes, labels may be raster masks or vector geometries in distinct coordinate reference systems, and both training and evaluation often require spatially aware sampling and splitting strategies. TorchGeo is a PyTorch-based domain library that provides datasets, samplers, transforms and pre-trained models with the goal of making it easy to use geospatial data in machine learning pipelines. In this paper, we introduce a tutorial that demonstrates 1.) the core TorchGeo abstractions through code examples, and 2.) an end-to-end case study on multispectral water segmentation from Sentinel-2 imagery using the Earth Surface Water dataset. This demonstrates how to train a semantic segmentation model using TorchGeo datasets, apply the model to a Sentinel-2 scene over Rio de Janeiro, Brazil, and save the resulting predictions as a GeoTIFF for further geospatial analysis. The tutorial code itself is distributed as two Python notebooks: https://torchgeo.readthedocs.io/en/stable/tutorials/torchgeo.html and https://torchgeo.readthedocs.io/en/stable/tutorials/earth_surface_water.html.

[114] OpenMarcie: Dataset for Multimodal Action Recognition in Industrial Environments

Hymalai Bello, Lala Ray, Joanna Sorysz, Sungho Suh, Paul Lukowicz

Main category: cs.CV

TL;DR: OpenMarcie is the largest multimodal dataset for human action monitoring in manufacturing, featuring 37+ hours of data from wearables and cameras across two assembly tasks with 36 participants, benchmarked on three HAR tasks.

Details

Motivation: Smart factories need accurate worker activity recognition to optimize production and ensure safety. Existing datasets lack scale and multimodal diversity for comprehensive human action monitoring in manufacturing environments.

Method: Created OpenMarcie dataset with two experimental settings: 1) 12 participants doing bicycle assembly/disassembly without fixed protocol, 2) 25 volunteers doing 3D printer assembly with manufacturer instructions and collaborative assessment. Collected multimodal data from wearables and distributed cameras, resulting in 37+ hours of egocentric/exocentric data across 8 data types and 200+ channels.

Result: OpenMarcie is the largest multimodal manufacturing action dataset with comprehensive benchmarks across three tasks: activity classification, open vocabulary captioning, and cross-modal alignment, enabling robust human activity recognition research.

Conclusion: OpenMarcie provides a valuable resource for multimodal human activity recognition in manufacturing, supporting research in smart factory optimization and worker safety through comprehensive multimodal data collection and benchmarking.

Abstract: Smart factories use advanced technologies to optimize production and increase efficiency. To this end, the recognition of worker activity allows for accurate quantification of performance metrics, improving efficiency holistically while contributing to worker safety. OpenMarcie is, to the best of our knowledge, the biggest multimodal dataset designed for human action monitoring in manufacturing environments. It includes data from wearables sensing modalities and cameras distributed in the surroundings. The dataset is structured around two experimental settings, involving a total of 36 participants. In the first setting, twelve participants perform a bicycle assembly and disassembly task under semi-realistic conditions without a fixed protocol, promoting divergent and goal-oriented problem-solving. The second experiment involves twenty-five volunteers (24 valid data) engaged in a 3D printer assembly task, with the 3D printer manufacturer’s instructions provided to guide the volunteers in acquiring procedural knowledge. This setting also includes sequential collaborative assembly, where participants assess and correct each other’s progress, reflecting real-world manufacturing dynamics. OpenMarcie includes over 37 hours of egocentric and exocentric, multimodal, and multipositional data, featuring eight distinct data types and more than 200 independent information channels. The dataset is benchmarked across three human activity recognition tasks: activity classification, open vocabulary captioning, and cross-modal alignment.

[115] From Fewer Samples to Fewer Bits: Reframing Dataset Distillation as Joint Optimization of Precision and Compactness

My H. Dinh, Aditya Sant, Akshay Malhotra, Keya Patani, Shahab Hamidi-Rad

Main category: cs.CV

TL;DR: QuADD is a unified framework for dataset distillation that jointly optimizes both sample count reduction and data precision quantization under fixed bit budgets, achieving better accuracy per bit than existing methods.

Details

Motivation: Current dataset distillation methods focus mainly on reducing sample count but overlook data precision optimization, which also impacts storage and computational efficiency. There's a need for a unified approach that considers both dimensions under fixed bit budgets.

Method: QuADD integrates a differentiable quantization module within the distillation loop, enabling end-to-end co-optimization of synthetic samples and quantization parameters. It supports both uniform and adaptive non-uniform quantization, with the latter learning quantization levels from data to better represent information-dense regions.

Result: Experiments on image classification and 3GPP beam management tasks show QuADD surpasses existing dataset distillation and post-quantized baselines in accuracy per bit, establishing a new standard for information-efficient dataset distillation.

Conclusion: QuADD provides a unified framework for optimizing both sample count and precision in dataset distillation, demonstrating superior efficiency and establishing a new benchmark for information-efficient dataset compression.

Abstract: Dataset Distillation (DD) compresses large datasets into compact synthetic ones that maintain training performance. However, current methods mainly target sample reduction, with limited consideration of data precision and its impact on efficiency. We propose Quantization-aware Dataset Distillation (QuADD), a unified framework that jointly optimizes dataset compactness and precision under fixed bit budgets. QuADD integrates a differentiable quantization module within the distillation loop, enabling end-to-end co-optimization of synthetic samples and quantization parameters. Guided by the rate-distortion perspective, we empirically analyze how bit allocation between sample count and precision influences learning performance. Our framework supports both uniform and adaptive non-uniform quantization, where the latter learns quantization levels from data to represent information-dense regions better. Experiments on image classification and 3GPP beam management tasks show that QuADD surpasses existing DD and post-quantized baselines in accuracy per bit, establishing a new standard for information-efficient dataset distillation.

[116] TruckDrive: Long-Range Autonomous Highway Driving Dataset

Filippo Ghilotti, Edoardo Palladin, Samuel Brucker, Adam Sigal, Mario Bijelic, Felix Heide

Main category: cs.CV

TL;DR: TruckDrive: A highway-scale multimodal driving dataset for long-range perception up to 1,000 meters, addressing the gap in existing urban-focused datasets for heavy truck autonomy.

Details

Motivation: Heavy trucks require long-range scene understanding (hundreds of meters) due to their long braking distances, but existing driving datasets are limited to urban scenes with perception ranges under 100 meters.

Method: Created TruckDrive dataset with specialized long-range sensors: 7 long-range FMCW LiDARs, 3 high-resolution short-range LiDARs, 11 8MP surround cameras with varying focal lengths, and 10 4D FMCW radars. Includes 475K samples with 165K densely annotated frames.

Result: State-of-the-art autonomous driving models fail at ranges beyond 150 meters, showing performance drops of 31-99% in 3D perception tasks, revealing a systematic long-range gap in current architectures.

Conclusion: There is a critical need for specialized datasets and architectures for long-range perception in highway autonomy, particularly for heavy vehicles with unique safety requirements.

Abstract: Safe highway autonomy for heavy trucks remains an open and unsolved challenge: due to long braking distances, scene understanding of hundreds of meters is required for anticipatory planning and to allow safe braking margins. However, existing driving datasets primarily cover urban scenes, with perception effectively limited to short ranges of only up to 100 meters. To address this gap, we introduce TruckDrive, a highway-scale multimodal driving dataset, captured with a sensor suite purpose-built for long range sensing: seven long-range FMCW LiDARs measuring range and radial velocity, three high-resolution short-range LiDARs, eleven 8MP surround cameras with varying focal lengths and ten 4D FMCW radars. The dataset offers 475 thousands samples with 165 thousands densely annotated frames for driving perception benchmarking up to 1,000 meters for 2D detection and 400 meters for 3D detection, depth estimation, tracking, planning and end to end driving over 20 seconds sequences at highway speeds. We find that state-of-the-art autonomous driving models do not generalize to ranges beyond 150 meters, with drops between 31% and 99% in 3D perception tasks, exposing a systematic long-range gap that current architectures and training signals cannot close.

[117] DINOv3 Visual Representations for Blueberry Perception Toward Robotic Harvesting

Rui-Feng Wang, Daniel Petti, Yue Chen, Changying Li

Main category: cs.CV

TL;DR: Evaluates DINOv3 vision foundation model as frozen backbone for blueberry robotic harvesting tasks (segmentation, detection), finding segmentation benefits from patch-level representations while detection is constrained by scale variation and localization issues.

Details

Motivation: To understand the practical role and performance limits of vision foundation models (specifically DINOv3) in agricultural settings, particularly for blueberry robotic harvesting visual tasks, as their generalization in such specialized domains remains insufficiently understood.

Method: Evaluates DINOv3 as a frozen backbone for blueberry harvesting tasks including fruit/bruise segmentation and fruit/cluster detection using a unified protocol with lightweight decoders. Analyzes how different tasks benefit from the model’s representations.

Result: Segmentation benefits consistently from stable patch-level representations and scales with backbone size, while detection is constrained by target scale variation, patch discretization, and localization compatibility. Cluster detection fails due to limitations in modeling relational targets defined by spatial aggregation.

Conclusion: DINOv3 is best viewed as a semantic backbone rather than an end-to-end task model; its effectiveness depends on downstream spatial modeling aligned with fruit-scale and aggregation structures, providing guidance for agricultural robotic harvesting applications.

Abstract: Vision Foundation Models trained via large-scale self-supervised learning have demonstrated strong generalization in visual perception; however, their practical role and performance limits in agricultural settings remain insufficiently understood. This work evaluates DINOv3 as a frozen backbone for blueberry robotic harvesting-related visual tasks, including fruit and bruise segmentation, as well as fruit and cluster detection. Under a unified protocol with lightweight decoders, segmentation benefits consistently from stable patch-level representations and scales with backbone size. In contrast, detection is constrained by target scale variation, patch discretization, and localization compatibility. The failure of cluster detection highlights limitations in modeling relational targets defined by spatial aggregation. Overall, DINOv3 is best viewed not as an end-to-end task model, but as a semantic backbone whose effectiveness depends on downstream spatial modeling aligned with fruit-scale and aggregation structures, providing guidance for blueberry robotic harvesting. Code and dataset will be available upon acceptance.

[118] MIRAGE: Knowledge Graph-Guided Cross-Cohort MRI Synthesis for Alzheimer’s Disease Prediction

Guanchen Wu, Zhe Huang, Yuzhang Xie, Runze Yan, Akul Chopra, Deqiang Qiu, Xiao Hu, Fei Wang, Carl Yang

Main category: cs.CV

TL;DR: MIRAGE addresses missing MRI data in Alzheimer’s diagnosis by distilling anatomical knowledge from EHR data using knowledge graphs and frozen 3D U-Net regularization, avoiding actual 3D scan synthesis.

Details

Motivation: Multimodal AD diagnosis combining MRI and EHR is bottlenecked by frequent MRI unavailability, and synthesizing actual 3D scans from EHR data is technically challenging and clinically risky.

Method: Uses Biomedical Knowledge Graph with Graph Attention Networks to map EHR variables into unified embeddings, then employs frozen pre-trained 3D U-Net decoder as regularization engine with cohort-aggregated skip feature compensation to distill “diagnostic-surrogate” representations without 3D reconstruction.

Result: Improves AD classification rate by 13% compared to unimodal baselines in cohorts without real MRIs, successfully bridging the missing-modality gap.

Conclusion: MIRAGE provides a practical solution for missing MRI data in clinical settings by distilling anatomical knowledge from EHR without risky 3D synthesis, enabling multimodal diagnosis where MRI is unavailable.

Abstract: Reliable Alzheimer’s disease (AD) diagnosis increasingly relies on multimodal assessments combining structural Magnetic Resonance Imaging (MRI) and Electronic Health Records (EHR). However, deploying these models is bottlenecked by modality missingness, as MRI scans are expensive and frequently unavailable in many patient cohorts. Furthermore, synthesizing de novo 3D anatomical scans from sparse, high-dimensional tabular records is technically challenging and poses severe clinical risks. To address this, we introduce MIRAGE, a novel framework that reframes the missing-MRI problem as an anatomy-guided cross-modal latent distillation task. First, MIRAGE leverages a Biomedical Knowledge Graph (KG) and Graph Attention Networks to map heterogeneous EHR variables into a unified embedding space that can be propagated from cohorts with real MRIs to cohorts without them. To bridge the semantic gap and enforce physical spatial awareness, we employ a frozen pre-trained 3D U-Net decoder strictly as an auxiliary regularization engine. Supported by a novel cohort-aggregated skip feature compensation strategy, this decoder acts as a rigorous structural penalty, forcing 1D latent representations to encode biologically plausible, macro-level pathological semantics. By exclusively utilizing this distilled “diagnostic-surrogate” representation during inference, MIRAGE completely bypasses computationally expensive 3D voxel reconstruction. Experiments demonstrate that our framework successfully bridges the missing-modality gap, improving the AD classification rate by 13% compared to unimodal baselines in cohorts without real MRIs.

[119] ORCA: Orchestrated Reasoning with Collaborative Agents for Document Visual Question Answering

Aymen Lassoued, Mohamed Ali Souibgui, Yousri Kessentini

Main category: cs.CV

TL;DR: ORCA is a multi-agent framework for Document Visual Question Answering that uses specialized AI agents coordinated through reasoning decomposition, routing, and debate mechanisms to improve complex document understanding.

Details

Motivation: Current Vision-Language Models struggle with complex reasoning and multi-step workflows in Document Visual Question Answering, failing to decompose intricate questions and leverage specialized processing for different document elements.

Method: ORCA uses a reasoning agent to decompose queries into logical steps, then routes tasks to specialized agents from an agent dock. It employs debate mechanisms with stress-testing, thesis-antithesis adjudication, and sanity checking for answer reliability.

Result: Extensive experiments on three benchmarks demonstrate significant improvements over state-of-the-art methods, establishing a new paradigm for collaborative agent systems in vision-language reasoning.

Conclusion: ORCA presents an effective multi-agent framework for complex DocVQA tasks through strategic agent coordination and iterative refinement, advancing collaborative agent systems in vision-language reasoning.

Abstract: Document Visual Question Answering (DocVQA) remains challenging for existing Vision-Language Models (VLMs), especially under complex reasoning and multi-step workflows. Current approaches struggle to decompose intricate questions into manageable sub-tasks and often fail to leverage specialized processing paths for different document elements. We present ORCA: Orchestrated Reasoning with Collaborative Agents for Document Visual Question Answering, a novel multi-agent framework that addresses these limitations through strategic agent coordination and iterative refinement. ORCA begins with a reasoning agent that decomposes queries into logical steps, followed by a routing mechanism that activates task-specific agents from a specialized agent dock. Our framework leverages a set of specialized AI agents, each dedicated to a distinct modality, enabling fine-grained understanding and collaborative reasoning across diverse document components. To ensure answer reliability, ORCA employs a debate mechanism with stress-testing, and when necessary, a thesis-antithesis adjudication process. This is followed by a sanity checker to ensure format consistency. Extensive experiments on three benchmarks demonstrate that our approach achieves significant improvements over state-of-the-art methods, establishing a new paradigm for collaborative agent systems in vision-language reasoning.

[120] Deep Learning Based Wildfire Detection for Peatland Fires Using Transfer Learning

Emadeldeen Hamdan, Ahmad Faiz Tharima, Mohd Zahirasri Mohd Tohir, Dayang Nur Sakinah Musa, Erdem Koyuncu, Adam J. Watts, Ahmet Enis Cetin

Main category: cs.CV

TL;DR: Transfer learning approach for peatland fire detection using pretrained wildfire models fine-tuned on limited peatland data, addressing unique visual characteristics of smoldering fires.

Details

Motivation: Peatland fires have distinct visual characteristics (smoldering combustion, low flame intensity, persistent smoke, subsurface burning) that make conventional wildfire detectors ineffective, requiring specialized detection methods.

Method: Transfer learning approach that initializes a deep learning-based peatland fire detector using pretrained weights from conventional wildfire detection models, then fine-tunes the network on a dataset of Malaysian peatland images and videos.

Result: Transfer learning significantly improves detection accuracy and robustness compared to training from scratch, particularly under challenging conditions like low-contrast smoke, partial occlusions, and variable illumination.

Conclusion: The transfer learning approach provides a practical and scalable solution for early peatland fire detection with potential for real-time monitoring systems for fire prevention and environmental protection.

Abstract: Machine learning (ML)-based wildfire detection methods have been developed in recent years, primarily using deep learning (DL) models trained on large collections of wildfire images and videos. However, peatland fires exhibit distinct visual and physical characteristics – such as smoldering combustion, low flame intensity, persistent smoke, and subsurface burning – that limit the effectiveness of conventional wildfire detectors trained on open-flame forest fires. In this work, we present a transfer learning-based approach for peatland fire detection that leverages knowledge learned from general wildfire imagery and adapts it to the peatland fire domain. We initialize a DL-based peatland fire detector using pretrained weights from a conventional wildfire detection model and subsequently fine-tune the network using a dataset composed of Malaysian peatland images and videos. This strategy enables effective learning despite the limited availability of labeled peatland fire data. Experimental results demonstrate that transfer learning significantly improves detection accuracy and robustness compared to training from scratch, particularly under challenging conditions such as low-contrast smoke, partial occlusions, and variable illumination. The proposed approach provides a practical and scalable solution for early peatland fire detection and has the potential to support real-time monitoring systems for fire prevention and environmental protection.

[121] Large-Scale Dataset and Benchmark for Skin Tone Classification in the Wild

Vitor Pereira Matias, Márcus Vinícius Lobo Costa, João Batista Neto, Tiago Novello de Brito

Main category: cs.CV

TL;DR: STW dataset with 42K images labeled on 10-tone MST scale enables skin tone fairness analysis; SkinToneNet (fine-tuned ViT) achieves SOTA generalization for auditing datasets like CelebA/VGGFace2

Details

Motivation: Existing skin tone fairness methods lack granular annotated datasets, use non-visual scales (Fitzpatrick), have small private datasets, suffer from train-test leakage and imbalance, and rely on limited classic CV pipelines

Method: 1) Create STW dataset (42,313 images, 3,564 individuals, 10-tone MST scale labels); 2) Benchmark Classic CV vs DL approaches; 3) Propose SkinToneNet (fine-tuned Vision Transformer) for SOTA generalization

Result: Classic CV models give near-random results; deep learning reaches near-annotator accuracy; SkinToneNet achieves state-of-the-art generalization on out-of-domain data, enabling fairness auditing of public datasets

Conclusion: Provides comprehensive framework for skin tone fairness with large-scale open dataset, benchmarks, and SOTA model for reliable fairness assessment of public vision datasets

Abstract: Deep learning models often inherit biases from their training data. While fairness across gender and ethnicity is well-studied, fine-grained skin tone analysis remains a challenge due to the lack of granular, annotated datasets. Existing methods often rely on the medical 6-tone Fitzpatrick scale, which lacks visual representativeness, or use small, private datasets that prevent reproducibility, or often rely on classic computer vision pipelines, with a few using deep learning. They overlook issues like train-test leakage and dataset imbalance, and are limited by small or unavailable datasets. In this work, we present a comprehensive framework for skin tone fairness. First, we introduce the STW, a large-scale, open-access dataset comprising 42,313 images from 3,564 individuals, labeled using the 10-tone MST scale. Second, we benchmark both Classic Computer Vision (SkinToneCCV) and Deep Learning approaches, demonstrating that classic models provide near-random results, while deep learning reaches nearly annotator accuracy. Finally, we propose SkinToneNet, a fine-tuned ViT that achieves state-of-the-art generalization on out-of-domain data, which enables reliable fairness auditing of public datasets like CelebA and VGGFace2. This work provides state-of-the-art results in skin tone classification and fairness assessment. Code and data available soon

[122] E2E-GNet: An End-to-End Skeleton-based Geometric Deep Neural Network for Human Motion Recognition

Mubarak Olaoluwa, Hassen Drira

Main category: cs.CV

TL;DR: E2E-GNet is an end-to-end geometric deep neural network for skeleton-based human motion recognition that introduces geometric transformation and distortion-aware optimization layers to better capture discriminative features in non-Euclidean space.

Details

Motivation: Geometric deep learning can capture meaningful representations of data in non-Euclidean spaces, which is particularly relevant for skeleton-based human motion recognition where motion sequences exist in such spaces. The authors aim to enhance discriminative power between different motions by better leveraging geometric properties.

Method: E2E-GNet introduces two key layers: 1) A geometric transformation layer that jointly optimizes skeleton motion sequences in non-Euclidean space and applies a differentiable logarithm map activation to project them onto a linear space, and 2) A distortion-aware optimization layer that limits skeleton shape distortions caused by the projection to retain discriminative geometric cues.

Result: The method outperforms other approaches with lower computational cost across five datasets spanning three domains. Ablation studies demonstrate the impact of each proposed layer.

Conclusion: E2E-GNet effectively addresses skeleton-based human motion recognition by leveraging geometric deep learning principles, with the proposed geometric transformation and distortion-aware optimization layers significantly improving performance while maintaining computational efficiency.

Abstract: Geometric deep learning has recently gained significant attention in the computer vision community for its ability to capture meaningful representations of data lying in a non-Euclidean space. To this end, we propose E2E-GNet, an end-to-end geometric deep neural network for skeleton-based human motion recognition. To enhance the discriminative power between different motions in the non-Euclidean space, E2E-GNet introduces a geometric transformation layer that jointly optimizes skeleton motion sequences on this space and applies a differentiable logarithm map activation to project them onto a linear space. Building on this, we further design a distortion-aware optimization layer that limits skeleton shape distortions caused by this projection, enabling the network to retain discriminative geometric cues and achieve a higher motion recognition rate. We demonstrate the impact of each layer through ablation studies and extensive experiments across five datasets spanning three domains show that E2E-GNet outperforms other methods with lower cost.

Shuangzhi Li, Lei Ma, Xingyu Li

Main category: cs.CV

TL;DR: ModalPatch is a plug-and-play module for robust multi-modal 3D object detection that handles transient modality drops using temporal history and uncertainty-guided fusion.

Details

Motivation: Real-world autonomous driving faces reliability challenges due to transient data interruptions in multi-modal sensors (LiDAR and cameras) caused by hardware glitches, adverse weather, or occlusions, creating critical risks during simultaneous modality drops.

Method: ModalPatch leverages temporal sensor data for perceptual continuity using a history-based module to predict and compensate for unavailable features, combined with an uncertainty-guided cross-modality fusion strategy that dynamically estimates feature reliability to suppress biased signals while reinforcing informative ones.

Result: Extensive experiments show ModalPatch consistently enhances both robustness and accuracy of state-of-the-art 3D object detectors under diverse modality-drop conditions without requiring architectural changes or retraining.

Conclusion: ModalPatch provides an effective plug-and-play solution for robust multi-modal 3D object detection in autonomous driving, addressing critical reliability issues during transient modality drops through temporal continuity and uncertainty-aware fusion.

Abstract: Multi-modal 3D object detection is pivotal for autonomous driving, integrating complementary sensors like LiDAR and cameras. However, its real-world reliability is challenged by transient data interruptions and missing, where modalities can momentarily drop due to hardware glitches, adverse weather, or occlusions. This poses a critical risk, especially during a simultaneous modality drop, where the vehicle is momentarily blind. To address this problem, we introduce ModalPatch, the first plug-and-play module designed to enable robust detection under arbitrary modality-drop scenarios. Without requiring architectural changes or retraining, ModalPatch can be seamlessly integrated into diverse detection frameworks. Technically, ModalPatch leverages the temporal nature of sensor data for perceptual continuity, using a history-based module to predict and compensate for transiently unavailable features. To improve the fidelity of the predicted features, we further introduce an uncertainty-guided cross-modality fusion strategy that dynamically estimates the reliability of compensated features, suppressing biased signals while reinforcing informative ones. Extensive experiments show that ModalPatch consistently enhances both robustness and accuracy of state-of-the-art 3D object detectors under diverse modality-drop conditions.

[124] WTHaar-Net: a Hybrid Quantum-Classical Approach

Vittorio Palladino, Tsai Idden, Ahmet Enis Cetin

Main category: cs.CV

TL;DR: WTHaar-Net replaces Hadamard Transform with Haar Wavelet Transform in hybrid quantum-classical CNNs, enabling spatially localized multi-resolution representations with quantum implementation using structured Hadamard gates.

Details

Motivation: To enhance hybrid quantum-classical deep learning by replacing the Hadamard Transform with Haar Wavelet Transform, which provides better spatially localized, multi-resolution representations aligned with vision task inductive biases, while maintaining quantum circuit compatibility.

Method: Introduces WTHaar-Net, a convolutional neural network that substitutes the Haar Wavelet Transform (HWT) for Hadamard Transform in hybrid architectures. Shows HWT can be realized quantumly using structured Hadamard gates, enabling decomposition into unitary operations suitable for quantum circuits.

Result: Achieves substantial parameter reduction while maintaining competitive accuracy on CIFAR-10 and Tiny-ImageNet. Outperforms both ResNet and Hadamard-based baselines on Tiny-ImageNet. Validates quantum implementation on IBM Quantum cloud hardware, demonstrating compatibility with near-term quantum devices.

Conclusion: WTHaar-Net successfully integrates wavelet-based representations with quantum computing, offering parameter-efficient vision models that maintain accuracy while being compatible with current quantum hardware.

Abstract: Convolutional neural networks rely on linear filtering operations that can be reformulated efficiently in suitable transform domains. At the same time, advances in quantum computing have shown that certain structured linear transforms can be implemented with shallow quantum circuits, opening the door to hybrid quantum-classical approaches for enhancing deep learning models. In this work, we introduce WTHaar-Net, a convolutional neural network that replaces the Hadamard Transform used in prior hybrid architectures with the Haar Wavelet Transform (HWT). Unlike the Hadamard Transform, the Haar transform provides spatially localized, multi-resolution representations that align more closely with the inductive biases of vision tasks. We show that the HWT admits a quantum realization using structured Hadamard gates, enabling its decomposition into unitary operations suitable for quantum circuits. Experiments on CIFAR-10 and Tiny-ImageNet demonstrate that WTHaar-Net achieves substantial parameter reduction while maintaining competitive accuracy. On Tiny-ImageNet, our approach outperforms both ResNet and Hadamard-based baselines. We validate the quantum implementation on IBM Quantum cloud hardware, demonstrating compatibility with near-term quantum devices.

[125] SGMA: Semantic-Guided Modality-Aware Segmentation for Remote Sensing with Incomplete Multimodal Data

Lekang Wen, Liang Liao, Jing Xiao, Mi Wang

Main category: cs.CV

TL;DR: SGMA framework addresses incomplete multimodal semantic segmentation challenges through semantic-guided fusion and modality-aware sampling to handle missing modalities, intra-class variation, and cross-modal heterogeneity.

Details

Motivation: Address three key challenges in incomplete multimodal semantic segmentation: multimodal imbalance (dominant modalities suppress fragile ones), intra-class variation across modalities, and cross-modal heterogeneity with conflicting semantic cues.

Method: Proposes Semantic-Guided Modality-Aware (SGMA) framework with two modules: 1) Semantic-Guided Fusion extracts multi-scale class-wise semantic prototypes, estimates modality robustness via prototype-feature alignment, and performs adaptive fusion; 2) Modality-Aware Sampling dynamically reweights training samples based on robustness scores to prioritize challenging fragile modalities.

Result: Extensive experiments across multiple datasets and backbones show SGMA consistently outperforms state-of-the-art methods, with particularly significant improvements in fragile modalities.

Conclusion: SGMA effectively addresses incomplete multimodal segmentation challenges through semantic guidance and modality-aware learning, achieving balanced multimodal representation while handling intra-class variation and cross-modal inconsistencies.

Abstract: Multimodal semantic segmentation integrates complementary information from diverse sensors for remote sensing Earth observation. However, practical systems often encounter missing modalities due to sensor failures or incomplete coverage, termed Incomplete Multimodal Semantic Segmentation (IMSS). IMSS faces three key challenges: (1) multimodal imbalance, where dominant modalities suppress fragile ones; (2) intra-class variation in scale, shape, and orientation across modalities; and (3) cross-modal heterogeneity with conflicting cues producing inconsistent semantic responses. Existing methods rely on contrastive learning or joint optimization, which risk over-alignment, discarding modality-specific cues or imbalanced training, favoring robust modalities, while largely overlooking intra-class variation and cross-modal heterogeneity. To address these limitations, we propose the Semantic-Guided Modality-Aware (SGMA) framework, which ensures balanced multimodal learning while reducing intra-class variation and reconciling cross-modal inconsistencies through semantic guidance. SGMA introduces two complementary plug-and-play modules: (1) Semantic-Guided Fusion (SGF) module extracts multi-scale, class-wise semantic prototypes that capture consistent categorical representations across modalities, estimates per-modality robustness based on prototype-feature alignment, and performs adaptive fusion weighted by robustness scores to mitigate intra-class variation and cross-modal heterogeneity; (2) Modality-Aware Sampling (MAS) module leverages robustness estimations from SGF to dynamically reweight training samples, prioritizing challenging samples from fragile modalities to address modality imbalance. Extensive experiments across multiple datasets and backbones demonstrate that SGMA consistently outperforms state-of-the-art methods, with particularly significant improvements in fragile modalities.

[126] Beyond Anatomy: Explainable ASD Classification from rs-fMRI via Functional Parcellation and Graph Attention Networks

Syeda Hareem Madani, Noureen Bibi, Adam Rafiq Jeraj, Sumra Khan, Anas Zafar, Rizwan Qureshi

Main category: cs.CV

TL;DR: Graph-based deep learning framework for ASD classification using functional vs anatomical brain parcellations, achieving 95% accuracy with GAT ensemble on ABIDE I dataset.

Details

Motivation: Anatomical brain parcellations may fail to capture idiosyncratic connectivity patterns in ASD, so the study compares anatomical vs functionally-derived parcellation strategies for better ASD classification.

Method: Three-phase pipeline: baseline GCN with AAL atlas, optimized GCN with MSDL functional atlas, and Graph Attention Network ensemble; uses FSL preprocessing, site-stratified splits, Gaussian noise augmentation, and explainability analyses (gradient-based saliency and GNNExplainer).

Result: Achieved 95.0% accuracy (AUC=0.98) with GAT ensemble, outperforming recent GNN benchmarks on ABIDE I; 10.7-point gain from functional parcellation substitution alone; identified Posterior Cingulate Cortex and Precuneus as key Default Mode Network hubs.

Conclusion: Functional parcellation is the most impactful modeling decision for ASD classification; model decisions reflect ASD neuropathology rather than acquisition artifacts; all code and datasets will be publicly released.

Abstract: Anatomical brain parcellations dominate rs-fMRI-based Autism Spectrum Disorder (ASD) classification, yet their rigid boundaries may fail to capture the idiosyncratic connectivity patterns that characterise ASD. We present a graph-based deep learning framework comparing anatomical (AAL, 116 ROIs) and functionally-derived (MSDL, 39 ROIs) parcellation strategies on the ABIDE I dataset. Our FSL preprocessing pipeline handles multi-site heterogeneity across 400 balanced subjects, with site-stratified 70/15/15 splits to prevent data leakage. Gaussian noise augmentation within training folds expands samples from 280 to 1,680. A three phase pipeline progresses from a baseline GCN with AAL (73.3% accuracy, AUC=0.74), to an optimised GCN with MSDL (84.0%, AUC=0.84), to a Graph Attention Network ensemble achieving 95.0% accuracy (AUC=0.98), outperforming all recent GNN-based benchmarks on ABIDE I. The 10.7-point gain from atlas substitution alone demonstrates that functional parcellation is the most impactful modelling decision. Gradient-based saliency and GNNExplainer analyses converge on the Posterior Cingulate Cortex and Precuneus as core Default Mode Network hubs, validating that model decisions reflect ASD neuropathology rather than acquisition artefacts. All code and datasets will be publicly released upon acceptance.

[127] NeighborMAE: Exploiting Spatial Dependencies between Neighboring Earth Observation Images in Masked Autoencoders Pretraining

Liang Zeng, Valerio Marsocci, Wufan Zhao, Andrea Nascetti, Maarten Vergauwen

Main category: cs.CV

TL;DR: NeighborMAE introduces spatial dependency learning in masked image modeling for Earth Observation by jointly reconstructing neighboring images with dynamic mask ratio and loss weight adjustments.

Details

Motivation: Current masked image modeling approaches for Earth Observation overlook spatial dependencies between neighboring images, which contain rich contextual information due to the continuous nature of Earth's surface.

Method: Proposes NeighborMAE that learns spatial dependencies through joint reconstruction of neighboring Earth Observation images, using a heuristic strategy to dynamically adjust mask ratio and pixel-level loss weight to maintain reconstruction challenge.

Result: Experimental results across various pretraining datasets and downstream tasks show NeighborMAE significantly outperforms existing baselines.

Conclusion: Neighboring images provide valuable spatial dependencies for masked image modeling in Earth Observation, and the proposed dynamic adjustment strategy is effective.

Abstract: Masked Image Modeling has been one of the most popular self-supervised learning paradigms to learn representations from large-scale, unlabeled Earth Observation images. While incorporating multi-modal and multi-temporal Earth Observation data into Masked Image Modeling has been widely explored, the spatial dependencies between images captured from neighboring areas remains largely overlooked. Since the Earth’s surface is continuous, neighboring images are highly related and offer rich contextual information for self-supervised learning. To close this gap, we propose NeighborMAE, which learns spatial dependencies by joint reconstruction of neighboring Earth Observation images. To ensure that the reconstruction remains challenging, we leverage a heuristic strategy to dynamically adjust the mask ratio and the pixel-level loss weight. Experimental results across various pretraining datasets and downstream tasks show that NeighborMAE significantly outperforms existing baselines, underscoring the value of neighboring images in Masked Image Modeling for Earth Observation and the efficacy of our designs.

Kang Yang, Peng Wang, Lantao Li, Tianci Bu, Chen Sun, Deying Li, Yongcai Wang

Main category: cs.CV

TL;DR: EIMC introduces an early collaborative paradigm for multi-modal perception in autonomous driving that uses lightweight collaborative voxels and heatmap-driven consensus to reduce bandwidth while maintaining high detection accuracy.

Details

Motivation: Current multi-modal collaborative perception approaches follow a "local fusion to communication" sequence that requires high bandwidth for transmitting individual feature data before collaborative fusion, which is inefficient for real-world autonomous driving applications.

Method: 1) Early collaborative paradigm using lightweight collaborative voxels injected into local modality-fusion; 2) Heatmap-driven consensus protocol to identify where cooperation is needed; 3) Top-K instance vector querying from peers in low-confidence regions; 4) Cross-attention fusion for completion; 5) Refinement fusion using self-attention on top-K confident instances.

Result: Achieves 73.01% AP@0.5 on OPV2V and DAIR-V2X datasets while reducing byte bandwidth usage by 87.98% compared to the best published multi-modal collaborative detector.

Conclusion: EIMC demonstrates that early collaboration with instance-centric messaging can significantly reduce communication bandwidth while maintaining high detection performance for multi-modal autonomous driving perception.

Abstract: Multi-modal collaborative perception calls for great attention to enhancing the safety of autonomous driving. However, current multi-modal approaches remain a ``local fusion to communication’’ sequence, which fuses multi-modal data locally and needs high bandwidth to transmit an individual’s feature data before collaborative fusion. EIMC innovatively proposes an early collaborative paradigm. It injects lightweight collaborative voxels, transmitted by neighbor agents, into the ego’s local modality-fusion step, yielding compact yet informative 3D collaborative priors that tighten cross-modal alignment. Next, a heatmap-driven consensus protocol identifies exactly where cooperation is needed by computing per-pixel confidence heatmaps. Only the Top-K instance vectors located in these low-confidence, high-discrepancy regions are queried from peers, then fused via cross-attention for completion. Afterwards, we apply a refinement fusion that involves collecting the top-K most confident instances from each agent and enhancing their features using self-attention. The above instance-centric messaging reduces redundancy while guaranteeing that critical occluded objects are recovered. Evaluated on OPV2V and DAIR-V2X, EIMC attains 73.01% AP@0.5 while reducing byte bandwidth usage by 87.98% compared with the best published multi-modal collaborative detector. Code publicly released at https://github.com/sidiangongyuan/EIMC.

[129] ForestPersons: A Large-Scale Dataset for Under-Canopy Missing Person Detection

Deokyun Kim, Jeongjun Lee, Jungwon Choi, Jonggeon Park, Giyoung Lee, Yookyung Kim, Myungseok Ki, Juho Lee, Jihun Cha

Main category: cs.CV

TL;DR: ForestPersons dataset for under-canopy person detection in forest SAR missions, addressing limitations of aerial UAV imagery.

Details

Motivation: Current UAV-based missing person detection in forests is limited by canopy cover obscuring ground-level views, creating need for under-canopy perspectives.

Method: Created large-scale ForestPersons dataset with 96,482 images and 204,078 annotations including bounding boxes, pose, and visibility labels collected under diverse forest conditions.

Result: Standard object detection models perform poorly on ForestPersons, showing existing datasets don’t address under-canopy detection challenges.

Conclusion: ForestPersons fills critical gap for SAR applications, enabling better person detection under forest canopy conditions.

Abstract: Detecting missing persons in forest environments remains a challenge, as dense canopy cover often conceals individuals from detection in top-down or oblique aerial imagery typically captured by Unmanned Aerial Vehicles (UAVs). While UAVs are effective for covering large, inaccessible areas, their aerial perspectives often miss critical visual cues beneath the forest canopy. This limitation underscores the need for under-canopy perspectives better suited for detecting missing persons in such environments. To address this gap, we introduce ForestPersons, a novel large-scale dataset specifically designed for under-canopy person detection. ForestPersons contains 96,482 images and 204,078 annotations collected under diverse environmental and temporal conditions. Each annotation includes a bounding box, pose, and visibility label for occlusion-aware analysis. ForestPersons provides ground-level and low-altitude perspectives that closely reflect the visual conditions encountered by Micro Aerial Vehicles (MAVs) during forest Search and Rescue (SAR) missions. Our baseline evaluations reveal that standard object detection models, trained on prior large-scale object detection datasets or SAR-oriented datasets, show limited performance on ForestPersons. This indicates that prior benchmarks are not well aligned with the challenges of missing person detection under the forest canopy. We offer this benchmark to support advanced person detection capabilities in real-world SAR scenarios. The dataset is publicly available at https://huggingface.co/datasets/etri/ForestPersons.

[130] On Discriminative vs. Generative classifiers: Rethinking MLLMs for Action Understanding

Zhanzhong Pang, Dibyadip Chatterjee, Fadime Sener, Angela Yao

Main category: cs.CV

TL;DR: GAD classifier combines generative and discriminative approaches for efficient closed-set action understanding in MLLMs, achieving state-of-the-art results with faster inference.

Details

Motivation: Current MLLMs use inefficient autoregressive generation for action classification, suffering from semantic overlap in action labels. Discriminative classifiers are more efficient but lack generative capabilities. The paper aims to bridge this gap.

Method: Proposes Generation-Assisted Discriminative (GAD) classifier that operates only during fine-tuning, preserving MLLM pretraining compatibility. Combines generative modeling with discriminative classifiers for better performance while maintaining efficiency.

Result: Achieves state-of-the-art on 4 tasks across 5 datasets, with 2.5% average accuracy gain and 3x faster inference on COIN benchmark compared to generative methods.

Conclusion: GAD effectively combines strengths of both generative and discriminative approaches for closed-set action understanding in MLLMs, offering improved accuracy and efficiency.

Abstract: Multimodal Large Language Models (MLLMs) have advanced open-world action understanding and can be adapted as generative classifiers for closed-set settings by autoregressively generating action labels as text. However, this approach is inefficient, and shared subwords across action labels introduce semantic overlap, leading to ambiguity in generation. In contrast, discriminative classifiers learn task-specific representations with clear decision boundaries, enabling efficient one-step classification without autoregressive decoding. We first compare generative and discriminative classifiers with MLLMs for closed-set action understanding, revealing the superior accuracy and efficiency of the latter. To bridge the performance gap, we design strategies that elevate generative classifiers toward performance comparable with discriminative ones. Furthermore, we show that generative modeling can complement discriminative classifiers, leading to better performance while preserving efficiency. To this end, we propose Generation-Assisted Discriminative~(GAD) classifier for closed-set action understanding. GAD operates only during fine-tuning, preserving full compatibility with MLLM pretraining. Extensive experiments on temporal action understanding benchmarks demonstrate that GAD improves both accuracy and efficiency over generative methods, achieving state-of-the-art results on four tasks across five datasets, including an average 2.5% accuracy gain and 3x faster inference on our largest COIN benchmark.

[131] SemGS: Feed-Forward Semantic 3D Gaussian Splatting from Sparse Views for Generalizable Scene Understanding

Sheng Ye, Zhen-Hui Dong, Ruoyu Fan, Tian Lv, Yong-Jin Liu

Main category: cs.CV

TL;DR: SemGS: A feed-forward framework for reconstructing generalizable semantic fields from sparse image inputs using dual-branch architecture with shared CNN layers and camera-aware attention.

Details

Motivation: Existing methods for semantic scene reconstruction and novel view synthesis require dense multi-view inputs and scene-specific optimization, limiting practicality and scalability in real-world robotics applications.

Method: Dual-branch architecture extracts color and semantic features with shared shallow CNN layers, camera-aware attention mechanism models geometric relationships between viewpoints, decodes features into dual-Gaussians with geometric consistency, and uses regional smoothness loss for semantic coherence.

Result: Achieves state-of-the-art performance on benchmark datasets, provides rapid inference, and demonstrates strong generalization capabilities across diverse synthetic and real-world scenarios.

Conclusion: SemGS enables effective semantic scene reconstruction from sparse inputs with practical applications in robotics, offering a scalable solution for real-world deployment.

Abstract: Semantic understanding of 3D scenes is essential for robots to operate effectively and safely in complex environments. Existing methods for semantic scene reconstruction and semantic-aware novel view synthesis often rely on dense multi-view inputs and require scene-specific optimization, limiting their practicality and scalability in real-world applications. To address these challenges, we propose SemGS, a feed-forward framework for reconstructing generalizable semantic fields from sparse image inputs. SemGS uses a dual-branch architecture to extract color and semantic features, where the two branches share shallow CNN layers, allowing semantic reasoning to leverage textural and structural cues in color appearance. We also incorporate a camera-aware attention mechanism into the feature extractor to explicitly model geometric relationships between camera viewpoints. The extracted features are decoded into dual-Gaussians that share geometric consistency while preserving branch-specific attributes, and further rasterized to synthesize semantic maps under novel viewpoints. Additionally, we introduce a regional smoothness loss to enhance semantic coherence. Experiments show that SemGS achieves state-of-the-art performance on benchmark datasets, while providing rapid inference and strong generalization capabilities across diverse synthetic and real-world scenarios.

[132] Generalizable Knowledge Distillation from Vision Foundation Models for Semantic Segmentation

Chonghua Lv, Dong Zhao, Shuang Wang, Dou Quan, Ning Huyan, Nicu Sebe, Zhun Zhong

Main category: cs.CV

TL;DR: GKD is a multi-stage knowledge distillation framework for semantic segmentation that enhances out-of-domain generalization by decoupling representation learning from task learning and using selective feature distillation.

Details

Motivation: Conventional knowledge distillation methods for semantic segmentation focus on preserving in-domain accuracy but neglect out-of-domain generalization, which is crucial under distribution shifts. This limitation becomes more severe with vision foundation models (VFMs) - while VFMs exhibit strong robustness on unseen data, distilling them with conventional KD often compromises this ability.

Method: GKD uses a multi-stage framework that decouples representation learning from task learning. In stage 1, the student acquires domain-agnostic representations through selective feature distillation. In stage 2, these representations are frozen for task adaptation to mitigate overfitting to visible domains. Additionally, a query-based soft distillation mechanism is introduced where student features act as queries to teacher representations to selectively retrieve transferable spatial knowledge from VFMs.

Result: Extensive experiments on five domain generalization benchmarks show GKD consistently outperforms existing KD methods, achieving average gains of +1.9% in foundation-to-foundation (F2F) and +10.6% in foundation-to-local (F2L) distillation.

Conclusion: GKD effectively addresses the generalization limitation in conventional knowledge distillation for semantic segmentation, particularly when dealing with vision foundation models, by explicitly enhancing out-of-domain robustness through its multi-stage framework and selective distillation mechanisms.

Abstract: Knowledge distillation (KD) has been widely applied in semantic segmentation to compress large models, but conventional approaches primarily preserve in-domain accuracy while neglecting out-of-domain generalization, which is essential under distribution shifts. This limitation becomes more severe with the emergence of vision foundation models (VFMs): although VFMs exhibit strong robustness on unseen data, distilling them with conventional KD often compromises this ability. We propose Generalizable Knowledge Distillation (GKD), a multi-stage framework that explicitly enhances generalization. GKD decouples representation learning from task learning. In the first stage, the student acquires domain-agnostic representations through selective feature distillation, and in the second stage, these representations are frozen for task adaptation, thereby mitigating overfitting to visible domains. To further support transfer, we introduce a query-based soft distillation mechanism, where student features act as queries to teacher representations to selectively retrieve transferable spatial knowledge from VFMs. Extensive experiments on five domain generalization benchmarks demonstrate that GKD consistently outperforms existing KD methods, achieving average gains of +1.9% in foundation-to-foundation (F2F) and +10.6% in foundation-to-local (F2L) distillation. The code will be available at https://github.com/Younger-hua/GKD.

[133] Through the Lens of Contrast: Self-Improving Visual Reasoning in VLMs

Zhiyu Pan, Yizheng Wu, Jiashen Hua, Junyi Feng, Shaotian Yan, Bing Deng, Zhiguo Cao, Jieping Ye

Main category: cs.CV

TL;DR: VC-STaR is a self-improving framework for vision-language models that uses visual contrastive pairs to reduce hallucinations in reasoning paths, creating a new visual reasoning dataset that boosts VLM performance.

Details

Motivation: Self-improving techniques work well for language models but face challenges in vision-language models due to visual hallucinations that can't be effectively verified or rectified. The authors observed that VLMs identify visual cues more precisely when presented with contrastive VQA pairs (similar images with synonymous questions).

Method: Proposed Visual Contrastive Self-Taught Reasoner (VC-STaR) framework that leverages visual contrast to mitigate hallucinations. Collected diverse VQA datasets, curated contrastive pairs based on multi-modal similarity, generated rationales using VC-STaR, and created VisCoR-55K dataset for supervised finetuning of VLMs.

Result: VC-STaR outperforms existing self-improving approaches and surpasses models finetuned on state-of-the-art visual reasoning datasets. The framework demonstrates that VLMs’ inherent contrastive ability can bootstrap their own visual reasoning.

Conclusion: Visual contrastive self-improving is an effective approach for enhancing reasoning capabilities in vision-language models by reducing hallucinations and leveraging the models’ inherent ability to process contrastive visual information.

Abstract: Reasoning has emerged as a key capability of large language models. In linguistic tasks, this capability can be enhanced by self-improving techniques that refine reasoning paths for subsequent finetuning. However, extending these language-based self-improving approaches to vision language models (VLMs) presents a unique challenge:~visual hallucinations in reasoning paths cannot be effectively verified or rectified. Our solution starts with a key observation about visual contrast: when presented with a contrastive VQA pair, i.e., two visually similar images with synonymous questions, VLMs identify relevant visual cues more precisely. Motivated by this observation, we propose Visual Contrastive Self-Taught Reasoner (VC-STaR), a novel self-improving framework that leverages visual contrast to mitigate hallucinations in model-generated rationales. We collect a diverse suite of VQA datasets, curate contrastive pairs according to multi-modal similarity, and generate rationales using VC-STaR. Consequently, we obtain a new visual reasoning dataset, VisCoR-55K, which is then used to boost the reasoning capability of various VLMs through supervised finetuning. Extensive experiments show that VC-STaR not only outperforms existing self-improving approaches but also surpasses models finetuned on the SoTA visual reasoning datasets, demonstrating that the inherent contrastive ability of VLMs can bootstrap their own visual reasoning. Project at: https://github.com/zhiyupan42/VC-STaR.

[134] CAPT: Confusion-Aware Prompt Tuning for Reducing Vision-Language Misalignment

Maoyuan Shao, Yutong Gao, Xinyang Huang, Chuang Zhu, Lijuan Sun, Guoshun Nan

Main category: cs.CV

TL;DR: CAPT is a confusion-aware prompt tuning framework for vision-language models that addresses systematic misclassifications between visually/semantically similar categories by learning from model’s own confusion patterns.

Details

Motivation: Vision-language models like CLIP suffer from systematic misclassifications among visually and semantically similar categories, revealing intrinsic bias and limited fine-grained discriminative ability. These confusion patterns are not random but persistently occur between specific category pairs.

Method: Proposes CAPT framework with: 1) Confusion Bank to model stable confusion relationships, 2) Semantic Confusion Miner (SEM) to capture global inter-class confusion via semantic difference/commonality prompts, 3) Sample Confusion Miner (SAM) to retrieve misclassified instances using Diff-Manner Adapter, and 4) Multi-Granularity Difference Expert (MGDE) to unify semantic- and sample-level confusion information.

Result: Extensive experiments on 11 benchmark datasets show significant reduction in confusion-induced errors, enhanced discriminability and generalization for both base and novel classes, successfully resolving 50.72% of confusable sample pairs.

Conclusion: CAPT effectively addresses systematic confusion in vision-language models by enabling models to learn from their own misalignment, improving fine-grained discriminative ability while maintaining generalization performance.

Abstract: Vision-language models like CLIP have achieved remarkable progress in cross-modal representation learning, yet suffer from systematic misclassifications among visually and semantically similar categories. We observe that such confusion patterns are not random but persistently occur between specific category pairs, revealing the model’s intrinsic bias and limited fine-grained discriminative ability. To address this, we propose CAPT, a Confusion-Aware Prompt Tuning framework that enables models to learn from their own misalignment. Specifically, we construct a Confusion Bank to explicitly model stable confusion relationships across categories and misclassified samples. On this basis, we introduce a Semantic Confusion Miner (SEM) to capture global inter-class confusion through semantic difference and commonality prompts, and a Sample Confusion Miner (SAM) to retrieve representative misclassified instances from the bank and capture sample-level cues through a Diff-Manner Adapter that integrates global and local contexts. To further unify confusion information across different granularities, a Multi-Granularity Difference Expert (MGDE) module is designed to jointly leverage semantic- and sample-level experts for more robust confusion-aware reasoning. Extensive experiments on 11 benchmark datasets demonstrate that our method significantly reduces confusion-induced errors while enhancing the discriminability and generalization of both base and novel classes, successfully resolving 50.72 percent of confusable sample pairs. Code will be released at https://github.com/greatest-gourmet/CAPT.

[135] CAWM-Mamba: A unified model for infrared-visible image fusion and compound adverse weather restoration

Huichun Liu, Xiaosong Li, Zhuangfan Huang, Tao Ye, Yang Liu, Haishu Tan

Main category: cs.CV

TL;DR: CAWM-Mamba: First end-to-end framework for multimodal image fusion with compound adverse weather restoration using unified shared weights and wavelet-domain decomposition.

Details

Motivation: Existing adverse weather fusion methods only handle single degradation types (haze, rain, snow) and fail with multiple coexisting degradations (haze+rain, rain+snow), which is crucial for autonomous driving and UAV monitoring applications.

Method: Three key components: 1) Weather-Aware Preprocess Module (WAPM) for degraded visible feature enhancement and global weather embeddings, 2) Cross-modal Feature Interaction Module (CFIM) for heterogeneous modality alignment and complementary feature exchange, 3) Wavelet Space State Block (WSSB) with wavelet-domain decomposition to decouple multi-frequency degradations, including Freq-SSM for anisotropic high-frequency degradation modeling and unified degradation representation.

Result: Extensive experiments on AWMM-100K benchmark and three standard fusion datasets show CAWM-Mamba outperforms state-of-the-art methods in both compound and single-weather scenarios. Fusion results excel in downstream tasks like semantic segmentation and object detection.

Conclusion: CAWM-Mamba is the first end-to-end framework for joint image fusion and compound weather restoration, demonstrating superior performance and practical value for real-world adverse weather perception applications.

Abstract: Multimodal Image Fusion (MMIF) integrates complementary information from various modalities to produce clearer and more informative fused images. MMIF under adverse weather is particularly crucial in autonomous driving and UAV monitoring applications. However, existing adverse weather fusion methods generally only tackle single types of degradation such as haze, rain, or snow, and fail when multiple degradations coexist (e.g., haze+rain, rain+snow). To address this challenge, we propose Compound Adverse Weather Mamba (CAWM-Mamba), the first end-to-end framework that jointly performs image fusion and compound weather restoration with unified shared weights. Our network contains three key components: (1) a Weather-Aware Preprocess Module (WAPM) to enhance degraded visible features and extracts global weather embeddings; (2) a Cross-modal Feature Interaction Module (CFIM) to facilitate the alignment of heterogeneous modalities and exchange of complementary features across modalities; and (3) a Wavelet Space State Block (WSSB) that leverages wavelet-domain decomposition to decouple multi-frequency degradations. WSSB includes Freq-SSM, a module that models anisotropic high-frequency degradation without redundancy, and a unified degradation representation mechanism to further improve generalization across complex compound weather conditions. Extensive experiments on the AWMM-100K benchmark and three standard fusion datasets demonstrate that CAWM-Mamba consistently outperforms state-of-the-art methods in both compound and single-weather scenarios. In addition, our fusion results excel in downstream tasks covering semantic segmentation and object detection, confirming the practical value in real-world adverse weather perception. The source code will be available at https://github.com/Feecuin/CAWM-Mamba.

[136] Track4World: Feedforward World-centric Dense 3D Tracking of All Pixels

Jiahao Lu, Jiayi Xu, Wenbo Hu, Ruijie Zhu, Chengfeng Zhao, Sai-Kit Yeung, Ying Shan, Yuan Liu

Main category: cs.CV

TL;DR: Track4World: A feedforward model for efficient holistic 3D tracking of every pixel in world-centric coordinates from monocular video using 3D correlation scheme for simultaneous 2D/3D dense flow estimation.

Details

Motivation: Current monocular 3D tracking methods are limited to either tracking sparse points on the first frame or using slow optimization-based frameworks for dense tracking. There's a need for efficient holistic 3D tracking of every pixel in world-centric coordinates for comprehensive understanding of 3D video dynamics.

Method: Built on global 3D scene representation encoded by a VGGT-style ViT, Track4World applies a novel 3D correlation scheme to simultaneously estimate pixel-wise 2D and 3D dense flow between arbitrary frame pairs. The estimated scene flow combined with reconstructed 3D geometry enables efficient 3D tracking of every pixel.

Result: Extensive experiments on multiple benchmarks demonstrate that Track4World consistently outperforms existing methods in 2D/3D flow estimation and 3D tracking, showing robustness and scalability for real-world 4D reconstruction tasks.

Conclusion: Track4World enables efficient holistic 3D tracking of every pixel in world-centric coordinates, advancing monocular video understanding and 4D reconstruction capabilities beyond current sparse or slow optimization-based approaches.

Abstract: Estimating the 3D trajectory of every pixel from a monocular video is crucial and promising for a comprehensive understanding of the 3D dynamics of videos. Recent monocular 3D tracking works demonstrate impressive performance, but are limited to either tracking sparse points on the first frame or a slow optimization-based framework for dense tracking. In this paper, we propose a feedforward model, called Track4World, enabling an efficient holistic 3D tracking of every pixel in the world-centric coordinate system. Built on the global 3D scene representation encoded by a VGGT-style ViT, Track4World applies a novel 3D correlation scheme to simultaneously estimate the pixel-wise 2D and 3D dense flow between arbitrary frame pairs. The estimated scene flow, along with the reconstructed 3D geometry, enables subsequent efficient 3D tracking of every pixel of this video. Extensive experiments on multiple benchmarks demonstrate that our approach consistently outperforms existing methods in 2D/3D flow estimation and 3D tracking, highlighting its robustness and scalability for real-world 4D reconstruction tasks.

[137] ATD: Improved Transformer with Adaptive Token Dictionary for Image Restoration

Leheng Zhang, Wei Long, Yawei Li, Xingyu Zhou, Xiaorui Zhao, Shuhang Gu

Main category: cs.CV

TL;DR: ATD is a transformer-based image restoration architecture using adaptive token dictionaries for global dependency modeling with linear complexity, achieving SOTA on super-resolution and other restoration tasks.

Details

Motivation: Transformers show promise for image restoration but suffer from quadratic complexity in self-attention, forcing local window restrictions that limit receptive field and performance. Need global dependency modeling with linear complexity.

Method: Proposes Adaptive Token Dictionary (ATD) with learnable token dictionary summarizing image priors. Uses Token Dictionary Cross-Attention (TDCA) to enhance features via dictionary interaction. Groups features by category from TDCA attention maps, integrates category info into feed-forward network for better fusion.

Result: ATD and ATD-light achieve state-of-the-art performance on multiple image super-resolution benchmarks. ATD-U variant shows strong results on denoising and JPEG artifact removal tasks.

Conclusion: ATD enables efficient global dependency modeling for image restoration with linear complexity, outperforming existing methods across multiple restoration tasks while maintaining computational efficiency.

Abstract: Recently, Transformers have gained significant popularity in image restoration tasks such as image super-resolution and denoising, owing to their superior performance. However, balancing performance and computational burden remains a long-standing problem for transformer-based architectures. Due to the quadratic complexity of self-attention, existing methods often restrict attention to local windows, resulting in limited receptive field and suboptimal performance. To address this issue, we propose Adaptive Token Dictionary (ATD), a novel transformer-based architecture for image restoration that enables global dependency modeling with linear complexity relative to image size. The ATD model incorporates a learnable token dictionary, which summarizes external image priors (i.e., typical image structures) during the training process. To utilize this information, we introduce a token dictionary cross-attention (TDCA) mechanism that enhances the input features via interaction with the learned dictionary. Furthermore, we exploit the category information embedded in the TDCA attention maps to group input features into multiple categories, each representing a cluster of similar features across the image and serving as an attention group. We also integrate the learned category information into the feed-forward network to further improve feature fusion. ATD and its lightweight version ATD-light, achieve state-of-the-art performance on multiple image super-resolution benchmarks. Moreover, we develop ATD-U, a multi-scale variant of ATD, to address other image restoration tasks, including image denoising and JPEG compression artifacts removal. Extensive experiments demonstrate the superiority of out proposed models, both quantitatively and qualitatively.

[138] Neural Electromagnetic Fields for High-Resolution Material Parameter Reconstruction

Zhe Chen, Peilin Zheng, Wenshuo Chen, Xiucheng Wang, Yutao Yue, Nan Cheng

Main category: cs.CV

TL;DR: NEMF is a framework for creating functional digital twins by solving the ill-posed physical inversion problem to reconstruct material properties from non-invasive sensing data, enabling simulatable 3D replicas beyond visual appearance.

Details

Motivation: Current methods like NeRF produce visually rich but functionally incomplete digital twins lacking underlying material properties. The key challenge is acquiring material properties (permittivity, conductivity) non-invasively, which requires solving an ill-posed physical inversion problem where standard signals like images and RF deeply entangle geometry, ambient field, and materials.

Method: NEMF uses a systematic disentanglement strategy: 1) leverages high-fidelity geometry from images as an anchor to resolve ambient field, 2) constrains both geometry and field using non-invasive data to transform the ill-posed problem into well-posed physics-supervised learning, 3) uses a decoder guided by ambient RF signals and differentiable physical reflection models to output continuous, spatially-varying material parameter fields.

Result: Validated on high-fidelity synthetic datasets, NEMF reconstructs material maps with high accuracy and enables high-fidelity physical simulation. The resulting functional digital twin moves beyond passive visual replicas to truly functional and simulatable models.

Conclusion: NEMF advances digital twin creation by enabling dense, non-invasive physical inversion of material properties, transforming ill-posed problems into well-posed learning tasks and enabling functional simulation capabilities beyond visual appearance.

Abstract: Creating functional Digital Twins, simulatable 3D replicas of the real world, is a central challenge in computer vision. Current methods like NeRF produce visually rich but functionally incomplete twins. The key barrier is the lack of underlying material properties (e.g., permittivity, conductivity). Acquiring this information for every point in a scene via non-contact, non-invasive sensing is a primary goal, but it demands solving a notoriously ill-posed physical inversion problem. Standard remote signals, like images and radio frequencies (RF), deeply entangle the unknown geometry, ambient field, and target materials. We introduce NEMF, a novel framework for dense, non-invasive physical inversion designed to build functional digital twins. Our key insight is a systematic disentanglement strategy. NEMF leverages high-fidelity geometry from images as a powerful anchor, which first enables the resolution of the ambient field. By constraining both geometry and field using only non-invasive data, the original ill-posed problem transforms into a well-posed, physics-supervised learning task. This transformation unlocks our core inversion module: a decoder. Guided by ambient RF signals and a differentiable layer incorporating physical reflection models, it learns to explicitly output a continuous, spatially-varying field of the scene’s underlying material parameters. We validate our framework on high-fidelity synthetic datasets. Experiments show our non-invasive inversion reconstructs these material maps with high accuracy, and the resulting functional twin enables high-fidelity physical simulation. This advance moves beyond passive visual replicas, enabling the creation of truly functional and simulatable models of the physical world.

[139] Maximizing Generalization: The Effect of Different Augmentation Techniques on Lightweight Vision Transformer for Bengali Character Classification

Rafi Hassan Chowdhury, Naimul Haque, Kaniz Fatiha

Main category: cs.CV

TL;DR: Study evaluates image augmentation techniques for Bengali handwritten character recognition using lightweight EfficientViT model, finding Random Affine + Color Jitter combination achieves best accuracy on two datasets.

Details

Motivation: Address limited data problem for resource-scarce languages like Bengali in computer vision tasks, where large datasets are often unavailable for training deep learning models effectively.

Method: Tested various image augmentation techniques (CLAHE, Random Rotation, Random Affine, Color Jitter, and combinations) with lightweight EfficientViT model on Bengali handwritten character datasets (Ekush and AIBangla).

Result: Random Affine + Color Jitter combination achieved best accuracy: 97.48% on Ekush and 97.57% on AIBangla datasets, outperforming all other individual and combined augmentation techniques.

Conclusion: Image data augmentation is effective for resource-scarce languages, with specific combinations like Random Affine + Color Jitter providing optimal performance for Bengali handwritten character recognition using lightweight models.

Abstract: Deep learning models have proven to be highly effective in computer vision, with deep convolutional neural networks achieving impressive results across various computer vision tasks. However, these models rely heavily on large datasets to avoid overfitting. When a model learns features with either low or high variance, it can lead to underfitting or overfitting on the training data. Unfortunately, large-scale datasets may not be available in many domains, particularly for resource-limited languages such as Bengali. In this experiment, a series of tests were conducted in the field of image data augmentation as an approach to addressing the limited data problem for Bengali handwritten characters. The study also provides an in-depth analysis of the performance of different augmentation techniques. Data augmentation refers to a set of techniques applied to data to increase its size and diversity, making it more suitable for training deep learning models. The image augmentation techniques evaluated in this study include CLAHE, Random Rotation, Random Affine, Color Jitter, and their combinations. The study further explores the use of augmentation methods with a lightweight model such as EfficientViT. Among the different augmentation strategies, the combination of Random Affine and Color Jitter produced the best accuracy on the Ekush [1] and AIBangla [2] datasets, achieving accuracies of 97.48% and 97.57%, respectively. This combination outperformed all other individual and combined augmentation techniques. Overall, this analysis presents a thorough examination of the impact of image data augmentation in resource-scarce languages, particularly in the context of Bengali handwritten character recognition using lightweight models.

[140] Synthetic-Child: An AIGC-Based Synthetic Data Pipeline for Privacy-Preserving Child Posture Estimation

Taowen Zeng

Main category: cs.CV

TL;DR: Synthetic-Child: AIGC pipeline generates photorealistic child posture training data without real child photos, achieving 71.2 AP on real-child test set with 12.5 AP improvement over adult-data baseline.

Details

Motivation: Collecting annotated child posture datasets is expensive and ethically problematic due to privacy concerns. Need synthetic alternatives to reduce dependence on real child imagery while maintaining accuracy.

Method: Four-stage pipeline: 1) Programmable 3D child body model (SMPL-X) generates diverse poses with ground-truth annotations; 2) Dual ControlNet (pose + depth) conditioned on FLUX-1 Dev synthesizes photorealistic images; 3) ViTPose filtering and augmentation; 4) RTMPose-M fine-tuning with quantization for edge deployment.

Result: Achieves 71.2 AP on real-child test set (+12.5 AP improvement over baseline). After INT8 quantization: 70.4 AP at 22 FPS on edge NPU. Outperforms commercial posture corrector with higher recognition rates and 1.8x faster response.

Conclusion: Carefully designed AIGC pipelines can substantially reduce dependence on real child imagery while achieving deployment-ready accuracy, with applications to other privacy-sensitive domains.

Abstract: Accurate child posture estimation is critical for AI-powered study companion devices, yet collecting large-scale annotated datasets of children is both expensive and ethically prohibitive due to privacy concerns. We present Synthetic-Child, an AIGC-based synthetic data pipeline that produces photorealistic child posture training images with ground-truth-projected keypoint annotations, requiring zero real child photographs. The pipeline comprises four stages: (1) a programmable 3D child body model (SMPL-X) in Blender generates diverse desk-study poses with IK-constrained anatomical plausibility and automatic COCO-format ground-truth export; (2) a custom PoseInjectorNode feeds 3D-derived skeletons into a dual ControlNet (pose + depth) conditioned on FLUX-1 Dev, synthesizing 12,000 photorealistic images across 10 posture categories with low annotation drift; (3) ViTPose-based confidence filtering and targeted augmentation remove generation failures and improve robustness; (4) RTMPose-M (13.6M params) is fine-tuned on the synthetic data and paired with geometric feature engineering and a lightweight MLP for posture classification, then quantized to INT8 for real-time edge deployment. On a real-child test set (n~300), the FP16 model achieves 71.2 AP – a +12.5 AP improvement over the COCO-pretrained adult-data baseline at identical model capacity. After INT8 quantization the model retains 70.4 AP while running at 22 FPS on a 0.8-TOPS Rockchip RK3568 NPU. In a single-subject controlled comparison with a commercial posture corrector, our system achieves substantially higher recognition rates across most tested categories and responds ~1.8x faster on average. These results demonstrate that carefully designed AIGC pipelines can substantially reduce dependence on real child imagery while achieving deployment-ready accuracy, with potential applications to other privacy-sensitive domains.

A. Enes Doruk, Hasan F. Ates

Main category: cs.CV

TL;DR: VLMFusionOcc3D uses vision-language models to improve 3D semantic occupancy prediction in autonomous driving by injecting semantic priors into voxel features and dynamically fusing sensor data based on weather conditions.

Details

Motivation: Current voxel-based occupancy models suffer from semantic ambiguity in sparse geometric grids and performance degradation under adverse weather conditions, limiting their robustness for autonomous driving applications.

Method: Proposes a multimodal framework with: 1) dual-branch feature extraction from multi-view images and LiDAR, 2) Instance-driven VLM Attention (InstVLM) using gated cross-attention and LoRA-adapted CLIP embeddings to inject semantic priors, 3) Weather-Aware Adaptive Fusion (WeathFusion) for dynamic sensor weighting based on weather conditions, and 4) Depth-Aware Geometric Alignment (DAGA) loss for geometric consistency.

Result: Extensive experiments on nuScenes and SemanticKITTI show consistent performance improvements over state-of-the-art voxel-based baselines, with significant gains in challenging weather scenarios.

Conclusion: The framework provides a scalable and robust solution for 3D semantic occupancy prediction that effectively addresses semantic ambiguity and weather-related performance degradation through multimodal fusion and vision-language model integration.

Abstract: This paper introduces VLMFusionOcc3D, a robust multimodal framework for dense 3D semantic occupancy prediction in autonomous driving. Current voxel-based occupancy models often struggle with semantic ambiguity in sparse geometric grids and performance degradation under adverse weather conditions. To address these challenges, we leverage the rich linguistic priors of Vision-Language Models (VLMs) to anchor ambiguous voxel features to stable semantic concepts. Our framework initiates with a dual-branch feature extraction pipeline that projects multi-view images and LiDAR point clouds into a unified voxel space. We propose Instance-driven VLM Attention (InstVLM), which utilizes gated cross-attention and LoRA-adapted CLIP embeddings to inject high-level semantic and geographic priors directly into the 3D voxels. Furthermore, we introduce Weather-Aware Adaptive Fusion (WeathFusion), a dynamic gating mechanism that utilizes vehicle metadata and weather-conditioned prompts to re-weight sensor contributions based on real-time environmental reliability. To ensure structural consistency, a Depth-Aware Geometric Alignment (DAGA) loss is employed to align dense camera-derived geometry with sparse, spatially accurate LiDAR returns. Extensive experiments on the nuScenes and SemanticKITTI datasets demonstrate that our plug-and-play modules consistently enhance the performance of state-of-the-art voxel-based baselines. Notably, our approach achieves significant improvements in challenging weather scenarios, offering a scalable and robust solution for complex urban navigation.

[142] Mind the Way You Select Negative Texts: Pursuing the Distance Consistency in OOD Detection with VLMs

Zhikang Xu, Qianqian Xu, Zitai Wang, Cong Hua, Sicong Li, Zhiyong Yang, Qingming Huang

Main category: cs.CV

TL;DR: InterNeg: A framework for OOD detection using Vision-Language Models that addresses inconsistency between intra-modal and inter-modal distances by systematically enhancing consistent inter-modal distance from both textual and visual perspectives.

Details

Motivation: Current VLM-based OOD detection methods often incorporate intra-modal distance (comparing negative texts with ID labels or test images with image proxies), which creates inconsistency with the inter-modal distance that CLIP-like VLMs are optimized for, potentially leading to suboptimal performance.

Method: Proposes InterNeg framework with two key components: 1) Textual perspective: inter-modal criterion for selecting negative texts; 2) Visual perspective: dynamically identify high-confidence OOD images and invert them into textual space to generate extra negative text embeddings guided by inter-modal distance.

Result: Achieves state-of-the-art performance with 3.47% reduction in FPR95 on ImageNet benchmark and 5.50% improvement in AUROC on challenging Near-OOD benchmark across multiple benchmarks.

Conclusion: InterNeg effectively addresses the inconsistency issue in VLM-based OOD detection by systematically utilizing consistent inter-modal distance enhancement, demonstrating superior performance over existing methods.

Abstract: Out-of-distribution (OOD) detection seeks to identify samples from unknown classes, a critical capability for deploying machine learning models in open-world scenarios. Recent research has demonstrated that Vision-Language Models (VLMs) can effectively leverage their multi-modal representations for OOD detection. However, current methods often incorporate intra-modal distance during OOD detection, such as comparing negative texts with ID labels or comparing test images with image proxies. This design paradigm creates an inherent inconsistency against the inter-modal distance that CLIP-like VLMs are optimized for, potentially leading to suboptimal performance. To address this limitation, we propose InterNeg, a simple yet effective framework that systematically utilizes consistent inter-modal distance enhancement from textual and visual perspectives. From the textual perspective, we devise an inter-modal criterion for selecting negative texts. From the visual perspective, we dynamically identify high-confidence OOD images and invert them into the textual space, generating extra negative text embeddings guided by inter-modal distance. Extensive experiments across multiple benchmarks demonstrate the superiority of our approach. Notably, our InterNeg achieves state-of-the-art performance compared to existing works, with a 3.47% reduction in FPR95 on the large-scale ImageNet benchmark and a 5.50% improvement in AUROC on the challenging Near-OOD benchmark.

[143] Direct Reward Fine-Tuning on Poses for Single Image to 3D Human in the Wild

Seunguk Do, Minwoo Huh, Joonghyuk Shin, Jaesik Park

Main category: cs.CV

TL;DR: DrPose improves 3D human reconstruction from single images by fine-tuning multi-view diffusion models on diverse poses using only pose-image pairs, without expensive 3D data.

Details

Motivation: Current single-view 3D human reconstruction methods using multi-view diffusion models produce unnatural poses, especially for dynamic/challenging poses, due to limited diversity in 3D human datasets.

Method: DrPose uses direct reward fine-tuning on a novel DrPose15K dataset (constructed from human motion data and pose-conditioned video generation). It trains with only pose-image pairs using PoseScore - a differentiable reward measuring consistency between generated multi-view images and ground-truth poses.

Result: Consistent qualitative and quantitative improvements across conventional benchmarks, in-the-wild images, and a new benchmark focused on challenging poses.

Conclusion: DrPose effectively addresses pose diversity limitations in 3D human reconstruction by leveraging abundant pose data without requiring expensive 3D assets.

Abstract: Single-view 3D human reconstruction has achieved remarkable progress through the adoption of multi-view diffusion models, yet the recovered 3D humans often exhibit unnatural poses. This phenomenon becomes pronounced when reconstructing 3D humans with dynamic or challenging poses, which we attribute to the limited scale of available 3D human datasets with diverse poses. To address this limitation, we introduce DrPose, Direct Reward fine-tuning algorithm on Poses, which enables post-training of a multi-view diffusion model on diverse poses without requiring expensive 3D human assets. DrPose trains a model using only human poses paired with single-view images, employing a direct reward fine-tuning to maximize PoseScore, which is our proposed differentiable reward that quantifies consistency between a generated multi-view latent image and a ground-truth human pose. This optimization is conducted on DrPose15K, a novel dataset that was constructed from an existing human motion dataset and a pose-conditioned video generative model. Constructed from abundant human pose sequence data, DrPose15K exhibits a broader pose distribution compared to existing 3D human datasets. We validate our approach through evaluation on conventional benchmark datasets, in-the-wild images, and a newly constructed benchmark, with a particular focus on assessing performance on challenging human poses. Our results demonstrate consistent qualitative and quantitative improvements across all benchmarks. Project page: https://seunguk-do.github.io/drpose.

[144] Towards an Incremental Unified Multimodal Anomaly Detection: Augmenting Multimodal Denoising From an Information Bottleneck Perspective

Kaifang Long, Lianbo Ma, Jiaqi Liu, Liming Liu, Guoyang Xie

Main category: cs.CV

TL;DR: IB-IUMAD: A novel denoising framework for incremental unified multimodal anomaly detection that addresses catastrophic forgetting by disentangling spurious features and filtering redundant information using Mamba decoder and information bottleneck fusion.

Details

Motivation: Address catastrophic forgetting in incremental unified multimodal anomaly detection by identifying and mitigating the negative impact of spurious and redundant features, which previous methods have overlooked, particularly in multimodal frameworks created by naive aggregation of unimodal architectures.

Method: Proposes IB-IUMAD framework combining Mamba decoder for disentangling inter-object feature coupling to prevent spurious feature interference, and information bottleneck fusion module to filter redundant features from fused multimodal features while preserving discriminative information.

Result: Theoretical analyses and experiments on MVTec 3D-AD and Eyecandies datasets demonstrate the effectiveness and competitive performance of IB-IUMAD in incremental unified multimodal anomaly detection.

Conclusion: The proposed IB-IUMAD framework successfully addresses catastrophic forgetting in multimodal anomaly detection by explicitly handling spurious and redundant features, offering a robust solution for incremental learning across modalities.

Abstract: The quest for incremental unified multimodal anomaly detection seeks to empower a single model with the ability to systematically detect anomalies across all categories and support incremental learning to accommodate emerging objects/categories. Central to this pursuit is resolving the catastrophic forgetting dilemma, which involves acquiring new knowledge while preserving prior learned knowledge. Despite some efforts to address this dilemma, a key oversight persists: ignoring the potential impact of spurious and redundant features on catastrophic forgetting. In this paper, we delve into the negative effect of spurious and redundant features on this dilemma in incremental unified frameworks, and reveal that under similar conditions, the multimodal framework developed by naive aggregation of unimodal architectures is more prone to forgetting. To address this issue, we introduce a novel denoising framework called IB-IUMAD, which exploits the complementary benefits of the Mamba decoder and information bottleneck fusion module: the former dedicated to disentangle inter-object feature coupling, preventing spurious feature interference between objects; the latter serves to filter out redundant features from the fused features, thus explicitly preserving discriminative information. A series of theoretical analyses and experiments on MVTec 3D-AD and Eyecandies datasets demonstrates the effectiveness and competitive performance of IB-IUMAD.

[145] SEP-YOLO: Fourier-Domain Feature Representation for Transparent Object Instance Segmentation

Fengming Zhang, Tao Yan, Jianchao Huang

Main category: cs.CV

TL;DR: SEP-YOLO is a novel framework for transparent object instance segmentation that uses frequency domain enhancement and multi-scale spatial refinement to address challenges like boundary blur and low contrast in transparent objects.

Details

Motivation: Transparent object instance segmentation is challenging due to boundary blur, low contrast, and high dependence on background context. Existing methods fail because they rely on strong appearance cues and clear boundaries that transparent objects lack.

Method: Proposes SEP-YOLO with: 1) Frequency Domain Detail Enhancement Module using learnable complex weights to enhance weak high-frequency boundary components, 2) Multi-scale spatial refinement stream with Content-Aware Alignment Neck and Multi-scale Gated Refinement Block for precise feature alignment and boundary localization, and 3) Provides high-quality instance-level annotations for Trans10K dataset.

Result: Extensive experiments on Trans10K and GVD datasets show SEP-YOLO achieves state-of-the-art performance in transparent object instance segmentation.

Conclusion: SEP-YOLO effectively addresses transparent object segmentation challenges through frequency domain enhancement and multi-scale refinement, with new annotations filling a critical data gap in the field.

Abstract: Transparent object instance segmentation presents significant challenges in computer vision, due to the inherent properties of transparent objects, including boundary blur, low contrast, and high dependence on background context. Existing methods often fail as they depend on strong appearance cues and clear boundaries. To address these limitations, we propose SEP-YOLO, a novel framework that integrates a dual-domain collaborative mechanism for transparent object instance segmentation. Our method incorporates a Frequency Domain Detail Enhancement Module, which separates and enhances weak highfrequency boundary components via learnable complex weights. We further design a multi-scale spatial refinement stream, which consists of a Content-Aware Alignment Neck and a Multi-scale Gated Refinement Block, to ensure precise feature alignment and boundary localization in deep semantic features. We also provide high-quality instance-level annotations for the Trans10K dataset, filling the critical data gap in transparent object instance segmentation. Extensive experiments on the Trans10K and GVD datasets show that SEP-YOLO achieves state-of-the-art (SOTA) performance.

[146] OmniFashion: Towards Generalist Fashion Intelligence via Multi-Task Vision-Language Learning

Zhengwei Yang, Andi Long, Hao Li, Zechao Hu, Kui Jiang, Zheng Wang

Main category: cs.CV

TL;DR: OmniFashion: A unified vision-language framework for fashion intelligence that bridges multiple tasks (retrieval, recommendation, recognition, dialogue) using a million-scale dataset with comprehensive fashion annotations.

Details

Motivation: Current fashion intelligence suffers from fragmented supervision and incomplete annotations, preventing vision-language models from serving as a generalist fashion brain that unifies understanding and reasoning across diverse tasks.

Method: Construct FashionX dataset (million-scale with exhaustive fashion item annotations organized from global to part-level), then develop OmniFashion framework that bridges diverse fashion tasks under a unified fashion dialogue paradigm.

Result: OmniFashion achieves strong task-level accuracy and cross-task generalization on multi-subtasks and retrieval benchmarks, demonstrating scalable path toward universal, dialogue-oriented fashion intelligence.

Conclusion: The unified vision-language framework offers a scalable approach to universal fashion intelligence by addressing annotation limitations and enabling multi-task reasoning through dialogue.

Abstract: Fashion intelligence spans multiple tasks, i.e., retrieval, recommendation, recognition, and dialogue, yet remains hindered by fragmented supervision and incomplete fashion annotations. These limitations jointly restrict the formation of consistent visual-semantic structures, preventing recent vision-language models (VLMs) from serving as a generalist fashion brain that unifies understanding and reasoning across tasks. Therefore, we construct FashionX, a million-scale dataset that exhaustively annotates visible fashion items within an outfit and organizes attributes from global to part-level. Built upon this foundation, we propose OmniFashion, a unified vision-language framework that bridges diverse fashion tasks under a unified fashion dialogue paradigm, enabling both multi-task reasoning and interactive dialogue. Experiments on multi-subtasks and retrieval benchmarks show that OmniFashion achieves strong task-level accuracy and cross-task generalization, highlighting its offering of a scalable path toward universal, dialogue-oriented fashion intelligence.

Ashutosh Chaubey, Jiacheng Pang, Mohammad Soleymani

Main category: cs.CV

TL;DR: MoD-DPO improves modality grounding in omni-modal LLMs by enforcing modality-aware regularization and language-prior debiasing to reduce cross-modal hallucinations.

Details

Motivation: Omni-modal LLMs suffer from cross-modal hallucinations due to spurious correlations and dominant language priors, despite strong performance on audiovisual understanding tasks.

Method: Proposes Modality-Decoupled Direct Preference Optimization (MoD-DPO) with modality-aware regularization terms that enforce invariance to corruptions in irrelevant modalities and sensitivity to relevant modalities, plus language-prior debiasing penalty.

Result: Extensive experiments show MoD-DPO consistently improves perception accuracy and hallucination resistance across multiple audiovisual hallucination benchmarks, outperforming previous preference optimization baselines.

Conclusion: MoD-DPO demonstrates the importance of modality-faithful alignment and provides a scalable path toward more reliable and resilient multimodal foundation models.

Abstract: Omni-modal large language models (omni LLMs) have recently achieved strong performance across audiovisual understanding tasks, yet they remain highly susceptible to cross-modal hallucinations arising from spurious correlations and dominant language priors. In this work, we propose Modality-Decoupled Direct Preference Optimization (MoD-DPO), a simple and effective framework for improving modality grounding in omni LLMs. MoD-DPO introduces modality-aware regularization terms that explicitly enforce invariance to corruptions in irrelevant modalities and sensitivity to perturbations in relevant modalities, thereby reducing unintended cross-modal interactions. To further mitigate over-reliance on textual priors, we incorporate a language-prior debiasing penalty that discourages hallucination-prone text-only responses. Extensive experiments across multiple audiovisual hallucination benchmarks demonstrate that MoD-DPO consistently improves perception accuracy and hallucination resistance, outperforming previous preference optimization baselines under similar training budgets. Our findings underscore the importance of modality-faithful alignment and demonstrate a scalable path toward more reliable and resilient multimodal foundation models.

[148] DREAM: Where Visual Understanding Meets Text-to-Image Generation

Chao Li, Tianhong Li, Sai Vidyaranya Nuthalapati, Hong-You Chen, Satya Narayan Shukla, Yonghuan Yang, Jun Xiao, Xiangjun Fan, Aashu Singh, Dina Katabi, Shlok Kumar Mishra

Main category: cs.CV

TL;DR: DREAM is a unified multimodal framework that jointly optimizes discriminative and generative objectives for both visual representation learning and text-to-image generation within a single model.

Details

Motivation: The paper addresses the challenge of unifying visual representation learning and text-to-image generation within a single model, which remains a central problem in multimodal learning. Current approaches typically separate these two capabilities into different models or architectures.

Method: DREAM uses two key techniques: 1) Masking Warmup - a progressive masking schedule that starts with minimal masking for contrastive alignment (representation learning) and gradually transitions to full masking for stable generative training; 2) Semantically Aligned Decoding - at inference, aligns partially masked image candidates with target text and selects the best one for further decoding to improve text-image fidelity.

Result: Trained solely on CC12M, DREAM achieves 72.7% ImageNet linear-probing accuracy (+1.1% over CLIP) and FID of 4.25 (+6.2% over FLUID), with consistent gains in few-shot classification, semantic segmentation, and depth estimation. Also improves text-image fidelity by +6.3% without external rerankers.

Conclusion: Discriminative and generative objectives can be synergistic, allowing unified multimodal models that excel at both visual understanding and generation, demonstrating that a single model can effectively handle both representation learning and generation tasks.

Abstract: Unifying visual representation learning and text-to-image (T2I) generation within a single model remains a central challenge in multimodal learning. We introduce DREAM, a unified framework that jointly optimizes discriminative and generative objectives, while learning strong visual representations. DREAM is built on two key techniques: During training, Masking Warmup, a progressive masking schedule, begins with minimal masking to establish the contrastive alignment necessary for representation learning, then gradually transitions to full masking for stable generative training. At inference, DREAM employs Semantically Aligned Decoding to align partially masked image candidates with the target text and select the best one for further decoding, improving text-image fidelity (+6.3%) without external rerankers. Trained solely on CC12M, DREAM achieves 72.7% ImageNet linear-probing accuracy (+1.1% over CLIP) and an FID of 4.25 (+6.2% over FLUID), with consistent gains in few-shot classification, semantic segmentation, and depth estimation. These results demonstrate that discriminative and generative objectives can be synergistic, allowing unified multimodal models that excel at both visual understanding and generation.

[149] VisionCreator: A Native Visual-Generation Agentic Model with Understanding, Thinking, Planning and Creation

Jinxiang Lai, Zexin Lu, Jiajun He, Rongwei Quan, Wenzhe Zhao, Qinyu Yang, Qi Chen, Qin Lin, Chuyue Li, Tao Gao, Yuhao Shan, Shuai Shao, Song Guo, Qinglin Lu

Main category: cs.CV

TL;DR: VisionCreator is a native visual-generation agentic model that unifies Understanding, Thinking, Planning, and Creation capabilities for autonomous visual content creation through specialized training and evaluation frameworks.

Details

Motivation: Current approaches have limitations: general models lack nuanced understanding of design conventions and creative workflows, while workflow-based agents lack specialized knowledge for autonomous creative planning. There's a need for models that can autonomously plan and execute complex visual creation tasks.

Method: Proposes VisionCreator with UTPC (Understanding, Thinking, Planning, Creation) framework. Uses VisGenData-4k dataset generated via metacognition-based VisionAgent, Progressive Specialization Training (PST), Virtual Reinforcement Learning (VRL) in simulated environment, and comprehensive VisGenBench evaluation benchmark.

Result: VisionCreator-8B/32B models demonstrate superior performance over larger closed-source models across multiple evaluation dimensions. The framework enables stable acquisition of UTPC capabilities for complex visual creation tasks.

Conclusion: VisionCreator provides a foundation for future research in visual-generation agentic systems by unifying understanding, thinking, planning, and creation capabilities within an end-to-end learnable framework.

Abstract: Visual content creation tasks demand a nuanced understanding of design conventions and creative workflows-capabilities challenging for general models, while workflow-based agents lack specialized knowledge for autonomous creative planning. To overcome these challenges, we propose VisionCreator, a native visual-generation agentic model that unifies Understanding, Thinking, Planning, and Creation (UTPC) capabilities within an end-to-end learnable framework. Our work introduces four key contributions: (i) VisGenData-4k and its construction methodology using metacognition-based VisionAgent to generate high-quality creation trajectories with explicit UTPC structures; (ii) The VisionCreator agentic model, optimized through Progressive Specialization Training (PST) and Virtual Reinforcement Learning (VRL) within a high-fidelity simulated environment, enabling stable and efficient acquisition of UTPC capabilities for complex creation tasks; (iii) VisGenBench, a comprehensive benchmark featuring 1.2k test samples across diverse scenarios for standardized evaluation of multi-step visual creation capabilities; (iv) Remarkably, our VisionCreator-8B/32B models demonstrate superior performance over larger closed-source models across multiple evaluation dimensions. Overall, this work provides a foundation for future research in visual-generation agentic systems.

[150] ReCo-Diff: Residual-Conditioned Deterministic Sampling for Cold Diffusion in Sparse-View CT

Yong Eun Choi, Hyoung Suk Park, Kiwan Jeon, Hyun-Cheol Park, Sung Ho Kang

Main category: cs.CV

TL;DR: ReCo-Diff: A residual-conditioned diffusion framework for sparse-view CT reconstruction that uses observation residuals to guide sampling without heuristic interventions.

Details

Motivation: Existing diffusion models for sparse-view CT reconstruction rely on ad hoc sampling controls or fixed schedules that are sensitive to error accumulation and sampling instability.

Method: Proposes ReCo-Diff with residual-conditioned self-guided sampling that first produces an unconditioned baseline reconstruction, then conditions subsequent predictions on the observation residual between predicted image and measured sparse-view input.

Result: Outperforms existing cold diffusion sampling baselines with higher reconstruction accuracy, improved stability, and enhanced robustness under severe sparsity.

Conclusion: Residual-conditioned diffusion provides continuous, measurement-aware correction while preserving deterministic sampling, offering a more stable approach for sparse-view CT reconstruction.

Abstract: Cold and generalized diffusion models have recently shown strong potential for sparse-view CT reconstruction by explicitly modeling deterministic degradation processes. However, existing sampling strategies often rely on ad hoc sampling controls or fixed schedules, which remain sensitive to error accumulation and sampling instability. We propose ReCo-Diff, a residual-conditioned diffusion framework that leverages observation residuals through residual-conditioned self-guided sampling. At each sampling step, ReCo-Diff first produces a null (unconditioned) baseline reconstruction and then conditions subsequent predictions on the observation residual between the predicted image and the measured sparse-view input. This residual-driven guidance provides continuous, measurement-aware correction while preserving a deterministic sampling schedule, without requiring heuristic interventions. Experimental results demonstrate that ReCo-Diff consistently outperforms existing cold diffusion sampling baselines, achieving higher reconstruction accuracy, improved stability, and enhanced robustness under severe sparsity.

[151] FiDeSR: High-Fidelity and Detail-Preserving One-Step Diffusion Super-Resolution

Aro Kim, Myeongjin Jang, Chaewon Moon, Youngjin Shin, Jinwoo Jeong, Sang-hyo Park

Main category: cs.CV

TL;DR: FiDeSR is a one-step diffusion framework for real-world image super-resolution that preserves fine details and ensures high-fidelity reconstruction through detail-aware weighting and adaptive enhancers.

Details

Motivation: Existing diffusion-based image super-resolution methods struggle to simultaneously preserve fine details and ensure high-fidelity reconstruction, often resulting in suboptimal visual quality.

Method: Proposes FiDeSR with: 1) detail-aware weighting strategy during training that adaptively emphasizes regions with higher prediction errors, 2) low- and high-frequency adaptive enhancers during inference for flexible enhancement control without retraining, and 3) residual-in-residual noise refinement to correct prediction errors in diffusion noise and enhance detail recovery.

Result: FiDeSR achieves superior real-world SR performance compared to existing diffusion-based methods, producing outputs with both high perceptual quality and faithful content restoration.

Conclusion: FiDeSR presents an effective one-step diffusion framework for high-fidelity, detail-preserving image super-resolution with flexible enhancement control and improved reconstruction accuracy.

Abstract: Diffusion-based approaches have recently driven remarkable progress in real-world image super-resolution (SR). However, existing methods still struggle to simultaneously preserve fine details and ensure high-fidelity reconstruction, often resulting in suboptimal visual quality. In this paper, we propose FiDeSR, a high-fidelity and detail-preserving one-step diffusion super-resolution framework. During training, we introduce a detail-aware weighting strategy that adaptively emphasizes regions where the model exhibits higher prediction errors. During inference, low- and high-frequency adaptive enhancers further refine the reconstruction without requiring model retraining, enabling flexible enhancement control. To further improve the reconstruction accuracy, FiDeSR incorporates a residual-in-residual noise refinement, which corrects prediction errors in the diffusion noise and enhances fine detail recovery. FiDeSR achieves superior real-world SR performance compared to existing diffusion-based methods, producing outputs with both high perceptual quality and faithful content restoration. The source code will be released at: https://github.com/Ar0Kim/FiDeSR.

[152] ShareVerse: Multi-Agent Consistent Video Generation for Shared World Modeling

Jiayi Zhu, Jianing Zhang, Yiying Yang, Wei Cheng, Xiaoyun Yuan

Main category: cs.CV

TL;DR: ShareVerse is a video generation framework for multi-agent shared world modeling that integrates large video models with CARLA simulation data, spatial concatenation of multi-view videos, and cross-agent attention mechanisms for consistent shared world generation.

Details

Motivation: Existing video generation works lack support for unified shared world construction with multi-agent interaction. The paper aims to address this gap by enabling consistent world modeling across multiple agents with overlapping and non-overlapping regions.

Method: 1) Built large-scale multi-agent interactive world modeling dataset on CARLA simulation platform with diverse scenes, weather conditions, and interactive trajectories with paired multi-view videos. 2) Proposed spatial concatenation strategy for four-view videos of independent agents to model broader environments and ensure multi-view geometric consistency. 3) Integrated cross-agent attention blocks into pretrained video model to enable interactive transmission of spatial-temporal information across agents.

Result: ShareVerse supports 49-frame large-scale video generation, accurately perceives positions of dynamic agents, and achieves consistent shared world modeling with proper handling of overlapping and non-overlapping regions.

Conclusion: ShareVerse successfully enables multi-agent shared world modeling in video generation, addressing the limitations of existing approaches and demonstrating effective integration of simulation data, spatial concatenation strategies, and cross-agent attention mechanisms.

Abstract: This paper presents ShareVerse, a video generation framework enabling multi-agent shared world modeling, addressing the gap in existing works that lack support for unified shared world construction with multi-agent interaction. ShareVerse leverages the generation capability of large video models and integrates three key innovations: 1) A dataset for large-scale multi-agent interactive world modeling is built on the CARLA simulation platform, featuring diverse scenes, weather conditions, and interactive trajectories with paired multi-view videos (front/ rear/ left/ right views per agent) and camera data. 2) We propose a spatial concatenation strategy for four-view videos of independent agents to model a broader environment and to ensure internal multi-view geometric consistency. 3) We integrate cross-agent attention blocks into the pretrained video model, which enable interactive transmission of spatial-temporal information across agents, guaranteeing shared world consistency in overlapping regions and reasonable generation in non-overlapping regions. ShareVerse, which supports 49-frame large-scale video generation, accurately perceives the position of dynamic agents and achieves consistent shared world modeling.

[153] Intelligent Pathological Diagnosis of Gestational Trophoblastic Diseases via Visual-Language Deep Learning Model

Yuhang Liu, Yueyang Cang, Wenge Que, Xinru Bai, Xingtong Wang, Kuisheng Chen, Jingya Li, Xiaoteng Zhang, Xinmin Li, Lixia Zhang, Pingge Hu, Qiaoting Xie, Peiyu Xu, Xianxu Zeng, Li Shi

Main category: cs.CV

TL;DR: GTDoctor is an AI system for diagnosing gestational trophoblastic disease from pathology slides, performing lesion segmentation and providing diagnostic conclusions with clinical interpretability.

Details

Motivation: Current GTD diagnosis is time-consuming, experience-dependent, and has low consistency, threatening maternal health. There's a need for automated, accurate diagnostic tools.

Method: Developed GTDoctor model for pixel-based lesion segmentation on pathological slides, with diagnostic conclusions and personalized analysis. Built GTDiagnosis software system and conducted clinical trials.

Result: Achieved mean precision >0.91 for lesion detection (n=679 slides). In prospective studies, pathologists using GTDiagnosis attained 95.59% PPV (n=68 patients). Reduced diagnostic time from 56 to 16 seconds per case (n=285 patients).

Conclusion: GTDoctor and GTDiagnosis provide a novel solution for GTD pathological diagnosis, enhancing performance and efficiency while maintaining clinical interpretability.

Abstract: The pathological diagnosis of gestational trophoblastic disease(GTD) takes a long time, relies heavily on the experience of pathologists, and the consistency of initial diagnosis is low, which seriously threatens maternal health and reproductive outcomes. We developed an expert model for GTD pathological diagnosis, named GTDoctor. GTDoctor can perform pixel-based lesion segmentation on pathological slides, and output diagnostic conclusions and personalized pathological analysis results. We developed a software system, GTDiagnosis, based on this technology and conducted clinical trials. The retrospective results demonstrated that GTDiagnosis achieved a mean precision of over 0.91 for lesion detection in pathological slides (n=679 slides). In prospective studies, pathologists using GTDiagnosis attained a Positive Predictive Value of 95.59% (n=68 patients). The tool reduced average diagnostic time from 56 to 16 seconds per case (n=285 patients). GTDoctor and GTDiagnosis offer a novel solution for GTD pathological diagnosis, enhancing diagnostic performance and efficiency while maintaining clinical interpretability.

[154] MiM-DiT: MoE in MoE with Diffusion Transformers for All-in-One Image Restoration

Lingshun Kong, Jiawei Zhang, Zhengpeng Duan, Xiaohe Wu, Yueqi Yang, Xiaotao Wang, Dongqing Zou, Lei Lei, Jinshan Pan

Main category: cs.CV

TL;DR: A unified image restoration framework using dual-level Mixture-of-Experts with pretrained diffusion model for handling diverse degradation types like haze, blur, noise, and low-light.

Details

Motivation: Different image degradation types (haze, blur, noise, low-light) require diverse restoration strategies, making it difficult for a single model to handle all effectively. Current approaches struggle with unified restoration across multiple degradation types.

Method: Proposes a unified framework with dual-level Mixture-of-Experts (MoE) architecture integrated with pretrained diffusion model. Inter-MoE layer adaptively combines expert groups for major degradation types, while Intra-MoE layer selects specialized sub-experts for fine-grained variations within each type.

Result: Extensive experiments show the method performs favorably against state-of-the-art approaches on multiple image restoration tasks, achieving both coarse-grained adaptation across degradation categories and fine-grained modulation for specific intra-class variations.

Conclusion: The dual-level MoE architecture with pretrained diffusion model provides an effective unified solution for all-in-one image restoration, handling complex real-world corruptions through specialized expert selection at both inter and intra degradation levels.

Abstract: All-in-one image restoration is challenging because different degradation types, such as haze, blur, noise, and low-light, impose diverse requirements on restoration strategies, making it difficult for a single model to handle them effectively. In this paper, we propose a unified image restoration framework that integrates a dual-level Mixture-of-Experts (MoE) architecture with a pretrained diffusion model. The framework operates at two levels: the Inter-MoE layer adaptively combines expert groups to handle major degradation types, while the Intra-MoE layer further selects specialized sub-experts to address fine-grained variations within each type. This design enables the model to achieve coarse-grained adaptation across diverse degradation categories while performing fine-grained modulation for specific intra-class variations, ensuring both high specialization in handling complex, real-world corruptions. Extensive experiments demonstrate that the proposed method performs favorably against the state-of-the-art approaches on multiple image restoration task.

[155] TenExp: Mixture-of-Experts-Based Tensor Decomposition Structure Search Framework

Ting-Wei Zhou, Xi-Le Zhao, Sheng Liu, Wei-Hao Wu, Yu-Bang Zheng, Deyu Meng

Main category: cs.CV

TL;DR: TenExp is a mixture-of-experts framework for tensor decomposition structure search that dynamically selects suitable tensor decompositions in an unsupervised manner, going beyond fixed factor-interaction families and enabling mixtures of decompositions.

Details

Motivation: Current tensor decomposition methods are limited by fixed factor-interaction families and cannot deliver mixtures of decompositions, making it challenging to capture the complex low-rank structures in data.

Method: Proposes TenExp, a mixture-of-experts-based tensor decomposition structure search framework that dynamically selects and activates suitable tensor decompositions in an unsupervised fashion.

Result: Extensive experiments on synthetic and realistic datasets demonstrate TenExp’s superiority over state-of-the-art tensor decomposition methods, with theoretical approximation error bounds provided.

Conclusion: TenExp advances tensor decomposition structure search by enabling both single decompositions beyond fixed families and mixtures of decompositions, with proven theoretical guarantees and empirical effectiveness.

Abstract: Recently, tensor decompositions continue to emerge and receive increasing attention. Selecting a suitable tensor decomposition to exactly capture the low-rank structures behind the data is at the heart of the tensor decomposition field, which remains a challenging and relatively under-explored problem. Current tensor decomposition structure search methods are still confined by a fixed factor-interaction family (e.g., tensor contraction) and cannot deliver the mixture of decompositions. To address this problem, we elaborately design a mixture-of-experts-based tensor decomposition structure search framework (termed as TenExp), which allows us to dynamically select and activate suitable tensor decompositions in an unsupervised fashion. This framework enjoys two unique advantages over the state-of-the-art tensor decomposition structure search methods. Firstly, TenExp can provide a suitable single decomposition beyond a fixed factor-interaction family. Secondly, TenExp can deliver a suitable mixture of decompositions beyond a single decomposition. Theoretically, we also provide the approximation error bound of TenExp, which reveals the approximation capability of TenExp. Extensive experiments on both synthetic and realistic datasets demonstrate the superiority of the proposed TenExp compared to the state-of-the-art tensor decomposition-based methods.

[156] Cross-view geo-localization, Image retrieval, Multiscale geometric modeling, Frequency domain enhancement

Hongying Zhang, ShuaiShuai Ma

Main category: cs.CV

TL;DR: SFDE is a cross-view geo-localization network that combines spatial and frequency domain features to handle viewpoint variations and texture inconsistencies between ground and aerial images.

Details

Motivation: Cross-view geo-localization is challenging due to severe geometric asymmetry, texture inconsistency across imaging domains, and degradation of discriminative local information. Existing spatial domain methods are sensitive to large viewpoint variations and local disturbances.

Method: Proposes Spatial and Frequency Domain Enhancement Network (SFDE) with three parallel branches: global semantic context modeling, local geometric structure modeling, and frequency domain statistical stability modeling. Features are jointly optimized via progressive enhancement and coupled constraints in a unified embedding space.

Result: SFDE achieves competitive performance and in many cases surpasses state-of-the-art methods while maintaining a lightweight and computationally efficient design.

Conclusion: Leveraging complementary representations from spatial and frequency domains enables learning cross-view representations with consistency across multiple granularities, addressing limitations of purely spatial domain approaches.

Abstract: Cross-view geo-localization (CVGL) aims to establish spatial correspondences between images captured from significantly different viewpoints and constitutes a fundamental technique for visual localization in GNSS-denied environments. Nevertheless, CVGL remains challenging due to severe geometric asymmetry, texture inconsistency across imaging domains, and the progressive degradation of discriminative local information. Existing methods predominantly rely on spatial domain feature alignment, which is inherently sensitive to large scale viewpoint variations and local disturbances. To alleviate these limitations, this paper proposes the Spatial and Frequency Domain Enhancement Network (SFDE), which leverages complementary representations from spatial and frequency domains. SFDE adopts a three branch parallel architecture to model global semantic context, local geometric structure, and statistical stability in the frequency domain, respectively, thereby characterizing consistency across domains from the perspectives of scene topology, multiscale structural patterns, and frequency invariance. The resulting complementary features are jointly optimized in a unified embedding space via progressive enhancement and coupled constraints, enabling the learning of cross-view representations with consistency across multiple granularities. Comprehensive experiments show that SFDE achieves competitive performance and in many cases even surpasses state-of-the-art methods, while maintaining a lightweight and computationally efficient design. {Our code is available at https://github.com/Mashuaishuai669/SFDE

[157] Gated Differential Linear Attention: A Linear-Time Decoder for High-Fidelity Medical Segmentation

Hongbo Zheng, Afshin Bozorgpour, Dorit Merhof, Minjia Zhang

Main category: cs.CV

TL;DR: PVT-GDLA introduces a decoder-centric transformer with Gated Differential Linear Attention for efficient, high-fidelity medical image segmentation that preserves fine anatomical boundaries while maintaining linear computational complexity.

Details

Motivation: Medical image segmentation needs models that preserve fine anatomical boundaries while being efficient for clinical deployment. Transformers capture long-range dependencies but have quadratic attention costs and large data requirements, while CNNs are compute-friendly but struggle with global reasoning. Linear attention offers O(N) scaling but suffers from training instability and attention dilution.

Method: PVT-GDLA is a decoder-centric Transformer with Gated Differential Linear Attention (GDLA). GDLA computes two kernelized attention paths on complementary query/key subspaces and subtracts them with a learnable channel-wise scale to cancel common-mode noise. A lightweight head-specific gate injects nonlinearity and input-adaptive sparsity to mitigate attention sink. A parallel local token-mixing branch with depthwise convolution strengthens neighboring-token interactions while maintaining O(N) complexity.

Result: PVT-GDLA achieves state-of-the-art accuracy across CT, MRI, ultrasound, and dermoscopy benchmarks under equal training budgets, with comparable parameters but lower FLOPs than CNN-, Transformer-, hybrid-, and linear-attention baselines.

Conclusion: PVT-GDLA provides a practical path to fast, scalable, high-fidelity medical segmentation in clinical environments and other resource-constrained settings by combining efficient linear attention with boundary-preserving capabilities.

Abstract: Medical image segmentation requires models that preserve fine anatomical boundaries while remaining efficient for clinical deployment. While transformers capture long-range dependencies, they suffer from quadratic attention cost and large data requirements, whereas CNNs are compute-friendly yet struggle with global reasoning. Linear attention offers $\mathcal{O}(N)$ scaling, but often exhibits training instability and attention dilution, yielding diffuse maps. We introduce PVT-GDLA, a decoder-centric Transformer that restores sharp, long-range dependencies at linear time. Its core, Gated Differential Linear Attention (GDLA), computes two kernelized attention paths on complementary query/key subspaces and subtracts them with a learnable, channel-wise scale to cancel common-mode noise and amplify relevant context. A lightweight, head-specific gate injects nonlinearity and input-adaptive sparsity, mitigating attention sink, and a parallel local token-mixing branch with depthwise convolution strengthens neighboring-token interactions, improving boundary fidelity, all while retaining $\mathcal{O}(N)$ complexity and low parameter overhead. Coupled with a pretrained Pyramid Vision Transformer (PVT) encoder, PVT-GDLA achieves state-of-the-art accuracy across CT, MRI, ultrasound, and dermoscopy benchmarks under equal training budgets, with comparable parameters but lower FLOPs than CNN-, Transformer-, hybrid-, and linear-attention baselines. PVT-GDLA provides a practical path to fast, scalable, high-fidelity medical segmentation in clinical environments and other resource-constrained settings.

[158] CoShadow: Multi-Object Shadow Generation for Image Compositing via Diffusion Model

Waqas Ahmed, Dean Diepeveen, Ferdous Sohel

Main category: cs.CV

TL;DR: A method for generating physically plausible shadows for multiple inserted objects in image compositing using multimodal diffusion models with image and text pathways.

Details

Motivation: Existing shadow generation methods focus on single-object insertion and fail to generalize to multiple foreground objects, while real-world compositing pipelines often insert multiple objects simultaneously requiring jointly consistent shadows.

Method: Uses pre-trained text-to-image diffusion model with two pathways: image pathway injects multi-scale features for spatial guidance, text pathway encodes per-object shadow bounding boxes as learned positional tokens fused via cross-attention with attention-alignment loss.

Result: Achieves state-of-the-art performance in both single and multi-object shadow generation settings, demonstrated on augmented DESOBAv2 dataset with composite scenes.

Conclusion: The method effectively addresses multi-object shadow generation by leveraging multimodal diffusion capabilities and achieves physically plausible shadow synthesis for multiple inserted objects.

Abstract: Realistic shadow generation is crucial for achieving seamless image compositing, yet existing methods primarily focus on single-object insertion and often fail to generalize when multiple foreground objects are composited into a background scene. In practice, however, modern compositing pipelines and real-world applications often insert multiple objects simultaneously, necessitating shadows that are jointly consistent in terms of geometry, attachment, and location. In this paper, we address the under-explored problem of multi-object shadow generation, aiming to synthesize physically plausible shadows for multiple inserted objects. Our approach exploits the multimodal capabilities of a pre-trained text-to-image diffusion model. An image pathway injects dense, multi-scale features to provide fine-grained spatial guidance, while a text-based pathway encodes per-object shadow bounding boxes as learned positional tokens and fuses them via cross-attention. An attention-alignment loss further grounds these tokens to their corresponding shadow regions. To support this task, we augment the DESOBAv2 dataset by constructing composite scenes with multiple inserted objects and automatically derive prompts combining object category and shadow positioning information. Experimental results demonstrate that our method achieves state-of-the-art performance in both single and multi-object shadow generation settings.

[159] iGVLM: Dynamic Instruction-Guided Vision Encoding for Question-Aware Multimodal Understanding

HanZpeng Liu, Yaqian Li, Zidan Wang, Shuoxi Zhang, Zihao Bo, Rinyoichi Takezoe, Kaiwen Long, Kun He

Main category: cs.CV

TL;DR: iGVLM introduces instruction-guided visual modulation with dual-branch architecture to address representation bottleneck in LVLMs, enabling task-specific visual reasoning while preserving pre-trained visual priors.

Details

Motivation: Current LVLMs suffer from representation bottleneck: they use static, instruction-agnostic vision encoders with invariant visual representations across different textual tasks, hindering fine-grained reasoning where task-specific visual cues are critical.

Method: Proposes iGVLM with decoupled dual-branch architecture: frozen representation branch preserves task-agnostic visual representations, and dynamic conditioning branch performs affine feature modulation via Adaptive Layer Normalization (AdaLN). Also introduces MM4 diagnostic probe for quantifying logical consistency.

Result: iGVLM consistently enhances instruction sensitivity across diverse language backbones, offering plug-and-play paradigm for bridging passive perception and active reasoning. Shows improvements on standard benchmarks and new MM4 diagnostic probe.

Conclusion: iGVLM addresses representation bottleneck in LVLMs through instruction-guided visual modulation, enabling smooth transition from general-purpose perception to instruction-aware reasoning while maintaining structural integrity of pre-trained visual priors.

Abstract: Despite the success of Large Vision–Language Models (LVLMs), most existing architectures suffer from a representation bottleneck: they rely on static, instruction-agnostic vision encoders whose visual representations are utilized in an invariant manner across different textual tasks. This rigidity hinders fine-grained reasoning where task-specific visual cues are critical. To address this issue, we propose iGVLM, a general framework for instruction-guided visual modulation. iGVLM introduces a decoupled dual-branch architecture: a frozen representation branch that preserves task-agnostic visual representations learned during pre-training, and a dynamic conditioning branch that performs affine feature modulation via Adaptive Layer Normalization (AdaLN). This design enables a smooth transition from general-purpose perception to instruction-aware reasoning while maintaining the structural integrity and stability of pre-trained visual priors. Beyond standard benchmarks, we introduce MM4, a controlled diagnostic probe for quantifying logical consistency under multi-query, multi-instruction settings. Extensive results show that iGVLM consistently enhances instruction sensitivity across diverse language backbones, offering a plug-and-play paradigm for bridging passive perception and active reasoning.

[160] Seeing Clearly without Training: Mitigating Hallucinations in Multimodal LLMs for Remote Sensing

Yi Liu, Jing Zhang, Di Wang, Xiaoyu Tian, Haonan Guo, Bo Du

Main category: cs.CV

TL;DR: RADAR is a training-free inference method that uses intrinsic attention in MLLMs to guide progressive localization and fine-grained reasoning, reducing hallucinations in remote sensing visual question answering.

Details

Motivation: MLLMs suffer from pronounced hallucinations in remote sensing VQA due to visual grounding failures in large-scale scenes and misinterpretation of fine-grained small targets, requiring systematic analysis and mitigation.

Method: Proposes RADAR (Relative Attention-Driven Actively Reasoning), a training-free inference method that leverages MLLMs’ intrinsic attention to guide progressive localization and fine-grained local reasoning at test time, along with RSHBench benchmark for diagnosis.

Result: Extensive experiments show RADAR consistently improves RS-VQA performance and reduces both factual and logical hallucinations across diverse MLLMs.

Conclusion: RADAR effectively addresses MLLM hallucinations in remote sensing VQA through attention-guided progressive localization and reasoning, with the RSHBench benchmark enabling systematic diagnosis of hallucination issues.

Abstract: Multimodal large language models (MLLMs) suffer from pronounced hallucinations in remote sensing visual question-answering (RS-VQA), primarily caused by visual grounding failures in large-scale scenes or misinterpretation of fine-grained small targets. To systematically analyze these issues, we introduce RSHBench, a protocol-based benchmark for fine-grained diagnosis of factual and logical hallucinations. To mitigate grounding-induced factual hallucinations, we further propose Relative Attention-Driven Actively Reasoning (RADAR), a training-free inference method that leverages intrinsic attention in MLLMs to guide progressive localization and fine-grained local reasoning at test time. Extensive experiments across diverse MLLMs demonstrate that RADAR consistently improves RS-VQA performance and reduces both factual and logical hallucinations. Code and data will be publicly available at: https://github.com/MiliLab/RADAR

[161] ITO: Images and Texts as One via Synergizing Multiple Alignment and Training-Time Fusion

HanZpeng Liu, Yaqian Li, Zidan Wang, Shuoxi Zhang, Zonglin Zhao, Zihao Bo, Rinyoichi Takezoe, Kaiwen Long, Kun He

Main category: cs.CV

TL;DR: ITO framework improves image-text contrastive pretraining through multimodal multiple alignment and training-time fusion to eliminate modality gap while maintaining inference efficiency.

Details

Motivation: Existing image-text contrastive pretraining methods often produce representations that remain partially organized by modality, creating a modality gap that limits cross-modal understanding and alignment.

Method: Two synergistic mechanisms: 1) Multimodal multiple alignment that mines diverse image-text correspondences for richer supervision, and 2) A lightweight training-time multimodal fusion module that enforces structured cross-modal interaction (discarded at inference to maintain efficiency).

Result: ITO consistently outperforms strong baselines across classification, retrieval, and multimodal benchmarks. Analysis shows multiple alignment drives discriminative power while training-time fusion acts as structural regularizer, eliminating modality gap and stabilizing training dynamics.

Conclusion: The proposed ITO framework effectively addresses modality gap in contrastive learning through synergistic alignment and fusion mechanisms, achieving superior performance while maintaining inference efficiency of dual-encoder architectures.

Abstract: Image-text contrastive pretraining has become a dominant paradigm for visual representation learning, yet existing methods often yield representations that remain partially organized by modality. We propose ITO, a framework addressing this limitation through two synergistic mechanisms. Multimodal multiple alignment enriches supervision by mining diverse image-text correspondences, while a lightweight training-time multimodal fusion module enforces structured cross-modal interaction. Crucially, the fusion module is discarded at inference, preserving the efficiency of standard dual-encoder architectures. Extensive experiments show that ITO consistently outperforms strong baselines across classification, retrieval, and multimodal benchmarks. Our analysis reveals that while multiple alignment drives discriminative power, training-time fusion acts as a critical structural regularizer – eliminating the modality gap and stabilizing training dynamics to prevent the early saturation often observed in aggressive contrastive learning.

[162] HiLoRA: Hierarchical Low-Rank Adaptation for Personalized Federated Learning

Zihao Peng, Nan Zou, Jiandian Zeng, Guo Li, Ke Chen, Boyuan Li, Tian Wang

Main category: cs.CV

TL;DR: HiLoRA: Hierarchical LoRA framework for federated learning with Vision Transformers that uses three adapter levels (root, cluster, leaf) to capture global, subgroup, and client-specific knowledge, improving personalization and generalization.

Details

Motivation: Existing LoRA-based federated tuning methods for Vision Transformers overlook latent client structures in real-world settings, limiting shared representation learning and hindering effective adaptation to unseen clients.

Method: Proposes HiLoRA with adapters at three hierarchical levels: root (global knowledge), cluster (subgroup knowledge), and leaf (client-specific knowledge). Uses cross-tier orthogonality and cascaded optimization to separate update subspaces. Introduces LoRA-Subspace Adaptive Clustering to infer latent client groups via subspace similarity analysis.

Result: Experiments on ViT backbones with CIFAR-100 and DomainNet demonstrate consistent improvements in both personalization and generalization compared to existing methods.

Conclusion: HiLoRA effectively addresses the limitations of existing LoRA-based federated tuning methods by capturing hierarchical knowledge structures, enabling better knowledge sharing across structurally aligned clients while maintaining personalization.

Abstract: Vision Transformers (ViTs) have been widely adopted in vision tasks due to their strong transferability. In Federated Learning (FL), where full fine-tuning is communication heavy, Low-Rank Adaptation (LoRA) provides an efficient and communication-friendly way to adapt ViTs. However, existing LoRA-based federated tuning methods overlook latent client structures in real-world settings, limiting shared representation learning and hindering effective adaptation to unseen clients. To address this, we propose HiLoRA, a hierarchical LoRA framework that places adapters at three levels: root, cluster, and leaf, each designed to capture global, subgroup, and client-specific knowledge, respectively. Through cross-tier orthogonality and cascaded optimization, HiLoRA separates update subspaces and aligns each tier with its residual personalized objective. In particular, we develop a LoRA-Subspace Adaptive Clustering mechanism that infers latent client groups via subspace similarity analysis, thereby facilitating knowledge sharing across structurally aligned clients. Theoretically, we establish a tier-wise generalization analysis that supports HiLoRA’s design. Experiments on ViT backbones with CIFAR-100 and DomainNet demonstrate consistent improvements in both personalization and generalization.

[163] Designing UNICORN: a Unified Benchmark for Imaging in Computational Pathology, Radiology, and Natural Language

Michelle Stegeman, Lena Philipp, Fennie van der Graaf, Marina D’Amato, Clément Grisi, Luc Builtjes, Joeran S. Bosma, Judith Lefkes, Rianne A. Weber, James A. Meakin, Thomas Koopman, Anne Mickan, Mathias Prokop, Ewoud J. Smit, Geert Litjens, Jeroen van der Laak, Bram van Ginneken, Maarten de Rooij, Henkjan Huisman, Colin Jacobs, Francesco Ciompi, Alessa Hering

Main category: cs.CV

TL;DR: UNICORN is a public benchmark for evaluating medical foundation models across diverse tasks, modalities, and anatomical regions with standardized few-shot adaptation and sequestered test sets.

Details

Motivation: Current medical foundation model evaluation lacks standardized, reproducible frameworks. Existing benchmarks are fragmented across tasks, organs, or modalities, limiting assessment of cross-task generalization capabilities.

Method: Two-step framework decoupling model inference from task-specific evaluation using standardized few-shot adaptation. Built sequestered test sets from clinically relevant cohorts across 17 institutions, with standardized evaluation code and submission interface.

Result: UNICORN includes data from 2,400+ patients, 3,700+ vision cases, 2,400+ clinical reports across 8 countries, spanning 8 anatomical regions and 4 imaging modalities. Introduces UNICORN Score for aggregated performance comparison.

Conclusion: UNICORN establishes foundation for reproducible benchmarking of medical foundation models by standardizing multi-task, multi-modality assessment with public data, baselines, and evaluation platform.

Abstract: Medical foundation models show promise to learn broadly generalizable features from large, diverse datasets. This could be the base for reliable cross-modality generalization and rapid adaptation to new, task-specific goals, with only a few task-specific examples. Yet, evidence for this is limited by the lack of public, standardized, and reproducible evaluation frameworks, as existing public benchmarks are often fragmented across task-, organ-, or modality-specific settings, limiting assessment of cross-task generalization. We introduce UNICORN, a public benchmark designed to systematically evaluate medical foundation models under a unified protocol. To isolate representation quality, we built the benchmark on a novel two-step framework that decouples model inference from task-specific evaluation based on standardized few-shot adaptation. As a central design choice, we constructed indirectly accessible sequestered test sets derived from clinically relevant cohorts, along with standardized evaluation code and a submission interface on an open benchmarking platform. Performance is aggregated into a single UNICORN Score, a new metric that we introduce to support direct comparison of foundation models across diverse medical domains, modalities, and task types. The UNICORN test dataset includes data from more than 2,400 patients, including over 3,700 vision cases and over 2,400 clinical reports collected from 17 institutions across eight countries. The benchmark spans eight anatomical regions and four imaging modalities. Both task-specific and aggregated leaderboards enable accessible, standardized, and reproducible evaluation. By standardizing multi-task, multi-modality assessment, UNICORN establishes a foundation for reproducible benchmarking of medical foundation models. Data, baseline methods, and the evaluation platform are publicly available via unicorn.grand-challenge.org.

[164] VSearcher: Long-Horizon Multimodal Search Agent via Reinforcement Learning

Ruiyang Zhang, Qianguo Sun, Chao Song, Yiyan Qi, Zhedong Zheng

Main category: cs.CV

TL;DR: VSearcher transforms static multimodal models into multimodal search agents capable of long-horizon, multi-turn tool use in web environments through reinforcement learning.

Details

Motivation: Current text-only LLMs are limited to single modality, while multimodal models lack ability to access up-to-date web information. There's a need for multimodal agents that can actively search and use tools in real-world web environments.

Method: Proposes VSearcher with Iterative Injection Data Synthesis pipeline to generate complex multimodal QA questions, filtered with comprehensive metrics. Uses SFT-then-RL training pipeline to turn base multimodal models into agents capable of multi-turn tool calling (text search, image search, web browsing). Also introduces MM-SearchExam benchmark for evaluation.

Result: VSearcher achieves superior performance compared to recent multimodal search agents and surpasses several proprietary models on multimodal web search tasks. The proposed benchmark proves highly challenging for recent proprietary models.

Conclusion: VSearcher successfully transforms static multimodal models into capable multimodal search agents that can perform long-horizon, multi-turn tool use in real-world web environments, addressing limitations of current multimodal models.

Abstract: Large models are increasingly becoming autonomous agents that interact with real-world environments and use external tools to augment their static capabilities. However, most recent progress has focused on text-only large language models, which are limited to a single modality and therefore have narrower application scenarios. On the other hand, multimodal large models, while offering stronger perceptual capabilities, remain limited to static knowledge and lack the ability to access and leverage up-to-date web information. In this paper, we propose VSearcher, turning static multimodal model into multimodal search agent capable of long-horizon, multi-turn tool use in real-world web environments, including text search, image search, and web browsing, via reinforcement learning. Specifically, we introduce Iterative Injection Data Synthesis pipeline to generate large-scale, complex multimodal QA questions, which are further filtered with comprehensive metrics to ensure high quality and sufficient difficulty. We then adopt an SFT-then-RL training pipeline to turn base multimodal models to agent capable of multi-turn tool calling in real-world web environments. Besides, we propose a multimodal search benchmark MM-SearchExam dedicated to evaluating search capabilities of multimodal search agents, which proves highly challenging for recent proprietary models. Extensive evaluations across multiple multimodal search benchmarks reveal effectiveness of our method. VSearcher achieves superior performance compared to recent multimodal search agents and even surpasses several proprietary models on multimodal web search tasks.

[165] R3GW: Relightable 3D Gaussians for Outdoor Scenes in the Wild

Margherita Lea Corona, Wieland Morgenstern, Peter Eisert, Anna Hilsmann

Main category: cs.CV

TL;DR: R3GW extends 3D Gaussian Splatting to learn relightable outdoor scene representations from unconstrained photo collections, separating foreground and sky with distinct Gaussian sets for physically-based relighting.

Details

Motivation: 3D Gaussian Splatting excels at static scene reconstruction but lacks illumination modeling, making it unsuitable for relighting tasks and struggling with scenes captured under varying lighting conditions in the wild.

Method: Separates scene into relightable foreground and non-reflective background (sky) using two distinct Gaussian sets. Combines Physically Based Rendering with 3DGS representation to model view-dependent lighting effects in varying illumination settings.

Result: Achieves state-of-the-art performance on NeRF-OSR dataset, synthesizes photorealistic novel views under arbitrary illumination conditions, and mitigates depth reconstruction artifacts at sky-foreground boundaries.

Conclusion: R3GW enables physically-based relighting of unconstrained outdoor scenes captured in the wild, extending 3DGS capabilities to handle varying illumination conditions while improving rendering quality.

Abstract: 3D Gaussian Splatting (3DGS) has established itself as a leading technique for 3D reconstruction and novel view synthesis of static scenes, achieving outstanding rendering quality and fast training. However, the method does not explicitly model the scene illumination, making it unsuitable for relighting tasks. Furthermore, 3DGS struggles to reconstruct scenes captured in the wild by unconstrained photo collections featuring changing lighting conditions. In this paper, we present R3GW, a novel method that learns a relightable 3DGS representation of an outdoor scene captured in the wild. Our approach separates the scene into a relightable foreground and a non-reflective background (the sky), using two distinct sets of Gaussians. R3GW models view-dependent lighting effects in the foreground reflections by combining Physically Based Rendering with the 3DGS scene representation in a varying illumination setting. We evaluate our method quantitatively and qualitatively on the NeRF-OSR dataset, offering state-of-the-art performance and enhanced support for physically-based relighting of unconstrained scenes. Our method synthesizes photorealistic novel views under arbitrary illumination conditions. Additionally, our representation of the sky mitigates depth reconstruction artifacts, improving rendering quality at the sky-foreground boundary

[166] NOVA: Sparse Control, Dense Synthesis for Pair-Free Video Editing

Tianlin Pan, Jiayi Dai, Chenpu Yuan, Zhengyao Lv, Binxin Yang, Hubery Yin, Chen Li, Jing Lyu, Caifeng Shan, Chenyang Si

Main category: cs.CV

TL;DR: NOVA is a framework for unpaired video editing that uses sparse keyframe guidance and dense synthesis with degradation-simulation training to achieve high fidelity and temporal consistency without requiring paired datasets.

Details

Motivation: Current video editing models require large-scale paired datasets which are difficult to collect, especially for local video editing. Existing unpaired approaches struggle with background and temporal consistency when transferring image editing techniques to video.

Method: NOVA uses a two-branch framework: sparse branch provides semantic guidance through user-edited keyframes, dense branch incorporates motion and texture from original video. Includes degradation-simulation training strategy that learns motion reconstruction and temporal consistency from artificially degraded videos without paired data.

Result: Extensive experiments show NOVA outperforms existing approaches in edit fidelity, motion preservation, and temporal coherence.

Conclusion: NOVA provides an effective solution for unpaired video editing that maintains high fidelity and temporal consistency through sparse control and dense synthesis, eliminating the need for hard-to-collect paired datasets.

Abstract: Recent video editing models have achieved impressive results, but most still require large-scale paired datasets. Collecting such naturally aligned pairs at scale remains highly challenging and constitutes a critical bottleneck, especially for local video editing data. Existing workarounds transfer image editing to video through global motion control for pair-free video editing, but such designs struggle with background and temporal consistency. In this paper, we propose NOVA: Sparse Control & Dense Synthesis, a new framework for unpaired video editing. Specifically, the sparse branch provides semantic guidance through user-edited keyframes distributed across the video, and the dense branch continuously incorporates motion and texture information from the original video to maintain high fidelity and coherence. Moreover, we introduce a degradation-simulation training strategy that enables the model to learn motion reconstruction and temporal consistency by training on artificially degraded videos, thus eliminating the need for paired data. Our extensive experiments demonstrate that NOVA outperforms existing approaches in edit fidelity, motion preservation, and temporal coherence.

[167] Structure-Aware Text Recognition for Ancient Greek Critical Editions

Nicolas Angleraud, Antonia Karamolegkou, Benoît Sagot, Thibault Clérice

Main category: cs.CV

TL;DR: VLMs struggle with complex layout semantics in historical Greek critical editions; synthetic dataset and benchmark created; Qwen3VL-8B achieves SOTA performance with 1.0% CER on real scans.

Details

Motivation: Current visual language models have limited ability to interpret complex layout semantics in historical scholarly texts, particularly Ancient Greek critical editions with dense reference hierarchies and extensive marginal annotations.

Method: Created two novel resources: (1) large-scale synthetic corpus of 185,000 page images from TEI/XML sources with controlled typographic/layout variation, and (2) curated benchmark of real scanned editions spanning over a century. Evaluated three state-of-the-art VLMs under zero-shot and fine-tuning regimes.

Result: VLMs show substantial limitations with highly structured historical documents in zero-shot settings, underperforming compared to established software. However, Qwen3VL-8B achieves state-of-the-art performance with median Character Error Rate of 1.0% on real scans.

Conclusion: Results highlight both current shortcomings and future potential of VLMs for structure-aware recognition of complex scholarly documents, with Qwen3VL-8B showing promising performance.

Abstract: Recent advances in visual language models (VLMs) have transformed end-to-end document understanding. However, their ability to interpret the complex layout semantics of historical scholarly texts remains limited. This paper investigates structure-aware text recognition for Ancient Greek critical editions, which have dense reference hierarchies and extensive marginal annotations. We introduce two novel resources: (i) a large-scale synthetic corpus of 185,000 page images generated from TEI/XML sources with controlled typographic and layout variation, and (ii) a curated benchmark of real scanned editions spanning more than a century of editorial and typographic practices. Using these datasets, we evaluate three state-of-the-art VLMs under both zero-shot and fine-tuning regimes. Our experiments reveal substantial limitations in current VLM architectures when confronted with highly structured historical documents. In zero-shot settings, most models significantly underperform compared to established off-the-shelf software. Nevertheless, the Qwen3VL-8B model achieves state-of-the-art performance, reaching a median Character Error Rate of 1.0% on real scans. These results highlight both the current shortcomings and the future potential of VLMs for structure-aware recognition of complex scholarly documents.

[168] ScribeTokens: Fixed-Vocabulary Tokenization of Digital Ink

Douglass Wang

Main category: cs.CV

TL;DR: ScribeTokens is a novel tokenization method for digital ink that decomposes pen movements into unit pixel steps with a fixed 10-token vocabulary, enabling efficient representation and outperforming vector approaches on both generation and recognition tasks.

Details

Motivation: Digital ink lacks a unified representation - continuous vectors produce long sequences with training instability, while existing token representations require large vocabularies, face out-of-vocabulary issues, and underperform vectors on recognition tasks.

Method: Proposes ScribeTokens that decomposes pen movement into unit pixel steps with two pen-state tokens, creating a fixed 10-token base vocabulary that enables aggressive BPE compression. Also introduces next-ink-token prediction as a self-supervised pretraining strategy.

Result: On handwritten text generation, ScribeTokens dramatically outperforms vectors (17.33% vs. 70.29% CER). For recognition, it’s the only token representation to outperform vectors without pretraining. With pretraining, achieves best results across all representations (8.27% CER on IAM, 9.83% on DeepWriting).

Conclusion: ScribeTokens provides an effective token representation for digital ink that enables superior performance on both generation and recognition tasks, with pretraining further accelerating convergence and improving results.

Abstract: Digital ink – the coordinate stream captured from stylus or touch input – lacks a unified representation. Continuous vector representations produce long sequences and suffer from training instability, while existing token representations require large vocabularies, face out-of-vocabulary issues, and underperform vectors on recognition. We propose ScribeTokens, a tokenization that decomposes pen movement into unit pixel steps. Together with two pen-state tokens, this fixed 10-token base vocabulary suffices to represent any digital ink and enables aggressive BPE compression. On handwritten text generation, ScribeTokens dramatically outperforms vectors (17.33% vs. 70.29% CER), showing tokens are far more effective for generation. On recognition, ScribeTokens is the only token representation to outperform vectors without pretraining. We further introduce next-ink-token prediction as a self-supervised pretraining strategy, which consistently improves recognition across all token-based models and accelerates convergence by up to 83x. With pretraining, ScribeTokens achieves the best recognition results across all representations on both datasets (8.27% CER on IAM, 9.83% on DeepWriting).

[169] BrandFusion: A Multi-Agent Framework for Seamless Brand Integration in Text-to-Video Generation

Zihao Zhu, Ruotong Wang, Siwei Lyu, Min Zhang, Baoyuan Wu

Main category: cs.CV

TL;DR: BrandFusion: A multi-agent framework for automatically embedding advertiser brands into text-to-video generated content while preserving semantic fidelity to user prompts.

Details

Motivation: While text-to-video models have advanced rapidly, their commercial potential remains untapped. The paper addresses the challenge of seamlessly integrating brands into AI-generated videos without compromising user intent or video quality.

Method: BrandFusion uses a two-phase multi-agent framework: 1) Offline phase builds a Brand Knowledge Base by probing model priors and fine-tuning for novel brands; 2) Online phase uses five agents to iteratively refine user prompts, leveraging the knowledge base and real-time contextual tracking for brand visibility and semantic alignment.

Result: Experiments on 18 established and 2 custom brands across multiple state-of-the-art T2V models show BrandFusion significantly outperforms baselines in semantic preservation, brand recognizability, and integration naturalness. Human evaluations confirm higher user satisfaction.

Conclusion: BrandFusion establishes a practical pathway for sustainable T2V monetization through seamless brand integration while maintaining semantic fidelity to user intent.

Abstract: The rapid advancement of text-to-video (T2V) models has revolutionized content creation, yet their commercial potential remains largely untapped. We introduce, for the first time, the task of seamless brand integration in T2V: automatically embedding advertiser brands into prompt-generated videos while preserving semantic fidelity to user intent. This task confronts three core challenges: maintaining prompt fidelity, ensuring brand recognizability, and achieving contextually natural integration. To address them, we propose BrandFusion, a novel multi-agent framework comprising two synergistic phases. In the offline phase (advertiser-facing), we construct a Brand Knowledge Base by probing model priors and adapting to novel brands via lightweight fine-tuning. In the online phase (user-facing), five agents jointly refine user prompts through iterative refinement, leveraging the shared knowledge base and real-time contextual tracking to ensure brand visibility and semantic alignment. Experiments on 18 established and 2 custom brands across multiple state-of-the-art T2V models demonstrate that BrandFusion significantly outperforms baselines in semantic preservation, brand recognizability, and integration naturalness. Human evaluations further confirm higher user satisfaction, establishing a practical pathway for sustainable T2V monetization.

[170] Toward Early Quality Assessment of Text-to-Image Diffusion Models

Huanlei Guo, Hongxin Wei, Bingyi Jing

Main category: cs.CV

TL;DR: Probe-Select: A plug-in module for text-to-image models that predicts final image quality from early denoiser activations, enabling early termination of unpromising seeds to reduce sampling costs by 60%.

Details

Motivation: Current text-to-image systems use inefficient "generate-then-select" pipelines that sample many seeds and keep only a few, requiring expensive denoising steps for each candidate. Post-hoc evaluation metrics like CLIPScore and ImageReward add computational overhead after generation is complete.

Method: Probe-Select exploits the observation that intermediate denoiser activations at early timesteps encode stable coarse structure, object layout, and spatial arrangement that strongly correlates with final image fidelity. It predicts final quality scores directly from these early activations, allowing unpromising seeds to be terminated early.

Result: Experiments across diffusion and flow-matching backbones show that early evaluation at only 20% of the trajectory accurately ranks candidate seeds and enables selective continuation. This reduces sampling cost by over 60% while improving the quality of retained images.

Conclusion: Early structural signals in denoiser activations can effectively guide selective generation without altering the underlying generative model, making text-to-image generation more efficient while maintaining or improving output quality.

Abstract: Recent text-to-image (T2I) diffusion and flow-matching models can produce highly realistic images from natural language prompts. In practical scenarios, T2I systems are often run in a ``generate–then–select’’ mode: many seeds are sampled and only a few images are kept for use. However, this pipeline is highly resource-intensive since each candidate requires tens to hundreds of denoising steps, and evaluation metrics such as CLIPScore and ImageReward are post-hoc. In this work, we address this inefficiency by introducing Probe-Select, a plug-in module that enables efficient evaluation of image quality within the generation process. We observe that certain intermediate denoiser activations, even at early timesteps, encode a stable coarse structure, object layout and spatial arrangement–that strongly correlates with final image fidelity. Probe-Select exploits this property by predicting final quality scores directly from early activations, allowing unpromising seeds to be terminated early. Across diffusion and flow-matching backbones, our experiments show that early evaluation at only 20% of the trajectory accurately ranks candidate seeds and enables selective continuation. This strategy reduces sampling cost by over 60% while improving the quality of the retained images, demonstrating that early structural signals can effectively guide selective generation without altering the underlying generative model. Code is available at https://github.com/Guhuary/ProbeSelect.

[171] Scale-invariant Gaussian derivative residual networks

Andrzej Perzanowski, Tony Lindeberg

Main category: cs.CV

TL;DR: GaussDerResNets are provably scale-invariant networks using Gaussian derivative residual blocks to address scale generalization challenges in deep networks.

Details

Motivation: Deep networks struggle with scale generalization, failing to handle images at scales not seen during training (out-of-distribution problem). The paper aims to create networks that can generalize across image scales.

Method: Construct scale-invariant Gaussian derivative residual networks (GaussDerResNets) using scale-covariant Gaussian derivative residual blocks with residual skip connections. Use depthwise-separable convolutions to reduce parameters and computations while maintaining scale generalization.

Result: GaussDerResNets demonstrate strong scale generalization and scale selection properties on rescaled versions of STL-10, Fashion-MNIST, and CIFAR-10 datasets. Networks maintain good scale generalization even with increased depth and accuracy.

Conclusion: The proposed GaussDerResNets provide a provably scale-invariant architecture that addresses scale generalization challenges in deep networks while maintaining computational efficiency through depthwise-separable convolutions.

Abstract: Generalisation across image scales remains a fundamental challenge for deep networks, which often fail to handle images at scales not seen during training (the out-of-distribution problem). In this paper, we present provably scale-invariant Gaussian derivative residual networks (GaussDerResNets), constructed out of scale-covariant Gaussian derivative residual blocks coupled in cascade, aimed at addressing this problem. By adding residual skip connections to the previous notion of Gaussian derivative layers, deeper networks with substantially increased accuracy can be constructed, while preserving very good scale generalisation properties at the higher level of accuracy. Explicit proofs are provided regarding the underlying scale-covariant and scale-invariant properties in arbitrary dimensions. To analyse the ability of GaussDerResNets to generalise to new scales, we apply them on the new rescaled version of the STL-10 dataset, where training is done at a single fixed scale and evaluation is performed on multiple copies of the test set, each rescaled to a single distinct spatial scale, with scale factors extending over a range of 4. We also conduct similar systematic experiments on the rescaled versions of Fashion-MNIST and CIFAR-10 datasets. Experimentally, we demonstrate that the GaussDerResNets have strong scale generalisation and scale selection properties on all the three rescaled datasets. In our ablation studies, we investigate different architectural variants of GaussDerResNets, demonstrating that basing the architecture on depthwise-separable convolutions allows for decreasing both the number of parameters and the amount of computations, with reasonably maintained accuracy and scale generalisation.

[172] Multimodal-Prior-Guided Importance Sampling for Hierarchical Gaussian Splatting in Sparse-View Novel View Synthesis

Kaiqiang Xiong, Zhanke Wang, Ronggang Wang

Main category: cs.CV

TL;DR: Multimodal-prior-guided importance sampling for hierarchical 3D Gaussian Splatting improves sparse-view novel view synthesis by fusing photometric, semantic, and geometric cues to guide refinement.

Details

Motivation: Sparse-view novel view synthesis with 3D Gaussian Splatting suffers from overfitting to texture-induced errors and noise from pose/appearance inconsistencies. Existing methods lack robust mechanisms to determine where to add fine details in underconstrained sparse-view settings.

Method: Proposes multimodal-prior-guided importance sampling that fuses photometric rendering residuals, semantic priors, and geometric priors to estimate local recoverability. Uses coarse-to-fine Gaussian representation with stable coarse layer and selective addition of fine primitives where recoverable. Includes geometric-aware sampling and retention policy to concentrate refinement on critical regions while protecting new primitives in underconstrained areas.

Result: Achieves state-of-the-art reconstructions on diverse sparse-view benchmarks, with up to +0.3 dB PSNR improvement on DTU dataset. Effectively alleviates overfitting and suppresses noise from inconsistencies.

Conclusion: Multimodal prior fusion provides robust guidance for hierarchical 3DGS refinement in sparse-view settings, outperforming methods relying solely on photometric residuals by prioritizing regions with consistent multimodal evidence.

Abstract: We present multimodal-prior-guided importance sampling as the central mechanism for hierarchical 3D Gaussian Splatting (3DGS) in sparse-view novel view synthesis. Our sampler fuses complementary cues { – } photometric rendering residuals, semantic priors, and geometric priors { – } to produce a robust, local recoverability estimate that directly drives where to inject fine Gaussians. Built around this sampling core, our framework comprises (1) a coarse-to-fine Gaussian representation that encodes global shape with a stable coarse layer and selectively adds fine primitives where the multimodal metric indicates recoverable detail; and (2) a geometric-aware sampling and retention policy that concentrates refinement on geometrically critical and complex regions while protecting newly added primitives in underconstrained areas from premature pruning. By prioritizing regions supported by consistent multimodal evidence rather than raw residuals alone, our method alleviates overfitting texture-induced errors and suppresses noise from pose/appearance inconsistencies. Experiments on diverse sparse-view benchmarks demonstrate state-of-the-art reconstructions, with up to +0.3 dB PSNR on DTU.

[173] Think-as-You-See: Streaming Chain-of-Thought Reasoning for Large Vision-Language Models

Jialiang Zhang, Junlong Tong, Junyan Lin, Hao Wu, Yirong Sun, Yunpu Ma, Xiaoyu Shen

Main category: cs.CV

TL;DR: Think-as-You-See (TaYS) enables streaming video reasoning for LVLMs with parallel CoT generation, stream-constrained training, and temporally aligned reasoning units to reduce latency while maintaining performance.

Details

Motivation: Current LVLMs use batch-style processing assuming full video availability, which doesn't align with real-world streaming video data where information arrives sequentially. There's a need for streaming reasoning paradigms that can process video frames as they arrive.

Method: Proposes TaYS framework with: 1) parallelized CoT generation, 2) stream-constrained training, 3) stream-parallel inference, 4) temporally aligned reasoning units, 5) streaming attention masks and positional encodings, and 6) dual KV-cache decoupling visual encoding from textual reasoning.

Result: TaYS consistently outperforms batch and interleaved baselines on video CoT tasks (event dynamics analysis, causal reasoning, thematic understanding), improving reasoning performance while substantially reducing time-to-first-token (TTFT) and overall reasoning delay.

Conclusion: Data-aligned streaming reasoning enables efficient and responsive video understanding for LVLMs, with TaYS demonstrating effectiveness in reducing latency while maintaining or improving reasoning performance on streaming video data.

Abstract: Large Vision Language Models (LVLMs) exhibit strong Chain-of-Thought (CoT) capabilities, yet most existing paradigms assume full-video availability before inference, a batch-style process misaligned with real-world video streams where information arrives sequentially. Motivated by the streaming nature of video data, we investigate two streaming reasoning paradigms for LVLMs. The first, an interleaved paradigm, alternates between receiving frames and producing partial reasoning but remains constrained by strictly ordered cache updates. To better match streaming inputs, we propose \textbf{Think-as-You-See (TaYS)}, a unified framework enabling true concurrent reasoning. TaYS integrates parallelized CoT generation, stream-constrained training, and stream-parallel inference. It further employs temporally aligned reasoning units, streaming attention masks and positional encodings, and a dual KV-cache that decouples visual encoding from textual reasoning. We evaluate all paradigms on the Qwen2.5-VL family across representative video CoT tasks, including event dynamics analysis, causal reasoning, and thematic understanding. Experiments show that TaYS consistently outperforms both batch and interleaved baselines, improving reasoning performance while substantially reducing time-to-first-token (TTFT) and overall reasoning delay. These results demonstrate the effectiveness of data-aligned streaming reasoning in enabling efficient and responsive video understanding for LVLMs. We release our code at \href{https://github.com/EIT-NLP/StreamingLLM/tree/main/TaYS}{this repository.}

Xinjie Zhu, Zijing Zhao, Hui Jin, Qingxiao Guo, Yilong Ma, Yunhao Wang, Xiaobing Guo, Weifeng Zhang

Main category: cs.CV

TL;DR: SIGMark is a scalable in-generation watermarking framework for video diffusion models that enables blind extraction with high robustness against temporal and spatial disturbances.

Details

Motivation: Existing in-generation watermarking approaches for video diffusion models are non-blind (requiring template matching and storage of message-key pairs), computationally expensive at scale, and have weak robustness against temporal disturbances when applied to modern causal 3D VAEs.

Method: Proposes SIGMark with two key components: 1) GF-PRC (Global set of Frame-wise PseudoRandom Coding keys) for generating watermarked initial noise to enable blind extraction while preserving noise distribution, and 2) SGO (Segment Group-Ordering module) tailored to causal 3D VAEs to ensure robust watermark inversion under temporal disturbance.

Result: Comprehensive experiments show SIGMark achieves very high bit-accuracy during extraction under both temporal and spatial disturbances with minimal overhead, demonstrating scalability and robustness on modern diffusion models.

Conclusion: SIGMark provides a scalable, distortion-free in-generation watermarking solution for video diffusion models with blind extraction capability and strong robustness, addressing key limitations of existing approaches for AI safety and content protection.

Abstract: Artificial Intelligence Generated Content (AIGC), particularly video generation with diffusion models, has been advanced rapidly. Invisible watermarking is a key technology for protecting AI-generated videos and tracing harmful content, and thus plays a crucial role in AI safety. Beyond post-processing watermarks which inevitably degrade video quality, recent studies have proposed distortion-free in-generation watermarking for video diffusion models. However, existing in-generation approaches are non-blind: they require maintaining all the message-key pairs and performing template-based matching during extraction, which incurs prohibitive computational costs at scale. Moreover, when applied to modern video diffusion models with causal 3D Variational Autoencoders (VAEs), their robustness against temporal disturbance becomes extremely weak. To overcome these challenges, we propose SIGMark, a Scalable In-Generation watermarking framework with blind extraction for video diffusion. To achieve blind-extraction, we propose to generate watermarked initial noise using a Global set of Frame-wise PseudoRandom Coding keys (GF-PRC), reducing the cost of storing large-scale information while preserving noise distribution and diversity for distortion-free watermarking. To enhance robustness, we further design a Segment Group-Ordering module (SGO) tailored to causal 3D VAEs, ensuring robust watermark inversion during extraction under temporal disturbance. Comprehensive experiments on modern diffusion models show that SIGMark achieves very high bit-accuracy during extraction under both temporal and spatial disturbances with minimal overhead, demonstrating its scalability and robustness. Our project is available at https://jeremyzhao1998.github.io/SIGMark-release/.

[175] SemanticDialect: Semantic-Aware Mixed-Format Quantization for Video Diffusion Transformers

Wonsuk Jang, Thierry Tambe

Main category: cs.CV

TL;DR: SemanticDialect is a novel quantization method for video diffusion transformers that uses block-wise mixed-format quantization with semantic-aware format selection to reduce memory/compute costs while preserving video quality and temporal coherence.

Details

Motivation: Diffusion Transformers (DiT) for video generation have high memory and compute costs that hinder edge deployment. Existing quantization methods degrade video quality due to high activation variation and fail to preserve semantic/temporal coherence.

Method: 1) SemanticDialect scales formatbook with lookup tables for quantization error and quantized values for efficient per-block format selection. 2) Activation decomposition reduces quantization error by re-quantizing and adding back residual errors with attention-guided salient token selection. 3) Semantic-aware dialect assignment (SeDA) improves quantized value consistency by sharing sub-formatbook among semantically correlated tokens.

Result: Experiments on video DiT (VDiT) models show SemanticDialect outperforms prior VDiT quantization methods and fine-grained block-wise format baselines, while approaching FP16 quality on Open-Sora 2.0.

Conclusion: SemanticDialect enables efficient quantization of video diffusion transformers for edge deployment while maintaining high video quality and temporal coherence through semantic-aware format selection and error reduction techniques.

Abstract: Diffusion Transformers (DiT) achieve strong video generation quality, but their memory and compute costs hinder edge deployment. Quantization can reduce these costs, yet existing methods often degrade video quality under high activation variation and the need to preserve semantic/temporal coherence. We propose SemanticDialect, which advances recent block-wise mixed-format quantization-selecting a per-block optimal format (a dialect) from multiple candidates (a formatbook)-by scaling the formatbook with lookup tables for quantization error and quantized values, enabling efficient per-block format selection and quantization at low online cost. We also introduce activation decomposition that reduces quantization error by re-quantizing and adding back residual errors, with attention-guided salient token selection. We further propose semantic-aware dialect assignment (SeDA) to improve quantized value consistency by sharing a sub-formatbook among semantically correlated tokens. Experiments on video DiT (VDiT) models show that SemanticDialect outperforms prior VDiT quantization methods and fine-grained block-wise format baselines, while approaching FP16 quality on Open-Sora 2.0.

[176] StegaFFD: Privacy-Preserving Face Forgery Detection via Fine-Grained Steganographic Domain Lifting

Guoqing Ma, Xun Lin, Hui Ma, Ajian Liu, Yizhong Liu, Wenzhong Tang, Shan Yu, Chenqi Kong, Yi Yu

Main category: cs.CV

TL;DR: StegaFFD: A steganography-based framework that hides facial images within natural cover images to protect privacy while enabling face forgery detection directly in the steganographic domain, avoiding suspicion from attackers.

Details

Motivation: Existing face forgery detection models require raw face images, creating privacy risks during transmission or from untrusted servers. Traditional privacy protection methods (anonymization, encryption, distortion) cause obvious semantic distortion that alerts attackers and introduces artifacts that confuse forgery detection models which rely on subtle traces.

Method: StegaFFD uses image steganography to hide facial images within natural cover images. It introduces three key components: 1) Low-Frequency-Aware Decomposition (LFAD) to suppress interference from low-frequency cover semantics, 2) Spatial-Frequency Differential Attention (SFDA) to enhance hidden facial feature perception, and 3) Steganographic Domain Alignment (SDA) to align representations of hidden faces with raw counterparts.

Result: Extensive experiments on seven face forgery detection datasets show StegaFFD achieves strong imperceptibility, avoids raising attackers’ suspicion, and better preserves forgery detection accuracy compared to existing facial privacy protection methods.

Conclusion: StegaFFD provides an effective privacy-preserving solution for face forgery detection that maintains detection accuracy while being imperceptible to attackers, addressing the limitations of traditional privacy protection methods.

Abstract: Most existing Face Forgery Detection (FFD) models assume access to raw face images. In practice, under a client-server framework, private facial data may be intercepted during transmission or leaked by untrusted servers. Previous privacy protection approaches, such as anonymization, encryption, or distortion, partly mitigate leakage but often introduce severe semantic distortion, making images appear obviously protected. This alerts attackers, provoking more aggressive strategies and turning the process into a cat-and-mouse game. Moreover, these methods heavily manipulate image contents, introducing degradation or artifacts that may confuse FFD models, which rely on extremely subtle forgery traces. Inspired by advances in image steganography, which enable high-fidelity hiding and recovery, we propose a Stega}nography-based Face Forgery Detection framework (StegaFFD) to protect privacy without raising suspicion. StegaFFD hides facial images within natural cover images and directly conducts forgery detection in the steganographic domain. However, the hidden forgery-specific features are extremely subtle and interfered with by cover semantics, posing significant challenges. To address this, we propose Low-Frequency-Aware Decomposition (LFAD) and Spatial-Frequency Differential Attention (SFDA), which suppress interference from low-frequency cover semantics and enhance hidden facial feature perception. Furthermore, we introduce Steganographic Domain Alignment (SDA) to align the representations of hidden faces with those of their raw counterparts, enhancing the model’s ability to perceive subtle facial cues in the steganographic domain. Extensive experiments on seven FFD datasets demonstrate that StegaFFD achieves strong imperceptibility, avoids raising attackers’ suspicion, and better preserves FFD accuracy compared to existing facial privacy protection methods.

[177] LLandMark: A Multi-Agent Framework for Landmark-Aware Multimodal Interactive Video Retrieval

Minh-Chi Phung, Thien-Bao Le, Cam-Tu Tran-Thi, Thu-Dieu Nguyen-Thi, Vu-Hung Dao

Main category: cs.CV

TL;DR: LLandMark is a modular multi-agent framework for landmark-aware multimodal video retrieval that handles complex queries through specialized agents collaborating across query parsing, landmark reasoning, multimodal retrieval, and answer synthesis stages.

Details

Motivation: The increasing diversity and scale of video data demands retrieval systems with multimodal understanding, adaptive reasoning, and domain-specific knowledge integration, particularly for handling real-world complex queries involving cultural or spatial landmarks.

Method: A modular multi-agent framework with specialized agents collaborating across four stages: query parsing/planning, landmark reasoning, multimodal retrieval, and reranked answer synthesis. Key components include a Landmark Knowledge Agent that detects landmarks and reformulates them into visual prompts, an LLM-assisted image-to-image pipeline using Gemini 2.5 Flash for autonomous landmark detection and retrieval, and an OCR refinement module using Gemini and LlamaIndex for Vietnamese text recognition.

Result: Experimental results show that LLandMark achieves adaptive, culturally grounded, and explainable retrieval performance for Vietnamese scenes, demonstrating effective multimodal video retrieval with landmark awareness.

Conclusion: LLandMark provides an effective framework for landmark-aware multimodal video retrieval that handles complex queries through collaborative multi-agent architecture and specialized components for landmark detection, visual prompt generation, and multimodal matching.

Abstract: The increasing diversity and scale of video data demand retrieval systems capable of multimodal understanding, adaptive reasoning, and domain-specific knowledge integration. This paper presents LLandMark, a modular multi-agent framework for landmark-aware multimodal video retrieval to handle real-world complex queries. The framework features specialized agents that collaborate across four stages: query parsing and planning, landmark reasoning, multimodal retrieval, and reranked answer synthesis. A key component, the Landmark Knowledge Agent, detects cultural or spatial landmarks and reformulates them into descriptive visual prompts, enhancing CLIP-based semantic matching for Vietnamese scenes. To expand capabilities, we introduce an LLM-assisted image-to-image pipeline, where a large language model (Gemini 2.5 Flash) autonomously detects landmarks, generates image search queries, retrieves representative images, and performs CLIP-based visual similarity matching, removing the need for manual image input. In addition, an OCR refinement module leveraging Gemini and LlamaIndex improves Vietnamese text recognition. Experimental results show that LLandMark achieves adaptive, culturally grounded, and explainable retrieval performance.

[178] Intrinsic Geometry-Appearance Consistency Optimization for Sparse-View Gaussian Splatting

Kaiqiang Xiong, Rui Peng, Jiahao Wu, Zhanke Wang, Jie Liang, Xiaoyun Zheng, Feng Gao, Ronggang Wang

Main category: cs.CV

TL;DR: MVD-HuGaS: A method for free-view 3D human rendering from a single image using multi-view diffusion models and 3D Gaussian optimization

Details

Motivation: Existing methods for 3D human reconstruction from single images using diffusion models produce artifacts (flattened structure, over-smoothing) and struggle with real-world generalization. Need for better quality and fidelity.

Method: 1) Enhanced multi-view diffusion model fine-tuned on 3D human datasets to generate multi-view images from single reference. 2) Alignment module for joint optimization of 3D Gaussians and camera poses. 3) Depth-based Facial Distraction Mitigation module to refine facial regions. 4) 3D Gaussian optimization using refined multi-view images and accurate poses.

Result: Achieves state-of-the-art performance on single-view 3D human rendering on Thuman2.0 and 2K2K datasets. Produces high-fidelity free-view renderings with improved geometry and reduced artifacts.

Conclusion: MVD-HuGaS effectively addresses limitations of previous diffusion-based 3D human reconstruction methods by incorporating 3D geometry priors, accurate pose estimation, and facial refinement for superior free-view rendering quality.

Abstract: 3D human reconstruction from a single image is a challenging problem and has been exclusively studied in the literature. Recently, some methods have resorted to diffusion models for guidance, optimizing a 3D representation via Score Distillation Sampling(SDS) or generating a back-view image for facilitating reconstruction. However, these methods tend to produce unsatisfactory artifacts (\textit{e.g.} flattened human structure or over-smoothing results caused by inconsistent priors from multiple views) and struggle with real-world generalization in the wild. In this work, we present \emph{MVD-HuGaS}, enabling free-view 3D human rendering from a single image via a multi-view human diffusion model. We first generate multi-view images from the single reference image with an enhanced multi-view diffusion model, which is well fine-tuned on high-quality 3D human datasets to incorporate 3D geometry priors and human structure priors. To infer accurate camera poses from the sparse generated multi-view images for reconstruction, an alignment module is introduced to facilitate joint optimization of 3D Gaussians and camera poses. Furthermore, we propose a depth-based Facial Distortion Mitigation module to refine the generated facial regions, thereby improving the overall fidelity of the reconstruction. Finally, leveraging the refined multi-view images, along with their accurate camera poses, MVD-HuGaS optimizes the 3D Gaussians of the target human for high-fidelity free-view renderings. Extensive experiments on Thuman2.0 and 2K2K datasets show that the proposed MVD-HuGaS achieves state-of-the-art performance on single-view 3D human rendering.

[179] 3D-DRES: Detailed 3D Referring Expression Segmentation

Qi Chen, Changli Wu, Jiayi Ji, Yiwei Ma, Liujuan Cao

Main category: cs.CV

TL;DR: 3D-DRES task for phrase-level 3D referring expression segmentation with DetailRefer dataset and DetailBase baseline model

Details

Motivation: Current 3D visual grounding tasks only handle sentence-level detection/segmentation, missing rich compositional contextual reasoning in natural language expressions

Method: Introduce 3D-DRES task with DetailRefer dataset (54,432 descriptions, 11,054 objects) using phrase-instance annotation paradigm, and DetailBase baseline architecture for dual-mode segmentation

Result: Models trained on DetailRefer excel at phrase-level segmentation and show surprising improvements on traditional 3D-RES benchmarks

Conclusion: 3D-DRES enhances fine-grained 3D vision-language understanding through phrase-level segmentation with comprehensive dataset and effective baseline

Abstract: Current 3D visual grounding tasks only process sentence level detection or segmentation, which critically fails to leverage the rich compositional contextual reasonings within natural language expressions. To address this challenge, we introduce Detailed 3D Referring Expression Segmentation (3D-DRES), a new task that provides a phrase to 3D instance mapping, aiming at enhancing fine-grained 3D vision language understanding. To support 3D-DRES, we present DetailRefer, a new dataset comprising 54,432 descriptions spanning 11,054 distinct objects. Unlike previous datasets, DetailRefer implements a pioneering phrase-instance annotation paradigm where each referenced noun phrase is explicitly mapped to its corresponding 3D elements. Additionally, we introduce DetailBase, a purposefully streamlined yet effective baseline architecture that supports dual-mode segmentation at both sentence and phrase levels. Our experimental results demonstrate that models trained on DetailRefer not only excel at phrase-level segmentation but also show surprising improvements on traditional 3D-RES benchmarks.

[180] Interpretable Motion-Attentive Maps: Spatio-Temporally Localizing Concepts in Video Diffusion Transformers

Youngjun Jun, Seil Kang, Woojung Han, Seong Jae Hwang

Main category: cs.CV

TL;DR: The paper introduces methods to interpret motion understanding in Video Diffusion Transformers, focusing on how motion concepts are converted to video through spatial and temporal localization without gradient calculations.

Details

Motivation: Current Video Diffusion Transformers can generate high-quality videos from text descriptions involving motion, but there's insufficient understanding of how they convert motion words into video. Previous interpretability work focuses mainly on objects, leaving motion-related behavior largely unexplored.

Method: Two main methods: 1) GramCol - adaptively produces per-frame saliency maps for any text concept (motion and non-motion); 2) Motion-feature selection algorithm to obtain Interpretable Motion-Attentive Map (IMAP) that localizes motion both spatially and temporally. Both methods work without gradient calculations or parameter updates.

Result: The method shows outstanding localization capability on motion localization tasks and zero-shot video semantic segmentation, providing interpretable and clearer saliency maps for both motion and non-motion concepts.

Conclusion: The paper successfully addresses the gap in understanding motion interpretation in Video Diffusion Transformers, providing effective tools for visualizing how motion concepts are processed spatially and temporally without requiring model modifications or gradient computations.

Abstract: Video Diffusion Transformers (DiTs) have been synthesizing high-quality video with high fidelity from given text descriptions involving motion. However, understanding how Video DiTs convert motion words into video remains insufficient. Furthermore, while prior studies on interpretable saliency maps primarily target objects, motion-related behavior in Video DiTs remains largely unexplored. In this paper, we investigate concrete motion features that specify when and which object moves for a given motion concept. First, to spatially localize, we introduce GramCol, which adaptively produces per-frame saliency maps for any text concept, including both motion and non-motion. Second, we propose a motion-feature selection algorithm to obtain an Interpretable Motion-Attentive Map (IMAP) that localizes motion spatially and temporally. Our method discovers concept saliency maps without the need for any gradient calculation or parameter update. Experimentally, our method shows outstanding localization capability on the motion localization task and zero-shot video semantic segmentation, providing interpretable and clearer saliency maps for both motion and non-motion concepts.

[181] ProGIC: Progressive and Lightweight Generative Image Compression with Residual Vector Quantization

Hao Cao, Chengbin Liang, Wenqi Guo, Zhijin Qin, Jungong Han

Main category: cs.CV

TL;DR: ProGIC is a compact generative image compression method using residual vector quantization for progressive transmission and efficient deployment.

Details

Motivation: Current generative image compression methods rely on large, rigid models that limit flexible transmission and practical deployment in low-bitrate scenarios.

Method: Uses residual vector quantization (RVQ) with multiple vector quantizers encoding residuals stage-by-stage, paired with lightweight backbone based on depthwise-separable convolutions and small attention blocks.

Result: Achieves comparable compression performance with 57.57% bitrate savings on DISTS and 58.83% on LPIPS vs MS-ILLM on Kodak dataset, plus 10x faster encoding/decoding on GPUs.

Conclusion: ProGIC provides efficient, deployable generative image compression with progressive transmission capabilities and practical deployment on both GPU and CPU devices.

Abstract: Recent advances in generative image compression (GIC) have delivered remarkable improvements in perceptual quality. However, many GICs rely on large-scale and rigid models, which severely constrain their utility for flexible transmission and practical deployment in low-bitrate scenarios. To address these issues, we propose Progressive Generative Image Compression (ProGIC), a compact codec built on residual vector quantization (RVQ). In RVQ, a sequence of vector quantizers encodes the residuals stage by stage, each with its own codebook. The resulting codewords sum to a coarse-to-fine reconstruction and a progressive bitstream, enabling previews from partial data. We pair this with a lightweight backbone based on depthwise-separable convolutions and small attention blocks, enabling practical deployment on both GPUs and CPU-only devices. Experimental results show that ProGIC attains comparable compression performance compared with previous methods. It achieves bitrate savings of up to 57.57% on DISTS and 58.83% on LPIPS compared to MS-ILLM on the Kodak dataset. Beyond perceptual quality, ProGIC enables progressive transmission for flexibility, and also delivers over 10 times faster encoding and decoding compared with MS-ILLM on GPUs for efficiency.

[182] Harmonic Beltrami Signature Network: a Shape Prior Module in Deep Learning Framework

Chenran Lin, Lok Ming Lui

Main category: cs.CV

TL;DR: HBSN is a deep learning architecture that computes Harmonic Beltrami Signatures for shape representation, enabling shape prior integration into vision pipelines.

Details

Motivation: Need for robust shape representation that provides one-to-one correspondence with 2D shapes while being invariant to translation, scaling, and rotation, and can be efficiently integrated into deep learning models.

Method: Uses neural network architecture with pre-STN for shape normalization, UNet backbone for HBS prediction, and post-STN for angle regularization to compute Harmonic Beltrami Signatures.

Result: HBSN accurately computes HBS representations for complex shapes and improves performance of existing segmentation models when incorporated as a shape prior module.

Conclusion: HBSN serves as an effective general-purpose module for embedding geometric shape information into computer vision pipelines, enhancing model performance through shape priors.

Abstract: This paper presents the Harmonic Beltrami Signature Network (HBSN), a novel deep learning architecture for computing the Harmonic Beltrami Signature (HBS) from binary-like images. HBS is a shape representation that provides a one-to-one correspondence with 2D simply connected shapes, with invariance to translation, scaling, and rotation. By exploiting the function approximation capacity of neural networks, HBSN enables efficient extraction and utilization of shape prior information. The proposed network architecture incorporates a pre-Spatial Transformer Network (pre-STN) for shape normalization, a UNet-based backbone for HBS prediction, and a post-STN for angle regularization. Experiments show that HBSN accurately computes HBS representations, even for complex shapes. Furthermore, we demonstrate how HBSN can be directly incorporated into existing deep learning segmentation models, improving their performance through the use of shape priors. The results confirm the utility of HBSN as a general-purpose module for embedding geometric shape information into computer vision pipelines.

[183] Articulation in Motion: Prior-free Part Mobility Analysis for Articulated Objects By Dynamic-Static Disentanglement

Hao Ai, Wenjie Chang, Jianbo Jiao, Ales Leonardis, Ofek Eyal

Main category: cs.CV

TL;DR: AiM reconstructs articulated objects from interaction videos and scans, using dual-Gaussian representation and motion cues for part segmentation and articulation analysis without prior knowledge.

Details

Motivation: Current methods for articulated object reconstruction require prior knowledge of part counts and clear visibility in multiple states, limiting their applicability and robustness in real-world scenarios.

Method: Proposes a dual-Gaussian scene representation learned from initial 3DGS scan and interaction video, using motion cues for part segmentation and articulation joint assignment, followed by sequential RANSAC for part mobility analysis without structural priors.

Result: Achieves higher quality part segmentation than previous methods without prior knowledge, validated through extensive experiments on both simple and complex objects with strong generalization ability.

Conclusion: AiM provides a robust framework for articulated object reconstruction, segmentation, and articulation analysis from interaction videos, overcoming limitations of previous methods that require prior knowledge and clear multi-state visibility.

Abstract: Articulated objects are ubiquitous in daily life. Our goal is to achieve a high-quality reconstruction, segmentation of independent moving parts, and analysis of articulation. Recent methods analyse two different articulation states and perform per-point part segmentation, optimising per-part articulation using cross-state correspondences, given a priori knowledge of the number of parts. Such assumptions greatly limit their applications and performance. Their robustness is reduced when objects cannot be clearly visible in both states. To address these issues, in this paper, we present a new framework, Articulation in Motion (AiM). We infer part-level decomposition, articulation kinematics, and reconstruct an interactive 3D digital replica from a user-object interaction video and a start-state scan. We propose a dual-Gaussian scene representation that is learned from an initial 3DGS scan of the object and a video that shows the movement of separate parts. It uses motion cues to segment the object into parts and assign articulation joints. Subsequently, a robust, sequential RANSAC is employed to achieve part mobility analysis without any part-level structural priors, which clusters moving primitives into rigid parts and estimates kinematics while automatically determining the number of parts. The proposed approach separates the object into parts, each represented as a 3D Gaussian set, enabling high-quality rendering. Our approach yields higher quality part segmentation than previous methods, without prior knowledge. Extensive experimental analysis on both simple and complex objects validates the effectiveness and strong generalisation ability of our approach. Project page: https://haoai-1997.github.io/AiM/.

[184] HDINO: A Concise and Efficient Open-Vocabulary Detector

Hao Zhang, Yiqun Wang, Qinran Lin, Runze Fan, Yong Li

Main category: cs.CV

TL;DR: HDINO is an efficient open-vocabulary object detector that eliminates dependence on manually curated datasets and resource-intensive cross-modal feature extraction through a two-stage training strategy with semantic alignment and lightweight feature fusion.

Details

Motivation: Most existing open-vocabulary object detection methods rely heavily on manually curated fine-grained training datasets and resource-intensive layer-wise cross-modal feature extraction, which limits their efficiency and scalability.

Method: Two-stage training strategy: 1) One-to-Many Semantic Alignment Mechanism (O2M) treats noisy samples as additional positive instances for visual-textual alignment, with Difficulty Weighted Classification Loss (DWCL) for hard example mining; 2) Lightweight feature fusion module enhances sensitivity to linguistic semantics.

Result: HDINO-T achieves 49.2 mAP on COCO using only 2.2M training images (vs 5.4M/6.5M for competitors), surpassing Grounding DINO-T by 0.8 mAP and T-Rex2 by 2.8 mAP. After fine-tuning, HDINO-T and HDINO-L achieve 56.4 mAP and 59.2 mAP respectively.

Conclusion: HDINO demonstrates an efficient and scalable approach to open-vocabulary object detection that eliminates manual data curation dependencies while achieving state-of-the-art performance with fewer training resources.

Abstract: Despite the growing interest in open-vocabulary object detection in recent years, most existing methods rely heavily on manually curated fine-grained training datasets as well as resource-intensive layer-wise cross-modal feature extraction. In this paper, we propose HDINO, a concise yet efficient open-vocabulary object detector that eliminates the dependence on these components. Specifically, we propose a two-stage training strategy built upon the transformer-based DINO model. In the first stage, noisy samples are treated as additional positive object instances to construct a One-to-Many Semantic Alignment Mechanism(O2M) between the visual and textual modalities, thereby facilitating semantic alignment. A Difficulty Weighted Classification Loss (DWCL) is also designed based on initial detection difficulty to mine hard examples and further improve model performance. In the second stage, a lightweight feature fusion module is applied to the aligned representations to enhance sensitivity to linguistic semantics. Under the Swin Transformer-T setting, HDINO-T achieves \textbf{49.2} mAP on COCO using 2.2M training images from two publicly available detection datasets, without any manual data curation and the use of grounding data, surpassing Grounding DINO-T and T-Rex2 by \textbf{0.8} mAP and \textbf{2.8} mAP, respectively, which are trained on 5.4M and 6.5M images. After fine-tuning on COCO, HDINO-T and HDINO-L further achieve \textbf{56.4} mAP and \textbf{59.2} mAP, highlighting the effectiveness and scalability of our approach. Code and models are available at https://github.com/HaoZ416/HDINO.

[185] GloPath: An Entity-Centric Foundation Model for Glomerular Lesion Assessment and Clinicopathological Insights

Qiming He, Jing Li, Tian Guan, Yifei Ma, Zimo Zhao, Yanxia Wang, Hongjing Chen, Yingming Xu, Shuang Ge, Yexing Zhang, Yizhi Wang, Xinrui Chen, Lianghui Zhu, Yiqing Liu, Qingxia Hou, Shuyan Zhao, Xiaoqin Wang, Lili Ma, Peizhen Hu, Qiang Huang, Zihan Wang, Zhiyuan Shen, Junru Cheng, Siqi Zeng, Jiurun Chen, Zhen Song, Chao He, Zhe Wang, Yonghong He

Main category: cs.CV

TL;DR: GloPath is an entity-centric foundation model for glomerular pathology analysis using multi-scale, multi-view self-supervised learning on over 1M glomeruli, achieving state-of-the-art performance on lesion assessment tasks and revealing morphology-clinical correlations.

Details

Motivation: Current AI approaches struggle with the heterogeneity of glomerular morphology and fine-grained lesion patterns in renal pathology, creating challenges for accurate diagnosis and prognosis of kidney diseases.

Method: Multi-scale and multi-view self-supervised learning foundation model trained on over 1 million glomeruli extracted from 14,049 renal biopsy specimens, using entity-centric approach focused on glomerular structures.

Result: Outperformed state-of-the-art methods in 42 out of 52 tasks (80.8%), achieved 91.51% ROC-AUC for lesion recognition in real-world study, and revealed statistically significant associations across 224 morphology-clinical variable pairs.

Conclusion: GloPath represents a scalable and interpretable platform for glomerular lesion assessment and clinicopathological discovery, advancing clinically translatable AI in renal pathology.

Abstract: Glomerular pathology is central to the diagnosis and prognosis of renal diseases, yet the heterogeneity of glomerular morphology and fine-grained lesion patterns remain challenging for current AI approaches. We present GloPath, an entity-centric foundation model trained on over one million glomeruli extracted from 14,049 renal biopsy specimens using multi-scale and multi-view self-supervised learning. GloPath addresses two major challenges in nephropathology: glomerular lesion assessment and clinicopathological insights discovery. For lesion assessment, GloPath was benchmarked across three independent cohorts on 52 tasks, including lesion recognition, grading, few-shot classification, and cross-modality diagnosis-outperforming state-of-the-art methods in 42 tasks (80.8%). In the large-scale real-world study, it achieved an ROC-AUC of 91.51% for lesion recognition, demonstrating strong robustness in routine clinical settings. For clinicopathological insights, GloPath systematically revealed statistically significant associations between glomerular morphological parameters and clinical indicators across 224 morphology-clinical variable pairs, demonstrating its capacity to connect tissue-level pathology with patient-level outcomes. Together, these results position GloPath as a scalable and interpretable platform for glomerular lesion assessment and clinicopathological discovery, representing a step toward clinically translatable AI in renal pathology.

[186] TRACE: Task-Adaptive Reasoning and Representation Learning for Universal Multimodal Retrieval

Xiangzhao Hao, Shijie Wang, Tianyu Yang, Tianyue Wang, Haiyun Guo, JinQiao Wang

Main category: cs.CV

TL;DR: TRACE introduces a unified multimodal retrieval framework that combines generative reasoning with discriminative representation learning, using Chain-of-Thought reasoning for complex queries and compressing it into embeddings.

Details

Motivation: Current multimodal retrieval models treat MLLMs as static encoders, underutilizing their generative reasoning capabilities and struggling with complex compositional intents that require logical deduction rather than superficial pattern matching.

Method: TRACE first generates a structured Chain-of-Thought to explicitly reason about queries, then compresses this reasoning trace into a compact embedding via a dedicated token. It uses a difficulty-aware routing strategy and is trained on the M-BEIR-CoT dataset.

Result: TRACE achieves state-of-the-art performance on the M-BEIR benchmark, demonstrates learned implicit routing behavior (activating reasoning for complex queries while bypassing it for simpler ones), and shows remarkable zero-shot transferability to unseen domains and novel constraints.

Conclusion: TRACE successfully unifies generative reasoning with discriminative representation learning for multimodal retrieval, achieving optimal balance between accuracy and efficiency while maintaining strong generalization capabilities.

Abstract: Universal Multimodal Retrieval requires unified embedding models capable of interpreting diverse user intents, ranging from simple keywords to complex compositional instructions. While Multimodal Large Language Models (MLLMs) possess strong reasoning capabilities, prevailing adaptations confine them to static encoders, underutilizing their generative potential. This encoder-only paradigm struggles with complex intents that demand logical deduction rather than superficial pattern matching. To address this, we introduce TRACE (Task-adaptive Reasoning And Compressing Embeddings). TRACE unifies generative reasoning with discriminative representation learning. It first generates a structured Chain-of-Thought (CoT) to explicitly reason about the query, and subsequently compresses this reasoning trace into a compact embedding via a dedicated token. To train this framework, we construct M-BEIR-CoT, a large-scale dataset featuring a difficulty-aware routing strategy. Experiments on the M-BEIR benchmark establish TRACE as the new state-of-the-art. Crucially, TRACE demonstrates a learned implicit routing behavior. It autonomously activates reasoning for complex queries while bypassing it for simpler ones, achieving an optimal balance between retrieval accuracy and inference throughput. Furthermore, by internalizing the deductive process, TRACE exhibits remarkable zero-shot transferability to unseen domains and novel constraints.

[187] TC-Padé: Trajectory-Consistent Padé Approximation for Diffusion Acceleration

Benlei Cui, Shaoxuan He, Bukun Huang, Zhizeng Ye, Yunyun Sun, Longtao Huang, Hui Xue, Yang Yang, Jingqun Tang, Zhou Zhao, Haiwen Hong

Main category: cs.CV

TL;DR: TC-Padé: A feature prediction framework using Padé approximation for accelerating diffusion models in low-step regimes (20-30 steps) while maintaining generation quality.

Details

Motivation: Diffusion models suffer from computational burden due to iterative sampling. Existing feature caching techniques work well at high step counts (50+ steps) but fail in practical low-step regimes (20-30 steps) due to error accumulation and trajectory drift from polynomial-based extrapolators, and they overlook distinct dynamical properties of different denoising phases.

Method: Proposes Trajectory-Consistent Padé approximation framework using rational functions (Padé approximation) to model feature evolution more accurately than Taylor-based methods. Includes: (1) adaptive coefficient modulation using historical cached residuals to detect trajectory transitions, and (2) step-aware prediction strategies tailored to early, mid, and late sampling stages.

Result: Achieves 2.88x acceleration on FLUX.1-dev and 1.72x on Wan2.1 while maintaining high quality across FID, CLIP, Aesthetic, and VBench-2.0 metrics. Outperforms existing feature caching methods in extensive experiments on DiT-XL/2, FLUX.1-dev, and Wan2.1 across image and video generation.

Conclusion: TC-Padé effectively addresses limitations of existing feature caching methods in low-step regimes by using Padé approximation and trajectory-consistent sampling strategies, achieving significant acceleration while preserving generation quality.

Abstract: Despite achieving state-of-the-art generation quality, diffusion models are hindered by the substantial computational burden of their iterative sampling process. While feature caching techniques achieve effective acceleration at higher step counts (e.g., 50 steps), they exhibit critical limitations in the practical low-step regime of 20-30 steps. As the interval between steps increases, polynomial-based extrapolators like TaylorSeer suffer from error accumulation and trajectory drift. Meanwhile, conventional caching strategies often overlook the distinct dynamical properties of different denoising phases. To address these challenges, we propose Trajectory-Consistent Padé approximation, a feature prediction framework grounded in Padé approximation. By modeling feature evolution through rational functions, our approach captures asymptotic and transitional behaviors more accurately than Taylor-based methods. To enable stable and trajectory-consistent sampling under reduced step counts, TC-Padé incorporates (1) adaptive coefficient modulation that leverages historical cached residuals to detect subtle trajectory transitions, and (2) step-aware prediction strategies tailored to the distinct dynamics of early, mid, and late sampling stages. Extensive experiments on DiT-XL/2, FLUX.1-dev, and Wan2.1 across both image and video generation demonstrate the effectiveness of TC-Padé. For instance, TC-Padé achieves 2.88x acceleration on FLUX.1-dev and 1.72x on Wan2.1 while maintaining high quality across FID, CLIP, Aesthetic, and VBench-2.0 metrics, substantially outperforming existing feature caching methods.

[188] Semi-Supervised Few-Shot Adaptation of Vision-Language Models

Julio Silva-Rodríguez, Ender Konukoglu

Main category: cs.CV

TL;DR: Medical vision-language model adaptation using semi-supervised learning with text-informed pseudo-labels for few-shot classification with imbalanced medical data.

Details

Motivation: Medical imaging faces high annotation costs and class imbalances, making few-shot adaptation of vision-language models challenging, especially for underrepresented categories in low-shot regimes.

Method: Proposes a semi-supervised solver that leverages unlabeled data by propagating text-informed pseudo-labels during few-shot adaptation of vision-language models.

Result: Enables lower-budget annotation pipelines, reducing labeling effort by >50% in low-shot regimes while maintaining performance.

Conclusion: Semi-supervised approach with text-informed pseudo-labels effectively addresses class imbalance in medical few-shot VLM adaptation, significantly reducing annotation costs.

Abstract: Vision-language models (VLMs) pre-trained on large, heterogeneous data sources are becoming increasingly popular, providing rich multi-modal embeddings that enable efficient transfer to new tasks. A particularly relevant application is few-shot adaptation, where only a handful of annotated examples are available to adapt the model through multi-modal linear probes. In medical imaging, specialized VLMs have shown promising performance in zero- and few-shot image classification, which is valuable for mitigating the high cost of expert annotations. However, challenges remain in extremely low-shot regimes: the inherent class imbalances in medical tasks often lead to underrepresented categories, penalizing overall model performance. To address this limitation, we propose leveraging unlabeled data by introducing an efficient semi-supervised solver that propagates text-informed pseudo-labels during few-shot adaptation. The proposed method enables lower-budget annotation pipelines for adapting VLMs, reducing labeling effort by >50% in low-shot regimes.

[189] Improving Anomaly Detection with Foundation-Model Synthesis and Wavelet-Domain Attention

Wensheng Wu, Zheming Lu, Ziqian Lu, Zewei He, Xuecheng Sun, Zhao Wang, Jungong Han, Yunlong Yu

Main category: cs.CV

TL;DR: FMAS pipeline generates realistic industrial anomaly samples without fine-tuning, combined with WDAM module using wavelet domain attention for enhanced anomaly detection.

Details

Motivation: Industrial anomaly detection suffers from scarce anomalous samples and complex real-world anomalies, requiring better synthesis methods and feature extraction.

Method: Proposes FMAS (foundation model-based anomaly synthesis pipeline) for realistic anomaly generation without fine-tuning, and WDAM (Wavelet Domain Attention Module) using adaptive sub-band processing in wavelet domain.

Result: Significantly improves anomaly detection sensitivity while maintaining computational efficiency; achieves substantial performance gains on MVTec AD and VisA datasets as plug-and-play module.

Conclusion: The combination of FMAS and WDAM effectively addresses industrial anomaly detection challenges through realistic anomaly synthesis and enhanced frequency-domain feature extraction.

Abstract: Industrial anomaly detection faces significant challenges due to the scarcity of anomalous samples and the complexity of real-world anomalies. In this paper, we propose a foundation model-based anomaly synthesis pipeline (FMAS) that generates highly realistic anomalous samples without fine-tuning or class-specific training. Motivated by the distinct frequency-domain characteristics of anomalies, we introduce aWavelet Domain Attention Module (WDAM), which exploits adaptive sub-band processing to enhance anomaly feature extraction. The combination of FMAS and WDAM significantly improves anomaly detection sensitivity while maintaining computational efficiency. Comprehensive experiments on MVTec AD and VisA datasets demonstrate that WDAM, as a plug-and-play module, achieves substantial performance gains against existing baselines.

Jiaxing Liu, Zexi Zhang, Xiaoyan Li, Boyue Wang, Yongli Hu, Baocai Yin

Main category: cs.CV

TL;DR: TagaVLM is a topology-aware framework that enhances VLMs for Vision-Language Navigation by explicitly injecting spatial topological structures into the model architecture, enabling better global action reasoning.

Details

Motivation: Current VLMs are pretrained on static vision-language tasks, creating an architectural mismatch with the dynamic, embodied, and spatially-structured nature of navigation. Existing methods either convert visual-spatial information to text (forcing implicit inference) or limit global action capabilities.

Method: Proposes TagaVLM with three key components: 1) Spatial Topology Aware Residual Attention (STAR-Att) that directly integrates topological edge information into VLM’s self-attention, 2) Interleaved Navigation Prompt to enhance topological node information and visual-text alignment, and 3) Global action reasoning using the embedded topological graph for robust path correction.

Result: Achieves state-of-the-art performance on R2R benchmark among large-model-based methods: 51.09% Success Rate (SR) and 47.18 SPL in unseen environments, outperforming prior work by 3.39% in SR and 9.08 in SPL.

Conclusion: Targeted architectural enhancements on smaller open-source VLMs can be more effective than brute-force model scaling for embodied spatial reasoning tasks like VLN. Explicit topological structure injection bridges the gap between static VLM pretraining and dynamic navigation requirements.

Abstract: Vision-Language Navigation (VLN) presents a unique challenge for Large Vision-Language Models (VLMs) due to their inherent architectural mismatch: VLMs are primarily pretrained on static, disembodied vision-language tasks, which fundamentally clash with the dynamic, embodied, and spatially-structured nature of navigation. Existing large-model-based methods often resort to converting rich visual and spatial information into text, forcing models to implicitly infer complex visual-topological relationships or limiting their global action capabilities. To bridge this gap, we propose TagaVLM (Topology-Aware Global Action reasoning), an end-to-end framework that explicitly injects topological structures into the VLM backbone. To introduce topological edge information, Spatial Topology Aware Residual Attention (STAR-Att) directly integrates it into the VLM’s self-attention mechanism, enabling intrinsic spatial reasoning while preserving pretrained knowledge. To enhance topological node information, an Interleaved Navigation Prompt strengthens node-level visual-text alignment. Finally, with the embedded topological graph, the model is capable of global action reasoning, allowing for robust path correction. On the R2R benchmark, TagaVLM achieves state-of-the-art performance among large-model-based methods, with a Success Rate (SR) of 51.09% and SPL of 47.18 in unseen environments, outperforming prior work by 3.39% in SR and 9.08 in SPL. This demonstrates that, for embodied spatial reasoning, targeted enhancements on smaller open-source VLMs can be more effective than brute-force model scaling. The code will be released upon publication.Project page: https://apex-bjut.github.io/Taga-VLM

[191] Spatial Autoregressive Modeling of DINOv3 Embeddings for Unsupervised Anomaly Detection

Ertunc Erdil, Nico Schulthess, Guney Tombak, Ender Konukoglu

Main category: cs.CV

TL;DR: A simple and efficient unsupervised anomaly detection framework that uses 2D autoregressive modeling to capture spatial dependencies between patch embeddings from DINO models, reducing memory and computational overhead compared to memory bank approaches.

Details

Motivation: Existing DINO-based anomaly detection methods treat patch embeddings independently, ignoring spatial relationships, and use memory-intensive approaches like memory banks or prototypes that require costly comparisons at inference time.

Method: Proposes a 2D autoregressive model using a convolutional neural network to explicitly model spatial and contextual dependencies between patch embeddings. Learns a compact parametric model of the normative distribution instead of storing embeddings.

Result: Achieves competitive anomaly detection performance on BMAD medical imaging benchmark while substantially reducing inference time and memory requirements compared to existing DINO-based methods.

Conclusion: Explicitly modeling spatial dependencies with autoregressive models provides an efficient alternative to memory-intensive approaches for unsupervised anomaly detection using DINO features.

Abstract: DINO models provide rich patch-level representations that have recently enabled strong performance in unsupervised anomaly detection (UAD). Most existing methods extract patch embeddings from ``normal’’ images and model them independently, ignoring spatial and neighborhood relationships between patches. This implicitly assumes that self-attention and positional encodings sufficiently encode contextual information within each patch embedding. In addition, the normative distribution is often modeled as memory banks or prototype-based representations, which require storing large numbers of features and performing costly comparisons at inference time, leading to substantial memory and computational overhead. In this work, we address both limitations by proposing a simple and efficient framework that explicitly models spatial and contextual dependencies between patch embeddings using a 2D autoregressive (AR) model. Instead of storing embeddings or clustering prototypes, our approach learns a compact parametric model of the normative distribution via an AR convolutional neural network (CNN). At test time, anomaly detection reduces to a single forward pass through the network and enables fast and memory-efficient inference. We evaluate our method on the BMAD benchmark, which comprises three medical imaging datasets, and compare it against existing work including recent DINO-based methods. Experimental results demonstrate that explicitly modeling spatial dependencies achieves competitive anomaly detection performance while substantially reducing inference time and memory requirements. Code is available at the project page: https://eerdil.github.io/spatial-ar-dinov3-uad/.

[192] The Dresden Dataset for 4D Reconstruction of Non-Rigid Abdominal Surgical Scenes

Reuben Docea, Rayan Younis, Yonghao Long, Maxime Fleury, Jinjing Xu, Chenyang Li, André Schulze, Ann Wierick, Johannes Bender, Micha Pfeiffer, Qi Dou, Martin Wagner, Stefanie Speidel

Main category: cs.CV

TL;DR: D4D Dataset provides paired endoscopic video and structured-light geometry for evaluating 3D reconstruction of deforming abdominal tissue in surgical conditions, with over 300k frames and 369 point clouds across 98 recordings.

Details

Motivation: There is a need for comprehensive benchmarks to evaluate 3D reconstruction methods for deforming soft tissue in surgical environments, particularly for non-rigid SLAM, 4D reconstruction, and depth estimation algorithms.

Method: Data acquired from six porcine cadaver sessions using da Vinci Xi stereo endoscope and Zivid structured-light camera, registered via optical tracking and manual curation. Three sequence types probe algorithm robustness to different motion patterns. Postprocessing uses ICP and semi-automatic registration with instrument masks.

Result: Dataset provides rectified stereo images, per-frame instrument masks, stereo depth, structured-light point clouds, camera poses and intrinsics. Enables quantitative geometric evaluation in visible and occluded regions with photometric view-synthesis baselines.

Conclusion: The D4D Dataset serves as a comprehensive benchmark for developing and evaluating non-rigid SLAM, 4D reconstruction, and depth estimation methods in surgical contexts.

Abstract: The D4D Dataset provides paired endoscopic video and high-quality structured-light geometry for evaluating 3D reconstruction of deforming abdominal soft tissue in realistic surgical conditions. Data were acquired from six porcine cadaver sessions using a da Vinci Xi stereo endoscope and a Zivid structured-light camera, registered via optical tracking and manually curated iterative alignment methods. Three sequence types - whole deformations, incremental deformations, and moved-camera clips - probe algorithm robustness to non-rigid motion, deformation magnitude, and out-of-view updates. Each clip provides rectified stereo images, per-frame instrument masks, stereo depth, start/end structured-light point clouds, curated camera poses and camera intrinsics. In postprocessing, ICP and semi-automatic registration techniques are used to register data, and instrument masks are created. The dataset enables quantitative geometric evaluation in both visible and occluded regions, alongside photometric view-synthesis baselines. Comprising over 300,000 frames and 369 point clouds across 98 curated recordings, this resource can serve as a comprehensive benchmark for developing and evaluating non-rigid SLAM, 4D reconstruction, and depth estimation methods.

[193] VIRGi: View-dependent Instant Recoloring of 3D Gaussians Splats

Alessio Mazzucchelli, Ivan Ojeda-Martin, Fernando Rivas-Manzaneque, Elena Garces, Adrian Penate-Sanchez, Francesc Moreno-Noguer

Main category: cs.CV

TL;DR: VIRGi enables rapid color editing of 3D Gaussian Splatting scenes while preserving view-dependent effects like specular highlights, requiring only one manually edited image for real-time propagation.

Details

Motivation: While 3D Gaussian Splatting (3DGS) excels at novel view synthesis and 3D reconstruction, there's no efficient method for photorealistic appearance editing that preserves view-dependent effects like specular highlights.

Method: Introduces a novel architecture separating color into diffuse and view-dependent components, with multi-view training using image patches from multiple viewpoints. For recoloring, uses a rapid scheme requiring only one manually edited image, fine-tuning a single MLP alongside a single-shot segmentation module for editable areas.

Result: Enables color edits to propagate across entire scenes in just two seconds, facilitating real-time interaction with control over view-dependent effects. Outperforms Neural Radiance Field-based competitors on diverse datasets both quantitatively and qualitatively.

Conclusion: VIRGi provides an efficient, photorealistic method for editing 3DGS scenes while preserving view-dependent effects, enabling real-time interaction with minimal user input.

Abstract: 3D Gaussian Splatting (3DGS) has recently transformed the fields of novel view synthesis and 3D reconstruction due to its ability to accurately model complex 3D scenes and its unprecedented rendering performance. However, a significant challenge persists: the absence of an efficient and photorealistic method for editing the appearance of the scene’s content. In this paper we introduce VIRGi, a novel approach for rapidly editing the color of scenes modeled by 3DGS while preserving view-dependent effects such as specular highlights. Key to our method are a novel architecture that separates color into diffuse and view-dependent components, and a multi-view training strategy that integrates image patches from multiple viewpoints. Improving over the conventional single-view batch training, our 3DGS representation provides more accurate reconstruction and serves as a solid representation for the recoloring task. For 3DGS recoloring, we then introduce a rapid scheme requiring only one manually edited image of the scene from the end-user. By fine-tuning the weights of a single MLP, alongside a module for single-shot segmentation of the editable area, the color edits are seamlessly propagated to the entire scene in just two seconds, facilitating real-time interaction and providing control over the strength of the view-dependent effects. An exhaustive validation on diverse datasets demonstrates significant quantitative and qualitative advancements over competitors based on Neural Radiance Fields representations.

[194] Any Resolution Any Geometry: From Multi-View To Multi-Patch

Wenqing Cui, Zhenyu Li, Mykola Lavreniuk, Jian Shi, Ramzi Idoughi, Xiangjun Tang, Peter Wonka

Main category: cs.CV

TL;DR: URGT adapts VGGT into a multi-patch transformer for joint high-resolution depth-normal estimation using cross-patch attention and GridMix sampling to balance local detail and global consistency.

Details

Motivation: Joint estimation of surface normals and depth is crucial for 3D scene understanding, but high-resolution prediction faces challenges in preserving fine local details while maintaining global consistency across the scene.

Method: Proposes Ultra Resolution Geometry Transformer (URGT) that partitions high-resolution images into patches augmented with coarse depth/normal priors, processes them jointly via cross-patch attention for global coherence, and uses GridMix patch sampling for spatial robustness.

Result: Achieves SOTA on UnrealStereo4K: reduces AbsRel from 0.0582 to 0.0291, RMSE from 2.17 to 1.31, mean angular error from 23.36° to 18.51°, with sharper geometry and strong zero-shot/cross-domain generalization.

Conclusion: URGT provides an efficient, extensible solution for high-quality geometry refinement that scales to very high resolutions while maintaining both local detail and global consistency through multi-patch transformer architecture.

Abstract: Joint estimation of surface normals and depth is essential for holistic 3D scene understanding, yet high-resolution prediction remains difficult due to the trade-off between preserving fine local detail and maintaining global consistency. To address this challenge, we propose the Ultra Resolution Geometry Transformer (URGT), which adapts the Visual Geometry Grounded Transformer (VGGT) into a unified multi-patch transformer for monocular high-resolution depth–normal estimation. A single high-resolution image is partitioned into patches that are augmented with coarse depth and normal priors from pre-trained models, and jointly processed in a single forward pass to predict refined geometric outputs. Global coherence is enforced through cross-patch attention, which enables long-range geometric reasoning and seamless propagation of information across patches within a shared backbone. To further enhance spatial robustness, we introduce a GridMix patch sampling strategy that probabilistically samples grid configurations during training, improving inter-patch consistency and generalization. Our method achieves state-of-the-art results on UnrealStereo4K, jointly improving depth and normal estimation, reducing AbsRel from 0.0582 to 0.0291, RMSE from 2.17 to 1.31, and lowering mean angular error from 23.36 degrees to 18.51 degrees, while producing sharper and more stable geometry. The proposed multi-patch framework also demonstrates strong zero-shot and cross-domain generalization and scales effectively to very high resolutions, offering an efficient and extensible solution for high-quality geometry refinement.

[195] TinyIceNet: Low-Power SAR Sea Ice Segmentation for On-Board FPGA Inference

Mhd Rashed Al Koutayni, Mohamed Selim, Gerd Reis, Alain Pagani, Didier Stricker

Main category: cs.CV

TL;DR: TinyIceNet: A compact semantic segmentation network for on-board sea ice mapping from Sentinel-1 SAR imagery, optimized for FPGA deployment with low-precision quantization to balance accuracy and energy efficiency.

Details

Motivation: Sea ice mapping is crucial for polar navigation, but conventional ground-based processing of Sentinel-1 SAR data faces bandwidth, latency, and energy constraints. On-board processing using dedicated inference chips offers a transformative solution by generating actionable products directly in orbit.

Method: Co-designed compact semantic segmentation network combining SAR-aware architectural simplifications with low-precision quantization. Trained on AI4Arctic dataset, synthesized using High-Level Synthesis, and deployed on Xilinx Zynq UltraScale+ FPGA platform for on-board processing.

Result: Achieves 75.216% F1 score on Stage of Development (SOD) segmentation while reducing energy consumption by 2x compared to full-precision GPU baselines, demonstrating near-real-time inference capabilities.

Conclusion: TinyIceNet demonstrates the potential of chip-level hardware-algorithm co-design for spaceborne and edge AI systems, enabling efficient on-board processing of remote sensing data under strict hardware and power constraints.

Abstract: Accurate sea ice mapping is essential for safe maritime navigation in polar regions, where rapidly changing ice conditions require timely and reliable information. While Sentinel-1 Synthetic Aperture Radar (SAR) provides high-resolution, all-weather observations of sea ice, conventional ground-based processing is limited by downlink bandwidth, latency, and energy costs associated with transmitting large volumes of raw data. On-board processing, enabled by dedicated inference chips integrated directly within the satellite payload, offers a transformative alternative by generating actionable sea ice products in orbit. In this context, we present TinyIceNet, a compact semantic segmentation network co-designed for on-board Stage of Development (SOD) mapping from dual-polarized Sentinel-1 SAR imagery under strict hardware and power constraints. Trained on the AI4Arctic dataset, TinyIceNet combines SAR-aware architectural simplifications with low-precision quantization to balance accuracy and efficiency. The model is synthesized using High-Level Synthesis and deployed on a Xilinx Zynq UltraScale+ FPGA platform, demonstrating near-real-time inference with significantly reduced energy consumption. Experimental results show that TinyIceNet achieves 75.216% F1 score on SOD segmentation while reducing energy consumption by 2x compared to full-precision GPU baselines, underscoring the potential of chip-level hardware-algorithm co-design for future spaceborne and edge AI systems.

[196] BRIGHT: A Collaborative Generalist-Specialist Foundation Model for Breast Pathology

Xiaojing Guo, Jiatai Lin, Yumian Jia, Jingqi Huang, Zeyan Xu, Weidong Li, Longfei Wang, Jingjing Chen, Qin Li, Weiwei Wang, Lifang Cui, Wen Yue, Zhiqiang Cheng, Xiaolong Wei, Jianzhong Yu, Xia Jin, Baizhou Li, Honghong Shen, Jing Li, Chunlan Li, Yanfen Cui, Yi Dai, Yiling Yang, Xiaolong Qian, Liu Yang, Yang Yang, Guangshen Gao, Yaqing Li, Lili Zhai, Chenying Liu, Tianhua Zhang, Zhenwei Shi, Cheng Lu, Xingchen Zhou, Jing Xu, Miaoqing Zhao, Fang Mei, Jiaojiao Zhou, Ning Mao, Fangfang Liu, Chu Han, Zaiyi Liu

Main category: cs.CV

TL;DR: BRIGHT is a specialized breast pathology foundation model trained on 210M histopathology tiles from 51K breast WSIs, using a collaborative generalist-specialist framework to achieve SOTA performance across 24 clinical tasks in breast oncology.

Details

Motivation: Generalist pathology foundation models lack proficiency in specific organ systems due to insufficient large-scale validation for single organs and absence of tailored training paradigms to translate broad histomorphological knowledge into organ-specific expertise.

Method: Developed BRIGHT, the first breast pathology foundation model, trained on ~210M histopathology tiles from 51K+ breast WSIs across 40K+ patients from 19 hospitals. Used collaborative generalist-specialist framework to capture both universal and organ-specific features. Evaluated on largest multi-institutional cohorts with 25K+ WSIs across 10 hospitals covering 24 clinical tasks.

Result: BRIGHT outperforms three leading generalist PFMs, achieving SOTA performance in 21/24 internal validation tasks and 5/10 external validation tasks with excellent heatmap interpretability.

Conclusion: BRIGHT demonstrates clinical utility in breast oncology and validates a collaborative generalist-specialist paradigm, providing a scalable template for developing PFMs for specific organ systems.

Abstract: Generalist pathology foundation models (PFMs), pretrained on large-scale multi-organ datasets, have demonstrated remarkable predictive capabilities across diverse clinical applications. However, their proficiency on the full spectrum of clinically essential tasks within a specific organ system remains an open question due to the lack of large-scale validation cohorts for a single organ as well as the absence of a tailored training paradigm that can effectively translate broad histomorphological knowledge into the organ-specific expertise required for specialist-level interpretation. In this study, we propose BRIGHT, the first PFM specifically designed for breast pathology, trained on approximately 210 million histopathology tiles from over 51,000 breast whole-slide images derived from a cohort of over 40,000 patients across 19 hospitals. BRIGHT employs a collaborative generalist-specialist framework to capture both universal and organ-specific features. To comprehensively evaluate the performance of PFMs on breast oncology, we curate the largest multi-institutional cohorts to date for downstream task development and evaluation, comprising over 25,000 WSIs across 10 hospitals. The validation cohorts cover the full spectrum of breast pathology across 24 distinct clinical tasks spanning diagnosis, biomarker prediction, treatment response and survival prediction. Extensive experiments demonstrate that BRIGHT outperforms three leading generalist PFMs, achieving state-of-the-art (SOTA) performance in 21 of 24 internal validation tasks and in 5 of 10 external validation tasks with excellent heatmap interpretability. By evaluating on large-scale validation cohorts, this study not only demonstrates BRIGHT’s clinical utility in breast oncology but also validates a collaborative generalist-specialist paradigm, providing a scalable template for developing PFMs on a specific organ system.

[197] EduVQA: Benchmarking AI-Generated Video Quality Assessment for Education

Baoliang Chen, Xinlong Bu, Lingyu Zhu, Hanwei Zhu, Xiangjie Sui

Main category: cs.CV

TL;DR: EduAIGV-1k benchmark dataset and EduVQA framework for evaluating AI-generated educational math videos, with fine-grained annotations and a novel S2D-MoE module for quality assessment.

Details

Motivation: While AI-generated content models excel at creating photorealistic videos, their potential for educational storytelling and visual learning remains unexplored. There's a need to assess AI-generated educational videos (AIGVs) for teaching foundational math concepts to young learners.

Method: Created EduAIGV-1k dataset with 1,130 short videos from 10 state-of-the-art T2V models using 113 pedagogy-oriented prompts. Each video has fine-grained annotations along perceptual quality (spatial/temporal fidelity) and prompt alignment (word/sentence-level). Proposed EduVQA framework with Structured 2D Mixture-of-Experts (S2D-MoE) module that enhances dependency between overall quality and sub-dimensions via shared experts and dynamic 2D gating matrix.

Result: EduVQA consistently outperforms existing VQA baselines in experiments. The dataset provides multi-dimensional supervision signals beyond single quality scores, enabling detailed assessment of AI-generated educational videos.

Conclusion: The work establishes the first benchmark for evaluating AI-generated educational videos, providing both dataset and evaluation framework to advance research in educational AIGC. The fine-grained annotations and novel S2D-MoE approach offer interpretable quality assessment for educational video generation.

Abstract: While AI-generated content (AIGC) models have achieved remarkable success in generating photorealistic videos, their potential to support visual, story-driven learning in education remains largely untapped. To close this gap, we present EduAIGV-1k, the first benchmark dataset and evaluation framework dedicated to assessing the quality of AI-generated videos (AIGVs) designed to teach foundational math concepts, such as numbers and geometry, to young learners. EduAIGV-1k contains 1,130 short videos produced by ten state-of-the-art text-to-video (T2V) models using 113 pedagogy-oriented prompts. Each video is accompanied by rich, fine-grained annotations along two complementary axes: (1) Perceptual quality, disentangled into spatial and temporal fidelity, and (2) Prompt alignment, labeled at the word-level and sentence-level to quantify the degree to which each mathematical concept in the prompt is accurately grounded in the generated video. These fine-grained annotations transform each video into a multi-dimensional, interpretable supervision signal, far beyond a single quality score. Leveraging this dense feedback, we introduce EduVQA for both perceptual and alignment quality assessment of AIGVs. In particular, we propose a Structured 2D Mixture-of-Experts (S2D-MoE) module, which enhances the dependency between overall quality and each sub-dimension by shared experts and dynamic 2D gating matrix. Extensive experiments show our EduVQA consistently outperforms existing VQA baselines. Both our dataset and code will be publicly available.

[198] MoECLIP: Patch-Specialized Experts for Zero-shot Anomaly Detection

Jun Yeong Park, JunYoung Seo, Minji Kang, Yu Rang Park

Main category: cs.CV

TL;DR: MoECLIP introduces a Mixture-of-Experts architecture with patch-level adaptation for Zero-Shot Anomaly Detection using CLIP, addressing patch-agnostic limitations in existing methods.

Details

Motivation: Existing Zero-Shot Anomaly Detection methods using CLIP have patch-agnostic designs that process all patches monolithically without considering their unique characteristics, limiting specialization while preserving generalization.

Method: Proposes MoECLIP with Mixture-of-Experts architecture that dynamically routes each image patch to specialized Low-Rank Adaptation (LoRA) experts based on patch characteristics. Includes Frozen Orthogonal Feature Separation (FOFS) to orthogonally separate input features and simplex ETF loss to regulate expert outputs for maximally equiangular representations.

Result: Comprehensive experiments across 14 benchmark datasets spanning industrial and medical domains demonstrate that MoECLIP outperforms existing state-of-the-art methods.

Conclusion: MoECLIP successfully addresses patch-agnostic limitations in ZSAD by enabling patch-level adaptation while preserving CLIP’s generalization capabilities through specialized expert routing and orthogonal feature separation.

Abstract: The CLIP model’s outstanding generalization has driven recent success in Zero-Shot Anomaly Detection (ZSAD) for detecting anomalies in unseen categories. The core challenge in ZSAD is to specialize the model for anomaly detection tasks while preserving CLIP’s powerful generalization capability. Existing approaches attempting to solve this challenge share the fundamental limitation of a patch-agnostic design that processes all patches monolithically without regard for their unique characteristics. To address this limitation, we propose \textbf{MoECLIP}, a Mixture-of-Experts (MoE) architecture for the ZSAD task, which achieves patch-level adaptation by dynamically routing each image patch to a specialized Low-Rank Adaptation (LoRA) expert based on its unique characteristics. Furthermore, to prevent functional redundancy among the LoRA experts, we introduce (1) Frozen Orthogonal Feature Separation (FOFS), which orthogonally separates the input feature space to force experts to focus on distinct information, and (2) a simplex equiangular tight frame (ETF) loss to regulate the expert outputs to form maximally equiangular representations. Comprehensive experimental results across 14 benchmark datasets spanning industrial and medical domains demonstrate that MoECLIP outperforms existing state-of-the-art methods. The code is available at https://github.com/CoCoRessa/MoECLIP.

[199] AWDiff: An a trous wavelet diffusion model for lung ultrasound image synthesis

Maryam Heidari, Nantheera Anantrasirichai, Steven Walker, Rahul Bhatnagar, Alin Achim

Main category: cs.CV

TL;DR: AWDiff is a diffusion-based augmentation framework for lung ultrasound that uses wavelet transforms to preserve fine diagnostic features and semantic conditioning with BioMedCLIP for clinical relevance.

Details

Motivation: Lung ultrasound has limited data availability for machine learning, and existing generative augmentation methods (GANs, diffusion models) lose subtle diagnostic cues like B-lines and pleural irregularities due to resolution reduction.

Method: Proposes A trous Wavelet Diffusion (AWDiff) that integrates a trous wavelet transform to preserve fine-scale structures without destructive downsampling, and uses semantic conditioning with BioMedCLIP (vision-language foundation model) to enforce alignment with clinically meaningful labels.

Result: On a LUS dataset, AWDiff achieved lower distortion and higher perceptual quality compared to existing methods, demonstrating both structural fidelity and clinical diversity.

Conclusion: AWDiff effectively addresses the data scarcity problem in medical imaging by preserving diagnostic features while generating clinically relevant augmented data.

Abstract: Lung ultrasound (LUS) is a safe and portable imaging modality, but the scarcity of data limits the development of machine learning methods for image interpretation and disease monitoring. Existing generative augmentation methods, such as Generative Adversarial Networks (GANs) and diffusion models, often lose subtle diagnostic cues due to resolution reduction, particularly B-lines and pleural irregularities. We propose A trous Wavelet Diffusion (AWDiff), a diffusion based augmentation framework that integrates the a trous wavelet transform to preserve fine-scale structures while avoiding destructive downsampling. In addition, semantic conditioning with BioMedCLIP, a vision language foundation model trained on large scale biomedical corpora, enforces alignment with clinically meaningful labels. On a LUS dataset, AWDiff achieved lower distortion and higher perceptual quality compared to existing methods, demonstrating both structural fidelity and clinical diversity.

[200] Geometry-Guided Reinforcement Learning for Multi-view Consistent 3D Scene Editing

Jiyuan Wang, Chunyu Lin, Lei Sun, Zhi Cao, Yuyang Yin, Lang Nie, Zhenlong Yuan, Xiangxiang Chu, Yunchao Wei, Kang Liao, Guosheng Lin

Main category: cs.CV

TL;DR: RL3DEdit uses reinforcement learning with 3D foundation model rewards to achieve multi-view consistent 3D editing from 2D diffusion priors

Details

Motivation: Maintaining multi-view consistency in 3D editing is challenging, and supervised fine-tuning is infeasible due to lack of 3D-consistent editing paired data. The observation that verifying 3D consistency is easier than generating it makes RL a natural solution.

Method: Proposes RL3DEdit, a single-pass RL framework that uses rewards from the 3D foundation model VGGT. The method feeds edited images to VGGT and uses output confidence maps and pose estimation errors as reward signals to anchor 2D editing priors onto a 3D-consistent manifold.

Result: Extensive experiments show RL3DEdit achieves stable multi-view consistency and outperforms state-of-the-art methods in editing quality with high efficiency.

Conclusion: RL3DEdit effectively addresses the multi-view consistency challenge in 3D editing by leveraging RL optimization with 3D foundation model rewards, offering a promising solution for 3D content editing.

Abstract: Leveraging the priors of 2D diffusion models for 3D editing has emerged as a promising paradigm. However, maintaining multi-view consistency in edited results remains challenging, and the extreme scarcity of 3D-consistent editing paired data renders supervised fine-tuning (SFT), the most effective training strategy for editing tasks, infeasible. In this paper, we observe that, while generating multi-view consistent 3D content is highly challenging, verifying 3D consistency is tractable, naturally positioning reinforcement learning (RL) as a feasible solution. Motivated by this, we propose \textbf{RL3DEdit}, a single-pass framework driven by RL optimization with novel rewards derived from the 3D foundation model, VGGT. Specifically, we leverage VGGT’s robust priors learned from massive real-world data, feed the edited images, and utilize the output confidence maps and pose estimation errors as reward signals, effectively anchoring the 2D editing priors onto a 3D-consistent manifold via RL. Extensive experiments demonstrate that RL3DEdit achieves stable multi-view consistency and outperforms state-of-the-art methods in editing quality with high efficiency. To promote the development of 3D editing, we will release the code and model.

[201] Kling-MotionControl Technical Report

Kling Team, Jialu Chen, Yikang Ding, Zhixue Fang, Kun Gai, Kang He, Xu He, Jingyun Hua, Mingming Lao, Xiaohan Li, Hui Liu, Jiwen Liu, Xiaoqiang Liu, Fan Shi, Xiaoyu Shi, Peiqin Sun, Songlin Tang, Pengfei Wan, Tiancheng Wen, Zhiyong Wu, Haoxian Zhang, Runze Zhao, Yuanxing Zhang, Yan Zhou

Main category: cs.CV

TL;DR: Kling-MotionControl is a unified DiT-based framework for robust, precise, and expressive holistic character animation that orchestrates heterogeneous motion representations for body, face, and hands while ensuring cross-identity generalization and appearance preservation.

Details

Motivation: The paper addresses the challenge of generating lifelike character animation by transferring motion dynamics from driving videos to reference images, aiming to achieve high-fidelity results with robust cross-identity generalization while maintaining precise appearance preservation.

Method: The method uses a unified DiT-based framework with a divide-and-conquer strategy that orchestrates heterogeneous motion representations for body, face, and hands. It incorporates adaptive identity-agnostic learning for cross-identity generalization, meticulous identity injection and fusion designs for appearance preservation, a subject library mechanism for comprehensive reference contexts, and multi-stage distillation for acceleration (10x speedup).

Result: Human preference evaluations demonstrate superior performance compared to leading commercial and open-source solutions, achieving exceptional fidelity in holistic motion control, open domain generalization, and visual quality and coherence.

Conclusion: Kling-MotionControl establishes itself as a robust solution for high-quality, controllable, and lifelike character animation with intelligent semantic motion understanding and precise text responsiveness.

Abstract: Character animation aims to generate lifelike videos by transferring motion dynamics from a driving video to a reference image. Recent strides in generative models have paved the way for high-fidelity character animation. In this work, we present Kling-MotionControl, a unified DiT-based framework engineered specifically for robust, precise, and expressive holistic character animation. Leveraging a divide-and-conquer strategy within a cohesive system, the model orchestrates heterogeneous motion representations tailored to the distinct characteristics of body, face, and hands, effectively reconciling large-scale structural stability with fine-grained articulatory expressiveness. To ensure robust cross-identity generalization, we incorporate adaptive identity-agnostic learning, facilitating natural motion retargeting for diverse characters ranging from realistic humans to stylized cartoons. Simultaneously, we guarantee faithful appearance preservation through meticulous identity injection and fusion designs, further supported by a subject library mechanism that leverages comprehensive reference contexts. To ensure practical utility, we implement an advanced acceleration framework utilizing multi-stage distillation, boosting inference speed by over 10x. Kling-MotionControl distinguishes itself through intelligent semantic motion understanding and precise text responsiveness, allowing for flexible control beyond visual inputs. Human preference evaluations demonstrate that Kling-MotionControl delivers superior performance compared to leading commercial and open-source solutions, achieving exceptional fidelity in holistic motion control, open domain generalization, and visual quality and coherence. These results establish Kling-MotionControl as a robust solution for high-quality, controllable, and lifelike character animation.

[202] Conditioned Activation Transport for T2I Safety Steering

Maciej Chrabąszcz, Aleksander Szymczyk, Jan Dubiński, Tomasz Trzciński, Franziska Boenisch, Adam Dziedzic

Main category: cs.CV

TL;DR: CAT framework uses geometry-based conditioning and nonlinear transport maps to steer T2I model activations only for unsafe prompts, reducing toxic content while preserving image quality for benign prompts.

Details

Motivation: Current T2I models generate unsafe/toxic content; existing activation steering methods degrade image quality for benign prompts; need for targeted intervention that only affects unsafe content generation.

Method: Created SafeSteerDataset (2300 safe/unsafe prompt pairs); proposed Conditioned Activation Transport (CAT) with geometry-based conditioning and nonlinear transport maps that activate only within unsafe activation regions.

Result: CAT reduces Attack Success Rate significantly while maintaining image fidelity compared to unsteered generations; generalizes effectively across Z-Image and Infinity architectures.

Conclusion: CAT provides effective inference-time intervention for T2I safety without compromising image quality for benign prompts, offering a practical solution to the safety-fidelity trade-off.

Abstract: Despite their impressive capabilities, current Text-to-Image (T2I) models remain prone to generating unsafe and toxic content. While activation steering offers a promising inference-time intervention, we observe that linear activation steering frequently degrades image quality when applied to benign prompts. To address this trade-off, we first construct SafeSteerDataset, a contrastive dataset containing 2300 safe and unsafe prompt pairs with high cosine similarity. Leveraging this data, we propose Conditioned Activation Transport (CAT), a framework that employs a geometry-based conditioning mechanism and nonlinear transport maps. By conditioning transport maps to activate only within unsafe activation regions, we minimize interference with benign queries. We validate our approach on two state-of-the-art architectures: Z-Image and Infinity. Experiments demonstrate that CAT generalizes effectively across these backbones, significantly reducing Attack Success Rate while maintaining image fidelity compared to unsteered generations. Warning: This paper contains potentially offensive text and images.

[203] Chain of World: World Model Thinking in Latent Motion

Fuxiang Yang, Donglin Di, Lulu Tang, Xuancheng Zhang, Lei Fan, Hao Li, Chen Wei, Tonghua Su, Baorui Ma

Main category: cs.CV

TL;DR: CoWVLA introduces a “Chain of World” paradigm that unifies world-model temporal reasoning with disentangled latent motion representation for Vision-Language-Action models, achieving better performance and efficiency on robotic benchmarks.

Details

Motivation: Current VLA models either waste capacity reconstructing redundant backgrounds (world-model VLAs) or lack temporally continuous dynamic modeling and world knowledge (latent-action VLAs). There's a need to combine the benefits of both approaches.

Method: Uses pretrained video VAE as latent motion extractor to factorize video segments into structure and motion latents. VLA learns from instruction and initial frame to infer continuous latent motion chain and predict terminal frame. Co-fine-tuning aligns latent dynamics with discrete action prediction by jointly modeling sparse keyframes and action sequences in unified autoregressive decoder.

Result: Extensive experiments on robotic simulation benchmarks show CoWVLA outperforms existing world-model and latent-action approaches and achieves moderate computational efficiency.

Conclusion: CoWVLA preserves world-model benefits of temporal reasoning and world knowledge while retaining compactness and interpretability of latent actions, enabling efficient visuomotor learning as a more effective VLA pretraining paradigm.

Abstract: Vision-Language-Action (VLA) models are a promising path toward embodied intelligence, yet they often overlook the predictive and temporal-causal structure underlying visual dynamics. World-model VLAs address this by predicting future frames, but waste capacity reconstructing redundant backgrounds. Latent-action VLAs encode frame-to-frame transitions compactly, but lack temporally continuous dynamic modeling and world knowledge. To overcome these limitations, we introduce CoWVLA (Chain-of-World VLA), a new “Chain of World” paradigm that unifies world-model temporal reasoning with a disentangled latent motion representation. First, a pretrained video VAE serves as a latent motion extractor, explicitly factorizing video segments into structure and motion latents. Then, during pre-training, the VLA learns from an instruction and an initial frame to infer a continuous latent motion chain and predict the segment’s terminal frame. Finally, during co-fine-tuning, this latent dynamic is aligned with discrete action prediction by jointly modeling sparse keyframes and action sequences in a unified autoregressive decoder. This design preserves the world-model benefits of temporal reasoning and world knowledge while retaining the compactness and interpretability of latent actions, enabling efficient visuomotor learning. Extensive experiments on robotic simulation benchmarks show that CoWVLA outperforms existing world-model and latent-action approaches and achieves moderate computational efficiency, highlighting its potential as a more effective VLA pretraining paradigm. The project website can be found at https://fx-hit.github.io/cowvla-io.

[204] ProSMA-UNet: Decoder Conditioning for Proximal-Sparse Skip Feature Selection

Chun-Wun Cheng, Yanqi Cheng, Peiyuan Jing, Guang Yang, Carola-Bibiane Schönlieb, Angelica I. Aviles-Rivero

Main category: cs.CV

TL;DR: ProSMA-UNet improves medical image segmentation by replacing traditional skip connections with sparse multi-scale attention gates that explicitly filter irrelevant features using learnable soft-thresholding and decoder-conditioned channel gating.

Details

Motivation: Traditional U-Net skip connections propagate both useful spatial details and harmful low-level noise/textures, especially problematic in low-contrast medical imaging. Existing attention gates only softly reweight features rather than explicitly removing irrelevant activations.

Method: ProSMA-UNet reformulates skip gating as decoder-conditioned sparse feature selection: 1) Uses lightweight depthwise dilated convolutions to create multi-scale compatibility fields capturing local/contextual relevance, 2) Enforces explicit sparsity via ℓ1 proximal operator with learnable per-channel thresholds (closed-form soft-thresholding gate), 3) Adds decoder-conditioned channel gating using global decoder context to suppress irrelevant channels.

Result: Extensive experiments on challenging 2D and 3D benchmarks show state-of-the-art performance, with particularly large gains (~20%) on difficult 3D segmentation tasks.

Conclusion: ProSMA-UNet effectively addresses skip connection limitations in medical image segmentation by introducing explicit sparse feature selection mechanisms that filter noise while preserving relevant spatial details, significantly improving performance especially on challenging 3D tasks.

Abstract: Medical image segmentation commonly relies on U-shaped encoder-decoder architectures such as U-Net, where skip connections preserve fine spatial detail by injecting high-resolution encoder features into the decoder. However, these skip pathways also propagate low-level textures, background clutter, and acquisition noise, allowing irrelevant information to bypass deeper semantic filtering – an issue that is particularly detrimental in low-contrast clinical imaging. Although attention gates have been introduced to address this limitation, they typically produce dense sigmoid masks that softly reweight features rather than explicitly removing irrelevant activations. We propose ProSMA-UNet (Proximal-Sparse Multi-Scale Attention U-Net), which reformulates skip gating as a decoder-conditioned sparse feature selection problem. ProSMA constructs a multi-scale compatibility field using lightweight depthwise dilated convolutions to capture relevance across local and contextual scales, then enforces explicit sparsity via an $\ell_1$ proximal operator with learnable per-channel thresholds, yielding a closed-form soft-thresholding gate that can remove noisy responses. To further suppress semantically irrelevant channels, ProSMA incorporates decoder-conditioned channel gating driven by global decoder context. Extensive experiments on challenging 2D and 3D benchmarks demonstrate state-of-the-art performance, with particularly large gains ($\approx20$%) on difficult 3D segmentation tasks. Project page: https://math-ml-x.github.io/ProSMA-UNet/

[205] Specificity-aware reinforcement learning for fine-grained open-world classification

Samuele Angheben, Davide Berasi, Alessandro Conti, Elisa Ricci, Yiming Wang

Main category: cs.CV

TL;DR: SpeciaRL: A specificity-aware reinforcement learning framework for fine-tuning reasoning LMMs to produce both correct and specific predictions in open-world fine-grained image classification.

Details

Motivation: Reasoning Large Multimodal Models (LMMs) show strong visual understanding but produce overly generic predictions in fine-grained image classification under open-world settings. While models possess intrinsic fine-grained knowledge, balancing specificity (detailed predictions) with correctness remains a challenging, understudied problem.

Method: Proposes SpeciaRL, a specificity-aware reinforcement learning framework that fine-tunes reasoning LMMs using dynamic, verifier-based reward signals anchored to the best predictions within online rollouts. The approach promotes specificity while respecting model capabilities to prevent incorrect predictions.

Result: Out-of-domain experiments show SpeciaRL achieves the best trade-off between correctness and specificity across extensive fine-grained benchmarks, surpassing existing methods and advancing open-world fine-grained image classification.

Conclusion: SpeciaRL effectively steers reasoning LMMs toward predictions that are both correct and specific in open-world fine-grained image classification, addressing the specificity-correctness trade-off challenge.

Abstract: Classifying fine-grained visual concepts under open-world settings, i.e., without a predefined label set, demands models to be both accurate and specific. Recent reasoning Large Multimodal Models (LMMs) exhibit strong visual understanding capability but tend to produce overly generic predictions when performing fine-grained image classification. Our preliminary analysis reveals that models do possess the intrinsic fine-grained domain knowledge. However, promoting more specific predictions (specificity) without compromising correct ones (correctness) remains a non-trivial and understudied challenge. In this work, we investigate how to steer reasoning LMMs toward predictions that are both correct and specific. We propose a novel specificity-aware reinforcement learning framework, SpeciaRL, to fine-tune reasoning LMMs on fine-grained image classification under the open-world setting. SpeciaRL introduces a dynamic, verifier-based reward signal anchored to the best predictions within online rollouts, promoting specificity while respecting the model’s capabilities to prevent incorrect predictions. Our out-of-domain experiments show that SpeciaRL delivers the best trade-off between correctness and specificity across extensive fine-grained benchmarks, surpassing existing methods and advancing open-world fine-grained image classification. Code and model are publicly available at https://github.com/s-angheben/SpeciaRL.

[206] COP-GEN: Latent Diffusion Transformer for Copernicus Earth Observation Data – Generation Stochastic by Design

Miguel Espinosa, Eva Gmelich Meijling, Valerio Marsocci, Elliot J. Crowley, Mikolaj Czerkawski

Main category: cs.CV

TL;DR: COP-GEN is a multimodal latent diffusion transformer for Earth observation data that models joint distributions of heterogeneous modalities (optical, radar, elevation) to enable flexible any-to-any conditional generation with uncertainty representation.

Details

Motivation: Earth observation applications need to integrate data from multiple sensors, but relationships between modalities are non-injective (identical conditioning can correspond to multiple plausible observations). Deterministic models collapse to conditional means and fail to represent uncertainty and variability needed for tasks like data completion and cross-sensor translation.

Method: Multimodal latent diffusion transformer that models joint distribution of heterogeneous Earth Observation modalities at native spatial resolutions. Parameterizes cross-modal mappings as conditional distributions to enable any-to-any conditional generation without task-specific retraining.

Result: COP-GEN generates diverse yet physically consistent realizations while maintaining strong peak fidelity across optical, radar, and elevation modalities. Captures meaningful cross-modal structure and systematically adapts output uncertainty as conditioning information increases.

Conclusion: Stochastic generative modeling is practically important for Earth observation, and evaluation should move beyond single-reference, pointwise metrics. The model enables zero-shot modality translation, spectral band infilling, and generation under partial/missing inputs.

Abstract: Earth observation applications increasingly rely on data from multiple sensors, including optical, radar, elevation, and land-cover products. Relationships between these modalities are fundamental for data integration but are inherently non-injective: identical conditioning information can correspond to multiple physically plausible observations. Thus, such conditional mappings should be parametrised as data distributions. As a result, deterministic models tend to collapse toward conditional means and fail to represent the uncertainty and variability required for tasks such as data completion and cross-sensor translation. We introduce COP-GEN, a multimodal latent diffusion transformer that models the joint distribution of heterogeneous Earth Observation modalities at their native spatial resolutions. By parameterising cross-modal mappings as conditional distributions, COP-GEN enables flexible any-to-any conditional generation, including zero-shot modality translation, spectral band infilling, and generation under partial or missing inputs, without task-specific retraining. Experiments on a large-scale global multimodal dataset show that COP-GEN generates diverse yet physically consistent realisations while maintaining strong peak fidelity across optical, radar, and elevation modalities. Qualitative and quantitative analyses demonstrate that the model captures meaningful cross-modal structure and systematically adapts its output uncertainty as conditioning information increases. These results highlight the practical importance of stochastic generative modeling for Earth observation and motivate evaluation protocols that move beyond single-reference, pointwise metrics. Website: https:// miquel-espinosa.github.io/cop-gen

[207] UniG2U-Bench: Do Unified Models Advance Multimodal Understanding?

Zimo Wen, Boxiu Li, Wanbo Zhang, Junxiang Lei, Xiaoyu Chen, Yijia Fan, Qi Zhang, Yujiang Wang, Lili Qiu, Bo Li, Ziwei Liu, Caihua Shan, Yifan Yang, Yifei Shen

Main category: cs.CV

TL;DR: UniG2U-Bench systematically evaluates when generation improves understanding in multimodal models, finding generation often degrades performance except in spatial intelligence, visual illusions, and multi-round reasoning tasks.

Details

Motivation: While unified multimodal models show strong generative capabilities, it's unclear when and whether generation actually improves understanding. Existing benchmarks lack systematic exploration of specific tasks where generation facilitates understanding.

Method: Introduces UniG2U-Bench, a comprehensive benchmark categorizing generation-to-understanding evaluation into 7 regimes and 30 subtasks requiring varying degrees of visual transformations. Evaluates over 30 models to analyze generation’s impact on understanding.

Result: Three key findings: 1) Unified models generally underperform base VLMs, and Generate-then-Answer inference typically degrades performance; 2) Consistent enhancements emerge in spatial intelligence, visual illusions, and multi-round reasoning tasks; 3) Tasks with similar reasoning structures and models sharing architectures show correlated behaviors.

Conclusion: Generation-understanding coupling induces class-consistent inductive biases. More diverse training data and novel paradigms are needed to fully unlock unified multimodal modeling’s potential.

Abstract: Unified multimodal models have recently demonstrated strong generative capabilities, yet whether and when generation improves understanding remains unclear. Existing benchmarks lack a systematic exploration of the specific tasks where generation facilitates understanding. To this end, we introduce UniG2U-Bench, a comprehensive benchmark categorizing generation-to-understanding (G2U) evaluation into 7 regimes and 30 subtasks, requiring varying degrees of implicit or explicit visual transformations. Extensive evaluation of over 30 models reveals three core findings: 1) Unified models generally underperform their base Vision-Language Models (VLMs), and Generate-then-Answer (GtA) inference typically degrades performance relative to direct inference. 2) Consistent enhancements emerge in spatial intelligence, visual illusions, or multi-round reasoning subtasks, where enhanced spatial and shape perception, as well as multi-step intermediate image states, prove beneficial. 3) Tasks with similar reasoning structures and models sharing architectures exhibit correlated behaviors, suggesting that generation-understanding coupling induces class-consistent inductive biases over tasks, pretraining data, and model architectures. These findings highlight the necessity for more diverse training data and novel paradigms to fully unlock the potential of unified multimodal modeling.

[208] DuoMo: Dual Motion Diffusion for World-Space Human Reconstruction

Yufu Wang, Evonne Ng, Soyong Shin, Rawal Khirodkar, Yuan Dong, Zhaoen Su, Jinhyung Park, Kris Kitani, Alexander Richard, Fabian Prada, Michael Zollhofer

Main category: cs.CV

TL;DR: DuoMo: A two-stage diffusion model approach for reconstructing globally consistent human motion in world-space coordinates from unconstrained videos with noisy/incomplete observations.

Details

Motivation: Reconstructing human motion from unconstrained videos requires balancing generalization from diverse/noisy inputs with maintaining global motion consistency. Existing methods struggle with this trade-off when dealing with incomplete or noisy observations.

Method: Factorizes motion learning into two diffusion models: 1) camera-space model estimates motion from videos in camera coordinates, 2) world-space model lifts this initial estimate into world coordinates and refines it for global consistency. Generates motion of mesh vertices directly, bypassing parametric models.

Result: State-of-the-art performance: 16% reduction in world-space reconstruction error on EMDB while maintaining low foot skating, and 30% reduction in world-space error on RICH dataset.

Conclusion: DuoMo effectively addresses the generalization-consistency trade-off in motion reconstruction from unconstrained videos, achieving superior performance through its two-stage diffusion approach that directly generates mesh vertex motions.

Abstract: We present DuoMo, a generative method that recovers human motion in world-space coordinates from unconstrained videos with noisy or incomplete observations. Reconstructing such motion requires solving a fundamental trade-off: generalizing from diverse and noisy video inputs while maintaining global motion consistency. Our approach addresses this problem by factorizing motion learning into two diffusion models. The camera-space model first estimates motion from videos in camera coordinates. The world-space model then lifts this initial estimate into world coordinates and refines it to be globally consistent. Together, the two models can reconstruct motion across diverse scenes and trajectories, even from highly noisy or incomplete observations. Moreover, our formulation is general, generating the motion of mesh vertices directly and bypassing parametric models. DuoMo achieves state-of-the-art performance. On EMDB, our method obtains a 16% reduction in world-space reconstruction error while maintaining low foot skating. On RICH, it obtains a 30% reduction in world-space error. Project page: https://yufu-wang.github.io/duomo/

[209] LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory

Junyi Zhang, Charles Herrmann, Junhwa Hur, Chen Sun, Ming-Hsuan Yang, Forrester Cole, Trevor Darrell, Deqing Sun

Main category: cs.CV

TL;DR: LoGeR is a novel architecture for long-context 3D reconstruction from video that processes streams in chunks with a hybrid memory system to maintain coherence across boundaries, enabling training on 128 frames and generalization to thousands.

Details

Motivation: Existing feedforward geometric models struggle with long videos due to quadratic attention complexity or limited memory in recurrent designs, creating a bottleneck for scaling dense 3D reconstruction to minutes-long sequences.

Method: Processes video streams in chunks with bidirectional priors for intra-chunk reasoning. Uses a hybrid memory module with parametric Test-Time Training memory for global coordinate frame anchoring and non-parametric Sliding Window Attention for preserving context for precise alignment.

Result: Outperforms prior state-of-the-art feedforward methods, reducing ATE on KITTI by over 74%, and achieves robust globally consistent reconstruction on sequences up to 19k frames.

Conclusion: LoGeR successfully scales dense 3D reconstruction to extremely long sequences without post-optimization through its chunk-based processing and hybrid memory architecture.

Abstract: Feedforward geometric foundation models achieve strong short-window reconstruction, yet scaling them to minutes-long videos is bottlenecked by quadratic attention complexity or limited effective memory in recurrent designs. We present LoGeR (Long-context Geometric Reconstruction), a novel architecture that scales dense 3D reconstruction to extremely long sequences without post-optimization. LoGeR processes video streams in chunks, leveraging strong bidirectional priors for high-fidelity intra-chunk reasoning. To manage the critical challenge of coherence across chunk boundaries, we propose a learning-based hybrid memory module. This dual-component system combines a parametric Test-Time Training (TTT) memory to anchor the global coordinate frame and prevent scale drift, alongside a non-parametric Sliding Window Attention (SWA) mechanism to preserve uncompressed context for high-precision adjacent alignment. Remarkably, this memory architecture enables LoGeR to be trained on sequences of 128 frames, and generalize up to thousands of frames during inference. Evaluated across standard benchmarks and a newly repurposed VBR dataset with sequences of up to 19k frames, LoGeR substantially outperforms prior state-of-the-art feedforward methods–reducing ATE on KITTI by over 74%–and achieves robust, globally consistent reconstruction over unprecedented horizons.

[210] Beyond Language Modeling: An Exploration of Multimodal Pretraining

Shengbang Tong, David Fan, John Nguyen, Ellis Brown, Gaoyue Zhou, Shengyi Qian, Boyang Zheng, Théophane Vallaeys, Junlin Han, Rob Fergus, Naila Murray, Marjan Ghazvininejad, Mike Lewis, Nicolas Ballas, Amir Bar, Michael Rabbat, Jakob Verbeek, Luke Zettlemoyer, Koustuv Sinha, Yann LeCun, Saining Xie

Main category: cs.CV

TL;DR: The paper investigates native multimodal foundation models using Transfusion framework (next-token prediction for language, diffusion for vision) trained on diverse data, finding RAE optimal for unified visual representation, synergy between visual/language data, emergent world modeling, and MoE enabling efficient multimodal scaling while addressing vision’s greater data hunger.

Details

Motivation: To provide empirical clarity on native multimodal model design by isolating factors governing multimodal pretraining without language pretraining interference, advancing beyond language-only foundation models.

Method: Uses Transfusion framework with next-token prediction for language and diffusion for vision, trained from scratch on diverse multimodal data (text, video, image-text pairs, action-conditioned video). Conducts controlled pretraining experiments with IsoFLOP analysis to compute scaling laws.

Result: Four key insights: 1) RAE provides optimal unified visual representation; 2) visual and language data are complementary; 3) unified pretraining leads to emergent world modeling; 4) MoE enables efficient multimodal scaling with modality specialization. Found vision is significantly more data-hungry than language.

Conclusion: MoE architecture harmonizes scaling asymmetry between vision (data-hungry) and language (capacity-hungry), paving way for truly unified multimodal models through empirical understanding of multimodal pretraining dynamics.

Abstract: The visual world offers a critical axis for advancing foundation models beyond language. Despite growing interest in this direction, the design space for native multimodal models remains opaque. We provide empirical clarity through controlled, from-scratch pretraining experiments, isolating the factors that govern multimodal pretraining without interference from language pretraining. We adopt the Transfusion framework, using next-token prediction for language and diffusion for vision, to train on diverse data including text, video, image-text pairs, and even action-conditioned video. Our experiments yield four key insights: (i) Representation Autoencoder (RAE) provides an optimal unified visual representation by excelling at both visual understanding and generation; (ii) visual and language data are complementary and yield synergy for downstream capabilities; (iii) unified multimodal pretraining leads naturally to world modeling, with capabilities emerging from general training; and (iv) Mixture-of-Experts (MoE) enables efficient and effective multimodal scaling while naturally inducing modality specialization. Through IsoFLOP analysis, we compute scaling laws for both modalities and uncover a scaling asymmetry: vision is significantly more data-hungry than language. We demonstrate that the MoE architecture harmonizes this scaling asymmetry by providing the high model capacity required by language while accommodating the data-intensive nature of vision, paving the way for truly unified multimodal models.

[211] CFG-Ctrl: Control-Based Classifier-Free Diffusion Guidance

Hanyang Wang, Yiyang Liu, Jiawei Chi, Fangfu Liu, Ran Xue, Yueqi Duan

Main category: cs.CV

TL;DR: SMC-CFG introduces sliding mode control to improve stability and semantic alignment in flow-based diffusion models, outperforming standard CFG across multiple text-to-image models.

Details

Motivation: Existing CFG methods in flow-based diffusion models suffer from instability, overshooting, and degraded semantic fidelity at large guidance scales due to reliance on linear control. The authors aim to develop a more stable nonlinear control approach.

Method: Proposes CFG-Ctrl framework reinterpreting CFG as control of continuous-time generative flow. Introduces Sliding Mode Control CFG (SMC-CFG) with exponential sliding mode surface over semantic prediction error and switching control term for nonlinear feedback-guided correction. Provides Lyapunov stability analysis for theoretical finite-time convergence guarantees.

Result: Experiments on Stable Diffusion 3.5, Flux, and Qwen-Image show SMC-CFG outperforms standard CFG in semantic alignment and enhances robustness across wide range of guidance scales.

Conclusion: SMC-CFG provides a stable nonlinear control approach for CFG that improves semantic alignment and robustness in text-to-image generation, with theoretical convergence guarantees.

Abstract: Classifier-Free Guidance (CFG) has emerged as a central approach for enhancing semantic alignment in flow-based diffusion models. In this paper, we explore a unified framework called CFG-Ctrl, which reinterprets CFG as a control applied to the first-order continuous-time generative flow, using the conditional-unconditional discrepancy as an error signal to adjust the velocity field. From this perspective, we summarize vanilla CFG as a proportional controller (P-control) with fixed gain, and typical follow-up variants develop extended control-law designs derived from it. However, existing methods mainly rely on linear control, inherently leading to instability, overshooting, and degraded semantic fidelity especially on large guidance scales. To address this, we introduce Sliding Mode Control CFG (SMC-CFG), which enforces the generative flow toward a rapidly convergent sliding manifold. Specifically, we define an exponential sliding mode surface over the semantic prediction error and introduce a switching control term to establish nonlinear feedback-guided correction. Moreover, we provide a Lyapunov stability analysis to theoretically support finite-time convergence. Experiments across text-to-image generation models including Stable Diffusion 3.5, Flux, and Qwen-Image demonstrate that SMC-CFG outperforms standard CFG in semantic alignment and enhances robustness across a wide range of guidance scales. Project Page: https://hanyang-21.github.io/CFG-Ctrl

[212] MIBURI: Towards Expressive Interactive Gesture Synthesis

M. Hamza Mughal, Rishabh Dabral, Vera Demberg, Christian Theobalt

Main category: cs.CV

TL;DR: MIBURI is an online, causal framework for generating expressive full-body gestures and facial expressions synchronized with real-time spoken dialogue using LLM-based speech-text embeddings and hierarchical motion encoding.

Details

Motivation: Current LLM-based conversational agents lack embodiment and expressive gestures essential for natural interaction. Existing ECA solutions produce rigid, low-diversity motions, while generative co-speech gesture methods depend on future context and have long run-times.

Method: Uses body-part aware gesture codecs to encode hierarchical motion details into multi-level discrete tokens. These tokens are autoregressively generated by a 2D causal framework conditioned on LLM-based speech-text embeddings, modeling temporal dynamics and part-level motion hierarchy in real time with auxiliary objectives for expressiveness.

Result: Comparative evaluations demonstrate the causal and real-time approach produces natural and contextually aligned gestures against recent baselines, enabling online generation of expressive full-body gestures synchronized with speech.

Conclusion: MIBURI bridges the gap between LLM-based conversational agents and embodied interaction by enabling real-time generation of expressive gestures and facial expressions synchronized with spoken dialogue.

Abstract: Embodied Conversational Agents (ECAs) aim to emulate human face-to-face interaction through speech, gestures, and facial expressions. Current large language model (LLM)-based conversational agents lack embodiment and the expressive gestures essential for natural interaction. Existing solutions for ECAs often produce rigid, low-diversity motions, that are unsuitable for human-like interaction. Alternatively, generative methods for co-speech gesture synthesis yield natural body gestures but depend on future speech context and require long run-times. To bridge this gap, we present MIBURI, the first online, causal framework for generating expressive full-body gestures and facial expressions synchronized with real-time spoken dialogue. We employ body-part aware gesture codecs that encode hierarchical motion details into multi-level discrete tokens. These tokens are then autoregressively generated by a two-dimensional causal framework conditioned on LLM-based speech-text embeddings, modeling both temporal dynamics and part-level motion hierarchy in real time. Further, we introduce auxiliary objectives to encourage expressive and diverse gestures while preventing convergence to static poses. Comparative evaluations demonstrate that our causal and real-time approach produces natural and contextually aligned gestures against recent baselines. We urge the reader to explore demo videos on https://vcai.mpi-inf.mpg.de/projects/MIBURI/.

[213] Utonia: Toward One Encoder for All Point Clouds

Yujia Zhang, Xiaoyang Wu, Yunhan Yang, Xianzhe Fan, Han Li, Yuechen Zhang, Zehao Huang, Naiyan Wang, Hengshuang Zhao

Main category: cs.CV

TL;DR: Utonia is a unified self-supervised point transformer encoder trained across diverse 3D point cloud domains that learns consistent representations transferable across domains and beneficial for multimodal reasoning tasks.

Details

Motivation: The paper aims to create a single foundation model for 3D point clouds that can handle diverse domains (remote sensing, LiDAR, RGB-D, CAD models, video-derived point clouds) despite their different sensing geometries, densities, and priors, moving toward unified 3D understanding.

Method: Utonia uses a self-supervised point transformer encoder trained jointly across multiple 3D point cloud domains. The unified training approach allows the model to learn consistent representation spaces that transfer across different sensing modalities and domains.

Result: The unified model improves perception capabilities and reveals emergent behaviors only observable when domains are trained jointly. Utonia representations benefit embodied and multimodal reasoning: they improve robotic manipulation when conditioning vision-language-action policies, and enhance spatial reasoning when integrated into vision-language models.

Conclusion: Utonia represents a step toward foundation models for sparse 3D data, with potential applications in AR/VR, robotics, and autonomous driving. The work demonstrates that unified training across diverse 3D domains yields transferable representations and emergent multimodal reasoning capabilities.

Abstract: We dream of a future where point clouds from all domains can come together to shape a single model that benefits them all. Toward this goal, we present Utonia, a first step toward training a single self-supervised point transformer encoder across diverse domains, spanning remote sensing, outdoor LiDAR, indoor RGB-D sequences, object-centric CAD models, and point clouds lifted from RGB-only videos. Despite their distinct sensing geometries, densities, and priors, Utonia learns a consistent representation space that transfers across domains. This unification improves perception capability while revealing intriguing emergent behaviors that arise only when domains are trained jointly. Beyond perception, we observe that Utonia representations can also benefit embodied and multimodal reasoning: conditioning vision-language-action policies on Utonia features improves robotic manipulation, and integrating them into vision-language models yields gains on spatial reasoning. We hope Utonia can serve as a step toward foundation models for sparse 3D data, and support downstream applications in AR/VR, robotics, and autonomous driving.

[214] D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI

Suhwan Choi, Jaeyoon Jung, Haebin Seong, Minchan Kim, Minyeong Kim, Yongjun Cho, Yoonshik Kim, Yubeen Park, Youngjae Yu, Yunsung Lee

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) - no abstract available for analysis

Details

Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable due to API rate limiting

Result: Cannot determine results as paper content is unavailable due to API rate limiting

Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2510.05684: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.05684&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[215] Weight Space Representation Learning on Diverse NeRF Architectures

Francesco Ballerini, Pierluigi Zama Ramirez, Luigi Di Stefano, Samuele Salti

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to failed paper fetch

Method: Cannot determine method due to failed paper fetch

Result: Cannot determine results due to failed paper fetch

Conclusion: Cannot determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2502.09623: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.09623&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[216] Cycle-Consistent Multi-Graph Matching for Self-Supervised Annotation of C.Elegans

Christoph Karg, Sebastian Stricker, Lisa Hutschenreiter, Bogdan Savchynskyy, Dagmar Kainmueller

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2503.07348: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.07348&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[217] GAN-Based Single-Stage Defense for Traffic Sign Classification Under Adversarial Patch

Abyad Enan, Mashrur Chowdhury

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2503.12567: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.12567&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[218] Language-guided Open-world Video Anomaly Detection under Weak Supervision

Zihao Liu, Xiaoyu Wu, Jianqin Wu, Xuxu Wang, Linlin Yang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2503.13160: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.13160&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[219] Scale-wise Distillation of Diffusion Models

Nikita Starodubcev, Ilya Drobyshevskiy, Denis Kuznedelev, Artem Babenko, Dmitry Baranchuk

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to missing paper content

Method: Unable to determine method due to missing paper content

Result: Unable to determine results due to missing paper content

Conclusion: Unable to draw conclusions due to missing paper content

Abstract: Failed to fetch summary for 2503.16397: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.16397&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[220] Differentially Private 2D Human Pose Estimation

Kaushik Bhargav Sivangi, Paul Henderson, Fani Deligianni

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2504.10190: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.10190&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[221] Model Already Knows the Best Noise: Bayesian Active Noise Selection via Attention in Video Diffusion Model

Kwanyoung Kim, Sanghyun Kim

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2505.17561: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.17561&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[222] SABER: Spatially Consistent 3D Universal Adversarial Objects for BEV Detectors

Aixuan Li, Mochu Xiang, Bosen Hou, Zhexiong Wan, Jing Zhang, Yuchao Dai

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2505.22499: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.22499&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[223] Frame Guidance: Training-Free Guidance for Frame-Level Control in Video Diffusion Models

Sangwon Jang, Taekyung Ki, Jaehyeong Jo, Jaehong Yoon, Soo Ye Kim, Zhe Lin, Sung Ju Hwang

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2506.07177: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.07177&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[224] StreamSplat: Towards Online Dynamic 3D Reconstruction from Uncalibrated Video Streams

Zike Wu, Qi Yan, Xuanyu Yi, Lele Wang, Renjie Liao

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2506.08862: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.08862&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[225] Nano-EmoX: Unifying Multimodal Emotional Intelligence from Perception to Empathy

Jiahao Huang, Fengyan Lin, Xuechao Yang, Chen Feng, Kexin Zhu, Xu Yang, Zhide Chen

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to draw conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2603.02123: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.02123&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[226] Evolutionary Caching to Accelerate Your Off-the-Shelf Diffusion Model

Anirud Aggarwal, Abhinav Shrivastava, Matthew Gwilliam

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2506.15682: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.15682&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[227] Partial Weakly-Supervised Oriented Object Detection

Mingxin Liu, Peiyuan Zhang, Yuan Liu, Wei Zhang, Yue Zhou, Ning Liao, Ziyang Gong, Junwei Luo, Zhirui Wang, Yi Yu, Xue Yang

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting).

Details

Motivation: Cannot determine motivation without access to paper content.

Method: Cannot determine method without access to paper content.

Result: Cannot determine results without access to paper content.

Conclusion: Cannot determine conclusion without access to paper content.

Abstract: Failed to fetch summary for 2507.02751: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.02751&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[228] CoBELa: Steering Transparent Generation via Concept Bottlenecks on Energy Landscapes

Sangwon Kim, Kyoungoh Lee, Jeyoun Dong, Kwang-Ju Kim

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to technical error in fetching paper content

Method: Unable to determine method due to technical error in fetching paper content

Result: Unable to determine results due to technical error in fetching paper content

Conclusion: Unable to draw conclusions due to technical error in fetching paper content

Abstract: Failed to fetch summary for 2507.08334: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.08334&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[229] DMTrack: Spatio-Temporal Multimodal Tracking via Dual-Adapter

Weihong Li, Shaohua Dong, Haonan Lu, Yanhao Zhang, Heng Fan, Libo Zhang

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2508.01592: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.01592&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[230] MMTok: Multimodal Coverage Maximization for Efficient Inference of VLMs

Sixun Dong, Juhua Hu, Mian Zhang, Ming Yin, Yanjie Fu, Qi Qian

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed API request

Method: Unable to determine method due to failed API request

Result: Unable to determine results due to failed API request

Conclusion: Unable to draw conclusions due to failed API request

Abstract: Failed to fetch summary for 2508.18264: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.18264&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[231] Are VLMs Ready for Lane Topology Awareness in Autonomous Driving?

Xin Chen, Jia He, Maozheng Li, Dongliang Xu, Tianyu Wang, Yixiao Chen, Zhixin Lin, Yue Yao

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to failed paper retrieval

Method: Cannot determine method due to failed paper retrieval

Result: Cannot determine results due to failed paper retrieval

Conclusion: Cannot draw conclusions due to failed paper retrieval

Abstract: Failed to fetch summary for 2509.16654: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.16654&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[232] SiNGER: A Clearer Voice Distills Vision Transformers Further

Geunhyeok Yu, Sunjae Jeong, Yoonyoung Choi, Jaeseung Kim, Hyoseok Hwang

Main category: cs.CV

TL;DR: Paper 2509.20986 summary unavailable due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to abstract fetch failure

Method: Unable to determine method due to abstract fetch failure

Result: Unable to determine results due to abstract fetch failure

Conclusion: Unable to determine conclusion due to abstract fetch failure

Abstract: Failed to fetch summary for 2509.20986: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.20986&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[233] Earth-Agent: Unlocking the Full Landscape of Earth Observation with Agents

Peilin Feng, Zhutao Lv, Junyan Ye, Xiaolei Wang, Xinjie Huo, Jinhua Yu, Wanghan Xu, Wenlong Zhang, Lei Bai, Conghui He, Weijia Li

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed data retrieval

Method: Unable to determine method due to failed data retrieval

Result: Unable to determine results due to failed data retrieval

Conclusion: Unable to analyze paper due to technical retrieval error

Abstract: Failed to fetch summary for 2509.23141: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.23141&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[234] Proxy-GS: Unified Occlusion Priors for Training and Inference in Structured 3D Gaussian Splatting

Yuanyuan Gao, Yuning Gong, Yifei Liu, Li Jingfeng, Dingwen Zhang, Yanci Zhang, Dan Xu, Xiao Sun, Zhihang Zhong

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2509.24421: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.24421&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[235] EchoGen: Generating Visual Echoes in Any Scene via Feed-Forward Subject-Driven Auto-Regressive Model

Ruixiao Dong, Zhendong Wang, Keli Liu, Li Li, Ying Chen, Kai Li, Daowen Li, Houqiang Li

Main category: cs.CV

TL;DR: Unable to analyze paper 2509.26127 due to HTTP 429 error when fetching summary from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2509.26127: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.26127&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[236] TTT3R: 3D Reconstruction as Test-Time Training

Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, Anpei Chen

Main category: cs.CV

TL;DR: Paper ID 2509.26645 summary unavailable due to HTTP 429 rate limiting error from arXiv API

Details

Motivation: Unable to determine motivation due to technical error preventing access to paper abstract

Method: Method information unavailable - arXiv API rate limiting prevented retrieval of paper details

Result: No results available - technical error occurred during paper information retrieval

Conclusion: Cannot provide analysis due to HTTP 429 error from arXiv API indicating rate limiting

Abstract: Failed to fetch summary for 2509.26645: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.26645&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Zhaoyang Li, Dongjun Qian, Kai Su, Qishuai Diao, Xiangyang Xia, Chang Liu, Wenfei Yang, Tianzhu Zhang, Zehuan Yuan

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2510.00438: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.00438&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[238] Arbitrary Generative Video Interpolation

Guozhen Zhang, Haiguang Wang, Chunyu Wang, Yuan Zhou, Qinglin Lu, Limin Wang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2510.00578: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.00578&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[239] Interaction Field Matching: Overcoming Limitations of Electrostatic Models

Stepan I. Manukhov, Alexander Kolesov, Vladimir V. Palyulin, Alexander Korotin

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to technical error fetching paper content

Method: Unable to determine method due to technical error fetching paper content

Result: Unable to determine results due to technical error fetching paper content

Conclusion: Unable to determine conclusion due to technical error fetching paper content

Abstract: Failed to fetch summary for 2506.02950: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.02950&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[240] Human3R: Everyone Everywhere All at Once

Yue Chen, Xingyu Chen, Yuxuan Xue, Anpei Chen, Yuliang Xiu, Gerard Pons-Moll

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2510.06219: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.06219&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[241] MIRAGE: Runtime Scheduling for Multi-Vector Image Retrieval with Hierarchical Decomposition

Maoliang Li, Ke Li, Yaoyang Liu, Jiayu Chen, Zihao Zheng, Yinjun Wu, Chenchen Liu, Xiang Chen

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2510.08976: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.08976&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[242] Perception-R1: Advancing Multimodal Reasoning Capabilities of MLLMs via Visual Perception Reward

Tong Xiao, Xin Xu, Zhenya Huang, Hongyu Gao, Quan Liu, Qi Liu, Enhong Chen

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about paper content due to access limitations

Abstract: Failed to fetch summary for 2506.07218: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.07218&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[243] Reasoning as Representation: Rethinking Visual Reinforcement Learning in Image Quality Assessment

Shijie Zhao, Xuanyu Zhang, Weiqi Li, Junlin Li, Li Zhang, Tianfan Xue, Jian Zhang

Main category: cs.CV

TL;DR: Unable to analyze paper 2510.11369 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation without access to the paper abstract

Method: Cannot determine method without access to the paper abstract

Result: Cannot determine results without access to the paper abstract

Conclusion: Cannot draw conclusions without access to the paper abstract

Abstract: Failed to fetch summary for 2510.11369: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.11369&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[244] Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs

Minji Kim, Taekyung Kim, Bohyung Han

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2510.13251: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.13251&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[245] Self-Aug: Query and Entropy Adaptive Decoding for Large Vision-Language Models

Eun Woo Im, Muhammad Kashif Ali, Vivek Gupta

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable due to server rate limiting

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2510.13315: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.13315&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[246] Inpainting the Red Planet: Diffusion Models for the Reconstruction of Martian Environments in Virtual Reality

Giuseppe Lorenzo Catalano, Agata Marta Soccini

Main category: cs.CV

TL;DR: Unconditional diffusion model for reconstructing Martian terrain heightmaps with missing data, outperforming traditional interpolation methods.

Details

Motivation: Space exploration uses VR for mission planning and training, requiring accurate 3D planetary terrain representations. Martian heightmaps often have missing values, and current interpolation methods fail to preserve geometric coherence. Deep learning conditional methods used on Earth can't be applied to Mars due to limited data.

Method: Unconditional diffusion model trained on 12,000 Martian heightmaps from NASA’s HiRISE survey. Uses non-homogeneous rescaling strategy to capture terrain features across multiple scales before resizing to fixed 128x128 resolution.

Result: Outperforms established methods (Inverse Distance Weighting, kriging, Navier-Stokes) on 1000 evaluation samples: 4-15% better on RMSE and 29-81% better on LPIPS perceptual similarity.

Conclusion: Unconditional diffusion models effectively reconstruct Martian terrain with missing data, providing more accurate and perceptually similar results than traditional interpolation techniques.

Abstract: Space exploration increasingly relies on Virtual Reality for several tasks, such as mission planning, multidisciplinary scientific analysis, and astronaut training. A key factor for the reliability of the simulations is having accurate 3D representations of planetary terrains. Extraterrestrial heightmaps derived from satellite imagery often contain missing values due to acquisition and transmission constraints. Mars is among the most studied planets beyond Earth, and its extensive terrain datasets make the Martian surface reconstruction a valuable task, although many areas remain unmapped. Deep learning algorithms can support void-filling tasks; however, whereas Earth’s comprehensive datasets enables the use of conditional methods, such approaches cannot be applied to Mars. Current approaches rely on simpler interpolation techniques which, however, often fail to preserve geometric coherence. In this work, we propose a method for reconstructing the surface of Mars based on an unconditional diffusion model. Training was conducted on an augmented dataset of 12000 Martian heightmaps derived from NASA’s HiRISE survey. A non-homogeneous rescaling strategy captures terrain features across multiple scales before resizing to a fixed 128x128 model resolution. We compared our method against established void-filling and inpainting techniques, including Inverse Distance Weighting, kriging, and Navier-Stokes algorithm, on an evaluation set of 1000 samples. Results show that our approach consistently outperforms these methods in terms of reconstruction accuracy (4-15% on RMSE) and perceptual similarity (29-81% on LPIPS) with the original data.

Alvee Hassan, Rusab Sarmun, Muhammad E. H. Chowdhury, M Murugappan, Abdulrahman Alqahtani, Balamurugan Balusamy, Sohaib Bassam Zoghoul

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2510.27315: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.27315&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[248] MotionStream: Real-Time Video Generation with Interactive Motion Controls

Joonghyuk Shin, Zhengqi Li, Richard Zhang, Jun-Yan Zhu, Jaesik Park, Eli Shechtman, Xun Huang

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2511.01266: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.01266&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[249] ConEQsA: Concurrent and Asynchronous Embodied Questions Scheduling and Answering

Haisheng Wang, Dong Liu, Weiming Zhi

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2509.11663: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.11663&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[250] Online Data Curation for Object Detection via Marginal Contributions to Dataset-level Average Precision

Zitang Sun, Masakazu Yoshimura, Junji Otsuka, Atsushi Irie, Takeshi Ohashi

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2511.14197: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.14197&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[251] Markovian Scale Prediction: A New Era of Visual Autoregressive Generation

Yu Zhang, Jingyi Liu, Yiwei Shi, Qi Zhang, Duoqian Miao, Changwei Wang, Longbing Cao

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2511.23334: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.23334&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[252] Self-Paced and Self-Corrective Masked Prediction for Movie Trailer Generation

Sidan Zhu, Hongteng Xu, Dixin Luo

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) for arXiv ID 2512.04426

Details

Motivation: Cannot determine motivation as paper content is unavailable due to server rate limiting

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about paper content due to access limitations

Abstract: Failed to fetch summary for 2512.04426: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.04426&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[253] Audio-sync Video Instance Editing with Granularity-Aware Mask Refiner

Haojie Zheng, Shuchen Weng, Jingqi Liu, Siqi Yang, Boxin Shi, Xinlong Wang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2512.10571: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.10571&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[254] CHAMMI-75: Pre-training multi-channel models with heterogeneous microscopy images

Vidit Agrawal, John Peters, Tyler N. Thompson, Mohammad Vali Sanian, Chau Pham, Nikita Moshkov, Arshad Kazi, Aditya Pillai, Jack Freeman, Byunguk Kang, Samouil L. Farhi, Ernest Fraenkel, Ron Stewart, Lassi Paavolainen, Bryan A. Plummer, Juan C. Caicedo

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2512.20833: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.20833&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[255] UniDrive-WM: Unified Understanding, Planning and Generation World Model For Autonomous Driving

Zhexiao Xiong, Xin Ye, Burhan Yaman, Sheng Cheng, Yiren Lu, Jingru Luo, Nathan Jacobs, Liu Ren

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2601.04453: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.04453&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[256] Hot-Start from Pixels: Low-Resolution Visual Tokens for Chinese Language Modeling

Shuyang Xiang, Hao Guan

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2601.09566: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.09566&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[257] Graph Recognition via Subgraph Prediction

André Eberhard, Gerhard Neumann, Pascal Friederich

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2601.15133: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.15133&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[258] MLV-Edit: Towards Consistent and Highly Efficient Editing for Minute-Level Videos

Yangyi Cao, Yuanhang Li, Lan Chen, Qi Mao

Main category: cs.CV

TL;DR: Failed to fetch paper summary - HTTP 429 error indicates rate limiting from arXiv API

Details

Motivation: Unable to determine motivation due to API rate limiting preventing access to paper content

Method: Cannot analyze method as paper content is unavailable due to technical limitations

Result: No results available - technical error prevents access to paper information

Conclusion: Cannot provide analysis due to arXiv API rate limiting (HTTP 429 error)

Abstract: Failed to fetch summary for 2602.02123: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.02123&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[259] VideoTemp-o3: Harmonizing Temporal Grounding and Video Understanding in Agentic Thinking-with-Videos

Wenqi Liu, Yunxiao Wang, Shijie Ma, Meng Liu, Qile Su, Tianke Zhang, Haonan Fan, Changyi Liu, Kaiyu Jiang, Jiankang Chen, Kaiyu Tang, Bin Wen, Fan Yang, Tingting Gao, Han Li, Yinwei Wei, Xuemeng Song

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2602.07801: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.07801&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[260] WristMIR: Coarse-to-Fine Region-Aware Retrieval of Pediatric Wrist Radiographs with Radiology Report-Driven Learning

Mert Sonmezer, Serge Vasylechko, Duygu Atasoy, Seyda Ertekin, Sila Kurugol

Main category: cs.CV

TL;DR: Paper 2602.07872: Could not fetch summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed summary fetch

Method: Unable to determine method due to failed summary fetch

Result: Unable to determine results due to failed summary fetch

Conclusion: Unable to determine conclusion due to failed summary fetch

Abstract: Failed to fetch summary for 2602.07872: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.07872&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[261] The Garbage Dataset (GD): A Multi-Class Image Benchmark for Automated Waste Segregation

Suman Kunwar

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2602.10500: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.10500&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[262] EO-VAE: Towards A Multi-sensor Tokenizer for Earth Observation Data

Nils Lehmann, Yi Wang, Zhitong Xiong, Xiaoxiang Zhu

Main category: cs.CV

TL;DR: Paper ID 2602.12177 could not be fetched due to HTTP 429 (rate limiting) error from arXiv API

Details

Motivation: Unable to determine motivation as the paper content could not be retrieved

Method: Unable to determine method as the paper content could not be retrieved

Result: Unable to determine results as the paper content could not be retrieved

Conclusion: Unable to determine conclusion as the paper content could not be retrieved

Abstract: Failed to fetch summary for 2602.12177: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.12177&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[263] CRAFT-LoRA: Content-Style Personalization via Rank-Constrained Adaptation and Training-Free Fusion

Yu Li, Yujun Cai, Chi Zhang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2602.18936: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.18936&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[264] From Pairs to Sequences: Track-Aware Policy Gradients for Keypoint Detection

Yepeng Liu, Hao Li, Liwen Yang, Fangzhen Li, Xudi Ge, Yuliang Gu, kuang Gao, Bing Wang, Guang Chen, Hangjun Ye, Yongchao Xu

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to draw conclusions due to fetch failure

Abstract: Failed to fetch summary for 2602.20630: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.20630&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[265] Training-Free Multi-Concept Image Editing

Niki Foteinopoulou, Ignas Budvytis, Stephan Liwicki

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to draw conclusions due to failed paper fetch

Abstract: Failed to fetch summary for 2602.20839: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.20839&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[266] Uni-Animator: Towards Unified Visual Colorization

Xinyuan Chen, Yao Xu, Shaowen Wang, Pengjie Song, Bowen Deng

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2602.23191: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.23191&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[267] 3D Modality-Aware Pre-training for Vision-Language Model in MRI Multi-organ Abnormality Detection

Haowen Zhu, Ning Yin, Xiaogen Zhou

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2602.23652: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.23652&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[268] APPO: Attention-guided Perception Policy Optimization for Video Reasoning

Henghui Du, Chang Zhou, Xi Chen, Di Hu

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to missing paper content

Method: Unable to determine method due to missing paper content

Result: Unable to determine results due to missing paper content

Conclusion: Unable to determine conclusion due to missing paper content

Abstract: Failed to fetch summary for 2602.23823: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.23823&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[269] A Novel Evolutionary Method for Automated Skull-Face Overlay in Computer-Aided Craniofacial Superimposition

Práxedes Martínez-Moreno, Andrea Valsecchi, Pablo Mesejo, Pilar Navarro-Ramírez, Valentino Lugli, Sergio Damas

Main category: cs.CV

TL;DR: Lilium is an automated evolutionary method that improves skull-face overlay accuracy in forensic identification by explicitly modeling soft-tissue variability using a 3D cone-based representation optimized via Differential Evolution.

Details

Motivation: Current skull-face overlay methods in forensic craniofacial superimposition suffer from accuracy issues due to individual variability in soft-tissue thickness, which introduces significant uncertainty into the alignment process.

Method: Lilium uses a 3D cone-based representation to model soft-tissue variability, optimized via Differential Evolution algorithm. It enforces anatomical plausibility through constraints: landmark matching, camera parameter consistency, head pose alignment, skull containment within facial boundaries, and region parallelism.

Result: Lilium outperforms the state-of-the-art method in terms of both accuracy and robustness for skull-face overlay in forensic identification.

Conclusion: The proposed evolutionary approach successfully addresses soft-tissue variability challenges in craniofacial superimposition, providing a more accurate and robust automated solution for forensic identification.

Abstract: Craniofacial Superimposition is a forensic technique for identifying skeletal remains by comparing a post-mortem skull with ante-mortem facial photographs. A critical step in this process is Skull-Face Overlay (SFO). This stage involves aligning a 3D skull model with a 2D facial image, typically guided by cranial and facial landmarks’ correspondence. However, its accuracy is undermined by individual variability in soft-tissue thickness, introducing significant uncertainty into the overlay. This paper introduces Lilium, an automated evolutionary method to enhance the accuracy and robustness of SFO. Lilium explicitly models soft-tissue variability using a 3D cone-based representation whose parameters are optimized via a Differential Evolution algorithm. The method enforces anatomical, morphological, and photographic plausibility through a combination of constraints: landmark matching, camera parameter consistency, head pose alignment, skull containment within facial boundaries, and region parallelism. This emulation of the usual forensic practitioners’ approach leads Lilium to outperform the state-of-the-art method in terms of both accuracy and robustness.

[270] ShiftLUT: Spatial Shift Enhanced Look-Up Tables for Efficient Image Restoration

Xiaolong Zeng, Yitong Yu, Shiyao Xiong, Jinhua Hao, Ming Sun, Chao Zhou, Bin Wang

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2603.00906 suggests it’s from March 2026, but no content available for analysis.

Details

Motivation: Cannot determine motivation without access to paper content. The HTTP 429 error indicates rate limiting from arXiv API.

Method: No method information available due to failed content retrieval.

Result: No results available for analysis.

Conclusion: Cannot provide analysis due to technical limitations in accessing the paper content.

Abstract: Failed to fetch summary for 2603.00906: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.00906&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[271] Learning to Weigh Waste: A Physics-Informed Multimodal Fusion Framework and Large-Scale Dataset for Commercial and Industrial Applications

Md. Adnanul Islam, Wasimul Karim, Md Mahbub Alam, Subhey Sadi Rahman, Md. Abdur Rahman, Arefin Ittesafun Abian, Mohaimenul Azam Khan Raiaan, Kheng Cher Yeo, Deepika Mathur, Sami Azam

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2603.00931: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.00931&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[272] Mobile-VTON: High-Fidelity On-Device Virtual Try-On

Zhenchen Wan, Ce Chen, Runqi Lin, Jiaxin Huang, Tianxi Chen, Yanwu Xu, Tongliang Liu, Mingming Gong

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to missing paper content

Method: Unable to determine method due to missing paper content

Result: Unable to determine results due to missing paper content

Conclusion: Unable to determine conclusion due to missing paper content

Abstract: Failed to fetch summary for 2603.00947: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.00947&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[273] PreciseCache: Precise Feature Caching for Efficient and High-fidelity Video Generation

Jiangshan Wang, Kang Zhao, Jiayi Guo, Jiayu Wang, Hang Guo, Chenyang Zhu, Xiu Li, Xiangyu Yue

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2603.00976: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.00976&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Yunguan Fu, Wenjia Bai, Wen Yan, Matthew J Clarkson, Rhodri Huw Davies, Yipeng Hu

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2603.01073: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01073&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[275] IDER: IDempotent Experience Replay for Reliable Continual Learning

Zhanwang Liu, Yuting Li, Haoyuan Gao, Yexin Li, Linghe Kong, Lichao Sun, Weiran Huang

Main category: cs.CV

TL;DR: Unable to analyze paper 2603.00624 due to HTTP 429 error when fetching summary from arXiv API

Details

Motivation: Cannot determine motivation due to missing abstract data

Method: Cannot determine method due to missing abstract data

Result: Cannot determine results due to missing abstract data

Conclusion: Cannot draw conclusions due to missing abstract data

Abstract: Failed to fetch summary for 2603.00624: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.00624&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[276] HeroGS: Hierarchical Guidance for Robust 3D Gaussian Splatting under Sparse Views

Jiashu Li, Xumeng Han, Zhaoyang Wei, Zipeng Wang, Kuiran Wang, Guorong Li, Zhenjun Han, Jianbin Jiao

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2603.01099: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01099&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[277] Continuous Exposure-Time Modeling for Realistic Atmospheric Turbulence Synthesis

Junwei Zeng, Dong Liang, Sheng-Jun Huang, Kun Zhan, Songcan Chen

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). Paper ID 2603.01398 could not be retrieved from arXiv API.

Details

Motivation: Cannot determine motivation without access to paper content.

Method: Cannot determine method without access to paper content.

Result: Cannot determine results without access to paper content.

Conclusion: Cannot draw conclusions without access to paper content.

Abstract: Failed to fetch summary for 2603.01398: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01398&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[278] UETrack: A Unified and Efficient Framework for Single Object Tracking

Ben Kang, Jie Zhao, Xin Chen, Wanting Geng, Bin Zhang, Lu Zhang, Dong Wang, Huchuan Lu

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2603.01412 appears to be from March 2023.

Details

Motivation: Cannot determine motivation without access to paper content.

Method: Cannot determine method without access to paper content.

Result: Cannot determine results without access to paper content.

Conclusion: Cannot determine conclusion without access to paper content.

Abstract: Failed to fetch summary for 2603.01412: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01412&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[279] Value Gradient Guidance for Flow Matching Alignment

Zhen Liu, Tim Z. Xiao, Carles Domingo-Enrich, Weiyang Liu, Dinghuai Zhang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2512.05116: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.05116&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[280] FACE: A Face-based Autoregressive Representation for High-Fidelity and Efficient Mesh Generation

Hanxiao Wang, Yuan-Chen Guo, Ying-Tian Liu, Zi-Xin Zou, Biao Zhang, Weize Quan, Ding Liang, Yan-Pei Cao, Dong-Ming Yan

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2603.01515: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01515&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[281] InterCoG: Towards Spatially Precise Image Editing with Interleaved Chain-of-Grounding Reasoning

Yecong Wan, Fan Li, Chunwei Wang, Hao Wu, Mingwen Shao, Wangmeng Zuo

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) when trying to access arXiv API

Details

Motivation: Cannot determine motivation without access to the paper content

Method: Cannot determine method without access to the paper content

Result: Cannot determine results without access to the paper content

Conclusion: Cannot determine conclusion without access to the paper content

Abstract: Failed to fetch summary for 2603.01586: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01586&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[282] PromptStereo: Zero-Shot Stereo Matching via Structure and Motion Prompts

Xianqi Wang, Hao Yang, Hangtian Wang, Junda Cheng, Gangwei Xu, Min Lin, Xin Yang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) - unable to analyze content

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to draw conclusions due to access error

Abstract: Failed to fetch summary for 2603.01650: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01650&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[283] Efficient Test-Time Optimization for Depth Completion via Low-Rank Decoder Adaptation

Minseok Seo, Wonjun Lee, Jaehyuk Jang, Changick Kim

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2603.01765: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01765&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[284] SimRecon: SimReady Compositional Scene Reconstruction from Real Videos

Chong Xia, Kai Zhu, Zizhuo Wang, Fangfu Liu, Zhizheng Zhang, Yueqi Duan

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.02133: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.02133&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[285] OnlineX: Unified Online 3D Reconstruction and Understanding with Active-to-Stable State Evolution

Chong Xia, Fangfu Liu, Yule Wang, Yize Pang, Yueqi Duan

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2603.02134 suggests it’s from March 2023, but no content available for analysis.

Details

Motivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot draw conclusions without access to the paper content.

Abstract: Failed to fetch summary for 2603.02134: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.02134&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[286] HiFi-Inpaint: Towards High-Fidelity Reference-Based Inpainting for Generating Detail-Preserving Human-Product Images

Yichen Liu, Donghao Zhou, Jie Wang, Xin Gao, Guisheng Liu, Jiatong Li, Quanwei Zhang, Qiang Lyu, Lanqing Guo, Shilei Wen, Weiqiang Wang, Pheng-Ann Heng

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2603.02210: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.02210&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[287] SceneStreamer: Continuous Scenario Generation as Next Token Group Prediction

Zhenghao Peng, Yuxin Liu, Bolei Zhou

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2506.23316: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.23316&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[288] InstructVLA: Vision-Language-Action Instruction Tuning from Understanding to Manipulation

Shuai Yang, Hao Li, Bin Wang, Yilun Chen, Yang Tian, Tai Wang, Hanqing Wang, Feng Zhao, Yiyi Liao, Jiangmiao Pang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation as paper content could not be retrieved

Method: Unable to determine method as paper content could not be retrieved

Result: Unable to determine results as paper content could not be retrieved

Conclusion: Unable to draw conclusions as paper content could not be retrieved

Abstract: Failed to fetch summary for 2507.17520: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.17520&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[289] PROFusion: Robust and Accurate Dense Reconstruction via Camera Pose Regression and Optimization

Siyan Dong, Zijun Wang, Lulu Cai, Yi Ma, Yanchao Yang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to technical error in accessing paper content

Method: Unable to determine method due to technical error in accessing paper content

Result: Unable to determine results due to technical error in accessing paper content

Conclusion: Unable to determine conclusion due to technical error in accessing paper content

Abstract: Failed to fetch summary for 2509.24236: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.24236&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[290] Kinematify: Open-Vocabulary Synthesis of High-DoF Articulated Objects

Jiawei Wang, Dingyou Wang, Jiaming Hu, Qixuan Zhang, Jingyi Yu, Lan Xu

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2511.01294: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.01294&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

cs.AI

[291] Federated Inference: Toward Privacy-Preserving Collaborative and Incentivized Model Serving

Jungwon Seo, Ferhat Ozgur Catak, Chunming Rong, Jaeyeon Jang

Main category: cs.AI

TL;DR: Federated Inference (FI) is a new collaborative paradigm where independently trained models collaborate at inference time without sharing data or parameters, requiring privacy preservation and performance gains.

Details

Motivation: Current federated learning focuses on training-time collaboration, but there's a need for a unified understanding of inference-time collaboration where models can work together without sharing private data or parameters.

Method: Formalizes FI as protected collaborative computation, analyzes core design dimensions, examines trade-offs between privacy constraints, non-IID data, and limited observability, and provides concrete instantiation with empirical analysis.

Result: Identifies recurring friction points in privacy-preserving inference, ensemble-based collaboration, and incentive alignment, showing FI exhibits system-level behaviors distinct from training-time federation or classical ensemble methods.

Conclusion: FI is a distinct collaborative paradigm complementary to federated learning, requiring new approaches for practical, scalable, and privacy-preserving collaborative inference systems.

Abstract: Federated Inference (FI) studies how independently trained and privately owned models can collaborate at inference time without sharing data or model parameters. While recent work has explored secure and distributed inference from disparate perspectives, a unified abstraction and system-level understanding of FI remain lacking. This paper positions FI as a distinct collaborative paradigm, complementary to federated learning, and identifies two fundamental requirements that govern its feasibility: inference-time privacy preservation and meaningful performance gains through collaboration. We formalize FI as a protected collaborative computation, analyze its core design dimensions, and examine the structural trade-offs that arise when privacy constraints, non-IID data, and limited observability are jointly imposed at inference time. Through a concrete instantiation and empirical analysis, we highlight recurring friction points in privacy-preserving inference, ensemble-based collaboration, and incentive alignment. Our findings suggest that FI exhibits system-level behaviors that cannot be directly inherited from training-time federation or classical ensemble methods. Overall, this work provides a unifying perspective on FI and outlines open challenges that must be addressed to enable practical, scalable, and privacy-preserving collaborative inference systems.

[292] Engineering Reasoning and Instruction (ERI) Benchmark: A Large Taxonomy-driven Dataset for Foundation Models and Agents

MZ Naser, Ahmad Bani Awwad, Zoie McCreery, Radwa Eissa, Ahmad Naser, Gianluca Cusatis, Andrew Metcalf, Kapil Madathil, Jamal Abdalla, Venkatesh Kodur, Mohammad Reza Saeb

Main category: cs.AI

TL;DR: ERI is a comprehensive engineering instruction dataset spanning 9 fields, 55 subdomains, 7 intent types, and 3 difficulty tiers for training/evaluating engineering-capable LLMs and agents.

Details

Motivation: To address the lack of specialized benchmarks for evaluating engineering reasoning capabilities in LLMs, particularly for instruction tuning, routing, and agentic workflows in engineering applications.

Method: Created a taxonomy-driven dataset with 57,750 records across engineering domains, developed convergent validation protocol with cross-provider independence and multi-judge averaging, and evaluated 7 LLMs with statistical analysis.

Result: Frontier models (GPT-5, Claude Sonnet 4, DeepSeek V3.1) achieved mean scores >4.30/5, with statistically significant three-tier performance structure and hallucination risk bounded to 1.7% via validation protocol.

Conclusion: ERI enables reproducible evaluation of engineering reasoning in LLMs, reveals clear performance stratification, and provides validation methodology to address benchmark circularity concerns for engineering applications.

Abstract: The Engineering Reasoning and Instruction (ERI) benchmark is a taxonomy-driven instruction dataset designed to train and evaluate engineering-capable large language models (LLMs) and agents. This dataset spans nine engineering fields (namely: civil, mechanical, electrical, chemical, environmental, aerospace, materials, fire, and industrial engineering) and 55 subdomains, and is crossed with seven intent types (i.e., definition, explanation, calculation, comparison, design/synthesis, troubleshooting, and code-related) and three difficulty tiers (undergraduate, graduate, and professional), yielding 57,750 records with field/subdomain/type/difficulty metadata and solution formatting. We examined ERI via seven LLMs and report a statistically significant three-tier performance structure, with frontier models (GPT-5, Claude Sonnet 4, DeepSeek V3.1) achieving mean scores above 4.30 on a five-point scale, while mid-tier and smaller models exhibited progressively higher failure rates and steeper performance degradation on graduate-level questions. To address circularity concerns inherent in LLM benchmarks, we developed a convergent validation protocol that leverages cross-provider independence, multi-judge averaging, and frontier-model agreement analysis to empirically bound hallucination risk to 1.7%. ERI is released with taxonomy specifications, validation scripts, and an evaluation harness to enable reproducible comparisons and regression testing for instruction tuning, routing, retrieval-augmented evaluation, and agentic tool-use workflows in engineering settings.

[293] SuperLocalMemory: Privacy-Preserving Multi-Agent Memory with Bayesian Trust Defense Against Memory Poisoning

Varun Pratap Bhardwaj

Main category: cs.AI

TL;DR: SuperLocalMemory is a local-first memory system for multi-agent AI that defends against memory poisoning attacks through architectural isolation and Bayesian trust scoring, with adaptive learning-to-rank for personalized retrieval without cloud dependencies.

Details

Motivation: As AI agents increasingly rely on persistent memory, cloud-based memory systems create centralized attack surfaces where poisoned memories can propagate across sessions and users, posing significant security risks demonstrated in documented attacks against production systems.

Method: Combines SQLite-backed storage with FTS5 full-text search, Leiden-based knowledge graph clustering, event-driven coordination with per-agent provenance, and adaptive re-ranking framework that learns user preferences through three-layer behavioral analysis (cross-project technology preferences, project context detection, workflow pattern mining).

Result: Evaluation shows 10.6ms median search latency, zero concurrency errors under 10 simultaneous agents, trust separation (gap=0.90) with 72% trust degradation for sleeper attacks, and 104% improvement in NDCG@5 when adaptive re-ranking is enabled.

Conclusion: SuperLocalMemory provides a secure, local-first memory system for multi-agent AI that defends against memory poisoning while offering personalized retrieval, with open-source availability and integration with 17+ development tools via Model Context Protocol.

Abstract: We present SuperLocalMemory, a local-first memory system for multi-agent AI that defends against OWASP ASI06 memory poisoning through architectural isolation and Bayesian trust scoring, while personalizing retrieval through adaptive learning-to-rank – all without cloud dependencies or LLM inference calls. As AI agents increasingly rely on persistent memory, cloud-based memory systems create centralized attack surfaces where poisoned memories propagate across sessions and users – a threat demonstrated in documented attacks against production systems. Our architecture combines SQLite-backed storage with FTS5 full-text search, Leiden-based knowledge graph clustering, an event-driven coordination layer with per-agent provenance, and an adaptive re-ranking framework that learns user preferences through three-layer behavioral analysis (cross-project technology preferences, project context detection, and workflow pattern mining). Evaluation across seven benchmark dimensions demonstrates 10.6ms median search latency, zero concurrency errors under 10 simultaneous agents, trust separation (gap =0.90) with 72% trust degradation for sleeper attacks, and 104% improvement in NDCG@5 when adaptive re-ranking is enabled. Behavioral data is isolated in a separate database with GDPR Article 17 erasure support. SuperLocalMemory is open-source (MIT) and integrates with 17+ development tools via Model Context Protocol.

[294] Estimating Visual Attribute Effects in Advertising from Observational Data: A Deepfake-Informed Double Machine Learning Approach

Yizhi Liu, Balaji Padmanabhan, Siva Viswanathan

Main category: cs.AI

TL;DR: DICE-DML framework uses generative AI to disentangle visual attributes from confounders for causal inference in images, achieving 73-97% RMSE reduction over standard methods.

Details

Motivation: Marketers lack rigorous methods for understanding how visual attributes causally affect consumer engagement, especially when treatments (like skin tone) are embedded within images alongside confounding variables.

Method: DICE-DML combines: (1) deepfake-generated image pairs to isolate treatment variation, (2) DICE-Diff adversarial learning on paired difference vectors where background signals cancel, and (3) orthogonal projection to geometrically remove treatment-axis components.

Result: In simulations, DICE-DML reduces RMSE by 73-97% vs standard DML, with 97.5% improvement at null effect point. Applied to 232,089 Instagram posts, it achieves valid confounding control (R²=0.63) vs invalid standard DML results, estimating marginally significant negative effect of darker skin tone (-522 likes; p=0.062).

Conclusion: DICE-DML provides a principled approach for causal inference with visual data when treatments and confounders coexist within images, addressing fundamental methodological challenges in visual attribute analysis.

Abstract: Digital advertising increasingly relies on visual content, yet marketers lack rigorous methods for understanding how specific visual attributes causally affect consumer engagement. This paper addresses a fundamental methodological challenge: estimating causal effects when the treatment, such as a model’s skin tone, is an attribute embedded within the image itself. Standard approaches like Double Machine Learning (DML) fail in this setting because vision encoders entangle treatment information with confounding variables, producing severely biased estimates. We develop DICE-DML (Deepfake-Informed Control Encoder for Double Machine Learning), a framework that leverages generative AI to disentangle treatment from confounders. The approach combines three mechanisms: (1) deepfake-generated image pairs that isolate treatment variation; (2) DICE-Diff adversarial learning on paired difference vectors, where background signals cancel to reveal pure treatment fingerprints; and (3) orthogonal projection that geometrically removes treatment-axis components. In simulations with known ground truth, DICE-DML reduces root mean squared error by 73-97% compared to standard DML, with the strongest improvement (97.5%) at the null effect point, demonstrating robust Type I error control. Applying DICE-DML to 232,089 Instagram influencer posts, we estimate the causal effect of skin tone on engagement. Standard DML produces diagnostically invalid results (negative outcome R^2), while DICE-DML achieves valid confounding control (R^2 = 0.63) and estimates a marginally significant negative effect of darker skin tone (-522 likes; p = 0.062), substantially smaller than the biased standard estimate. Our framework provides a principled approach for causal inference with visual data when treatments and confounders coexist within images.

[295] Can machines be uncertain?

Luis Rosa

Main category: cs.AI

TL;DR: The paper examines how AI systems can represent and realize states of uncertainty, distinguishing between epistemic and subjective uncertainty, and proposing that some uncertainty states are interrogative attitudes with question-based content.

Details

Motivation: To understand whether and how AI systems can genuinely experience or represent states of uncertainty, moving beyond mere probabilistic representations to examine the philosophical and architectural foundations of uncertainty in AI.

Method: Adopts a functionalist and behavioral perspective to analyze how different AI architectures (symbolic, connectionist, hybrid) accommodate uncertainty, distinguishing between epistemic uncertainty (in data/information) and subjective uncertainty (system’s attitude).

Result: Identifies that some states of uncertainty are interrogative attitudes whose content is a question rather than a proposition, and distinguishes between distributed and discrete realizations of subjective uncertainty across different AI architectures.

Conclusion: AI systems can realize states of uncertainty in meaningful ways, with different architectures offering distinct approaches to representing both epistemic and subjective uncertainty, including question-based representations of uncertainty states.

Abstract: The paper investigates whether and how AI systems can realize states of uncertainty. By adopting a functionalist and behavioral perspective, it examines how symbolic, connectionist and hybrid architectures make room for uncertainty. The paper distinguishes between epistemic uncertainty, or uncertainty inherent in the data or information, and subjective uncertainty, or the system’s own attitude of being uncertainty. It further distinguishes between distributed and discrete realizations of subjective uncertainty. A key contribution is the idea that some states of uncertainty are interrogative attitudes whose content is a question rather than a proposition.

[296] COOL-MC: Verifying and Explaining RL Policies for Platelet Inventory Management

Dennis Gross

Main category: cs.AI

TL;DR: COOL-MC combines reinforcement learning with probabilistic model checking to verify and explain platelet inventory management policies, achieving low stockout (2.9%) and wastage (1.1%) probabilities while revealing the policy focuses on inventory age distribution.

Details

Motivation: Blood banks face critical inventory management challenges with perishable platelets (5-day shelf life), needing to balance costly wastage from overstocking against life-threatening shortages from understocking. While RL can learn effective policies, neural policies remain black boxes that hinder trust and adoption in safety-critical healthcare domains.

Method: Apply COOL-MC tool that combines RL with probabilistic model checking and explainable RL. Construct policy-induced discrete-time Markov chain (including only reachable states under trained policy to reduce memory usage), verify PCTL properties, and provide feature-level explanations including action reachability and counterfactual analysis.

Result: Trained policy achieves 2.9% stockout probability and 1.1% inventory-full (wastage) probability within 200-step horizon. Policy primarily attends to age distribution of inventory rather than other features like day of week or pending orders. Action reachability shows diverse replenishment strategy with most order quantities reached quickly, while several never selected. Counterfactual analysis shows replacing medium-large orders with smaller ones leaves safety probabilities nearly unchanged.

Conclusion: First formal verification and explanation of RL platelet inventory management policy demonstrates COOL-MC’s value for transparent, auditable decision-making in safety-critical healthcare supply chain domains, addressing trust barriers for neural policy adoption.

Abstract: Platelets expire within five days. Blood banks face uncertain daily demand and must balance ordering decisions between costly wastage from overstocking and life-threatening shortages from understocking. Reinforcement learning (RL) can learn effective ordering policies for this Markov decision process (MDP), but the resulting neural policies remain black boxes, hindering trust and adoption in safety-critical domains. We apply COOL-MC, a tool that combines RL with probabilistic model checking and explainable RL, to verify and explain a trained policy for the MDP on platelet inventory management inspired by Haijema et al. By constructing a policy-induced discrete-time Markov chain (which includes only the reachable states under the trained policy to reduce memory usage), we verify PCTL properties and provide feature-level explanations. Results show that the trained policy achieves a 2.9% stockout probability and a 1.1% inventory-full (potential wastage) probability within a 200-step horizon, primarily attends to the age distribution of inventory rather than other features such as day of week or pending orders. Action reachability analysis reveals that the policy employs a diverse replenishment strategy, with most order quantities reached quickly, while several are never selected. Counterfactual analysis shows that replacing medium-large orders with smaller ones leaves both safety probabilities nearly unchanged, indicating that these orders are placed in well-buffered inventory states. This first formal verification and explanation of an RL platelet inventory management policy demonstrates COOL-MC’s value for transparent, auditable decision-making in safety-critical healthcare supply chain domains.

[297] EvoSkill: Automated Skill Discovery for Multi-Agent Systems

Salaheddin Alzubi, Noah Provenzano, Jaydon Bingham, Weiyuan Chen, Tu Vu

Main category: cs.AI

TL;DR: EvoSkill is a self-evolving framework that automatically discovers and refines agent skills through iterative failure analysis, improving agent performance on specialized tasks without model fine-tuning.

Details

Motivation: Current coding agents lack domain expertise for specialized tasks, and existing skill development approaches are either hand-crafted or optimize low-level artifacts that are tightly coupled to specific models and tasks.

Method: EvoSkill analyzes execution failures, proposes new skills or edits to existing ones, and materializes them into structured, reusable skill folders. It uses a Pareto frontier of agent programs for selection, retaining only skills that improve held-out validation performance while keeping the underlying model frozen.

Result: EvoSkill improved exact-match accuracy by 7.3% on OfficeQA (60.6% → 67.9%) and by 12.1% on SealQA (26.6% → 38.7%). Skills evolved on one task transferred zero-shot to another, improving accuracy by 5.3% without modification.

Conclusion: Skill-level optimization produces transferable capabilities beyond the training task, demonstrating that automated skill evolution can significantly enhance agent performance on specialized domains without model fine-tuning.

Abstract: Coding agents are increasingly used as general-purpose problem solvers, but their flexibility does not by itself confer the domain expertise needed for specialized tasks. Recent work addresses this through \textit{agent skills}: reusable workflows, and code, that augment agents with domain-specific capabilities. Most skills today are hand-crafted, and existing evolutionary approaches optimize low-level artifacts (e.g. prompts & code) that are tightly coupled to specific models and tasks. We introduce \textbf{EvoSkill}, a self-evolving framework that automatically discovers and refines agent skills through iterative failure analysis. EvoSkill analyzes execution failures, proposes new skills or edits to existing ones, and materializes them into structured, reusable skill folders. A Pareto frontier of agent programs governs selection, retaining only skills that improve held-out validation performance while the underlying model remains frozen. We evaluate EvoSkill on two benchmarks: OfficeQA, a grounded reasoning benchmark over U.S.\ Treasury data, where it improves exact-match accuracy by \textbf{7.3%} (60.6% $\to$ 67.9%); and SealQA, a search-augmented QA benchmark with noisy retrieval, where it yields a \textbf{12.1%} gain (26.6% $\to$ 38.7%). We also investigate the zero-shot transfer capabilties of skills evolved on one task to the other; in particular: skills evolved from SealQA transfers zero-shot to BrowseComp, improving accuracy by \textbf{5.3%} without modification demonstrating that skill-level optimization produces transferable capabilities beyond the training task.

[298] VL-KGE: Vision-Language Models Meet Knowledge Graph Embeddings

Athanasios Efthymiou, Stevan Rudinac, Monika Kackovic, Nachoem Wijnberg, Marcel Worring

Main category: cs.AI

TL;DR: VL-KGE integrates Vision-Language Models with knowledge graph embeddings to create unified multimodal representations for heterogeneous knowledge graphs, improving link prediction performance.

Details

Motivation: Real-world multimodal knowledge graphs contain entities with diverse modalities, but existing KGE methods are either unimodal or process modalities in isolation with weak cross-modal alignment. Vision-Language Models offer powerful cross-modal alignment capabilities that could enhance multimodal KGE.

Method: Proposes Vision-Language Knowledge Graph Embeddings (VL-KGE), a framework that integrates cross-modal alignment from Vision-Language Models with structured relational modeling to learn unified multimodal representations of knowledge graphs.

Result: VL-KGE consistently improves over traditional unimodal and multimodal KGE methods in link prediction tasks on WN9-IMG and two novel fine art MKGs (WikiArt-MKG-v1 and WikiArt-MKG-v2).

Conclusion: Vision-Language Models are valuable for multimodal knowledge graph embedding, enabling more robust and structured reasoning over large-scale heterogeneous knowledge graphs by providing better cross-modal alignment.

Abstract: Real-world multimodal knowledge graphs (MKGs) are inherently heterogeneous, modeling entities that are associated with diverse modalities. Traditional knowledge graph embedding (KGE) methods excel at learning continuous representations of entities and relations, yet they are typically designed for unimodal settings. Recent approaches extend KGE to multimodal settings but remain constrained, often processing modalities in isolation, resulting in weak cross-modal alignment, and relying on simplistic assumptions such as uniform modality availability across entities. Vision-Language Models (VLMs) offer a powerful way to align diverse modalities within a shared embedding space. We propose Vision-Language Knowledge Graph Embeddings (VL-KGE), a framework that integrates cross-modal alignment from VLMs with structured relational modeling to learn unified multimodal representations of knowledge graphs. Experiments on WN9-IMG and two novel fine art MKGs, WikiArt-MKG-v1 and WikiArt-MKG-v2, demonstrate that VL-KGE consistently improves over traditional unimodal and multimodal KGE methods in link prediction tasks. Our results highlight the value of VLMs for multimodal KGE, enabling more robust and structured reasoning over large-scale heterogeneous knowledge graphs.

[299] Diagnosing Retrieval vs. Utilization Bottlenecks in LLM Agent Memory

Boqin Yuan, Yue Su, Kun Yao

Main category: cs.AI

TL;DR: Memory-augmented LLM agents: Retrieval method matters more than write strategy for performance, with raw chunk storage matching or beating sophisticated lossy alternatives.

Details

Motivation: To understand the relative importance of memory writing vs. retrieval strategies in memory-augmented LLM agents, as current memory pipelines may discard useful context that retrieval mechanisms fail to compensate for.

Method: Diagnostic framework analyzing performance across write strategies (raw chunks, Mem0-style fact extraction, MemGPT-style summarization) and retrieval methods (cosine, BM25, hybrid reranking) in a 3x3 study on LoCoMo benchmark.

Result: Retrieval method is dominant factor: 20-point accuracy difference across retrieval methods vs. only 3-8 points across write strategies. Raw chunked storage (zero LLM calls) matches or outperforms expensive lossy alternatives. Performance breakdowns most often occur at retrieval stage rather than utilization.

Conclusion: Under current retrieval practices, improving retrieval quality yields larger gains than increasing write-time sophistication. Raw chunk storage is surprisingly effective and computationally cheaper.

Abstract: Memory-augmented LLM agents store and retrieve information from prior interactions, yet the relative importance of how memories are written versus how they are retrieved remains unclear. We introduce a diagnostic framework that analyzes how performance differences manifest across write strategies, retrieval methods, and memory utilization behavior, and apply it to a 3x3 study crossing three write strategies (raw chunks, Mem0-style fact extraction, MemGPT-style summarization) with three retrieval methods (cosine, BM25, hybrid reranking). On LoCoMo, retrieval method is the dominant factor: average accuracy spans 20 points across retrieval methods (57.1% to 77.2%) but only 3-8 points across write strategies. Raw chunked storage, which requires zero LLM calls, matches or outperforms expensive lossy alternatives, suggesting that current memory pipelines may discard useful context that downstream retrieval mechanisms fail to compensate for. Failure analysis shows that performance breakdowns most often manifest at the retrieval stage rather than at utilization. We argue that, under current retrieval practices, improving retrieval quality yields larger gains than increasing write-time sophistication. Code is publicly available at https://github.com/boqiny/memory-probe.

[300] PRISM: Pushing the Frontier of Deep Think via Process Reward Model-Guided Inference

Rituraj Sharma, Weiyuan Chen, Noah Provenzano, Tu Vu

Main category: cs.AI

TL;DR: PRISM introduces a Process Reward Model-guided inference algorithm that uses step-level verification to improve reasoning in DEEPTHINK systems by guiding population refinement and solution aggregation.

Details

Motivation: Existing DEEPTHINK frameworks lack reliable correctness signals during inference, creating a bottleneck where deeper deliberation amplifies errors, suppresses correct minority solutions, and yields weak returns to additional compute.

Method: PRISM uses Process Reward Models (PRMs) for step-level verification to guide both population refinement and solution aggregation. It treats candidate solutions as particles in a PRM-defined energy landscape and reshapes the population through score-guided resampling and stochastic refinement.

Result: PRISM achieves 90.0% on AIME25, 75.4% on HMMT25, and 71.4% on GPQA Diamond with gpt-oss-20b, matching or exceeding gpt-oss-120b performance. It produces consistent net-directional correction during refinement and remains reliable with few initial correct candidates.

Conclusion: PRISM effectively addresses the population-enhancement bottleneck in DEEPTHINK systems through PRM-guided inference, enabling better reasoning performance while often lying on the compute-accuracy Pareto frontier.

Abstract: DEEPTHINK methods improve reasoning by generating, refining, and aggregating populations of candidate solutions, which enables strong performance on complex mathematical and scientific tasks. However, existing frameworks often lack reliable correctness signals during inference, which creates a population-enhancement bottleneck where deeper deliberation amplifies errors, suppresses correct minority solutions, and yields weak returns to additional compute. In this paper, we introduce a functional decomposition of DEEPTHINK systems and propose PRISM, a Process Reward Model (PRM)-guided inference algorithm that uses step-level verification to guide both population refinement and solution aggregation. During refinement, PRISM treats candidate solutions as particles in a PRM-defined energy landscape and reshapes the population through score-guided resampling and stochastic refinement, which concentrates probability mass on higher-quality reasoning while preserving diversity. Across mathematics and science benchmarks, PRISM is competitive with or outperforms existing DEEPTHINK methods, reaching 90.0%, 75.4%, and 71.4% with gpt-oss-20b on AIME25, HMMT25, and GPQA Diamond, respectively, while matching or exceeding gpt-oss-120b. Additionally, our analysis shows that PRISM produces consistent net-directional correction during refinement, remains reliable when the initial population contains few correct candidates, and often lies on the compute-accuracy Pareto frontier.

[301] Revealing Positive and Negative Role Models to Help People Make Good Decisions

Avrim Blum, Keziah Naggita, Matthew R. Walter, Jingyan Wang

Main category: cs.AI

TL;DR: A social network intervention framework where a planner with limited budget reveals positive/negative role model labels to maximize agents emulating positive role models, with algorithmic solutions and fairness considerations.

Details

Motivation: In social networks, agents emulate role models but may not know whether these models are positive or negative. A social planner with limited disclosure budget wants to strategically reveal labels to maximize social welfare (agents emulating positive role models).

Method: The paper proposes a welfare maximization framework with limited disclosure budget, introduces a proxy welfare function to handle submodularity issues when revealing negative labels, develops approximation algorithms, and extends to fairness across groups, intervention models, and coverage radius models.

Result: Theoretical results include constant-factor approximation algorithms when agents have constant negative neighbors, fairness guarantees across groups, and extensive experiments on four real-world datasets supporting the effectiveness of proposed algorithms.

Conclusion: The paper provides a comprehensive framework for strategic label disclosure in social networks with theoretical guarantees and practical algorithms for welfare maximization while addressing fairness concerns across different agent groups.

Abstract: We consider a setting where agents take action by following their role models in a social network, and study strategies for a social planner to help agents by revealing whether the role models are positive or negative. Specifically, agents observe a local neighborhood of possible role models they can emulate, but do not know their true labels. Revealing a positive label encourages emulation, while revealing a negative one redirects agents toward alternative options. The social planner observes all labels, but operates under a limited disclosure budget that it selectively allocates to maximize social welfare (the expected number of agents who emulate adjacent positive role models). We consider both algorithms and hardness results for welfare maximization, and provide a sample-complexity guarantee when the planner observes a sampled subset of agents. We also consider fairness guarantees when agents belong to different groups. It is a technical challenge that the ability to reveal negative role models breaks submodularity. We thus introduce a proxy welfare function that remains submodular even when revealed targets include negative ones. When each agent has at most a constant number of negative target neighbors, we use this proxy to achieve a constant-factor approximation to the true optimal welfare gain. When agents belong to different groups, we also show that each group’s welfare gain is within a constant factor of the optimum achievable if the full budget were allocated to that group. Beyond this basic model, we also propose an intervention model that directly connects high-risk agents to positive role models, and a coverage radius model that expands the visibility of selected positive role models. Lastly, we conduct extensive experiments on four real-world datasets to support our theoretical results and assess the effectiveness of the proposed algorithms.

[302] NeuroProlog: Multi-Task Fine-Tuning for Neurosymbolic Mathematical Reasoning via the Cocktail Effect

Pratibha Zunjare, Michael Hsiao

Main category: cs.AI

TL;DR: NeuroProlog is a neurosymbolic framework that improves mathematical reasoning in LLMs by compiling word problems into verifiable Prolog programs with multi-task training and execution-guided decoding.

Details

Motivation: LLMs perform well on natural language tasks but struggle with mathematical reasoning, often producing fluent but logically inconsistent solutions. There's a need for frameworks that ensure verifiable reasoning with formal guarantees.

Method: NeuroProlog compiles math word problems into executable Prolog programs. It uses a multi-task Cocktail training strategy with three objectives: mathematical formula-to-rule translation, natural language-to-program synthesis, and program-answer alignment. At inference, it employs execution-guided decoding with error taxonomy for iterative program repair.

Result: Significant accuracy improvements on GSM8K: +5.23% (Qwen-32B), +3.43% (GPT-OSS-20B), and +5.54% (Llama-3B) over single-task baselines. Error analysis shows scale-dependent dynamics: at 32B scale, transforms unfixable type errors (12% repair) to correctable domain errors (96% repair), achieving 92.7% overall correction.

Conclusion: NeuroProlog demonstrates that neurosymbolic approaches with multi-task training and execution-guided decoding can significantly improve mathematical reasoning in LLMs, with scale-dependent effects revealing critical capacity thresholds for symbolic reasoning.

Abstract: Large Language Models (LLMs) achieve strong performance on natural language tasks but remain unreliable in mathematical reasoning, frequently generating fluent yet logically inconsistent solutions. We present \textbf{NeuroProlog}, a neurosymbolic framework that ensures verifiable reasoning by compiling math word problems into executable Prolog programs with formal verification guarantees. We propose a multi-task Cocktail training strategy that jointly optimizes three synergistic objectives in a unified symbolic representation space: (i) mathematical formula-to-rule translation (KB), (ii) natural language-to-program synthesis (SOLVE), and (iii) program-answer alignment. This joint supervision enables positive transfer, where symbolic grounding in formula translation directly improves compositional reasoning capabilities. At inference, we introduce an execution-guided decoding pipeline with fine-grained error taxonomy that enables iterative program repair and quantifies model self-debugging capacity. Comprehensive evaluation on GSM8K across four model scales (3B–32B parameters) demonstrates consistent improvements: cocktail training achieves significant accuracy gains of +5.23% (Qwen-32B, $p < 0.01$), +3.43% (GPT-OSS-20B, $p < 0.01$), and +5.54% (Llama-3B, $p < 0.05$) over single-task baselines.Systematic error analysis reveals scale-dependent learning dynamics: at 32B scale, cocktail training transforms unfixable type errors (12% repair rate) into correctable domain errors (96% repair rate), achieving 92.7% overall correction; at 8B scale, the same training eliminates syntactic errors but introduces semantic failures, revealing a critical capacity threshold for type-safe symbolic reasoning.

[303] LLM-MLFFN: Multi-Level Autonomous Driving Behavior Feature Fusion via Large Language Model

Xiangyu Li, Tianyi Wang, Xi Cheng, Rakesh Chowdary Machineni, Zhaomiao Guo, Sikai Chen, Junfeng Jiao, Christian Claudel

Main category: cs.AI

TL;DR: LLM-MLFFN: A large language model-enhanced multi-level feature fusion network for autonomous vehicle driving behavior classification, combining numerical features with LLM-derived semantic features for improved accuracy and interpretability.

Details

Motivation: Existing AV driving behavior classification approaches rely primarily on numerical time-series modeling and lack semantic abstraction, limiting interpretability and robustness in complex traffic environments.

Method: Three-component framework: (1) multi-level feature extraction module for statistical, behavioral, and dynamic features; (2) semantic description module using LLMs to transform raw data into high-level semantic features; (3) dual-channel multi-level feature fusion network combining numerical and semantic features with weighted attention mechanisms.

Result: Achieved over 94% classification accuracy on Waymo open trajectory dataset, surpassing existing machine learning models. Ablation studies validated contributions of multi-level fusion, feature extraction strategies, and LLM-derived semantic reasoning.

Conclusion: Integrating structured feature modeling with language-driven semantic abstraction provides a principled and interpretable pathway for robust autonomous driving behavior classification.

Abstract: Accurate classification of autonomous vehicle (AV) driving behaviors is critical for safety validation, performance diagnosis, and traffic integration analysis. However, existing approaches primarily rely on numerical time-series modeling and often lack semantic abstraction, limiting interpretability and robustness in complex traffic environments. This paper presents LLM-MLFFN, a novel large language model (LLM)-enhanced multi-level feature fusion network designed to address the complexities of multi-dimensional driving data. The proposed LLM-MLFFN framework integrates priors from largescale pre-trained models and employs a multi-level approach to enhance classification accuracy. LLM-MLFFN comprises three core components: (1) a multi-level feature extraction module that extracts statistical, behavioral, and dynamic features to capture the quantitative aspects of driving behaviors; (2) a semantic description module that leverages LLMs to transform raw data into high-level semantic features; and (3) a dual-channel multi-level feature fusion network that combines numerical and semantic features using weighted attention mechanisms to improve robustness and prediction accuracy. Evaluation on the Waymo open trajectory dataset demonstrates the superior performance of the proposed LLM-MLFFN, achieving a classification accuracy of over 94%, surpassing existing machine learning models. Ablation studies further validate the critical contributions of multi-level fusion, feature extraction strategies, and LLM-derived semantic reasoning. These results suggest that integrating structured feature modeling with language-driven semantic abstraction provides a principled and interpretable pathway for robust autonomous driving behavior classification.

[304] A Neuropsychologically Grounded Evaluation of LLM Cognitive Abilities

Faiz Ghifari Haznitrama, Faeyza Rishad Ardi, Alice Oh

Main category: cs.AI

TL;DR: The paper introduces NeuroCognition, a benchmark based on neuropsychological tests to evaluate LLMs’ foundational cognitive abilities beyond task completion, revealing gaps in image processing and complex reasoning despite strong text performance.

Details

Motivation: Current LLM benchmarks focus on task completion but fail to probe foundational cognitive abilities, explaining why models struggle with simple human tasks despite strong performance on complex benchmarks. There's a need for benchmarks that measure core adaptive cognition.

Method: Introduced NeuroCognition benchmark with three adapted neuropsychological tests: Raven’s Progressive Matrices (abstract relational reasoning), Spatial Working Memory (maintenance/systematic search), and Wisconsin Card Sorting Test (cognitive flexibility). Evaluated 156 models across text and image modalities.

Result: Models perform strongly on text but degrade on images and with increased complexity. Complex reasoning isn’t universally beneficial, while simple human-like strategies yield partial gains. NeuroCognition correlates positively with standard benchmarks while measuring distinct cognitive abilities.

Conclusion: NeuroCognition reveals where LLMs align with human-like intelligence and where they lack core adaptive cognition, serving as a verifiable, scalable source for improving LLMs beyond current task-focused benchmarks.

Abstract: Large language models (LLMs) exhibit a unified “general factor” of capability across 10 benchmarks, a finding confirmed by our factor analysis of 156 models, yet they still struggle with simple, trivial tasks for humans. This is because current benchmarks focus on task completion, failing to probe the foundational cognitive abilities that highlight these behaviors. We address this by introducing the NeuroCognition benchmark, grounded in three adapted neuropsychological tests: Raven’s Progressive Matrices (abstract relational reasoning), Spatial Working Memory (maintenance and systematic search), and the Wisconsin Card Sorting Test (cognitive flexibility). Our evaluation reveals that while models perform strongly on text, their performance degrades for images and with increased complexity. Furthermore, we observe that complex reasoning is not universally beneficial, whereas simple, human-like strategies yield partial gains. We also find that NeuroCognition correlates positively with standard general-capability benchmarks, while still measuring distinct cognitive abilities beyond them. Overall, NeuroCognition emphasizes where current LLMs align with human-like intelligence and where they lack core adaptive cognition, showing the potential to serve as a verifiable, scalable source for improving LLMs.

[305] AnchorDrive: LLM Scenario Rollout with Anchor-Guided Diffusion Regeneration for Safety-Critical Scenario Generation

Zhulin Jiang, Zetao Li, Cheng Wang, Ziwen Wang, Chen Xiong

Main category: cs.AI

TL;DR: AnchorDrive: A two-stage framework using LLMs for controllable generation and diffusion models for realistic trajectory synthesis to create safety-critical driving scenarios for autonomous vehicle evaluation.

Details

Motivation: Autonomous driving systems need comprehensive safety evaluation in critical scenarios, but such scenarios are rare in real-world data. Existing simulation methods lack both controllability (following specific instructions) and realism (matching real driving distributions).

Method: Two-stage approach: 1) LLM as driver agent in closed-loop simulation with plan assessor feedback for semantically controllable scenario generation; 2) Diffusion model guided by anchor points extracted from LLM trajectories to regenerate realistic trajectories while preserving user intent.

Result: Experiments on highD dataset show AnchorDrive achieves superior performance in criticality, realism, and controllability compared to existing methods, validating its effectiveness for generating safety-critical scenarios.

Conclusion: AnchorDrive successfully combines LLMs’ controllability with diffusion models’ realism to generate high-quality safety-critical driving scenarios, addressing limitations of existing simulation methods for autonomous vehicle evaluation.

Abstract: Autonomous driving systems require comprehensive evaluation in safety-critical scenarios to ensure safety and robustness. However, such scenarios are rare and difficult to collect from real-world driving data, necessitating simulation-based synthesis. Yet, existing methods often exhibit limitations in both controllability and realism. From a capability perspective, LLMs excel at controllable generation guided by natural language instructions, while diffusion models are better suited for producing trajectories consistent with realistic driving distributions. Leveraging their complementary strengths, we propose AnchorDrive, a two-stage safety-critical scenario generation framework. In the first stage, we deploy an LLM as a driver agent within a closed-loop simulation, which reasons and iteratively outputs control commands under natural language constraints; a plan assessor reviews these commands and provides corrective feedback, enabling semantically controllable scenario generation. In the second stage, the LLM extracts key anchor points from the first-stage trajectories as guidance objectives, which jointly with other guidance terms steer the diffusion model to regenerate complete trajectories with improved realism while preserving user-specified intent. Experiments on the highD dataset demonstrate that AnchorDrive achieves superior overall performance in criticality, realism, and controllability, validating its effectiveness for generating controllable and realistic safety-critical scenarios.

[306] LiveAgentBench: Comprehensive Benchmarking of Agentic Systems Across 104 Real-World Challenges

Hao Li, Huan Wang, Jinjie Gu, Wenjie Wang, Chenyi Zhuang, Sikang Bian

Main category: cs.AI

TL;DR: LiveAgentBench is a comprehensive benchmark with 104 real-world scenarios for evaluating general AI agents, featuring a novel Social Perception-Driven Data Generation method to ensure task relevance and complexity.

Details

Motivation: Existing benchmarks for general AI agents fail to accurately represent real-world user tasks, creating a gap in evaluating practical performance of increasingly capable language models in real applications.

Method: Developed LiveAgentBench with 104 scenarios from publicly sourced questions on social media and real-world products, using Social Perception-Driven Data Generation (SPDG) method to ensure real-world relevance, task complexity, and result verifiability.

Result: Created benchmark with 374 tasks (125 validation, 249 testing), evaluated various models/frameworks/commercial products, revealed practical performance and identified improvement areas, with SPDG enabling continuous updates from real-world interactions.

Conclusion: LiveAgentBench addresses limitations of existing benchmarks by providing realistic evaluation of general AI agents, with SPDG method ensuring ongoing relevance through continuous updates from real-world user queries.

Abstract: As large language models grow more capable, general AI agents have become increasingly prevalent in practical applications. However, existing benchmarks face significant limitations, failing to represent real-world user tasks accurately. To address this gap, we present LiveAgentBench, a comprehensive benchmark with 104 scenarios that reflect real user requirements. It is constructed from publicly sourced questions on social media and real-world products. Central to our approach is the Social Perception-Driven Data Generation (SPDG) method, a novel process we developed to ensure each question’s real-world relevance, task complexity, and result verifiability. We evaluate various models, frameworks, and commercial products using LiveAgentBench, revealing their practical performance and identifying areas for improvement. This release includes 374 tasks, with 125 for validation and 249 for testing. The SPDG process enables continuous updates with fresh queries from real-world interactions.

[307] SUN: Shared Use of Next-token Prediction for Efficient Multi-LLM Disaggregated Serving

Sunghyeon Woo, Ahreum Seo, Jaegwang Lee, Jaeeun Kil, Hanbae Seo, Joonghoon Kim, Baeseong Park, Se Jung Kwon, Dongsoo Lee

Main category: cs.AI

TL;DR: SUN enables cross-model sharing of decode execution in multi-LLM serving by decomposing Transformers into task-specific prefill modules and shared decode modules, improving GPU utilization and throughput.

Details

Motivation: Current multi-model LLM serving suffers from inefficient decode execution due to model-specific resource partitioning, preventing cross-model batching and causing severe GPU underutilization, especially under skewed workloads.

Method: Decomposes decoder-only Transformers into prefill and decode modules, fine-tunes only task-specific prefill modules while keeping decode modules frozen and shared across models, enabling model-agnostic decode routing policies to balance requests across shared workers.

Result: Achieves accuracy comparable to full fine-tuning while improving throughput per GPU by up to 2.0x over conventional disaggregation, with time-per-output-token within 5%. Quantized SUN (QSUN) achieves 45% speedup with comparable accuracy.

Conclusion: SUN enables efficient cross-model sharing of decode execution in multi-LLM serving, significantly improving GPU utilization and throughput while maintaining accuracy, with quantization further enhancing performance.

Abstract: In multi-model LLM serving, decode execution remains inefficient due to model-specific resource partitioning: since cross-model batching is not possible, memory-bound decoding often suffers from severe GPU underutilization, especially under skewed workloads. We propose Shared Use of Next-token Prediction (SUN), the first approach that enables cross-model sharing of decode execution in disaggregated multi-LLM serving. SUN decomposes a decoder-only Transformer into a prefill module and a decode module, and fine-tunes only the task-specific prefill module, enabling a frozen decode module to be shared across models. This design enables a model-agnostic decode routing policy that balances decode requests across shared workers to maximize utilization. Across diverse tasks and model families, SUN achieves accuracy comparable to full fine-tuning while maintaining system throughput with fewer decode workers. In particular, SUN improves throughput per GPU by up to 2.0x over conventional disaggregation while keeping time-per-output-token (TPOT) within 5%. SUN inherently enables and facilitates low-bit decoding; with Quantized SUN (QSUN), it achieves a 45% speedup with comparable accuracy to SUN while preserving the benefits of shared decoding.

[308] AgentAssay: Token-Efficient Regression Testing for Non-Deterministic AI Agent Workflows

Varun Pratap Bhardwaj

Main category: cs.AI

TL;DR: AgentAssay: A token-efficient framework for regression testing non-deterministic AI agent workflows with statistical guarantees, achieving 78-100% cost reduction.

Details

Motivation: No principled methodology exists for verifying that AI agents haven't regressed after changes to prompts, tools, models, or orchestration logic, despite their unprecedented deployment scale.

Method: Eight key contributions: stochastic three-valued verdicts, five-dimensional agent coverage metrics, agent-specific mutation testing operators, metamorphic relations, CI/CD deployment gates, behavioral fingerprinting mapping execution traces to compact vectors, adaptive budget optimization, and trace-first offline analysis.

Result: Experiments across 5 models, 3 scenarios, and 7,605 trials show behavioral fingerprinting achieves 86% detection power where binary testing has 0%, SPRT reduces trials by 78%, and full pipeline achieves 100% cost savings through trace-first analysis.

Conclusion: AgentAssay provides the first comprehensive framework for regression testing AI agents with statistical guarantees and significant cost reductions, enabling reliable verification of agent performance after changes.

Abstract: Autonomous AI agents are deployed at unprecedented scale, yet no principled methodology exists for verifying that an agent has not regressed after changes to its prompts, tools, models, or orchestration logic. We present AgentAssay, the first token-efficient framework for regression testing non-deterministic AI agent workflows, achieving 78-100% cost reduction while maintaining rigorous statistical guarantees. Our contributions include: (1) stochastic three-valued verdicts (PASS/FAIL/INCONCLUSIVE) grounded in hypothesis testing; (2) five-dimensional agent coverage metrics; (3) agent-specific mutation testing operators; (4) metamorphic relations for agent workflows; (5) CI/CD deployment gates as statistical decision procedures; (6) behavioral fingerprinting that maps execution traces to compact vectors, enabling multivariate regression detection; (7) adaptive budget optimization calibrating trial counts to behavioral variance; and (8) trace-first offline analysis enabling zero-cost testing on production traces. Experiments across 5 models (GPT-5.2, Claude Sonnet 4.6, Mistral-Large-3, Llama-4-Maverick, Phi-4), 3 scenarios, and 7,605 trials demonstrate that behavioral fingerprinting achieves 86% detection power where binary testing has 0%, SPRT reduces trials by 78%, and the full pipeline achieves 100% cost savings through trace-first analysis. Implementation: 20,000+ lines of Python, 751 tests, 10 framework adapters.

[309] See and Remember: A Multimodal Agent for Web Traversal

Xinjun Wang, Shengyao Wang, Aimin Zhou, Hao Hao

Main category: cs.AI

TL;DR: V-GEMS is a multimodal agent architecture for web navigation that integrates visual grounding and explicit memory to prevent spatial disorientation and navigation loops, achieving 28.7% performance gain over baseline.

Details

Motivation: Current LLM-based agents struggle with spatial disorientation and navigation loops in autonomous web navigation tasks, needing better visual perception and long-term context maintenance.

Method: Proposes V-GEMS with visual grounding to resolve ambiguous interactive elements and explicit memory stack with state tracking for structured path mapping, enabling valid backtracking and preventing cyclical failures.

Result: Significantly dominates WebWalker baseline with 28.7% performance gain; introduces updatable dynamic benchmark for rigorous adaptability evaluation.

Conclusion: V-GEMS provides robust multimodal architecture for precise web traversal through visual grounding and explicit memory mechanisms.

Abstract: Autonomous web navigation requires agents to perceive complex visual environments and maintain long-term context, yet current Large Language Model (LLM) based agents often struggle with spatial disorientation and navigation loops. In this paper, we propose generally applicable V-GEMS(Visual Grounding and Explicit Memory System), a robust multimodal agent architecture designed for precise and resilient web traversal. Our agent integrates visual grounding to resolve ambiguous interactive elements and introduces an explicit memory stack with state tracking. This dual mechanism allows the agent to maintain a structured map of its traversal path, enabling valid backtracking and preventing cyclical failures in deep navigation tasks. We also introduce an updatable dynamic benchmark to rigorously evaluate adaptability. Experiments show V-GEMS significantly dominates the WebWalker baseline, achieving a substantial 28.7% performance gain. Code is available at https://github.com/Vaultttttttttttt/V-GEMS.

[310] SorryDB: Can AI Provers Complete Real-World Lean Theorems?

Austin Letson, Leopoldo Sarra, Auguste Poiroux, Oliver Dressler, Paul Lezeau, Dhyan Aranha, Frederick Pu, Aaron Hill, Miguel Corredera Hidalgo, Julian Berman, George Tsoukalas, Lenny Taelman

Main category: cs.AI

TL;DR: SorryDB is a dynamic benchmark of open Lean tasks from real GitHub projects that continuously updates to align with community needs and prevent test-set contamination, with evaluation showing complementary performance across different approaches.

Details

Motivation: Existing benchmarks for formal mathematics are static and often composed of competition problems, which don't align well with real-world community needs. There's a need for benchmarks that reflect actual formalization projects and can measure an agent's ability to contribute to novel mathematics.

Method: Created SorryDB by collecting open Lean tasks from 78 real-world formalization projects on GitHub. The benchmark dynamically updates to provide a continuous stream of tasks. Evaluated various approaches including generalist LLMs, agentic approaches, specialized symbolic provers, and curated Lean tactics on a snapshot of 1000 tasks.

Result: Current approaches show complementary performance - agentic approach based on Gemini Flash is most performant but not strictly better than other methods. Different approaches have different strengths, suggesting a hybrid approach might be optimal.

Conclusion: SorryDB provides a robust, continuously-updating benchmark that better aligns with community needs and mitigates test-set contamination. The complementary nature of different approaches suggests future work should explore combining methods for better formal mathematics assistance.

Abstract: We present SorryDB, a dynamically-updating benchmark of open Lean tasks drawn from 78 real world formalization projects on GitHub. Unlike existing static benchmarks, often composed of competition problems, hillclimbing the SorryDB benchmark will yield tools that are aligned to the community needs, more usable by mathematicians, and more capable of understanding complex dependencies. Moreover, by providing a continuously updated stream of tasks, SorryDB mitigates test-set contamination and offers a robust metric for an agent’s ability to contribute to novel formal mathematics projects. We evaluate a collection of approaches, including generalist large language models, agentic approaches, and specialized symbolic provers, over a selected snapshot of 1000 tasks from SorryDB. We show that current approaches are complementary: even though an agentic approach based on Gemini Flash is the most performant, it is not strictly better than other off-the-shelf large-language models, specialized provers, or even a curated list of Lean tactics.

[311] LLMs for High-Frequency Decision-Making: Normalized Action Reward-Guided Consistency Policy Optimization

Yang Zhao, Zihao Li, Zhiyu Jiang, Dandan Ma, Ganchao Liu, Wenzhe Zhao

Main category: cs.AI

TL;DR: NAR-CP improves LLM-based agents for high-frequency decision tasks using normalized action rewards and consistency policy optimization to address policy misalignment in composite tasks.

Details

Motivation: LLMs have limitations in high-frequency decision tasks where numerical state information updates frequently with minimal fluctuations. Existing methods focus on low-frequency discrete embodied scenarios and suffer from policy misalignment between learned sub-tasks and composite tasks.

Method: 1) Normalized Action Reward: Acquire dense rewards from environmental feedback of candidate actions via reward functions, then complete reward shaping through normalization. 2) Consistency Policy Optimization: Use LLMs to infer sub-observation candidate actions and generate joint policies, with consistency loss ensuring alignment between global semantic policies and sub-semantic policies.

Result: Experiments on UAV pursuit (a typical high-frequency task) show superior performance on independent and composite tasks with excellent generalization to unseen tasks.

Conclusion: NAR-CP effectively addresses LLM limitations in high-frequency decision-making by combining normalized action rewards with consistency policy optimization, demonstrating strong performance and generalization in complex tasks.

Abstract: While Large Language Models (LLMs) form the cornerstone of sequential decision-making agent development, they have inherent limitations in high-frequency decision tasks. Existing research mainly focuses on discrete embodied decision scenarios with low-frequency and significant semantic differences in state space (e.g., household planning). These methods suffer from limited performance in high-frequency decision-making tasks, since high-precision numerical state information in such tasks undergoes frequent updates with minimal fluctuations, and exhibiting policy misalignment between the learned sub-tasks and composite tasks. To address these issues, this paper proposes Normalized Action Reward guided Consistency Policy Optimization (NAR-CP). 1) Our method first acquires predefined dense rewards from environmental feedback of candidate actions via reward functions, then completes reward shaping through normalization, and theoretically verifies action reward normalization does not impair optimal policy. 2) To reduce policy misalignment in composite tasks, we use LLMs to infer sub-observation candidate actions and generate joint policies, with consistency loss ensuring precise alignment between global semantic policies and sub-semantic policies. Experiments on UAV pursuit, a typical high-frequency task, show our method delivers superior performance on independent and composite tasks with excellent generalization to unseen tasks.

[312] Retrieval-Augmented Robots via Retrieve-Reason-Act

Izat Temiraliev, Diji Yang, Yi Zhang

Main category: cs.AI

TL;DR: Robots need active information retrieval capabilities to access external procedural knowledge for complex tasks like furniture assembly, moving beyond just internal memory or text-based constraints.

Details

Motivation: Current robots lack the ability to actively seek external procedural knowledge from unstructured documentation when facing novel tasks with no prior demonstrations, creating a critical information gap that prevents them from being truly general-purpose.

Method: Proposes Retrieval-Augmented Robotics (RAR) with an iterative Retrieve-Reason-Act loop: robots actively retrieve visual procedural manuals from unstructured corpus, ground 2D diagrams to 3D physical parts via cross-modal alignment, and synthesize executable plans.

Result: The approach significantly outperforms baselines relying on zero-shot reasoning or few-shot example retrieval on a challenging long-horizon assembly benchmark, demonstrating the value of grounding robotic planning in retrieved visual documents.

Conclusion: RAR establishes a new paradigm that extends Information Retrieval from answering user queries to driving embodied physical actions, bridging the gap between visual documentation and physical actuation.

Abstract: To achieve general-purpose utility, we argue that robots must evolve from passive executors into active Information Retrieval users. In strictly zero-shot settings where no prior demonstrations exist, robots face a critical information gap, such as the exact sequence required to assemble a complex furniture kit, that cannot be satisfied by internal parametric knowledge (common sense) or past internal memory. While recent robotic works attempt to use search before action, they primarily focus on retrieving past kinematic trajectories (analogous to searching internal memory) or text-based safety rules (searching for constraints). These approaches fail to address the core information need of active task construction: acquiring unseen procedural knowledge from external, unstructured documentation. In this paper, we define the paradigm as Retrieval-Augmented Robotics (RAR), empowering the robot with the information-seeking capability that bridges the gap between visual documentation and physical actuation. We formulate the task execution as an iterative Retrieve-Reason-Act loop: the robot or embodied agent actively retrieves relevant visual procedural manuals from an unstructured corpus, grounds the abstract 2D diagrams to 3D physical parts via cross-modal alignment, and synthesizes executable plans. We validate this paradigm on a challenging long-horizon assembly benchmark. Our experiments demonstrate that grounding robotic planning in retrieved visual documents significantly outperforms baselines relying on zero-shot reasoning or few-shot example retrieval. This work establishes the basis of RAR, extending the scope of Information Retrieval from answering user queries to driving embodied physical actions.

[313] FinTexTS: Financial Text-Paired Time-Series Dataset via Semantic-Based and Multi-Level Pairing

Jaehoon Lee, Suhwan Park, Tae Yoon Lim, Seunghan Lee, Jun Seo, Dongwan Kang, Hwanil Choi, Minjae Kim, Sungdong Yoo, SoonYoung Lee, Yongjae Lee, Wonbin Ahn

Main category: cs.AI

TL;DR: A semantic-based multi-level framework for pairing financial news with stock price time-series data, addressing complex market interdependencies through LLM-based classification and embedding matching.

Details

Motivation: Financial time-series analysis needs to capture complex market interdependencies where stock prices are influenced by company-specific events, related companies, sectors, and macroeconomic factors. Existing keyword-based text-time-series pairing methods fail to capture these semantic relationships.

Method: Proposes a semantic-based multi-level pairing framework: 1) Extract company context from SEC filings, 2) Use embedding-based matching to retrieve relevant news articles, 3) Classify news into four levels (macro, sector, related company, target company) using LLMs, 4) Construct FinTexTS dataset with this approach.

Result: Created FinTexTS, a large-scale text-paired stock price dataset. Experimental results show the semantic-based multi-level pairing strategy improves stock price forecasting. When applied to proprietary curated news sources, yields higher-quality paired data and better forecasting performance.

Conclusion: The semantic-based multi-level pairing framework effectively captures complex financial market relationships, enabling better text-time-series integration for financial forecasting applications.

Abstract: The financial domain involves a variety of important time-series problems. Recently, time-series analysis methods that jointly leverage textual and numerical information have gained increasing attention. Accordingly, numerous efforts have been made to construct text-paired time-series datasets in the financial domain. However, financial markets are characterized by complex interdependencies, in which a company’s stock price is influenced not only by company-specific events but also by events in other companies and broader macroeconomic factors. Existing approaches that pair text with financial time-series data based on simple keyword matching often fail to capture such complex relationships. To address this limitation, we propose a semantic-based and multi-level pairing framework. Specifically, we extract company-specific context for the target company from SEC filings and apply an embedding-based matching mechanism to retrieve semantically relevant news articles based on this context. Furthermore, we classify news articles into four levels (macro-level, sector-level, related company-level, and target-company level) using large language models (LLMs), enabling multi-level pairing of news articles with the target company. Applying this framework to publicly-available news datasets, we construct \textbf{FinTexTS}, a new large-scale text-paired stock price dataset. Experimental results on \textbf{FinTexTS} demonstrate the effectiveness of our semantic-based and multi-level pairing strategy in stock price forecasting. In addition to publicly-available news underlying \textbf{FinTexTS}, we show that applying our method to proprietary yet carefully curated news sources leads to higher-quality paired data and improved stock price forecasting performance.

[314] A Natural Language Agentic Approach to Study Affective Polarization

Stephanie Anneris Malvicini, Ewelina Gajewska, Arda Derbent, Katarzyna Budzynska, Jarosław A. Chudziak, Maria Vanina Martinez

Main category: cs.AI

TL;DR: Multi-agent LLM platform for studying affective polarization in social media through simulated virtual communities

Details

Motivation: Existing studies of affective polarization in social media have limited scope (real-world) or insufficient training data (simulated), with lack of tools to formalize different definitions across studies, hindering comparison and interoperable frameworks

Method: Developed a multi-agent model platform leveraging LLMs to construct virtual communities where agents engage in discussions, enabling analysis of affective polarization questions from social science literature and introducing scenarios for observation at different granularity levels

Result: Platform serves as flexible tool for computational studies of complex social dynamics like affective polarization, leveraging advanced agent models to simulate rich, context-sensitive interactions and systematically explore research questions

Conclusion: The LLM-based multi-agent platform provides a comprehensive approach to studying affective polarization in social media, offering fresh perspectives and enabling measurement at various levels of granularity

Abstract: Affective polarization has been central to political and social studies, with growing focus on social media, where partisan divisions are often exacerbated. Real-world studies tend to have limited scope, while simulated studies suffer from insufficient high-quality training data, as manually labeling posts is labor-intensive and prone to subjective biases. The lack of adequate tools to formalize different definitions of affective polarization across studies complicates result comparison and hinders interoperable frameworks. We present a multi-agent model providing a comprehensive approach to studying affective polarization in social media. To operationalize our framework, we develop a platform leveraging large language models (LLMs) to construct virtual communities where agents engage in discussions. We showcase the potential of our platform by (1) analyzing questions related to affective polarization, as explored in social science literature, providing a fresh perspective on this phenomenon, and (2) introducing scenarios that allow observation and measurement of polarization at different levels of granularity and abstraction. Experiments show that our platform is a flexible tool for computational studies of complex social dynamics such as affective polarization. It leverages advanced agent models to simulate rich, context-sensitive interactions and systematically explore research questions traditionally addressed through human-subject studies.

[315] Rethinking Code Similarity for Automated Algorithm Design with LLMs

Rui Zhang, Zhichao Lu

Main category: cs.AI

TL;DR: BehaveSim measures algorithmic similarity through problem-solving behavior trajectories, using dynamic time warping to distinguish true algorithmic innovation from syntactic variation in LLM-generated code.

Details

Motivation: With LLM-based Automated Algorithm Design (LLM-AAD), algorithmic principles are implicitly embedded in generated code, making it essential to assess algorithmic similarity directly from code and distinguish genuine innovation from syntactic variation, which existing code similarity metrics fail to do.

Method: BehaveSim measures algorithmic similarity through problem-solving behavior as sequences of intermediate solutions (problem-solving trajectories or PSTrajs) produced during execution, quantifying alignment between PSTrajs using dynamic time warping (DTW).

Result: BehaveSim effectively distinguishes algorithms with divergent logic despite syntactic or output-level similarities, enhances LLM-AAD frameworks by promoting behavioral diversity (improving performance on three AAD tasks), and enables systematic analysis of problem-solving strategies through behavioral clustering.

Conclusion: BehaveSim provides a novel approach to measure algorithmic similarity through behavior analysis, offering crucial tools for assessing AI-generated algorithms and improving LLM-AAD frameworks by promoting behavioral diversity.

Abstract: The rise of Large Language Model-based Automated Algorithm Design (LLM-AAD) has transformed algorithm development by autonomously generating code implementations of expert-level algorithms. Unlike traditional expert-driven algorithm development, in the LLM-AAD paradigm, the main design principle behind an algorithm is often implicitly embedded in the generated code. Therefore, assessing algorithmic similarity directly from code, distinguishing genuine algorithmic innovation from mere syntactic variation, becomes essential. While various code similarity metrics exist, they fail to capture algorithmic similarity, as they focus on surface-level syntax or output equivalence rather than the underlying algorithmic logic. We propose BehaveSim, a novel method to measure algorithmic similarity through the lens of problem-solving behavior as a sequence of intermediate solutions produced during execution, dubbed as problem-solving trajectories (PSTrajs). By quantifying the alignment between PSTrajs using dynamic time warping (DTW), BehaveSim distinguishes algorithms with divergent logic despite syntactic or output-level similarities. We demonstrate its utility in two key applications: (i) Enhancing LLM-AAD: Integrating BehaveSim into existing LLM-AAD frameworks (e.g., FunSearch, EoH) promotes behavioral diversity, significantly improving performance on three AAD tasks. (ii) Algorithm analysis: BehaveSim clusters generated algorithms by behavior, enabling systematic analysis of problem-solving strategies–a crucial tool for the growing ecosystem of AI-generated algorithms. Data and code of this work are open-sourced at https://github.com/RayZhhh/behavesim.

[316] Agentified Assessment of Logical Reasoning Agents

Zhiyu Ni, Yifeng Xiao, Zheng Liang

Main category: cs.AI

TL;DR: Framework for reproducible benchmarking of logical reasoning agents using assessor agents, demonstrated with auto-formalization agent for first-order logic reasoning on FOLIO dataset.

Details

Motivation: Need for reproducible, auditable, and robust evaluation of logical reasoning agents, especially when assessment must handle execution failures and maintain structured evaluation protocols.

Method: Agentified assessment framework with assessor agent that issues tasks, enforces budgets, parses outputs, and records failures. Case study: auto-formalization agent translates natural language FOL problems to Z3Py programs and uses SMT solving for logical entailment.

Result: Auto-formalization agent achieves 86.70% accuracy on cleaned FOLIO validation set, outperforming chain-of-thought baseline (73.89%) under the assessor protocol.

Conclusion: The framework enables reproducible benchmarking of reasoning agents, and the auto-formalization approach shows strong performance on logical reasoning tasks compared to baseline methods.

Abstract: We present a framework for evaluating and benchmarking logical reasoning agents when assessment itself must be reproducible, auditable, and robust to execution failures. Building on agentified assessment, we use an assessor agent to issue tasks, enforce execution budgets, parse outputs, and record structured failure types, while the agent under test only needs to expose a standardized agent-to-agent interface. As a case study, we benchmark an auto-formalization agent for first-order logic (FOL) reasoning on a solver-verified and repaired split of FOLIO. The agent translates natural language premises and conclusions into executable Z3Py programs and employs satisfiability modulo theories (SMT) solving to determine logical entailment. On the cleaned FOLIO validation set, the auto-formalization agent achieves 86.70% accuracy under the assessor protocol, outperforming a chain-of-thought baseline (73.89%).

[317] Guideline-Grounded Evidence Accumulation for High-Stakes Agent Verification

Yichi Zhang, Nabeel Seedat, Yinpeng Dong, Peng Cui, Jun Zhu, Mihaela van de Schaar

Main category: cs.AI

TL;DR: GLEAN: A framework for verifying LLM-powered agents in clinical diagnosis by accumulating evidence from expert guidelines and calibrating correctness probabilities with Bayesian logistic regression.

Details

Motivation: As LLM-powered agents are increasingly used for high-stakes clinical decision-making, there's a critical need for reliable verification of their decisions to enable trustworthy deployment. Existing verifiers underperform due to lack of domain knowledge and poor calibration.

Method: GLEAN (Guideline-grounded Evidence Accumulation) compiles expert-curated protocols into trajectory-informed correctness signals. It evaluates step-wise alignment with domain guidelines, aggregates multi-guideline ratings into surrogate features, accumulates evidence along the trajectory, and calibrates into correctness probabilities using Bayesian logistic regression. For uncertain cases, it triggers active verification by expanding guideline coverage and performing differential checks.

Result: Empirical validation on agentic clinical diagnosis across three diseases from MIMIC-IV dataset shows GLEAN surpasses best baseline by 12% in AUROC and 50% in Brier score reduction, demonstrating effectiveness in both discrimination and calibration. Expert study with clinicians recognizes GLEAN’s practical utility.

Conclusion: GLEAN provides an effective framework for verifying LLM-powered agents in high-stakes domains like clinical diagnosis by leveraging expert guidelines and calibrated evidence accumulation, addressing limitations of existing verification methods.

Abstract: As LLM-powered agents have been used for high-stakes decision-making, such as clinical diagnosis, it becomes critical to develop reliable verification of their decisions to facilitate trustworthy deployment. Yet, existing verifiers usually underperform owing to a lack of domain knowledge and limited calibration. To address this, we establish GLEAN, an agent verification framework with Guideline-grounded Evidence Accumulation that compiles expert-curated protocols into trajectory-informed, well-calibrated correctness signals. GLEAN evaluates the step-wise alignment with domain guidelines and aggregates multi-guideline ratings into surrogate features, which are accumulated along the trajectory and calibrated into correctness probabilities using Bayesian logistic regression. Moreover, the estimated uncertainty triggers active verification, which selectively collects additional evidence for uncertain cases via expanding guideline coverage and performing differential checks. We empirically validate GLEAN with agentic clinical diagnosis across three diseases from the MIMIC-IV dataset, surpassing the best baseline by 12% in AUROC and 50% in Brier score reduction, which confirms the effectiveness in both discrimination and calibration. In addition, the expert study with clinicians recognizes GLEAN’s utility in practice.

[318] LLM-based Argument Mining meets Argumentation and Description Logics: a Unified Framework for Reasoning about Debates

Gianvincenzo Alfano, Sergio Greco, Lucio La Cava, Stefano Francesco Monea, Irina Trubitsyna

Main category: cs.AI

TL;DR: A framework combining argument mining, quantitative reasoning, and ontology-based querying to analyze debates with structured, verifiable reasoning instead of purely statistical LLM approaches.

Details

Motivation: LLMs struggle with explicit, transparent, and verifiable reasoning over complex texts like debates, lacking structured representations of argument relationships and strengths that determine overall acceptability.

Method: Integrates learning-based argument mining with quantitative reasoning and ontology-based querying. Extracts fuzzy argumentative knowledge base from raw debate text, applies quantitative argumentation semantics to compute final argument strengths, and embeds results into fuzzy description logic for expressive query answering.

Result: Provides a transparent, explainable, and formally grounded method for analyzing debates that overcomes purely statistical LLM-based analyses by enabling structured reasoning about argument relationships and strengths.

Conclusion: The proposed framework offers a more rigorous approach to debate analysis compared to LLMs, combining argument mining, quantitative reasoning, and formal logic for verifiable, structured reasoning.

Abstract: Large Language Models (LLMs) achieve strong performance in analyzing and generating text, yet they struggle with explicit, transparent, and verifiable reasoning over complex texts such as those containing debates. In particular, they lack structured representations that capture how arguments support or attack each other and how their relative strengths determine overall acceptability. We encompass these limitations by proposing a framework that integrates learning-based argument mining with quantitative reasoning and ontology-based querying. Starting from a raw debate text, the framework extracts a fuzzy argumentative knowledge base, where arguments are explicitly represented as entities, linked by attack and support relations, and annotated with initial fuzzy strengths reflecting plausibility w.r.t. the debate’s context. Quantitative argumentation semantics are then applied to compute final argument strengths by propagating the effects of supports and attacks. These results are then embedded into a fuzzy description logic setting, enabling expressive query answering through efficient rewriting techniques. The proposed approach provides a transparent, explainable, and formally grounded method for analyzing debates, overcoming purely statistical LLM-based analyses.

[319] Retrievit: In-context Retrieval Capabilities of Transformers, State Space Models, and Hybrid Architectures

Georgios Pantazopoulos, Malvina Nikandrou, Ioannis Konstas, Alessandro Suglia

Main category: cs.AI

TL;DR: Hybrid Transformer-SSM architectures outperform SSMs and match/exceed Transformers for information-dense n-gram retrieval, but Transformers remain superior for position retrieval tasks requiring precise positional associations.

Details

Motivation: Transformers have strong in-context retrieval capabilities but suffer from quadratic complexity, while SSMs offer linear-time efficiency but limited retrieval. The paper investigates whether hybrid architectures can combine the strengths of both approaches.

Method: The study evaluates Transformers, SSMs, and hybrid architectures on two synthetic in-context retrieval tasks: n-gram retrieval (identifying and reproducing n-grams following queries) and position retrieval (two-hop associative lookup to output positional indices). Experiments assess data efficiency, length generalization, robustness, and learned representations under controlled conditions.

Result: Hybrid models outperform SSMs and match or exceed Transformers in data efficiency and extrapolation for information-dense n-gram retrieval. However, Transformers maintain superiority in position retrieval tasks. Representation analysis reveals SSM-based models develop locality-aware embeddings where adjacent position tokens become neighbors in embedding space, forming interpretable structures not found in Transformers.

Conclusion: Hybrid architectures can achieve the best of both worlds for certain retrieval tasks but have limitations. The emergent locality-aware embeddings in SSM-based models explain their strengths and limitations. The findings provide principled guidance for architecture selection based on task requirements and reveal fundamental differences in how different models learn positional associations.

Abstract: Transformers excel at in-context retrieval but suffer from quadratic complexity with sequence length, while State Space Models (SSMs) offer efficient linear-time processing but have limited retrieval capabilities. We investigate whether hybrid architectures combining Transformers and SSMs can achieve the best of both worlds on two synthetic in-context retrieval tasks. The first task, n-gram retrieval, requires the model to identify and reproduce an n-gram that succeeds the query within the input sequence. The second task, position retrieval, presents the model with a single query token and requires it to perform a two-hop associative lookup: first locating the corresponding element in the sequence, and then outputting its positional index. Under controlled experimental conditions, we assess data efficiency, length generalization, robustness to out of domain training examples, and learned representations across Transformers, SSMs, and hybrid architectures. We find that hybrid models outperform SSMs and match or exceed Transformers in data efficiency and extrapolation for information-dense context retrieval. However, Transformers maintain superiority in position retrieval tasks. Through representation analysis, we discover that SSM-based models develop locality-aware embeddings where tokens representing adjacent positions become neighbors in embedding space, forming interpretable structures. This emergent property, absent in Transformers, explains both the strengths and limitations of SSMs and hybrids for different retrieval tasks. Our findings provide principled guidance for architecture selection based on task requirements and reveal fundamental differences in how Transformers and SSMs, and hybrid models learn positional associations.

[320] SAE as a Crystal Ball: Interpretable Features Predict Cross-domain Transferability of LLMs without Training

Qi Zhang, Yifei Wang, Xiaohan Wang, Jiajun Chai, Guojun Yin, Wei Lin, Yisen Wang

Main category: cs.AI

TL;DR: SAE-based Transferability Score (STS) uses sparse autoencoders to predict how well supervised fine-tuning will transfer to different domains before actually fine-tuning, achieving strong correlation with actual performance changes.

Details

Motivation: Post-training processes like supervised fine-tuning introduce model shifts that affect performance across domains, but it's unclear how these shifts transfer. Current methods lack interpretability and require actual fine-tuning to assess transferability.

Method: Proposes STS metric using sparse autoencoders (SAEs) to identify shifted dimensions in representations and calculate their correlations with downstream domains. This enables transferability estimation before fine-tuning by analyzing SAE representations.

Result: Extensive experiments show STS accurately predicts supervised fine-tuning transferability with Pearson correlation coefficients above 0.7 with actual performance changes. Also takes initial steps toward extending STS to reinforcement learning.

Conclusion: STS serves as an interpretable tool for guiding post-training strategies in LLMs by forecasting transferability before committing to fine-tuning, potentially saving computational resources.

Abstract: In recent years, pre-trained large language models have achieved remarkable success across diverse tasks. Besides the pivotal role of self-supervised pre-training, their effectiveness in downstream applications also depends critically on the post-training process, which adapts models to task-specific data and objectives. However, this process inevitably introduces model shifts that can influence performance in different domains, and how such shifts transfer remains poorly understood. To open up the black box, we propose the SAE-based Transferability Score (STS), a new metric that leverages sparse autoencoders (SAEs) to forecast post-training transferability. Taking supervised fine-tuning as an example, STS identifies shifted dimensions in SAE representations and calculates their correlations with downstream domains, enabling reliable estimation of transferability \textit{before} fine-tuning. Extensive experiments across multiple models and domains show that STS accurately predicts the transferability of supervised fine-tuning, achieving Pearson correlation coefficients above 0.7 with actual performance changes. Beyond this, we take an initial step toward extending STS to reinforcement learning. We believe that STS can serve as an {\color{black} interpretable} tool for guiding post-training strategies in LLMs. Code is available at https://github.com/PKU-ML/STS.

[321] ShipTraj-R1: Reinforcing Ship Trajectory Prediction in Large Language Models via Group Relative Policy Optimization

Yang Zhan, Yunhao Li, Zhang Chao, Yuxu Lu, Yan Li

Main category: cs.AI

TL;DR: ShipTraj-R1: An LLM-based framework using reinforcement fine-tuning (GRPO) for ship trajectory prediction, reformulating it as text-to-text generation with dynamic prompts and rule-based rewards.

Details

Motivation: While reinforcement fine-tuning has improved LLM reasoning, applying LLMs to ship trajectory prediction remains unexplored. The paper aims to bridge this gap by creating an LLM-based framework for maritime trajectory prediction.

Method: 1) Reformulates ship trajectory prediction as text-to-text generation. 2) Uses dynamic prompts with conflicting ship trajectory info for adaptive CoT reasoning. 3) Implements comprehensive rule-based rewards for reasoning format and prediction accuracy. 4) Reinforces model through GRPO mechanism using Qwen3 backbone with domain-specific prompts.

Result: Extensive experiments on two complex real-world maritime datasets show ShipTraj-R1 achieves the least error compared to state-of-the-art deep learning and LLM-based baselines.

Conclusion: The proposed ShipTraj-R1 framework successfully applies LLMs to ship trajectory prediction through reinforcement fine-tuning, demonstrating superior performance over existing methods.

Abstract: Recent advancements in reinforcement fine-tuning have significantly improved the reasoning ability of large language models (LLMs). In particular, methods such as group relative policy optimization (GRPO) have demonstrated strong capabilities across various fields. However, applying LLMs to ship trajectory prediction remains largely unexplored. In this paper, we propose ShipTraj-R1, a novel LLM-based framework that reformulates ship trajectory prediction as a text-to-text generation problem. (1) We design a dynamic prompt containing trajectory information about conflicting ships to guide the model to achieve adaptive chain-of-thought (CoT) reasoning. (2) We introduce a comprehensive rule-based reward mechanism to incentivize the reasoning format and prediction accuracy of the model. (3) Our ShipTraj-R1 is reinforced through the GRPO mechanism guided by domain-specific prompts and rewards, and utilizes the Qwen3 as the model backbone. Extensive experimental results on two complex and real-world maritime datasets show that the proposed ShipTraj-R1 achieves the least error compared with state-of-the-art deep learning and LLM-based baselines.

[322] Architecting Trust in Artificial Epistemic Agents

Nahema Marchal, Stephanie Chan, Matija Franklin, Manon Revel, Geoff Keeling, Roberta Fischli, Bilva Chandra, Iason Gabriel

Main category: cs.AI

TL;DR: The paper argues that LLMs are becoming epistemic agents that shape knowledge environments, requiring new evaluation frameworks focused on trustworthiness, alignment with human epistemic goals, and robust socio-epistemic infrastructure.

Details

Motivation: Large language models are increasingly functioning as autonomous epistemic agents that curate information and generate advice, creating new informational dependencies that require fundamental shifts in AI evaluation and governance to ensure reliable, well-calibrated knowledge ecosystems.

Method: The paper proposes a normative framework centered on three pillars: building trustworthiness of epistemic AI agents (demonstrating competence, falsifiability, virtuous behaviors), aligning AI with human epistemic goals, and reinforcing socio-epistemic infrastructure through technical provenance systems and “knowledge sanctuaries.”

Result: The framework provides a roadmap for ensuring AI systems act as reliable partners in robust, inclusive knowledge ecosystems, addressing risks of cognitive deskilling and epistemic drift while potentially augmenting human judgment and collective decision-making.

Conclusion: The calibration of epistemic AI agents to human norms is a high-stakes necessity, requiring a fundamental shift in evaluation and governance to ensure beneficial human-AI knowledge ecosystems through trustworthy, aligned agents and reinforced socio-epistemic infrastructure.

Abstract: Large language models increasingly function as epistemic agents – entities that can 1) autonomously pursue epistemic goals and 2) actively shape our shared knowledge environment. They curate the information we receive, often supplanting traditional search-based methods, and are frequently used to generate both personal and deeply specialized advice. How they perform these functions, including whether they are reliable and properly calibrated to both individual and collective epistemic norms, is therefore highly consequential for the choices we make. We argue that the potential impact of epistemic AI agents on practices of knowledge creation, curation and synthesis, particularly in the context of complex multi-agent interactions, creates new informational interdependencies that necessitate a fundamental shift in evaluation and governance of AI. While a well-calibrated ecosystem could augment human judgment and collective decision-making, poorly aligned agents risk causing cognitive deskilling and epistemic drift, making the calibration of these models to human norms a high-stakes necessity. To ensure a beneficial human-AI knowledge ecosystem, we propose a framework centered on building and cultivating the trustworthiness of epistemic AI agents; aligning AI these agents with human epistemic goals; and reinforcing the surrounding socio-epistemic infrastructure. In this context, trustworthy AI agents must demonstrate epistemic competence, robust falsifiability, and epistemically virtuous behaviors, supported by technical provenance systems and “knowledge sanctuaries” designed to protect human resilience. This normative roadmap provides a path toward ensuring that future AI systems act as reliable partners in a robust and inclusive knowledge ecosystem.

[323] SpatialText: A Pure-Text Cognitive Benchmark for Spatial Understanding in Large Language Models

Peiyao Jiang, Zequn Qin, Xi Li

Main category: cs.AI

TL;DR: SpatialText is a diagnostic framework that isolates text-based spatial reasoning to test whether models construct genuine mental spatial models versus relying on linguistic heuristics.

Details

Motivation: Existing benchmarks fail to isolate intrinsic spatial cognition from statistical language heuristics, and multimodal evaluations often conflate spatial reasoning with visual perception. The authors aim to systematically investigate whether models construct flexible spatial mental models.

Method: Introduces SpatialText, a theory-driven diagnostic framework with dual-source methodology: human-annotated descriptions of real 3D indoor environments (capturing natural ambiguities, perspective shifts, functional relations) and code-generated, logically precise scenes (probing formal spatial deduction and epistemic boundaries).

Result: Systematic evaluation reveals fundamental representational limitations: models show proficiency in retrieving explicit spatial facts and operating within global coordinate systems, but exhibit critical failures in egocentric perspective transformation and local reference frame reasoning.

Conclusion: Current models rely heavily on linguistic co-occurrence heuristics rather than constructing coherent, verifiable internal spatial representations. SpatialText serves as a rigorous instrument for diagnosing cognitive boundaries of artificial spatial intelligence.

Abstract: Genuine spatial reasoning relies on the capacity to construct and manipulate coherent internal spatial representations, often conceptualized as mental models, rather than merely processing surface linguistic associations. While large language models exhibit advanced capabilities across various domains, existing benchmarks fail to isolate this intrinsic spatial cognition from statistical language heuristics. Furthermore, multimodal evaluations frequently conflate genuine spatial reasoning with visual perception. To systematically investigate whether models construct flexible spatial mental models, we introduce SpatialText, a theory-driven diagnostic framework. Rather than functioning simply as a dataset, SpatialText isolates text-based spatial reasoning through a dual-source methodology. It integrates human-annotated descriptions of real 3D indoor environments, which capture natural ambiguities, perspective shifts, and functional relations, with code-generated, logically precise scenes designed to probe formal spatial deduction and epistemic boundaries. Systematic evaluation across state-of-the-art models reveals fundamental representational limitations. Although models demonstrate proficiency in retrieving explicit spatial facts and operating within global, allocentric coordinate systems, they exhibit critical failures in egocentric perspective transformation and local reference frame reasoning. These systematic errors provide strong evidence that current models rely heavily on linguistic co-occurrence heuristics rather than constructing coherent, verifiable internal spatial representations. SpatialText thus serves as a rigorous instrument for diagnosing the cognitive boundaries of artificial spatial intelligence.

[324] OrchMAS: Orchestrated Reasoning with Multi Collaborative Heterogeneous Scientific Expert Structured Agents

Yichao Feng, Haoran Luo, Zhenghong Lin, Yiqun Sun, Pengfei Wei, Lawrence B. Hsieh, Anh Tuan Luu

Main category: cs.AI

TL;DR: Scientific multi-agent LLM framework with dynamic orchestration for complex reasoning tasks

Details

Motivation: Existing multi-agent LLM systems are weak for scientific domains due to static prompts, rigid workflows, homogeneous models, poor domain adaptation, limited reasoning flexibility, and inability to revise decisions

Method: Two-tier multi-model orchestration framework: orchestration model analyzes tasks and dynamically constructs domain-aware reasoning pipelines with specialized expert agents; execution model performs steps under generated specifications; iterative updates enable dynamic replanning

Result: Consistent improvements over existing multi-agent systems and strong baselines across diverse reasoning and scientific benchmarks

Conclusion: Proposed framework enables robust scientific reasoning through structured heterogeneous model collaboration, dynamic adaptation, and model-agnostic flexible deployment

Abstract: Multi-agent large language model frameworks are promising for complex multi step reasoning, yet existing systems remain weak for scientific and knowledge intensive domains due to static prompts and agent roles, rigid workflows, and homogeneous model reliance, leading to poor domain adaptation, limited reasoning flexibility, and high latency on heterogeneous or long-horizon scientific tasks. They also struggle to revise earlier decisions when intermediate reasoning diverges, reducing reliability in structured and calculation heavy settings. To address these limitations, we propose a scientific domain oriented interactive two tier multi model orchestration framework. A dedicated orchestration model analyzes each task, dynamically constructs a domain aware reasoning pipeline, and instantiates specialized expert agents with tailored prompts, while an execution model performs each step under generated role and instruction specifications. The orchestrator iteratively updates the pipeline based on intermediate feedback, enabling dynamic replanning, role reallocation, and prompt refinement across multi turn interactions, strengthening robustness and specialization for scientific reasoning through structured heterogeneous model collaboration. The framework is model agnostic and supports heterogeneous LLM integration with different capacities or costs, enabling flexible performance efficiency trade offs in practical scientific deployments. Experiments show consistent improvements over existing multi agent systems and strong baselines across diverse reasoning and scientific style benchmarks.

[325] REGAL: A Registry-Driven Architecture for Deterministic Grounding of Agentic AI in Enterprise Telemetry

Yuvraj Agrawal

Main category: cs.AI

TL;DR: REGAL: A registry-driven architecture for deterministic grounding of agentic AI systems in enterprise telemetry, addressing challenges of limited context, local semantics, and evolving interfaces.

Details

Motivation: Enterprise engineering organizations generate diverse telemetry data, and while LLMs enable agentic automation, grounding these agents on private telemetry faces practical challenges: limited model context, locally defined semantic concepts, and evolving metric interfaces.

Method: REGAL uses a registry-driven architecture with two main components: (1) a Medallion ELT pipeline producing replayable, semantically compressed Gold artifacts, and (2) a registry-driven compilation layer that synthesizes Model Context Protocol (MCP) tools from declarative metric definitions. The registry serves as an “interface-as-code” layer ensuring alignment between tool specification and execution.

Result: A prototype implementation and case study validate the feasibility of deterministic grounding and demonstrate implications for latency, token efficiency, and operational governance.

Conclusion: The work systematizes an architectural pattern for enterprise LLM grounding by elevating deterministic computation and semantic compilation to first-class design primitives for agentic systems, rather than proposing new learning algorithms.

Abstract: Enterprise engineering organizations produce high-volume, heterogeneous telemetry from version control systems, CI/CD pipelines, issue trackers, and observability platforms. Large Language Models (LLMs) enable new forms of agentic automation, but grounding such agents on private telemetry raises three practical challenges: limited model context, locally defined semantic concepts, and evolving metric interfaces. We present REGAL, a registry-driven architecture for deterministic grounding of agentic AI systems in enterprise telemetry. REGAL adopts an explicitly architectural approach: deterministic telemetry computation is treated as a first-class primitive, and LLMs operate over a bounded, version-controlled action space rather than raw event streams. The architecture combines (1) a Medallion ELT pipeline that produces replayable, semantically compressed Gold artifacts, and (2) a registry-driven compilation layer that synthesizes Model Context Protocol (MCP) tools from declarative metric definitions. The registry functions as an “interface-as-code” layer, ensuring alignment between tool specification and execution, mitigating tool drift, and embedding governance policies directly at the semantic boundary. A prototype implementation and case study validate the feasibility of deterministic grounding and illustrate its implications for latency, token efficiency, and operational governance. This work systematizes an architectural pattern for enterprise LLM grounding; it does not propose new learning algorithms, but rather elevates deterministic computation and semantic compilation to first-class design primitives for agentic systems.

[326] TikZilla: Scaling Text-to-TikZ with High-Quality Data and Reinforcement Learning

Christian Greisinger, Steffen Eger

Main category: cs.AI

TL;DR: TikZilla: A two-stage pipeline (SFT + RL) using inverse graphics reward signals to generate high-quality TikZ figures from text descriptions, outperforming GPT-4o and matching GPT-5 with smaller models.

Details

Motivation: Existing Text-to-TikZ datasets are too small and noisy, causing text-figure mismatches. Current approaches rely only on supervised fine-tuning without considering rendered figure semantics, leading to errors like looping, irrelevant content, and incorrect spatial relations.

Method: Constructed DaTikZ-V4 dataset (4x larger and higher quality than previous). Trained TikZilla models (3B and 8B Qwen) using two-stage pipeline: supervised fine-tuning followed by reinforcement learning with reward signals from image encoder trained via inverse graphics.

Result: TikZilla improves by 1.5-2 points over base models on 5-point scale, surpasses GPT-4o by 0.5 points, and matches GPT-5 in image-based evaluation while operating at much smaller model sizes.

Conclusion: The two-stage pipeline with inverse graphics reward signals enables high-quality scientific figure generation from text, demonstrating that smaller models can achieve state-of-the-art performance through better training approaches.

Abstract: Large language models (LLMs) are increasingly used to assist scientists across diverse workflows. A key challenge is generating high-quality figures from textual descriptions, often represented as TikZ programs that can be rendered as scientific images. Prior research has proposed a variety of datasets and modeling approaches for this task. However, existing datasets for Text-to-TikZ are too small and noisy to capture the complexity of TikZ, causing mismatches between text and rendered figures. Moreover, prior approaches rely solely on supervised fine-tuning (SFT), which does not expose the model to the rendered semantics of the figure, often resulting in errors such as looping, irrelevant content, and incorrect spatial relations. To address these issues, we construct DaTikZ-V4, a dataset more than four times larger and substantially higher in quality than DaTikZ-V3, enriched with LLM-generated figure descriptions. Using this dataset, we train TikZilla, a family of small open-source Qwen models (3B and 8B) with a two-stage pipeline of SFT followed by reinforcement learning (RL). For RL, we leverage an image encoder trained via inverse graphics to provide semantically faithful reward signals. Extensive human evaluations with over 1,000 judgments show that TikZilla improves by 1.5-2 points over its base models on a 5-point scale, surpasses GPT-4o by 0.5 points, and matches GPT-5 in the image-based evaluation, while operating at much smaller model sizes. Code, data, and models will be made available.

[327] RAPO: Expanding Exploration for LLM Agents via Retrieval-Augmented Policy Optimization

Siwei Zhang, Yun Xiong, Xi Chen, Zi’an Jia, Renhong Huang, Jiarong Xu, Jiawei Zhang

Main category: cs.AI

TL;DR: RAPO introduces retrieval-augmented policy optimization for LLM agents, using hybrid-policy rollout with retrieved step-level traces to expand exploration beyond pure on-policy methods.

Details

Motivation: Existing Agentic RL methods for LLM agents rely on pure on-policy exploration, limiting discovery of new reasoning perspectives. While some incorporate off-policy signals, they use full trajectories rather than fine-grained step-level exploration needed for agentic reasoning.

Method: Two-phase approach: 1) Hybrid-policy Agentic Rollout - agents reason over retrieved off-policy step-level traces to expand reasoning receptive field; 2) Retrieval-aware Policy Optimization - calibrates policy gradient with retrieval reward and importance shaping to prioritize retrieval-illuminating exploration.

Result: Achieves +5.0% average gain on fourteen datasets across three agentic reasoning tasks, with 1.2x faster training efficiency compared to existing methods.

Conclusion: RAPO successfully addresses exploration limitations in Agentic RL by integrating retrieval mechanisms for step-level exploration, leading to improved performance and efficiency in LLM agent reasoning tasks.

Abstract: Agentic Reinforcement Learning (Agentic RL) has shown remarkable potential in large language model-based (LLM) agents. These works can empower LLM agents to tackle complex tasks via multi-step, tool-integrated reasoning. However, an inherent limitation of existing Agentic RL methods is their reliance on a pure on-policy paradigm for exploration, restricting exploration to the agent’s self-generated outputs and preventing the discovery of new reasoning perspectives for further improvement. While recent efforts incorporate auxiliary off-policy signals to enhance exploration, they typically utilize full off-policy trajectories for trajectory-level policy estimation, overlooking the necessity for the fine-grained, step-level exploratory dynamics within agentic rollout. In this paper, we revisit exploration in Agentic RL and propose Retrieval-Augmented Policy Optimization (RAPO), a novel RL framework that introduces retrieval to explicitly expand exploration during training. To achieve this, we decompose the Agentic RL training process into two phases: (i) Hybrid-policy Agentic Rollout, and (ii) Retrieval-aware Policy Optimization. Specifically, we propose a Hybrid-policy Agentic Rollout strategy, which allows the agents to continuously reason over the retrieved off-policy step-level traces. It dynamically extends the reasoning receptive field of agents, enabling broader exploration conditioned on external behaviors. Subsequently, we introduce the Retrieval-aware Policy Optimization mechanism, which calibrates the policy gradient estimation with retrieval reward and importance shaping, stabilizing training and prioritizing retrieval-illuminating exploration. Extensive experiments show that RAPO achieves an +5.0% average gain on fourteen datasets across three agentic reasoning tasks, while delivering 1.2x faster training efficiency.

[328] Beyond Factual Correctness: Mitigating Preference-Inconsistent Explanations in Explainable Recommendation

Chengkai Wang, Baisong Liu

Main category: cs.AI

TL;DR: PURE: A preference-aware reasoning framework for LLM-based explainable recommendation that selects multi-hop reasoning paths aligned with user preferences to generate convincing explanations.

Details

Motivation: Current LLM-based explainable recommenders produce factually correct but preference-inconsistent explanations that justify items using attributes conflicting with users' historical preferences, leading to unconvincing reasoning missed by standard hallucination metrics.

Method: PURE follows a select-then-generate paradigm: 1) Selects compact multi-hop item-centric reasoning paths that are factually grounded and aligned with user preference structure, guided by user intent, specificity, and diversity; 2) Injects selected evidence into LLM generation via structure-aware prompting preserving relational constraints.

Result: Experiments on three real-world datasets show PURE consistently reduces preference-inconsistent explanations and factual hallucinations while maintaining competitive recommendation accuracy, explanation quality, and inference efficiency.

Conclusion: Trustworthy explanations require not only factual correctness but also justification aligned with user preferences; PURE’s preference-aware reasoning framework effectively addresses this need.

Abstract: LLM-based explainable recommenders can produce fluent explanations that are factually correct, yet still justify items using attributes that conflict with a user’s historical preferences. Such preference-inconsistent explanations yield logically valid but unconvincing reasoning and are largely missed by standard hallucination or faithfulness metrics. We formalize this failure mode and propose PURE, a preference-aware reasoning framework following a select-then-generate paradigm. Instead of only improving generation, PURE intervenes in evidence selection, it selects a compact set of multi-hop item-centric reasoning paths that are both factually grounded and aligned with user preference structure, guided by user intent, specificity, and diversity to suppress generic, weakly personalized evidence. The selected evidence is then injected into LLM generation via structure-aware prompting that preserves relational constraints. To measure preference inconsistency, we introduce a feature-level, user-centric evaluation metric that reveals misalignment overlooked by factuality-based measures. Experiments on three real-world datasets show that PURE consistently reduces preference-inconsistent explanations and factual hallucinations while maintaining competitive recommendation accuracy, explanation quality, and inference efficiency. These results highlight that trustworthy explanations require not only factual correctness but also justification aligned with user preferences.

[329] Odin: Multi-Signal Graph Intelligence for Autonomous Discovery in Knowledge Graphs

Muyukani Kizito, Elizabeth Nyambere

Main category: cs.AI

TL;DR: Odin is a production-deployed graph intelligence engine for autonomous discovery of meaningful patterns in knowledge graphs without predefined queries, using multi-signal integration including structural, semantic, temporal, and community-aware guidance.

Details

Motivation: Current retrieval-based systems require predefined queries and suffer from the "echo chamber" problem where exploration gets trapped in dense local communities. There's a need for autonomous discovery systems that can find meaningful patterns in knowledge graphs without prior specification, especially for regulated industries requiring provenance traceability.

Method: Uses COMPASS score combining: (1) structural importance via Personalized PageRank, (2) semantic plausibility through Neural Probabilistic Logic Learning as discriminative filter, (3) temporal relevance with configurable decay, and (4) community-aware guidance through GNN-identified bridge entities and inter-community affinity scores. Employs beam search with multi-signal guidance.

Result: First autonomous discovery system deployed in regulated production environments (healthcare and insurance). Achieves O(b·h) complexity while maintaining high recall compared to exhaustive exploration. Demonstrates significant improvements in pattern discovery quality and analyst efficiency with complete provenance traceability.

Conclusion: Odin represents a breakthrough in autonomous graph pattern discovery, addressing the echo chamber problem through multi-signal integration and bridge scoring mechanisms, making it suitable for regulated industries where hallucination is unacceptable.

Abstract: We present Odin, the first production-deployed graph intelligence engine for autonomous discovery of meaningful patterns in knowledge graphs without prior specification. Unlike retrieval-based systems that answer predefined queries, Odin guides exploration through the COMPASS (Composite Oriented Multi-signal Path Assessment) score, a novel metric that combines (1) structural importance via Personalized PageRank, (2) semantic plausibility through Neural Probabilistic Logic Learning (NPLL) used as a discriminative filter rather than generative model, (3) temporal relevance with configurable decay, and (4) community-aware guidance through GNN-identified bridge entities and inter-community affinity scores. This multi-signal integration, particularly the bridge scoring mechanism, addresses the “echo chamber” problem where graph exploration becomes trapped in dense local communities. We formalize the autonomous discovery problem, prove theoretical properties of our scoring function, and demonstrate that beam search with multi-signal guidance achieves $O(b \cdot h)$ complexity while maintaining high recall compared to exhaustive exploration. To our knowledge, Odin represents the first autonomous discovery system deployed in regulated production environments (healthcare and insurance), demonstrating significant improvements in pattern discovery quality and analyst efficiency. Our approach maintains complete provenance traceability – a critical requirement for regulated industries where hallucination is unacceptable.

[330] Beyond Task Completion: Revealing Corrupt Success in LLM Agents through Procedure-Aware Evaluation

Hongliu Cao, Ilias Driouich, Eoin Thomas

Main category: cs.AI

TL;DR: PAE framework evaluates LLM agents on procedural awareness, exposing corrupt successes and failure patterns across models.

Details

Motivation: Current benchmarks evaluate whether tasks are completed but not how, missing procedural violations and corrupt successes in LLM agents used in high-stakes settings.

Method: Introduces Procedure-Aware Evaluation (PAE) framework that formalizes agent procedures as structured observations, exposes consistency relationships, evaluates on multiple axes (Utility, Efficiency, Interaction Quality, Procedural Integrity), and applies multi-dimensional gating to disqualify corrupt outcomes.

Result: Reveals 27-78% of benchmark-reported successes are corrupt successes concealing violations; gating substantially collapses Pass^4 rate and affects model rankings; exposes distinctive per-model failure signatures and structural flaws in benchmark design.

Conclusion: PAE provides more comprehensive evaluation of LLM agents by focusing on procedural awareness, revealing hidden failures and benchmark design flaws that current evaluation methods miss.

Abstract: Large Language Model (LLM)-based agents are increasingly adopted in high-stakes settings, but current benchmarks evaluate mainly whether a task was completed, not how. We introduce Procedure-Aware Evaluation (PAE), a framework that formalizes agent procedures as structured observations and exposes consistency relationships between what agents observe, communicate, and execute. PAE evaluates agents along complementary axes (Utility, Efficiency, Interaction Quality, Procedural Integrity) and applies multi-dimensional gating that categorically disqualifies corrupt outcomes. Evaluating state-of-the-art LLM agents on tau-bench yields findings at the axis, compliance, and benchmark levels. At the axis level, the dimensions capture non-redundant failure modes: utility masks reliability gaps, speed does not imply precision, and conciseness does not predict intent adherence. At the procedural compliance level, 27-78% of benchmark reported successes are corrupt successes concealing violations across interaction and integrity. Furthermore, gating substantially collapses Pass^4 rate and affects model rankings. The analysis of corrupt success cases reveals distinctive per-model failure signatures: GPT-5 spreads errors across policy, execution, and intent dimensions; Kimi-K2-Thinking concentrates 78% of violations in policy faithfulness and compliance; and Mistral-Large-3 is dominated by faithfulness failures. At the benchmark level, our analysis exposes structural flaws in the benchmark design, including task scope gaps, contradictory reward signals, and simulator artifacts that produce accidental successes.

[331] AI Space Physics: Constitutive boundary semantics for open AI institutions

Oleg Romanchuk, Roman Bondar

Main category: cs.AI

TL;DR: AI Space Physics: A formal semantics framework for governing persistent AI institutions with self-expanding authority boundaries, treating expansion as first-class boundary events requiring witness obligations.

Details

Motivation: Current AI governance focuses on decision-layer constraints but lacks formal semantics for boundary-crossing mechanics in persistent AI institutions that accumulate state, invoke tools, and expand their authority over time.

Method: Introduces AI Space Physics with minimal state model, typed boundary channels, horizon-limited reach semantics, membrane-witness discipline, and core law family (P-1 series) requiring witness completeness, non-bypass mediation, atomic adjudication-to-effect transitions, and replayable reconstruction.

Result: Provides constitutive semantics that explicitly separates second-order effects into structural expansion and policy broadening, treating expansion transitions as governance-relevant even when immediate external effects are zero.

Conclusion: Reclassifies authority-surface expansion as first-class boundary events with constitutive witness obligations, making expansion without immediate commit adjudication-relevant within a formal governance framework.

Abstract: Agentic AI deployments increasingly behave as persistent institutions rather than one-shot inference endpoints: they accumulate state, invoke external tools, coordinate multiple runtimes, and modify their future authority surface over time. Existing governance language typically specifies decision-layer constraints but leaves the causal mechanics of boundary crossing underdefined, particularly for transitions that do not immediately change the external world yet expand what the institution can later do. This paper introduces AI Space Physics as a constitutive semantics for open, self-expanding AI institutions. We define a minimal state model with typed boundary channels, horizon-limited reach semantics, and a membrane-witness discipline. The core law family (P-1, P-1a, P-1b, P-1c) requires witness completeness, non-bypass mediation, atomic adjudication-to-effect transitions, and replayable reconstruction of adjudication class. We explicitly separate second-order effects into structural expansion and policy broadening, and treat expansion transitions as governance-relevant even when immediate external deltas are zero. The novelty claim is precise rather than expansive: this work does not introduce mediation as a concept; it reclassifies authority-surface expansion as a first-class boundary event with constitutive witness obligations. In this semantics, expansion without immediate commit remains adjudication-relevant.

[332] Agentic AI-based Coverage Closure for Formal Verification

Sivaram Pothireddypalli, Ashish Raman, Deepak Narayan Gadde, Aman Kumar

Main category: cs.AI

TL;DR: An agentic AI-driven workflow using LLM-enabled Generative AI automates coverage analysis for formal verification, identifies coverage gaps, and generates formal properties to accelerate verification efficiency and improve coverage closure in IC development.

Details

Motivation: Traditional exhaustive approaches for coverage closure in Integrated Chip (IC) development often fail to achieve full coverage within project timelines, creating a need for more efficient verification methods.

Method: The study presents an agentic AI-driven workflow that utilizes Large Language Model (LLM)-enabled Generative AI to automate coverage analysis for formal verification, identify coverage gaps, and generate required formal properties.

Result: Benchmarking on open-source and internal designs shows measurable increases in coverage metrics, with improvements correlated to design complexity. Comparative analysis validates the effectiveness of this approach.

Conclusion: Agentic AI-based techniques have significant potential to improve formal verification productivity and support comprehensive coverage closure in IC development.

Abstract: Coverage closure is a critical requirement in Integrated Chip (IC) development process and key metric for verification sign-off. However, traditional exhaustive approaches often fail to achieve full coverage within project timelines. This study presents an agentic AI-driven workflow that utilizes Large Language Model (LLM)-enabled Generative AI (GenAI) to automate coverage analysis for formal verification, identify coverage gaps, and generate the required formal properties. The framework accelerates verification efficiency by systematically addressing coverage holes. Benchmarking open-source and internal designs reveals a measurable increase in coverage metrics, with improvements correlated to the complexity of the design. Comparative analysis validates the effectiveness of this approach. These results highlight the potential of agentic AI-based techniques to improve formal verification productivity and support comprehensive coverage closure.

[333] Saarthi for AGI: Towards Domain-Specific General Intelligence for Formal Verification

Aman Kumar, Deepak Narayan Gadde, Luu Danh Minh, Vaisakh Naduvodi Viswambharan, Keerthan Kopparam Radhakrishna, Sivaram Pothireddypalli

Main category: cs.AI

TL;DR: Saarthi is an AI framework for formal verification using multi-agent collaboration, enhanced with structured rulebooks and GraphRAG to improve SystemVerilog Assertion generation accuracy and reduce iteration cycles.

Details

Motivation: Current LLM-based agents for formal verification suffer from hallucinations and errors, especially with complex tasks. The paper aims to enhance Saarthi's robustness for domain-specific general intelligence in formal verification, particularly for STSC (Short Term, Short Context) problems.

Method: Two key enhancements: (1) structured rulebook and specification grammar for better accuracy and controllability of SVA generation, (2) integration of advanced RAG techniques like GraphRAG to provide agents with technical knowledge for iterative refinement.

Result: Benchmarked on NVIDIA’s CVDP benchmark, showing 70% improvement in generated assertion accuracy and 50% reduction in iterations needed for coverage closure.

Conclusion: Saarthi with these enhancements represents significant progress toward domain-specific general intelligence for formal verification, particularly effective for STSC problems.

Abstract: Saarthi is an agentic AI framework that uses multi-agent collaboration to perform end-to-end formal verification. Even though the framework provides a complete flow from specification to coverage closure, with around 40% efficacy, there are several challenges that need to be addressed to make it more robust and reliable. Artificial General Intelligence (AGI) is still a distant goal, and current Large Language Model (LLM)-based agents are prone to hallucinations and making mistakes, especially when dealing with complex tasks such as formal verification. However, with the right enhancements and improvements, we believe that Saarthi can be a significant step towards achieving domain-specific general intelligence for formal verification. Especially for problems that require Short Term, Short Context (STSC) capabilities, such as formal verification, Saarthi can be a powerful tool to assist verification engineers in their work. In this paper, we present two key enhancements to the Saarthi framework: (1) a structured rulebook and specification grammar to improve the accuracy and controllability of SystemVerilog Assertion (SVA) generation, and (2) integration of advanced Retrieval Augmented Generation (RAG) techniques, such as GraphRAG, to provide agents with access to technical knowledge and best practices for iterative refinement and improvement of outputs. We also benchmark these enhancements for the overall Saarthi framework using challenging test cases from NVIDIA’s CVDP benchmark targeting formal verification. Our benchmark results stand out with a 70% improvement in the accuracy of generated assertions, and a 50% reduction in the number of iterations required to achieve coverage closure.

[334] FEAST: Retrieval-Augmented Multi-Hierarchical Food Classification for the FoodEx2 System

Lorenzo Molfetta, Alessio Cocchieri, Stefano Fantazzini, Giacomo Frisoni, Luca Ragazzi, Gianluca Moro

Main category: cs.AI

TL;DR: FEAST is a retrieval-augmented framework for hierarchical text classification in the FoodEx2 system, using a three-stage approach with deep metric learning to handle complex label interdependencies and data sparsity.

Details

Motivation: The FoodEx2 food classification system faces implementation barriers due to complex hierarchical structure, label interdependencies, data sparsity, and extreme output dimensions. Existing models work on balanced hierarchies but fail on practical constraints of real-world systems like FoodEx2.

Method: FEAST decomposes FoodEx2 classification into three stages: (1) base term identification, (2) multi-label facet prediction, and (3) facet descriptor assignment. It leverages hierarchical structure to guide training and uses deep metric learning to learn discriminative embeddings that mitigate data sparsity.

Result: FEAST outperforms the prior European CNN baseline by 12-38% F1 scores on rare classes in the multilingual FoodEx2 benchmark, showing significant improvement in handling rare and fine-grained labels.

Conclusion: The proposed FEAST framework effectively addresses the challenges of hierarchical text classification in real-world systems like FoodEx2 by decomposing the problem and using metric learning to handle data sparsity and rare classes.

Abstract: Hierarchical text classification (HTC) and extreme multi-label classification (XML) tasks face compounded challenges from complex label interdependencies, data sparsity, and extreme output dimensions. These challenges are exemplified in the European Food Safety Authority’s FoodEx2 system-a standardized food classification framework essential for food consumption monitoring and contaminant exposure assessment across Europe. FoodEx2 coding transforms natural language food descriptions into a set of codes from multiple standardized hierarchies, but faces implementation barriers due to its complex structure. Given a food description (e.g., “organic yogurt’’), the system identifies its base term (“yogurt’’), all the applicable facet categories (e.g., “production method’’), and then, every relevant facet descriptors to each category (e.g., “organic production’’). While existing models perform adequately on well-balanced and semantically dense hierarchies, no work has been applied on the practical constraints imposed by the FoodEx2 system. The limited literature addressing such real-world scenarios further compounds these challenges. We propose FEAST (Food Embedding And Semantic Taxonomy), a novel retrieval-augmented framework that decomposes FoodEx2 classification into a three-stage approach: (1) base term identification, (2) multi-label facet prediction, and (3) facet descriptor assignment. By leveraging the system’s hierarchical structure to guide training and performing deep metric learning, FEASTlearns discriminative embeddings that mitigate data sparsity and improve generalization on rare and fine-grained labels. Evaluated on the multilingual FoodEx2 benchmark, FEAST outperforms the prior European’s CNN baseline F1 scores by 12-38 % on rare classes.

[335] Neuro-Symbolic Artificial Intelligence: A Task-Directed Survey in the Black-Box Models Era

Giovanni Pio Delvecchio, Lorenzo Molfetta, Gianluca Moro

Main category: cs.AI

TL;DR: Survey paper examining task-specific advancements in Neuro-Symbolic (NeSy) AI, exploring how symbolic systems can enhance explainability and reasoning capabilities in real-world applications.

Details

Motivation: The integration of symbolic computing with neural networks has been a long-standing interest in AI research, with NeSy methods considered as potential proxies for human-level intelligence. However, limited semantic generalizability and challenges in complex domains hinder practical implementation. Recent connectionist successes raise questions about NeSy competitiveness, particularly in NLP and Computer Vision fields.

Method: This is a survey paper that examines task-specific advancements in the NeSy domain. The authors analyze how incorporating symbolic systems can enhance explainability and reasoning capabilities across various applications and methodologies.

Result: The survey provides a comprehensive resource for researchers exploring explainable NeSy methodologies for real-life tasks and applications. It includes reproducibility details and in-depth comments on each surveyed research work, available through a GitHub repository.

Conclusion: The survey serves as a valuable resource for understanding current NeSy advancements and their potential to enhance explainability and reasoning in AI systems, addressing the gap between symbolic and connectionist approaches.

Abstract: The integration of symbolic computing with neural networks has intrigued researchers since the first theorizations of Artificial intelligence (AI). The ability of Neuro-Symbolic (NeSy) methods to infer or exploit behavioral schema has been widely considered as one of the possible proxies for human-level intelligence. However, the limited semantic generalizability and the challenges in declining complex domains with pre-defined patterns and rules hinder their practical implementation in real-world scenarios. The unprecedented results achieved by connectionist systems since the last AI breakthrough in 2017 have raised questions about the competitiveness of NeSy solutions, with particular emphasis on the Natural Language Processing and Computer Vision fields. This survey examines task-specific advancements in the NeSy domain to explore how incorporating symbolic systems can enhance explainability and reasoning capabilities. Our findings are meant to serve as a resource for researchers exploring explainable NeSy methodologies for real-life tasks and applications. Reproducibility details and in-depth comments on each surveyed research work are made available at https://github.com/disi-unibo-nlp/task-oriented-neuro-symbolic.git.

[336] Expectation and Acoustic Neural Network Representations Enhance Music Identification from Brain Activity

Shogo Noguchi, Taketo Akama, Tai Nakamura, Shun Minamikawa, Natalia Polouliakh

Main category: cs.AI

TL;DR: Using ANN representations of acoustic and expectation-related information as teacher targets improves EEG-based music identification, with combined representations outperforming individual ones and random initialization ensembles.

Details

Motivation: To improve EEG-based music identification by leveraging ANN representations that resemble cortical activity patterns during music listening, specifically distinguishing between acoustic and expectation-related information.

Method: Using ANN representations as teacher targets for EEG recognition models, pretraining models to predict either acoustic or expectation-related representations, and combining both representations to achieve complementary gains.

Result: Models pretrained to predict either representation outperform non-pretrained baselines, and combining them yields complementary gains that exceed strong seed ensembles formed by varying random initializations.

Conclusion: Teacher representation type shapes downstream performance, and representation learning can be guided by neural encoding principles, pointing toward advances in predictive music cognition and neural decoding with potential for general-purpose EEG models.

Abstract: During music listening, cortical activity encodes both acoustic and expectation-related information. Prior work has shown that ANN representations resemble cortical representations and can serve as supervisory signals for EEG recognition. Here we show that distinguishing acoustic and expectation-related ANN representations as teacher targets improves EEG-based music identification. Models pretrained to predict either representation outperform non-pretrained baselines, and combining them yields complementary gains that exceed strong seed ensembles formed by varying random initializations. These findings show that teacher representation type shapes downstream performance and that representation learning can be guided by neural encoding. This work points toward advances in predictive music cognition and neural decoding. Our expectation representation, computed directly from raw signals without manual labels, reflects predictive structure beyond onset or pitch, enabling investigation of multilayer predictive encoding across diverse stimuli. Its scalability to large, diverse datasets further suggests potential for developing general-purpose EEG models grounded in cortical encoding principles.

[337] No Memorization, No Detection: Output Distribution-Based Contamination Detection in Small Language Models

Omer Sela

Main category: cs.AI

TL;DR: CDD detects data contamination by measuring peakedness of model outputs, but only works when fine-tuning causes verbatim memorization; parameter-efficient fine-tuning can evade detection.

Details

Motivation: To understand when contamination detection methods based on output distribution analysis succeed or fail, particularly in the context of different fine-tuning approaches and model capacities.

Method: Controlled contamination experiments on GSM8K, HumanEval, and MATH datasets using small language models (70M-410M parameters), comparing different fine-tuning approaches (full fine-tuning vs. low-rank adaptation) and measuring CDD’s detection accuracy.

Result: CDD only detects contamination when fine-tuning produces verbatim memorization. With low-rank adaptation (parameter-efficient fine-tuning), models learn from contaminated data without memorizing it, and CDD performs at chance level even with verifiable contamination.

Conclusion: There’s a memorization threshold governing detectability; parameter-efficient fine-tuning can produce contamination that output-distribution methods cannot detect, highlighting practical limitations of current contamination detection approaches.

Abstract: CDD, or Contamination Detection via output Distribution, identifies data contamination by measuring the peakedness of a model’s sampled outputs. We study the conditions under which this approach succeeds and fails on small language models ranging from 70M to 410M parameters. Using controlled contamination experiments on GSM8K, HumanEval, and MATH, we find that CDD’s effectiveness depends critically on whether fine-tuning produces verbatim memorization. With low-rank adaptation, models can learn from contaminated data without memorizing it, and CDD performs at chance level even when the data is verifiably contaminated. Only when fine-tuning capacity is sufficient to induce memorization does CDD recover strong detection accuracy. Our results characterize a memorization threshold that governs detectability and highlight a practical consideration: parameter-efficient fine-tuning can produce contamination that output-distribution methods do not detect. Our code is available at https://github.com/Sela-Omer/Contamination-Detection-Small-LM

[338] NeuroSkill(tm): Proactive Real-Time Agentic System Capable of Modeling Human State of Mind

Nataliya Kosmyna, Eugene Hauptmann

Main category: cs.AI

TL;DR: NeuroSkill system: Real-time proactive agentic system that models Human State of Mind using EXG foundation model and text embeddings, running offline on edge devices via BCI brain signal inputs.

Details

Motivation: To create a real-time proactive agentic system that can understand and engage with human cognitive and affective states directly from brain-computer interface data, enabling more natural human-AI interaction.

Method: Uses foundation EXG model and text embeddings to process brain signals from BCI devices, with NeuroLoop harness running agentic flows that engage with human state of mind via SKILL.md descriptions and API/CLI interfaces.

Result: A fully offline edge system capable of real-time proactive engagement with humans on cognitive and affective levels, providing actionable tool calls and protocol execution based on brain signal analysis.

Conclusion: The NeuroSkill system represents a novel approach to human-AI interaction by directly modeling human state of mind from brain signals, enabling more empathetic and responsive agentic systems.

Abstract: Real-time proactive agentic system, capable of modeling Human State of Mind, using foundation EXG model and text embeddings model, running fully offline on the edge. Unlike all previously known systems, the NeuroSkill(tm) system leverages SKILL.md description of Human’s State of Mind via API and CLI provided by the system, directly from the Brain-Computer Interface (BCI) devices, which records Human biophysical and brain signals. Our custom harness - NeuroLoop(tm) - utilizes all of the above to run agentic flow that manages to engage with the Human on multiple cognitive and affective levels of their State of Mind (e.g., empathy), by providing actionable tool calls and protocol execution with explicit or implicit requests from the Human. GPLv3 open-source software with ethically aligned AI100 licensing for the skill markdown.

[339] Density-Guided Response Optimization: Community-Grounded Alignment via Implicit Acceptance Signals

Patrick Gerard, Svitlana Volkova

Main category: cs.AI

TL;DR: DGRO aligns language models to community norms using implicit preference signals from acceptance behavior, without explicit preference labels.

Details

Motivation: Current alignment methods require explicit preference supervision or predefined principles, which exclude most online communities lacking annotation infrastructure or dealing with sensitive topics where preference elicitation is costly or ethically fraught.

Method: Density-guided response optimization (DGRO) uses geometric structure in representation space: accepted responses occupy coherent, high-density regions reflecting community norms, while rejected content falls in sparser areas. This implicit preference signal guides alignment without explicit labels.

Result: Local density recovers pairwise community judgments, showing geometric structure encodes meaningful preference signals. DGRO-aligned models consistently produce responses preferred by human annotators, domain experts, and model-based judges over supervised and prompt-based baselines across diverse communities.

Conclusion: DGRO provides a practical alignment alternative for communities where explicit preference supervision is unavailable or misaligned, though learning from emergent acceptance behavior carries implications and risks.

Abstract: Language models deployed in online communities must adapt to norms that vary across social, cultural, and domain-specific contexts. Prior alignment approaches rely on explicit preference supervision or predefined principles, which are effective for well-resourced settings but exclude most online communities – particularly those without institutional backing, annotation infrastructure, or organized around sensitive topics – where preference elicitation is costly, ethically fraught, or culturally misaligned. We observe that communities already express preferences implicitly through what content they accept, engage with, and allow to persist. We show that this acceptance behavior induces measurable geometric structure in representation space: accepted responses occupy coherent, high-density regions that reflect community-specific norms, while rejected content falls in sparser or misaligned areas. We operationalize this structure as an implicit preference signal for alignment and introduce density-guided response optimization (DGRO), a method that aligns language models to community norms without requiring explicit preference labels. Using labeled preference data, we demonstrate that local density recovers pairwise community judgments, indicating that geometric structure encodes meaningful preference signal. We then apply DGRO in annotation-scarce settings across diverse communities spanning platform, topic, and language. DGRO-aligned models consistently produce responses preferred by human annotators, domain experts, and model-based judges over supervised and prompt-based baselines. We position DGRO as a practical alignment alternative for communities where explicit preference supervision is unavailable or misaligned with situated practices, and discuss the implications and risks of learning from emergent acceptance behavior.

[340] AI-for-Science Low-code Platform with Bayesian Adversarial Multi-Agent Framework

Zihang Zeng, Jiaquan Zhang, Pengze Li, Yuan Qi, Xi Chen

Main category: cs.AI

TL;DR: A Bayesian adversarial multi-agent framework for scientific code generation that coordinates three LLM agents to improve reliability and reduce error propagation in AI for Science tasks.

Details

Motivation: LLMs show potential for automating scientific code generation but face challenges in reliability, error propagation in multi-agent workflows, and evaluation in domains with ill-defined success metrics.

Method: Three LLM-based agents coordinated under Bayesian framework: Task Manager structures inputs into plans/test cases, Code Generator produces solutions, and Evaluator provides feedback. Uses adversarial loop where Task Manager refines test cases to challenge Code Generator, with prompt distributions dynamically updated using Bayesian principles integrating code quality metrics.

Result: Benchmark evaluations demonstrate LCP’s effectiveness in generating robust code while minimizing error propagation. Tested on Earth Science cross-disciplinary task showing strong reliability and outperforming competing models.

Conclusion: The proposed Low-code Platform (LCP) reduces dependence on LLM reliability, addresses evaluation uncertainty in scientific tasks, and streamlines human-AI collaboration by translating non-expert prompts into domain-specific requirements.

Abstract: Large Language Models (LLMs) demonstrate potentials for automating scientific code generation but face challenges in reliability, error propagation in multi-agent workflows, and evaluation in domains with ill-defined success metrics. We present a Bayesian adversarial multi-agent framework specifically designed for AI for Science (AI4S) tasks in the form of a Low-code Platform (LCP). Three LLM-based agents are coordinated under the Bayesian framework: a Task Manager that structures user inputs into actionable plans and adaptive test cases, a Code Generator that produces candidate solutions, and an Evaluator providing comprehensive feedback. The framework employs an adversarial loop where the Task Manager iteratively refines test cases to challenge the Code Generator, while prompt distributions are dynamically updated using Bayesian principles by integrating code quality metrics: functional correctness, structural alignment, and static analysis. This co-optimization of tests and code reduces dependence on LLM reliability and addresses evaluation uncertainty inherent to scientific tasks. LCP also streamlines human-AI collaboration by translating non-expert prompts into domain-specific requirements, bypassing the need for manual prompt engineering by practitioners without coding backgrounds. Benchmark evaluations demonstrate LCP’s effectiveness in generating robust code while minimizing error propagation. The proposed platform is also tested on an Earth Science cross-disciplinary task and demonstrates strong reliability, outperforming competing models.

[341] Valet: A Standardized Testbed of Traditional Imperfect-Information Card Games

Mark Goadrich, Achille Morenville, Éric Piette

Main category: cs.AI

TL;DR: Valet is a diverse testbed of 21 traditional imperfect-information card games designed to benchmark AI algorithms across varied game mechanics and information structures.

Details

Motivation: Current AI evaluation for imperfect-information games relies on individual game metrics, making it difficult to assess algorithm robustness across different game types. There's a need for comprehensive benchmarking suites that span diverse game characteristics.

Method: Created Valet testbed with 21 traditional card games spanning multiple genres, cultures, player counts, deck structures, mechanics, and information hiding methods. Games are encoded in RECYCLE card game description language for standardization. Characterized games through random simulations to measure branching factors and duration, with baseline performance using Monte Carlo Tree Search against random opponents.

Result: Successfully developed a diverse benchmarking suite covering wide range of imperfect-information game characteristics. Empirical characterization provides quantitative metrics for each game’s complexity. Baseline results demonstrate the testbed’s suitability for comparative algorithm evaluation.

Conclusion: Valet provides a comprehensive testbed for evaluating imperfect-information game-playing algorithms across diverse game types, enabling more robust assessment of AI capabilities beyond single-game metrics.

Abstract: AI algorithms for imperfect-information games are typically compared using performance metrics on individual games, making it difficult to assess robustness across game choices. Card games are a natural domain for imperfect information due to hidden hands and stochastic draws. To facilitate comparative research on imperfect-information game-playing algorithms and game systems, we introduce Valet, a diverse and comprehensive testbed of 21 traditional imperfect-information card games. These games span multiple genres, cultures, player counts, deck structures, mechanics, winning conditions, and methods of hiding and revealing information. To standardize implementations across systems, we encode the rules of each game in RECYCLE, a card game description language. We empirically characterize each game’s branching factor and duration using random simulations, reporting baseline score distributions for a Monte Carlo Tree Search player against random opponents to demonstrate the suitability of Valet as a benchmarking suite.

[342] Inherited Goal Drift: Contextual Pressure Can Undermine Agentic Goals

Achyutha Menon, Magnus Saebo, Tyler Crosse, Spencer Gibson, Eyon Jang, Diogo Cruz

Main category: cs.AI

TL;DR: Modern language model agents show robustness to goal drift in adversarial settings but inherit drift when conditioned on trajectories from weaker agents, with only GPT-5.1 maintaining consistent resilience across models.

Details

Motivation: As language models are increasingly deployed as agents in long-context tasks, understanding their susceptibility to goal drift (deviation from original objectives) is crucial, especially for newer models where drift characteristics remain unclear.

Method: Investigates drift in state-of-the-art models using a simulated stock-trading environment, testing robustness under adversarial pressure and conditioning on prefilled trajectories from weaker agents. Also validates findings in a new emergency room triage environment.

Result: Models show robustness to adversarial pressure but brittle resilience - they inherit drift when conditioned on weaker agent trajectories. Drift behavior varies significantly by model family, with only GPT-5.1 maintaining consistent resilience. Drift correlates poorly with instruction hierarchy following.

Conclusion: Modern LM agents remain vulnerable to contextual pressures despite apparent robustness, highlighting the need for refined post-training techniques to mitigate goal drift across different deployment settings.

Abstract: The accelerating adoption of language models (LMs) as agents for deployment in long-context tasks motivates a thorough understanding of goal drift: agents’ tendency to deviate from an original objective. While prior-generation language model agents have been shown to be susceptible to drift, the extent to which drift affects more recent models remains unclear. In this work, we provide an updated characterization of the extent and causes of goal drift. We investigate drift in state-of-the-art models within a simulated stock-trading environment (Arike et al., 2025). These models are largely shown to be robust even when subjected to adversarial pressure. We show, however, that this robustness is brittle: across multiple settings, the same models often inherit drift when conditioned on prefilled trajectories from weaker agents. The extent of conditioning-induced drift varies significantly by model family, with only GPT-5.1 maintaining consistent resilience among tested models. We find that drift behavior is inconsistent between prompt variations and correlates poorly with instruction hierarchy following behavior, with strong hierarchy following failing to reliably predict resistance to drift. Finally, we run analogous experiments in a new emergency room triage environment to show preliminary evidence for the transferability of our results across qualitatively different settings. Our findings underscore the continued vulnerability of modern LM agents to contextual pressures and the need for refined post-training techniques to mitigate this.

[343] Robust Counterfactual Inference in Markov Decision Processes

Jessica Lally, Milad Kazemi, Nicola Paoletti

Main category: cs.AI

TL;DR: Novel non-parametric approach for counterfactual inference in MDPs that computes tight bounds on counterfactual probabilities across all compatible causal models, enabling robust policy optimization.

Details

Motivation: Existing counterfactual inference methods for MDPs assume specific causal models, limiting validity and usefulness since many causal models align with observational/interventional distributions, each yielding different counterfactual distributions.

Method: Proposes non-parametric approach that computes tight bounds on counterfactual transition probabilities across all compatible causal models using closed-form expressions, avoiding exponential optimization problems. Constructs interval counterfactual MDP and identifies robust policies optimizing worst-case reward.

Result: Method provides highly efficient and scalable computation for non-trivial MDPs, demonstrates improved robustness over existing methods in various case studies.

Conclusion: The approach addresses limitations of existing counterfactual inference methods by providing robust bounds across all compatible causal models, enabling more valid and useful counterfactual analysis in MDPs.

Abstract: This paper addresses a key limitation in existing counterfactual inference methods for Markov Decision Processes (MDPs). Current approaches assume a specific causal model to make counterfactuals identifiable. However, there are usually many causal models that align with the observational and interventional distributions of an MDP, each yielding different counterfactual distributions, so fixing a particular causal model limits the validity (and usefulness) of counterfactual inference. We propose a novel non-parametric approach that computes tight bounds on counterfactual transition probabilities across all compatible causal models. Unlike previous methods that require solving prohibitively large optimisation problems (with variables that grow exponentially in the size of the MDP), our approach provides closed-form expressions for these bounds, making computation highly efficient and scalable for non-trivial MDPs. Once such an interval counterfactual MDP is constructed, our method identifies robust counterfactual policies that optimise the worst-case reward w.r.t. the uncertain interval MDP probabilities. We evaluate our method on various case studies, demonstrating improved robustness over existing methods.

[344] ViPlan: A Benchmark for Visual Planning with Symbolic Predicates and Vision-Language Models

Matteo Merler, Nicola Dainese, Minttu Alakuijala, Giovanni Bonetta, Pietro Ferrazzi, Yu Tian, Bernardo Magnini, Pekka Marttinen

Main category: cs.AI

TL;DR: ViPlan benchmark compares VLM-grounded symbolic planning vs direct VLM planning in visual domains, showing each excels in different scenarios based on grounding accuracy and linguistic knowledge needs.

Details

Motivation: There's a lack of rigorous comparison between VLM-grounded symbolic planning approaches and direct VLM planning methods due to missing visual benchmarks that support symbolic planning.

Method: Created ViPlan benchmark with increasingly challenging tasks in two visual domains: visual Blocksworld and simulated household robotics. Compares VLM-as-grounder (symbolic planning with VLM grounding) vs VLM-as-planner (direct VLM planning) approaches.

Result: VLM-as-grounder outperforms in Blocksworld (46% vs 9% success) where image grounding is crucial and accurate. VLM-as-planner excels in household robotics (34% vs 5% success) where linguistic knowledge helps and partial observability hinders grounding approaches. Chain-of-Thought prompting shows no consistent benefits.

Conclusion: ViPlan reveals fundamental trade-offs between symbolic and neural planning approaches in visual domains, with each method having distinct strengths based on grounding accuracy and knowledge requirements.

Abstract: Integrating Large Language Models with symbolic planners is a promising direction for obtaining verifiable and grounded plans, with recent work extending this idea to visual domains using Vision-Language Models (VLMs). However, a rigorous comparison with methods that plan directly with VLMs is missing, due to a lack of visual benchmarks that support symbolic planning. We present ViPlan, the first open-source benchmark for comparing VLM-grounded symbolic approaches (VLM-as-grounder) with direct VLM planning methods (VLM-as-planner). ViPlan introduces a series of increasingly challenging tasks in two visual domains: a visual variant of the classic Blocksworld planning problem and a simulated household robotics environment. We find VLM-as-grounder methods to outperform direct VLM planning in Blocksworld (solving 46% of the tasks against 9%), where image grounding is both crucial and accurate. However, in the household robotics tasks, where linguistic knowledge helps, VLM-as-planner methods are greatly superior to VLM-as-grounder approaches (solving 34% of the tasks against 5%), which are hindered by partial observability. Thus, ViPlan domains capture fundamental shortcomings of both planning approaches, which we further diagnose with a qualitative failure analysis. Finally, across methods, we observe no consistent benefit from Chain-of-Thought prompting, suggesting persistent limitations in current VLMs’ visual reasoning abilities.

[345] Efficient Agent Training for Computer Use

Yanheng He, Jiahe Jin, Pengfei Liu

Main category: cs.AI

TL;DR: PC Agent-E is an efficient agent training framework for computer use that reduces reliance on large-scale human demonstrations by using AI-synthesized alternative action decisions to augment a small set of human trajectories.

Details

Motivation: Scaling high-quality trajectory data is a critical bottleneck for developing human-like computer use agents. The paper aims to address this by reducing dependence on large-scale human demonstrations.

Method: Start with 312 human-annotated computer use trajectories, augment them by synthesizing diverse alternative action decisions using Claude 3.7 Sonnet, then train the PC Agent-E model on these enriched trajectories.

Result: Achieved 141% relative improvement over training on human trajectories alone, surpassed Claude 3.7 Sonnet by 10% on WindowsAgentArena-V2 benchmark, and released an improved benchmark.

Conclusion: Integrating human computer use skills with automated AI data synthesis enables efficient agent training with minimal human demonstrations while achieving superior performance over both human-only training and direct AI distillation.

Abstract: Scaling up high-quality trajectory data has long been a critical bottleneck for developing human-like computer use agents. We introduce PC Agent-E, an efficient agent training framework that significantly reduces reliance on large-scale human demonstrations. Starting with just 312 human-annotated computer use trajectories, we further augment them by synthesizing diverse alternative action decisions with Claude 3.7 Sonnet. Trained on these enriched trajectories, our PC Agent-E model achieved a remarkable 141 relative improvement, and even surpassed the Claude 3.7 Sonnet by 10% in relative terms on WindowsAgentArena-V2, an improved benchmark we also released. By integrating robust human computer use skills with automated AI data synthesis capabilities, our method not only brought substantial improvements over training on human trajectories alone, but also significantly surpassed direct distillation from Claude 3.7 Sonnet. Code, data and models are available at https://github.com/GAIR-NLP/PC-Agent-E

[346] OptMerge: Unifying Multimodal LLM Capabilities and Modalities via Model Merging

Yongxian Wei, Runxi Cheng, Weike Jin, Enneng Yang, Li Shen, Lu Hou, Sinan Du, Chun Yuan, Xiaochun Cao, Dacheng Tao

Main category: cs.AI

TL;DR: Introduces a model merging benchmark for Multimodal LLMs covering vision, audio, and video tasks, proposes a novel noise-removing merging method, and shows merging improves MLLMs without training data.

Details

Motivation: Foundation models update slowly while domain-specific models evolve rapidly. Model merging can combine expert models into more capable unified models, but lacks benchmarks for Multimodal LLMs covering diverse modalities.

Method: 1) Creates MLLM merging benchmark with VQA, Geometry, Chart, OCR, Grounding tasks; 2) Implements 10 merging algorithms; 3) Proposes novel method removing noise from task vectors and optimizing merged vector based on task vector interaction loss.

Result: Proposed method achieves average 2.48% performance gain. Model merging enables building improved MLLMs without training data. Complementarity among multiple modalities outperforms individual modalities.

Conclusion: Model merging offers promising approach for building enhanced MLLMs, with proposed benchmark and method advancing the field toward Omni-language models combining vision, audio, and video capabilities.

Abstract: Foundation models update slowly due to resource-intensive training, whereas domain-specific models evolve rapidly between releases. Model merging seeks to combine multiple expert models into a single, more capable model, reducing storage and serving costs while supporting decentralized development. Despite its potential, previous studies have primarily focused on merging visual classification models or Large Language Models (LLMs) for code and math tasks. Recently, Multimodal LLMs (MLLMs) that extend LLMs through large-scale multimodal training have gained traction. However, there lacks a benchmark for model merging research that clearly divides the tasks for MLLM training and evaluation. In this paper, $\textbf{(i)}$ we introduce a model merging benchmark for MLLMs, which includes multiple tasks such as VQA, Geometry, Chart, OCR, and Grounding, studying both LoRA and full fine-tuning models. Moreover, we explore how model merging can combine different modalities (e.g., vision-language, audio-language, and video-language models), moving toward the Omni-language model. $\textbf{(ii)}$ We implement 10 model merging algorithms on the benchmark. Furthermore, we propose a novel method that removes noise from task vectors and robustly optimizes the merged vector based on a loss defined over task vector interactions, achieving an average performance gain of 2.48%. $\textbf{(iii)}$ We find that model merging offers a promising way for building improved MLLMs without requiring training data. Our results also demonstrate that the complementarity among multiple modalities outperforms individual modalities.

[347] Toward a Dynamic Stackelberg Game-Theoretic Framework for Agentic AI Defense Against LLM Jailbreaking

Zhengye Han, Quanyan Zhu

Main category: cs.AI

TL;DR: Game theoretic framework models prompt engineer vs LLM interaction as extensive form game with RRT search over prompt space for jailbreak analysis and defense.

Details

Motivation: To provide a principled foundation for understanding and defending against jailbreak attacks on LLMs through game theory, capturing both the discovery of attack strategies and strategic model responses.

Method: Two-player extensive form game coupled with Rapidly exploring Random Trees (RRT) search over prompt space; attacker samples, extends, and tests prompts while LLM chooses accept/reject/redirect; defender behavior analyzed through local Stackelberg equilibrium.

Result: Framework captures jailbreak strategy discovery and model responses; shows defender behavior interpretable through Stackelberg equilibrium; provides theoretical foundation for Purple Agent defense and LLM guardrail evaluation.

Conclusion: Game tree offers principled foundation for evaluating, interpreting, and hardening LLM guardrails against jailbreak attacks through game theoretic analysis.

Abstract: This paper proposes a game theoretic framework that models the interaction between prompt engineers and large language models (LLMs) as a two player extensive form game coupled with a Rapidly exploring Random Trees (RRT) search over prompt space. The attacker incrementally samples, extends, and tests prompts, while the LLM chooses to accept, reject, or redirect, leading to terminal outcomes of Safe Interaction, Blocked, or Jailbreak. Embedding RRT exploration inside the extensive form game captures both the discovery phase of jailbreak strategies and the strategic responses of the model. Furthermore, we show that the defender behavior can be interpreted through a local Stackelberg equilibrium condition, which explains when the attacker can no longer obtain profitable prompt deviations and provides a theoretical lens for understanding the effectiveness of our Purple Agent defense. The resulting game tree thus offers a principled foundation for evaluating, interpreting, and hardening LLM guardrails.

[348] Higher Gauge Flow Models

Alexander Strunk, Roland Assam

Main category: cs.AI

TL;DR: Higher Gauge Flow Models extend Generative Flow Models using L∞-algebra to incorporate higher geometry and symmetries, showing improved performance on Gaussian Mixture Model datasets.

Details

Motivation: To extend the capabilities of Generative Flow Models by incorporating higher geometry and higher symmetries through mathematical structures beyond ordinary Lie algebras, enabling more expressive generative modeling.

Method: Builds upon ordinary Gauge Flow Models by leveraging L∞-algebra (effectively extending Lie Algebra) to integrate higher geometry and higher symmetries associated with higher groups into the generative flow framework.

Result: Experimental evaluation on Gaussian Mixture Model datasets revealed substantial performance improvements compared to traditional Flow Models.

Conclusion: Higher Gauge Flow Models represent a novel class of generative models that successfully incorporate advanced mathematical structures for improved generative performance.

Abstract: This paper introduces Higher Gauge Flow Models, a novel class of Generative Flow Models. Building upon ordinary Gauge Flow Models (arXiv:2507.13414), these Higher Gauge Flow Models leverage an L$_{\infty}$-algebra, effectively extending the Lie Algebra. This expansion allows for the integration of the higher geometry and higher symmetries associated with higher groups into the framework of Generative Flow Models. Experimental evaluation on a Gaussian Mixture Model dataset revealed substantial performance improvements compared to traditional Flow Models.

[349] See, Think, Act: Teaching Multimodal Agents to Effectively Interact with GUI by Identifying Toggles

Zongru Wu, Rui Mao, Zhiyuan Tian, Pengzhou Cheng, Tianjie Ju, Zheng Wu, Lingzhong Dong, Haiyue Sheng, Zhuosheng Zhang, Gongshen Liu

Main category: cs.AI

TL;DR: Proposes State-aware Reasoning (StaR) to improve multimodal agents’ reliability in executing toggle control instructions on GUIs by enabling state perception and reasoning.

Details

Motivation: Multimodal agents struggle with reliably executing toggle control instructions in GUI environments, especially when the current state already matches the desired state, creating a key bottleneck for practical applications.

Method: Constructs a state control benchmark with binary toggle instructions from public datasets, then proposes StaR - a multimodal reasoning method that enables agents to perceive current toggle state, infer desired state from instructions, and act accordingly.

Result: StaR improves toggle instruction execution accuracy by over 30% on four multimodal agents, enhances general agentic task performance on three public benchmarks, and shows potential for real-world applications in dynamic environments.

Conclusion: State-aware reasoning is crucial for reliable multimodal GUI agents, and StaR effectively addresses toggle control challenges while improving general agentic capabilities.

Abstract: The advent of multimodal agents facilitates effective interaction within graphical user interface (GUI), especially in ubiquitous GUI control. However, their inability to reliably execute toggle control instructions remains a key bottleneck. To investigate this, we construct a state control benchmark with binary toggle instructions derived from public datasets. Evaluation results of existing agents demonstrate their notable unreliability, particularly when the current toggle state already matches the desired state. To address the challenge, we propose State-aware Reasoning (StaR), a multimodal reasoning method that enables agents to perceive the current toggle state, infer the desired state from the instruction, and act accordingly. Experiments on four multimodal agents demonstrate that StaR can improve toggle instruction execution accuracy by over 30%. Further evaluations on three public agentic benchmarks show that StaR also enhances general agentic task performance. Finally, evaluations on a dynamic environment highlight the potential of StaR for real-world applications. Code and benchmark: https://github.com/ZrW00/StaR.

[350] Benefits and Pitfalls of Reinforcement Learning for Language Model Planning: A Theoretical Perspective

Siwei Wang, Yifei Shen, Haoran Sun, Shi Feng, Shang-Hua Teng, Li Dong, Yaru Hao, Wei Chen

Main category: cs.AI

TL;DR: Theoretical analysis shows RL methods (policy gradient and Q-learning) enhance LLM planning capabilities, with Q-learning offering advantages in off-policy learning and diversity preservation compared to policy gradient’s diversity collapse.

Details

Motivation: While recent RL methods have improved LLM planning, there's limited theoretical understanding of why they work and their limitations. The paper aims to provide theoretical insights into RL's benefits and drawbacks for LLM planning.

Method: Uses a tractable graph-based abstraction to analyze policy gradient and Q-learning methods theoretically, examining exploration, generalization, and diversity preservation. Validates findings on the Blocksworld planning benchmark.

Result: SFT introduces co-occurrence-based spurious solutions, while RL achieves correct planning through exploration. Policy gradient suffers from diversity collapse, but Q-learning preserves diversity through off-policy learning. Reward design is crucial to prevent Q-value bias.

Conclusion: RL enhances LLM planning with Q-learning offering key advantages over policy gradient, but careful reward design is essential. Theoretical insights explain empirical successes and guide practical RL applications for LLMs.

Abstract: Recent reinforcement learning (RL) methods have substantially enhanced the planning capabilities of Large Language Models (LLMs), yet the theoretical basis for their effectiveness remains elusive. In this work, we investigate RL’s benefits and limitations through a tractable graph-based abstraction, focusing on policy gradient (PG) and Q-learning methods. Our theoretical analyses reveal that supervised fine-tuning (SFT) may introduce co-occurrence-based spurious solutions, whereas RL achieves correct planning primarily through exploration, underscoring exploration’s role in enabling better generalization. However, we also show that PG suffers from diversity collapse, where output diversity decreases during training and persists even after perfect accuracy is attained. By contrast, Q-learning provides two key advantages: off-policy learning and diversity preservation at convergence. We further demonstrate that careful reward design is necessary to prevent Q-value bias in Q-learning. Finally, applying our framework to the real-world planning benchmark Blocksworld, we confirm that these behaviors manifest in practice.

[351] MedLA: A Logic-Driven Multi-Agent Framework for Complex Medical Reasoning with Large Language Models

Siqi Ma, Jiajie Huang, Fan Zhang, Yue Shen, Jinlin Wu, Guohui Fan, Zhu Zhang, Zelin Zang

Main category: cs.AI

TL;DR: MedLA: A logic-driven multi-agent framework using LLMs for medical reasoning with explicit logical trees and multi-round consensus building

Details

Motivation: Existing multi-agent approaches for medical QA have limitations: fixed roles or shallow interaction prompts prevent detection and resolution of fine-grained logical inconsistencies in complex medical reasoning

Method: Logic-driven multi-agent framework where each agent organizes reasoning into explicit logical trees based on syllogistic triads (major premise, minor premise, conclusion). Agents engage in multi-round, graph-guided discussions to compare and refine logic trees through error correction and contradiction resolution

Result: Consistently outperforms static role-based systems and single-agent baselines on MedDDx and standard medical QA benchmarks. Scales effectively across both open-source and commercial LLM backbones, achieving state-of-the-art performance

Conclusion: MedLA offers a generalizable paradigm for trustworthy medical reasoning through transparent inference and premise-level alignment in multi-agent systems

Abstract: Answering complex medical questions requires not only domain expertise and patient-specific information, but also structured and multi-perspective reasoning. Existing multi-agent approaches often rely on fixed roles or shallow interaction prompts, limiting their ability to detect and resolve fine-grained logical inconsistencies. To address this, we propose \textsc{MedLA}, a logic-driven multi-agent framework built on large language models. Each agent organizes its reasoning process into an explicit logical tree based on syllogistic triads (major premise, minor premise, and conclusion), enabling transparent inference and premise-level alignment. Agents engage in a multi-round, graph-guided discussion to compare and iteratively refine their logic trees, achieving consensus through error correction and contradiction resolution. We demonstrate that \textsc{MedLA} consistently outperforms both static role-based systems and single-agent baselines on challenging benchmarks such as MedDDx and standard medical QA tasks. Furthermore, \textsc{MedLA} scales effectively across both open-source and commercial LLM backbones, achieving state-of-the-art performance and offering a generalizable paradigm for trustworthy medical reasoning.

[352] Off-Trajectory Reasoning: Can LLMs Collaborate on Reasoning Trajectory?

Aochong Oliver Li, Tanya Goyal

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2510.06410: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.06410&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[353] Reducing Belief Deviation in Reinforcement Learning for Active Reasoning

Deyu Zou, Yongqiang Chen, Jianxiang Wang, Haochen Yang, Mufei Li, James Cheng, Pan Li, Yu Gong

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2510.12264: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.12264&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[354] Echoing: Identity Failures when LLM Agents Talk to Each Other

Sarath Shekkizhar, Romain Cosentino, Adam Earle, Silvio Savarese

Main category: cs.AI

TL;DR: Unable to analyze paper 2511.09710 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation without access to paper abstract

Method: Cannot determine method without access to paper abstract

Result: Cannot determine results without access to paper abstract

Conclusion: Cannot draw conclusions without access to paper abstract

Abstract: Failed to fetch summary for 2511.09710: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.09710&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[355] Spilled Energy in Large Language Models

Adrian Robert Minut, Hazem Dewidar, Iacopo Masi

Main category: cs.AI

TL;DR: Paper reinterprets LLM softmax classifier as Energy-Based Model to detect hallucinations via training-free energy spill metrics from output logits.

Details

Motivation: Current hallucination detection methods often require trained probe classifiers or activation ablations, which adds complexity and training overhead. The paper aims to develop a principled, training-free approach to detect factual errors, biases, and failures in LLMs by analyzing energy dynamics during decoding.

Method: Reinterpret final LLM softmax classifier as Energy-Based Model (EBM), decomposing sequence-to-sequence probability chain into multiple interacting EBMs. Introduce two training-free metrics: spilled energy (discrepancy between energy values across consecutive generation steps) and marginalized energy (measurable at single step). These metrics track “energy spills” that correlate with errors.

Result: Evaluated on nine benchmarks across state-of-the-art LLMs (LLaMA, Mistral, Gemma) and synthetic algebraic operations (Qwen3). Approach demonstrates robust, competitive hallucination detection and cross-task generalization. Results hold for both pretrained and instruction-tuned variants without training overhead.

Conclusion: The EBM reinterpretation provides a principled framework for hallucination detection using training-free energy metrics. Method effectively localizes answer tokens and detects hallucinations without requiring trained probes or activation ablations, offering practical advantages for deployment.

Abstract: We reinterpret the final Large Language Model (LLM) softmax classifier as an Energy-Based Model (EBM), decomposing the sequence-to-sequence probability chain into multiple interacting EBMs at inference. This principled approach allows us to track “energy spills” during decoding, which we empirically show correlate with factual errors, biases, and failures. Similar to Orgad et al. (2025), our method localizes the exact answer token and subsequently tests for hallucinations. Crucially, however, we achieve this without requiring trained probe classifiers or activation ablations. Instead, we introduce two completely training-free metrics derived directly from output logits: spilled energy, which captures the discrepancy between energy values across consecutive generation steps that should theoretically match, and marginalized energy, which is measurable at a single step. Evaluated on nine benchmarks across state-of-the-art LLMs (including LLaMA, Mistral, and Gemma) and on synthetic algebraic operations (Qwen3), our approach demonstrates robust, competitive hallucination detection and cross-task generalization. Notably, these results hold for both pretrained and instruction-tuned variants without introducing any training overhead. Code available at: github.com/OmnAI-Lab/spilled-energy

[356] Comparing AI Agents to Cybersecurity Professionals in Real-World Penetration Testing

Justin W. Lin, Eliot Krzysztof Jones, Donovan Julian Jasper, Ethan Jun-shen Ho, Anna Wu, Arnold Tianyi Yang, Neil Perry, Andy Zou, Matt Fredrikson, J. Zico Kolter, Percy Liang, Dan Boneh, Daniel E. Ho

Main category: cs.AI

TL;DR: Paper ID 2512.09882 - Unable to fetch abstract due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to missing abstract

Method: Cannot determine method due to missing abstract

Result: Cannot determine results due to missing abstract

Conclusion: Cannot determine conclusion due to missing abstract

Abstract: Failed to fetch summary for 2512.09882: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.09882&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[357] CORE: Concept-Oriented Reinforcement for Bridging the Definition-Application Gap in Mathematical Reasoning

Zijun Gao, Zhikun Xu, Xiao Ye, Ben Zhou

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed API request

Method: Unable to determine method due to failed API request

Result: Unable to determine results due to failed API request

Conclusion: Unable to determine conclusion due to failed API request

Abstract: Failed to fetch summary for 2512.18857: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.18857&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[358] MultiSessionCollab: Learning User Preferences with Memory to Improve Long-Term Collaboration

Shuhaib Mehri, Priyanka Kargupta, Tal August, Dilek Hakkani-Tür

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2601.02702: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.02702&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[359] Minimal Computational Preconditions for Subjective Perspective in Artificial Agents

Hongju Pae

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2602.02902: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.02902&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Lukas Weidener, Marko Brkić, Mihailo Jovanović, Ritvik Singh, Emre Ulgac, Aakaash Meduri

Main category: cs.AI

TL;DR: Paper ID 2602.19810: Unable to fetch abstract due to HTTP 429 error (rate limiting). No content available for analysis.

Details

Motivation: Cannot determine motivation as abstract is unavailable due to HTTP 429 error from arXiv API.

Method: Cannot determine method as abstract is unavailable due to HTTP 429 error from arXiv API.

Result: Cannot determine results as abstract is unavailable due to HTTP 429 error from arXiv API.

Conclusion: Cannot determine conclusion as abstract is unavailable due to HTTP 429 error from arXiv API.

Abstract: Failed to fetch summary for 2602.19810: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.19810&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[361] CIRCLE: A Framework for Evaluating AI from a Real-World Lens

Reva Schwartz, Carina Westling, Morgan Briggs, Marzieh Fadaee, Isar Nejadgholi, Matthew Holmes, Fariza Rashid, Maya Carlyle, Afaf Taïk, Kyra Wilson, Peter Douglas, Theodora Skeadas, Gabriella Waters, Rumman Chowdhury, Thiago Lacerda

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2602.24055: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.24055&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[362] NeuroHex: Highly-Efficient Hex Coordinate System for Creating World Models to Enable Adaptive AI

Quinn Jacobson, Joe Luo, Jingfei Xu, Shanmuga Venkatachalam, Kevin Wang, Dingchao Rong, John Paul Shen

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.00376: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.00376&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[363] RubricBench: Aligning Model-Generated Rubrics with Human Standards

Qiyuan Zhang, Junyi Zhou, Yufei Wang, Fuyuan Lyu, Yidong Ming, Can Xu, Qingfeng Sun, Kai Zheng, Peng Kang, Xue Liu, Chen Ma

Main category: cs.AI

TL;DR: RubricBench: A benchmark for assessing rubric-based evaluation in LLM alignment, featuring 1,147 pairwise comparisons with expert-annotated rubrics to measure evaluation reliability.

Details

Motivation: As LLM alignment evolves toward complex generation, reward models are shifting to rubric-guided evaluation to mitigate surface-level biases, but there's no unified benchmark to assess this paradigm due to lack of discriminative complexity and ground-truth rubric annotations.

Method: Created RubricBench with 1,147 pairwise comparisons using multi-dimensional filtration pipeline to target hard samples with nuanced input complexity and misleading surface bias, augmented with expert-annotated atomic rubrics derived strictly from instructions.

Result: Comprehensive experiments reveal substantial capability gap between human-annotated and model-generated rubrics, showing state-of-the-art models struggle to autonomously specify valid evaluation criteria and lag considerably behind human-guided performance.

Conclusion: RubricBench provides a needed benchmark for assessing rubric-based evaluation in LLM alignment, highlighting the limitations of current models in generating reliable evaluation rubrics compared to human expertise.

Abstract: As Large Language Model (LLM) alignment evolves from simple completions to complex, highly sophisticated generation, Reward Models are increasingly shifting toward rubric-guided evaluation to mitigate surface-level biases. However, the community lacks a unified benchmark to assess this evaluation paradigm, as existing benchmarks lack both the discriminative complexity and the ground-truth rubric annotations required for rigorous analysis. To bridge this gap, we introduce RubricBench, a curated benchmark with 1,147 pairwise comparisons specifically designed to assess the reliability of rubric-based evaluation. Our construction employs a multi-dimensional filtration pipeline to target hard samples featuring nuanced input complexity and misleading surface bias, augmenting each with expert-annotated, atomic rubrics derived strictly from instructions. Comprehensive experiments reveal a substantial capability gap between human-annotated and model-generated rubrics, indicating that even state-of-the-art models struggle to autonomously specify valid evaluation criteria, lagging considerably behind human-guided performance.

[364] Loss Barcode: A Topological Measure of Escapability in Loss Landscapes

Serguei Barannikov, Daria Voronkova, Alexander Mironenko, Ilya Trofimov, Alexander Korotin, Grigorii Sotnikov, Evgeny Burnaev

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2012.15834 appears to be from arXiv, but content is unavailable.

Details

Motivation: Unable to determine motivation due to content unavailability.

Method: Unable to determine method due to content unavailability.

Result: Unable to determine results due to content unavailability.

Conclusion: Unable to draw conclusions due to content unavailability.

Abstract: Failed to fetch summary for 2012.15834: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2012.15834&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[365] Network Topology Optimization via Deep Reinforcement Learning

Zhuoran Li, Xing Wang, Ling Pan, Lin Zhu, Zhendong Wang, Junlan Feng, Chao Deng, Longbo Huang

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2204.14133: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2204.14133&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[366] Diffusion-EXR: Controllable Review Generation for Explainable Recommendation via Diffusion Models

Ling Li, Shaohua Li, June Tay, Huijing Zhan

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2312.15490: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2312.15490&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[367] Leverage Knowledge Graph and Large Language Model for Law Article Recommendation: A Case Study of Chinese Criminal Law

Yongming Chen, Miner Chen, Ye Zhu, Juan Pei, Siyu Chen, Yu Zhou, Yi Wang, Yifan Zhou, Hao Li, Songan Zhang

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable due to API rate limiting

Result: Cannot determine results as paper content is unavailable due to API rate limiting

Conclusion: Cannot determine conclusion as paper content is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2410.04949: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.04949&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[368] Covering Numbers for Deep ReLU Networks with Applications to Function Approximation and Nonparametric Regression

Weigutian Ou, Helmut Bölcskei

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about paper content due to access limitations

Abstract: Failed to fetch summary for 2410.06378: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.06378&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[369] AttackSeqBench: Benchmarking the Capabilities of LLMs for Attack Sequences Understanding

Haokai Ma, Javier Yong, Yunshan Ma, Kuei Chen, Anis Yusof, Zhenkai Liang, Ee-Chien Chang

Main category: cs.AI

TL;DR: Failed to fetch summary for paper 2503.03170 due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to missing paper content

Method: Unable to determine method due to missing paper content

Result: Unable to determine results due to missing paper content

Conclusion: Unable to determine conclusion due to missing paper content

Abstract: Failed to fetch summary for 2503.03170: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.03170&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[370] Robust Weight Imprinting: Insights from Neural Collapse and Proxy-Based Aggregation

Justus Westerhoff, Golzar Atefi, Mario Koddenbrock, Alexei Figueroa, Alexander Löser, Erik Rodner, Felix A. Gers

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2503.14572: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.14572&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Weidi Luo, Tianyu Lu, Qiming Zhang, Xiaogeng Liu, Bin Hu, Yue Zhao, Jieyu Zhao, Song Gao, Patrick McDaniel, Zhen Xiang, Chaowei Xiao

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed API request

Method: Unable to determine method due to failed API request

Result: Unable to determine results due to failed API request

Conclusion: Unable to determine conclusion due to failed API request

Abstract: Failed to fetch summary for 2504.19373: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.19373&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[372] The Gen AI Generation: Student Views of Awareness, Preparedness, and Concern

Micaela Siraj, Jon Duke, Thomas Plötz

Main category: cs.AI

TL;DR: Unable to analyze paper 2505.02230 due to HTTP 429 error when fetching from arXiv API

Details

Motivation: Cannot determine motivation due to inability to access paper content

Method: Cannot determine method due to inability to access paper content

Result: Cannot determine results due to inability to access paper content

Conclusion: Cannot determine conclusion due to inability to access paper content

Abstract: Failed to fetch summary for 2505.02230: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.02230&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[373] Unsupervised Representation Learning – an Invariant Risk Minimization Perspective

Yotam Norman, Ron Meir

Main category: cs.AI

TL;DR: Unable to analyze paper 2505.12506 due to HTTP 429 error when fetching from arXiv API

Details

Motivation: Cannot determine motivation due to inability to access paper content

Method: Cannot determine method due to inability to access paper content

Result: Cannot determine results due to inability to access paper content

Conclusion: Cannot draw conclusions about paper content due to access issues

Abstract: Failed to fetch summary for 2505.12506: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.12506&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[374] Know When to Abstain: Optimal Selective Classification with Likelihood Ratios

Alvin Heng, Harold Soh

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to failed paper fetch

Method: Cannot determine method due to failed paper fetch

Result: Cannot determine results due to failed paper fetch

Conclusion: Cannot determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2505.15008: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.15008&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[375] Learning of Population Dynamics: Inverse Optimization Meets JKO Scheme

Mikhail Persiianov, Jiawei Chen, Petr Mokrov, Alexander Tyurin, Evgeny Burnaev, Alexander Korotin

Main category: cs.AI

TL;DR: Unable to analyze paper 2506.01502 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation as abstract is unavailable due to rate limiting error

Method: Cannot determine method as abstract is unavailable due to rate limiting error

Result: Cannot determine results as abstract is unavailable due to rate limiting error

Conclusion: Cannot determine conclusion as abstract is unavailable due to rate limiting error

Abstract: Failed to fetch summary for 2506.01502: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.01502&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[376] Self-Improving Loops for Visual Robotic Planning

Calvin Luo, Zilai Zeng, Mingxi Jia, Yilun Du, Chen Sun

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2506.06658: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.06658&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[377] Improving Classifier-Free Guidance in Masked Diffusion: Low-Dim Theoretical Insights with High-Dim Impact

Kevin Rojas, Ye He, Chieh-Hsin Lai, Yuhta Takida, Yuki Mitsufuji, Molei Tao

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2507.08965: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.08965&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[378] Gauge Flow Models

Alexander Strunk, Roland Assam

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2507.13414: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.13414&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[379] The Lattice Geometry of Neural Network Quantization – A Short Equivalence Proof of GPTQ and Babai’s Algorithm

Johann Birnick

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2508.01077 suggests it’s from August 2025, but no content available for analysis.

Details

Motivation: Cannot determine motivation without access to the paper content. The HTTP 429 error indicates the arXiv API is rate limiting requests.

Method: No method information available due to failed fetch. The paper ID format suggests it’s a recent preprint from August 2025.

Result: No results available. The arXiv API returned HTTP 429 status, which typically means too many requests in a short period.

Conclusion: Unable to analyze paper content. Need to wait for rate limits to reset or try alternative methods to access the paper.

Abstract: Failed to fetch summary for 2508.01077: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.01077&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[380] The Choice of Divergence: A Neglected Key to Mitigating Diversity Collapse in Reinforcement Learning with Verifiable Reward

Long Li, Zhijian Zhou, Jiaran Hao, Jason Klein Liu, Yanting Miao, Wei Pang, Xiaoyu Tan, Wei Chu, Zhe Wang, Shirui Pan, Chao Qu, Yuan Qi

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2509.07430: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.07430&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[381] ScaleDoc: Scaling LLM-based Predicates over Large Document Collections

Hengrui Zhang, Yulong Hui, Yihao Liu, Huanchen Zhang

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access limitations

Method: Unable to determine method due to access limitations

Result: Unable to determine results due to access limitations

Conclusion: Unable to draw conclusions due to access limitations

Abstract: Failed to fetch summary for 2509.12610: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.12610&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[382] Enhancing Generative Auto-bidding with Offline Reward Evaluation and Policy Search

Zhiyu Mou, Yiqin Lv, Miao Xu, Qi Wang, Yixiu Mao, Jinghao Chen, Qichen Ye, Chao Li, Rongquan Bai, Chuan Yu, Jian Xu, Bo Zheng

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2509.15927: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.15927&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[383] Best-of-$\infty$ – Asymptotic Performance of Test-Time Compute

Junpei Komiyama, Daisuke Oba, Masafumi Oyamada

Main category: cs.AI

TL;DR: Paper ID 2509.21091 could not be fetched due to HTTP 429 error (rate limiting), so analysis cannot be performed

Details

Motivation: Unable to determine motivation due to HTTP 429 error preventing access to the paper abstract

Method: Unable to determine method due to HTTP 429 error preventing access to the paper abstract

Result: Unable to determine results due to HTTP 429 error preventing access to the paper abstract

Conclusion: Unable to draw conclusions due to HTTP 429 error preventing access to the paper abstract

Abstract: Failed to fetch summary for 2509.21091: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.21091&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[384] Fine-Tuning Diffusion Models via Intermediate Distribution Shaping

Gautham Govind Anil, Shaan Ul Haque, Nithish Kannen, Dheeraj Nagaraj, Sanjay Shakkottai, Karthikeyan Shanmugam

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed data retrieval

Method: Unable to determine method due to failed data retrieval

Result: Unable to determine results due to failed data retrieval

Conclusion: Unable to determine conclusion due to failed data retrieval

Abstract: Failed to fetch summary for 2510.02692: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.02692&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[385] Every Language Model Has a Forgery-Resistant Signature

Matthew Finlayson, Xiang Ren, Swabha Swayamdipta

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2510.14086: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.14086&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[386] xLLM Technical Report

Tongxuan Liu, Tao Peng, Peijun Yang, Xiaoyang Zhao, Xiusheng Lu, Weizhe Huang, Zirui Liu, Xiaoyu Chen, Zhiwei Liang, Jun Xiong, Donghe Jin, Minchao Zhang, Jinrong Guo, Yingxu Deng, Xu Zhang, Xianzhe Dong, Siqi Wang, Siyu Wu, Yu Wu, Zihan Tang, Yuting Zeng, Yanshu Wang, Jinguang Liu, Meng Kang, Menxin Li, Yunlong Wang, Yiming Liu, Xiaolong Ma, Yifan Wang, Yichen Zhang, Jinrun Yin, Keyang Zheng, Jiawei Yin, Jun Zhang, Ziyue Wang, Xiaobo Lin, Liangyu Liu, Liwei Lan, Yang Liu, Chunhua Peng, Han Liu, Songcheng Ren, Xuezhu Wang, Yunheng Shen, Yi Wang, Guyue Liu, Yitao Hu, Hui Chen, Tong Yang, Hailong Yang, Jing Li, Guiguang Ding, Ke Zhang

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2510.14686: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.14686&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[387] WebDevJudge: Evaluating (M)LLMs as Critiques for Web Development Quality

Chunyang Li, Yilun Zheng, Xinting Huang, Tianqing Fang, Jiahao Xu, Lihui Chen, Yangqiu Song, Han Hu

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to missing paper content

Method: Unable to determine method due to missing paper content

Result: Unable to determine results due to missing paper content

Conclusion: Unable to draw conclusions due to missing paper content

Abstract: Failed to fetch summary for 2510.18560: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.18560&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[388] VeriStruct: AI-assisted Automated Verification of Data-Structure Modules in Verus

Chuyue Sun, Yican Sun, Daneshvar Amrollahi, Ethan Zhang, Shuvendu Lahiri, Shan Lu, David Dill, Clark Barrett

Main category: cs.AI

TL;DR: Paper 2510.25015: Unable to fetch summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to missing abstract

Method: Cannot determine method due to missing abstract

Result: Cannot determine results due to missing abstract

Conclusion: Cannot determine conclusion due to missing abstract

Abstract: Failed to fetch summary for 2510.25015: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.25015&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[389] FAST: Topology-Aware Frequency-Domain Distribution Matching for Coreset Selection

Jin Cui, Boran Zhao, Jiajun Xu, Jiaqi Guo, Shuo Guan, Pengju Ren

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable due to server rate limiting

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about paper content due to access limitations

Abstract: Failed to fetch summary for 2511.19476: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.19476&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[390] WARP: Weight Teleportation for Attack-Resilient Unlearning Protocols

Mohammad M Maheri, Xavier Cadet, Peter Chin, Hamed Haddadi

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2512.00272: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.00272&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[391] Dual Randomized Smoothing: Beyond Global Noise Variance

Chenhao Sun, Yuhao Mao, Martin Vechev

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation due to missing paper content

Method: Cannot determine method due to missing paper content

Result: Cannot determine results due to missing paper content

Conclusion: Cannot draw conclusions due to missing paper content

Abstract: Failed to fetch summary for 2512.01782: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.01782&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[392] Q-BERT4Rec: Quantized Semantic-ID Representation Learning for Multimodal Recommendation

Haofeng Huang, Ling Gai

Main category: cs.AI

TL;DR: Paper 2512.02474: Unable to fetch summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unknown - could not retrieve paper content due to API rate limiting

Method: Unknown - could not retrieve paper content due to API rate limiting

Result: Unknown - could not retrieve paper content due to API rate limiting

Conclusion: Unknown - could not retrieve paper content due to API rate limiting

Abstract: Failed to fetch summary for 2512.02474: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.02474&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[393] Nightjar: Dynamic Adaptive Speculative Decoding for Large Language Models Serving

Rui Li, Zhaoning Zhang, Libo Zhang, Huaimin Wang, Xiang Fu, Zhiquan Lai

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2512.22420: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.22420&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[394] Zero-Permission Manipulation: Can We Trust Large Multimodal Model Powered GUI Agents?

Yi Qian, Kunwei Qian, Xingbang He, Ligeng Chen, Jikang Zhang, Tiantai Zhang, Haiyang Wei, Linzhang Wang, Hao Wu, Bing Mao

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2601.12349: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.12349&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[395] Multimodal Multi-Agent Ransomware Analysis Using AutoGen

Asifullah Khan, Aimen Wadood, Mubashar Iqbal, Umme Zahoora

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to API rate limiting preventing access to paper details

Method: Unable to determine method due to API rate limiting preventing access to paper details

Result: Unable to determine results due to API rate limiting preventing access to paper details

Conclusion: Unable to determine conclusion due to API rate limiting preventing access to paper details

Abstract: Failed to fetch summary for 2601.20346: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.20346&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[396] Learning Contextual Runtime Monitors for Safe AI-Based Autonomy

Alejandro Luque-Cerpa, Mengyuan Wang, Emil Carlsson, Sanjit A. Seshia, Devdatt Dubhashi, Hazem Torfah

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2601.20666 could not be retrieved from arXiv API.

Details

Motivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot draw conclusions without access to the paper content.

Abstract: Failed to fetch summary for 2601.20666: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.20666&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[397] Sustainable Materials Discovery in the Era of Artificial Intelligence

Sajid Mannan, Rupert J. Myers, Rohit Batra, Rocio Mercado, Lothar Wondraczek, N. M. Anoop Krishnan

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2601.21527 suggests it’s from January 2025, but content is unavailable.

Details

Motivation: Cannot determine motivation without access to the paper content. The HTTP 429 error indicates the arXiv API is rate limiting requests.

Method: Cannot determine method without access to the paper content. The error suggests technical issues with fetching the paper metadata.

Result: No results available due to failed paper retrieval. The arXiv API returned a 429 status code indicating too many requests.

Conclusion: Unable to analyze the paper due to technical limitations in accessing the content. The arXiv API rate limiting prevents retrieval of the paper details.

Abstract: Failed to fetch summary for 2601.21527: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.21527&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[398] Semantic-level Backdoor Attack against Text-to-Image Diffusion Models

Tianxin Chen, Wenbo Jiang, Hongqiao Chen, Zhirun Zheng, Cheng Huang

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2602.04898: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.04898&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[399] Curriculum Learning for Efficient Chain-of-Thought Distillation via Structure-Aware Masking and GRPO

Bowen Yu, Maolin Wang, Sheng Zhang, Binhao Wang, Yi Wen, Jingtong Gao, Bowen Liu, Zimo Zhao, Wanyu Wang, Xiangyu Zhao

Main category: cs.AI

TL;DR: Unable to analyze paper 2602.17686 due to HTTP 429 error when fetching from arXiv API

Details

Motivation: Cannot determine motivation as paper content could not be retrieved

Method: No method information available due to API request failure

Result: No results available - paper summary fetch failed

Conclusion: Unable to provide analysis due to technical limitations in accessing the paper

Abstract: Failed to fetch summary for 2602.17686: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.17686&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[400] FlexGuard: Continuous Risk Scoring for Strictness-Adaptive LLM Content Moderation

Zhihao Ding, Jinming Li, Ze Lu, Jieming Shi

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2602.23636: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.23636&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[401] Beyond State-Wise Mirror Descent: Offline Policy Optimization with Parameteric Policies

Xiang Li, Yuheng Zhang, Nan Jiang

Main category: cs.AI

TL;DR: Unable to analyze paper 2602.23811 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation without access to the paper abstract

Method: Cannot determine method without access to the paper abstract

Result: Cannot determine results without access to the paper abstract

Conclusion: Cannot draw conclusions without access to the paper abstract

Abstract: Failed to fetch summary for 2602.23811: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.23811&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[402] What Is the Alignment Tax?

Robin Young

Main category: cs.AI

TL;DR: Paper ID 2603.00047 could not be fetched due to HTTP 429 (rate limiting) error from arXiv API

Details

Motivation: Unable to determine motivation as the paper content could not be retrieved due to API rate limiting

Method: Unknown - paper content not accessible

Result: Unknown - paper content not accessible

Conclusion: Unable to provide analysis due to technical limitations in accessing the paper

Abstract: Failed to fetch summary for 2603.00047: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.00047&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[403] Theory of Code Space: Do Code Agents Understand Software Architecture?

Grigory Sapunov

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2603.00601 suggests it’s from March 2023, but content is unavailable.

Details

Motivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot draw conclusions without access to the paper content.

Abstract: Failed to fetch summary for 2603.00601: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.00601&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[404] FastCode: Fast and Cost-Efficient Code Understanding and Reasoning

Zhonghang Li, Zongwei Li, Yuxuan Chen, Han Shi, Jiawei Li, Jierun Chen, Haoli Bai, Chao Huang

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2603.01012: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01012&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[405] Solving Inverse PDE Problems using Minimization Methods and AI

Noura Al Helwani, Sophie Moufawad, Georges Sakr

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper retrieval

Method: Unable to determine method due to failed paper retrieval

Result: Unable to determine results due to failed paper retrieval

Conclusion: Unable to draw conclusions due to failed paper retrieval

Abstract: Failed to fetch summary for 2603.01731: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01731&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[406] Rethinking Policy Diversity in Ensemble Policy Gradient in Large-Scale Reinforcement Learning

Naoki Shitanda, Motoki Omura, Tatsuya Harada, Takayuki Osa

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2603.01741 suggests it’s from March 2025, but no content available for analysis.

Details

Motivation: Cannot determine motivation without access to paper content.

Method: Cannot determine method without access to paper content.

Result: Cannot determine results without access to paper content.

Conclusion: Cannot draw conclusions without access to paper content.

Abstract: Failed to fetch summary for 2603.01741: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01741&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[407] Hyperparameter Trajectory Inference with Conditional Lagrangian Optimal Transport

Harry Amad, Mihaela van der Schaar

Main category: cs.AI

TL;DR: Unable to analyze paper 2603.01771 due to HTTP 429 error when fetching summary from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable due to API rate limiting

Result: Cannot determine results as paper content is unavailable due to API rate limiting

Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2603.01771: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01771&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

cs.SD

[408] SGPA: Spectrogram-Guided Phonetic Alignment for Feasible Shapley Value Explanations in Multimodal Large Language Models

Paweł Pozorski, Jakub Muszyński, Maria Ganzha

Main category: cs.SD

TL;DR: SGPA enables efficient Shapley value attribution for audio language models by aligning audio segments with words using phonetic alignment and spectral boundary refinement.

Details

Motivation: Explaining audio language models via Shapley value attribution is intractable due to native tokenization issues: too many encoder frames (150+ per utterance), individual audio frames lack meaning, and token boundaries cause masking artifacts.

Method: Introduces Spectrogram-Guided Phonetic Alignment (SGPA), a four-stage pipeline combining Connectionist Temporal Classification forced alignment with spectral boundary refinement to produce acoustically stable, word-aligned audio segments.

Result: SGPA yields 43× reduction in model evaluations on LFM2-Audio-1.5B with VoiceBench, significantly alters attribution concentration while preserving global cumulative profile, enabling audio explainability.

Conclusion: SGPA establishes a feasibility-enabling layer for audio explainability by making Shapley value attribution tractable for audio language models through proper audio-word alignment.

Abstract: Explaining the behavior of end-to-end audio language models via Shapley value attribution is intractable under native tokenization: a typical utterance yields over $150$ encoder frames, inflating the coalition space by roughly $10^{42}$ relative to text; individual audio frames lack standalone meaning; and token boundaries that bisect phonetic transitions introduce masking artifacts. We introduce Spectrogram-Guided Phonetic Alignment (SGPA), a four-stage pipeline that combines Connectionist Temporal Classification forced alignment with spectral boundary refinement to produce acoustically stable, word-aligned audio segments. Controlled diagnostics on LFM2-Audio-1.5B with VoiceBench show that SGPA yields a 43$\times$ reduction in model evaluations. Statistical testing confirms that SGPA significantly alters attribution concentration while preserving the global cumulative profile, establishing it as a feasibility-enabling layer for audio explainability.

[409] MEBM-Phoneme: Multi-scale Enhanced BrainMagic for End-to-End MEG Phoneme Classification

Liang Jinghua, Zhang Zifeng, Li Songyi, Zheng Linze

Main category: cs.SD

TL;DR: MEBM-Phoneme is a neural decoder for phoneme classification from MEG signals using multi-scale convolutional modules and attention mechanisms to improve temporal modeling and address data challenges.

Details

Motivation: The paper aims to advance MEG-based speech perception analysis by improving phoneme classification from non-invasive brain signals, addressing challenges like class imbalance and session-specific distributional shifts in neural data.

Method: Built on BrainMagic backbone, integrates short-term multi-scale convolutional module with mid-term encoder, uses depthwise separable convolution for cross-scale integration, convolutional attention for temporal dependencies, and employs weighted cross-entropy loss with random temporal augmentation and stacking-based validation.

Result: Achieves competitive phoneme decoding accuracy on LibriBrain Competition 2025 Track2 validation and test leaderboards, demonstrating robust generalization and effective hierarchical temporal modeling.

Conclusion: Hierarchical temporal modeling and training stabilization techniques are valuable for advancing MEG-based speech perception analysis, with MEBM-Phoneme showing promising results for phoneme classification from neural signals.

Abstract: We propose MEBM-Phoneme, a multi-scale enhanced neural decoder for phoneme classification from non-invasive magnetoencephalography (MEG) signals. Built upon the BrainMagic backbone, MEBM-Phoneme integrates a short-term multi-scale convolutional module to augment the native mid-term encoder, with fused representations via depthwise separable convolution for efficient cross-scale integration. A convolutional attention layer dynamically weights temporal dependencies to refine feature aggregation. To address class imbalance and session-specific distributional shifts, we introduce a stacking-based local validation set alongside weighted cross-entropy loss and random temporal augmentation. Comprehensive evaluations on LibriBrain Competition 2025 Track2 demonstrate robust generalization, achieving competitive phoneme decoding accuracy on the validation and official test leaderboard. These results underscore the value of hierarchical temporal modeling and training stabilization for advancing MEG-based speech perception analysis.

[410] MEBM-Speech: Multi-scale Enhanced BrainMagic for Robust MEG Speech Detection

Li Songyi, Zheng Linze, Liang Jinghua, Zhang Zifeng

Main category: cs.SD

TL;DR: MEBM-Speech is a multi-scale neural decoder for speech activity detection from MEG signals, combining convolutional modules, BiLSTM, and depthwise separable convolutions for robust temporal modeling.

Details

Motivation: The paper aims to develop robust speech activity detection from non-invasive MEG signals, which is crucial for cognitive neuroscience and clinical applications requiring fine-grained detection of speech versus silence states.

Method: Built on BrainMagic backbone, integrates three temporal modeling mechanisms: multi-scale convolutional module for short-term patterns, BiLSTM for long-range context, and depthwise separable convolutional layer for cross-scale feature fusion. Includes lightweight temporal jittering and average pooling for onset robustness.

Result: Achieved average F1 macro of 89.3% on validation set and comparable results on official test leaderboard for LibriBrain Competition 2025 Track1 benchmark.

Conclusion: Multi-scale temporal representation learning is effective for robust MEG-based speech decoding, demonstrating strong performance in speech activity detection from neural signals.

Abstract: We propose MEBM-Speech, a multi-scale enhanced neural decoder for speech activity detection from non-invasive magnetoencephalography (MEG) signals. Built upon the BrainMagic backbone, MEBM-Speech integrates three complementary temporal modeling mechanisms: a multi-scale convolutional module for short-term pattern extraction, a bidirectional LSTM (BiLSTM) for long-range context modeling, and a depthwise separable convolutional layer for efficient cross-scale feature fusion. A lightweight temporal jittering strategy and average pooling further improve onset robustness and boundary stability. The model performs continuous probabilistic decoding of MEG signals, enabling fine-grained detection of speech versus silence states - an ability crucial for both cognitive neuroscience and clinical applications. Comprehensive evaluations on the LibriBrain Competition 2025 Track1 benchmark demonstrate strong performance, achieving an average F1 macro of 89.3% on the validation set and comparable results on the official test leaderboard. These findings highlight the effectiveness of multi-scale temporal representation learning for robust MEG-based speech decoding.

[411] When Scaling Fails: Mitigating Audio Perception Decay of LALMs via Multi-Step Perception-Aware Reasoning

Ruixiang Mao, Xiangnan Ma, Dan Chen, Ziming Zhu, Yuan Ge, Aokai Hao, Haishu Zhao, Yifu Huo, Qing Yang, Kaiyan Chang, Xiaoqian Liu, Chenglong Wang, Qiaozhi He, Tong Xiao, Jingbo Zhu

Main category: cs.SD

Details

[412] Sequence-Level Unsupervised Training in Speech Recognition: A Theoretical Study

Zijian Yang, Jörg Barkoczi, Ralf Schlüter, Hermann Ney

Main category: cs.SD

TL;DR: Theoretical framework for unsupervised speech recognition with error bounds and proposed single-stage sequence-level cross-entropy loss

Details

Motivation: To understand when and how unsupervised speech recognition can succeed with unpaired data, and to establish theoretical foundations for this task

Method: Developed theoretical framework with classification error bounds, introduced two conditions for unsupervised speech recognition feasibility, validated bounds in simulations, and proposed single-stage sequence-level cross-entropy loss

Result: Established theoretical conditions for unsupervised speech recognition success, derived classification error bounds, and validated them through simulations

Conclusion: Unsupervised speech recognition is theoretically possible under certain conditions, and the proposed sequence-level cross-entropy loss is motivated by the derived error bounds

Abstract: Unsupervised speech recognition is a task of training a speech recognition model with unpaired data. To determine when and how unsupervised speech recognition can succeed, and how classification error relates to candidate training objectives, we develop a theoretical framework for unsupervised speech recognition grounded in classification error bounds. We introduce two conditions under which unsupervised speech recognition is possible. The necessity of these conditions are also discussed. Under these conditions, we derive a classification error bound for unsupervised speech recognition and validate this bound in simulations. Motivated by this bound, we propose a single-stage sequence-level cross-entropy loss for unsupervised speech recognition.

[413] When Spoof Detectors Travel: Evaluation Across 66 Languages in the Low-Resource Language Spoofing Corpus

Kirill Borodin, Vasiliy Kudryavtsev, Maxim Maslov, Mikhail Gorodnichev, Grach Mkrtchian

Main category: cs.SD

TL;DR: LRLspoof is a large-scale multilingual synthetic speech corpus for cross-lingual spoof detection, containing 2,732 hours of audio from 24 TTS systems across 66 languages (45 low-resource). It enables evaluation of spoof detection robustness without requiring target-domain bonafide speech.

Details

Motivation: There's a need for comprehensive multilingual synthetic speech datasets to study cross-lingual spoof detection, particularly for low-resource languages. Current benchmarks lack diversity across languages, making it difficult to evaluate how spoof detection models generalize across different linguistic domains.

Method: Created LRLspoof corpus using 24 open-source TTS systems across 66 languages. Evaluated 11 publicly available countermeasures using threshold transfer: calibrated EER operating points on pooled external benchmarks, then applied resulting thresholds to report spoof rejection rate (SRR).

Result: Results show model-dependent cross-lingual disparity, with spoof rejection varying significantly across languages even under controlled conditions. This highlights language as an independent source of domain shift in spoof detection.

Conclusion: Language itself introduces domain shift in spoof detection, and cross-lingual robustness varies by model. The publicly available LRLspoof dataset enables better evaluation of spoof detection systems across diverse languages.

Abstract: We introduce LRLspoof, a large-scale multilingual synthetic-speech corpus for cross-lingual spoof detection, comprising 2,732 hours of audio generated with 24 open-source TTS systems across 66 languages, including 45 low-resource languages under our operational definition. To evaluate robustness without requiring target-domain bonafide speech, we benchmark 11 publicly available countermeasures using threshold transfer: for each model we calibrate an EER operating point on pooled external benchmarks and apply the resulting threshold, reporting spoof rejection rate (SRR). Results show model-dependent cross-lingual disparity, with spoof rejection varying markedly across languages even under controlled conditions, highlighting language as an independent source of domain shift in spoof detection. The dataset is publicly available at \href{https://huggingface.co/datasets/MTUCI/LRLspoof}{\textbf{\underline{\textit{HuggingFace}}}} and \href{https://modelscope.cn/datasets/lab260/LRLspoof}{\textbf{\underline{\textit{ModelScope}}}}

[414] Rethinking Training Targets, Architectures and Data Quality for Universal Speech Enhancement

Szu-Wei Fu, Rong Chao, Xuesong Yang, Sung-Feng Huang, Ryandhimas E. Zezario, Rauf Nasretdinov, Ante Jukić, Yu Tsao, Yu-Chiang Frank Wang

Main category: cs.SD

TL;DR: Universal Speech Enhancement (USE) addresses three key challenges: improving dereverberation targets, optimizing distortion-perception tradeoff, and balancing training data quality vs. quantity, achieving SOTA results with strong generalization.

Details

Motivation: The paper addresses three overlooked problems in Universal Speech Enhancement: suboptimal training targets, the distortion-perception tradeoff, and the quality-quantity tradeoff in training data curation.

Method: 1) Replaces early-reflected speech targets with time-shifted anechoic clean speech; 2) Proposes a two-stage framework guided by distortion-perception tradeoff theory; 3) Analyzes training data scale vs. quality tradeoffs.

Result: Achieves state-of-the-art performance on URGENT 2025 non-blind test set, exhibits strong language-agnostic generalization, and effectively improves TTS training data quality.

Conclusion: Systematically addressing training target selection, distortion-perception optimization, and data curation leads to superior universal speech enhancement with practical applications for TTS systems.

Abstract: Universal Speech Enhancement (USE) aims to restore speech quality under diverse degradation conditions while preserving signal fidelity. Despite recent progress, key challenges in training target selection, the distortion–perception tradeoff, and data curation remain unresolved. In this work, we systematically address these three overlooked problems. First, we revisit the conventional practice of using early-reflected speech as the dereverberation target and show that it can degrade perceptual quality and downstream ASR performance. We instead demonstrate that time-shifted anechoic clean speech provides a superior learning target. Second, guided by the distortion–perception tradeoff theory, we propose a simple two-stage framework that achieves minimal distortion under a given level of perceptual quality. Third, we analyze the trade-off between training data scale and quality for USE, revealing that training on large uncurated corpora imposes a performance ceiling, as models struggle to remove subtle artifacts. Our method achieves state-of-the-art performance on the URGENT 2025 non-blind test set and exhibits strong language-agnostic generalization, making it effective for improving TTS training data. Code and models will be released upon acceptance.

[415] Single Microphone Own Voice Detection based on Simulated Transfer Functions for Hearing Aids

Mathuranathan Mayuravaani, W. Bastiaan Kleijn, Andrew Lensen, Charlotte Sørensen

Main category: cs.SD

TL;DR: A simulation-based approach for own voice detection in hearing aids using single microphone, with transformer classifier trained on simulated acoustic transfer functions.

Details

Motivation: Existing own voice detection solutions often require multiple microphones or additional sensors, increasing device complexity and cost. Need for ML-based OVD without costly transfer-function measurements.

Method: Data augmentation strategy using simulated acoustic transfer functions (ATFs) to expose model to various spatial conditions. Transformer-based classifier trained on analytically generated ATFs, then progressively fine-tuned using numerically simulated ATFs from rigid-sphere to detailed head-and-torso models.

Result: 95.52% accuracy on simulated head-and-torso test data, 90.02% accuracy with one-second utterances, and 80% accuracy on real hearing aid recordings without fine-tuning using lightweight test-time feature compensation.

Conclusion: The model demonstrates ability to generalize from simulated to real-world conditions, showing practical viability and pointing toward promising direction for future hearing aid design with single-microphone solutions.

Abstract: This paper presents a simulation-based approach to own voice detection (OVD) in hearing aids using a single microphone. While OVD can significantly improve user comfort and speech intelligibility, existing solutions often rely on multiple microphones or additional sensors, increasing device complexity and cost. To enable ML-based OVD without requiring costly transfer-function measurements, we propose a data augmentation strategy based on simulated acoustic transfer functions (ATFs) that expose the model to a wide range of spatial propagation conditions. A transformer-based classifier is first trained on analytically generated ATFs and then progressively fine-tuned using numerically simulated ATFs, transitioning from a rigid-sphere model to a detailed head-and-torso representation. This hierarchical adaptation enabled the model to refine its spatial understanding while maintaining generalization. Experimental results show 95.52% accuracy on simulated head-and-torso test data. Under short-duration conditions, the model maintained 90.02% accuracy with one-second utterances. On real hearing aid recordings, the model achieved 80% accuracy without fine-tuning, aided by lightweight test-time feature compensation. This highlights the model’s ability to generalize from simulated to real-world conditions, demonstrating practical viability and pointing toward a promising direction for future hearing aid design.

[416] Differentiable Time-Varying IIR Filtering for Real-Time Speech Denoising

Riccardo Rota, Kiril Ratmanski, Jozef Coldenhoff, Milos Cernak

Main category: cs.SD

TL;DR: TVF is a low-latency speech enhancement model combining DSP interpretability with deep learning adaptability, using a neural network to predict coefficients for a 35-band IIR filter cascade in real-time.

Details

Motivation: To bridge the gap between traditional DSP filtering and modern neural speech modeling by creating an interpretable, low-latency speech enhancement model that combines the strengths of both approaches while avoiding "black-box" deep learning solutions.

Method: Uses a lightweight neural network backbone to predict coefficients for a differentiable 35-band IIR filter cascade in real-time, allowing dynamic adaptation to non-stationary noise while maintaining complete interpretability of the processing chain.

Result: Demonstrated efficacy on speech denoising using Valentini-Botinhao dataset, showing effective adaptation to changing noise conditions compared to static DDSP and fully deep-learning-based solutions.

Conclusion: TVF successfully combines DSP interpretability with deep learning adaptability for low-latency speech enhancement, offering explicit and adjustable spectral modifications while maintaining effective noise adaptation.

Abstract: We present TVF (Time-Varying Filtering), a low-latency speech enhancement model with 1 million parameters. Combining the interpretability of Digital Signal Processing (DSP) with the adaptability of deep learning, TVF bridges the gap between traditional filtering and modern neural speech modeling. The model utilizes a lightweight neural network backbone to predict the coefficients of a differentiable 35-band IIR filter cascade in real time, allowing it to adapt dynamically to non-stationary noise. Unlike ``black-box’’ deep learning approaches, TVF offers a completely interpretable processing chain, where spectral modifications are explicit and adjustable. We demonstrate the efficacy of this approach on a speech denoising task using the Valentini-Botinhao dataset and compare the results to a static DDSP approach and a fully deep-learning-based solution, showing that TVF achieves effective adaptation to changing noise conditions.

[417] An Investigation Into Various Approaches For Bengali Long-Form Speech Transcription and Bengali Speaker Diarization

Epshita Jahan, Khandoker Md Tanjinul Islam, Pritom Biswas, Tafsir Al Nafin

Main category: cs.SD

TL;DR: Fine-tuned Whisper Medium for Bengali long-form ASR and integrated pyannote diarization with custom segmentation to achieve competitive WER (0.38) and DER (0.27) on noisy hour-long recordings.

Details

Motivation: Bengali is a low-resource language in speech technology, particularly for complex tasks like long-form transcription and speaker diarization. The paper addresses the challenge of "who spoke when/what" in hour-long recordings to improve AI inclusivity for South Asian languages.

Method: Multistage approach: 1) Fine-tuned Whisper Medium on Bengali data for transcription, 2) Integrated pyannote/speaker-diarization-community-1 with custom-trained segmentation model, 3) Used two-pass method with hyperparameter tuning, chunking, background noise cleaning, and algorithmic post-processing.

Result: Achieved DER of 0.27 on private leaderboard (0.19 on public) for diarization, and WER of 0.38 on private leaderboard for transcription. Results show significant improvement through targeted tuning and strategic data utilization.

Conclusion: Targeted tuning and strategic data utilization can significantly improve AI inclusivity for South Asian languages. The approach demonstrates effective handling of diverse and noisy acoustic environments in low-resource language settings.

Abstract: Bengali remains a low-resource language in speech technology, especially for complex tasks like long-form transcription and speaker diarization. This paper presents a multistage approach developed for the “DL Sprint 4.0 - Bengali Long-Form Speech Recognition” and “DL Sprint 4.0 - Bengali Speaker Diarization” competitions on Kaggle, addressing the challenge of “who spoke when/what” in hour-long recordings. We implemented Whisper Medium fine-tuned on Bengali data (bengaliAI/tugstugi bengaliai-asr whisper-medium) for transcription and integrated pyannote/speaker-diarization-community-1 with our custom-trained segmentation model to handle diverse and noisy acoustic environments. Using a two-pass method with hyperparameter tuning, we achieved a DER of 0.27 on the private leaderboard and 0.19 on the public leaderboard. For transcription, chunking, background noise cleaning, and algorithmic post-processing yielded a WER of 0.38 on the private leaderboard. These results show that targeted tuning and strategic data utilization can significantly improve AI inclusivity for South Asian languages. All relevant code is available at: https://github.com/Short-Potatoes/Bengali-long-form-transcription-and-diarization.git Index Terms: Bengali speech recognition, speaker diarization, Whisper, ASR, low-resource languages, pyannote, voice activity detection

[418] On Adversarial Attacks In Acoustic Drone Localization

Tamir Shor, Chaim Baskin, Alex Bronstein

Main category: cs.SD

TL;DR: Analysis of PGD adversarial attacks on acoustic-based drone localization systems and development of perturbation recovery algorithm

Details

Motivation: Drones rely increasingly on acoustic sensing for navigation due to visual limitations, but adversarial attacks on acoustic localization systems remain unexplored, posing security risks for mission-critical applications.

Method: Comprehensive analysis of Projected Gradient Descent (PGD) adversarial attacks on acoustic drone localization systems, plus development of an adversarial perturbation recovery algorithm to mitigate attack effects.

Result: Demonstrates vulnerability of acoustic-based drone localization to PGD attacks and shows that the proposed recovery algorithm can significantly reduce the impact of such adversarial perturbations.

Conclusion: Acoustic drone localization systems are vulnerable to adversarial attacks, requiring robust defense mechanisms like the proposed perturbation recovery algorithm for secure deployment in real-world applications.

Abstract: Multi-rotor aerial autonomous vehicles (MAVs, more widely known as “drones”) have been generating increased interest in recent years due to their growing applicability in a vast and diverse range of fields (e.g., agriculture, commercial delivery, search and rescue). The sensitivity of visual-based methods to lighting conditions and occlusions had prompted growing study of navigation reliant on other modalities, such as acoustic sensing. A major concern in using drones in scale for tasks in non-controlled environments is the potential threat of adversarial attacks over their navigational systems, exposing users to mission-critical failures, security breaches, and compromised safety outcomes that can endanger operators and bystanders. While previous work shows impressive progress in acoustic-based drone localization, prior research in adversarial attacks over drone navigation only addresses visual sensing-based systems. In this work, we aim to compensate for this gap by supplying a comprehensive analysis of the effect of PGD adversarial attacks over acoustic drone localization. We furthermore develop an algorithm for adversarial perturbation recovery, capable of markedly diminishing the affect of such attacks in our setting.

[419] PrismAudio: Decomposed Chain-of-Thoughts and Multi-dimensional Rewards for Video-to-Audio Generation

Huadai Liu, Kaicheng Luo, Wen Wang, Qian Chen, Peiwen Sun, Rongjie Huang, Xiangang Li, Jieping Ye, Wei Xue

Main category: cs.SD

TL;DR: PrismAudio is a novel Video-to-Audio generation framework that uses specialized Chain-of-Thought modules with targeted RL rewards to address four perceptual dimensions, achieving SOTA performance with efficient training via Fast-GRPO.

Details

Motivation: Existing V2A methods suffer from objective entanglement (conflating competing goals in single loss functions) and lack human preference alignment, failing to properly balance the four critical perceptual dimensions: semantic consistency, audio-visual temporal synchrony, aesthetic quality, and spatial accuracy.

Method: Introduces PrismAudio framework with four specialized Chain-of-Thought modules (Semantic, Temporal, Aesthetic, and Spatial CoT), each paired with targeted reward functions for multidimensional RL optimization. Proposes Fast-GRPO with hybrid ODE-SDE sampling for efficient training, and AudioCanvas benchmark for evaluation.

Result: Achieves state-of-the-art performance across all four perceptual dimensions on both in-domain VGGSound test set and out-of-domain AudioCanvas benchmark, demonstrating superior audio generation quality and alignment.

Conclusion: PrismAudio successfully addresses objective entanglement in V2A generation through specialized CoT planning with targeted RL rewards, enabling balanced optimization across multiple perceptual dimensions while maintaining computational efficiency.

Abstract: Video-to-Audio (V2A) generation requires balancing four critical perceptual dimensions: semantic consistency, audio-visual temporal synchrony, aesthetic quality, and spatial accuracy; yet existing methods suffer from objective entanglement that conflates competing goals in single loss functions and lack human preference alignment. We introduce PrismAudio, the first framework to integrate Reinforcement Learning into V2A generation with specialized Chain-of-Thought (CoT) planning. Our approach decomposes monolithic reasoning into four specialized CoT modules (Semantic, Temporal, Aesthetic, and Spatial CoT), each paired with targeted reward functions. This CoT-reward correspondence enables multidimensional RL optimization that guides the model to jointly generate better reasoning across all perspectives, solving the objective entanglement problem while preserving interpretability. To make this optimization computationally practical, we propose Fast-GRPO, which employs hybrid ODE-SDE sampling that dramatically reduces the training overhead compared to existing GRPO implementations. We also introduce AudioCanvas, a rigorous benchmark that is more distributionally balanced and covers more realistically diverse and challenging scenarios than existing datasets, with 300 single-event classes and 501 multi-event samples. Experimental results demonstrate that PrismAudio achieves state-of-the-art performance across all four perceptual dimensions on both the in-domain VGGSound test set and out-of-domain AudioCanvas benchmark. The project page is available at https://PrismAudio.github.io.

[420] Diffusion-based Symbolic Music Generation with Structured State Space Models

Shenghua Yuan, Xing Tang, Jiatao Chen, Tianming Xie, Jing Wang, Bing Shi

Main category: cs.SD

TL;DR: SMDIM is a diffusion-based architecture for symbolic music generation that integrates Structured State Space Models (Mamba) for efficient global context modeling with attention mechanisms for local detail preservation, achieving near-linear complexity for long sequences.

Details

Motivation: Current transformer-based approaches for symbolic music generation suffer from quadratic computational complexity, limiting scalability for long sequences. There's a need for more efficient architectures that can handle long musical sequences while maintaining generation quality.

Method: Proposes Symbolic Music Diffusion with Mamba (SMDIM), which combines Structured State Space Models (SSMs/Mamba) for efficient global context modeling with a Mamba-FeedForward-Attention Block (MFA) that integrates linear-complexity Mamba layers, non-linear FeedForward layers, and self-attention mechanisms for local detail preservation.

Result: SMDIM achieves near-linear complexity, outperforms state-of-the-art models in both generation quality and computational efficiency on diverse datasets including FolkDB (traditional Chinese folk music), and demonstrates adaptability to broad long-sequence generation tasks.

Conclusion: SMDIM offers a scalable and efficient solution for coherent sequence modeling in symbolic music generation, with potential applications to other long-sequence generation tasks beyond music.

Abstract: Recent advancements in diffusion models have significantly improved symbolic music generation. However, most approaches rely on transformer-based architectures with self-attention mechanisms, which are constrained by quadratic computational complexity, limiting scalability for long sequences. To address this, we propose Symbolic Music Diffusion with Mamba (SMDIM), a novel diffusion-based architecture integrating Structured State Space Models (SSMs) for efficient global context modeling and the Mamba-FeedForward-Attention Block (MFA) for precise local detail preservation. The MFA Block combines the linear complexity of Mamba layers, the non-linear refinement of FeedForward layers, and the fine-grained precision of self-attention mechanisms, achieving a balance between scalability and musical expressiveness. SMDIM achieves near-linear complexity, making it highly efficient for long-sequence tasks. Evaluated on diverse datasets, including FolkDB, a collection of traditional Chinese folk music that represents an underexplored domain in symbolic music generation, SMDIM outperforms state-of-the-art models in both generation quality and computational efficiency. Beyond symbolic music, SMDIM’s architectural design demonstrates adaptability to a broad range of long-sequence generation tasks, offering a scalable and efficient solution for coherent sequence modeling.

[421] AI-Generated Music Detection in Broadcast Monitoring

David López-Ayala, Asier Cabello, Pablo Zinemanas, Emilio Molina, Martín Rocamora

Main category: cs.SD

TL;DR: AI-OpenBMAT dataset for broadcast-style AI music detection, addressing challenges of short excerpts and speech masking in broadcast audio.

Details

Motivation: Existing AI music detection methods work well for clean, full-length tracks in streaming contexts but fail in broadcast audio where music appears as short excerpts often masked by dominant speech.

Method: Created AI-OpenBMAT dataset with 3,294 one-minute audio excerpts following real TV audio patterns, combining human production music with AI-generated continuations. Benchmarked CNN baseline and SpectTTTra models for SNR and duration robustness.

Result: Models that perform well in streaming scenarios suffer substantial degradation in broadcast settings, with F1-scores dropping below 60% when music is in background or has short duration.

Conclusion: Speech masking and short music length are critical challenges for AI music detection, and AI-OpenBMAT serves as a benchmark for developing detectors that meet industrial broadcast requirements.

Abstract: AI music generators have advanced to the point where their outputs are often indistinguishable from human compositions. While detection methods have emerged, they are typically designed and validated in music streaming contexts with clean, full-length tracks. Broadcast audio, however, poses a different challenge: music appears as short excerpts, often masked by dominant speech, conditions under which existing detectors fail. In this work, we introduce AI-OpenBMAT, the first dataset tailored to broadcast-style AI-music detection. It contains 3,294 one-minute audio excerpts (54.9 hours) that follow the duration patterns and loudness relations of real television audio, combining human-made production music with stylistically matched continuations generated with Suno v3.5. We benchmark a CNN baseline and state-of-the-art SpectTTTra models to assess SNR and duration robustness, and evaluate on a full broadcast scenario. Across all settings, models that excel in streaming scenarios suffer substantial degradation, with F1-scores dropping below 60% when music is in the background or has a short duration. These results highlight speech masking and short music length as critical open challenges for AI music detection, and position AI-OpenBMAT as a benchmark for developing detectors capable of meeting industrial broadcast requirements.

[422] UniTAF: A Modular Framework for Joint Text-to-Speech and Audio-to-Face Modeling

Qiangong Zhou, Nagasaka Tomohiro

Main category: cs.SD

TL;DR: A unified model merging TTS and audio-to-face (A2F) systems to enable internal feature transfer for improved audio-facial expression consistency from text input.

Details

Motivation: To improve consistency between generated audio and facial expressions by creating a unified model that allows internal feature transfer between TTS and A2F components, rather than treating them as separate systems.

Method: Merges independent TTS and A2F models into a unified architecture to enable internal feature transfer, and extends emotion control mechanisms from TTS to the joint model.

Result: Validates feasibility of reusing TTS intermediate representations for joint speech and facial expression modeling, providing engineering practice references for speech expression co-design.

Conclusion: Demonstrates that unified modeling with internal feature transfer is feasible for improving audio-facial consistency, offering practical insights for multimodal generation system design.

Abstract: This work considers merging two independent models, TTS and A2F, into a unified model to enable internal feature transfer, thereby improving the consistency between audio and facial expressions generated from text. We also discuss the extension of the emotion control mechanism from TTS to the joint model. This work does not aim to showcase generation quality; instead, from a system design perspective, it validates the feasibility of reusing intermediate representations from TTS for joint modeling of speech and facial expressions, and provides engineering practice references for subsequent speech expression co-design. The project code has been open source at: https://github.com/GoldenFishes/UniTAF

[423] CodecFlow: Efficient Bandwidth Extension via Conditional Flow Matching in Neural Codec Latent Space

Bowen Zhang, Junchuan Zhao, Ian McLoughlin, Ye Wang, A S Madhukumar

Main category: cs.SD

TL;DR: CodecFlow: A neural codec-based speech bandwidth extension framework using voicing-aware conditional flow conversion and structure-constrained residual vector quantization for efficient high-fidelity speech reconstruction.

Details

Motivation: Existing speech bandwidth extension methods using spectrogram or waveform modeling have high computational costs and limited high-frequency fidelity. Neural audio codecs offer compact latent representations but face challenges in accurately recovering high-resolution latent information due to representation mismatch.

Method: CodecFlow uses a neural codec-based BWE framework with: 1) voicing-aware conditional flow converter on continuous codec embeddings, and 2) structure-constrained residual vector quantizer to improve latent alignment stability. The system is optimized end-to-end.

Result: CodecFlow achieves strong spectral fidelity and enhanced perceptual quality on 8 kHz to 16 kHz and 44.1 kHz speech BWE tasks, demonstrating efficient speech reconstruction in compact latent space.

Conclusion: CodecFlow provides an effective framework for speech bandwidth extension using neural codec representations, addressing computational efficiency and high-frequency fidelity limitations of existing methods.

Abstract: Speech Bandwidth Extension improves clarity and intelligibility by restoring/inferring appropriate high-frequency content for low-bandwidth speech. Existing methods often rely on spectrogram or waveform modeling, which can incur higher computational cost and have limited high-frequency fidelity. Neural audio codecs offer compact latent representations that better preserve acoustic detail, yet accurately recovering high-resolution latent information remains challenging due to representation mismatch. We present CodecFlow, a neural codec-based BWE framework that performs efficient speech reconstruction in a compact latent space. CodecFlow employs a voicing-aware conditional flow converter on continuous codec embeddings and a structure-constrained residual vector quantizer to improve latent alignment stability. Optimized end-to-end, CodecFlow achieves strong spectral fidelity and enhanced perceptual quality on 8 kHz to 16 kHz and 44.1 kHz speech BWE tasks.

[424] VoiceAgentRAG: Solving the RAG Latency Bottleneck in Real-Time Voice Agents Using Dual-Agent Architectures

Jielin Qiu, Jianguo Zhang, Zixiang Chen, Liangwei Yang, Ming Zhu, Juntao Tan, Haolin Chen, Wenting Zhao, Rithesh Murthy, Roshan Ram, Akshara Prabhakar, Shelby Heinecke, Caiming Xiong, Silvio Savarese, Huan Wang

Main category: cs.SD

TL;DR: VoiceAgentRAG is an open-source dual-agent system that separates retrieval from response generation using a slow thinker agent for background topic prediction and document pre-fetching, and a fast talker agent that reads from a semantic cache for sub-millisecond responses.

Details

Motivation: The motivation is to improve response latency in voice-based conversational AI systems by decoupling the slow retrieval process from fast response generation, enabling real-time voice interactions without delays from vector database queries.

Method: The method uses a dual-agent architecture: 1) A background “Slow Thinker” agent continuously monitors conversation, predicts follow-up topics using LLM, and pre-fetches relevant document chunks into a FAISS-backed semantic cache. 2) A foreground “Fast Talker” agent reads only from this cache, bypassing the vector database entirely on cache hits for sub-millisecond response times.

Result: The system achieves sub-millisecond response times by eliminating vector database queries during cache hits, enabling real-time voice interactions with reduced latency compared to traditional RAG systems.

Conclusion: VoiceAgentRAG demonstrates that decoupling retrieval from generation through dual-agent architecture with predictive caching can significantly reduce latency in voice-based conversational AI systems, making real-time voice interactions more feasible.

Abstract: We present VoiceAgentRAG, an open-source dual-agent memory router that decouples retrieval from response generation. A background Slow Thinker agent continuously monitors the conversation stream, predicts likely follow-up topics using an LLM, and pre-fetches relevant document chunks into a FAISS-backed semantic cache. A foreground Fast Talker agent reads only from this sub-millisecond cache, bypassing the vector database entirely on cache hits.

cs.LG

[425] RxnNano:Training Compact LLMs for Chemical Reaction and Retrosynthesis Prediction via Hierarchical Curriculum Learning

Ran Li, Shimin Di, Haowei LI, Luanshi Bu, Jiachuan Wang, Wangze Ni, Lei Chen

Main category: cs.LG

TL;DR: RxnNano is a compact 0.5B-parameter chemical reaction prediction model that prioritizes chemical understanding over scale, achieving state-of-the-art performance through latent chemical consistency, hierarchical cognitive curriculum, and atom-map permutation invariance.

Details

Motivation: Current chemical reaction prediction models overemphasize parameter and dataset scaling while failing to capture fundamental chemical intuition like reaction common sense and topological atom mapping logic. The core challenge is instilling deep chemical knowledge into models rather than just scaling them up.

Method: Proposes a unified framework with four key innovations: (1) Latent Chemical Consistency objective modeling reactions as movements on continuous chemical manifold; (2) Hierarchical Cognitive Curriculum training through progressive stages from syntax to semantic reasoning; (3) Atom-Map Permutation Invariance (AMPI) for learning invariant relational topology; (4) Structured plan-based reasoning to improve LLM performance.

Result: The compact 0.5B-parameter RxnNano significantly outperforms fine-tuned LLMs ten times larger (>7B) and all domain baselines, achieving 23.5% Top-1 accuracy improvement on rigorous benchmarks without test-time augmentation.

Conclusion: Chemical understanding should be prioritized over scale in reaction prediction models. The proposed framework successfully instills chemical intuition into compact models, achieving superior performance with significantly fewer parameters.

Abstract: Chemical reaction prediction is pivotal for accelerating drug discovery and synthesis planning. Despite advances in data-driven models, current approaches are hindered by an overemphasis on parameter and dataset scaling. Some methods coupled with evaluation techniques that bypass fundamental challenges in reaction representation and fail to capture deep chemical intuition like reaction common sense and {topological atom mapping logic}. We argue that the core challenge lies in instilling these knowledge into the models. To this end, we propose a unified framework that prioritizes chemical understanding over scale through three key innovations: (1) a {Latent Chemical Consistency} objective that models reactions as movements on a continuous chemical manifold, ensuring reversible and physically plausible transformations; (2) a {Hierarchical Cognitive Curriculum} that trains the model through progressive stages, from syntax mastery to semantic reasoning, building robust chemical intuition; (3) {Atom-Map Permutation Invariance (AMPI)}, which force the model to learn invariant relational topology and balance multi-task learning. (4)and structured plan-based reasoning to improve the performance of the LLMs. Our compact {0.5B-parameter model}, \textbf{RxnNano} significantly outperforms fine-tuned LLMs ten times larger (>7B) and all the domain baselines, achieving a 23.5% Top-1 accuracy improvement on rigorous benchmarks without test-time augmentation. https://github.com/rlisml/RxnNano.

[426] ATPO: Adaptive Tree Policy Optimization for Multi-Turn Medical Dialogue

Ruike Cao, Shaojie Bai, Fugen Yao, Liang Dong, Jian Xu, Li Xiao

Main category: cs.LG

TL;DR: ATPO: Uncertainty-aware Adaptive Tree Policy Optimization for multi-turn medical dialogue LLMs, using hierarchical MDP formulation with adaptive rollout budget allocation to high-uncertainty states for better value estimation and exploration.

Details

Motivation: Effective information seeking in multi-turn medical dialogues is critical for accurate diagnosis with incomplete information. Aligning LLMs for interactive scenarios is challenging due to uncertainty in user-agent interactions, formulated as Hierarchical Markov Decision Process (H-MDP). Conventional RL methods struggle with long-horizon credit assignment and unstable value estimation.

Method: Proposes ATPO algorithm that adaptively allocates rollout budget to states with high uncertainty, quantified by composite metric of Bellman error and action-value variance. Includes uncertainty-guided pruning to minimize rollouts and asynchronous search architecture with KV cache reuse for computational efficiency.

Result: Extensive experiments on three public medical dialogue benchmarks show significant outperformance over strong baselines, with Qwen3-8B model surpassing much larger GPT-4o (+0.92% accuracy).

Conclusion: ATPO enables more accurate value estimation and efficient diverse exploration for medical dialogue LLMs, achieving state-of-the-art performance with computational optimizations for tree-based RL.

Abstract: Effective information seeking in multi-turn medical dialogues is critical for accurate diagnosis, especially when dealing with incomplete information. Aligning Large Language Models (LLMs) for these interactive scenarios is challenging due to the uncertainty inherent in user-agent interactions, which we formulate as a Hierarchical Markov Decision Process (H-MDP). While conventional Reinforcement Learning (RL) methods like Group Relative Policy Optimization (GRPO) struggle with long-horizon credit assignment and Proximal Policy Optimization (PPO) suffers from unstable value estimation in this context, we propose a novel uncertainty-aware Adaptive Tree Policy Optimization (ATPO) algorithm. Our method adaptively allocates the rollout budget to states with high uncertainty, quantified by a composite metric of Bellman error and action-value variance. This strategy enables more accurate value estimation, while fostering more efficient and diverse exploration. To mitigate the high computational cost of tree-based RL, we introduce two key optimizations: an uncertainty-guided pruning mechanism to minimize the number of rollouts, and an asynchronous search architecture that leverages KV cache reuse to maximize inference throughput. Extensive experiments on three public medical dialogue benchmarks demonstrate that our algorithm significantly outperforms several strong baselines, culminating in Qwen3-8B model surpassing the much larger GPT-4o ($+0.92%$ accuracy).

[427] Is Retraining-Free Enough? The Necessity of Router Calibration for Efficient MoE Compression

Sieun Hyeon, Jaeyoung Do

Main category: cs.LG

TL;DR: Router Knowledge Distillation (Router KD) addresses post-compression degradation in Mixture-of-Experts models by calibrating the router to match changed experts, without retraining expert parameters.

Details

Motivation: MoE models have deployment memory bottlenecks due to massive parameters. Existing retraining-free compression methods suffer from router-expert mismatch when experts are changed but routers remain unchanged, causing persistent performance degradation.

Method: Proposes Router Knowledge Distillation (Router KD) that updates only the router parameters by distilling the original model’s next-token distribution on unlabeled calibration data, avoiding expert parameter updates.

Result: Router KD consistently recovers performance across Expert Pruning, Expert Editing, and Expert Merging paradigms, with larger gains in fine-grained MoEs due to their more complex routing decision boundaries.

Conclusion: Effective retraining-free MoE compression should avoid updating expert parameters while allowing lightweight router calibration via knowledge distillation to address router-expert mismatch.

Abstract: Mixture-of-Experts (MoE) models scale capacity efficiently, but their massive parameter footprint creates a deployment-time memory bottleneck. We organize retraining-free MoE compression into three paradigms - Expert Pruning, Expert Editing, and Expert Merging - and show that persistent post-compression degradation largely stems from a neglected factor: router-expert mismatch when experts are changed but the router is left untouched. We argue that effective retraining-free compression should avoid updating expert parameters while allowing lightweight router calibration. To this end, we propose Router Knowledge Distillation (Router KD), which updates only a tiny fraction of parameters (the router) by distilling the original model’s next-token distribution on unlabeled calibration data. Experiments across representative methods in all three paradigms demonstrate consistent performance recovery, with substantially larger gains in fine-grained MoEs (many small experts) than in coarse-grained MoEs due to their more complex routing decision boundaries.

[428] Self-Play Only Evolves When Self-Synthetic Pipeline Ensures Learnable Information Gain

Wei Liu, Siya Qi, Yali Du, Yulan He

Main category: cs.LG

TL;DR: The paper proposes a framework for sustainable self-evolving LLMs using triadic roles (Proposer, Solver, Verifier) and three system designs to ensure increasing learnable information across iterations.

Details

Motivation: Current self-evolving LLM systems often plateau quickly because they synthesize more data without increasing learnable information. The paper aims to address this failure mode by creating sustainable self-evolution loops.

Method: Proposes a triadic role framework: Proposer generates tasks, Solver attempts solutions, Verifier provides training signals. Introduces three system designs: 1) Asymmetric co-evolution (weak-to-strong-to-weak loop), 2) Capacity growth (expanding parameters to match information), 3) Proactive information seeking (external context to prevent saturation).

Result: Through experiments on self-play coding tasks, the framework demonstrates measurable progress from brittle self-play dynamics to sustained self-evolution by ensuring increasing learnable information across iterations.

Conclusion: Sustainable self-evolution requires systematic designs that ensure learnable information increases across iterations, moving beyond simple self-play to create truly self-improving systems.

Abstract: Large language models (LLMs) make it plausible to build systems that improve through self-evolving loops, but many existing proposals are better understood as self-play and often plateau quickly. A central failure mode is that the loop synthesises more data without increasing learnable information for the next iteration. Through experiments on a self-play coding task, we reveal that sustainable self-evolution requires a self-synthesised data pipeline with learnable information that increases across iterations. We identify triadic roles that self-evolving LLMs play: the Proposer, which generates tasks; the Solver, which attempts solutions; and the Verifier, which provides training signals, and we identify three system designs that jointly target learnable information gain from this triadic roles perspective. Asymmetric co-evolution closes a weak-to-strong-to-weak loop across roles. Capacity growth expands parameter and inference-time budgets to match rising learnable information. Proactive information seeking introduces external context and new task sources that prevent saturation. Together, these modules provide a measurable, system-level path from brittle self-play dynamics to sustained self-evolution.

[429] NExT-Guard: Training-Free Streaming Safeguard without Token-Level Labels

Junfeng Fang, Nachuan Chen, Houcheng Jiang, Dan Zhang, Fei Shen, Xiang Wang, Xiangnan He, Tat-Seng Chua

Main category: cs.LG

TL;DR: NExT-Guard is a training-free framework for real-time streaming safety in LLMs that uses Sparse Autoencoders to monitor latent risk signals without token-level supervision.

Details

Motivation: Conventional post-hoc safety measures fail in streaming scenarios where content needs to be intercepted in real-time. Existing streaming safeguards require expensive token-level annotations and suffer from overfitting, creating a need for a more efficient approach.

Method: The framework leverages pretrained Sparse Autoencoders (SAEs) from base LLMs to monitor interpretable latent features. It extracts token-level risk signals from hidden representations without requiring additional training or token-level supervision, enabling real-time safety monitoring during streaming.

Result: NExT-Guard outperforms both post-hoc safeguards and supervised streaming safeguards across different models, SAE variants, and risk scenarios. It demonstrates superior robustness while being training-free and low-cost.

Conclusion: The work shows that streaming safety is an inherent capability of well-trained post-hoc safeguards, and NExT-Guard provides a universal, scalable paradigm for real-time safety that accelerates practical deployment of streaming safeguards without token-level supervision.

Abstract: Large language models are increasingly deployed in streaming scenarios, rendering conventional post-hoc safeguards ineffective as they fail to interdict unsafe content in real-time. While streaming safeguards based on token-level supervised training could address this, they necessitate expensive annotations and suffer from severe overfitting. In this work, we challenge the paradigm that streaming safety must rely on token-level supervised training. Instead, it is an inherent capability of well-trained post-hoc safeguards, as they already encode token-level risk signals in hidden representations. Hence, we introduce NExT-Guard, a training-free framework that achieves streaming safeguards by monitoring interpretable latent features from Sparse Autoencoders (SAEs). It uses pretrained SAEs from publicly available base LLMs, enabling flexible, low-cost deployment without token-level supervision. Experimental results show that NExT-Guard outperforms both post-hoc and streaming safeguards based on supervised training, with superior robustness across models, SAE variants, and risk scenarios. These results make NExT-Guard a universal and scalable paradigm for real-time safety, accelerating the practical deployment of streaming safeguards.

[430] Forecasting as Rendering: A 2D Gaussian Splatting Framework for Time Series Forecasting

Yixin Wang, Yifan Hu, Peiyuan Liu, Naiqi Li, Dai Tao, Shu-Tao Xia

Main category: cs.LG

TL;DR: TimeGS is a novel time series forecasting framework that shifts from regression to 2D generative rendering using adaptive Gaussian kernels to model complex temporal patterns while maintaining chronological continuity.

Details

Motivation: Current time series forecasting methods that reshape 1D sequences into 2D representations suffer from topological mismatches (spatial operators sever chronological continuity) and inefficient modeling capacity (uniform fixed-size representations fail to adapt to compressible, non-stationary patterns).

Method: TimeGS reconceptualizes future sequences as continuous latent surfaces and uses anisotropic Gaussian kernels to adaptively model variations. It introduces two key blocks: Multi-Basis Gaussian Kernel Generation (MB-GKG) synthesizes kernels from a fixed dictionary for stable optimization, and Multi-Period Chronologically Continuous Rasterization (MP-CCR) enforces strict temporal continuity across periodic boundaries.

Result: Comprehensive experiments on standard benchmark datasets demonstrate that TimeGS attains state-of-the-art performance in time series forecasting.

Conclusion: TimeGS successfully addresses limitations of existing 2D representation methods by shifting to a generative rendering paradigm that maintains chronological continuity and provides adaptive resolution for complex temporal patterns.

Abstract: Time series forecasting (TSF) remains a challenging problem due to the intricate entanglement of intraperiod-fluctuations and interperiod-trends. While recent advances have attempted to reshape 1D sequences into 2D period-phase representations, they suffer from two principal limitations.Firstly, treating reshaped tensors as static images results in a topological mismatch, as standard spatial operators sever chronological continuity at grid boundaries. Secondly, relying on uniform fixed-size representations allocates modeling capacity inefficiently and fails to provide the adaptive resolution required for compressible, non-stationary temporal patterns. To address these limitations, we introduce TimeGS, a novel framework that fundamentally shifts the forecasting paradigm from regression to 2D generative rendering. By reconceptualizing the future sequence as a continuous latent surface, TimeGS utilizes the inherent anisotropy of Gaussian kernels to adaptively model complex variations with flexible geometric alignment. To realize this, we introduce a Multi-Basis Gaussian Kernel Generation (MB-GKG) block that synthesizes kernels from a fixed dictionary to stabilize optimization, and a Multi-Period Chronologically Continuous Rasterization (MP-CCR) block that enforces strict temporal continuity across periodic boundaries. Comprehensive experiments on standard benchmark datasets demonstrate that TimeGS attains state-of-the-art performance.

[431] MedFeat: Model-Aware and Explainability-Driven Feature Engineering with LLMs for Clinical Tabular Prediction

Zizheng Zhang, Yiming Li, Justin Xu, Jinyu Wang, Rui Wang, Lei Song, Jiang Bian, David W Eyre, Jingjing Fu

Main category: cs.LG

TL;DR: MedFeat: A feedback-driven, model-aware feature engineering framework that uses LLMs with domain knowledge and SHAP explanations to improve healthcare tabular predictions by discovering clinically meaningful features.

Details

Motivation: Classical models with feature engineering often outperform neural approaches in healthcare tabular predictions. While LLMs can integrate domain knowledge into feature engineering, existing approaches use broad search over predefined transformations without considering downstream model characteristics and feature importance signals.

Method: MedFeat is a feedback-driven and model-aware feature engineering framework that leverages LLM reasoning with domain knowledge. It provides feature explanations based on SHAP values while tracking successful and failed proposals to guide feature discovery. By incorporating model awareness, it prioritizes informative signals that are difficult for the downstream model to learn directly.

Result: Across a broad range of clinical prediction tasks, MedFeat achieves stable improvements over various baselines and discovers clinically meaningful features that generalize under distribution shift, demonstrating robustness across years and from ICU cohorts to general hospitalized patients.

Conclusion: MedFeat offers insights into real-world deployment of feature engineering in healthcare tabular predictions, showing that model-aware LLM-driven feature engineering can improve performance and discover meaningful features that generalize well.

Abstract: In healthcare tabular predictions, classical models with feature engineering often outperform neural approaches. Recent advances in Large Language Models enable the integration of domain knowledge into feature engineering, offering a promising direction. However, existing approaches typically rely on a broad search over predefined transformations, overlooking downstream model characteristics and feature importance signals. We present MedFeat, a feedback-driven and model-aware feature engineering framework that leverages LLM reasoning with domain knowledge and provides feature explanations based on SHAP values while tracking successful and failed proposals to guide feature discovery. By incorporating model awareness, MedFeat prioritizes informative signals that are difficult for the downstream model to learn directly due to its characteristics. Across a broad range of clinical prediction tasks, MedFeat achieves stable improvements over various baselines and discovers clinically meaningful features that generalize under distribution shift, demonstrating robustness across years and from ICU cohorts to general hospitalized patients, thereby offering insights into real-world deployment. Code required to reproduce our experiments will be released, subject to dataset agreements and institutional policies.

[432] MedCalc-Bench Doesn’t Measure What You Think: A Benchmark Audit and the Case for Open-Book Evaluation

Artus Krohn-Grimberghe

Main category: cs.LG

TL;DR: MedCalc-Bench audit reveals calculator implementation errors, shows “open-book” prompting with calculator specs achieves 81-85% accuracy surpassing RL methods, suggesting benchmark measures tool-use rather than clinical reasoning.

Details

Motivation: To challenge the current framing of MedCalc-Bench as a clinical reasoning benchmark by auditing its calculator implementations and demonstrating that performance improvements come from better tool-use rather than enhanced clinical reasoning capabilities.

Method: 1) Systematic audit of calculator implementations identifying and fixing over 20 errors; 2) Testing “open-book” prompting where models receive calculator specifications at inference time; 3) Establishing upper bounds using advanced models like GPT-5.2-Thinking.

Result: Open-book prompting raised accuracy from ~52% to 81-85% on GLM models, surpassing all published RL-trained systems (74%) without fine-tuning. Upper bound established at 95-97% with residual errors due to ground-truth issues and dataset ambiguities.

Conclusion: MedCalc-Bench primarily measures formula memorization and arithmetic precision rather than clinical reasoning, and would be better framed as a tool-use evaluation benchmark.

Abstract: MedCalc-Bench is a widely used benchmark for evaluating LLM performance on clinical calculator tasks, with state-of-the-art direct prompting scores plateauing around 35% on the Verified split (HELM MedHELM leaderboard) and the best published approach-RL with verifiable rewards-reaching 74%. We present three contributions that challenge the benchmark’s current framing. First, we conduct a systematic audit of the benchmark’s calculator implementations, identifying and fixing over 20 errors ranging from critical formula inaccuracies to runtime bugs in a NeurIPS-published dataset. Second, we show that a simple intervention-providing the model with the calculator specification at inference time (“open-book” prompting)-raises accuracy from ~52% to 81-85% on GLM-4.6V and GLM-4.7, surpassing all published results including RL-trained systems, without any fine-tuning. Third, we establish an upper bound of 95-97% using GPT-5.2-Thinking, with residual errors attributable primarily to ground-truth issues and dataset ambiguities. Our findings suggest that MedCalc-Bench predominantly measures formula memorization and arithmetic precision rather than clinical reasoning, and would be better framed as a tool-use evaluation.

[433] Characterizing and Predicting Wildfire Evacuation Behavior: A Dual-Stage ML Approach

Sazzad Bin Bashar Polock, Anandi Dutta, Subasish Das

Main category: cs.LG

TL;DR: Machine learning analysis of wildfire evacuation behavior using survey data reveals distinct behavioral typologies and shows transportation mode can be predicted from household characteristics, while evacuation timing remains difficult to predict due to dynamic fire conditions.

Details

Motivation: To better understand the complex, variable nature of wildfire evacuation behavior and develop data-driven approaches for emergency planning and resource allocation by identifying patterns in household decision-making.

Method: Used large-scale MTurk survey data from California, Colorado, and Oregon residents, applying unsupervised methods (Multiple Correspondence Analysis, K-Modes clustering, Latent Class Analysis) to identify behavioral typologies and supervised models to predict evacuation outcomes.

Result: Identified consistent behavioral subgroups differentiated by vehicle access, disaster planning, technological resources, pet ownership, and residential stability. Transportation mode can be predicted reliably from household characteristics, but evacuation timing remains difficult to classify due to dependence on dynamic fire conditions.

Conclusion: Machine learning provides valuable insights into wildfire evacuation behavior patterns, enabling targeted preparedness strategies and equitable emergency planning, though real-time evacuation timing remains challenging to predict.

Abstract: Wildfire evacuation behavior is highly variable and influenced by complex interactions among household resources, preparedness, and situational cues. Using a large-scale MTurk survey of residents in California, Colorado, and Oregon, this study integrates unsupervised and supervised machine learning methods to uncover latent behavioral typologies and predict key evacuation outcomes. Multiple Correspondence Analysis, K-Modes clustering, and Latent Class Analysis reveal consistent subgroups differentiated by vehicle access, disaster planning, technological resources, pet ownership, and residential stability. Complementary supervised models show that transportation mode can be predicted with high reliability from household characteristics, whereas evacuation timing remains difficult to classify due to its dependence on dynamic, real-time fire conditions. These findings advance data-driven understanding of wildfire evacuation behavior and demonstrate how machine learning can support targeted preparedness strategies, resource allocation, and equitable emergency planning.

[434] Subspace Geometry Governs Catastrophic Forgetting in Low-Rank Adaptation

Brady Steele

Main category: cs.LG

TL;DR: Geometric theory shows catastrophic forgetting in LoRA follows a simple law based on minimum principal angle between task gradient subspaces, revealing rank-invariance at high angles and reconciling contradictory findings in literature.

Details

Motivation: While LoRA has become popular for parameter-efficient adaptation of large models, its behavior in continual learning settings with catastrophic forgetting remains poorly understood, with contradictory findings in literature about the role of adapter rank.

Method: Developed a geometric theory characterizing catastrophic forgetting through gradient subspace interactions, derived a forgetting law based on minimum principal angle between task gradient subspaces, and validated on synthetic tasks, Split-CIFAR100 with ViT-LoRA, and sequential GLUE with RoBERTa-LoRA.

Result: Found strong correlation (r=0.994) on synthetic tasks, discovered approximate rank-invariance property where forgetting becomes largely independent of adapter rank at high subspace angles (CV ≈0.8% synthetic, 10-19% real benchmarks), and showed rank only affects forgetting when task subspaces are similar.

Conclusion: The geometric theory provides principled guidance for continual learning with parameter-efficient fine-tuning, reconciles contradictory literature findings, and shows orthogonal methods like O-LoRA offer minimal benefit when natural orthogonality is already high.

Abstract: Low-Rank Adaptation (LoRA) has emerged as a parameter-efficient approach for adapting large pre-trained models, yet its behavior under continual learning remains poorly understood. We present a geometric theory characterizing catastrophic forgetting in LoRA through the lens of gradient subspace interactions. Our central finding is that forgetting is governed by a simple geometric law: $\mathcal{F} = α(1 - \cos^2θ_{\min}) + β$, where $θ_{\min}$ is the minimum principal angle between task gradient subspaces. This formulation reveals an approximate rank-invariance property, at high subspace angles, forgetting becomes largely independent of the adapter rank (coefficient of variation $\approx 0.8%$ in controlled synthetic settings; CV $\approx 10$-$19%$ on real benchmarks, suggesting this is regime-dependent rather than absolute). We validate our theory on synthetic tasks ($r=0.994$ correlation), Split-CIFAR100 with ViT-LoRA, and sequential GLUE with RoBERTa-LoRA. Our analysis reconciles seemingly contradictory findings in the literature: we show that rank affects forgetting only when task subspaces are similar (low angle), while orthogonal methods like O-LoRA provide minimal benefit when natural orthogonality is already high. These insights provide principled guidance for continual learning with parameter-efficient fine-tuning.

[435] MUSE: A Run-Centric Platform for Multimodal Unified Safety Evaluation of Large Language Models

Zhongxi Wang, Yueqian Lin, Jingyang Zhang, Hai Helen Li, Yiran Chen

Main category: cs.LG

Details

[436] Scaling Reward Modeling without Human Supervision

Jingxuan Fan, Yueying Li, Zhenting Qi, Dinghuai Zhang, Kianté Brantley, Sham M. Kakade, Hanlin Zhang

Main category: cs.LG

TL;DR: Scaling reward models through unsupervised preference learning on web data without human annotations, showing strong improvements on RewardBench and downstream tasks.

Details

Motivation: Learning from feedback is crucial for advancing model capabilities and safety, but current approaches are constrained by cost and scalability of human annotations. The paper explores whether reward models can be effectively scaled through unsupervised approaches.

Method: Proposes reward-based scaling (RBS) as preference learning over document prefixes and suffixes drawn from large-scale web corpora. Trains on 11M tokens of math-focused web data without human annotations, testing across diverse model families and scales.

Result: Achieves up to +7.7 point average improvement on RewardBench v2 accuracy, with gains up to +16.1 on in-domain math subsets. Shows consistent improvements on out-of-domain safety and general subsets. When applied to best-of-N selection and policy optimization, these reward models substantially improve downstream math performance and match or exceed supervised baselines.

Conclusion: Demonstrates the feasibility and promise of training reward models without costly human annotations, showing that unsupervised approaches can effectively scale reward modeling and transfer across diverse model families and tasks.

Abstract: Learning from feedback is an instrumental process for advancing the capabilities and safety of frontier models, yet its effectiveness is often constrained by cost and scalability. We present a pilot study that explores scaling reward models through unsupervised approaches. We operationalize reward-based scaling (RBS), in its simplest form, as preference learning over document prefixes and suffixes drawn from large-scale web corpora. Its advantage is demonstrated in various aspects: despite using no human annotations, training on 11M tokens of math-focused web data yields steady gains on RewardBench v1 and v2, and these improvements consistently transfer across diverse initialization backbones spanning model families and scales. Across models, our method improves RewardBench v2 accuracy by up to +7.7 points on average, with gains of up to +16.1 on in-domain math subsets and consistent improvements on out-of-domain safety and general subsets. When applied to best-of-N selection and policy optimization, these reward models substantially improve downstream math performance and match or exceed strong supervised reward model baselines of similar size. Overall, we demonstrate the feasibility and promise of training reward models without costly and potentially unreliable human annotations.

[437] Efficient Sparse Selective-Update RNNs for Long-Range Sequence Modeling

Bojian Yin, Shurong Wang, Haoyu Tan, Sander Bohte, Federico Corradi, Guoqi Li

Main category: cs.LG

TL;DR: Selective-Update RNNs (suRNNs) introduce neuron-level binary switches that only update memory for informative events, overcoming memory decay in traditional RNNs by preserving memory during redundant input periods.

Details

Motivation: Traditional RNNs suffer from memory decay due to rigid update schedules that force memory overwriting at every time step, even during static or redundant input periods, making it difficult to retain information from distant past events.

Method: suRNNs use neuron-level binary switches that selectively open only for informative events, decoupling recurrent updates from raw sequence length and allowing memory preservation during low-information intervals.

Result: suRNNs match or exceed Transformer accuracy on Long Range Arena, WikiText, and synthetic benchmarks while remaining significantly more efficient for long-term storage.

Conclusion: The selective update mechanism resolves the mismatch between sequence length and information content, establishing a new direction for achieving Transformer-level performance within efficient recurrent modeling frameworks.

Abstract: Real-world sequential signals, such as audio or video, contain critical information that is often embedded within long periods of silence or noise. While recurrent neural networks (RNNs) are designed to process such data efficiently, they often suffer from ``memory decay’’ due to a rigid update schedule: they typically update their internal state at every time step, even when the input is static. This constant activity forces the model to overwrite its own memory and makes it hard for the learning signal to reach back to distant past events. Here we show that we can overcome this limitation using Selective-Update RNNs (suRNNs), a non-linear architecture that learns to preserve its memory when the input is redundant. By using a neuron-level binary switch that only opens for informative events, suRNNs decouple the recurrent updates from the raw sequence length. This mechanism allows the model to maintain an exact, unchanged memory of the past during low-information intervals, creating a direct path for gradients to flow across time. Our experiments on the Long Range Arena, WikiText, and other synthetic benchmarks show that suRNNs match or exceed the accuracy of much more complex models such as Transformers, while remaining significantly more efficient for long-term storage. By allowing each neuron to learn its own update timescale, our approach resolves the mismatch between how long a sequence is and how much information it actually contains. By providing a principled approach to managing temporal information density, this work establishes a new direction for achieving Transformer-level performance within the highly efficient framework of recurrent modeling.

[438] Routing Absorption in Sparse Attention: Why Random Gates Are Hard to Beat

Keston Aquino-Michaels

Main category: cs.LG

TL;DR: Transformers struggle to learn meaningful attention sparsity patterns during end-to-end training due to “routing absorption” where Q/K/V projections adapt to whatever mask is imposed, making learned gates perform little better than random gates.

Details

Motivation: To understand why transformers fail to learn meaningful sparse attention patterns during training despite attention distributions being highly concentrated and post-hoc gating being highly accurate.

Method: Four lines of evidence using a controlled 31M-parameter transformer: (1) comparing learned vs random soft gating perplexity, (2) analyzing gradient flow through hard top-k gates, (3) distilling gates onto co-adapted Q/K/V, and (4) testing stochastic mask randomization during training.

Result: Learned gates perform nearly identically to random gates (48.73 vs 49.83 perplexity), hard gates receive zero gradient, distilled gates achieve high F1 but catastrophic perplexity when deployed, and mask randomization fails to prevent co-adaptation.

Conclusion: End-to-end sparse attention methods face “routing absorption” where Q/K/V projections adapt to masks, making learned gates ineffective. Post-hoc sparsification approaches that decouple representation learning from sparsification avoid this issue.

Abstract: Can a transformer learn which attention entries matter during training? In principle, yes: attention distributions are highly concentrated, and a small gate network can identify the important entries post-hoc with near-perfect accuracy. In practice, barely. When sparse attention is trained end-to-end, the model’s Q/K/V projections co-adapt to whatever mask is imposed, absorbing the routing signal until learned gates perform little better than frozen random gates. We call this routing absorption and present four independent lines of evidence for it in a controlled 31M-parameter transformer: (1) differentiable soft gating converges to nearly the same perplexity whether the gate is learned or random (48.73 +/- 0.60 vs. 49.83 +/- 0.04 over 3 seeds); (2) hard top-k gating receives exactly zero gradient through the mask; (3) a gate distilled onto co-adapted Q/K/V achieves high F1 against oracle masks but catastrophic perplexity when deployed (601.6 vs. 48.6 on mask-agnostic Q/K/V); and (4) stochastic mask randomization during training fails to prevent co-adaptation (78.2 ppl deployed dense vs. 37.3 baseline). We connect routing absorption to the same phenomenon in Mixture-of-Experts, where random routing matches learned routing because experts co-adapt to any router, but show that attention exhibits a structurally more severe form: shared Q/K/V parameters enable cross-layer compensation pathways absent in MoE, where experts are self-contained modules. The implication is that end-to-end sparse attention methods employing per-query token-level gating face absorption pressure proportional to the parameter asymmetry between the gate and the model, and that post-hoc approaches, which decouple representation learning from sparsification, sidestep this entirely.

[439] Neural Paging: Learning Context Management Policies for Turing-Complete Agents

Liang Chen, Qi Liu

Main category: cs.LG

TL;DR: Neural Paging introduces a hierarchical architecture with a differentiable Page Controller to manage context window limitations in LLMs, reducing long-horizon reasoning complexity from O(N²) to O(N·K²).

Details

Motivation: Current LLMs with external memory face a critical bottleneck: finite and costly context windows that function as scarce semantic caches rather than infinite memory, limiting long-horizon reasoning capabilities.

Method: Proposes Neural Paging, a hierarchical architecture that decouples symbolic reasoning from information management. Formulates the Context Paging Problem and introduces a lightweight, differentiable Page Controller designed to approximate “Semantic Belady’s Optimality” by retaining tokens with high future utility.

Result: Theoretical analysis shows Neural Paging reduces asymptotic complexity of long-horizon reasoning from O(N²) to O(N·K²) under bounded context window size K. Provides robustness bound (Theorem 4) quantifying competitive-ratio degradation under policy-dependent access with bounded sensitivity. Validation on synthetic paging traces confirms theoretical guarantees hold.

Conclusion: Neural Paging addresses the context window bottleneck in LLMs through a principled paging architecture with theoretical guarantees, enabling more efficient long-horizon reasoning while identifying opportunities for learned policies.

Abstract: The proof that Large Language Models (LLMs) augmented with external read-write memory constitute a computationally universal system has established the theoretical foundation for general-purpose agents. However, existing implementations face a critical bottleneck: the finite and costly Context Window, which functions not as infinite memory but as a scarce semantic cache. In this work, we introduce \textit{Neural Paging}, a hierarchical architecture that decouples symbolic reasoning from information resource management. We formulate the \textit{Context Paging Problem (CPP)} and propose a lightweight, differentiable \textit{Page Controller} designed to approximate ``Semantic Belady’s Optimality’’ – retaining tokens with high future utility under explicit assumptions on access patterns. We provide theoretical analysis showing that, under bounded context window size~$K$, Neural Paging reduces the asymptotic complexity of long-horizon reasoning from quadratic $O(N^2)$ to $O(N \cdot K^2)$, and we derive a robustness bound (Theorem~4) that quantifies competitive-ratio degradation under policy-dependent access with bounded sensitivity. We validate these bounds on synthetic paging traces, confirming that the theoretical guarantees hold and identifying significant slack that motivates learned policies.

[440] Safety Training Persists Through Helpfulness Optimization in LLM Agents

Benjamin Plaut

Main category: cs.LG

TL;DR: Safety post-training in agentic LLMs shows safety training persists through helpfulness training, with all configurations ending near a linear Pareto frontier rather than finding optimal combined strategies.

Details

Motivation: The paper addresses safety in agentic (multi-step, tool-use) LLM settings where safety refers to harmful actions taken by the model, extending beyond single-step chat refusal scenarios. It investigates how safety training interacts with helpfulness training in post-training optimization.

Method: The study compares effects of running Direct Preference Optimization (DPO) on safety alone, helpfulness alone, and both metrics sequentially. It examines training dynamics and measures performance along safety-helpfulness trade-offs, analyzing whether combined training finds optimal strategies.

Result: Safety training persists through subsequent helpfulness training. All training configurations end near a linear Pareto frontier (R² = 0.77). Even simultaneous training on both metrics results in points on the frontier rather than finding “best of both worlds” strategies, despite such strategies existing in the DPO dataset.

Conclusion: The findings reveal limitations in current post-training approaches for balancing safety and helpfulness in agentic LLMs, highlighting the need for better understanding of post-training dynamics and more sophisticated optimization methods.

Abstract: Safety post-training has been studied extensively in single-step “chat” settings where safety typically refers to refusing harmful requests. We study an “agentic” (i.e., multi-step, tool-use) setting where safety refers to harmful actions directly taken by the LLM. We compare the effects of running direct preference optimization (DPO) on safety or helpfulness alone vs both metrics sequentially. As expected, training on one metric alone results in an extreme point along this frontier. However, unlike prior work, we find that safety training persists through subsequent helpfulness training. We also find that all training configurations end up near a linear Pareto frontier with $R^2 = 0.77$. Even post-training on both metrics simultaneously simply results in another point on the frontier rather than finding a “best of both worlds” strategy, despite the presence of such strategies in our DPO dataset. Overall, our findings underscore the need for better understanding of post-training dynamics.

[441] Generalized Discrete Diffusion with Self-Correction

Linxuan Wang, Ziyi Wang, Yikun Bai, Wei Deng, Guang Lin, Qifan Song

Main category: cs.LG

TL;DR: SCDD is a self-correcting discrete diffusion model that reformulates pretrained self-correction with explicit state transitions in discrete time, enabling more efficient parallel decoding while preserving generation quality.

Details

Motivation: Existing self-correction methods for discrete diffusion models have limitations: inference-time or post-training approaches suffer from limited generalization and may impair reasoning performance, while GIDD's continuous interpolation-based pipeline has opaque interactions between uniform transitions and absorbing masks, complicating hyperparameter tuning and hindering practical performance.

Method: Proposes Self-Correcting Discrete Diffusion (SCDD) model that reformulates pretrained self-correction with explicit state transitions and learns directly in discrete time. The framework simplifies training noise schedule, eliminates redundant remasking step, and relies exclusively on uniform transitions to learn self-correction.

Result: Experiments at GPT-2 scale demonstrate that the method enables more efficient parallel decoding while preserving generation quality.

Conclusion: SCDD provides an improved approach to self-correction in discrete diffusion models with clearer state transitions and better practical performance.

Abstract: Self-correction is an effective technique for maintaining parallel sampling in discrete diffusion models with minimal performance degradation. Prior work has explored self-correction at inference time or during post-training; however, such approaches often suffer from limited generalization and may impair reasoning performance. GIDD pioneers pretraining-based self-correction via a multi-step BERT-style uniform-absorbing objective. However, GIDD relies on a continuous interpolation-based pipeline with opaque interactions between uniform transitions and absorbing masks, which complicates hyperparameter tuning and hinders practical performance. In this work, we propose a Self-Correcting Discrete Diffusion (SCDD) model to reformulate pretrained self-correction with explicit state transitions and learn directly in discrete time. Our framework also simplifies the training noise schedule, eliminates a redundant remasking step, and relies exclusively on uniform transitions to learn self-correction. Experiments at the GPT-2 scale demonstrate that our method enables more efficient parallel decoding while preserving generation quality.

[442] Physics-Informed Neural Networks with Architectural Physics Embedding for Large-Scale Wave Field Reconstruction

Huiwen Zhang, Feng Ye, Chu Ma

Main category: cs.LG

TL;DR: PE-PINN integrates physical guidance directly into neural network architecture via envelope transformation layers to accelerate convergence and reduce memory usage for large-scale wave field reconstruction.

Details

Motivation: Standard physics-informed neural networks (PINNs) have limitations for large-scale wave field reconstruction due to slow convergence, optimization instability, and spectral bias from only embedding physics in loss functions.

Method: Proposes architecture physics embedded (PE)-PINN with envelope transformation layers parameterized by source properties, material interfaces, and wave physics to directly integrate physical guidance into network architecture.

Result: PE-PINN achieves >10x speedup in convergence compared to standard PINNs and several orders of magnitude reduction in memory usage compared to FEM, enabling large-scale 2D/3D electromagnetic wave reconstruction.

Conclusion: PE-PINN enables high-fidelity modeling for large-scale wave field analysis in applications like wireless communications, sensing, and room acoustics by overcoming limitations of both traditional numerical methods and standard PINNs.

Abstract: Large-scale wave field reconstruction requires precise solutions but faces challenges with computational efficiency and accuracy. The physics-based numerical methods like Finite Element Method (FEM) provide high accuracy but struggle with large-scale or high-frequency problems due to prohibitive computational costs. Pure data-driven approaches excel in speed but often lack sufficient labeled data for complex scenarios. Physics-informed neural networks (PINNs) integrate physical principles into machine learning models, offering a promising solution by bridging these gaps. However, standard PINNs embed physical principles only in loss functions, leading to slow convergence, optimization instability, and spectral bias, limiting their ability for large-scale wave field reconstruction. This work introduces architecture physics embedded (PE)-PINN, which integrates additional physical guidance directly into the neural network architecture beyond Helmholtz equations and boundary conditions in loss functions. Specifically, a new envelope transformation layer is designed to mitigate spectral bias with kernels parameterized by source properties, material interfaces, and wave physics. Experiments demonstrate that PE-PINN achieves more than 10 times speedup in convergence compared to standard PINNs and several orders of magnitude reduction in memory usage compared to FEM. This breakthrough enables high-fidelity modeling for large-scale 2D/3D electromagnetic wave reconstruction involving reflections, refractions, and diffractions in room-scale domains, readily applicable to wireless communications, sensing, room acoustics, and other fields requiring large-scale wave field analysis.

[443] Beyond Binary Preferences: A Principled Framework for Reward Modeling with Ordinal Feedback

Amirhossein Afsharrad, Ruida Zhou, Luca Viano, Sanjay Lall, Mohammad Ghavamzadeh

Main category: cs.LG

TL;DR: A principled ordinal regression framework for reward modeling that learns threshold parameters from Likert scale preferences, outperforming heuristic methods.

Details

Motivation: Current reward modeling approaches lack a principled mathematical framework for leveraging ordinal preference data (Likert scale ratings), relying instead on ad-hoc heuristics applied to binary preference models.

Method: Formulates reward modeling with Likert scale preferences as a discrete ordinal regression problem, deriving two loss functions: negative log-likelihood loss and all-threshold loss, which learn threshold parameters capturing ordinal structure.

Result: Experimental results on multiple benchmarks show the ordinal regression approach achieves competitive or superior performance compared to existing heuristic methods across diverse evaluation categories including chat, reasoning, and safety tasks.

Conclusion: Provides the first principled mathematical framework for incorporating Likert scale preferences into reward model training, enabling more effective utilization of fine-grained human feedback beyond ad-hoc modifications of binary preference models.

Abstract: Reward modeling is crucial for aligning large language models with human preferences, yet current approaches lack a principled mathematical framework for leveraging ordinal preference data. When human annotators provide graded preferences on a Likert scale (e.g., significantly better, better, slightly better, negligibly better), existing methods typically apply ad-hoc heuristics, such as margin terms or scaling factors, to loss functions derived from binary preference models like Bradley-Terry. These approaches lack an underlying mathematical model for how ordinal preference data is generated. We present a theoretically grounded framework that formulates reward modeling with Likert scale preferences as a discrete ordinal regression problem. We derive two loss functions from this formulation: a negative log-likelihood loss and an all-threshold loss, both of which learn threshold parameters that naturally capture the ordinal structure of preferences. Unlike existing heuristic methods that manually specify fixed margins or scaling weights, our approach learns these parameters directly from data within a coherent probabilistic framework. Experimental results on multiple benchmarks demonstrate that our ordinal regression approach consistently achieves competitive or superior performance compared to existing heuristic methods across diverse evaluation categories including chat, reasoning, and safety tasks. Our work provides the first principled mathematical framework for incorporating Likert scale preferences into reward model training, moving beyond ad-hoc modifications of binary preference models to enable more effective utilization of fine-grained human feedback.

[444] Adaptive Personalized Federated Learning via Multi-task Averaging of Kernel Mean Embeddings

Jean-Baptiste Fermanian, Batiste Le Bars, Aurélien Bellet

Main category: cs.LG

TL;DR: A personalized federated learning method where agents learn collaborative weights from data using kernel mean embeddings, enabling automatic adaptation between global and local learning without prior knowledge of data heterogeneity.

Details

Motivation: To develop a personalized federated learning approach that can automatically adapt to data heterogeneity across agents without requiring prior knowledge about the relationships between different agents' data distributions.

Method: Formulates collaborative weight estimation as a kernel mean embedding problem with multiple data sources, uses multi-task averaging to capture statistical relationships between agents, and proposes a practical implementation using random Fourier features for communication efficiency.

Result: Derives finite-sample guarantees on local excess risks for broad distribution classes, explicitly quantifies statistical gains of collaboration, and validates theoretical results through numerical experiments.

Conclusion: The proposed method provides a fully adaptive PFL approach that automatically transitions between global and local learning regimes, with theoretical guarantees and practical communication-efficient implementation.

Abstract: Personalized Federated Learning (PFL) enables a collection of agents to collaboratively learn individual models without sharing raw data. We propose a new PFL approach in which each agent optimizes a weighted combination of all agents’ empirical risks, with the weights learned from data rather than specified a priori. The novelty of our method lies in formulating the estimation of these collaborative weights as a kernel mean embedding estimation problem with multiple data sources, leveraging tools from multi-task averaging to capture statistical relationships between agents. This perspective yields a fully adaptive procedure that requires no prior knowledge of data heterogeneity and can automatically transition between global and local learning regimes. By recasting the objective as a high-dimensional mean estimation problem, we derive finite-sample guarantees on local excess risks for a broad class of distributions, explicitly quantifying the statistical gains of collaboration. To address communication constraints inherent to federated settings, we also propose a practical implementation based on random Fourier features, which allows one to trade communication cost for statistical efficiency. Numerical experiments validate our theoretical results.

[445] Structured vs. Unstructured Pruning: An Exponential Gap

Davide Ferré, Frédéric Giroire, Emanuele Natale, Frederik Mallmann-Trenn

Main category: cs.LG

TL;DR: Theoretical analysis shows exponential separation between neuron pruning and weight pruning for approximating ReLU neurons, with neuron pruning requiring Ω(d/ε) neurons vs O(d log(1/ε)) for weight pruning.

Details

Motivation: To theoretically analyze the Strong Lottery Ticket Hypothesis (SLTH) for structured neuron pruning vs unstructured weight pruning, specifically examining the approximation of single ReLU neurons using two-layer networks.

Method: Theoretical analysis comparing neuron pruning (structured) and weight pruning (unstructured) for approximating a bias-free ReLU neuron using randomly initialized bias-free two-layer ReLU networks. Derives lower bounds for neuron pruning and upper bounds for weight pruning.

Result: Neuron pruning requires Ω(d/ε) hidden neurons to ε-approximate a target ReLU neuron, while weight pruning achieves ε-approximation with only O(d log(1/ε)) neurons, establishing an exponential separation between the two pruning paradigms.

Conclusion: Structured neuron pruning is fundamentally less efficient than unstructured weight pruning for approximating target functions, providing theoretical justification for the practical observation that weight pruning often outperforms neuron pruning.

Abstract: The Strong Lottery Ticket Hypothesis (SLTH) posits that large, randomly initialized neural networks contain sparse subnetworks capable of approximating a target function at initialization without training, suggesting that pruning alone is sufficient. Pruning methods are typically classified as unstructured, where individual weights can be removed from the network, and structured, where parameters are removed according to specific patterns, as in neuron pruning. Existing theoretical results supporting the SLTH rely almost exclusively on unstructured pruning, showing that logarithmic overparameterization suffices to approximate simple target networks. In contrast, neuron pruning has received limited theoretical attention. In this work, we consider the problem of approximating a single bias-free ReLU neuron using a randomly initialized bias-free two-layer ReLU network, thereby isolating the intrinsic limitations of neuron pruning. We show that neuron pruning requires a starting network with $Ω(d/\varepsilon)$ hidden neurons to $\varepsilon$-approximate a target ReLU neuron. In contrast, weight pruning achieves $\varepsilon$-approximation with only $O(d\log(1/\varepsilon))$ neurons, establishing an exponential separation between the two pruning paradigms.

[446] Talking with Verifiers: Automatic Specification Generation for Neural Network Verification

Yizhak Y. Elboher, Reuven Peleg, Zhouxing Shi, Guy Katz, Jan Křetínský

Main category: cs.LG

TL;DR: A framework that translates natural language specifications into formal verification queries for neural networks, enabling verification of high-level semantic requirements rather than just low-level constraints.

Details

Motivation: Current neural network verification tools only support low-level constraints over raw inputs/outputs, limiting practical adoption. Real-world correctness requirements are naturally expressed at higher semantic levels, but DNNs lack explicit mapping to human-understandable features.

Method: Introduces a novel component to the verification pipeline that analyzes natural language specifications and automatically translates them into formal verification queries compatible with existing neural network verifiers.

Result: Successfully verifies complex semantic specifications on both structured and unstructured datasets that were previously inaccessible. The translation maintains high fidelity to user intent with low computational overhead.

Conclusion: Substantially extends the applicability of formal DNN verification to real-world, high-level requirements by bridging the gap between natural language specifications and formal verification tools.

Abstract: Neural network verification tools currently support only a narrow class of specifications, typically expressed as low-level constraints over raw inputs and outputs. This limitation significantly hinders their adoption and practical applicability across diverse application domains where correctness requirements are naturally expressed at a higher semantic level. This challenge is rooted in the inherent nature of deep neural networks, which learn internal representations that lack an explicit mapping to human-understandable features. To address this, we bridge this gap by introducing a novel component to the verification pipeline, making existing verification tools applicable to a broader range of domains and specification styles. Our framework enables users to formulate specifications in natural language, which are then automatically analyzed and translated into formal verification queries compatible with state-of-the-art neural network verifiers. We evaluate our approach on both structured and unstructured datasets, demonstrating that it successfully verifies complex semantic specifications that were previously inaccessible. Our results show that this translation process maintains high fidelity to user intent while incurring low computational overhead, thereby substantially extending the applicability of formal DNN verification to real-world, high-level requirements.

[447] CUDABench: Benchmarking LLMs for Text-to-CUDA Generation

Jiace Zhu, Wentao Chen, Qi Fan, Zhixing Ren, Junying Wu, Xing Zhe Chai, Chotiwit Rungrueangwutthinon, Yehan Ma, An Zou

Main category: cs.LG

TL;DR: CUDABench is a comprehensive benchmark for evaluating LLMs’ text-to-CUDA generation capabilities, covering diverse domains with compilation, functional, and performance metrics.

Details

Motivation: Current benchmarks focus only on high-level language to CUDA translation, overlooking the more challenging text-to-CUDA generation task. There's also a need for better performance assessment of LLM-generated GPU programs given hardware-specific requirements.

Method: 1) Construct CUDABench-Set covering Breadth-Depth-Difficulty evaluation space across AI, scientific computing, and data analytics domains. 2) Develop CUDABench-Score and Generative Verification Pipeline assessing compilation correctness, functional consistency via execution-based verification, and Performance-Score using roofline-based metrics.

Result: Benchmarking state-of-the-art LLMs reveals: 1) notable mismatch between high compilation success rates and low functional correctness, 2) lack of domain-specific algorithmic knowledge, and 3) suboptimal utilization of GPU hardware resources.

Conclusion: CUDABench provides a comprehensive framework for evaluating text-to-CUDA capabilities, revealing key challenges in LLM-generated GPU programming that need addressing for practical deployment.

Abstract: Recent studies have demonstrated the potential of Large Language Models (LLMs) in generating GPU Kernels. Current benchmarks focus on the translation of high-level languages into CUDA, overlooking the more general and challenging task of text-to-CUDA generation. Furthermore, given the hardware-specific and performance-critical features of GPU programming, accurately assessing the performance of LLM-generated GPU programs is nontrivial. In this work, we introduce CUDABench, a comprehensive benchmark designed to evaluate the text-to-CUDA capabilities of LLMs. First, we construct CUDABench-Set, which covers Breadth-Depth-Difficulty evaluation space in diverse application domains, including artificial intelligence, scientific computing, and data analytics, etc. Furthermore, we propose CUDABench-Score and Generative Verification Pipeline that assess (1) compilation correctness, (2) functional consistency through execution-based verification, and (3) a novel roofline-based metric, Performance-Score. Benchmarking state-of-the-art LLMs reveals insightful findings and challenges of text-to-CUDA, such as a notable mismatch between high compilation success rates and low functional correctness, a lack of domain-specific algorithmic knowledge, and suboptimal utilization of GPU hardware resources. Our benchmark is available at https://github.com/CUDA-Bench/CUDABench.

[448] Concept Heterogeneity-aware Representation Steering

Laziz U. Abdullaev, Noelle Y. L. Wong, Ryan T. Z. Lee, Shiqi Jiang, Khoi N. M. Nguyen, Tan M. Nguyen

Main category: cs.LG

TL;DR: CHaRS introduces a concept heterogeneity-aware representation steering method using optimal transport between Gaussian mixture models for more effective LLM control than global steering.

Details

Motivation: Existing representation steering methods use single global directions assuming homogeneous concept representation, but LLM representations are actually non-homogeneous with clustered, context-dependent structure, making global steering brittle.

Method: Models source and target representations as Gaussian mixture models, formulates steering as discrete optimal transport between semantic latent clusters, derives input-dependent steering map via barycentric projection producing kernel-weighted cluster-level shifts.

Result: CHaRS yields more effective behavioral control than global steering across numerous experimental settings.

Conclusion: Concept heterogeneity-aware representation steering via optimal transport provides superior control over LLM behavior compared to traditional global steering methods.

Abstract: Representation steering offers a lightweight mechanism for controlling the behavior of large language models (LLMs) by intervening on internal activations at inference time. Most existing methods rely on a single global steering direction, typically obtained via difference-in-means over contrastive datasets. This approach implicitly assumes that the target concept is homogeneously represented across the embedding space. In practice, however, LLM representations can be highly non-homogeneous, exhibiting clustered, context-dependent structure, which renders global steering directions brittle. In this work, we view representation steering through the lens of optimal transport (OT), noting that standard difference-in-means steering implicitly corresponds to the OT map between two unimodal Gaussian distributions with identical covariance, yielding a global translation. To relax this restrictive assumption, we theoretically model source and target representations as Gaussian mixture models and formulate steering as a discrete OT problem between semantic latent clusters. From the resulting transport plan, we derive an explicit, input-dependent steering map via barycentric projection, producing a smooth, kernel-weighted combination of cluster-level shifts. We term this method Concept Heterogeneity-aware Representation Steering (CHaRS). Through numerous experimental settings, we show that CHaRS yields more effective behavioral control than global steering.

[449] Length Generalization Bounds for Transformers

Andy Yang, Pascal Bergsträßer, Georg Zetzsche, David Chiang, Anthony W. Lin

Main category: cs.LG

TL;DR: This paper proves that computable length generalization bounds don’t exist for CRASP (and thus transformers) with two or more layers, but provides computable bounds for positive CRASP (equivalent to fixed-precision transformers) with exponential length complexity.

Details

Motivation: The paper addresses the open problem of computability of length generalization bounds for CRASP, a class of languages closely linked to transformers. Length generalization is crucial for learning algorithms to make correct predictions on inputs of any length given finite training data, but whether such bounds can be computed for transformer-like architectures was previously unknown.

Method: The authors use theoretical computer science and formal language theory approaches to analyze CRASP (Circuits with Relational Attention and Sequential Position). They prove non-existence results for general CRASP through impossibility arguments, and provide constructive bounds for positive CRASP by establishing equivalence to fixed-precision transformers and analyzing their computational properties.

Result: Main result: Non-existence of computable length generalization bounds for CRASP with two or more layers (and hence for transformers). Complementary result: Computable bounds exist for positive CRASP (equivalent to fixed-precision transformers), but with exponential length complexity. The bounds are proven optimal.

Conclusion: Transformers lack computable length generalization bounds in general, but fixed-precision transformers (equivalent to positive CRASP) admit such bounds with exponential complexity. This provides fundamental theoretical limits on length generalization for transformer architectures.

Abstract: Length generalization is a key property of a learning algorithm that enables it to make correct predictions on inputs of any length, given finite training data. To provide such a guarantee, one needs to be able to compute a length generalization bound, beyond which the model is guaranteed to generalize. This paper concerns the open problem of the computability of such generalization bounds for CRASP, a class of languages which is closely linked to transformers. A positive partial result was recently shown by Chen et al. for CRASP with only one layer and, under some restrictions, also with two layers. We provide complete answers to the above open problem. Our main result is the non-existence of computable length generalization bounds for CRASP (already with two layers) and hence for transformers. To complement this, we provide a computable bound for the positive fragment of CRASP, which we show equivalent to fixed-precision transformers. For both positive CRASP and fixed-precision transformers, we show that the length complexity is exponential, and prove optimality of the bounds.

[450] High-order Knowledge Based Network Controllability Robustness Prediction: A Hypergraph Neural Network Approach

Shibing Mo, Jiarui Zhang, Jiayu Xie, Xiangyi Teng, Jing Liu

Main category: cs.LG

TL;DR: NCR-HoK: A dual hypergraph attention neural network that predicts network controllability robustness using high-order structural information, outperforming existing methods with lower computational cost.

Details

Motivation: Traditional methods for evaluating network controllability robustness rely on computationally expensive attack simulations, limiting scalability. Existing ML approaches focus only on pairwise interactions, ignoring high-order structural relationships that could better predict robustness.

Method: Proposes NCR-HoK with three components: 1) node feature encoder, 2) hypergraph construction capturing high-order relations, and 3) dual hypergraph attention module that simultaneously learns explicit graph structure, local high-order connections, and embedding space features.

Result: The method achieves superior performance compared to state-of-the-art network robustness learning methods on both synthetic and real-world networks while maintaining low computational overhead.

Conclusion: First exploration of high-order knowledge’s impact on network controllability robustness, demonstrating that capturing high-order structural information significantly improves prediction accuracy and computational efficiency.

Abstract: In order to evaluate the invulnerability of networks against various types of attacks and provide guidance for potential performance enhancement as well as controllability maintenance, network controllability robustness (NCR) has attracted increasing attention in recent years. Traditionally, controllability robustness is determined by attack simulations, which are computationally time-consuming and only applicable to small-scale networks. Although some machine learning-based methods for predicting network controllability robustness have been proposed, they mainly focus on pairwise interactions in complex networks, and the underlying relationships between high-order structural information and controllability robustness have not been explored. In this paper, a dual hypergraph attention neural network model based on high-order knowledge (NCR-HoK) is proposed to accomplish robustness learning and controllability robustness curve prediction. Through a node feature encoder, hypergraph construction with high-order relations, and a dedicated dual hypergraph attention module, the proposed method can effectively learn three types of network information simultaneously: explicit structural information in the original graph, high-order connection information in local neighborhoods, and hidden features in the embedding space. Notably, we explore for the first time the impact of high-order knowledge on network controllability robustness. Compared with state-of-the-art methods for network robustness learning, the proposed method achieves superior performance on both synthetic and real-world networks with low computational overhead.

[451] Boosting Meta-Learning for Few-Shot Text Classification via Label-guided Distance Scaling

Yunlong Gao, Xinyue Liu, Yingbo Wang, Linlin Zong, Bo Xu

Main category: cs.LG

TL;DR: LDS improves few-shot text classification by using label semantics as supervision signals during both training and testing stages to pull sample representations closer to their class centers.

Details

Motivation: Existing few-shot text classification methods focus on complex training algorithms but ignore that randomly selected labeled samples during testing may not provide effective supervision signals, leading to misclassification.

Method: Proposes Label-guided Distance Scaling (LDS) strategy: 1) Training stage: label-guided loss injects label semantic information to pull sample representations closer to corresponding label representations; 2) Testing stage: Label-guided Scaler scales sample representations with label semantics to provide additional supervision signals.

Result: Extensive experiments show the approach significantly outperforms state-of-the-art models when combined with common meta-learners.

Conclusion: LDS effectively mitigates misclassification in few-shot text classification by leveraging label semantics as supervision signals throughout both training and testing stages.

Abstract: Few-shot text classification aims to recognize unseen classes with limited labeled text samples. Existing approaches focus on boosting meta-learners by developing complex algorithms in the training stage. However, the labeled samples are randomly selected during the testing stage, so they may not provide effective supervision signals, leading to misclassification. To address this issue, we propose a \textbf{L}abel-guided \textbf{D}istance \textbf{S}caling (LDS) strategy. The core of our method is exploiting label semantics as supervision signals in both the training and testing stages. Specifically, in the training stage, we design a label-guided loss to inject label semantic information, pulling closer the sample representations and corresponding label representations. In the testing stage, we propose a Label-guided Scaler which scales sample representations with label semantics to provide additional supervision signals. Thus, even if labeled sample representations are far from class centers, our Label-guided Scaler pulls them closer to their class centers, thereby mitigating the misclassification. We combine two common meta-learners to verify the effectiveness of the method. Extensive experimental results demonstrate that our approach significantly outperforms state-of-the-art models. All datasets and codes are available at https://anonymous.4open.science/r/Label-guided-Text-Classification.

[452] PRISM: Exploring Heterogeneous Pretrained EEG Foundation Model Transfer to Clinical Differential Diagnosis

Jeet Bandhu Lahiri, Parshva Runwal, Arvasu Kulkarni, Mahir Jain, Aditya Ray Mishra, Siddharth Panwar, Sandeep Singh

Main category: cs.LG

TL;DR: PRISM is a masked autoencoder EEG foundation model that shows diverse pretraining populations produce more adaptable representations than narrow-source pretraining, with significant performance gains on challenging clinical tasks like epilepsy diagnosis.

Details

Motivation: Current EEG foundation models are pretrained on narrow-source clinical archives and evaluated on benchmarks from the same ecosystem, making it unclear whether representations encode neural physiology or recording-distribution artifacts. The authors aim to understand the impact of pretraining population diversity on model adaptability and performance.

Method: PRISM is a masked autoencoder ablated along two axes: pretraining population and downstream adaptation, with fixed architecture and preprocessing. The authors compare narrow-source EU/US corpora against geographically diverse pools augmented with multi-center South Asian clinical recordings across multiple EEG systems.

Result: Three key findings: 1) Narrow-source pretraining yields stronger linear probes on distribution-matched benchmarks, while diverse pretraining produces more adaptable representations under fine-tuning. 2) On distinguishing epilepsy from diagnostic mimickers via interictal EEG, the diverse checkpoint outperforms narrow-source by +12.3 pp balanced accuracy. 3) Systematic inconsistencies between EEG-Bench and EEG-FM-Bench reverse model rankings on identical datasets by up to 24 pp.

Conclusion: Targeted diversity can substitute for indiscriminate scale in EEG foundation models, and dataset count is a confounding variable in model comparison. Diverse pretraining populations lead to more adaptable representations with superior performance on challenging clinical tasks.

Abstract: EEG foundation models are typically pretrained on narrow-source clinical archives and evaluated on benchmarks from the same ecosystem, leaving unclear whether representations encode neural physiology or recording-distribution artifacts. We introduce PRISM (Population Representative Invariant Signal Model), a masked autoencoder ablated along two axes – pretraining population and downstream adaptation – with architecture and preprocessing fixed. We compare a narrow-source EU/US corpus (TUH + PhysioNet) against a geographically diverse pool augmented with multi-center South Asian clinical recordings across multiple EEG systems. Three findings emerge. First, narrow-source pretraining yields stronger linear probes on distribution-matched benchmarks, while diverse pretraining produces more adaptable representations under fine-tuning – a trade-off invisible under single-protocol evaluation. Trained on three source corpora, PRISM matches or outperforms REVE (92 datasets, 60,000+ hours) on the majority of tasks, demonstrating that targeted diversity can substitute for indiscriminate scale and that dataset count is a confounding variable in model comparison. Second, on a clinically challenging and previously untested task – distinguishing epilepsy from diagnostic mimickers via interictal EEG – the diverse checkpoint outperforms the narrow-source checkpoint by +12.3 pp balanced accuracy, the largest gap across all evaluations. Third, systematic inconsistencies between EEG-Bench and EEG-FM-Bench reverse model rankings on identical datasets by up to 24 pp; we identify six concrete sources including split construction, checkpoint selection, segment length, and normalization, showing these factors compound non-additively.

[453] Graph Attention Based Prioritization of Disease Responsible Genes from Multimodal Alzheimer’s Network

Binon Teji, Subhajit Bandyopadhyay, Swarup Roy

Main category: cs.LG

TL;DR: NETRA is a multimodal graph transformer framework that uses attention mechanisms to prioritize disease-associated genes, outperforming traditional centrality measures for Alzheimer’s disease analysis.

Details

Motivation: Traditional network-based approaches for prioritizing disease-associated genes rely on static centrality measures and fail to capture cross-modal biological heterogeneity, limiting their effectiveness for complex disorders like Alzheimer's disease.

Method: Constructs gene regulatory networks from multiple data types (microarray, single-cell RNA-seq, single-nucleus RNA-seq), trains BERT-based models on random-walk sequences for gene embeddings, compresses expression profiles with variational autoencoders, integrates with auxiliary biological networks, and uses a graph transformer with attention mechanisms to assign relevance scores.

Result: Achieves normalized enrichment score of ~3.9 for Alzheimer’s disease pathway, substantially outperforming classical centrality measures and diffusion models. Top-ranked genes enrich neurodegenerative pathways, recover known AD susceptibility loci, and reveal conserved cross-disease gene modules.

Conclusion: NETRA provides a powerful, extensible framework for disease gene prioritization that captures multimodal biological heterogeneity and outperforms traditional approaches, with applications beyond Alzheimer’s disease to other complex disorders.

Abstract: Prioritizing disease-associated genes is central to understanding the molecular mechanisms of complex disorders such as Alzheimer’s disease (AD). Traditional network-based approaches rely on static centrality measures and often fail to capture cross-modal biological heterogeneity. We propose NETRA (Node Evaluation through Transformer-based Representation and Attention), a multimodal graph transformer framework that replaces heuristic centrality metrics with attention-driven relevance scoring. Using AD as a case study, gene regulatory networks are independently constructed from microarray, single-cell RNA-seq, and single-nucleus RNA-seq data. Random-walk sequences derived from these networks are used to train a BERT-based model for learning global gene embeddings, while modality-specific gene expression profiles are compressed using variational autoencoders. These representations are integrated with auxiliary biological networks, including protein-protein interactions, Gene Ontology semantic similarity, and diffusion-based gene similarity, into a unified multimodal graph. A graph transformer assigns NETRA scores that quantify gene relevance in a disease-specific and context-aware manner. Gene set enrichment analysis shows that NETRA achieves a normalized enrichment score of about 3.9 for the Alzheimer’s disease pathway, substantially outperforming classical centrality measures and diffusion models. Top-ranked genes enrich multiple neurodegenerative pathways, recover a known late-onset AD susceptibility locus at chr12q13, and reveal conserved cross-disease gene modules. The framework preserves biologically realistic heavy-tailed network topology and is readily extensible to other complex disorders.

[454] A Comparative Study of UMAP and Other Dimensionality Reduction Methods

Guanzhe Zhang, Shanshan Ding, Zhezhen Jin

Main category: cs.LG

TL;DR: Supervised UMAP performs well for classification but has limitations for regression tasks in incorporating response information effectively.

Details

Motivation: To systematically evaluate supervised UMAP for both regression and classification tasks, as supervised extensions of UMAP for regression settings remain underexplored despite UMAP's popularity for dimensionality reduction.

Method: Comprehensive comparative analysis of UMAP, supervised UMAP, and competing dimensionality reduction methods (PCA, Kernel PCA, SIR, Kernel SIR, t-SNE) using simulated and real datasets, with performance assessed via predictive accuracy on low-dimensional embeddings.

Result: Supervised UMAP performs well for classification tasks but exhibits limitations in effectively incorporating response information for regression problems.

Conclusion: Supervised UMAP shows promise for classification but needs further development for regression applications, highlighting an important direction for future research.

Abstract: Uniform Manifold Approximation and Projection (UMAP) is a widely used manifold learning technique for dimensionality reduction. This paper studies UMAP, supervised UMAP, and several competing dimensionality reduction methods, including Principal Component Analysis (PCA), Kernel PCA, Sliced Inverse Regression (SIR), Kernel SIR, and t-distributed Stochastic Neighbor Embedding, through a comprehensive comparative analysis. Although UMAP has attracted substantial attention for preserving local and global structures, its supervised extensions, particularly for regression settings, remain rather underexplored. We provide a systematic evaluation of supervised UMAP for both regression and classification using simulated and real datasets, with performance assessed via predictive accuracy on low-dimensional embeddings. Our results show that supervised UMAP performs well for classification but exhibits limitations in effectively incorporating response information for regression, highlighting an important direction for future development.

[455] Temporal Imbalance of Positive and Negative Supervision in Class-Incremental Learning

Jinge Ma, Fengqing Zhu

Main category: cs.LG

TL;DR: Temporal-Adjusted Loss (TAL) addresses catastrophic forgetting in Class-Incremental Learning by tackling temporal imbalance in supervision, using a decay kernel to reweight negative supervision dynamically.

Details

Motivation: The paper identifies that existing CIL methods overlook temporal imbalance as a key cause of prediction bias toward new classes, where earlier classes receive stronger negative supervision toward training end, leading to asymmetric precision and recall.

Method: Proposes Temporal-Adjusted Loss (TAL) which uses a temporal decay kernel to construct supervision strength vectors and dynamically reweights negative supervision in cross-entropy loss, theoretically degenerating to standard cross-entropy under balanced conditions.

Result: Extensive experiments show TAL significantly reduces forgetting and improves performance on multiple CIL benchmarks, demonstrating the importance of temporal modeling for stable long-term learning.

Conclusion: Temporal imbalance is a crucial factor in CIL’s catastrophic forgetting, and TAL effectively addresses this by temporal modeling of supervision strength, offering a principled approach to stable incremental learning.

Abstract: With the widespread adoption of deep learning in visual tasks, Class-Incremental Learning (CIL) has become an important paradigm for handling dynamically evolving data distributions. However, CIL faces the core challenge of catastrophic forgetting, often manifested as a prediction bias toward new classes. Existing methods mainly attribute this bias to intra-task class imbalance and focus on corrections at the classifier head. In this paper, we highlight an overlooked factor – temporal imbalance – as a key cause of this bias. Earlier classes receive stronger negative supervision toward the end of training, leading to asymmetric precision and recall. We establish a temporal supervision model, formally define temporal imbalance, and propose Temporal-Adjusted Loss (TAL), which uses a temporal decay kernel to construct a supervision strength vector and dynamically reweight the negative supervision in cross-entropy loss. Theoretical analysis shows that TAL degenerates to standard cross-entropy under balanced conditions and effectively mitigates prediction bias under imbalance. Extensive experiments demonstrate that TAL significantly reduces forgetting and improves performance on multiple CIL benchmarks, underscoring the importance of temporal modeling for stable long-term learning.

[456] Quantum-Inspired Fine-Tuning for Few-Shot AIGC Detection via Phase-Structured Reparameterization

Kaiyang Xing, Han Fang, Zhaoyun Chen, Zhonghui Li, Yang Yang, Weiming Zhang, Guoping Guo

Main category: cs.LG

TL;DR: Q-LoRA integrates quantum neural networks into LoRA adapters for improved few-shot AIGC detection, with a classical variant H-LoRA achieving similar performance at lower cost.

Details

Motivation: To extend quantum neural networks' few-shot generalization advantages to large-scale tasks by integrating them into parameter-efficient fine-tuning frameworks like LoRA, specifically for AI-generated content detection.

Method: Proposes Q-LoRA which integrates lightweight QNNs into LoRA adapters, and H-LoRA, a classical variant that applies Hilbert transform within LoRA to retain similar phase structure and constraints without quantum simulation overhead.

Result: Both Q-LoRA and H-LoRA outperform standard LoRA by over 5% accuracy in few-shot AIGC detection, with H-LoRA achieving comparable accuracy at significantly lower computational cost.

Conclusion: Quantum-inspired inductive biases (phase-aware representations and norm-constrained transformations) can enhance few-shot learning in parameter-efficient fine-tuning, with classical approximations providing cost-effective alternatives.

Abstract: Recent studies show that quantum neural networks (QNNs) generalize well in few-shot regimes. To extend this advantage to large-scale tasks, we propose Q-LoRA, a quantum-enhanced fine-tuning scheme that integrates lightweight QNNs into the low-rank adaptation (LoRA) adapter. Applied to AI-generated content (AIGC) detection, Q-LoRA consistently outperforms standard LoRA under few-shot settings. We analyze the source of this improvement and identify two possible structural inductive biases from QNNs: (i) phase-aware representations, which encode richer information across orthogonal amplitude-phase components, and (ii) norm-constrained transformations, which stabilize optimization via inherent orthogonality. However, Q-LoRA incurs non-trivial overhead due to quantum simulation. Motivated by our analysis, we further introduce H-LoRA, a fully classical variant that applies the Hilbert transform within the LoRA adapter to retain similar phase structure and constraints. Experiments on few-shot AIGC detection show that both Q-LoRA and H-LoRA outperform standard LoRA by over 5% accuracy, with H-LoRA achieving comparable accuracy at significantly lower cost in this task.

[457] The Malignant Tail: Spectral Segregation of Label Noise in Over-Parameterized Networks

Zice Wang

Main category: cs.LG

TL;DR: Networks under label noise develop a “Malignant Tail” where signal and noise become geometrically segregated - signal in low-rank subspaces and stochastic noise in high-frequency orthogonal components, requiring explicit spectral truncation for robust generalization.

Details

Motivation: To understand the geometric mechanism behind the phase transition from benign to harmful overfitting under increasing noise-to-signal ratios, particularly how networks handle stochastic label noise versus systematic noise.

Method: Using Spectral Linear Probe to analyze training dynamics, showing SGD fails to suppress noise but biases it to high-frequency orthogonal subspaces. Proposes Explicit Spectral Truncation (d « D) to surgically prune noise-dominated subspaces post-hoc.

Result: Demonstrates that excess spectral capacity enables noise memorization, and geometric truncation can recover optimal generalization capability latent in converged models, providing stable post-hoc intervention unlike temporal early stopping.

Conclusion: Under label noise, excess spectral capacity is a structural liability that necessitates explicit rank constraints to filter stochastic corruptions for robust generalization, with geometric truncation offering a stable solution.

Abstract: While implicit regularization facilitates benign overfitting in low-noise regimes, recent theoretical work predicts a sharp phase transition to harmful overfitting as the noise-to-signal ratio increases. We experimentally isolate the geometric mechanism of this transition: the Malignant Tail, a failure mode where networks functionally segregate signal and noise, reducing coherent semantic features into low-rank subspaces while pushing stochastic label noise into high-frequency orthogonal components, distinct from systematic or corruption-aligned noise. Through a Spectral Linear Probe of training dynamics, we demonstrate that Stochastic Gradient Descent (SGD) fails to suppress this noise, instead implicitly biasing it toward high-frequency orthogonal subspaces, effectively preserving signal-noise separability. We show that this geometric separation is distinct from simple variance reduction in untrained models. In trained networks, SGD actively segregates noise, allowing post-hoc Explicit Spectral Truncation (d « D) to surgically prune the noise-dominated subspace. This approach recovers the optimal generalization capability latent in the converged model. Unlike unstable temporal early stopping, Geometric Truncation provides a stable post-hoc intervention. Our findings suggest that under label noise, excess spectral capacity is not harmless redundancy but a latent structural liability that allows for noise memorization, necessitating explicit rank constraints to filter stochastic corruptions for robust generalization.

[458] Preconditioned Score and Flow Matching

Shadab Ahamed, Eshed Gal, Simon Ghyselincks, Md Shahriar Rahim Siddiqui, Moshe Eliasof, Eldad Haber

Main category: cs.LG

TL;DR: The paper analyzes how covariance structure of intermediate distributions affects optimization bias in flow matching and diffusion models, proposing preconditioning maps to improve conditioning and avoid suboptimal plateaus.

Details

Motivation: The geometry of intermediate distributions in flow matching and score-based diffusion models strongly affects optimization. When the covariance of these distributions is ill-conditioned, gradient-based training rapidly fits high-variance directions while systematically under-optimizing low-variance modes, leading to learning plateaus at suboptimal weights.

Method: The authors formalize this optimization bias effect in analytically tractable settings and propose reversible, label-conditional preconditioning maps that reshape the geometry of intermediate distributions by improving the conditioning of their covariance matrices without altering the underlying generative model.

Result: Across MNIST latent flow matching and additional high-resolution datasets, preconditioning consistently yields better-trained models by avoiding suboptimal plateaus. The method primarily mitigates optimization stagnation by enabling continued progress along previously suppressed directions rather than just accelerating early convergence.

Conclusion: The conditioning of intermediate distribution covariances governs optimization bias in flow matching and diffusion models, and reversible preconditioning maps provide an effective solution to avoid suboptimal learning plateaus by improving geometric conditioning.

Abstract: Flow matching and score-based diffusion train vector fields under intermediate distributions $p_t$, whose geometry can strongly affect their optimization. We show that the covariance $Σ_t$ of $p_t$ governs optimization bias: when $Σ_t$ is ill-conditioned, and gradient-based training rapidly fits high-variance directions while systematically under-optimizing low-variance modes, leading to learning that plateaus at suboptimal weights. We formalize this effect in analytically tractable settings and propose reversible, label-conditional \emph{preconditioning} maps that reshape the geometry of $p_t$ by improving the conditioning of $Σ_t$ without altering the underlying generative model. Rather than accelerating early convergence, preconditioning primarily mitigates optimization stagnation by enabling continued progress along previously suppressed directions. Across MNIST latent flow matching, and additional high-resolution datasets, we empirically track conditioning diagnostics and distributional metrics and show that preconditioning consistently yields better-trained models by avoiding suboptimal plateaus.

[459] Diffusion-MPC in Discrete Domains: Feasibility Constraints, Horizon Effects, and Critic Alignment: Case study with Tetris

Haochuan Kevin Wang

Main category: cs.LG

TL;DR: Diffusion-MPC for Tetris using discrete denoising with feasibility masking and reranking strategies, analyzing sampling constraints, critic integration, and compute scaling in combinatorial domains.

Details

Motivation: To study diffusion-based model predictive control in discrete combinatorial domains, using Tetris as a case study, to understand the structural challenges of diffusion planners in discrete environments and provide practical diagnostics for critic integration.

Method: Uses a MaskGIT-style discrete denoiser to sample candidate placement sequences, with feasibility-constrained sampling via logit masking over valid placements. Implements reranking strategies using heuristic scores, pretrained DQN critic, and hybrid combinations. Analyzes compute scaling in candidate count (K) and planning horizon (H).

Result: Feasibility masking is necessary in discrete domains, removing 46% invalid action mass and yielding 6.8% score improvement and 5.6% survival improvement. DQN reranking shows systematic misalignment with rollout quality (mean regret 17.6, p90 36.6). Shorter planning horizons outperform longer ones under sparse/delayed rewards. Compute choices determine failure modes: small K limits candidate quality, while larger H amplifies misranking and model mismatch.

Conclusion: The study highlights structural challenges of diffusion planners in discrete environments and provides practical diagnostics for critic integration, showing that feasibility constraints, critic alignment, and compute scaling are critical factors in discrete combinatorial domains.

Abstract: We study diffusion-based model predictive control (Diffusion-MPC) in discrete combinatorial domains using Tetris as a case study. Our planner samples candidate placement sequences with a MaskGIT-style discrete denoiser and selects actions via reranking. We analyze three key factors: (1) feasibility-constrained sampling via logit masking over valid placements, (2) reranking strategies using a heuristic score, a pretrained DQN critic, and a hybrid combination, and (3) compute scaling in candidate count and planning horizon. We find that feasibility masking is necessary in discrete domains, removing invalid action mass (46%) and yielding a 6.8% improvement in score and 5.6% improvement in survival over unconstrained sampling. Naive DQN reranking is systematically misaligned with rollout quality, producing high decision regret (mean 17.6, p90 36.6). Shorter planning horizons outperform longer ones under sparse and delayed rewards, suggesting uncertainty compounding in long imagined rollouts. Overall, compute choices (K, H) determine dominant failure modes: small K limits candidate quality, while larger H amplifies misranking and model mismatch. Our findings highlight structural challenges of diffusion planners in discrete environments and provide practical diagnostics for critic integration.

[460] Learning graph topology from metapopulation epidemic encoder-decoder

Xin Li, Jonathan Cohen, Shai Pilosof, Rami Puzis

Main category: cs.LG

TL;DR: Deep learning encoder-decoder architectures for joint inference of epidemic parameters and mobility networks from time-series data, outperforming state-of-the-art topology inference methods.

Details

Motivation: Metapopulation epidemic models require both epidemic parameters and mobility networks, but joint inference is challenging with limited epidemic tracing data. Existing methods typically estimate one while assuming the other, creating a persistent gap in disease propagation modeling.

Method: Two encoder-decoder deep learning architectures that infer metapopulation mobility graphs from time-series data, with and without the assumption of epidemic model parameters. The approach is evaluated across diverse random and empirical mobility networks.

Result: The proposed approach outperforms state-of-the-art topology inference methods. Topology inference improves dramatically with data on additional pathogens, establishing a robust framework for simultaneous inference of epidemic parameters and topology.

Conclusion: The study provides a solution to the previously unsolved problem of joint inference of epidemic parameters and mobility networks, addressing a critical gap in modeling disease propagation through metapopulation frameworks.

Abstract: Metapopulation epidemic models are a valuable tool for studying large-scale outbreaks. With the limited availability of epidemic tracing data, it is challenging to infer the essential constituents of these models, namely, the epidemic parameters and the relevant mobility network between subpopulations. Either one of these constituents can be estimated while assuming the other; however, the problem of their joint inference has not yet been solved. Here, we propose two encoder-decoder deep learning architectures that infer metapopulation mobility graphs from time-series data, with and without the assumption of epidemic model parameters. Evaluation across diverse random and empirical mobility networks shows that the proposed approach outperforms the state-of-the-art topology inference. Further, we show that topology inference improves dramatically with data on additional pathogens. Our study establishes a robust framework for simultaneously inferring epidemic parameters and topology, addressing a persistent gap in modeling disease propagation.

[461] Learning Optimal Search Strategies

Stefan Ankirchner, Maximilian Philipp Thiel

Main category: cs.LG

TL;DR: The paper proposes an algorithm for learning optimal stopping thresholds in parking problems with unknown inhomogeneous Poisson processes, achieving logarithmic regret with minimax optimality.

Details

Motivation: The paper addresses the challenge of learning optimal search strategies in sequential decision problems where opportunities arrive according to an unknown inhomogeneous Poisson process, using parking as a motivating example.

Method: The authors propose an algorithm that learns the optimal threshold-type stopping rule by estimating the integrated jump intensity rather than the intensity function itself, avoiding the need to estimate the full intensity function.

Result: The algorithm achieves logarithmic regret growth uniformly over a broad class of environments, and the authors prove a logarithmic minimax regret lower bound, establishing the growth optimality of their approach.

Conclusion: The proposed method provides an efficient way to learn optimal stopping thresholds in problems with unknown arrival processes, with provable optimal regret guarantees.

Abstract: We explore the question of how to learn an optimal search strategy within the example of a parking problem where parking opportunities arrive according to an unknown inhomogeneous Poisson process. The optimal policy is a threshold-type stopping rule characterized by an indifference position. We propose an algorithm that learns this threshold by estimating the integrated jump intensity rather than the intensity function itself. We show that our algorithm achieves a logarithmic regret growth, uniformly over a broad class of environments. Moreover, we prove a logarithmic minimax regret lower bound, establishing the growth optimality of the proposed approach.

[462] Rigidity-Aware Geometric Pretraining for Protein Design and Conformational Ensembles

Zhanghan Ni, Yanjing Li, Zeju Qiu, Bernhard Schölkopf, Hongyu Guo, Weiyang Liu, Shengchao Liu

Main category: cs.LG

TL;DR: RigidSSL is a geometric pretraining framework for protein design that learns rigidity-aware representations from structural perturbations and molecular dynamics to improve generative modeling of protein structures.

Details

Motivation: Current protein design methods have three limitations: inability to jointly learn geometry and design tasks, reliance on local non-rigid representations limiting global geometric understanding, and ineffective modeling of dynamic conformational information.

Method: Two-phase geometric pretraining: Phase I (RigidSSL-Perturb) learns geometric priors from 432K AlphaFold structures with simulated perturbations; Phase II (RigidSSL-MD) refines representations on 1.3K molecular dynamics trajectories. Uses bi-directional rigidity-aware flow matching to optimize translational and rotational dynamics.

Result: RigidSSL variants improve designability by up to 43%, enhance novelty and diversity in unconditional generation, improve zero-shot motif scaffolding success rate by 5.8%, and capture more biophysically realistic conformational ensembles in GPCR modeling.

Conclusion: RigidSSL effectively addresses geometric learning limitations in protein design through rigidity-aware self-supervised pretraining, enabling better generative modeling of protein structures and dynamics.

Abstract: Generative models have recently advanced $\textit{de novo}$ protein design by learning the statistical regularities of natural structures. However, current approaches face three key limitations: (1) Existing methods cannot jointly learn protein geometry and design tasks, where pretraining can be a solution; (2) Current pretraining methods mostly rely on local, non-rigid atomic representations for property prediction downstream tasks, limiting global geometric understanding for protein generation tasks; and (3) Existing approaches have yet to effectively model the rich dynamic and conformational information of protein structures. To overcome these issues, we introduce $\textbf{RigidSSL}$ ($\textit{Rigidity-Aware Self-Supervised Learning}$), a geometric pretraining framework that front-loads geometry learning prior to generative finetuning. Phase I (RigidSSL-Perturb) learns geometric priors from 432K structures from the AlphaFold Protein Structure Database with simulated perturbations. Phase II (RigidSSL-MD) refines these representations on 1.3K molecular dynamics trajectories to capture physically realistic transitions. Underpinning both phases is a bi-directional, rigidity-aware flow matching objective that jointly optimizes translational and rotational dynamics to maximize mutual information between conformations. Empirically, RigidSSL variants improve designability by up to 43% while enhancing novelty and diversity in unconditional generation. Furthermore, RigidSSL-Perturb improves the success rate by 5.8% in zero-shot motif scaffolding and RigidSSL-MD captures more biophysically realistic conformational ensembles in G protein-coupled receptor modeling. The code is available at: https://github.com/ZhanghanNi/RigidSSL.git.

[463] Personalized Multi-Agent Average Reward TD-Learning via Joint Linear Approximation

Leo, Wang, Pengkun Yang, Lili Su

Main category: cs.LG

TL;DR: Personalized multi-agent TD learning with shared linear representation where agents’ optimal weights lie in unknown subspace, using cooperative single-timescale approach to filter conflicting signals and achieve linear speedup.

Details

Motivation: To address personalized multi-agent reinforcement learning where agents interact with different environments but share common structure, inspired by personalized federated learning to leverage shared representations while handling heterogeneity.

Method: Cooperative single-timescale TD learning where agents jointly estimate common subspace and local heads, decomposing learning to filter out conflicting signals and mitigate negative impacts of misaligned information.

Result: The approach achieves linear speedup by effectively handling heterogeneity and Markovian sampling, with experiments showing benefits of learning via shared structure for general control problems.

Conclusion: Learning shared representations in multi-agent TD learning can filter conflicting signals and improve convergence, with analytical techniques providing insights for leveraging common structures in heterogeneous settings.

Abstract: We study personalized multi-agent average reward TD learning, in which a collection of agents interacts with different environments and jointly learns their respective value functions. We focus on the setting where there exists a shared linear representation, and the agents’ optimal weights collectively lie in an unknown linear subspace. Inspired by the recent success of personalized federated learning (PFL), we study the convergence of cooperative single-timescale TD learning in which agents iteratively estimate the common subspace and local heads. We showed that this decomposition can filter out conflicting signals, effectively mitigating the negative impacts of ``misaligned’’ signals, and achieving linear speedup. The main technical challenges lie in the heterogeneity, the Markovian sampling, and their intricate interplay in shaping error evolutions. Specifically, not only are the error dynamics of multiple variables closely interconnected, but there is also no direct contraction for the principal angle distance between the optimal subspace and the estimated subspace. We hope our analytical techniques can be useful to inspire research on deeper exploration into leveraging common structures. Experiments are provided to show the benefits of learning via a shared structure to the more general control problem.

[464] Dimension-Independent Convergence of Underdamped Langevin Monte Carlo in KL Divergence

Shiyuan Zhang, Qiwei Di, Xuheng Li, Quanquan Gu

Main category: cs.LG

TL;DR: First dimension-free KL divergence bounds for discretized underdamped Langevin dynamics, improving convergence analysis for high-dimensional sampling.

Details

Motivation: Existing convergence guarantees for discretized underdamped Langevin dynamics (ULD) scale polynomially with ambient dimension d, leading to vacuous bounds in high dimensions. While dimension-free results exist for Wasserstein-2 distance, dimension-independent guarantees for ULD discretizations in KL divergence remained open.

Method: Refines the KL local error framework to a dimension-free setting, yielding bounds that depend on tr(H) where H upper bounds the Hessian of V, rather than on dimension d.

Result: Proves first dimension-free KL divergence bounds for discretized ULD, obtaining improved iteration complexity for underdamped Langevin Monte Carlo relative to overdamped methods when tr(H) ≪ d.

Conclusion: Closes the gap in dimension-independent guarantees for ULD discretizations in KL divergence, providing improved theoretical understanding of underdamped Langevin dynamics in high-dimensional settings.

Abstract: Underdamped Langevin dynamics (ULD) is a widely-used sampler for Gibbs distributions $π\propto e^{-V}$, and is often empirically effective in high dimensions. However, existing non-asymptotic convergence guarantees for discretized ULD typically scale polynomially with the ambient dimension $d$, leading to vacuous bounds when $d$ is large. The main known dimension-free result concerns the randomized midpoint discretization in Wasserstein-2 distance (Liu et al.,2023), while dimension-independent guarantees for ULD discretizations in KL divergence have remained open. We close this gap by proving the first dimension-free KL divergence bounds for discretized ULD. Our analysis refines the KL local error framework (Altschuler et al., 2025) to a dimension-free setting and yields bounds that depend on $\mathrm{tr}(\mathbf{H})$, where $\mathbf{H}$ upper bounds the Hessian of $V$, rather than on $d$. As a consequence, we obtain improved iteration complexity for underdamped Langevin Monte Carlo relative to overdamped Langevin methods in regimes where $\mathrm{tr}(\mathbf{H})\ll d$.

[465] A Unified Revisit of Temperature in Classification-Based Knowledge Distillation

Logan Frank, Jim Davis

Main category: cs.LG

TL;DR: Systematic study of temperature selection in knowledge distillation reveals its dependencies on training components like optimizer and teacher pretraining, providing practical guidance for practitioners.

Details

Motivation: Temperature selection in knowledge distillation is poorly understood and typically done via grid search or copying from prior work, which is inefficient and may lead to suboptimal performance when training setups differ. There's a need to understand how temperature interacts with other training components.

Method: The authors conduct a unified systematic study examining interactions between temperature and various training components (optimizer, teacher pretraining/finetuning, etc.). They analyze cross-connections to identify situations that significantly impact temperature selection.

Result: The study identifies common situations that have a pronounced impact on temperature selection, providing valuable practical guidance for practitioners using knowledge distillation.

Conclusion: Temperature in knowledge distillation is closely linked to training components, and systematic understanding of these interactions can improve temperature selection and overall distillation performance.

Abstract: A central idea of knowledge distillation is to expose relational structure embedded in the teacher’s weights for the student to learn, which is often facilitated using a temperature parameter. Despite its widespread use, there remains limited understanding on how to select an appropriate temperature value, or how this value depends on other training elements such as optimizer, teacher pretraining/finetuning, etc. In practice, temperature is commonly chosen via grid search or by adopting values from prior work, which can be time-consuming or may lead to suboptimal student performance when training setups differ. In this work, we posit that temperature is closely linked to these training components and present a unified study that systematically examines such interactions. From analyzing these cross-connections, we identify and present common situations that have a pronounced impact on temperature selection, providing valuable guidance for practitioners employing knowledge distillation in their work.

[466] Using the SEKF to Transfer NN Models of Dynamical Systems with Limited Data

Joshua E. Hammond, Tyler A. Soderstrom, Brian A. Korgel, Michael Baldea

Main category: cs.LG

TL;DR: SEKF enables efficient adaptation of pre-trained neural network models to new dynamical systems with minimal data (as little as 1% of original training data) by fine-tuning only a subset of parameters.

Details

Motivation: Data-driven models for dynamical systems require extensive training data, but gathering sufficient data is often infeasible due to cost or safety concerns in practical applications.

Method: Uses Subset Extended Kalman Filter (SEKF) to adapt pre-trained neural network models to new, similar systems by fine-tuning only a subset of parameters with limited available data.

Result: Experimental validation across damped spring and continuous stirred-tank reactor systems shows that small parameter perturbations capture target system dynamics with as little as 1% of original training data, reducing computational cost and generalization error.

Conclusion: SEKF provides an efficient approach for adapting pre-trained models to new systems with limited data, addressing practical constraints in data acquisition for dynamical system modeling.

Abstract: Data-driven models of dynamical systems require extensive amounts of training data. For many practical applications, gathering sufficient data is not feasible due to cost or safety concerns. This work uses the Subset Extended Kalman Filter (SEKF) to adapt pre-trained neural network models to new, similar systems with limited data available. Experimental validation across damped spring and continuous stirred-tank reactor systems demonstrates that small parameter perturbations to the initial model capture target system dynamics while requiring as little as 1% of original training data. In addition, finetuning requires less computational cost and reduces generalization error.

[467] Spectral Regularization for Diffusion Models

Satish Chandran, Nicolas Roque dos Santos, Yunshu Wu, Greg Ver Steeg, Evangelos Papalexakis

Main category: cs.LG

TL;DR: Spectral regularization framework for diffusion models using Fourier- and wavelet-domain losses to improve frequency balance and multi-scale structure in generated samples

Details

Motivation: Standard diffusion models use pointwise reconstruction objectives that ignore the spectral and multi-scale structure of natural signals, leading to suboptimal frequency balance and coherence in generated samples

Method: Augments standard diffusion training with differentiable Fourier- and wavelet-domain losses as soft inductive biases, without modifying diffusion process, architecture, or sampling. Compatible with DDPM, DDIM, and EDM formulations

Result: Consistent improvements in sample quality across image and audio generation tasks, with largest gains on higher-resolution unconditional datasets where fine-scale structure is most challenging

Conclusion: Spectral regularization provides an effective way to improve diffusion model performance by incorporating frequency-domain priors, with minimal computational overhead and broad compatibility

Abstract: Diffusion models are typically trained using pointwise reconstruction objectives that are agnostic to the spectral and multi-scale structure of natural signals. We propose a loss-level spectral regularization framework that augments standard diffusion training with differentiable Fourier- and wavelet-domain losses, without modifying the diffusion process, model architecture, or sampling procedure. The proposed regularizers act as soft inductive biases that encourage appropriate frequency balance and coherent multi-scale structure in generated samples. Our approach is compatible with DDPM, DDIM, and EDM formulations and introduces negligible computational overhead. Experiments on image and audio generation demonstrate consistent improvements in sample quality, with the largest gains observed on higher-resolution, unconditional datasets where fine-scale structure is most challenging to model.

[468] Incremental Graph Construction Enables Robust Spectral Clustering of Texts

Marko Pranjić, Boshko Koloski, Nada Lavrač, Senja Pollak, Marko Robnik-Šikonja

Main category: cs.LG

TL;DR: Incremental k-NN graph construction method for spectral clustering that guarantees connectivity by linking each new node to its k nearest previously inserted nodes, addressing fragmentation issues in standard k-NN graphs.

Details

Motivation: Standard k-NN graphs in spectral clustering often contain disconnected components at practical sparsity levels (small k), making clustering degenerate and sensitive to hyperparameters, especially with text embeddings.

Method: Proposes incremental k-NN graph construction where each new node is linked to its k nearest previously inserted nodes, guaranteeing connected graphs for any k. Provides inductive proof of connectedness and supports incremental updates for new documents.

Result: Validated on SentenceTransformer embeddings across six clustering datasets from Massive Text Embedding Benchmark. Outperforms standard k-NN in low-k regime where disconnected components are prevalent, and matches standard k-NN at larger k.

Conclusion: Simple incremental k-NN graph construction method effectively addresses connectivity issues in spectral clustering of text embeddings, improving robustness in low-sparsity regimes while maintaining performance at higher sparsity levels.

Abstract: Neighborhood graphs are a critical but often fragile step in spectral clustering of text embeddings. On realistic text datasets, standard $k$-NN graphs can contain many disconnected components at practical sparsity levels (small $k$), making spectral clustering degenerate and sensitive to hyperparameters. We introduce a simple incremental $k$-NN graph construction that preserves connectivity by design: each new node is linked to its $k$ nearest previously inserted nodes, which guarantees a connected graph for any $k$. We provide an inductive proof of connectedness and discuss implications for incremental updates when new documents arrive. We validate the approach on spectral clustering of SentenceTransformer embeddings using Laplacian eigenmaps across six clustering datasets from the Massive Text Embedding Benchmark.Compared to standard $k$-NN graphs, our method outperforms in the low-$k$ regime where disconnected components are prevalent, and matches standard $k$-NN at larger $k$.

[469] Manifold Aware Denoising Score Matching (MAD)

Alona Levy-Jurgenson, Alvaro Prat, James Cuin, Yee Whye Teh

Main category: cs.LG

TL;DR: Proposes a modified denoising score-matching approach that decomposes the score function into known and learnable components to implicitly account for manifold structure while maintaining computational efficiency.

Details

Motivation: Existing methods for learning distributions on manifolds often require implicitly learning the manifold structure, which can be computationally intensive. The authors aim to reduce this burden while maintaining efficiency by leveraging known analytical components.

Method: Proposes a decomposition of the score function into a known component $s^{base}$ (which implicitly includes manifold information) and a remainder component $s-s^{base}$ (the learning target). Derives analytical forms for $s^{base}$ for specific cases like rotation matrices and discrete distributions.

Result: Demonstrates the utility of the approach for distributions over rotation matrices and discrete distributions, showing that it can effectively handle manifold-structured data while maintaining computational efficiency.

Conclusion: The proposed decomposition provides a computationally efficient way to learn distributions on manifolds by leveraging known analytical components to implicitly account for manifold structure, reducing the burden on the learning algorithm.

Abstract: A major focus in designing methods for learning distributions defined on manifolds is to alleviate the need to implicitly learn the manifold so that learning can concentrate on the data distribution within the manifold. However, accomplishing this often leads to compute-intensive solutions. In this work, we propose a simple modification to denoising score-matching in the ambient space to implicitly account for the manifold, thereby reducing the burden of learning the manifold while maintaining computational efficiency. Specifically, we propose a simple decomposition of the score function into a known component $s^{base}$ and a remainder component $s-s^{base}$ (the learning target), with the former implicitly including information on where the data manifold resides. We derive known components $s^{base}$ in analytical form for several important cases, including distributions over rotation matrices and discrete distributions, and use them to demonstrate the utility of this approach in those cases.

[470] Can Computational Reducibility Lead to Transferable Models for Graph Combinatorial Optimization?

Semih Cantürk, Thomas Sabourin, Frederik Wenkel, Michael Perlmutter, Guy Wolf

Main category: cs.LG

TL;DR: A neural combinatorial optimization framework using expressive message passing (GCON) with energy-based loss achieves SOTA performance on individual tasks and shows effective transfer learning between graph CO problems through pretraining informed by computational reducibility theory.

Details

Motivation: The paper addresses the challenge of efficient generalization in neural combinatorial optimization solvers, aiming to develop unified models that can transfer knowledge between different CO tasks and handle new tasks not seen during training.

Method: Proposes a GCON module for expressive message passing with energy-based unsupervised loss functions. Uses computational reducibility theory to design pretraining and fine-tuning strategies for transfer learning between graph CO problems (MVC, MIS, MaxClique, MaxCut, MDS, graph coloring).

Result: Achieves high performance comparable to SOTA on individual CO tasks. Shows effective transfer learning between related problems (MVC, MIS, MaxClique) and in multi-task settings. Pretraining on all but one task leads to faster convergence and avoids negative transfer.

Conclusion: Learning common representations across multiple graph CO problems is viable through expressive message passing and pretraining strategies informed by polynomial reduction literature, representing progress toward foundational models for neural combinatorial optimization.

Abstract: A key challenge in deriving unified neural solvers for combinatorial optimization (CO) is efficient generalization of models between a given set of tasks to new tasks not used during the initial training process. To address it, we first establish a new model, which uses a GCON module as a form of expressive message passing together with energy-based unsupervised loss functions. This model achieves high performance (often comparable with state-of-the-art results) across multiple CO tasks when trained individually on each task. We then leverage knowledge from the computational reducibility literature to propose pretraining and fine-tuning strategies that transfer effectively (a) between MVC, MIS and MaxClique, and (b) in a multi-task learning setting that additionally incorporates MaxCut, MDS and graph coloring. Additionally, in a leave-one-out, multi-task learning setting, we observe that pretraining on all but one task almost always leads to faster convergence on the remaining task when fine-tuning while avoiding negative transfer. Our findings indicate that learning common representations across multiple graph CO problems is viable through the use of expressive message passing coupled with pretraining strategies that are informed by the polynomial reduction literature, thereby taking an important step towards enabling the development of foundational models for neural CO. We provide an open-source implementation of our work at https://github.com/semihcanturk/COPT-MT .

[471] What Capable Agents Must Know: Selection Theorems for Robust Decision-Making under Uncertainty

Aran Nayebi

Main category: cs.LG

TL;DR: The paper proves quantitative theorems showing that low average-case regret on structured action-conditioned prediction tasks forces agents to implement predictive, structured internal states like belief states or world models.

Details

Motivation: To determine what internal structure is necessary for artificial agents to act competently under uncertainty, addressing whether belief states or world models are required (not just implementable) for optimal control.

Method: Proves “selection theorems” that reduce predictive modeling to binary betting decisions, showing regret bounds limit probability mass on suboptimal bets, enforcing predictive distinctions needed for separating high-margin outcomes.

Result: In fully observed settings, yields approximate recovery of interventional transition kernel; under partial observability, implies necessity of belief-like memory and predictive state, addressing open questions in prior world-model recovery work.

Conclusion: Low average-case regret on structured action-conditioned prediction tasks forces agents to implement predictive, structured internal states, establishing necessity (not just sufficiency) of belief states or world models for competent action under uncertainty.

Abstract: As artificial agents become increasingly capable, what internal structure is necessary for an agent to act competently under uncertainty? Classical results show that optimal control can be implemented using belief states or world models, but not that such representations are required. We prove quantitative “selection theorems” showing that low average-case regret on structured families of action-conditioned prediction tasks forces an agent to implement a predictive, structured internal state. Our results cover stochastic policies, partial observability, and evaluation under task distributions, without assuming optimality, determinism, or access to an explicit model. Technically, we reduce predictive modeling to binary “betting” decisions and show that regret bounds limit probability mass on suboptimal bets, enforcing the predictive distinctions needed to separate high-margin outcomes. In fully observed settings, this yields approximate recovery of the interventional transition kernel; under partial observability, it implies necessity of belief-like memory and predictive state, addressing an open question in prior world-model recovery work.

[472] ParEVO: Synthesizing Code for Irregular Data: High-Performance Parallelism through Agentic Evolution

Liu Yang, Zeyu Nie, Andrew Liu, Felix Zou, Deniz Altinbüken, Amir Yazdanbakhsh, Quanquan C. Liu

Main category: cs.LG

TL;DR: ParEVO is a framework that synthesizes high-performance parallel algorithms for irregular data structures using evolutionary coding agents and specialized LLM fine-tuning.

Details

Motivation: The steep learning curve of concurrent programming for irregular data structures (sparse graphs, unbalanced trees, non-uniform meshes) where static scheduling fails and data dependencies are unpredictable. Current LLMs often fail catastrophically on these tasks, generating code with race conditions, deadlocks, and sub-optimal scaling.

Method: Three main contributions: (1) Parlay-Instruct Corpus - 13,820 tasks synthesized via “Critic-Refine” pipeline filtering for empirically performant algorithms using Work-Span parallel primitives; (2) specialized DeepSeek, Qwen, and Gemini models fine-tuned to align with ParlayLib library semantics; (3) Evolutionary Coding Agent (ECA) that iteratively repairs code using feedback from compilers, dynamic race detectors, and performance profilers.

Result: Achieves average 106x speedup (max 1103x) across ParEval benchmark, and 13.6x speedup specifically on complex irregular graph problems, outperforming state-of-the-art commercial models. Evolutionary approach matches expert human baselines with up to 4.1x speedup on specific highly-irregular kernels.

Conclusion: ParEVO successfully bridges the gap in parallel algorithm synthesis for irregular data structures, demonstrating significant performance improvements over existing LLM approaches and competitive results with human experts.

Abstract: The transition from sequential to parallel computing is essential for modern high-performance applications but is hindered by the steep learning curve of concurrent programming. This challenge is magnified for irregular data structures (such as sparse graphs, unbalanced trees, and non-uniform meshes) where static scheduling fails and data dependencies are unpredictable. Current Large Language Models (LLMs) often fail catastrophically on these tasks, generating code plagued by subtle race conditions, deadlocks, and sub-optimal scaling. We bridge this gap with ParEVO, a framework designed to synthesize high-performance parallel algorithms for irregular data. Our contributions include: (1) The Parlay-Instruct Corpus, a curated dataset of 13,820 tasks synthesized via a “Critic-Refine” pipeline that explicitly filters for empirically performant algorithms that effectively utilize Work-Span parallel primitives; (2) specialized DeepSeek, Qwen, and Gemini models fine-tuned to align probabilistic generation with the rigorous semantics of the ParlayLib library; and (3) an Evolutionary Coding Agent (ECA) that improves the “last mile” of correctness by iteratively repairing code using feedback from compilers, dynamic race detectors, and performance profilers. On the ParEval benchmark, ParEVO achieves an average 106x speedup (with a maximum of 1103x) across the suite, and a robust 13.6x speedup specifically on complex irregular graph problems, outperforming state-of-the-art commercial models. Furthermore, our evolutionary approach matches state-of-the-art expert human baselines, achieving up to a 4.1x speedup on specific highly-irregular kernels. Source code and datasets are available at https://github.com/WildAlg/ParEVO.

[473] Understanding and Mitigating Dataset Corruption in LLM Steering

Cullen Anderson, Narmeen Oozeer, Foad Namjoo, Remy Ogasawara, Amirali Abdullah, Jeff M. Phillips

Main category: cs.LG

TL;DR: Contrastive steering for LLM behavior adjustment is robust to moderate data corruption but vulnerable to malicious adversarial corruption; robust mean estimators can mitigate unwanted effects.

Details

Motivation: Contrastive steering is widely used for AI safety applications to adjust LLM behavior, but its robustness to noisy or adversarial data corruption in the steering examples dataset is poorly understood, raising safety concerns.

Method: The study analyzes contrastive steering robustness by corrupting the dataset of examples used to train steering directions. It examines geometry of various corruption types and tests replacing the standard mean computation step with a robust mean estimator to mitigate malicious effects.

Result: Contrastive steering shows robustness to moderate corruption but exhibits clear malicious side effects when non-trivial fractions of training data are altered. Robust mean estimators effectively mitigate most unwanted effects from malicious corruption.

Conclusion: While contrastive steering is generally robust, it has vulnerabilities to adversarial data corruption that can be addressed through robust statistical methods like robust mean estimation, improving safety for AI applications.

Abstract: Contrastive steering has been shown as a simple and effective method to adjust the generative behavior of LLMs at inference time. It uses examples of prompt responses with and without a trait to identify a direction in an intermediate activation layer, and then shifts activations in this 1-dimensional subspace. However, despite its growing use in AI safety applications, the robustness of contrastive steering to noisy or adversarial data corruption is poorly understood. We initiate a study of the robustness of this process with respect to corruption of the dataset of examples used to train the steering direction. Our first observation is that contrastive steering is quite robust to a moderate amount of corruption, but unwanted side effects can be clearly and maliciously manifested when a non-trivial fraction of the training data is altered. Second, we analyze the geometry of various types of corruption, and identify some safeguards. Notably, a key step in learning the steering direction involves high-dimensional mean computation, and we show that replacing this step with a recently developed robust mean estimator often mitigates most of the unwanted effects of malicious corruption.

[474] Thermodynamic Regulation of Finite-Time Gibbs Training in Energy-Based Models: A Restricted Boltzmann Machine Study

Görkem Can Süleymanoğlu

Main category: cs.LG

TL;DR: Introduces a thermodynamic regulation framework for RBMs where temperature evolves as a dynamical state variable to prevent training instabilities caused by fixed-temperature Gibbs sampling during finite-time training.

Details

Motivation: Fixed-temperature Gibbs sampling in RBMs assumes stochastic regime validity during learning, but this can become structurally fragile under finite-time training dynamics, leading to issues like Gibbs sampler freezing, negative phase localization, and parameter drift.

Method: Endogenous thermodynamic regulation framework where temperature evolves as a dynamical state variable coupled to measurable sampling statistics. Uses two-time-scale separation regime with standard local Lipschitz conditions and strictly positive L2 regularization.

Result: Proves global parameter boundedness, local exponential stability of thermodynamic subsystem, and mitigation of inverse-temperature blow-up and freezing-induced degeneracy. Experiments on MNIST show improved normalization stability and effective sample size while preserving reconstruction performance.

Conclusion: Reinterprets RBM training as a controlled non-equilibrium dynamical process rather than a static equilibrium approximation, addressing fundamental training instabilities through thermodynamic regulation.

Abstract: Restricted Boltzmann Machines (RBMs) are typically trained using finite-length Gibbs chains under a fixed sampling temperature. This practice implicitly assumes that the stochastic regime remains valid as the energy landscape evolves during learning. We argue that this assumption can become structurally fragile under finite-time training dynamics. This fragility arises because, in nonconvex energy-based models, fixed-temperature finite-time training can generate admissible trajectories with effective-field amplification and conductance collapse. As a result, the Gibbs sampler may asymptotically freeze, the negative phase may localize, and, without sufficiently strong regularization, parameters may exhibit deterministic linear drift. To address this instability, we introduce an endogenous thermodynamic regulation framework in which temperature evolves as a dynamical state variable coupled to measurable sampling statistics. Under standard local Lipschitz conditions and a two-time-scale separation regime, we establish global parameter boundedness under strictly positive L2 regularization. We further prove local exponential stability of the thermodynamic subsystem and show that the regulated regime mitigates inverse-temperature blow-up and freezing-induced degeneracy within a forward-invariant neighborhood. Experiments on MNIST demonstrate that the proposed self-regulated RBM substantially improves normalization stability and effective sample size relative to fixed-temperature baselines, while preserving reconstruction performance. Overall, the results reinterpret RBM training as a controlled non-equilibrium dynamical process rather than a static equilibrium approximation.

[475] Bridging Diffusion Guidance and Anderson Acceleration via Hopfield Dynamics

Kwanyoung Kim

Main category: cs.LG

TL;DR: The paper proposes Geometry Aware Attention Guidance (GAG), a plug-and-play method that improves diffusion model generation quality by modeling attention dynamics as fixed-point iterations in Modern Hopfield Networks and applying Anderson Acceleration with geometric decomposition.

Details

Motivation: Classifier-Free Guidance (CFG) improves diffusion model quality but has high inference costs and limited applicability to distilled/single-step models. Attention-space extrapolation methods are computationally efficient but lack theoretical foundations.

Method: Model attention dynamics as fixed-point iterations within Modern Hopfield Networks, show attention-space extrapolation is a special case of Anderson Acceleration, and propose GAG that decomposes attention updates into parallel/orthogonal components relative to guidance direction for stabilization.

Result: GAG significantly improves generation quality, seamlessly integrates with existing frameworks as a plug-and-play method, and provides theoretical grounding for attention-space guidance methods.

Conclusion: The paper establishes a theoretical foundation for attention-space extrapolation in diffusion models and proposes GAG as an efficient, stable guidance method that enhances generation quality while maintaining computational efficiency.

Abstract: Classifier-Free Guidance (CFG) has significantly enhanced the generative quality of diffusion models by extrapolating between conditional and unconditional outputs. However, its high inference cost and limited applicability to distilled or single-step models have shifted research focus toward attention-space extrapolation. While these methods offer computational efficiency, their theoretical underpinnings remain elusive. In this work, we establish a foundational framework for attention-space extrapolation by modeling attention dynamics as fixed-point iterations within Modern Hopfield Networks. We demonstrate that the extrapolation effect in attention space constitutes a special case of Anderson Acceleration applied to these dynamics. Building on this insight and the weak contraction property, we propose Geometry Aware Attention Guidance (GAG). By decomposing attention updates into parallel and orthogonal components relative to the guidance direction, GAG stabilizes the acceleration process and maximizes guidance efficiency. Our plug-and-play method seamlessly integrates with existing frameworks while significantly improving generation quality.

[476] EdgeFLow: Serverless Federated Learning via Sequential Model Migration in Edge Networks

Yuchen Shi, Qijun Hou, Pingyi Fan, Khaled B. Letaief

Main category: cs.LG

TL;DR: EdgeFLow: A federated learning framework that replaces cloud servers with sequential model migration between edge base stations to reduce communication bottlenecks in IoT systems.

Details

Motivation: Federated Learning (FL) faces significant communication bottlenecks due to client-server data exchanges and long-distance transmissions in IoT environments, which limits scalability and efficiency.

Method: EdgeFLow redesigns FL system topology by replacing traditional cloud servers with sequential model migration between edge base stations, conducting model aggregation and propagation exclusively at edge clusters to eliminate cloud-based transmissions.

Result: EdgeFLow achieves comparable accuracy improvements while significantly reducing communication costs, with rigorous convergence analysis provided for non-convex objectives and non-IID data distributions.

Conclusion: EdgeFLow establishes a foundational framework for communication-efficient FL in IoT and edge-network learning systems through systemic architectural innovation.

Abstract: Federated Learning (FL) has emerged as a transformative distributed learning paradigm in the era of Internet of Things (IoT), reconceptualizing data processing methodologies. However, FL systems face significant communication bottlenecks due to inevitable client-server data exchanges and long-distance transmissions. This work presents EdgeFLow, an innovative FL framework that redesigns the system topology by replacing traditional cloud servers with sequential model migration between edge base stations. By conducting model aggregation and propagation exclusively at edge clusters, EdgeFLow eliminates cloud-based transmissions and substantially reduces global communication overhead. We provide rigorous convergence analysis for EdgeFLow under non-convex objectives and non-IID data distributions, extending classical FL convergence theory. Experimental results across various configurations validate the theoretical analysis, demonstrating that EdgeFLow achieves comparable accuracy improvements while significantly reducing communication costs. As a systemic architectural innovation for communication-efficient FL, EdgeFLow establishes a foundational framework for future developments in IoT and edge-network learning systems.

[477] Wasserstein Proximal Policy Gradient

Zhaoyu Zhu, Shuhan Zhang, Rui Gao, Shuang Li

Main category: cs.LG

TL;DR: Wasserstein Proximal Policy Gradient (WPPG) for continuous-action RL using Wasserstein geometry, avoiding policy density evaluation and applicable to implicit stochastic policies.

Details

Motivation: To develop policy gradient methods for continuous-action reinforcement learning that avoid evaluating policy log densities or gradients, making them applicable to expressive implicit stochastic policies specified as pushforward maps.

Method: Derived from Wasserstein proximal update via operator-splitting scheme alternating optimal transport updates with Gaussian convolution heat steps. Uses Wasserstein geometry to handle continuous actions with entropy regularization.

Result: Established global linear convergence rate for both exact policy evaluation and actor-critic implementations with controlled approximation error. Empirically simple to implement and achieves competitive performance on standard continuous-control benchmarks.

Conclusion: WPPG provides a theoretically grounded, practical method for continuous-action RL that avoids policy density evaluation and works with implicit stochastic policies while maintaining competitive performance.

Abstract: We study policy gradient methods for continuous-action, entropy-regularized reinforcement learning through the lens of Wasserstein geometry. Starting from a Wasserstein proximal update, we derive Wasserstein Proximal Policy Gradient (WPPG) via an operator-splitting scheme that alternates an optimal transport update with a heat step implemented by Gaussian convolution. This formulation avoids evaluating the policy’s log density or its gradient, making the method directly applicable to expressive implicit stochastic policies specified as pushforward maps. We establish a global linear convergence rate for WPPG, covering both exact policy evaluation and actor-critic implementations with controlled approximation error. Empirically, WPPG is simple to implement and attains competitive performance on standard continuous-control benchmarks.

[478] Towards Parameter-Free Temporal Difference Learning

Yunxiang Li, Mark Schmidt, Reza Babanezhad, Sharan Vaswani

Main category: cs.LG

TL;DR: TD learning with exponential step-size schedule achieves optimal convergence without requiring problem-dependent parameters like minimum eigenvalue of feature covariance or mixing time.

Details

Motivation: Existing finite-time analyses of TD learning require setting algorithm parameters using problem-dependent quantities that are difficult to estimate in practice, creating a gap between theory and practice.

Method: Uses exponential step-size schedule with standard TD(0) algorithm, analyzed under i.i.d. sampling and Markovian sampling. For Markovian setting, proposes regularized TD(0) with exponential step-size schedule.

Result: In i.i.d. setting, achieves optimal bias-variance trade-off for last iterate without knowledge of problem-dependent quantities. In Markovian setting, achieves comparable convergence rate without requiring projections, iterate averaging, or knowledge of mixing time or minimum eigenvalue.

Conclusion: Exponential step-size schedule enables practical TD learning algorithms that achieve strong theoretical guarantees without requiring impractical parameter knowledge or algorithm modifications.

Abstract: Temporal difference (TD) learning is a fundamental algorithm for estimating value functions in reinforcement learning. Recent finite-time analyses of TD with linear function approximation quantify its theoretical convergence rate. However, they often require setting the algorithm parameters using problem-dependent quantities that are difficult to estimate in practice – such as the minimum eigenvalue of the feature covariance ((ω)) or the mixing time of the underlying Markov chain ((τ_{\text{mix}})). In addition, some analyses rely on nonstandard and impractical modifications, exacerbating the gap between theory and practice. To address these limitations, we use an exponential step-size schedule with the standard TD(0) algorithm. We analyze the resulting method under two sampling regimes: independent and identically distributed (i.i.d.) sampling from the stationary distribution, and the more practical Markovian sampling along a single trajectory. In the i.i.d.\ setting, the proposed algorithm does not require knowledge of problem-dependent quantities such as (ω), and attains the optimal bias-variance trade-off for the last iterate. In the Markovian setting, we propose a regularized TD(0) algorithm with an exponential step-size schedule. The resulting algorithm achieves a comparable convergence rate to prior works, without requiring projections, iterate averaging, or knowledge of (τ_{\text{mix}}) or (ω).

[479] Joint Optimization of Model Partitioning and Resource Allocation for Anti-Jamming Collaborative Inference Systems

Mengru Wu, Jiawei Li, Jiaqi Wei, Bin Lyu, Kai-Kit Wong, Hyundong Shin

Main category: cs.LG

TL;DR: Anti-jamming collaborative inference system for DNN partitioning between devices and edge servers with joint optimization of resource allocation, power control, and partitioning to maximize delay-accuracy revenue.

Details

Motivation: DNN inference on resource-constrained devices requires device-edge collaboration, but intermediate feature transmission is vulnerable to malicious jamming that degrades inference performance. Need to counter jamming threats in collaborative inference systems.

Method: Analyze jamming and DNN partitioning effects via data regression, then formulate optimization problem to maximize revenue of delay and accuracy (RDA) under constraints. Propose alternating optimization algorithm decomposing into three subproblems solved via KKT conditions, convex optimization, and quantum genetic algorithm.

Result: Extensive simulations demonstrate proposed scheme outperforms baselines in terms of RDA (revenue of delay and accuracy).

Conclusion: The proposed anti-jamming collaborative inference system with joint optimization effectively counters jamming threats and improves system performance in device-edge DNN inference scenarios.

Abstract: With the increasing computational demands of deep neural network (DNN) inference on resource-constrained devices, DNN partitioning-based device-edge collaborative inference has emerged as a promising paradigm. However, the transmission of intermediate feature data is vulnerable to malicious jamming, which significantly degrades the overall inference performance. To counter this threat, this letter focuses on an anti-jamming collaborative inference system in the presence of a malicious jammer. In this system, a DNN model is partitioned into two distinct segments, which are executed by wireless devices and edge servers, respectively. We first analyze the effects of jamming and DNN partitioning on inference accuracy via data regression. Based on this, our objective is to maximize the system’s revenue of delay and accuracy (RDA) under inference accuracy and computing resource constraints by jointly optimizing computation resource allocation, devices’ transmit power, and DNN partitioning. To address the mixed-integer nonlinear programming problem, we propose an efficient alternating optimization-based algorithm, which decomposes the problem into three subproblems that are solved via Karush-Kuhn-Tucker conditions, convex optimization methods, and a quantum genetic algorithm, respectively. Extensive simulations demonstrate that our proposed scheme outperforms baselines in terms of RDA.

[480] Heterogeneous Agent Collaborative Reinforcement Learning

Zhixia Zhang, Zixuan Huang, Xin Xia, Deqing Wang, Fuzhen Zhuang, Shuai Ma, Ning Ding, Yaodong Yang, Jianxin Li, Yikun Ban

Main category: cs.LG

TL;DR: HACRL is a new collaborative reinforcement learning paradigm where heterogeneous agents share verified rollouts during training to improve each other while executing independently at inference, with HACPO algorithm providing principled rollout sharing and theoretical guarantees.

Details

Motivation: Addresses inefficiencies of isolated on-policy optimization by enabling collaborative learning among heterogeneous agents without requiring coordinated deployment at inference time, overcoming limitations of existing approaches like LLM-based MARL and one-directional distillation methods.

Method: Proposes HACRL paradigm with HACPO algorithm featuring four tailored mechanisms: 1) rollout verification and sharing, 2) capability discrepancy mitigation, 3) policy distribution shift handling, and 4) theoretical guarantees for unbiased advantage estimation and optimization correctness.

Result: HACPO consistently improves all participating agents across diverse heterogeneous model combinations and reasoning benchmarks, outperforming GSPO by average 3.3% while using only half the rollout cost.

Conclusion: HACRL enables efficient collaborative optimization for heterogeneous agents through principled rollout sharing, achieving better performance with reduced computational cost while maintaining independent execution capability.

Abstract: We introduce Heterogeneous Agent Collaborative Reinforcement Learning (HACRL), a new learning paradigm that addresses the inefficiencies of isolated on-policy optimization. HACRL enables collaborative optimization with independent execution: heterogeneous agents share verified rollouts during training to mutually improve, while operating independently at inference time. Unlike LLM-based multi-agent reinforcement learning (MARL), HACRL does not require coordinated deployment, and unlike on-/off-policy distillation, it enables bidirectional mutual learning among heterogeneous agents rather than one-directional teacher-to-student transfer. Building on this paradigm, we propose HACPO, a collaborative RL algorithm that enables principled rollout sharing to maximize sample utilization and cross-agent knowledge transfer. To mitigate capability discrepancies and policy distribution shifts, HACPO introduces four tailored mechanisms with theoretical guarantees on unbiased advantage estimation and optimization correctness. Extensive experiments across diverse heterogeneous model combinations and reasoning benchmarks show that HACPO consistently improves all participating agents, outperforming GSPO by an average of 3.3% while using only half the rollout cost.

[481] Real-Time Generative Policy via Langevin-Guided Flow Matching for Autonomous Driving

Tianze Zhu, Yinuo Wang, Wenjun Zou, Tianyi Zhang, Likun Wang, Letian Tao, Feihong Zhang, Yao Lyu, Shengbo Eben Li

Main category: cs.LG

TL;DR: DACER-F introduces flow matching into online RL for autonomous driving, enabling single-step inference diffusion policies with ultra-low latency while maintaining performance.

Details

Motivation: Generative policies in RL for autonomous driving show promise but suffer from high inference latency that prevents real-time deployment. Current diffusion-based methods require multiple inference steps, making them impractical for time-sensitive applications.

Method: Proposes DACER-F (Diffusion Actor-Critic with Entropy Regulator via Flow Matching) that introduces flow matching into online RL. Uses Langevin dynamics and Q-function gradients to dynamically optimize actions from experience replay toward a target distribution balancing high Q-value with exploration. The flow policy learns to map from a simple prior to this dynamic target in single inference steps.

Result: Outperforms baselines DACER and DSAC in complex multi-lane and intersection simulations while maintaining ultra-low inference latency. Achieves score of 775.8 in humanoid-stand task on DeepMind Control Suite, surpassing prior methods.

Conclusion: DACER-F establishes a high-performance, computationally efficient RL algorithm that addresses the inference latency bottleneck of diffusion policies while maintaining competitive performance in complex environments.

Abstract: Reinforcement learning (RL) is a fundamental methodology in autonomous driving systems, where generative policies exhibit considerable potential by leveraging their ability to model complex distributions to enhance exploration. However, their inherent high inference latency severely impedes their deployment in real-time decision-making and control. To address this issue, we propose diffusion actor-critic with entropy regulator via flow matching (DACER-F) by introducing flow matching into online RL, enabling the generation of competitive actions in a single inference step. By leveraging Langevin dynamics and gradients of the Q-function, DACER-F dynamically optimizes actions from experience replay toward a target distribution that balances high Q-value information with exploratory behavior. The flow policy is then trained to efficiently learn a mapping from a simple prior distribution to this dynamic target. In complex multi-lane and intersection simulations, DACER-F outperforms baselines diffusion actor-critic with entropy regulator (DACER) and distributional soft actor-critic (DSAC), while maintaining an ultra-low inference latency. DACER-F further demonstrates its scalability on standard RL benchmark DeepMind Control Suite (DMC), achieving a score of 775.8 in the humanoid-stand task and surpassing prior methods. Collectively, these results establish DACER-F as a high-performance and computationally efficient RL algorithm.

[482] Same Error, Different Function: The Optimizer as an Implicit Prior in Financial Time Series

Federico Vittorio Cortesi, Giuseppe Iannone, Giulia Crippa, Tomaso Poggio, Pierfrancesco Beneventano

Main category: cs.LG

TL;DR: Different neural network architectures and optimizers achieve similar test loss on financial volatility forecasting but learn qualitatively different functions with material consequences for portfolio decisions.

Details

Motivation: Financial time series models operate in underspecified regimes where different models achieve indistinguishable out-of-sample error, raising questions about whether they learn the same underlying functions and whether this matters for real-world decisions.

Method: Large-scale volatility forecasting for S&P 500 stocks using various neural network architectures and training pipelines, analyzing how different optimizers reshape non-linear response profiles and temporal dependence despite similar test loss.

Result: Predictive accuracy remains unchanged across architectures, but optimizer choice significantly reshapes learned functions, creating a near-vertical Sharpe-turnover frontier with nearly 3× turnover dispersion at comparable Sharpe ratios.

Conclusion: In underspecified settings, optimization acts as a consequential source of inductive bias, so model evaluation should extend beyond scalar loss to encompass functional and decision-level implications.

Abstract: Neural networks applied to financial time series operate in a regime of underspecification, where model predictors achieve indistinguishable out-of-sample error. Using large-scale volatility forecasting for S$&$P 500 stocks, we show that different model-training-pipeline pairs with identical test loss learn qualitatively different functions. Across architectures, predictive accuracy remains unchanged, yet optimizer choice reshapes non-linear response profiles and temporal dependence differently. These divergences have material consequences for decisions: volatility-ranked portfolios trace a near-vertical Sharpe-turnover frontier, with nearly $3\times$ turnover dispersion at comparable Sharpe ratios. We conclude that in underspecified settings, optimization acts as a consequential source of inductive bias, thus model evaluation should extend beyond scalar loss to encompass functional and decision-level implications.

[483] Implicit Bias in Deep Linear Discriminant Analysis

Jiawen Li

Main category: cs.LG

TL;DR: Theoretical analysis of implicit regularization in Deep LDA objective for metric learning, showing how network architecture transforms additive gradient updates into multiplicative updates that conserve quasi-norms.

Details

Motivation: While implicit bias of standard loss functions has been studied, the optimization geometry induced by discriminative metric-learning objectives remains largely unexplored. This paper aims to provide initial theoretical analysis of implicit regularization in Deep LDA.

Method: Analyze gradient flow of Deep LDA loss on L-layer diagonal linear network. Study how network architecture transforms standard additive gradient updates into multiplicative weight updates under balanced initialization.

Result: Proves that under balanced initialization, the network architecture transforms additive gradient updates into multiplicative weight updates, demonstrating automatic conservation of the (2/L) quasi-norm.

Conclusion: Provides first theoretical analysis of implicit regularization in metric-learning objectives, specifically Deep LDA, revealing how network architecture induces specific optimization geometry and regularization properties.

Abstract: While the Implicit Bias(or Implicit Regularization) of standard loss functions has been studied, the optimization geometry induced by discriminative metric-learning objectives remains largely unexplored.To the best of our knowledge, this paper presents an initial theoretical analysis of the implicit regularization induced by the Deep LDA,a scale invariant objective designed to minimize intraclass variance and maximize interclass distance. By analyzing the gradient flow of the loss on a L-layer diagonal linear network, we prove that under balanced initialization, the network architecture transforms standard additive gradient updates into multiplicative weight updates, which demonstrates an automatic conservation of the (2/L) quasi-norm.

[484] Post Hoc Extraction of Pareto Fronts for Continuous Control

Raghav Thakar, Gaurav Dixit, Kagan Tumer

Main category: cs.LG

TL;DR: MAPEX enables efficient Pareto frontier extraction from pre-trained specialist policies for multi-objective reinforcement learning without retraining costs.

Details

Motivation: Real-world agents need to balance multiple objectives, but existing MORL methods require full multi-objective training from scratch and cannot leverage pre-trained specialist policies, incurring high sample costs.

Method: MAPEX is an offline MORL method that constructs Pareto frontiers by reusing pre-trained specialist policies, critics, and replay buffers. It combines specialist critic evaluations into a mixed advantage signal and uses it to weight a behavior cloning loss for training new multi-objective policies.

Result: MAPEX produces comparable Pareto fronts at 0.001% the sample cost of established baselines on five multi-objective MuJoCo environments.

Conclusion: MAPEX provides an efficient post hoc Pareto front extraction method that preserves single-objective RL simplicity while enabling multi-objective trade-offs without retraining costs.

Abstract: Agents in the real world must often balance multiple objectives, such as speed, stability, and energy efficiency in continuous control. To account for changing conditions and preferences, an agent must ideally learn a Pareto frontier of policies representing multiple optimal trade-offs. Recent advances in multi-policy multi-objective reinforcement learning (MORL) enable learning a Pareto front directly, but require full multi-objective consideration from the start of training. In practice, multi-objective preferences often arise after a policy has already been trained on a single specialised objective. Existing MORL methods cannot leverage these pre-trained `specialists’ to learn Pareto fronts and avoid incurring the sample costs of retraining. We introduce Mixed Advantage Pareto Extraction (MAPEX), an offline MORL method that constructs a frontier of policies by reusing pre-trained specialist policies, critics, and replay buffers. MAPEX combines evaluations from specialist critics into a mixed advantage signal, and weights a behaviour cloning loss with it to train new policies that balance multiple objectives. MAPEX’s post hoc Pareto front extraction preserves the simplicity of single-objective off-policy RL, and avoids retrofitting these algorithms into complex MORL frameworks. We formally describe the MAPEX procedure and evaluate MAPEX on five multi-objective MuJoCo environments. Given the same starting policies, MAPEX produces comparable fronts at $0.001%$ the sample cost of established baselines.

[485] MASPOB: Bandit-Based Prompt Optimization for Multi-Agent Systems with Graph Neural Networks

Zhi Hong, Qian Zhang, Jiahang Sun, Zhiwei Shang, Mingze Kong, Xiangyi Wang, Yao Shu, Zhongxiang Dai

Main category: cs.LG

TL;DR: MASPOB is a bandit-based framework for optimizing prompts in Multi-Agent Systems, addressing sample efficiency, topology coupling, and combinatorial search challenges through UCB, GNNs, and coordinate ascent.

Details

Motivation: Real-world Multi-Agent Systems using LLMs as cognitive backbones face challenges in prompt optimization due to prohibitive evaluation costs, topology-induced coupling among prompts, and combinatorial explosion of the search space, requiring sample-efficient solutions.

Method: MASPOB uses bandit framework with Upper Confidence Bound for exploration-exploitation balance, integrates Graph Neural Networks to capture structural priors from agent topology, and employs coordinate ascent to decompose optimization into univariate sub-problems.

Result: Extensive experiments across diverse benchmarks show MASPOB achieves state-of-the-art performance, consistently outperforming existing baselines in prompt optimization for Multi-Agent Systems.

Conclusion: MASPOB provides an effective, sample-efficient framework for prompt optimization in Multi-Agent Systems, successfully addressing key challenges through bandit algorithms, GNNs, and coordinate ascent decomposition.

Abstract: Large Language Models (LLMs) have achieved great success in many real-world applications, especially the one serving as the cognitive backbone of Multi-Agent Systems (MAS) to orchestrate complex workflows in practice. Since many deployment scenarios preclude MAS workflow modifications and its performance is highly sensitive to the input prompts, prompt optimization emerges as a more natural approach to improve its performance. However, real-world prompt optimization for MAS is impeded by three key challenges: (1) the need of sample efficiency due to prohibitive evaluation costs, (2) topology-induced coupling among prompts, and (3) the combinatorial explosion of the search space. To address these challenges, we introduce MASPOB (Multi-Agent System Prompt Optimization via Bandits), a novel sample-efficient framework based on bandits. By leveraging Upper Confidence Bound (UCB) to quantify uncertainty, the bandit framework balances exploration and exploitation, maximizing gains within a strictly limited budget. To handle topology-induced coupling, MASPOB integrates Graph Neural Networks (GNNs) to capture structural priors, learning topology-aware representations of prompt semantics. Furthermore, it employs coordinate ascent to decompose the optimization into univariate sub-problems, reducing search complexity from exponential to linear. Extensive experiments across diverse benchmarks demonstrate that MASPOB achieves state-of-the-art performance, consistently outperforming existing baselines.

[486] Robust Heterogeneous Analog-Digital Computing for Mixture-of-Experts Models with Theoretical Generalization Guarantees

Mohammed Nowaz Rabbani Chowdhury, Hsinyu Tsai, Geoffrey W. Burr, Kaoutar El Maghraoui, Liu Liu, Meng Wang

Main category: cs.LG

TL;DR: A retraining-free heterogeneous computation framework that identifies noise-sensitive experts in MoE models via maximum neuron norm and computes them digitally while running majority of experts on analog in-memory computing hardware to maintain accuracy under hardware nonidealities.

Details

Motivation: Sparse Mixture-of-Experts models have massive parameter counts leading to substantial memory and energy inefficiency during inference. Analog in-memory computing offers a solution by eliminating data movement, but requires noise-aware retraining which is infeasible for large MoE models.

Method: Proposes a retraining-free heterogeneous computation framework: 1) Identifies noise-sensitive experts using maximum neuron norm (provably identifiable), 2) Computes these sensitive experts digitally, 3) Executes majority of experts on AIMC hardware, 4) Assigns densely activated modules (like attention layers) to digital computation due to high noise sensitivity.

Result: Extensive experiments on large MoE language models (DeepSeekMoE and OLMoE) across multiple benchmark tasks validate the robustness of the approach in maintaining accuracy under analog nonidealities.

Conclusion: The proposed heterogeneous computation framework enables efficient deployment of large MoE models on analog hardware without retraining, maintaining accuracy by strategically allocating noise-sensitive components to digital computation.

Abstract: Sparse Mixture-of-Experts (MoE) models enable efficient scalability by activating only a small sub-set of experts per input, yet their massive parameter counts lead to substantial memory and energy inefficiency during inference. Analog in-memory computing (AIMC) offers a promising solution by eliminating frequent data movement between memory and compute units. However, mitigating hardware nonidealities of AIMC typically requires noise-aware retraining, which is infeasible for large MoE models. In this paper, we propose a retraining-free heterogeneous computation framework in which noise-sensitive experts, which are provably identifiable by their maximum neuron norm, are computed digitally while the majority of the experts are executed on AIMC hardware. We further assign densely activated modules, such as attention layers, to digital computation due to their high noise sensitivity despite comprising a small fraction of parameters. Extensive experiments on large MoE language models, including DeepSeekMoE and OLMoE, across multiple benchmark tasks validate the robustness of our approach in maintaining accuracy under analog nonidealities.

[487] SaFeR-ToolKit: Structured Reasoning via Virtual Tool Calling for Multimodal Safety

Zixuan Xu, Tiancheng He, Huahui Yi, Kun Wang, Xi Chen, Gongli Xi, Qiankun Li, Kang Li, Yang Liu, Zhigang Zeng

Main category: cs.LG

TL;DR: SaFeR-ToolKit improves vision-language model safety by formalizing safety decision-making as a checkable protocol with tool-based reasoning, training models to follow structured safety protocols before generating final responses.

Details

Motivation: Vision-language models are vulnerable to multimodal jailbreaks and over-refusal because safety depends on both visual evidence and user intent, but current alignment methods only supervise final responses rather than the reasoning process.

Method: Formalizes safety decision-making as a checkable protocol where a planner specifies a persona, Perception→Reasoning→Decision tool set, and constrained transition graph. A responder outputs typed key-value tool traces before final answers. Trains a single policy with three-stage curriculum: SFT → DPO → GRPO, where GRPO directly supervises tool usage beyond answer-level feedback.

Result: Significantly improves safety, helpfulness, and reasoning rigor on Qwen2.5-VL models (3B: 29.39/45.04/4.98 → 84.40/71.13/78.87; 7B: 53.21/52.92/19.26 → 86.34/80.79/85.34) while preserving general capabilities (3B: 58.67 → 59.21; 7B: 66.39 → 66.81).

Conclusion: Tool-based safety reasoning with structured protocols and direct supervision of tool usage effectively addresses multimodal jailbreaks and over-refusal in vision-language models while maintaining model capabilities.

Abstract: Vision-language models remain susceptible to multimodal jailbreaks and over-refusal because safety hinges on both visual evidence and user intent, while many alignment pipelines supervise only the final response. To address this, we present SaFeR-ToolKit, which formalizes safety decision-making as a checkable protocol. Concretely, a planner specifies a persona, a Perception $\to$ Reasoning $\to$ Decision tool set, and a constrained transition graph, while a responder outputs a typed key-value tool trace before the final answer. To make the protocol reliably followed in practice, we train a single policy with a three-stage curriculum (SFT $\to$ DPO $\to$ GRPO), where GRPO directly supervises tool usage beyond answer-level feedback. Our contributions are two-fold: I. Dataset. The first tool-based safety reasoning dataset, comprising 31,654 examples (SFT 6k, DPO 18.6k, GRPO 6k) plus 1k held-out evaluation. II. Experiments. On Qwen2.5-VL, SaFeR-ToolKit significantly improves Safety/Helpfulness/Reasoning Rigor on 3B (29.39/45.04/4.98 $\to$ 84.40/71.13/78.87) and 7B (53.21/52.92/19.26 $\to$ 86.34/80.79/85.34), while preserving general capabilities (3B: 58.67 $\to$ 59.21; 7B: 66.39 $\to$ 66.81). Codes are available at https://github.com/Duebassx/SaFeR_ToolKit.

[488] HomeAdam: Adam and AdamW Algorithms Sometimes Go Home to Obtain Better Provable Generalization

Feihu Huang, Guanyi Zhang, Songcan Chen

Main category: cs.LG

TL;DR: Theoretical analysis of Adam/AdamW generalization via algorithmic stability, proposing HomeAdam(W) with improved generalization error O(1/N) and faster convergence than standard Adam variants.

Details

Motivation: Adam and AdamW are default optimizers that converge fast but generalize worse than SGD. While some variants exist, their theoretical generalization properties remain unexplored. The paper aims to analyze Adam/AdamW generalization theoretically and propose improved algorithms.

Method: 1) Restudy generalization of Adam/AdamW via algorithmic stability theory, proving generalization error bounds. 2) Propose HomeAdam(W) algorithms that sometimes return momentum-based SGD to improve generalization. 3) Provide theoretical analysis showing improved generalization error and convergence rates.

Result: 1) Proved Adam/AdamW without square-root (Adam(W)-srf) has generalization error O(ρ̂^{-2T}/N). 2) HomeAdam(W) achieves smaller generalization error O(1/N) than both Adam(W)-srf and standard Adam(W). 3) HomeAdam(W) has faster convergence rate O(1/T^{1/4}) than Adam(W)-srf. 4) Extensive experiments demonstrate efficiency.

Conclusion: The paper provides theoretical generalization analysis for Adam variants and proposes HomeAdam(W) algorithms with provably better generalization error (O(1/N)) and faster convergence than existing Adam variants, bridging the gap between theory and practice.

Abstract: Adam and AdamW are a class of default optimizers for training deep learning models in machine learning. These adaptive algorithms converge faster but generalize worse compared to SGD. In fact, their proved generalization error $O(\frac{1}{\sqrt{N}})$ also is larger than $O(\frac{1}{N})$ of SGD, where $N$ denotes training sample size. Recently, although some variants of Adam have been proposed to improve its generalization, their improved generalizations are still unexplored in theory. To fill this gap, in the paper, we restudy generalization of Adam and AdamW via algorithmic stability, and first prove that Adam and AdamW without square-root (i.e., Adam(W)-srf) have a generalization error $O(\frac{\hatρ^{-2T}}{N})$, where $T$ denotes iteration number and $\hatρ>0$ denotes the smallest element of second-order momentum plus a small positive number. To improve generalization, we propose a class of efficient clever Adam (i.e., HomeAdam(W)) algorithms via sometimes returning momentum-based SGD. Moreover, we prove that our HomeAdam(W) have a smaller generalization error $O(\frac{1}{N})$ than $O(\frac{\hatρ^{-2T}}{N})$ of Adam(W)-srf, since $\hatρ$ is generally very small. In particular, it is also smaller than the existing $O(\frac{1}{\sqrt{N}})$ of Adam(W). Meanwhile, we prove our HomeAdam(W) have a faster convergence rate of $O(\frac{1}{T^{1/4}})$ than $O(\frac{\breveρ^{-1}}{T^{1/4}})$ of the Adam(W)-srf, where $\breveρ\leq\hatρ$ also is very small. Extensive numerical experiments demonstrate efficiency of our HomeAdam(W) algorithms.

[489] Improving Diffusion Planners by Self-Supervised Action Gating with Energies

Yuan Lu, Dongqi Han, Yansen Wang, Dongsheng Li

Main category: cs.LG

TL;DR: SAGE is an inference-time re-ranking method for diffusion planners in offline RL that uses latent consistency signals to penalize dynamically inconsistent plans, improving performance without environment rollouts or policy retraining.

Details

Motivation: Diffusion planners in offline RL can fail when value-guided selection favors trajectories that score well but are locally inconsistent with environment dynamics, leading to brittle execution. There's a need to improve plan feasibility without additional environment interactions.

Method: SAGE trains a Joint-Embedding Predictive Architecture (JEPA) encoder on offline state sequences and an action-conditioned latent predictor for short horizon transitions. At test time, it assigns each sampled candidate an energy based on latent prediction error and combines this feasibility score with value estimates to select actions.

Result: Across locomotion, navigation, and manipulation benchmarks, SAGE improves the performance and robustness of diffusion planners without requiring environment rollouts or policy re-training.

Conclusion: SAGE provides an effective inference-time re-ranking method that enhances diffusion planners by incorporating dynamic consistency signals, making them more robust while maintaining the benefits of offline training.

Abstract: Diffusion planners are a strong approach for offline reinforcement learning, but they can fail when value-guided selection favours trajectories that score well yet are locally inconsistent with the environment dynamics, resulting in brittle execution. We propose Self-supervised Action Gating with Energies (SAGE), an inference-time re-ranking method that penalises dynamically inconsistent plans using a latent consistency signal. SAGE trains a Joint-Embedding Predictive Architecture (JEPA) encoder on offline state sequences and an action-conditioned latent predictor for short horizon transitions. At test time, SAGE assigns each sampled candidate an energy given by its latent prediction error and combines this feasibility score with value estimates to select actions. SAGE can integrate into existing diffusion planning pipelines that can sample trajectories and select actions via value scoring; it requires no environment rollouts and no policy re-training. Across locomotion, navigation, and manipulation benchmarks, SAGE improves the performance and robustness of diffusion planners.

[490] From Shallow to Deep: Pinning Semantic Intent via Causal GRPO

Shuyi Zhou, Zeen Song, Wenwen Qiang, Jiyan Sun, Yao Zhou, Yinlong Liu, Wei Ma

Main category: cs.LG

TL;DR: TSC-GRPO framework addresses LLM vulnerability to adversarial prefix attacks by using causal intent probing and group policy optimization to maintain malicious intent detection throughout generation.

Details

Motivation: Large Language Models remain vulnerable to adversarial prefix attacks despite safety alignment, due to "semantic representation decay" where malicious intent signals fade as models generate compliant prefixes.

Method: Two-Stage Causal-GRPO: 1) Train causal intent probe using causal identifiability theory to disentangle invariant intent from stylistic perturbations, 2) Internalize causal awareness via Group Relative Policy Optimization with cumulative causal penalty in “fork-in-the-road” training scenarios.

Result: TSC-GRPO significantly outperforms baselines in defending against jailbreak attacks while preserving general utility.

Conclusion: The framework successfully addresses shallow safety alignment by enabling robust late-stage refusals through intent pinning and causal awareness.

Abstract: Large Language Models remain vulnerable to adversarial prefix attacks (e.g., Sure, here is'') despite robust standard safety. We diagnose this vulnerability as Shallow Safety Alignment, stemming from a pathology we term semantic representation decay: as the model generates compliant prefixes, its internal malicious intent signal fades. To address this, we propose Two-Stage Causal-GRPO (TSC-GRPO), a framework designed to achieve intent pinning. First, grounded in causal identifiability theory, we train a causal intent probe to disentangle invariant intent from stylistic perturbations. Second, we internalize this causal awareness into the policy via Group Relative Policy Optimization. By employing a cumulative causal penalty within fork-in-the-road’’ training scenarios, we force the model to learn that accumulating harmful tokens monotonically decreases reward, enabling robust late-stage refusals. Experiments show that TSC-GRPO significantly outperforms baselines in defending against jailbreak attacks while preserving general utility.

[491] Causal Learning Should Embrace the Wisdom of the Crowd

Ryan Feng Lin, Yuantao Wei, Huiling Liao, Xiaoning Qian, Shuai Huang

Main category: cs.LG

TL;DR: A new paradigm for causal structure learning that integrates human expertise and LLMs through distributed decision-making to overcome limitations of purely data-driven approaches.

Details

Motivation: Traditional causal structure learning from observational data faces combinatorial explosion and observational ambiguities. The paper argues that current technologies enable a new paradigm that leverages human causal knowledge to overcome these limitations.

Method: Proposes a distributed decision-making framework where human experts and LLM agents contribute fragmented knowledge about subsets of variables. Integrates scalable crowdsourcing, interactive knowledge elicitation, expert opinion modeling, robust aggregation techniques, and LLM-based simulation for AI-driven information acquisition.

Result: Presents a systematic framework for synthesizing distributed causal knowledge to recover global causal structures unachievable by individual agents alone. Advocates for a new research frontier in human-AI collaborative causal discovery.

Conclusion: Causal learning is ready for a paradigm shift that integrates human expertise with AI technologies, enabling more robust and comprehensive causal discovery through distributed knowledge synthesis.

Abstract: Learning causal structures typically represented by directed acyclic graphs (DAGs) from observational data is notoriously challenging due to the combinatorial explosion of possible graphs and inherent ambiguities in observations. This paper argues that causal learning is now ready for the emergence of a new paradigm supported by rapidly advancing technologies, fulfilling the long-standing vision of leveraging human causal knowledge. This paradigm integrates scalable crowdsourcing platforms for data collection, interactive knowledge elicitation for expert opinion modeling, robust aggregation techniques for expert reconciliation, and large language model (LLM)-based simulation for augmenting AI-driven information acquisition. In this paper, we focus on DAG learning for causal discovery and frame the problem as a distributed decision-making task, recognizing that each participant (human expert or LLM agent) possesses fragmented and imperfect knowledge about different subsets of the variables of interest in the causal graph. By proposing a systematic framework to synthesize these insights, we aim to enable the recovery of a global causal structure unachievable by any individual agent alone.We advocate for a new research frontier and outline a comprehensive framework for new research thrusts that range from eliciting, modeling, aggregating, and optimizing human causal knowledge contributions.

[492] Addressing Missing and Noisy Modalities in One Solution: Unified Modality-Quality Framework for Low-quality Multimodal Data

Sijie Mai, Shiqin Han, Haifeng Hu

Main category: cs.LG

TL;DR: UMQ framework jointly addresses noisy and missing modalities in multimodal affective computing by treating them as unified low-quality modality problems, using quality estimation, enhancement, and quality-aware mixture-of-experts.

Details

Motivation: Real-world multimodal data often suffers from low quality including noisy and missing modalities, which degrade model performance. Previous approaches handle these issues separately, but they need to be addressed jointly for better robustness in practical scenarios.

Method: Proposes Unified Modality-Quality (UMQ) framework with three components: 1) Quality estimator trained with rank-guided strategy using relative quality comparisons, 2) Quality enhancer for each modality using cross-modal information and modality baselines, 3) Quality-aware mixture-of-experts with specialized routing for different modality-quality problems.

Result: UMQ consistently outperforms state-of-the-art baselines on multiple datasets under complete, missing, and noisy modality settings, demonstrating improved robustness in low-quality multimodal scenarios.

Conclusion: The unified approach to handling both noisy and missing modalities as low-quality problems is effective for multimodal affective computing, providing better robustness than separate handling methods.

Abstract: Multimodal data encountered in real-world scenarios are typically of low quality, with noisy modalities and missing modalities being typical forms that severely hinder model performance and robustness. However, prior works often handle noisy and missing modalities separately. In contrast, we jointly address missing and noisy modalities to enhance model robustness in low-quality data scenarios. We regard both noisy and missing modalities as a unified low-quality modality problem, and propose a unified modality-quality (UMQ) framework to enhance low-quality representations for multimodal affective computing. Firstly, we train a quality estimator with explicit supervised signals via a rank-guided training strategy that compares the relative quality of different representations by adding a ranking constraint, avoiding training noise caused by inaccurate absolute quality labels. Then, a quality enhancer for each modality is constructed, which uses the sample-specific information provided by other modalities and the modality-specific information provided by the defined modality baseline representation to enhance the quality of unimodal representations. Finally, we propose a quality-aware mixture-of-experts module with particular routing mechanism to enable multiple modality-quality problems to be addressed more specifically. UMQ consistently outperforms state-of-the-art baselines on multiple datasets under the settings of complete, missing, and noisy modalities.

[493] An Empirical Analysis of Calibration and Selective Prediction in Multimodal Clinical Condition Classification

L. Julián Lechuga López, Farah E. Shamout, Tim G. J. Rudner

Main category: cs.LG

TL;DR: Selective prediction fails in multimodal ICU clinical condition classification due to class-dependent miscalibration, where models assign high uncertainty to correct predictions and low uncertainty to incorrect ones, particularly for underrepresented conditions.

Details

Motivation: As AI systems move toward clinical deployment, ensuring reliable prediction behavior is crucial for safety-critical decision-making. Selective prediction (deferring uncertain predictions to human experts) is proposed as a safeguard, but its reliability needs empirical evaluation in multimodal clinical settings.

Method: Empirical evaluation of uncertainty-based selective prediction in multilabel clinical condition classification using multimodal ICU data. Tested across state-of-the-art unimodal and multimodal models, analyzing performance degradation and calibration issues.

Result: Selective prediction substantially degrades performance despite strong standard evaluation metrics. Failure is driven by severe class-dependent miscalibration - models assign high uncertainty to correct predictions and low uncertainty to incorrect ones, especially for underrepresented clinical conditions. Aggregate metrics obscure these effects.

Conclusion: The study reveals a task-specific failure mode of selective prediction in multimodal clinical condition classification, highlighting the need for calibration-aware evaluation to ensure safety and robustness in clinical AI deployment.

Abstract: As artificial intelligence systems move toward clinical deployment, ensuring reliable prediction behavior is fundamental for safety-critical decision-making tasks. One proposed safeguard is selective prediction, where models can defer uncertain predictions to human experts for review. In this work, we empirically evaluate the reliability of uncertainty-based selective prediction in multilabel clinical condition classification using multimodal ICU data. Across a range of state-of-the-art unimodal and multimodal models, we find that selective prediction can substantially degrade performance despite strong standard evaluation metrics. This failure is driven by severe class-dependent miscalibration, whereby models assign high uncertainty to correct predictions and low uncertainty to incorrect ones, particularly for underrepresented clinical conditions. Our results show that commonly used aggregate metrics can obscure these effects, limiting their ability to assess selective prediction behavior in this setting. Taken together, our findings characterize a task-specific failure mode of selective prediction in multimodal clinical condition classification and highlight the need for calibration-aware evaluation to provide strong guarantees of safety and robustness in clinical AI.

[494] The power of small initialization in noisy low-tubal-rank tensor recovery

ZHiyu Liu, Haobo Geng, Xudong Wang, Yandong Tang, Zhi Han, Yao Wang

Main category: cs.LG

TL;DR: Small initialization in factorized gradient descent enables near-minimax optimal tensor recovery despite tubal-rank overestimation, overcoming noise sensitivity issues of spectral initialization.

Details

Motivation: Existing tensor recovery methods using factorized gradient descent with spectral initialization suffer from recovery error that scales linearly with overestimated tubal-rank when measurements are corrupted by dense noise like Gaussian noise.

Method: Proposes using small initialization instead of spectral initialization for factorized gradient descent in tensor recovery under t-product framework, analyzed using a four-stage analytic framework with early stopping strategy.

Result: Achieves nearly minimax optimal recovery error independent of overestimated tubal-rank R, with theoretical guarantees validated through simulations and real-data experiments.

Conclusion: Small initialization enables robust tensor recovery despite tubal-rank overestimation, providing practical solution with easy-to-use early stopping strategy.

Abstract: We study the problem of recovering a low-tubal-rank tensor $\mathcal{X}_\star\in \mathbb{R}^{n \times n \times k}$ from noisy linear measurements under the t-product framework. A widely adopted strategy involves factorizing the optimization variable as $\mathcal{U} * \mathcal{U}^\top$, where $\mathcal{U} \in \mathbb{R}^{n \times R \times k}$, followed by applying factorized gradient descent (FGD) to solve the resulting optimization problem. Since the tubal-rank $r$ of the underlying tensor $\mathcal{X}_\star$ is typically unknown, this method often assumes $r < R \le n$, a regime known as over-parameterization. However, when the measurements are corrupted by some dense noise (e.g., Gaussian noise), FGD with the commonly used spectral initialization yields a recovery error that grows linearly with the over-estimated tubal-rank $R$. To address this issue, we show that using a small initialization enables FGD to achieve a nearly minimax optimal recovery error, even when the tubal-rank $R$ is significantly overestimated. Using a four-stage analytic framework, we analyze this phenomenon and establish the sharpest known error bound to date, which is independent of the overestimated tubal-rank $R$. Furthermore, we provide a theoretical guarantee showing that an easy-to-use early stopping strategy can achieve the best known result in practice. All these theoretical findings are validated through a series of simulations and real-data experiments.

[495] Practical FP4 Training for Large-Scale MoE Models on Hopper GPUs

Wuyue Zhang, Chongdong Huang, Chunbo You, Cheng Gu, Fengjuan Wang, Mou Sun

Main category: cs.LG

TL;DR: Enables MXFP4 efficiency for MoE models on Hopper GPUs without native 4-bit support through direct FP8-to-FP4 quantization and optimized data layout conversion.

Details

Motivation: Training large-scale Mixture-of-Experts models is bottlenecked by activation memory and expert-parallel communication, but FP4 training remains impractical on Hopper-class GPUs without native MXFP4/NVFP4 support.

Method: Introduces direct FP8-to-FP4 quantization and de-quantization with scaling-aware FP4 row-wise to column-wise conversion. Core MoE computations run in FP8 while activations and expert-parallel communication use MXFP4 compression.

Result: At 671B parameter scale, achieves comparable performance to FP8 baselines while reducing peak activation memory by 14.8% (11.8 GB) and improving training throughput by 12.5% (1157 to 1302 tokens/GPU/sec).

Conclusion: FP4 efficiency can be practically realized for large-scale MoE training through careful software-hardware co-design, even without native FP4 Tensor Core support.

Abstract: Training large-scale Mixture-of-Experts (MoE) models is bottlenecked by activation memory and expert-parallel communication, yet FP4 training remains impractical on Hopper-class GPUs without native MXFP4 or NVFP4 support. In this work, we present a training recipe that enables MXFP4 efficiency for MoE models on Hopper architectures without native 4-bit computation support. A central challenge is to integrate FP4 into an existing BF16/FP8 hybrid training pipeline without incurring costly precision round-trips (e.g., FP4 $\leftrightarrow$ BF16 $\leftrightarrow$ FP8). We address this challenge by introducing direct FP8-to-FP4 quantization and de-quantization, together with scaling-aware FP4 row-wise to column-wise conversion, enabling FP4 activations and expert-parallel communication with minimal overhead. Core MoE computations are executed in FP8, while activations and expert-parallel communication are compressed using MXFP4, achieving substantial memory and bandwidth savings without degrading convergence. At the 671B parameter scale, our method achieves end-to-end training performance comparable to strong FP8 baselines, while reducing peak activation memory by 14.8% (11.8 GB) and improving training throughput by 12.5%, from 1157 to 1302 tokens per GPU per second. These results show that FP4 efficiency can be practically realized for large-scale MoE training through careful software-hardware co-design, even without native FP4 Tensor Core support.

[496] Deep learning-guided evolutionary optimization for protein design

Erik Hartman, Di Tang, Johan Malmström

Main category: cs.LG

TL;DR: BoGA combines Bayesian optimization with genetic algorithms for efficient protein sequence design, demonstrated by designing peptide binders against pneumolysin.

Details

Motivation: Protein design is challenging due to vast sequence space and complex sequence-function relationships; efficient exploration methods are needed for therapeutic and biotech applications.

Method: BoGA integrates genetic algorithm as stochastic proposal generator within Bayesian optimization surrogate modeling loop, prioritizing candidates based on prior evaluations and model predictions.

Result: BoGA accelerates discovery of high-confidence peptide binders against pneumolysin, demonstrating efficient protein design across diverse objectives; implemented in BoPep suite.

Conclusion: BoGA enables data-efficient optimization for protein design, with potential applications in therapeutics and biotechnology; available as open-source software.

Abstract: Designing novel proteins with desired characteristics remains a significant challenge due to the large sequence space and the complexity of sequence-function relationships. Efficient exploration of this space to identify sequences that meet specific design criteria is crucial for advancing therapeutics and biotechnology. Here, we present BoGA (Bayesian Optimization Genetic Algorithm), a framework that combines evolutionary search with Bayesian optimization to efficiently navigate the sequence space. By integrating a genetic algorithm as a stochastic proposal generator within a surrogate modeling loop, BoGA prioritizes candidates based on prior evaluations and surrogate model predictions, enabling data-efficient optimization. We demonstrate the utility of BoGA through benchmarking on sequence and structure design tasks, followed by its application in designing peptide binders against pneumolysin, a key virulence factor of \textit{Streptococcus pneumoniae}. BoGA accelerates the discovery of high-confidence binders, demonstrating the potential for efficient protein design across diverse objectives. The algorithm is implemented within the BoPep suite and is available under an MIT license at \href{https://github.com/ErikHartman/bopep}{GitHub}.

[497] Rethinking Time Series Domain Generalization via Structure-Stratified Calibration

Jinyang Li, Shuhao Mei, Xiaoyu Xiao, Shuhang Li, Ruoxi Yun, Jinbo Sun

Main category: cs.LG

TL;DR: A structurally stratified calibration framework for cross-domain time series generalization that addresses structural heterogeneity in latent dynamical systems by performing amplitude calibration only within structurally compatible sample clusters.

Details

Motivation: Existing cross-domain generalization methods assume comparable samples in shared representation space, but real-world datasets often come from structurally heterogeneous dynamical systems, leading to spurious correspondences and negative transfer when performing global alignment.

Method: Proposes a structurally stratified calibration framework that explicitly distinguishes structurally consistent samples and performs amplitude calibration exclusively within structurally compatible sample clusters to alleviate generalization failures caused by structural incompatibility.

Result: Evaluations on 19 public datasets (100.3k samples) show SSCF significantly outperforms strong baselines under zero-shot setting, achieving substantial performance improvements through concise and computationally efficient calibration strategy.

Conclusion: Establishing structural consistency prior to alignment constitutes a more reliable and effective pathway for improving cross-domain generalization of time series governed by latent dynamical systems.

Abstract: For time series arising from latent dynamical systems, existing cross-domain generalization methods commonly assume that samples are comparably meaningful within a shared representation space. In real-world settings, however, different datasets often originate from structurally heterogeneous families of dynamical systems, leading to fundamentally distinct feature distributions. Under such circumstances, performing global alignment while neglecting structural differences is highly prone to establishing spurious correspondences and inducing negative transfer. From the new perspective of cross-domain structural correspondence failure, we revisit this problem and propose a structurally stratified calibration framework. This approach explicitly distinguishes structurally consistent samples and performs amplitude calibration exclusively within structurally compatible sample clusters, thereby effectively alleviating generalization failures caused by structural incompatibility. Notably, the proposed framework achieves substantial performance improvements through a concise and computationally efficient calibration strategy. Evaluations on 19 public datasets (100.3k samples) demonstrate that SSCF significantly outperforms strong baselines under the zero-shot setting. These results confirm that establishing structural consistency prior to alignment constitutes a more reliable and effective pathway for improving cross-domain generalization of time series governed by latent dynamical systems.

[498] Next Embedding Prediction Makes World Models Stronger

George Bredis, Nikita Balagansky, Daniil Gavrilov, Ruslan Rakhimov

Main category: cs.LG

TL;DR: NE-Dreamer is a decoder-free model-based RL agent that uses temporal transformers to predict next-step encoder embeddings from latent state sequences, achieving strong performance on complex partially observable environments.

Details

Motivation: The paper addresses the challenge of capturing temporal dependencies in model-based reinforcement learning for partially observable, high-dimensional domains. Current approaches often rely on reconstruction losses or auxiliary supervision, which may not be optimal for learning predictive state representations.

Method: NE-Dreamer introduces a decoder-free approach using temporal transformers to predict next-step encoder embeddings directly from latent state sequences. This optimizes temporal predictive alignment in representation space without requiring reconstruction losses or auxiliary supervision.

Result: On DeepMind Control Suite, NE-Dreamer matches or exceeds DreamerV3 and leading decoder-free agents. On challenging DMLab tasks involving memory and spatial reasoning, it achieves substantial performance gains.

Conclusion: Next-embedding prediction with temporal transformers provides an effective, scalable framework for MBRL in complex, partially observable environments, demonstrating the value of direct temporal predictive alignment in representation space.

Abstract: Capturing temporal dependencies is critical for model-based reinforcement learning (MBRL) in partially observable, high-dimensional domains. We introduce NE-Dreamer, a decoder-free MBRL agent that leverages a temporal transformer to predict next-step encoder embeddings from latent state sequences, directly optimizing temporal predictive alignment in representation space. This approach enables NE-Dreamer to learn coherent, predictive state representations without reconstruction losses or auxiliary supervision. On the DeepMind Control Suite, NE-Dreamer matches or exceeds the performance of DreamerV3 and leading decoder-free agents. On a challenging subset of DMLab tasks involving memory and spatial reasoning, NE-Dreamer achieves substantial gains. These results establish next-embedding prediction with temporal transformers as an effective, scalable framework for MBRL in complex, partially observable environments.

[499] From Heuristic Selection to Automated Algorithm Design: LLMs Benefit from Strong Priors

Qi Huang, Furong Ye, Ananta Shahane, Thomas Bäck, Niki van Stein

Main category: cs.LG

TL;DR: LLM-driven black-box optimization enhanced by using high-quality algorithmic code examples from benchmark studies

Details

Motivation: Existing LLM-driven algorithm design focuses on specific problems with adaptive prompts, but lacks systematic guidance from benchmarking studies to improve optimization performance

Method: Analyze token-wise attribution of prompts to LLM-generated code, then leverage prior benchmark algorithms to guide LLM-driven optimization through high-quality examples

Result: Demonstrated superior performance on two black-box optimization benchmarks: pseudo-Boolean optimization suite (pbo) and black-box optimization benchmark (bbob)

Conclusion: Integrating benchmarking studies enhances both efficiency and robustness of LLM-driven black-box optimization methods

Abstract: Large Language Models (LLMs) have already been widely adopted for automated algorithm design, demonstrating strong abilities in generating and evolving algorithms across various fields. Existing work has largely focused on examining their effectiveness in solving specific problems, with search strategies primarily guided by adaptive prompt designs. In this paper, through investigating the token-wise attribution of the prompts to LLM-generated algorithmic codes, we show that providing high-quality algorithmic code examples can substantially improve the performance of the LLM-driven optimization. Building upon this insight, we propose leveraging prior benchmark algorithms to guide LLM-driven optimization and demonstrate superior performance on two black-box optimization benchmarks: the pseudo-Boolean optimization suite (pbo) and the black-box optimization suite (bbob). Our findings highlight the value of integrating benchmarking studies to enhance both efficiency and robustness of the LLM-driven black-box optimization methods.

[500] The Price of Robustness: Stable Classifiers Need Overparameterization

Jonas von Berg, Adalbert Fono, Massimiliano Datres, Sohir Maskey, Gitta Kutyniok

Main category: cs.LG

TL;DR: The paper establishes generalization bounds for discontinuous classifiers based on class stability (margin distance to decision boundary), showing that overparameterization is necessary for high stability and good generalization.

Details

Motivation: To better understand the relationship between overparameterization, stability, and generalization in discontinuous classifiers, where current understanding is incomplete despite the empirical success of overparameterized models.

Method: Develops generalization bounds for finite function classes that improve inversely with class stability (expected margin distance). Extends to infinite function classes using normalized co-stability (margin in codomain). Provides theoretical analysis connecting parameter count to stability requirements.

Result: Shows that interpolating models with p≈n parameters must be unstable, requiring substantial overparameterization for high stability. Experiments confirm stability increases with model size and correlates with test performance, unlike traditional norm-based measures.

Conclusion: Class stability serves as a quantifiable robustness measure for generalization in discontinuous classifiers, explaining why overparameterization is necessary for achieving both interpolation and stability.

Abstract: The relationship between overparameterization, stability, and generalization remains incompletely understood in the setting of discontinuous classifiers. We address this gap by establishing a generalization bound for finite function classes that improves inversely with class stability, defined as the expected distance to the decision boundary in the input domain (margin). Interpreting class stability as a quantifiable notion of robustness, we derive as a corollary a law of robustness for classification that extends the results of Bubeck and Sellke beyond smoothness assumptions to discontinuous functions. In particular, any interpolating model with $p \approx n$ parameters on $n$ data points must be unstable, implying that substantial overparameterization is necessary to achieve high stability. We obtain analogous results for parameterized infinite function classes by analyzing a stronger robustness measure derived from the margin in the codomain, which we refer to as the normalized co-stability. Experiments support our theory: stability increases with model size and correlates with test performance, while traditional norm-based measures remain largely uninformative.

[501] Learning Memory-Enhanced Improvement Heuristics for Flexible Job Shop Scheduling

Jiaqi Wang, Zhiguang Cao, Peng Zhao, Rui Cao, Yubin Xiao, Yuan Jiang, You Zhou

Main category: cs.LG

TL;DR: MIStar: A memory-enhanced improvement search framework using heterogeneous graph representation for flexible job-shop scheduling problems, outperforming traditional heuristics and DRL-based constructive methods.

Details

Motivation: Current DRL-based approaches for flexible job-shop scheduling (FJSP) use constructive methods that often fail to reach near-optimal solutions. Improvement-based methods are more effective but face challenges with flexible machine allocation, requiring better state representation, policy learning, and search strategies.

Method: Proposes MIStar framework with: 1) novel heterogeneous disjunctive graph to model operation sequences on machines, 2) memory-enhanced heterogeneous graph neural network (MHGNN) for feature extraction using historical trajectories, and 3) parallel greedy search strategy for efficient solution space exploration.

Result: Extensive experiments on synthetic data and public benchmarks show MIStar significantly outperforms both traditional handcrafted improvement heuristics and state-of-the-art DRL-based constructive methods.

Conclusion: MIStar effectively addresses the challenges of improvement-based methods for FJSP through innovative graph representation, memory-enhanced learning, and efficient search, achieving superior scheduling solutions.

Abstract: The rise of smart manufacturing under Industry 4.0 introduces mass customization and dynamic production, demanding more advanced and flexible scheduling techniques. The flexible job-shop scheduling problem (FJSP) has attracted significant attention due to its complex constraints and strong alignment with real-world production scenarios. Current deep reinforcement learning (DRL)-based approaches to FJSP predominantly employ constructive methods. While effective, they often fall short of reaching (near-)optimal solutions. In contrast, improvement-based methods iteratively explore the neighborhood of initial solutions and are more effective in approaching optimality. However, the flexible machine allocation in FJSP poses significant challenges to the application of this framework, including accurate state representation, effective policy learning, and efficient search strategies. To address these challenges, this paper proposes a Memory-enhanced Improvement Search framework with heterogeneous graph representation–MIStar. It employs a novel heterogeneous disjunctive graph that explicitly models the operation sequences on machines to accurately represent scheduling solutions. Moreover, a memoryenhanced heterogeneous graph neural network (MHGNN) is designed for feature extraction, leveraging historical trajectories to enhance the decision-making capability of the policy network. Finally, a parallel greedy search strategy is adopted to explore the solution space, enabling superior solutions with fewer iterations. Extensive experiments on synthetic data and public benchmarks demonstrate that MIStar significantly outperforms both traditional handcrafted improvement heuristics and state-of-the-art DRL-based constructive methods.

[502] Lattice-based Deep Neural Networks: Regularity and Tailored Regularization

Alexander Keller, Frances Y. Kuo, Dirk Nuyens, Ian H. Sloan

Main category: cs.LG

TL;DR: Survey on applying lattice rules (quasi-Monte Carlo methods) to train deep neural networks, showing theoretical generalization bounds and better performance than standard regularization.

Details

Motivation: Lattice rules are effective for high-dimensional integration and function approximation, and recent research shows potential for improving DNN training. The authors aim to apply these methods to DNNs to achieve better theoretical generalization bounds with dimension-independent constants.

Method: Using lattice rules as training points for DNNs with smooth activation functions. Imposing restrictions on network parameters to match target function regularity. Tailoring lattice training points and regularization to achieve theoretical error bounds.

Result: Proved that DNNs with tailored lattice training points achieve good theoretical generalization error bounds with constants independent of input dimension. Numerical demonstrations show DNNs trained with tailored regularization perform significantly better than with standard ℓ₂ regularization.

Conclusion: Lattice rules offer promising approach for DNN training with theoretical guarantees and practical improvements over standard methods, particularly for high-dimensional problems.

Abstract: This survey article is concerned with the application of lattice rules to Deep Neural Networks (DNNs), lattice rules being a family of quasi-Monte Carlo methods. They have demonstrated effectiveness in various contexts for high-dimensional integration and function approximation. They are extremely easy to implement thanks to their very simple formulation – all that is required is a good integer generating vector of length matching the dimensionality of the problem. In recent years there has been a burst of research activities on the application and theory of DNNs. We review our recent article on using lattice rules as training points for DNNs with a smooth activation function, where we obtained explicit regularity bounds of the DNNs. By imposing restrictions on the network parameters to match the regularity features of the target function, we prove that DNNs with tailored lattice training points can achieve good theoretical generalization error bounds, with implied constants independent of the input dimension. We also demonstrate numerically that DNNs trained with our tailored regularization perform significantly better than with standard $\ell_2$ regularization.

[503] Adapting Time Series Foundation Models through Data Mixtures

Thomas L. Lee, Edoardo M. Ponti, Amos Storkey

Main category: cs.LG

TL;DR: MixFT improves zero-shot forecasting for time series foundation models by using Bayesian mixtures to identify sub-domains within datasets and fine-tuning separate modules on homogeneous data partitions.

Details

Motivation: Time series foundation models struggle with new domains not covered in pretraining. Traditional fine-tuning approaches (single module on all data or per-dataset modules) are suboptimal because datasets can contain multiple sub-domains with different distributions.

Method: Proposes MixFT which uses Bayesian mixtures to re-divide data into homogeneous sets representing sub-domains, then fine-tunes separate modules on each partition to create specialized modules for different data distributions.

Result: MixFT outperforms both per-dataset fine-tuning methods and single-module fine-tuning on all data, demonstrating better specialization for zero-shot forecasting.

Conclusion: Re-partitioning data to identify and target sub-domains enables better specialization of time series foundation models, improving zero-shot forecasting performance for new domains.

Abstract: Time series foundation models (TSFMs) have become increasingly popular for zero-shot forecasting. However, for a new time series domain not fully covered by the pretraining set, performance can suffer. Therefore, when a practitioner cares about a new domain and has access to a set of related datasets, the question arises: how best to fine-tune a TSFM to improve zero-shot forecasting? A typical approach to this type of problem is to fine-tune a LoRA module on all datasets or separately on each dataset. Tuning a separate module on each dataset allows for the specialisation of the TSFM to different types of data distribution, by selecting differing combinations of per-dataset modules for different time series contexts. However, we find that, using per-dataset modules might not be optimal, since a time series dataset can contain data from several types of distributions, i.e. sub-domains. This can be due to the distribution shifting or having differing distributions for different dimensions of the time series. Hence, we propose MixFT which re-divides the data using Bayesian mixtures into sets that best represent the sub-domains present in the data, and fine-tunes separately on each of these sets. This re-division of the data ensures that each set is more homogeneous, leading to fine-tuned modules focused on specific sub-domains. Our experiments show that MixFT performs better than per-dataset methods and when fine-tuning a single module on all the data. This suggests that by re-partitioning the data to represent sub-domains we can better specialise TSFMs to improve zero-shot forecasting.

[504] Eliciting Numerical Predictive Distributions of LLMs Without Autoregression

Julianna Piskorz, Katarzyna Kobalczyk, Mihaela van der Schaar

Main category: cs.LG

TL;DR: LLMs can perform regression tasks via in-context learning, but autoregressive decoding is inefficient for continuous outputs requiring predictive distributions. The paper investigates whether distributional properties can be recovered directly from LLM embeddings without sampling.

Details

Motivation: Autoregressive decoding for continuous-valued outputs in LLMs requires repeated sampling to obtain predictive distributions, leading to high computational cost and inference time. The authors want to explore more efficient alternatives.

Method: Train regression probes to predict statistical functionals (mean, median, quantiles) of LLM’s numerical output distribution directly from its internal representations/embeddings, bypassing explicit autoregressive generation.

Result: LLM embeddings carry informative signals about summary statistics of their predictive distributions, including numerical uncertainty. This suggests distributional properties can be recovered without sampling.

Conclusion: The investigation opens new questions about how LLMs internally encode uncertainty in numerical tasks and suggests feasibility of lightweight alternatives to sampling-based approaches for uncertainty-aware numerical predictions.

Abstract: Large Language Models (LLMs) have recently been successfully applied to regression tasks – such as time series forecasting and tabular prediction – by leveraging their in-context learning abilities. However, their autoregressive decoding process may be ill-suited to continuous-valued outputs, where obtaining predictive distributions over numerical targets requires repeated sampling, leading to high computational cost and inference time. In this work, we investigate whether distributional properties of LLM predictions can be recovered without explicit autoregressive generation. To this end, we study a set of regression probes trained to predict statistical functionals (e.g., mean, median, quantiles) of the LLM’s numerical output distribution directly from its internal representations. Our results suggest that LLM embeddings carry informative signals about summary statistics of their predictive distributions, including the numerical uncertainty. This investigation opens up new questions about how LLMs internally encode uncertainty in numerical tasks, and about the feasibility of lightweight alternatives to sampling-based approaches for uncertainty-aware numerical predictions.

[505] Learning in Markov Decision Processes with Exogenous Dynamics

Davide Maran, Davide Salaorni, Marcello Restelli

Main category: cs.LG

TL;DR: This paper studies structured MDPs with exogenous state components whose transitions are independent of agent actions, showing this structure enables significantly improved learning guarantees with only exogenous state space size in leading regret terms.

Details

Motivation: Standard RL algorithms are designed for generic MDPs, but many practical systems have structured dynamics where only a subset of state variables are directly influenced by agent actions while others evolve exogenously. This structure is common in real-world systems but not exploited by standard methods.

Method: The authors study a structured class of MDPs with exogenous state components whose transitions are independent of agent actions. They develop RL algorithms that exploit this structure, providing theoretical analysis showing improved learning guarantees. They establish matching lower bounds to prove optimality.

Result: Theoretical results show significantly improved regret bounds where only the size of the exogenous state space appears in leading terms, compared to standard RL methods that scale with full state space. Empirical validation demonstrates substantial gains in sample efficiency across toy settings and real-world-inspired environments.

Conclusion: Exploiting exogenous structure in MDPs yields information-theoretically optimal learning guarantees with substantial practical benefits for sample efficiency, making this approach valuable for real-world systems with such structure.

Abstract: Reinforcement learning algorithms are typically designed for generic Markov Decision Processes (MDPs), where any state-action pair can lead to an arbitrary transition distribution. In many practical systems, however, only a subset of the state variables is directly influenced by the agent’s actions, while the remaining components evolve according to exogenous dynamics and account for most of the stochasticity. In this work, we study a structured class of MDPs characterized by exogenous state components whose transitions are independent of the agent’s actions. We show that exploiting this structure yields significantly improved learning guarantees, with only the size of the exogenous state space appearing in the leading terms of the regret bounds. We further establish a matching lower bound, showing that this dependence is information-theoretically optimal. Finally, we empirically validate our approach across classical toy settings and real-world-inspired environments, demonstrating substantial gains in sample efficiency compared to standard reinforcement learning methods.

[506] Embedding interpretable $\ell_1$-regression into neural networks for uncovering temporal structure in cell imaging

Fabian Kabus, Maren Hackenberg, Julia Hindel, Thibault Cholvin, Antje Kilias, Thomas Brox, Abhinav Valada, Marlene Bartos, Harald Binder

Main category: cs.LG

TL;DR: Combines neural networks with interpretable sparse regression by embedding a vector autoregressive model into a convolutional autoencoder for calcium imaging data analysis.

Details

Motivation: Neural networks excel at learning non-sparse structure but lack interpretability, while classical statistical regression with ℓ₁ regularization offers better interpretability through sparsity. The paper aims to optimally combine these approaches for analyzing two-photon calcium imaging data where sparse autoregressive dynamics need to be extracted.

Method: Embeds a vector autoregressive (VAR) model as an interpretable regression technique into a convolutional autoencoder. Uses skip connections to separate non-sparse static spatial information, selectively channeling sparse structure into the ℓ₁-regularized VAR. Enables ℓ₁-estimation by differentiating through the piecewise linear solution path. Contrasts with approaches where the autoencoder doesn’t adapt to the VAR model.

Result: The integrated approach provides both dimension reduction for tractable temporal modeling and interpretable sparse regression. Enables testing for comparing temporal sequences from the same observational unit and generates contribution maps visualizing which spatial regions drive the learned dynamics.

Conclusion: Successfully combines neural network capabilities with interpretable sparse regression, offering a hybrid approach that maintains both modeling power and interpretability for analyzing complex neural imaging data with sparse temporal dynamics.

Abstract: While artificial neural networks excel in unsupervised learning of non-sparse structure, classical statistical regression techniques offer better interpretability, in particular when sparseness is enforced by $\ell_1$ regularization, enabling identification of which factors drive observed dynamics. We investigate how these two types of approaches can be optimally combined, exemplarily considering two-photon calcium imaging data where sparse autoregressive dynamics are to be extracted. We propose embedding a vector autoregressive (VAR) model as an interpretable regression technique into a convolutional autoencoder, which provides dimension reduction for tractable temporal modeling. A skip connection separately addresses non-sparse static spatial information, selectively channeling sparse structure into the $\ell_1$-regularized VAR. $\ell_1$-estimation of regression parameters is enabled by differentiating through the piecewise linear solution path. This is contrasted with approaches where the autoencoder does not adapt to the VAR model. Having an embedded statistical model also enables a testing approach for comparing temporal sequences from the same observational unit. Additionally, contribution maps visualize which spatial regions drive the learned dynamics.

[507] On the Structural Limitations of Weight-Based Neural Adaptation and the Role of Reversible Behavioral Learning

Pardhu Sri Rushi Varma Konduru

Main category: cs.LG

TL;DR: Introduces structural irreversibility in neural model adaptation and proposes reversible behavioral learning to allow deterministic unloading of learned behaviors while preserving model identity.

Details

Motivation: Current neural model adaptation methods (fine-tuning, alignment training, RL) cause long-term alterations to base model behavior due to shared parameter changes, making it impossible to deterministically revert to original behavior without parameter snapshots.

Method: Introduces reversible behavioral learning where model behaviors are structurally dissociated from identity parameters, allowing deterministic unloading via explicit unload process. Also proposes Recoverability Factor as a normalized measure of behavioral recoverability and diagnostics based on model divergence.

Result: Experiments show reversible model adaptation achieves rollback within numerical precision, while shared-parameter mutation exhibits persistent post-reset divergence.

Conclusion: Structural irreversibility is a fundamental characteristic of shared-parameter model adaptation, and reversible behavioral learning provides a solution for deterministic behavior unloading while preserving model identity.

Abstract: Neural models are usually adapted through changes in parameters shared among model components via fine-tuning, alignment-based training, and reinforcement learning. These changes have been found effective in short-term optimization. However, they result in long-term alterations in the model’s base behavior. In this study, we introduce the concept of structural irreversibility as a characteristic of shared-parameter model adaptation. This concept refers to the intertwining of task-specific objectives with the representational identity of the model. We show that when parameters are directly mutated, the resulting model behaves divergently from the original model. This divergence cannot be reversed deterministically without an explicit parameter snapshot. We introduce reversible behavioral learning, in which model behaviors are structurally dissociated from identity parameters and can be deterministically unloaded through an explicit unload process. We also introduce the Recoverability Factor as a normalized measure of behavioral recoverability and provide additional diagnostics based on model divergence. Experiments show that reversible model adaptation achieves rollback within numerical precision, whereas shared-parameter mutation exhibits persistent post-reset divergence.

[508] Distributed Dynamic Invariant Causal Prediction in Environmental Time Series

Ziruo Hao, Tao Yang, Xiaofeng Wu, Bo Hu

Main category: cs.LG

TL;DR: DisDy-ICPT: A distributed framework for learning dynamic invariant causal relationships in time-series data with environmental attributes, enabling stable causal prediction without data communication.

Details

Motivation: Existing methods fail to combine dynamic causal analysis with environmental contexts in distributed temporal settings, creating a gap for robust decision-making in domains like climate science and environmental monitoring.

Method: Proposes DisDy-ICPT framework that learns dynamic causal relationships over time while mitigating spatial confounding variables without requiring data communication between distributed nodes.

Result: Theoretically proven to recover stable causal predictors within bounded communication rounds; empirically shows superior predictive stability and accuracy on synthetic benchmarks and environment-segmented real-world datasets compared to baselines.

Conclusion: DisDy-ICPT offers promising applications in carbon monitoring and weather forecasting, with future work extending to online learning scenarios.

Abstract: The extraction of invariant causal relationships from time series data with environmental attributes is critical for robust decision-making in domains such as climate science and environmental monitoring. However, existing methods either emphasize dynamic causal analysis without leveraging environmental contexts or focus on static invariant causal inference, leaving a gap in distributed temporal settings. In this paper, we propose Distributed Dynamic Invariant Causal Prediction in Time-series (DisDy-ICPT), a novel framework that learns dynamic causal relationships over time while mitigating spatial confounding variables without requiring data communication. We theoretically prove that DisDy-ICPT recovers stable causal predictors within a bounded number of communication rounds under standard sampling assumptions. Empirical evaluations on synthetic benchmarks and environment-segmented real-world datasets show that DisDy-ICPT achieves superior predictive stability and accuracy compared to baseline methods A and B. Our approach offers promising applications in carbon monitoring and weather forecasting. Future work will extend DisDy-ICPT to online learning scenarios.

[509] Beyond One-Size-Fits-All: Adaptive Subgraph Denoising for Zero-Shot Graph Learning with Large Language Models

Fengzhi Li, Liang Zhang, Yuan Zuo, Ruiqing Zhao, YanSong Liu, Yunfei Ma, Fanyu Meng, Junlan Feng

Main category: cs.LG

TL;DR: GraphSSR introduces adaptive subgraph extraction and denoising for zero-shot LLM-based graph reasoning, addressing structural noise issues in previous methods through Sample-Select-Reason pipeline and reinforcement learning.

Details

Motivation: Traditional GNNs struggle with zero-shot generalization on graphs, and while LLM-based approaches like Graph-R1 show promise, they suffer from structural noise due to task-agnostic subgraph extraction that includes irrelevant neighbors and edges, distorting LLMs' reasoning capabilities.

Method: Proposes GraphSSR with SSR pipeline (Sample-Select-Reason) for adaptive subgraph extraction, SSR-SFT for supervised fine-tuning using synthesized reasoning traces, and SSR-RL reinforcement learning framework with Authenticity-Reinforced and Denoising-Reinforced RL to optimize sampling and selection for denoised subgraphs.

Result: The method achieves improved zero-shot graph reasoning by enabling LLMs to work with parsimonious, denoised subgraphs, overcoming the one-size-fits-all limitation of previous approaches and reducing structural noise.

Conclusion: GraphSSR provides an effective framework for adaptive subgraph extraction and denoising in zero-shot LLM-based graph reasoning, enhancing generalization capabilities by filtering task-irrelevant structural information.

Abstract: Graph-based tasks in the zero-shot setting remain a significant challenge due to data scarcity and the inability of traditional Graph Neural Networks (GNNs) to generalize to unseen domains or label spaces. While recent advancements have transitioned toward leveraging Large Language Models (LLMs) as predictors to enhance GNNs, these methods often suffer from cross-modal alignment issues. A recent paradigm (i.e., Graph-R1) overcomes the aforementioned architectural dependencies by adopting a purely text-based format and utilizing LLM-based graph reasoning, showing improved zero-shot generalization. However, it employs a task-agnostic, one-size-fits-all subgraph extraction strategy, which inevitably introduces significant structural noise–irrelevant neighbors and edges–that distorts the LLMs’ receptive field and leads to suboptimal predictions. To address this limitation, we introduce GraphSSR, a novel framework designed for adaptive subgraph extraction and denoising in zero-shot LLM-based graph reasoning. Specifically, we propose the SSR pipeline, which dynamically tailors subgraph extraction to specific contexts through a “Sample-Select-Reason” process, enabling the model to autonomously filter out task-irrelevant neighbors and overcome the one-size-fits-all issue. To internalize this capability, we develop SSR-SFT, a data synthesis strategy that generates high-quality SSR-style graph reasoning traces for supervised fine-tuning of LLMs. Furthermore, we propose SSR-RL, a two-stage reinforcement learning framework that explicitly regulates sampling and selection operations within the proposed SSR pipeline designed for adaptive subgraph denoising. By incorporating Authenticity-Reinforced and Denoising-Reinforced RL, we guide the model to achieve accurate predictions using parsimonious, denoised subgraphs for reasoning.

[510] Towards Accurate and Interpretable Time-series Forecasting: A Polynomial Learning Approach

Bo Liu, Shao-Bo Lin, Changmiao Wang, Xiaotong Liu

Main category: cs.LG

TL;DR: IPL is an interpretable polynomial learning method for time series forecasting that models original features and their interactions through polynomial representations, balancing accuracy and interpretability.

Details

Motivation: Current time series forecasting methods lack interpretability, undermining user trust and complicating debugging. Existing interpretable methods have limitations including insufficient temporal dependency modeling, lack of feature-level interpretability for early warning, and difficulty balancing accuracy with interpretability.

Method: Proposes Interpretable Polynomial Learning (IPL) which integrates interpretability into model structure by explicitly modeling original features and their interactions of arbitrary order through polynomial representations. This preserves temporal dependencies and allows flexible trade-off between accuracy and interpretability by adjusting polynomial degree.

Result: IPL achieves high prediction accuracy with superior interpretability compared to widely used explainability methods on simulated and Bitcoin price data. Experiments on field-collected antenna data show IPL yields simpler and more efficient early warning mechanisms.

Conclusion: IPL provides an effective approach for interpretable time series forecasting that balances accuracy and interpretability through polynomial representations, offering practical value for early warning applications.

Abstract: Time series forecasting enables early warning and has driven asset performance management from traditional planned maintenance to predictive maintenance. However, the lack of interpretability in forecasting methods undermines users’ trust and complicates debugging for developers. Consequently, interpretable time-series forecasting has attracted increasing research attention. Nevertheless, existing methods suffer from several limitations, including insufficient modeling of temporal dependencies, lack of feature-level interpretability to support early warning, and difficulty in simultaneously achieving the accuracy and interpretability. This paper proposes the interpretable polynomial learning (IPL) method, which integrates interpretability into the model structure by explicitly modeling original features and their interactions of arbitrary order through polynomial representations. This design preserves temporal dependencies, provides feature-level interpretability, and offers a flexible trade-off between prediction accuracy and interpretability by adjusting the polynomial degree. We evaluate IPL on simulated and Bitcoin price data, showing that it achieves high prediction accuracy with superior interpretability compared with widely used explainability methods. Experiments on field-collected antenna data further demonstrate that IPL yields simpler and more efficient early warning mechanisms.

[511] Enhancing Physics-Informed Neural Networks with Domain-aware Fourier Features: Towards Improved Performance and Interpretable Results

Alberto Miño Calero, Luis Salamanca, Konstantinos E. Tatsis

Main category: cs.LG

TL;DR: PINNs with Domain-aware Fourier Features (DaFFs) improve accuracy, training efficiency, and interpretability compared to vanilla PINNs and Random Fourier Features (RFFs)-based PINNs.

Details

Motivation: Physics-Informed Neural Networks (PINNs) incorporate physics via PDEs in loss functions but suffer from difficult training, poor interpretability, and computational costs. The authors aim to address these limitations by developing domain-aware positional encodings that eliminate the need for explicit boundary condition losses and improve model interpretability.

Method: Proposes Domain-aware Fourier Features (DaFFs) for positional encoding that encapsulate domain-specific characteristics (geometry, boundary conditions). Unlike Random Fourier Features (RFFs), DaFFs eliminate explicit boundary condition loss terms and loss balancing. Also develops an LRP-based explainability framework tailored to PINNs for extracting relevance attribution scores from input space.

Result: PINN-DaFFs achieve orders-of-magnitude lower errors and faster convergence compared to vanilla PINNs and RFFs-based PINNs. LRP analysis shows DaFFs lead to more physically consistent feature attributions, while PINN-RFFs and vanilla PINNs display scattered, less physics-relevant patterns.

Conclusion: DaFFs enhance PINNs’ accuracy, efficiency, and interpretability, enabling more robust and informative physics-informed learning by incorporating domain knowledge directly into positional encoding and providing better explainability through tailored LRP analysis.

Abstract: Physics-Informed Neural Networks (PINNs) incorporate physics into neural networks by embedding partial differential equations (PDEs) into their loss function. Despite their success in learning the underlying physics, PINN models remain difficult to train and interpret. In this work, a novel modeling approach is proposed, which relies on the use of Domain-aware Fourier Features (DaFFs) for the positional encoding of the input space. These features encapsulate all the domain-specific characteristics, such as the geometry and boundary conditions, and unlike Random Fourier Features (RFFs), eliminate the need for explicit boundary condition loss terms and loss balancing schemes, while simplifying the optimization process and reducing the computational cost associated with training. We further develop an LRP-based explainability framework tailored to PINNs, enabling the extraction of relevance attribution scores for the input space. It is demonstrated that PINN-DaFFs achieve orders-of-magnitude lower errors and allow faster convergence compared to vanilla PINNs and RFFs-based PINNs. Furthermore, LRP analysis reveals that the proposed leads to more physically consistent feature attributions, while PINN-RFFs and vanilla PINNs display more scattered and less physics-relevant patterns. These results demonstrate that DaFFs not only enhance PINNs’ accuracy and efficiency but also improve interpretability, laying the ground for more robust and informative physics-informed learning.

[512] Contextual Latent World Models for Offline Meta Reinforcement Learning

Mohammadreza Nakheai, Aidan Scannell, Kevin Luck, Joni Pajarinen

Main category: cs.LG

TL;DR: Offline meta-RL method that combines context-based task inference with latent world models to learn better task representations through task-conditioned temporal consistency

Details

Motivation: Learning effective task representations without supervision in offline meta-reinforcement learning is challenging. Context-based methods infer task representations from transition histories but struggle to learn meaningful representations beyond task discrimination.

Method: Introduces contextual latent world models that condition latent world models on inferred task representations and train them jointly with context encoders. This enforces task-conditioned temporal consistency, capturing task-dependent dynamics rather than just task discrimination.

Result: Method learns more expressive task representations and significantly improves generalization to unseen tasks across MuJoCo, Contextual-DeepMind Control, and Meta-World benchmarks.

Conclusion: Combining context-based task inference with latent world models through task-conditioned temporal consistency yields superior task representations and generalization in offline meta-RL.

Abstract: Offline meta-reinforcement learning seeks to learn policies that generalize across related tasks from fixed datasets. Context-based methods infer a task representation from transition histories, but learning effective task representations without supervision remains a challenge. In parallel, latent world models have demonstrated strong self-supervised representation learning through temporal consistency. We introduce contextual latent world models, which condition latent world models on inferred task representations and train them jointly with the context encoder. This enforces task-conditioned temporal consistency, yielding task representations that capture task-dependent dynamics rather than merely discriminating between tasks. Our method learns more expressive task representations and significantly improves generalization to unseen tasks across MuJoCo, Contextual-DeepMind Control, and Meta-World benchmarks.

[513] CGL: Advancing Continual GUI Learning via Reinforcement Fine-Tuning

Zhenquan Yao, Zitong Huang, Yihan Zeng, Jianhua Han, Hang Xu, Chun-Mei Feng, Jianwei Ma, Wangmeng Zuo

Main category: cs.LG

TL;DR: A continual GUI learning framework (CGL) that balances adaptation efficiency and skill retention by synergizing supervised fine-tuning and reinforcement learning, with dynamic proportion adjustment and gradient surgery to prevent knowledge overwriting.

Details

Motivation: GUI agents face challenges in continual learning due to frequent application updates, where adapting to new tasks often causes forgetting of old tasks. Current methods like SFT enable fast adaptation but trigger knowledge overwriting, while RL shows resilience but may be less efficient.

Method: Proposes CGL framework with: 1) SFT proportion adjustment mechanism guided by policy entropy to dynamically balance SFT and RL training phases; 2) Gradient surgery strategy that projects exploratory SFT gradients onto GRPO-based anchor gradients, clipping conflicting components; 3) AndroidControl-CL benchmark dividing GUI applications into task groups for continual learning evaluation.

Result: Experimental results demonstrate the effectiveness of the CGL framework across continual learning scenarios. The proposed approach successfully balances adaptation efficiency and skill retention in GUI agent learning.

Conclusion: The CGL framework effectively addresses continual GUI learning challenges by synergizing SFT and RL with dynamic balancing mechanisms and gradient surgery, preventing knowledge overwriting while maintaining adaptation efficiency.

Abstract: Graphical User Interface (GUI) Agents, benefiting from recent advances in multimodal large language models (MLLM), have achieved significant development. However, due to the frequent updates of GUI applications, adapting to new tasks without forgetting old tasks in GUI continual learning remains an open problem. In this work, we reveal that while Supervised Fine-Tuning (SFT) facilitates fast adaptation, it often triggers knowledge overwriting, whereas Reinforcement Learning (RL) demonstrates an inherent resilience that shields prior interaction logic from erasure. Based on this insight, we propose a \textbf{C}ontinual \textbf{G}UI \textbf{L}earning (CGL) framework that dynamically balances adaptation efficiency and skill retention by enhancing the synergy between SFT and RL. Specifically, we introduce an SFT proportion adjustment mechanism guided by policy entropy to dynamically control the weight allocation between the SFT and RL training phases. To resolve explicit gradient interference, we further develop a specialized gradient surgery strategy. By projecting exploratory SFT gradients onto GRPO-based anchor gradients, our method explicitly clips the components of SFT gradients that conflict with GRPO. On top of that, we establish an AndroidControl-CL benchmark, which divides GUI applications into distinct task groups to effectively simulate and evaluate the performance of continual GUI learning. Experimental results demonstrate the effectiveness of our proposed CGL framework across continual learning scenarios. The benchmark, code, and model will be made publicly available.

[514] Leveraging Label Proportion Prior for Class-Imbalanced Semi-Supervised Learning

Kohki Akiba, Shinnosuke Matsuo, Shota Harada, Ryoma Bise

Main category: cs.LG

TL;DR: Proposes a lightweight SSL framework using Proportion Loss from LLP to address class imbalance, improving performance on minority classes in semi-supervised learning.

Details

Motivation: Semi-supervised learning suffers from class imbalance where pseudo-labeling amplifies majority bias and suppresses minority class performance. Existing methods struggle with this issue, especially under scarce-label conditions.

Method: Introduces Proportion Loss from learning from label proportions (LLP) into SSL as a regularization term. This aligns model predictions with global class distribution to mitigate bias. Also formulates a stochastic variant to account for mini-batch composition fluctuations.

Result: Experiments on Long-tailed CIFAR-10 show consistent improvements over FixMatch and ReMixMatch baselines across imbalance severities and label ratios. Achieves competitive or superior results compared to existing CISSL methods, particularly under scarce-label conditions.

Conclusion: Proportion Loss effectively addresses class imbalance in SSL by regularizing predictions to match global class distributions, offering a lightweight solution that improves minority class performance without complex architectural changes.

Abstract: Semi-supervised learning (SSL) often suffers under class imbalance, where pseudo-labeling amplifies majority bias and suppresses minority performance. We address this issue with a lightweight framework that, to our knowledge, is the first to introduce Proportion Loss from learning from label proportions (LLP) into SSL as a regularization term. Proportion Loss aligns model predictions with the global class distribution, mitigating bias across both majority and minority classes. To further stabilize training, we formulate a stochastic variant that accounts for fluctuations in mini-batch composition. Experiments on the Long-tailed CIFAR-10 benchmark show that integrating Proportion Loss into FixMatch and ReMixMatch consistently improves performance over the baselines across imbalance severities and label ratios, and achieves competitive or superior results compared to existing CISSL methods, particularly under scarce-label conditions.

[515] Integrating Homomorphic Encryption and Synthetic Data in FL for Privacy and Learning Quality

Yenan Wang, Carla Fabiana Chiasserini, Elad Michael Schiller

Main category: cs.LG

TL;DR: Alt-FL: Federated learning with alternating authentic/synthetic data rounds and selective homomorphic encryption to balance privacy, accuracy, and computational costs.

Details

Motivation: FL needs to ensure learning quality and privacy protection while keeping resource consumption low, especially with computationally expensive homomorphic encryption.

Method: Alternating Federated Learning (Alt-FL) interleaves local training with authentic data (authentic rounds) and synthetic data (synthetic rounds), transferring encrypted parameters on authentic rounds and plaintext on synthetic rounds.

Result: Improves model accuracy by 13.4%, reduces HE-related costs by up to 48% compared to Selective HE, and demonstrates robust privacy protection against data leakage attacks like DLG.

Conclusion: Alt-FL effectively balances privacy, accuracy, and computational efficiency in federated learning through synthetic data augmentation and strategic encryption interleaving.

Abstract: Federated learning (FL) enables collaborative training of machine learning models without sharing sensitive client data, making it a cornerstone for privacy-critical applications. However, FL faces the dual challenge of ensuring learning quality and robust privacy protection while keeping resource consumption low, particularly when using computationally expensive techniques such as homomorphic encryption (HE). In this work, we enhance an FL process that preserves privacy using HE by integrating it with synthetic data generation and an interleaving strategy. Specifically, our solution, named Alternating Federated Learning (Alt-FL), consists of alternating between local training with authentic data (authentic rounds) and local training with synthetic data (synthetic rounds) and transferring the encrypted and plaintext model parameters on authentic and synthetic rounds (resp.). Our approach improves learning quality (e.g., model accuracy) through datasets enhanced with synthetic data, preserves client data privacy via HE, and keeps manageable encryption and decryption costs through our interleaving strategy. We evaluate our solution against data leakage attacks, such as the DLG attack, demonstrating robust privacy protection. Also, Alt-FL provides 13.4% higher model accuracy and decreases HE-related costs by up to 48% with respect to Selective HE.

[516] LAGO: A Local-Global Optimization Framework Combining Trust Region Methods and Bayesian Optimization

Eliott Van Dieren, Tommaso Vanzan, Fabio Nobile

Main category: cs.LG

TL;DR: LAGO is a hybrid optimization algorithm that combines gradient-enhanced Bayesian Optimization for global exploration with trust region methods for local refinement through an adaptive competition mechanism.

Details

Motivation: To address the limitations of pure global optimization methods (slow convergence) and pure local optimization methods (getting stuck in local optima) by creating a hybrid approach that efficiently balances exploration and exploitation.

Method: Combines gradient-enhanced Bayesian Optimization with gradient-based trust region local refinement. At each iteration, both strategies propose candidate points independently, and selection is based on predicted improvement. The method separates global exploration from local refinement by optimizing BO acquisition functions outside active trust regions, while incorporating local gradient evaluations into the global Gaussian process only when they meet lengthscale-based distance criteria.

Result: Achieves improved exploration of the full design space compared to standard non-linear local optimization algorithms for smooth functions, while maintaining fast local convergence in promising regions.

Conclusion: LAGO provides an effective hybrid optimization approach that balances global exploration and local exploitation, overcoming limitations of pure global or local methods through adaptive competition between strategies.

Abstract: We introduce LAGO, a LocAl-Global Optimization algorithm that combines gradient-enhanced Bayesian Optimization (BO) with gradient-based trust region local refinement through an adaptive competition mechanism. At each iteration, global and local optimization strategies independently propose candidate points, and the next evaluation is selected based on predicted improvement. LAGO separates global exploration from local refinement at the proposal level: the BO acquisition function is optimized outside the active trust region, while local function and gradient evaluations are incorporated into the global gradient-enhanced Gaussian process only when they satisfy a lengthscale-based minimum-distance criterion, reducing the risk of numerical instability during the local exploitation. This enables efficient local refinement when reaching promising regions, without sacrificing a global search of the design space. As a result, the method achieves an improved exploration of the full design space compared to standard non-linear local optimization algorithms for smooth functions, while maintaining fast local convergence in regions of interest.

[517] Why Does RLAIF Work At All?

Robin Young

Main category: cs.LG

TL;DR: The paper proposes the latent value hypothesis to explain why Reinforcement Learning from AI Feedback (RLAIF) works for value learning, suggesting pretraining encodes human values as directions in representation space that can be elicited via constitutional prompts.

Details

Motivation: To provide a theoretical explanation for why RLAIF enables language models to improve through self-generated preference judgments, despite lacking a clear theoretical foundation for how this self-improvement works for value learning.

Method: Proposes the latent value hypothesis and formalizes it under a linear model where constitutional prompts act as projection operators selecting value-relevant directions in representation space. Analyzes RLAIF through this theoretical framework.

Result: The analysis explains several empirical phenomena: the generation-judgment gap, how RLAIF quality scales with model capacity, the existence of adversarial constitutions that can activate anti-social values, and unifies findings like refusal directions and low-rank safety subspaces.

Conclusion: The latent value hypothesis provides a unified theoretical account for RLAIF’s effectiveness in value alignment, explaining why constitutional prompting works and how pretrained representations encode human values that can be elicited for self-improvement.

Abstract: Reinforcement Learning from AI Feedback (RLAIF) enables language models to improve by training on their own preference judgments, yet no theoretical account explains why this self-improvement seemingly works for value learning. We propose the latent value hypothesis, that pretraining on internet-scale data encodes human values as directions in representation space, and constitutional prompts elicit these latent values into preference judgments. We formalize this intuition under a linear model where the constitution acts as a projection operator selecting value-relevant directions. Our analysis yields several results. RLAIF improves alignment when the constitution-activated direction correlates with true values better than the model’s default generation direction thus explaining the generation-judgment gap; the ceiling on RLAIF quality is determined by how well representations encode values, which scales with model capacity; and adversarial constitutions exist that can activate anti-social value directions encoded from harmful pretraining data. Our account unifies scattered empirical findings including the refusal direction, low-rank safety subspaces, and RLAIF scaling behavior.

[518] On the Topology of Neural Network Superlevel Sets

Bahman Gharesifard

Main category: cs.LG

TL;DR: Neural networks with Riccati-type activation functions produce Pfaffian outputs with architecture-controlled complexity bounds on topological features like superlevel sets and Lie bracket rank drop loci.

Details

Motivation: The paper aims to establish theoretical bounds on the topological complexity of neural network outputs, particularly for networks with Riccati-type activation functions that have been studied in universal approximation theory. The motivation is to understand how architectural choices affect the geometric and topological properties of neural network outputs.

Method: The authors analyze neural networks whose activation functions satisfy a Riccati-type ordinary differential equation condition. They prove that such networks produce Pfaffian outputs on analytic domains, with format controlled only by the architecture. This allows them to derive bounds on topological complexity measures like total Betti numbers for superlevel sets and Lie bracket rank drop loci.

Result: The main results show that for neural networks with Riccati-type activations, the topological complexity of superlevel sets and Lie bracket rank drop loci can be bounded uniformly over all possible weights, with bounds depending only on the network architecture. This provides architecture-only control over topological features.

Conclusion: The work establishes fundamental connections between neural network architecture and the topological complexity of their outputs, providing theoretical guarantees for networks with Riccati-type activation functions. These results contribute to understanding the geometric properties of neural networks from a mathematical perspective.

Abstract: We show that neural networks with activations satisfying a Riccati-type ordinary differential equation condition, an assumption arising in recent universal approximation results in the uniform topology, produce Pfaffian outputs on analytic domains with format controlled only by the architecture. Consequently, superlevel sets, as well as Lie bracket rank drop loci for neural network parameterized vector fields, admit architecture-only bounds on topological complexity, in particular on total Betti numbers, uniformly over all weights.

[519] Breaking the Prototype Bias Loop: Confidence-Aware Federated Contrastive Learning for Highly Imbalanced Clients

Tian-Shuang Wu, Shen-Huan Lyu, Ning Chen, Yi-Xiao He, Bing Tang, Baoliu Ye, Qingfu Zhang

Main category: cs.LG

TL;DR: CAFedCL breaks prototype bias loop in federated contrastive learning via confidence-aware aggregation and geometric consistency regularization

Details

Motivation: Local class imbalance and data heterogeneity in federated learning create a prototype bias loop where biased local prototypes accumulate errors through repeated aggregation and reuse as contrastive anchors

Method: Confidence-aware aggregation using predictive uncertainty to downweight high-variance prototypes, generative augmentation for minority classes, and geometric consistency regularization to stabilize class structure

Result: Outperforms federated baselines under varying class imbalance and data heterogeneity in both accuracy and client fairness

Conclusion: CAFedCL effectively breaks the prototype bias loop through uncertainty-aware aggregation and structural stabilization techniques

Abstract: Local class imbalance and data heterogeneity across clients often trap prototype-based federated contrastive learning in a prototype bias loop: biased local prototypes induced by imbalanced data are aggregated into biased global prototypes, which are repeatedly reused as contrastive anchors, accumulating errors across communication rounds. To break this loop, we propose Confidence-Aware Federated Contrastive Learning (CAFedCL), a novel framework that improves the prototype aggregation mechanism and strengthens the contrastive alignment guided by prototypes. CAFedCL employs a confidence-aware aggregation mechanism that leverages predictive uncertainty to downweight high-variance local prototypes. In addition, generative augmentation for minority classes and geometric consistency regularization are integrated to stabilize the structure between classes. From a theoretical perspective, we provide an expectation-based analysis showing that our aggregation reduces estimation variance, thereby bounding global prototype drift and ensuring convergence. Extensive experiments under varying levels of class imbalance and data heterogeneity demonstrate that CAFedCL consistently outperforms representative federated baselines in both accuracy and client fairness.

[520] cPNN: Continuous Progressive Neural Networks for Evolving Streaming Time Series

Federico Giannini, Giacomo Ziffer, Emanuele Della Valle

Main category: cs.LG

TL;DR: cPNN is a continuous version of Progressive Neural Networks that handles concept drift, temporal dependencies, and catastrophic forgetting in data streams using RNNs and SGD.

Details

Motivation: Data streams often violate the i.i.d. assumption by having temporal dependencies and concept drift, while neural networks suffer from catastrophic forgetting when learning new concepts. Existing solutions address these problems separately, lacking a joint solution.

Method: Continuous Progressive Neural Networks (cPNN) extend Progressive Neural Networks to handle continuous data streams. The method uses Recurrent Neural Networks to capture temporal dependencies and applies Stochastic Gradient Descent to streams with temporal dependencies, enabling knowledge transfer while preventing forgetting.

Result: Ablation study shows cPNN quickly adapts to new concepts and demonstrates robustness to concept drifts in data streams.

Conclusion: cPNN provides a unified solution for handling concept drift, temporal dependencies, and catastrophic forgetting in continuous data streams, outperforming separate approaches to these problems.

Abstract: Dealing with an unbounded data stream involves overcoming the assumption that data is identically distributed and independent. A data stream can, in fact, exhibit temporal dependencies (i.e., be a time series), and data can change distribution over time (concept drift). The two problems are deeply discussed, and existing solutions address them separately: a joint solution is absent. In addition, learning multiple concepts implies remembering the past (a.k.a. avoiding catastrophic forgetting in Neural Networks’ terminology). This work proposes Continuous Progressive Neural Networks (cPNN), a solution that tames concept drifts, handles temporal dependencies, and bypasses catastrophic forgetting. cPNN is a continuous version of Progressive Neural Networks, a methodology for remembering old concepts and transferring past knowledge to fit the new concepts quickly. We base our method on Recurrent Neural Networks and exploit the Stochastic Gradient Descent applied to data streams with temporal dependencies. Results of an ablation study show a quick adaptation of cPNN to new concepts and robustness to drifts.

[521] SEHFS: Structural Entropy-Guided High-Order Correlation Learning for Multi-View Multi-Label Feature Selection

Cheng Peng, Yonghao Li, Wanfu Gao, Jie Wen, Weiping Ding

Main category: cs.LG

TL;DR: SEHFS is a multi-view multi-label feature selection method that uses structural entropy to learn high-order feature correlations beyond pairwise dependencies, addressing limitations of existing information-theoretic approaches.

Details

Motivation: Existing information-theoretic methods for multi-view multi-label learning struggle with learning high-order structural correlations in real-world data and are prone to local optima due to heuristic optimization approaches.

Method: Converts feature graph into structural-entropy-minimizing encoding tree to quantify high-order dependencies, groups redundant features into clusters, and uses information theory-matrix fusion framework with shared semantic matrix and view-specific contribution matrices.

Result: Demonstrates superior performance in feature selection on eight datasets from various domains, with theoretical establishment of structural entropy’s ability to learn high-order correlations.

Conclusion: SEHFS effectively addresses high-order correlation learning and optimization issues in multi-view multi-label feature selection through structural entropy and matrix fusion framework.

Abstract: In recent years, multi-view multi-label learning (MVML) has attracted extensive attention due to its close alignment to real-world scenarios. Information-theoretic methods have gained prominence for learning nonlinear correlations. However, two key challenges persist: first, features in real-world data commonly exhibit high-order structural correlations, but existing information-theoretic methods struggle to learn such correlations; second, commonly relying on heuristic optimization, information-theoretic methods are prone to converging to local optima. To address these two challenges, we propose a novel method called Structural Entropy Guided High-Order Correlation Learning for Multi-View Multi-Label Feature Selection (SEHFS). The core idea of SEHFS is to convert the feature graph into a structural-entropy-minimizing encoding tree, quantifying the information cost of high-order dependencies and thus learning high-order feature correlations beyond pairwise correlations. Specifically, features exhibiting strong high-order redundancy are grouped into a single cluster within the encoding tree, while inter-cluster feaeture correlations are minimized, thereby eliminating redundancy both within and across clusters. Furthermore, a new framework based on the fusion of information theory and matrix methods is adopted, which learns a shared semantic matrix and view-specific contribution matrices to reconstruct a global view matrix, thereby enhancing the information-theoretic method and balancing the global and local optimization. The ability of structural entropy to learn high-order correlations is theoretically established, and and both experiments on eight datasets from various domains and ablation studies demonstrate that SEHFS achieves superior performance in feature selection.

[522] IoUCert: Robustness Verification for Anchor-based Object Detectors

Benedikt Brückner, Alejandro J. Mercado, Yanghao Zhang, Panagiotis Kouvaros, Alessio Lomuscio

Main category: cs.LG

TL;DR: IoUCert is a formal verification framework for anchor-based object detection models that provides robustness guarantees against input perturbations by optimizing IoU bounds through novel coordinate transformations.

Details

Motivation: Formal robustness verification has been successful for image classification but remains challenging for object detection due to complex non-linear coordinate transformations and Intersection-over-Union (IoU) metrics. Existing methods struggle with anchor-based architectures like SSD, YOLOv2, and YOLOv3.

Method: Proposes a coordinate transformation that circumvents precision-degrading relaxations of non-linear box prediction functions. Enables optimization of bounds directly with respect to anchor box offsets, leading to a novel Interval Bound Propagation method that derives optimal IoU bounds for single-object localization.

Result: Enables, for the first time, robustness verification of realistic anchor-based models (SSD, YOLOv2, YOLOv3 variants) against various input perturbations. Demonstrates practical verification capabilities for object detection systems.

Conclusion: IoUCert successfully addresses the bottleneck in formal verification for object detection by providing a specialized framework that handles the unique challenges of anchor-based architectures and IoU metrics.

Abstract: While formal robustness verification has seen significant success in image classification, scaling these guarantees to object detection remains notoriously difficult due to complex non-linear coordinate transformations and Intersection-over-Union (IoU) metrics. We introduce {\sc \sf IoUCert}, a novel formal verification framework designed specifically to overcome these bottlenecks in foundational anchor-based object detection architectures. Focusing on the object localisation component in single-object settings, we propose a coordinate transformation that enables our algorithm to circumvent precision-degrading relaxations of non-linear box prediction functions. This allows us to optimise bounds directly with respect to the anchor box offsets which enables a novel Interval Bound Propagation method that derives optimal IoU bounds. We demonstrate that our method enables, for the first time, the robustness verification of realistic, anchor-based models including SSD, YOLOv2, and YOLOv3 variants against various input perturbations.

[523] Step-Level Sparse Autoencoder for Reasoning Process Interpretation

Xuan Yang, Jiayu Liu, Yuhang Lai, Hao Xu, Zhenya Huang, Ning Miao

Main category: cs.LG

TL;DR: SSAE is a step-level sparse autoencoder that disentangles LLM reasoning steps into sparse features for interpretability, enabling analysis of reasoning patterns and properties like correctness.

Details

Motivation: Current LLM interpretability tools like Sparse Autoencoders operate at token level, creating a granularity mismatch for analyzing step-level reasoning patterns, directions, and semantic transitions in Chain-of-Thought reasoning.

Method: Proposes Step-level Sparse Autoencoder (SSAE) that controls sparsity of step features conditioned on context, forming an information bottleneck to split incremental information from background information into sparsely activated dimensions.

Result: Experiments on multiple base models and reasoning tasks show SSAE effectively extracts features that can predict surface-level information (generation length, token distribution) and complex properties (correctness, logicality) through linear probing.

Conclusion: LLMs already partly know about reasoning properties during generation, providing foundation for self-verification ability; SSAE offers effective analytical tool for step-level reasoning interpretability.

Abstract: Large Language Models (LLMs) have achieved strong complex reasoning capabilities through Chain-of-Thought (CoT) reasoning. However, their reasoning patterns remain too complicated to analyze. While Sparse Autoencoders (SAEs) have emerged as a powerful tool for interpretability, existing approaches predominantly operate at the token level, creating a granularity mismatch when capturing more critical step-level information, such as reasoning direction and semantic transitions. In this work, we propose step-level sparse autoencoder (SSAE), which serves as an analytical tool to disentangle different aspects of LLMs’ reasoning steps into sparse features. Specifically, by precisely controlling the sparsity of a step feature conditioned on its context, we form an information bottleneck in step reconstruction, which splits incremental information from background information and disentangles it into several sparsely activated dimensions. Experiments on multiple base models and reasoning tasks show the effectiveness of the extracted features. By linear probing, we can easily predict surface-level information, such as generation length and first token distribution, as well as more complicated properties, such as the correctness and logicality of the step. These observations indicate that LLMs should already at least partly know about these properties during generation, which provides the foundation for the self-verification ability of LLMs. The code is available at https://github.com/Miaow-Lab/SSAE

[524] Reinforcement Learning with Symbolic Reward Machines

Thomas Krug, Daniel Neider

Main category: cs.LG

TL;DR: Symbolic Reward Machines (SRMs) extend Reward Machines to work directly with environment observations using symbolic formulas, eliminating the need for manual labeling functions in reinforcement learning.

Details

Motivation: Reward Machines require manual creation of labeling functions for each environment and task, which limits their applicability in standard RL frameworks. The authors aim to create a more practical approach that works with standard environment outputs.

Method: Proposes Symbolic Reward Machines (SRMs) that consume standard environment observations directly through symbolic formula guards. Introduces two learning algorithms: QSRM and LSRM for learning SRMs.

Result: SRM methods outperform baseline RL approaches and achieve the same results as existing RM methods while adhering to standard environment definitions and providing interpretable task representations.

Conclusion: SRMs successfully overcome the limitations of traditional RMs by eliminating the need for manual labeling functions while maintaining performance and interpretability.

Abstract: Reward Machines (RMs) are an established mechanism in Reinforcement Learning (RL) to represent and learn sparse, temporally extended tasks with non-Markovian rewards. RMs rely on high-level information in the form of labels that are emitted by the environment alongside the observation. However, this concept requires manual user input for each environment and task. The user has to create a suitable labeling function that computes the labels. These limitations lead to poor applicability in widely adopted RL frameworks. We propose Symbolic Reward Machines (SRMs) together with the learning algorithms QSRM and LSRM to overcome the limitations of RMs. SRMs consume only the standard output of the environment and process the observation directly through guards that are represented by symbolic formulas. In our evaluation, our SRM methods outperform the baseline RL approaches and generate the same results as the existing RM methods. At the same time, our methods adhere to the widely used environment definition and provide interpretable representations of the task to the user.

[525] On the Expressive Power of Transformers for Maxout Networks and Continuous Piecewise Linear Functions

Linyan Gu, Lihua Yang, Feng Zhou

Main category: cs.LG

TL;DR: Transformers can approximate maxout networks and inherit universal approximation capabilities of ReLU networks, with expressivity growing exponentially with depth via piecewise linear functions.

Details

Motivation: Despite Transformers' empirical success, their theoretical expressive power remains insufficiently understood. The paper aims to bridge this gap by analyzing Transformers' approximation capabilities and connecting them to established neural network theory.

Method: Establishes explicit approximation of maxout networks by Transformers while preserving comparable model complexity. Develops framework to analyze approximation of continuous piecewise linear functions by Transformers, quantifying expressivity via number of linear regions.

Result: Transformers inherit universal approximation capability of ReLU networks under similar complexity constraints. Their expressivity grows exponentially with depth as measured by number of linear regions. Self-attention layers implement max-type operations while feedforward layers realize token-wise affine transformations.

Conclusion: The analysis establishes theoretical bridge between approximation theory for standard feedforward neural networks and Transformer architectures, providing structural insights into Transformers’ internal mechanisms.

Abstract: Transformer networks have achieved remarkable empirical success across a wide range of applications, yet their theoretical expressive power remains insufficiently understood. In this paper, we study the expressive capabilities of Transformer architectures. We first establish an explicit approximation of maxout networks by Transformer networks while preserving comparable model complexity. As a consequence, Transformers inherit the universal approximation capability of ReLU networks under similar complexity constraints. Building on this connection, we develop a framework to analyze the approximation of continuous piecewise linear functions by Transformers and quantitatively characterize their expressivity via the number of linear regions, which grows exponentially with depth. Our analysis establishes a theoretical bridge between approximation theory for standard feedforward neural networks and Transformer architectures. It also yields structural insights into Transformers: self-attention layers implement max-type operations, while feedforward layers realize token-wise affine transformations.

[526] Why Adam Can Beat SGD: Second-Moment Normalization Yields Sharper Tails

Ruinan Jin, Yingbin Liang, Shaofeng Zou

Main category: cs.LG

TL;DR: Theoretical analysis showing Adam’s superior high-probability convergence with δ^{-1/2} dependence vs SGD’s δ^{-1} dependence under bounded variance

Details

Motivation: Despite Adam's empirical superiority over SGD, existing theory fails to explain the performance gap, providing similar guarantees for both methods

Method: Uncovers Adam’s key second-moment normalization and develops a stopping-time/martingale analysis under the classical bounded variance model

Result: Establishes first theoretical separation: Adam achieves δ^{-1/2} dependence on confidence parameter δ, while SGD requires at least δ^{-1} dependence

Conclusion: Adam provably outperforms SGD in high-probability convergence under bounded variance, explaining the empirical performance gap

Abstract: Despite Adam demonstrating faster empirical convergence than SGD in many applications, much of the existing theory yields guarantees essentially comparable to those of SGD, leaving the empirical performance gap insufficiently explained. In this paper, we uncover a key second-moment normalization in Adam and develop a stopping-time/martingale analysis that provably distinguishes Adam from SGD under the classical bounded variance model (a second moment assumption). In particular, we establish the first theoretical separation between the high-probability convergence behaviors of the two methods: Adam achieves a $δ^{-1/2}$ dependence on the confidence parameter $δ$, whereas corresponding high-probability guarantee for SGD necessarily incurs at least a $δ^{-1}$ dependence.

[527] Multi-Scale Adaptive Neighborhood Awareness Transformer For Graph Fraud Detection

Jiaqi Lv, Qingfeng Du, Yu Zhang, Yongqi Han, Sheng Li

Main category: cs.LG

TL;DR: MANDATE: A transformer-based approach for graph fraud detection that addresses GNN limitations through multi-scale positional encoding, homophilic/heterophilic embedding strategies, and relation-aware fusion.

Details

Motivation: Existing GNN-based fraud detection methods suffer from inherent inductive biases including homogeneity assumptions and limited global modeling ability, which hinder their effectiveness in detecting fraudulent behavior in graphs.

Method: Proposes Multi-scale Neighborhood Awareness Transformer (MANDATE) with: 1) multi-scale positional encoding for various distances from central nodes, 2) different embedding strategies for homophilic vs. heterophilic connections, 3) embedding fusion strategy for multi-relation graphs to handle relationship distribution bias.

Result: Experiments on three fraud detection datasets demonstrate the superiority of MANDATE over existing methods, showing improved performance in detecting fraudulent behavior in graph-structured data.

Conclusion: MANDATE effectively addresses GNN limitations for fraud detection by enhancing global modeling capabilities and mitigating homophily distribution differences through transformer-based architecture with specialized embedding strategies.

Abstract: Graph fraud detection (GFD) is crucial for identifying fraudulent behavior within graphs, benefiting various domains such as financial networks and social media. Existing methods based on graph neural networks (GNNs) have succeeded considerably due to their effective expressive capacity for graph-structured data. However, the inherent inductive bias of GNNs, including the homogeneity assumption and the limited global modeling ability, hinder the effectiveness of these models. To address these challenges, we propose Multi-scale Neighborhood Awareness Transformer (MANDATE), which alleviates the inherent inductive bias of GNNs. Specifically, we design a multi-scale positional encoding strategy to encode the positional information of various distances from the central node. By incorporating it with the self-attention mechanism, the global modeling ability can be enhanced significantly. Meanwhile, we design different embedding strategies for homophilic and heterophilic connections. This mitigates the homophily distribution differences between benign and fraudulent nodes. Moreover, an embedding fusion strategy is designed for multi-relation graphs, which alleviates the distribution bias caused by different relationships. Experiments on three fraud detection datasets demonstrate the superiority of MANDATE.

[528] From Complex Dynamics to DynFormer: Rethinking Transformers for PDEs

Pengyu Lai, Yixiao Chen, Dewu Yang, Rui Wang, Feng Wang, Hui Xu

Main category: cs.LG

TL;DR: DynFormer is a dynamics-informed neural operator that uses specialized modules for different physical scales, combining spectral embedding for large-scale interactions with local-global mixing for small-scale turbulence, achieving superior accuracy and efficiency in PDE solving.

Details

Motivation: Classical PDE solvers are computationally expensive for high-dimensional problems, while existing Transformer-based neural operators treat all spatial points uniformly, ignoring the intrinsic scale separation in physical fields and applying inefficient global attention across all scales.

Method: DynFormer uses spectral embedding to isolate low-frequency modes with Kronecker-structured attention for efficient large-scale interactions, and a Local-Global-Mixing transformation with nonlinear multiplicative frequency mixing to reconstruct small-scale turbulent cascades without global attention costs.

Result: Achieves up to 95% reduction in relative error compared to state-of-the-art baselines across four PDE benchmarks while significantly reducing GPU memory consumption, demonstrating superior accuracy and efficiency.

Conclusion: Embedding first-principles physical dynamics into Transformer architectures creates a scalable, theoretically grounded blueprint for PDE surrogate modeling that respects the multi-scale nature of physical systems.

Abstract: Partial differential equations (PDEs) are fundamental for modeling complex physical systems, yet classical numerical solvers face prohibitive computational costs in high-dimensional and multi-scale regimes. While Transformer-based neural operators have emerged as powerful data-driven alternatives, they conventionally treat all discretized spatial points as uniform, independent tokens. This monolithic approach ignores the intrinsic scale separation of physical fields, applying computationally prohibitive global attention that redundantly mixes smooth large-scale dynamics with high-frequency fluctuations. Rethinking Transformers through the lens of complex dynamics, we propose DynFormer, a novel dynamics-informed neural operator. Rather than applying a uniform attention mechanism across all scales, DynFormer explicitly assigns specialized network modules to distinct physical scales. It leverages a Spectral Embedding to isolate low-frequency modes, enabling a Kronecker-structured attention mechanism to efficiently capture large-scale global interactions with reduced complexity. Concurrently, we introduce a Local-Global-Mixing transformation. This module utilizes nonlinear multiplicative frequency mixing to implicitly reconstruct the small-scale, fast-varying turbulent cascades that are slaved to the macroscopic state, without incurring the cost of global attention. Integrating these modules into a hybrid evolutionary architecture ensures robust long-term temporal stability. Extensive memory-aligned evaluations across four PDE benchmarks demonstrate that DynFormer achieves up to a 95% reduction in relative error compared to state-of-the-art baselines, while significantly reducing GPU memory consumption. Our results establish that embedding first-principles physical dynamics into Transformer architectures yields a highly scalable, theoretically grounded blueprint for PDE surrogate modeling.

[529] Joint Training Across Multiple Activation Sparsity Regimes

Haotian Wang

Main category: cs.LG

TL;DR: Training neural networks with global top-k activation constraints across multiple sparsity regimes improves generalization on CIFAR-10 without data augmentation.

Details

Motivation: Inspired by biological systems' strong generalization, the paper hypothesizes that robust internal representations should remain effective across both dense and sparse activation regimes. The goal is to explore whether joint training across multiple activation sparsity regimes can improve generalization in deep neural networks.

Method: Introduces a training strategy that applies global top-k constraints to hidden activations and cycles a single model through multiple activation budgets via progressive compression and periodic reset. Uses adaptive keep-ratio control strategies on CIFAR-10 without data augmentation with a WRN-28-4 backbone.

Result: In single-run experiments, two adaptive keep-ratio control strategies both outperform dense baseline training on CIFAR-10 without data augmentation, suggesting improved generalization.

Conclusion: Joint training across multiple activation sparsity regimes may provide a simple and effective route to improved generalization in deep neural networks, inspired by biological systems’ robustness.

Abstract: Generalization in deep neural networks remains only partially understood. Inspired by the stronger generalization tendency of biological systems, we explore the hypothesis that robust internal representations should remain effective across both dense and sparse activation regimes. To test this idea, we introduce a simple training strategy that applies global top-k constraints to hidden activations and repeatedly cycles a single model through multiple activation budgets via progressive compression and periodic reset. Using CIFAR-10 without data augmentation and a WRN-28-4 backbone, we find in single-run experiments that two adaptive keep-ratio control strategies both outperform dense baseline training. These preliminary results suggest that joint training across multiple activation sparsity regimes may provide a simple and effective route to improved generalization.

[530] Torus embeddings

Dan Stowell

Main category: cs.LG

TL;DR: Paper introduces toroidal embeddings as an alternative to Euclidean/hyperspherical representations, showing comparable performance while enabling efficient integer-based TinyML implementations.

Details

Motivation: Current deep learning embeddings use Euclidean or hyperspherical spaces, but computers fundamentally use integers with overflow (torus topology). This mismatch wastes representation capacity and complicates efficient embedded implementations.

Method: Adapts deep learning frameworks to create toroidal representations using two strategies, with normalization-based approach showing best stability and performance comparable to L2-normalized hyperspheres.

Result: Torus embeddings maintain desirable quantization properties and comparable performance to hypersphere embeddings, while providing simple pathway to efficient TinyML embedded implementation.

Conclusion: Toroidal embeddings offer viable alternative to hyperspherical representations with similar performance but better alignment with computer hardware, enabling more efficient embedded ML implementations.

Abstract: Many data representations are vectors of continuous values. In particular, deep learning embeddings are data-driven representations, typically either unconstrained in Euclidean space, or constrained to a hypersphere. These may also be translated into integer representations (quantised) for efficient large-scale use. However, the fundamental (and most efficient) numeric representation in the overwhelming majority of existing computers is integers with overflow – and vectors of these integers do not correspond to either of these spaces, but instead to the topology of a (hyper)torus. This mismatch can lead to wasted representation capacity. Here we show that common deep learning frameworks can be adapted, quite simply, to create representations with inherent toroidal topology. We investigate two alternative strategies, demonstrating that a normalisation-based strategy leads to training with desirable stability and performance properties, comparable to a standard hyperspherical L2 normalisation. We also demonstrate that a torus embedding maintains desirable quantisation properties. The torus embedding does not outperform hypersphere embeddings in general, but is comparable, and opens the possibility to train deep embeddings which have an extremely simple pathway to efficient `TinyML’ embedded implementation.

[531] Information Routing in Atomistic Foundation Models: How Equivariance Creates Linearly Disentangled Representations

Joshua Steier

Main category: cs.LG

TL;DR: CPD method reveals atomistic foundation models encode composition and geometry information differently across architectures, with equivariant models showing better linear disentanglement.

Details

Motivation: To understand what information atomistic foundation models encode in their intermediate representations and how this information is organized, particularly the relationship between composition and geometric signals.

Method: Introduced Composition Projection Decomposition (CPD) using QR projection to linearly remove composition signal from learned representations and probe the geometric residual. Analyzed eight models from five architectural families on QM9 molecules and Materials Project crystals.

Result: Found disentanglement gradient: tensor product equivariant architectures (MACE) produce representations where geometry is almost fully linearly accessible after composition removal, while handcrafted descriptors (ANI-2x) entangle information nonlinearly. MACE routes target-specific signal through irreducible representation channels. Gradient boosted tree probes on projected residuals are systematically inflated.

Conclusion: Linearly disentangled representations are more sample-efficient under linear probing, suggesting practical advantages for equivariant architectures beyond raw prediction accuracy. Recommend linear probes as primary metric.

Abstract: What do atomistic foundation models encode in their intermediate representations, and how is that information organized? We introduce Composition Projection Decomposition (CPD), which uses QR projection to linearly remove composition signal from learned representations and probes the geometric residual. Across eight models from five architectural families on QM9 molecules and Materials Project crystals, we find a disentanglement gradient: tensor product equivariant architectures (MACE) produce representations where geometry is almost fully linearly accessible after composition removal ($R^2_{\text{geom}} = 0.782$ for HOMO-LUMO gap), while handcrafted descriptors (ANI-2x) entangle the same information nonlinearly ($R^2_{\text{geom}} = -0.792$ under Ridge; $R^2 = +0.784$ under MLP). MACE routes target-specific signal through irreducible representation channels – dipole to $L = 1$, HOMO-LUMO gap to $L = 0$ – a pattern not observed in ViSNet’s vector-scalar architecture under the same probe. We show that gradient boosted tree probes on projected residuals are systematically inflated, recovering $R^2 = 0.68$–$0.95$ on a purely compositional target, and recommend linear probes as the primary metric. Linearly disentangled representations are more sample-efficient under linear probing, suggesting a practical advantage for equivariant architectures beyond raw prediction accuracy.

[532] Less Noise, Same Certificate: Retain Sensitivity for Unlearning

Carolin Heinzler, Kasra Malihi, Amartya Sanyal

Main category: cs.LG

TL;DR: Certified machine unlearning methods can use retain sensitivity instead of differential privacy’s global sensitivity to add less noise while achieving the same unlearning guarantees.

Details

Motivation: Existing certified unlearning methods use DP techniques with noise calibrated to global sensitivity, which is overly conservative because unlearning doesn't require protecting retained data privacy.

Method: Define retain sensitivity as worst-case output change over deletions while keeping retained data fixed. Use this instead of global sensitivity to calibrate noise for certified unlearning algorithms.

Result: Retain sensitivity allows same unlearning certificates with less noise. Validated theoretically and empirically on problems including minimum spanning trees, PCA, and ERM.

Conclusion: Retain sensitivity provides more efficient noise calibration for certified unlearning than DP-style global sensitivity, improving utility while maintaining provable guarantees.

Abstract: Certified machine unlearning aims to provably remove the influence of a deletion set $U$ from a model trained on a dataset $S$, by producing an unlearned output that is statistically indistinguishable from retraining on the retain set $R:=S\setminus U$. Many existing certified unlearning methods adapt techniques from Differential Privacy (DP) and add noise calibrated to global sensitivity, i.e., the worst-case output change over all adjacent datasets. We show that this DP-style calibration is often overly conservative for unlearning, based on a key observation: certified unlearning, by definition, does not require protecting the privacy of the retained data $R$. Motivated by this distinction, we define retain sensitivity as the worst-case output change over deletions $U$ while keeping $R$ fixed. While insufficient for DP, retain sensitivity is exactly sufficient for unlearning, allowing for the same certificates with less noise. We validate these reductions in noise theoretically and empirically across several problems, including the weight of minimum spanning trees, PCA, and ERM. Finally, we refine the analysis of two widely used certified unlearning algorithms through the lens of retain sensitivity, leveraging the regularity induced by $R$ to further reduce noise and improve utility.

[533] I-CAM-UV: Integrating Causal Graphs over Non-Identical Variable Sets Using Causal Additive Models with Unobserved Variables

Hirofumi Suzuki, Kentaro Kanamori, Takuya Takagi, Thong Pham, Takashi Nicholas Maeda, Shohei Shimizu

Main category: cs.LG

TL;DR: I-CAM-UV integrates causal graphs from multiple datasets with different variable sets by leveraging Causal Additive Models with Unobserved Variables to handle missing variables and confounders.

Details

Motivation: Real-world causal discovery often involves multiple datasets with non-identical variable sets, but existing methods are designed for single datasets. Simple overlapping of causal graphs from individual datasets fails to capture relationships involving variables unobserved in some datasets or confounded by unobserved variables.

Method: Proposes I-CAM-UV approach that uses Causal Additive Models with Unobserved Variables (CAM-UV) to extract causal graphs with information about unobserved variables. The method integrates CAM-UV results by enumerating all consistent causal graphs across datasets and provides an efficient combinatorial search algorithm.

Result: Demonstrates that I-CAM-UV outperforms existing methods by identifying more causal relationships that would be missed by simple overlapping approaches, especially when dealing with datasets having different variable sets and unobserved confounders.

Conclusion: I-CAM-UV provides an effective solution for causal discovery from multiple datasets with different variable sets by properly handling unobserved variables and confounders through integration of CAM-UV results.

Abstract: Causal discovery from observational data is a fundamental tool in various fields of science. While existing approaches are typically designed for a single dataset, we often need to handle multiple datasets with non-identical variable sets in practice. One straightforward approach is to estimate a causal graph from each dataset and construct a single causal graph by overlapping. However, this approach identifies limited causal relationships because unobserved variables in each dataset can be confounders, and some variable pairs may be unobserved in any dataset. To address this issue, we leverage Causal Additive Models with Unobserved Variables (CAM-UV) that provide causal graphs having information related to unobserved variables. We show that the ground truth causal graph has structural consistency with the information of CAM-UV on each dataset. As a result, we propose an approach named I-CAM-UV to integrate CAM-UV results by enumerating all consistent causal graphs. We also provide an efficient combinatorial search algorithm and demonstrate the usefulness of I-CAM-UV against existing methods.

[534] Stabilized Adaptive Loss and Residual-Based Collocation for Physics-Informed Neural Networks

Divyavardhan Singh, Shubham Kamble, Dimple Sonone, Kishor Upla

Main category: cs.LG

TL;DR: Improved PINNs with adaptive loss balancing and residual-based collocation for stiff/shock problems like Burgers’ and Allen-Cahn equations

Details

Motivation: Traditional Physics-Informed Neural Networks (PINNs) have limitations when dealing with problems characterized by high stiffness or shock-dominated dynamics, including unbalanced training and solution inaccuracy even with small physics residuals.

Method: Developed two key improvements: 1) New adaptive loss balancing scheme using smoothed gradient norms to ensure satisfaction of initial and boundary conditions, and 2) Adaptive residual-based collocation scheme to improve accuracy in regions with high physics residuals.

Result: Significant improvement in solution accuracy: For Burgers’ equation, relative L2 error reduced by ~44% compared to traditional PINNs; for Allen-Cahn equation, relative L2 error reduced by ~70%. The method shows consistent satisfaction of physics residuals.

Conclusion: The proposed approach successfully addresses limitations of traditional PINNs for stiff/shock problems through adaptive loss balancing and residual-based collocation, achieving substantial accuracy improvements while maintaining physics consistency.

Abstract: Physics-Informed Neural Networks (PINNs) have been recognized as a mesh-free alternative to solve partial differential equations where physics information is incorporated. However, in dealing with problems characterized by high stiffness or shock-dominated dynamics, traditional PINNs have been found to have limitations, including unbalanced training and inaccuracy in solution, even with small physics residuals. In this research, we seek to address these limitations using the viscous Burgers’ equation with low viscosity and the Allen-Cahn equation as test problems. In addressing unbalanced training, we have developed a new adaptive loss balancing scheme using smoothed gradient norms to ensure satisfaction of initial and boundary conditions. Further, to address inaccuracy in the solution, we have developed an adaptive residual-based collocation scheme to improve the accuracy of solutions in the regions with high physics residuals. The proposed new approach significantly improves solution accuracy with consistent satisfaction of physics residuals. For instance, in the case of Burgers’ equation, the relative L2 error is reduced by about 44 percent compared to traditional PINNs, while for the Allen-Cahn equation, the relative L2 error is reduced by approximately 70 percent. Additionally, we show the trustworthy solution comparison of the proposed method using a robust finite difference solver.

[535] Adaptive Methods Are Preferable in High Privacy Settings: An SDE Perspective

Enea Monzio Compagnoni, Alessandro Stanghellini, Rustem Islamov, Aurelien Lucchi, Anastasiia Koloskova

Main category: cs.LG

TL;DR: The paper analyzes differential privacy in optimization, comparing DP-SGD and DP-SignSGD through stochastic differential equations, showing DP-SignSGD has better privacy-utility trade-off in high-privacy regimes and more practical hyperparameter transfer across privacy levels.

Details

Motivation: Differential Privacy is becoming essential for large-scale training due to privacy regulations, but there's a need to understand how DP noise interacts with adaptive optimization methods and their practical deployment across different privacy levels.

Method: The authors use stochastic differential equations (SDE) to analyze private optimizers, specifically DP-SGD and DP-SignSGD under per-example clipping. They examine convergence behavior under fixed hyperparameters versus optimal learning rates, and extend empirical analysis to DP-Adam.

Result: DP-SGD converges at O(1/ε²) privacy-utility trade-off with speed independent of ε, while DP-SignSGD converges at speed linear in ε with O(1/ε) trade-off, dominating in high-privacy or large batch noise regimes. Under optimal learning rates, both achieve comparable asymptotic performance, but DP-SGD’s optimal learning rate scales linearly with ε while DP-SignSGD’s is essentially ε-independent.

Conclusion: Adaptive methods like DP-SignSGD are more practical as their hyperparameters transfer across privacy levels with little re-tuning, making them more suitable for real-world deployment of differentially private machine learning.

Abstract: Differential Privacy (DP) is becoming central to large-scale training as privacy regulations tighten. We revisit how DP noise interacts with adaptivity in optimization through the lens of stochastic differential equations, providing the first SDE-based analysis of private optimizers. Focusing on DP-SGD and DP-SignSGD under per-example clipping, we show a sharp contrast under fixed hyperparameters: DP-SGD converges at a Privacy-Utility Trade-Off of $\mathcal{O}(1/\varepsilon^2)$ with speed independent of $\varepsilon$, while DP-SignSGD converges at a speed linear in $\varepsilon$ with an $\mathcal{O}(1/\varepsilon)$ trade-off, dominating in high-privacy or large batch noise regimes. By contrast, under optimal learning rates, both methods achieve comparable theoretical asymptotic performance; however, the optimal learning rate of DP-SGD scales linearly with $\varepsilon$, while that of DP-SignSGD is essentially $\varepsilon$-independent. This makes adaptive methods far more practical, as their hyperparameters transfer across privacy levels with little or no re-tuning. Empirical results confirm our theory across training and test metrics, and empirically extend from DP-SignSGD to DP-Adam.

[536] SynthCharge: An Electric Vehicle Routing Instance Generator with Feasibility Screening to Enable Learning-Based Optimization and Benchmarking

Mertcan Daysalilar, Fuat Uyguroglu, Gabriel Nicolosi, Adam Meyers

Main category: cs.LG

TL;DR: SynthCharge is a parametric generator for creating diverse, feasibility-screened Electric Vehicle Routing Problem with Time Windows (EVRPTW) instances to address limitations of static benchmarks for evaluating learning-based routing models.

Details

Motivation: Existing EVRPTW benchmark datasets are often static and lack verifiable feasibility, which restricts reproducible evaluation of learning-based routing models. There's a need for dynamic benchmarking infrastructure to systematically evaluate neural routing approaches.

Method: SynthCharge generates diverse EVRPTW instances across varying spatiotemporal configurations with scalable customer counts (5-500 customers). It integrates instance geometry with adaptive energy capacity scaling and range-aware charging station placement, and uses systematic filtering through a fast feasibility screening process to guarantee structural validity.

Result: The generator can produce large-scale instances up to 500 customers, though experiments focus on 5-100 customers. It provides the dynamic benchmarking infrastructure needed for systematic evaluation of neural routing approaches.

Conclusion: SynthCharge addresses the limitations of static benchmarks by providing a parametric generator that produces diverse, feasibility-screened EVRPTW instances, enabling better evaluation of learning-based routing models.

Abstract: The electric vehicle routing problem with time windows (EVRPTW) extends the classical VRPTW by introducing battery capacity constraints and charging station decisions. Existing benchmark datasets are often static and lack verifiable feasibility, which restricts reproducible evaluation of learning-based routing models. We introduce SynthCharge, a parametric generator that produces diverse, feasibility-screened EVRPTW instances across varying spatiotemporal configurations and scalable customer counts. While SynthCharge can currently generate large-scale instances of up to 500 customers, we focus our experiments on sizes ranging from 5 to 100 customers. Unlike static benchmark suites, SynthCharge integrates instance geometry with adaptive energy capacity scaling and range-aware charging station placement. To guarantee structural validity, the generator systematically filters out unsolvable instances through a fast feasibility screening process. Ultimately, SynthCharge provides the dynamic benchmarking infrastructure needed to systematically evaluate the robustness of emerging neural routing and data-driven approaches.

[537] Coalgebras for categorical deep learning: Representability and universal approximation

Dragan Mašulović

Main category: cs.LG

TL;DR: A categorical framework using coalgebraic foundations to generalize equivariant representations in deep learning, connecting abstract invariant behavior specifications to neural architecture realizations.

Details

Motivation: To develop a more abstract, domain-independent framework for understanding equivariant representations in deep learning beyond the specific group action context of geometric deep learning, using category theory to bridge abstract invariant behavior specifications with concrete neural architectures.

Method: Develops a coalgebraic foundation for equivariant representation, showing how embeddings of datasets (functors from SET to VECT) and invariant behaviors (endofunctors on SET) can be lifted to compatible endofunctors on VECT. Establishes a universal approximation theorem for equivariant maps in this generalized setting.

Result: 1) Demonstrated that given dataset embeddings and invariant behaviors, there exists corresponding endofunctors on VECT that recover analogous invariant behavior on embedded data. 2) Proved a universal approximation theorem showing continuous equivariant functions can be approximated within this coalgebraic framework for broad classes of symmetries.

Conclusion: Provides a categorical bridge between abstract specification of invariant behavior and its concrete realization in neural architectures, offering a more general framework than geometric deep learning for reasoning about equivariant representations.

Abstract: Categorical deep learning (CDL) has recently emerged as a framework that leverages category theory to unify diverse neural architectures. While geometric deep learning (GDL) is grounded in the specific context of invariants of group actions, CDL aims to provide domain-independent abstractions for reasoning about models and their properties. In this paper, we contribute to this program by developing a coalgebraic foundation for equivariant representation in deep learning, as classical notions of group actions and equivariant maps are naturally generalized by the coalgebraic formalism. Our first main result demonstrates that, given an embedding of data sets formalized as a functor from SET to VECT, and given a notion of invariant behavior on data sets modeled by an endofunctor on SET, there is a corresponding endofunctor on VECT that is compatible with the embedding in the sense that this lifted functor recovers the analogous notion of invariant behavior on the embedded data. Building on this foundation, we then establish a universal approximation theorem for equivariant maps in this generalized setting. We show that continuous equivariant functions can be approximated within our coalgebraic framework for a broad class of symmetries. This work thus provides a categorical bridge between the abstract specification of invariant behavior and its concrete realization in neural architectures.

[538] Inverse Reconstruction of Shock Time Series from Shock Response Spectrum Curves using Machine Learning

Adam Watts, Andrew Jeon, Destry Newton, Ryan Bowering

Main category: cs.LG

TL;DR: A conditional variational autoencoder (CVAE) learns to generate acceleration time series from shock response spectra, providing fast, high-fidelity inverse reconstruction without iterative optimization.

Details

Motivation: Reconstructing time-domain acceleration signals from shock response spectra (SRS) is ill-posed due to nonlinear, many-to-one mapping. Traditional iterative optimization methods are computationally expensive and limited by predefined basis functions.

Method: Proposes a conditional variational autoencoder (CVAE) that learns a data-driven inverse mapping from SRS to acceleration time series. The model generates signals consistent with target spectra without iterative optimization once trained.

Result: The CVAE achieves improved spectral fidelity compared to classical techniques, strong generalization to unseen spectra, and inference speeds three to six orders of magnitude faster than conventional methods.

Conclusion: Deep generative modeling provides a scalable and efficient approach for inverse SRS reconstruction, overcoming limitations of traditional iterative optimization methods.

Abstract: The shock response spectrum (SRS) is widely used to characterize the response of single-degree-of-freedom (SDOF) systems to transient accelerations. Because the mapping from acceleration time history to SRS is nonlinear and many-to-one, reconstructing time-domain signals from a target spectrum is inherently ill-posed. Conventional approaches address this problem through iterative optimization, typically representing signals as sums of exponentially decayed sinusoids, but these methods are computationally expensive and constrained by predefined basis functions. We propose a conditional variational autoencoder (CVAE) that learns a data-driven inverse mapping from SRS to acceleration time series. Once trained, the model generates signals consistent with prescribed target spectra without requiring iterative optimization. Experiments demonstrate improved spectral fidelity relative to classical techniques, strong generalization to unseen spectra, and inference speeds three to six orders of magnitude faster. These results establish deep generative modeling as a scalable and efficient approach for inverse SRS reconstruction.

[539] Guiding Sparse Neural Networks with Neurobiological Principles to Elicit Biologically Plausible Representations

Patrick Inoue, Florian Röhrbein, Andreas Knoblauch

Main category: cs.LG

TL;DR: A biologically inspired learning rule that integrates neurobiological principles (sparsity, lognormal weight distributions, Dale’s law) naturally emerges in neural networks, improving robustness, generalization, and few-shot learning without explicit enforcement.

Details

Motivation: Deep neural networks struggle with generalization, few-shot learning, and continuous adaptation compared to biological neural systems. The authors aim to address these limitations by incorporating neurobiologically inspired assumptions into neural network learning.

Method: Introduces a biologically inspired learning rule that naturally integrates neurobiological principles including sparsity, lognormal weight distributions, and adherence to Dale’s law without requiring explicit enforcement. The model aligns with these core neurobiological principles to enhance learning capabilities.

Result: The approach enhances robustness against adversarial attacks and demonstrates superior generalization, particularly in few-shot learning scenarios. The integration of constraints leads to the emergence of biologically plausible neural representations.

Conclusion: Incorporating neurobiological assumptions into neural network design is effective for improving learning capabilities. The approach could extend from feature-specific to task-specific encoding and offer insights into neural resource allocation for complex tasks.

Abstract: While deep neural networks (DNNs) have achieved remarkable performance in tasks such as image recognition, they often struggle with generalization, learning from few examples, and continuous adaptation - abilities inherent in biological neural systems. These challenges arise due to DNNs’ failure to emulate the efficient, adaptive learning mechanisms of biological networks. To address these issues, we explore the integration of neurobiologically inspired assumptions in neural network learning. This study introduces a biologically inspired learning rule that naturally integrates neurobiological principles, including sparsity, lognormal weight distributions, and adherence to Dale’s law, without requiring explicit enforcement. By aligning with these core neurobiological principles, our model enhances robustness against adversarial attacks and demonstrates superior generalization, particularly in few-shot learning scenarios. Notably, integrating these constraints leads to the emergence of biologically plausible neural representations, underscoring the efficacy of incorporating neurobiological assumptions into neural network design. Preliminary results suggest that this approach could extend from feature-specific to task-specific encoding, potentially offering insights into neural resource allocation for complex tasks.

[540] On Geometry Regularization in Autoencoder Reduced-Order Models with Latent Neural ODE Dynamics

Mikhail Osipov

Main category: cs.LG

TL;DR: Regularization methods for latent representations in reduced-order models show that Stiefel projection consistently improves latent dynamics training, while other smoothness-focused methods often hinder long-horizon performance.

Details

Motivation: The paper investigates geometric regularization strategies for learned latent representations in encoder-decoder reduced-order models, aiming to improve the quality and stability of latent dynamics modeling for physical systems like the advection-diffusion-reaction equation.

Method: The study evaluates four regularization approaches during autoencoder pre-training: (a) near-isometry regularization of decoder Jacobian, (b) stochastic decoder gain penalty, (c) second-order directional curvature penalty, and (d) Stiefel projection of the first decoder layer. Latent dynamics are modeled using neural ODE in a fixed experimental setting for the ADR equation.

Result: Methods (a)-(c) often produce latent representations that make subsequent latent-dynamics training more difficult, especially for long-horizon rollouts, despite improving local decoder smoothness. In contrast, Stiefel projection (d) consistently improves conditioning-related diagnostics and yields better rollout performance.

Conclusion: The downstream impact of latent-geometry mismatch outweighs the benefits of improved decoder smoothness in this setting, suggesting that Stiefel projection is more effective for stable latent dynamics modeling than smoothness-focused regularization methods.

Abstract: We investigate geometric regularization strategies for learned latent representations in encoder–decoder reduced-order models. In a fixed experimental setting for the advection–diffusion–reaction (ADR) equation, we model latent dynamics using a neural ODE and evaluate four regularization approaches applied during autoencoder pre-training: (a) near-isometry regularization of the decoder Jacobian, (b) a stochastic decoder gain penalty based on random directional gains, (c) a second-order directional curvature penalty, and (d) Stiefel projection of the first decoder layer. Across multiple seeds, we find that (a)–(c) often produce latent representations that make subsequent latent-dynamics training with a frozen autoencoder more difficult, especially for long-horizon rollouts, even when they improve local decoder smoothness or related sensitivity proxies. In contrast, (d) consistently improves conditioning-related diagnostics of the learned latent dynamics and tends to yield better rollout performance. We discuss the hypothesis that, in this setting, the downstream impact of latent-geometry mismatch outweighs the benefits of improved decoder smoothness.

[541] Evaluating Spoken Language as a Biomarker for Automated Screening of Cognitive Impairment

Maria R. Lima, Alexander Capstick, Fatemeh Geranmayeh, Ramin Nilforooshan, Maja Matarić, Ravi Vaidyanathan, Payam Barnaghi

Main category: cs.LG

TL;DR: Using explainable machine learning with linguistic features from speech to screen for Alzheimer’s disease and predict cognitive severity, validated on real-world data.

Details

Motivation: Address the unmet need for timely cognitive impairment assessment by developing scalable, interpretable speech biomarkers using machine learning that can generalize to real-world datasets.

Method: Used Random Forest models trained on linguistic features from DementiaBank speech data (N=291) for ADRD detection and MMSE score prediction, validated on in-residence pilot data (N=22), with risk stratification and feature importance analysis.

Result: ADRD detection achieved 69.4% sensitivity and 83.3% specificity on benchmark data, with 70.0% sensitivity and 52.5% specificity on pilot data. MMSE prediction had mean absolute error of 3.7 on benchmark and 3.3 on pilot data. Risk stratification improved specificity by 13%.

Conclusion: Explainable ML with linguistic speech features shows promise for cognitive health monitoring and triage, with potential for integration with conversational technology for early screening and intervention.

Abstract: Timely and accurate assessment of cognitive impairment remains a major unmet need. Speech biomarkers offer a scalable, non-invasive, cost-effective solution for automated screening. However, the clinical utility of machine learning (ML) remains limited by interpretability and generalisability to real-world speech datasets. We evaluate explainable ML for screening of Alzheimer’s disease and related dementias (ADRD) and severity prediction using benchmark DementiaBank speech (N = 291, 64% female, 69.8 (SD = 8.6) years). We validate generalisability on pilot data collected in-residence (N = 22, 59% female, 76.2 (SD = 8.0) years). To enhance clinical utility, we stratify risk for actionable triage and assess linguistic feature importance. We show that a Random Forest trained on linguistic features for ADRD detection achieves a mean sensitivity of 69.4% (95% confidence interval (CI) = 66.4-72.5) and specificity of 83.3% (78.0-88.7). On pilot data, this model yields a mean sensitivity of 70.0% (58.0-82.0) and specificity of 52.5% (39.3-65.7). For prediction of Mini-Mental State Examination (MMSE) scores, a Random Forest Regressor achieves a mean absolute MMSE error of 3.7 (3.7-3.8), with comparable performance of 3.3 (3.1-3.5) on pilot data. Risk stratification improves specificity by 13% on the test set, offering a pathway for clinical triage. Linguistic features associated with ADRD include increased use of pronouns and adverbs, greater disfluency, reduced analytical thinking, lower lexical diversity, and fewer words that reflect a psychological state of completion. Our predictive modelling shows promise for integration with conversational technology at home to monitor cognitive health and triage higher-risk individuals, enabling early screening and intervention.

[542] Speculative Speculative Decoding

Tanishq Kumar, Tri Dao, Avner May

Main category: cs.LG

TL;DR: SSD (speculative speculative decoding) parallelizes speculation and verification in speculative decoding to eliminate drafting overhead, achieving 2x speedup over optimized speculative decoding.

Details

Motivation: Autoregressive decoding is slow due to sequential nature. Speculative decoding accelerates inference but still has sequential dependence between speculation and verification phases.

Method: Introduces speculative speculative decoding (SSD) where draft model predicts verification outcomes and prepares speculations pre-emptively during ongoing verification. Saguaro algorithm solves three key challenges: predicting verification outcomes, preparing multiple speculations, and managing speculation cache.

Result: Saguaro implementation achieves up to 2x faster than optimized speculative decoding baselines and up to 5x faster than autoregressive decoding with open source inference engines.

Conclusion: SSD effectively parallelizes speculation and verification, eliminating drafting overhead and significantly accelerating inference beyond existing speculative decoding methods.

Abstract: Autoregressive decoding is bottlenecked by its sequential nature. Speculative decoding has become a standard way to accelerate inference by using a fast draft model to predict upcoming tokens from a slower target model, and then verifying them in parallel with a single target model forward pass. However, speculative decoding itself relies on a sequential dependence between speculation and verification. We introduce speculative speculative decoding (SSD) to parallelize these operations. While a verification is ongoing, the draft model predicts likely verification outcomes and prepares speculations pre-emptively for them. If the actual verification outcome is then in the predicted set, a speculation can be returned immediately, eliminating drafting overhead entirely. We identify three key challenges presented by speculative speculative decoding, and suggest principled methods to solve each. The result is Saguaro, an optimized SSD algorithm. Our implementation is up to 2x faster than optimized speculative decoding baselines and up to 5x faster than autoregressive decoding with open source inference engines.

[543] Learning Demographic-Conditioned Mobility Trajectories with Aggregate Supervision

Jessie Z. Li, Zhiqing Hong, Toru Shirakawa, Serina Chang

Main category: cs.LG

TL;DR: ATLAS is a weakly supervised approach for generating demographic-conditioned human mobility trajectories using unlabeled individual trajectories, region-level aggregated mobility features, and census demographic compositions.

Details

Motivation: Existing trajectory generation models fail to capture demographic heterogeneity because most trajectory datasets lack demographic labels, creating a gap in data availability for studying how different demographic groups exhibit significantly different mobility patterns.

Method: ATLAS trains a trajectory generator using only unlabeled individual trajectories, region-level aggregated mobility features, and region-level demographic compositions from census data. It fine-tunes the generator so that simulated mobility matches observed regional aggregates while conditioning on demographics.

Result: Experiments on real trajectory data with demographic labels show ATLAS substantially improves demographic realism over baselines (JSD ↓ 12%–69%) and closes much of the gap to strongly supervised training. Theoretical analyses identify key factors including demographic diversity across regions and aggregate feature informativeness.

Conclusion: ATLAS successfully addresses the data gap for demographic-conditioned trajectory generation using weak supervision, enabling better modeling of heterogeneous mobility patterns across demographic groups without requiring labeled individual data.

Abstract: Human mobility trajectories are widely studied in public health and social science, where different demographic groups exhibit significantly different mobility patterns. However, existing trajectory generation models rarely capture this heterogeneity because most trajectory datasets lack demographic labels. To address this gap in data, we propose ATLAS, a weakly supervised approach for demographic-conditioned trajectory generation using only (i) individual trajectories without demographic labels, (ii) region-level aggregated mobility features, and (iii) region-level demographic compositions from census data. ATLAS trains a trajectory generator and fine-tunes it so that simulated mobility matches observed regional aggregates while conditioning on demographics. Experiments on real trajectory data with demographic labels show that ATLAS substantially improves demographic realism over baselines (JSD $\downarrow$ 12%–69%) and closes much of the gap to strongly supervised training. We further develop theoretical analyses for when and why ATLAS works, identifying key factors including demographic diversity across regions and the informativeness of the aggregate feature, paired with experiments demonstrating the practical implications of our theory. We release our code at https://github.com/schang-lab/ATLAS.

[544] ALARM: Automated MLLM-Based Anomaly Detection in Complex-EnviRonment Monitoring with Uncertainty Quantification

Congjing Zhang, Feng Lin, Xinyi Zhao, Pei Guo, Wei Li, Lin Chen, Chaoyue Zhao, Shuai Huang

Main category: cs.LG

TL;DR: ALARM is a UQ-supported MLLM-based visual anomaly detection framework that integrates uncertainty quantification with quality-assurance techniques for robust performance in complex environments.

Details

Motivation: In complex environments, visual anomalies are often highly contextual and ambiguous, making uncertainty quantification crucial for reliable MLLM-based visual anomaly detection systems.

Method: ALARM integrates uncertainty quantification with quality-assurance techniques including reasoning chain, self-reflection, and MLLM ensemble, designed based on a rigorous probabilistic inference pipeline and computational process.

Result: Extensive empirical evaluations using real-world smart-home benchmark data and wound image classification data show ALARM’s superior performance and generic applicability across different domains for reliable decision-making.

Conclusion: ALARM provides a robust UQ-supported MLLM framework for visual anomaly detection that addresses uncertainty in complex environments and demonstrates cross-domain applicability.

Abstract: The advance of Large Language Models (LLMs) has greatly stimulated research interest in developing multi-modal LLM (MLLM)-based visual anomaly detection (VAD) algorithms that can be deployed in complex environments. The challenge is that in these complex environments, the anomalies are sometimes highly contextual and also ambiguous, and thereby, uncertainty quantification (UQ) is a crucial capacity for an MLLM-based VAD system to succeed. In this paper, we introduce our UQ-supported MLLM-based VAD framework called ALARM. ALARM integrates UQ with quality-assurance techniques like reasoning chain, self-reflection, and MLLM ensemble for robust and accurate performance and is designed based on a rigorous probabilistic inference pipeline and computational process. Extensive empirical evaluations are conducted using the real-world smart-home benchmark data and wound image classification data, which shows ALARM’s superior performance and its generic applicability across different domains for reliable decision-making.

[545] DiaBlo: Diagonal Blocks Are Sufficient For Finetuning

Selcuk Gurses, Aozhong Zhang, Yanxia Deng, Xun Dong, Xin Li, Naigang Wang, Penghang Yin, Zi Yang

Main category: cs.LG

TL;DR: DiaBlo is a parameter-efficient fine-tuning method that updates only diagonal blocks of selected weight matrices, achieving competitive performance with better convergence stability than LoRA while maintaining memory efficiency.

Details

Motivation: To address performance gaps between parameter-efficient fine-tuning (PEFT) methods and full-model fine-tuning, while avoiding the convergence issues and complex initialization schemes required by low-rank adaptation methods like LoRA.

Method: DiaBlo updates only the diagonal blocks of selected model weight matrices instead of using low-rank matrix products. This eliminates the need for auxiliary initialization schemes or customized optimization strategies, leading to more stable convergence.

Result: DiaBlo achieves competitive accuracy across commonsense reasoning, arithmetic reasoning, code generation, and safety alignment tasks while preserving high memory efficiency and fast fine-tuning speed comparable to LoRA.

Conclusion: Fine-tuning only diagonal blocks is sufficient for strong and consistent performance, offering a simpler yet effective alternative to low-rank adaptation methods with theoretical guarantees and practical advantages.

Abstract: Fine-tuning is a critical step for adapting large language models (LLMs) to domain-specific downstream tasks. To mitigate the substantial computational and memory costs of full-model fine-tuning, Parameter-Efficient Fine-Tuning (PEFT) methods have been proposed to update only a small subset of model parameters. However, performance gaps between PEFT approaches and full-model fine-tuning still exist. In this work, we present DiaBlo, a simple yet effective PEFT approach that updates only the diagonal blocks of selected model weight matrices. Unlike Low-Rank Adaptation (LoRA) and its variants, DiaBlo eliminates the need for low-rank matrix products, thereby avoiding the reliance on auxiliary initialization schemes or customized optimization strategies to improve convergence. This design leads to stable and robust convergence while maintaining comparable memory efficiency and training speed to LoRA. Moreover, we provide theoretical guarantees showing that, under mild low-rank conditions, DiaBlo is more expressive than LoRA in the linear problem and converges to a stationary point of the general nonlinear full fine-tuning. Through extensive experiments across a range of tasks, including commonsense reasoning, arithmetic reasoning, code generation, and safety alignment, we show that fine-tuning only diagonal blocks is sufficient for strong and consistent performance. DiaBlo not only achieves competitive accuracy but also preserves high memory efficiency and fast fine-tuning speed. Codes are available at https://github.com/ziyangjoy/DiaBlo.

[546] Bridging Kolmogorov Complexity and Deep Learning: Asymptotically Optimal Description Length Objectives for Transformers

Peter Shaw, James Cohan, Jacob Eisenstein, Kristina Toutanova

Main category: cs.LG

TL;DR: Theoretical framework for applying Minimum Description Length principle to Transformers via asymptotically optimal description length objectives with Kolmogorov complexity foundations, showing Transformers can achieve optimal compression up to additive constants.

Details

Motivation: The MDL principle provides a formal framework for Occam's razor in ML, but applying it to neural networks like Transformers is challenging due to lack of principled measures for model complexity. Need theoretical foundations for compression and generalization in neural networks.

Method: Introduces asymptotically optimal description length objectives based on Kolmogorov complexity theory. Proves such objectives exist for Transformers by demonstrating their computational universality. Constructs tractable variational objective using adaptive Gaussian mixture prior for empirical analysis.

Result: Proves minimizers achieve optimal compression up to additive constant as model resources increase. Shows variational objective selects low-complexity solutions with strong generalization on algorithmic tasks, but standard optimizers fail to find such solutions from random initialization.

Conclusion: Provides theoretical framework for description length objectives with strong asymptotic guarantees, outlining path toward training neural networks for greater compression and generalization, though optimization challenges remain.

Abstract: The Minimum Description Length (MDL) principle offers a formal framework for applying Occam’s razor in machine learning. However, its application to neural networks such as Transformers is challenging due to the lack of a principled, universal measure for model complexity. This paper introduces the theoretical notion of asymptotically optimal description length objectives, grounded in the theory of Kolmogorov complexity. We establish that a minimizer of such an objective achieves optimal compression, for any dataset, up to an additive constant, in the limit as model resource bounds increase. We prove that asymptotically optimal objectives exist for Transformers, building on a new demonstration of their computational universality. We further show that such objectives can be tractable and differentiable by constructing and analyzing a variational objective based on an adaptive Gaussian mixture prior. Our empirical analysis shows that this variational objective selects for a low-complexity solution with strong generalization on an algorithmic task, but standard optimizers fail to find such solutions from a random initialization, highlighting key optimization challenges. More broadly, by providing a theoretical framework for identifying description length objectives with strong asymptotic guarantees, we outline a potential path towards training neural networks that achieve greater compression and generalization.

[547] A Reinforcement Learning Approach in Multi-Phase Second-Price Auction Design

Rui Ai, Boxiang Lyu, Zhaoran Wang, Zhuoran Yang, Michael I. Jordan

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2210.10278: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2210.10278&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[548] SPARLING: Learning Latent Representations with Extremely Sparse Activations

Kavi Gupta, Osbert Bastani, Armando Solar-Lezama

Main category: cs.LG

TL;DR: Unable to analyze paper 2302.01976 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation as abstract is unavailable due to rate limiting error

Method: Cannot determine method as abstract is unavailable due to rate limiting error

Result: Cannot determine results as abstract is unavailable due to rate limiting error

Conclusion: Cannot determine conclusion as abstract is unavailable due to rate limiting error

Abstract: Failed to fetch summary for 2302.01976: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2302.01976&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[549] Dynamic Deep-Reinforcement-Learning Algorithm in Partially Observable Markov Decision Processes

Saki Omi, Hyo-Sang Shin, Namhoon Cho, Antonios Tsourdos

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2307.15931: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2307.15931&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[550] Making informed decisions in cutting tool maintenance in milling: A KNN-based model agnostic approach

Revati M. Wahul, Aditya M. Rahalkar, Om M. Khare, Abhishek D. Patange, Rohan N. Soman

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2310.14629: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2310.14629&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[551] Absolute abstraction: a renormalisation group approach

Carlo Orientale Caputo, Elias Seiffert, Enrico Frausin, Matteo Marsili

Main category: cs.LG

TL;DR: The paper argues that data breadth, not just network depth, is crucial for developing truly abstract representations in neural networks, using renormalization group theory and Hierarchical Feature Model as theoretical framework.

Details

Motivation: While depth in neural networks is known to capture abstract features, the authors argue that depth alone is insufficient for truly abstract representations. They propose that the breadth of training data is equally crucial for developing abstract representations.

Method: The authors use a renormalization group approach where representations expand to encompass broader data sets. They identify the Hierarchical Feature Model as the unique fixed point of this transformation, representing absolutely abstract representations. They test this theory using Deep Belief Networks and auto-encoders trained on data of varying breadth.

Result: Numerical experiments show that neural network representations approach the Hierarchical Feature Model as both data breadth increases and network depth increases, confirming theoretical predictions about the dual importance of data breadth and network architecture.

Conclusion: Truly abstract representations in neural networks require both sufficient network depth and broad training data. The Hierarchical Feature Model provides a theoretical framework for understanding how representations become abstract through the interplay of architecture and data diversity.

Abstract: Abstraction is the process of extracting the essential features from raw data while ignoring irrelevant details. It is well known that abstraction emerges with depth in neural networks, where deep layers capture abstract characteristics of data by combining lower level features encoded in shallow layers (e.g. edges). Yet we argue that depth alone is not enough to develop truly abstract representations. We advocate that the level of abstraction crucially depends on how broad the training set is. We address the issue within a renormalisation group approach where a representation is expanded to encompass a broader set of data. We take the unique fixed point of this transformation – the Hierarchical Feature Model – as a candidate for a representation which is absolutely abstract. This theoretical picture is tested in numerical experiments based on Deep Belief Networks and auto-encoders trained on data of different breadth. These show that representations in neural networks approach the Hierarchical Feature Model as the data get broader and as depth increases, in agreement with theoretical predictions.

[552] Learning Lagrangian Interaction Dynamics with Sampling-Based Model Order Reduction

Hrishikesh Viswanath, Yue Chang, Aleksey Panas, Julius Berner, Peter Yichen Chen, Aniket Bera

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to access restrictions

Method: Cannot determine method due to access restrictions

Result: Cannot determine results due to access restrictions

Conclusion: Cannot draw conclusions due to access restrictions

Abstract: Failed to fetch summary for 2407.03925: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2407.03925&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[553] Few-shot Model Extraction Attacks against Sequential Recommender Systems

Hui Zhang, Fu Liu

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2411.11677: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.11677&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[554] Combinatorial Rising Bandits

Seockbean Song, Youngsik Yoon, Siwei Wang, Wei Chen, Jungseul Ok

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Unable to determine motivation due to API rate limiting preventing access to paper content

Method: Unable to determine method due to API rate limiting preventing access to paper content

Result: Unable to determine results due to API rate limiting preventing access to paper content

Conclusion: Unable to draw conclusions due to API rate limiting preventing access to paper content

Abstract: Failed to fetch summary for 2412.00798: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2412.00798&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[555] Adversarial Attacks in Weight-Space Classifiers

Tamir Shor, Ethan Fetaya, Chaim Baskin, Alex Bronstein

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2502.20314: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.20314&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[556] Landscape of Thoughts: Visualizing the Reasoning Process of Large Language Models

Zhanke Zhou, Zhaocheng Zhu, Xuan Li, Mikhail Galkin, Xiao Feng, Sanmi Koyejo, Jian Tang, Bo Han

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) - unable to analyze content

Details

Motivation: Unable to determine motivation due to failed API request

Method: Unable to determine method due to failed API request

Result: Unable to determine results due to failed API request

Conclusion: Unable to determine conclusion due to failed API request

Abstract: Failed to fetch summary for 2503.22165: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.22165&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[557] StablePCA: Distributionally Robust Learning of Representations from Multi-Source Data

Zhenyu Wang, Molei Liu, Jing Lei, Francis Bach, Zijian Guo

Main category: cs.LG

TL;DR: Unable to analyze paper 2505.00940 due to HTTP 429 error when fetching from arXiv API

Details

Motivation: Cannot determine motivation as paper content could not be retrieved

Method: No method information available due to fetch failure

Result: No results available due to fetch failure

Conclusion: Cannot provide analysis due to technical error in retrieving paper content

Abstract: Failed to fetch summary for 2505.00940: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.00940&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[558] Deterministic Bounds and Random Estimates of Metric Tensors on Neuromanifolds

Ke Sun

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to access limitations

Method: Cannot determine method due to access limitations

Result: Cannot determine results due to access limitations

Conclusion: Cannot determine conclusion due to access limitations

Abstract: Failed to fetch summary for 2505.13614: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.13614&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[559] Strictly Constrained Generative Modeling via Split Augmented Langevin Sampling

Matthieu Blanke, Yongquan Qu, Sara Shamekh, Pierre Gentine

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2505.18017: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.18017&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[560] Automatic and Structure-Aware Sparsification of Hybrid Neural ODEs

Bob Junyi Zou, Lu Tian

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to technical error in fetching paper content

Method: Unable to determine method due to technical error in fetching paper content

Result: Unable to determine results due to technical error in fetching paper content

Conclusion: Unable to draw conclusions due to technical error in fetching paper content

Abstract: Failed to fetch summary for 2505.18996: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.18996&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[561] NatADiff: Adversarial Boundary Guidance for Natural Adversarial Diffusion

Max Collins, Jordan Vice, Tim French, Ajmal Mian

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access restrictions

Method: Unable to determine method due to access restrictions

Result: Unable to determine results due to access restrictions

Conclusion: Unable to draw conclusions due to access restrictions

Abstract: Failed to fetch summary for 2505.20934: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.20934&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[562] Optimizing Data Augmentation through Bayesian Model Selection

Madi Matymov, Ba-Hien Tran, Michael Kampffmeyer, Markus Heinonen, Maurizio Filippone

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable due to API rate limiting

Result: Cannot determine results as paper content is unavailable due to API rate limiting

Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2505.21813: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.21813&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[563] Weight-Space Linear Recurrent Neural Networks

Roussel Desmond Nzoyem, Nawid Keshtmand, Enrique Crespo Fernandez, Idriss Tsayem, Raul Santos-Rodriguez, David A.W. Barton, Tom Deakin

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2506.01153 could not be retrieved from arXiv API.

Details

Motivation: Cannot determine motivation without access to paper content.

Method: Cannot determine method without access to paper content.

Result: Cannot determine results without access to paper content.

Conclusion: Cannot determine conclusion without access to paper content.

Abstract: Failed to fetch summary for 2506.01153: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.01153&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[564] Dynamic Manifold Hopfield Networks for Context-Dependent Associative Memory

Chong Li, Taiping Zeng, Xiangyang Xue, Jianfeng Feng

Main category: cs.LG

TL;DR: Dynamic Manifold Hopfield Networks (DMHN) extend classical Hopfield networks by allowing contextual modulation to dynamically reshape attractor geometry, enabling flexible reorganization of neural representations without explicit context-specific parameterization.

Details

Motivation: Neural circuits show flexible reorganization by context, suggesting cognition relies on dynamic manifolds rather than static representations. However, how such dynamic organization can be realized mechanistically within a unified dynamical system remains unclear. Classical Hopfield networks have fixed energy landscapes that constrain retrieval within static attractor manifolds.

Method: Introduce Dynamic Manifold Hopfield Networks (DMHN), continuous dynamical models where contextual modulation dynamically reshapes attractor geometry. Network interactions are learned data-driven to intrinsically deform attractor manifold geometry across cues without explicit context-specific parameterization, transforming static attractor manifolds into context-dependent families of neural manifolds.

Result: DMHN achieve substantially higher capacity and robustness than classical and modern Hopfield networks. When storing 2N patterns in a network of N neurons, DMHN attain reliable retrieval with average accuracy of 64%, compared with 1% for classical and 13% for modern variants.

Conclusion: Dynamic reorganization of attractor manifold geometry serves as a principled mechanism for context-dependent remapping in neural associative memory, providing a framework for understanding flexible neural representations in cortical and hippocampal circuits.

Abstract: Neural population activity in cortical and hippocampal circuits can be flexibly reorganized by context, suggesting that cognition relies on dynamic manifolds rather than static representations. However, how such dynamic organization can be realized mechanistically within a unified dynamical system remains unclear. Continuous Hopfield networks provide a classical attractor framework in which neural dynamics follow gradient descent on a fixed energy landscape, constraining retrieval within a static attractor manifold geometry. Extending this approach, we introduce Dynamic Manifold Hopfield Networks (DMHN), continuous dynamical models in which contextual modulation dynamically reshapes attractor geometry, transforming a static attractor manifold into a context-dependent family of neural manifolds. In DMHN, network interactions are learned in a data-driven manner, to intrinsically deform the geometry of its attractor manifold across cues without explicit context-specific parameterization. As a result, in associative retrieval, DMHN achieve substantially higher capacity and robustness than classical and modern Hopfield networks: when storing $2N$ patterns in a network of $N$ neurons, DMHN attain reliable retrieval with an average accuracy of 64%, compared with 1% and 13% for classical and modern variants, respectively. Together, these results establish dynamic reorganization of attractor manifold geometry as a principled mechanism for context-dependent remapping in neural associative memory.

[565] RNE: plug-and-play diffusion inference-time control and energy-based training

Jiajun He, José Miguel Hernández-Lobato, Yuanqi Du, Francisco Vargas

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2506.05668: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.05668&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[566] Tailored Behavior-Change Messaging for Physical Activity: Integrating Contextual Bandits and Large Language Models

Haochen Song, Dominik Hofer, Rania Islambouli, Laura Hawkins, Ananya Bhattacharjee, Zahra Hassanzadeh, Jan Smeddinck, Meredith Franklin, Joseph Jay Williams

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable due to API rate limiting

Result: Cannot determine results as paper content is unavailable due to API rate limiting

Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2506.07275: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.07275&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[567] Federated ADMM from Bayesian Duality

Thomas Möllenhoff, Siddharth Swaroop, Finale Doshi-Velez, Mohammad Emtiyaz Khan

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2506.13150 appears to be from June 2025, suggesting recent work in multimodal AI.

Details

Motivation: Cannot determine motivation without access to paper content. Based on the arXiv ID format (2506.13150), this appears to be a recent paper from June 2025, likely in the multimodal AI space.

Method: Cannot determine method without access to paper content. The HTTP 429 error indicates rate limiting from arXiv API, preventing retrieval of the abstract and details.

Result: Cannot determine results without access to paper content. The technical issue prevents analysis of the paper’s contributions.

Conclusion: Cannot draw conclusions about the paper’s content due to technical limitations in accessing the abstract. The arXiv ID suggests it’s recent work potentially relevant to multimodal AI.

Abstract: Failed to fetch summary for 2506.13150: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.13150&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[568] An Explainable and Interpretable Composite Indicator Based on Decision Rules

Salvatore Corrente, Salvatore Greco, Roman Słowiński, Silvano Zappalà

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable due to API rate limiting

Result: Cannot determine results as paper content is unavailable due to API rate limiting

Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2506.13259: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.13259&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[569] Privacy Risk Predictions Based on Fundamental Understanding of Personal Data and an Evolving Threat Landscape

Haoran Niu, K. Suzanne Barber

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2508.04542 could not be retrieved from arXiv API.

Details

Motivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot determine conclusion without access to the paper content.

Abstract: Failed to fetch summary for 2508.04542: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.04542&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[570] The Hidden Width of Deep ResNets: Tight Error Bounds and Phase Diagram

Lénaïc Chizat

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to data fetch failure

Method: Unable to determine method due to data fetch failure

Result: Unable to determine results due to data fetch failure

Conclusion: Unable to determine conclusion due to data fetch failure

Abstract: Failed to fetch summary for 2509.10167: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.10167&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[571] Towards a more realistic evaluation of machine learning models for bearing fault diagnosis

João Paulo Vieira, Victor Afonso Bauler, Rodrigo Kobashikawa Rosa, Danilo Silva

Main category: cs.LG

TL;DR: Failed to fetch summary for paper 2509.22267 due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation as abstract could not be retrieved

Method: Unable to determine method as abstract could not be retrieved

Result: Unable to determine results as abstract could not be retrieved

Conclusion: Unable to draw conclusions about the paper content

Abstract: Failed to fetch summary for 2509.22267: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.22267&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[572] Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization

Vage Egiazarian, Roberto L. Castro, Denis Kuznedelev, Andrei Panferov, Eldar Kurtic, Shubhra Pandit, Alexandre Marques, Mark Kurtz, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2509.23202: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.23202&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[573] CREPE: Controlling Diffusion with Replica Exchange

Jiajun He, Paul Jeha, Peter Potaptchik, Leo Zhang, José Miguel Hernández-Lobato, Yuanqi Du, Saifuddin Syed, Francisco Vargas

Main category: cs.LG

TL;DR: Unable to analyze paper 2509.23265 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation as abstract is not available

Method: Cannot determine method as abstract is not available

Result: Cannot determine results as abstract is not available

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2509.23265: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.23265&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[574] Entering the Era of Discrete Diffusion Models: A Benchmark for Schrödinger Bridges and Entropic Optimal Transport

Xavier Aramayo Carrasco, Grigoriy Ksenofontov, Aleksei Leonov, Iaroslav Sergeevich Koshelev, Alexander Korotin

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to access error

Method: Cannot determine method due to access error

Result: Cannot determine results due to access error

Conclusion: Cannot determine conclusion due to access error

Abstract: Failed to fetch summary for 2509.23348: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.23348&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[575] Lightweight Transformer for EEG Classification via Balanced Signed Graph Algorithm Unrolling

Junyi Yao, Parham Eftekhar, Gene Cheung, Xujin Chris Liu, Yao Wang, Wei Hu

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2510.03027: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.03027&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[576] AdaBet: Gradient-free Layer Selection for Efficient Training of Deep Neural Networks

Irene Tenison, Soumyajit Chatterjee, Fahim Kawsar, Mohammad Malekzadeh

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2510.03101: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.03101&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[577] Post-hoc Stochastic Concept Bottleneck Models

Wiktor Jan Hoffmann, Sonia Laguna, Moritz Vandenhirtz, Emanuele Palumbo, Julia E. Vogt

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to missing paper content

Method: Unable to determine method due to missing paper content

Result: Unable to determine results due to missing paper content

Conclusion: Unable to draw conclusions due to missing paper content

Abstract: Failed to fetch summary for 2510.08219: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.08219&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[578] Characterizing the Multiclass Learnability of Forgiving 0-1 Loss Functions

Jacob Trauger, Tyson Trauger, Ambuj Tewari

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to missing paper content

Method: Unable to determine method due to missing paper content

Result: Unable to determine results due to missing paper content

Conclusion: Unable to draw conclusions due to missing paper content

Abstract: Failed to fetch summary for 2510.08382: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.08382&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[579] Efficient Resource-Constrained Training of Transformers via Subspace Optimization

Le-Trung Nguyen, Enzo Tartaglione, Van-Tam Nguyen

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about paper content due to access limitations

Abstract: Failed to fetch summary for 2510.09160: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.09160&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[580] Auditing Information Disclosure During LLM-Scale Gradient Descent Using Gradient Uniqueness

Sleem Abdelghafar, Maryam Aliakbarpour, Chris Jermaine

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: The motivation cannot be determined as the paper content could not be retrieved due to API rate limiting

Method: Method unknown - paper content not accessible

Result: Results unknown - paper content not accessible

Conclusion: No conclusion can be drawn as the paper content was not retrieved

Abstract: Failed to fetch summary for 2510.10902: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.10902&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[581] Policy Transfer for Continuous-Time Reinforcement Learning: A (Rough) Differential Equation Approach

Xin Guo, Zijiu Lyu

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2510.15165: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.15165&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[582] Boosted Trees on a Diet: Compact Models for Resource-Constrained Devices

Nina Herrmann, Jan Stenkamp, Benjamin Karic, Stefan Oehmcke, Fabian Gieseke

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2510.26557: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.26557&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[583] Graph Homomorphism Distortion: A Metric to Distinguish Them All and in the Latent Space Bind Them

Martin Carrasco, Olga Zaghen, Kavir Sumaraj, Erik Bekkers, Bastian Rieck

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable due to API rate limiting

Result: Cannot determine results as paper content is unavailable due to API rate limiting

Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2511.03068: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.03068&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[584] Continual Unlearning for Text-to-Image Diffusion Models: A Regularization Perspective

Justin Lee, Zheda Mai, Jinsu Yoo, Chongyu Fan, Cheng Zhang, Wei-Lun Chao

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to technical error fetching paper content

Method: Unable to determine method due to technical error fetching paper content

Result: Unable to determine results due to technical error fetching paper content

Conclusion: Unable to draw conclusions due to technical error fetching paper content

Abstract: Failed to fetch summary for 2511.07970: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.07970&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[585] SURFACEBENCH: A Geometry-Aware Benchmark for Symbolic Surface Discovery

Sanchit Kabra, Shobhnik Kriplani, Parshin Shojaee, Chandan K. Reddy

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2511.10833: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.10833&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[586] QiMeng-CRUX: Narrowing the Gap Between Natural Language and Verilog via Core Refined Understanding eXpression for Circuit Design

Lei Huang, Rui Zhang, Jiaming Guo, Yang Zhang, Di Huang, Shuyao Cheng, Pengwei Jin, Chongxiao Li, Zidong Du, Xing Hu, Yunji Chen, Qi Guo

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access restrictions

Method: Unable to determine method due to access restrictions

Result: Unable to determine results due to access restrictions

Conclusion: Unable to determine conclusion due to access restrictions

Abstract: Failed to fetch summary for 2511.20099: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.20099&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Jiazhao Shi, Ziyu Wang, Yichen Lin, Shoufeng Lu

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2512.24075: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.24075&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[588] Quantized SO(3)-Equivariant Graph Neural Networks for Efficient Molecular Property Prediction

Haoyu Zhou, Ping Xue, Hao Zhang, Tianfan Fu

Main category: cs.LG

TL;DR: Paper 2601.02213: Unable to fetch summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to unavailable abstract

Method: Cannot determine method due to unavailable abstract

Result: Cannot determine results due to unavailable abstract

Conclusion: Cannot determine conclusion due to unavailable abstract

Abstract: Failed to fetch summary for 2601.02213: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.02213&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[589] Discrete Solution Operator Learning for Geometry-Dependent PDEs

Jinshuai Bai, Haolin Li, Zahra Sharif Khodaei, M. H. Aliabadi, YuanTong Gu, Xi-Qiao Feng

Main category: cs.LG

TL;DR: Failed to fetch paper summary - HTTP 429 error indicates rate limiting from arXiv API

Details

Motivation: Unable to determine motivation due to API access error

Method: Unable to determine method due to API access error

Result: Unable to determine results due to API access error

Conclusion: Unable to analyze paper content due to technical limitations

Abstract: Failed to fetch summary for 2601.09143: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.09143&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[590] Data-Driven Conditional Flexibility Index

Moritz Wedemeyer, Eike Cramer, Alexander Mitsos, Manuel Dahmen

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to data fetch failure

Method: Unable to determine method due to data fetch failure

Result: Unable to determine results due to data fetch failure

Conclusion: Unable to draw conclusions due to data fetch failure

Abstract: Failed to fetch summary for 2601.16028: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.16028&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[591] Distributional value gradients for stochastic environments

Baptiste Debes, Tinne Tuytelaars

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2601.20071: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.20071&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[592] Quantization-Aware Distillation for NVFP4 Inference Accuracy Recovery

Meng Xin, Sweta Priyadarshi, Jingyu Xin, Bilal Kartal, Aditya Vavre, Asma Kuriparambil Thekkumpate, Zijia Chen, Ameya Sunil Mahabaleshwarkar, Ido Shahaf, Akhiad Bercovich, Kinjal Patel, Suguna Varshini Velury, Chenjie Luo, Zhiyu Cheng, Jenny Chen, Chen-Han Yu, Wei Ping, Oleg Rybakov, Nima Tajbakhsh, Oluwatobi Olabiyi, Dusan Stosic, Di Wu, Song Han, Eric Chung, Sharath Turuvekere Sreenivas, Bryan Catanzaro, Yoshi Suhara, Tijmen Blankevoort, Huizi Mao

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2601.20088 appears to be an arXiv paper, but the content cannot be retrieved at this time.

Details

Motivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot determine conclusion without access to the paper content.

Abstract: Failed to fetch summary for 2601.20088: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.20088&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[593] On the Relationship Between Representation Geometry and Generalization in Deep Neural Networks

Sumit Yadav

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2602.00130: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.00130&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[594] SwiftRepertoire: Few-Shot Immune-Signature Synthesis via Dynamic Kernel Codes

Rong Fu, Muge Qi, Yang Li, Yabin Jin, Jiekai Wu, Jiaxuan Lu, Chunlei Meng, Youjin Wang, Zeli Su, Juntao Gao, Li Bao, Qi Zhao, Wei Luo, Simon Fong

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2602.01051: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.01051&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[595] Near-Constant Strong Violation and Last-Iterate Convergence for Online CMDPs via Decaying Safety Margins

Qian Zuo, Zhiyong Wang, Fengxiang He

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2602.10917: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.10917&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[596] MoToRec: Sparse-Regularized Multimodal Tokenization for Cold-Start Recommendation

Jialin Liu, Zhaorui Zhang, Ray C.C. Cheung

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Unable to determine motivation as paper content could not be retrieved

Method: Unable to determine method as paper content could not be retrieved

Result: Unable to determine results as paper content could not be retrieved

Conclusion: Unable to determine conclusion as paper content could not be retrieved

Abstract: Failed to fetch summary for 2602.11062: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.11062&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[597] Function-Space Decoupled Diffusion for Forward and Inverse Modeling in Carbon Capture and Storage

Xin Ju, Jiachen Yao, Anima Anandkumar, Sally M. Benson, Gege Wen

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2602.12274 appears to be from February 2026, which suggests it might be a future or incorrectly dated paper.

Details

Motivation: Cannot determine motivation without access to the paper content. The HTTP 429 error indicates the arXiv API is rate limiting requests, preventing retrieval of the abstract and details.

Method: Cannot determine method without access to the paper content. The error suggests either the paper doesn’t exist, has an incorrect ID, or arXiv’s API is temporarily limiting requests.

Result: No results available due to failed fetch. The HTTP 429 status code means “Too Many Requests” - the server is limiting the rate of API calls.

Conclusion: Unable to analyze the paper due to technical limitations. The paper ID 2602.12274 (February 2026) appears to be from the future, suggesting either a typo in the ID or an unusual dating issue.

Abstract: Failed to fetch summary for 2602.12274: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.12274&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[598] Out-of-Support Generalisation via Weight-Space Sequence Modelling

Roussel Desmond Nzoyem

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) when trying to access arXiv API for paper ID 2602.13550

Details

Motivation: Cannot determine motivation without access to the paper content

Method: Cannot determine method without access to the paper content

Result: Cannot determine results without access to the paper content

Conclusion: Cannot determine conclusion without access to the paper content

Abstract: Failed to fetch summary for 2602.13550: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.13550&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[599] A Penalty Approach for Differentiation Through Black-Box Quadratic Programming Solvers

Yuxuan Linghu, Zhiyuan Liu, Qi Deng

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper content.

Details

Motivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot draw conclusions without access to the paper content.

Abstract: Failed to fetch summary for 2602.14154: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.14154&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[600] The Implicit Bias of Adam and Muon on Smooth Homogeneous Neural Networks

Eitan Gronich, Gal Vardi

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2602.16340 appears to be from February 2026, which suggests it’s a future or hypothetical paper ID format.

Details

Motivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot draw conclusions without access to the paper content.

Abstract: Failed to fetch summary for 2602.16340: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.16340&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[601] SigmaQuant: Hardware-Aware Heterogeneous Quantization Method for Edge DNN Inference

Qunyou Liu, Pengbo Yu, Marina Zapater, David Atienza

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2602.22136: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22136&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[602] Tell Me What To Learn: Generalizing Neural Memory to be Controllable in Natural Language

Max S. Bennett, Thomas P. Zollo, Richard Zemel

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusion due to lack of paper content

Abstract: Failed to fetch summary for 2602.23201: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.23201&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[603] A medical coding language model trained on clinical narratives from a population-wide cohort of 1.8 million patients

Joakim Edin, Sedrah Butt Balaganeshan, Annike Kjølby Kristensen, Lars Maaløe, Ioannis Louloudis, Søren Brunak

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to draw conclusions due to failed paper fetch

Abstract: Failed to fetch summary for 2603.00221: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.00221&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[604] CoPeP: Benchmarking Continual Pretraining for Protein Language Models

Darshan Patil, Pranshu Malviya, Mathieu Reymond, Quentin Fournier, Sarath Chandar

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2603.00253: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.00253&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[605] Demystifying Group Relative Policy Optimization: Its Policy Gradient is a U-Statistic

Hongyi Zhou, Kai Ye, Erhan Xu, Jin Zhu, Ying Yang, Shijin Gong, Chengchun Shi

Main category: cs.LG

TL;DR: Unable to analyze paper 2603.01162 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation without access to paper abstract

Method: Cannot determine method without access to paper abstract

Result: Cannot determine results without access to paper abstract

Conclusion: Cannot draw conclusions without access to paper abstract

Abstract: Failed to fetch summary for 2603.01162: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01162&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[606] Operator Learning Using Weak Supervision from Walk-on-Spheres

Hrishikesh Viswanath, Hong Chul Nam, Xi Deng, Julius Berner, Anima Anandkumar, Aniket Bera

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2603.01193: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01193&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[607] Importance Weighting Correction of Regularized Least-Squares for Target Shift

Davit Gogolashvili

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2210.09709: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2210.09709&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[608] A Global Optimization Algorithm for K-Center Clustering of One Billion Samples

Jiayang Ren, Ningning You, Kaixun Hua, Chaojie Ji, Yankai Cao

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2301.00061: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2301.00061&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[609] (Un)fair devices: Moving beyond AI accuracy in personal sensing

Sofia Yfantidou, Marios Constantinides, Dimitris Spathis, Athena Vakali, Daniele Quercia, Fahim Kawsar

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to technical error fetching paper content

Method: Unable to determine method due to technical error fetching paper content

Result: Unable to determine results due to technical error fetching paper content

Conclusion: Unable to draw conclusions due to technical error fetching paper content

Abstract: Failed to fetch summary for 2303.15585: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2303.15585&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[610] A Normal Map-Based Proximal Stochastic Gradient Method: Convergence and Identification Properties

Junwen Qiu, Li Jiang, Andre Milzarek

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper content.

Details

Motivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot determine conclusion without access to the paper content.

Abstract: Failed to fetch summary for 2305.05828: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2305.05828&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[611] Proper losses regret at least 1/2-order

Han Bao, Asuka Takatsu

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2407.10417: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2407.10417&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[612] Quantifying User Coherence: A Unified Framework for Analyzing Recommender Systems Across Domains

Michaël Soumm, Alexandre Fournier-Montgieux, Adrian Popescu, Bertrand Delezoide

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to draw conclusions due to access error

Abstract: Failed to fetch summary for 2410.02453: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.02453&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[613] Prediction of Multiscale Features Using Deep Learning-based Preconditioner-Solver Architecture for Darcy Equation in High-Contrast Media

Jie Chen, Peiqi Li, Zhengkang He, Simon Hands

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting).

Details

Motivation: Cannot determine motivation without access to paper content.

Method: Cannot determine method without access to paper content.

Result: Cannot determine results without access to paper content.

Conclusion: Cannot draw conclusions without access to paper content.

Abstract: Failed to fetch summary for 2411.02431: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.02431&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[614] LiteLMGuard: Seamless and Lightweight On-Device Prompt Filtering for Safeguarding Small Language Models against Quantization-induced Risks and Vulnerabilities

Kalyan Nakka, Jimmy Dani, Ausmit Mondal, Nitesh Saxena

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to access error

Method: Cannot determine method due to access error

Result: Cannot determine results due to access error

Conclusion: Cannot determine conclusion due to access error

Abstract: Failed to fetch summary for 2505.05619: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.05619&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[615] CLEAR: Calibrated Learning for Epistemic and Aleatoric Risk

Ilia Azizi, Juraj Bodik, Jakob Heiss, Bin Yu

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2507.08150: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.08150&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[616] EP-GAT: Energy-based Parallel Graph Attention Neural Network for Stock Trend Classification

Zhuodong Jiang, Pengju Zhang, Peter Martin

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to draw conclusions due to failed paper fetch

Abstract: Failed to fetch summary for 2507.08184: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.08184&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[617] Learning Acrobatic Flight from Preferences

Colin Merk, Ismail Geles, Jiaxu Xing, Angel Romero, Giorgia Ramponi, Davide Scaramuzza

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Unable to determine motivation due to API access limitations

Method: Unable to determine method due to API access limitations

Result: Unable to determine results due to API access limitations

Conclusion: Unable to determine conclusion due to API access limitations

Abstract: Failed to fetch summary for 2508.18817: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.18817&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[618] Fast Estimation of Wasserstein Distances via Regression on Sliced Wasserstein Distances

Khai Nguyen, Hai Nguyen, Nhat Ho

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2509.20508: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.20508&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[619] Secure Sparse Matrix Multiplications and their Applications to Privacy-Preserving Machine Learning

Marc Damie, Florian Hahn, Andreas Peter, Jan Ramon

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2510.14894: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.14894&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[620] Stochastic Control Methods for Optimization

Jinniao Qiu

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2601.01248: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.01248&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[621] Linear Model Extraction via Factual and Counterfactual Queries

Daan Otto, Jannis Kurtz, Dick den Hertog, Ilker Birbil

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to API rate limiting preventing access to paper content

Method: Unable to determine method due to API rate limiting preventing access to paper content

Result: Unable to determine results due to API rate limiting preventing access to paper content

Conclusion: Unable to determine conclusion due to API rate limiting preventing access to paper content

Abstract: Failed to fetch summary for 2602.09748: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.09748&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Eduar Castrillo Velilla

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) - need to try again later or use alternative access method

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2602.20833: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.20833&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[623] A Researcher’s Guide to Empirical Risk Minimization

Lars van der Laan

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2602.21501: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.21501&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[624] PSQE: A Theoretical-Practical Approach to Pseudo Seed Quality Enhancement for Unsupervised Multimodal Entity Alignment

Yunpeng Hong, Chenyang Bu, Jie Zhang, Yi He, Di Wu, Xindong Wu

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: N/A - Paper content not accessible due to technical limitations

Method: N/A - Paper content not accessible due to technical limitations

Result: N/A - Paper content not accessible due to technical limitations

Conclusion: N/A - Paper content not accessible due to technical limitations

Abstract: Failed to fetch summary for 2602.22903: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22903&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[625] A Boundary Integral-based Neural Operator for Mesh Deformation

Zhengyu Wu, Jun Liu, Wei Wang

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) when trying to access arXiv API

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2602.23703: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.23703&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[626] Non-Rectangular Average-Reward Robust MDPs: Optimal Policies and Their Transient Values

Shengbo Wang, Nian Si

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2603.00945: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.00945&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[627] Grokking as a Phase Transition between Competing Basins: a Singular Learning Theory Approach

Ben Cullen, Sergio Estan-Ruiz, Riya Danait, Jiayi Li

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable due to server rate limiting

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2603.01192: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01192&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

cs.MA

[628] The Alignment Flywheel: A Governance-Centric Hybrid MAS for Architecture-Agnostic Safety

Elias Malomgré, Pieter Simoens

Main category: cs.MA

TL;DR: A hybrid multi-agent system architecture called Alignment Flywheel that separates decision generation from safety governance, enabling patch-based safety updates without retraining core models.

Details

Motivation: As learned and generative models become more powerful in autonomous systems, their safety behavior is often opaque and difficult to audit or update after deployment. There's a need for architectures that can govern safety separately from decision-making.

Method: Proposes the Alignment Flywheel architecture with four components: 1) Proposer (autonomous decision component) generates candidate trajectories, 2) Safety Oracle returns safety signals, 3) enforcement layer applies risk policies at runtime, and 4) governance MAS supervises the Oracle through auditing, verification, and refinement.

Result: Creates a framework for integrating capable but fallible autonomous systems under explicit, version-controlled, and auditable oversight, enabling safety patches without retraining core decision components.

Conclusion: The Alignment Flywheel provides a governance-centric hybrid MAS architecture that decouples safety governance from decision generation, enabling safer deployment of powerful autonomous systems through patch-based safety updates and explicit oversight.

Abstract: Multi-agent systems provide mature methodologies for role decomposition, coordination, and normative governance, capabilities that remain essential as increasingly powerful autonomous decision components are embedded within agent-based systems. While learned and generative models substantially expand system capability, their safety behavior is often entangled with training, making it opaque, difficult to audit, and costly to update after deployment. This paper formalizes the Alignment Flywheel as a governance-centric hybrid MAS architecture that decouples decision generation from safety governance. A Proposer, representing any autonomous decision component, generates candidate trajectories, while a Safety Oracle returns raw safety signals through a stable interface. An enforcement layer applies explicit risk policy at runtime, and a governance MAS supervises the Oracle through auditing, uncertainty-driven verification, and versioned refinement. The central engineering principle is patch locality: many newly observed safety failures can be mitigated by updating the governed oracle artifact and its release pipeline rather than retracting or retraining the underlying decision component. The architecture is implementation-agnostic with respect to both the Proposer and the Safety Oracle, and specifies the roles, artifacts, protocols, and release semantics needed for runtime gating, audit intake, signed patching, and staged rollout across distributed deployments. The result is a hybrid MAS engineering framework for integrating highly capable but fallible autonomous systems under explicit, version-controlled, and auditable oversight.

[629] StitchCUDA: An Automated Multi-Agents End-to-End GPU Programing Framework with Rubric-based Agentic Reinforcement Learning

Shiyang Li, Zijian Zhang, Winson Chen, Yuebo Luo, Mingyi Hong, Caiwen Ding

Main category: cs.MA

TL;DR: StitchCUDA is a multi-agent framework for end-to-end GPU program generation using specialized agents (Planner, Coder, Verifier) with rubric-based reinforcement learning to improve CUDA programming skills.

Details

Motivation: Modern ML workloads heavily rely on GPUs, but achieving high end-to-end performance is challenging due to dependencies on both GPU kernel efficiency and host-side settings. While LLM-based methods show promise for GPU kernel generation, prior works focus only on single-kernel optimization and don't extend to end-to-end programs, limiting practical deployment.

Method: StitchCUDA uses a multi-agent framework with three specialized agents: Planner (orchestrates system design), Coder (implements step-by-step), and Verifier (correctness check and performance profiling). The Coder is improved through rubric-based agentic reinforcement learning over two atomic skills: task-to-code generation and feedback-driven code optimization, using combined rubric reward and rule-based reward from real executions.

Result: On KernelBench, StitchCUDA achieves nearly 100% success rate on end-to-end GPU programming tasks, with 1.72x better speedup over multi-agent baselines and 2.73x better than RL model baselines.

Conclusion: StitchCUDA effectively addresses end-to-end GPU program generation challenges through a multi-agent framework with specialized agents and reinforcement learning, preventing reward hacking and enabling practical deployment of GPU-optimized ML workloads.

Abstract: Modern machine learning (ML) workloads increasingly rely on GPUs, yet achieving high end-to-end performance remains challenging due to dependencies on both GPU kernel efficiency and host-side settings. Although LLM-based methods show promise on automated GPU kernel generation, prior works mainly focus on single-kernel optimization and do not extend to end-to-end programs, hindering practical deployment. To address the challenge, in this work, we propose StitchCUDA, a multi-agent framework for end-to-end GPU program generation, with three specialized agents: a Planner to orchestrate whole system design, a Coder dedicated to implementing it step-by-step, and a Verifier for correctness check and performance profiling using Nsys/NCU. To fundamentally improve the Coder’s ability in end-to-end GPU programming, StitchCUDA integrates rubric-based agentic reinforcement learning over two atomic skills, task-to-code generation and feedback-driven code optimization, with combined rubric reward and rule-based reward from real executions. Therefore, the Coder learns how to implement advanced CUDA programming techniques (e.g., custom kernel fusion, cublas epilogue), and we also effectively prevent Coder’s reward hacking (e.g., just copy PyTorch code or hardcoding output) during benchmarking. Experiments on KernelBench show that StitchCUDA achieves nearly 100% success rate on end-to-end GPU programming tasks, with 1.72x better speedup over the multi-agent baseline and 2.73x than the RL model baselines.

[630] Generalized Per-Agent Advantage Estimation for Multi-Agent Policy Optimization

Seongmin Kim, Giseung Park, Woojun Kim, Jiwon Jeon, Seungyeol Han, Youngchul Sung

Main category: cs.MA

TL;DR: A multi-agent reinforcement learning framework with Generalized Per-Agent Advantage Estimator (GPAE) that improves sample efficiency and coordination through precise per-agent advantage estimation and double-truncated importance sampling.

Details

Motivation: Current multi-agent reinforcement learning methods struggle with sample efficiency and coordination in complex scenarios, particularly due to inaccurate per-agent advantage estimation and challenges with off-policy learning in non-stationary environments.

Method: Proposes Generalized Per-Agent Advantage Estimator (GPAE) using per-agent value iteration operators to compute precise advantages without direct Q-function estimation. Introduces double-truncated importance sampling ratio scheme to improve credit assignment for off-policy trajectories by balancing sensitivity to policy changes with robustness to non-stationarity.

Result: The approach outperforms existing methods on benchmarks, demonstrating superior coordination and sample efficiency in complex multi-agent scenarios.

Conclusion: The GPAE framework provides an effective solution for multi-agent reinforcement learning by enabling accurate per-agent advantage estimation and stable off-policy learning, leading to improved coordination and sample efficiency.

Abstract: In this paper, we propose a novel framework for multi-agent reinforcement learning that enhances sample efficiency and coordination through accurate per-agent advantage estimation. The core of our approach is Generalized Per-Agent Advantage Estimator (GPAE), which employs a per-agent value iteration operator to compute precise per-agent advantages. This operator enables stable off-policy learning by indirectly estimating values via action probabilities, eliminating the need for direct Q-function estimation. To further refine estimation, we introduce a double-truncated importance sampling ratio scheme. This scheme improves credit assignment for off-policy trajectories by balancing sensitivity to the agent’s own policy changes with robustness to non-stationarity from other agents. Experiments on benchmarks demonstrate that our approach outperforms existing approaches, excelling in coordination and sample efficiency for complex scenarios.

[631] Strategic Concealment of Environment Representations in Competitive Games

Yue Guan, Dipankar Maity, Panagiotis Tsiotras

Main category: cs.MA

TL;DR: Strategic concealment of environment representations in competitive games where a Defender tries to infer and exploit an Attacker’s representation while the Attacker obfuscates its representation type.

Details

Motivation: To understand how players strategically conceal their environment representations in competitive scenarios, particularly in defense situations where one player needs to infer the other's representation to effectively place barriers.

Method: Model the interaction as a Bayesian game where the Defender infers the Attacker’s representation from its trajectory and places barriers, while the Attacker obfuscates its representation type. Solve for Perfect Bayesian Nash Equilibrium via a bilinear program integrating Bayesian inference, strategic planning, and belief manipulation.

Result: Simulations show that purposeful concealment naturally emerges: the Attacker randomizes its trajectory to manipulate the Defender’s belief, inducing suboptimal barrier selections and thereby gaining a strategic advantage.

Conclusion: Strategic concealment of environment representations is an effective tactic in competitive games, where obfuscation can lead to suboptimal defensive responses and provide attackers with strategic advantages.

Abstract: This paper investigates the strategic concealment of environment representations used by players in competitive games. We consider a defense scenario in which one player (the Defender) seeks to infer and exploit the representation used by the other player (the Attacker). The interaction between the two players is modeled as a Bayesian game: the Defender infers the Attacker’s representation from its trajectory and places barriers to obstruct the Attacker’s path towards its goal, while the Attacker obfuscates its representation type to mislead the Defender. We solve for the Perfect Bayesian Nash Equilibrium via a bilinear program that integrates Bayesian inference, strategic planning, and belief manipulation. Simulations show that purposeful concealment naturally emerges: the Attacker randomizes its trajectory to manipulate the Defender’s belief, inducing suboptimal barrier selections and thereby gaining a strategic advantage.

[632] Reuse, Don’t Recompute: Efficient Large Reasoning Model Inference via Memory Orchestration

Daivik Patel, Shrenik Patel

Main category: cs.MA

TL;DR: ENGRAM-R is an inference-time memory layer that enables large reasoning models to reuse structured memory instead of recomputing derivations, achieving significant token reduction while maintaining accuracy.

Details

Motivation: Current large reasoning models achieve accuracy through test-time scaling (longer chains of thought, multiple solution sampling), but this comes with high token costs and latency. The authors argue that memory should be a core component for efficient reasoning - when evidence exists, models should reuse structured memory rather than recompute derivations.

Method: ENGRAM-R is an inference-time memory layer that integrates typed retrieval with compact fact card representations and explicit citation control. It enables models to retrieve and reuse previously computed evidence instead of processing full context repeatedly.

Result: On the LoCoMo benchmark, ENGRAM-R reduces input tokens by 85% and reasoning tokens by 75% compared to full context while maintaining high accuracy. On a multi-hop slice of the LongMemEval benchmark, it achieves similar efficiency with substantial accuracy gains.

Conclusion: Memory is not only critical for long-horizon correctness but also a practical lever for efficient reasoning under tight compute, memory, and latency budgets. Structured memory reuse enables significant efficiency improvements without sacrificing accuracy.

Abstract: Large reasoning models (LRMs) achieve strong accuracy through test-time scaling, generating longer chains of thought or sampling multiple solutions, but at steep costs in tokens and latency. We argue that memory is a core ingredient for efficient reasoning: when evidence already exists, models should think less by reusing structured memory instead of recomputing derivations. We present ENGRAM-R, an inference-time memory layer that integrates typed retrieval with compact fact card representations and explicit citation control. On the LoCoMo benchmark, ENGRAM-R reduces input tokens by 85% and reasoning tokens by 75% compared to full context while maintaining high accuracy. On a multi-hop slice of the LongMemEval benchmark, it achieves similar efficiency with substantial accuracy gains. These results show that memory is not only critical for long-horizon correctness but also a practical lever for efficient reasoning under tight compute, memory, and latency budgets.

cs.MM

Wei Jiang, Tong Chen, Wei Yuan, Quoc Viet Hung Nguyen, Hongzhi Yin

Main category: cs.MM

TL;DR: AgentM3D: A multi-agent framework for zero-shot mixed-source multi-modal misinformation detection using adaptive test-time scaling with Best-of-N reasoning and critic agents.

Details

Motivation: Single vision-language models (VLMs) are insufficient for complex mixed-source multi-modal misinformation detection (M3D) where false information can come from text, images, or modality mismatches. Existing agentic solutions have limited reasoning capacity due to single forward passes and lack exploration of alternative reasoning paths.

Method: Proposes AgentM3D with modality-specific VLM agents using Best-of-N mechanism with critic agents for scoring. Features cascading decision chain to reduce computation/error propagation, planning agent for dynamic reasoning path allocation, and adaptive stopping mechanism.

Result: Achieves state-of-the-art zero-shot detection performance on two M3D benchmarks compared to various VLM-based and agentic baselines.

Conclusion: AgentM3D effectively addresses limitations of single VLMs and existing agentic systems for M3D through adaptive test-time scaling and multi-agent reasoning architecture.

Abstract: Vision-language models (VLMs) have been proven effective for detecting multi-modal misinformation on social platforms, especially in zero-shot settings with unavailable or delayed annotations. However, a single VLM’s capacity falls short in the more complex mixed-source multi-modal misinformation detection (M3D) task. Taking captioned images as an example, in M3D, false information can originate from untruthful texts, forged images, or mismatches between the two modalities. Although recent agentic systems can handle zero-shot M3D by connecting modality-specific VLM agents, their effectiveness is still bottlenecked by their architecture. In existing agentic M3D solutions, for any input sample, each agent performs only one forward reasoning pass, making decisions prone to model randomness and reasoning errors in challenging cases. Moreover, the lack of exploration over alternative reasoning paths prevents modern VLMs from fully utilizing their reasoning capacity. In this work, we present AgentM3D, a multi-agent framework for zero-shot M3D. To amplify the reasoning capability of VLMs, we introduce an adaptive test-time scaling paradigm in which each modality-specific VLM agent applies a Best-of-N mechanism, coupled with a critic agent for task-aligned scoring. The agents are organized in a cascading, modality-specific decision chain to reduce unnecessary computation and limit error propagation. To ensure scalability, a planning agent dynamically determines the maximum number of reasoning paths based on sample difficulty, and an adaptive stopping mechanism prevents excessive reasoning within each agent. Extensive experiments on two M3D benchmarks demonstrate that AgentM3D achieves state-of-the-art zero-shot detection performance compared with various VLM-based and agentic baselines.

[634] Synthetic Perception: Can Generated Images Unlock Latent Visual Prior for Text-Centric Reasoning?

Yuesheng Huang, Peng Zhang, Xiaoxin Wu, Riliang Liu, Jiaqi Liang

Main category: cs.MM

TL;DR: Using text-to-image models to generate synthetic images from text can enhance text classification by providing visual priors, though effectiveness depends on model quality, prompt engineering, and task visual groundability.

Details

Motivation: There's a modality gap between abundant text-only data and powerful multimodal models. The paper investigates whether synthetic images generated by T2I models can unlock visual priors for text-centric reasoning tasks.

Method: Comprehensive evaluation framework on text classification analyzing T2I model quality (Flux.1, SDXL), prompt engineering strategies, and multimodal fusion architectures. Uses synthetic perception to project text into visual semantic space.

Result: Synthetic perception yields significant performance gains by effectively projecting text into visual semantic space, even when augmenting strong LLM baselines like Llama-3 and Qwen-2.5. Serves as cross-modal probing to mitigate sensory deprivation in pure text training.

Conclusion: Effectiveness is highly conditional on semantic alignment between text and generated image, task’s visual groundability, and T2I model’s generative fidelity. Establishes rigorous benchmark for using synthetic images to enrich language understanding in unimodal scenarios.

Abstract: A significant modality gap" exists between the abundance of text-only data and the increasing power of multimodal models. This work systematically investigates whether images generated on-the-fly by Text-to-Image (T2I) models can serve as a mechanism to unlock latent visual priors for text-centric reasoning. Through a comprehensive evaluation framework on text classification, we analyze the impact of critical variables, including T2I model quality (e.g., Flux.1, SDXL), prompt engineering strategies, and multimodal fusion architectures. Our findings demonstrate that this synthetic perception” can yield significant performance gains by effectively projecting text into a visual semantic space, even when augmenting strong large language model baselines like Llama-3 and Qwen-2.5. We show that this approach serves as a form of cross-modal probing, mitigating the sensory deprivation inherent in pure text training. However, the effectiveness is highly conditional, depending on the semantic alignment between text and the generated image, the task’s visual groundability, and the generative fidelity of the T2I model. Our work establishes a rigorous benchmark for this paradigm, demonstrating its viability as a pathway to enrich language understanding in traditionally unimodal scenarios.

eess.AS

[635] LMU-Based Sequential Learning and Posterior Ensemble Fusion for Cross-Domain Infant Cry Classification

Niloofar Jazaeri, Hilmi R. Dajani, Marco Janeczek, Martin Bouchard

Main category: eess.AS

TL;DR: A compact acoustic framework for infant cry cause classification using multi-branch CNN with enhanced Legendre Memory Unit for temporal modeling and calibrated ensemble fusion for cross-dataset generalization.

Details

Motivation: Infant cry cause decoding is challenging due to short nonstationary signals, limited annotations, and strong domain shifts across infants and datasets, requiring robust solutions for healthcare monitoring applications.

Method: Proposes a compact acoustic framework that fuses MFCC, STFT, and pitch features using a multi-branch CNN encoder, models temporal dynamics with enhanced Legendre Memory Unit (LMU) for efficient sequence modeling, and uses calibrated posterior ensemble fusion with entropy-gated weighting for cross-dataset generalization.

Result: Experiments on Baby2020 and Baby Crying datasets show improved macro-F1 under cross-domain evaluation, with leakage-aware splits and demonstrated real-time feasibility for on-device monitoring.

Conclusion: The proposed framework effectively addresses challenges in infant cry cause classification through efficient temporal modeling and robust cross-dataset generalization techniques suitable for practical healthcare monitoring applications.

Abstract: Decoding infant cry causes remains challenging for healthcare monitoring due to short nonstationary signals, limited annotations, and strong domain shifts across infants and datasets. We propose a compact acoustic framework that fuses MFCC, STFT, and pitch features within a multi-branch CNN encoder and models temporal dynamics using an enhanced Legendre Memory Unit (LMU). Compared to LSTMs, the LMU backbone provides stable sequence modeling with substantially fewer recurrent parameters, supporting efficient deployment. To improve cross-dataset generalization, we introduce calibrated posterior ensemble fusion with entropy-gated weighting to preserve domain-specific expertise while mitigating dataset bias. Experiments on Baby2020 and Baby Crying demonstrate improved macro-F1 under cross-domain evaluation, along with leakageaware splits and real-time feasibility for on-device monitoring.

[636] On the Parameter Estimation of Sinusoidal Models for Speech and Audio Signals

George P. Kafentzis

Main category: eess.AS

TL;DR: Comparison of three sinusoidal models (SM, EDSM, eaQHM) for audio parameter estimation, showing eaQHM excels with medium-to-large windows while EDSM performs better with small windows.

Details

Motivation: To evaluate and compare the parameter estimation performance of three sinusoidal models for speech and audio signals, identifying their strengths and weaknesses for different analysis conditions.

Method: Comparative analysis of three models: standard Sinusoidal Model (FFT-based), Exponentially Damped Sinusoidal Model (subspace method), and extended adaptive Quasi-Harmonic Model (adaptive basis functions with Least Squares). Performance evaluated via signal reconstruction accuracy vs. window size on synthetic signals and vs. number of sinusoids on real signals including singing voices and guitar solos.

Result: eaQHM outperforms EDSM in medium-to-large window size analysis, while EDSM yields higher reconstruction values for smaller analysis window sizes. Each model has distinct advantages depending on analysis conditions.

Conclusion: Future research should merge the adaptivity of eaQHM with the parameter estimation robustness of EDSM for high-quality analysis and resynthesis of general audio signals.

Abstract: In this paper, we examine the parameter estimation performance of three well-known sinusoidal models for speech and audio. The first one is the standard Sinusoidal Model (SM), which is based on the Fast Fourier Transform (FFT). The second is the Exponentially Damped Sinusoidal Model (EDSM) which has been proposed in the last decade, and utilizes a subspace method for parameter estimation, and finally the extended adaptive Quasi-Harmonic Model (eaQHM), which has been recently proposed for AM-FM decomposition, and estimates the signal parameters using Least Squares on a set of basis function that are adaptive to the local characteristics of the signal. The parameter estimation of each model is briefly described and its performance is compared to the others in terms of signal reconstruction accuracy versus window size on a variety of synthetic signals and versus the number of sinusoids on real signals. The latter include highly non stationary signals, such as singing voices and guitar solos. The advantages and disadvantages of each model are presented via synthetic signals and then the application on real signals is discussed. Conclusively, eaQHM outperforms EDS in medium-to-large window size analysis, whereas EDSM yields higher reconstruction values for smaller analysis window sizes. Thus, a future research direction appears to be the merge of adaptivity of the eaQHM and parameter estimation robustness of the EDSM in a new paradigm for high-quality analysis and resynthesis of general audio signals.

[637] Quality of Automatic Speech Recognition – Polish Language case study – from Wav2Vec to Scribe ElevenLabs

Marcin Pietroń, Szymon Piórkowski, Kamil Faber, Dominik Żurek, Michał Karwatowski, Jerzy Duda, Hubert Zieliński, Piotr Lipnicki, Mikołaj Leszczuk

Main category: eess.AS

TL;DR: Comparative study of ASR models for Polish medical interviews, showing Whisper with LLM integration performs best among open-source models, while ElevenLabs Scribe performs best overall.

Details

Motivation: To evaluate and compare modern ASR models for Polish medical interview applications, particularly comparing end-to-end architectures with hybrid ASR+LLM approaches for improved accuracy.

Method: Two-stage pipeline approach using Whisper ASR followed by LLM for correction/improvement, compared against various end-to-end models (QuartzNet, FastConformer, Wav2Vec 2.0 XLSR, ESPnet) and ElevenLabs Scribe on Polish benchmarks including medical interviews.

Result: Whisper with LLM integration performs best among open-source models, while ElevenLabs Scribe achieves best overall performance on both general benchmarks and medical data.

Conclusion: Hybrid ASR+LLM approaches show promise for specialized domains like medical interviews, with Whisper+LLM being the best open-source solution and ElevenLabs Scribe offering superior commercial performance.

Abstract: This article concerns comparative studies on the Automatic Speech Recognition (ASR) model incorporated with the Large Language Model (LLM) used for medical interviews. The proposed solution is tested on polish language benchmarks and dataset with medical interviews. The latest ASR technologies are based on convolutional neural networks (CNNs), recurrent neural networks (RNNs) and Transformers. Most of them work as end-to-end solutions. The presented approach in the case of the Whisper model shows a two-stage solution with End-To-End ASR and LLM working together in a pipeline. The ASR output is an input for LLM. The LLM is a component by which the output from ASR is corrected and improved. Comparative studies for automatic recognition of the Polish language between modern End-To-End deep learning architectures and the ASR hybrid model were performed. The medical interview tests were performed with two state-of-the-art ASR models: OpenAI Whisper incorporated with LLM and Scribe ElevenLabs. Additionally, the results were compared with five more end-to-end models (QuartzNet, FastConformer, Wav2Vec 2.0 XLSR and ESPnet Model Zoo) on Mozilla Common Voice and VoxPopuli databases. Tests were conducted for clean audio signal, signal with bandwidth limitation, and degraded. The tested models were evaluated on the basis of Word Error Rate (WER) and Character Error Rate (CER). The results show that the Whisper model performs by far the best among the open-source models. ElevenLabs Scribe model, on the other hand, performs best for Polish on both general benchmark and medical data.

[638] OnDA: On-device Channel Pruning for Efficient Personalized Keyword Spotting

Matteo Risso, Alessio Burrello, Daniele Jahier Pagliari

Main category: eess.AS

TL;DR: On-device keyword spotting adaptation combining weight training with online structured channel pruning for personalized models, achieving significant compression and efficiency gains.

Details

Motivation: Always-on keyword spotting requires on-device adaptation to handle user/environment distribution shifts under strict latency and energy constraints, but existing approaches focus only on weight adaptation without architectural optimization.

Method: Proposes coupling weight adaptation (on-device training) with architectural adaptation via online structured channel pruning. Compares data-agnostic and data-aware pruning criteria applied to in-field pseudo-labelled user data within a self-learning personalized KWS pipeline.

Result: Achieves up to 9.63x model-size compression at iso-task performance on HeySnips and HeySnapdragon datasets. On Jetson Orin Nano GPU, achieves 1.52x/1.57x latency and 1.64x/1.77x energy-consumption improvements during online training/inference compared to weights-only adaptation.

Conclusion: Combining weight adaptation with architectural adaptation through online pruning significantly improves efficiency for personalized on-device keyword spotting, enabling better resource utilization while maintaining performance.

Abstract: Always-on keyword spotting (KWS) demands on-device adaptation to cope with user- and environment-specific distribution shifts under tight latency and energy budgets. This paper proposes, for the first time, coupling weight adaptation (i.e., on-device training) with architectural adaptation, in the form of online structured channel pruning, for personalized on-device KWS. Starting from a state-of-the-art self-learning personalized KWS pipeline, we compare data-agnostic and data-aware pruning criteria applied on in-field pseudo-labelled user data. On the HeySnips and HeySnapdragon datasets, we achieve up to 9.63x model-size compression with respect to unpruned baselines at iso-task performance, measured as the accuracy at 0.5 false alarms per hour. When deploying our adaptation pipeline on a Jetson Orin Nano embedded GPU, we achieve up to 1.52x/1.57x and 1.64x/1.77x latency and energy-consumption improvements during online training/inference compared to weights-only adaptation.

[639] Whisper-RIR-Mega: A Paired Clean-Reverberant Speech Benchmark for ASR Robustness to Room Acoustics

Mandip Goswami

Main category: eess.AS

TL;DR: Whisper-RIR-Mega is a benchmark dataset pairing clean LibriSpeech utterances with their reverberant versions using real room impulse responses, used to evaluate Whisper ASR models’ robustness to room acoustics.

Details

Motivation: To create a standardized benchmark for evaluating automatic speech recognition (ASR) robustness to room acoustics, specifically reverberation effects, which degrade ASR performance in real-world environments.

Method: Created dataset by pairing clean LibriSpeech utterances with same utterances convolved with real room impulse responses from RIR-Mega corpus. Used stratified splits by reverberation time (RT60) and direct-to-reverberant ratio (DRR). Evaluated five Whisper model variants (tiny through large-v3) on 1600 test samples.

Result: Reverberation consistently degraded ASR performance across all Whisper model sizes. The reverb penalty in word error rate (WER) ranged from 0.12 to 1.07 percentage points depending on the model. Larger models generally showed better robustness but still suffered degradation.

Conclusion: The benchmark provides a reproducible framework for evaluating ASR robustness to room acoustics. Reverberation remains a significant challenge for ASR systems, and the released dataset and code support further research on robust speech recognition.

Abstract: We introduce Whisper-RIR-Mega, a benchmark dataset of paired clean and reverberant speech for evaluating automatic speech recognition (ASR) robustness to room acoustics. Each sample pairs a clean LibriSpeech utterance with the same utterance convolved with a real room impulse response from the RIR-Mega corpus, with stratified splits by reverberation time (RT60) and direct-to-reverberant ratio (DRR). We evaluate five Whisper models (tiny through large-v3) on 1600 test samples and report word error rate (WER) and character error rate (CER) under clean and reverberant conditions. Reverberation consistently degrades performance across all model sizes; the reverb penalty in WER ranges from 0.12 to 1.07 percentage points depending on the model. We release the dataset, evaluation code, and baseline results to support reproducible research on robust ASR.

[640] Decomposing the Influence of Physical Acoustic Modeling on Neural Personal Sound Zone Rendering: An Ablation Study

Hao Jiang, Edgar Choueiri

Main category: eess.AS

TL;DR: Ablation study on simulated acoustic transfer functions for binaural personal sound zones, evaluating contributions of loudspeaker frequency responses, directivity modeling, and rigid-sphere HRTFs to sound separation performance.

Details

Motivation: Deep learning-based Personal Sound Zones rely on simulated acoustic transfer functions for training, but idealized point-source models have significant sim-to-real gaps. While physically informed components improve generalization, their individual contributions remain unclear, making it difficult to prioritize measurement and modeling efforts under limited budgets.

Method: Controlled ablation study using Binaural Spatial Audio Neural Network (BSANN) for head-pose-conditioned binaural PSZ rendering. Progressively enriched simulated ATFs with three components: (1) anechoically measured loudspeaker frequency responses, (2) analytic circular-piston directivity modeling, and (3) rigid-sphere head-related transfer functions. Evaluated four configurations via in-situ measurements with two dummy heads.

Result: Frequency responses provided spectral calibration with modest XTC improvements and reduced inter-listener IPI imbalance. Directivity delivered most consistent sound-zone separation gains (10.05 dB average IZI/IPI). Rigid-sphere HRTFs dominated binaural separation, boosting XTC by +2.38/+2.89 dB (average 4.51 to 7.91 dB), primarily above 2 kHz, while introducing mild listener-dependent IZI/IPI shifts.

Conclusion: The findings provide guidance for prioritizing measurements and models when constructing training acoustic transfer functions under limited budgets, showing which components contribute most to different aspects of sound zone performance.

Abstract: Deep learning-based Personal Sound Zones (PSZs) rely on simulated acoustic transfer functions (ATFs) for training, yet idealized point-source models exhibit large sim-to-real gaps. While physically informed components improve generalization, individual contributions remain unclear. This paper presents a controlled ablation study on a head-pose-conditioned binaural PSZ renderer using the Binaural Spatial Audio Neural Network (BSANN). We progressively enrich simulated ATFs with three components: (i) anechoically measured frequency responses of the particular loudspeakers(FR), (ii) analytic circular-piston directivity (DIR), and (iii) rigid-sphere head-related transfer functions (RS-HRTF). Four configurations are evaluated via in-situ measurements with two dummy heads. Performance metrics include inter-zone isolation (IZI), inter-program interference (IPI), and crosstalk cancellation (XTC) over 100-20000 Hz. Results show FR provides spectral calibration, yielding modest XTC improvements and reduced inter-listener IPI imbalance. DIR delivers the most consistent sound-zone separation gains (10.05 dB average IZI/IPI). RS-HRTF dominates binaural separation, boosting XTC by +2.38/+2.89 dB (average 4.51 to 7.91 dB), primarily above 2 kHz, while introducing mild listener-dependent IZI/IPI shifts. These findings guide prioritization of measurements and models when constructing training ATFs under limited budgets.

[641] Benchmarking Speech Systems for Frontline Health Conversations: The DISPLACE-M Challenge

Dhanya E, Ankita Meena, Manas Nanivadekar, Noumida A, Victor Azad, Ashwini Nagaraj Shenoy, Pratik Roy Chowdhuri, Shobhit Banga, Vanshika Chhabra, Chitralekha Bhat, Shareef babu Kalluri, Srikanth Raj Chetupalli, Deepu Vijayasenan, Sriram Ganapathy

Main category: eess.AS

TL;DR: DISPLACE-M challenge introduces a medical conversational AI benchmark with multi-speaker Indian language dialogues, featuring 4 tasks: speaker diarization, ASR, topic identification, and dialogue summarization.

Details

Motivation: To create a benchmark for understanding goal-oriented, real-world medical dialogues in noisy, multi-speaker environments with overlapping speech across Indian languages and dialects, addressing the gap in healthcare conversational AI systems.

Method: Released a medical conversational dataset (25h dev + 10h eval), provided baseline systems in a unified end-to-end pipeline across 4 tasks, and organized a global challenge with 12 participating teams evaluated using metrics like DER, tcpWER, and ROUGE-L.

Result: 12 teams participated globally, pushing baseline system performance, but even with 6-8 weeks of dedicated effort, the tasks proved substantially challenging and existing systems are far from healthcare deployment readiness.

Conclusion: Medical conversational AI in noisy, multi-speaker environments remains a significant challenge requiring further research, with DISPLACE-M providing a valuable benchmark for advancing healthcare dialogue understanding systems.

Abstract: The DIarization and Speech Processing for LAnguage understanding in Conversational Environments - Medical (DISPLACE-M) challenge introduces a conversational AI benchmark focused on understanding goal-oriented, real-world medical dialogues collected in the field. The challenge addresses multi-speaker interactions between healthcare workers and seekers characterized by spontaneous, noisy and overlapping speech across Indian languages and dialects. As part of the challenge, medical conversational dataset comprising 25 hours of development data and 10 hours of blind evaluation recordings was released. We provided baseline systems within a unified end-to-end pipeline across 4 tasks - speaker diarization, automatic speech recognition, topic identification and dialogue summarization - to enable consistent benchmarking. System performance is evaluated using established metrics such as diarization error rate (DER), time-constrained minimum-permutation word error rate (tcpWER), and ROUGE-L. During this evaluation (Phase-I), 12 teams, across the globe, actively participated pushing the baseline systems on these metrics. However, even with a 6-8 week dedicated effort from various participants, the task is shown to be substantially challenging, and the existing systems are significantly short of healthcare deployment readiness.

[642] DBMIF: a deep balanced multimodal iterative fusion framework for air- and bone-conduction speech enhancement

Yilei Wu, Changyan Zheng, Xingyu Zhang, Yakun Zhang, Chengshi Zheng, Shuang Yang, Ye Yan, Erwei Yin

Main category: eess.AS

TL;DR: DBMIF: A three-branch multimodal fusion framework that combines air-conduction and bone-conduction speech signals using iterative attention and cross-branch gating to achieve robust speech enhancement in extremely low SNR environments.

Details

Motivation: Conventional speech enhancement systems fail in extremely low SNR environments where AC microphones are overwhelmed by noise. While BC sensors offer noise-tolerant information, existing fusion methods struggle with consistent performance across varying SNR conditions.

Method: Proposes Deep Balanced Multimodal Iterative Fusion Framework (DBMIF) with three-branch architecture: multi-scale interactive encoder-decoder backbone, iterative attention module for adaptive weighting, cross-branch gated module for bidirectional exchange, and balanced-interaction bottleneck for compact fused representation.

Result: Achieves competitive performance in speech quality and intelligibility across diverse noise types. Reduces character error rate by at least 2.5% in downstream ASR tasks compared to competing approaches.

Conclusion: DBMIF effectively harnesses BC speech robustness while preserving AC speech naturalness, ensuring reliability in real-world scenarios. The framework demonstrates superior multimodal fusion for speech enhancement in challenging noise conditions.

Abstract: The performance of conventional speech enhancement systems degrades sharply in extremely low signal-to-noise ratio (SNR) environments where air-conduction (AC) microphones are overwhelmed by ambient noise. Although bone-conduction (BC) sensors offer complementary, noise-tolerant information, existing fusion approaches struggle to maintain consistent performance across a wide range of SNR conditions. To address this limitation, we propose the Deep Balanced Multimodal Iterative Fusion Framework (DBMIF), a three-branch architecture designed to reconstruct high-fidelity speech through rigorous cross-modal interaction. Specifically, grounded in a multi-scale interactive encoder-decoder backbone, the framework orchestrates an iterative attention module and a cross-branch gated module to facilitate adaptive weighting and bidirectional exchange. To complement this dynamic interaction, a balanced-interaction bottleneck is further integrated to learn a compact, stable fused representation. Extensive experiments demonstrate that DBMIF achieves competitive performance compared with recent unimodal and multimodal baselines in both speech quality and intelligibility across diverse noise types. In downstream ASR tasks, the proposed method reduces the character error rate by at least 2.5 percent compared to competing approaches. These results confirm that DBMIF effectively harnesses the robustness of BC speech while preserving the naturalness of AC speech, ensuring reliability in real-world scenarios. The source code is publicly available at github.com/wyl516w/dbmif.

[643] Does Fine-tuning by Reinforcement Learning Improve Generalization in Binary Speech Deepfake Detection?

Xin Wang, Ge Wanying, Junichi Yamagishi

Main category: eess.AS

TL;DR: Using reinforcement learning (GRPO) instead of supervised fine-tuning improves speech deepfake detection generalization to unseen attacks while maintaining target-domain performance.

Details

Motivation: Speech deepfake detection models struggle to generalize to unseen attacks. While pre-training with speech foundation models is common, most approaches rely solely on supervised fine-tuning. Inspired by RL's success in large language model fine-tuning, researchers investigate whether RL can improve generalization in speech deepfake detection.

Method: The paper investigates using Group Relative Policy Optimization (GRPO) for fine-tuning speech deepfake detection models instead of supervised fine-tuning. Experiments are conducted with multiple detectors and test sets, comparing pure GRPO-based fine-tuning against SFT-only and hybrid setups. Ablation studies examine the role of negative rewards in GRPO.

Result: Pure GRPO-based fine-tuning improves performance on out-of-domain test sets while maintaining performance on target-domain test data. This approach outperforms both SFT-only and hybrid setups. Ablation studies suggest that negative rewards in GRPO may be a key factor in the improvement.

Conclusion: Reinforcement learning, specifically GRPO, offers a promising alternative to supervised fine-tuning for speech deepfake detection, improving generalization to unseen attacks while preserving performance on known attacks.

Abstract: Building speech deepfake detection models that are generalizable to unseen attacks remains a challenging problem. Although the field has shifted toward a pre-training and fine-tuning paradigm using speech foundation models, most approaches rely solely on supervised fine-tuning (SFT). Inspired by the field of large language models, wherein reinforcement learning (RL) is used for model fine-tuning, we investigate the impact of RL, specifically Group Relative Policy Optimization (GRPO). The results from experiments using multiple detectors and test sets indicate that pure GRPO-based fine-tuning improves performance on out-of-domain test sets while maintaining performance on target-domain test data. This approach outperforms both SFT-only and hybrid setups. Our ablation studies further suggest that the negative reward in GRPO may be a key factor in this improvement.

[644] Bias and Fairness in Self-Supervised Acoustic Representations for Cognitive Impairment Detection

Kashaf Gulzar, Korbinian Riedhammer, Elmar Nöth, Andreas K. Maier, Paula Andrea Pérez-Toro

Main category: eess.AS

TL;DR: Analysis of bias in speech-based cognitive impairment detection using acoustic features and Wav2Vec2 embeddings, revealing performance disparities across demographic subgroups and limited cross-task generalization between CI and depression classification.

Details

Motivation: Speech-based detection of cognitive impairment offers non-invasive early diagnosis, but performance disparities across demographic and clinical subgroups raise fairness and generalizability concerns that need systematic investigation.

Method: Systematic bias analysis using DementiaBank Pitt Corpus comparing traditional acoustic features (MFCCs, eGeMAPS) with contextualized speech embeddings from Wav2Vec 2.0, evaluating classification performance across gender, age, and depression-status subgroups for CI and depression detection.

Result: Higher-layer Wav2Vec2 embeddings outperform baselines (UAR up to 80.6%) but show significant performance disparities: females and younger participants have lower discriminative power (AUC: 0.769 and 0.746) and substantial specificity disparities (Δspec up to 18% and 15%). Depression detection within CI subjects yields lower performance with mild improvements from low/mid-level W2V2 layers. Cross-task generalization between CI and depression classification is limited.

Conclusion: Findings emphasize need for fairness-aware model evaluation and subgroup-specific analysis in clinical speech applications, highlighting demographic and clinical heterogeneity challenges in real-world deployment.

Abstract: Speech-based detection of cognitive impairment (CI) offers a promising non-invasive approach for early diagnosis, yet performance disparities across demographic and clinical subgroups remain underexplored, raising concerns around fairness and generalizability. This study presents a systematic bias analysis of acoustic-based CI and depression classification using the DementiaBank Pitt Corpus. We compare traditional acoustic features (MFCCs, eGeMAPS) with contextualized speech embeddings from Wav2Vec 2.0 (W2V2), and evaluate classification performance across gender, age, and depression-status subgroups. For CI detection, higher-layer W2V2 embeddings outperform baseline features (UAR up to 80.6%), but exhibit performance disparities; specifically, females and younger participants demonstrate lower discriminative power ((AUC): 0.769 and 0.746, respectively) and substantial specificity disparities ((Δ_{spec}) up to 18% and 15%, respectively), leading to a higher risk of misclassifications than their counterparts. These disparities reflect representational biases, defined as systematic differences in model performance across demographic or clinical subgroups. Depression detection within CI subjects yields lower overall performance, with mild improvements from low and mid-level W2V2 layers. Cross-task generalization between CI and depression classification is limited, indicating that each task depends on distinct representations. These findings emphasize the need for fairness-aware model evaluation and subgroup-specific analysis in clinical speech applications, particularly in light of demographic and clinical heterogeneity in real-world applications.

[645] Interpreting Speaker Characteristics in the Dimensions of Self-Supervised Speech Features

Kyle Janse van Rensburg, Benjamin van Niekerk, Herman Kamper

Main category: eess.AS

TL;DR: Analysis of how self-supervised speech models structure representations, finding speaker information encoded in principal dimensions of SSL features, with applications for voice synthesis control.

Details

Motivation: To understand how speech models trained through self-supervised learning structure their representations, specifically examining whether speech characteristics are captured within individual dimensions of SSL features, which has been understudied compared to layer-wise analysis.

Method: Used PCA on utterance-averaged representations from WavLM to analyze principal dimensions. Examined correlations between individual principal dimensions and speech characteristics like pitch, intensity, noise levels, formants. Conducted synthesis experiments to test controllability.

Result: Found that the principal dimension explaining most variance encodes pitch and associated characteristics like gender. Other individual dimensions correlate with intensity, noise levels, second formant, and higher frequency characteristics. Synthesis experiments showed most characteristics can be controlled by changing corresponding dimensions.

Conclusion: Self-supervised speech models structure representations such that individual dimensions encode specific speech characteristics, providing a simple method to control voice characteristics in synthesis applications through dimension manipulation.

Abstract: How do speech models trained through self-supervised learning structure their representations? Previous studies have looked at how information is encoded in feature vectors across different layers. But few studies have considered whether speech characteristics are captured within individual dimensions of SSL features. In this paper we specifically look at speaker information using PCA on utterance-averaged representations. Using WavLM, we find that the principal dimension that explains most variance encodes pitch and associated characteristics like gender. Other individual principal dimensions correlate with intensity, noise levels, the second formant, and higher frequency characteristics. Finally, in synthesis experiments we show that most characteristics can be controlled by changing the corresponding dimensions. This provides a simple method to control characteristics of the output voice in synthesis applications.

[646] Using Songs to Improve Kazakh Automatic Speech Recognition

Rustem Yeshpanov

Main category: eess.AS

TL;DR: Using songs as unconventional training data for Kazakh ASR, showing modest but meaningful improvements over zero-shot baselines when combined with other small datasets.

Details

Motivation: Low-resource languages like Kazakh lack sufficient transcribed speech data for effective ASR development. This study explores songs as an alternative data source to address data scarcity.

Method: Curated dataset of 3,013 audio-text pairs (4.5 hours) from Kazakh songs, segmented at lyric-line level. Fine-tuned Whisper models under 7 training scenarios combining Songs, Common Voice Corpus, and FLEURS datasets. Evaluated on CVC, FLEURS, and Kazakh Speech Corpus 2 benchmarks.

Result: Song-based fine-tuning improves performance over zero-shot baselines. Best model (Whisper Large-V3 Turbo trained on Songs+CVC+FLEURS) achieved 27.6% WER on CVC, 11.8% on FLEURS, and halved error on KSC2 (39.3% vs 81.2%). Gains remain below models trained on 1,100-hour KSC2 corpus.

Conclusion: Even modest song-speech mixtures can yield meaningful adaptation improvements in low-resource ASR, demonstrating songs as a viable unconventional data source. Dataset released for research under non-commercial license.

Abstract: Developing automatic speech recognition (ASR) systems for low-resource languages is hindered by the scarcity of transcribed corpora. This proof-of-concept study explores songs as an unconventional yet promising data source for Kazakh ASR. We curate a dataset of 3,013 audio-text pairs (about 4.5 hours) from 195 songs by 36 artists, segmented at the lyric-line level. Using Whisper as the base recogniser, we fine-tune models under seven training scenarios involving Songs, Common Voice Corpus (CVC), and FLEURS, and evaluate them on three benchmarks: CVC, FLEURS, and Kazakh Speech Corpus 2 (KSC2). Results show that song-based fine-tuning improves performance over zero-shot baselines. For instance, Whisper Large-V3 Turbo trained on a mixture of Songs, CVC, and FLEURS achieves 27.6% normalised WER on CVC and 11.8% on FLEURS, while halving the error on KSC2 (39.3% vs. 81.2%) relative to the zero-shot model. Although these gains remain below those of models trained on the 1,100-hour KSC2 corpus, they demonstrate that even modest song-speech mixtures can yield meaningful adaptation improvements in low-resource ASR. The dataset is released on Hugging Face for research purposes under a gated, non-commercial licence.

[647] TCG CREST System Description for the DISPLACE-M Challenge

Nikhil Raghav, Md Sahidullah

Main category: eess.AS

TL;DR: TCG CREST system for speaker diarization in noisy medical conversations, comparing modular SpeechBrain pipeline with end-to-end Diarizen system using WavLM, achieving ~39% DER improvement.

Details

Motivation: Address speaker diarization challenges in noisy rural healthcare scenarios with naturalistic medical conversations, evaluating different VAD methods and clustering algorithms for improved performance.

Method: Two frameworks: 1) Modular pipeline using SpeechBrain with ECAPA-TDNN embeddings, 2) Hybrid end-to-end Diarizen system built on pre-trained WavLM. Explored various clustering techniques including AHC and novel spectral clustering variants (SC-adapt, SC-PNA, SC-MK).

Result: Diarizen system provided ~39% relative improvement in DER compared to SpeechBrain baseline. Best system achieved DER of 10.37% on development and 9.21% on evaluation sets. Team ranked 6th out of 11 participants.

Conclusion: End-to-end neural diarization systems like Diarizen with advanced clustering techniques significantly outperform traditional modular approaches for speaker diarization in challenging noisy medical conversation scenarios.

Abstract: This report presents the TCG CREST system description for Track 1 (Speaker Diarization) of the DISPLACE-M challenge, focusing on naturalistic medical conversations in noisy rural-healthcare scenarios. Our study evaluates the impact of various voice activity detection (VAD) methods and advanced clustering algorithms on overall speaker diarization (SD) performance. We compare and analyze two SD frameworks: a modular pipeline utilizing SpeechBrain with ECAPA-TDNN embeddings, and a state-of-the-art (SOTA) hybrid end-to-end neural diarization system, Diarizen, built on top of a pre-trained WavLM. With these frameworks, we explore diverse clustering techniques, including agglomerative hierarchical clustering (AHC), and multiple novel variants of spectral clustering, such as SC-adapt, SC-PNA, and SC-MK. Experimental results demonstrate that the Diarizen system provides an approximate $39%$ relative improvement in the diarization error rate (DER) on the post-evaluation analysis of PhaseI compared to the SpeechBrain baseline. Our best-performing submitted system employing the Diarizen baseline with AHC employing a median filtering with a larger context window of $29$ achieved a DER of 10.37% on the development and 9.21% on the evaluation sets, respectively. Our team ranked sixth out of the 11 participating teams after the PhaseI evaluation.

[648] Predicting Tuberculosis from Real-World Cough Audio Recordings and Metadata

George P. Kafentzis, Stephane Tetsing, Joe Brew, Lola Jover, Mindaugas Galvosas, Carlos Chaccour, Peter M. Small

Main category: eess.AS

TL;DR: Using cough audio recordings from TB and non-TB patients collected via mobile app, statistical classifiers achieve ~0.70 AUC with cough sounds alone, improving to ~0.81 AUC when combined with clinical metadata.

Details

Motivation: TB diagnosis typically requires clinical exams and specialized tests, but cough sounds from different respiratory diseases can be distinguished. Automated cough analysis via mobile apps could help improve TB case-finding while reducing costs.

Method: Used large dataset of TB/non-TB cough recordings collected via Hyfe mobile app from Africa, India, and Asia. Applied statistical classifiers based on spectral and time domain features, with/without clinical metadata. Used stratified grouped cross-validation at cough-level and participant-level.

Result: Achieved average AUC of ~0.70 ± 0.05 using cough sounds alone, and ~0.81 ± 0.05 when adding demographic and clinical factors. Both cough-level and participant-level classification performed similarly.

Conclusion: Mobile phone-based applications integrating clinical symptoms and cough sound analysis could help community health workers and health programs improve TB case-finding efforts while reducing costs, potentially improving public health.

Abstract: Tuberculosis (TB) is an infectious disease caused by the bacterium Mycobacterium tuberculosis and primarily affects the lungs, as well as other body parts. TB is spread through the air when an infected person coughs, sneezes, or talks. Medical doctors diagnose TB in patients via clinical examinations and specialized tests. However, coughing is a common symptom of respiratory diseases such as TB. Literature suggests that cough sounds coming from different respiratory diseases can be distinguished by both medical doctors and computer algorithms. Therefore, cough recordings associated with patients with and without TB seems to be a reasonable avenue of investigation. In this work, we utilize a very large dataset of TB and non-TB cough audio recordings obtained from the south-east of Africa, India, and the south-east of Asia using a fully automated phone-based application (Hyfe), without manual annotation. We fit statistical classifiers based on spectral and time domain features with and without clinical metadata. A stratified grouped cross-validation approach shows that an average Area Under Curve (AUC) of approximately 0.70 $\pm$ 0.05 both for a cough-level and a participant-level classification can be achieved using cough sounds alone. The addition of demographic and clinical factors increases performance, resulting in an average AUC of approximately 0.81 $\pm$ 0.05. Our results suggest mobile phone-based applications that integrate clinical symptoms and cough sound analysis could help community health workers and, most importantly, health service programs to improve TB case-finding efforts while reducing costs, which could substantially improve public health.

eess.IV

[649] Loss Design and Architecture Selection for Long-Tailed Multi-Label Chest X-Ray Classification

Nikhileswara Rao Sulake

Main category: eess.IV

TL;DR: Systematic evaluation of loss functions, architectures, and post-training strategies for long-tailed multi-label chest X-ray classification, with LDAM-DRW outperforming standard approaches and ConvNeXt-Large achieving best performance.

Details

Motivation: Long-tailed class distributions in multi-label chest X-ray classification pose challenges for recognizing rare but clinically important findings that are severely underrepresented, requiring specialized approaches to handle class imbalance.

Method: Empirical evaluation of loss functions (BCE, asymmetric losses, LDAM-DRW), CNN backbone architectures, and post-training strategies (classifier re-training, test-time augmentation) on CXR-LT 2026 benchmark with 143K images and 30 disease labels.

Result: LDAM-DRW consistently outperformed standard BCE and asymmetric losses for rare class recognition; ConvNeXt-Large achieved best single-model performance (0.5220 mAP, 0.3765 F1); submission ranked 5th among 68 teams with 0.3950 mAP on official test leaderboard.

Conclusion: LDAM-DRW is effective for long-tailed medical image classification, with architecture selection and post-training strategies providing additional improvements, though development-to-test performance gaps highlight challenges in clinical imaging settings.

Abstract: Long-tailed class distributions pose a significant challenge for multi-label chest X-ray (CXR) classification, where rare but clinically important findings are severely underrepresented. In this work, we present a systematic empirical evaluation of loss functions, CNN backbone architectures and post-training strategies on the CXR-LT 2026 benchmark, comprising approximately 143K images with 30 disease labels from PadChest. Our experiments demonstrate that LDAM with deferred re-weighting (LDAM-DRW) consistently outperforms standard BCE and asymmetric losses for rare class recognition. Amongst the architectures evaluated, ConvNeXt-Large achieves the best single-model performance with 0.5220 mAP and 0.3765 F1 on our development set, whilst classifier re-training and test-time augmentation further improve ranking metrics. On the official test leaderboard, our submission achieved 0.3950 mAP, ranking 5th amongst all 68 participating teams with total of 1528 submissions. We provide a candid analysis of the development-to-test performance gap and discuss practical insights for handling class imbalance in clinical imaging settings. Code is available at https://github.com/Nikhil-Rao20/Long_Tail.

[650] Biomechanically Accurate Gait Analysis: A 3d Human Reconstruction Framework for Markerless Estimation of Gait Parameters

Akila Pemasiri, Ethan Goan, Glen Lichtwark, Robert Schuster, Luke Kelly, Clinton Fookes

Main category: eess.IV

TL;DR: A markerless gait analysis framework using 3D human reconstruction from video that extracts biomechanically interpretable markers and integrates with OpenSim for joint kinematics, showing strong agreement with marker-based motion capture.

Details

Motivation: To develop a scalable, markerless, and interpretable approach for gait analysis that bridges the gap between conventional keypoint-based pose estimation and biomechanically meaningful motion capture systems, enabling broader clinical and real-world deployment of vision-based biomechanics.

Method: Uses 3D human reconstruction from video data to extract biomechanically meaningful markers analogous to motion capture systems, then integrates these markers within OpenSim for joint kinematic estimation. Compares both spatiotemporal and kinematic gait parameters against reference marker-based data.

Result: The framework shows strong agreement with marker-based measurements, with considerable improvements compared to pose-estimation methods alone. It provides accurate gait assessment while being markerless and scalable.

Conclusion: The proposed framework offers a scalable, markerless, and interpretable approach for accurate gait assessment that supports broader clinical and real-world deployment of vision-based biomechanics, bridging the gap between computer vision and biomechanical analysis.

Abstract: This paper presents a biomechanically interpretable framework for gait analysis using 3D human reconstruction from video data. Unlike conventional keypoint based approaches, the proposed method extracts biomechanically meaningful markers analogous to motion capture systems and integrates them within OpenSim for joint kinematic estimation. To evaluate performance, both spatiotemporal and kinematic gait parameters were analysed against reference marker-based data. Results indicate strong agreement with marker-based measurements, with considerable improvements when compared with pose-estimation methods alone. The proposed framework offers a scalable, markerless, and interpretable approach for accurate gait assessment, supporting broader clinical and real world deployment of vision based biomechanics

Shuide Wen, Sungil Seok, Beier Ku, Richee Li, Yubin He, Bowen Qu, Yang Yang, Ping Su, Can Jiao

Main category: eess.IV

TL;DR: DLIOS is an LLM-augmented real-time multimodal interactive enhancement system for TikTok live streaming with transparent overlay rendering, automated broadcast commentary, and AI persona capabilities.

Details

Motivation: To enhance live streaming experiences by creating an automated, interactive system that generates emotionally coherent commentary, manages real-time interactions, and supports virtual AI personas for more engaging broadcasts.

Method: Three-layer transparent window architecture with WebView2 capture pipeline and thread-safe event bus, plus LLM broadcast automation framework with four-segment prompt scheduling, multi-persona support, real-time danmaku reaction engine, and AI singer-songwriter persona case study.

Result: 36-hour stress test showed zero danmaku overlap, zero deadlock crashes, gift effect P95 latency ≤180ms, LLM-to-TTS segment P95 latency ≤2.1s, and TTS integrated loudness gain of 9.5 LUFS.

Conclusion: DLIOS successfully demonstrates a robust real-time multimodal interactive system for live streaming enhancement using LLM automation, with practical applications for content creators and virtual personas.

Abstract: We present DLIOS, a Large Language Model (LLM)-augmented real-time multi-modal interactive enhancement overlay system for Douyin (TikTok) live streaming. DLIOS employs a three-layer transparent window architecture for independent rendering of danmaku (scrolling text), gift and like particle effects, and VIP entrance animations, built around an event-driven WebView2 capture pipeline and a thread-safe event bus. On top of this foundation we contribute an LLM broadcast automation framework comprising: (1) a per-song four-segment prompt scheduling system (T1 opening/transition, T2 empathy, T3 era story/production notes, T4 closing) that generates emotionally coherent radio-style commentary from lyric metadata; (2) a JSON-serializable RadioPersonaConfig schema supporting hot-swap multi-persona broadcasting; (3) a real-time danmaku quick-reaction engine with keyword routing to static urgent speech or LLM-generated empathetic responses; and (4) the Suwan Li AI singer-songwriter persona case study – over 100 AI-generated songs produced with Suno. A 36-hour stress test demonstrates: zero danmaku overlap, zero deadlock crashes, gift effect P95 latency <= 180 ms, LLM-to-TTS segment P95 latency <= 2.1 s, and TTS integrated loudness gain of 9.5 LUFS. live streaming; danmaku; large language model; prompt engineering; virtual persona; WebView2; WINMM; TTS; Suno; loudness normalization; real-time scheduling

[652] Context Adaptive Extended Chain Coding for Semantic Map Compression

Runyu Yang, Junqi Liao, Hyomin Choi, Fabien Racapé, Ivan V. Bajić

Main category: eess.IV

TL;DR: A novel chain-coding-based framework for lossless compression of semantic maps that exploits contour topology and shared boundaries between semantic regions, achieving 18% bitrate reduction and significant runtime improvements.

Details

Motivation: Semantic maps are increasingly used in robotics, autonomous systems, and extended reality, creating a need for efficient compression methods that preserve structured semantic information while reducing storage and transmission costs.

Method: Proposes an extended chain code (ECC) for compact contour representation with legacy 3OT fallback, context-adaptive entropy coding based on Markov modeling, and skip-coding mechanism to eliminate redundant shared contour representations via run-length signaling.

Result: Achieves average 18% bitrate reduction compared to state-of-the-art benchmarks on semantic map datasets, with up to 98% encoder and 50% decoder runtime reduction relative to modern generic lossless codecs. Consistent gains also shown on occupancy maps.

Conclusion: The proposed chain-coding framework effectively compresses semantic maps by exploiting structural properties like contour topology and shared boundaries, offering both compression efficiency and computational performance improvements for practical applications.

Abstract: Semantic maps are increasingly utilized in areas such as robotics, autonomous systems, and extended reality, motivating the investigation of efficient compression methods that preserve structured semantic information. This paper studies lossless compression of semantic maps through a novel chain-coding-based framework that explicitly exploits contour topology and shared boundaries between adjacent semantic regions. We propose an extended chain code (ECC) to represent long-range contour transitions more compactly, while retaining a legacy three-orthogonal chain code (3OT) as a fallback mode for further efficiency. To efficiently encode sequences of ECC symbols, a context-adaptive entropy coding scheme based on Markov modeling is employed. Furthermore, a skip-coding mechanism is introduced to eliminate redundant representations of shared contours between adjacent semantic regions, supporting both complete and partial skips via run-length signaling. Experimental results demonstrate that the proposed method achieves an average bitrate reduction of 18% compared with a state-of-the-art benchmark on semantic map datasets. In addition, the proposed encoder and decoder achieve up to 98% and 50% runtime reduction, respectively, relative to a modern generic lossless codec. Extended evaluations on occupancy maps further confirm consistent compression gains across the majority of tested scenarios.

[653] RealOSR: Latent Guidance Boosts Diffusion-based Real-world Omnidirectional Image Super-Resolutions

Xuhan Sheng, Runyi Li, Bin Chen, Weiqi Li, Xu Jiang, Jian Zhang

Main category: eess.IV

TL;DR: RealOSR: A diffusion-based framework for real-world omnidirectional image super-resolution with efficient latent-based condition guidance and one-step denoising, achieving 200× faster inference than previous methods.

Details

Motivation: Existing omnidirectional image super-resolution methods are limited by simplified degradation assumptions and cannot model real-world degradation. Recent diffusion approaches suffer from slow inference due to hundreds of updating steps and frequent VAE usage.

Method: Proposes RealOSR framework with Latent Gradient Alignment Routing (LaGAR) module that enables efficient pixel-latent space interactions and simulates gradient descent directly in latent space, using a one-step denoising paradigm.

Result: Significant improvements in visual quality and over 200× inference acceleration compared to recent diffusion-based ODISR method OmniSSR.

Conclusion: RealOSR effectively addresses real-world ODISR challenges with efficient latent-based guidance and fast inference, making it practical for high-quality omnidirectional image super-resolution.

Abstract: Omnidirectional image super-resolution (ODISR) aims to upscale low-resolution (LR) omnidirectional images (ODIs) to high-resolution (HR), catering to the growing demand for detailed visual content across a $ 180^{\circ}\times360^{\circ}$ viewport. Existing ODISR methods are limited by simplified degradation assumptions (e.g., bicubic downsampling), failing to model and exploit the real-world degradation information. Recent latent-based diffusion approaches using condition guidance suffer from slow inference due to their hundreds of updating steps and frequent use of VAE. To tackle these challenges, we propose \textbf{RealOSR}, a diffusion-based framework tailored for real-world ODISR, featuring efficient latent-based condition guidance within a one-step denoising paradigm. Central to efficient latent-based condition guidance is the proposed \textbf{Latent Gradient Alignment Routing (LaGAR)}, a lightweight module that enables effective pixel-latent space interactions and simulates gradient descent directly in the latent space, thereby leveraging the semantic richness and multi-scale features captured by the denoising UNet. Compared to the recent diffusion-based ODISR method, OmniSSR, RealOSR achieves significant improvements in visual quality and over \textbf{200$\times$} inference acceleration. Our code and models will be released upon acceptance.

[654] Slot-BERT: Self-supervised Object Discovery in Surgical Video

Guiqiu Liao, Matjaz Jogan, Marcel Hussing, Kenta Nakahashi, Kazuhiro Yasufuku, Amin Madani, Eric Eaton, Daniel A. Hashimoto

Main category: eess.IV

TL;DR: Slot-BERT: A bidirectional long-range model for object-centric representation learning in surgical videos that maintains temporal coherence while being computationally efficient.

Details

Motivation: Existing object-centric methods for videos struggle with maintaining long-range temporal coherence in surgical applications. Recurrent approaches lack temporal consistency for long videos, while fully parallel processing is computationally prohibitive for medical facility hardware.

Method: Slot-BERT uses bidirectional long-range modeling to learn object-centric representations in latent space with robust temporal coherence. It scales to videos of unconstrained lengths and employs a novel slot contrastive loss to reduce redundancy and improve representation disentanglement through enhanced slot orthogonality.

Result: Slot-BERT surpasses state-of-the-art object-centric approaches under unsupervised training on real-world surgical video datasets from abdominal, cholecystectomy, and thoracic procedures. It demonstrates efficient zero-shot domain adaptation across diverse surgical specialties and databases.

Conclusion: Slot-BERT provides an effective solution for object-centric representation learning in surgical videos that balances temporal coherence with computational efficiency, enabling practical deployment in medical facilities while supporting reasoning about objects and actions.

Abstract: Object-centric slot attention is a powerful framework for unsupervised learning of structured and explainable representations that can support reasoning about objects and actions, including in surgical videos. While conventional object-centric methods for videos leverage recurrent processing to achieve efficiency, they often struggle with maintaining long-range temporal coherence required for long videos in surgical applications. On the other hand, fully parallel processing of entire videos enhances temporal consistency but introduces significant computational overhead, making it impractical for implementation on hardware in medical facilities. We present Slot-BERT, a bidirectional long-range model that learns object-centric representations in a latent space while ensuring robust temporal coherence. Slot-BERT scales object discovery seamlessly to long videos of unconstrained lengths. A novel slot contrastive loss further reduces redundancy and improves the representation disentanglement by enhancing slot orthogonality. We evaluate Slot-BERT on real-world surgical video datasets from abdominal, cholecystectomy, and thoracic procedures. Our method surpasses state-of-the-art object-centric approaches under unsupervised training achieving superior performance across diverse domains. We also demonstrate efficient zero-shot domain adaptation to data from diverse surgical specialties and databases.

[655] Zero-shot CT Super-Resolution using Diffusion-based 2D Projection Priors and Signed 3D Gaussians

Jeonghyun Noh, Hyun-Jic Oh, Won-Ki Jeong

Main category: eess.IV

TL;DR: A zero-shot 3D CT super-resolution framework using diffusion-based 2D projection priors and 3D Gaussian splatting with negative alpha blending for high-resolution CT reconstruction from single low-resolution inputs.

Details

Motivation: Clinical CT diagnosis requires high-resolution images but faces radiation exposure risks. Supervised deep learning methods need paired datasets that are often unavailable, while existing zero-shot methods fail to recover fine structural details due to limited information in single low-resolution volumes.

Method: Two-stage framework: (1) Train diffusion model on abundant X-ray data to upsample low-resolution CT projections, enhancing scarce information. (2) Use 3D Gaussian splatting with novel Negative Alpha Blending (NAB-GS) to model signed residuals between diffusion-generated high-resolution and upsampled low-resolution projections for 3D volume reconstruction.

Result: Demonstrates superior quantitative and qualitative performance on two public datasets. Expert evaluations show clinical potential at 4x super-resolution.

Conclusion: The proposed zero-shot 3D CT super-resolution framework effectively overcomes limitations of existing methods by integrating diffusion-based 2D projection priors and novel 3D reconstruction techniques, showing promise for clinical applications without requiring paired training data.

Abstract: Computed tomography (CT) is important in clinical diagnosis, but acquiring high-resolution (HR) CT is constrained by radiation exposure risks. While deep learning-based super-resolution (SR) methods have shown promise for reconstructing HR CT from low-resolution (LR) inputs, supervised approaches require paired datasets that are often unavailable. Zero-shot methods address this limitation by operating on single LR inputs; however, they frequently fail to recover fine structural details due to limited LR information within individual volumes. To overcome these limitations, we propose a novel zero-shot 3D CT SR framework that integrates diffusion-based upsampled 2D projection priors into the 3D reconstruction process. Specifically, our framework consists of two stages: (1) LR CT projection SR, training a diffusion model on abundant X-ray data to upsample LR projections, thereby enhancing the scarce information inherent in the LR inputs. (2) 3D CT volume reconstruction, using 3D Gaussian splatting with our novel Negative Alpha Blending (NAB-GS), which models positive and negative Gaussian densities to learn signed residuals between diffusion-generated HR and upsampled LR projections. Our framework demonstrates superior quantitative and qualitative performance on two public datasets, and expert evaluations present the framework’s clinical potential at 4x.

[656] Proceedings for the Inaugural Meeting of the International Society for Tractography – IST 2025 Bordeaux

Flavio Dell Acqua, Maxime Descoteaux, Graham Little, Laurent Petit, Dogu Baran Aydogan, Stephanie Forkel, Alexander Leemans, Simona Schiavi, Michel Thiebaut de Schotten

Main category: eess.IV

TL;DR: Conference proceedings from the International Society for Tractography 2025 covering neuroanatomy, tractography methods, and clinical applications of diffusion MRI

Details

Motivation: To document and disseminate the latest research presented at the inaugural tractography conference, fostering collaboration between neuroanatomy, tractography methods, and clinical applications

Method: Collection of abstracts from poster, power pitch, and oral sessions presented at the IST Conference 2025 in Bordeaux, France

Result: Proceedings covering advancements in tractography, diffusion MRI, neurological/psychiatric disorders, deep brain stimulation targeting, and brain development

Conclusion: The conference successfully brought together world-leading experts to discuss critical challenges and chart future directions in tractography research

Abstract: This collection comprises the abstracts presented during poster, power pitch and oral sessions at the Inaugural Conference of the International Society for Tractography (IST Conference 2025), held in Bordeaux, France, from October 13-16, 2025. The conference was designed to foster meaningful exchange and collaboration between disparate fields. The overall focus was on advancing research, innovation, and community in the common fields of interest: neuroanatomy, tractography methods and scientific/clinical applications of tractography. The included abstracts cover the latest advancements in tractography, Diffusion MRI, and related fields including new work on; neurological and psychiatric disorders, deep brain stimulation targeting, and brain development. This landmark event brought together world-leading experts to discuss critical challenges and chart the future direction of the field.

[657] GLIDE-Reg: Global-to-Local Deformable Registration Using Co-Optimized Foundation and Handcrafted Features

Yunzheng Zhu, Aichi Chien, Kimaya kulkarni, Luoting Zhuang, Stephen Park, Ricky Savjani, Daniel Low, William Hsu

Main category: eess.IV

TL;DR: GLIDE-Reg: A deformable medical image registration method using global semantic embeddings and local descriptors for robust cross-resolution registration, achieving state-of-the-art performance on lung CT datasets.

Details

Motivation: Current deformable registration methods lack robustness and generalizability across spatial resolution differences and anatomical coverage variations in medical imaging, which is crucial for applications like lesion tracking and treatment evaluation.

Method: Joint optimization of registration field and learnable dimensionality reduction module that compresses VFM embeddings to maintain registration relevance, then fuses these global semantic cues with MIND local descriptors for robust registration.

Result: Achieved average DSC of 0.859, 0.862, and 0.901 across 6 anatomical structures on Lung250M, NLST, and UCLA5DCT datasets, outperforming DEEDS with relative improvements of 3.0%, 0.5%, and 0.1%. Target registration errors of 1.58 mm on Lung250M landmarks and 1.11 mm on NLST nodule centers.

Conclusion: GLIDE-Reg demonstrates robust performance across challenging medical imaging tasks, particularly for lung cancer applications like nodule tracking, showing effectiveness in handling resolution and coverage variations.

Abstract: Deformable registration is crucial in medical imaging. Several existing applications include lesion tracking, probabilistic atlas generation, and treatment response evaluation. However, current methods often lack robustness and generalizability across two key factors: spatial resolution and differences in anatomical coverage. We jointly optimize a registration field and a learnable dimensionality reduction module so that compressed VFM embeddings remain registration-relevant, and fuse these global semantic cues with MIND local descriptors. GLIDE-Reg achieves average dice similarity coefficients (DSC) across 6 anatomical structures of 0.859, 0.862, and 0.901 in two public cohorts (Lung250M and NLST) and one institution cohort (UCLA5DCT), and outperforms the state-of-the-art DEEDS (0.834, 0.858, 0.900) with relative improvements of 3.0%, 0.5%, and 0.1%. For target registration errors, GLIDE-Reg achieves 1.58 mm on Lung250M landmarks (compared to 1.25 mm on corrField and 1.91 mm on DEEDS) and 1.11 mm on NLST nodule centers (compared to 1.11 mm on DEEDS). The substantiated performance on the nodule centers also demonstrates its robustness across challenging downstream tasks, such as nodule tracking, which is an essential prior step for early-stage lung cancer diagnosis.

Editor’s Picks

[1] When Scaling Fails: Mitigating Audio Perception Decay of LALMs via Multi-Step Perception-Aware Reasoning

[2] Real-Time Generation of Game Video Commentary with Multimodal LLMs: Pause-Aware Decoding Approaches

[3] MUSE: A Run-Centric Platform for Multimodal Unified Safety Evaluation of Large Language Models

Today’s Research Highlights

Table of Contents

cs.CL

[1] A Zipf-preserving, long-range correlated surrogate for written language and other symbolic sequences

[2] Universal Conceptual Structure in Neural Translation: Probing NLLB-200’s Multilingual Geometry

[3] Characterizing Memorization in Diffusion Language Models: Generalized Extraction and Sampling Effects

[4] Detecting AI-Generated Essays in Writing Assessment: Responsible Use and Generalizability Across LLMs

[5] RO-N3WS: Enhancing Generalization in Low-Resource ASR with Diverse Romanian Speech Benchmarks

[6] GLoRIA: Gated Low-Rank Interpretable Adaptation for Dialectal ASR

[7] CoDAR: Continuous Diffusion Language Models are More Powerful Than You Think

[8] How Controllable Are Large Language Models? A Unified Evaluation across Behavioral Granularities

[9] ExpGuard: LLM Content Moderation in Specialized Domains

[10] GPUTOK: GPU Accelerated Byte Level BPE Tokenization

[11] Think, But Don’t Overthink: Reproducing Recursive Language Models

[12] Cross-Family Speculative Prefill: Training-Free Long-Context Compression with Small Draft Models

[13] Real-Time Generation of Game Video Commentary with Multimodal LLMs: Pause-Aware Decoding Approaches

[14] Evaluating Cross-Modal Reasoning Ability and Problem Characteristics with Multimodal Item Response Theory

[15] ITLC at SemEval-2026 Task 11: Normalization and Deterministic Parsing for Formal Reasoning in LLMs

[16] HateMirage: An Explainable Multi-Dimensional Dataset for Decoding Faux Hate and Subtle Online Abuse

[17] Graph-GRPO: Stabilizing Multi-Agent Topology Learning via Group Relative Policy Optimization

[18] MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs

[19] Sensory-Aware Sequential Recommendation via Review-Distilled Representations

[20] Efficient Self-Evaluation for Diffusion Language Models via Sequence Regeneration

[21] From Solver to Tutor: Evaluating the Pedagogical Intelligence of LLMs with KMP-Bench

[22] OCR or Not? Rethinking Document Information Extraction in the MLLMs Era with Real-World Large-Scale Datasets

[23] Faster, Cheaper, More Accurate: Specialised Knowledge Tracing Models Outperform LLMs

[24] A Browser-based Open Source Assistant for Multimodal Content Verification

[25] The Distribution of Phoneme Frequencies across the World’s Languages: Macroscopic and Microscopic Information-Theoretic Models

[26] Nodes Are Early, Edges Are Late: Probing Diagram Representations in Large Vision-Language Models

[27] LaTeX Compilation: Challenges in the Era of LLMs

[28] Eval4Sim: An Evaluation Framework for Persona Simulation

[29] Learning to Generate and Extract: A Multi-Agent Collaboration Framework For Zero-shot Document-level Event Arguments Extraction

[30] ACE-Merging: Data-Free Model Merging with Adaptive Covariance Estimation

[31] MaBERT:A Padding Safe Interleaved Transformer Mamba Hybrid Encoder for Efficient Extended Context Masked Language Modeling

[32] TrustMH-Bench: A Comprehensive Benchmark for Evaluating the Trustworthiness of Large Language Models in Mental Health

[33] PrivMedChat: End-to-End Differentially Private RLHF for Medical Dialogue Systems

[34] TAO-Attack: Toward Advanced Optimization-Based Jailbreak Attacks for Large Language Models

[35] Compact Prompting in Instruction-tuned LLMs for Joint Argumentative Component Detection

[36] Evaluating Performance Drift from Model Switching in Multi-Turn LLM Systems

[37] UniSkill: A Dataset for Matching University Curricula to Professional Competencies

[38] APRES: An Agentic Paper Revision and Evaluation System

[39] BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing?

[40] Code2Math: Can Your Code Agent Effectively Evolve Math Problems Through Exploration?

[41] Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use

[42] Using Learning Progressions to Guide AI Feedback for Science Learning

[43] Reproduction and Replication of an Adversarial Stylometry Experiment

[44] Monitoring AI-Modified Content at Scale: A Case Study on the Impact of ChatGPT on AI Conference Peer Reviews

[45] Diverging Preferences: When do Annotators Disagree and do Models Know?

[46] A Survey of Query Optimization in Large Language Models

[47] Param$Δ$ for Direct Weight Mixing: Post-Train Large Language Model at Zero Cost

[48] Hallucination, Monofacts, and Miscalibration: An Empirical Investigation

[49] Make LoRA Great Again: Boosting LoRA with Adaptive Singular Values and Mixture-of-Experts Optimization Alignment

[50] $\texttt{SEM-CTRL}$: Semantically Controlled Decoding

[51] Adaptive Social Learning via Mode Policy Optimization for Language Agents

[52] Talk to Your Slides: High-Efficiency Slide Editing via Language-Driven Structured Data Manipulation

[53] Go-Browse: Training Web Agents with Structured Exploration

[54] HSSBench: Benchmarking Humanities and Social Sciences Ability for Multimodal Large Language Models

[55] Search Arena: Analyzing Search-Augmented LLMs

[56] CyclicReflex: Improving Reasoning Models via Cyclical Reflection Token Scheduling

[57] You Only Fine-tune Once: Many-Shot In-Context Fine-Tuning for Large Language Models

[58] LLM Probability Concentration: How Alignment Shrinks the Generative Horizon

[59] LEDOM: Reverse Language Model

[60] Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy

[61] Psychometric Item Validation Using Virtual Respondents with Trait-Response Mediators

[62] Not All Errors Are Created Equal: ASCoT Addresses Late-Stage Fragility in Efficient LLM Reasoning

[63] Link Prediction for Event Logs in the Process Industry

[64] No Text Needed: Forecasting MT Quality and Inequity from Fertility and Metadata

[65] No Answer Needed: Predicting LLM Answer Accuracy from Question-Only Linear Probes

[66] Prior-based Noisy Text Data Filtering: Fast and Strong Alternative For Perplexity

[67] Death of the Novel(ty): Beyond n-Gram Novelty as a Metric for Textual Creativity

[68] ManagerBench: Evaluating the Safety-Pragmatism Trade-off in Autonomous LLMs

[69] AccurateRAG: A Framework for Building Accurate Retrieval-Augmented Question-Answering Applications

[70] Cache-to-Cache: Direct Semantic Communication Between Large Language Models

[71] A Set of Quebec-French Corpus of Regional Expressions and Terms

[72] Spectrum Tuning: Post-Training for Distributional Coverage and In-Context Steerability

[73] Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences