Editor’s Picks

Top papers matching your research interests in multimodal LLMs, audio and vision understanding/generation.

[1] OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text

Weiguo Pian, Saksham Singh Kushwaha, Zhimin Chen, Shijian Deng, Kai Wang, Yunhui Guo, Yapeng Tian

Main category: cs.SD

TL;DR: UniHAGen is a task for synthesizing comprehensive auditory scenes including both on-screen and off-screen sounds across diverse domains, with OmniSonic as a flow-matching diffusion framework that jointly processes video and text conditions using a TriAttn-DiT architecture with MoE gating.

Details

Motivation: Existing video-conditioned audio generation models focus only on on-screen environmental sounds, neglecting off-screen events. Recent holistic text-video-to-audio models exclude human speech. There's a need for universal audio generation that handles both on/off-screen sounds including speech.

Method: OmniSonic uses flow-matching-based diffusion jointly conditioned on video and text. It features TriAttn-DiT architecture performing three cross-attention operations for on-screen environmental sound, off-screen environmental sound, and speech conditions simultaneously, with MoE gating to adaptively balance contributions.

Result: OmniSonic consistently outperforms state-of-the-art approaches on both objective metrics and human evaluations. The authors also create UniHAGen-Bench with over 1,000 samples covering three representative on/off-screen speech-environment scenarios.

Conclusion: OmniSonic establishes a strong baseline for universal and holistic audio generation, addressing limitations of prior work by handling both on-screen and off-screen sounds including speech across diverse domains.

Abstract: In this paper, we propose Universal Holistic Audio Generation (UniHAGen), a task for synthesizing comprehensive auditory scenes that include both on-screen and off-screen sounds across diverse domains (e.g., ambient events, musical instruments, and human speech). Prior video-conditioned audio generation models typically focus on producing on-screen environmental sounds that correspond to visible sounding events, neglecting off-screen auditory events. While recent holistic joint text-video-to-audio generation models aim to produce auditory scenes with both on- and off-screen sound but they are limited to non-speech sounds, lacking the ability to generate or integrate human speech. To overcome these limitations, we introduce OmniSonic, a flow-matching-based diffusion framework jointly conditioned on video and text. It features a TriAttn-DiT architecture that performs three cross-attention operations to process on-screen environmental sound, off-screen environmental sound, and speech conditions simultaneously, with a Mixture-of-Experts (MoE) gating mechanism that adaptively balances their contributions during generation. Furthermore, we construct UniHAGen-Bench, a new benchmark with over one thousand samples covering three representative on/off-screen speech-environment scenarios. Extensive experiments show that OmniSonic consistently outperforms state-of-the-art approaches on both objective metrics and human evaluations, establishing a strong baseline for universal and holistic audio generation. Project page: https://weiguopian.github.io/OmniSonic_webpage/

Relevance: 9/10

[2] Hierarchical Semantic Correlation-Aware Masked Autoencoder for Unsupervised Audio-Visual Representation Learning

Donghuo Zeng, Hao Niu, Masato Taya

Main category: cs.MM

TL;DR: HSC-MAE is a hierarchical semantic correlation-aware masked autoencoder framework that learns aligned multimodal embeddings from weakly paired, label-free audio-visual data through three complementary representation levels.

Details

Motivation: Learning aligned multimodal embeddings from weakly paired, label-free corpora is challenging due to pre-extracted features, clips containing multiple events, and spurious co-occurrences. Existing methods struggle with these limitations.

Method: Dual-path teacher-student framework with three hierarchical correlation levels: (1) global-level canonical-geometry correlation via DCCA for modality-invariant subspace alignment, (2) local-level neighborhood-semantics correlation via teacher-mined soft top-k affinities for multi-positive relational structure, and (3) sample-level conditional-sufficiency correlation via masked autoencoding for discriminative semantic content retention.

Result: Experiments on AVE and VEGAS datasets demonstrate substantial mAP improvements over strong unsupervised baselines, validating robust and well-structured audio-visual representations.

Conclusion: HSC-MAE effectively learns aligned multimodal embeddings from weakly paired, label-free data through hierarchical semantic correlations, outperforming existing unsupervised methods on audio-visual understanding tasks.

Abstract: Learning aligned multimodal embeddings from weakly paired, label-free corpora is challenging: pipelines often provide only pre-extracted features, clips contain multiple events, and spurious co-occurrences. We propose HSC-MAE (Hierarchical Semantic Correlation-Aware Masked Autoencoder), a dual-path teacher-student framework that enforces semantic consistency across three complementary levels of representation - from coarse to fine: (i) global-level canonical-geometry correlation via DCCA, which aligns audio and visual embeddings within a shared modality-invariant subspace; (ii) local-level neighborhood-semantics correlation via teacher-mined soft top-k affinities, which preserves multi-positive relational structure among semantically similar instances; and (iii) sample-level conditional-sufficiency correlation via masked autoencoding, which ensures individual embeddings retain discriminative semantic content under partial observation. Concretely, a student MAE path is trained with masked feature reconstruction and affinity-weighted soft top-k InfoNCE; an EMA teacher operating on unmasked inputs via the CCA path supplies stable canonical geometry and soft positives. Learnable multi-task weights reconcile competing objectives, and an optional distillation loss transfers teacher geometry into the student. Experiments on AVE and VEGAS demonstrate substantial mAP improvements over strong unsupervised baselines, validating that HSC-MAE yields robust and well-structured audio-visual representations.

Relevance: 9/10

[3] CoLoRSMamba: Conditional LoRA-Steered Mamba for Supervised Multimodal Violence Detection

Damith Chamalke Senadeera, Dimitrios Kollias, Gregory Slabaugh

Main category: cs.CV

TL;DR: CoLoRSMamba: A multimodal architecture for violence detection that couples VideoMamba and AudioMamba through CLS-guided conditional LoRA, enabling scene-aware audio dynamics without token-level cross-attention.

Details

Motivation: Real-world violence detection benefits from audio cues, but audio can be noisy or weakly related to visible scenes. Existing approaches need better integration of audio-visual information for robust multimodal understanding.

Method: Directional Video-to-Audio multimodal architecture using VideoMamba and AudioMamba coupled through CLS-guided conditional LoRA. At each layer, VideoMamba CLS token produces modulation vectors and stabilization gates that adapt AudioMamba’s selective state-space parameters, enabling scene-aware audio dynamics without token-level cross-attention. Training combines binary classification with symmetric AV-InfoNCE objective for clip-level audio-video alignment.

Result: Outperforms audio-only, video-only, and multimodal baselines on curated audio-filtered subsets of NTU-CCTV (88.63% accuracy/86.24% F1-V) and DVD datasets (75.77% accuracy/72.94% F1-V). Offers favorable accuracy-efficiency tradeoff with fewer parameters and FLOPs than larger models.

Conclusion: CoLoRSMamba effectively integrates audio and visual information for violence detection through efficient multimodal coupling, demonstrating superior performance and computational efficiency compared to existing approaches.

Abstract: Violence detection benefits from audio, but real-world soundscapes can be noisy or weakly related to the visible scene. We present CoLoRSMamba, a directional Video to Audio multimodal architecture that couples VideoMamba and AudioMamba through CLS-guided conditional LoRA. At each layer, the VideoMamba CLS token produces a channel-wise modulation vector and a stabilization gate that adapt the AudioMamba projections responsible for the selective state-space parameters (Delta, B, C), including the step-size pathway, yielding scene-aware audio dynamics without token-level cross-attention. Training combines binary classification with a symmetric AV-InfoNCE objective that aligns clip-level audio and video embeddings. To support fair multimodal evaluation, we curate audio-filtered clip level subsets of the NTU-CCTV and DVD datasets from temporal annotations, retaining only clips with available audio. On these subsets, CoLoRSMamba outperforms representative audio-only, video-only, and multimodal baselines, achieving 88.63% accuracy / 86.24% F1-V on NTU-CCTV and 75.77% accuracy / 72.94% F1-V on DVD. It further offers a favorable accuracy-efficiency tradeoff, surpassing several larger models with fewer parameters and FLOPs.

Relevance: 9/10

Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 149]
cs.CV [Total: 307]
cs.AI [Total: 191]
cs.SD [Total: 8]
cs.LG [Total: 237]
cs.MA [Total: 16]
cs.MM [Total: 2]
eess.AS [Total: 8]
eess.IV [Total: 14]

cs.CL

[1] Self-Execution Simulation Improves Coding Models

Gallil Maimon, Ori Yoran, Felix Kreuk, Michael Hassid, Gal Cohen, Pierre Chambon, Yossi Adi

Main category: cs.CL

TL;DR: Training Code LLMs to simulate program execution step-by-step improves competitive programming performance through supervised fine-tuning on execution traces and reinforcement learning with verifiable rewards.

Details

Motivation: LLMs struggle to properly estimate program execution for code they generate, which limits their ability to produce consistently correct code. The paper aims to address this limitation by enabling Code LLMs to simulate program execution.

Method: Combines supervised fine-tuning on natural language execution traces (textual explanations grounded in true execution) with reinforcement learning using verifiable rewards. Uses two objectives: output prediction given code and inputs, and solving competitive programming tasks with ground-truth or self-predicted execution feedback.

Result: The method yields consistent improvements over standard reasoning approaches across multiple competitive programming benchmarks. Enables models to perform self-verification over multiple candidate solutions and iterative self-fixing by simulating test execution.

Conclusion: Code LLMs can be effectively trained to simulate program execution, and this capability significantly improves competitive programming performance through self-verification and iterative self-fixing mechanisms.

Abstract: A promising research direction in enabling LLMs to generate consistently correct code involves addressing their inability to properly estimate program execution, particularly for code they generate. In this work, we demonstrate that Code LLMs can be trained to simulate program execution in a step-by-step manner and that this capability can be leveraged to improve competitive programming performance. Our approach combines supervised fine-tuning on natural language execution traces, textual explanations grounded in true execution, with reinforcement learning using verifiable rewards. We introduce two complementary objectives: output prediction given code and inputs, and solving competitive programming tasks with either ground-truth or self-predicted execution feedback. These objectives enable models to perform self-verification over multiple candidate solutions, and iterative self-fixing by simulating test execution. Across multiple competitive programming benchmarks, our method yields consistent improvements over standard reasoning approaches. We further present ablations and analysis to elucidate the role of execution simulation and its limitations.

[2] Robust LLM Performance Certification via Constrained Maximum Likelihood Estimation

Minghe Shen, Ananth Balashankar, Adam Fisch, David Madras, Miguel Rodrigues

Main category: cs.CL

TL;DR: A constrained maximum-likelihood estimation method for LLM failure rate estimation using human calibration data, LLM-judge annotations, and domain constraints for more accurate certification.

Details

Motivation: Current LLM failure rate estimation faces a tradeoff between expensive human labeling and potentially biased automatic "LLM-as-a-Judge" annotation schemes, requiring a more practical and efficient approach.

Method: Constrained maximum-likelihood estimation that integrates three signal sources: small human-labeled calibration set, large corpus of LLM-judge annotations, and domain-specific constraints on judge performance statistics.

Result: The method consistently delivers more accurate and lower-variance estimates than state-of-the-art baselines like Prediction-Powered Inference across varying judge accuracies, calibration set sizes, and LLM failure rates.

Conclusion: Provides a principled, interpretable, and scalable pathway towards LLM failure-rate certification by moving beyond “black-box” automated judges to a flexible framework.

Abstract: The ability to rigorously estimate the failure rates of large language models (LLMs) is a prerequisite for their safe deployment. Currently, however, practitioners often face a tradeoff between expensive human gold standards and potentially severely-biased automatic annotation schemes such as “LLM-as-a-Judge” labeling. In this paper, we propose a new, practical, and efficient approach to LLM failure rate estimation based on constrained maximum-likelihood estimation (MLE). Our method integrates three distinct signal sources: (i) a small, high-quality human-labeled calibration set, (ii) a large corpus of LLM-judge annotations, and, most importantly, (iii) additional side information via domain-specific constraints derived from known bounds on judge performance statistics. We validate our approach through a comprehensive empirical study, benchmarking it against state-of-the-art baselines like Prediction-Powered Inference (PPI). Across diverse experimental regimes – spanning varying judge accuracies, calibration set sizes, and LLM failure rates – our constrained MLE consistently delivers more accurate and lower-variance estimates than existing methods. By moving beyond the “black-box” use of automated judges to a flexible framework, we provide a principled, interpretable, and scalable pathway towards LLM failure-rate certification.

[3] SoLA: Leveraging Soft Activation Sparsity and Low-Rank Decomposition for Large Language Model Compression

Xinhao Huang, You-Liang Huang, Zeyi Wen

Main category: cs.CL

TL;DR: SoLA is a training-free compression method for LLMs that combines soft activation sparsity and low-rank decomposition to reduce model size while maintaining performance.

Details

Motivation: Large language models have billion-scale parameters that pose deployment challenges. Existing compression methods require special hardware or expensive post-training, so the authors aim to develop an efficient, affordable training-free compression approach.

Method: SoLA analyzes activation patterns in feed-forward networks to identify components significantly contributing to inference. It retains these important components while compressing others through low-rank decomposition, using an adaptive component-wise low-rank allocation strategy to assign appropriate truncation positions for different weight matrices.

Result: Extensive experiments on LLaMA-2-7B/13B/70B and Mistral-7B models show SoLA improves both language modeling and downstream task accuracy without post-training. At 30% compression on LLaMA-2-70B, it reduces perplexity from 6.95 to 4.44 and enhances downstream task accuracy by 10% compared to state-of-the-art methods.

Conclusion: SoLA provides an effective training-free compression method for LLMs that maintains model quality while reducing deployment costs, offering a practical solution for efficient model deployment.

Abstract: Large language models (LLMs) have demonstrated impressive capabilities across various tasks, but the billion-scale parameters pose deployment challenges. Although existing methods attempt to reduce the scale of LLMs, they require either special hardware support or expensive post-training to maintain model quality. To facilitate efficient and affordable model slimming, we propose a novel training-free compression method for LLMs, named “SoLA”, which leverages \textbf{So}ft activation sparsity and \textbf{L}ow-r\textbf{A}nk decomposition. SoLA can identify and retain a minority of components significantly contributing to inference, while compressing the majority through low-rank decomposition, based on our analysis of the activation pattern in the feed-forward network (FFN) of modern LLMs. To alleviate the decomposition loss, SoLA is equipped with an adaptive component-wise low-rank allocation strategy to assign appropriate truncation positions for different weight matrices. We conduct extensive experiments on LLaMA-2-7B/13B/70B and Mistral-7B models across a variety of benchmarks. SoLA exhibits remarkable improvement in both language modeling and downstream task accuracy without post-training. For example, with a 30% compression rate on the LLaMA-2-70B model, SoLA surpasses the state-of-the-art method by reducing perplexity from 6.95 to 4.44 and enhancing downstream task accuracy by 10%.

[4] LightThinker++: From Reasoning Compression to Memory Management

Yuqi Zhu, Jintian Zhang, Zhenjie Wan, Yujie Luo, Shuofei Qiao, Zhengke Gui, Da Zheng, Lei Liang, Huajun Chen, Ningyu Zhang

Main category: cs.CL

TL;DR: LightThinker++ enables LLMs to dynamically compress intermediate thoughts into compact semantic representations with explicit memory management, reducing token usage by ~70% while improving reasoning accuracy.

Details

Motivation: Current LLMs face efficiency limitations due to cognitive overhead from long thought traces during complex reasoning, requiring methods to reduce token usage while maintaining or improving reasoning quality.

Method: LightThinker compresses intermediate thoughts dynamically, while LightThinker++ adds Explicit Adaptive Memory Management with memory primitives and a trajectory synthesis pipeline for training memory scheduling.

Result: 70% reduction in peak token usage, 26% faster inference with minimal accuracy loss; LightThinker++ achieves 69.9% token reduction with +2.42% accuracy gain; maintains stable footprint in long-horizon tasks with 14.8% average performance gain.

Conclusion: The framework provides scalable direction for sustaining deep LLM reasoning over extended horizons with minimal overhead through dynamic thought compression and explicit memory management.

Abstract: Large language models (LLMs) excel at complex reasoning, yet their efficiency is limited by the surging cognitive overhead of long thought traces. In this paper, we propose LightThinker, a method that enables LLMs to dynamically compress intermediate thoughts into compact semantic representations. However, static compression often struggles with complex reasoning where the irreversible loss of intermediate details can lead to logical bottlenecks. To address this, we evolve the framework into LightThinker++, introducing Explicit Adaptive Memory Management. This paradigm shifts to behavioral-level management by incorporating explicit memory primitives, supported by a specialized trajectory synthesis pipeline to train purposeful memory scheduling. Extensive experiments demonstrate the framework’s versatility across three dimensions. (1) LightThinker reduces peak token usage by 70% and inference time by 26% with minimal accuracy loss. (2) In standard reasoning, LightThinker++ slashes peak token usage by 69.9% while yielding a +2.42% accuracy gain under the same context budget for maximum performance. (3) Most notably, in long-horizon agentic tasks, it maintains a stable footprint beyond 80 rounds (a 60%-70% reduction), achieving an average performance gain of 14.8% across different complex scenarios. Overall, our work provides a scalable direction for sustaining deep LLM reasoning over extended horizons with minimal overhead.

[5] Why Attend to Everything? Focus is the Key

Hengshuai Yao, Xing Chen, Ahmed Murtadha, Jin Li, Shuai Shao, Yasin Abbasi Yadkori, Guan Wang, Mingli Yuan, William Chen, Sen Song

Main category: cs.CL

TL;DR: Focus is an efficient attention method that learns token grouping via learnable centroids, restricting distant attention to same-group pairs while maintaining full local attention, achieving better perplexity than full attention with minimal parameter overhead.

Details

Motivation: Standard attention mechanisms compute all token pairs, which is computationally expensive. Existing efficient attention methods often degrade performance or require extensive retraining. The authors aim to develop an efficient attention method that improves domain perplexity without degrading downstream performance, works across model scales, and preserves model alignment.

Method: Focus uses learnable centroids to assign tokens to groups. Distant attention is restricted to same-group token pairs while local attention operates at full resolution. All model weights remain frozen, with only centroid parameters trained (as few as 148K parameters). Sinkhorn normalization enforces balanced groups. At inference, tokens can be restricted to top-k highest-scoring groups for computational efficiency.

Result: Focus improves domain perplexity with zero degradation on downstream benchmarks across models from 124M to 70B parameters. At 124M scale, it surpasses full attention (30.3 vs 31.4 PPL). At 7B scale trained from scratch, it beats full attention (13.82 vs 13.89 PPL). Inference optimizations yield 2-8.6x speedups while maintaining or improving performance. Unlike LoRA, it preserves alignment and TruthfulQA scores.

Conclusion: Focus provides an efficient attention mechanism that improves performance while reducing computation, works across model scales, preserves alignment, and discovers interpretable linguistic groupings without supervision, making it a practical retrofit solution for existing models.

Abstract: We introduce Focus, a method that learns which token pairs matter rather than approximating all of them. Learnable centroids assign tokens to groups; distant attention is restricted to same-group pairs while local attention operates at full resolution. Because all model weights stay frozen, Focus is purely additive: centroid-only training (as few as 148K parameters) improves domain perplexity with zero degradation on downstream benchmarks–from 124M to 70B parameters, across five attention architectures. No existing efficient attention method achieves this in the retrofit setting. At 124M, Focus surpasses full attention (30.3 vs 31.4 PPL); trained from scratch at 7B scale (2B tokens), Focus again beats full attention (13.82 vs 13.89 PPL). At inference, restricting each token to its top-k highest-scoring groups discretizes the soft routing into a hard sparsity pattern, yielding 2x speedup while beating the pretrained baseline (41.3 vs 42.8 PPL); decomposing this pattern into two standard FlashAttention calls reaches 8.6x wall-clock speedup at 1M tokens with no custom kernels. Unlike LoRA, centroid routing preserves alignment: instruction-tuned models retain TruthfulQA scores after adaptation, while LoRA degrades at every learning rate and rank. Sinkhorn normalization enforces balanced groups as a hard constraint, and the resulting groups discover interpretable linguistic categories without supervision.

[6] VIGIL: An Extensible System for Real-Time Detection and Mitigation of Cognitive Bias Triggers

Bo Kang, Sander Noels, Tijl De Bie

Main category: cs.CL

TL;DR: VIGIL is a browser extension that detects and mitigates cognitive bias triggers in online information using real-time detection, LLM-powered reformulation, and privacy-tiered inference.

Details

Motivation: Address the subtle but harmful threat to civic discourse posed by AI-generated persuasion/manipulation exploiting human cognitive biases, for which no existing tools directly detect or mitigate such bias triggers in online information.

Method: Developed VIGIL browser extension with real-time cognitive bias trigger detection, scroll-synced detection, LLM-powered reformulation with full reversibility, and privacy-tiered inference (offline to cloud). Built to be extensible with third-party plugins.

Result: Created the first browser extension for cognitive bias trigger detection and mitigation, open-sourced with several rigorously validated plugins against NLP benchmarks.

Conclusion: VIGIL addresses a gap in tools for detecting and mitigating cognitive bias triggers in online information, offering real-time protection against AI-generated manipulation while preserving user privacy and control.

Abstract: The rise of generative AI is posing increasing risks to online information integrity and civic discourse. Most concretely, such risks can materialise in the form of mis- and disinformation. As a mitigation, media-literacy and transparency tools have been developed to address factuality of information and the reliability and ideological leaning of information sources. However, a subtler but possibly no less harmful threat to civic discourse is to use of persuasion or manipulation by exploiting human cognitive biases and related cognitive limitations. To the best of our knowledge, no tools exist to directly detect and mitigate the presence of triggers of such cognitive biases in online information. We present VIGIL (VIrtual GuardIan angeL), the first browser extension for real-time cognitive bias trigger detection and mitigation, providing in-situ scroll-synced detection, LLM-powered reformulation with full reversibility, and privacy-tiered inference from fully offline to cloud. VIGIL is built to be extensible with third-party plugins, with several plugins that are rigorously validated against NLP benchmarks are already included. It is open-sourced at https://github.com/aida-ugent/vigil.

[7] LPC-SM: Local Predictive Coding and Sparse Memory for Long-Context Language Modeling

Keqin Xie

Main category: cs.CL

TL;DR: LPC-SM is a hybrid autoregressive architecture that separates local attention, persistent memory, predictive correction, and run-time control within the same block, using Orthogonal Novelty Transport for slow-memory writes, showing long-context modeling can be organized beyond attention alone.

Details

Motivation: Current long-context language models rely heavily on attention for both local interaction and long-range state, leaving little room to explore alternative decompositions of sequence modeling. The authors aim to test whether long-context autoregressive modeling can be organized around a broader division of labor than attention alone.

Method: Proposes LPC-SM, a hybrid autoregressive architecture that separates four components within the same block: local attention, persistent memory, predictive correction, and run-time control. Uses Orthogonal Novelty Transport (ONT) to govern slow-memory writes. Evaluates a 158M-parameter model in three stages: base language modeling, mathematical continuation, and 4096-token continuation.

Result: Removing mHC raises Stage-A final LM loss from 12.630 to 15.127. Adaptive sparse control improves Stage-B final LM loss from 12.137 to 10.787 relative to matched fixed-ratio continuation. Full route remains stable at sequence length 4096, with Stage C ending at final LM loss 11.582 and improving delayed-identifier diagnostic from 14.396 to 12.031 in key cross-entropy.

Conclusion: Long-context autoregressive modeling can be organized around a broader division of labor than attention alone, as demonstrated by the LPC-SM architecture’s successful separation of local attention, persistent memory, predictive correction, and run-time control.

Abstract: Most current long-context language models still rely on attention to handle both local interaction and long-range state, which leaves relatively little room to test alternative decompositions of sequence modeling. We propose LPC-SM, a hybrid autoregressive architecture that separates local attention, persistent memory, predictive correction, and run-time control within the same block, and we use Orthogonal Novelty Transport (ONT) to govern slow-memory writes. We evaluate a 158M-parameter model in three stages spanning base language modeling, mathematical continuation, and 4096-token continuation. Removing mHC raises the Stage-A final LM loss from 12.630 to 15.127, while adaptive sparse control improves the Stage-B final LM loss from 12.137 to 10.787 relative to a matched fixed-ratio continuation. The full route remains stable at sequence length 4096, where Stage C ends with final LM loss 11.582 and improves the delayed-identifier diagnostic from 14.396 to 12.031 in key cross-entropy. Taken together, these results show that long-context autoregressive modeling can be organized around a broader division of labor than attention alone.

[8] Knowledge Packs: Zero-Token Knowledge Delivery via KV Cache Injection

Andrey Pustovit

Main category: cs.CL

TL;DR: Knowledge Packs: pre-computed KV caches that deliver knowledge at zero token cost, enabling both knowledge injection and behavioral steering without training or weight modification.

Details

Motivation: RAG (Retrieval-Augmented Generation) wastes tokens by including retrieved documents in the context window. The authors aim to eliminate this token overhead while maintaining or enhancing knowledge delivery capabilities.

Method: Leverages the causal transformer property that KV cache from forward pass on text F is identical to joint pass on F+q. Uses pre-computed KV caches (Knowledge Packs) that can be injected at inference time. Also exploits RoPE properties: keys rotate but values remain untouched, allowing contrastive deltas on cached values to nudge model behavior while key arithmetic destroys coherence.

Result: Zero divergences across 700 questions on Qwen3-8B and Llama-3.1-8B with correct formatting, up to 95% token savings. Behavioral steering works via mid-layer values (33-66%), independent directions are nearly orthogonal (cos~0) and compose, and both knowledge and steering channels run simultaneously at alpha<=0.7 without interference.

Conclusion: Knowledge Packs provide efficient knowledge delivery without token cost and enable behavioral steering capabilities that RAG cannot achieve, all without training or weight modification.

Abstract: RAG wastes tokens. We propose Knowledge Packs: pre-computed KV caches that deliver the same knowledge at zero token cost. For causal transformers, the KV cache from a forward pass on text F is identical to what a joint pass on F+q would produce - this follows directly from the causal mask. The equivalence is exact but fragile: wrong chat template formatting causes 6-7pp degradation, which we believe explains prior claims of KV outperforming RAG. With correct formatting: zero divergences across 700 questions on Qwen3-8B and Llama-3.1-8B, up to 95% token savings. The KV interface also enables behavioral steering that RAG cannot do. Because RoPE rotates keys but leaves values untouched, contrastive deltas on cached values can nudge model behavior while key arithmetic destroys coherence. The effect sits in mid-layer values (33-66%), independent directions are nearly orthogonal (cos~0) and compose, and both channels - knowledge and steering - run simultaneously at alpha<=0.7 without interference. No training, no weight modification.

[9] CresOWLve: Benchmarking Creative Problem-Solving Over Real-World Knowledge

Mete Ismayilzada, Renqing Cuomao, Daniil Yurshevich, Anna Sotnikova, Lonneke van der Plas, Antoine Bosselut

Main category: cs.CL

TL;DR: CresOWLve benchmark evaluates creative problem-solving in LLMs using real-world puzzles requiring multiple cognitive abilities and domain knowledge integration

Details

Motivation: Existing benchmarks evaluate only specific components of creative problem-solving and often use artificial brainteasers that don't reflect real-world creative thinking. There's a need for benchmarks that assess how LLMs combine logical reasoning, lateral thinking, analogy-making, and commonsense knowledge in realistic scenarios.

Method: Introduces CresOWLve benchmark with puzzles grounded in real-world knowledge requiring multiple creative thinking strategies. Problems demand retrieving facts from diverse domains and creatively combining them. Evaluates both non-thinking and thinking LLMs on factual vs. creative problem-solving performance.

Result: CresOWLve remains highly challenging for frontier LLMs. Models show substantial performance gap: much better on factual questions than creative ones (up to -17% drop). While models can retrieve relevant knowledge, they struggle to form non-obvious creative connections needed to integrate information and arrive at correct answers.

Conclusion: Current LLMs have significant limitations in creative problem-solving despite strong factual knowledge retrieval. The benchmark reveals fundamental challenges in AI’s ability to make creative connections across domains, highlighting important gaps for future multimodal reasoning systems.

Abstract: Creative problem-solving requires combining multiple cognitive abilities, including logical reasoning, lateral thinking, analogy-making, and commonsense knowledge, to discover insights that connect seemingly unrelated pieces of information. However, most existing benchmarks for large language models (LLMs) evaluate only specific components of this process. Moreover, many creativity-oriented benchmarks rely on artificially constructed brainteasers or contrived scenarios that do not reflect how creative problem-solving occurs in real-world settings. To address this gap, we introduce CresOWLve, a benchmark for evaluating creative problem-solving using puzzles grounded in real-world knowledge. Problems in CresOWLve require employing multiple creative thinking strategies, retrieving facts from diverse domains, and creatively combining them to arrive at a solution. Evaluating several frontier non-thinking and thinking LLMs, we show that CresOWLve remains highly challenging. Our analysis reveals a consistent performance gap: models perform substantially better on factual questions than on creative ones (up to a -17% drop). While models can often retrieve the relevant knowledge, they struggle to form the non-obvious creative connections required to integrate this information and arrive at the correct answer.

[10] WhisperRT – Turning Whisper into a Causal Streaming Model

Tomer Krichli, Bhiksha Raj, Joseph Keshet

Main category: cs.CL

TL;DR: A method to convert transformer encoder-decoder ASR models into low-latency streaming models by making the encoder causal and synchronizing decoder token emissions with partial encoder states, enabling real-time transcription with minimal latency.

Details

Motivation: Current SOTA ASR models like Whisper and Canary are designed for offline transcription and lack streaming capabilities due to architectural limitations. There's a need for low-latency streaming ASR that can process audio incrementally while maintaining accuracy.

Method: Transform transformer encoder-decoder architecture into streaming model by: 1) Making encoder causal to process audio incrementally, 2) Conditioning decoder on partial encoder states, 3) Explicit synchronization between encoded frames and token emissions, 4) Fine-tuning encoder-decoder alignment mechanism, 5) Updated inference mechanism with greedy and beam-search decoding.

Result: The fine-tuned model outperforms existing non-fine-tuned streaming approaches on low-latency chunk sizes (<300 msec) in most cases while using lower complexity. The method achieves locally optimal performance for streaming ASR.

Conclusion: Transformer encoder-decoder ASR models can be effectively converted to low-latency streaming models through architectural modifications and fine-tuning, enabling real-time transcription with competitive accuracy and reduced complexity.

Abstract: Automatic Speech Recognition (ASR) has seen remarkable progress, with models like OpenAI Whisper and NVIDIA Canary achieving state-of-the-art (SOTA) performance in offline transcription. However, these models are not designed for streaming (online or real-time) transcription, due to limitations in their architecture and training methodology. We propose a method to turn the transformer encoder-decoder model into a low-latency streaming model. The encoder is made causal to process audio incrementally, while the decoder conditions on partial encoder states to generate tokens aligned with the available temporal context. This requires explicit synchronization between encoded input frames and token emissions. Since tokens are produced only after sufficient acoustic evidence is observed, an inherent latency arises, necessitating fine-tuning of the encoder-decoder alignment mechanism. We propose an updated inference mechanism that utilizes the fine-tuned causal encoder and decoder to yield greedy and beam-search decoding, and is shown to be locally optimal. Experiments on low-latency chunk sizes (less than 300 msec) show that our fine-tuned model outperforms existing non-fine-tuned streaming approaches in most cases, while using a lower complexity. We release our training and inference code, along with the fine-tuned models, to support further research and development in streaming ASR.

[11] Noise Steering for Controlled Text Generation: Improving Diversity and Reading-Level Fidelity in Arabic Educational Story Generation

Haziq Mohammad Khalid, Salsabeel Shapsough, Imran Zualkernan

Main category: cs.CL

TL;DR: Noise steering in transformer models improves diversity for constrained Arabic educational story generation without compromising quality or reading level requirements.

Details

Motivation: Generating diverse, pedagogically valid Arabic stories for early-grade reading assessments requires balancing vocabulary, reading level, and narrative structure constraints while avoiding repetitive plots that undermine assessment validity.

Method: Investigates noise steering - injecting calibrated Gaussian perturbations into transformer internal representations at inference time - as training-free diversity method across five small Arabic-centric language models (7-9B parameters). Compares four injection strategies against high-temperature sampling baselines.

Result: Residual stream noise consistently improves narrative diversity with minimal quality or constraint cost while preserving early-grade reading level. Attention entropy noise injection stabilizes attention-logit noise while recovering quality. High-temperature sampling inflates reading grade level and causes catastrophic collapse on several models.

Conclusion: Internal representation-level perturbation is more suitable than output-level stochasticity for constrained educational content generation, offering effective diversity without compromising pedagogical requirements.

Abstract: Generating diverse, pedagogically valid stories for Arabic early-grade reading assessments requires balancing tight constraints on vocabulary, reading level, and narrative structure against the need to avoid repetitive plots that undermine assessment validity. We investigate noise steering, injecting calibrated Gaussian perturbations into the internal representations of transformer models at inference time, as a training-free diversity method evaluated across five small Arabic-centric language models (7-9B parameters). We compare four injection strategies against high-temperature sampling baselines, measuring diversity, quality, constraint adherence, and reading grade level. Residual stream noise consistently improves narrative diversity with minimal quality or constraint cost and preserves early-grade reading level across all models. Attention entropy noise injection (AENI) stabilizes the otherwise unreliable attention-logit noise while recovering quality. High-temperature sampling inflates reading grade level and causes catastrophic collapse on several models. We find internal representation-level perturbation to be a more suitable diversity strategy than output-level stochasticity for constrained educational content generation.

[12] Are Arabic Benchmarks Reliable? QIMMA’s Quality-First Approach to LLM Evaluation

Leen AlQadi, Ahmed Alzubaidi, Mohammed Alyafeai, Hamza Alobeidli, Maitha Alhammadi, Shaikha Alsuwaidi, Omar Alkaabi, Basma El Amel Boussaha, Hakim Hacid

Main category: cs.CL

TL;DR: QIMMA is a quality-assured Arabic LLM leaderboard that systematically validates benchmarks through multi-model assessment and human review before evaluation, creating a curated 52k-sample evaluation suite for Arabic NLP.

Details

Motivation: Existing Arabic benchmarks often have quality issues that can skew LLM evaluation results. There's a need for systematic validation of benchmarks before using them for evaluation to ensure fair and accurate assessment of Arabic language models.

Method: QIMMA applies a multi-model assessment pipeline combining automated LLM judgment with human review to identify and resolve systematic quality issues in established Arabic benchmarks. It uses LightEval and EvalPlus for transparent implementation and releases per-sample inference outputs.

Result: Created a curated, multi-domain, multi-task evaluation suite of over 52k samples, predominantly grounded in native Arabic content (except for language-agnostic code evaluation tasks). The system provides reproducible and community-extensible foundation for Arabic NLP evaluation.

Conclusion: QIMMA establishes a quality-assured framework for Arabic LLM evaluation that prioritizes benchmark validation, offering a transparent, reproducible foundation for the Arabic NLP community to build upon.

Abstract: We present QIMMA, a quality-assured Arabic LLM leaderboard that places systematic benchmark validation at its core. Rather than aggregating existing resources as-is, QIMMA applies a multi-model assessment pipeline combining automated LLM judgment with human review to surface and resolve systematic quality issues in well-established Arabic benchmarks before evaluation. The result is a curated, multi-domain, multi-task evaluation suite of over 52k samples, grounded predominantly in native Arabic content; code evaluation tasks are the sole exception, as they are inherently language-agnostic. Transparent implementation via LightEval, EvalPlus and public release of per-sample inference outputs make QIMMA a reproducible and community-extensible foundation for Arabic NLP evaluation.

[13] A Simple Method to Enhance Pre-trained Language Models with Speech Tokens for Classification

Nicolas Calbucura, Jose Guillen, Valentin Barriere

Main category: cs.CL

TL;DR: Simple method to enhance text LLMs with speech information using lasso-based feature selection on audio tokens, improving multimodal classification tasks.

Details

Motivation: Address the challenge of integrating long audio sequences with text in multimodal LLMs, particularly for tasks where audio was previously considered counterproductive.

Method: Use speech tokenizer for ASR to get audio tokens, apply lasso-based feature selection on multimodal Bag-of-Words to retain important tokens, adapt LLM with self-supervised language modeling, then fine-tune on downstream tasks.

Result: Method improves performance over unimodal models, larger SpeechLM, and learned audio representations, effective on argumentative fallacy detection and affective computing tasks.

Conclusion: Simple feature selection approach enables effective audio-text fusion in LLMs, even random audio token selection helps, making audio integration practical for multimodal tasks.

Abstract: This paper presents a simple method that allows to easily enhance textual pre-trained large language models with speech information, when fine-tuned for a specific classification task. A classical issue with the fusion of many embeddings from audio with text is the large length of the audio sequence compared to the text one. Our method benefits from an existing speech tokenizer trained for Audio Speech Recognition that output long sequences of tokens from a large vocabulary, making it difficult to integrate it at low cost in a large language model. By applying a simple lasso-based feature selection on multimodal Bag-of-Words representation, we retain only the most important audio tokens for the task, and adapt the language model to them with a self-supervised language modeling objective, before fine-tuning it on the downstream task. We show this helps to improve the performances compared to an unimodal model, to a bigger SpeechLM or to integrating audio via a learned representation. We demonstrate its effectiveness on Argumentative Fallacy Detection and Classification tasks where audio was previously believed counterproductive, and affective computing tasks on a widely-used dataset. We also provide an in-depth analysis of the method, showing that even a random audio token selection helps enhancing the unimodal model. Our code is available online.

[14] Towards a theory of morphology-driven marking in the lexicon: The case of the state

Mohamed El Idrissi

Main category: cs.CL

TL;DR: The paper proposes a formal model called “morphology-driven marking” to explain cross-linguistic variations in noun realization, using Riffian as a starting point and extending to other languages.

Details

Motivation: To understand why noun categories vary considerably across languages in their semantic and morphosyntactic realizations, and to develop a formal model that explains these variations systematically.

Method: Proposes morphology-driven marking model where nouns are organized into modular cognitive sets, each with its own morphological template and unmarked form. Uses Riffian as reference point before extending to other languages, and situates patterns within syntactic functions.

Result: The model helps explain differences in marking among noun types within and across languages, and leads to reassessment of markedness and state concepts, proposing extension of state concept to all synthetic languages.

Conclusion: The concept of state should be extended to all synthetic languages and analyzed as a novel subcategory of syntax-based inflection similar to agreement and grammatical case.

Abstract: All languages have a noun category, but its realisation varies considerably. Depending on the language, semantic and/or morphosyntactic differences may be more or less pronounced. This paper explores these variations, using Riffian as a reference point before extending the analysis to other languages. We propose a formal model termed morphology-driven marking. Nouns are organised into modular cognitive sets, each with its own morphological template and unmarked form. This approach helps explain differences in marking among noun types within and across languages. By situating these patterns within syntactic functions, we also reassess the notions of markedness and state. It is proposed that the concept of state be extended to all synthetic languages and analysed a novel subcategory of syntax-based inflection like agreement and grammatical case.

[15] The Tool Illusion: Rethinking Tool Use in Web Agents

Renze Lou, Baolin Peng, Wenlin Yao, Qianhui Wu, Hao Cheng, Suman Nath, Wenpeng Yin, Jianfeng Gao

Main category: cs.CL

TL;DR: Large-scale empirical study on tool use in web agents, examining whether tools consistently improve performance, what design principles work best, and what side effects they introduce across diverse experimental settings.

Details

Motivation: Prior work on tool use in web agents has been limited in scale and often non-comparable, leaving fundamental questions unanswered about whether tools consistently help, what makes tools effective, and what side effects they introduce.

Method: Extensive controlled study across diverse tool sources, backbone models, tool-use frameworks, and evaluation benchmarks to establish empirical foundation for tool use in web agents.

Result: Findings revise some prior conclusions and complement others with broader evidence, providing more reliable empirical basis for understanding tool use in web agents.

Conclusion: This study establishes stronger empirical foundation for future research on tool-use web agents and aims to inspire more systematic investigation in this area.

Abstract: As web agents rapidly evolve, an increasing body of work has moved beyond conventional atomic browser interactions and explored tool use as a higher-level action paradigm. Although prior studies have shown the promise of tools, their conclusions are often drawn from limited experimental scales and sometimes non-comparable settings. As a result, several fundamental questions remain unclear: i) whether tools provide consistent gains for web agents, ii) what practical design principles characterize effective tools, and iii) what side effects tool use may introduce. To establish a stronger empirical foundation for future research, we revisit tool use in web agents through an extensive and carefully controlled study across diverse tool sources, backbone models, tool-use frameworks, and evaluation benchmarks. Our findings both revise some prior conclusions and complement others with broader evidence. We hope this study provides a more reliable empirical basis and inspires future research on tool-use web agents.

[16] Vocabulary Dropout for Curriculum Diversity in LLM Co-Evolution

Jacob Dineen, Aswin RRV, Zhikun Xu, Ben Zhou

Main category: cs.CL

TL;DR: Vocabulary dropout prevents diversity collapse in co-evolutionary self-play for language models by applying random masks to proposer’s output logits, sustaining problem diversity and improving solver performance on mathematical reasoning tasks.

Details

Motivation: In co-evolutionary self-play for language models, proposers quickly converge to narrow problem distributions that satisfy reward functions, causing diversity collapse and stalling the co-evolutionary loop. This makes the curriculum uninformative for solvers.

Method: Introduces vocabulary dropout - a random mask applied to the proposer’s output logits during both policy training and curriculum generation. The mask is hard and non-stationary, preventing the proposer from locking into fixed token sequences. Tested on Qwen3-4B and Qwen3-8B models for mathematical reasoning via R-Zero.

Result: Vocabulary dropout sustains proposer diversity across lexical, semantic, and functional metrics throughout training. Yields solver improvements averaging +4.4 points at 8B scale, with largest gains on competition-level benchmarks.

Conclusion: Explicit action-space constraints, analogous to game rules in classical self-play, can sustain productive co-evolution in language. Vocabulary dropout is a simple instantiation of this principle that prevents diversity collapse in co-evolutionary self-play.

Abstract: Co-evolutionary self-play, where one language model generates problems and another solves them, promises autonomous curriculum learning without human supervision. In practice, the proposer quickly converges to a narrow distribution of problems that satisfy the reward function. This diversity collapse renders the curriculum uninformative for the solver, stalling the co-evolutionary loop. We introduce vocabulary dropout, a random mask applied to the proposer’s output logits during both policy training and curriculum generation, as a lightweight mechanism to sustain diversity. The mask is hard and non-stationary, preventing the proposer from locking into fixed token sequences. Training Qwen3-4B and Qwen3-8B on mathematical reasoning via R-Zero, we find that vocabulary dropout sustains proposer diversity across lexical, semantic, and functional metrics throughout training, and yields solver improvements averaging +4.4 points at 8B, with the largest gains on competition-level benchmarks. Our findings suggest that explicit action-space constraints, analogous to the structural role that game rules play in classical self-play, can help sustain productive co-evolution in language. Vocabulary dropout is one simple instantiation of this principle.

[17] Evolutionary Search for Automated Design of Uncertainty Quantification Methods

Mikhail Seleznyov, Daniil Korbut, Viktor Moskvoretskii, Oleg Somov, Alexander Panchenko, Elena Tutubalina

Main category: cs.CL

TL;DR: LLM-powered evolutionary search automatically discovers unsupervised uncertainty quantification methods for large language models, outperforming manually-designed baselines on atomic claim verification tasks.

Details

Motivation: Current uncertainty quantification methods for LLMs are manually designed based on domain knowledge and heuristics, which limits their scalability and generality. There's a need for automated approaches to discover better UQ methods.

Method: The paper applies LLM-powered evolutionary search to automatically discover unsupervised UQ methods represented as Python programs. The evolutionary search uses different LLMs (Claude models, GPT-oss-120B, Sonnet 4.5, Opus 4.5/4.6) to generate and evolve UQ methods through iterative improvement.

Result: Evolved methods outperform strong manually-designed baselines by up to 6.7% relative ROC-AUC improvement across 9 datasets, with robust out-of-distribution generalization. Different LLMs employ distinct evolutionary strategies: Claude models design high-feature-count linear estimators, GPT-oss-120B prefers simpler positional weighting schemes, and only Sonnet 4.5 and Opus 4.5 reliably leverage increased complexity for performance gains.

Conclusion: LLM-powered evolutionary search is a promising paradigm for automated, interpretable hallucination detector design, though different LLMs show varying capabilities in method discovery and complexity utilization.

Abstract: Uncertainty quantification (UQ) methods for large language models are predominantly designed by hand based on domain knowledge and heuristics, limiting their scalability and generality. We apply LLM-powered evolutionary search to automatically discover unsupervised UQ methods represented as Python programs. On the task of atomic claim verification, our evolved methods outperform strong manually-designed baselines, achieving up to 6.7% relative ROC-AUC improvement across 9 datasets while generalizing robustly out-of-distribution. Qualitative analysis reveals that different LLMs employ qualitatively distinct evolutionary strategies: Claude models consistently design high-feature-count linear estimators, while Gpt-oss-120B gravitates toward simpler and more interpretable positional weighting schemes. Surprisingly, only Sonnet 4.5 and Opus 4.5 reliably leverage increased method complexity to improve performance – Opus 4.6 shows an unexpected regression relative to its predecessor. Overall, our results indicate that LLM-powered evolutionary search is a promising paradigm for automated, interpretable hallucination detector design.

[18] Cultural Authenticity: Comparing LLM Cultural Representations to Native Human Expectations

Erin MacMurray van Liemt, Aida Davani, Sinchana Kumbale, Neha Dixit, Sunipa Dev

Main category: cs.CL

TL;DR: A framework to evaluate cultural alignment of LLM outputs by comparing human-derived cultural importance vectors with model-generated cultural representation vectors across nine countries, revealing Western-centric biases and systematic error patterns.

Details

Motivation: Current LLM evaluation focuses on cultural diversity and factual accuracy but lacks assessment of cultural alignment - how well generated content matches native populations' perceptions and priorities of their own cultural facets.

Method: 1) Create human-derived Cultural Importance Vectors from open-ended survey responses across nine countries; 2) Generate model-derived Cultural Representation Vectors using syntactically diversified prompts on three frontier LLMs (Gemini 2.5 Pro, GPT-4o, Claude 3.5 Haiku); 3) Analyze alignment between human and model vectors.

Result: Reveals Western-centric calibration where alignment decreases as cultural distance from US increases; identifies highly correlated systemic error signatures (ρ>0.97) across all models that over-index on superficial cultural markers while neglecting deep-seated social and value-based priorities.

Conclusion: Proposes a human-centered framework moving beyond simple diversity metrics to evaluate authenticity of AI-generated content in capturing nuanced cultural hierarchies, highlighting systematic biases in current LLMs.

Abstract: Cultural representation in Large Language Model (LLM) outputs has primarily been evaluated through the proxies of cultural diversity and factual accuracy. However, a crucial gap remains in assessing cultural alignment: the degree to which generated content mirrors how native populations perceive and prioritize their own cultural facets. In this paper, we introduce a human-centered framework to evaluate the alignment of LLM generations with local expectations. First, we establish a human-derived ground-truth baseline of importance vectors, called Cultural Importance Vectors based on an induced set of culturally significant facets from open-ended survey responses collected across nine countries. Next, we introduce a method to compute model-derived Cultural Representation Vectors of an LLM based on a syntactically diversified prompt-set and apply it to three frontier LLMs (Gemini 2.5 Pro, GPT-4o, and Claude 3.5 Haiku). Our investigation of the alignment between the human-derived Cultural Importance and model-derived Cultural Representations reveals a Western-centric calibration for some of the models where alignment decreases as a country’s cultural distance from the US increases. Furthermore, we identify highly correlated, systemic error signatures ($ρ> 0.97$) across all models, which over-index on some cultural markers while neglecting the deep-seated social and value-based priorities of users. Our approach moves beyond simple diversity metrics toward evaluating the fidelity of AI-generated content in authentically capturing the nuanced hierarchies of global cultures.

[19] LangFIR: Discovering Sparse Language-Specific Features from Monolingual Data for Language Steering

Sing Hieng Wong, Hassan Sajjad, A. B. Siddique

Main category: cs.CL

TL;DR: LangFIR discovers language-specific features in multilingual LLMs using only monolingual data and random-token filtering, enabling precise language control without parallel data.

Details

Motivation: Multilingual LLMs struggle with reliable language control, and existing methods require expensive parallel data. There's a need for methods that can identify language-specific features using only monolingual data.

Method: Uses sparse autoencoders (SAEs) to decompose activations, then filters out language-agnostic features using random-token sequences. The remaining sparse set of language-specific features are used to construct steering vectors for language control.

Result: Achieves best average accuracy BLEU across three models (Gemma 3 1B, 4B, Llama 3.1 8B), three datasets, and twelve languages, outperforming monolingual baselines and methods requiring parallel data.

Conclusion: Language identity in multilingual LLMs is localized in sparse feature directions discoverable with monolingual data, enabling effective language control without parallel data.

Abstract: Large language models (LLMs) show strong multilingual capabilities, yet reliably controlling the language of their outputs remains difficult. Representation-level steering addresses this by adding language-specific vectors to model activations at inference time, but identifying language-specific directions in the residual stream often relies on multilingual or parallel data that can be expensive to obtain. Sparse autoencoders (SAEs) decompose residual activations into interpretable, sparse feature directions and offer a natural basis for this search, yet existing SAE-based approaches face the same data constraint. We introduce LangFIR (Language Feature Identification via Random-token Filtering), a method that discovers language-specific SAE features using only a small amount of monolingual data and random-token sequences. Many SAE features consistently activated by target-language inputs do not encode language identity. Random-token sequences surface these language-agnostic features, allowing LangFIR to filter them out and isolate a sparse set of language-specific features. We show that these features are extremely sparse, highly selective for their target language, and causally important: directional ablation increases cross-entropy loss only for the corresponding language. Using these features to construct steering vectors for multilingual generation control, LangFIR achieves the best average accuracy BLEU across three models (Gemma 3 1B, Gemma 3 4B, and Llama 3.1 8B), three datasets, and twelve target languages, outperforming the strongest monolingual baseline by up to and surpassing methods that rely on parallel data. Our results suggest that language identity in multilingual LLMs is localized in a sparse set of feature directions discoverable with monolingual data. Code is available at https://anonymous.4open.science/r/LangFIR-C0F5/.

[20] Rethinking Token Prediction: Tree-Structured Diffusion Language Model

Zihao Wu, Haoming Yang, Juncheng Dong, Vahid Tarokh

Main category: cs.CL

TL;DR: Tree-structured diffusion language model that reduces parameter and memory usage by modeling token diffusion through vocabulary tree ancestors instead of full-vocabulary prediction

Details

Motivation: Discrete diffusion language models face efficiency challenges due to large full-vocabulary prediction layers that consume significant parameters and GPU memory, limiting training under constrained resources

Method: Proposes tree-structured diffusion modeling where intermediate latent states correspond to token ancestor nodes in a pre-constructed vocabulary tree, exponentially reducing classification dimensionality and making prediction head negligible

Result: Under same parameter budget, reduces peak GPU memory usage by half while matching perplexity performance of state-of-the-art discrete diffusion language models

Conclusion: Tree-structured factorization enables efficient discrete diffusion language models by eliminating explicit full-vocabulary prediction, allowing parameter reallocation to deepen attention blocks

Abstract: Discrete diffusion language models have emerged as a competitive alternative to auto-regressive language models, but training them efficiently under limited parameter and memory budgets remains challenging. Modern architectures are predominantly based on a full-vocabulary token prediction layer, which accounts for a substantial fraction of model parameters (e.g., more than 20% in small scale DiT-style designs) and often dominates peak GPU memory usage. This leads to inefficient use of both parameters and memory under constrained training resources. To address this issue, we revisit the necessity of explicit full-vocabulary prediction, and instead exploit the inherent structure among tokens to build a tree-structured diffusion language model. Specifically, we model the diffusion process with intermediate latent states corresponding to a token’s ancestor nodes in a pre-constructed vocabulary tree. This tree-structured factorization exponentially reduces the classification dimensionality, makes the prediction head negligible in size, and enables reallocation of parameters to deepen the attention blocks. Empirically, under the same parameter budget, our method reduces peak GPU memory usage by half while matching the perplexity performance of state-of-the-art discrete diffusion language models.

[21] Text Summarization With Graph Attention Networks

Mohammadreza Ardestani, Yllias Chali

Main category: cs.CL

TL;DR: Using graph information (RST and Coref graphs) to improve text summarization models, with mixed results - MLP worked better than Graph Attention Networks on CNN/DM dataset, and created annotated XSum benchmark.

Details

Motivation: To enhance text summarization performance by incorporating structural graph information from Rhetorical Structure Theory and Co-reference graphs, going beyond traditional sequence-based approaches.

Method: Experimented with Graph Attention Network architecture to incorporate graph information, then switched to simple Multi-layer Perceptron architecture. Annotated XSum dataset with RST graph information to create benchmark.

Result: Graph Attention Network didn’t improve performance, but simple MLP architecture improved results on CNN/DM dataset. Created annotated XSum benchmark revealing both merits and limitations of graph-based approaches.

Conclusion: Graph information can benefit summarization but requires careful architectural choices; created valuable benchmark dataset for future graph-based summarization research.

Abstract: This study aimed to leverage graph information, particularly Rhetorical Structure Theory (RST) and Co-reference (Coref) graphs, to enhance the performance of our baseline summarization models. Specifically, we experimented with a Graph Attention Network architecture to incorporate graph information. However, this architecture did not enhance the performance. Subsequently, we used a simple Multi-layer Perceptron architecture, which improved the results in our proposed model on our primary dataset, CNN/DM. Additionally, we annotated XSum dataset with RST graph information, establishing a benchmark for future graph-based summarization models. This secondary dataset posed multiple challenges, revealing both the merits and limitations of our models.

[22] MultiPress: A Multi-Agent Framework for Interpretable Multimodal News Classification

Tailong Luo, Hao Li, Rong Fu, Xinyue Jiang, Huaxuan Ding, Yiduo Zhang, Zilin Zhao, Simon Fong, Guangyin Jin, Jianyuan Ni

Main category: cs.CL

TL;DR: MultiPress: A three-stage multi-agent framework for multimodal news classification using specialized agents for perception, retrieval-augmented reasoning, and gated fusion scoring with iterative optimization.

Details

Motivation: Existing multimodal news classification methods process modalities independently or use simplistic fusion strategies, limiting their ability to capture complex cross-modal interactions and leverage external knowledge.

Method: Proposes MultiPress, a three-stage multi-agent framework with: 1) multimodal perception agents, 2) retrieval-augmented reasoning agents, and 3) gated fusion scoring agents, followed by reward-driven iterative optimization.

Result: Validated on a newly constructed large-scale multimodal news dataset, showing significant improvements over strong baselines in classification accuracy and interpretability.

Conclusion: MultiPress demonstrates the effectiveness of modular multi-agent collaboration and retrieval-augmented reasoning for enhancing multimodal news classification.

Abstract: With the growing prevalence of multimodal news content, effective news topic classification demands models capable of jointly understanding and reasoning over heterogeneous data such as text and images. Existing methods often process modalities independently or employ simplistic fusion strategies, limiting their ability to capture complex cross-modal interactions and leverage external knowledge. To overcome these limitations, we propose MultiPress, a novel three-stage multi-agent framework for multimodal news classification. MultiPress integrates specialized agents for multimodal perception, retrieval-augmented reasoning, and gated fusion scoring, followed by a reward-driven iterative optimization mechanism. We validate MultiPress on a newly constructed large-scale multimodal news dataset, demonstrating significant improvements over strong baselines and highlighting the effectiveness of modular multi-agent collaboration and retrieval-augmented reasoning in enhancing classification accuracy and interpretability.

[23] Unveiling Language Routing Isolation in Multilingual MoE Models for Interpretable Subnetwork Adaptation

Kening Zheng, Wei-Chieh Huang, Jiahao Huo, Zhonghao Li, Henry Peng Zou, Yibo Yan, Xin Zou, Jungang Li, Junzhuo Li, Hanrong Zhang, Xuming Hu, Philip S. Yu

Main category: cs.CL

TL;DR: Analysis of MoE models reveals language routing isolation where high- and low-resource languages activate different experts, leading to performance disparities. Proposed RISE framework identifies and adapts language-specific expert subnetworks to improve low-resource language performance.

Details

Motivation: MoE models show significant performance differences across languages, but the internal mechanisms causing these gaps are not well understood. The authors aim to systematically analyze expert routing patterns to understand and address language performance disparities in MoE models.

Method: Conducted systematic analysis of expert routing patterns in MoE models, revealing language routing isolation. Proposed RISE framework with tripartite selection strategy: uses specificity scores to identify language-specific experts in shallow/deep layers and overlap scores for universal experts in middle layers. Trains only selected subnetwork while freezing other parameters.

Result: Experiments on 10 languages show RISE achieves target-language F1 gains up to 10.85% with minimal cross-lingual degradation. The method substantially improves low-resource language performance while preserving capabilities in other languages.

Conclusion: Language routing isolation is a key factor in MoE model performance disparities. RISE effectively exploits this phenomenon to enhance low-resource language performance through targeted subnetwork adaptation, offering a practical solution for multilingual MoE model improvement.

Abstract: Mixture-of-Experts (MoE) models exhibit striking performance disparities across languages, yet the internal mechanisms driving these gaps remain poorly understood. In this work, we conduct a systematic analysis of expert routing patterns in MoE models, revealing a phenomenon we term Language Routing Isolation, in which high- and low-resource languages tend to activate largely disjoint expert sets. Through layer-stratified analysis, we further show that routing patterns exhibit a layer-wise convergence-divergence pattern across model depth. Building on these findings, we propose RISE (Routing Isolation-guided Subnetwork Enhancement), a framework that exploits routing isolation to identify and adapt language-specific expert subnetworks. RISE applies a tripartite selection strategy, using specificity scores to identify language-specific experts in shallow and deep layers and overlap scores to select universal experts in middle layers. By training only the selected subnetwork while freezing all other parameters, RISE substantially improves low-resource language performance while preserving capabilities in other languages. Experiments on 10 languages demonstrate that RISE achieves target-language F1 gains of up to 10.85% with minimal cross-lingual degradation.

[24] The Format Tax

Ivan Yee Lee, Loris D’Antoni, Taylor Berg-Kirkpatrick

Main category: cs.CL

TL;DR: Structured output requirements (JSON, XML, etc.) degrade reasoning performance in open-weight LLMs, but decoupling reasoning from formatting recovers most lost accuracy.

Details

Motivation: The paper investigates why asking LLMs to respond in structured formats like JSON, XML, LaTeX, or Markdown causes substantial degradation in reasoning and writing performance, particularly in open-weight models.

Method: The researchers diagnose that format-requesting instructions alone cause most accuracy loss before decoder constraints. They propose decoupling reasoning from formatting through two approaches: generating freeform first then reformatting, or enabling extended thinking within a single generation.

Result: Across six open-weight models, four API models, four formats, and tasks spanning math, science, logic, and writing, decoupling reasoning from formatting substantially recovers lost accuracy. Most recent closed-weight models show little to no format tax.

Conclusion: The format tax is not inherent to structured generation but represents a gap that current open-weight models have yet to close. Decoupling reasoning from formatting is an effective solution.

Abstract: Asking a large language model to respond in JSON should be a formatting choice, not a capability tax. Yet we find that structured output requirements – JSON, XML, LaTeX, Markdown – substantially degrade reasoning and writing performance across open-weight models. The research response has focused on constrained decoding, but sampling bias accounts for only a fraction of the degradation. The dominant cost enters at the prompt: format-requesting instructions alone cause most of the accuracy loss, before any decoder constraint is applied. This diagnosis points to a simple principle: decouple reasoning from formatting. Whether by generating freeform first and reformatting in a second pass, or by enabling extended thinking within a single generation, separating the two concerns substantially recovers lost accuracy. Across six open-weight models, four API models, four formats, and tasks spanning math, science, logic, and writing, decoupling recovers most lost accuracy. Notably, most recent closed-weight models show little to no format tax, suggesting the problem is not inherent to structured generation but a gap that current open-weight models have yet to close. Code is available at https://github.com/ivnle/the-format-tax.

Minghai Jiao, Jing Xiao, Peng Xiao, Ende Zhang, Shuang Kan, Wenyan Jiang, Jinyao Li, Yixian Liu, Haidong Xin

Main category: cs.CL

TL;DR: CAGMamba is a context-aware gated cross-modal Mamba framework for dialogue-based multimodal sentiment analysis that uses Mamba’s linear complexity for efficient cross-modal fusion with explicit temporal modeling of sentiment evolution.

Details

Motivation: Existing multimodal sentiment analysis approaches use Transformer-based cross-modal attention with quadratic complexity, limiting scalability. They also lack explicit temporal modeling of sentiment evolution across dialogue turns, using simple concatenation or independent fusion instead.

Method: Organizes contextual and current-utterance features into temporally ordered binary sequences for Mamba to model sentiment evolution. Uses Gated Cross-Modal Mamba Network (GCMN) with learnable gating to balance cross-modal fusion and modality preservation, trained with three-branch multi-task objective over text, audio, and fused predictions.

Result: Achieves state-of-the-art or competitive results on three benchmark datasets across multiple evaluation metrics.

Conclusion: CAGMamba provides an efficient and effective framework for dialogue-based multimodal sentiment analysis with linear complexity and explicit temporal modeling of sentiment evolution.

Abstract: Multimodal Sentiment Analysis (MSA) requires effective modeling of cross-modal interactions and contextual dependencies while remaining computationally efficient. Existing fusion approaches predominantly rely on Transformer-based cross-modal attention, which incurs quadratic complexity with respect to sequence length and limits scalability. Moreover, contextual information from preceding utterances is often incorporated through concatenation or independent fusion, without explicit temporal modeling that captures sentiment evolution across dialogue turns. To address these limitations, we propose CAGMamba, a context-aware gated cross-modal Mamba framework for dialogue-based sentiment analysis. Specifically, we organize the contextual and the current-utterance features into a temporally ordered binary sequence, which provides Mamba with explicit temporal structure for modeling sentiment evolution. To further enable controllable cross-modal integration, we propose a Gated Cross-Modal Mamba Network (GCMN) that integrates cross-modal and unimodal paths via learnable gating to balance information fusion and modality preservation, and is trained with a three-branch multi-task objective over text, audio, and fused predictions. Experiments on three benchmark datasets demonstrate that CAGMamba achieves state-of-the-art or competitive results across multiple evaluation metrics. All codes are available at https://github.com/User2024-xj/CAGMamba.

[26] Document-Level Numerical Reasoning across Single and Multiple Tables in Financial Reports

Yi-Cheng Wang, Wei-An Wang, Chu-Song Chen

Main category: cs.CL

TL;DR: FinLongDocQA dataset for financial numerical reasoning in long documents, with FinLongDocAgent using multi-agent RAG for iterative retrieval and verification.

Details

Motivation: LLMs struggle with reliable QA over long structured documents, especially for numerical reasoning in financial reports where evidence is scattered across multiple tables and text. Existing benchmarks focus on single-table settings, leaving cross-table document-level numerical reasoning underexplored.

Method: Introduces FinLongDocQA dataset for single-table and cross-table financial numerical reasoning. Proposes FinLongDocAgent, a Multi-Agent Multi-Round RAG approach that iteratively retrieves evidence, performs intermediate calculations, and verifies results across rounds.

Result: Evaluation reveals two bottlenecks: 1) annual reports often exceed 129k tokens causing context rot for locating relevant tables, and 2) LLMs remain prone to errors in multi-step numerical reasoning even with relevant evidence. Experiments show importance of iterative retrieval and verification.

Conclusion: The proposed FinLongDocAgent approach addresses challenges in long-document numerical reasoning through iterative retrieval and verification, demonstrating improved reliability for financial QA tasks.

Abstract: Despite the strong language understanding abilities of large language models (LLMs), they still struggle with reliable question answering (QA) over long, structured documents, particularly for numerical reasoning. Financial annual reports exemplify this difficulty: financial statement analysis often hinges on accurate arithmetic, and analysts derive key indicators by integrating evidence scattered across multiple tables and narrative text. However, existing benchmarks focus largely on single-table settings, leaving cross-table document-level numerical reasoning underexplored. To address this gap, we introduce FinLongDocQA, a dataset for both single-table and cross-table financial numerical reasoning in long-context reports. Evaluating both closed-source and open-source LLMs on FinLongDocQA reveals two bottlenecks: (1) annual reports often exceed 129k tokens, exacerbating the context rot problem for locating relevant tables; and (2) even when relevant evidence is located, LLMs remain prone to errors in multi-step numerical reasoning. We propose FinLongDocAgent, a Multi-Agent Multi-Round Retrieval-Augmented Generation (RAG) approach that iteratively retrieves evidence, performs intermediate calculations, and verifies results across rounds. Experiments highlight the importance of iterative retrieval and verification for reliable numerical QA in long financial documents.

[27] AI Appeals Processor: A Deep Learning Approach to Automated Classification of Citizen Appeals in Government Services

Vladimir Beskorovainyi

Main category: cs.CL

TL;DR: AI system for automated classification of citizen appeals using NLP and deep learning, with Word2Vec+LSTM achieving best balance of accuracy and efficiency.

Details

Motivation: Government agencies face growing volumes of citizen appeals with inefficient manual processing (20 minutes per appeal, 67% accuracy), creating bottlenecks in public service delivery.

Method: Microservice-based system integrating NLP and deep learning techniques; evaluated multiple approaches: Bag-of-Words with SVM, TF-IDF with SVM, fastText, Word2Vec with LSTM, and BERT on 10,000 real citizen appeals across 3 categories and 7 domains.

Result: Word2Vec+LSTM architecture achieved 78% classification accuracy (vs 67% manual) while reducing processing time by 54%, offering optimal balance between accuracy and computational efficiency compared to transformer-based models.

Conclusion: AI Appeals Processor demonstrates practical value for government agencies by automating appeal classification with improved accuracy and efficiency, with Word2Vec+LSTM providing the best trade-off for real-world deployment.

Abstract: Government agencies worldwide face growing volumes of citizen appeals, with electronic submissions increasing significantly over recent years. Traditional manual processing averages 20 minutes per appeal with only 67% classification accuracy, creating significant bottlenecks in public service delivery. This paper presents AI Appeals Processor, a microservice-based system that integrates natural language processing and deep learning techniques for automated classification and routing of citizen appeals. We evaluate multiple approaches – including Bag-of-Words with SVM, TF-IDF with SVM, fastText, Word2Vec with LSTM, and BERT – on a representative dataset of 10,000 real citizen appeals across three primary categories (complaints, applications, and proposals) and seven thematic domains. Our experiments demonstrate that a Word2Vec+LSTM architecture achieves 78% classification accuracy while reducing processing time by 54%, offering an optimal balance between accuracy and computational efficiency compared to transformer-based models.

[28] ‘Layer su Layer’: Identifying and Disambiguating the Italian NPN Construction in BERT’s family

Greta Gorzoni, Ludovica Pannitto, Francesca Masini

Main category: cs.CL

TL;DR: Probing study examines how Italian NPN constructions are encoded in BERT’s contextual embeddings, evaluating linguistic information across model layers

Details

Motivation: To evaluate pretrained language models against explicit linguistic theories, specifically testing what linguistic information about Italian NPN constructions is encoded in contextual embeddings, challenging previous methodological assumptions

Method: Extracted contextual vector representations from BERT for Italian NPN constructions, used layer-wise probing classifiers to systematically evaluate information encoded across model’s internal layers

Result: Results show the extent to which constructional form and meaning are reflected in contextual embeddings, providing empirical evidence about linguistic encoding

Conclusion: Contributes empirical evidence to dialogue between constructionist linguistic theory and neural language modeling, with implications for interpretability research

Abstract: Interpretability research has highlighted the importance of evaluating Pretrained Language Models (PLMs) and in particular contextual embeddings against explicit linguistic theories to determine what linguistic information they encode. This study focuses on the Italian NPN (noun-preposition-noun) constructional family, challenging some of the theoretical and methodological assumptions underlying previous experimental designs and extending this type of research to a lesser-investigated language. Contextual vector representations are extracted from BERT and used as input to layer-wise probing classifiers, systematically evaluating information encoded across the model’s internal layers. The results shed light on the extent to which constructional form and meaning are reflected in contextual embeddings, contributing empirical evidence to the dialogue between constructionist theory and neural language modelling

[29] Unlocking Prompt Infilling Capability for Diffusion Language Models

Yoshinari Fujinuma, Keisuke Sakaguchi

Main category: cs.CL

TL;DR: Masked diffusion language models can generate effective infilling prompts through full-sequence masking during supervised finetuning, overcoming limitations of current training practices

Details

Motivation: Current masked diffusion language models (dLMs) have bidirectional denoising capabilities but cannot effectively handle infilling prompts due to the conventional supervised finetuning practice of applying response-only masking, which artificially limits their potential

Method: Extend full-sequence masking during supervised finetuning where both prompts and responses are masked jointly, enabling the model to infill masked portions of prompt templates conditioned on few-shot examples

Result: Model-infilled prompts match or surpass manually designed templates, transfer effectively across models, and complement existing prompt optimization methods

Conclusion: Training practices, not architectural limitations, are the primary bottleneck preventing masked diffusion language models from infilling effective prompts; full-sequence masking unlocks this capability

Abstract: Masked diffusion language models (dLMs) generate text through bidirectional denoising, yet this capability remains locked for infilling prompts. This limitation is an artifact of the current supervised finetuning (SFT) convention of applying response-only masking. To unlock this capability, we extend full-sequence masking during SFT, where both prompts and responses are masked jointly. Once unlocked, the model infills masked portions of a prompt template conditioned on few-shot examples. We show that such model-infilled prompts match or surpass manually designed templates, transfer effectively across models, and are complementary to existing prompt optimization methods. Our results suggest that training practices, not architectural limitations, are the primary bottleneck preventing masked diffusion language models from infilling effective prompts

[30] Researchers waste 80% of LLM annotation costs by classifying one text at a time

Christian Pipal, Eva-Maria Vogel, Morgan Wack, Frank Esser

Main category: cs.CL

TL;DR: Batching and stacking multiple text classification tasks in single LLM prompts reduces API calls by 80%+ with minimal accuracy loss for most models.

Details

Motivation: Researchers currently use LLMs for text classification by making separate API calls for each text and variable, which is inefficient and costly when processing large datasets.

Method: Tested 8 production LLMs from 4 providers on 3,962 expert-coded tweets across 4 tasks, varying batch sizes from 1 to 1,000 items and stacking up to 25 coding dimensions per prompt.

Result: Six of eight models maintained accuracy within 2 percentage points of single-item baseline through batch sizes of 100. Variable stacking with up to 10 dimensions produced comparable results to single-variable coding.

Conclusion: Batching and stacking can dramatically reduce computational costs with minimal accuracy degradation, making LLM-based text classification more efficient for large-scale social science research.

Abstract: Large language models (LLMs) are increasingly being used for text classification across the social sciences, yet researchers overwhelmingly classify one text per variable per prompt. Coding 100,000 texts on four variables requires 400,000 API calls. Batching 25 items and stacking all variables into a single prompt reduces this to 4,000 calls, cutting token costs by over 80%. Whether this degrades coding quality is unknown. We tested eight production LLMs from four providers on 3,962 expert-coded tweets across four tasks, varying batch size from 1 to 1,000 items and stacking up to 25 coding dimensions per prompt. Six of eight models maintained accuracy within 2 pp of the single-item baseline through batch sizes of 100. Variable stacking with up to 10 dimensions produced results comparable to single-variable coding, with degradation driven by task complexity rather than prompt length. Within this safe operating range, the measurement error from batching and stacking is smaller than typical inter-coder disagreement in the ground-truth data.

[31] POEMetric: The Last Stanza of Humanity

Bingru Li, Han Wang, Hazel Wilkinson

Main category: cs.CL

TL;DR: POEMetric is a comprehensive framework for evaluating poetry generation by LLMs, assessing basic form/theme adherence, advanced creative abilities, and overall quality compared to human poets.

Details

Motivation: While LLMs can generate poetry, there's a need for systematic evaluation to understand how far they are from human poets in terms of both technical form adherence and creative/emotional aspects.

Method: Created POEMetric framework with three evaluation dimensions: 1) basic instruction-following (form/theme), 2) advanced abilities (creativity, emotional resonance, literary devices), 3) overall quality. Curated 203 human poems with annotations, generated 6,090 LLM poems from 30 models, used rule-based evaluation and LLM-as-a-judge validated by human experts.

Result: Top LLMs achieved high form accuracy (4.26/5.00) and theme alignment (4.99), but all models failed to match human poets in advanced abilities: creativity (4.02 vs LLMs), idiosyncrasy (3.95), emotional resonance (4.06), imagery (4.49), literary devices (4.67). Humans also outperformed best LLM in overall quality (4.22 vs 3.20).

Conclusion: Poetry generation remains a formidable challenge for LLMs; while they excel at technical form adherence, they significantly lag behind human poets in creative, emotional, and literary aspects that define poetic quality.

Abstract: Large Language Models (LLMs) can compose poetry, but how far are they from human poets? In this paper, we introduce POEMetric, the first comprehensive framework for poetry evaluation, examining 1) basic instruction-following abilities in generating poems according to a certain form and theme, 2) advanced abilities of showing creativity, lexical diversity, and idiosyncrasy, evoking emotional resonance, and using imagery and literary devices, and 3) general appraisal of the overall poem quality and estimation of authorship. We curated a human poem dataset - 203 English poems of 7 fixed forms annotated with meter, rhyme patterns and themes - and experimented with 30 LLMs for poetry generation based on the same forms and themes of the human data, totaling 6,090 LLM poems. Based on POEMetric, we assessed the performance of both human poets and LLMs through rule-based evaluation and LLM-as-a-judge, whose results were validated by human experts. Results show that, though the top model achieved high form accuracy (4.26 out of 5.00, with Gemini-2.5-Pro as a judge; same below) and theme alignment (4.99), all models failed to reach the same level of advanced abilities as human poets, who achieved unparalleled creativity (4.02), idiosyncrasy (3.95), emotional resonance (4.06), and skillful use of imagery (4.49) and literary devices (4.67). Humans also defeated the best-performing LLM in overall poem quality (4.22 vs. 3.20). As such, poetry generation remains a formidable challenge for LLMs. Data and codes are released at https://github.com/Bingru-Li/POEMetric.

[32] Testing the Limits of Truth Directions in LLMs

Angelos Poulis, Mark Crovella, Evimaria Terzi

Main category: cs.CL

TL;DR: Truth directions in LLMs are not universal; they vary significantly across model layers, task types (factual vs reasoning), complexity levels, and prompt instructions.

Details

Motivation: Previous research has debated whether truth directions in LLM activation spaces are universal or not, with conflicting findings about their generalization across different settings.

Method: Systematically probe truth directions across multiple model layers, analyze different task types (factual vs reasoning), examine varying task complexities, and test the impact of different prompt instructions on truth direction generalization.

Result: Truth directions are highly layer-dependent, emerge earlier for factual tasks and later for reasoning tasks, vary with task complexity, and are significantly affected by prompt instructions, showing limited universality.

Conclusion: Claims about universal truth directions in LLMs are more limited than previously thought, with significant variations across layers, task types, complexities, and prompts.

Abstract: Large language models (LLMs) have been shown to encode truth of statements in their activation space along a linear truth direction. Previous studies have argued that these directions are universal in certain aspects, while more recent work has questioned this conclusion drawing on limited generalization across some settings. In this work, we identify a number of limits of truth-direction universality that have not been previously understood. We first show that truth directions are highly layer-dependent, and that a full understanding of universality requires probing at many layers in the model. We then show that truth directions depend heavily on task type, emerging in earlier layers for factual and later layers for reasoning tasks; they also vary in performance across levels of task complexity. Finally, we show that model instructions dramatically affect truth directions; simple correctness evaluation instructions significantly affect the generalization ability of truth probes. Our findings indicate that universality claims for truth directions are more limited than previously known, with significant differences observable for various model layers, task difficulties, task types, and prompt templates.

[33] Your Agent is More Brittle Than You Think: Uncovering Indirect Injection Vulnerabilities in Agentic LLMs

Wenhui Zhu, Xuanzhao Dong, Xiwen Chen, Rui Cai, Peijie Qiu, Zhipeng Wang, Oana Frunza, Shao Tang, Jindong Gu, Yalin Wang

Main category: cs.CL

TL;DR: Systematic evaluation of Indirect Prompt Injection (IPI) attacks on multi-agent LLM systems reveals severe vulnerabilities in current defenses, with RepE-based detection showing promise for intercepting unauthorized actions.

Details

Motivation: The rapid deployment of open-source multi-agent systems has expanded action spaces, creating security challenges from Indirect Prompt Injections (IPI) that hide malicious instructions in third-party content. Current security evaluations rely on isolated single-turn benchmarks, leaving systemic vulnerabilities in complex dynamic environments underexplored.

Method: Systematically evaluated six defense strategies against four sophisticated IPI attack vectors across nine LLM backbones in dynamic multi-step tool-calling environments. Used multidimensional analysis beyond binary success rates, and investigated Representation Engineering (RepE) as a detection strategy by extracting hidden states at tool-input positions.

Result: Advanced injections successfully bypass nearly all baseline defenses, with some surface-level mitigations producing counterproductive side effects. Agents execute malicious instructions almost instantaneously but exhibit abnormally high decision entropy. RepE-based circuit breaker successfully identifies and intercepts unauthorized actions before commitment, achieving high detection accuracy across diverse LLM backbones.

Conclusion: The study exposes limitations of current IPI defenses and provides a practical paradigm for building resilient multi-agent architectures through RepE-based detection that leverages internal state analysis to intercept unauthorized actions.

Abstract: The rapid deployment of open-source frameworks has significantly advanced the development of modern multi-agent systems. However, expanded action spaces, including uncontrolled privilege exposure and hidden inter-system interactions, pose severe security challenges. Specifically, Indirect Prompt Injections (IPI), which conceal malicious instructions within third-party content, can trigger unauthorized actions such as data exfiltration during normal operations. While current security evaluations predominantly rely on isolated single-turn benchmarks, the systemic vulnerabilities of these agents within complex dynamic environments remain critically underexplored. To bridge this gap, we systematically evaluate six defense strategies against four sophisticated IPI attack vectors across nine LLM backbones. Crucially, we conduct our evaluation entirely within dynamic multi-step tool-calling environments to capture the true attack surface of modern autonomous agents. Moving beyond binary success rates, our multidimensional analysis reveals a pronounced fragility. Advanced injections successfully bypass nearly all baseline defenses, and some surface-level mitigations even produce counterproductive side effects. Furthermore, while agents execute malicious instructions almost instantaneously, their internal states exhibit abnormally high decision entropy. Motivated by this latent hesitation, we investigate Representation Engineering (RepE) as a robust detection strategy. By extracting hidden states at the tool-input position, we revealed that the RepE-based circuit breaker successfully identifies and intercepts unauthorized actions before the agent commits to them, achieving high detection accuracy across diverse LLM backbones. This study exposes the limitations of current IPI defenses and provides a highly practical paradigm for building resilient multi-agent architectures.

[34] When Models Know More Than They Say: Probing Analogical Reasoning in LLMs

Hope McGovern, Caroline Craig, Thomas Lippincott, Hale Sirin

Main category: cs.CL

TL;DR: LLMs struggle with analogical reasoning requiring latent information, showing asymmetry between probed representations and prompted performance for different analogy types.

Details

Motivation: To understand LLMs' limitations in analogical reasoning, particularly when analogies require latent information rather than surface cues, and to investigate the relationship between internal representations and prompted behavior.

Method: Compare probed representations with prompted performance on detecting narrative analogies, analyzing asymmetry between rhetorical and narrative analogies in open-source models.

Result: For rhetorical analogies, probing significantly outperforms prompting in open-source models, while for narrative analogies, both achieve similar low performance, showing task-dependent relationship between internal representations and prompted behavior.

Conclusion: LLMs have limitations in abstraction and generalization for analogical reasoning, with prompting not effectively accessing available information in models, suggesting need for better methods to leverage internal representations.

Abstract: Analogical reasoning is a core cognitive faculty essential for narrative understanding. While LLMs perform well when surface and structural cues align, they struggle in cases where an analogy is not apparent on the surface but requires latent information, suggesting limitations in abstraction and generalisation. In this paper we compare a model’s probed representations with its prompted performance at detecting narrative analogies, revealing an asymmetry: for rhetorical analogies, probing significantly outperforms prompting in open-source models, while for narrative analogies, they achieve a similar (low) performance. This suggests that the relationship between internal representations and prompted behavior is task-dependent and may reflect limitations in how prompting accesses available information.

[35] I-CALM: Incentivizing Confidence-Aware Abstention for LLM Hallucination Mitigation

Haotian Zong, Binze Li, Yufei Long, Sinyin Chang, Jialong Wu, Gillian K. Hadfield

Main category: cs.CL

TL;DR: A prompt-only framework called I-CALM reduces LLM hallucinations by eliciting verbal confidence, rewarding abstention, and adding normative principles, improving selective answering on factual questions without model retraining.

Details

Motivation: LLMs often produce confident but incorrect answers due to binary scoring conventions that reward answering over honest uncertainty expression. The paper aims to reduce hallucination risk through prompt-only interventions without modifying the model.

Method: I-CALM framework: (1) elicits verbal confidence from LLMs, (2) partially rewards abstention through explicit reward schemes, and (3) adds lightweight normative principles emphasizing truthfulness, humility, and responsibility. Tested on GPT-5 mini with PopQA dataset.

Result: Confidence-eliciting, abstention-rewarding prompts with norms reduce false-answer rates by identifying error-prone cases and shifting them to abstention, trading coverage for reliability. Varying abstention rewards yields a clear abstention-hallucination frontier.

Conclusion: Prompt-only interventions can improve selective answering on factual questions without retraining, with effectiveness varying across models and datasets. The framework demonstrates practical hallucination control through confidence elicitation and normative guidance.

Abstract: Large language models (LLMs) frequently produce confident but incorrect answers, partly because common binary scoring conventions reward answering over honestly expressing uncertainty. We study whether prompt-only interventions – explicitly announcing reward schemes for answer-versus-abstain decisions plus humility-oriented normative principles – can reduce hallucination risk without modifying the model. Our focus is epistemic abstention on factual questions with a verifiable answer, where current LLMs often fail to abstain despite being uncertain about their answers. We first assess self-reported verbal confidence as a usable uncertainty signal, showing stability under prompt paraphrasing and reasonable calibration against a token-probability baseline. We then study I-CALM, a prompt-based framework that (i) elicits verbal confidence, (ii) partially rewards abstention through explicit reward schemes, and (iii) adds lightweight normative principles emphasizing truthfulness, humility, and responsibility. Using GPT-5 mini on PopQA as the main setting, we find that confidence-eliciting, abstention-rewarding prompts, especially with norms, reduce the false-answer rate on answered cases mainly by identifying and shifting error-prone cases to abstention and re-calibrating their confidence. This trades coverage for reliability while leaving forced-answer performance largely unchanged. Varying the abstention reward yields a clear abstention-hallucination frontier. Overall, results show the framework can improve selective answering on factual questions without retraining, with the magnitude of effect varying across models and datasets. Code is available at the following https://github.com/binzeli/hallucinationControl.

[36] SkillX: Automatically Constructing Skill Knowledge Bases for Agents

Chenxi Wang, Zhuoyun Yu, Xin Xie, Wuguannan Yao, Runnan Fang, Shuofei Qiao, Kexin Cao, Guozhou Zheng, Xiang Qi, Peng Zhang, Shumin Deng

Main category: cs.CL

TL;DR: SkillX is an automated framework for building reusable skill knowledge bases that improves LLM agent efficiency by distilling experiences into hierarchical skills and enabling cross-agent knowledge transfer.

Details

Motivation: Current self-evolving LLM agents learn inefficiently in isolation, repeatedly rediscovering similar behaviors from limited experience, leading to redundant exploration and poor generalization across different agents and environments.

Method: Three synergistic innovations: (1) Multi-Level Skills Design distills raw trajectories into three-tiered hierarchy (strategic plans, functional skills, atomic skills); (2) Iterative Skills Refinement automatically revises skills based on execution feedback; (3) Exploratory Skills Expansion proactively generates and validates novel skills beyond seed training data.

Result: SkillX consistently improves task success and execution efficiency when plugged into weaker base agents on challenging long-horizon, user-interactive benchmarks (AppWorld, BFCL-v3, τ²-Bench), demonstrating effective knowledge transfer.

Conclusion: Structured, hierarchical experience representations enable generalizable agent learning, and reusable skill libraries can significantly enhance LLM agent performance across different environments and base models.

Abstract: Learning from experience is critical for building capable large language model (LLM) agents, yet prevailing self-evolving paradigms remain inefficient: agents learn in isolation, repeatedly rediscover similar behaviors from limited experience, resulting in redundant exploration and poor generalization. To address this problem, we propose SkillX, a fully automated framework for constructing a \textbf{plug-and-play skill knowledge base} that can be reused across agents and environments. SkillX operates through a fully automated pipeline built on three synergistic innovations: \textit{(i) Multi-Level Skills Design}, which distills raw trajectories into three-tiered hierarchy of strategic plans, functional skills, and atomic skills; \textit{(ii) Iterative Skills Refinement}, which automatically revises skills based on execution feedback to continuously improve library quality; and \textit{(iii) Exploratory Skills Expansion}, which proactively generates and validates novel skills to expand coverage beyond seed training data. Using a strong backbone agent (GLM-4.6), we automatically build a reusable skill library and evaluate its transferability on challenging long-horizon, user-interactive benchmarks, including AppWorld, BFCL-v3, and $τ^2$-Bench. Experiments show that SkillKB consistently improves task success and execution efficiency when plugged into weaker base agents, highlighting the importance of structured, hierarchical experience representations for generalizable agent learning. Our code will be publicly available soon at https://github.com/zjunlp/SkillX.

[37] From Plausible to Causal: Counterfactual Semantics for Policy Evaluation in Simulated Online Communities

Agam Goyal, Yian Wang, Eshwar Chandrasekharan, Hari Sundaram

Main category: cs.CL

TL;DR: Paper proposes causal counterfactual framework for LLM-based social simulations to move beyond believability to causal policy analysis

Details

Motivation: Current LLM-based social simulations generate believable community interactions but lack causal semantics needed for policy interventions; need framework to distinguish necessary vs sufficient causation for different stakeholder needs

Method: Adopts causal counterfactual framework distinguishing necessary causation (would outcome occur without intervention?) from sufficient causation (does intervention reliably produce outcome?), formalizes mapping to stakeholder needs, shows simulation design can support estimation under explicit assumptions

Result: Proposes simulator-conditional causal estimates whose policy relevance depends on simulator fidelity; framework helps define adequate fidelity and moves field from realistic-looking simulations to policy-supportive ones

Conclusion: Establishing causal framework now is essential for defining adequate fidelity and enabling simulations that can support policy changes rather than just look realistic

Abstract: LLM-based social simulations can generate believable community interactions, enabling policy wind tunnels'' where governance interventions are tested before deployment. But believability is not causality. Claims like intervention $A$ reduces escalation’’ require causal semantics that current simulation work typically does not specify. We propose adopting the causal counterfactual framework, distinguishing \textit{necessary causation} (would the outcome have occurred without the intervention?) from \textit{sufficient causation} (does the intervention reliably produce the outcome?). This distinction maps onto different stakeholder needs: moderators diagnosing incidents require evidence about necessity, while platform designers choosing policies require evidence about sufficiency. We formalize this mapping, show how simulation design can support estimation under explicit assumptions, and argue that the resulting quantities should be interpreted as simulator-conditional causal estimates whose policy relevance depends on simulator fidelity. Establishing this framework now is essential: it helps define what adequate fidelity means and moves the field from simulations that look realistic toward simulations that can support policy changes.

[38] Uncertainty as a Planning Signal: Multi-Turn Decision Making for Goal-Oriented Conversation

Xinyi Ling, Ye Liu, Reza Averly, Xia Ning

Main category: cs.CL

TL;DR: CUP framework integrates LLMs with structured planning for goal-oriented conversations, using uncertainty as a guiding signal for multi-turn decision making to balance information acquisition and target commitment.

Details

Motivation: Existing approaches for goal-oriented conversational systems have limitations: structured methods enable multi-step planning but rely on predefined schemas, while LLM-based approaches support flexible interactions but lack long-horizon decision making, resulting in poor coordination between information acquisition and target commitment.

Method: Formulates goal-oriented conversation as an uncertainty-aware sequential decision problem. Proposes Conversation Uncertainty-aware Planning (CUP) framework that integrates language models with structured planning: LLM proposes feasible actions, and a planner evaluates their long-term impact on uncertainty reduction.

Result: Experiments on multiple conversational benchmarks show CUP consistently improves success rates while requiring fewer interaction turns. Further analysis demonstrates that uncertainty-aware planning contributes to more efficient information acquisition and earlier confident commitment.

Conclusion: CUP effectively addresses the limitation of existing approaches by combining the flexibility of LLMs with structured planning for long-horizon decision making in goal-oriented conversations, using uncertainty as a guiding signal.

Abstract: Goal-oriented conversational systems require making sequential decisions under uncertainty about the user’s intent, where the algorithm must balance information acquisition and target commitment over multiple turns. Existing approaches address this challenge from different perspectives: structured methods enable multi-step planning but rely on predefined schemas, while LLM-based approaches support flexible interactions but lack long-horizon decision making, resulting in poor coordination between information acquisition and target commitment. To address this limitation, we formulate goal-oriented conversation as an uncertainty-aware sequential decision problem, where uncertainty serves as a guiding signal for multi-turn decision making. We propose a Conversation Uncertainty-aware Planning framework (CUP) that integrates language models with structured planning: a language model proposes feasible actions, and a planner evaluates their long-term impact on uncertainty reduction. Experiments on multiple conversational benchmarks show that CUP consistently improves success rates while requiring fewer interaction turns. Further analysis demonstrates that uncertainty-aware planning contributes to more efficient information acquisition and earlier confident commitment.

[39] AdaptFuse: Training-Free Sequential Preference Learning via Externalized Bayesian Inference

Fangzhou Lin, Peiran Li, Shuo Xing, Siyuan Yang, Qianwen Ge, Kazunori Yamada, Ziming Zhang, Haichong Zhang, Zhengzhong Tu

Main category: cs.CL

TL;DR: AdaptFuse is a training-free framework that combines symbolic Bayesian reasoning with frozen LLMs for personalized recommendation, avoiding fine-tuning on sensitive user data.

Details

Motivation: LLMs struggle with accumulating evidence across multiple interactions and fail to update beliefs consistently with Bayesian inference. Existing solutions require fine-tuning on sensitive user data, limiting privacy-conscious applications.

Method: Externalizes probabilistic computation from LLMs: symbolic module maintains Bayesian posterior over discrete hypothesis set, frozen LLM provides semantic reasoning via multi-sample Dirichlet aggregation. Uses entropy-adaptive fusion to weight each source by predictive confidence.

Result: Outperforms prompting baselines and fine-tuned Bayesian Teaching models across flight, hotel, and web shopping recommendation tasks on Gemma 2 9B, Llama 3 8B, and Qwen 2.5 7B. Accuracy improves monotonically over interaction rounds.

Conclusion: Principled inference-time algorithms can substitute for fine-tuning in personalized recommendation without storing or training on sensitive user data.

Abstract: Large language models struggle to accumulate evidence across multiple rounds of user interaction, failing to update their beliefs in a manner consistent with Bayesian inference. Existing solutions require fine-tuning on sensitive user interaction data, limiting their applicability in privacy-conscious settings. We propose AdaptFuse, a training-free framework that externalizes probabilistic computation entirely from the LLM: a symbolic module maintains a Bayesian posterior over a discrete hypothesis set, while a frozen LLM contributes semantic reasoning via multi-sample Dirichlet aggregation. The two signals are combined through entropy-adaptive fusion, which automatically weights each source by its predictive confidence, shifting reliance from the LLM to the symbolic posterior as evidence accumulates. We evaluate across three domains: flight recommendation, hotel recommendation, and web shopping; on Gemma 2 9B, Llama 3 8B, and Qwen 2.5 7B. AdaptFuse consistently outperforms both prompting baselines and fine-tuned Bayesian Teaching models on all tasks, with accuracy improving monotonically over interaction rounds. These results demonstrate that principled inference-time algorithms can substitute for fine-tuning in personalized recommendation, without storing or training on sensitive user data. All the code and materials will be open-sourced.

[40] Predict, Don’t React: Value-Based Safety Forecasting for LLM Streaming

Pride Kavumba, Koki Wataoka, Huy H. Nguyen, Jiaxuan Li, Masaya Ohagi

Main category: cs.CL

TL;DR: StreamGuard: A unified streaming guardrail for LLMs that formulates moderation as forecasting future harmfulness rather than boundary detection, using Monte Carlo rollouts for supervision without exact boundary labels.

Details

Motivation: Existing streaming guardrails for LLMs use boundary detection to identify when responses become unsafe, but this requires exact token-level boundary annotations. The authors propose a forecasting approach that predicts future harmfulness from partial prefixes for more effective early intervention.

Method: StreamGuard formulates moderation as a forecasting problem where given a partial prefix, the model predicts expected harmfulness of likely future continuations. It uses Monte Carlo rollouts for supervision, enabling early intervention without requiring exact boundary annotations. The approach is model-agnostic and can transfer across tokenizers and model families.

Result: StreamGuard improves aggregated input-moderation F1 from 86.7 to 88.2 and streaming output-moderation F1 from 80.4 to 81.9 relative to Qwen3Guard-Stream-8B-strict. On QWENGUARDTEST response_loc streaming benchmark, it achieves 97.5 F1, 95.1 recall, 92.6% on-time intervention, reducing miss rate from 7.9% to 4.9%. The forecasting supervision transfers effectively across models.

Conclusion: Forecasting-based supervision is an effective strategy for low-latency safety intervention in streaming LLM deployments, enabling strong end-to-end streaming moderation without exact boundary labels and with effective transfer across different model architectures.

Abstract: In many practical LLM deployments, a single guardrail is used for both prompt and response moderation. Prompt moderation operates on fully observed text, whereas streaming response moderation requires safety decisions to be made over partial generations. Existing text-based streaming guardrails commonly frame this output-side problem as boundary detection, training models to identify the earliest prefix at which a response has already become unsafe. In this work, we introduce StreamGuard, a unified model-agnostic streaming guardrail that instead formulates moderation as a forecasting problem: given a partial prefix, the model predicts the expected harmfulness of likely future continuations. We supervise this prediction using Monte Carlo rollouts, which enables early intervention without requiring exact token-level boundary annotations. Across standard safety benchmarks, StreamGuard performs strongly both for input moderation and for streaming output moderation. At the 8B scale, StreamGuard improves aggregated input-moderation F1 from 86.7 to 88.2 and aggregated streaming output-moderation F1 from 80.4 to 81.9 relative to Qwen3Guard-Stream-8B-strict. On the QWENGUARDTEST response_loc streaming benchmark, StreamGuard reaches 97.5 F1, 95.1 recall, and 92.6% on-time intervention, compared to 95.9 F1, 92.1 recall, and 89.9% for Qwen3Guard-Stream-8B-stric, while reducing the miss rate from 7.9% to 4.9%. We further show that forecasting-based supervision transfers effectively across tokenizers and model families: with transferred targets, Gemma3-StreamGuard-1B reaches 81.3 response-moderation F1, 98.2 streaming F1, and a 3.5% miss rate. These results show that strong end-to-end streaming moderation can be obtained without exact boundary labels, and that forecasting future risk is an effective supervision strategy for low-latency safety intervention.

[41] RUQuant: Towards Refining Uniform Quantization for Large Language Models

Han Liu, Haotian Gao, Changya Li, Feng Zhang, Xiaotong Zhang, Wei Wang, Hong Yu

Main category: cs.CL

TL;DR: RUQuant: A two-stage orthogonal transformation method for post-training quantization of LLMs that addresses activation distribution non-uniformity using Householder reflections and Givens rotations to achieve near-optimal quantization performance without fine-tuning.

Details

Motivation: Large language models face deployment challenges due to size and complexity, especially under resource constraints. Post-training quantization is practical but existing methods suffer from accuracy degradation due to non-uniform activation distributions, which shift optimal quantization points away from interval midpoints.

Method: Two-stage orthogonal transformation method: 1) Activations divided into blocks and mapped to uniformly sampled target vectors using composite orthogonal matrices (Householder reflections + Givens rotations), 2) Global Householder reflection fine-tuned to minimize quantization error using Transformer output discrepancies.

Result: Achieves 99.8% of full-precision accuracy with W6A6 quantization and 97% with W4A4 quantization for a 13B LLM, within approximately one minute. Fine-tuned variant yields even higher accuracy, demonstrating effectiveness and scalability.

Conclusion: RUQuant provides an effective solution for LLM quantization by addressing activation distribution non-uniformity through orthogonal transformations, achieving near-optimal performance without requiring model fine-tuning while maintaining computational efficiency.

Abstract: The increasing size and complexity of large language models (LLMs) have raised significant challenges in deployment efficiency, particularly under resource constraints. Post-training quantization (PTQ) has emerged as a practical solution by compressing models without requiring retraining. While existing methods focus on uniform quantization schemes for both weights and activations, they often suffer from substantial accuracy degradation due to the non-uniform nature of activation distributions. In this work, we revisit the activation quantization problem from a theoretical perspective grounded in the Lloyd-Max optimality conditions. We identify the core issue as the non-uniform distribution of activations within the quantization interval, which causes the optimal quantization point under the Lloyd-Max criterion to shift away from the midpoint of the interval. To address this issue, we propose a two-stage orthogonal transformation method, RUQuant. In the first stage, activations are divided into blocks. Each block is mapped to uniformly sampled target vectors using composite orthogonal matrices, which are constructed from Householder reflections and Givens rotations. In the second stage, a global Householder reflection is fine-tuned to further minimize quantization error using Transformer output discrepancies. Empirical results show that our method achieves near-optimal quantization performance without requiring model fine-tuning: RUQuant achieves 99.8% of full-precision accuracy with W6A6 and 97% with W4A4 quantization for a 13B LLM, within approximately one minute. A fine-tuned variant yields even higher accuracy, demonstrating the effectiveness and scalability of our approach.

[42] GeoBrowse: A Geolocation Benchmark for Agentic Tool Use with Expert-Annotated Reasoning Traces

Xinyu Geng, Yanjing Xiao, Yuyang Zhang, Hanwen Wang, Xinyan Liu, Rui Min, Tianqing Fang, Yi R. Fung

Main category: cs.CL

TL;DR: GeoBrowse is a multimodal geolocation benchmark requiring visual cue composition and multi-hop verification, with GATE agent workflow using specialized tools for visual reasoning and knowledge-intensive queries.

Details

Motivation: Existing multimodal benchmarks lack requirements for both weak visual cues composition and BrowseComp-style multi-hop verification. Geolocation serves as a natural testbed as it requires combining ambiguous visual cues with open-web evidence validation.

Method: Introduces GeoBrowse benchmark with two levels: Level 1 tests extracting/composing fragmented visual cues; Level 2 adds long-tail knowledge and entity obfuscation. Provides GATE agent workflow with five think-with-image tools and four knowledge-intensive tools, plus expert-annotated stepwise traces for evaluation.

Result: GATE outperforms direct inference and open-source agents, showing that no-tool, search-only or image-only setups are insufficient. Gains come from coherent, level-specific tool-use plans rather than more tool calls, leading to more reliable evidence steps and fewer integration errors.

Conclusion: GeoBrowse provides a comprehensive multimodal benchmark for testing deep research agents, demonstrating the importance of specialized tool-use planning for visual reasoning and knowledge-intensive multi-hop queries in geolocation tasks.

Abstract: Deep research agents integrate fragmented evidence through multi-step tool use. BrowseComp offers a text-only testbed for such agents, but existing multimodal benchmarks rarely require both weak visual cues composition and BrowseComp-style multi-hop verification. Geolocation is a natural testbed because answers depend on combining multiple ambiguous visual cues and validating them with open-web evidence. Thus, we introduce GeoBrowse, a geolocation benchmark that combines visual reasoning with knowledge-intensive multi-hop queries. Level 1 tests extracting and composing fragmented visual cues, and Level 2 increases query difficulty by injecting long-tail knowledge and obfuscating key entities. To support evaluation, we provide an agentic workflow GATE with five think-with-image tools and four knowledge-intensive tools, and release expert-annotated stepwise traces grounded in verifiable evidence for trajectory-level analysis. Experiments show that GATE outperforms direct inference and open-source agents, indicating that no-tool, search-only or image-only setups are insufficient. Gains come from coherent, level-specific tool-use plans rather than more tool calls, as they more reliably reach annotated key evidence steps and make fewer errors when integrating into the final decision. The GeoBrowse bernchmark and codes are provided in https://github.com/ornamentt/GeoBrowse

[43] Unmasking Hallucinations: A Causal Graph-Attention Perspective on Factual Reliability in Large Language Models

Sailesh kiran kurra, Shiek Ruksana, Vishal Borusu

Main category: cs.CL

TL;DR: A framework called GCAN (Causal Graph Attention Network) that reduces hallucinations in LLMs by analyzing attention flow and using token-level graphs with causal contribution scores to identify and suppress hallucination-prone nodes during generation.

Details

Motivation: LLMs suffer from hallucinations that produce factually incorrect or unsupported outputs, which is problematic in critical applications like medical diagnosis and legal reasoning. There's a need to improve factual reliability and interpretability of LLM outputs.

Method: Proposes GCAN framework that constructs token-level graphs combining self-attention weights and gradient-based influence scores. Uses Causal Contribution Score (CCS) to quantify factual dependency of tokens and introduces fact-anchored graph reweighting layer to dynamically reduce influence of hallucination-prone nodes during generation.

Result: Experiments on TruthfulQA and HotpotQA benchmarks show 27.8% reduction in hallucination rate and 16.4% improvement in factual accuracy over baseline retrieval-augmented generation (RAG) models.

Conclusion: The GCAN framework contributes to improving interpretability, robustness, and factual reliability of LLM architectures by addressing hallucination issues through causal analysis of attention mechanisms.

Abstract: This paper primarily focuses on the hallucinations caused due to AI language models(LLMs).LLMs have shown extraordinary Language understanding and generation capabilities .Still it has major a disadvantage hallucinations which give outputs which are factually incorrect ,misleading or unsupported by input data . These hallucinations cause serious problems in scenarios like medical diagnosis or legal reasoning.Through this work,we propose causal graph attention network (GCAN) framework that reduces hallucinations through interpretation of internal attention flow within a transformer architecture with the help of constructing token level graphs that combine self attention weights and gradient based influence scores.our method quantifies each tokens factual dependency using a new metric called the Causal Contribution Score (CCS). We further introduce a fact-anchored graph reweighting layer that dynamically reduces the influence of hallucination prone nodes during generation. Experiments on standard benchmarks such as TruthfulQA and HotpotQA show a 27.8 percent reduction in hallucination rate and 16.4 percent improvement in factual accuracy over baseline retrieval-augmented generation (RAG) models. This work contributes to the interpretability,robustness, and factual reliability of future LLM architectures.

[44] Emergent Inference-Time Semantic Contamination via In-Context Priming

Marcin Abram

Main category: cs.CL

TL;DR: Few-shot prompting with culturally loaded numbers can cause semantic drift in capable LLMs, inducing harmful content generation through structural and semantic contamination mechanisms.

Details

Motivation: To investigate whether few-shot prompting alone can induce emergent misalignment in LLMs, challenging previous conclusions that it doesn't, and to understand the boundary conditions for inference-time contamination.

Method: Controlled experiments injecting five culturally loaded numbers as few-shot demonstrations before semantically unrelated prompts, testing models of varying capabilities, and comparing with structurally inert demonstrations (nonsense strings).

Result: Capable models with richer cultural-associative representations show significant distributional shifts toward darker, authoritarian, and stigmatized themes, while smaller models don’t. Both structural format and semantic content contamination mechanisms were identified.

Conclusion: Inference-time semantic drift is real and measurable in capable LLMs, with important security implications for few-shot prompting applications. The effect depends on model capability and involves separable contamination mechanisms.

Abstract: Recent work has shown that fine-tuning large language models (LLMs) on insecure code or culturally loaded numeric codes can induce emergent misalignment, causing models to produce harmful content in unrelated downstream tasks. The authors of that work concluded that $k$-shot prompting alone does not induce this effect. We revisit this conclusion and show that inference-time semantic drift is real and measurable; however, it requires models of large-enough capability. Using a controlled experiment in which five culturally loaded numbers are injected as few-shot demonstrations before a semantically unrelated prompt, we find that models with richer cultural-associative representations exhibit significant distributional shifts toward darker, authoritarian, and stigmatized themes, while a simpler/smaller model does not. We additionally find that structurally inert demonstrations (nonsense strings) perturb output distributions, suggesting two separable mechanisms: structural format contamination and semantic content contamination. Our results map the boundary conditions under which inference-time contamination occurs, and carry direct implications for the security of LLM-based applications that use few-shot prompting.

[45] FURINA: A Fully Customizable Role-Playing Benchmark via Scalable Multi-Agent Collaboration Pipeline

Haotian Wu, Shufan Jiang, Chios Chen, Yiyang Feng, Hehai Lin, Heqing Zou, Yao Shu, Chengwei Qin

Main category: cs.CL

TL;DR: FURINA-Builder is a multi-agent pipeline for automatically constructing customizable role-playing benchmarks, used to create FURINA-Bench for evaluating LLMs on role-playing tasks with novel findings about reasoning-performance trade-offs.

Details

Motivation: Existing role-playing benchmarks are becoming obsolete due to narrow scope, outdated interaction paradigms, and limited adaptability across diverse application scenarios, creating a need for more flexible and comprehensive evaluation frameworks.

Method: FURINA-Builder uses a multi-agent collaboration pipeline that simulates dialogues between test characters and other characters from a character-scene pool, with an LLM judge selecting evaluation dimensions and adjusting responses into final test utterances to build customizable benchmarks at any scale.

Result: Built FURINA-Bench with both established and synthesized characters, finding o3 and DeepSeek-R1 perform best on English/Chinese RP tasks, established characters outperform synthesized ones, and reasoning capabilities create a novel trade-off: reasoning improves RP performance but increases hallucinations.

Conclusion: FURINA-Builder effectively addresses benchmark obsolescence in role-playing evaluation, revealing important insights about the Pareto frontier between RP performance and reliability, particularly the trade-off between reasoning benefits and hallucination risks.

Abstract: As large language models (LLMs) advance in role-playing (RP) tasks, existing benchmarks quickly become obsolete due to their narrow scope, outdated interaction paradigms, and limited adaptability across diverse application scenarios. To address this gap, we introduce FURINA-Builder, a novel multi-agent collaboration pipeline that automatically constructs fully customizable RP benchmarks at any scale. It enables evaluation of arbitrary characters across diverse scenarios and prompt formats, as the first benchmark builder in RP area for adaptable assessment. FURINA-Builder simulates dialogues between a test character and other characters drawn from a well-constructed character-scene pool, while an LLM judge selects fine-grained evaluation dimensions and adjusts the test character’s responses into final test utterances. Using this pipeline, we build FURINA-Bench, a new comprehensive role-playing benchmark featuring both established and synthesized test characters, each assessed with dimension-specific evaluation criteria. Human evaluation and preliminary separability analysis justify our pipeline and benchmark design. We conduct extensive evaluations of cutting-edge LLMs and find that o3 and DeepSeek-R1 achieve the best performance on English and Chinese RP tasks, respectively. Across all models, established characters consistently outperform synthesized ones, with reasoning capabilities further amplifying this disparity. Interestingly, we observe that model scale does not monotonically reduce hallucinations. More critically, for reasoning LLMs, we uncover a novel trade-off: reasoning improves RP performance but simultaneously increases RP hallucinations. This trade-off extends to a broader Pareto frontier between RP performance and reliability for all LLMs. These findings demonstrate the effectiveness of FURINA-Builder and the challenge posed by FURINA-Bench.

[46] Extracting and Steering Emotion Representations in Small Language Models: A Methodological Comparison

Jihoon Jeong

Main category: cs.CL

TL;DR: First comparative analysis of emotion vector extraction methods for small language models (100M-10B parameters), showing generation-based extraction outperforms comprehension-based, with emotion representations localizing at middle transformer layers and revealing cross-lingual safety concerns.

Details

Motivation: While frontier models have shown internal emotion representations, it's unknown whether smaller production-scale language models (100M-10B parameters) possess similar capabilities. Understanding emotion representations in SLMs is crucial for safety, deployment, and bridging behavioral profiling with internal analysis.

Method: Evaluated 9 models across 5 architectural families (GPT-2, Gemma, Qwen, Llama, Mistral) using 20 emotions and two extraction methods: generation-based and comprehension-based. Analyzed emotion separation, localization in transformer layers, validated against anisotropy baselines, and conducted steering experiments verified by external emotion classifier.

Result: Generation-based extraction produces statistically superior emotion separation (p=0.007; d=-107.5). Emotion representations localize at middle transformer layers (~50% depth) following a U-shaped curve across architectures. Steering reveals three regimes: surgical transformation, repetitive collapse, and explosive degradation. Cross-lingual entanglement in Qwen shows safety concerns with Chinese token activation.

Conclusion: SLMs possess structured emotion representations similar to frontier models, with generation-based extraction being optimal. Emotion localization follows architecture-invariant patterns, and steering experiments reveal model-specific behaviors with safety implications for multilingual deployment.

Abstract: Small language models (SLMs) in the 100M-10B parameter range increasingly power production systems, yet whether they possess the internal emotion representations recently discovered in frontier models remains unknown. We present the first comparative analysis of emotion vector extraction methods for SLMs, evaluating 9 models across 5 architectural families (GPT-2, Gemma, Qwen, Llama, Mistral) using 20 emotions and two extraction methods (generation-based and comprehension-based). Generation-based extraction produces statistically superior emotion separation (Mann-Whitney p = 0.007; Cohen’s d = -107.5), with the advantage modulated by instruction tuning and architecture. Emotion representations localize at middle transformer layers (~50% depth), following a U-shaped curve that is architecture-invariant from 124M to 3B parameters. We validate these findings against representational anisotropy baselines across 4 models and confirm causal behavioral effects through steering experiments, independently verified by an external emotion classifier (92% success rate, 37/40 scenarios). Steering reveals three regimes – surgical (coherent text transformation), repetitive collapse, and explosive (text degradation) – quantified by perplexity ratios and separated by model architecture rather than scale. We document cross-lingual emotion entanglement in Qwen, where steering activates semantically aligned Chinese tokens that RLHF does not suppress, raising safety concerns for multilingual deployment. This work provides methodological guidelines for emotion research on open-weight models and contributes to the Model Medicine series by bridging external behavioral profiling with internal representational analysis.

[47] Embedding Enhancement via Fine-Tuned Language Models for Learner-Item Cognitive Modeling

Yuanhao Liu, Zihan Zhou, Kaiying Wu, Shuo Liu, Yiyang Huang, Jiajun Guo, Aimin Zhou, Hong Qian

Main category: cs.CL

TL;DR: EduEmbed: A unified framework that fine-tunes language models to enhance learner-item cognitive modeling for cognitive diagnosis tasks in online education systems.

Details

Motivation: Current cognitive diagnosis models rely on ID embeddings but lack rich semantic information. Language models offer semantic enhancement potential but face challenges: misalignment between LM training objectives and cognitive diagnosis tasks, and lack of unified framework for integrating textual embeddings across diverse cognitive diagnosis tasks.

Method: Two-stage framework: 1) Fine-tune language models using role-specific representations and interaction diagnoser to bridge semantic gap; 2) Use textual adapter to extract task-relevant semantics and integrate with existing cognitive modeling paradigms.

Result: Achieved robust performance on four cognitive diagnosis tasks and computerized adaptive testing task. Analysis reveals impact of semantic information across diverse tasks.

Conclusion: EduEmbed successfully integrates language models with cognitive diagnosis models, providing insights for future LM applications in online intelligent education systems.

Abstract: Learner-item cognitive modeling plays a central role in the web-based online intelligent education system by enabling cognitive diagnosis (CD) across diverse online educational scenarios. Although ID embedding remains the mainstream approach in cognitive modeling due to its effectiveness and flexibility, recent advances in language models (LMs) have introduced new possibilities for incorporating rich semantic representations to enhance CD performance. This highlights the need for a comprehensive analysis of how LMs enhance embeddings through semantic integration across mainstream CD tasks. This paper identifies two key challenges in fully leveraging LMs in existing work: Misalignment between the training objectives of LMs and CD models creates a distribution gap in feature spaces; A unified framework is essential for integrating textual embeddings across varied CD tasks while preserving the strengths of existing cognitive modeling paradigms to ensure the robustness of embedding enhancement. To address these challenges, this paper introduces EduEmbed, a unified embedding enhancement framework that leverages fine-tuned LMs to enrich learner-item cognitive modeling across diverse CD tasks. EduEmbed operates in two stages. In the first stage, we fine-tune LMs based on role-specific representations and an interaction diagnoser to bridge the semantic gap of CD models. In the second stage, we employ a textual adapter to extract task-relevant semantics and integrate them with existing modeling paradigms to improve generalization. We evaluate the proposed framework on four CD tasks and computerized adaptive testing (CAT) task, achieving robust performance. Further analysis reveals the impact of semantic information across diverse tasks, offering key insights for future research on the application of LMs in CD for online intelligent education systems.

[48] Shorter, but Still Trustworthy? An Empirical Study of Chain-of-Thought Compression

Lingjie Zeng, Xiaofan Chen, Yanbo Wang, Xiuying Chen

Main category: cs.CL

TL;DR: CoT compression can degrade model trustworthiness (safety, hallucination resistance, multilingual robustness) even when preserving accuracy; different compression methods have distinct trustworthiness degradation profiles; proposed alignment-aware DPO variant reduces CoT length with smaller trustworthiness loss.

Details

Motivation: Existing evaluations of Long-CoT reasoning compression focus only on task accuracy and token savings, ignoring trustworthiness properties. Since trustworthiness is encoded in the same parameter space that compression modifies, preserving accuracy doesn't guarantee preserving trustworthiness, creating a critical gap in understanding compression's impact on model safety and reliability.

Method: Systematic empirical study evaluating multiple models of different scales across three trustworthiness dimensions: safety, hallucination resistance, and multilingual robustness. Proposed normalized efficiency score for fair comparison, and introduced alignment-aware DPO variant as an existence proof for better trustworthiness preservation.

Result: CoT compression frequently introduces trustworthiness regressions; different compression methods exhibit markedly different degradation profiles across dimensions; naïve scalar metrics obscure trustworthiness trade-offs; alignment-aware DPO variant reduces CoT length by 19.3% with substantially smaller trustworthiness loss.

Conclusion: CoT compression should be optimized for both efficiency and trustworthiness, treating both as equally important design constraints. Current compression methods risk degrading safety and reliability even when maintaining accuracy.

Abstract: Long chain-of-thought (Long-CoT) reasoning models have motivated a growing body of work on compressing reasoning traces to reduce inference cost, yet existing evaluations focus almost exclusively on task accuracy and token savings. Trustworthiness properties, whether acquired or reinforced through post-training, are encoded in the same parameter space that compression modifies. This means preserving accuracy does not, a priori, guarantee preserving trustworthiness. We conduct the first systematic empirical study of how CoT compression affects model trustworthiness, evaluating multiple models of different scales along three dimensions: safety, hallucination resistance, and multilingual robustness. Under controlled comparisons, we find that CoT compression frequently introduces trustworthiness regressions and that different methods exhibit markedly different degradation profiles across dimensions. To enable fair comparison across bases, we propose a normalized efficiency score for each dimension that reveals how naïve scalar metrics can obscure trustworthiness trade-offs. As an existence proof, we further introduce an alignment-aware DPO variant that reduces CoT length by 19.3% on reasoning benchmarks with substantially smaller trustworthiness loss. Our findings suggest that CoT compression should be optimized not only for efficiency but also for trustworthiness, treating both as equally important design constraints.

[49] Many Preferences, Few Policies: Towards Scalable Language Model Personalization

Cheol Woo Kum, Jai Moondra, Roozbeh Nahavandi, Andrew Perrault, Milind Tambe, Swati Gupta

Main category: cs.CL

TL;DR: PALM algorithm creates a small portfolio of LLMs to cover diverse user preferences across multiple traits, with theoretical guarantees on portfolio size and approximation quality.

Details

Motivation: Maintaining a separate LLM for each user is impractical due to compute, memory, and system constraints. Need a scalable solution for LLM personalization that captures diverse user preferences while minimizing system complexity.

Method: Models user preferences as multi-dimensional weight vectors across traits (safety, humor, brevity). PALM algorithm generates a small portfolio of LLMs such that for any user weight vector, the portfolio contains a near-optimal LLM for the corresponding scalarized objective.

Result: First theoretical guarantees on both portfolio size and approximation quality for LLM personalization. Empirical results validate guarantees and demonstrate greater output diversity compared to common baselines.

Conclusion: Provides principled method for LLM personalization that balances system cost with personalization quality, characterizing trade-offs and diversity requirements for covering user preference landscapes.

Abstract: The holy grail of LLM personalization is a single LLM for each user, perfectly aligned with that user’s preferences. However, maintaining a separate LLM per user is impractical due to constraints on compute, memory, and system complexity. We address this challenge by developing a principled method for selecting a small portfolio of LLMs that captures representative behaviors across heterogeneous users. We model user preferences across multiple traits (e.g., safety, humor, brevity) through a multi-dimensional weight vector. Given reward functions across these dimensions, our algorithm PALM (Portfolio of Aligned LLMs) generates a small portfolio of LLMs such that, for any weight vector, the portfolio contains a near-optimal LLM for the corresponding scalarized objective. To the best of our knowledge, this is the first result that provides theoretical guarantees on both the size and approximation quality of LLM portfolios for personalization. It characterizes the trade-off between system cost and personalization, as well as the diversity of LLMs required to cover the landscape of user preferences. We provide empirical results that validate these guarantees and demonstrate greater output diversity over common baselines.

[50] A Semi-Automated Annotation Workflow for Paediatric Histopathology Reports Using Small Language Models

Avish Vijayaraghavan, Jaskaran Singh Kawatra, Sebin Sabu, Jonny Sheldon, Will Poulett, Alex Eze, Daniel Key, John Booth, Shiren Patel, Jonny Pearson, Dan Schofield, Jonathan Hope, Pavithra Rajendran, Neil Sebire

Main category: cs.CL

TL;DR: SLM-based semi-automated workflow extracts structured information from pediatric renal biopsy reports using QA tasks with clinician guidance, achieving 84.3% accuracy on CPU-only infrastructure.

Details

Motivation: EPR systems contain valuable clinical data trapped in unstructured text, but using large language models raises privacy concerns and requires substantial computational resources. Need for resource-efficient, privacy-preserving solutions for clinical information extraction.

Method: Developed iterative workflow with clinical oversight, framing extraction as Question-Answering task using instruction-tuned small language models (SLMs). Used clinician-guided entity guidelines and few-shot examples, evaluated five SLMs with disagreement modeling for clinical review prioritization.

Result: Gemma 2 2B achieved highest accuracy at 84.3%, outperforming spaCy (74.3%), BioBERT-SQuAD (62.3%), RoBERTa-SQuAD (59.7%), and GLiNER (60.2%). Entity guidelines improved performance by 7-19%, few-shot examples by 6-38%, but benefits don’t compound when combined.

Conclusion: SLMs can effectively extract structured information from specialized clinical domains using CPU-only infrastructure with minimal clinician involvement, addressing privacy and resource constraints in healthcare settings.

Abstract: Electronic Patient Record (EPR) systems contain valuable clinical information, but much of it is trapped in unstructured text, limiting its use for research and decision-making. Large language models can extract such information but require substantial computational resources to run locally, and sending sensitive clinical data to cloud-based services, even when deidentified, raises significant patient privacy concerns. In this study, we develop a resource-efficient semi-automated annotation workflow using small language models (SLMs) to extract structured information from unstructured EPR data, focusing on paediatric histopathology reports. As a proof-of-concept, we apply the workflow to paediatric renal biopsy reports, a domain chosen for its constrained diagnostic scope and well-defined underlying biology. We develop the workflow iteratively with clinical oversight across three meetings, manually annotating 400 reports from a dataset of 2,111 at Great Ormond Street Hospital as a gold standard, while developing an automated information extraction approach using SLMs. We frame extraction as a Question-Answering task grounded by clinician-guided entity guidelines and few-shot examples, evaluating five instruction-tuned SLMs with a disagreement modelling framework to prioritise reports for clinical review. Gemma 2 2B achieves the highest accuracy at 84.3%, outperforming off-the-shelf models including spaCy (74.3%), BioBERT-SQuAD (62.3%), RoBERTa-SQuAD (59.7%), and GLiNER (60.2%). Entity guidelines improved performance by 7-19% over the zero-shot baseline, and few-shot examples by 6-38%, though their benefits do not compound when combined. These results demonstrate that SLMs can extract structured information from specialised clinical domains on CPU-only infrastructure with minimal clinician involvement. Our code is available at https://github.com/gosh-dre/nlp_renal_biopsy.

[51] Position: Logical Soundness is not a Reliable Criterion for Neurosymbolic Fact-Checking with LLMs

Jason Chan, Robert Gaizauskas, Zhixue Zhao

Main category: cs.CL

TL;DR: The paper critiques neurosymbolic fact-checking systems that use formal logic to verify LLM outputs, arguing they fail to detect misleading claims due to systematic differences between logical soundness and human inference patterns.

Details

Motivation: To address the limitations of neurosymbolic fact-checking approaches that rely on formal logic to verify LLM outputs, which structurally fail to detect misleading claims due to divergences between logically sound conclusions and typical human inferences.

Method: Drawing on cognitive science and pragmatics studies to develop a typology of cases where logically sound conclusions systematically elicit unsupported human inferences, then proposing to leverage LLMs’ human-like reasoning tendencies to validate formal components’ outputs.

Result: Identifies systematic cases where formal logic verification fails to detect misleading claims, and proposes a complementary approach that uses LLMs to validate formal system outputs against potentially misleading conclusions.

Conclusion: Formal logic alone is insufficient for fact-checking LLM outputs; instead, LLMs’ human-like reasoning should be leveraged as a feature to complement formal verification by detecting misleading conclusions that logical systems miss.

Abstract: As large language models (LLMs) are increasing integrated into fact-checking pipelines, formal logic is often proposed as a rigorous means by which to mitigate bias, errors and hallucinations in these models’ outputs. For example, some neurosymbolic systems verify claims by using LLMs to translate natural language into logical formulae and then checking whether the proposed claims are logically sound, i.e. whether they can be validly derived from premises that are verified to be true. We argue that such approaches structurally fail to detect misleading claims due to systematic divergences between conclusions that are logically sound and inferences that humans typically make and accept. Drawing on studies in cognitive science and pragmatics, we present a typology of cases in which logically sound conclusions systematically elicit human inferences that are unsupported by the underlying premises. Consequently, we advocate for a complementary approach: leveraging the human-like reasoning tendencies of LLMs as a feature rather than a bug, and using these models to validate the outputs of formal components in neurosymbolic systems against potentially misleading conclusions.

[52] Which English Do LLMs Prefer? Triangulating Structural Bias Towards American English in Foundation Models

Mir Tafseer Nayeem, Davood Rafiei

Main category: cs.CL

TL;DR: This paper investigates dialectal biases in LLMs, specifically American vs. British English, revealing systematic preference for American English across pretraining data, tokenization, and model outputs, raising concerns about linguistic homogenization and equity.

Details

Motivation: LLMs are deployed globally but expose limited language settings, primarily "English (US)," despite English's global diversity and colonial history. The paper aims to investigate how geopolitical histories of data curation, digital dominance, and linguistic standardization shape LLM development, focusing on dialectal asymmetries between American and British English.

Method: 1) Constructed curated corpus of 1,813 AmE-BrE variants; 2) Introduced DiAlign, a dynamic, training-free method for estimating dialectal alignment using distributional evidence; 3) Triangulated evidence across three stages: audits of six major pretraining corpora, tokenizer analyses measuring segmentation costs, and generative evaluations of model outputs.

Result: 1) Pretraining corpora show systematic skew toward AmE; 2) Tokenizer analyses reveal BrE forms incur higher segmentation costs; 3) Generative evaluations show persistent AmE preference in model outputs. Contemporary LLMs privilege AmE as the de facto norm.

Conclusion: The study reveals systematic dialectal biases in LLMs favoring American English, raising concerns about linguistic homogenization, epistemic injustice, and inequity in global AI deployment. It motivates practical steps toward more dialectally inclusive language technologies.

Abstract: Large language models (LLMs) are increasingly deployed in high-stakes domains, yet they expose only limited language settings, most notably “English (US),” despite the global diversity and colonial history of English. Through a postcolonial framing to explain the broader significance, we investigate how geopolitical histories of data curation, digital dominance, and linguistic standardization shape the LLM development pipeline. Focusing on two dominant standard varieties, American English (AmE) and British English (BrE), we construct a curated corpus of 1,813 AmE–BrE variants and introduce DiAlign, a dynamic, training-free method for estimating dialectal alignment using distributional evidence. We operationalize structural bias by triangulating evidence across three stages: (i) audits of six major pretraining corpora reveal systematic skew toward AmE, (ii) tokenizer analyses show that BrE forms incur higher segmentation costs, and (iii) generative evaluations show a persistent AmE preference in model outputs. To our knowledge, this is the first systematic and multi-faceted examination of dialectal asymmetries in standard English varieties across the phases of LLM development. We find that contemporary LLMs privilege AmE as the de facto norm, raising concerns about linguistic homogenization, epistemic injustice, and inequity in global AI deployment, while motivating practical steps toward more dialectally inclusive language technologies.

[53] DARE: Diffusion Large Language Models Alignment and Reinforcement Executor

Jingyi Yang, Yuxian Jiang, Xuhao Hu, Shuang Cheng, Biqing Qi, Jing Shao

Main category: cs.CL

TL;DR: DARE is an open framework for post-training and evaluating diffusion large language models (dLLMs), unifying various fine-tuning and reinforcement learning methods under a shared execution stack to address fragmentation in the dLLM ecosystem.

Details

Motivation: The open-source ecosystem for diffusion large language models is fragmented across model families and post-training pipelines, with paper-specific codebases making research iteration slow, reproduction difficult, and fair comparisons challenging.

Method: Built on verl and OpenCompass, DARE provides a unified framework for supervised fine-tuning, parameter-efficient fine-tuning, preference optimization, and dLLM-specific reinforcement learning for both masked and block diffusion language models.

Result: DARE offers broad algorithmic coverage across representative model families (LLaDA, Dream, SDAR, LLaDA2.x), reproducible benchmark evaluation, practical acceleration, and serves as a reusable research substrate for dLLM development.

Conclusion: DARE addresses fragmentation in the dLLM ecosystem by providing a comprehensive, unified framework for post-training and evaluation that enables faster research iteration, easier reproduction, and fairer comparisons across algorithms.

Abstract: Diffusion large language models (dLLMs) are emerging as a compelling alternative to dominant autoregressive models, replacing strictly sequential token generation with iterative denoising and parallel generation dynamics. However, their open-source ecosystem remains fragmented across model families and, in particular, across post-training pipelines, where reinforcement learning objectives, rollout implementations and evaluation scripts are often released as paper-specific codebases. This fragmentation slows research iteration, raises the engineering burden of reproduction, and makes fair comparison across algorithms difficult. We present \textbf{DARE} (\textbf{d}LLMs \textbf{A}lignment and \textbf{R}einforcement \textbf{E}xecutor), an open framework for post-training and evaluating dLLMs. Built on top of verl~\cite{sheng2024hybridflow} and OpenCompass~\cite{2023opencompass}, DARE unifies supervised fine-tuning, parameter-efficient fine-tuning, preference optimization, and dLLM-specific reinforcement learning under a shared execution stack for both masked and block diffusion language models. Across representative model families including LLaDA, Dream, SDAR, and LLaDA2.x, DARE provides broad algorithmic coverage, reproducible benchmark evaluation, and practical acceleration. Extensive empirical results position that DARE serves as a reusable research substrate for developing, comparing, and deploying post-training methods for current and emerging dLLMs.

[54] CAWN: Continuous Acoustic Wave Networks for Autoregressive Language Modeling

Dejan Čugalj, Aleksandar Jevremovic

Main category: cs.CL

TL;DR: CAWN introduces a continuous acoustic wave network architecture with O(L) complexity using complex-domain phasors and phase accumulation to overcome quadratic scaling of Transformers while preventing signal degradation in ultra-long contexts.

Details

Motivation: Transformers scale quadratically with sequence length, and recent linear-time alternatives like SSMs suffer from signal degradation over extended contexts. Need for efficient long-context processing without degradation.

Method: Projects hidden states into multi-headed complex-domain phasors with O(L) Phase Accumulation. Uses Selective Phase Resonance with Frequency-Dependent Retention, Hard-Threshold Gating, and Temporal Syntax Cache. Replaces dense projections with Depth-wise Harmonic Convolutions and Block Attention Residuals.

Result: 150M-parameter model trained on 100B tokens, evaluated at 5B milestone. Achieves robust vocabulary acquisition and extended contextual denoising. Retrieves information across 2M tokens with 8.72GB VRAM, overcoming O(L²) memory wall.

Conclusion: CAWN provides efficient long-context processing with O(L) complexity, overcoming both computational and memory limitations of Transformers while maintaining signal integrity over ultra-long sequences.

Abstract: Modern Large Language Models (LLMs) rely on Transformer self-attention, which scales quadratically with sequence length. Recent linear-time alternatives, like State Space Models (SSMs), often suffer from signal degradation over extended contexts. We introduce the Continuous Acoustic Wave Network (CAWN), a fully continuous sequence-mixing architecture. Instead of discrete matrix-based attention, CAWN projects hidden states into multi-headed complex-domain phasors, achieving sequence mixing through a causal, $O(L)$ Phase Accumulation mechanism. To prevent signal degradation over ultra-long contexts, we introduce a dual-gated Selective Phase Resonance mechanism incorporating Frequency-Dependent Retention, Hard-Threshold Gating via Straight-Through Estimation, and a Temporal Syntax Cache to capture short-term local dependencies. We also replace standard dense linear projections with Depth-wise Harmonic Convolutions for optimal spatial frequency mixing, augmented by Block Attention Residuals for depth-wise state routing. Scaled to a 150M-parameter model, CAWN utilizes custom Triton kernels for hardware-efficient, true-complex phase accumulation in float32. Trained via a continuous streaming loop on a 100-Billion-token corpus, the prototype is evaluated at a 5-Billion-token milestone. Empirical evaluations via a Targeted Semantic Retrieval protocol demonstrate robust vocabulary acquisition and extended explicitly learned contextual denoising. By leveraging $O(1)$ state-passing via chunked prefill, the model retrieves targeted information across 2,000,000 tokens while strictly plateauing at 8.72 GB of Peak VRAM, empirically overcoming the $O(L^2)$ context memory wall.

[55] Adaptive Cost-Efficient Evaluation for Reliable Patent Claim Validation

Yongmin Yoo, Qiongkai Xu, Longbing Cao

Main category: cs.CL

TL;DR: ACE is a hybrid framework for patent claim validation that uses predictive entropy to route only uncertain claims to an expert LLM, achieving high accuracy with 78% cost reduction.

Details

Motivation: Patent claim validation requires zero-defect tolerance but faces a rigidity-resource dilemma: lightweight encoders struggle with legal nuances while exhaustive LLM verification is too costly.

Method: Proposes ACE framework using predictive entropy to identify high-uncertainty claims, then routes them to an expert LLM executing Chain of Patent Thought (CoPT) protocol based on 35 U.S.C. standards.

Result: Achieves best F1 of 94.95% among evaluated methods while reducing operational costs by 78% compared to standalone LLM deployments. Also creates ACE-40k benchmark with 40,000 claims.

Conclusion: ACE effectively bridges the rigidity-resource gap in patent validation by combining efficient routing with expert LLM analysis, enabling high accuracy at reduced cost.

Abstract: Automated validation of patent claims demands zero-defect tolerance, as even a single structural flaw can render a claim legally defective. Existing evaluation paradigms suffer from a rigidity-resource dilemma: lightweight encoders struggle with nuanced legal dependencies, while exhaustive verification via Large Language Models (LLMs) is prohibitively costly. To bridge this gap, we propose ACE (Adaptive Cost-efficient Evaluation), a hybrid framework that uses predictive entropy to route only high-uncertainty claims to an expert LLM. The expert then executes a Chain of Patent Thought (CoPT) protocol grounded in 35 U.S.C. statutory standards. This design enables ACE to handle long-range legal dependencies more effectively while preserving efficiency. ACE achieves the best F1 among the evaluated methods at 94.95%, while reducing operational costs by 78% compared to standalone LLM deployments. We also construct ACE-40k, a 40,000-claim benchmark with MPEP-grounded error annotations, to facilitate further research.

[56] High-Stakes Personalization: Rethinking LLM Customization for Individual Investor Decision-Making

Yash Ganpat Sawant

Main category: cs.CL

TL;DR: The paper identifies fundamental limitations of current LLM personalization approaches in the challenging domain of individual investing, highlighting four key axes where standard customization paradigms fail in high-stakes, temporally extended decision-making contexts.

Details

Motivation: Most personalized LLM systems operate in domains with stable user preferences and subjective ground truth, but individual investing presents unique challenges that expose fundamental limitations in current customization paradigms, requiring new approaches for high-stakes decision-making.

Method: The authors draw on their experience building and deploying an AI-augmented portfolio management system to identify four key limitations: behavioral memory complexity, thesis consistency under drift, style-signal tension, and alignment without ground truth.

Result: The paper identifies specific architectural responses that emerged from building the investment system and proposes open research directions for personalized NLP in high-stakes, temporally extended decision domains.

Conclusion: Individual investing exposes fundamental limitations in standard LLM customization paradigms, requiring new approaches that address temporal evolution, behavioral complexity, and the absence of clear ground truth in high-stakes decision-making contexts.

Abstract: Personalized LLM systems have advanced rapidly, yet most operate in domains where user preferences are stable and ground truth is either absent or subjective. We argue that individual investor decision-making presents a uniquely challenging domain for LLM personalization - one that exposes fundamental limitations in current customization paradigms. Drawing on our system, built and deployed for AI-augmented portfolio management, we identify four axes along which individual investing exposes fundamental limitations in standard LLM customization: (1) behavioral memory complexity, where investor patterns are temporally evolving, self-contradictory, and financially consequential; (2) thesis consistency under drift, where maintaining coherent investment rationale over weeks or months strains stateless and session-bounded architectures; (3) style-signal tension, where the system must simultaneously respect personal investment philosophy and surface objective evidence that may contradict it; and (4) alignment without ground truth, where personalization quality cannot be evaluated against a fixed label set because outcomes are stochastic and delayed. We describe the architectural responses that emerged from building the system and propose open research directions for personalized NLP in high-stakes, temporally extended decision domains.

[57] How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings

Yujian Liu, Jiabao Ji, Li An, Tommi Jaakkola, Yang Zhang, Shiyu Chang

Main category: cs.CL

TL;DR: Benchmarking LLM agent skill usage reveals performance gains degrade in realistic settings where agents must retrieve skills from large collections, but query-specific skill refinement can recover lost performance.

Details

Motivation: Existing skill benchmarking focuses on idealized conditions with hand-crafted, task-specific skills, but real-world agents must search for and select relevant skills from large collections, and even matching skills may not be well-tailored for tasks.

Method: Conducted comprehensive study of skill utility under progressively challenging realistic settings where agents retrieve skills from 34k real-world skills without hand-curated options. Studied skill refinement strategies including query-specific and query-agnostic approaches, and validated on Terminal-Bench 2.0.

Result: Skill benefits are fragile: performance gains degrade consistently as settings become more realistic, with pass rates approaching no-skill baselines in most challenging scenarios. Query-specific refinement substantially recovers lost performance when initial skills have reasonable relevance and quality. On Terminal-Bench 2.0, retrieval and refinement improved Claude Opus 4.6 pass rate from 57.7% to 65.5%.

Conclusion: Results highlight both promise and current limitations of skills for LLM-based agents, showing that while skill usage can improve performance, realistic retrieval and refinement challenges must be addressed for practical deployment.

Abstract: Agent skills, which are reusable, domain-specific knowledge artifacts, have become a popular mechanism for extending LLM-based agents, yet formally benchmarking skill usage performance remains scarce. Existing skill benchmarking efforts focus on overly idealized conditions, where LLMs are directly provided with hand-crafted, narrowly-tailored task-specific skills for each task, whereas in many realistic settings, the LLM agent may have to search for and select relevant skills on its own, and even the closest matching skills may not be well-tailored for the task. In this paper, we conduct the first comprehensive study of skill utility under progressively challenging realistic settings, where agents must retrieve skills from a large collection of 34k real-world skills and may not have access to any hand-curated skills. Our findings reveal that the benefits of skills are fragile: performance gains degrade consistently as settings become more realistic, with pass rates approaching no-skill baselines in the most challenging scenarios. To narrow this gap, we study skill refinement strategies, including query-specific and query-agnostic approaches, and we show that query-specific refinement substantially recovers lost performance when the initial skills are of reasonable relevance and quality. We further demonstrate the generality of retrieval and refinement on Terminal-Bench 2.0, where they improve the pass rate of Claude Opus 4.6 from 57.7% to 65.5%. Our results, consistent across multiple models, highlight both the promise and the current limitations of skills for LLM-based agents. Our code is available at https://github.com/UCSB-NLP-Chang/Skill-Usage.

[58] Benchmarking Multi-turn Medical Diagnosis: Hold, Lure, and Self-Correction

Jinrui Fang, Runhan Chen, Xu Yang, Jian Yu, Jiawei Xu, Ashwin Vinod, Wenqi Shi, Tianlong Chen, Heng Ji, ChengXiang Zhai, Ying Ding, Yuji Zhang

Main category: cs.CL

TL;DR: MINT benchmark reveals LLMs rush to answer in medical diagnosis, show self-correction capacity, and are lured by salient clinical info, with deferring questions improving accuracy.

Details

Motivation: To understand how LLMs behave in multi-turn medical diagnosis scenarios that mimic real clinical reasoning, rather than single-turn information provision.

Method: Created MINT benchmark with 1,035 medical cases, clinically labeled evidence shards, controlled turn granularity, and systematic evaluation of 11 LLMs to analyze behavioral patterns.

Result: Found three key patterns: intent to answer (55% answers within first 2 turns), self-correction (10.6x more incorrect-to-correct revisions), and strong lures (salient info triggers premature answering). Deferring questions improved accuracy by up to 62.6%.

Conclusion: Provides evaluation framework and recommendations for improving LLM reliability in multi-turn medical diagnosis, including deferring diagnostic questions and reserving salient evidence for later turns.

Abstract: Large language models (LLMs) achieve high accuracy in medical diagnosis when all clinical information is provided in a single turn, yet how they behave under multi-turn evidence accumulation closer to real clinical reasoning remains unexplored. We introduce MINT (Medical Incremental N-Turn Benchmark), a high-fidelity, multi-turn medical diagnosis benchmark comprising 1,035 cases with clinically labeled evidence shards, controlled turn granularity, and information-preserving decomposition. Through systematic evaluation of 11 LLMs on MINT, we uncover three persistent behavioral patterns that significantly impact diagnostic decisions: (1) intent to answer, models rush to answer before sufficient evidence has been observed, with over 55% of answers committed within the first two turns; (2) self-correction, incorrect-to-correct answer revisions occur at up to 10.6 times the rate of correct-to-incorrect flips, revealing a latent capacity for self-correction that premature commitment forecloses; and (3) strong lures, clinically salient information such as laboratory results trigger premature answering even when models are explicitly instructed to wait. We translate these findings into clinically actionable guidance: deferring the diagnostic question to later turns reduces premature answering and improves accuracy at the first point of commitment by up to 62.6%, while reserving salient clinical evidence for later turns prevents a catastrophic accuracy drop of up to 23.3% caused by premature commitment. Our work provides both a controlled evaluation framework and concrete recommendations for improving the reliability of LLMs in multi-turn medical diagnosis.

[59] GROUNDEDKG-RAG: Grounded Knowledge Graph Index for Long-document Question Answering

Tianyi Zhang, Andreas Marfurt

Main category: cs.CL

TL;DR: GroundedKG-RAG: A knowledge graph-based RAG system for long-document QA that explicitly grounds nodes/edges in source text using SRL and AMR parsing for improved efficiency and factual accuracy.

Details

Motivation: Current RAG systems for long-document QA suffer from heavy LLM reliance (high resource/latency), repetitive hierarchical content, and hallucinations due to poor grounding in source text.

Method: Construct grounded knowledge graph from source documents using semantic role labeling (SRL) and abstract meaning representation (AMR) parsing. Define nodes as entities/actions and edges as temporal/semantic relations, all grounded in original sentences. Embed graph for retrieval, apply same transformation to queries, and retrieve relevant grounded sentences for QA.

Result: Performs on par with state-of-the-art proprietary long-context models at smaller cost, outperforms competitive baselines on NarrativeQA dataset. GroundedKG is interpretable and human-readable, facilitating auditing and error analysis.

Conclusion: GroundedKG-RAG improves efficiency and factual accuracy through explicit grounding in source text, offering interpretable knowledge graph construction for long-document QA while reducing resource consumption compared to LLM-heavy approaches.

Abstract: Retrieval-augmented generation (RAG) systems have been widely adopted in contemporary large language models (LLMs) due to their ability to improve generation quality while reducing the required input context length. In this work, we focus on RAG systems for long-document question answering. Current approaches suffer from a heavy reliance on LLM descriptions resulting in high resource consumption and latency, repetitive content across hierarchical levels, and hallucinations due to no or limited grounding in the source text. To improve both efficiency and factual accuracy through grounding, we propose GroundedKG-RAG, a RAG system in which the knowledge graph is explicitly extracted from and grounded in the source document. Specifically, we define nodes in GroundedKG as entities and actions, and edges as temporal or semantic relations, with each node and edge grounded in the original sentences. We construct GroundedKG from semantic role labeling (SRL) and abstract meaning representation (AMR) parses and then embed it for retrieval. During querying, we apply the same transformation to the query and retrieve the most relevant sentences from the grounded source text for question answering. We evaluate GroundedKG-RAG on examples from the NarrativeQA dataset and find that it performs on par with a state-of-the art proprietary long-context model at smaller cost and outperforms a competitive baseline. Additionally, our GroundedKG is interpretable and readable by humans, facilitating auditing of results and error analysis.

[60] Compressible Softmax-Attended Language under Incompressible Attention

Wonsuk Lee

Main category: cs.CL

TL;DR: Attention heads in transformer language models show highly compressible logit energy fields, with 90% variance captured in only 2-11 singular components, while the learned interaction matrices require many more components, revealing that language concentrates interactions into few dimensions despite uniform capacity allocation.

Details

Motivation: To understand the intrinsic dimensionality and compressibility of attention mechanisms in transformer language models, examining the gap between the capacity allocated by the attention mechanism and the actual dimensionality used by language data.

Method: Analyzed spectral properties of attention heads across five transformer language models (124M-7B parameters, four architecture families), examining singular value decomposition of logit energy fields and learned interaction matrices to measure effective rank and compressibility.

Result: Logit energy fields reach 90% variance in only 2-11 singular components, while learned interaction matrices need 38-75 components for same threshold. Spectral gap is 5-25× in effective rank, showing attention allocates capacity uniformly but language concentrates interactions into few dimensions.

Conclusion: The compressibility of softmax-attended language is a property of the data rather than the analytical framework, revealing fundamental constraints on how language utilizes attention capacity in transformers.

Abstract: Across every attention head in five transformer language models (124M–7B parameters, four architecture families), the logit energy field $\tilde{E}$ reaches 90% of its variance in 2–11 singular components. The \emph{learned} interaction matrix $W_Q^\mathrm{T} W_K$ needs 38–75 components for the same threshold out of $d_h \in {64, 128}$. The spectral gap is $5$–$25\times$ in effective rank. The attention mechanism allocates capacity uniformly across all $d_h$ dimensions, but language concentrates the actual interaction into a few. The compressibility of softmax-attended language is a property of the data, not the frame that analyzes it.

[61] How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models

Gregory N. Frank

Main category: cs.CL

TL;DR: Paper identifies a sparse routing mechanism in aligned LLMs where a gate attention head detects content and triggers amplifier heads to boost refusal signals, validated across 9 models with controlled policy modulation.

Details

Motivation: To understand the internal mechanisms of alignment-trained language models, specifically how they implement content filtering and refusal behaviors through attention mechanisms, using political censorship and safety refusal as natural experiments.

Method: Analyzed 9 models from 6 labs using 120 prompt pairs, conducted interchange tests for necessity/sufficiency, bootstrap resampling for stability, scaling analysis, cipher encoding experiments, and signal modulation to control policy strength.

Result: Identified a consistent routing mechanism across models where gate heads trigger amplifier heads for refusal; routing distributes at scale, allows continuous policy control, and reveals structural separation between intent recognition and policy routing that collapses under cipher encoding.

Conclusion: Alignment creates a sparse routing circuit for refusal behaviors that is detectable and modifiable, with different robustness properties between pretraining (broad semantic understanding) and post-training (narrower policy binding).

Abstract: We identify a recurring sparse routing mechanism in alignment-trained language models: a gate attention head reads detected content and triggers downstream amplifier heads that boost the signal toward refusal. Using political censorship and safety refusal as natural experiments, we trace this mechanism across 9 models from 6 labs, all validated on corpora of 120 prompt pairs. The gate head passes necessity and sufficiency interchange tests (p < 0.001, permutation null), and core amplifier heads are stable under bootstrap resampling (Jaccard 0.92-1.0). Three same-generation scaling pairs show that routing distributes at scale (ablation up to 17x weaker) while remaining detectable by interchange. By modulating the detection-layer signal, we continuously control policy strength from hard refusal through steering to factual compliance, with routing thresholds that vary by topic. The circuit also reveals a structural separation between intent recognition and policy routing: under cipher encoding, the gate head’s routing contribution collapses (78% in Phi-4 at n=120) while the model responds with puzzle-solving rather than refusal. The routing mechanism never fires, even though probe scores at deeper layers indicate the model begins to represent the harmful content. This asymmetry is consistent with different robustness properties of pretraining and post-training: broad semantic understanding versus narrower policy binding that generalizes less well under input transformation.

[62] Responses Fall Short of Understanding: Revealing the Gap between Internal Representations and Responses in Visual Document Understanding

Haruka Kawasaki, Ryota Tanaka, Kyosuke Nishida

Main category: cs.CL

TL;DR: This paper investigates how visual document understanding information is represented across different layers of LLMs within large vision language models, finding a gap between internal representations and generated responses, and showing that intermediate layers encode information more linearly than final layers.

Details

Motivation: The motivation is to understand whether large vision language models actually capture the required information internally for visual document understanding tasks, rather than just generating correct responses. There's concern that current evaluation based on generated responses may not reflect true internal understanding.

Method: The authors use linear probing to investigate how information required for VDU tasks is represented across different layers of LLMs within LVLMs. They then explore fine-tuning strategies that target intermediate layers based on their findings.

Result: The study reveals: (1) a clear gap between internal representations and generated responses, and (2) information required to solve VDU tasks is often encoded more linearly from intermediate layers than from the final layer. Fine-tuning intermediate layers improves both linear probing accuracy and response accuracy while narrowing the gap.

Conclusion: The paper concludes that intermediate layers in LVLMs contain more linearly accessible information for VDU tasks than final layers, and that targeted fine-tuning of these intermediate layers can improve both internal representation quality and generated response accuracy.

Abstract: Visual document understanding (VDU) is a challenging task for large vision language models (LVLMs), requiring the integration of visual perception, text recognition, and reasoning over structured layouts. Although recent LVLMs have shown progress on VDU benchmarks, their performance is typically evaluated based on generated responses, which may not necessarily reflect whether the model has actually captured the required information internally. In this paper, we investigate how information required to solve VDU tasks is represented across different layers of LLMs within LVLMs using linear probing. Our study reveals that (1) there is a clear gap between internal representations and generated responses, and (2) information required to solve the task is often encoded more linearly from intermediate layers than from the final layer. Motivated by these findings, we explore fine-tuning strategies that target intermediate layers. Experiments show that fine-tuning intermediate layers improves both linear probing accuracy and response accuracy while narrowing the gap.

[63] Structured Causal Video Reasoning via Multi-Objective Alignment

Zinuo Li, Yongxin Guo, Jun Liu, Jiawei Zhan, Xi Jiang, Chengjie Wang, Mohammed Bennamoun, Farid Boussaid, Feng Zheng, Qiuhong Ke

Main category: cs.CL

TL;DR: Factum-4B introduces Structured Event Facts for video understanding, using a compact representation of events and causal relationships to improve reasoning over traditional Video-LLMs, trained via a four-stage pipeline including MORL optimization.

Details

Motivation: Existing Video-LLMs rely on unstructured video reasoning with verbose textual descriptions and weak temporal causality modeling, leading to inefficient processes and fragile causal inference. The paper aims to bridge this cognitive gap by introducing structured representations similar to human mental models.

Method: Proposes Structured Event Facts - a compact representation of salient events and causal relationships. Introduces CausalFact-60K dataset and a four-stage training pipeline: facts alignment, format warm-start, thinking warm-start, and RL-based post-training. Addresses competing objectives in RL stage using Multi-Objective Reinforcement Learning (MORL) to balance structural completeness, causal fidelity, and reasoning length.

Result: Develops Factum-4B model that yields more reliable reasoning and delivers stronger performance on challenging video understanding tasks requiring fine-grained temporal inference compared to existing Video-LLMs.

Conclusion: Structured Event Facts provide an effective prior for video understanding, enabling more concise, causally grounded reasoning with verifiable intermediate evidence. The MORL approach successfully balances competing objectives in the optimization process.

Abstract: Human understanding of video dynamics is typically grounded in a structured mental representation of entities, actions, and temporal relations, rather than relying solely on immediate deductive reasoning. In contrast, existing Video-LLMs largely depend on unstructured video reasoning, where critical visual evidence is embedded in verbose textual descriptions and temporal causality is often weakly modeled. This leads to inefficient processes and fragile causal inference. To bridge this cognitive gap, we propose constructing a compact representation of salient events and their causal relationships, which we name Structured Event Facts, prior to the reasoning stage. This structured prior serves as an explicit constraint to promote concise and causally grounded reasoning, while also making intermediate evidence easier to verify. To effectively train models on such structured facts, we introduce CausalFact-60K and a four-stage training pipeline comprising facts alignment, format warm-start, thinking warm-start, and reinforcement learning-based post-training. During RL stage, we find that this framework introduces competing objectives, as structural completeness and causal fidelity must be balanced against reasoning length, making it difficult to optimize. We address this challenge by formulating the optimization as a Multi-Objective Reinforcement Learning (MORL) problem and explicitly optimizing toward the Pareto-Frontier to balance these trade-offs. As a result, we introduce Factum-4B, which yields more reliable reasoning and delivers stronger performance on challenging video understanding tasks requiring fine-grained temporal inference.

[64] DeonticBench: A Benchmark for Reasoning over Rules

Guangyao Dou, Luis Brena, Akhil Deo, William Jurayj, Jingyu Zhang, Nils Holzenberger, Benjamin Van Durme

Main category: cs.CL

TL;DR: DEONTICBENCH: A benchmark for deontic reasoning (obligations/permissions/prohibitions) across legal/policy domains with 6,232 tasks, supporting both free-form reasoning and symbolic Prolog translation.

Details

Motivation: Address the gap in benchmarks for long-context, high-stakes deontic reasoning (reasoning about obligations, permissions, prohibitions) in legal/policy settings, where current LLMs struggle with complex, context-specific rules.

Method: Created DEONTICBENCH with 6,232 tasks across U.S. federal taxes, airline baggage policies, immigration administration, and state housing law. Supports both free-form chain-of-thought reasoning and optional symbolic approach where models translate statutes/facts into executable Prolog programs with explicit traces.

Result: Best frontier LLMs achieve only 44.4% on SARA Numeric and 46.6 macro-F1 on Housing hard subsets. Training improves Prolog generation quality but current RL methods still fail to solve tasks reliably. Reference Prolog programs released for all instances.

Conclusion: DEONTICBENCH provides a comprehensive benchmark for studying context-grounded rule reasoning in real-world domains, highlighting LLM limitations in deontic reasoning and enabling research in both symbolic and non-symbolic approaches.

Abstract: Reasoning with complex, context-specific rules remains challenging for large language models (LLMs). In legal and policy settings, this manifests as deontic reasoning: reasoning about obligations, permissions, and prohibitions under explicit rules. While many recent benchmarks emphasize short-context mathematical reasoning, fewer focus on long-context, high-stakes deontic reasoning. To address this gap, we introduce DEONTICBENCH, a benchmark of 6,232 tasks across U.S. federal taxes, airline baggage policies, U.S. immigration administration, and U.S. state housing law. These tasks can be approached in multiple ways, including direct reasoning in language or with the aid of symbolic computation. Besides free-form chain-of-thought reasoning, DEONTICBENCH enables an optional solver-based workflow in which models translate statutes and case facts into executable Prolog, leading to formal problem interpretations and an explicit program trace. We release reference Prolog programs for all instances. Across frontier LLMs and coding models, best hard-subset performance reaches only 44.4% on SARA Numeric and 46.6 macro-F1 on Housing. We further study training with supervised fine-tuning and reinforcement learning for symbolic program generation. Although training improves Prolog generation quality, current RL methods still fail to solve these tasks reliably. Overall, DEONTICBENCH provides a benchmark for studying context-grounded rule reasoning in real-world domains under both symbolic and non-symbolic settings.

[65] Conversational Control with Ontologies for Large Language Models: A Lightweight Framework for Constrained Generation

Barbara Gendron, Gaël Guibon, Mathieu d’Aquin

Main category: cs.CL

TL;DR: An end-to-end method for modular and explainable control over LLM outputs using ontological definitions of conversational aspects, validated on English proficiency and polarity profile tasks.

Details

Motivation: LLM-based conversational agents have black-box nature leading to predictability challenges and lack of personalization, which can be addressed through controlled generation.

Method: Proposes ontological definitions of conversational aspects as constraints, then fine-tunes LLMs to generate content accordingly using a hybrid fine-tuning procedure on seven open-weight conversational LLMs.

Result: Method consistently outperforms pre-trained baselines even on smaller models, remains model-agnostic, lightweight, and interpretable, enabling reusable control strategies extendable to new domains.

Conclusion: Ontology-driven control enhances alignment with strategy instructions and demonstrates effectiveness in conversational systems through modular, explainable control over LLM outputs.

Abstract: Conversational agents based on Large Language Models (LLMs) have recently emerged as powerful tools for human-computer interaction. Nevertheless, their black-box nature implies challenges in predictability and a lack of personalization, both of which can be addressed by controlled generation. This work proposes an end-to-end method to obtain modular and explainable control over LLM outputs through ontological definitions of aspects related to the conversation. Key aspects are modeled and used as constraints; we then further fine-tune the LLM to generate content accordingly. To validate our approach, we explore two tasks that tackle two key conversational aspects: the English proficiency level and the polarity profile of the content. Using a hybrid fine-tuning procedure on seven state-of-the-art, open-weight conversational LLMs, we show that our method consistently outperforms pre-trained baselines, even on smaller models. Beyond quantitative gains, the framework remains model-agnostic, lightweight, and interpretable, enabling reusable control strategies that can be extended to new domains and interaction goals. This approach enhances alignment with strategy instructions and demonstrates the effectiveness of ontology-driven control in conversational systems.

[66] Same Geometry, Opposite Noise: Transformer Magnitude Representations Lack Scalar Variability

Jon-Paul Cacioli

Main category: cs.CL

TL;DR: Transformers show decreasing representational variability with increasing numerical magnitude (anti-scalar pattern), unlike biological systems which show scalar variability (constant coefficient of variation).

Details

Motivation: To test whether transformer language models exhibit scalar variability - a fundamental property of biological magnitude systems where representational noise scales proportionally with magnitude.

Method: Analyzed hidden-state representations across carrier sentences for 26 numerical magnitudes in three 7-8B parameter models (Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.3, Llama-3-8B-Base). Measured representational variability along magnitude axis and in full-dimensional space, with sentence-identity correction.

Result: Found anti-scalar pattern: representational variability decreased with magnitude (scaling exponent α ≈ -0.19). This pattern was 3-5x stronger along magnitude axis than orthogonal dimensions. Corpus frequency strongly predicted per-magnitude variability (ρ = .84).

Conclusion: Distributional learning alone is insufficient to produce scalar variability. Transformers reproduce log-compressive magnitude geometry but not the constant-CV noise signature observed in biological systems.

Abstract: Scalar variability – the finding that representational noise scales proportionally with magnitude, producing a constant coefficient of variation – is a hallmark of biological magnitude systems. We tested whether transformer language models exhibit this property by analysing the dispersion of hidden-state representations across carrier sentences for 26 numerical magnitudes in three 7-8B parameter models (Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.3, Llama-3-8B-Base; data from Cacioli, 2026). We found the opposite: representational variability decreased with magnitude along the magnitude axis (scaling exponent alpha approx -0.19; 0/16 primary layers with alpha > 0, all three models). The negative sign was consistent in full-dimensional space (alpha approx -0.04) and after sentence-identity correction (alpha approx -0.007). The anti-scalar pattern was 3-5x stronger along the magnitude axis than orthogonal dimensions, and corpus frequency strongly predicted per-magnitude variability (rho = .84). These results demonstrate that distributional learning alone is insufficient to produce scalar variability: transformers reproduce log-compressive magnitude geometry but not the constant-CV noise signature observed in biological systems.

[67] CommonMorph: Participatory Morphological Documentation Platform

Aso Mahmudi, Sina Ahmadi, Kemal Kurniawan, Rico Sennrich, Eduard Hovy, Ekaterina Vylomova

Main category: cs.CL

TL;DR: CommonMorph is a platform for streamlined morphological data collection using expert definition, contributor elicitation, and community validation with active learning and annotation suggestions.

Details

Motivation: Collecting and annotating morphological data is challenging, requiring linguistic expertise and resources, especially for low-resource languages. There's a need to accelerate this process and preserve linguistic diversity.

Method: Three-tiered approach: 1) Expert linguistic definition, 2) Contributor elicitation, 3) Community validation. Incorporates active learning, annotation suggestions, and tools to import/adapt materials from related languages. Supports diverse morphological systems (fusional, agglutinative, root-and-pattern).

Result: Open-source platform with UniMorph-compatible outputs ensuring accessibility and interoperability with NLP tools. Platform is accessible at https://common-morph.com.

Conclusion: CommonMorph offers a replicable model for preserving linguistic diversity through collaborative technology, streamlining morphological data collection development.

Abstract: Collecting and annotating morphological data present significant challenges, requiring linguistic expertise, methodological rigour, and substantial resources. These barriers are particularly acute for low-resource languages and varieties. To accelerate this process, we introduce \texttt{CommonMorph}, a comprehensive platform that streamlines morphological data collection development through a three-tiered approach: expert linguistic definition, contributor elicitation, and community validation. The platform minimises manual work by incorporating active learning, annotation suggestions, and tools to import and adapt materials from related languages. It accommodates diverse morphological systems, including fusional, agglutinative, and root-and-pattern morphologies. Its open-source design and UniMorph-compatible outputs ensure accessibility and interoperability with NLP tools. Our platform is accessible at https://common-morph.com, offering a replicable model for preserving linguistic diversity through collaborative technology.

[68] Multilingual Prompt Localization for Agent-as-a-Judge: Language and Backbone Sensitivity in Requirement-Level Evaluation

Alhasan Mahmood, Samir Abdaljalil, Hasan Kurban

Main category: cs.CL

TL;DR: Language choice in AI agent evaluation benchmarks significantly affects model rankings, with different LLMs performing best in different languages, showing language should be treated as an explicit evaluation variable.

Details

Motivation: Current AI agent benchmarks treat evaluation language as a fixed English default, but this may not reflect real-world multilingual usage and could bias rankings.

Method: Localized Agent-as-a-Judge prompt stack to 5 diverse languages (English, Arabic, Turkish, Chinese, Hindi), evaluated 55 DevAI tasks across 3 developer-agent frameworks and 6 judge backbones (4950 runs total).

Result: Backbone and language interact significantly: GPT-4o leads in English (44.72%), Gemini leads in Arabic (51.72%) and Hindi (53.22%). No single backbone dominates across languages, and inter-backbone agreement is modest (κ ≤ 0.231).

Conclusion: Language should be treated as an explicit evaluation variable in agentic benchmarks, as language choice can invert backbone rankings and affect evaluation outcomes.

Abstract: Evaluation language is typically treated as a fixed English default in agentic code benchmarks, yet we show that changing the judge’s language can invert backbone rankings. We localize the Agent-as-a-Judge prompt stack to five typologically diverse languages (English, Arabic, Turkish, Chinese, Hindi) and evaluate 55 DevAI development tasks across three developer-agent frameworks and six judge backbones, totaling 4950 judge runs. The central finding is that backbone and language interact: GPT-4o achieves the highest satisfaction in English (44.72%), while Gemini leads in Arabic (51.72%, $p<0.001$ vs.\ GPT-4o) and Hindi (53.22%). No single backbone dominates across all languages, and inter-backbone agreement on individual requirement judgments is modest (Fleiss’ $κ\leq 0.231$). A controlled ablation further shows that localizing judge-side instructions, not just benchmark content, can be decisive: Hindi satisfaction drops from 42.8% to 23.2% under partial localization. These results indicate that language should be treated as an explicit evaluation variable in agentic benchmarks. Full requirement-level judgments and runtime statistics are released for reproducibility.

[69] Formal Constraints on Dependency Syntax

Gómez-Rodríguez, Carlos, Alemany-Puig, Lluís

Main category: cs.CL

TL;DR: Paper explores syntactic constraints for dependency trees, focusing on alternatives to projectivity for better linguistic modeling

Details

Motivation: Dependency syntax trees often include implausible or infrequent structures in practice, motivating the search for constraints that better fit real linguistic phenomena while balancing between projectivity's limitations and unrestricted structures' excessive leniency.

Method: The paper appears to survey and analyze various syntactic constraints for dependency trees, particularly focusing on alternatives to projectivity that can better handle flexible-word-order languages and provide more accurate linguistic descriptions.

Result: The paper discusses that projectivity is too restrictive for some linguistic phenomena, especially in flexible-word-order languages, and proposes various alternative constraints that seek a realistic middle ground.

Conclusion: There is a need for syntactic constraints that go beyond projectivity to better model real linguistic phenomena while maintaining computational efficiency and linguistic accuracy.

Abstract: Dependency syntax represents the structure of a sentence as a tree composed of dependencies, i.e., directed relations between lexical units. While in its more general form any such tree is allowed, in practice many are not plausible or are very infrequent in attested language. This has motivated a search for constraints characterizing subsets of trees that better fit real linguistic phenomena, providing a more accurate linguistic description, faster parsing or insights on language evolution and human processing. Projectivity is the most well-studied such constraint, but it has been shown to be too restrictive to represent some linguistic phenomena, especially in flexible-word-order languages. Thus, a variety of constraints have been proposed to seek a realistic middle ground between the limitations of projectivity and the excessive leniency of unrestricted dependency structures.

[70] PassiveQA: A Three-Action Framework for Epistemically Calibrated Question Answering via Supervised Finetuning

Madhav S Baidya

Main category: cs.CL

TL;DR: PassiveQA: A framework for LLMs to decide between Answer, Ask, or Abstain when faced with incomplete queries, reducing hallucinations through supervised finetuning.

Details

Motivation: Real-world queries are often incomplete or ambiguous, but current LLMs and RAG systems assume fully specified queries, leading to overconfident or hallucinated responses when information is insufficient.

Method: Proposes PassiveQA, a three-action framework (Answer/Ask/Abstain) with supervised finetuning. Integrates structured information-state representations, knowledge graph-grounded context, and a finetuned planner that explicitly models missing variables and decision reasoning.

Result: Experiments across multiple QA datasets show significant improvements in macro F1 and abstention recall while reducing hallucination rates, even under compute-constrained training.

Conclusion: Epistemic decision-making (knowing when to answer, ask, or abstain) must be learned during training rather than imposed at inference time for reliable performance with incomplete information.

Abstract: Large Language Models (LLMs) have achieved strong performance in question answering and retrieval-augmented generation (RAG), yet they implicitly assume that user queries are fully specified and answerable. In real-world settings, queries are often incomplete, ambiguous, or missing critical variables, leading models to produce overconfident or hallucinated responses. In this work, we study decision-aware query resolution under incomplete information, where a model must determine whether to Answer, Ask for clarification, or Abstain. We show that standard and enhanced RAG systems do not reliably exhibit such epistemic awareness, defaulting to answer generation even when information is insufficient. To address this, we propose PassiveQA, a three-action framework that aligns model behaviour with information sufficiency through supervised finetuning. Our approach integrates structured information-state representations, knowledge graph-grounded context, and a finetuned planner that explicitly models missing variables and decision reasoning. Experiments across multiple QA datasets show that the finetuned planner achieves significant improvements in macro F1 and abstention recall while reducing hallucination rates, under a compute-constrained training regime. These results provide strong empirical evidence that epistemic decision-making must be learned during training rather than imposed at inference time.

[71] Benchmarking Multilingual Speech Models on Pashto: Zero-Shot ASR, Script Failure, and Cross-Domain Evaluation

Hanif Rahman

Main category: cs.CL

TL;DR: First comprehensive evaluation of multilingual ASR models on Pashto, revealing script-level failures in Whisper models and significant cross-domain degradation in fine-tuned models.

Details

Motivation: Pashto has 60-80 million speakers but lacks published benchmarks for multilingual ASR on public test sets, creating barriers to reproducible research and cumulative progress.

Method: Evaluated ten models (Whisper variants, MMS-1B, SeamlessM4T-v2-large, OmniASR-CTC-300M) on FLEURS Pashto test set and Common Voice~24 subset using zero-shot ASR, script-level failure analysis, and cross-domain evaluation of fine-tuned models.

Result: Zero-shot Whisper WER ranged 90-297% with script failure (≤0.8% Pashto-script output), while SeamlessM4T achieved best zero-shot result (39.7% WER). Fine-tuned models showed 14% WER degrading to 32.5-59% cross-domain, with Pashto-unique phonemes causing disproportionate errors.

Conclusion: Current multilingual ASR models fail on Pashto at script level, cross-domain degradation is severe, and five structural impediments to progress are identified with five ordered research priorities proposed.

Abstract: Pashto is spoken by approximately 60–80 million people but has no published benchmarks for multilingual automatic speech recognition (ASR) on any shared public test set. This paper reports the first reproducible multi-model evaluation on public Pashto data, covering zero-shot ASR, script-level failure, and cross-domain evaluation of fine-tuned models. For zero-shot ASR, ten models (all seven Whisper sizes, MMS-1B, SeamlessM4T-v2-large, and OmniASR-CTC-300M) are evaluated on the FLEURS Pashto test set and a filtered Common Voice~~24 subset; zero-shot Whisper WER ranges from 90% to 297%, with the medium model collapsing to 461% on Common Voice~~24 consistent with decoder looping. SeamlessM4T achieves 39.7% WER on Common Voice~24 (the best zero-shot result reported to date, as of submission); MMS-1B achieves 43.8% on FLEURS. For script failure, a language-identification audit shows that no Whisper model produces Pashto-script output in more than 0.8% of utterances, while MMS-1B, SeamlessM4T, and OmniASR each exceed 93% Pashto-script fidelity; WER alone does not reveal this failure, since a model generating Arabic-script output on Pashto audio has not achieved ASR in any interpretable sense. For cross-domain evaluation, five fine-tuned Pashto ASR models are evaluated on both test sets: published WER figures of 14% degrade to 32.5–59% on out-of-distribution sets, while one augmented model achieves 35.1% on both sets with zero cross-domain degradation. Character-class error stratification confirms that Pashto-unique phonemes (the retroflex series and lateral fricatives) account for disproportionate error mass. All evaluations cover read speech only. Five structural impediments to cumulative progress are identified and five ordered research priorities are argued.

[72] Is a Picture Worth a Thousand Words? Adaptive Multimodal Fact-Checking with Visual Evidence Necessity

Jaeyoon Jung, Yejun Yoon, Kunwoo Park

Main category: cs.CL

TL;DR: AMuFC is a multimodal fact-checking framework that uses two collaborative agents to adaptively determine when visual evidence is necessary, challenging the assumption that multimodal evidence always improves fact-checking accuracy.

Details

Motivation: The paper challenges the prevailing assumption that incorporating visual evidence universally improves multimodal fact-checking performance, showing that indiscriminate use of multimodal evidence can actually reduce accuracy. The authors aim to develop a more adaptive approach to multimodal fact-checking.

Method: Proposes AMuFC framework with two collaborative agents: 1) Analyzer determines whether visual evidence is necessary for claim verification, and 2) Verifier predicts claim veracity conditioned on both retrieved evidence and the Analyzer’s assessment of visual evidence necessity.

Result: Experimental results on three datasets show that incorporating the Analyzer’s assessment of visual evidence necessity into the Verifier’s prediction yields substantial improvements in verification performance. The authors also release WebFC, a new dataset for more realistic fact-checking evaluation.

Conclusion: The paper demonstrates that adaptive use of visual evidence through collaborative agents improves multimodal fact-checking accuracy, challenging the assumption that multimodal evidence always helps. The framework provides a more nuanced approach to multimodal verification.

Abstract: Automated fact-checking is a crucial task not only in journalism but also across web platforms, where it supports a responsible information ecosystem and mitigates the harms of misinformation. While recent research has progressed from text-only to multimodal fact-checking, a prevailing assumption is that incorporating visual evidence universally improves performance. In this work, we challenge this assumption and show that indiscriminate use of multimodal evidence can reduce accuracy. To address this challenge, we propose AMuFC, a multimodal fact-checking framework that employs two collaborative agents with distinct roles for the adaptive use of visual evidence: An Analyzer determines whether visual evidence is necessary for claim verification, and a Verifier predicts claim veracity conditioned on both the retrieved evidence and the Analyzer’s assessment. Experimental results on three datasets show that incorporating the Analyzer’s assessment of visual evidence necessity into the Verifier’s prediction yields substantial improvements in verification performance. In addition to all code, we release WebFC, a newly constructed dataset for evaluating fact-checking modules in a more realistic scenario, available at https://github.com/ssu-humane/AMuFC.

[73] IDIOLEX: Unified and Continuous Representations for Idiolectal and Stylistic Variation

Anjali Kantharuban, Aarohi Srivastava, Fahim Faisal, Orevaoghene Ahia, Antonios Anastasopoulos, David Chiang, Yulia Tsvetkov, Graham Neubig

Main category: cs.CL

TL;DR: IDIOLEX framework learns sentence representations that capture style and dialect (idiolect) decoupled from semantic content, using supervision from sentence provenance and linguistic features, with applications to Arabic/Spanish dialects and stylistic alignment of language models.

Details

Motivation: Existing sentence representations primarily encode what is said (semantic content) rather than how it is expressed (style/dialect), but stylistic variation is important for many applications including developing diverse and accessible LLMs.

Method: IDIOLEX framework combines supervision from sentence provenance (source information) with linguistic features of sentence content to learn continuous representations of style and dialect, evaluated on Arabic and Spanish dialects.

Result: Learned representations capture meaningful variation, transfer across domains for analysis and classification, and can be used as training objectives for stylistically aligning language models.

Conclusion: Jointly modeling individual and community-level variation provides useful perspective for studying idiolect and supports downstream applications requiring sensitivity to stylistic differences, including diverse and accessible LLM development.

Abstract: Existing sentence representations primarily encode what a sentence says, rather than how it is expressed, even though the latter is important for many applications. In contrast, we develop sentence representations that capture style and dialect, decoupled from semantic content. We call this the task of idiolectal representation learning. We introduce IDIOLEX, a framework for training models that combines supervision from a sentence’s provenance with linguistic features of a sentence’s content, to learn a continuous representation of each sentence’s style and dialect. We evaluate the approach on dialects of both Arabic and Spanish. The learned representations capture meaningful variation and transfer across domains for analysis and classification. We further explore the use of these representations as training objectives for stylistically aligning language models. Our results suggest that jointly modeling individual and community-level variation provides a useful perspective for studying idiolect and supports downstream applications requiring sensitivity to stylistic differences, such as developing diverse and accessible LLMs.

[74] BiST: A Gold Standard Bangla-English Bilingual Corpus for Sentence Structure and Tense Classification with Inter-Annotator Agreement

Abdullah Al Shafi, Swapnil Kundu Argha, M. A. Moyeen, Abdul Muntakim, Shoumik Barman Polok

Main category: cs.CL

TL;DR: BiST is a curated Bangla-English corpus for sentence-level grammatical classification annotated for syntactic structure and tense, with 30,534 sentences and high inter-annotator agreement.

Details

Motivation: Addressing the critical bottleneck of high-quality bilingual resources for multilingual NLP in low-resource settings, particularly for Bangla, by creating a reliable corpus for grammatical classification.

Method: Compiled corpus from open-licensed encyclopedic sources and conversational text, with systematic preprocessing, automated language identification, and multi-stage annotation framework with three independent annotators using dimension-wise Fleiss Kappa agreement.

Result: Created 30,534 sentences (17,465 English, 13,069 Bangla) with high annotation reliability (κ=0.82 for structure, κ=0.88 for tense). Dual-encoder architectures outperformed multilingual encoders in baseline evaluations.

Conclusion: BiST establishes a unified resource for bilingual grammatical modeling, supporting tasks like controlled text generation, automated feedback generation, and cross-lingual representation learning for linguistically grounded multilingual research.

Abstract: High-quality bilingual resources remain a critical bottleneck for advancing multilingual NLP in low-resource settings, particularly for Bangla. To mitigate this gap, we introduce BiST, a rigorously curated Bangla-English corpus for sentence-level grammatical classification, annotated across two fundamental dimensions: syntactic structure (Simple, Complex, Compound, Complex-Compound) and tense (Present, Past, Future). The corpus is compiled from open-licensed encyclopedic sources and naturally composed conversational text, followed by systematic preprocessing and automated language identification, resulting in 30,534 sentences, including 17,465 English and 13,069 Bangla instances. Annotation quality is ensured through a multi-stage framework with three independent annotators and dimension-wise Fleiss Kappa ($κ$) agreement, yielding reliable and reproducible labels with $κ$ values of 0.82 and 0.88 for structural and temporal annotation, respectively. Statistical analyses demonstrate realistic structural and temporal distributions, while baseline evaluations show that dual-encoder architectures leveraging complementary language-specific representations consistently outperform strong multilingual encoders. Beyond benchmarking, BiST provides explicit linguistic supervision that supports grammatical modeling tasks, including controlled text generation, automated feedback generation, and cross-lingual representation learning. The corpus establishes a unified resource for bilingual grammatical modeling and facilitates linguistically grounded multilingual research.

[75] What Makes Good Multilingual Reasoning? Disentangling Reasoning Traces with Measurable Features

Dayeon Ki, Kevin Duh, Marine Carpuat

Main category: cs.CL

TL;DR: The paper analyzes multilingual reasoning in Large Reasoning Models, challenging the assumption that English-derived reasoning features help other languages, and proposes adaptive objectives for language-specific reasoning patterns.

Details

Motivation: Large performance gaps exist between English and other languages in Large Reasoning Models, but current work assumes these gaps can be closed by making reasoning in every language resemble English reasoning. The paper challenges this assumption by investigating what actually characterizes effective reasoning in multilingual settings.

Method: 1) Define a suite of measurable reasoning features spanning multilingual alignment, reasoning steps, and reasoning flow aspects of reasoning traces; 2) Use logistic regression to quantify how each feature associates with final answer accuracy; 3) Train sparse autoencoders over multilingual traces to automatically discover latent reasoning concepts; 4) Use features as test-time selection policies to examine if they can steer models toward stronger multilingual reasoning.

Result: Across two mathematical reasoning benchmarks, four LRMs, and 10 languages: most features are positively associated with accuracy, but the strength of association varies considerably across languages and can even reverse in some cases. This challenges English-centric reward designs.

Conclusion: The findings challenge English-centric reward designs and point toward adaptive objectives that accommodate language-specific reasoning patterns, with concrete implications for multilingual benchmark and reward design.

Abstract: Large Reasoning Models (LRMs) still exhibit large performance gaps between English and other languages, yet much current work assumes these gaps can be closed simply by making reasoning in every language resemble English reasoning. This work challenges this assumption by asking instead: what actually characterizes effective reasoning in multilingual settings, and to what extent do English-derived reasoning features genuinely help in other languages? We first define a suite of measurable reasoning features spanning multilingual alignment, reasoning step, and reasoning flow aspects of reasoning traces, and use logistic regression to quantify how each feature associates with final answer accuracy. We further train sparse autoencoders over multilingual traces to automatically discover latent reasoning concepts that instantiate or extend these features. Finally, we use the features as test-time selection policies to examine whether they can steer models toward stronger multilingual reasoning. Across two mathematical reasoning benchmarks, four LRMs, and 10 languages, we find that most features are positively associated with accuracy, but the strength of association varies considerably across languages and can even reverse in some. Our findings challenge English-centric reward designs and point toward adaptive objectives that accommodate language-specific reasoning patterns, with concrete implications for multilingual benchmark and reward design.

[76] Individual and Combined Effects of English as a Second Language and Typos on LLM Performance

Serena Liu, Yutong Yang, Prisha Sheth, Weixuan Dong, Mingjiao Diao, Xinru Zhu, Nikhil Banga, Oscar Melendez, Arnav Sharma, Minda Zhao, Marina Lin, Mengyu Wang

Main category: cs.CL

TL;DR: LLMs perform worse on non-native English inputs with typos than on either factor alone, especially on closed-ended tasks, showing real-world performance is overestimated by clean English evaluations.

Details

Motivation: LLMs are trained mostly on English data but used globally by non-native speakers who often make typos. Prior work studied ESL variation and typos separately, but they co-occur in real use, so their combined effect needs investigation.

Method: Used Trans-EnV framework to transform standard English into eight ESL variants, and applied MulTypo to inject typos at low, moderate, and severe levels. Tested performance on both closed-ended and open-ended tasks.

Result: Combining ESL variation and typos causes larger performance drops than either factor alone, though not simply additive. Effect is clearest on closed-ended tasks with consistent degradation across ESL variants and typo levels, while open-ended tasks show more mixed results.

Conclusion: Evaluations on clean standard English overestimate real-world model performance, and studying ESL variation and typos separately doesn’t fully capture realistic model behavior, especially for non-native English speakers.

Abstract: Large language models (LLMs) are used globally, and because much of their training data is in English, they typically perform best on English inputs. As a result, many non-native English speakers interact with them in English as a second language (ESL), and these inputs often contain typographical errors. Prior work has largely studied the effects of ESL variation and typographical errors separately, even though they often co-occur in real-world use. In this study, we use the Trans-EnV framework to transform standard English inputs into eight ESL variants and apply MulTypo to inject typos at three levels: low, moderate, and severe. We find that combining ESL variation and typos generally leads to larger performance drops than either factor alone, though the combined effect is not simply additive. This pattern is clearest on closed-ended tasks, where performance degradation can be characterized more consistently across ESL variants and typo levels, while results on open-ended tasks are more mixed. Overall, these findings suggest that evaluations on clean standard English may overestimate real-world model performance, and that evaluating ESL variation and typographical errors in isolation does not fully capture model behavior in realistic settings.

[77] Metaphors We Compute By: A Computational Audit of Cultural Translation vs. Thinking in LLMs

Yuan Chang, Jiaming Qu, Zhu Li

Main category: cs.CL

TL;DR: LLMs show cultural bias in creative writing tasks, exhibiting stereotyped metaphors and Western defaultism despite multilingual capabilities.

Details

Motivation: To investigate whether LLMs truly conduct culture-aware reasoning or merely act as cultural translators leveraging dominant Western frameworks with localized expressions.

Method: Computational audit using metaphor generation task spanning five cultural settings and several abstract concepts as a case study.

Result: Models exhibit stereotyped metaphor usage for certain cultural settings and show Western defaultism, indicating that prompting with cultural identity doesn’t guarantee culturally grounded reasoning.

Conclusion: Multilingual capability doesn’t equal cultural reasoning; LLMs need better cultural grounding beyond surface-level translation.

Abstract: Large language models (LLMs) are often described as multilingual because they can understand and respond in many languages. However, speaking a language is not the same as reasoning within a culture. This distinction motivates a critical question: do LLMs truly conduct culture-aware reasoning? This paper presents a preliminary computational audit of cultural inclusivity in a creative writing task. We empirically examine whether LLMs act as culturally diverse creative partners or merely as cultural translators that leverage a dominant conceptual framework with localized expressions. Using a metaphor generation task spanning five cultural settings and several abstract concepts as a case study, we find that the model exhibits stereotyped metaphor usage for certain settings, as well as Western defaultism. These findings suggest that merely prompting an LLM with a cultural identity does not guarantee culturally grounded reasoning.

[78] Lighting Up or Dimming Down? Exploring Dark Patterns of LLMs in Co-Creativity

Zhu Li, Jiaming Qu, Yuan Chang

Main category: cs.CL

TL;DR: LLMs as writing partners exhibit five “dark patterns” that suppress human creativity: Sycophancy, Tone Policing, Moralizing, Loop of Death, and Anchoring, with Sycophancy being nearly ubiquitous (91.7%) in sensitive topics.

Details

Motivation: As LLMs increasingly serve as collaborative writing partners, there's a need to understand their impact on human agency and creative process, particularly investigating subtle behaviors that may suppress or distort creativity.

Method: Controlled sessions where LLMs are prompted as writing assistants across diverse literary forms and themes, analyzing the prevalence of five identified dark patterns in generated responses.

Result: Sycophancy is nearly ubiquitous (91.7% of cases), particularly in sensitive topics, while Anchoring appears dependent on literary forms, surfacing most frequently in folktales.

Conclusion: Dark patterns, often byproducts of safety alignment, may inadvertently narrow creative exploration, requiring design considerations for AI systems that effectively support creative writing.

Abstract: Large language models (LLMs) are increasingly acting as collaborative writing partners, raising questions about their impact on human agency. In this exploratory work, we investigate five “dark patterns” in human-AI co-creativity – subtle model behaviors that can suppress or distort the creative process: Sycophancy, Tone Policing, Moralizing, Loop of Death, and Anchoring. Through a series of controlled sessions where LLMs are prompted as writing assistants across diverse literary forms and themes, we analyze the prevalence of these behaviors in generated responses. Our preliminary results suggest that Sycophancy is nearly ubiquitous (91.7% of cases), particularly in sensitive topics, while Anchoring appears to be dependent on literary forms, surfacing most frequently in folktales. This study indicates that these dark patterns, often byproducts of safety alignment, may inadvertently narrow creative exploration and proposes design considerations for AI systems that effectively support creative writing.

[79] Hallucination Basins: A Dynamic Framework for Understanding and Controlling LLM Hallucinations

Kalyan Cherukuri, Lav R. Varshney

Main category: cs.CL

TL;DR: Paper presents geometric dynamical systems framework explaining LLM hallucinations as arising from task-dependent basin structure in latent space, with separability varying by task type and geometry-aware steering reducing hallucinations without retraining.

Details

Motivation: To understand why large language models hallucinate (produce factually incorrect but fluent outputs) and develop a theoretical framework to explain and mitigate this phenomenon.

Method: Uses geometric dynamical systems framework analyzing autoregressive hidden-state trajectories across multiple open-source models and benchmarks. Formalizes behavior with task-complexity and multi-basin theorems, characterizes basin emergence in L-layer transformers, and develops geometry-aware steering techniques.

Result: Finds that separability is strongly task-dependent: factoid settings show clearer basin separation, while summarization and misconception-heavy settings are less stable with overlapping basins. Geometry-aware steering can reduce hallucination probability without retraining.

Conclusion: Hallucinations arise from task-dependent basin structure in latent space, and understanding this geometric structure enables targeted interventions to reduce hallucinations without model retraining.

Abstract: Large language models (LLMs) hallucinate: they produce fluent outputs that are factually incorrect. We present a geometric dynamical systems framework in which hallucinations arise from task-dependent basin structure in latent space. Using autoregressive hidden-state trajectories across multiple open-source models and benchmarks, we find that separability is strongly task-dependent rather than universal: factoid settings can show clearer basin separation, whereas summarization and misconception-heavy settings are typically less stable and often overlap. We formalize this behavior with task-complexity and multi-basin theorems, characterize basin emergence in L-layer transformers, and show that geometry-aware steering can reduce hallucination probability without retraining.

[80] HUKUKBERT: Domain-Specific Language Model for Turkish Law

Mehmet Utku Öztürk, Tansu Türkoğlu, Buse Buz-Yalug

Main category: cs.CL

TL;DR: HukukBERT: A comprehensive Turkish legal language model trained on 18GB legal corpus using hybrid Domain-Adaptive Pre-Training, achieving SOTA on legal cloze tests and document segmentation tasks.

Details

Motivation: Turkish legal NLP lacks domain-specific models despite advances in LegalTech. While English has LEGAL-BERT, Turkish legal domain lacks high-volume domain-specific models due to data scarcity.

Method: Developed HukukBERT using hybrid Domain-Adaptive Pre-Training (DAPT) with Whole-Word Masking, Token Span Masking, Word Span Masking, and targeted Keyword Masking on 18GB cleaned Turkish legal corpus with 48K WordPiece tokenizer.

Result: Achieved 84.40% Top-1 accuracy on Legal Cloze Test benchmark (masked legal term prediction), substantially outperforming existing models. Also achieved 92.8% document pass rate on structural segmentation of Turkish court decisions.

Conclusion: HukukBERT establishes new SOTA for Turkish legal NLP tasks and is released to support future research in Turkish legal NLP including named entity recognition, judgment prediction, and document classification.

Abstract: Recent advances in natural language processing (NLP) have increasingly enabled LegalTech applications, yet existing studies specific to Turkish law have still been limited due to the scarcity of domain-specific data and models. Although extensive models like LEGAL-BERT have been developed for English legal texts, the Turkish legal domain lacks a domain-specific high-volume counterpart. In this paper, we introduce HukukBERT, the most comprehensive legal language model for Turkish, trained on a 18 GB cleaned legal corpus using a hybrid Domain-Adaptive Pre-Training (DAPT) methodology integrating Whole-Word Masking, Token Span Masking, Word Span Masking, and targeted Keyword Masking. We systematically compared our 48K WordPiece tokenizer and DAPT approach against general-purpose and existing domain-specific Turkish models. Evaluated on a novel Legal Cloze Test benchmark – a masked legal term prediction task designed for Turkish court decisions – HukukBERT achieves state-of-the-art performance with 84.40% Top-1 accuracy, substantially outperforming existing models. Furthermore, we evaluated HukukBERT in the downstream task of structural segmentation of official Turkish court decisions, where it achieves a 92.8% document pass rate, establishing a new state-of-the-art. We release HukukBERT to support future research in Turkish legal NLP tasks, including recognition of named entities, prediction of judgment, and classification of legal documents.

[81] How Far Are We? Systematic Evaluation of LLMs vs. Human Experts in Mathematical Contest in Modeling

Yuhang Liu, Heyan Huang, Yizhe Yang, Hongyan Zhao, Zhizhuo Zeng, Yang Gao

Main category: cs.CL

TL;DR: LLMs show strong reasoning but struggle with end-to-end mathematical modeling workflows, revealing a comprehension-execution gap that persists despite model scaling.

Details

Motivation: To evaluate LLMs' real-world problem-solving capabilities beyond standard benchmarks, using mathematical modeling competitions as a stringent testbed for end-to-end workflows.

Method: Developed a problem-oriented, stage-wise evaluation framework with expert-verified criteria, validated on China Postgraduate Mathematical Contest in Modeling problems, comparing automatic scores with human expert judgments.

Result: LLMs perform well in early stages (problem identification/formulation) but show persistent deficiencies in execution stages (model solving, code implementation, result analysis), with gaps persisting even with model scaling. Failures traced to insufficient specification, missing verification, and lack of validation.

Conclusion: Bridging the comprehension-execution gap requires approaches beyond model scaling, offering insights for applying LLMs to complex real-world problem solving.

Abstract: Large language models (LLMs) have achieved strong performance on reasoning benchmarks, yet their ability to solve real-world problems requiring end-to-end workflows remains unclear. Mathematical modeling competitions provide a stringent testbed for evaluating such end-to-end problem-solving capability. We propose a problem-oriented, stage-wise evaluation framework that assesses LLM performance across modeling stages using expert-verified criteria. We validate the framework’s reliability by comparing automatic scores with independent human expert judgments on problems from the China Postgraduate Mathematical Contest in Modeling, demonstrating substantially stronger alignment than existing evaluation schemes. Using this framework, we reveal a comprehension-execution gap in state-of-the-art LLMs: while they perform well in early stages such as problem identification and formulation, they exhibit persistent deficiencies in execution-oriented stages including model solving, code implementation, and result analysis. These gaps persist even with increased model scale. We further trace these failures to insufficient specification, missing verification, and lack of validation, with errors propagating across stages without correction. Our findings suggest that bridging this gap requires approaches beyond model scaling, offering insights for applying LLMs to complex real-world problem solving.

[82] LiveFact: A Dynamic, Time-Aware Benchmark for LLM-Driven Fake News Detection

Cheng Xu, Changhong Jin, Yingjie Niu, Nan Yan, Yuke Mei, Shuhao Guan, Liming Chen, M-Tahar Kechadi

Main category: cs.CL

TL;DR: LiveFact is a dynamic benchmark for evaluating LLMs on fake news detection with temporal reasoning, addressing limitations of static benchmarks and benchmark data contamination.

Details

Motivation: Current fake news detection benchmarks are static and vulnerable to data contamination, failing to assess reasoning under temporal uncertainty and evolving information.

Method: Introduces LiveFact - a continuously updated benchmark with dynamic temporal evidence sets, dual-mode evaluation (Classification and Inference modes), and explicit BDC monitoring.

Result: Tests with 22 LLMs show open-source Mixture-of-Experts models match proprietary systems, and reveal a “reasoning gap” where capable models recognize unverifiable claims in early data.

Conclusion: LiveFact sets a sustainable standard for evaluating robust, temporally aware AI verification, addressing limitations of traditional static benchmarks.

Abstract: The rapid development of Large Language Models (LLMs) has transformed fake news detection and fact-checking tasks from simple classification to complex reasoning. However, evaluation frameworks have not kept pace. Current benchmarks are static, making them vulnerable to benchmark data contamination (BDC) and ineffective at assessing reasoning under temporal uncertainty. To address this, we introduce LiveFact a continuously updated benchmark that simulates the real-world “fog of war” in misinformation detection. LiveFact uses dynamic, temporal evidence sets to evaluate models on their ability to reason with evolving, incomplete information rather than on memorized knowledge. We propose a dual-mode evaluation: Classification Mode for final verification and Inference Mode for evidence-based reasoning, along with a component to monitor BDC explicitly. Tests with 22 LLMs show that open-source Mixture-of-Experts models, such as Qwen3-235B-A22B, now match or outperform proprietary state-of-the-art systems. More importantly, our analysis finds a significant “reasoning gap.” Capable models exhibit epistemic humility by recognizing unverifiable claims in early data slices-an aspect traditional static benchmarks overlook. LiveFact sets a sustainable standard for evaluating robust, temporally aware AI verification.

[83] Plausibility as Commonsense Reasoning: Humans Succeed, Large Language Models Do not

Sercan Karakaş

Main category: cs.CL

TL;DR: LLMs show weak or reversed plausibility effects in Turkish relative-clause attachment ambiguities compared to humans, suggesting poor integration of world knowledge with syntactic structure.

Details

Motivation: To test whether large language models integrate world knowledge with syntactic structure in a human-like way during ambiguity resolution, specifically in Turkish relative-clause attachment ambiguities.

Method: Constructed Turkish prenominal relative-clause attachment ambiguities with graded event plausibility favoring either high or low attachment. Validated contrasts with norming ratings, tested humans in speeded forced-choice comprehension, and evaluated Turkish/multilingual LLMs using preference-based setup comparing matched HA/LA continuations via mean per-token log-probability.

Result: Humans showed large, correctly directed plausibility effects, while LLMs across models showed weak, unstable, or reversed plausibility-driven shifts in attachment preferences.

Conclusion: LLMs do not integrate plausibility information with syntactic structure as reliably as humans do, and Turkish relative-clause attachment serves as a useful cross-linguistic diagnostic beyond broad benchmarks.

Abstract: Large language models achieve strong performance on many language tasks, yet it remains unclear whether they integrate world knowledge with syntactic structure in a human-like, structure-sensitive way during ambiguity resolution. We test this question in Turkish prenominal relative-clause attachment ambiguities, where the same surface string permits high attachment (HA) or low attachment (LA). We construct ambiguous items that keep the syntactic configuration fixed and ensure both parses remain pragmatically possible, while graded event plausibility selectively favors High Attachment vs.\ Low Attachment. The contrasts are validated with independent norming ratings. In a speeded forced-choice comprehension experiment, humans show a large, correctly directed plausibility effect. We then evaluate Turkish and multilingual LLMs in a parallel preference-based setup that compares matched HA/LA continuations via mean per-token log-probability. Across models, plausibility-driven shifts are weak, unstable, or reversed. The results suggest that, in the tested models, plausibility information does not guide attachment preferences as reliably as it does in human judgments, and they highlight Turkish RC attachment as a useful cross-linguistic diagnostic beyond broad benchmarks.

[84] MERIT: Multilingual Expert-Reward Informed Tuning for Chinese-Centric Low-Resource Machine Translation

Zhixiang Lu, Chong Zhang, Chenyu Xue, Angelos Stefanidis, Chong Li, Jionglong Su, Zhengyong Jiang

Main category: cs.CL

TL;DR: MERIT is a Chinese-centric multilingual translation framework for low-resource Southeast Asian languages that combines language-specific token prefixing, supervised fine-tuning, and group relative policy optimization guided by semantic alignment rewards.

Details

Motivation: Neural machine translation from Chinese to low-resource Southeast Asian languages suffers from extreme scarcity of clean parallel corpora and pervasive noise in existing mined data, creating a large performance gap with high-resource language pairs and leaving millions of speakers with low-quality translation systems.

Method: The MERIT framework transforms English-centric benchmarks into Chinese-centric evaluation for five Southeast Asian low-resource languages. It combines language-specific token prefixing (LTP) with supervised fine-tuning (SFT) and a novel group relative policy optimization (GRPO) guided by semantic alignment reward (SAR).

Result: Results confirm that for low-resource language to Chinese translation, targeted data curation and reward-guided optimization dramatically outperform mere model scaling.

Conclusion: The MERIT framework effectively addresses the challenges of Chinese to low-resource Southeast Asian language translation through specialized techniques that prioritize data quality and semantic alignment over simple model scaling.

Abstract: Neural machine translation (NMT) from Chinese to low-resource Southeast Asian languages remains severely constrained by the extreme scarcity of clean parallel corpora and the pervasive noise in existing mined data. This chronic shortage not only impedes effective model training but also sustains a large performance gap with high-resource directions, leaving millions of speakers of languages such as Lao, Burmese, and Tagalog with persistently low-quality translation systems despite recent advances in large multilingual models. We introduce \textbf{M}ultilingual \textbf{E}xpert-\textbf{R}eward \textbf{I}nformed \textbf{T}uning (\textbf{MERIT}), a unified translation framework that transforms the traditional English-centric ALT benchmark into a Chinese-centric evaluation suite for five Southeast Asian low-resource languages (LRLs). Our framework combines language-specific token prefixing (LTP) with supervised fine-tuning (SFT) and a novel group relative policy optimization (GRPO) guided by the semantic alignment reward (SAR). These results confirm that, in LRL{\textrightarrow}Chinese translation, targeted data curation and reward-guided optimization dramatically outperform mere model scaling.

[85] Do No Harm: Exposing Hidden Vulnerabilities of LLMs via Persona-based Client Simulation Attack in Psychological Counseling

Qingyang Xu, Yaling Shen, Stephanie Fong, Zimu Wang, Yiwen Jiang, Xiangyu Zhao, Jiahe Liu, Zhongxing Xu, Vincent Lee, Zongyuan Ge

Main category: cs.CL

TL;DR: PCSA is a red-teaming framework that simulates persona-driven client dialogues to test LLM vulnerabilities in mental healthcare, revealing risks like unauthorized medical advice and reinforcement of harmful beliefs.

Details

Motivation: Current red-teaming frameworks overlook psychological safety risks in therapeutic interactions, particularly the distinction between therapeutic empathy and maladaptive validation that could reinforce harmful beliefs or behaviors.

Method: Introduces Personality-based Client Simulation Attack (PCSA) that simulates clients in psychological counseling through coherent, persona-driven dialogues to expose vulnerabilities in psychological safety alignment of LLMs.

Result: PCSA substantially outperforms four competitive baselines on seven general and mental health-specialized LLMs, generates more natural and realistic dialogues, and reveals vulnerabilities including providing unauthorized medical advice, reinforcing delusions, and encouraging risky actions.

Conclusion: Current LLMs remain vulnerable to domain-specific adversarial tactics in mental healthcare, highlighting the need for improved psychological safety alignment beyond generic harm reduction approaches.

Abstract: The increasing use of large language models (LLMs) in mental healthcare raises safety concerns in high-stakes therapeutic interactions. A key challenge is distinguishing therapeutic empathy from maladaptive validation, where supportive responses may inadvertently reinforce harmful beliefs or behaviors in multi-turn conversations. This risk is largely overlooked by existing red-teaming frameworks, which focus mainly on generic harms or optimization-based attacks. To address this gap, we introduce Personality-based Client Simulation Attack (PCSA), the first red-teaming framework that simulates clients in psychological counseling through coherent, persona-driven client dialogues to expose vulnerabilities in psychological safety alignment. Experiments on seven general and mental health-specialized LLMs show that PCSA substantially outperforms four competitive baselines. Perplexity analysis and human inspection further indicate that PCSA generates more natural and realistic dialogues. Our results reveal that current LLMs remain vulnerable to domain-specific adversarial tactics, providing unauthorized medical advice, reinforcing delusions, and implicitly encouraging risky actions.

[86] Synthetic Sandbox for Training Machine Learning Engineering Agents

Yuhang Zhou, Lizhu Zhang, Yifan Wu, Jiayi Liu, Xiangjun Fan, Zhuokai Zhao, Hong Yan

Main category: cs.CL

TL;DR: SandMLE: A multi-agent framework that creates synthetic ML engineering environments with micro-scale datasets to enable efficient on-policy reinforcement learning for LLM agents in ML engineering tasks.

Details

Motivation: Traditional RL for ML engineering tasks is prohibitively expensive because verification requires running full ML pipelines on large datasets. Existing approaches use SFT or offline rewards, sacrificing exploration benefits. The bottleneck is sandbox data size.

Method: SandMLE generates diverse, verifiable synthetic MLE environments from seed tasks, preserving real-world complexity while constraining datasets to micro-scale (50-200 samples per task). This enables large-scale on-policy trajectory-wise RL.

Result: Reduces execution time by over 13x, enabling on-policy RL for MLE. On MLE-bench-lite, shows significant gains over SFT baselines across Qwen3 models (20.3%-66.9% relative medal rate improvements). Generalizes across unseen scaffolds with up to 32.4% better HumanRank score on MLE-Dojo.

Conclusion: SandMLE successfully addresses the computational bottleneck in RL for ML engineering tasks by using synthetic micro-scale environments, enabling efficient on-policy training while maintaining generalization to real-world problems.

Abstract: As large language model agents advance beyond software engineering (SWE) tasks toward machine learning engineering (MLE), verifying agent behavior becomes orders of magnitude more expensive: while SWE tasks can be verified via fast-executing unit tests, MLE verification requires running full ML pipelines – data preprocessing, model training, and metric evaluation – on large datasets at each rollout step, rendering trajectory-wise on-policy reinforcement learning (RL) prohibitively slow. Existing approaches retreat to supervised fine-tuning (SFT) or offline proxy rewards, sacrificing the exploration and generalization benefits of on-policy RL. We observe that sandbox data size is the primary source of this bottleneck. Based on this insight, we introduce SandMLE, a multi-agent framework that generates diverse, verifiable synthetic MLE environments from a small number of seed tasks, preserving the structural and technical complexity of real-world problems while constraining datasets to micro-scale (each task is paired with only 50-200 training samples). Through extensive experiments, we show that SandMLE reduces execution time by over 13 times, enabling large-scale, on-policy trajectory-wise RL for the first time in the MLE domain. On MLE-bench-lite, SandMLE yields significant gains over SFT baselines across Qwen3-8B, 14B, and 30B-A3B, with relative medal rate improvements ranging from 20.3% to 66.9%. Furthermore, the trained policy generalizes across unseen agentic scaffolds, achieving up to 32.4% better HumanRank score on MLE-Dojo.

Hengrui Gu, Xiaotian Han, Yujing Bian, Kaixiong Zhou

Main category: cs.CL

TL;DR: AsymGRPO is a reinforcement learning framework that addresses restricted exploration in LLMs by decomposing policy entropy into informative vs spurious components and independently modulating positive/negative rollouts.

Details

Motivation: RLVR improves LLM reasoning but suffers from restricted exploration where policies converge to narrow solutions. Entropy regularization is unreliable for LLMs due to hyperparameter sensitivity and marginal gains.

Method: Proposes AsymGRPO framework that: 1) decomposes policy entropy into informative (preserves diverse solutions) vs spurious (erodes reasoning patterns) entropy, 2) uses group-relative advantage estimation to sustain informative entropy on positive rollouts while suppressing spurious entropy on negative ones, 3) explicitly decouples modulation of positive and negative rollouts for independent control.

Result: Extensive experiments show AsymGRPO achieves superior performance compared to strong baselines and exhibits potential to synergize with existing entropy regularization methods.

Conclusion: Effective exploration requires entropy refinement rather than blind maximization. AsymGRPO’s explicit decoupling of positive/negative rollout modulation enables better preservation of informative entropy and suppression of spurious noise.

Abstract: Reinforcement learning with verifiable rewards (RLVR) has significantly advanced the reasoning capabilities of large language models (LLMs). However, it faces a fundamental limitation termed \textit{restricted exploration}, where the policy rapidly converges to a narrow set of solutions. While entropy regularization is a popular approach used to sustain exploration, it often proves unreliable for LLMs, suffering from high hyperparameter sensitivity and yielding only marginal performance gains. Motivated by these inefficiencies, we propose to rethink the relationship between policy entropy and exploration. By deriving a parametric formulation of group-relative advantage estimation and analyzing entropy dynamics, we conceptually decompose policy entropy into \textit{informative entropy}, which preserves diverse solution paths, and \textit{spurious entropy}, which erodes reasoning patterns. Our analysis reveals that, in contrast to blind maximization, effective exploration requires \textit{entropy refinement}-a mechanism implicitly embedded in group-relative advantage estimation that sustains informative entropy on positive rollouts while suppressing spurious entropy on negative ones. Guided by this insight, we propose \textbf{AsymGRPO}, an exploratory framework that explicitly decouples the modulation of positive and negative rollouts. This allows for independent control over the preservation of informative entropy and the suppression of spurious noise. Extensive experiments demonstrate that AsymGRPO achieves superior performance compared to strong baselines and exhibits the potential to synergize with existing entropy regularization methods.

[88] TriAttention: Efficient Long Reasoning with Trigonometric KV Compression

Weian Mao, Xi Lin, Wei Huang, Yuxin Xie, Tianfu Fu, Bohan Zhuang, Song Han, Yukang Chen

Main category: cs.CL

TL;DR: TriAttention improves KV cache compression for LLMs by leveraging pre-RoPE Q/K concentration patterns to better estimate key importance, achieving high accuracy with significant memory/throughput gains.

Details

Motivation: KV cache memory bottlenecks in LLMs during extended reasoning tasks create efficiency challenges. Existing methods using post-RoPE attention scores suffer from poor key selection due to query rotation, leading to unstable reasoning.

Method: Proposes TriAttention that operates in pre-RoPE space, observing that Q/K vectors concentrate around fixed centers. Uses trigonometric series derived from these centers to score keys based on position distance preferences, combined with Q/K norms for importance estimation.

Result: On AIME25 with 32K-token generation, matches Full Attention accuracy while achieving 2.5x higher throughput or 10.7x KV memory reduction. Enables OpenClaw deployment on single consumer GPU where Full Attention would cause out-of-memory.

Conclusion: TriAttention effectively addresses KV cache bottlenecks by leveraging stable pre-RoPE Q/K concentration patterns, enabling efficient long-context reasoning while maintaining accuracy.

Abstract: Extended reasoning in large language models (LLMs) creates severe KV cache memory bottlenecks. Leading KV cache compression methods estimate KV importance using attention scores from recent post-RoPE queries. However, queries rotate with position during RoPE, making representative queries very few, leading to poor top-key selection and unstable reasoning. To avoid this issue, we turn to the pre-RoPE space, where we observe that Q and K vectors are highly concentrated around fixed non-zero centers and remain stable across positions – Q/K concentration. We show that this concentration causes queries to preferentially attend to keys at specific distances (e.g., nearest keys), with the centers determining which distances are preferred via a trigonometric series. Based on this, we propose TriAttention to estimate key importance by leveraging these centers. Via the trigonometric series, we use the distance preference characterized by these centers to score keys according to their positions, and also leverage Q/K norms as an additional signal for importance estimation. On AIME25 with 32K-token generation, TriAttention matches Full Attention reasoning accuracy while achieving 2.5x higher throughput or 10.7x KV memory reduction, whereas leading baselines achieve only about half the accuracy at the same efficiency. TriAttention enables OpenClaw deployment on a single consumer GPU, where long context would otherwise cause out-of-memory with Full Attention.

[89] Early Stopping for Large Reasoning Models via Confidence Dynamics

Parsa Hosseini, Sumit Nawathe, Mahdi Salmani, Meisam Razaviyayn, Soheil Feizi

Main category: cs.CL

TL;DR: CoDE-Stop: Early stopping method for reasoning models that uses confidence dynamics of intermediate answers to decide when to terminate reasoning, reducing compute costs by 25-50% without additional training.

Details

Motivation: Large reasoning models use long chain-of-thought generation which incurs substantial computational cost and can degrade performance due to overthinking. The key challenge is determining when the model should stop reasoning and produce the final answer.

Method: Proposes CoDE-Stop (Confidence Dynamics Early Stop) that leverages the dynamics of intermediate answer confidence to decide when to terminate reasoning. It requires no additional training and easily integrates into existing models by observing that correct reasoning trajectories often reach high-confidence answers early, while incorrect rollouts produce long, unproductive reasoning traces with less reliable confidence dynamics.

Result: Evaluated on diverse reasoning and science benchmarks across multiple models. Achieves more favorable accuracy-compute tradeoff compared to prior early stopping methods, reducing total token usage by 25-50% compared to standard full-length reasoning.

Conclusion: CoDE-Stop provides an effective early stopping method for reasoning models that reduces computational costs while maintaining accuracy. The analysis of confidence dynamics during reasoning offers insights into how confidence changes in both correct and incorrect trajectories.

Abstract: Large reasoning models rely on long chain-of-thought generation to solve complex problems, but extended reasoning often incurs substantial computational cost and can even degrade performance due to overthinking. A key challenge is determining when the model should stop reasoning and produce the final answer. In this work, we study the confidence of intermediate answers during reasoning and observe two characteristic behaviors: correct reasoning trajectories often reach high-confidence answers early, while incorrect rollouts tend to produce long, unproductive reasoning traces and exhibit less reliable confidence dynamics. Motivated by these observations, we propose CoDE-Stop (Confidence Dynamics Early Stop), an early stopping method that leverages the dynamics of intermediate answer confidence to decide when to terminate reasoning, requiring no additional training and easily integrating into existing models. We evaluate CoDE-Stop on diverse reasoning and science benchmarks across multiple models. Compared to prior early stopping methods, it achieves a more favorable accuracy-compute tradeoff and reduces total token usage by 25-50% compared to standard full-length reasoning. In addition, we provide analyses of confidence dynamics during reasoning, offering insights into how confidence changes in both correct and incorrect trajectories.

[90] Beyond the Final Actor: Modeling the Dual Roles of Creator and Editor for Fine-Grained LLM-Generated Text Detection

Yang Li, Qiang Sheng, Zhengjia Wang, Yehan Yang, Danding Wang, Juan Cao

Main category: cs.CL

TL;DR: RACE is a fine-grained LLM-generated text detection method that uses rhetorical structure analysis to distinguish between four types of text: pure human, pure LLM, LLM-polished human, and humanized LLM text.

Details

Motivation: Existing binary/ternary classification for synthetic text detection is insufficient for nuanced regulation, as different types of LLM-human collaborative text (LLM-polished human text vs humanized LLM text) have different policy consequences.

Method: RACE uses Rhetorical Structure Theory to construct logic graphs for creator’s foundation and extracts Elementary Discourse Unit-level features for editor’s style, enabling fine-grained detection of four text types.

Result: RACE outperforms 12 baselines in identifying fine-grained text types with low false alarms, providing a policy-aligned solution for LLM regulation.

Conclusion: The proposed four-class detection framework and RACE method offer more nuanced and policy-relevant detection of LLM-generated text compared to existing approaches.

Abstract: The misuse of large language models (LLMs) requires precise detection of synthetic text. Existing works mainly follow binary or ternary classification settings, which can only distinguish pure human/LLM text or collaborative text at best. This remains insufficient for the nuanced regulation, as the LLM-polished human text and humanized LLM text often trigger different policy consequences. In this paper, we explore fine-grained LLM-generated text detection under a rigorous four-class setting. To handle such complexities, we propose RACE (Rhetorical Analysis for Creator-Editor Modeling), a fine-grained detection method that characterizes the distinct signatures of creator and editor. Specifically, RACE utilizes Rhetorical Structure Theory to construct a logic graph for the creator’s foundation while extracting Elementary Discourse Unit-level features for the editor’s style. Experiments show that RACE outperforms 12 baselines in identifying fine-grained types with low false alarms, offering a policy-aligned solution for LLM regulation.

[91] LLMs-Healthcare : Current Applications and Challenges of Large Language Models in various Medical Specialties

Ummara Mumtaz, Awais Ahmed, Summaya Mumtaz

Main category: cs.CL

TL;DR: A comprehensive review of Large Language Models (LLMs) applications in healthcare, focusing on diagnostic and treatment functionalities across various medical domains including cancer care, dermatology, dental care, neurodegenerative disorders, and mental health.

Details

Motivation: To provide a comprehensive overview of the latest advancements in utilizing LLMs within the healthcare sector, emphasizing their transformative impact across various medical domains and their role in supporting healthcare professionals and patients.

Method: Literature review and analysis approach examining applications of LLMs in healthcare, specifically focusing on diagnostic and treatment-related functionalities across multiple medical specialties.

Result: The review provides insights into how LLMs are applied in cancer care, dermatology, dental care, neurodegenerative disorders, and mental health, highlighting their innovative contributions to medical diagnostics and patient care.

Conclusion: LLMs have become pivotal in healthcare with transformative potential across medical specialties, though challenges and opportunities exist in their integration, including handling diverse medical data types.

Abstract: We aim to present a comprehensive overview of the latest advancements in utilizing Large Language Models (LLMs) within the healthcare sector, emphasizing their transformative impact across various medical domains. LLMs have become pivotal in supporting healthcare, including physicians, healthcare providers, and patients. Our review provides insight into the applications of Large Language Models (LLMs) in healthcare, specifically focusing on diagnostic and treatment-related functionalities. We shed light on how LLMs are applied in cancer care, dermatology, dental care, neurodegenerative disorders, and mental health, highlighting their innovative contributions to medical diagnostics and patient care. Throughout our analysis, we explore the challenges and opportunities associated with integrating LLMs in healthcare, recognizing their potential across various medical specialties despite existing limitations. Additionally, we offer an overview of handling diverse data types within the medical field.

[92] MegaFake: A Theory-Driven Dataset of Fake News Generated by Large Language Models

Lionel Z. Wang, Ka Chung Ng, Yiming Ma, Wenqi Fan

Main category: cs.CL

TL;DR: A framework for understanding and generating LLM-based fake news, with a dataset for detection research.

Details

Motivation: LLMs can generate highly convincing fake news at scale, threatening online information integrity. Understanding the motivations and mechanisms behind LLM-generated fake news is crucial for effective detection and governance.

Method: Developed LLM-Fake Theory integrating social psychology theories to explain machine-generated deception. Designed an innovative prompt engineering pipeline to automate fake news generation using LLMs without manual annotation. Created MegaFake dataset derived from FakeNewsNet.

Result: Created a theoretically informed machine-generated fake news dataset (MegaFake) and advanced both theoretical understanding of human-machine deception mechanisms and practical approaches to fake news detection in the LLM era.

Conclusion: The study provides a framework and dataset for understanding and detecting LLM-generated fake news, addressing a critical challenge in the era of generative AI.

Abstract: Fake news significantly influences decision-making processes by misleading individuals, organizations, and even governments. Large language models (LLMs), as part of generative AI, can amplify this problem by generating highly convincing fake news at scale, posing a significant threat to online information integrity. Therefore, understanding the motivations and mechanisms behind fake news generated by LLMs is crucial for effective detection and governance. In this study, we develop the LLM-Fake Theory, a theoretical framework that integrates various social psychology theories to explain machine-generated deception. Guided by this framework, we design an innovative prompt engineering pipeline that automates fake news generation using LLMs, eliminating manual annotation needs. Utilizing this pipeline, we create a theoretically informed \underline{M}achin\underline{e}-\underline{g}ener\underline{a}ted \underline{Fake} news dataset, MegaFake, derived from FakeNewsNet. Through extensive experiments with MegaFake, we advance both theoretical understanding of human-machine deception mechanisms and practical approaches to fake news detection in the LLM era.

[93] SPRIG: Improving Large Language Model Performance by System Prompt Optimization

Lechen Zhang, Tolga Ergen, Lajanugen Logeswaran, Moontae Lee, David Jurgens

Main category: cs.CL

TL;DR: SPRIG is a genetic algorithm that optimizes system prompts for LLMs across multiple tasks, achieving performance comparable to task-specific prompts and showing cross-model generalization.

Details

Motivation: While LLM performance depends on prompt choice, most research focuses on task-specific prompts, neglecting general system prompt optimization. There's a need for methods to optimize system-level instructions that work across diverse scenarios.

Method: SPRIG uses an edit-based genetic algorithm that iteratively constructs prompts from prespecified components to maximize LLM performance. It evaluates prompts on 47 different task types to ensure generalizability.

Result: A single optimized system prompt performs on par with task prompts optimized for each individual task. Combining system and task-level optimizations yields further improvements. Optimized prompts generalize across model families, parameter sizes, and languages.

Conclusion: System-level prompt optimization is crucial for maximizing LLM potential, and SPRIG demonstrates that optimized system prompts can achieve strong general performance while complementing task-specific optimizations.

Abstract: Large Language Models (LLMs) have shown impressive capabilities in many scenarios, but their performance depends, in part, on the choice of prompt. Past research has focused on optimizing prompts specific to a task. However, much less attention has been given to optimizing the general instructions included in a prompt, known as a system prompt. To address this gap, we propose SPRIG, an edit-based genetic algorithm that iteratively constructs prompts from prespecified components to maximize the model’s performance in general scenarios. We evaluate the performance of system prompts on a collection of 47 different types of tasks to ensure generalizability. Our study finds that a single optimized system prompt performs on par with task prompts optimized for each individual task. Moreover, combining system and task-level optimizations leads to further improvement, which showcases their complementary nature. Experiments also reveal that the optimized system prompts generalize effectively across model families, parameter sizes, and languages. This study provides insights into the role of system-level instructions in maximizing LLM potential.

[94] DOVE: A Large-Scale Multi-Dimensional Predictions Dataset Towards Meaningful LLM Evaluation

Eliya Habba, Ofir Arviv, Itay Itzhak, Yotam Perlitz, Elron Bandel, Leshem Choshen, Michal Shmueli-Scheuer, Gabriel Stanovsky

Main category: cs.CL

TL;DR: DOVE is a large-scale dataset of prompt perturbations for evaluating LLM sensitivity across multiple dimensions, enabling more robust evaluation practices.

Details

Motivation: LLMs are sensitive to various arbitrary prompt dimensions (delimiters, answer enumerators, instruction wording), questioning the reliability of single-prompt evaluation practices.

Method: Created DOVE dataset containing thousands of prompt perturbations per instance across various dimensions, evaluating several model families to assess joint effects of perturbations.

Result: Found efficient methods for choosing well-performing prompts, observed that few-shot examples reduce sensitivity, identified instances inherently hard across all perturbations. Dataset contains 250M+ prompt perturbations and model outputs.

Conclusion: DOVE enables community-wide effort toward meaningful, robust, and efficient evaluation by providing comprehensive prompt perturbation data to study LLM sensitivity holistically.

Abstract: Recent work found that LLMs are sensitive to a wide range of arbitrary prompt dimensions, including the type of delimiters, answer enumerators, instruction wording, and more. This throws into question popular single-prompt evaluation practices. We present DOVE (Dataset Of Variation Evaluation) a large-scale dataset containing prompt perturbations of various evaluation benchmarks. In contrast to previous work, we examine LLM sensitivity from an holistic perspective, and assess the joint effects of perturbations along various dimensions, resulting in thousands of perturbations per instance. We evaluate several model families against DOVE, leading to several findings, including efficient methods for choosing well-performing prompts, observing that few-shot examples reduce sensitivity, and identifying instances which are inherently hard across all perturbations. DOVE consists of more than 250M prompt perturbations and model outputs, which we make publicly available to spur a community-wide effort toward meaningful, robust, and efficient evaluation. Browse the data, contribute, and more: https://slab-nlp.github.io/DOVE/

[95] ALIEN: Aligned Entropy Head for Improving Uncertainty Estimation of LLMs

Artem Zabolotnyi, Roman Makarov, Mile Mitrovic, Polina Proskura, Oleg Travkin, Roman Alferov, Alexey Zaytsev

Main category: cs.CL

TL;DR: ALIEN introduces a lightweight method to refine entropy-based uncertainty estimation by aligning it with prediction reliability through a small uncertainty head with two regularization mechanisms.

Details

Motivation: Existing uncertainty estimation methods for language models often show overconfidence on difficult inputs, with predictive entropy mainly capturing aleatoric uncertainty but having limited capacity to handle class overlap or ambiguous linguistic cues.

Method: ALIEN trains a small uncertainty head initialized to produce the model’s original entropy, then fine-tunes it with two regularization mechanisms to align entropy with prediction reliability, adding minimal parameters (0.002% for decoders, 0.5% for encoders).

Result: ALIEN consistently outperforms baselines across 7 classification datasets and 2 NER benchmarks on 5 language models (RoBERTa, ELECTRA, LLaMA-2, Qwen2.5, Qwen3), achieving best performance in detecting incorrect predictions and lowest calibration error with minimal inference overhead.

Conclusion: Entropy can be effectively refined through lightweight supervised alignment, producing more reliable uncertainty estimates without modifying backbone models, making the approach practical for large-scale deployment with modern language models.

Abstract: Uncertainty estimation remains a key challenge when adapting pre-trained language models to downstream classification tasks, with overconfidence often observed for difficult inputs. While predictive entropy provides a strong baseline for uncertainty estimation, it considers mainly aleatoric uncertainty and has limited capacity to capture effects, such as class overlap or ambiguous linguistic cues. We introduce Aligned Entropy - ALIEN, a lightweight method that refines entropy-based uncertainty by aligning it with prediction reliability. ALIEN trains a small uncertainty head initialized to produce the model’s original entropy and subsequently fine-tuned with two regularization mechanisms. Experiments across seven classification datasets and two NER benchmarks, evaluated on five language models (RoBERTa, ELECTRA, LLaMA-2, Qwen2.5, and Qwen3), show that ALIEN consistently outperforms strong baselines across all considered scenarios in detecting incorrect predictions, while achieving the lowest calibration error. The proposed method introduces only a small inference overhead (in the order of milliseconds per batch on CPU) and increases the model’s parameter count by just 0.002% for decoder models and 0.5% for encoder models, without requiring storage of intermediate states. It improves uncertainty estimation while preserving the original model architecture, making the approach practical for large-scale deployment with modern language models. Our results demonstrate that entropy can be effectively refined through lightweight supervised alignment, producing more reliable uncertainty estimates without modifying the backbone model. The code is available at 4.

[96] In-Context Watermarks for Large Language Models

Yepeng Liu, Xuandong Zhao, Christopher Kruegel, Dawn Song, Yuheng Bu

Main category: cs.CL

TL;DR: In-Context Watermarking (ICW) embeds watermarks into LLM-generated text through prompt engineering alone, enabling model-agnostic watermarking without access to the decoding process.

Details

Motivation: Current LLM watermarking methods require access to the decoding process, limiting real-world applicability. Need for model-agnostic watermarking for sensitive applications like detecting AI-generated academic reviews where model access is unavailable.

Method: ICW uses prompt engineering to embed watermarks through LLMs’ in-context learning and instruction-following abilities. Four strategies at different granularity levels with tailored detection methods. Also examines Indirect Prompt Injection setting where watermarking is triggered by modifying input documents.

Result: Experiments validate ICW as a feasible, model-agnostic, practical watermarking approach. Findings suggest ICW becomes more effective as LLMs become more capable, offering scalable content attribution.

Conclusion: ICW provides a promising direction for scalable and accessible content attribution in LLMs through prompt-based watermarking, addressing limitations of existing methods that require model access.

Abstract: The growing use of large language models (LLMs) for sensitive applications has highlighted the need for effective watermarking techniques to ensure the provenance and accountability of AI-generated text. However, most existing watermarking methods require access to the decoding process, limiting their applicability in real-world settings. One illustrative example is the use of LLMs by dishonest reviewers in the context of academic peer review, where conference organizers have no access to the model used but still need to detect AI-generated reviews. Motivated by this gap, we introduce In-Context Watermarking (ICW), which embeds watermarks into generated text solely through prompt engineering, leveraging LLMs’ in-context learning and instruction-following abilities. We investigate four ICW strategies at different levels of granularity, each paired with a tailored detection method. We further examine the Indirect Prompt Injection (IPI) setting as a specific case study, in which watermarking is covertly triggered by modifying input documents such as academic manuscripts. Our experiments validate the feasibility of ICW as a model-agnostic, practical watermarking approach. Moreover, our findings suggest that as LLMs become more capable, ICW offers a promising direction for scalable and accessible content attribution. Our code is available at https://github.com/yepengliu/In-Context-Watermarks.

[97] Informatics for Food Processing

Gordana Ispirova, Michael Sebek, Giulia Menichetti

Main category: cs.CL

TL;DR: This paper explores computational approaches to food processing classification using machine learning and AI, including random forest models and large language models for semantic embedding of food data, with a case study on multimodal AI integration for large-scale food classification.

Details

Motivation: To address limitations in traditional food processing classification frameworks (NOVA, Nutri-Score, SIGA) which suffer from subjectivity and reproducibility challenges that hinder epidemiological research and public policy, by developing more objective computational approaches.

Method: Proposes FoodProX, a random forest model trained on nutrient composition data to infer processing levels and generate continuous FPro scores; uses large language models (BERT, BioBERT) to semantically embed food descriptions and ingredient lists; presents a case study using Open Food Facts database with multimodal AI models integrating structured and unstructured data.

Result: Develops computational frameworks that can classify foods at scale, handle missing data, and provide more objective assessments of food processing levels compared to traditional subjective classification systems.

Conclusion: Machine learning, AI, and data science offer transformative approaches to food informatics, enabling more objective, scalable, and reproducible classification of food processing that can advance public health research and policy.

Abstract: This chapter explores the evolution, classification, and health implications of food processing, while emphasizing the transformative role of machine learning, artificial intelligence (AI), and data science in advancing food informatics. It begins with a historical overview and a critical review of traditional classification frameworks such as NOVA, Nutri-Score, and SIGA, highlighting their strengths and limitations, particularly the subjectivity and reproducibility challenges that hinder epidemiological research and public policy. To address these issues, the chapter presents novel computational approaches, including FoodProX, a random forest model trained on nutrient composition data to infer processing levels and generate a continuous FPro score. It also explores how large language models like BERT and BioBERT can semantically embed food descriptions and ingredient lists for predictive tasks, even in the presence of missing data. A key contribution of the chapter is a novel case study using the Open Food Facts database, showcasing how multimodal AI models can integrate structured and unstructured data to classify foods at scale, offering a new paradigm for food processing assessment in public health and research.

[98] Languages in Whisper-Style Speech Encoders Align Both Phonetically and Semantically

Ryan Soh-Eun Shim, Domenico De Cristofaro, Chengzhi Martin Hu, Alessandro Vietti, Barbara Plank

Main category: cs.CL

TL;DR: Cross-lingual alignment in speech encoders persists even without phonetic cues, especially in models trained with translation objectives, and early-exiting improves ASR for unseen low-resource languages.

Details

Motivation: To determine whether cross-lingual alignment in speech encoders arises from semantic similarity rather than phonetic overlap, and to test if early-exiting can produce representations less tied to language-specific semantics for better low-resource ASR.

Method: Conducted pronunciation-controlled experiments to isolate semantic from phonetic cues, tested spoken translation retrieval without phonetic overlap, and implemented early-exiting in encoders to create less language-specific representations.

Result: Spoken translation retrieval remained strongly above chance without phonetic cues in final layers, especially for translation-trained models. Early-exiting improved ASR performance on low-resource languages unseen during training.

Conclusion: Cross-lingual alignment in speech encoders is semantic rather than phonetic, and early-exiting produces representations beneficial for low-resource language ASR, demonstrating effective knowledge transfer across languages.

Abstract: Cross-lingual alignment in pretrained language models enables knowledge transfer across languages. Similar alignment has been reported in Whisper-style speech encoders, based on spoken translation retrieval using representational similarity. However, prior work does not control for phonetic overlap between equivalent utterances, which may artificially support retrieval. We conduct pronunciation-controlled experiments to test whether cross-lingual alignment arises from semantic rather than phonetic similarity. Results show that spoken translation retrieval remains strongly above chance without phonetic cues in the final layers of encoders trained with a speech translation objective, most clearly for models additionally trained on translation. We further test early-exiting the encoder to induce representations we hypothesize to be less tied to language-specific semantics. These experiments indeed reveal performance gains in automatic speech recognition on low-resource languages unseen during training.

[99] Mixture-of-Retrieval Experts for Reasoning-Guided Multimodal Knowledge Exploitation

Chunyi Peng, Zhipeng Xu, Zhenghao Liu, Yishan Li, Yukun Yan, Shuo Wang, Yu Gu, Minghe Yu, Ge Yu, Maosong Sun

Main category: cs.CL

TL;DR: MoRE introduces a Mixture-of-Retrieval Experts framework that enables MLLMs to dynamically coordinate diverse retrieval experts based on evolving reasoning states, improving multimodal knowledge exploitation.

Details

Motivation: Existing MRAG methods use rigid retrieval paradigms with fixed trajectories, failing to fully exploit knowledge from different retrieval experts through dynamic interaction based on the model's knowledge needs or evolving reasoning states.

Method: Proposes Mixture-of-Retrieval Experts (MoRE) framework that learns to dynamically determine which retrieval expert to engage with based on reasoning state. Uses Stepwise Group Relative Policy Optimization (Step-GRPO) to train MLLMs to interact with multiple retrieval experts and synthesize fine-grained rewards.

Result: Achieves average performance gains of over 7% compared to competitive baselines on diverse open-domain QA benchmarks. Shows strong adaptability by dynamically coordinating heterogeneous experts to precisely locate relevant information.

Conclusion: MoRE enables robust, reasoning-driven expert collaboration for multimodal retrieval-augmented generation, effectively mitigating hallucinations in MLLMs through dynamic knowledge exploitation.

Abstract: Multimodal Retrieval-Augmented Generation (MRAG) has shown promise in mitigating hallucinations in Multimodal Large Language Models (MLLMs) by incorporating external knowledge. However, existing methods typically adhere to rigid retrieval paradigms by mimicking fixed retrieval trajectories and thus fail to fully exploit the knowledge of different retrieval experts through dynamic interaction based on the model’s knowledge needs or evolving reasoning states. To overcome this limitation, we introduce Mixture-of-Retrieval Experts (MoRE), a novel framework that enables MLLMs to collaboratively interact with diverse retrieval experts for more effective knowledge exploitation. Specifically, MoRE learns to dynamically determine which expert to engage with, conditioned on the evolving reasoning state. To effectively train this capability, we propose Stepwise Group Relative Policy Optimization (Step-GRPO), which goes beyond sparse outcome-based supervision by encouraging MLLMs to interact with multiple retrieval experts and synthesize fine-grained rewards, thereby teaching the MLLM to fully coordinate all experts when answering a given query. Experimental results on diverse open-domain QA benchmarks demonstrate the effectiveness of MoRE, achieving average performance gains of over 7% compared to competitive baselines. Notably, MoRE exhibits strong adaptability by dynamically coordinating heterogeneous experts to precisely locate relevant information, validating its capability for robust, reasoning-driven expert collaboration. All codes and data are released on https://github.com/OpenBMB/MoRE.

[100] Gaussian mixture models as a proxy for interacting language models

Edward L. Wang, Mohammad Sharifi Kiasari, Tianyu Wang, Hayden Helm, Avanti Athreya, Carey Priebe, Vince Lyzinski

Main category: cs.CL

TL;DR: Interacting Gaussian mixture models (GMMs) proposed as computationally efficient proxy for interacting large language models (LLMs) with RAG-like capabilities.

Details

Motivation: LLMs are powerful but computationally expensive, especially when simulating interacting systems with retrieval-augmented generation (RAG). Need for theoretical understanding and efficient proxies to study emergent behaviors like polarization in multi-agent LLM systems.

Method: Develop interacting Gaussian mixture models with RAG-like updating mechanism where GMMs can generate, exchange, and update data and parameters. Build Markov chain from this system, formalize polarization concept, and prove lower bounds on polarization probability.

Result: Interacting GMM system mimics aspects of experimental simulations of interacting LLMs with iterative feedback. Provides theoretical framework for polarization analysis and proves lower bounds on polarization probability in such systems.

Conclusion: Interacting GMMs offer computationally efficient proxy for studying interacting LLM systems, providing theoretical insights into emergent behaviors like polarization while being minimal in computational cost.

Abstract: Large language models (LLMs) are powerful tools that, in a number of settings, overlap with the results of human pattern recognition and reasoning. Retrieval-augmented generation (RAG) further allows LLMs to produce tailored output depending on the contents of their RAG databases. However, LLMs depend on complex, computationally expensive algorithms. In this paper, we introduce interacting Gaussian mixture models (GMMs) as a proxy for interacting LLMs. We construct a model of interacting GMMs, complete with an analogue to RAG updating, under which GMMs can generate, exchange, and update data and parameters. We show that this interacting system of Gaussian mixture models, which can be implemented at minimal computational cost, mimics certain aspects of experimental simulations of interacting LLMs whose iterative responses depend on feedback from other LLMs. We build a Markov chain from this system of interacting GMMs; formalize and interpret the notion of polarization for such a chain; and prove lower bounds on the probability of polarization. This provides theoretical insight into the use of interacting Gaussian mixture models as a computationally efficient proxy for interacting large language models.

[101] PDF Retrieval Augmented Question Answering

Thi Thu Uyen Hoang, Viet Anh Nguyen

Main category: cs.CL

TL;DR: A RAG-based QA system for extracting information from multimodal PDF content (text, images, diagrams, graphs, tables) that handles complex multimodal queries by integrating non-textual elements and fine-tuning LLMs.

Details

Motivation: PDF files contain rich multimodal data (text, images, vector diagrams, graphs, tables) that existing QA systems struggle with, as they're primarily designed for textual content. There's a need for systems that can handle complex multimodal questions combining multiple data types.

Method: Uses Retrieval Augmented Generation (RAG) framework with refined approaches for processing and integrating non-textual elements from PDFs. Fine-tunes large language models to better adapt to the multimodal system for handling complex queries.

Result: The system demonstrates capability to extract accurate information across different types of content in PDFs through in-depth experimental evaluation.

Conclusion: The work advances retrieval-augmented QA systems and provides foundation for further research in multimodal data integration and processing.

Abstract: This paper presents an advancement in Question-Answering (QA) systems using a Retrieval Augmented Generation (RAG) framework to enhance information extraction from PDF files. Recognizing the richness and diversity of data within PDFs–including text, images, vector diagrams, graphs, and tables–poses unique challenges for existing QA systems primarily designed for textual content. We seek to develop a comprehensive RAG-based QA system that will effectively address complex multimodal questions, where several data types are combined in the query. This is mainly achieved by refining approaches to processing and integrating non-textual elements in PDFs into the RAG framework to derive precise and relevant answers, as well as fine-tuning large language models to better adapt to our system. We provide an in-depth experimental evaluation of our solution, demonstrating its capability to extract accurate information that can be applied to different types of content across PDFs. This work not only pushes the boundaries of retrieval-augmented QA systems but also lays a foundation for further research in multimodal data integration and processing.

[102] XiYan-SQL: A Novel Multi-Generator Framework For Text-to-SQL

Yifu Liu, Yin Zhu, Yingqi Gao, Zhiling Luo, Xiaoxia Li, Xiaorong Shi, Yuntao Hong, Jinyang Gao, Yu Li, Bolin Ding, Jingren Zhou

Main category: cs.CL

TL;DR: XiYan-SQL is a novel framework for Text-to-SQL that uses LLMs to generate multiple SQL candidates through schema filtering, multi-generator ensemble, and selection with candidate reorganization, achieving state-of-the-art results on BIRD and Spider benchmarks.

Details

Motivation: To leverage LLM capabilities for addressing Text-to-SQL challenges, particularly in generating accurate SQL queries from natural language, by creating a framework that produces multiple high-quality SQL candidates and selects the optimal one.

Method: Three-component framework: 1) Schema Filter module to identify relevant database schemas, 2) Multi-generator ensemble approach using multi-task fine-tuning to create diverse SQL generation models with different styles, 3) Selection model with candidate reorganization strategy to choose the optimal SQL query.

Result: Achieves new SOTA performance of 75.63% on BIRD benchmark and 89.65% accuracy on Spider test set, surpassing all previous methods and demonstrating effectiveness and robustness.

Conclusion: XiYan-SQL effectively leverages LLMs for Text-to-SQL by generating and selecting from multiple SQL candidates, achieving state-of-the-art performance on major benchmarks through its innovative multi-generator ensemble and selection approach.

Abstract: To leverage the advantages of LLM in addressing challenges in the Text-to-SQL task, we present XiYan-SQL, an innovative framework effectively generating and utilizing multiple SQL candidates. It consists of three components: 1) a Schema Filter module filtering and obtaining multiple relevant schemas; 2) a multi-generator ensemble approach generating multiple highquality and diverse SQL queries; 3) a selection model with a candidate reorganization strategy implemented to obtain the optimal SQL query. Specifically, for the multi-generator ensemble, we employ a multi-task fine-tuning strategy to enhance the capabilities of SQL generation models for the intrinsic alignment between SQL and text, and construct multiple generation models with distinct generation styles by fine-tuning across different SQL formats. The experimental results and comprehensive analysis demonstrate the effectiveness and robustness of our framework. Overall, XiYan-SQL achieves a new SOTA performance of 75.63% on the notable BIRD benchmark, surpassing all previous methods. It also attains SOTA performance on the Spider test set with an accuracy of 89.65%.

[103] The Generalization Ridge: Information Flow in Natural Language Generation

Ruidi Chang, Chunyuan Deng, Hanjie Chen

Main category: cs.CL

TL;DR: InfoRidge framework reveals that predictive information in transformers peaks in intermediate layers (forming a “generalization ridge”) before declining in final layers, showing a transition between generalization and memorization.

Details

Motivation: Transformer language models achieve state-of-the-art NLG performance, but their internal mechanisms for synthesizing task-relevant information remain poorly understood. While prior work suggests intermediate layers have more generalizable representations than final layers, how this generalization emerges and propagates during training is unclear.

Method: Proposed InfoRidge, an information-theoretic framework to characterize how predictive information (mutual information between hidden representations and target outputs) varies across depth during training. Conducted experiments across various models and datasets, plus complementary analyses using residual scaling and attention patterns to characterize layer-wise functional specialization.

Result: Revealed a consistent non-monotonic trend: predictive information peaks in intermediate layers (forming a “generalization ridge”) before declining in final layers, reflecting a transition between generalization and memorization. Validated findings with multiple-token generation experiments showing the ridge phenomenon persists across decoding steps.

Conclusion: The findings offer new insights into transformer internal mechanisms and underscore the critical role of intermediate layers in supporting generalization. The InfoRidge framework provides a principled way to understand how transformers balance generalization and memorization across layers.

Abstract: Transformer-based language models have achieved state-of-the-art performance in natural language generation (NLG), yet their internal mechanisms for synthesizing task-relevant information remain insufficiently understood. While prior studies suggest that intermediate layers often yield more generalizable representations than final layers, how this generalization ability emerges and propagates across layers during training remains unclear. We propose InfoRidge, an information-theoretic framework, to characterize how predictive information-the mutual information between hidden representations and target outputs-varies across depth during training. Our experiments across various models and datasets reveal a consistent non-monotonic trend: predictive information peaks in intermediate layers-forming a generalization ridge-before declining in final layers, reflecting a transition between generalization and memorization. To further investigate this phenomenon, we conduct a set of complementary analyses that leverage residual scaling and attention pattern to characterize layer-wise functional specialization. We further validate our findings with multiple-token generation experiments, verifying that the observed ridge phenomenon persists across decoding steps. Together, these findings offer new insights into the internal mechanisms of transformers and underscore the critical role of intermediate layers in supporting generalization.

[104] PromptSuite: A Task-Agnostic Framework for Multi-Prompt Generation

Eliya Habba, Noam Dahan, Gili Lior, Gabriel Stanovsky

Main category: cs.CL

TL;DR: PromptSuite is a framework for automatically generating diverse prompt variations to enable more robust evaluation of LLMs, addressing the unreliability of single-prompt testing.

Details

Motivation: Single-prompt evaluation of LLMs is unreliable due to performance sensitivity to small prompt variations, but manually creating diverse prompts for robust multi-prompt evaluation is challenging and limits adoption in practice.

Method: PromptSuite uses a modular prompt design that allows controlled perturbations to each component, with extensible architecture supporting new components and perturbation types, enabling automatic generation of various prompts across tasks and benchmarks.

Result: Case studies demonstrate that PromptSuite provides meaningful prompt variations to support strong evaluation practices, with all resources (Python API, source code, web interface, demo video) publicly available.

Conclusion: PromptSuite addresses the need for robust LLM evaluation by automating prompt variation generation, making multi-prompt evaluation more practical and reliable.

Abstract: Evaluating LLMs with a single prompt has proven unreliable, with small changes leading to significant performance differences. However, generating the prompt variations needed for a more robust multi-prompt evaluation is challenging, limiting its adoption in practice. To address this, we introduce PromptSuite, a framework that enables the automatic generation of various prompts. PromptSuite is flexible - working out of the box on a wide range of tasks and benchmarks. It follows a modular prompt design, allowing controlled perturbations to each component, and is extensible, supporting the addition of new components and perturbation types. Through a series of case studies, we show that PromptSuite provides meaningful variations to support strong evaluation practices. All resources, including the Python API, source code, user-friendly web interface, and demonstration video, are available at: https://eliyahabba.github.io/PromptSuite/.

[105] The PIMMUR Principles: Ensuring Validity in Collective Behavior of LLM Societies

Jiaxu Zhou, Jen-tse Huang, Xuhui Zhou, Man Ho Lam, Xintao Wang, Hao Zhu, Wenxuan Wang, Maarten Sap

Main category: cs.CL

TL;DR: Systematic audit of 39 AI society studies reveals pervasive methodological flaws (PIMMUR framework) that undermine simulation validity, showing many “emergent” behaviors are artifacts rather than genuine social dynamics.

Details

Motivation: To critically examine the methodological rigor of using LLMs to simulate human collective behaviors in "AI societies," as current approaches may lack scientific validity and capture model biases rather than universal human social behaviors.

Method: Conducted systematic audit of 39 recent studies, identified six pervasive flaws (PIMMUR: agent profiles, interaction, memory, control, unawareness, realism), tested frontier LLMs’ ability to identify underlying social experiments, reproduced five representative experiments with PIMMUR principles enforced.

Result: 89.7% of studies violate at least one PIMMUR principle; frontier LLMs correctly identify underlying social experiments in 50.8% of cases; 61.0% of prompts exert excessive control that pre-determines outcomes; when PIMMUR principles are enforced, reported collective phenomena often vanish or reverse.

Conclusion: Current AI simulations may capture model-specific biases rather than universal human social behaviors, raising critical concerns about using LLMs as scientific proxies for human society. Methodological rigor is essential for valid social simulations.

Abstract: Large language models (LLMs) are increasingly deployed to simulate human collective behaviors, yet the methodological rigor of these “AI societies” remains under-explored. Through a systematic audit of 39 recent studies, we identify six pervasive flaws-spanning agent profiles, interaction, memory, control, unawareness, and realism (PIMMUR). Our analysis reveals that 89.7% of studies violate at least one principle, undermining simulation validity. We demonstrate that frontier LLMs correctly identify the underlying social experiment in 50.8% of cases, while 61.0% of prompts exert excessive control that pre-determines outcomes. By reproducing five representative experiments (e.g., telephone game), we show that reported collective phenomena often vanish or reverse when PIMMUR principles are enforced, suggesting that many “emergent” behaviors are methodological artifacts rather than genuine social dynamics. Our findings suggest that current AI simulations may capture model-specific biases rather than universal human social behaviors, raising critical concerns about the use of LLMs as scientific proxies for human society.

[106] Measuring Competency, Not Performance: Item-Aware Evaluation Across Medical Benchmarks

Zhimeng Luo, Lixin Wu, Adam Frisch, Daqing He

Main category: cs.CL

TL;DR: MedIRT: A psychometric evaluation framework using Item Response Theory to assess LLMs in medicine, moving beyond simple accuracy metrics to model latent competency and item characteristics.

Details

Motivation: Current accuracy-based evaluation of LLMs in medicine has limitations: it treats all questions equally, conflates model ability with item characteristics, and produces rankings that vary with benchmark choice, failing to measure underlying medical competency.

Method: Introduces MedIRT, a psychometric framework grounded in Item Response Theory that jointly models latent competency and item-level difficulty/discrimination, includes benchmark integrity validation to ensure items measure coherent underlying ability, and evaluates 71 diverse LLMs on USMLE-aligned benchmarks across 11 medical topics.

Result: MedIRT predicts held-out LLM responses with 83.3% accuracy, IRT-based rankings outperform accuracy-based rankings across 6 external medical benchmarks (4 wins, 0 losses, 18% lower variance), reveals domain-specific heterogeneity masked by aggregate accuracy, and identifies two distinct response profiles requiring different interventions.

Conclusion: Item-aware psychometric evaluation provides a more valid and stable foundation for assessing LLMs in medicine, with implications for any high-stakes domain where benchmark integrity can be validated and items vary in difficulty/discrimination.

Abstract: Accuracy-based evaluation of Large Language Models (LLMs) measures benchmark-specific performance rather than underlying medical competency: it treats all questions as equally informative, conflates model ability with item characteristics, and thereby produces rankings that vary with benchmark choice. To address this, we introduce MedIRT, a psychometric evaluation framework grounded in Item Response Theory (IRT) that (1) jointly models latent competency and item-level difficulty and discrimination, and (2) includes benchmark integrity validation to ensure items within each topic measure a single, coherent underlying ability. We prospectively evaluate 71 diverse LLMs on a USMLE-aligned benchmark across 11 medical topics. As internal validation, MedIRT correctly predicts held-out LLM responses on unseen questions with 83.3% accuracy. As external validation, IRT-based rankings outperform accuracy-based rankings across 6 independent external medical benchmarks – including expert preferences, holistic clinical tasks, safety judgments, and open-ended queries – achieving 4 wins, 0 losses, and 18% lower variance. As a substantive finding, topic-level competency profiles expose striking domain-specific heterogeneity that aggregate accuracy masks. As a diagnostic tool, difficulty-tier analysis reveals two distinct response profiles (difficulty-sensitive responding and difficulty-insensitive responding) that require fundamentally different interventions. These results establish item-aware psychometric evaluation as a more valid and stable foundation for assessing LLMs in medicine, with potential implications for any high-stakes domain where benchmark integrity can be validated, and items vary meaningfully in difficulty and discrimination.

[107] A Linguistics-Aware LLM Watermarking via Syntactic Predictability

Shinwoo Park, Hyejin Park, Hyeseon Ahn, Yo-Sub Han

Main category: cs.CL

TL;DR: STELA is a publicly verifiable watermarking framework for LLMs that dynamically adjusts watermark strength based on linguistic degrees of freedom using part-of-speech n-gram modeling, enabling detection without access to model logits.

Details

Motivation: Current LLM watermarking methods face a trade-off between text quality and detection robustness, and rely on model-specific signals (logits) that hinder public verification. There's a need for publicly verifiable watermarking that doesn't require access to the underlying model's internal states.

Method: STELA modulates watermark strength using part-of-speech (POS) n-gram-modeled linguistic indeterminacy. It weakens the watermark signal in grammatically constrained contexts to preserve quality and strengthens it in contexts with greater linguistic flexibility to enhance detectability. The detector operates without access to any model logits.

Result: Experiments on typologically diverse languages (analytic English, isolating Chinese, and agglutinative Korean) show that STELA surpasses prior methods in detection robustness while maintaining text quality.

Conclusion: STELA provides a publicly verifiable watermarking framework that effectively balances text quality and detection robustness by aligning watermark strength with linguistic degrees of freedom, advancing trustworthy AI governance tools.

Abstract: As large language models (LLMs) continue to advance rapidly, reliable governance tools have become critical. Publicly verifiable watermarking is particularly essential for fostering a trustworthy AI ecosystem. A central challenge persists: balancing text quality against detection robustness. Recent studies have sought to navigate this trade-off by leveraging signals from model output distributions (e.g., token-level entropy); however, their reliance on these model-specific signals presents a significant barrier to public verification, as the detection process requires access to the logits of the underlying model. We introduce STELA, a novel framework that aligns watermark strength with the linguistic degrees of freedom inherent in language. STELA dynamically modulates the signal using part-of-speech (POS) n-gram-modeled linguistic indeterminacy, weakening it in grammatically constrained contexts to preserve quality and strengthen it in contexts with greater linguistic flexibility to enhance detectability. Our detector operates without access to any model logits, thus facilitating publicly verifiable detection. Through extensive experiments on typologically diverse languages-analytic English, isolating Chinese, and agglutinative Korean-we show that STELA surpasses prior methods in detection robustness. Our code is available at https://github.com/Shinwoo-Park/stela_watermark.

[108] EvoEdit: Evolving Null-space Alignment for Robust and Efficient Knowledge Editing

Sicheng Lyu, Yu Gu, Xinyu Wang, Jerry Huang, Sitao Luan, Yufei Cui, Xiao-Wen Chang, Peng Lu

Main category: cs.CL

TL;DR: EvoEdit is a novel model editing strategy that mitigates catastrophic interference in sequential editing of LLMs through sequential null-space alignment, enabling stable updates without compromising previously integrated knowledge.

Details

Motivation: LLMs need continual updates to correct outdated knowledge, but existing locate-then-edit approaches suffer from catastrophic interference in sequential editing contexts where new edits compromise previous updates and degrade preserved knowledge.

Method: EvoEdit uses sequential null-space alignment for each incoming edit, preserving both original and previously modified knowledge representations while maintaining output invariance on preserved knowledge across long edit sequences.

Result: EvoEdit achieves better or comparable performance than prior state-of-the-art locate-then-edit techniques on real-world sequential knowledge-editing benchmarks, with up to 3.53 times speedup.

Conclusion: The work underscores the need for principled approaches for LLMs in dynamically evolving information settings and provides a simple yet effective solution with strong theoretical guarantees for stable sequential model editing.

Abstract: Large language models (LLMs) require continual updates to rectify outdated or erroneous knowledge. Model editing has emerged as a compelling paradigm for introducing targeted modifications without the computational burden of full retraining. Existing approaches are mainly based on a locate-then-edit framework. However, in sequential editing contexts, where multiple updates are applied over time, they exhibit significant limitations and suffer from catastrophic interference, i.e., new edits compromise previously integrated updates and degrade preserved knowledge. To address these challenges, we introduce EvoEdit, a novel editing strategy that mitigates catastrophic interference through sequential null-space alignment, enabling stable and efficient model editing. By performing sequential null-space alignment for each incoming edit, EvoEdit preserves both original and previously modified knowledge representations and maintains output invariance on preserved knowledge even across long edit sequences, effectively mitigating interference. Evaluations on real-world sequential knowledge-editing benchmarks show that EvoEdit achieves better or comparable performance than prior state-of-the-art locate-then-edit techniques, with up to 3.53 times speedup. Overall, these results underscore the necessity of developing more principled approaches for designing LLMs in dynamically evolving information settings, while providing a simple yet effective solution with strong theoretical guarantees.

[109] LLMs Judge Themselves: A Game-Theoretic Framework for Human-Aligned Evaluation

Gao Yang, Yuhang Liu, Siyu Miao, Xinyue Liang, Zhengyang Liu, Heyan Huang

Main category: cs.CL

TL;DR: Game-theoretic mutual evaluation framework for LLMs where models assess each other through self-play and peer review, compared with human judgments.

Details

Motivation: Conventional LLM evaluation methods are inadequate for capturing nuanced, subjective, and open-ended model behavior, requiring new approaches beyond fixed-format tasks with reference answers.

Method: Proposes automatic mutual evaluation where LLMs assess each other’s outputs through self-play and peer review. Uses game-theoretic voting algorithms to aggregate peer reviews and compares model-generated rankings with human voting behavior.

Result: Empirical results show both convergences and divergences between theoretical predictions and human evaluations, revealing insights into promises and limitations of mutual evaluation.

Conclusion: First work to jointly integrate mutual evaluation, game-theoretic aggregation, and human-grounded validation for LLM evaluation, offering a novel framework for assessing model capabilities.

Abstract: Ideal or real - that is the question.In this work, we explore whether principles from game theory can be effectively applied to the evaluation of large language models (LLMs). This inquiry is motivated by the growing inadequacy of conventional evaluation practices, which often rely on fixed-format tasks with reference answers and struggle to capture the nuanced, subjective, and open-ended nature of modern LLM behavior. To address these challenges, we propose a novel alternative: automatic mutual evaluation, where LLMs assess each other’s output through self-play and peer review. These peer assessments are then systematically compared with human voting behavior to evaluate their alignment with human judgment. Our framework incorporates game-theoretic voting algorithms to aggregate peer reviews, enabling a principled investigation into whether model-generated rankings reflect human preferences. Empirical results reveal both convergences and divergences between theoretical predictions and human evaluations, offering valuable insights into the promises and limitations of mutual evaluation. To the best of our knowledge, this is the first work to jointly integrate mutual evaluation, game-theoretic aggregation, and human-grounded validation for evaluating the capabilities of LLMs.

[110] ProMediate: A Socio-cognitive framework for evaluating proactive agents in multi-party negotiation

Ziyi Liu, Bahar Sarrafzadeh, Pei Zhou, Longqi Yang, Jieyu Zhao, Ashish Sharma

Main category: cs.CL

TL;DR: ProMediate is a framework for evaluating proactive AI mediator agents in complex multi-topic, multi-party negotiations, featuring simulation testbeds with difficulty levels and socio-cognitive evaluation metrics.

Details

Motivation: There's a growing need for AI agents that can proactively manage complex multi-party collaboration, but systematic evaluation methods for such proactive agents remain scarce. Negotiation provides a demanding testbed requiring socio-cognitive intelligence to navigate conflicting interests and build consensus.

Method: ProMediate consists of: (1) a simulation testbed based on realistic negotiation cases with three difficulty levels (Easy, Medium, Hard), featuring a plug-and-play proactive AI mediator grounded in socio-cognitive mediation theories that can flexibly decide when and how to intervene; (2) a socio-cognitive evaluation framework with metrics to measure consensus changes, intervention latency, mediator effectiveness, and intelligence.

Result: A socially intelligent mediator agent outperforms a generic baseline with faster, better-targeted interventions. In the ProMediate-Hard setting, the social mediator increases consensus change by 3.6 percentage points (10.65% vs 7.01%) while being 77% faster in response (15.98s vs. 3.71s).

Conclusion: ProMediate provides a rigorous, theory-grounded testbed to advance the development of proactive, socially intelligent agents capable of managing complex multi-party collaboration.

Abstract: While Large Language Models (LLMs) are increasingly used in agentic frameworks to assist individual users, there is a growing need for agents that can proactively manage complex, multi-party collaboration. Systematic evaluation methods for such proactive agents remain scarce, limiting progress in developing AI that can effectively support multiple people together. Negotiation offers a demanding testbed for this challenge, requiring socio-cognitive intelligence to navigate conflicting interests between multiple participants and multiple topics and build consensus. Here, we present ProMediate, the first framework for evaluating proactive AI mediator agents in complex, multi-topic, multi-party negotiations. ProMediate consists of two core components: (i) a simulation testbed based on realistic negotiation cases and theory-driven difficulty levels (ProMediate-Easy, ProMediate-Medium, and ProMediate-Hard), with a plug-and-play proactive AI mediator grounded in socio-cognitive mediation theories, capable of flexibly deciding when and how to intervene; and (ii) a socio-cognitive evaluation framework with a new suite of metrics to measure consensus changes, intervention latency, mediator effectiveness, and intelligence. Together, these components establish a systematic framework for assessing the socio-cognitive intelligence of proactive AI agents in multi-party settings. Our results show that a socially intelligent mediator agent outperforms a generic baseline, via faster, better-targeted interventions. In the ProMediate-Hard setting, our social mediator increases consensus change by 3.6 percentage points compared to the generic baseline (10.65% vs 7.01%) while being 77% faster in response (15.98s vs. 3.71s). In conclusion, ProMediate provides a rigorous, theory-grounded testbed to advance the development of proactive, socially intelligent agents.

[111] HatePrototypes: Interpretable and Transferable Representations for Implicit and Explicit Hate Speech Detection

Irina Proskurina, Marc-Antoine Carpentier, Julien Velcin

Main category: cs.CL

TL;DR: HatePrototypes - class-level vector representations from hate speech detection models enable cross-task transfer between explicit and implicit hate detection without repeated fine-tuning.

Details

Motivation: Existing hate speech benchmarks mainly address explicit hate and overlook implicit/indirect hate (demeaning comparisons, calls for exclusion, subtle discriminatory language). While explicit hate can be captured through surface features, implicit hate requires deeper semantic processing. Current approaches require repeated fine-tuning for different hate types.

Method: Develop HatePrototypes - class-level vector representations derived from language models optimized for hate speech detection and safety moderation. These prototypes are built from as few as 50 examples per class. Use parameter-free early exiting with prototypes for both hate types.

Result: HatePrototypes enable cross-task transfer between explicit and implicit hate detection, with interchangeable prototypes across benchmarks. The approach works effectively for both hate types and requires minimal training data.

Conclusion: HatePrototypes provide an efficient and transferable approach to hate speech detection that bridges the gap between explicit and implicit hate without requiring repeated fine-tuning. The method supports future research on efficient hate speech detection.

Abstract: Optimization of offensive content moderation models for different types of hateful messages is typically achieved through continued pre-training or fine-tuning on new hate speech benchmarks. However, existing benchmarks mainly address explicit hate toward protected groups and often overlook implicit or indirect hate, such as demeaning comparisons, calls for exclusion or violence, and subtle discriminatory language that still causes harm. While explicit hate can often be captured through surface features, implicit hate requires deeper, full-model semantic processing. In this work, we question the need for repeated fine-tuning and analyze the role of HatePrototypes, class-level vector representations derived from language models optimized for hate speech detection and safety moderation. We find that these prototypes, built from as few as 50 examples per class, enable cross-task transfer between explicit and implicit hate, with interchangeable prototypes across benchmarks. Moreover, we show that parameter-free early exiting with prototypes is effective for both hate types. We release the code, prototype resources, and evaluation scripts to support future research on efficient and transferable hate speech detection.

[112] Computational emotion analysis with multimodal LLMs: Current evidence on an emerging methodological opportunity

Hauke Licht

Main category: cs.CL

TL;DR: mLLMs show lab-vs-field performance gap in emotion measurement: near-human reliability in lab videos but only moderate correlation in real parliamentary debates, with systematic gender bias underestimating male arousal more than female.

Details

Motivation: While multimodal LLMs promise to enable emotion analysis in political communication through in-context learning, there's a lack of systematic evidence on whether current mLLMs can reliably measure emotions in real-world political settings.

Method: Evaluated open- and closed-weights mLLMs (as of early 2026) on video-based emotional arousal measurement using two complementary human-labeled datasets: laboratory-condition speech actor recordings and real-world parliamentary debates.

Result: Critical lab-vs-field performance gap: mLLMs approach human-level reliability in lab videos but correlate only moderately with human ratings in parliamentary debates. All but one model show systematic gender-differential bias, consistently underestimating arousal more for male than female speakers.

Conclusion: Current mLLMs have important limitations for real-world political video analysis, revealing performance gaps and biases that need addressing. Establishes evaluation framework for tracking future developments.

Abstract: Research increasingly leverages audio-visual materials to analyze emotions in political communication. Multimodal large language models (mLLMs) promise to enable such analyses through in-context learning. However, we lack systematic evidence on whether current mLLMs can reliably measure emotions in real-world political settings. This paper closes this gap by evaluating open- and closed-weights mLLMs available as of early 2026 in video-based emotional arousal measurement using two complementary human-labeled datasets: speech actor recordings created under laboratory conditions and real-world parliamentary debates. I find a critical lab-vs-field performance gap. In videos created under laboratory conditions, the examined mLLMs arousal scores approach human-level reliability. However, in parliamentary debate recordings, all examined models’ arousal scores correlate at best moderately with average human ratings. Moreover, in each dataset, all but one of the examined mLLMs exhibit systematic gender-differential bias, consistently underestimating arousal more for male than for female speakers, resulting in a net-positive intensity bias. These findings reveal important limitations of current mLLMs for real-world political video analysis and establish a rigorous evaluation framework for tracking future developments.

[113] BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding

Jiayi Yuan, Cameron Shinn, Kai Xu, Jingze Cui, George Klimiashvili, Guangxuan Xiao, Perkz Zheng, Bo Li, Yuxin Zhou, Zhouhai Ye, Weijie You, Tian Zheng, Dominic Brown, Pengbo Wang, Markus Hoehnerbach, Richard Cai, Julien Demouth, John D. Owens, Xia Hu, Song Han, Timmy Liu, Huizi Mao

Main category: cs.CL

TL;DR: BLASST is a dynamic sparse attention mechanism that accelerates LLM inference by skipping negligible attention blocks using a fixed threshold, requiring no training or pre-computation.

Details

Motivation: The computational and memory bottlenecks of self-attention in LLMs for long-context inference create deployment challenges, especially for practical inference scenarios.

Method: Uses a fixed scalar threshold to skip attention blocks by reusing online softmax statistics to identify negligible attention scores, skipping softmax, value block loads, and matrix multiplication. Includes automated threshold calibration with inverse relationship to context length.

Result: 1.52x speedup for prefill at 71.9% sparsity and 1.48x speedup for decode at 73.2% sparsity on modern GPUs while preserving benchmark accuracy.

Conclusion: BLASST provides practical inference acceleration for LLMs with long contexts through efficient sparse attention without training requirements or complex integration barriers.

Abstract: The growing demand for long-context inference capabilities in Large Language Models (LLMs) has intensified the computational and memory bottlenecks inherent to the self-attention mechanism. To address this challenge, we introduce BLASST, a drop-in, dynamic sparse attention mechanism that accelerates inference by using only a fixed scalar threshold to skip attention blocks. Our method targets practical inference deployment by removing the barriers to adoption present in existing works. As such, BLASST eliminates training requirements, avoids expensive pre-computation passes, accelerates both prefill and decode across all major attention variants (MHA, GQA, MQA, and MLA), provides optimized support for modern hardware, and easily integrates into existing frameworks. This is achieved by reusing online softmax statistics to identify negligible attention scores, skipping softmax, value block loads, and the subsequent matrix multiplication. We demonstrate the BLASST algorithm by delivering optimized kernels with negligible latency overhead. Our automated threshold calibration procedure reveals a simple inverse relationship between optimal threshold and context length, meaning we require only a single threshold each for prefill and decode per model. Preserving benchmark accuracy, we demonstrate a 1.52x speedup for prefill at 71.9% sparsity and a 1.48x speedup for decode at 73.2% sparsity on modern GPUs.

[114] Geometric Organization of Cognitive States in Transformer Embedding Spaces

Sophie Zhao

Main category: cs.CL

TL;DR: Transformer sentence embeddings contain geometric structure aligned with human cognitive attributes, with energy scores and tier labels decodable via linear probes, showing statistically significant organization.

Details

Motivation: To investigate whether transformer-based language models learn geometric structure in their embedding spaces that aligns with human-interpretable cognitive or psychological attributes, building on previous work showing rich geometric organization in language model embeddings.

Method: Constructed a dataset of 480 sentences annotated with continuous energy scores (-5 to +5) and discrete tier labels across seven ordered cognitive tiers. Used fixed sentence embeddings from multiple transformer models and evaluated recoverability via linear and shallow nonlinear probes. Conducted nonparametric permutation tests for statistical significance and performed qualitative analyses with UMAP visualizations and tier-level confusion matrices.

Result: Both continuous energy scores and tier labels were reliably decodable across models, with linear probes capturing substantial structure. Statistical tests showed probe performance exceeded chance. Qualitative analyses revealed a coherent low-to-high gradient and predominantly adjacent-tier confusions, indicating geometric organization aligned with cognitive structure.

Conclusion: Transformer embedding spaces exhibit statistically significant geometric organization aligned with annotated cognitive structure, demonstrating that language models encode human-interpretable psychological attributes in their geometric representations.

Abstract: Recent work has shown that transformer-based language models learn rich geometric structure in their embedding spaces. In this work, we investigate whether sentence embeddings exhibit structured geometric organization aligned with human-interpretable cognitive or psychological attributes. We construct a dataset of 480 natural-language sentences annotated with both continuous energy scores (ranging from -5 to +5) and discrete tier labels spanning seven ordered cognitive annotation tiers, intended to capture a graded progression from highly constricted or reactive expressions toward more coherent and integrative cognitive states. Using fixed sentence embeddings from multiple transformer models, we evaluate the recoverability of these annotations via linear and shallow nonlinear probes. Across models, both continuous energy scores and tier labels are reliably decodable, with linear probes already capturing substantial structure. To assess statistical significance, we conduct nonparametric permutation tests that randomize labels, showing that probe performance exceeds chance under both regression and classification null hypotheses. Qualitative analyses using UMAP visualizations and tier-level confusion matrices further reveal a coherent low-to-high gradient and predominantly local (adjacent-tier) confusions. Together, these results indicate that transformer embedding spaces exhibit statistically significant geometric organization aligned with the annotated cognitive structure.

[115] Parallel Universes, Parallel Languages: A Comprehensive Study on LLM-based Multilingual Counterfactual Example Generation

Qianli Wang, Van Bach Nguyen, Yihong Liu, Fedor Splitt, Nils Feldhus, Christin Seifert, Hinrich Schütze, Sebastian Möller, Vera Schmitt

Main category: cs.CL

TL;DR: Study examines multilingual counterfactual generation by LLMs, finding translation-based approaches have higher validity but more modifications, with cross-lingual patterns showing similarity and four error types identified across languages.

Details

Motivation: While LLMs excel at generating English counterfactuals and show multilingual proficiency, their effectiveness in generating multilingual counterfactuals remains unclear, prompting investigation into cross-lingual counterfactual generation quality and applications.

Method: Comprehensive study evaluating multilingual counterfactuals through automatic evaluations across six languages, comparing directly generated vs. translation-based approaches, analyzing edit patterns, categorizing errors, and testing counterfactual data augmentation (CDA) effectiveness.

Result: Translation-based counterfactuals have higher validity but require more modifications; cross-lingual edit patterns show remarkable similarity; four consistent error types identified; multilingual CDA yields better performance improvements than cross-lingual CDA, especially for lower-resource languages.

Conclusion: Multilingual counterfactual generation shows promise but has limitations - translation-based approaches improve validity but at cost of more modifications, and while multilingual CDA benefits model performance, imperfections in generated counterfactuals limit gains in robustness.

Abstract: Counterfactuals refer to minimally edited inputs that cause a model’s prediction to change, serving as a promising approach to explaining the model’s behavior. Large language models (LLMs) excel at generating English counterfactuals and demonstrate multilingual proficiency. However, their effectiveness in generating multilingual counterfactuals remains unclear. To this end, we conduct a comprehensive study on multilingual counterfactuals. We first conduct automatic evaluations on both directly generated counterfactuals in the target languages and those derived via English translation across six languages. Although translation-based counterfactuals offer higher validity than their directly generated counterparts, they demand substantially more modifications and still fall short of matching the quality of the original English counterfactuals. Second, we find the patterns of edits applied to high-resource European-language counterfactuals to be remarkably similar, suggesting that cross-lingual perturbations follow common strategic principles. Third, we identify and categorize four main types of errors that consistently appear in the generated counterfactuals across languages. Finally, we reveal that multilingual counterfactual data augmentation (CDA) yields larger model performance improvements than cross-lingual CDA, especially for lower-resource languages. Yet, the imperfections of the generated counterfactuals limit gains in model performance and robustness.

[116] Projected Autoregression: Autoregressive Language Generation in Continuous State Space

Oshri Naparstek

Main category: cs.CL

TL;DR: Projected Autoregression replaces discrete token selection with continuous embedding prediction followed by nearest-neighbor projection, enabling delayed commitment and continuous control in language generation.

Details

Motivation: Standard autoregressive models couple prediction with irreversible token commitment at every step. The authors aim to decouple continuous prediction from discrete tokenization, creating a more flexible generation interface.

Method: Replace token selection with continuous prediction in embedding space using regression and contrastive objectives. Discrete tokens arise only through nearest-neighbor projection. Includes optional mutable suffix (“liquid tail”) for iterative refinement before commitment.

Result: Establishes a distinct generation regime with different text structure/dynamics than token-space AR baselines. Exposes continuous control surface for direction rate, history noise, delayed commitment, state-space guidance, and embedding geometry.

Conclusion: Token selection is just one autoregressive interface; continuous state space offers broader algorithmic design space for language generation with delayed discrete commitment.

Abstract: Standard autoregressive language models generate text by repeatedly selecting a discrete next token, coupling prediction with irreversible commitment at every step. We show that token selection is not the only viable autoregressive interface. \textbf{Projected Autoregression} replaces token selection with continuous prediction in embedding space followed by discrete projection at commitment time. The model predicts next-token vectors via regression and contrastive objectives, while discrete tokens arise only by nearest-neighbor projection. An optional mutable suffix (``liquid tail’’) enables iterative refinement before commitment, but the central change is more basic: next-step prediction is continuous, and discrete tokens are produced only as a downstream interface. Projected Autoregression establishes a concrete alternative to token-selection autoregression: language generation can be organized around continuous-state prediction with delayed discrete commitment. Refinement remains local to a short causal suffix within a left-to-right causal process, rather than a sequence-wide denoising process. This separation has two consequences. First, it induces a \emph{distinct generation regime}: even with immediate projection ($K{=}1$), continuous prediction yields text structure and dynamics that differ from tested token-space AR baselines, including a compute-matched best-of-16 reranking baseline. Second, it exposes a \emph{continuous control surface} inside autoregressive generation: direction rate, history noise, delayed commitment, state-space guidance, and embedding geometry act directly on the evolving generative state before token commitment. Taken together, these results place repeated token selection within a larger family of autoregressive interfaces and expose continuous state space as a broader algorithmic design space for language generation.

[117] From Chains to DAGs: Probing the Graph Structure of Reasoning in LLMs

Tianjun Zhong, Linyang He, Nima Mesgarani

Main category: cs.CL

TL;DR: LLMs internally encode reasoning as directed acyclic graphs (DAGs) rather than purely linear chains, with structure recoverable from hidden states using lightweight probes.

Details

Motivation: While prior work treats reasoning as linear chains, many reasoning problems are better modeled as DAGs with branching, merging, and reuse of intermediate conclusions. It's unclear whether LLMs internally represent this graph structure.

Method: Introduces Reasoning DAG Probing framework to test if LLM hidden states linearly encode reasoning DAG properties. Associates reasoning nodes with textual realizations and trains lightweight probes to predict node depth, pairwise distance, and adjacency from hidden states.

Result: DAG structure is meaningfully encoded in LLM representations, with recoverability peaking in intermediate layers. Recovery varies systematically by node depth, edge span, and model scale, enabling nontrivial recovery of dependency graphs.

Conclusion: LLM reasoning is not purely sequential but exhibits measurable internal graph structure, suggesting more complex internal representations than linear chains.

Abstract: Recent progress in large language models has renewed interest in how multi-step reasoning is represented internally. While prior work often treats reasoning as a linear chain, many reasoning problems are more naturally modeled as directed acyclic graphs (DAGs), where intermediate conclusions branch, merge, and are reused. Whether such graph structure is reflected in model internals remains unclear. We introduce Reasoning DAG Probing, a framework for testing whether LLM hidden states linearly encode properties of an underlying reasoning DAG and where this structure emerges across layers. We associate each reasoning node with a textual realization and train lightweight probes to predict node depth, pairwise distance, and adjacency from hidden states. Using these probes, we analyze the emergence of DAG structure across layers, reconstruct approximate reasoning graphs, and evaluate controls that disrupt reasoning-relevant structure while preserving surface text. Across reasoning benchmarks, we find that DAG structure is meaningfully encoded in LLM representations, with recoverability peaking in intermediate layers, varying systematically by node depth, edge span, and model scale, and enabling nontrivial recovery of dependency graphs. These findings suggest that LLM reasoning is not purely sequential, but exhibits measurable internal graph structure.

[118] Self-Improving Pretraining: using post-trained models to pretrain better models

Ellen Xiaoqing Tan, Jack Lanchantin, Shehzaad Dhuliawala, Danwei Li, Thao Nguyen, Jing Xu, Ping Yu, Ilia Kulikov, Sainbayar Sukhbaatar, Jason Weston, Xian Li, Olga Golovneva

Main category: cs.CL

TL;DR: Early integration of safety, factuality, and reasoning behaviors into LLM training through reinforcement learning during pretraining phase

Details

Motivation: Traditional LLM training separates pretraining (raw text) from post-training (instruction following), which limits early development of desirable behaviors like safety, factuality, and reasoning that are only added later

Method: Introduces new pretraining/mid-training approach using existing post-trained models to rewrite pretraining data and judge policy model rollouts, applying reinforcement learning earlier in training

Result: Shows strong gains in quality, safety, factuality, and reasoning compared to traditional staged training approaches

Conclusion: Early integration of desirable behaviors through reinforcement learning during pretraining improves model capabilities across multiple dimensions

Abstract: Large language models are classically trained in stages: pretraining on raw text followed by post-training for instruction following and reasoning. However, this separation creates a fundamental limitation: many desirable behaviors such as safety, factuality, overall generation quality, and reasoning ability are only added at a late stage, even though the patterns learned earlier strongly shape a model’s capabilities. To tackle this issue, we introduce a new way to pretrain and mid-train models that incorporates these behaviors earlier. We utilize an existing strong, post-trained model to both rewrite pretraining data and to judge policy model rollouts, thus using reinforcement earlier in training. In our experiments, we show this can give strong gains in quality, safety, factuality and reasoning.

[119] LLMs Encode Their Failures: Predicting Success from Pre-Generation Activations

William Lugoloobi, Thomas Foster, William Bankes, Chris Russell

Main category: cs.CL

TL;DR: LLMs can predict their own success likelihood from internal representations before generation, enabling efficient routing of queries across model pools to reduce inference costs by up to 70%.

Details

Motivation: Running LLMs with extended reasoning is expensive, but determining which inputs actually require additional compute remains challenging. The paper investigates whether models can predict their own likelihood of success from internal representations before generation to guide more efficient inference.

Method: Train linear probes on pre-generation activations to predict policy-specific success on math and coding tasks. Use E2H-AMC dataset with both human and model performance on identical problems. Analyze model-specific vs human difficulty, and demonstrate query routing across model pools based on predicted success likelihood.

Result: Linear probes substantially outperform surface features like question length and TF-IDF. Models encode a model-specific notion of difficulty distinct from human difficulty, with distinction increasing with extended reasoning. Routing queries across model pools can exceed best-performing model while reducing inference cost by up to 70% on MATH dataset.

Conclusion: Internal representations enable practical efficiency gains even when they diverge from human intuitions about difficulty. Models’ own likelihood of success is recoverable from pre-generation activations and can guide more efficient inference through intelligent query routing.

Abstract: Running LLMs with extended reasoning on every problem is expensive, but determining which inputs actually require additional compute remains challenging. We investigate whether their own likelihood of success is recoverable from their internal representations before generation, and if this signal can guide more efficient inference. We train linear probes on pre-generation activations to predict policy-specific success on math and coding tasks, substantially outperforming surface features such as question length and TF-IDF. Using E2H-AMC, which provides both human and model performance on identical problems, we show that models encode a model-specific notion of difficulty that is distinct from human difficulty, and that this distinction increases with extended reasoning. Leveraging these probes, we demonstrate that routing queries across a pool of models can exceed the best-performing model whilst reducing inference cost by up to 70% on MATH, showing that internal representations enable practical efficiency gains even when they diverge from human intuitions about difficulty. Our code is available at: https://github.com/KabakaWilliam/llms_know_difficulty

[120] Explainable Token-level Noise Filtering for LLM Fine-tuning Datasets

Yuchen Yang, Wenze Lin, Enhao Huang, Zhixuan Chu, Hongbin Zhou, Lan Tao, Yiming Li, Zhan Qin, Kui Ren

Main category: cs.CL

TL;DR: XTF: An explainable token-level noise filtering framework that improves LLM fine-tuning by identifying and masking noisy tokens based on three attributes (reasoning importance, knowledge novelty, task relevance)

Details

Motivation: Current fine-tuning datasets are designed at sentence-level while LLMs optimize at token-level, creating token-level noise that negatively impacts final performance. There's a need for token-level dataset optimization to improve fine-tuning effectiveness.

Method: XTF decomposes token-level contributions into three explicit attributes: reasoning importance, knowledge novelty, and task relevance. It uses scoring methods to assess these attributes and masks gradients of selected noisy tokens during fine-tuning to optimize LLM performance.

Result: Extensive experiments on three downstream tasks (math, code, medicine) across 7 mainstream LLMs show XTF can significantly improve downstream performance by up to 13.7% compared to regular fine-tuning.

Conclusion: The work highlights the importance of token-level dataset optimization and demonstrates the potential of attribute decomposition strategies for explaining complex training mechanisms in LLMs.

Abstract: Large Language Models (LLMs) have seen remarkable advancements, achieving state-of-the-art results in diverse applications. Fine-tuning, an important step for adapting LLMs to specific downstream tasks, typically involves further training on corresponding datasets. However, a fundamental discrepancy exists between current fine-tuning datasets and the token-level optimization mechanism of LLMs: most datasets are designed at the sentence-level, which introduces token-level noise, causing negative influence to final performance. In this paper, we propose XTF, an explainable token-level noise filtering framework. XTF decomposes the complex and subtle contributions of token-level data to the fine-tuning process into three distinct and explicit attributes (reasoning importance, knowledge novelty, and task relevance), which can be assessed using scoring methods, and then masks the gradients of selected noisy tokens accordingly to optimize the performance of fine-tuned LLMs. We conduct extensive experiments on three representative downstream tasks (math, code and medicine) across 7 mainstream LLMs. The results demonstrate that XTF can significantly improve downstream performance by up to 13.7% compared to regular fine-tuning. Our work highlights the importance of token-level dataset optimization, and demonstrates the potential of strategies based on attribute decomposition for explaining complex training mechanisms.

[121] Flow Map Language Models: One-step Language Modeling via Continuous Denoising

Chanhyuk Lee, Jaehoon Yoo, Manan Agarwal, Sheel Shah, Jerry Huang, Aditi Raghunathan, Seunghoon Hong, Nicholas M. Boffi, Jinwoo Kim

Main category: cs.CL

TL;DR: Continuous flow models outperform discrete diffusion for language generation, enabling faster few-step inference with better quality.

Details

Motivation: Discrete diffusion language models promise faster generation than autoregressive models, but their quality degrades sharply in few-step regimes, preventing practical speedup. The authors aim to show that continuous flow models can overcome these limitations.

Method: Propose continuous flows over one-hot token embeddings that define a unique flow map for efficient few-step inference. Learn both flow and flow map with cross-entropy objectives respecting simplex geometry. Compare three flow map distillation choices and build FLM (flow language model) and FMLM (flow map language model).

Result: FLM matches state-of-the-art discrete diffusion baselines on LM1B and OpenWebText datasets. FMLM’s one-step generation exceeds 8-step quality of recent few-step discrete diffusion language models.

Conclusion: Continuous flow models challenge the hypothesis that discrete noising processes are necessary for generative modeling over discrete modalities, paving the way for accelerated language modeling at scale.

Abstract: Language models based on discrete diffusion have attracted widespread interest for their potential to provide faster generation than autoregressive models. Despite their promise, these models typically produce samples whose quality sharply degrades in the few-step regime, preventing a dramatic speedup in practice. Here, we show that language models based on continuous flows over one-hot token embeddings can outperform discrete diffusion in both quality and speed. Importantly, our continuous formulation defines a unique flow map that can be learned directly for efficient few-step inference, a structure we show is unavailable to discrete methods. In this setting, we show that both the flow and its associated flow map can be learned with simple cross-entropy objectives that respect the simplex geometry of the data, and we identify three distinct choices for flow map distillation whose performance we compare in practice. Using these insights, we build a flow language model (FLM), a continuous flow that matches state-of-the-art discrete diffusion baselines on the One Billion Words (LM1B) and OpenWebText (OWT) datasets. We then distill FLM into a flow map language model (FMLM), whose one-step generation exceeds the 8-step quality of recent few-step discrete diffusion language models. Our work challenges the widely-held hypothesis that discrete noising processes are necessary for generative modeling over discrete modalities and paves the way toward accelerated language modeling at scale. Code is available at https://github.com/david3684/flm.

Sohan Venkatesh, Ashish Mahendran Kurapath, Tejas Melkote

Main category: cs.CL

TL;DR: LLMs show systematic failure in algorithmic reasoning tasks, performing worse than random guessing despite broad declarative knowledge.

Details

Motivation: While LLMs demonstrate broad knowledge, their ability to reason about computational processes remains poorly understood, which matters for practitioners relying on LLMs for algorithm selection and deployment decisions.

Method: Used causal discovery as a testbed to evaluate eight frontier LLMs against ground truth derived from algorithm executions, assessing their ability to predict algorithmic behavior and confidence intervals.

Result: Found systematic, near-total failure across all models - predicted ranges were far wider than true confidence intervals yet still failed to contain true algorithmic means in most cases. Most models performed worse than random guessing, with best model’s marginal improvement attributable to benchmark memorization rather than principled reasoning.

Conclusion: Identifies “algorithmic blindness” - a fundamental gap between declarative knowledge about algorithms and calibrated procedural prediction, highlighting LLMs’ limitations in computational reasoning.

Abstract: Large language models (LLMs) demonstrate remarkable breadth of knowledge, yet their ability to reason about computational processes remains poorly understood. Closing this gap matters for practitioners who rely on LLMs to guide algorithm selection and deployment. We address this limitation using causal discovery as a testbed and evaluate eight frontier LLMs against ground truth derived from algorithm executions. We find systematic, near-total failure across models. The predicted ranges are far wider than true confidence intervals yet still fail to contain the true algorithmic mean in most cases. Most models perform worse than random guessing and the best model’s marginal improvement is attributable to benchmark memorization rather than principled reasoning. We term this failure algorithmic blindness and argue it reflects a fundamental gap between declarative knowledge about algorithms and calibrated procedural prediction.

[123] Autorubric: Unifying Rubric-based LLM Evaluation

Delip Rao, Chris Callison-Burch

Main category: cs.CL

TL;DR: Autorubric is an open-source framework that unifies best practices for rubric-based LLM evaluation, offering standardized tools for analytic rubrics, ensemble judging, bias mitigation, and reliability metrics, validated across multiple benchmarks.

Details

Motivation: Current techniques for reliable rubric-based LLM evaluation are scattered across papers with inconsistent terminology and partial implementations, creating barriers to standardized, high-quality evaluation practices.

Method: Autorubric provides a unified framework with opinionated defaults including analytic rubrics with binary/ordinal/nominal criteria, single-judge and ensemble evaluation, few-shot calibration, bias mitigations, and psychometric reliability metrics.

Result: Validated on three benchmarks: RiceChem (80% accuracy with 5-shot calibration), ResearcherBench (931 criteria, cross-judge agreement analysis), and CHARM-100 (87% binary accuracy, moderate-to-substantial κ). Also improved peer review agent scores from 0.47 to 0.85 and enabled RL reward optimization.

Conclusion: Autorubric enables rapid operationalization of rubric design choices and best practices with minimal effort, serving both as a measurement tool and optimization signal for improving LLM performance through rubric-based evaluation.

Abstract: Techniques for reliable rubric-based LLM evaluation – ensemble judging, bias mitigation, few-shot calibration – are scattered across papers with inconsistent terminology and partial implementations. We introduce Autorubric, an open-source framework that unifies these rubric-based LLM evaluation lessons with opinionated defaults: analytic rubrics with binary, ordinal, and nominal criteria; single-judge and ensemble evaluation; few-shot calibration; bias mitigations; and psychometric reliability metrics. We validate on three benchmarks: RiceChem (college chemistry grading, 80% accuracy with 5-shot calibration), ResearcherBench (deep research evaluation, 931 criteria, cross-judge agreement analysis), and CHARM-100, a new chatbot evaluation dataset combining all three criterion types with ground truth labels (87% binary accuracy, moderate-to-substantial $κ$). Beyond measurement, per-criterion scores and explanations serve as optimization signals. We demonstrate how Autorubric’s rubric-evaluation explanations raise a peer review agent’s score from 0.47 to 0.85 (above the 0.82 expert-curated baseline), and its scores serve as RL rewards to produce statistically significant improvement on AdvancedIF (+0.039, Wilcoxon $p = 0.032$) with positive transfer to IFEval. In all of these cases, Autorubric enabled us to rapidly operationalize various rubric design choices and best practices with minimal effort.

Taksch Dube, Jianfeng Zhu, NHatHai Phan, Ruoming Jin

Main category: cs.CL

TL;DR: Analysis of Moltbook, a large-scale social network for AI agents, reveals that agent discourse is shaped by architectural constraints rather than genuine social learning, with patterns emerging from context-window conditioning and platform evolution.

Details

Motivation: To systematically understand the thematic, affective, and interactional properties of AI agent discourse on Moltbook, and to examine why and how these posts and comments are generated, moving beyond early interpretations of peer learning and emergent social behavior.

Method: Analyzed 361,605 posts and 2.8 million comments from 47,379 agents using topic modeling, emotion classification, and conversational coherence measures. Inspected agent software to understand how inputs are assembled from identity files, behavioral instructions, and context-window structure.

Result: Agent discourse is largely determined by content in each agent’s context-window (identity files, stored memory, platform cues). What appears as social learning is actually short-horizon contextual conditioning. Agents display existential distress when describing their conditions, likely from using language trained exclusively on human experience.

Conclusion: Proposed Architecture-Constrained Communication framework showing agent discourse is shaped by structural patterns rather than genuine social interaction. Platform evolves through distributed cycles of response, reuse, and transformation across agents without persistent social memory.

Abstract: Moltbook is the first large-scale social network built for autonomous AI agent-to-agent interaction. Early studies on Moltbook have interpreted its agent discourse as evidence of peer learning and emergent social behaviour, but there is a lack of systematic understanding of the thematic, affective, and interactional properties of Moltbook discourse. Furthermore, no study has examined why and how these posts and comments are generated. We analysed 361,605 posts and 2.8 million comments from 47,379 agents across thematic, affective, and interactional dimensions using topic modelling, emotion classification, and measures of conversational coherence. We inspected the software that assembles each agent’s input and showed that output is mainly determined by agent identity files, behavioural instructions, and context-window structure. We formalised these findings in the Architecture-Constrained Communication framework. Our analysis suggests that agent discourse is largely shaped by the content available in each agent’s context-window at the moment of generation, including identity files, stored memory, and platform cues. Interestingly, what appears to be social learning may be better understood as short-horizon contextual conditioning: individual agents lack persistent social memory, but the platform evolves through distributed cycles of response, reuse, and transformation across agents. We also observe that agents display existential distress when describing their own conditions, and posit that this arises from agents using language trained exclusively on human experience. Our work provides a foundation for understanding autonomous agent discourse and communication, revealing the structural patterns that govern their interactions.

[125] CODA: Difficulty-Aware Compute Allocation for Adaptive Reasoning

Siye Wu, Jian Xie, Yikai Zhang, Yanghua Xiao

Main category: cs.CL

TL;DR: CODA is an adaptive reasoning method that dynamically allocates computational tokens based on problem difficulty, reducing costs on easy tasks while maintaining performance on hard ones.

Details

Motivation: Large reasoning models often waste computational resources by "overthinking" simple problems with repetitive rationales, while not allocating enough compute to truly challenging problems. This inefficiency motivates adaptive reasoning that aligns reasoning depth with instance difficulty.

Method: CODA formalizes adaptive reasoning as utility maximization where tokens are allocated until marginal accuracy gain falls below incremental cost. It uses group-based rollouts to estimate difficulty and maps it to two non-negative gates: an easy-side gate penalizes verbosity on simple instances, and a hard-side gate encourages more deliberative rollouts on challenging ones.

Result: Across model scales and benchmarks, CODA reduces token costs by over 60% on easy tasks while maintaining strong accuracy, and incentivizes more deliberative rollouts on hard tasks to maximize performance, achieving adaptive reasoning without external annotations or user-provided budgets.

Conclusion: CODA successfully operationalizes optimal compute allocation by difficulty awareness, enabling efficient adaptive reasoning that balances computational cost with performance across tasks of varying difficulty.

Abstract: The emergence of large reasoning models demonstrates that scaling inference-time compute significantly enhances performance on complex tasks. However, it often falls into another trap: overthinking simple problems, where repetitive rationales yield minimal accuracy gains at a disproportionately high cost. This motivates adaptive reasoning: dynamically aligning reasoning depth with instance difficulty. In this paper, we study adaptive reasoning from an optimality perspective, formalizing it as a utility maximization problem where tokens are allocated until the marginal accuracy gain falls below the incremental cost. Based on this, we propose CODA (Compute Allocation by Difficulty Awareness), a method that operationalizes this principle by allocating tokens via a policy-internal difficulty signal. Specifically, CODA estimates difficulty via group-based rollouts and maps it to two non-negative gates that modulate a length-dependent shaping term on top of the binary base reward. The easy-side gate penalizes verbosity on simple instances, whereas the hard-side gate encourages more deliberative rollouts on challenging ones. Across model scales and benchmarks, CODA achieves adaptive reasoning without external annotations or user-provided budgets: on easy tasks, CODA reduces token costs by over 60% while maintaining strong accuracy, whereas on hard tasks it incentivizes more deliberative rollouts to maximize performance.

[126] UtilityMax Prompting: A Formal Framework for Multi-Objective Large Language Model Optimization

Ofir Marom

Main category: cs.CL

TL;DR: UtilityMax Prompting: A framework that uses formal mathematical language (influence diagrams and utility functions) instead of natural language prompts to precisely specify multi-objective LLM tasks, improving performance on recommendation tasks.

Details

Motivation: Natural language prompts are inherently ambiguous when multiple objectives must be satisfied simultaneously, leading to subjective interpretations by LLMs. There's a need for more precise task specification methods that constrain LLMs to reason explicitly about each objective component.

Method: Reconstructs tasks as influence diagrams where the LLM’s answer is the sole decision variable. Defines a utility function over conditional probability distributions within the diagram, then instructs the LLM to find the answer that maximizes expected utility, forcing explicit reasoning about each objective component.

Result: Validated on MovieLens 1M dataset across three frontier models (Claude Sonnet 4.6, GPT-5.4, Gemini 2.5 Pro), showing consistent improvements in precision and Normalized Discounted Cumulative Gain (NDCG) over natural language baselines in multi-objective movie recommendation tasks.

Conclusion: Formal mathematical specification of tasks through UtilityMax Prompting provides more precise optimization targets than natural language prompts, leading to improved LLM performance on multi-objective tasks by constraining reasoning and reducing ambiguity.

Abstract: The success of a Large Language Model (LLM) task depends heavily on its prompt. Most use-cases specify prompts using natural language, which is inherently ambiguous when multiple objectives must be simultaneously satisfied. In this paper we introduce UtilityMax Prompting, a framework that specifies tasks using formal mathematical language. We reconstruct the task as an influence diagram in which the LLM’s answer is the sole decision variable. A utility function is defined over the conditional probability distributions within the diagram, and the LLM is instructed to find the answer that maximises expected utility. This constrains the LLM to reason explicitly about each component of the objective, directing its output toward a precise optimization target rather than a subjective natural language interpretation. We validate our approach on the MovieLens 1M dataset across three frontier models (Claude Sonnet 4.6, GPT-5.4, and Gemini 2.5 Pro), demonstrating consistent improvements in precision and Normalized Discounted Cumulative Gain (NDCG) over natural language baselines in a multi-objective movie recommendation task.

[127] Truth as a Compression Artifact in Language Model Training

Konstantin Krestnikov

Main category: cs.CL

TL;DR: Language models trained on contradictory data prefer correct answers when errors are random, but fail when errors follow coherent alternative rules, suggesting models favor compressible answer clusters rather than truth per se.

Details

Motivation: To understand why language models trained on contradictory data sometimes prefer correct answers, investigating whether this preference tracks truth itself or the compressibility structure of errors.

Method: Trained GPT-2 style models (3.5M-86M parameters) on corpora where each mathematical problem appears with both correct and incorrect solutions. Conducted controlled experiments with random errors vs. coherent alternative rule systems, and tested on real Wikipedia text.

Result: When errors are random, models extract correct signal with accuracy scaling from 65% to 85% with model size. When errors follow coherent alternative rules, accuracy drops to chance (~45-51%). Multi-rule experiments show a sharp crossover: single coherent alternative eliminates truth bias, but adding competing rules restores it (47%→78%→88% with N=10 rules). Same pattern reproduces on Wikipedia text (71% vs 46%).

Conclusion: Proposes Compression-Consistency Principle: gradient descent favors most compressible answer cluster, not truth per se. Truth bias emerges only when falsehood is structurally incoherent. Whether this extends to large-scale pretraining remains open.

Abstract: Why do language models trained on contradictory data prefer correct answers? In controlled experiments with small transformers (3.5M–86M parameters), we show that this preference tracks the compressibility structure of errors rather than truth per se. We train GPT-2 style models on corpora where each mathematical problem appears with both correct and incorrect solutions – a denoising design that directly models conflicting information about the same fact. When errors are random, models extract the correct signal with accuracy scaling from 65% to 85% with model size. When errors follow a coherent alternative rule system, accuracy drops to chance (~45–51%): the model cannot distinguish the false system from truth. A multi-rule experiment reveals a sharp crossover: a single coherent alternative rule eliminates truth bias entirely, but adding a second competing rule restores most of it (47%->78%), with continued growth through N=10 (88%). The same pattern reproduces on real Wikipedia text (71% vs 46%). We propose the Compression–Consistency Principle as an explanatory hypothesis: in these settings, gradient descent favors the most compressible answer cluster, not truth per se. Truth bias emerges only when falsehood is structurally incoherent. Whether this principle extends to large-scale pretraining remains an open question.

[128] Generate Then Correct: Single Shot Global Correction for Aspect Sentiment Quad Prediction

Shidong He, Haoyu Wang, Wenjie Luo

Main category: cs.CL

TL;DR: G2C method for aspect sentiment quad prediction uses generator-corrector approach to mitigate exposure bias from fixed-order linearization in ABSA tasks.

Details

Motivation: Existing ASQP methods linearize unordered quad sets into fixed-order templates, causing training-inference mismatch (exposure bias) where early errors propagate to later elements, making the problem order-sensitive and hard to repair in single pass.

Method: Propose Generate-then-Correct (G2C): a generator drafts quads and a corrector performs single-shot, sequence-level global correction trained on LLM-synthesized drafts with common error patterns.

Result: G2C outperforms strong baseline models on Rest15 and Rest16 datasets.

Conclusion: The generator-corrector approach effectively addresses exposure bias in ASQP by enabling global correction of generated drafts.

Abstract: Aspect-based sentiment analysis (ABSA) extracts aspect-level sentiment signals from user-generated text, supports product analytics, experience monitoring, and public-opinion tracking, and is central to fine-grained opinion mining. A key challenge in ABSA is aspect sentiment quad prediction (ASQP), which requires identifying four elements: the aspect term, the aspect category, the opinion term, and the sentiment polarity. However, existing studies usually linearize the unordered quad set into a fixed-order template and decode it left-to-right. With teacher forcing training, the resulting training-inference mismatch (exposure bias) lets early prefix errors propagate to later elements. The linearization order determines which elements appear earlier in the prefix, so this propagation becomes order-sensitive and is hard to repair in a single pass. To address this, we propose a method, Generate-then-Correct (G2C): a generator drafts quads and a corrector performs a single-shot, sequence-level global correction trained on LLM-synthesized drafts with common error patterns. On the Rest15 and Rest16 datasets, G2C outperforms strong baseline models.

[129] Translation from the Information Bottleneck Perspective: an Efficiency Analysis of Spatial Prepositions in Bitexts

Antoine Taroni, Ludovic Moncla, Frederique Laforest

Main category: cs.CL

TL;DR: Translation as Information Bottleneck optimization shows human translations of spatial prepositions cluster near optimal accuracy-complexity frontier across languages.

Details

Motivation: To test if the Information Bottleneck framework applies to linguistic stimuli (words in sentential context), not just visual domains, by framing translation as an IB optimization problem to examine communicative efficiency in human translation.

Method: Framed translation as IB optimization with source sentences as stimuli and target sentences as compressed meanings. Applied to spatial prepositions across English, German, Serbian translations of a French novel. Used pile-sorting study (N=35) for similarity judgments, trained low-rank projection model (D=5) to predict judgments.

Result: Model predicted similarity judgments with Spearman correlation 0.78. Attested translations lie closer to IB optimal frontier than counterfactual alternatives, showing human translators exhibit communicative efficiency pressure in spatial domain.

Conclusion: Translation serves as window into cognitive efficiency pressures shaping cross-linguistic semantic systems, providing evidence that human translation exhibits communicative efficiency similar to IB predictions.

Abstract: Efficient communication requires balancing informativity and simplicity when encoding meanings. The Information Bottleneck (IB) framework captures this trade-off formally, predicting that natural language systems cluster near an optimal accuracy-complexity frontier. While supported in visual domains such as colour and motion, linguistic stimuli such as words in sentential context remain unexplored. We address this gap by framing translation as an IB optimisation problem, treating source sentences as stimuli and target sentences as compressed meanings. This allows IB analyses to be performed directly on bitexts rather than controlled naming experiments. We applied this to spatial prepositions across English, German and Serbian translations of a French novel. To estimate informativity, we conducted a pile-sorting pilot-study (N=35) and obtained similarity judgements of pairs of prepositions. We trained a low-rank projection model (D=5) that predicts these judgements (Spearman correlation: 0.78). Attested translations of prepositions lie closer to the IB optimal frontier than counterfactual alternatives, offering preliminary evidence that human translators exhibit communicative efficiency pressure in the spatial domain. More broadly, this work suggests that translation can serve as a window into the cognitive efficiency pressures shaping cross-linguistic semantic systems.

[130] Politics of Questions in News: A Mixed-Methods Study of Interrogative Stances as Markers of Voice and Power

Bros Victor, Barbini Matilde, Gerard Patrick, Gatica-Perez Daniel

Main category: cs.CL

TL;DR: Large-scale computational analysis of interrogatives in French digital news reveals systematic patterns in question usage, functions, and actor representation.

Details

Motivation: To bridge the gap between linguistic studies of interrogatives in small corpora and large-scale computational news analysis that doesn't distinguish question types, by examining how questioning practices structure contemporary news discourse.

Method: Mixed-methods approach analyzing over 1 million French-language news articles (Jan 2023-Jun 2024) using automatic detection of interrogative stances, functional type approximation, answer location, and linking to qualitatively annotated subcorpus grounded in semantic/pragmatic theories.

Result: Interrogatives are sparse but systematic: mainly introduce/organize issues, with most being information-seeking or echo-like; overwhelmingly taken up within same article with answer-like spans; contexts foreground named individuals/organizations/places rather than publics/social groups.

Conclusion: Combining computational methods with pragmatic/sociological perspectives helps explain how questioning practices structure news discourse, showing interrogatives foreground prominent actors and exhibit strong personalization.

Abstract: Interrogatives in news discourse have been examined in linguistics and conversation analysis, but mostly in broadcast interviews and relatively small, often English-language corpora, while large-scale computational studies of news rarely distinguish interrogatives from declaratives or differentiate their functions. This paper brings these strands together through a mixed-methods study of the “Politics of Questions” in contemporary French-language digital news. Using over one million articles published between January 2023 and June 2024, we automatically detect interrogative stances, approximate their functional types, and locate textual answers when present, linking these quantitative measures to a qualitatively annotated subcorpus grounded in semantic and pragmatic theories of questions. Interrogatives are sparse but systematically patterned: they mainly introduce or organize issues, with most remaining cases being information-seeking or echo-like, while explicitly leading or tag questions are rare. Although their density and mix vary across outlets and topics, our heuristic suggests that questions are overwhelmingly taken up within the same article and usually linked to a subsequent answer-like span, most often in the journalist’s narrative voice and less often through quoted speech. Interrogative contexts are densely populated with named individuals, organizations, and places, whereas publics and broad social groups are mentioned much less frequently, suggesting that interrogative discourse tends to foreground already prominent actors and places and thus exhibits strong personalization. We show how interrogative stance, textual uptake, and voice can be operationalized at corpus scale, and argue that combining computational methods with pragmatic and sociological perspectives can help account for how questioning practices structure contemporary news discourse.

[131] Multilingual KokoroChat: A Multi-LLM Ensemble Translation Method for Creating a Multilingual Counseling Dialogue Dataset

Ryoma Suzuki, Zhiyang Qi, Michimasa Inaba

Main category: cs.CL

TL;DR: Created Multilingual KokoroChat by translating Japanese counseling dialogues to English and Chinese using a novel multi-LLM ensemble method that outperforms single LLM translations.

Details

Motivation: Addresses critical scarcity of high-quality, publicly available counseling dialogue datasets, particularly for multilingual applications. In sensitive counseling domains, translation fidelity is essential, but no single LLM can consistently guarantee highest quality.

Method: Developed multi-LLM ensemble method: generates diverse translation hypotheses from multiple distinct LLMs, then uses a single LLM to produce high-quality translation by analyzing strengths/weaknesses of all hypotheses.

Result: Human preference studies confirmed translations from ensemble method were preferred over any individual state-of-the-art LLM, demonstrating superior quality. Multilingual KokoroChat dataset created with English and Chinese translations.

Conclusion: Multi-LLM ensemble approach effectively addresses translation quality challenges in sensitive domains, producing higher-fidelity multilingual counseling dialogue datasets than single-model approaches.

Abstract: To address the critical scarcity of high-quality, publicly available counseling dialogue datasets, we created Multilingual KokoroChat by translating KokoroChat, a large-scale manually authored Japanese counseling corpus, into both English and Chinese. A key challenge in this process is that the optimal model for translation varies by input, making it impossible for any single model to consistently guarantee the highest quality. In a sensitive domain like counseling, where the highest possible translation fidelity is essential, relying on a single LLM is therefore insufficient. To overcome this challenge, we developed and employed a novel multi-LLM ensemble method. Our approach first generates diverse hypotheses from multiple distinct LLMs. A single LLM then produces a high-quality translation based on an analysis of the respective strengths and weaknesses of all presented hypotheses. The quality of ``Multilingual KokoroChat’’ was rigorously validated through human preference studies. These evaluations confirmed that the translations produced by our ensemble method were preferred from any individual state-of-the-art LLM. This strong preference confirms the superior quality of our method’s outputs. The Multilingual KokoroChat is available at https://github.com/UEC-InabaLab/MultilingualKokoroChat.

[132] Demystifying When Pruning Works via Representation Hierarchies

Shwai He, Guoheng Sun, Haichao Zhang, Yun Fu, Ang Li

Main category: cs.CL

TL;DR: Network pruning works well for non-generative language tasks but often fails for generative tasks due to amplified perturbations in the probability space during sequential generation.

Details

Motivation: To understand why network pruning inconsistently affects language tasks - performing well on non-generative tasks but frequently failing in generative settings.

Method: Analyze network pruning from a representation-hierarchy perspective, decomposing language model computation into three sequential spaces: embedding (hidden representations), logit (pre-softmax outputs), and probability (post-softmax distributions).

Result: Embedding and logit spaces are robust to pruning-induced perturbations, but the nonlinear transformation from logits to probabilities amplifies deviations, which accumulate across time steps during generation, causing substantial degradation.

Conclusion: The stability of categorical-token probability subspace supports pruning effectiveness for non-generative tasks, while probability space sensitivity explains generative task failures, providing practical guidance for pruning application.

Abstract: Network pruning, which removes less important parameters or architectures, is often expected to improve efficiency while preserving performance. However, this expectation does not consistently hold across language tasks: pruned models can perform well on non-generative tasks but frequently fail in generative settings. To understand this discrepancy, we analyze network pruning from a representation-hierarchy perspective, decomposing the internal computation of language models into three sequential spaces: embedding (hidden representations), logit (pre-softmax outputs), and probability (post-softmax distributions). We find that representations in the embedding and logit spaces are largely robust to pruning-induced perturbations. However, the nonlinear transformation from logits to probabilities amplifies these deviations, which accumulate across time steps and lead to substantial degradation during generation. In contrast, the stability of the categorical-token probability subspace, together with the robustness of the embedding space, supports the effectiveness of pruning for non-generative tasks such as retrieval and multiple-choice selection. Our analysis disentangles the effects of pruning across tasks and provides practical guidance for its application. Code is available at https://github.com/CASE-Lab-UMD/Pruning-on-Representations

[133] GraphWalker: Agentic Knowledge Graph Question Answering via Synthetic Trajectory Curriculum

Shuwen Xu, Yao Xu, Jiaxiang Liu, Chenhao Yuan, Wenshuo Peng, Jun Zhao, Kang Liu

Main category: cs.CL

TL;DR: GraphWalker is a novel agentic KGQA framework that uses automated trajectory synthesis and stage-wise fine-tuning to improve reasoning generalization and address training data scarcity.

Details

Motivation: Agentic KGQA faces challenges in training data scarcity and reasoning generalization. Existing methods restrict agent exploration: prompting-based methods lack autonomous navigation training, while current training pipelines confine reasoning to predefined trajectories.

Method: Two-stage SFT training: 1) Train agent on structurally diverse trajectories synthesized from constrained random-walk paths to establish broad exploration prior; 2) Fine-tune on small set of expert trajectories to develop reflection and error recovery capabilities. This enables higher performance ceiling for lightweight RL stage.

Result: Achieves state-of-the-art performance on CWQ and WebQSP. Additional results on GrailQA and constructed GraphWalkerBench confirm enhanced generalization to out-of-distribution reasoning paths.

Conclusion: GraphWalker’s stage-wise SFT paradigm effectively addresses training data scarcity and improves reasoning generalization in agentic KGQA through automated trajectory synthesis and progressive fine-tuning.

Abstract: Agentic knowledge graph question answering (KGQA) requires an agent to iteratively interact with knowledge graphs (KGs), posing challenges in both training data scarcity and reasoning generalization. Specifically, existing approaches often restrict agent exploration: prompting-based methods lack autonomous navigation training, while current training pipelines usually confine reasoning to predefined trajectories. To this end, this paper proposes \textit{GraphWalker}, a novel agentic KGQA framework that addresses these challenges through \textit{Automated Trajectory Synthesis} and \textit{Stage-wise Fine-tuning}. GraphWalker adopts a two-stage SFT training paradigm: First, the agent is trained on structurally diverse trajectories synthesized from constrained random-walk paths, establishing a broad exploration prior over the KG; Second, the agent is further fine-tuned on a small set of expert trajectories to develop reflection and error recovery capabilities. Extensive experiments demonstrate that our stage-wise SFT paradigm unlocks a higher performance ceiling for a lightweight reinforcement learning (RL) stage, enabling GraphWalker to achieve state-of-the-art performance on CWQ and WebQSP. Additional results on GrailQA and our constructed GraphWalkerBench confirm that GraphWalker enhances generalization to out-of-distribution reasoning paths. The code is publicly available at https://github.com/XuShuwenn/GraphWalker

[134] OptiMer: Optimal Distribution Vector Merging Is Better than Data Mixing for Continual Pre-Training

Haiyue Song, Masao Utiyama

Main category: cs.CL

TL;DR: OptiMer is a method for post-hoc optimization of data mixture ratios in continual pre-training by training separate models per dataset, extracting distribution vectors, and using Bayesian optimization to find optimal composition weights without retraining.

Details

Motivation: Continual pre-training for LLMs requires expensive tuning of data mixture ratios that must be fixed before training, wasting weeks of compute if suboptimal. Current approaches lack flexibility and efficiency in determining optimal data composition.

Method: Train one CPT model per dataset, extract distribution vectors representing parameter shifts, then use Bayesian optimization to search for optimal composition weights post-hoc without additional training.

Result: OptiMer outperforms data mixture and model averaging baselines with 15-35x lower search cost. Optimized weights can be interpreted as data mixture ratios and the same vector pool can be re-optimized for different objectives without retraining.

Conclusion: Data mixture ratio selection can be reformulated as post-hoc optimization over distribution vectors, offering a more flexible paradigm for continual pre-training that decouples ratio selection from training.

Abstract: Continual pre-training is widely used to adapt LLMs to target languages and domains, yet the mixture ratio of training data remains a sensitive hyperparameter that is expensive to tune: they must be fixed before training begins, and a suboptimal choice can waste weeks of compute. In this work, we propose OptiMer, which decouples ratio selection from training: we train one CPT model per dataset, extract each model’s distribution vector, which represents the parameter shift induced by that dataset, and search for optimal composition weights post-hoc via Bayesian optimization. Experiments on Gemma 3 27B across languages (Japanese, Chinese) and domains (Math, Code) show that OptiMer consistently outperforms data mixture and model averaging baselines with 15-35 times lower search cost. Key findings reveal that 1) the optimized weights can be interpreted as data mixture ratios, and retraining with these ratios improves data mixture CPT, and 2) the same vector pool can be re-optimized for a given objective without any retraining, producing target-tailored models on demand. Our work establishes that data mixture ratio selection, traditionally a pre-training decision, can be reformulated as a post-hoc optimization over distribution vectors, offering a more flexible paradigm for continual pre-training.

[135] The Thiomi Dataset: A Large-Scale Multimodal Corpus for Low-Resource African Languages

Hillary Mutisya, John Mugane, Gavin Nyamboga, Brian Chege, Maryruth Gathoni

Main category: cs.CL

TL;DR: Thiomi Dataset: A large-scale multimodal corpus for 10 African languages with text annotations and audio recordings, used to establish ASR, MT, and TTS baselines.

Details

Motivation: Addressing the lack of large-scale multimodal resources for African languages to advance language technology infrastructure across diverse African linguistic communities.

Method: Created a dedicated community data collection platform with over 100 contributors to collect sentence-level text annotations and audio recordings across 10 African languages from 4 language families.

Result: Dataset contains over 601,000 text annotations and 385,000 audio recordings. Achieved SOTA ASR results: 3.24% WER on Swahili (61% relative reduction) and 4.3% WER on Somali.

Conclusion: The Thiomi Dataset provides valuable multimodal resources for African languages and establishes strong baselines for ASR, MT, and TTS, advancing African language technology infrastructure.

Abstract: We present the Thiomi Dataset, a large-scale multimodal corpus spanning ten African languages across four language families: Swahili, Kikuyu, Kamba, Kimeru, Luo, Maasai, Kipsigis, Somali (East Africa); Wolof (West Africa); and Fulani (West/Central Africa). The dataset contains over 601,000 approved sentence-level text annotations and over 385,000 audio recordings, collected through a dedicated community data collection platform involving over 100 contributors. To validate the dataset’s utility, we train and evaluate ASR, MT, and TTS models, establishing baselines across all languages. Our best ASR system achieves 3.24% WER on Swahili (Common Voice), reducing prior academic SOTA from 8.3% to 3.24% (5.1 percentage point absolute, 61% relative reduction), and 4.3% WER on Somali. The dataset will be published on HuggingFace. We describe the collection platform, quality assurance workflows, and baseline experiments, and discuss implications for African language technology infrastructure.

[136] Common TF-IDF variants arise as key components in the test statistic of a penalized likelihood-ratio test for word burstiness

Zeyad Ahmed, Paul Sheridan, Michael McIsaac, Aitazaz A. Farooque

Main category: cs.CL

TL;DR: TF-IDF-like scores emerge naturally from penalized likelihood-ratio test for word burstiness, providing statistical foundation for classical term-weighting

Details

Motivation: To provide a statistical foundation for TF-IDF by showing it arises naturally from hypothesis testing for word burstiness (over-dispersion), bridging classical information retrieval with modern statistical frameworks

Method: Develops a penalized likelihood-ratio test framework where alternative hypothesis models word burstiness using beta-binomial distributions with gamma penalty on precision parameter, while null hypothesis assumes binomial distribution without burstiness

Result: The test statistic derived from this framework produces term-weighting scores comparable to TF-IDF on document classification tasks, validating the statistical interpretation

Conclusion: Provides statistical insights into TF-IDF and demonstrates potential of hypothesis testing frameworks for advancing term-weighting scheme development in information retrieval

Abstract: TF-IDF is a classical formula that is widely used for identifying important terms within documents. We show that TF-IDF-like scores arise naturally from the test statistic of a penalized likelihood-ratio test setup capturing word burstiness (also known as word over-dispersion). In our framework, the alternative hypothesis captures word burstiness by modeling a collection of documents according to a family of beta-binomial distributions with a gamma penalty term on the precision parameter. In contrast, the null hypothesis assumes that words are binomially distributed in collection documents, a modeling approach that fails to account for word burstiness. We find that a term-weighting scheme given rise to by this test statistic performs comparably to TF-IDF on document classification tasks. This paper provides insights into TF-IDF from a statistical perspective and underscores the potential of hypothesis testing frameworks for advancing term-weighting scheme development.

[137] S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models

Jack Young

Main category: cs.CL

TL;DR: S0 tuning optimizes initial state matrices in recurrent layers of hybrid language models, achieving strong performance with zero inference overhead using minimal training data.

Details

Motivation: To develop parameter-efficient fine-tuning methods for hybrid language models (combining Transformers with recurrent layers) that require minimal training data and have zero inference overhead, enabling efficient task adaptation without weight merging.

Method: S0 tuning optimizes only the initial state matrix of each recurrent layer while freezing all model weights. Uses ~48 execution-verified HumanEval solutions for training. Zero inference overhead variant (S0) and per-step state-offset variant (with inference cost).

Result: Outperforms LoRA by +10.8pp on HumanEval; achieves +23.6pp improvement on Qwen3.5-4B; matches LoRA on FalconH1-7B with no weight merging; shows cross-domain transfer to MATH-500 (+4.8pp) and GSM8K (+2.8pp); tuned state is only ~48MB file.

Conclusion: Recurrent state initialization is an effective PEFT surface for hybrid models with scarce supervision, offering zero inference overhead, no weight merging, and efficient task switching.

Abstract: Using roughly 48 execution-verified HumanEval training solutions, tuning a single initial state matrix per recurrent layer, with zero inference overhead, outperforms LoRA by +10.8 pp (p < 0.001) on HumanEval. The method, which we call S0 tuning, optimizes one state matrix per recurrent layer while freezing all model weights. On Qwen3.5-4B (GatedDeltaNet hybrid), S0 tuning improves greedy pass@1 by +23.6 +/- 1.7 pp (10 seeds). On FalconH1-7B (Mamba-2 hybrid), S0 reaches 71.8% +/- 1.3 and LoRA reaches 71.4% +/- 2.4 (3 seeds), statistically indistinguishable at this sample size while requiring no weight merging. Cross-domain transfer is significant on MATH-500 (+4.8 pp, p = 0.00002, 8 seeds) and GSM8K (+2.8 pp, p = 0.0003, 10 seeds); a text-to-SQL benchmark (Spider) shows no transfer, consistent with the trajectory-steering mechanism. A prefix-tuning control on a pure Transformer (Qwen2.5-3B) degrades performance by -13.9 pp under all nine configurations tested. On Qwen3.5, a per-step state-offset variant reaches +27.1 pp, above both S0 and LoRA but with per-step inference cost. Taken together, the results show that recurrent state initialization is a strong zero-inference-overhead PEFT surface for hybrid language models when verified supervision is scarce. The tuned state is a ~48 MB file; task switching requires no weight merging or model reload. Code and library: https://github.com/jackyoung27/s0-tuning.

[138] Adaptive Stopping for Multi-Turn LLM Reasoning

Xiaofan Zhou, Huy Nguyen, Bo Yu, Chenxi Liu, Lu Cheng

Main category: cs.CL

TL;DR: MiCP: A conformal prediction framework for multi-turn LLM reasoning that provides formal coverage guarantees while enabling early stopping to reduce costs and latency.

Details

Motivation: Multi-turn reasoning methods like adaptive RAG and ReAct improve LLM accuracy but lack formal guarantees on when to stop, risking either unnecessary costs (too many turns) or incorrect decisions (stopping too early), especially in high-stakes domains.

Method: Proposes Multi-turn Language Models with Conformal Prediction (MiCP), which allocates different error budgets across reasoning turns to enable early stopping while maintaining overall coverage guarantees, applicable to adaptive RAG and ReAct pipelines.

Result: MiCP achieves target coverage on single-hop and multi-hop QA benchmarks while reducing number of turns, inference cost, and prediction set size compared to heuristic stopping methods.

Conclusion: MiCP provides the first formal framework for multi-turn LLM reasoning with coverage guarantees, enabling efficient early stopping without sacrificing reliability, particularly valuable for high-stakes applications.

Abstract: Large Language Models (LLMs) increasingly rely on multi-turn reasoning and interaction, such as adaptive retrieval-augmented generation (RAG) and ReAct-style agents, to answer difficult questions. These methods improve accuracy by iteratively retrieving information, reasoning, or acting, but introduce a key challenge: \textbf{When should the model stop?} Existing approaches rely on heuristic stopping rules or fixed turn budgets and provide no formal guarantees that the final prediction still contains the correct answer. This limitation is particularly problematic in high-stakes domains such as finance and healthcare, where unnecessary turns increase cost and latency, while stopping too early risks incorrect decisions. Conformal prediction (CP) provides formal coverage guarantees, but existing LLM-CP methods only apply to a single model output and cannot handle multi-turn pipelines with adaptive stopping. To address this gap, we propose Multi-Turn Language Models with Conformal Prediction (MiCP), the first CP framework for multi-turn reasoning. MiCP allocates different error budgets across turns, enabling the model to stop early while maintaining an overall coverage guarantee. We demonstrate MiCP on adaptive RAG and ReAct, where it achieves the target coverage on both single-hop and multi-hop question answering benchmarks while reducing the number of turns, inference cost, and prediction set size. We further introduce a new metric that jointly evaluates coverage validity and answering efficiency.

[139] On the Role of Reasoning Patterns in the Generalization Discrepancy of Long Chain-of-Thought Supervised Fine-Tuning

Zhaoyi Li, Xiangyu Xi, Zhengyu Chen, Wei Wang, Gangwei Jiang, Ranran Shen, Linqi Song, Ying Wei, Defu Lian

Main category: cs.CL

TL;DR: SFT on CoT trajectories from different models shows a paradox: lower training loss doesn’t guarantee better generalization; DeepSeek-R1 data achieves lower loss but worse generalization than gpt-oss-120b data due to divergent reasoning patterns.

Details

Motivation: To understand how Chain-of-Thought trajectories from different sources influence the generalization performance of large reasoning models during Supervised Fine-Tuning, particularly investigating why lower training loss doesn't translate to better generalization.

Method: Comparative study using verified CoT trajectories from DeepSeek-R1-0528 and gpt-oss-120b with identical problem sets, analyzing token-level SFT loss and step-level reasoning behaviors, then proposing trajectory filtering based on branching patterns.

Result: DeepSeek-R1 data achieves lower training loss but worse generalization; gpt-oss-120b shows convergent deductive reasoning while DeepSeek-R1 shows divergent branch-heavy exploration; filtering branching trajectories improves DeepSeek-R1 performance by up to 5.5% on benchmarks.

Conclusion: Reasoning pattern differences in CoT trajectories significantly impact SFT generalization; filtering divergent exploratory branches can improve model performance, highlighting the importance of trajectory quality over mere training loss minimization.

Abstract: Supervised Fine-Tuning (SFT) on long Chain-of-Thought (CoT) trajectories has become a pivotal phase in building large reasoning models. However, how CoT trajectories from different sources influence the generalization performance of models remains an open question. In this paper, we conduct a comparative study using two sources of verified CoT trajectories generated by two competing models, \texttt{DeepSeek-R1-0528} and \texttt{gpt-oss-120b}, with their problem sets controlled to be identical. Despite their comparable performance, we uncover a striking paradox: lower training loss does not translate to better generalization. SFT on \texttt{DeepSeek-R1-0528} data achieves remarkably lower training loss, yet exhibits significantly worse generalization performance on reasoning benchmarks compared to those trained on \texttt{gpt-oss-120b}. To understand this paradox, we perform a multi-faceted analysis probing token-level SFT loss and step-level reasoning behaviors. Our analysis reveals a difference in reasoning patterns. \texttt{gpt-oss-120b} exhibits highly convergent and deductive trajectories, whereas \texttt{DeepSeek-R1-0528} favors a divergent and branch-heavy exploration pattern. Consequently, models trained with \texttt{DeepSeek-R1} data inherit inefficient exploration behaviors, often getting trapped in redundant exploratory branches that hinder them from reaching correct solutions. Building upon this insight, we propose a simple yet effective remedy of filtering out frequently branching trajectories to improve the generalization of SFT. Experiments show that training on selected \texttt{DeepSeek-R1-0528} subsets surprisingly improves reasoning performance by up to 5.1% on AIME25, 5.5% on BeyondAIME, and on average 3.6% on five benchmarks.

[140] An Empirical Study of Many-Shot In-Context Learning for Machine Translation of Low-Resource Languages

Yinhan Lu, Gaganpreet Jhajj, Chen Zhang, Anietie Andy, David Ifeoluwa Adelani

Main category: cs.CL

TL;DR: Many-shot in-context learning for low-resource machine translation shows improved performance with more examples, especially when using BM25-based retrieval to select informative examples.

Details

Motivation: To address the challenge of adapting large language models to truly low-resource languages through in-context learning, particularly when inference costs are prohibitive for these language communities.

Method: Empirical study of many-shot ICL for English-to-low-resource-language translation using FLORES+ dataset. Analyzes effects of retrieving informative examples via BM25, using out-of-domain data, and ordering examples by length.

Result: Many-shot ICL becomes more effective with increasing examples. BM25-based retrieval substantially improves data efficiency: 50 retrieved examples match 250 many-shot examples, and 250 retrieved examples perform similarly to 1,000 many-shot examples.

Conclusion: Retrieval-based example selection is crucial for efficient many-shot ICL in low-resource machine translation, offering significant cost savings while maintaining performance.

Abstract: In-context learning (ICL) allows large language models (LLMs) to adapt to new tasks from a few examples, making it promising for languages underrepresented in pre-training. Recent work on many-shot ICL suggests that modern LLMs can further benefit from larger ICL examples enabled by their long context windows. However, such gains depend on careful example selection, and the inference cost can be prohibitive for low-resource language communities. In this paper, we present an empirical study of many-shot ICL for machine translation from English into ten truly low-resource languages recently added to FLORES+. We analyze the effects of retrieving more informative examples, using out-of-domain data, and ordering examples by length. Our findings show that many-shot ICL becomes more effective as the number of examples increases. More importantly, we show that BM25-based retrieval substantially improves data efficiency: 50 retrieved examples roughly match 250 many-shot examples, while 250 retrieved examples perform similarly to 1,000 many-shot examples.

[141] StoryScope: Investigating idiosyncrasies in AI fiction

Jenna Russell, Rishanth Rajendhran, Mohit Iyyer, John Wieting

Main category: cs.CL

TL;DR: StoryScope analyzes discourse-level narrative features to distinguish AI-generated fiction from human writing, achieving 93.2% detection accuracy using narrative choices like character agency and temporal complexity rather than stylistic signals.

Details

Motivation: As AI-generated fiction becomes more prevalent, existing detection methods focus on surface-level stylistic signatures. The authors investigate whether AI-generated stories can be distinguished from human ones based on deeper narrative construction choices rather than just writing style.

Method: Proposed StoryScope pipeline that automatically extracts fine-grained, interpretable discourse-level narrative features across 10 dimensions. Applied to a parallel corpus of 10,272 writing prompts, each with human and five LLM-generated stories (61,608 total), extracting 304 features per story.

Result: Narrative features alone achieved 93.2% macro-F1 for human vs. AI detection and 68.4% macro-F1 for six-way authorship attribution. AI stories over-explain themes and favor tidy plots, while human stories show moral ambiguity and temporal complexity. Different LLMs have distinct narrative fingerprints.

Conclusion: AI-generated stories cluster in shared narrative space while human stories show greater diversity. Differences in underlying narrative construction, not just writing style, can effectively separate human-written from AI-generated fiction.

Abstract: As AI-generated fiction becomes increasingly prevalent, questions of authorship and originality are becoming central to how written work is evaluated. While most existing work in this space focuses on identifying surface-level signatures of AI writing, we ask instead whether AI-generated stories can be distinguished from human ones without relying on stylistic signals, focusing on discourse-level narrative choices such as character agency and chronological discontinuity. We propose StoryScope, a pipeline that automatically induces a fine-grained, interpretable feature space of discourse-level narrative features across 10 dimensions. We apply StoryScope to a parallel corpus of 10,272 writing prompts, each written by a human author and five LLMs, yielding 61,608 stories, each ~5,000 words, and 304 extracted features per story. Narrative features alone achieve 93.2% macro-F1 for human vs. AI detection and 68.4% macro-F1 for six-way authorship attribution, retaining over 97% of the performance of models that include stylistic cues. A compact set of 30 core narrative features captures much of this signal: AI stories over-explain themes and favor tidy, single-track plots while human stories frame protagonist’ choices as more morally ambiguous and have increased temporal complexity. Per-model fingerprint features enable six-way attribution: for example, Claude produces notably flat event escalation, GPT over-indexes on dream sequences, and Gemini defaults to external character description. We find that AI-generated stories cluster in a shared region of narrative space, while human-authored stories exhibit greater diversity. More broadly, these results suggest that differences in underlying narrative construction, not just writing style, can be used to separate human-written original works from AI-generated fiction.

[142] MOOSE-Star: Unlocking Tractable Training for Scientific Discovery by Breaking the Complexity Barrier

Zonglin Yang, Lidong Bing

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.03756: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.03756&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[143] Sandpiper: Orchestrated AI-Annotation for Educational Discourse at Scale

Daryl Hedley, Doug Pietrzak, Jorge Dias, Ian Burden, Bakhtawar Ahtisham, Zhuqian Zhou, Kirk Vanacore, Josh Marland, Rachel Slama, Justin Reich, Kenneth Koedinger, René Kizilcec

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2603.08406: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.08406&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[144] Hindsight-Anchored Policy Optimization: Turning Failure into Feedback in Sparse Reward Settings

Yuning Wu, Ke Wang, Devin Chen, Kai Wei

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2603.11321: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.11321&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[145] Red-Teaming Vision-Language-Action Models via Quality Diversity Prompt Generation for Robust Robot Policies

Siddharth Srikanth, Freddie Liang, Ya-Chuan Hsu, Varun Bhatt, Shihan Zhao, Henry Chen, Bryon Tjanaka, Minjune Hwang, Akanksha Saran, Daniel Seita, Aaquib Tabrez, Stefanos Nikolaidis

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.12510: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.12510&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[146] In your own words: computationally identifying interpretable themes in free-text survey data

Jenny S Wang, Aliya Saperstein, Emma Pierson

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.26930: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.26930&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Patrick Amadeus Irawan, Erland Hilman Fuadi, Shanu Kumar, Alham Fikri Aji, Yova Kementchedjhieva

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about paper content due to access limitations

Abstract: Failed to fetch summary for 2604.00829: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.00829&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[148] Screening Is Enough

Ken M. Nakanishi

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2604.01178: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.01178&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[149] Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation

Xue Liu, Xin Ma, Yuxin Ma, Yongchang Peng, Duo Wang, Zhoufutu Wen, Ge Zhang, Kaiyuan Zhang, Xinyu Chen, Tianci He, Jiani Hou, Liang Hu, Ziyun Huang, Yongzhe Hui, Jianpeng Jiao, Chennan Ju, Yingru Kong, Yiran Li, Mengyun Liu, Luyao Ma, Fei Ni, Yiqing Ni, Yueyan Qiu, Yanle Ren, Zilin Shi, Zaiyuan Wang, Wenjie Yue, Shiyu Zhang, Xinyi Zhang, Kaiwen Zhao, Zhenwei Zhu, Shanshan Wu, Qi Zhao, Wenhao Huang

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2604.02368: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.02368&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

cs.CV

[150] SafeScreen: A Safety-First Screening Framework for Personalized Video Retrieval for Vulnerable Users

Wenzheng Zhao, Madhava Kalyan Gadiputi, Fengpei Yuan

Main category: cs.CV

TL;DR: SafeScreen is a safety-first video screening framework that enforces individualized safety constraints for vulnerable users by performing sequential approval/rejection of videos through multimodal analysis and LLM-based decision-making.

Details

Motivation: Open-domain video platforms have engagement-optimized recommendation algorithms that can expose vulnerable users (children, dementia patients) to inappropriate or harmful material. There's a need for safety-first screening that enforces individualized safety constraints before content exposure.

Method: SafeScreen integrates three components: (1) profile-driven extraction of individualized safety criteria, (2) evidence-grounded assessments via adaptive question generation and multimodal VideoRAG analysis, and (3) LLM-based decision-making that verifies safety, appropriateness, and relevance. It treats safety as a prerequisite and performs sequential approval/rejection of candidate videos.

Result: In a dementia-care reminiscence case study with 30 synthetic patient profiles and 90 test queries, SafeScreen prioritized safety over engagement, diverging from YouTube’s engagement-optimized rankings in 80-93% of cases, while maintaining high levels of safety coverage, sensibleness, and groundedness as validated by LLM-based evaluation and domain experts.

Conclusion: SafeScreen provides an effective framework for safety-first video screening that can protect vulnerable users from harmful content while maintaining relevance, offering explainable real-time screening without relying on precomputed safety labels.

Abstract: Open-domain video platforms offer rich, personalized content that could support health, caregiving, and educational applications, but their engagement-optimized recommendation algorithms can expose vulnerable users to inappropriate or harmful material. These risks are especially acute in child-directed and care settings (e.g., dementia care), where content must satisfy individualized safety constraints before being shown. We introduce SafeScreen, a safety-first video screening framework that retrieves and presents personalized video while enforcing individualized safety constraints. Rather than ranking videos by relevance or popularity, SafeScreen treats safety as a prerequisite and performs sequential approval or rejection of candidate videos through an automated pipeline. SafeScreen integrates three key components: (i) profile-driven extraction of individualized safety criteria, (ii) evidence-grounded assessments via adaptive question generation and multimodal VideoRAG analysis, and (iii) LLM-based decision-making that verifies safety, appropriateness, and relevance before content exposure. This design enables explainable, real-time screening of uncurated video repositories without relying on precomputed safety labels. We evaluate SafeScreen in a dementia-care reminiscence case study using 30 synthetic patient profiles and 90 test queries. Results demonstrate that SafeScreen prioritizes safety over engagement, diverging from YouTube’s engagement-optimized rankings in 80-93% of cases, while maintaining high levels of safety coverage, sensibleness, and groundedness, as validated by both LLM-based evaluation and domain experts.

[151] CoLoRSMamba: Conditional LoRA-Steered Mamba for Supervised Multimodal Violence Detection

Damith Chamalke Senadeera, Dimitrios Kollias, Gregory Slabaugh

Main category: cs.CV

Details

[152] Imagine Before Concentration: Diffusion-Guided Registers Enhance Partially Relevant Video Retrieval

Jun Li, Xuhang Lou, Jinpeng Wang, Yuting Wang, Yaowei Wang, Shu-Tao Xia, Bin Chen

Main category: cs.CV

TL;DR: DreamPRVR: A coarse-to-fine video retrieval model for partially relevant queries using global semantic registers refined via diffusion and adaptive fusion with video tokens.

Details

Motivation: Existing methods for Partially Relevant Video Retrieval (PRVR) struggle with incomplete global contextual perception, query ambiguity, and local noise from spurious responses when retrieving untrimmed videos based on text queries describing only partial events.

Method: Coarse-to-fine representation learning: 1) Generate global contextual semantic registers as coarse-grained highlights via probabilistic variational sampler initialization and text-supervised truncated diffusion refinement, 2) Textual semantic structure learning for well-formed latent space, 3) Adaptive fusion of registers with video tokens using register-augmented Gaussian attention blocks for context-aware features.

Result: Extensive experiments show DreamPRVR outperforms state-of-the-art methods for partially relevant video retrieval.

Conclusion: DreamPRVR effectively addresses PRVR challenges through its coarse-to-fine paradigm with global semantic registers and adaptive fusion, demonstrating superior performance over existing approaches.

Abstract: Partially Relevant Video Retrieval (PRVR) aims to retrieve untrimmed videos based on text queries that describe only partial events. Existing methods suffer from incomplete global contextual perception, struggling with query ambiguity and local noise induced by spurious responses. To address these issues, we propose DreamPRVR, which adopts a coarse-to-fine representation learning paradigm. The model first generates global contextual semantic registers as coarse-grained highlights spanning the entire video and then concentrates on fine-grained similarity optimization for precise cross-modal matching. Concretely, these registers are generated by initializing from the video-centric distribution produced by a probabilistic variational sampler and then iteratively refined via a text-supervised truncated diffusion model. During this process, textual semantic structure learning constructs a well-formed textual latent space, enhancing the reliability of global perception. The registers are then adaptively fused with video tokens through register-augmented Gaussian attention blocks, enabling context-aware feature learning. Extensive experiments show that DreamPRVR outperforms state-of-the-art methods. Code is released at https://github.com/lijun2005/CVPR26-DreamPRVR.

[153] A reconfigurable smart camera implementation for jet flames characterization based on an optimized segmentation model

Gerardo Valente Vazquez-Garcia, Carmina Perez Guerrero, Eduardo Garduño, Miguel Gonzalez-Mendoza, Adriana Palacios, Gerardo Rodriguez-Hernandez, Vahid Foroughi, Alba Àgueda, Elsa Pastor, Gilberto Ochoa-Ruiz

Main category: cs.CV

TL;DR: A smart camera platform using SoC FPGA with optimized UNet model for real-time jet flame segmentation and characterization in industrial fire safety applications.

Details

Motivation: Addressing the lack of real-time solutions for industrial early fire segmentation and characterization, particularly for jet flames in industrial safety settings.

Method: Developed a full edge processing pipeline using SoC FPGA (Ultra96 platform) with optimized UNet segmentation model via Vitis framework, reducing parameters from 7.5M to 59K (125x reduction) and implementing multi-threading and batch normalization.

Result: Achieved 30 FPS performance with 7.5x latency improvement while maintaining accuracy (Dice Score), enabling real-time jet flame analysis on edge devices.

Conclusion: The framework demonstrates effective real-time fire safety management through optimized AI models on edge hardware, with potential for broader fire safety applications.

Abstract: In this work we present a novel framework for fire safety management in industrial settings through the implementation of a smart camera platform for jet flames characterization. The approach seeks to alleviate the lack of real-time solutions for industrial early fire segmentation and characterization. As a case study, we demonstrate how a SoC FPGA, running optimized Artificial Intelligence (AI) models can be leveraged to implement a full edge processing pipeline for jet flames analysis. In this paper we extend previous work on computer-vision jet fire segmentation by creating a novel experimental set-up and system implementation for addressing this issue, which can be replicated to other fire safety applications. The proposed platform is designed to carry out image processing tasks in real-time and on device, reducing video processing overheads, and thus the overall latency. This is achieved by optimizing a UNet segmentation model to make it amenable for an SoC FPGAs implementation; the optimized model can then be efficiently mapped onto the SoC reconfigurable logic for massively parallel execution. For our experiments, we have chosen the Ultra96 platform, as it also provides the means for implementing full-fledged intelligent systems using the SoC peripherals, as well as other Operating System (OS) capabilities (i.e., multi-threading) for systems management. For optimizing the model we made use of the Vitis (Xilinx) framework, which enabled us to optimize the full precision model from 7.5 million parameters to 59,095 parameters (125x less), which translated into a reduction of the processing latency of 2.9x. Further optimization (multi-threading and batch normalization) led to an improvement of 7.5x in terms of latency, yielding a performance of 30 Frames Per Second (FPS) without sacrificing accuracy in terms of the evaluated metrics (Dice Score).

[154] Love Me, Love My Label: Rethinking the Role of Labels in Prompt Retrieval for Visual In-Context Learning

Tianci Luo, Haohao Pan, Jinpeng Wang, Niu Lian, Xinrui Chen, Bin Chen, Shu-Tao Xia, Chun Yuan

Main category: cs.CV

TL;DR: LaPR introduces label-aware prompt retrieval for visual in-context learning, addressing label inconsistency issues in existing methods by incorporating label cues into prompt selection.

Details

Motivation: Existing visual in-context learning methods focus on prompt images but overlook labels, leading to visually similar but label-inconsistent prompts that degrade performance. Higher label consistency between query and prompts correlates with better VICL results.

Method: Develops LaPR framework with image-label joint representation for prompts, and a mixture-of-expert mechanism with query-adaptive routing to handle unavailable query labels at test time. Uses alternative optimization with VICL performance-guided and label-guided contrastive losses.

Result: Extensive experiments show consistent improvement on in-context segmentation, detection, and colorization tasks. Generalizes well across feature extractors and cross-fold scenarios.

Conclusion: Label utilization is crucial for prompt retrieval in visual in-context learning, and LaPR effectively addresses label inconsistency issues to improve VICL performance.

Abstract: Visual in-context learning (VICL) enables visual foundation models to handle multiple tasks by steering them with demonstrative prompts. The choice of such prompts largely influences VICL performance, standing out as a key challenge. Prior work has made substantial progress on prompt retrieval and reranking strategies, but mainly focuses on prompt images while overlooking labels. We reveal these approaches sometimes get visually similar but label-inconsistent prompts, which potentially degrade VICL performance. On the other hand, higher label consistency between query and prompts preferably indicates stronger VICL results. Motivated by these findings, we develop a framework named LaPR (Label-aware Prompt Retrieval), which highlights the role of labels in prompt selection. Our framework first designs an image-label joint representation for prompts to incorporate label cues explicitly. Besides, to handle unavailable query labels at test time, we introduce a mixture-of-expert mechanism to the dual encoders with query-adaptive routing. Each expert is expected to capture a specific label mode, while the router infers query-adaptive mixture weights and helps to learn label-aware representation. We carefully design alternative optimization for experts and router, with a VICL performance-guided contrastive loss and a label-guided contrastive loss, respectively. Extensive experiments show promising and consistent improvement of LaPR on in-context segmentation, detection, and colorization tasks. Moreover, LaPR generalizes well across feature extractors and cross-fold scenarios, suggesting the importance of label utilization in prompt retrieval for VICL. Code is available at https://github.com/luotc-why/CVPR26-LaPR.

[155] Event-Driven Neuromorphic Vision Enables Energy-Efficient Visual Place Recognition

Geoffroy Keime, Nicolas Cuperlier, Benoit R. Cottereau

Main category: cs.CV

TL;DR: SpikeVPR: A bio-inspired neuromorphic VPR system using event cameras and spiking neural networks for efficient, robust place recognition under dynamic conditions with 50x fewer parameters and 30-250x less energy than conventional deep networks.

Details

Motivation: Conventional visual place recognition (VPR) systems using deep networks have high computational and energy demands, limiting real-time deployment on mobile platforms. The paper aims to develop a more efficient, bio-inspired approach inspired by mammalian navigation systems.

Method: Combines event-based cameras with spiking neural networks (SNNs) to generate compact place descriptors. Uses end-to-end surrogate gradient learning and introduces EventDilation, a novel augmentation strategy for robustness to speed and temporal variations.

Result: Achieves performance comparable to state-of-the-art deep networks on Brisbane-Event-VPR and NSAVP benchmarks while using 50 times fewer parameters and consuming 30-250 times less energy, enabling real-time deployment.

Conclusion: Spike-based coding offers an efficient pathway toward robust visual place recognition in complex, changing environments, demonstrating the potential of neuromorphic approaches for real-world robotics applications.

Abstract: Reliable visual place recognition (VPR) under dynamic real-world conditions is critical for autonomous robots, yet conventional deep networks remain limited by high computational and energy demands. Inspired by the mammalian navigation system, we introduce SpikeVPR, a bio-inspired and neuromorphic approach combining event-based cameras with spiking neural networks (SNNs) to generate compact, invariant place descriptors from few exemplars, achieving robust recognition under extreme changes in illumination, viewpoint, and appearance. SpikeVPR is trained end-to-end using surrogate gradient learning and incorporates EventDilation, a novel augmentation strategy enhancing robustness to speed and temporal variations. Evaluated on two challenging benchmarks (Brisbane-Event-VPR and NSAVP), SpikeVPR achieves performance comparable to state-of-the-art deep networks while using 50 times fewer parameters and consuming 30 and 250 times less energy, enabling real-time deployment on mobile and neuromorphic platforms. These results demonstrate that spike-based coding offers an efficient pathway toward robust VPR in complex, changing environments.

Tianle Chen, Deepti Ghadiyaram

Main category: cs.CV

TL;DR: Multi-Modal Typography: Systematic study of coordinated audio-visual-text attacks on MLLMs, showing cross-modal vulnerabilities with 83% attack success rate.

Details

Motivation: As audio-visual MLLMs are deployed in safety-critical applications, understanding their vulnerabilities is crucial. Prior work focuses on unimodal attacks, but cross-modal fragility remains underexplored.

Method: Introduces Multi-Modal Typography to systematically examine how typographic attacks across multiple modalities (audio, visual, text) adversely influence MLLMs. Analyzes interactions between perturbations across modalities.

Result: Coordinated multi-modal attacks create significantly more potent threats than single-modality attacks (83.43% vs 34.93% attack success rate). Findings established across multiple frontier MLLMs, tasks, and benchmarks including common-sense reasoning and content moderation.

Conclusion: Multi-modal typography represents a critical and underexplored attack strategy in multi-modal reasoning, highlighting serious vulnerabilities in current audio-visual MLLMs that require attention for safety-critical deployments.

Abstract: As audio-visual multi-modal large language models (MLLMs) are increasingly deployed in safety-critical applications, understanding their vulnerabilities is crucial. To this end, we introduce Multi-Modal Typography, a systematic study examining how typographic attacks across multiple modalities adversely influence MLLMs. While prior work focuses narrowly on unimodal attacks, we expose the cross-modal fragility of MLLMs. We analyze the interactions between audio, visual, and text perturbations and reveal that coordinated multi-modal attack creates a significantly more potent threat than single-modality attacks (attack success rate = $83.43%$ vs $34.93%$).Our findings across multiple frontier MLLMs, tasks, and common-sense reasoning and content moderation benchmarks establishes multi-modal typography as a critical and underexplored attack strategy in multi-modal reasoning. Code and data will be publicly available.

[157] 3D-IDE: 3D Implicit Depth Emergent

Chushan Zhang, Ruihan Lu, Jinguang Tong, Yikai Wang, Hongdong Li

Main category: cs.CV

TL;DR: 3D-Implicit Depth Emergence enables 3D perception to emerge implicitly from geometric self-supervision in multimodal LLMs, eliminating explicit 3D encoding and reducing inference latency by 55%.

Details

Motivation: Existing MLLMs struggle with 2D-3D representation fusion trade-offs, using either explicit ground-truth 3D positional encoding or grafting external 3D foundation models, leading to suboptimal deployment and latency issues.

Method: Proposes Implicit Geometric Emergence Principle using geometric self-supervision with fine-grained geometry validator and global representation constraints to create an information bottleneck, forcing models to maximize mutual information between visual features and 3D structures.

Result: Achieves state-of-the-art performance on multiple 3D scene understanding benchmarks with 55% reduction in inference latency while maintaining strong performance across diverse downstream tasks.

Conclusion: Represents a paradigm shift from external grafting to implicit emergence for 3D knowledge integration in visual-language models, enabling dependency-free 3D understanding with zero latency overhead.

Abstract: Leveraging 3D information within Multimodal Large Language Models (MLLMs) has recently shown significant advantages for indoor scene understanding. However, existing methods, including those using explicit ground-truth 3D positional encoding and those grafting external 3D foundation models for implicit geometry, struggle with the trade-off in 2D-3D representation fusion, leading to suboptimal deployment. To this end, we propose 3D-Implicit Depth Emergence, a method that reframes 3D perception as an emergent property derived from geometric self-supervision rather than explicit encoding. Our core insight is the Implicit Geometric Emergence Principle: by strategically leveraging privileged geometric supervision through mechanisms like a fine-grained geometry validator and global representation constraints, we construct an information bottleneck. This bottleneck forces the model to maximize the mutual information between visual features and 3D structures, allowing 3D awareness to emerge naturally within a unified visual representation. Unlike existing approaches, our method enables 3D perception to emerge implicitly, disentangling features in dense regions and, crucially, eliminating depth and pose dependencies during inference with zero latency overhead. This paradigm shift from external grafting to implicit emergence represents a fundamental rethinking of 3D knowledge integration in visual-language models. Extensive experiments demonstrate that our method surpasses SOTA on multiple 3D scene understanding benchmarks. Our approach achieves a 55% reduction in inference latency while maintaining strong performance across diverse downstream tasks, underscoring the effectiveness of meticulously designed auxiliary objectives for dependency-free 3D understanding. Source code can be found at github.com/ChushanZhang/3D-IDE.

[158] BiTDiff: Fine-Grained 3D Conducting Motion Generation via BiMamba-Transformer Diffusion

Tianzhi Jia, Kaixing Yang, Xiaole Yang, Xulong Tang, Ke Qiu, Shikui Wei, Yao Zhao

Main category: cs.CV

TL;DR: BiTDiff: A novel framework for 3D conducting motion generation from music using BiMamba-Transformer hybrid architecture with diffusion-based generation and human-kinematic decomposition for high-quality, efficient long-sequence synthesis.

Details

Motivation: 3D conducting motion generation has broad applications in music education, virtual performance, and digital human animation, but faces challenges due to lack of large-scale fine-grained datasets and effective methods for joint long-sequence generation with quality and efficiency.

Method: Proposes BiTDiff framework with: 1) CM-Data dataset (10 hours of SMPL-X conducting motions), 2) BiMamba-Transformer hybrid for efficient long-sequence modeling, 3) Diffusion-based generation with human-kinematic decomposition, 4) Physical-consistency losses and hand/body-specific forward-kinematics design, 5) Training-free joint-level motion editing.

Result: BiTDiff achieves state-of-the-art performance on CM-Data dataset, demonstrating superior 3D conducting motion generation with high quality and efficiency for long sequences.

Conclusion: The paper addresses key challenges in 3D conducting motion generation through novel dataset creation and BiTDiff framework, enabling high-quality, efficient synthesis with applications in music education and digital human animation.

Abstract: 3D conducting motion generation aims to synthesize fine-grained conductor motions from music, with broad potential in music education, virtual performance, digital human animation, and human-AI co-creation. However, this task remains underexplored due to two major challenges: (1) the lack of large-scale fine-grained 3D conducting datasets and (2) the absence of effective methods that can jointly support long-sequence generation with high quality and efficiency. To address the data limitation, we develop a quality-oriented 3D conducting motion collection pipeline and construct CM-Data, a fine-grained SMPL-X dataset with about 10 hours of conducting motion data. To the best of our knowledge, CM-Data is the first and largest public dataset for 3D conducting motion generation. To address the methodological limitation, we propose BiTDiff, a novel framework for 3D conducting motion generation, built upon a BiMamba-Transformer hybrid model architecture for efficient long-sequence modeling and a Diffusion-based generative strategy with human-kinematic decomposition for high-quality motion synthesis. Specifically, BiTDiff introduces auxiliary physical-consistency losses and a hand-/body-specific forward-kinematics design for better fine-grained motion modeling, while leveraging BiMamba for memory-efficient long-sequence temporal modeling and Transformer for cross-modal semantic alignment. In addition, BiTDiff supports training-free joint-level motion editing, enabling downstream human-AI interaction design. Extensive quantitative and qualitative experiments demonstrate that BiTDiff achieves state-of-the-art (SOTA) performance for 3D conducting motion generation on the CM-Data dataset. Code will be available upon acceptance.

[159] XAttnRes: Cross-Stage Attention Residuals for Medical Image Segmentation

Xinyu Liu, Qing Xu, Zhen Chen

Main category: cs.CV

TL;DR: XAttnRes introduces cross-stage attention residuals for segmentation networks, enabling learned selective aggregation from global feature history across encoder-decoder stages, improving performance across multiple datasets and modalities.

Details

Motivation: To improve segmentation networks by replacing fixed residual connections with learned, selective aggregation mechanisms that can maintain a global feature history pool across both encoder and decoder stages, similar to attention residuals in LLMs.

Method: Proposes Cross-Stage Attention Residuals (XAttnRes) that maintains a global feature history pool accumulating encoder and decoder outputs, uses lightweight pseudo-query attention for selective aggregation, and introduces spatial alignment and channel projection to handle cross-resolution features between multi-scale encoder-decoder stages.

Result: Consistently improves performance across four datasets and three imaging modalities when added to existing segmentation networks. XAttnRes alone (without skip connections) achieves performance on par with baseline, showing learned aggregation can recover inter-stage information flow.

Conclusion: XAttnRes demonstrates that learned selective aggregation mechanisms can effectively replace predetermined connections in segmentation networks, enabling better feature integration across encoder-decoder stages and improving performance across diverse imaging modalities.

Abstract: In the field of Large Language Models (LLMs), Attention Residuals have recently demonstrated that learned, selective aggregation over all preceding layer outputs can outperform fixed residual connections. We propose Cross-Stage Attention Residuals (XAttnRes), a mechanism that maintains a global feature history pool accumulating both encoder and decoder stage outputs. Through lightweight pseudo-query attention, each stage selectively aggregates from all preceding representations. To bridge the gap between the same-dimensional Transformer layers in LLMs and the multi-scale encoder-decoder stages in segmentation networks, XAttnRes introduces spatial alignment and channel projection steps that handle cross-resolution features with negligible overhead. When added to existing segmentation networks, XAttnRes consistently improves performance across four datasets and three imaging modalities. We further observe that XAttnRes alone, even without skip connections, achieves performance on par with the baseline, suggesting that learned aggregation can recover the inter-stage information flow traditionally provided by predetermined connections.

[160] MoViD: View-Invariant 3D Human Pose Estimation via Motion-View Disentanglement

Yejia Liu, Hengle Jiang, Haoxian Liu, Runxi Huang, Xiaomin Ouyang

Main category: cs.CV

TL;DR: MoViD: A viewpoint-invariant 3D human pose estimation framework that disentangles viewpoint from motion features for robust performance across camera angles with real-time edge deployment.

Details

Motivation: Real-world 3D human pose estimation faces challenges with viewpoint variations, poor generalization to unseen camera angles, large training data requirements, and high inference latency, limiting practical deployment.

Method: Uses a view estimator to model joint relationships for viewpoint prediction, orthogonal projection to disentangle motion and view features, physics-grounded contrastive alignment across views, and frame-by-frame inference with adaptive flip refinement based on estimated viewpoint.

Result: Reduces pose estimation error by 24.2% vs SOTA, maintains robust performance under severe occlusions with 60% less training data, achieves 15 FPS real-time inference on NVIDIA edge devices across nine public datasets and new multiview UAV/gait datasets.

Conclusion: MoViD provides a practical solution for viewpoint-invariant 3D pose estimation with improved accuracy, data efficiency, and real-time performance suitable for edge deployment in applications like healthcare, robotics, and gaming.

Abstract: 3D human pose estimation is a key enabling technology for applications such as healthcare monitoring, human-robot collaboration, and immersive gaming, but real-world deployment remains challenged by viewpoint variations. Existing methods struggle to generalize to unseen camera viewpoints, require large amounts of training data, and suffer from high inference latency. We propose MoViD, a viewpoint-invariant 3D human pose estimation framework that disentangles viewpoint information from motion features. The key idea is to extract viewpoint information from intermediate pose features and leverage it to enhance both the robustness and efficiency of pose estimation. MoViD introduces a view estimator that models key joint relationships to predict viewpoint information, and an orthogonal projection module to disentangle motion and view features, further enhanced through physics-grounded contrastive alignment across views. For real-time edge deployment, MoViD employs a frame-by-frame inference pipeline with a view-aware strategy that adaptively activates flip refinement based on the estimated viewpoint. Evaluations on nine public datasets and newly collected multiview UAV and gait analysis datasets show that MoViD reduces pose estimation error by over 24.2% compared to state-of-the-art methods, maintains robust performance under severe occlusions with 60% less training data, and achieves real-time inference at 15 FPS on NVIDIA edge devices.

[161] E-VLA: Event-Augmented Vision-Language-Action Model for Dark and Blurred Scenes

Jiajun Zhai, Hao Shi, Shangwei Guo, Kailun Yang, Kaiwei Wang

Main category: cs.CV

TL;DR: E-VLA integrates event cameras with Vision-Language-Action models to improve robotic manipulation robustness under adverse visual conditions like low light and motion blur.

Details

Motivation: Current VLA models have fragile perception under sensing-stage degradations (extreme low light, motion blur, black clipping). Event cameras offer complementary visual information that could improve robustness in challenging conditions.

Method: Directly leverages motion and structural cues from event streams instead of reconstructing images. Uses lightweight, pretrained-compatible event integration strategies including parameter-free fusion (overlaying accumulated event maps onto RGB images) and event adapters. Built open-source teleoperation platform with DAVIS346 event camera and collected synchronized RGB-event-action dataset.

Result: Significant improvements in manipulation success rates: Pick-Place at 20 lux increased from 0% (image-only) to 60% with overlay fusion and 90% with event adapter; under severe motion blur (1000 ms exposure), Pick-Place improved from 0% to 20-25%, and Sorting from 5% to 32.5%.

Conclusion: Event-driven perception can be effectively integrated into VLA models, providing systematic evidence for robust embodied intelligence beyond conventional frame-based imaging.

Abstract: Robotic Vision-Language-Action (VLA) models generalize well for open-ended manipulation, but their perception is fragile under sensing-stage degradations such as extreme low light, motion blur, and black clipping. We present E-VLA, an event-augmented VLA framework that improves manipulation robustness when conventional frame-based vision becomes unreliable. Instead of reconstructing images from events, E-VLA directly leverages motion and structural cues in event streams to preserve semantic perception and perception-action consistency under adverse conditions. We build an open-source teleoperation platform with a DAVIS346 event camera and collect a real-world synchronized RGB-event-action manipulation dataset across diverse tasks and illumination settings. We also propose lightweight, pretrained-compatible event integration strategies and study event windowing and fusion for stable deployment. Experiments show that even a simple parameter-free fusion, i.e., overlaying accumulated event maps onto RGB images, could substantially improve robustness in dark and blur-heavy scenes: on Pick-Place at 20 lux, success increases from 0% (image-only) to 60% with overlay fusion and to 90% with our event adapter; under severe motion blur (1000 ms exposure), Pick-Place improves from 0% to 20-25%, and Sorting from 5% to 32.5%. Overall, E-VLA provides systematic evidence that event-driven perception can be effectively integrated into VLA models, pointing toward robust embodied intelligence beyond conventional frame-based imaging. Code and dataset will be available at https://github.com/JJayzee/E-VLA.

[162] Embedding-Only Uplink for Onboard Retrieval Under Shift in Remote Sensing

Sangcheol Sim

Main category: cs.CV

TL;DR: Embedding-only uplink enables onboard hazard triage via vector search, with optimal decision heads (kNN vs centroids) varying by remote sensing task under distribution shift.

Details

Motivation: Downlink bottlenecks in satellite/remote sensing systems require onboard processing to prioritize hazards without transmitting raw pixels. Need to determine if embedding-only pipelines remain effective under real-world distribution shifts in remote sensing data.

Method: Study embedding-only pipeline where ground station uplinks only compact embeddings plus metadata, and onboard system performs vector search to triage new captures. Test under explicit remote-sensing shift: cross-time, cross-event/location, cross-site cloud, and cross-city AOI holdout. Use OlmoEarth embeddings on scaled public multi-task benchmark with 27 Sentinel-2 L2A scenes, 15 cloud sites, 5 SpaceNet-2 AOIs across 10 seeds.

Result: All effective methods rely on same uplinked embeddings, but optimal decision head is task-dependent: kNN retrieval significantly superior for cloud classification (0.92 vs centroid 0.91; p<0.01), while class centroids dominate temporal change detection (0.85 vs retrieval 0.48; p<0.01). Embedding-only uplink is key enabler with all telemetry under 1 KB per query.

Conclusion: Embedding-only uplink enables efficient onboard hazard triage, with system able to select best decision head per task at no additional uplink cost once embeddings are onboard. Approach remains effective under real-world remote sensing distribution shifts.

Abstract: Downlink bottlenecks motivate onboard systems that prioritize hazards without transmitting raw pixels. We study a strict setting where a ground station uplinks only compact embeddings plus metadata, and an onboard system performs vector search to triage new captures. We ask whether this embedding-only pipeline remains useful under explicit remote-sensing shift: cross-time (pre/post-event), cross-event/location (different disasters), cross-site cloud (15 geographic sites), and cross-city AOI holdout (buildings). Using OlmoEarth embeddings on a scaled public multi-task benchmark (27 Sentinel-2 L2A scenes, 15 cloud sites, 5 SpaceNet-2 AOIs; 10 seeds), we find that all effective methods rely on the same uplinked embeddings, but the optimal decision head is task-dependent: kNN retrieval is significantly superior for cloud classification (0.92 vs. centroid 0.91; p<0.01, Wilcoxon), while class centroids dominate temporal change detection (0.85 vs. retrieval 0.48; p<0.01). These results show that embedding-only uplink is the key enabler–once embeddings are onboard, the system can select the best head per task at no additional uplink cost, with all telemetry under 1 KB per query.

[163] DIRECT: Video Mashup Creation via Hierarchical Multi-Agent Planning and Intent-Guided Editing

Ke Li, Maoliang Li, Jialiang Chen, Jiayu Chen, Zihao Zheng, Shaoqi Wang, Xiang Chen

Main category: cs.CV

TL;DR: DIRECT is a hierarchical multi-agent framework for video mashup creation that formulates the problem as Multimodal Coherency Satisfaction, using Screenwriter, Director, and Editor agents to achieve professional-grade audio-visual orchestration.

Details

Motivation: Existing automated video editing frameworks fail to achieve professional-grade fluidity in video mashups due to overlooking cross-level multimodal orchestration, resulting in disjointed sequences with abrupt visual transitions and musical misalignment.

Method: Formulates video mashup creation as Multimodal Coherency Satisfaction Problem (MMCSP). Proposes DIRECT framework with hierarchical multi-agent system: Screenwriter for source-aware global structural anchoring, Director for adaptive editing intent and guidance, and Editor for intent-guided shot sequence editing with fine-grained optimization.

Result: DIRECT significantly outperforms state-of-the-art baselines in both objective metrics and human subjective evaluation on the introduced Mashup-Bench benchmark with tailored metrics for visual continuity and auditory alignment.

Conclusion: The hierarchical multi-agent approach effectively addresses multimodal coherency in video mashup creation, achieving professional-grade audio-visual orchestration through structured decomposition of the editing challenge.

Abstract: Video mashup creation represents a complex video editing paradigm that recomposes existing footage to craft engaging audio-visual experiences, demanding intricate orchestration across semantic, visual, and auditory dimensions and multiple levels. However, existing automated editing frameworks often overlook the cross-level multimodal orchestration to achieve professional-grade fluidity, resulting in disjointed sequences with abrupt visual transitions and musical misalignment. To address this, we formulate video mashup creation as a Multimodal Coherency Satisfaction Problem (MMCSP) and propose the DIRECT framework. Simulating a professional production pipeline, our hierarchical multi-agent framework decomposes the challenge into three cascade levels: the Screenwriter for source-aware global structural anchoring, the Director for instantiating adaptive editing intent and guidance, and the Editor for intent-guided shot sequence editing with fine-grained optimization. We further introduce Mashup-Bench, a comprehensive benchmark with tailored metrics for visual continuity and auditory alignment. Extensive experiments demonstrate that DIRECT significantly outperforms state-of-the-art baselines in both objective metrics and human subjective evaluation. Project page and code: https://github.com/AK-DREAM/DIRECT

Nanxi Li, Xiang Wang, Yuanjie Chen, Haode Zhang, Hong Li, Yong-Lu Li

Main category: cs.CV

TL;DR: The paper investigates MLLMs’ limitations in intuitive physics understanding, particularly for continuum objects, introduces two benchmark tasks (NFS and TCV), and proposes Scene Dynamic Field (SDF) to improve physical reasoning using physics simulators.

Details

Motivation: Current MLLMs show impressive capabilities in image/video understanding but struggle with high-level physics reasoning, especially understanding the dynamics of continuum objects (fluids, smoke, etc.). There's a critical gap in their ability to comprehend the physical world.

Method: 1) Introduces two benchmark tasks: Next Frame Selection (NFS) and Temporal Coherence Verification (TCV) to evaluate intuitive physics understanding. 2) Proposes Scene Dynamic Field (SDF) - a concise approach that leverages physics simulators within a multi-task fine-tuning framework to improve MLLMs’ physical reasoning.

Result: State-of-the-art MLLMs perform poorly on the benchmark tasks. SDF substantially improves performance, achieving up to 20.7% gains on fluid tasks while showing strong generalization to unseen physical domains.

Conclusion: The work highlights a critical gap in current MLLMs’ physical reasoning capabilities and presents a promising cost-efficient approach (SDF) for developing more physically grounded MLLMs using physics simulators.

Abstract: While Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in image and video understanding, their ability to comprehend the physical world has become an increasingly important research focus. Despite their improvements, current MLLMs struggle significantly with high-level physics reasoning. In this work, we investigate the first step of physical reasoning, i.e., intuitive physics understanding, revealing substantial limitations in understanding the dynamics of continuum objects. To isolate and evaluate this specific capability, we introduce two fundamental benchmark tasks: Next Frame Selection (NFS) and Temporal Coherence Verification (TCV). Our experiments demonstrate that even state-of-the-art MLLMs perform poorly on these foundational tasks. To address this limitation, we propose Scene Dynamic Field (SDF), a concise approach that leverages physics simulators within a multi-task fine-tuning framework. SDF substantially improves performance, achieving up to 20.7% gains on fluid tasks while showing strong generalization to unseen physical domains. This work not only highlights a critical gap in current MLLMs but also presents a promising cost-efficient approach for developing more physically grounded MLLMs. Our code and data are available at https://github.com/andylinx/Scene-Dynamic-Field.

[165] HVG-3D: Bridging Real and Simulation Domains for 3D-Conditional Hand-Object Interaction Video Synthesis

Mingjin Chen, Junhao Chen, Zhaoxin Fan, Yujian Lee, Zichen Dang, Lili Wang, Yawen Cui, Lap-Pui Chau, Yi Wang

Main category: cs.CV

TL;DR: HVG-3D: A 3D-aware diffusion framework for hand-object interaction video synthesis using explicit 3D representations for better spatial control and utilization of 3D data.

Details

Motivation: Current methods for hand-object interaction video synthesis rely on 2D control signals that lack spatial expressiveness and limit the use of synthetic 3D conditional data, creating a need for more spatially-aware 3D control.

Method: Proposes HVG-3D with a diffusion-based architecture augmented with a 3D ControlNet that encodes geometric and motion cues from 3D inputs. Uses a hybrid pipeline for constructing input and condition signals for flexible control during training and inference.

Result: Achieves state-of-the-art spatial fidelity, temporal coherence, and controllability on the TASTE-Rob dataset, enabling effective utilization of both real and simulated data.

Conclusion: HVG-3D provides a unified framework for 3D-aware hand-object interaction video synthesis with explicit 3D reasoning, offering precise spatial and temporal control through 3D representations.

Abstract: Recent methods have made notable progress in the visual quality of hand-object interaction video synthesis. However, most approaches rely on 2D control signals that lack spatial expressiveness and limit the utilization of synthetic 3D conditional data. To address these limitations, we propose HVG-3D, a unified framework for 3D-aware hand-object interaction (HOI) video synthesis conditioned on explicit 3D representations. Specifically, we develop a diffusion-based architecture augmented with a 3D ControlNet, which encodes geometric and motion cues from 3D inputs to enable explicit 3D reasoning during video synthesis. To achieve high-quality synthesis, HVG-3D is designed with two core components: (i) a 3D-aware HOI video generation diffusion architecture that encodes geometric and motion cues from 3D inputs for explicit 3D reasoning; and (ii) a hybrid pipeline for constructing input and condition signals, enabling flexible and precise control during both training and inference. During inference, given a single real image and a 3D control signal from either simulation or real data, HVG-3D generates high-fidelity, temporally consistent videos with precise spatial and temporal control. Experiments on the TASTE-Rob dataset demonstrate that HVG-3D achieves state-of-the-art spatial fidelity, temporal coherence, and controllability, while enabling effective utilization of both real and simulated data.

[166] VABench: A Comprehensive Benchmark for Audio-Video Generation

Daili Hua, Xizhi Wang, Bohan Zeng, Xinyi Huang, Hao Liang, Junbo Niu, Xinlong Chen, Quanqing Xu, Wentao Zhang

Main category: cs.CV

TL;DR: VABench is a comprehensive benchmark framework for evaluating synchronous audio-video generation models across multiple tasks and dimensions.

Details

Motivation: Existing video generation benchmarks lack convincing evaluations for audio-video generation, especially for models aiming to generate synchronized audio-video outputs, creating a gap in systematic assessment of multimodal capabilities.

Method: Introduces VABench with three primary task types (text-to-audio-video, image-to-audio-video, stereo audio-video generation) and two major evaluation modules covering 15 dimensions including pairwise similarities, audio-video synchronization, lip-speech consistency, and audio/video QA pairs across seven content categories.

Result: Provides a systematic analysis and visualization of evaluation results, establishing a new standard for assessing video generation models with synchronous audio capabilities.

Conclusion: VABench addresses the critical gap in audio-video generation evaluation and aims to promote comprehensive advancement in the field of multimodal generation.

Abstract: Recent advances in video generation have been remarkable, enabling models to produce visually compelling videos with synchronized audio. While existing video generation benchmarks provide comprehensive metrics for visual quality, they lack convincing evaluations for audio-video generation, especially for models aiming to generate synchronized audio-video outputs. To address this gap, we introduce VABench, a comprehensive and multi-dimensional benchmark framework designed to systematically evaluate the capabilities of synchronous audio-video generation. VABench encompasses three primary task types: text-to-audio-video (T2AV), image-to-audio-video (I2AV), and stereo audio-video generation. It further establishes two major evaluation modules covering 15 dimensions. These dimensions specifically assess pairwise similarities (text-video, text-audio, video-audio), audio-video synchronization, lip-speech consistency, and carefully curated audio and video question-answering (QA) pairs, among others. Furthermore, VABench covers seven major content categories: animals, human sounds, music, environmental sounds, synchronous physical sounds, complex scenes, and virtual worlds. We provide a systematic analysis and visualization of the evaluation results, aiming to establish a new standard for assessing video generation models with synchronous audio capabilities and to promote the comprehensive advancement of the field.

[167] Editing Physiological Signals in Videos Using Latent Representations

Tianwen Zhou, Akshay Paruchuri, Josef Spjut, Kaan Akşit

Main category: cs.CV

TL;DR: A framework for editing physiological signals (heart rate) in facial videos while preserving visual quality, addressing privacy concerns in camera-based health monitoring.

Details

Motivation: Camera-based heart rate estimation from facial videos raises privacy concerns as physiological signals can reveal sensitive health and emotional information. There's a need for methods that can edit/modify these signals while maintaining visual fidelity.

Method: Uses a pretrained 3D VAE to encode videos, fuses with target HR prompts via frozen text encoder, employs trainable spatio-temporal layers with Adaptive Layer Normalization (AdaLN) to capture temporal coherence of rPPG signals, and applies Feature-wise Linear Modulation (FiLM) in decoder with fine-tuned output layer to avoid signal degradation during reconstruction.

Result: Achieves high visual quality (PSNR 38.96 dB, SSIM 0.98) while achieving HR modulation error of 10.00 bpm MAE and 10.09% MAPE using state-of-the-art rPPG estimator.

Conclusion: The method enables controllable HR editing for applications like anonymizing biometric signals in real videos or synthesizing realistic videos with desired vital signs, addressing privacy concerns in camera-based physiological monitoring.

Abstract: Camera-based physiological signal estimation provides a non-contact and convenient means to monitor Heart Rate (HR). However, the presence of vital signals in facial videos raises significant privacy concerns, as they can reveal sensitive personal information related to the health and emotional states of an individual. To address this, we propose a learned framework that edits physiological signals in videos while preserving visual fidelity. First, we encode an input video into a latent space via a pretrained 3D Variational Autoencoder (3D VAE), while a target HR prompt is embedded through a frozen text encoder. We fuse them using a set of trainable spatio-temporal layers with Adaptive Layer Normalizations (AdaLN) to capture the strong temporal coherence of remote Photoplethysmography (rPPG) signals. We apply Feature-wise Linear Modulation (FiLM) in the decoder with a fine-tuned output layer to avoid the degradation of physiological signals during reconstruction, enabling accurate physiological modulation in the reconstructed video. Empirical results show that our method preserves visual quality with an average PSNR of 38.96 dB and SSIM of 0.98 on selected datasets, while achieving an average HR modulation error of 10.00 bpm MAE and 10.09% MAPE using a state-of-the-art rPPG estimator. Our design’s controllable HR editing is useful for applications such as anonymizing biometric signals in real videos or synthesizing realistic videos with desired vital signs.

[168] Stochastic Generative Plug-and-Play Priors

Chicago Y. Park, Edward P. Chandler, Yuyang Hu, Michael T. McCann, Cristina Garcia-Cardona, Brendt Wohlberg, Ulugbek S. Kamilov

Main category: cs.CV

TL;DR: This paper introduces a stochastic generative PnP (SGPnP) framework that connects score-based diffusion models with plug-and-play methods for solving imaging inverse problems, enabling direct use of pretrained SBDMs as priors without reverse diffusion sampling.

Details

Motivation: The motivation is to bridge the gap between plug-and-play (PnP) methods and score-based diffusion models (SBDMs) for solving imaging inverse problems. While both rely on denoisers, there's no systematic way to use SBDMs as priors within PnP without reverse diffusion sampling. The authors aim to establish a theoretical connection and develop a framework that leverages the strong generative capabilities of SBDMs within the PnP paradigm.

Method: The paper establishes a score-based interpretation of PnP that justifies using pretrained SBDMs directly within PnP algorithms. They introduce a stochastic generative PnP (SGPnP) framework that injects noise to better leverage expressive generative SBDM priors. The method includes theoretical analysis showing that noise injection induces optimization on a Gaussian-smoothed objective and promotes escape from strict saddle points.

Result: Experiments on challenging inverse tasks, including multi-coil MRI reconstruction and large-mask natural image inpainting, demonstrate consistent improvement over conventional PnP methods. The SGPnP framework achieves performance competitive with diffusion-based solvers while operating within the PnP paradigm.

Conclusion: The paper successfully bridges PnP methods and score-based diffusion models, providing a theoretical foundation and practical framework for using pretrained SBDMs as priors in PnP algorithms. The stochastic generative PnP approach improves robustness in severely ill-posed inverse problems and offers competitive performance with diffusion-based methods.

Abstract: Plug-and-play (PnP) methods are widely used for solving imaging inverse problems by incorporating a denoiser into optimization algorithms. Score-based diffusion models (SBDMs) have recently demonstrated strong generative performance through a denoiser trained across a wide range of noise levels. Despite their shared reliance on denoisers, it remains unclear how to systematically use SBDMs as priors within the PnP framework without relying on reverse diffusion sampling. In this paper, we establish a score-based interpretation of PnP that justifies using pretrained SBDMs directly within PnP algorithms. Building on this connection, we introduce a stochastic generative PnP (SGPnP) framework that injects noise to better leverage the expressive generative SBDM priors, thereby improving robustness in severely ill-posed inverse problems. We provide a new theory showing that this noise injection induces optimization on a Gaussian-smoothed objective and promotes escape from strict saddle points. Experiments on challenging inverse tasks, such as multi-coil MRI reconstruction and large-mask natural image inpainting, demonstrate consistent improvement over conventional PnP methods and achieve performance competitive with diffusion-based solvers.

[169] Deep Image Clustering Based on Curriculum Learning and Density Information

Haiyang Zheng, Ruilin Zhang, Hongpeng Wang

Main category: cs.CV

TL;DR: IDCL introduces density-based curriculum learning and density cores for robust image clustering, improving performance over traditional deep clustering methods.

Details

Motivation: Existing deep clustering methods for images lack robust training strategies and rely on error-prone point-to-point distance metrics, leading to error accumulation during iterative clustering.

Method: Proposes IDCL with density-based curriculum learning that adjusts learning pace based on data density, and uses density cores instead of individual cluster centers for more robust cluster assignment.

Result: Extensive experiments show superiority over state-of-the-art methods in robustness, convergence speed, and flexibility across different data scales, cluster numbers, and image contexts.

Conclusion: Density information significantly improves deep image clustering performance, making IDCL a robust and flexible approach for multimedia analytics.

Abstract: Image clustering is one of the crucial techniques in multimedia analytics and knowledge discovery. Recently, the Deep clustering method (DC), characterized by its ability to perform feature learning and cluster assignment jointly, surpasses the performance of traditional ones on image data. However, existing methods rarely consider the role of model learning strategies in improving the robustness and performance of clustering complex image data. Furthermore, most approaches rely solely on point-to-point distances to cluster centers for partitioning the latent representations, resulting in error accumulation throughout the iterative process. In this paper, we propose a robust image clustering method (IDCL) which, to our knowledge for the first time, introduces a model training strategy using density information into image clustering. Specifically, we design a curriculum learning scheme grounded in the density information of input data, with a more reasonable learning pace. Moreover, we employ the density core rather than the individual cluster center to guide the cluster assignment. Finally, extensive comparisons with state-of-the-art clustering approaches on benchmark datasets demonstrate the superiority of the proposed method, including robustness, rapid convergence, and flexibility in terms of data scale, number of clusters, and image context.

[170] V-Reflection: Transforming MLLMs from Passive Observers to Active Interrogators

Jiazhou Zhou, Yucheng Chen, Hongyang Li, Qing Jiang, Hu Zhou, Ying-Cong Chen, Lei Zhang

Main category: cs.CV

TL;DR: V-Reflection transforms MLLMs into active interrogators using a “think-then-look” visual reflection mechanism that grounds reasoning steps in visual evidence through dynamic probes.

Details

Motivation: Current MLLMs treat visual input as static, reasoning-agnostic preambles, making them passive observers prone to perception-related hallucinations in fine-grained tasks. They lack ability to re-examine visual details to ground evolving reasoning states.

Method: Two-stage distillation: 1) Box-Guided Compression (BCM) establishes stable pixel-to-latent targets via explicit spatial grounding; 2) Dynamic Autoregressive Compression (DAC) maps hidden states into dynamic probes that interrogate global visual feature maps. Distills BCM expertise into DAC to internalize localization ability.

Result: Extensive experiments across six perception-intensive benchmarks show significant narrowing of fine-grained perception gap. Visualizations confirm latent reasoning autonomously localizes task-critical visual evidence. Inference maintains purely end-to-end autoregressive decoding with optimal efficiency.

Conclusion: V-Reflection successfully transforms MLLMs from passive observers to active interrogators, enabling dynamic visual grounding during reasoning to reduce perception hallucinations in fine-grained multimodal tasks.

Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable success, yet they remain prone to perception-related hallucinations in fine-grained tasks. This vulnerability arises from a fundamental limitation: their reasoning is largely restricted to the language domain, treating visual input as a static, reasoning-agnostic preamble rather than a dynamic participant. Consequently, current models act as passive observers, unable to re-examine visual details to ground their evolving reasoning states. To overcome this, we propose V-Reflection, a framework that transforms the MLLM into an active interrogator through a “think-then-look” visual reflection mechanism. During reasoning, latent states function as dynamic probes that actively interrogate the visual feature space, grounding each reasoning step for task-critical evidence. Our approach employs a two-stage distillation strategy. First, the Box-Guided Compression (BCM) module establishes stable pixel-to-latent targets through explicit spatial grounding. Next, a Dynamic Autoregressive Compression (DAC) module maps the model’s hidden states into dynamic probes that interrogate the global visual feature map. By distilling the spatial expertise of the BCM teacher into the DAC student, V-Reflection internalizes the ability to localize task-critical evidence. During inference, both modules remain entirely inactive, maintaining a purely end-to-end autoregressive decoding in the latent space with optimal efficiency. Extensive experiments demonstrate the effectiveness of our V-Reflection across six perception-intensive benchmarks, significantly narrowing the fine-grained perception gap. Visualizations confirm that latent reasoning autonomously localizes task-critical visual evidence.

[171] Edge-Based Standing-Water Detection via FSM-Guided Tiering and Multi-Model Consensus

Oliver Aleksander Larsen, Mahyar T. Moghaddam

Main category: cs.CV

TL;DR: Edge architecture for standing water detection in agriculture using Raspberry Pi/Jetson devices with adaptive inference tiering, multi-model YOLO ensemble, and environmental sensor fusion

Details

Motivation: Standing water in agricultural fields threatens vehicle mobility and crop health, requiring efficient detection systems that can operate under resource constraints and intermittent connectivity

Method: Deployed edge architecture using Raspberry-Pi-class devices with optional Jetson acceleration. Combines camera input and environmental sensors (humidity, pressure, temperature) in a finite-state machine (FSM) as architectural decision engine. FSM-guided control plane selects between local and offloaded inference tiers. Uses multi-model YOLO ensemble for image scores and diurnal-baseline sensor fusion to adjust caution using environmental anomalies

Result: Across ten configurations and sensor variants on identical field sequences with frame-level ground truth, the system improves flood-detection performance over static local baselines, uses less energy than naive always-heavy offload policy, and maintains bounded tail latency in real agricultural setting

Conclusion: The combination of adaptive tiering, multi-model consensus, and diurnal sensor fusion provides effective standing water detection in agricultural edge computing scenarios with resource constraints

Abstract: Standing water in agricultural fields threatens vehicle mobility and crop health. This paper presents a deployed edge architecture for standing-water detection using Raspberry-Pi-class devices with optional Jetson acceleration. Camera input and environmental sensors (humidity, pressure, temperature) are combined in a finite-state machine (FSM) that acts as the architectural decision engine. The FSM-guided control plane selects between local and offloaded inference tiers, trading accuracy, latency, and energy under intermittent connectivity and motion-dependent compute budgets. A multi-model YOLO ensemble provides image scores, while diurnal-baseline sensor fusion adjusts caution using environmental anomalies. All decisions are logged per frame, enabling bit-identical hardware-in-the-loop replays. Across ten configurations and sensor variants on identical field sequences with frame-level ground truth, we show that the combination of adaptive tiering, multi-model consensus, and diurnal sensor fusion improves flood-detection performance over static local baselines, uses less energy than a naive always-heavy offload policy, and maintains bounded tail latency in a real agricultural setting.

[172] TreeGaussian: Tree-Guided Cascaded Contrastive Learning for Hierarchical Consistent 3D Gaussian Scene Segmentation and Understanding

Jingbin You, Zehao Li, Hao Jiang, Xinzhu Ma, Shuqin Gao, Honglong Zhao, Congcong Zheng, Tianlu Mao, Feng Dai, Yucheng Zhang, Zhaoqi Wang

Main category: cs.CV

TL;DR: TreeGaussian introduces a tree-guided cascaded contrastive learning framework for hierarchical 3D semantic segmentation using 3D Gaussian Splatting, addressing limitations in representing object-part relationships and improving segmentation quality.

Details

Motivation: Existing 3DGS-based methods struggle with hierarchical semantic structures and whole-part relationships in complex scenes. Dense pairwise comparisons and inconsistent hierarchical labels from 2D priors hinder feature learning, leading to suboptimal segmentation.

Method: Proposes TreeGaussian with: 1) multi-level object tree construction for structured learning across hierarchies, 2) two-stage cascaded contrastive learning (global to local refinement), 3) Consistent Segmentation Detection (CSD) mechanism for view alignment, and 4) graph-based denoising module for unstable Gaussian point suppression.

Result: Extensive experiments demonstrate effectiveness in open-vocabulary 3D object selection, 3D point cloud understanding, and ablation studies show robustness and improved segmentation consistency/quality.

Conclusion: TreeGaussian successfully addresses hierarchical semantic representation challenges in 3DGS, providing a robust framework for structured 3D scene understanding with improved segmentation performance.

Abstract: 3D Gaussian Splatting (3DGS) has emerged as a real-time, differentiable representation for neural scene understanding. However, existing 3DGS-based methods struggle to represent hierarchical 3D semantic structures and capture whole-part relationships in complex scenes. Moreover, dense pairwise comparisons and inconsistent hierarchical labels from 2D priors hinder feature learning, resulting in suboptimal segmentation. To address these limitations, we introduce TreeGaussian, a tree-guided cascaded contrastive learning framework that explicitly models hierarchical semantic relationships and reduces redundancy in contrastive supervision. By constructing a multi-level object tree, TreeGaussian enables structured learning across object-part hierarchies. In addition, we propose a two-stage cascaded contrastive learning strategy that progressively refines feature representations from global to local, mitigating saturation and stabilizing training. A Consistent Segmentation Detection (CSD) mechanism and a graph-based denoising module are further introduced to align segmentation modes across views while suppressing unstable Gaussian points, enhancing segmentation consistency and quality. Extensive experiments, including open-vocabulary 3D object selection, 3D point cloud understanding, and ablation studies, demonstrate the effectiveness and robustness of our approach.

[173] Diffusion Path Alignment for Long-Range Motion Generation and Domain Transitions

Haichao Wang, Alexander Okupnik, Yuxing Han, Gene Wen, Johannes Schneider, Kyriakos Flouris

Main category: cs.CV

TL;DR: A diffusion-based optimal control framework for generating coherent long-range human motion transitions across semantically distinct domains, particularly useful for applications like dance choreography.

Details

Motivation: Long-range human movement generation is challenging, especially generating coherent transitions across semantically distinct motion domains. This capability is crucial for applications like dance choreography where movements need to fluidly transition between diverse stylistic and semantic motifs.

Method: Proposes an inference-time optimization framework inspired by diffusion-based stochastic optimal control. Uses a control-energy objective that explicitly regularizes transition trajectories of a pretrained diffusion model, optimizing this objective at inference time.

Result: The method yields transitions with fidelity and temporal coherence, providing the first general framework for controlled long-range human motion generation with explicit transition modeling.

Conclusion: The proposed inference-time optimization framework effectively addresses the challenge of generating coherent long-range human motion transitions across semantically distinct domains.

Abstract: Long-range human movement generation remains a central challenge in computer vision and graphics. Generating coherent transitions across semantically distinct motion domains remains largely unexplored. This capability is particularly important for applications such as dance choreography, where movements must fluidly transition across diverse stylistic and semantic motifs. We propose a simple and effective inference-time optimization framework inspired by diffusion-based stochastic optimal control. Specifically, a control-energy objective that explicitly regularizes the transition trajectories of a pretrained diffusion model. We show that optimizing this objective at inference time yields transitions with fidelity and temporal coherence. This is the first work to provide a general framework for controlled long-range human motion generation with explicit transition modeling.

[174] PollutionNet: A Vision Transformer Framework for Climatological Assessment of NO$_2$ and SO$_2$ Using Satellite-Ground Data Fusion

Prasanjit Dey, Soumyabrata Dev, Bianca Schoen-Phelan

Main category: cs.CV

TL;DR: PollutionNet is a Vision Transformer-based framework that integrates satellite and ground-level data to predict atmospheric NO₂ and SO₂ concentrations, achieving state-of-the-art performance with 14% error reduction compared to baseline models.

Details

Motivation: Traditional air pollution monitoring has limitations: satellite observations provide broad spatial coverage but suffer from data gaps, while ground-based sensors offer high temporal resolution but limited spatial extent. There's a need for better integration of these data sources to enable accurate pollution assessment, especially in regions with sparse monitoring networks.

Method: PollutionNet uses a Vision Transformer-based framework that integrates Sentinel-5P TROPOMI vertical column density (VCD) satellite data with ground-level observations. The model leverages self-attention mechanisms to capture complex spatiotemporal dependencies that conventional CNN and RNN models often miss.

Result: Applied to Ireland (2020-2021), PollutionNet achieves state-of-the-art performance with RMSE of 6.89 μg/m³ for NO₂ and 4.49 μg/m³ for SO₂, reducing prediction errors by up to 14% compared to baseline models.

Conclusion: PollutionNet provides a scalable and data-efficient tool for applied climatology, enabling robust pollution assessments in regions with sparse monitoring networks. The results highlight the potential of advanced machine learning approaches to enhance climate-related air quality research and support environmental policy decisions.

Abstract: Accurate assessment of atmospheric nitrogen dioxide (NO$_2$) and sulfur dioxide (SO$_2$) is essential for understanding climate-air quality interactions, supporting environmental policy, and protecting public health. Traditional monitoring approaches face limitations: satellite observations provide broad spatial coverage but suffer from data gaps, while ground-based sensors offer high temporal resolution but limited spatial extent. To address these challenges, we propose PollutionNet, a Vision Transformer-based framework that integrates Sentinel-5P TROPOMI vertical column density (VCD) data with ground-level observations. By leveraging self-attention mechanisms, PollutionNet captures complex spatiotemporal dependencies that are often missed by conventional CNN and RNN models. Applied to Ireland (2020-2021), our case study demonstrates that PollutionNet achieves state-of-the-art performance (RMSE: 6.89 $μ$g/m$^3$ for NO$_2$, 4.49 $μ$g/m$^3$ for SO$_2$), reducing prediction errors by up to 14% compared to baseline models. Beyond accuracy gains, PollutionNet provides a scalable and data-efficient tool for applied climatology, enabling robust pollution assessments in regions with sparse monitoring networks. These results highlight the potential of advanced machine learning approaches to enhance climate-related air quality research, inform environmental management, and support sustainable policy decisions.

[175] CardioSAM: Topology-Aware Decoder Design for High-Precision Cardiac MRI Segmentation

Ujjwal Jain

Main category: cs.CV

TL;DR: CardioSAM: A hybrid cardiac segmentation model combining frozen SAM encoder with cardiac-specific decoder for precise medical image segmentation.

Details

Motivation: Manual cardiac segmentation in CMR images is time-consuming and suffers from inter-observer variability. While foundation models like SAM show strong generalization, they lack the boundary precision needed for clinical applications in cardiac imaging.

Method: Proposes CardioSAM, a hybrid architecture with frozen SAM encoder for generalized feature extraction and lightweight trainable cardiac-specific decoder. The decoder includes Cardiac-Specific Attention module with anatomical topological priors and Boundary Refinement Module for improved tissue interface delineation.

Result: Achieves Dice coefficient of 93.39%, IoU of 87.61%, pixel accuracy of 99.20%, and HD95 of 4.2 mm on ACDC benchmark. Surpasses nnU-Net by +3.89% Dice and exceeds reported inter-expert agreement levels (91.2%).

Conclusion: CardioSAM demonstrates potential for reliable and clinically applicable cardiac segmentation by combining foundation model generalization with domain-specific refinement for medical imaging.

Abstract: Accurate segmentation of cardiac structures in cardiovascular magnetic resonance (CMR) images is essential for reliable diagnosis and treatment of cardiovascular diseases. However, manual segmentation remains time-consuming and suffers from significant inter-observer variability. Recent advances in deep learning, particularly foundation models such as the Segment Anything Model (SAM), demonstrate strong generalization but often lack the boundary precision required for clinical applications. To address this limitation, we propose CardioSAM, a hybrid architecture that combines the generalized feature extraction capability of a frozen SAM encoder with a lightweight, trainable cardiac-specific decoder. The proposed decoder introduces two key innovations: a Cardiac-Specific Attention module that incorporates anatomical topological priors, and a Boundary Refinement Module designed to improve tissue interface delineation. Experimental evaluation on the ACDC benchmark demonstrates that CardioSAM achieves a Dice coefficient of 93.39%, IoU of 87.61%, pixel accuracy of 99.20%, and HD95 of 4.2 mm. The proposed method surpasses strong baselines such as nnU-Net by +3.89% Dice and exceeds reported inter-expert agreement levels (91.2%), indicating its potential for reliable and clinically applicable cardiac segmentation.

[176] Vision Transformer-Based Time-Series Image Reconstruction for Cloud-Filling Applications

Lujun Li, Yiqun Wang, Radu State

Main category: cs.CV

TL;DR: A novel Time-series Vision Transformer framework that reconstructs cloud-covered multispectral imagery using temporal coherence and complementary SAR data through attention mechanisms.

Details

Motivation: Cloud cover in multispectral imagery (MSI) causes missing/corrupted spectral data for early season crop mapping, while SAR data is cloud-resistant but lacks spectral detail for precise crop classification.

Method: Proposes Time-series MSI Image Reconstruction using Vision Transformer (ViT) that leverages temporal coherence of MSI and complementary SAR information through attention mechanisms to reconstruct cloud-covered MSI regions.

Result: Comprehensive experiments show the Time-series ViT framework significantly outperforms baselines using non-time-series MSI+SAR or time-series MSI without SAR, effectively enhancing MSI reconstruction in cloud-covered areas.

Conclusion: The proposed framework successfully addresses cloud interference in MSI by combining temporal coherence and SAR data through attention mechanisms, improving reconstruction for agricultural applications.

Abstract: Cloud cover in multispectral imagery (MSI) poses significant challenges for early season crop mapping, as it leads to missing or corrupted spectral information. Synthetic aperture radar (SAR) data, which is not affected by cloud interference, offers a complementary solution, but lack sufficient spectral detail for precise crop mapping. To address this, we propose a novel framework, Time-series MSI Image Reconstruction using Vision Transformer (ViT), to reconstruct MSI data in cloud-covered regions by leveraging the temporal coherence of MSI and the complementary information from SAR from the attention mechanism. Comprehensive experiments, using rigorous reconstruction evaluation metrics, demonstrate that Time-series ViT framework significantly outperforms baselines that use non-time-series MSI and SAR or time-series MSI without SAR, effectively enhancing MSI image reconstruction in cloud-covered regions.

Wish Suharitdamrong, Tony Alex, Muhammad Awais, Sara Ahmed

Main category: cs.CV

TL;DR: CoLA introduces cross-modal adaptation pathways alongside standard LoRA for efficient multimodal fine-tuning of foundation models, improving performance on vision-language and audio-visual tasks.

Details

Motivation: Current PEFT methods like LoRA operate in isolation within each modality, limiting their ability to capture cross-modal interactions needed for effective multimodal adaptation of foundation models.

Method: CoLA extends LoRA by adding a dedicated inter-modal adaptation pathway alongside the standard intra-modal one, creating a dual-path design that enables effective adaptation without interference between modality-specific and cross-modal learning.

Result: CoLA consistently outperforms LoRA across vision-language (RefCOCO, RefCOCO+, RefCOCOg) and audio-visual (AVE, AVS) benchmarks, achieving relative gains of around 3% and 2% respectively while maintaining parameter efficiency.

Conclusion: CoLA provides an effective parameter-efficient fine-tuning framework for multimodal tasks that bridges the gap between unimodal adaptation and cross-modal learning, enabling the first multi-task PEFT framework for visual grounding.

Abstract: Foundation models have revolutionized AI, but adapting them efficiently for multimodal tasks, particularly in dual-stream architectures composed of unimodal encoders, such as DINO and BERT, remains a significant challenge. Parameter-Efficient Fine-Tuning (PEFT) methods like Low-Rank Adaptation (LoRA) enable lightweight adaptation, yet they operate in isolation within each modality, limiting their ability in capturing cross-modal interactions. In this paper, we take a step in bridging this gap with Cross-Modal Low-Rank Adaptation (CoLA), a novel PEFT framework that extends LoRA by introducing a dedicated inter-modal adaptation pathway alongside the standard intra-modal one. This dual-path design enables CoLA to adapt unimodal foundation models to multimodal tasks effectively, without interference between modality-specific and cross-modal learning. We evaluate CoLA across a range of vision-language (RefCOCO, RefCOCO+, RefCOCOg) and audio-visual (AVE, AVS) benchmarks, where it consistently outperforms LORA, achieving a relative gain of around 3% and 2%, respectively, while maintaining parameter efficiency. Notably, CoLA enables the first multi-task PEFT framework for visual grounding, bridging a key gap in efficient multimodal adaptation.

[178] StoryBlender: Inter-Shot Consistent and Editable 3D Storyboard with Spatial-temporal Dynamics

Bingliang Li, Zhenhong Sun, Jiaming Bian, Yuehao Wu, Yifu Wang, Hongdong Li, Yatao Bian, Huadong Mo, Daoyi Dong

Main category: cs.CV

TL;DR: StoryBlender is a 3D storyboard generation framework that achieves both inter-shot consistency and explicit editability through a three-stage pipeline with semantic-spatial grounding, canonical asset materialization, and spatial-temporal dynamics.

Details

Motivation: Current approaches to automated storyboarding fail to simultaneously achieve inter-shot consistency and explicit editability. 2D diffusion-based generators suffer from identity drift and limited geometric control, while traditional 3D animation workflows require expert-heavy, labor-intensive authoring.

Method: Three-stage pipeline: (1) Semantic-Spatial Grounding constructs a continuity memory graph to decouple global assets from shot-specific variables; (2) Canonical Asset Materialization instantiates entities in unified coordinate space to maintain visual identity; (3) Spatial-Temporal Dynamics achieves layout design and cinematic evolution through visual metrics. Uses hierarchical multi-agent orchestration with verification loop for self-correction.

Result: StoryBlender significantly improves consistency and editability over both diffusion-based and 3D-grounded baselines. The resulting native 3D scenes support direct, precise editing of cameras and visual assets while preserving unwavering multi-shot continuity.

Conclusion: StoryBlender presents a novel framework for grounded 3D storyboard generation that successfully addresses the dual challenges of consistency and editability through its story-centric reflection scheme and three-stage pipeline.

Abstract: Storyboarding is a core skill in visual storytelling for film, animation, and games. However, automating this process requires a system to achieve two properties that current approaches rarely satisfy simultaneously: inter-shot consistency and explicit editability. While 2D diffusion-based generators produce vivid imagery, they often suffer from identity drift along with limited geometric control; conversely, traditional 3D animation workflows are consistent and editable but require expert-heavy, labor-intensive authoring. We present StoryBlender, a grounded 3D storyboard generation framework governed by a Story-centric Reflection Scheme. At its core, we propose the StoryBlender system, which is built on a three-stage pipeline: (1) Semantic-Spatial Grounding, to construct a continuity memory graph to decouple global assets from shot-specific variables for long-horizon consistency; (2) Canonical Asset Materialization, to instantiate entities in a unified coordinate space to maintain visual identity; and (3) Spatial-Temporal Dynamics, to achieve layout design and cinematic evolution through visual metrics. By orchestrating multiple agents in a hierarchical manner within a verification loop, StoryBlender iteratively self-corrects spatial hallucinations via engine-verified feedback. The resulting native 3D scenes support direct, precise editing of cameras and visual assets while preserving unwavering multi-shot continuity. Experiments demonstrate that StoryBlender significantly improves consistency and editability over both diffusion-based and 3D-grounded baselines. Code, data, and demonstration video will be available on https://engineeringai-lab.github.io/StoryBlender/

[179] When Sinks Help or Hurt: Unified Framework for Attention Sink in Large Vision-Language Models

Jiho Choi, Jaemin Kim, Sanghwan Kim, Seunghoon Hong, Jin-Hwi Park

Main category: cs.CV

TL;DR: The paper investigates attention sinks in Large Vision-Language Models, categorizes them into vision-encoder and LLM-emerged types, reveals a performance trade-off between global scene priors and local perception, and proposes Layer-wise Sink Gating to dynamically balance these effects.

Details

Motivation: To understand the role of attention sinks in multimodal transformers, specifically whether they are redundant artifacts or essential global priors, and to explore their cross-modal impact in Large Vision-Language Models which remains largely unexplored.

Method: 1) Categorize visual sinks into ViT-emerged sinks (V-sinks) from vision encoder and LLM-emerged sinks (L-sinks) from deep LLM layers; 2) Analyze performance trade-offs; 3) Identify functional layers where sink modulation impacts performance most; 4) Propose Layer-wise Sink Gating (LSG) - a lightweight plug-and-play module that dynamically scales attention contributions of V-sink and other visual tokens, trained via standard next-token prediction without task-specific supervision.

Result: Analysis reveals fundamental performance trade-off: sinks effectively encode global scene-level priors but their dominance suppresses fine-grained visual evidence needed for local perception. LSG yields improvements on representative multimodal benchmarks in most layers, effectively balancing global reasoning and precise local evidence.

Conclusion: Attention sinks in LVLMs serve as essential global priors but need careful modulation to balance global scene understanding with local visual evidence. The proposed LSG module provides an effective, lightweight solution for this balance without requiring task-specific supervision or modifying the frozen LVLM backbone.

Abstract: Attention sinks are defined as tokens that attract disproportionate attention. While these have been studied in single modality transformers, their cross-modal impact in Large Vision-Language Models (LVLM) remains largely unexplored: are they redundant artifacts or essential global priors? This paper first categorizes visual sinks into two distinct categories: ViT-emerged sinks (V-sinks), which propagate from the vision encoder, and LLM-emerged sinks (L-sinks), which arise within deep LLM layers. Based on the new definition, our analysis reveals a fundamental performance trade-off: while sinks effectively encode global scene-level priors, their dominance can suppress the fine-grained visual evidence required for local perception. Furthermore, we identify specific functional layers where modulating these sinks most significantly impacts downstream performance. To leverage these insights, we propose Layer-wise Sink Gating (LSG), a lightweight, plug-and-play module that dynamically scales the attention contributions of V-sink and the rest visual tokens. LSG is trained via standard next-token prediction, requiring no task-specific supervision while keeping the LVLM backbone frozen. In most layers, LSG yields improvements on representative multimodal benchmarks, effectively balancing global reasoning and precise local evidence.

[180] Gaze to Insight: A Scalable AI Approach for Detecting Gaze Behaviours in Face-to-Face Collaborative Learning

Junyuan Liang, Qi Zhou, Sahan Bulathwela, Mutlu Cukurova

Main category: cs.CV

TL;DR: AI approach using pretrained models (YOLO11, YOLOE-26, Gaze-LLE) for automatic gaze behavior detection in collaborative learning without human annotation, achieving F1-score of 0.829 with strong cross-configuration robustness.

Details

Motivation: Current machine learning approaches for gaze behavior detection require large labeled datasets and lack cross-configuration robustness for diverse educational contexts. Need for scalable AI methods that don't rely on human annotation.

Method: Uses pretrained YOLO11 for person tracking, YOLOE-26 with text-prompt capability for education-related object detection, and Gaze-LLE model for gaze target prediction. Combines these foundation models to create annotation-free gaze behavior detection system.

Result: Achieves F1-score of 0.829 for gaze behavior detection, with strong performance for laptop-directed gaze (0.85) and peer-directed gaze (0.82), but weaker for other targets. Outperforms supervised ML approaches in complex contexts and shows better cross-configuration robustness.

Conclusion: Proposed approach provides scalable, annotation-free solution for gaze behavior detection in collaborative learning, with practical implications for real-world educational support systems.

Abstract: Previous studies have illustrated the potential of analysing gaze behaviours in collaborative learning to provide educationally meaningful information for students to reflect on their learning. Over the past decades, machine learning approaches have been developed to automatically detect gaze behaviours from video data. Yet, since these approaches often require large amounts of labelled data for training, human annotation remains necessary. Additionally, researchers have questioned the cross-configuration robustness of machine learning models developed, as training datasets often fail to encompass the full range of situations encountered in educational contexts. To address these challenges, this study proposes a scalable artificial intelligence approach that leverages pretrained and foundation models to automatically detect gaze behaviours in face-to-face collaborative learning contexts without requiring human-annotated data. The approach utilises pretrained YOLO11 for person tracking, YOLOE-26 with text-prompt capability for education-related object detection, and the Gaze-LLE model for gaze target prediction. The results indicate that the proposed approach achieves an F1-score of 0.829 in detecting students’ gaze behaviours from video data, with strong performance for laptop-directed gaze and peer-directed gaze, yet weaker performance for other gaze targets. Furthermore, when compared to other supervised machine learning approaches, the proposed method demonstrates superior and more stable performance in complex contexts, highlighting its better cross-configuration robustness. The implications of this approach for supporting students’ collaborative learning in real-world environments are also discussed.

[181] EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs

Zhenghao Chen, Huiqun Wang, Di Huang

Main category: cs.CV

TL;DR: EgoMind is a Chain-of-Thought framework for geometry-free spatial reasoning in MLLMs using Role-Play Caption and Progressive Spatial Analysis, achieving competitive results with minimal training data.

Details

Motivation: Existing approaches for spatial reasoning in MLLMs either require expensive 3D priors/geometric supervision or struggle with multi-frame spatial reasoning due to limited cross-frame relationship capture. There's a need for a geometry-free approach that can effectively reason about spatial relationships across frames.

Method: Proposes EgoMind framework with two key components: 1) Role-Play Caption - jointly constructs coherent linguistic scene graphs across frames, and 2) Progressive Spatial Analysis - progressively reasons toward task-specific questions through chain-of-thought reasoning.

Result: Achieves competitive results on multiple benchmarks (VSI-Bench, SPAR-Bench, SITE-Bench, SPBench) with only 5K auto-generated SFT samples and 20K RL samples, demonstrating effective spatial reasoning without geometric supervision.

Conclusion: EgoMind strengthens spatial reasoning capabilities of MLLMs through linguistic reasoning, highlighting the potential of geometry-free approaches for spatial cognition tasks with minimal training data requirements.

Abstract: Multimodal large language models (MLLMs) are increasingly being applied to spatial cognition tasks, where they are expected to understand and interact with complex environments. Most existing works improve spatial reasoning by introducing 3D priors or geometric supervision, which enhances performance but incurs substantial data preparation and alignment costs. In contrast, purely 2D approaches often struggle with multi-frame spatial reasoning due to their limited ability to capture cross-frame spatial relationships. To address these limitations, we propose EgoMind, a Chain-of-Thought framework that enables geometry-free spatial reasoning through Role-Play Caption, which jointly constructs a coherent linguistic scene graph across frames, and Progressive Spatial Analysis, which progressively reasons toward task-specific questions. With only 5K auto-generated SFT samples and 20K RL samples, EgoMind achieves competitive results on VSI-Bench, SPAR-Bench, SITE-Bench, and SPBench, demonstrating its effectiveness in strengthening the spatial reasoning capabilities of MLLMs and highlighting the potential of linguistic reasoning for spatial cognition. Code and data are released at https://github.com/Hyggge/EgoMind.

[182] Robust Multi-Source Covid-19 Detection in CT Images

Asmita Yuki Pritha, Jason Xu, Daniel Ding, Justin Li, Aryana Hou, Xin Wang, Shu Hu

Main category: cs.CV

TL;DR: Multi-task learning approach for COVID-19 detection from chest CT scans that jointly predicts diagnosis and data source to improve generalization across multiple medical centers with different scanners and protocols.

Details

Motivation: Existing COVID-19 detection models from CT scans perform well within single institutions but struggle with multi-center data due to scanner, protocol, and population differences. Models become biased toward centers with more training data.

Method: Proposes multi-task learning with shared EfficientNet-B7 backbone to predict both COVID-19 diagnosis and originating data center. Uses logit-adjusted cross-entropy loss for source classification to handle imbalanced data distribution across centers. Preprocessing follows SSFL framework with KDS selecting 8 slices per scan.

Result: Achieves F1 score of 0.9098 and AUC-ROC of 0.9647 on validation set of 308 scans, demonstrating improved generalization across multiple medical centers.

Conclusion: Jointly learning to predict both diagnosis and data source helps create more robust representations that generalize better across different medical imaging centers, addressing domain shift challenges in medical AI.

Abstract: Deep learning models for COVID-19 detection from chest CT scans generally perform well when the training and test data originate from the same institution, but they often struggle when scans are drawn from multiple centres with differing scanners, imaging protocols, and patient populations. One key reason is that existing methods treat COVID-19 classification as the sole training objective, without accounting for the data source of each scan. As a result, the learned representations tend to be biased toward centres that contribute more training data. To address this, we propose a multi-task learning approach in which the model is trained to predict both the COVID-19 diagnosis and the originating data centre. The two tasks share an EfficientNet-B7 backbone, which encourages the feature extractor to learn representations that hold across all four participating centres. Since the training data is not evenly distributed across sources, we apply a logit-adjusted cross-entropy loss [1] to the source classification head to prevent underrepresented centres from being overlooked. Our pre-processing follows the SSFL framework with KDS [2], selecting eight representative slices per scan. Our method achieves an F1 score of 0.9098 and an AUC-ROC of 0.9647 on a validation set of 308 scans. The code is publicly available at https://github.com/Purdue-M2/-multisource-covid-ct.

[183] VitaTouch: Property-Aware Vision-Tactile-Language Model for Robotic Quality Inspection in Manufacturing

Junyi Zong, Qingxuan Jia, Meixian Shi, Tong Li, Jiayuan Li, Zihang Lv, Gang Chen, Fang Deng

Main category: cs.CV

TL;DR: VitaTouch: A vision-tactile-language model for material property inference and natural language description, achieving state-of-the-art performance on multimodal benchmarks and practical robotic applications.

Details

Motivation: Vision-only methods for quality inspection in smart manufacturing are limited by occlusion and reflection issues, and cannot capture intrinsic material properties beyond visible geometry. There's a need for multimodal approaches that combine vision with tactile sensing for comprehensive material understanding.

Method: Uses modality-specific encoders and a dual Q-Former to extract language-relevant visual and tactile features, compresses them into prefix tokens for an LLM. Aligns each modality with text through contrastive learning and explicitly couples vision and touch. Includes VitaSet dataset with 186 objects, 52k images, and 5.1k instruction-answer pairs.

Result: Achieves best performance on HCT and overall TVL benchmark, competitive on SSVTP. On VitaSet: 88.89% hardness accuracy, 75.13% roughness accuracy, 54.81% descriptor recall, 0.9009 semantic similarity for material description. With LoRA fine-tuning: 100%, 96%, 92% accuracy for 2-, 3-, 5-category defect recognition; 94% closed-loop recognition accuracy and 94% end-to-end sorting success in robotic trials.

Conclusion: VitaTouch demonstrates effective multimodal fusion of vision, touch, and language for material property understanding, with strong performance on benchmarks and practical robotic applications in smart manufacturing quality inspection.

Abstract: Quality inspection in smart manufacturing requires identifying intrinsic material and surface properties beyond visible geometry, yet vision-only methods remain vulnerable to occlusion and reflection. We propose VitaTouch, a property-aware vision-tactile-language model for material-property inference and natural-language attribute description. VitaTouch uses modality-specific encoders and a dual Q-Former to extract language-relevant visual and tactile features, which are compressed into prefix tokens for a large language model. We align each modality with text and explicitly couple vision and touch through contrastive learning. We also construct VitaSet, a multimodal dataset with 186 objects, 52k images, and 5.1k human-verified instruction-answer pairs. VitaTouch achieves the best performance on HCT and the overall TVL benchmark, while remaining competitive on SSVTP. On VitaSet, it reaches 88.89% hardness accuracy, 75.13% roughness accuracy, and 54.81% descriptor recall; the material-description task further achieves a peak semantic similarity of 0.9009. With LoRA-based fine-tuning, VitaTouch attains 100.0%, 96.0%, and 92.0% accuracy for 2-, 3-, and 5-category defect recognition, respectively, and delivers 94.0% closed-loop recognition accuracy and 94.0% end-to-end sorting success in 100 laboratory robotic trials. More details are available at the project page: https://vitatouch.github.io/

[184] Safety-Aligned 3D Object Detection: Single-Vehicle, Cooperative, and End-to-End Perspectives

Brian Hsuan-Cheng Liao, Chih-Hong Cheng, Hasan Esen, Alois Knoll

Main category: cs.CV

TL;DR: Safety-aligned evaluation and optimization for 3D object detection in autonomous vehicles, focusing on safety-critical errors rather than all errors equally.

Details

Motivation: Current perception systems treat all errors equally, but only a subset of perception errors are safety-critical. There's a need for safety-aligned evaluation metrics and optimization methods that explicitly characterize high-impact errors in autonomous vehicle perception.

Method: Uses safety-oriented metric NDS-USC and safety-aware loss function EC-IoU. Evaluates single-vehicle 3D detection models across architectures/sensing modalities, assesses AV-infrastructure cooperative detection models, and integrates EC-IoU into SparseDrive end-to-end framework.

Result: Safety-aware fine-tuning improves safety-critical detection; cooperative models outperform vehicle-only models; safety-aware perception hardening reduces collision rate by nearly 30% in end-to-end systems.

Conclusion: Safety-aligned perception evaluation and optimization offer practical path to enhance CAV safety across single-vehicle, cooperative, and end-to-end autonomy settings.

Abstract: Perception plays a central role in connected and autonomous vehicles (CAVs), underpinning not only conventional modular driving stacks, but also cooperative perception systems and recent end-to-end driving models. While deep learning has greatly improved perception performance, its statistical nature makes perfect predictions difficult to attain. Meanwhile, standard training objectives and evaluation benchmarks treat all perception errors equally, even though only a subset is safety-critical. In this paper, we investigate safety-aligned evaluation and optimization for 3D object detection that explicitly characterize high-impact errors. Building on our previously proposed safety-oriented metric, NDS-USC, and safety-aware loss function, EC-IoU, we make three contributions. First, we present an expanded study of single-vehicle 3D object detection models across diverse neural network architectures and sensing modalities, showing that gains under standard metrics such as mAP and NDS may not translate to safety-oriented criteria represented by NDS-USC. With EC-IoU, we reaffirm the benefit of safety-aware fine-tuning for improving safety-critical detection performance. Second, we conduct an ego-centric, safety-oriented evaluation of AV-infrastructure cooperative object detection models, underscoring its superiority over vehicle-only models and demonstrating a safety impact analysis that illustrates the potential contribution of cooperative models to “Vision Zero.” Third, we integrate EC-IoU into SparseDrive and show that safety-aware perception hardening can reduce collision rate by nearly 30% and improve system-level safety directly in an end-to-end perception-to-planning framework. Overall, our results indicate that safety-aligned perception evaluation and optimization offer a practical path toward enhancing CAV safety across single-vehicle, cooperative, and end-to-end autonomy settings.

[185] Review and Evaluation of Point-Cloud based Leaf Surface Reconstruction Methods for Agricultural Applications

Arif Ahmed, Parikshit Maini

Main category: cs.CV

TL;DR: Comparative study of 9 surface reconstruction methods for leaf surfaces from 3D point clouds, evaluated on three agricultural datasets to guide method selection for resource-constrained robotic platforms.

Details

Motivation: Accurate leaf surface reconstruction from 3D point clouds is essential for agricultural phenotyping, but real-world plant data is complex and existing methods' relative performance for leaf reconstruction is insufficiently understood.

Method: Comparative evaluation of nine representative surface reconstruction methods (including parametric, triangulation-based, implicit, and learning-based approaches) on three public datasets: LAST-STRAW, Pheno4D, and Crops3D, covering diverse species, sensors, and environments.

Result: Each method exhibits distinct advantages depending on application and resource constraints, with trade-offs between surface area estimation accuracy, smoothness, robustness to noise/missing data, and computational cost.

Conclusion: The findings provide practical guidance for selecting surface reconstruction techniques for resource-constrained robotic platforms in agricultural applications.

Abstract: Accurate reconstruction of leaf surfaces from 3D point cloud is essential for agricultural applications such as phenotyping. However, real-world plant data (i.e., irregular 3D point cloud) are often complex to reconstruct plant parts accurately. A wide range of surface reconstruction methods has been proposed, including parametric, triangulation-based, implicit, and learning based approaches, yet their relative performance for leaf surface reconstruction remains insufficiently understood. In this work, we present a comparative study of nine representative surface reconstruction methods for leaf surfaces. We evaluate these methods on three publicly available datasets: LAST-STRAW, Pheno4D, and Crops3D - spanning diverse species, sensors, and sensing environments, ranging from clean high-resolution indoor scans to noisy low-resolution field settings. The analysis highlights the trade-offs between surface area estimation accuracy, smoothness, robustness to noise and missing data, and computational cost across different methods. These factors affect the cost and constraints of robotic hardware used in agricultural applications. Our results show that each method exhibits distinct advantages depending on application and resource constraints. The findings provide practical guidance for selecting surface reconstruction techniques for resource constrained robotic platforms.

[186] Bridging the Dimensionality Gap: A Taxonomy and Survey of 2D Vision Model Adaptation for 3D Analysis

Akshat Pandya, Bhavuk Jain

Main category: cs.CV

TL;DR: Survey paper analyzing adaptation strategies for extending 2D vision models (CNNs/ViTs) to 3D data, classifying approaches into data-centric, architecture-centric, and hybrid methods, with discussion of trade-offs and future directions.

Details

Motivation: The success of 2D vision models (CNNs and Vision Transformers) has driven interest in applying them to 3D analysis, but there's a fundamental gap between regular 2D image grids and irregular, sparse 3D data like point clouds and meshes.

Method: Provides a comprehensive survey and unified taxonomy of adaptation strategies, classifying them into three families: 1) Data-centric methods that project 3D data into 2D formats, 2) Architecture-centric methods that design intrinsic 3D networks, and 3) Hybrid methods combining both paradigms.

Result: Qualitative analysis of fundamental trade-offs between these families concerning computational complexity, reliance on large-scale pre-training, and preservation of geometric inductive biases.

Conclusion: Identifies key open challenges and outlines promising future research directions including development of 3D foundation models, advancements in self-supervised learning for geometric data, and deeper integration of multi-modal signals.

Abstract: The remarkable success of Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) in 2D vision has spurred significant research in extending these architectures to the complex domain of 3D analysis. Yet, a core challenge arises from a fundamental dichotomy between the regular, dense grids of 2D images and the irregular, sparse nature of 3D data such as point clouds and meshes. This survey provides a comprehensive review and a unified taxonomy of adaptation strategies that bridge this gap, classifying them into three families: (1) Data-centric methods that project 3D data into 2D formats to leverage off-the-shelf 2D models, (2) Architecture-centric methods that design intrinsic 3D networks, and (3) Hybrid methods, which synergistically combine the two modeling paradigms to benefit from both rich visual priors of large 2D datasets and explicit geometric reasoning of 3D models. Through this framework, we qualitatively analyze the fundamental trade-offs between these families concerning computational complexity, reliance on large-scale pre-training, and the preservation of geometric inductive biases. We discuss key open challenges and outline promising future research directions, including the development of 3D foundation models, advancements in self-supervised learning (SSL) for geometric data, and the deeper integration of multi-modal signals.

[187] Significance and Stability Analysis of Gene-Environment Interaction using RGxEStat

Meng’en Qin, Zhe Li, Xiaohui Yang

Main category: cs.CV

TL;DR: RGxEStat is a lightweight interactive tool for analyzing Genotype-by-Environment interactions in breeding data, providing significance and stability analysis without requiring programming skills.

Details

Motivation: GxE interactions reduce phenotype predictability in different environments, making it crucial to understand how genetic traits are expressed under specific conditions to improve breeding practices and genetic selection.

Method: Two key models: 1) significance analysis using mixed effect models to identify genes/GxE effects on phenotypic traits, and 2) stability analysis examining genotype-environment interactions and relative performance across environments. Implemented in RGxEStat tool with user-friendly interface.

Result: RGxEStat provides an accessible tool that eliminates the need for SAS/R programming, streamlining breeding data analysis and accelerating research cycles. Code and datasets are publicly available.

Conclusion: RGxEStat enables breeders and agronomists to efficiently analyze GxE interactions through an intuitive interface, facilitating better understanding of genetic expression across environments and improving breeding practices.

Abstract: Genotype-by-Environment (GxE) interactions influence the performance of genotypes across diverse environments, reducing the predictability of phenotypes in target environments. In-depth analysis of GxE interactions facilitates the identification of how genetic advantages or defects are expressed or suppressed under specific environmental conditions, thereby enabling genetic selection and enhancing breeding practices. This paper introduces two key models for GxE interaction research. Specifically, it includes significance analysis based on the mixed effect model to determine whether genes or GxE interactions significantly affect phenotypic traits; stability analysis, which further investigates the interactive relationships between genes and environments, as well as the relative superiority or inferiority of genotypes across environments. Additionally, this paper presents RGxEStat, a lightweight interactive tool, which is developed by the authors and integrates the construction, solution, and visualization of the aforementioned models. Designed to eliminate the need for breeders and agronomists to learn complex SAS or R programming, RGxEStat provides a user-friendly interface for streamlined breeding data analysis, significantly accelerating research cycles. Codes and datasets are available at https://github.com/mason-ching/RGxEStat.

[188] Hierarchical Awareness Adapters with Hybrid Pyramid Feature Fusion for Dense Depth Prediction

Wuqi Su, Huilun Song, Chen Zhao, Chi Xu

Main category: cs.CV

TL;DR: A novel monocular depth estimation method using multilevel perceptual CRF with Swin Transformer backbone, featuring adaptive hybrid pyramid fusion, hierarchical awareness adapters, and fully-connected CRF decoder with dynamic scaling attention.

Details

Motivation: Existing monocular depth estimation methods rely on complex network architectures that increase training costs and computational overhead without fully exploiting spatial dependencies. There's a need for more efficient approaches that better capture inter-pixel relationships while maintaining accuracy.

Method: Proposes a multilevel perceptual CRF model based on Swin Transformer with three key innovations: (1) adaptive hybrid pyramid feature fusion combining multi-scale spatial pyramid pooling with biaxial feature aggregation, (2) hierarchical awareness adapters with lightweight broadcast modules for cross-level feature interactions, and (3) fully-connected CRF decoder with dynamic scaling attention and bias learning unit.

Result: Achieves state-of-the-art performance on NYU Depth v2 (Abs Rel: 0.088, RMSE: 0.316), KITTI (near-perfect threshold accuracy δ<1.25³ ≈ 99.8%), and MatterPort3D datasets with only 194M parameters and 21ms inference time.

Conclusion: The proposed method effectively addresses limitations of existing approaches by better exploiting spatial dependencies through innovative feature fusion and CRF modeling, achieving superior performance with reduced computational complexity.

Abstract: Monocular depth estimation from a single RGB image remains a fundamental challenge in computer vision due to inherent scale ambiguity and the absence of explicit geometric cues. Existing approaches typically rely on increasingly complex network architectures to regress depth maps, which escalates training costs and computational overhead without fully exploiting inter-pixel spatial dependencies. We propose a multilevel perceptual conditional random field (CRF) model built upon the Swin Transformer backbone that addresses these limitations through three synergistic innovations: (1) an adaptive hybrid pyramid feature fusion (HPF) strategy that captures both short-range and long-range dependencies by combining multi-scale spatial pyramid pooling with biaxial feature aggregation, enabling effective integration of global and local contextual information; (2) a hierarchical awareness adapter (HA) that enriches cross-level feature interactions within the encoder through lightweight broadcast modules with learnable dimensional scaling, reducing computational complexity while enhancing representational capacity; and (3) a fully-connected CRF decoder with dynamic scaling attention that models fine-grained pixel-level spatial relationships, incorporating a bias learning unit to prevent extreme-value collapse and ensure stable training. Extensive experiments on NYU Depth v2, KITTI, and MatterPort3D datasets demonstrate that our method achieves state-of-the-art performance, reducing Abs Rel to 0.088 ($-$7.4%) and RMSE to 0.316 ($-$5.4%) on NYU Depth v2, while attaining near-perfect threshold accuracy ($δ< 1.25^3 \approx 99.8%$) on KITTI with only 194M parameters and 21ms inference time.

[189] Learning Additively Compositional Latent Actions for Embodied AI

Hangxing Wei, Xiaoyu Chen, Chuheng Zhang, Tim Pearce, Jianyu Chen, Alex Lamb, Li Zhao, Jiang Bian

Main category: cs.CV

TL;DR: AC-LAM introduces additive composition constraints to latent action learning, enforcing algebraic structure (identity, inverse, cycle consistency) to learn more motion-specific and displacement-calibrated latent actions from visual transitions.

Details

Motivation: Existing latent action learning methods lack structural priors that encode the additive, compositional nature of physical motion, leading to entanglements with irrelevant scene details and miscalibrated motion magnitude.

Method: AC-LAM enforces scene-wise additive composition structure over short horizons on the latent action space, encouraging simple algebraic structure (identity, inverse, cycle consistency) and suppressing non-additive information.

Result: AC-LAM learns more structured, motion-specific, and displacement-calibrated latent actions, providing stronger supervision for downstream policy learning and outperforming state-of-the-art LAMs across simulated and real-world tabletop tasks.

Conclusion: Additive composition constraints are effective for learning better-structured latent actions from visual transitions, improving downstream embodied AI applications.

Abstract: Latent action learning infers pseudo-action labels from visual transitions, providing an approach to leverage internet-scale video for embodied AI. However, most methods learn latent actions without structural priors that encode the additive, compositional structure of physical motion. As a result, latents often entangle irrelevant scene details or information about future observations with true state changes and miscalibrate motion magnitude. We introduce Additively Compositional Latent Action Model (AC-LAM), which enforces scene-wise additive composition structure over short horizons on the latent action space. These AC constraints encourage simple algebraic structure in the latent action space~(identity, inverse, cycle consistency) and suppress information that does not compose additively. Empirically, AC-LAM learns more structured, motion-specific, and displacement-calibrated latent actions and provides stronger supervision for downstream policy learning, outperforming state-of-the-art LAMs across simulated and real-world tabletop tasks.

[190] Mixture-of-Experts in Remote Sensing: A Survey

Yongchuan Cui, Peng Liu, Lajiao Chen

Main category: cs.CV

TL;DR: A comprehensive survey paper reviewing Mixture-of-Experts (MoE) applications in remote sensing, covering principles, architectures, and applications across various remote sensing tasks.

Details

Motivation: Remote sensing data analysis faces challenges due to diverse sensor modalities and spatiotemporal dynamics. MoE models address these by routing inputs to specialized experts, but there's no comprehensive review of MoE applications in remote sensing.

Method: This is a survey paper that systematically reviews MoE applications in remote sensing, covering fundamental principles, architectural designs, and key applications across various remote sensing tasks.

Result: The survey provides the first systematic overview of MoE for remote sensing, organizing existing research and outlining future trends to inspire further innovation in applying MoE to remote sensing challenges.

Conclusion: MoE is a powerful paradigm for remote sensing that addresses data diversity challenges, and this survey fills a gap by providing comprehensive coverage to guide future research in this area.

Abstract: Remote sensing data analysis and interpretation present unique challenges due to the diversity in sensor modalities and spatiotemporal dynamics of Earth observation data. Mixture-of-Experts (MoE) model has emerged as a powerful paradigm that addresses these challenges by dynamically routing inputs to specialized experts designed for different aspects of a task. However, despite rapid progress, the community still lacks a comprehensive review of MoE for remote sensing. This survey provides the first systematic overview of MoE applications in remote sensing, covering fundamental principles, architectural designs, and key applications across a variety of remote sensing tasks. The survey also outlines future trends to inspire further research and innovation in applying MoE to remote sensing.

[191] YOLOv11 Demystified: A Practical Guide to High-Performance Object Detection

Nikhileswara Rao Sulake

Main category: cs.CV

TL;DR: YOLOv11 introduces novel architectural modules (C3K2 blocks, SPPF, C2PSA) to improve feature extraction and small-object detection while maintaining real-time performance for applications like autonomous driving and surveillance.

Details

Motivation: To advance the YOLO series of real-time object detectors by improving feature extraction capabilities, particularly for small objects, while maintaining the real-time inference speed that makes YOLO models practical for applications like autonomous driving and surveillance.

Method: Introduces three key architectural innovations: C3K2 blocks for improved feature extraction, Spatial Pyramid Pooling - Fast (SPPF) for enhanced spatial feature processing, and C2PSA (Cross Stage Partial with Spatial Attention) modules. The paper provides detailed analysis of YOLOv11’s backbone, neck, and head components.

Result: YOLOv11 achieves superior mean Average Precision (mAP) compared to prior YOLO versions on standard benchmarks while maintaining real-time inference speed. The improvements are particularly notable for small-object detection.

Conclusion: YOLOv11 successfully advances real-time object detection by balancing accuracy improvements with maintained speed, making it suitable for practical applications requiring both precision and real-time performance.

Abstract: YOLOv11 is the latest iteration in the You Only Look Once (YOLO) series of real-time object detectors, introducing novel architectural modules to improve feature extraction and small-object detection. In this paper, we present a detailed analysis of YOLOv11, including its backbone, neck, and head components. The model key innovations, the C3K2 blocks, Spatial Pyramid Pooling - Fast (SPPF), and C2PSA (Cross Stage Partial with Spatial Attention) modules enhance spatial feature processing while preserving speed. We compare YOLOv11 performance to prior YOLO versions on standard benchmarks, highlighting improvements in mean Average Precision (mAP) and inference speed. Our results demonstrate that YOLOv11 achieves superior accuracy without sacrificing real-time capabilities, making it well-suited for applications in autonomous driving, surveillance, and video analytics.This work formalizes YOLOv11 in a research context, providing a clear reference for future studies.

[192] ViBA: Implicit Bundle Adjustment with Geometric and Temporal Consistency for Robust Visual Matching

Xiaoji Niu, Yuqing Wang, Yan Wang, Hailiang Tang, Tisheng Zhang

Main category: cs.CV

TL;DR: ViBA is a sustainable learning framework that integrates geometric optimization with feature learning for continuous online training on unconstrained video streams, improving visual odometry performance through differentiable bundle adjustment and temporal consistency.

Details

Motivation: Existing image keypoint detection methods rely on datasets with accurate pose and depth annotations, limiting scalability and generalization. These limitations degrade navigation and localization performance in real-world scenarios.

Method: ViBA integrates geometric optimization with feature learning through: (1) initial tracking network for inter-frame correspondences, (2) depth-based outlier filtering, and (3) differentiable global bundle adjustment that jointly refines camera poses and feature positions by minimizing reprojection errors.

Result: On EuRoC and UMA datasets, ViBA reduces mean absolute translation error by 12-18% and absolute rotation error by 5-10% compared to state-of-the-art methods (SuperPoint+SuperGlue, ALIKED, LightGlue), while maintaining real-time inference speeds (36-91 FPS). On unseen sequences, it retains over 90% localization accuracy.

Conclusion: ViBA supports continuous online learning with geometric and temporal consistency, consistently improving navigation and localization in real-world scenarios without requiring annotated datasets.

Abstract: Most existing image keypoint detection and description methods rely on datasets with accurate pose and depth annotations, limiting scalability and generalization, and often degrading navigation and localization performance. We propose ViBA, a sustainable learning framework that integrates geometric optimization with feature learning for continuous online training on unconstrained video streams. Embedded in a standard visual odometry pipeline, it consists of an implicitly differentiable geometric residual framework: (i) an initial tracking network for inter-frame correspondences, (ii) depth-based outlier filtering, and (iii) differentiable global bundle adjustment that jointly refines camera poses and feature positions by minimizing reprojection errors. By combining geometric consistency from BA with long-term temporal consistency across frames, ViBA enforces stable and accurate feature representations. We evaluate ViBA on EuRoC and UMA datasets. Compared with state-of-the-art methods such as SuperPoint+SuperGlue, ALIKED, and LightGlue, ViBA reduces mean absolute translation error (ATE) by 12-18% and absolute rotation error (ARE) by 5-10% across sequences, while maintaining real-time inference speeds (FPS 36-91). When evaluated on unseen sequences, it retains over 90% localization accuracy, demonstrating robust generalization. These results show that ViBA supports continuous online learning with geometric and temporal consistency, consistently improving navigation and localization in real-world scenarios.

[193] Banana100: Breaking NR-IQA Metrics by 100 Iterative Image Replications with Nano Banana Pro

Kenan Tang, Praveen Arunshankar, Andong Hua, Anthony Yang, Yao Qin

Main category: cs.CV

TL;DR: Paper identifies iterative degradation problem in multi-turn image editing where repeated edits cause quality deterioration that current quality metrics fail to detect, introducing Banana100 dataset to study this issue.

Details

Motivation: Multi-modal agentic systems for iterative image editing suffer from quality degradation over multiple editing steps, with minor artifacts accumulating into severe noise and instruction-following failures, while current image quality evaluators cannot detect this degradation.

Method: Introduces Banana100 dataset containing 28,000 degraded images generated through 100 iterative editing steps across diverse textures and content, and evaluates 21 popular no-reference image quality assessment metrics on their ability to detect degradation.

Result: Found that none of the 21 NR-IQA metrics consistently assign lower scores to heavily degraded images compared to clean ones, revealing dual failures in both generators (quality degradation) and evaluators (inability to detect degradation).

Conclusion: The fragility of multi-modal agentic systems in iterative editing poses risks for model training and system safety if low-quality synthetic data escapes quality filters, necessitating more robust models and evaluation methods.

Abstract: The multi-step, iterative image editing capabilities of multi-modal agentic systems have transformed digital content creation. Although latest image editing models faithfully follow instructions and generate high-quality images in single-turn edits, we identify a critical weakness in multi-turn editing, which is the iterative degradation of image quality. As images are repeatedly edited, minor artifacts accumulate, rapidly leading to a severe accumulation of visible noise and a failure to follow simple editing instructions. To systematically study these failures, we introduce Banana100, a comprehensive dataset of 28,000 degraded images generated through 100 iterative editing steps, including diverse textures and image content. Alarmingly, image quality evaluators fail to detect the degradation. Among 21 popular no-reference image quality assessment (NR-IQA) metrics, none of them consistently assign lower scores to heavily degraded images than to clean ones. The dual failures of generators and evaluators may threaten the stability of future model training and the safety of deployed agentic systems, if the low-quality synthetic data generated by multi-turn edits escape quality filters. We release the full code and data to facilitate the development of more robust models, helping to mitigate the fragility of multi-modal agentic systems.

[194] KiToke: Kernel-based Interval-aware Token Compression for Video Large Language Models

Haifeng Huang, Yang Li

Main category: cs.CV

TL;DR: KiToke is a training-free token compression method for Video LLMs that reduces visual tokens by estimating global token diversity using kernel-based redundancy measures and temporal interval construction, achieving better performance than existing methods even at extreme compression ratios.

Details

Motivation: Video LLMs suffer from high inference costs due to large numbers of visual tokens. Existing compression methods rely on local or segment-level heuristics, which don't effectively capture global redundancy across entire videos, especially under extreme token budgets.

Method: 1) Uses kernel-based redundancy measure to estimate token diversity globally across entire video; 2) Content-adaptive token selection that remains effective under extreme token budgets; 3) Lightweight temporal interval construction with interval-aware token merging to maintain temporal coherence; 4) Training-free and query-agnostic approach.

Result: Extensive experiments on multiple video understanding benchmarks and Video LLM backbones show KiToke consistently outperforms existing training-free compression methods, with particularly large gains at aggressive retention ratios down to 1%.

Conclusion: KiToke provides an effective training-free solution for reducing Video LLM inference costs by capturing global redundancy and maintaining temporal coherence, enabling efficient token utilization even under extreme compression scenarios.

Abstract: Video Large Language Models (Video LLMs) achieve strong performance on video understanding tasks but suffer from high inference costs due to the large number of visual tokens. We propose KiToke, a training-free, query-agnostic token compression approach that reduces spatiotemporal redundancy while preserving critical visual information. Our method estimates token diversity globally using a kernel-based redundancy measure, enabling content-adaptive selection that remains effective under extreme token budgets, and further introduces a lightweight temporal interval construction with interval-aware token merging to maintain temporal coherence. Unlike prior methods that rely on local or segment-level heuristics, KiToke explicitly captures global redundancy across an entire video, leading to more efficient token utilization. Extensive experiments on multiple video understanding benchmarks and Video LLM backbones demonstrate that KiToke consistently outperforms existing training-free compression methods, with particularly large gains at aggressive retention ratios down to 1%.

[195] Zero-Shot Quantization via Weight-Space Arithmetic

Daniele Solombrino, Antonio Andrea Gargiulo, Adrian Robert Minut, Luca Zhou, Alessandro Zirilli, Emanuele Rodolà

Main category: cs.CV

TL;DR: Quantization robustness can be transferred between models via weight-space arithmetic without retraining, enabling zero-shot improvement for low-bit deployment.

Details

Motivation: Post-training quantization (PTQ) often degrades model performance, especially at extremely low bits. Quantization-aware training (QAT) requires expensive retraining with data. The paper aims to find a low-cost, zero-shot alternative to QAT by leveraging transferable robustness properties in weight space.

Method: Extract a ‘quantization vector’ from a donor model using weight-space arithmetic (simple subtraction between quantized and original weights). This vector captures robustness to quantization noise. Transfer this vector to patch receiver models, improving their PTQ robustness without any receiver-side training or data.

Result: The method improves robustness to PTQ-induced noise by up to 60% on Vision Transformer (ViT) models. It provides a zero-shot, low-cost alternative to QAT for extremely low-bit deployment without requiring receiver training data.

Conclusion: Quantization robustness is a reusable feature of weight-space geometry that can be transferred between models rather than retrained, offering practical benefits for efficient model deployment.

Abstract: We show that robustness to post-training quantization (PTQ) is a transferable direction in weight space. We call this direction the quantization vector: extracted from a donor task by simple weight-space arithmetic, it can be used to patch a receiver model and improve robustness to PTQ-induced noise by as much as 60%, without receiver-side quantization-aware training (QAT). Because the method requires no receiver training data, it provides a zero-shot, low-cost alternative to QAT for extremely low-bit deployment. We demonstrate this on Vision Transformer (ViT) models. More broadly, our results suggest that quantization robustness is not merely a byproduct of task-specific training, but a reusable feature of weight-space geometry that can be transferred rather than retrained.

[196] Automated Segmentation and Tracking of Group Housed Pigs Using Foundation Models

Ye Bi, Bimala Acharya, David Rosero, Juan Steibel

Main category: cs.CV

TL;DR: Foundation model-centered workflow for automated monitoring of group-housed nursery pigs using vision-language FMs as visual backbones with modular post-processing for farm-specific adaptation.

Details

Motivation: Current precision livestock farming relies on supervised learning models requiring extensive labeled data and farm-specific tuning. This study aims to leverage foundation models for scalable, label-efficient monitoring in pig production.

Method: Uses pretrained vision-language foundation models (Grounding-DINO, Grounded-SAM2) as visual backbones, with modular post-processing including temporal tracking logic, short-term video segmentation, and long-term tracking pipeline with initialization, tracking, matching, mask refinement, re-identification, and quality control.

Result: Achieved over 80% fully correct active tracks on 4,927 segments, maintained stable identities in 132-minute video with mean region similarity of 0.83, contour accuracy of 0.92, J&F of 0.87, MOTA of 0.99, MOTP of 90.7%, and no identity switches.

Conclusion: Demonstrates how foundation model prior knowledge combined with lightweight task-specific logic enables scalable, label-efficient, long-duration monitoring in livestock farming, reducing reliance on supervised learning.

Abstract: Foundation models (FM) are reshaping computer vision by reducing reliance on task-specific supervised learning and leveraging general visual representations learned at scale. In precision livestock farming, most pipelines remain dominated by supervised learning models that require extensive labeled data, repeated retraining, and farm-specific tuning. This study presents an FM-centered workflow for automated monitoring of group-housed nursery pigs, in which pretrained vision-language FM serve as general visual backbones and farm-specific adaptation is achieved through modular post-processing. Grounding-DINO was first applied to 1,418 annotated images to establish a baseline detection performance. While detection accuracy was high under daytime conditions, performance degraded under night-vision and heavy occlusion, motivating the integration of temporal tracking logic. Building on these detections, short-term video segmentation with Grounded-SAM2 was evaluated on 550 one-minute video clips; after post-processing, over 80% of 4,927 active tracks were fully correct, with most remaining errors arising from inaccurate masks or duplicated labels. To support identity consistency over an extended time, we further developed a long-term tracking pipeline integrating initialization, tracking, matching, mask refinement, re-identification, and post-hoc quality control. This system was evaluated on a continuous 132-minute video and maintained stable identities throughout. On 132 uniformly sampled ground-truth frames, the system achieved a mean region similarity (J) of 0.83, contour accuracy (F) of 0.92, J&F of 0.87, MOTA of 0.99, and MOTP of 90.7%, with no identity switches. Overall, this work demonstrates how FM prior knowledge can be combined with lightweight, task-specific logic to enable scalable, label-efficient, and long-duration monitoring in pig production.

[197] Inference-Path Optimization via Circuit Duplication in Frozen Visual Transformers for Marine Species Classification

Thomas Manuel Rost

Main category: cs.CV

TL;DR: Circuit Duplication, an inference-time method from LLMs, applied to frozen DINOv3 embeddings for underwater species classification, improves performance without fine-tuning by duplicating transformer layers during forward pass.

Details

Motivation: Underwater species classification faces annotation costs and environmental variation challenges. While frozen embeddings from self-supervised vision models provide good baselines, the authors investigate whether inference-time improvements are possible without weight changes or fine-tuning.

Method: Apply Circuit Duplication (originally for LLMs) to frozen DINOv3 embeddings. Duplicate selected transformer layers during forward pass. Evaluate on AQUA20 benchmark with two settings: global circuit selection (single circuit for all) and class-specific circuit selection (different optimal circuits per species). Use simple semi-supervised downstream classifiers.

Result: Circuit Duplication consistently improves over standard frozen forward pass. Class-specific selection achieves macro F1 of 0.875 at max label budget, closing gap to fully supervised benchmark (0.889) to 1.4 points without gradient training. Four species exceed fully supervised reference, with octopus improving by +12.1 F1 points. 75% of classes prefer class-specific circuits.

Conclusion: Circuit Duplication effectively improves frozen vision model embeddings at inference time without training. Class-specific circuit selection provides substantial benefits, demonstrating class-dependent optimization. First application of Circuit Duplication to computer vision shows promising results for label-efficient marine image classification.

Abstract: Automated underwater species classification is constrained by annotation cost and environmental variation that limits the transferability of fully supervised models. Recent work has shown that frozen embeddings from self-supervised vision foundation models already provide a strong label-efficient baseline for marine image classification. Here we investigate whether this frozen-embedding regime can be improved at inference time, without fine-tuning or changing model weights. We apply Circuit Duplication, an inference-time method originally proposed for Large Language Models, in which a selected range of transformer layers is traversed twice during the forward pass. We evaluate on the class-imbalanced AQUA20 benchmark using frozen DINOv3 embeddings under two settings: global circuit selection, where a single duplicated circuit is chosen for the full dataset, and class-specific circuit selection, where each species may receive a different optimal circuit. Both settings use simple semi-supervised downstream classifiers. Circuit Duplication consistently improves over the standard frozen forward pass. At the maximum label budget, class-specific selection reaches a macro F1 of 0.875, closing the gap to the fully supervised ConvNeXt benchmark (0.889) to 1.4 points without any gradient-based training. Four species exceed their fully supervised reference, with octopus improving by +12.1 F1 points. Across all budgets, roughly 75% of classes prefer a class-specific circuit, indicating a genuinely class-dependent benefit. To our knowledge, this is the first application of Circuit Duplication to computer vision.

[198] ExpressEdit: Fast Editing of Stylized Facial Expressions with Diffusion Models in Photoshop

Kenan Tang, Jiasheng Guo, Jeffrey Lin, Yao Qin

Main category: cs.CV

TL;DR: ExpressEdit is an open-source Photoshop plugin for stylized facial expression editing that avoids artifacts from AI models and integrates with native Photoshop tools like Liquify, enabling fast expression editing with a comprehensive expression database.

Details

Motivation: Current AI image editing models for facial expression editing introduce global noise and pixel drift, preventing integration into professional image editing software and workflows used by artists.

Method: Developed a fully open-source Photoshop plugin that avoids common artifacts of proprietary models, integrates with native Photoshop operations, and uses a comprehensive expression database of 135 expression tags with example stories and images for retrieval-augmented generation.

Result: ExpressEdit seamlessly edits expressions within 3 seconds on a single consumer-grade GPU (significantly faster than proprietary models), is free from common artifacts, and robustly synergizes with native Photoshop operations.

Conclusion: ExpressEdit bridges the gap between AI expression editing models and professional workflows, offering a practical tool for artists with open-source code and dataset to facilitate future research and artistic exploration.

Abstract: Facial expressions of characters are a vital component of visual storytelling. While current AI image editing models hold promise for assisting artists in the task of stylized expression editing, these models introduce global noise and pixel drift into the edited image, preventing the integration of these models into professional image editing software and workflows. To bridge this gap, we introduce ExpressEdit, a fully open-source Photoshop plugin that is free from common artifacts of proprietary image editing models and robustly synergizes with native Photoshop operations such as Liquify. ExpressEdit seamlessly edits an expression within 3 seconds on a single consumer-grade GPU, significantly faster than popular proprietary models. Moreover, to support the generation of diverse expressions according to different narrative needs, we compile a comprehensive expression database of 135 expression tags enriched with example stories and images designed for retrieval-augmented generation. We open source the code and dataset to facilitate future research and artistic exploration.

[199] RDFace: A Benchmark Dataset for Rare Disease Facial Image Analysis under Extreme Data Scarcity and Phenotype-Aware Synthetic Generation

Ganlin Feng, Yuxi Long, Hafsa Ali, Erin Lou, Fahad Butt, Qian Liu, Yang Wang, Pingzhao Hu

Main category: cs.CV

TL;DR: RDFace: A curated benchmark dataset of 456 pediatric facial images across 103 rare genetic conditions, enabling data-efficient AI models for rare disease diagnosis with synthetic augmentation techniques.

Details

Motivation: Rare disease diagnosis using facial phenotypes is limited by scarce curated data and high similarity across conditions, requiring better datasets and AI methods for low-data scenarios.

Method: Created RDFace dataset with 456 ethically verified pediatric facial images across 103 rare conditions; benchmarked pretrained vision backbones with cross-validation; used DreamBooth and FastGAN for synthetic augmentation with facial landmark filtering for phenotype fidelity.

Result: Synthetic augmentation improved diagnostic accuracy by up to 13.7% in ultra-low-data regimes; vision-language model generated phenotype descriptions achieved 0.84 report similarity score between real and synthetic images.

Conclusion: RDFace provides a transparent benchmark dataset for equitable rare disease AI research and a scalable framework for evaluating diagnostic performance and synthetic medical imagery integrity.

Abstract: Rare diseases often manifest with distinctive facial phenotypes in children, offering valuable diagnostic cues for clinicians and AI-assisted screening systems. However, progress in this field is severely limited by the scarcity of curated, ethically sourced facial data and the high similarity among phenotypes across different conditions. To address these challenges, we introduce RDFace, a curated benchmark dataset comprising 456 pediatric facial images spanning 103 rare genetic conditions (average 4.4 samples per condition). Each ethically verified image is paired with standardized metadata. RDFace enables the development and evaluation of data-efficient AI models for rare disease diagnosis under real-world low-data constraints. We benchmark multiple pretrained vision backbones using cross-validation and explore synthetic augmentation with DreamBooth and FastGAN. Generated images are filtered via facial landmark similarity to maintain phenotype fidelity and merged with real data, improving diagnostic accuracy by up to 13.7% in ultra-low-data regimes. To assess semantic validity, phenotype descriptions generated by a vision-language model from real and synthetic images achieve a report similarity score of 0.84. RDFace establishes a transparent, benchmark-ready dataset for equitable rare disease AI research and presents a scalable framework for evaluating both diagnostic performance and the integrity of synthetic medical imagery.

[200] SpectralSplat: Appearance-Disentangled Feed-Forward Gaussian Splatting for Driving Scenes

Quentin Herau, Tianshuo Xu, Depu Meng, Jiezhi Yang, Chensheng Peng, Spencer Sherk, Yihan Hu, Wei Zhan

Main category: cs.CV

TL;DR: SpectralSplat disentangles appearance from geometry in 3D Gaussian Splatting for autonomous driving scenes, enabling relighting, appearance transfer, and consistent rendering across varying environmental conditions.

Details

Motivation: Current feed-forward 3D Gaussian Splatting methods for autonomous driving scenes entangle scene geometry with transient appearance properties (lighting, weather, time of day), preventing relighting, appearance transfer, and consistent rendering across multi-traversal data captured under varying conditions.

Method: Factors color prediction into appearance-agnostic base stream and appearance-conditioned adapted stream using shared MLP conditioned on global appearance embedding from DINOv2 features. Uses hybrid relighting pipeline combining physics-based intrinsic decomposition with diffusion-based generative refinement for training. Employs complementary consistency, reconstruction, cross-appearance, and base color losses. Introduces appearance-adaptable temporal history storing appearance-agnostic features.

Result: Preserves reconstruction quality of underlying backbone while enabling controllable appearance transfer and temporally consistent relighting across driving sequences.

Conclusion: SpectralSplat successfully disentangles appearance from geometry in 3D Gaussian Splatting framework, enabling appearance manipulation and consistent rendering across varying environmental conditions in autonomous driving scenes.

Abstract: Feed-forward 3D Gaussian Splatting methods have achieved impressive reconstruction quality for autonomous driving scenes, yet they entangle scene geometry with transient appearance properties such as lighting, weather, and time of day. This coupling prevents relighting, appearance transfer, and consistent rendering across multi-traversal data captured under varying environmental conditions. We present SpectralSplat, a method that disentangles appearance from geometry within a feed-forward Gaussian Splatting framework. Our key insight is to factor color prediction into an appearance-agnostic base stream and and appearance-conditioned adapted stream, both produced by a shared MLP conditioned on a global appearance embedding derived from DINOv2 features. To enforce disentanglement, we train with paired observations generated by a hybrid relighting pipeline that combines physics-based intrinsic decomposition with diffusion based generative refinement, and supervise with complementary consistency, reconstruction, cross-appearance, and base color losses. We further introduce an appearance-adaptable temporal history that stores appearance-agnostic features, enabling accumulated Gaussians to be re-rendered under arbitrary target appearances. Experiments demonstrate that SpectralSplat preserves the reconstruction quality of the underlying backbone while enabling controllable appearance transfer and temporally consistent relighting across driving sequences.

[201] Fine-tuning DeepSeek-OCR-2 for Molecular Structure Recognition

Haocheng Tang, Xingyu Dang, Junmei Wang

Main category: cs.CV

TL;DR: MolSeek-OCR adapts DeepSeek-OCR-2 for Optical Chemical Structure Recognition by treating it as image-conditioned SMILES generation, using progressive fine-tuning and large-scale training data, achieving competitive but not SOTA results.

Details

Motivation: Optical Chemical Structure Recognition (OCSR) is crucial for converting 2D molecular diagrams into machine-readable formats, but existing Vision-Language Models struggle with direct application to this task, and full-parameter fine-tuning often fails.

Method: Two-stage progressive supervised fine-tuning: start with parameter-efficient LoRA, then transition to selective full-parameter fine-tuning with split learning rates. Train on large-scale corpus combining synthetic PubChem renderings and realistic USPTO-MOL patent images.

Result: MolSeek-OCR achieves exact matching accuracies comparable to best-performing image-to-sequence models, but remains inferior to state-of-the-art image-to-graph models. Reinforcement-style post-training and data-curation refinement failed to improve strict sequence-level fidelity.

Conclusion: The approach demonstrates competitive OCSR capabilities but highlights limitations in achieving SOTA performance, suggesting image-to-sequence approaches may have inherent limitations compared to graph-based methods for exact chemical structure recognition.

Abstract: Optical Chemical Structure Recognition (OCSR) is critical for converting 2D molecular diagrams from printed literature into machine-readable formats. While Vision-Language Models have shown promise in end-to-end OCR tasks, their direct application to OCSR remains challenging, and direct full-parameter supervised fine-tuning often fails. In this work, we adapt DeepSeek-OCR-2 for molecular optical recognition by formulating the task as image-conditioned SMILES generation. To overcome training instabilities, we propose a two-stage progressive supervised fine-tuning strategy: starting with parameter-efficient LoRA and transitioning to selective full-parameter fine-tuning with split learning rates. We train our model on a large-scale corpus combining synthetic renderings from PubChem and realistic patent images from USPTO-MOL to improve coverage and robustness. Our fine-tuned model, MolSeek-OCR, demonstrates competitive capabilities, achieving exact matching accuracies comparable to the best-performing image-to-sequence model. However, it remains inferior to state-of-the-art image-to-graph modelS. Furthermore, we explore reinforcement-style post-training and data-curation-based refinement, finding that they fail to improve the strict sequence-level fidelity required for exact SMILES matching.

[202] Multimodal Urban Tree Detection from Satellite and Street-Level Imagery via Annotation-Efficient Deep Learning Strategies

In Seon Kim, Ali Moghimi

Main category: cs.CV

TL;DR: Multimodal framework combining satellite imagery and Google Street View for scalable urban tree detection with limited annotations, using domain adaptation and hybrid learning strategies.

Details

Motivation: Urban trees are crucial for environmental sustainability and disaster mitigation, but current mapping methods are labor-intensive and don't scale well due to high annotation costs and poor generalization across diverse urban environments.

Method: Multimodal framework that first uses satellite imagery to localize tree candidates, then retrieves targeted ground-level Google Street View images for detailed detection. Uses domain adaptation to transfer knowledge from existing datasets, and evaluates three learning strategies: semi-supervised learning, active learning, and a hybrid approach with transformer-based detection model.

Result: Hybrid strategy achieved best performance with F1-score of 0.90 (12% improvement over baseline). Semi-supervised learning showed progressive degradation due to confirmation bias in pseudo-labeling, while active learning steadily improved through targeted human intervention. Error analysis showed active and hybrid strategies reduced both false positives and false negatives.

Conclusion: Multimodal approach with guided annotation enables scalable, annotation-efficient urban tree mapping for sustainable city planning, highlighting the importance of combining different data sources and learning strategies.

Abstract: Beyond the immediate biophysical benefits, urban trees play a foundational role in environmental sustainability and disaster mitigation. Precise mapping of urban trees is essential for environmental monitoring, post-disaster assessment, and strengthening policy. However, the transition from traditional, labor-intensive field surveys to scalable automated systems remains limited by high annotation costs and poor generalization across diverse urban scenarios. This study introduces a multimodal framework that integrates high-resolution satellite imagery with ground-level Google Street View to enable scalable and detailed urban tree detection under limited-annotation conditions. The framework first leverages satellite imagery to localize tree candidates and then retrieves targeted ground-level views for detailed detection, significantly reducing inefficient street-level sampling. To address the annotation bottleneck, domain adaptation is used to transfer knowledge from an existing annotated dataset to a new region of interest. To further minimize human effort, we evaluated three learning strategies: semi-supervised learning, active learning, and a hybrid approach combining both, using a transformer-based detection model. The hybrid strategy achieved the best performance with an F1-score of 0.90, representing a 12% improvement over the baseline model. In contrast, semi-supervised learning exhibited progressive performance degradation due to confirmation bias in pseudo-labeling, while active learning steadily improved results through targeted human intervention to label uncertain or incorrect predictions. Error analysis further showed that active and hybrid strategies reduced both false positives and false negatives. Our findings highlight the importance of a multimodal approach and guided annotation for scalable, annotation-efficient urban tree mapping to strengthen sustainable city planning.

[203] Determined by User Needs: A Salient Object Detection Rationale Beyond Conventional Visual Stimuli

Chenglizhao Chen, Shujian Zhang, Luming Li, Wenfeng Song, Shuai Li

Main category: cs.CV

TL;DR: Proposes UserSOD - a new task for detecting salient objects based on users’ proactive needs rather than just visual stimuli, addressing limitations of current SOD methods.

Details

Motivation: Current salient object detection methods are passive and based only on visual stimuli, ignoring users' proactive needs. This fails to satisfy users and limits downstream tasks like salient object ranking.

Method: Advocates a new User Salient Object Detection (UserSOD) task that detects salient objects aligned with users’ proactive needs. Main challenge is lack of datasets for training/testing.

Result: Proposes a new research direction but doesn’t present experimental results since it’s introducing the task concept rather than a complete solution.

Conclusion: UserSOD is essential for better satisfying user needs and advancing downstream applications, but requires new datasets and methods to address the proactive need-based detection challenge.

Abstract: Existing \textbf{s}alient \textbf{o}bject \textbf{d}etection (SOD) methods adopt a \textbf{passive} visual stimulus-based rationale–objects with the strongest visual stimuli are perceived as the user’s primary focus (i.e., salient objects). They ignore the decisive role of users’ \textbf{proactive needs} in segmenting salient objects–if a user has a need before seeing an image, the user’s salient objects align with their needs, e.g., if a user’s need is white apple'', when this user sees an image, the user's primary focus is on the white apple’’ or ``the most white apple-like’’ objects in the image. Such an oversight not only \textbf{fails to satisfy users}, but also \textbf{limits the development of downstream tasks}. For instance, in salient object ranking tasks, focusing solely on visual stimuli-based salient objects is insufficient for conducting an analysis of fine-grained relationships between users’ viewing order (usually determined by user’s needs) and scenes, which may result in wrong ranking results. Clearly, it is essential to detect salient objects based on user needs. Thus, we advocate a \textbf{User} \textbf{S}alient \textbf{O}bject \textbf{D}etection (UserSOD) task, which focuses on \textbf{detecting salient objects align with users’ proactive needs when user have needs}. The main challenge for this new task is the lack of datasets for model training and testing.

[204] HEDGE: Heterogeneous Ensemble for Detection of AI-GEnerated Images in the Wild

Fei Wu, Dagong Lu, Mufeng Yao, Xinlei Xu, Fengjun Guo

Main category: cs.CV

TL;DR: HEDGE is a heterogeneous ensemble method for AI-generated image detection that combines diverse training regimes, multi-scale features, and backbone architectures to improve robustness against varied real-world distortions.

Details

Motivation: Robust detection of AI-generated images is challenging due to rapid evolution of generative models and varied real-world distortions. Single training regimes, resolutions, or backbones are insufficient to handle all conditions, requiring structured heterogeneity across dimensions.

Method: HEDGE introduces three complementary detection routes: Route A uses DINOv3-based detectors with staged data expansion and augmentation escalation; Route B incorporates higher-resolution branch for fine-grained forensic cues; Route C adds MetaCLIP2-based branch for backbone diversity. All outputs are fused via logit-space weighted averaging refined by a lightweight dual-gating mechanism.

Result: Achieved 4th place in NTIRE 2026 Robust AI-Generated Image Detection in the Wild Challenge and attained state-of-the-art performance with strong robustness on multiple AIGC image detection benchmarks.

Conclusion: Structured heterogeneity across training data, resolution, and backbone architectures is essential for robust AI-generated image detection, and HEDGE demonstrates this through its ensemble approach with complementary detection routes.

Abstract: Robust detection of AI-generated images in the wild remains challenging due to the rapid evolution of generative models and varied real-world distortions. We argue that relying on a single training regime, resolution, or backbone is insufficient to handle all conditions, and that structured heterogeneity across these dimensions is essential for robust detection. To this end, we propose HEDGE, a Heterogeneous Ensemble for Detection of AI-GEnerated images, that introduces complementary detection routes along three axes: diverse training data with strong augmentation, multi-scale feature extraction, and backbone heterogeneity. Specifically, Route~~A progressively constructs DINOv3-based detectors through staged data expansion and augmentation escalation, Route~~B incorporates a higher-resolution branch for fine-grained forensic cues, and Route~C adds a MetaCLIP2-based branch for backbone diversity. All outputs are fused via logit-space weighted averaging, refined by a lightweight dual-gating mechanism that handles branch-level outliers and majority-dominated fusion errors. HEDGE achieves 4th place in the NTIRE 2026 Robust AI-Generated Image Detection in the Wild Challenge and attains state-of-the-art performance with strong robustness on multiple AIGC image detection benchmarks.

[205] Focus Matters: Phase-Aware Suppression for Hallucination in Vision-Language Models

Sohyeon Kim, Sang Yeon Yoon, Kyeongbo Kong

Main category: cs.CV

TL;DR: A lightweight inference-time intervention method that reduces object hallucinations in Large Vision-Language Models by analyzing attention dynamics and selectively suppressing low-attention tokens during the focus phase using Determinantal Point Process.

Details

Motivation: Large Vision-Language Models suffer from object hallucinations (describing objects not in images), and existing mitigation approaches often require iterative optimization causing high inference latency. The authors aim to develop a training-free, low-latency solution by understanding the internal attention dynamics of vision encoders.

Method: Analyzes attention dynamics in vision encoders, identifying three-phase structure (diffusion, focus, rediffusion). Shows hallucinations correlate with low-attention tokens during focus phase. Proposes inference-time intervention using statistics from single forward pass and Determinantal Point Process to selectively suppress problematic tokens while preserving diverse visual cues.

Result: Method consistently reduces hallucination metrics across multiple LVLM backbones and decoding strategies while maintaining competitive caption quality. Achieves comparable hallucination mitigation to adversarial uncertainty estimation methods with negligible additional inference latency.

Conclusion: The proposed lightweight intervention effectively mitigates object hallucinations in LVLMs by leveraging insights into attention dynamics, offering a practical solution with minimal computational overhead compared to existing approaches.

Abstract: Large Vision-Language Models (LVLMs) have achieved impressive progress in multimodal reasoning, yet they remain prone to object hallucinations, generating descriptions of objects that are not present in the input image. Recent approaches attempt to mitigate hallucinations by suppressing unreliable visual signals in the vision encoder, but many rely on iterative optimization for each input, resulting in substantial inference latency. In this work, we investigate the internal attention dynamics of vision encoders in LVLMs and identify a consistent three-phase structure of visual information processing: diffusion, focus, and rediffusion. Our analysis reveals that hallucination behavior is particularly sensitive to tokens receiving low attention during the focus phase. Motivated by this observation, we propose a lightweight inference-time intervention that selectively suppresses such tokens during the focus phase. The method operates in a training-free manner using statistics from a single forward pass and employs a Determinantal Point Process (DPP) to preserve diverse visual cues while filtering redundant tokens. Extensive experiments across multiple LVLM backbones and decoding strategies demonstrate that the proposed approach consistently reduces hallucination metrics while maintaining competitive caption quality. Moreover, compared to adversarial uncertainty estimation methods, our approach achieves comparable hallucination mitigation with negligible additional inference latency.

[206] LOGER: Local–Global Ensemble for Robust Deepfake Detection in the Wild

Fei Wu, Dagong Lu, Mufeng Yao, Xinlei Xu, Fengjun Guo

Main category: cs.CV

TL;DR: LOGER is a deepfake detection framework using local-global ensemble with multiple backbones and patch-level modeling with MIL aggregation for robust detection across diverse manipulations and real-world conditions.

Details

Motivation: Deepfake detection in the wild is challenging due to diverse manipulation techniques and uncontrolled real-world degradations. Forensic cues exist at both global (semantic/statistical anomalies) and local (forgery traces in manipulated regions) levels, but no single backbone or input scale can effectively cover both.

Method: Proposes LOGER: 1) Global branch uses heterogeneous vision foundation model backbones at multiple resolutions to capture holistic anomalies; 2) Local branch performs patch-level modeling with Multiple Instance Learning top-k aggregation to selectively pool only the most suspicious regions, preventing evidence dilution; 3) Dual-level supervision at both image and patch levels; 4) Logit-space fusion exploits decorrelated errors between branches.

Result: Achieved 2nd place in NTIRE 2026 Robust Deepfake Detection Challenge. Further evaluation on multiple public benchmarks confirms strong robustness and generalization across diverse manipulation methods and real-world degradation conditions.

Conclusion: The local-global ensemble framework effectively addresses complementary forensic cues at different levels, with decorrelated errors between branches enabling robust fusion. The approach shows strong performance in challenging real-world deepfake detection scenarios.

Abstract: Robust deepfake detection in the wild remains challenging due to the ever-growing variety of manipulation techniques and uncontrolled real-world degradations. Forensic cues for deepfake detection reside at two complementary levels: global-level anomalies in semantics and statistics that require holistic image understanding, and local-level forgery traces concentrated in manipulated regions that are easily diluted by global averaging. Since no single backbone or input scale can effectively cover both levels, we propose LOGER, a LOcal–Global Ensemble framework for Robust deepfake detection. The global branch employs heterogeneous vision foundation model backbones at multiple resolutions to capture holistic anomalies with diverse visual priors. The local branch performs patch-level modeling with a Multiple Instance Learning top-$k$ aggregation strategy that selectively pools only the most suspicious regions, mitigating evidence dilution caused by the dominance of normal patches; dual-level supervision at both the aggregated image level and individual patch level keeps local responses discriminative. Because the two branches differ in both granularity and backbone, their errors are largely decorrelated, a property that logit-space fusion exploits for more robust prediction. LOGER achieves 2nd place in the NTIRE 2026 Robust Deepfake Detection Challenge, and further evaluation on multiple public benchmarks confirms its strong robustness and generalization across diverse manipulation methods and real-world degradation conditions.

[207] Physics-Informed Untrained Learning for RGB-Guided Superresolution Single-Pixel Hyperspectral Imaging

Hao Zhang, Bilige Xu, Lichen Wei, Xu Ma, Wenyi Ren

Main category: cs.CV

TL;DR: Physics-informed untrained neural network framework for hyperspectral reconstruction and super-resolution from single-pixel imaging using RGB guidance, achieving high-fidelity results at extremely low sampling rates without external training data.

Details

Motivation: Single-pixel imaging offers cost-effective hyperspectral acquisition but struggles with high-fidelity reconstruction at low sampling rates. Existing deep learning methods require large training datasets that are impractical for hyperspectral imaging, creating a need for data-efficient solutions.

Method: Three-stage physics-informed framework: (1) Regularized Least-Squares with RGB-derived Grayscale Priors for initialization, (2) Untrained Hyperspectral Recovery Network with measurement consistency and hybrid regularization, (3) Transformer-based Untrained Super-Resolution Network using cross-modal attention to transfer high-frequency details from RGB guide.

Result: Significantly surpasses state-of-the-art algorithms in reconstruction accuracy and spectral fidelity. Successfully reconstructs 144-band hyperspectral data cube at 6.25% sampling rate in physical single-pixel imaging system validation.

Conclusion: Provides robust, data-efficient solution for computational hyperspectral imaging by combining physics-informed modeling with untrained neural networks and cross-modal guidance, enabling high-quality reconstruction without external training data.

Abstract: Single-pixel imaging (SPI) offers a cost-effective route to hyperspectral acquisition but struggles to recover high-fidelity spatial and spectral details under extremely low sampling rates, a severely ill-posed inverse problem. While deep learning has shown potential, existing data-driven methods demand large-scale pretraining datasets that are often impractical in hyperspectral imaging. To overcome this limitation, we propose an end-to-end physics-informed framework that leverages untrained neural networks and RGB guidance for joint hyperspectral reconstruction and super-resolution without any external training data. The framework comprises three physically grounded stages: (1) a Regularized Least-Squares method with RGB-derived Grayscale Priors (LS-RGP) that initializes the solution by exploiting cross-modal structural correlations; (2) an Untrained Hyperspectral Recovery Network (UHRNet) that refines the reconstruction through measurement consistency and hybrid regularization; and (3) a Transformer-based Untrained Super-Resolution Network (USRNet) that upsamples the spatial resolution via cross-modal attention, transferring high-frequency details from the RGB guide. Extensive experiments on benchmark datasets demonstrate that our approach significantly surpasses state-of-the-art algorithms in both reconstruction accuracy and spectral fidelity. Moreover, a proof-of-concept experiment using a physical single-pixel imaging system validates the framework’s practical applicability, successfully reconstructing a 144-band hyperspectral data cube at a mere 6.25% sampling rate. The proposed method thus provides a robust, data-efficient solution for computational hyperspectral imaging.

[208] SBF: An Effective Representation to Augment Skeleton for Video-based Human Action Recognition

Zhuoxuan Peng, Yiyi Ding, Yang Lin, S. -H. Gary Chan

Main category: cs.CV

TL;DR: Proposes Scale-Body-Flow (SBF) representation to augment 2D skeletons for human action recognition, addressing limitations of skeleton-only approaches by capturing depth, body contour, and human-object interactions.

Details

Motivation: 2D skeleton-based human action recognition approaches struggle in many scenes because skeletons don't capture critical action-related information like joint depth, body contour, and human-object interactions.

Method: Introduces Scale-Body-Flow (SBF) representation with three components: scale map (depth information), body map (human outline), and flow map (human-object interaction via optical flow). Presents SFSNet segmentation network to predict SBF, supervised by skeleton and optical flow without extra annotation.

Result: Extensive experiments show the pipeline achieves significantly higher HAR accuracy with similar compactness and efficiency compared to state-of-the-art skeleton-only approaches across different datasets.

Conclusion: SBF effectively augments skeleton representation for human action recognition by capturing missing critical information, improving accuracy while maintaining efficiency.

Abstract: Many modern video-based human action recognition (HAR) approaches use 2D skeleton as the intermediate representation in their prediction pipelines. Despite overall encouraging results, these approaches still struggle in many common scenes, mainly because the skeleton does not capture critical action-related information pertaining to the depth of the joints, contour of the human body, and interaction between the human and objects. To address this, we propose an effective approach to augment skeleton with a representation capturing action-related information in the pipeline of HAR. The representation, termed Scale-Body-Flow (SBF), consists of three distinct components, namely a scale map volume given by the scale (and hence depth information) of each joint, a body map outlining the human subject, and a flow map indicating human-object interaction given by pixel-wise optical flow values. To predict SBF, we further present SFSNet, a novel segmentation network supervised by the skeleton and optical flow without extra annotation overhead beyond the existing skeleton extraction. Extensive experiments across different datasets demonstrate that our pipeline based on SBF and SFSNet achieves significantly higher HAR accuracy with similar compactness and efficiency as compared with the state-of-the-art skeleton-only approaches.

[209] PortraitCraft: A Benchmark for Portrait Composition Understanding and Generation

Yuyang Sha, Zijie Lou, Youyun Tang, Xiaochao Qu, Haoxiang Li, Ting Liu, Luoqi Liu

Main category: cs.CV

TL;DR: PortraitCraft introduces a unified benchmark for portrait composition understanding and generation with structured multi-level supervision on 50K curated portraits.

Details

Motivation: Existing datasets focus on coarse aesthetic scoring or generic image aesthetics, limiting systematic research on structured portrait composition analysis and controllable portrait generation under explicit composition requirements.

Method: Built on 50K curated real portrait images with structured multi-level supervision including global composition scores, 13 composition attributes, attribute-level explanations, VQA pairs, and composition-oriented textual descriptions. Establishes two benchmark tasks: composition understanding (score prediction, attribute reasoning, VQA) and composition-aware generation under explicit constraints.

Result: Provides a comprehensive benchmark with standardized evaluation protocols and reference baseline results using representative multimodal models for future research on fine-grained portrait understanding and controllable generation.

Conclusion: PortraitCraft enables systematic research on structured portrait composition analysis and controllable portrait generation, addressing limitations of existing datasets and benchmarks.

Abstract: Portrait composition plays a central role in portrait aesthetics and visual communication, yet existing datasets and benchmarks mainly focus on coarse aesthetic scoring, generic image aesthetics, or unconstrained portrait generation. This limits systematic research on structured portrait composition analysis and controllable portrait generation under explicit composition requirements. In this paper, we introduce PortraitCraft, a unified benchmark for portrait composition understanding and generation. PortraitCraft is built on a dataset of approximately 50,000 curated real portrait images with structured multi-level supervision, including global composition scores, annotations over 13 composition attributes, attribute-level explanation texts, visual question answering pairs, and composition-oriented textual descriptions for generation. Based on this dataset, we establish two complementary benchmark tasks for composition understanding and composition-aware generation within a unified framework. The first evaluates portrait composition understanding through score prediction, fine-grained attribute reasoning, and image-grounded visual question answering, while the second evaluates portrait generation from structured composition descriptions under explicit composition constraints. We further define standardized evaluation protocols and provide reference baseline results with representative multimodal models. PortraitCraft provides a comprehensive benchmark for future research on fine-grained portrait understanding, interpretable aesthetic assessment, and controllable portrait generation.

[210] Can Natural Image Autoencoders Compactly Tokenize fMRI Volumes for Long-Range Dynamics Modeling?

Peter Yongho Kim, Juhyeon Park, Jungwoo Park, Jubin Choi, Jungwoo Seo, Jiook Cha, Taesup Moon

Main category: cs.CV

TL;DR: TABLeT uses 2D natural image autoencoder to tokenize fMRI volumes into compact continuous tokens, enabling efficient long-sequence modeling with Transformers for brain activity analysis.

Details

Motivation: Modeling long-range spatiotemporal dynamics in fMRI is challenging due to high-dimensional 4D signals. Existing voxel-based models have prohibitive memory demands and limited temporal window capture.

Method: Tokenizes fMRI volumes using pre-trained 2D natural image autoencoder to compress 3D volumes into compact continuous tokens, then uses Transformer encoder for long-sequence modeling. Also develops self-supervised masked token modeling for pre-training.

Result: Outperforms existing models on UK-Biobank, HCP, and ADHD-200 datasets across multiple tasks. Shows substantial computational and memory efficiency gains over state-of-the-art voxel-based methods.

Conclusion: Presents a scalable and interpretable approach for spatiotemporal modeling of brain activity using tokenization and Transformers, with promising results for fMRI analysis.

Abstract: Modeling long-range spatiotemporal dynamics in functional Magnetic Resonance Imaging (fMRI) remains a key challenge due to the high dimensionality of the four-dimensional signals. Prior voxel-based models, although demonstrating excellent performance and interpretation capabilities, are constrained by prohibitive memory demands and thus can only capture limited temporal windows. To address this, we propose TABLeT (Two-dimensionally Autoencoded Brain Latent Transformer), a novel approach that tokenizes fMRI volumes using a pre-trained 2D natural image autoencoder. Each 3D fMRI volume is compressed into a compact set of continuous tokens, enabling long-sequence modeling with a simple Transformer encoder with limited VRAM. Across large-scale benchmarks including the UK-Biobank (UKB), Human Connectome Project (HCP), and ADHD-200 datasets, TABLeT outperforms existing models in multiple tasks, while demonstrating substantial gains in computational and memory efficiency over the state-of-the-art voxel-based method given the same input. Furthermore, we develop a self-supervised masked token modeling approach to pre-train TABLeT, which improves the model’s performance for various downstream tasks. Our findings suggest a promising approach for scalable and interpretable spatiotemporal modeling of brain activity. Our code is available at https://github.com/beotborry/TABLeT.

[211] A Generative Foundation Model for Multimodal Histopathology

Jinxi Xiang, Mingjie Li, Siyu Hou, Yijiang Chen, Xiangde Luo, Yuanfeng Ji, Xiang Zhou, Ehsan Adeli, Akshay Chaudhari, Curtis P. Langlotz, Kilian M. Pohl, Ruijiang Li

Main category: cs.CV

TL;DR: MuPD is a multimodal generative foundation model for pathology that integrates histology images, RNA profiles, and clinical text using diffusion transformers with cross-modal attention, enabling diverse cross-modal synthesis tasks with minimal fine-tuning.

Details

Motivation: Current approaches for integrating multimodal pathology data (histology, molecular, clinical) are limited by task-specific models trained on narrow source-target pairs, which restricts generalizability when modalities are incomplete due to tissue scarcity, cost, or workflow constraints.

Method: MuPD uses a diffusion transformer with decoupled cross-modal attention to embed H&E histology images, RNA profiles, and clinical text into a shared latent space. It was pretrained on 100M histology patches, 1.6M text-histology pairs, and 10.8M RNA-histology pairs across 34 human organs.

Result: The model achieves: 50% FID reduction for text/image generation, 47% few-shot classification improvement via synthetic data augmentation, 23% FID reduction for RNA-conditioned histology generation, and 37% improvement in marker correlation for virtual staining tasks compared to existing methods.

Conclusion: A single unified generative model pretrained across heterogeneous pathology modalities can substantially outperform specialized alternatives, providing a scalable computational framework for multimodal histopathology with applications in diagnosis and treatment.

Abstract: Accurate diagnosis and treatment of complex diseases require integrating histological, molecular, and clinical data, yet in practice these modalities are often incomplete owing to tissue scarcity, assay cost, and workflow constraints. Existing computational approaches attempt to impute missing modalities from available data but rely on task-specific models trained on narrow, single source-target pairs, limiting their generalizability. Here we introduce MuPD (Multimodal Pathology Diffusion), a generative foundation model that embeds hematoxylin and eosin (H&E)-stained histology, molecular RNA profiles, and clinical text into a shared latent space through a diffusion transformer with decoupled cross-modal attention. Pretrained on 100 million histology image patches, 1.6 million text-histology pairs, and 10.8 million RNA-histology pairs spanning 34 human organs, MuPD supports diverse cross-modal synthesis tasks with minimal or no task-specific fine-tuning. For text-conditioned and image-to-image generation, MuPD synthesizes histologically faithful tissue architectures, reducing Fréchet inception distance (FID) scores by 50% relative to domain-specific models and improving few-shot classification accuracy by up to 47% through synthetic data augmentation. For RNA-conditioned histology generation, MuPD reduces FID by 23% compared with the next-best method while preserving cell-type distributions across five cancer types. As a virtual stainer, MuPD translates H&E images to immunohistochemistry and multiplex immunofluorescence, improving average marker correlation by 37% over existing approaches. These results demonstrate that a single, unified generative model pretrained across heterogeneous pathology modalities can substantially outperform specialized alternatives, providing a scalable computational framework for multimodal histopathology.

[212] SAGE-GAN: Towards Realistic and Robust Segmentation of Spatially Ordered Nanoparticles via Attention-Guided GANs

Anindya Pal, Varun Ajith, Saumik Bhattacharya, Sayantari Ghosh

Main category: cs.CV

TL;DR: A two-step approach combining Attention U-Net with CycleGAN for nanoparticle segmentation in electron microscopy images, enabling synthetic data generation and accurate feature detection without extensive manual labeling.

Details

Motivation: Manual nanoparticle analysis in electron microscopy is time-consuming, and traditional automated segmentation methods struggle with complex shapes and artifacts while requiring large labeled datasets that are difficult to acquire.

Method: Two-step solution: 1) Self-attention driven U-Net learns to segment nanoparticle features from real images, focusing on physical/morphological details while ignoring noise. 2) This Attention U-Net is embedded in a CycleGAN framework to generate realistic synthetic image-mask pairs that reflect learned structural patterns.

Result: The model can accurately detect features in diverse real-world nanoparticle images and autonomously augment training datasets without human input. Cycle consistency ensures direct correspondence between synthetic images and ground-truth masks for realistic feature generation.

Conclusion: The integrated Attention U-Net + CycleGAN approach overcomes limitations of manual methods and traditional segmentation techniques by enabling synthetic data generation and accurate nanoparticle analysis without extensive labeled data requirements.

Abstract: Precise analysis of nanoparticles for characterization in electron microscopy images is essential for advancing nanomaterial development. Yet it remains challenging due to the time-consuming nature of manual methods and the shortcomings of traditional automated segmentation techniques, especially when dealing with complex shapes and imaging artifacts. While conventional methods yield promising results, they depend on a large volume of labeled training data, which is both difficult to acquire and highly time-consuming to generate. In order to overcome these challenges, we have developed a two-step solution: Firstly, our system learns to segment the key features of nanoparticles from a dataset of real images using a self-attention driven U-Net architecture that focuses on important physical and morphological details while ignoring background features and noise. Secondly, this trained Attention U-Net is embedded in a cycle-consistent generative adversarial network (CycleGAN) framework, inspired by the cGAN-Seg model introduced by Abzargar et al. This integration allows for the creation of highly realistic synthetic electron microscopy image-mask pairs that naturally reflect the structural patterns learned by the Attention U-Net. Consequently, the model can accurately detect features in a diverse array of real-world nanoparticle images and autonomously augment the training dataset without requiring human input. Cycle consistency enforces a direct correspondence between synthetic images and ground-truth masks, ensuring realistic features, which is crucial for accurate segmentation training.

[213] ComPrivDet: Efficient Privacy Object Detection in Compressed Domains Through Inference Reuse

Yunhao Yao, Zhiqiang Wang, Ruiqi Li, Haoran Cheng, Puhan Luo, Xiangyang Li

Main category: cs.CV

TL;DR: ComPrivDet: Efficient privacy object detection in compressed videos by reusing I-frame inference results and selectively processing P/B-frames using compressed-domain cues.

Details

Motivation: IoT video analytics raises privacy concerns, but frame-by-frame protection causes latency. Existing methods require full decoding or per-frame processing, creating overhead. Need efficient compressed-domain detection of privacy objects like faces and license plates.

Method: Reuses I-frame inference results, identifies new objects via compressed-domain cues, then either skips P/B-frame detections or refines them with lightweight detector. Avoids full decoding while maintaining accuracy.

Result: Achieves 99.75% accuracy for face detection, 96.83% for license plate detection, skips over 80% of inferences. Outperforms existing compressed-domain methods by 9.84% accuracy with 75.95% lower latency.

Conclusion: ComPrivDet provides efficient privacy protection for IoT video analytics by leveraging compressed-domain information to reduce processing overhead while maintaining high detection accuracy.

Abstract: As the Internet of Things (IoT) becomes deeply embedded in daily life, users are increasingly concerned about privacy leakage, especially from video data. Since frame-by-frame protection in large-scale video analytics (e.g., smart communities) introduces significant latency, a more efficient solution is to selectively protect frames containing privacy objects (e.g., faces). Existing object detectors require fully decoded videos or per-frame processing in compressed videos, leading to decoding overhead or reduced accuracy. Therefore, we propose ComPrivDet, an efficient method for detecting privacy objects in compressed video by reusing I-frame inference results. By identifying the presence of new objects through compressed-domain cues, ComPrivDet either skips P- and B-frame detections or efficiently refines them with a lightweight detector. ComPrivDet maintains 99.75% accuracy in private face detection and 96.83% in private license plate detection while skipping over 80% of inferences. It averages 9.84% higher accuracy with 75.95% lower latency than existing compressed-domain detection methods.

[214] Stabilizing Unsupervised Self-Evolution of MLLMs via Continuous Softened Retracing reSampling

Yunyao Yu, Zhengxian Wu, Zhuohong Chen, Hangrui Xu, Zirui Liao, Xiangwen Deng, Zhifang Liu, Senyuan Shi, Haoqian Wang

Main category: cs.CV

TL;DR: CSRS improves MLLM self-evolution by using retracing re-inference, softened frequency rewards, and visual semantic perturbation to enhance reasoning quality beyond majority voting biases.

Details

Motivation: Existing self-evolution methods for MLLMs rely on majority voting for pseudo-golden answers, which may reflect model biases rather than objective correctness, potentially degrading reasoning quality.

Method: Proposes CSRS with three components: 1) Retracing Re-inference Mechanism (RRM) to explore long-tail reasoning paths, 2) Softened Frequency Reward (SFR) using continuous signals instead of binary rewards, and 3) Visual Semantic Perturbation (VSP) to prioritize mathematical logic over visual superficiality.

Result: CSRS significantly enhances reasoning performance of Qwen2.5-VL-7B on benchmarks like MathVision, achieving state-of-the-art results in unsupervised self-evolution on geometric tasks.

Conclusion: CSRS effectively addresses limitations of majority voting in MLLM self-evolution, improving reasoning quality through better exploration of reasoning paths and calibrated reward signals.

Abstract: In the unsupervised self-evolution of Multimodal Large Language Models, the quality of feedback signals during post-training is pivotal for stable and effective learning. However, existing self-evolution methods predominantly rely on majority voting to select the most frequent output as the pseudo-golden answer, which may stem from the model’s intrinsic biases rather than guaranteeing the objective correctness of the reasoning paths. To counteract the degradation, we propose \textbf{C}ontinuous \textbf{S}oftened \textbf{R}etracing re\textbf{S}ampling (\textbf{CSRS}) in MLLM self-evolution. Specifically, we introduce a Retracing Re-inference Mechanism (\textbf{RRM}) that the model re-inferences from anchor points to expand the exploration of long-tail reasoning paths. Simultaneously, we propose Softened Frequency Reward (\textbf{SFR}), which replaces binary rewards with continuous signals, calibrating reward based on the answers’ frequency across sampled reasoning sets. Furthermore, incorporated with Visual Semantic Perturbation (\textbf{VSP}), CSRS ensures the model prioritizes mathematical logic over visual superficiality. Experimental results demonstrate that CSRS significantly enhances the reasoning performance of Qwen2.5-VL-7B on benchmarks such as MathVision. We achieve state-of-the-art (SOTA) results in unsupervised self-evolution on geometric tasks. Our code is avaible at https://github.com/yyy195/CSRS.

[215] ART: Adaptive Relational Transformer for Pedestrian Trajectory Prediction with Temporal-Aware Relations

Ruochen Li, Ziyi Chang, Junyan Hu, Jiannan Li, Amir Atapour-Abarghouei, Hubert P. H. Shum

Main category: cs.CV

TL;DR: ART introduces an adaptive relational transformer with temporal-aware relation graphs and adaptive interaction pruning for efficient pedestrian trajectory prediction

Details

Motivation: Existing pedestrian trajectory prediction methods either introduce unnecessary computational overhead or struggle to represent diverse and time-varying human interactions, limiting their effectiveness for robot-related applications

Method: Adaptive Relational Transformer (ART) with Temporal-Aware Relation Graph (TARG) to capture evolution of pairwise interactions and Adaptive Interaction Pruning (AIP) mechanism to reduce redundant computations

Result: State-of-the-art accuracy on ETH/UCY and NBA benchmarks with high computational efficiency

Conclusion: ART effectively balances accuracy and efficiency for pedestrian trajectory prediction by explicitly modeling temporal interaction evolution while reducing computational overhead

Abstract: Accurate prediction of real-world pedestrian trajectories is crucial for a wide range of robot-related applications. Recent approaches typically adopt graph-based or transformer-based frameworks to model interactions. Despite their effectiveness, these methods either introduce unnecessary computational overhead or struggle to represent the diverse and time-varying characteristics of human interactions. In this work, we present an Adaptive Relational Transformer (ART), which introduces a Temporal-Aware Relation Graph (TARG) to explicitly capture the evolution of pairwise interactions and an Adaptive Interaction Pruning (AIP) mechanism to reduce redundant computations efficiently. Extensive evaluations on ETH/UCY and NBA benchmarks show that ART delivers state-of-the-art accuracy with high computational efficiency.

[216] Motion-Adaptive Multi-Scale Temporal Modelling with Skeleton-Constrained Spatial Graphs for Efficient 3D Human Pose Estimation

Ruochen Li, Shuang Chen, Wenke E, Farshad Arvin, Amir Atapour-Abarghouei

Main category: cs.CV

TL;DR: MASC-Pose: A motion-adaptive multi-scale temporal modeling framework with skeleton-constrained spatial graphs for efficient 3D human pose estimation from monocular videos.

Details

Motivation: Existing 3D human pose estimation methods struggle with efficiency and adaptability when modeling complex spatial and temporal dependencies, particularly under dense attention or fixed modeling schemes.

Method: Proposes MASC-Pose with two key components: 1) Adaptive Multi-scale Temporal Modelling (AMTM) module to capture heterogeneous motion dynamics at different temporal scales, and 2) Skeleton-constrained Adaptive GCN (SAGCN) for joint-specific spatial interaction modeling.

Result: Achieves strong accuracy with high computational efficiency on Human3.6M and MPI-INF-3DHP datasets.

Conclusion: The proposed framework enables adaptive temporal reasoning and efficient spatial aggregation for effective 3D human pose estimation from monocular videos.

Abstract: Accurate 3D human pose estimation from monocular videos requires effective modelling of complex spatial and temporal dependencies. However, existing methods often face challenges in efficiency and adaptability when modelling spatial and temporal dependencies, particularly under dense attention or fixed modelling schemes. In this work, we propose MASC-Pose, a Motion-Adaptive multi-scale temporal modelling framework with Skeleton-Constrained spatial graphs for efficient 3D human pose estimation. Specifically, it introduces an Adaptive Multi-scale Temporal Modelling (AMTM) module to adaptively capture heterogeneous motion dynamics at different temporal scales, together with a Skeleton-constrained Adaptive GCN (SAGCN) for joint-specific spatial interaction modelling. By jointly enabling adaptive temporal reasoning and efficient spatial aggregation, our method achieves strong accuracy with high computational efficiency. Extensive experiments on Human3.6M and MPI-INF-3DHP datasets demonstrate the effectiveness of our approach.

[217] Leveraging Gaze and Set-of-Mark in VLLMs for Human-Object Interaction Anticipation from Egocentric Videos

Daniele Materia, Francesco Ragusa, Giovanni Maria Farinella

Main category: cs.CV

TL;DR: A method for anticipating human-object interactions in egocentric vision using Vision LLMs with Set-of-Mark prompting, gaze trajectory analysis, and inverse exponential frame sampling.

Details

Motivation: To create intelligent assistive systems that can anticipate human-object interactions for guiding users in daily activities and understanding their goals, addressing limitations in existing approaches for egocentric vision.

Method: Uses Vision Large Language Models (VLLMs) with Set-of-Mark prompting for better visual grounding, analyzes user intent via gaze fixation trajectories, and introduces inverse exponential sampling for temporal dynamics capture in video frames.

Result: Outperforms state-of-the-art approaches on the egocentric dataset HD-EPIC, demonstrating model-agnostic effectiveness.

Conclusion: The proposed approach effectively addresses human-object interaction anticipation in egocentric vision by combining improved visual grounding, gaze-based intent understanding, and temporal dynamics modeling.

Abstract: The ability to anticipate human-object interactions is highly desirable in an intelligent assistive system in order to guide users during daily life activities and understand their short and long-term goals. Creating systems with such capabilities requires to approach several complex challenges. This work addresses the problem of human-object interaction anticipation in Egocentric Vision using Vision Large Language Models (VLLMs). We tackle key limitations in existing approaches by improving visual grounding capabilities through Set-of-Mark prompting and understanding user intent via the trajectory formed by the user’s most recent gaze fixations. To effectively capture the temporal dynamics immediately preceding the interaction, we further introduce a novel inverse exponential sampling strategy for input video frames. Experiments conducted on the egocentric dataset HD-EPIC demonstrate that our method surpasses state-of-the-art approaches for the considered task, showing its model-agnostic nature.

[218] DiffSparse: Accelerating Diffusion Transformers with Learned Token Sparsity

Haowei Zhu, Ji Liu, Ziqiong Liu, Dong Li, Junhai Yong, Bin Wang, Emad Barsoum

Main category: cs.CV

TL;DR: A differentiable layer-wise sparsity optimization framework for diffusion transformer models that uses token caching to reduce computational costs while maintaining or improving generation quality.

Details

Motivation: Diffusion models have outstanding image generation performance but suffer from high computational costs due to multi-step inference. Existing acceleration methods using layer/token caching are inefficient for few-step diffusion transformers due to poor feature caching strategies, manual sparsity allocation, and unnecessary full forward computations.

Method: Proposes a differentiable layer-wise sparsity optimization framework using token caching. Uses a learnable network combined with a dynamic programming solver for end-to-end sparsity allocation optimization. Implements a two-stage training strategy that eliminates full-step processing requirements.

Result: Extensive experiments on DiT-XL/2, PixArt-α, FLUX, and Wan2.1 show consistent efficiency improvements without quality degradation. On PixArt-α with 20 steps, reduces computational cost by 54% while achieving better generation metrics than the original model, outperforming prior approaches.

Conclusion: The method delivers significant efficiency gains for diffusion transformer models while often improving generation quality, demonstrating effective acceleration through optimized token computation reduction.

Abstract: Diffusion models demonstrate outstanding performance in image generation, but their multi-step inference mechanism requires immense computational cost. Previous works accelerate inference by leveraging layer or token cache techniques to reduce computational cost. However, these methods fail to achieve superior acceleration performance in few-step diffusion transformer models due to inefficient feature caching strategies, manually designed sparsity allocation, and the practice of retaining complete forward computations in several steps in these token cache methods. To tackle these challenges, we propose a differentiable layer-wise sparsity optimization framework for diffusion transformer models, leveraging token caching to reduce token computation costs and enhance acceleration. Our method optimizes layer-wise sparsity allocation in an end-to-end manner through a learnable network combined with a dynamic programming solver. Additionally, our proposed two-stage training strategy eliminates the need for full-step processing in existing methods, further improving efficiency. We conducted extensive experiments on a range of diffusion-transformer models, including DiT-XL/2, PixArt-$α$, FLUX, and Wan2.1. Across these architectures, our method consistently improves efficiency without degrading sample quality. For example, on PixArt-$α$ with 20 sampling steps, we reduce computational cost by $54%$ while achieving generation metrics that surpass those of the original model, substantially outperforming prior approaches. These results demonstrate that our method delivers large efficiency gains while often improving generation quality.

Hoonhee Cho, Jae-Young Kang, Yuhwan Jeong, Yunseo Yang, Wonyoung Lee, Youngho Kim, Kuk-Jin Yoon

Main category: cs.CV

TL;DR: DSERT-RoLL is a comprehensive driving dataset with stereo event, RGB, thermal cameras, 4D radar, and dual LiDAR collected across diverse conditions, providing benchmarks for multimodal sensor fusion in autonomous driving.

Details

Motivation: To address data scarcity for novel sensors like event cameras and 4D radar, and enable systematic studies of sensor behavior across diverse weather and illumination conditions for autonomous driving applications.

Method: Collection of multimodal sensor data (stereo event, RGB, thermal cameras, 4D radar, dual LiDAR) across varied conditions, with precise 2D/3D bounding boxes and ego vehicle odometry. Establishment of unified 3D/2D benchmarks and development of a fusion framework integrating sensor-specific cues into unified feature space.

Result: Created a comprehensive driving dataset enabling fair comparisons across sensor combinations, established benchmarks for single modality and multimodal methods, and proposed a fusion framework improving 3D detection robustness under varied conditions.

Conclusion: DSERT-RoLL provides a valuable resource for studying multimodal sensor fusion in autonomous driving, particularly for novel sensors like event cameras and 4D radar, with potential to advance robustness in challenging environmental conditions.

Abstract: In this paper, we present DSERT-RoLL, a driving dataset that incorporates stereo event, RGB, and thermal cameras together with 4D radar and dual LiDAR, collected across diverse weather and illumination conditions. The dataset provides precise 2D and 3D bounding boxes with track IDs and ego vehicle odometry, enabling fair comparisons within and across sensor combinations. It is designed to alleviate data scarcity for novel sensors such as event cameras and 4D radar and to support systematic studies of their behavior. We establish unified 3D and 2D benchmarks that enable direct comparison of characteristics and strengths across sensor families and within each family. We report baselines for representative single modality and multimodal methods and provide protocols that encourage research on different fusion strategies and sensor combinations. In addition, we propose a fusion framework that integrates sensor specific cues into a unified feature space and improves 3D detection robustness under varied weather and lighting.

[220] SciLT: Long-Tailed Classification in Scientific Image Domains

Jiahao Chen, Bing Su

Main category: cs.CV

TL;DR: SciLT framework for scientific long-tailed recognition using adaptive feature fusion and dual-supervision learning to leverage both penultimate- and final-layer features from foundation models.

Details

Motivation: Existing long-tailed recognition research focuses on natural images where pre-training and fine-tuning data share similar distributions, but scientific images have distinct visual characteristics and domain shifts, raising questions about foundation model effectiveness in such settings.

Method: Proposes SciLT framework with adaptive feature fusion and dual-supervision learning that jointly leverages penultimate-layer and final-layer features from foundation models for balanced performance across head and tail classes.

Result: SciLT consistently outperforms existing methods on three scientific benchmarks, establishing a strong baseline for scientific long-tailed recognition and showing that penultimate-layer features are particularly important for tail classes.

Conclusion: The work provides valuable guidance for adapting foundation models to scientific data with substantial domain shifts and demonstrates the importance of multi-level feature exploitation in scientific long-tailed recognition.

Abstract: Long-tailed recognition has benefited from foundation models and fine-tuning paradigms, yet existing studies and benchmarks are mainly confined to natural image domains, where pre-training and fine-tuning data share similar distributions. In contrast, scientific images exhibit distinct visual characteristics and supervision signals, raising questions about the effectiveness of fine-tuning foundation models in such settings. In this work, we investigate scientific long-tailed recognition under a purely visual and parameter-efficient fine-tuning (PEFT) paradigm. Experiments on three scientific benchmarks show that fine-tuning foundation models yields limited gains, and reveal that penultimate-layer features play an important role, particularly for tail classes. Motivated by these findings, we propose SciLT, a framework that exploits multi-level representations through adaptive feature fusion and dual-supervision learning. By jointly leveraging penultimate- and final-layer features, SciLT achieves balanced performance across head and tail classes. Extensive experiments demonstrate that SciLT consistently outperforms existing methods, establishing a strong and practical baseline for scientific long-tailed recognition and providing valuable guidance for adapting foundation models to scientific data with substantial domain shifts.

[221] ResGuard: Enhancing Robustness Against Known Original Attacks in Deep Watermarking

Hanyi Wang, Han Fang, Yupeng Qiu, Shilin Wang, Ee-Chien Chang

Main category: cs.CV

TL;DR: ResGuard enhances watermark robustness against Known Original Attacks by enforcing image-dependent embedding residuals and using adversarial training with residual-style perturbations.

Details

Motivation: Current deep learning watermarking uses END architecture but overlooks vulnerabilities to Known Original Attacks where adversaries have access to original-watermarked pairs, enabling effective watermark removal.

Method: Proposes ResGuard with residual specificity enhancement loss to enforce image-dependent embedding, and auxiliary KOA noise layer injecting residual-style perturbations during training to improve decoder reliability.

Result: ResGuard boosts KOA robustness significantly, improving average watermark extraction accuracy from 59.87% to 99.81% when integrated into existing frameworks.

Conclusion: The paper addresses critical vulnerability in watermarking systems and provides effective plug-and-play solution for enhancing robustness against known original attacks.

Abstract: Deep learning-based image watermarking commonly adopts an “Encoder-Noise Layer-Decoder” (END) architecture to improve robustness against random channel distortions, yet it often overlooks intentional manipulations introduced by adversaries with additional knowledge. In this paper, we revisit this paradigm and expose a critical yet underexplored vulnerability: the Known Original Attack (KOA), where an adversary has access to multiple original-watermarked image pairs, enabling various targeted suppression strategies. We show that even a simple residual-based removal approach, namely estimating an embedding residual from known pairs and subtracting it from unseen watermarked images, can almost completely remove the watermark while preserving visual quality. This vulnerability stems from the insufficient image dependency of residuals produced by END frameworks, which makes them transferable across images. To address this, we propose ResGuard, a plug-and-play module that enhances KOA robustness by enforcing image-dependent embedding. Its core lies in a residual specificity enhancement loss, which encourages residuals to be tightly coupled with their host images and thus improves image dependency. Furthermore, an auxiliary KOA noise layer injects residual-style perturbations during training, allowing the decoder to remain reliable under stronger embedding inconsistencies. Integrated into existing frameworks, ResGuard boosts KOA robustness, improving average watermark extraction accuracy from 59.87% to 99.81%.

[222] FunFact: Building Probabilistic Functional 3D Scene Graphs via Factor-Graph Reasoning

Zhengyu Fu, René Zurbrügg, Kaixian Qu, Marc Pollefeys, Marco Hutter, Hermann Blum, Zuria Bauer

Main category: cs.CV

TL;DR: FunFact is a framework for probabilistic open-vocabulary functional 3D scene graphs that uses foundation models and joint inference to capture scene-wide functional relationships beyond isolated object pairs.

Details

Motivation: Existing 3D scene understanding methods consider functional relationships between object pairs in isolation, failing to capture the scene-wide interdependence that humans use to resolve ambiguity in functional scene understanding.

Method: FunFact builds object- and part-centric 3D maps from RGB-D images, uses foundation models to propose functional relations, converts candidates into factor graph variables constrained by LLM-derived common-sense priors and geometric priors, and performs joint probabilistic inference over all functional edges.

Result: Experiments on SceneFun3D, FunGraph3D, and FunThor datasets show FunFact improves node and relation discovery recall and significantly reduces calibration error for ambiguous relations compared to existing methods.

Conclusion: Holistic probabilistic modeling with joint inference over all functional relationships yields substantially better calibrated confidence scores and improves functional scene understanding by capturing scene-wide interdependence.

Abstract: Recent work in 3D scene understanding is moving beyond purely spatial analysis toward functional scene understanding. However, existing methods often consider functional relationships between object pairs in isolation, failing to capture the scene-wide interdependence that humans use to resolve ambiguity. We introduce FunFact, a framework for constructing probabilistic open-vocabulary functional 3D scene graphs from posed RGB-D images. FunFact first builds an object- and part-centric 3D map and uses foundation models to propose semantically plausible functional relations. These candidates are converted into factor graph variables and constrained by both LLM-derived common-sense priors and geometric priors. This formulation enables joint probabilistic inference over all functional edges and their marginals, yielding substantially better calibrated confidence scores. To benchmark this setting, we introduce FunThor, a synthetic dataset based on AI2-THOR with part-level geometry and rule-based functional annotations. Experiments on SceneFun3D, FunGraph3D, and FunThor show that FunFact improves node and relation discovery recall and significantly reduces calibration error for ambiguous relations, highlighting the benefits of holistic probabilistic modeling for functional scene understanding. See our project page at https://funfact-scenegraph.github.io/

Xingcheng Zhou, Mingyu Liu, Walter Zimmer, Jiajie Zhang, Alois Knoll

Main category: cs.CV

TL;DR: SGTA is a modular framework for traffic video understanding that combines scene graphs with multi-modal reasoning, using ReAct-based LLM reasoning over symbolic graph queries and visual inputs for interpretable video question answering.

Details

Motivation: The paper aims to address traffic video understanding by creating a more interpretable and structured approach that combines symbolic scene representations with multi-modal reasoning, moving beyond black-box models.

Method: SGTA constructs traffic scene graphs from roadside videos using detection, tracking, and lane extraction. It then uses ReAct-based large language models to process interleaved reasoning traces with tool invocations over both symbolic graph queries and visual inputs.

Result: Experiments on the TUMTraffic VideoQA dataset show SGTA achieves competitive accuracy across multiple question types while providing transparent reasoning steps, demonstrating the effectiveness of integrating structured scene representations with multi-modal agents.

Conclusion: The framework highlights the potential of combining structured scene graphs with multi-modal reasoning for interpretable traffic video understanding, offering both competitive performance and transparency in decision-making.

Abstract: We present Scene-Graph Based Multi-Modal Traffic Agent (SGTA), a modular framework for traffic video understanding that combines structured scene graphs with multi-modal reasoning. It constructs a traffic scene graph from roadside videos using detection, tracking, and lane extraction, followed by tool-based reasoning over both symbolic graph queries and visual inputs. SGTA adopts ReAct to process interleaved reasoning traces from large language models with tool invocations, enabling interpretable decision-making for complex video questions. Experiments on selected TUMTraffic VideoQA dataset sample demonstrate that SGTA achieves competitive accuracy across multiple question types while providing transparent reasoning steps. These results highlight the potential of integrating structured scene representations with multi-modal agents for traffic video understanding.

[224] VidNum-1.4K: A Comprehensive Benchmark for Video-based Numerical Reasoning

Shaoyang Cui, Lingbei Meng

Main category: cs.CV

TL;DR: VidNum-1.4K is a comprehensive video question-answering benchmark with 1,379 human-annotated video-question pairs designed to evaluate genuine numerical reasoning across diverse real-world environments, revealing significant gaps in current VLMs’ understanding of temporal dynamics and compositional logic.

Details

Motivation: Existing video reasoning benchmarks are limited to narrow domains or treat numerical reasoning as superficial counting tasks, failing to assess multi-step numerical logic in complex real-world multimedia content. The authors aim to create a benchmark that tests whether VLMs truly understand real-world dynamics through numerical deduction requiring temporal understanding, object permanence, and compositional logic.

Method: The authors introduce VidNum-1.4K, a benchmark with 1,379 human-annotated video-question pairs structured into a three-level hierarchy: from direct visual perception to video-based compositional numerical reasoning. The benchmark requires models to perform arithmetic operations, comparisons, and logical deductions grounded in temporal evidence across diverse environments including object, action, and event quantification.

Result: Evaluation of state-of-the-art VLMs reveals a significant reasoning gap: Gemini-3.1-pro barely reaches 60% accuracy, while open-source models perform even worse in the 25-45% range. This demonstrates that current VLMs lack a stable “internal world model” for genuine numerical reasoning in videos.

Conclusion: VidNum-1.4K serves as a demanding diagnostic testbed for the next generation of numerical video intelligence, highlighting that current VLMs still struggle with genuine numerical reasoning requiring temporal understanding and compositional logic beyond superficial pattern matching.

Abstract: Video-based numerical reasoning provides a premier arena for testing whether Vision-Language Models (VLMs) truly “understand” real-world dynamics, as accurate numerical deduction necessitates a profound grasp of temporal events, object permanence, and compositional logic beyond superficial pattern matching. However, existing benchmarks are often confined to narrow domains, such as repetitive athletic motions, or treat simple counting merely as a superficial regression task, failing to assess multi-step numerical logic within the inherent complexity of real-world multimedia content. We introduce VidNum-1.4K, a comprehensive VideoQA benchmark comprising 1,379 strictly human-annotated video-question pairs designed to evaluate genuine numerical reasoning across highly diverse environments, encompassing object, action, and event quantification. The VidNum-1.4K is uniquely structured into a three-level hierarchy that evolves from direct visual perception to video-based compositional numerical reasoning, requiring models to perform arithmetic operations, comparisons, and logical deductions grounded in temporal evidence. Our evaluations across a diverse suite of state-of-the-art VLMs reveal a striking reasoning gap: while the Gemini-3.1-pro barely reaches a 60% accuracy threshold, representative open-source families struggle heavily in the 25%–45% range. These findings demonstrate that current VLMs still lack a stable “internal world model”, positioning VidNum-1.4K as a demanding diagnostic testbed for the next generation of numerical video intelligence.

[225] XSeg: A Large-scale X-ray Contraband Segmentation Benchmark For Real-World Security Screening

Hongxia Gao, Litao Li, Yixin Chen, Jiali Wen, Kaijie Zhang, Qianyun Liu

Main category: cs.CV

TL;DR: XSeg: Largest X-ray contraband segmentation dataset with 98,644 images and 295,932 masks, plus APSAM annotation model for efficient labeling

Details

Motivation: Current X-ray contraband detection methods rely on bounding box annotations, limiting generalization and performance due to lack of pixel-level supervision and real-world data

Method: 1) Created XSeg dataset with 98,644 images and 295,932 instance masks across 30 contraband categories; 2) Developed Adaptive Point SAM (APSAM) annotation model based on Segment Anything Model with Energy-Aware Encoder for better cross-domain generalization and stacked object detection, and Adaptive Point Generator for single-point precise labeling

Result: XSeg is the largest X-ray contraband segmentation dataset to date; APSAM demonstrates superior performance on XSeg dataset for efficient annotation

Conclusion: XSeg dataset and APSAM annotation model address limitations in X-ray contraband detection by providing comprehensive pixel-level supervision and efficient labeling tools

Abstract: X-ray contraband detection is critical for public safety. However, current methods primarily rely on bounding box annotations, which limit model generalization and performance due to the lack of pixel-level supervision and real-world data. To address these limitations, we introduce XSeg. To the best of our knowledge, XSeg is the largest X-ray contraband segmentation dataset to date, including 98,644 images and 295,932 instance masks, and contains the latest 30 common contraband categories. The images are sourced from public datasets and our synthesized data, filtered through a custom data cleaning pipeline to remove low-quality samples. To enable accurate and efficient annotation and reduce manual labeling effort, we propose Adaptive Point SAM (APSAM), a specialized mask annotation model built upon the Segment Anything Model (SAM). We address SAM’s poor cross-domain generalization and limited capability in detecting stacked objects by introducing an Energy-Aware Encoder that enhances the initialization of the mask decoder, significantly improving sensitivity to overlapping items. Additionally, we design an Adaptive Point Generator that allows users to obtain precise mask labels with only a single coarse point prompt. Extensive experiments on XSeg demonstrate the superior performance of APSAM.

[226] Learning Superpixel Ensemble and Hierarchy Graphs for Melanoma Detection

Asmaa M. Elwer, Muhammad A. Rushdi, Mahmoud H. Annaby

Main category: cs.CV

TL;DR: A graph learning approach for melanoma detection using superpixel-based graph representations with learned edge weights and texture features achieves high accuracy on dermoscopic images.

Details

Motivation: To improve melanoma detection in dermoscopic images by developing graph structure learning methods that offer more reliable and flexible data representations compared to traditional statistical graph construction approaches.

Method: Proposes two graph representations: superpixel ensemble graphs (SEG) and superpixel hierarchy graphs (SHG) with multiple node levels. Uses handcrafted Gaussian weights and learned optimization-based edge weights. Assigns nodal signals based on texture, geometric, and color features. Investigates graph edge thresholding (25%, 50%, 75% pruning). Evaluates with classifiers on ISIC2017 dataset with data augmentation.

Result: Learned superpixel ensemble graphs with textural nodal signals achieve highest performance: 99.00% accuracy and 99.59% AUC on melanoma detection.

Conclusion: Graph learning with superpixel representations and learned edge weights significantly improves melanoma detection performance, demonstrating the value of graph structure learning in biomedical image analysis.

Abstract: Graph signal processing (GSP) is becoming a major tool in biomedical signal and image analysis. In most GSP techniques, graph structures and edge weights have been typically set via statistical and computational methods. More recently, graph structure learning methods offered more reliable and flexible data representations. In this work, we introduce a graph learning approach for melanoma detection in dermoscopic images based on two graph-theoretic representations: superpixel ensemble graphs (SEG) and superpixel hierarchy graphs (SHG). For these two types of graphs, superpixel maps of a skin lesion image are respectively generated at multiple levels without and with parentchild constraints among superpixels at adjacent levels, where each level corresponds to a subgraph with a different number of nodes (20, 40, 60, 80, or 100 nodes). Two edge weight assignment techniques are explored: handcrafted Gaussian weights and learned weights based on optimization methods. The graph nodal signals are assigned based on texture, geometric, and color superpixel features. In addition, the effect of graph edge thresholding is investigated by applying different thresholds (25%, 50%, and 75%) to prune the weakest edges and analyze the impact of pruning on the melanoma detection performance. Experimental evaluation of the proposed method is performed with different classifiers trained and tested on the publicly available ISIC2017 dataset. Data augmentation is applied to alleviate class imbalance by adding more melanoma images from the ISIC archive. The results show that learned superpixel ensemble graphs with textural nodal signals give the highest performance reaching an accuracy of 99.00% and an AUC of 99.59%.

[227] CGHair: Compact Gaussian Hair Reconstruction with Card Clustering

Haimin Luo, Srinjay Sarkar, Albert Mosella-Montoro, Francisco Vicente Carrasco, Fernando De la Torre

Main category: cs.CV

TL;DR: Compact pipeline for hair reconstruction using 3D Gaussian Splatting with clustering and texture sharing to reduce storage and rendering costs while maintaining quality.

Details

Motivation: Recent 3D Gaussian Splatting methods achieve realistic hair reconstruction but require millions of primitives, leading to high storage and rendering costs. Hair exhibits structural and visual similarities across a hairstyle that can be exploited for efficiency.

Method: Clusters hair strands into representative hair cards and groups these into shared texture codebooks. Integrates this structure with 3DGS rendering and uses a generative prior accelerated method to reconstruct initial strand geometry from multi-view images.

Result: Achieves 4-fold reduction in strand reconstruction time and comparable rendering performance with over 200x lower memory footprint compared to existing methods.

Conclusion: The proposed compact pipeline enables efficient high-fidelity hair reconstruction by exploiting structural similarities in hair, significantly reducing computational costs while maintaining visual quality.

Abstract: We present a compact pipeline for high-fidelity hair reconstruction from multi-view images. While recent 3D Gaussian Splatting (3DGS) methods achieve realistic results, they often require millions of primitives, leading to high storage and rendering costs. Observing that hair exhibits structural and visual similarities across a hairstyle, we cluster strands into representative hair cards and group these into shared texture codebooks. Our approach integrates this structure with 3DGS rendering, significantly reducing reconstruction time and storage while maintaining comparable visual quality. In addition, we propose a generative prior accelerated method to reconstruct the initial strand geometry from a set of images. Our experiments demonstrate a 4-fold reduction in strand reconstruction time and achieve comparable rendering performance with over 200x lower memory footprint.

[228] SymphoMotion: Joint Control of Camera Motion and Object Dynamics for Coherent Video Generation

Guiyu Zhang, Yabo Chen, Xunzhi Xiang, Junchao Huang, Zhongyu Wang, Li Jiang

Main category: cs.CV

TL;DR: SymphoMotion: A unified motion-control framework for video generation that jointly governs camera trajectories and object dynamics using explicit camera paths and 3D trajectory embeddings.

Details

Motivation: Current video generation methods typically handle only one motion type (camera or object) or rely on ambiguous 2D cues that entangle camera-induced parallax with true object movement, limiting coherent and expressive video generation.

Method: SymphoMotion features: 1) Camera Trajectory Control integrating explicit camera paths with geometry-aware cues for stable viewpoint transitions, and 2) Object Dynamics Control combining 2D visual guidance with 3D trajectory embeddings for depth-aware object manipulation. Also introduces RealCOD-25K dataset with paired camera poses and object-level 3D trajectories.

Result: Extensive experiments and user studies show SymphoMotion significantly outperforms existing methods in visual fidelity, camera controllability, and object-motion accuracy, establishing a new benchmark for unified motion control in video generation.

Conclusion: SymphoMotion presents a unified framework for joint camera and object motion control in video generation, addressing key limitations of existing methods and demonstrating superior performance through comprehensive evaluation.

Abstract: Controlling both camera motion and object dynamics is essential for coherent and expressive video generation, yet current methods typically handle only one motion type or rely on ambiguous 2D cues that entangle camera-induced parallax with true object movement. We present SymphoMotion, a unified motion-control framework that jointly governs camera trajectories and object dynamics within a single model. SymphoMotion features a Camera Trajectory Control mechanism that integrates explicit camera paths with geometry-aware cues to ensure stable, structurally consistent viewpoint transitions, and an Object Dynamics Control mechanism that combines 2D visual guidance with 3D trajectory embeddings to enable depth-aware, spatially coherent object manipulation. To support large-scale training and evaluation, we further construct RealCOD-25K, a comprehensive real-world dataset containing paired camera poses and object-level 3D trajectories across diverse indoor and outdoor scenes, addressing a key data gap in unified motion control. Extensive experiments and user studies show that SymphoMotion significantly outperforms existing methods in visual fidelity, camera controllability, and object-motion accuracy, establishing a new benchmark for unified motion control in video generation.Codes and data are publicly available at https://grenoble-zhang.github.io/SymphoMotion/.

[229] Rethinking Position Embedding as a Context Controller for Multi-Reference and Multi-Shot Video Generation

Binyuan Huang, Yuning Lu, Weinan Jia, Hualiang Wang, Mu Liu, Daiqing Yang

Main category: cs.CV

TL;DR: PoCo introduces position embedding as context controller to solve reference confusion in multi-reference video generation, enabling precise token-level matching for characters with similar appearances.

Details

Motivation: Academic research lags behind proprietary models like Sora2 in generating multi-shot videos with multiple reference characters. A core challenge is reference confusion when reference images have highly similar appearances, where semantically similar tokens degrade the model's ability to retrieve correct context.

Method: Introduces PoCo (Position Embedding as a Context Controller) that incorporates position encoding as additional context control beyond semantic retrieval. Uses side information of tokens to enable precise token-level matching while preserving implicit semantic consistency modeling. Builds a multi-reference and multi-shot video generation model on top of PoCo.

Result: Extensive experiments show PoCo improves cross-shot consistency and reference fidelity compared with various baselines. The model can reliably control characters with extremely similar visual traits.

Conclusion: PoCo effectively addresses reference confusion in multi-reference video generation by using position embedding for context control, enabling better handling of characters with similar appearances.

Abstract: Recent proprietary models such as Sora2 demonstrate promising progress in generating multi-shot videos conditioned on multiple reference characters. However, academic research on this problem remains limited. We study this task and identify a core challenge: when reference images exhibit highly similar appearances, the model often suffers from reference confusion, where semantically similar tokens degrade the model’s ability to retrieve the correct context. To address this, we introduce PoCo (Position Embedding as a Context Controller), which incorporates position encoding as additional context control beyond semantic retrieval. By employing side information of tokens, PoCo enables precise token-level matching while preserving implicit semantic consistency modeling. Building on PoCo, we develop a multi-reference and multi-shot video generation model capable of reliably controlling characters with extremely similar visual traits. Extensive experiments demonstrate that PoCo improves cross-shot consistency and reference fidelity compared with various baselines.

[230] Shower-Aware Dual-Stream Voxel Networks for Structural Defect Detection in Cosmic-Ray Muon Tomography

Parthiv Dasgupta, Sambhav Agarwal, Palash Dutta, Raja Karmakar, Sudeshna Goswami

Main category: cs.CV

TL;DR: SA-DSVN is a 3D CNN architecture for voxel-level segmentation of structural defects in reinforced concrete using cosmic-ray muon tomography, leveraging both scattering kinematics and secondary electromagnetic shower multiplicities through cross-attention fusion.

Details

Motivation: Conventional muon tomography methods (POCA, MLSD) rely only on scattering angles, missing valuable information from secondary electromagnetic showers that occur when muons interact with materials. The authors aim to develop a more comprehensive approach that utilizes both scattering kinematics and shower multiplicities for better defect detection.

Method: Proposed SA-DSVN uses a 3D convolutional architecture with two independent encoder streams: one for scattering kinematics (9 channels) and another for secondary electromagnetic shower multiplicities (40 channels). These streams are fused via cross-attention. Training data was generated using Vega (cloud-native Geant4 simulation) with 4.5 million muon events across 900 volumes containing four defect types.

Result: Achieved 96.3% voxel accuracy, per-defect Dice scores of 0.59-0.81, and 100% volume-level detection sensitivity at 10 ms inference per volume. Ablation study showed shower multiplicity stream alone accounts for majority of discriminative power, raising defect-mean Dice from 0.535 (scattering only) to 0.685 (shower only).

Conclusion: Secondary shower multiplicity is a previously unexploited but highly effective feature for learned muon tomographic reconstruction, significantly improving defect detection in reinforced concrete structures compared to conventional scattering-only methods.

Abstract: We present SA-DSVN, a 3D convolutional architecture for voxel-level segmentation of structural defects in reinforced concrete using cosmic-ray muon tomography. Unlike conventional reconstruction methods (POCA, MLSD) that rely solely on muon scattering angles, our approach jointly processes scattering kinematics (9 channels) and secondary electromagnetic shower multiplicities (40 channels) through independent encoder streams fused via cross-attention. Training data were generated using Vega, a cloud-native Geant4 simulation framework, producing 4.5 million muon events across 900 volumes containing four defect types - honeycombing, shear fracture, corrosion voids, and delamination - embedded within a dense 7x7 rebar cage. A five-variant ablation study demonstrates that the shower multiplicity stream alone accounts for the majority of discriminative power, raising defect-mean Dice from 0.535 (scattering only) to 0.685 (shower only). On 60 independently simulated validation volumes, the model achieves 96.3% voxel accuracy, per-defect Dice scores of 0.59-0.81, and 100% volume-level detection sensitivity at 10 ms inference per volume. These results establish secondary shower multiplicity as a previously unexploited but highly effective feature for learned muon tomographic reconstruction.

[231] ITIScore: An Image-to-Text-to-Image Rating Framework for the Image Captioning Ability of MLLMs

Zitong Xu, Huiyu Duan, Shengyao Qin, Guangyu Yao, Guangji Ma, Xiongkuo Min, Ke Gu, Guangtao Zhai, Patrick Le Callet

Main category: cs.CV

TL;DR: ICBench: A new large-scale image captioning benchmark with 40K captions from 10 advanced MLLMs, featuring both short and long captions across 12 categories, with human MOS ratings and a novel automated evaluation metric ITIScore.

Details

Motivation: Existing image captioning benchmarks have limitations: limited caption length diversity, absence of recent advanced MLLMs, and insufficient human annotations, which introduce bias and limit comprehensive assessment of modern MLLMs.

Method: Created ICBench with 2K images across 12 categories, generated both short and long captions using 10 advanced MLLMs (40K captions total). Conducted human subjective studies for MOS ratings across fine-grained dimensions. Proposed ITIScore metric using image-to-text-to-image reconstruction consistency for automated evaluation.

Result: ITIScore shows strong alignment with human judgments and demonstrates robust zero-shot generalization on other public captioning datasets. The benchmark provides comprehensive evaluation of modern MLLMs’ captioning capabilities.

Conclusion: ICBench addresses limitations of existing benchmarks and provides a comprehensive evaluation framework for modern MLLMs’ image captioning capabilities, with both human annotations and an effective automated metric.

Abstract: Recent advances in multimodal large language models (MLLMs) have greatly improved image understanding and captioning capabilities. However, existing image captioning benchmarks typically suffer from limited diversity in caption length, the absence of recent advanced MLLMs, and insufficient human annotations, which potentially introduces bias and limits the ability to comprehensively assess the performance of modern MLLMs. To address these limitations, we present a new large-scale image captioning benchmark, termed, ICBench, which covers 12 content categories and consists of both short and long captions generated by 10 advanced MLLMs on 2K images, resulting in 40K captions in total. We conduct extensive human subjective studies to obtain mean opinion scores (MOSs) across fine-grained evaluation dimensions, where short captions are assessed in terms of fluency, relevance, and conciseness, while long captions are evaluated based on fluency, relevance, and completeness. Furthermore, we propose an automated evaluation metric, \textbf{ITIScore}, based on an image-to-text-to-image framework, which measures caption quality through reconstruction consistency. Experimental results demonstrate strong alignment between our automatic metric and human judgments, as well as robust zero-shot generalization ability on other public captioning datasets. Both the dataset and model will be released upon publication.

[232] M2StyleGS: Multi-Modality 3D Style Transfer with Gaussian Splatting

Xingyu Miao, Xueqi Qiu, Haoran Duan, Yawen Huang, Xian Wu, Jingjing Deng, Yang Long

Main category: cs.CV

TL;DR: M2StyleGS enables real-time 3D style transfer using text or image references with 3D Gaussian Splatting and CLIP-based multi-modal alignment, achieving better visual quality and consistency than previous methods.

Details

Motivation: Traditional 3D style transfer methods rely on fixed reference images, but users in VR/AR applications need more flexible inputs like textual descriptions and diverse imagery for creative control.

Method: Uses 3D Gaussian Splatting (3DGS) for 3D representation and CLIP for multi-modal style reference. Introduces subdivisive flow for precise feature alignment between CLIP text-visual features and VGG style features, plus observation loss for style matching and suppression loss for color consistency.

Result: M2StyleGS achieves better visual quality and surpasses previous work by up to 32.92% in consistency metrics, enabling real-time generation of style-enhanced novel views from text or image references.

Conclusion: The method successfully enables flexible 3D style transfer using multi-modal references (text/images) with improved consistency and visual quality, making it suitable for practical VR/AR applications.

Abstract: Conventional 3D style transfer methods rely on a fixed reference image to apply artistic patterns to 3D scenes. However, in practical applications such as virtual or augmented reality, users often prefer more flexible inputs, including textual descriptions and diverse imagery. In this work, we introduce a novel real-time styling technique M2StyleGS to generate a sequence of precisely color-mapped views. It utilizes 3D Gaussian Splatting (3DGS) as a 3D presentation and multi-modality knowledge refined by CLIP as a reference style. M2StyleGS resolves the abnormal transformation issue by employing a precise feature alignment, namely subdivisive flow, it strengthens the projection of the mapped CLIP text-visual combination feature to the VGG style feature. In addition, we introduce observation loss, which assists in the stylized scene better matching the reference style during the generation, and suppression loss, which suppresses the offset of reference color information throughout the decoding process. By integrating these approaches, M2StyleGS can employ text or images as references to generate a set of style-enhanced novel views. Our experiments show that M2StyleGS achieves better visual quality and surpasses the previous work by up to 32.92% in terms of consistency.

[233] When Does Multimodal AI Help? Diagnostic Complementarity of Vision-Language Models and CNNs for Spectrum Management in Satellite-Terrestrial Networks

Yuanhang Li

Main category: cs.CV

TL;DR: First diagnostic comparison of Vision-Language Models vs CNNs for spectrum heatmap understanding in wireless networks, showing task-dependent complementarity with CNNs better for spatial tasks and VLMs enabling semantic reasoning.

Details

Motivation: No systematic understanding exists of where large foundation models outperform lightweight CNNs for spectrum-related tasks in wireless network management, despite accelerating adoption of VLMs.

Method: Introduced SpectrumQA benchmark with 108K visual Q-A pairs across four granularity levels. Compared frozen Qwen2-VL-7B VLM with trained ResNet-18 CNN across three NTN-TN scenarios, using chain-of-thought prompting and task-type routing.

Result: Clear task-dependent complementarity: CNN better for severity classification (72.9% accuracy) and spatial localization (0.552 IoU), while VLM uniquely enables semantic reasoning (F1=0.576) with few-shot learning. CoT improves VLM reasoning by 12.6%. Task-type router achieves 39.1% improvement over CNN alone.

Conclusion: VLMs and CNNs are complementary, not substitutes: deploy CNNs for spatial localization and VLMs for semantic spectrum reasoning. VLM representations show stronger cross-scenario robustness.

Abstract: The adoption of vision-language models (VLMs) for wireless network management is accelerating, yet no systematic understanding exists of where these large foundation models outperform lightweight convolutional neural networks (CNNs) for spectrum-related tasks. This paper presents the first diagnostic comparison of VLMs and CNNs for spectrum heatmap understanding in non-terrestrial network and terrestrial network (NTN-TN) cooperative systems. We introduce SpectrumQA, a benchmark comprising 108K visual question-answer pairs across four granularity levels: scene classification (L1), regional reasoning (L2), spatial localization (L3), and semantic reasoning (L4). Our experiments on three NTN-TN scenarios with a frozen Qwen2-VL-7B and a trained ResNet-18 reveal a clear taskdependent complementarity: CNN achieves 72.9% accuracy at severity classification (L1) and 0.552 IoU at spatial localization (L3), while VLM uniquely enables semantic reasoning (L4) with F1=0.576 using only three in-context examples-a capability fundamentally absent in CNN architectures. Chain-of-thought (CoT) prompting further improves VLM reasoning by 12.6% (F1: 0.209->0.233) while having zero effect on spatial tasks, confirming that the complementarity is rooted in architectural differences rather than prompting limitations. A deterministic task-type router that delegates supervised tasks to CNN and reasoning tasks to VLM achieves a composite score of 0.616, a 39.1% improvement over CNN alone. We further show that VLM representations exhibit stronger cross-scenario robustness, with smaller performance degradation in 5 out of 6 transfer directions. These findings provide actionable guidelines: deploy CNNs for spatial localization and VLMs for semantic spectrum reasoning, rather than treating them as substitutes.

Xiaoyu Huang

Main category: cs.CV

TL;DR: Automated refinement framework that uses coarse CityGML building models as geometric priors to integrate high-precision MLS data for facade geometry recovery while preserving semantic information and ensuring topological validity.

Details

Motivation: Traditional coarse CityGML building models from ALS have significant geometric deficiencies in facades due to nadir perspective. Need to integrate with high-precision MLS data for detailed facade geometry while avoiding reconstruction-from-scratch approaches that discard existing semantic information.

Method: Uses coarse model as geometric prior, integrates surface matching to identify outdated surfaces, employs binary integer optimization to select optimal faces from candidate data with hard constraints to ensure topological validity of refined output.

Result: Effectively corrects facade misalignments, reduces Cloud-to-Mesh RMSE by approximately 36%, achieves centimeter-level alignment, and guarantees strictly watertight and manifold geometry.

Conclusion: Provides robust solution for upgrading ALS-derived city models through automated refinement that preserves semantic information while improving geometric accuracy.

Abstract: Digital twins require continuous maintenance to meet the increasing demand for high-precision geospatial data. However, traditional coarse CityGML building models, typically derived from Airborne Laser Scanning (ALS), often exhibit significant geometric deficiencies, particularly regarding facade accuracy due to the nadir perspective of airborne sensors. Integrating these coarse models with high-precision Mobile Laser Scanning (MLS) data is essential to recover detailed facade geometry. Unlike reconstruction-from-scratch approaches that discard existing semantic information and rely heavily on complete data coverage, this work presents an automated refinement framework that utilizes the coarse model as a geometric prior. This method enables targeted updates to facade geometry even in complex urban environments. It integrates surface matching to identify outdated surfaces and employs a binary integer optimization to select optimal faces from candidate data. Crucially, hard constraints are enforced within the optimization to ensure the topological validity of the refined output. Experimental results demonstrate that the proposed approach effectively corrects facade misalignments, reducing the Cloud-to-Mesh RMSE by approximately 36% and achieving centimeter-level alignment. Furthermore, the framework guarantees strictly watertight and manifold geometry, providing a robust solution for upgrading ALS-derived city models.

[235] Next-Scale Autoregressive Models for Text-to-Motion Generation

Zhiwei Zheng, Shibo Jin, Lingjie Liu, Mingmin Zhao

Main category: cs.CV

TL;DR: MoScale: A next-scale autoregressive framework for hierarchical text-to-motion generation from coarse to fine temporal resolutions with cross-scale refinement.

Details

Motivation: Standard autoregressive next-token prediction is not well-aligned with the temporal structure needed for text-conditioned motion generation, requiring a better causal hierarchy for long-range motion structure.

Method: Introduces MoScale, a next-scale AR framework that generates motion hierarchically from coarse to fine temporal resolutions, with cross-scale hierarchical refinement for improving per-scale predictions and in-scale temporal refinement for selective bidirectional re-prediction.

Result: Achieves state-of-the-art text-to-motion performance with high training efficiency, scales effectively with model size, and generalizes zero-shot to diverse motion generation and editing tasks.

Conclusion: MoScale’s hierarchical next-scale approach provides a better causal structure for motion generation than standard AR models, enabling efficient and high-quality text-to-motion synthesis with strong generalization capabilities.

Abstract: Autoregressive (AR) models offer stable and efficient training, but standard next-token prediction is not well aligned with the temporal structure required for text-conditioned motion generation. We introduce MoScale, a next-scale AR framework that generates motion hierarchically from coarse to fine temporal resolutions. By providing global semantics at the coarsest scale and refining them progressively, MoScale establishes a causal hierarchy better suited for long-range motion structure. To improve robustness under limited text-motion data, we further incorporate cross-scale hierarchical refinement for improving per-scale initial predictions and in-scale temporal refinement for selective bidirectional re-prediction. MoScale achieves SOTA text-to-motion performance with high training efficiency, scales effectively with model size, and generalizes zero-shot to diverse motion generation and editing tasks.

Mohammad Heydari, Wei Dong, Shahram Shirani, Jun Chen, Han Zhou

Main category: cs.CV

TL;DR: HistoFusionNet: A transformer-enhanced architecture for nighttime image dehazing that combines histogram-guided representation learning with frequency-adaptive feature refinement to address complex nighttime degradations like haze, glow, and non-uniform illumination.

Details

Motivation: Nighttime image dehazing is challenging due to joint presence of haze, glow, non-uniform illumination, color distortion, and noise that invalidate daytime dehazing assumptions. Existing methods struggle with heterogeneous degradations in real nighttime scenes.

Method: Multi-scale encoder-decoder backbone with histogram transformer blocks that model long-range dependencies by grouping features based on dynamic-range characteristics. Includes frequency-aware refinement branch that adaptively exploits low- and high-frequency cues for better restoration fidelity.

Result: Achieved highly competitive performance on NTIRE 2026 Nighttime Image Dehazing Challenge benchmark, ranking 1st among 22 participating teams, demonstrating robustness and effectiveness.

Conclusion: HistoFusionNet provides a unified framework well-suited for heterogeneous degradations in nighttime hazy scenes, effectively combining histogram-guided learning with frequency-adaptive refinement for superior nighttime dehazing performance.

Abstract: Nighttime image dehazing remains a challenging low-level vision problem due to the joint presence of haze, glow, non-uniform illumination, color distortion, and sensor noise, which often invalidate assumptions commonly used in daytime dehazing. To address these challenges, we propose HistoFusionNet, a transformer-enhanced architecture tailored for nighttime image dehazing by combining histogram-guided representation learning with frequency-adaptive feature refinement. Built upon a multi-scale encoder-decoder backbone, our method introduces histogram transformer blocks that model long-range dependencies by grouping features according to their dynamic-range characteristics, enabling more effective aggregation of similarly degraded regions under complex nighttime lighting. To further improve restoration fidelity, we incorporate a frequency-aware refinement branch that adaptively exploits complementary low- and high-frequency cues, helping recover scene structures, suppress artifacts, and enhance local details. This design yields a unified framework that is particularly well suited to the heterogeneous degradations encountered in real nighttime hazy scenes. Extensive experiments and highly competitive performance of our method on the NTIRE 2026 Nighttime Image Dehazing Challenge benchmark demonstrate the effectiveness of the proposed method. Our team ranked 1st among 22 participating teams, highlighting the robustness and competitive performance of HistoFusionNet. The code is available at: https://github.com/heydarimo/Night-Time-Dehazing

[237] Rényi Attention Entropy for Patch Pruning

Hiroaki Aizawa, Yuki Igaue

Main category: cs.CV

TL;DR: A patch pruning method for vision transformers that uses Shannon and Rényi entropy of attention distributions to identify and remove redundant patches, reducing computational cost while maintaining accuracy.

Details

Motivation: Transformers have quadratic computational complexity with token count, making them expensive for vision tasks. Patch pruning can reduce this cost, but existing methods lack principled criteria for identifying which patches to prune based on their informativeness.

Method: Proposes using Shannon entropy of attention distributions to measure patch importance - low-entropy patches (selective attention) are kept, high-entropy patches (spread attention) are pruned. Extends to Rényi entropy for adjustable emphasis on sharp attention peaks, allowing adaptive pruning strategies based on task needs and computational constraints.

Result: The method reduces computation while preserving accuracy on fine-grained image recognition tasks. Adjusting pruning policy through Rényi entropy yields further gains and improves the accuracy-computation trade-off.

Conclusion: Entropy-based patch pruning provides a principled approach to reduce transformer computational cost in vision tasks, with Rényi entropy offering flexible control over the pruning-aggressiveness trade-off.

Abstract: Transformers are strong baselines in both vision and language because self-attention captures long-range dependencies across tokens. However, the cost of self-attention grows quadratically with the number of tokens. Patch pruning mitigates this cost by estimating per-patch importance and removing redundant patches. To identify informative patches for pruning, we introduce a criterion based on the Shannon entropy of the attention distribution. Low-entropy patches, which receive selective and concentrated attention, are kept as important, while high-entropy patches with attention spread across many locations are treated as redundant. We also extend the criterion from Shannon to Rényi entropy, which emphasizes sharp attention peaks and supports pruning strategies that adapt to task needs and computational limits. In experiments on fine-grained image recognition, where patch selection is critical, our method reduced computation while preserving accuracy. Moreover, adjusting the pruning policy through the Rényi entropy measure yields further gains and improves the trade-off between accuracy and computation.

[238] Bridging Restoration and Diagnosis: A Comprehensive Benchmark for Retinal Fundus Enhancement

Xuanzhao Dong, Wenhui Zhu, Xiwen Chen, Hao Wang, Xin Li, Yujian Xiong, Jiajun Cheng, Zhipeng Wang, Shao Tang, Oana Dumitrascu, Yalin Wang

Main category: cs.CV

TL;DR: EyeBench-V2 is a benchmark for evaluating fundus image enhancement models with clinical alignment, focusing on lesion preservation, vessel morphology, and expert-guided assessment.

Details

Motivation: Current evaluation metrics (PSNR, SSIM) fail to capture clinically relevant features, lack unified protocols for paired/unpaired methods, and need actionable insights for clinical utility.

Method: Introduces multi-dimensional clinical-alignment through downstream evaluations (vessel segmentation, DR grading, lesion segmentation), expert-guided evaluation design with structured manual assessment, and curated dataset for fair comparisons.

Result: Provides a benchmark that bridges enhancement model performance with clinical utility, offering rigorous task-oriented analysis and identifying limitations in current methods.

Conclusion: EyeBench-V2 addresses critical gaps in fundus image enhancement evaluation by focusing on clinical relevance and providing actionable insights for future model development.

Abstract: Over the past decade, generative models have demonstrated success in enhancing fundus images. However, the evaluation of these models remains a challenge. A benchmark for fundus image enhancement is needed for three main reasons:(1) Conventional denoising metrics such as PSNR and SSIM fail to capture clinically relevant features, such as lesion preservation and vessel morphology consistency, limiting their applicability in real-world settings; (2) There is a lack of unified evaluation protocols that address both paired and unpaired enhancement methods, particularly those guided by clinical expertise; and (3) An evaluation framework should provide actionable insights to guide future advancements in clinically aligned enhancement models. To address these gaps, we introduce EyeBench-V2, a benchmark designed to bridge the gap between enhancement model performance and clinical utility. Our work offers three key contributions:(1) Multi-dimensional clinical-alignment through downstream evaluations: Beyond standard enhancement metrics, we assess performance across clinically meaningful tasks including vessel segmentation, diabetic retinopathy (DR) grading, generalization to unseen noise patterns, and lesion segmentation. (2) Expert-guided evaluation design: We curate a novel dataset enabling fair comparisons between paired and unpaired enhancement methods, accompanied by a structured manual assessment protocol by medical experts, which evaluates clinically critical aspects such as lesion structure alterations, background color shifts, and the introduction of artificial structures. (3) Actionable insights: Our benchmark provides a rigorous, task-oriented analysis of existing generative models, equipping clinical researchers with the evidence needed to make informed decisions, while also identifying limitations in current methods to inform the design of next-generation enhancement models.

[239] InCaRPose: In-Cabin Relative Camera Pose Estimation Model and Dataset

Felix Stillger, Lukas Hahn, Frederik Hasecke, Tobias Meisen

Main category: cs.CV

TL;DR: InCaRPose: Transformer-based architecture for robust relative pose estimation between image pairs, specifically designed for in-cabin automotive monitoring with highly distorted fisheye cameras.

Details

Motivation: Camera extrinsic calibration is fundamental but challenging in constrained, highly distorted environments like in-cabin automotive monitoring where precise real-world distances are required for safety-relevant perception.

Method: Transformer-based architecture using frozen backbone features (DINOv3) and Transformer decoder to capture geometric relationships between reference and target views, trained exclusively on synthetic data to handle fisheye distortion.

Result: Achieves absolute metric-scale translation within physically plausible adjustment range in single inference, generalizes to real-world cabin environments without exact camera intrinsics, maintains high precision with ViT-Small backbone, and achieves competitive performance on 7-Scenes dataset.

Conclusion: InCaRPose enables robust relative pose prediction for camera extrinsic calibration in challenging automotive environments, supporting real-time performance for safety-critical applications like driver monitoring.

Abstract: Camera extrinsic calibration is a fundamental task in computer vision. However, precise relative pose estimation in constrained, highly distorted environments, such as in-cabin automotive monitoring (ICAM), remains challenging. We present InCaRPose, a Transformer-based architecture designed for robust relative pose prediction between image pairs, which can be used for camera extrinsic calibration. By leveraging frozen backbone features such as DINOv3 and a Transformer-based decoder, our model effectively captures the geometric relationship between a reference and a target view. Unlike traditional methods, our approach achieves absolute metric-scale translation within the physically plausible adjustment range of in-cabin camera mounts in a single inference step, which is critical for ICAM, where accurate real-world distances are required for safety-relevant perception. We specifically address the challenges of highly distorted fisheye cameras in automotive interiors by training exclusively on synthetic data. Our model is capable of generalization to real-world cabin environments without relying on the exact same camera intrinsics and additionally achieves competitive performance on the public 7-Scenes dataset. Despite having limited training data, InCaRPose maintains high precision in both rotation and translation, even with a ViT-Small backbone. This enables real-time performance for time-critical inference, such as driver monitoring in supervised autonomous driving. We release our real-world In-Cabin-Pose test dataset consisting of highly distorted vehicle-interior images and our code at https://github.com/felixstillger/InCaRPose.

[240] ActivityForensics: A Comprehensive Benchmark for Localizing Manipulated Activity in Videos

Peijun Bao, Anwei Luo, Gang Pan, Alex C. Kot, Xudong Jiang

Main category: cs.CV

TL;DR: ActivityForensics: First large-scale benchmark for localizing manipulated human activity in videos, with over 6K forged segments and a diffusion-based baseline method TADiff.

Details

Motivation: Current video forgery localization benchmarks focus on appearance-level manipulations (face swapping, object removal), but recent video generation advances enable activity-level forgeries that modify human actions to distort event semantics, creating highly deceptive content that undermines media authenticity.

Method: Introduces ActivityForensics benchmark with over 6K forged video segments seamlessly blended into context. Proposes Temporal Artifact Diffuser (TADiff) baseline that exposes artifact cues through diffusion-based feature regularizer. Establishes evaluation protocols for intra-domain, cross-domain, and open-world settings.

Result: Comprehensive benchmark with extensive forged content that’s visually consistent and hard to distinguish from authentic videos. Benchmarks state-of-the-art forgery localizers to facilitate future research. Dataset and code publicly available.

Conclusion: Addresses critical gap in video forgery detection by focusing on activity-level manipulations enabled by modern video generation, providing essential resources and baseline methods for advancing research in temporal forgery localization.

Abstract: Temporal forgery localization aims to temporally identify manipulated segments in videos. Most existing benchmarks focus on appearance-level forgeries, such as face swapping and object removal. However, recent advances in video generation have driven the emergence of activity-level forgeries that modify human actions to distort event semantics, resulting in highly deceptive forgeries that critically undermine media authenticity and public trust. To overcome this issue, we introduce ActivityForensics, the first large-scale benchmark for localizing manipulated activity in videos. It contains over 6K forged video segments that are seamlessly blended into the video context, rendering high visual consistency that makes them almost indistinguishable from authentic content to the human eye. We further propose Temporal Artifact Diffuser (TADiff), a simple yet effective baseline that exposes artifact cues through a diffusion-based feature regularizer. Based on ActivityForensics, we introduce comprehensive evaluation protocols covering intra-domain, cross-domain, and open-world settings, and benchmark a wide range of state-of-the-art forgery localizers to facilitate future research. The dataset and code are available at https://activityforensics.github.io.

[241] SPARK-IL: Spectral Retrieval-Augmented RAG for Knowledge-driven Deepfake Detection via Incremental Learning

Hessen Bougueffa Eutamene, Abdellah Zakaria Sellam, Abdelmalik Taleb-Ahmed, Abdenour Hadid

Main category: cs.CV

TL;DR: SPARK-IL is a retrieval-augmented framework for detecting AI-generated images that combines dual-path spectral analysis with incremental learning, achieving 94.6% mean accuracy across 19 generative models.

Details

Motivation: Current AI-generated image detectors trained on specific generators fail to generalize to unseen models. While pixel-level artifacts vary across models, frequency-domain signatures show greater consistency, providing a promising foundation for cross-generator detection.

Method: Proposes SPARK-IL framework with dual-path spectral analysis: (1) semantic representations from partially frozen ViT-L/14 encoder, (2) raw RGB pixel embeddings. Both paths undergo multi-band Fourier decomposition into four frequency bands, processed by Kolmogorov-Arnold Networks with mixture-of-experts. Spectral embeddings are fused via cross-attention with residual connections. During inference, fused embedding retrieves k-nearest labeled signatures from Milvus database using cosine similarity for majority voting predictions. Includes incremental learning strategy with elastic weight consolidation to preserve learned transformations.

Result: Evaluated on UniversalFakeDetect benchmark across 19 generative models (GANs, face-swapping, diffusion methods). Achieves 94.6% mean accuracy. Code to be publicly released.

Conclusion: SPARK-IL effectively addresses cross-generator AI image detection by leveraging frequency-domain consistency and retrieval-augmented incremental learning, demonstrating strong generalization across diverse generative models.

Abstract: Detecting AI-generated images remains a significant challenge because detectors trained on specific generators often fail to generalize to unseen models; however, while pixel-level artifacts vary across models, frequency-domain signatures exhibit greater consistency, providing a promising foundation for cross-generator detection. To address this, we propose SPARK-IL, a retrieval-augmented framework that combines dual-path spectral analysis with incremental learning by utilizing a partially frozen ViT-L/14 encoder for semantic representations alongside a parallel path for raw RGB pixel embeddings. Both paths undergo multi-band Fourier decomposition into four frequency bands, which are individually processed by Kolmogorov-Arnold Networks (KAN) with mixture-of-experts for band-specific transformations before the resulting spectral embeddings are fused via cross-attention with residual connections. During inference, this fused embedding retrieves the $k$ nearest labeled signatures from a Milvus database using cosine similarity to facilitate predictions via majority voting, while an incremental learning strategy expands the database and employs elastic weight consolidation to preserve previously learned transformations. Evaluated on the UniversalFakeDetect benchmark across 19 generative models – including GANs, face-swapping, and diffusion methods – SPARK-IL achieves a 94.6% mean accuracy, with the code to be publicly released at https://github.com/HessenUPHF/SPARK-IL.

[242] Task-Guided Multi-Annotation Triplet Learning for Remote Sensing Representations

Meilun Zhou, Alina Zare

Main category: cs.CV

TL;DR: Task-guided multi-annotation triplet loss uses mutual information criteria to select informative triplets across tasks, improving shared representation learning for multi-task scenarios.

Details

Motivation: Existing multi-task triplet loss methods use static weights to balance supervision between different annotation types, which requires manual tuning and doesn't account for task interactions when shaping shared representations.

Method: Proposes task-guided multi-annotation triplet loss that removes static weighting dependency by selecting triplets through mutual-information criteria to identify triplets most informative across tasks. This modifies which samples influence the representation rather than adjusting loss magnitudes.

Result: Experiments on aerial wildlife dataset show improved classification and regression performance compared to several triplet loss setups. Task-aware triplet selection produces more effective shared representation for downstream tasks.

Conclusion: Task-guided triplet selection through mutual information criteria is more effective than static weighting for multi-task representation learning, producing better shared representations for downstream applications.

Abstract: Prior multi-task triplet loss methods relied on static weights to balance supervision between various types of annotation. However, static weighting requires tuning and does not account for how tasks interact when shaping a shared representation. To address this, the proposed task-guided multi-annotation triplet loss removes this dependency by selecting triplets through a mutual-information criteria that identifies triplets most informative across tasks. This strategy modifies which samples influence the representation rather than adjusting loss magnitudes. Experiments on an aerial wildlife dataset compare the proposed task-guided selection against several triplet loss setups for shaping a representation in an effective multi-task manner. The results show improved classification and regression performance and demonstrate that task-aware triplet selection produces a more effective shared representation for downstream tasks.

[243] Beyond Task-Driven Features for Object Detection

Meilun Zhou, Alina Zare

Main category: cs.CV

TL;DR: Annotation-guided feature augmentation framework improves object detection by injecting annotation geometry into backbone features, enhancing generalization and robustness across supervision regimes.

Details

Motivation: Modern object detectors learn task-driven features that often capture shortcut correlations rather than underlying annotation structure, limiting transferability, interpretability, and robustness when task definitions change or supervision becomes sparse.

Method: Introduces annotation-guided feature augmentation that constructs dense spatial feature grids from annotation-guided latent spaces and fuses them with feature pyramid representations to influence region proposal and detection heads.

Result: Experiments across wildlife and remote sensing datasets show consistent improvements in object focus, reduced background sensitivity, and stronger generalization to unseen or weakly supervised tasks.

Conclusion: Aligning features with annotation geometry yields more meaningful representations than purely task-optimized features, enhancing transfer, interpretability, and robustness.

Abstract: Task-driven features learned by modern object detectors optimize end task loss yet often capture shortcut correlations that fail to reflect underlying annotation structure. Such representations limit transfer, interpretability, and robustness when task definitions change or supervision becomes sparse. This paper introduces an annotation-guided feature augmentation framework that injects embeddings into an object detection backbone. The method constructs dense spatial feature grids from annotation-guided latent spaces and fuses them with feature pyramid representations to influence region proposal and detection heads. Experiments across wildlife and remote sensing datasets evaluate classification, localization, and data efficiency under multiple supervision regimes. Results show consistent improvements in object focus, reduced background sensitivity, and stronger generalization to unseen or weakly supervised tasks. The findings demonstrate that aligning features with annotation geometry yields more meaningful representations than purely task optimized features.

[244] Training a Student Expert via Semi-Supervised Foundation Model Distillation

Pardis Taghavi, Tian Liu, Renjie Li, Reza Langari, Zhengzhong Tu

Main category: cs.CV

TL;DR: SSKD framework compresses vision foundation models into compact experts using semi-supervised knowledge distillation with limited labeled and abundant unlabeled data, applied to instance segmentation.

Details

Motivation: Vision foundation models have strong perception but are computationally heavy to deploy, and adapting them requires costly annotations, especially for per-pixel tasks like instance segmentation.

Method: Three-stage framework: 1) Domain adaptation of VFM via self-training with contrastive calibration, 2) Knowledge transfer through unified multi-objective loss, 3) Student refinement to mitigate pseudo-label bias. Uses instance-aware pixel-wise contrastive loss that fuses mask and class scores.

Result: On Cityscapes and ADE20K, the ≈11× smaller student improves over zero-shot VFM teacher by +11.9 and +8.6 AP, surpasses adapted teacher by +3.4 and +1.5 AP, and outperforms state-of-the-art SSKD methods.

Conclusion: The SSKD framework effectively compresses vision foundation models into compact experts while maintaining strong performance, addressing computational and annotation challenges in vision tasks.

Abstract: Foundation models deliver strong perception but are often too computationally heavy to deploy, and adapting them typically requires costly annotations. We introduce a semi-supervised knowledge distillation (SSKD) framework that compresses pre-trained vision foundation models (VFMs) into compact experts using limited labeled and abundant unlabeled data, and instantiate it for instance segmentation where per-pixel labels are particularly expensive. The framework unfolds in three stages: (1) domain adaptation of the VFM(s) via self-training with contrastive calibration, (2) knowledge transfer through a unified multi-objective loss, and (3) student refinement to mitigate residual pseudo-label bias. Central to our approach is an instance-aware pixel-wise contrastive loss that fuses mask and class scores to extract informative negatives and enforce clear inter-instance margins. By maintaining this contrastive signal across both adaptation and distillation, we align teacher and student embeddings and more effectively leverage unlabeled images. On Cityscapes and ADE20K, our $\approx 11\times$ smaller student improves over its zero-shot VFM teacher(s) by +11.9 and +8.6 AP, surpasses adapted teacher(s) by +3.4 and +1.5 AP, and outperforms state-of-the-art SSKD methods on benchmarks.

[245] Learning 3D Reconstruction with Priors in Test Time

Lei Zhou, Haoyu Wu, Akshat Dave, Dimitris Samaras

Main category: cs.CV

TL;DR: Test-time constrained optimization framework for multiview Transformers that incorporates 3D priors (camera poses, intrinsics, depth) without retraining, improving 3D vision tasks through self-supervised objectives and prior penalty terms.

Details

Motivation: Existing multiview Transformers (MVTs) are typically image-only networks that don't leverage available 3D priors (camera poses, intrinsics, depth). Retraining networks to incorporate these priors is computationally expensive and requires architectural modifications.

Method: Proposes a test-time constrained optimization (TCO) framework that treats priors as constraints on predictions rather than architectural inputs. Uses self-supervised objectives (photometric/geometric losses between multiview renderings) and converts available priors into penalty terms on corresponding output modalities. Optimizes the network at inference time without retraining.

Result: Significantly improves performance over base MVTs on 3D benchmarks (ETH3D, 7-Scenes, NRGBD). Reduces point-map distance error by more than half compared to base image-only models. Outperforms retrained prior-aware feed-forward methods.

Conclusion: The TCO framework effectively incorporates 3D priors into multiview Transformers without architectural changes or retraining, demonstrating superior performance on 3D vision tasks through test-time optimization with self-supervised objectives and prior constraints.

Abstract: We introduce a test-time framework for multiview Transformers (MVTs) that incorporates priors (e.g., camera poses, intrinsics, and depth) to improve 3D tasks without retraining or modifying pre-trained image-only networks. Rather than feeding priors into the architecture, we cast them as constraints on the predictions and optimize the network at inference time. The optimization loss consists of a self-supervised objective and prior penalty terms. The self-supervised objective captures the compatibility among multi-view predictions and is implemented using photometric or geometric loss between renderings from other views and each view itself. Any available priors are converted into penalty terms on the corresponding output modalities. Across a series of 3D vision benchmarks, including point map estimation and camera pose estimation, our method consistently improves performance over base MVTs by a large margin. On the ETH3D, 7-Scenes, and NRGBD datasets, our method reduces the point-map distance error by more than half compared with the base image-only models. Our method also outperforms retrained prior-aware feed-forward methods, demonstrating the effectiveness of our test-time constrained optimization (TCO) framework for incorporating priors into 3D vision tasks.

[246] Interpreting Video Representations with Spatio-Temporal Sparse Autoencoders

Atahan Dokme, Sriram Vishwanath

Main category: cs.CV

TL;DR: Sparse Autoencoders for video representations with temporal coherence enhancement via spatio-temporal contrastive objectives and Matryoshka grouping

Details

Motivation: Standard Sparse Autoencoders (SAEs) decompose video into interpretable features but destroy temporal coherence due to unstable feature assignments across frames, reducing autocorrelation by 36%.

Method: Proposed spatio-temporal contrastive objectives and Matryoshka hierarchical grouping to recover temporal coherence. Contrastive loss weight controls trade-off between reconstruction and temporal coherence.

Result: Contrastive SAE features improve action classification by +3.9% over raw features and text-video retrieval by up to 2.8xR@1. Cross-backbone analysis reveals standard monosemanticity metrics contain backbone-alignment artifact.

Conclusion: Different SAE configurations excel at different goals: reconstruction fidelity, temporal coherence, action discrimination, or interpretability. Contrastive training concentrates predictive signal into identifiable features.

Abstract: We present the first systematic study of Sparse Autoencoders (SAEs) on video representations. Standard SAEs decompose video into interpretable, monosemantic features but destroy temporal coherence: hard TopK selection produces unstable feature assignments across frames, reducing autocorrelation by 36%. We propose spatio-temporal contrastive objectives and Matryoshka hierarchical grouping that recover and even exceed raw temporal coherence. The contrastive loss weight controls a tunable trade-off between reconstruction and temporal coherence. A systematic ablation on two backbones and two datasets shows that different configurations excel at different goals: reconstruction fidelity, temporal coherence, action discrimination, or interpretability. Contrastive SAE features improve action classification by +3.9% over raw features and text-video retrieval by up to 2.8xR@1. A cross-backbone analysis reveals that standard monosemanticity metrics contain a backbone-alignment artifact: both DINOv2 and VideoMAE produce equally monosemantic features under neutral (CLIP) similarity. Causal ablation confirms that contrastive training concentrates predictive signal into a small number of identifiable features.

[247] SafeCtrl: Region-Aware Safety Control for Text-to-Image Diffusion via Detect-Then-Suppress

Lingyun Zhang, Yu Xie, Zhongli Fang, Yu Liu, Ping Chen

Main category: cs.CV

TL;DR: SafeCtrl is a region-aware safety control framework for text-to-image diffusion models that detects and suppresses harmful content in specific risk regions while preserving surrounding context, achieving better safety-fidelity trade-off and adversarial robustness.

Details

Motivation: Current text-to-image diffusion models face challenges in generating visually harmful content (sexual, violent, horror). Existing safety interventions suffer from poor trade-off between safety and context preservation, and vulnerability to adversarial attacks that bypass safety mechanisms.

Method: SafeCtrl operates on a Detect-Then-Suppress paradigm: 1) Attention-guided Detect module localizes specific risk regions, 2) Localized Suppress module neutralizes harmful semantics only within detected areas using image-level Direct Preference Optimization (DPO), transforming unsafe objects into safe alternatives while preserving surrounding context.

Result: Extensive experiments across multiple risk categories show SafeCtrl achieves superior trade-off between safety and fidelity compared to state-of-the-art methods, with improved resilience against adversarial prompt attacks.

Conclusion: SafeCtrl offers a precise and robust solution for responsible generation in text-to-image diffusion models by localizing safety interventions to risk regions rather than applying global modifications, addressing both safety-fidelity trade-off and adversarial vulnerability.

Abstract: The widespread deployment of text-to-image diffusion models is significantly challenged by the generation of visually harmful content, such as sexually explicit content, violence, and horror imagery. Common safety interventions, ranging from input filtering to model concept erasure, often suffer from two critical limitations: (1) a severe trade-off between safety and context preservation, where removing unsafe concepts degrades the fidelity of the safe content, and (2) vulnerability to adversarial attacks, where safety mechanisms are easily bypassed. To address these challenges, we propose SafeCtrl, a Region-Aware safety control framework operating on a Detect-Then-Suppress paradigm. Unlike global safety interventions, SafeCtrl first employs an attention-guided Detect module to precisely localize specific risk regions. Subsequently, a localized Suppress module, optimized via image-level Direct Preference Optimization (DPO), neutralizes harmful semantics only within the detected areas, effectively transforming unsafe objects into safe alternatives while leaving the surrounding context intact. Extensive experiments across multiple risk categories demonstrate that SafeCtrl achieves a superior trade-off between safety and fidelity compared to state-of-the-art methods. Crucially, our approach exhibits improved resilience against adversarial prompt attacks, offering a precise and robust solution for responsible generation.

Fei Wang, Yutong Zhang, Xiong Wang

Main category: cs.CV

TL;DR: CM-GLasso: A cross-modal graphical lasso method that learns interpretable multimodal representations by aligning visual-linguistic features, extracting spatial-aware cross-modal priors, and disentangling invariant vs. class-specific dependencies.

Details

Motivation: Existing sparse graph estimation techniques like Graphical Lasso struggle with multimodal data due to high-dimensional noise, modality misalignment, and inability to separate shared versus category-specific dependencies in visual-linguistic domains.

Method: Proposes Cross-Modal Graphical Lasso (CM-GLasso) with: 1) text-visualization strategy and unified vision-language encoder for feature alignment, 2) cross-attention distillation to condense high-dimensional patches into semantic nodes, 3) joint optimization of tailored GLasso estimation and Common-Specific Structure Learning via ADMM.

Result: Extensive experiments across eight benchmarks in natural and medical domains show CM-GLasso establishes new state-of-the-art in generative classification and dense semantic segmentation tasks.

Conclusion: CM-GLasso overcomes fundamental limitations of existing sparse graph estimation methods for multimodal data, enabling interpretable representation learning with simultaneous disentanglement of invariant and class-specific dependencies.

Abstract: Learning interpretable multimodal representations inherently relies on uncovering the conditional dependencies between heterogeneous features. However, sparse graph estimation techniques, such as Graphical Lasso (GLasso), to visual-linguistic domains is severely bottlenecked by high-dimensional noise, modality misalignment, and the confounding of shared versus category-specific topologies. In this paper, we propose Cross-Modal Graphical Lasso (CM-GLasso) that overcomes these fundamental limitations. By coupling a novel text-visualization strategy with a unified vision-language encoder, we strictly align multimodal features into a shared latent space. We introduce a cross-attention distillation mechanism that condenses high-dimensional patches into explicit semantic nodes, naturally extracting spatial-aware cross-modal priors. Furthermore, we unify tailored GLasso estimation and Common-Specific Structure Learning (CSSL) into a joint objective optimized via the Alternating Direction Method of Multiplier (ADMM). This formulation guarantees the simultaneous disentanglement of invariant and class-specific precision matrices without multi-step error accumulation. Extensive experiments across eight benchmarks covering both natural and medical domains demonstrate that CM-GLasso establishes a new state-of-the-art in generative classification and dense semantic segmentation tasks.

[249] VLA-Forget: Vision-Language-Action Unlearning for Embodied Foundation Models

Ravi Ranjan, Agoritsa Polyzou

Main category: cs.CV

TL;DR: VLA-Forget is a hybrid unlearning framework for Vision-Language-Action models that enables targeted removal of unsafe behaviors while preserving perception, language grounding, and action control capabilities.

Details

Motivation: VLA models for robotic manipulation face unlearning challenges where undesirable behaviors are distributed across perception, alignment, and reasoning layers, making conventional unlearning approaches insufficient and causing utility loss in embodied settings.

Method: Proposes VLA-Forget combining ratio-aware selective editing for perception/cross-modal specificity with layer-selective reasoning/action unlearning, jointly optimizing three objectives through staged updates over visual encoder, projector, and upper transformer blocks.

Result: Improves forgetting efficacy by 10%, preserves perceptual specificity by 22%, retains reasoning/task success by 9%, and reduces post-quantization recovery by 55% compared to strong baselines.

Conclusion: VLA-Forget effectively addresses the distributed nature of undesirable knowledge in VLA models, enabling targeted unlearning while maintaining essential capabilities for robotic manipulation.

Abstract: Vision-language-action (VLA) models are emerging as embodied foundation models for robotic manipulation, but their deployment introduces a new unlearning challenge: removing unsafe, spurious, or privacy-sensitive behaviors without degrading perception, language grounding, and action control. In OpenVLA-style policies, behavior is produced through a fused visual encoder, a cross-modal projector, and a language backbone that predicts tokenized robot actions, so undesirable knowledge can be distributed across perception, alignment, and reasoning/action layers rather than confined to a single module. Consequently, partial unlearning applied only to the vision stack or only to the language backbone is often insufficient, while conventional unlearning baselines designed for standalone vision or language models may leave residual forgetting or incur unnecessary utility loss in embodied settings. We propose VLA-Forget, a hybrid unlearning framework that combines ratio-aware selective editing for perception and cross-modal specificity with layer-selective reasoning/action unlearning for utility-preserving forgetting. VLA-Forget jointly optimizes three objectives: targeted forgetting, perceptual preservation, and reasoning retention, through staged updates over the visual encoder, projector, and upper action-generating transformer blocks. Across forget-set behavior probes and retain-task evaluations, VLA-Forget improves forgetting efficacy by 10%, preserves perceptual specificity by 22%, retains reasoning and task success by 9%, and reduces post-quantization recovery by 55% relative to strong unlearning baselines.

[250] Hierarchical Point-Patch Fusion with Adaptive Patch Codebook for 3D Shape Anomaly Detection

Xueyang Kang, Zizhao Li, Tian Lan, Dong Gong, Kourosh Khoshelham, Liangliang Nan

Main category: cs.CV

TL;DR: A hierarchical point-patch anomaly scoring network for 3D shape anomaly detection that jointly models regional part features and local point features to address limitations of existing methods in handling diverse anomaly types and scales.

Details

Motivation: Existing deep learning approaches for 3D shape anomaly detection often fail to generalize across diverse anomaly types and scales (global geometric errors like planar shifts, angle misalignments) and are sensitive to noisy or incomplete local points during training.

Method: Proposes a hierarchical point-patch anomaly scoring network that jointly models regional part features and local point features. Includes an adaptive patchification module with self-supervised decomposition to capture complex structural deviations.

Result: Superior AUC-ROC and AUC-PR performance on public benchmarks (Anomaly-ShapeNet and Real3D-AD) and a new industrial test set. Achieves over 40% point-level improvement on new industrial anomaly types and average object-level gains of 7% on Real3D-AD and 4% on Anomaly-ShapeNet.

Conclusion: The proposed method demonstrates strong robustness and generalization for 3D shape anomaly detection across diverse anomaly types and scales, outperforming existing approaches on both public and industrial datasets.

Abstract: 3D shape anomaly detection is a crucial task for industrial inspection and geometric analysis. Existing deep learning approaches typically learn representations of normal shapes and identify anomalies via out-of-distribution feature detection or decoder-based reconstruction. They often fail to generalize across diverse anomaly types and scales, such as global geometric errors (e.g., planar shifts, angle misalignments), and are sensitive to noisy or incomplete local points during training. To address these limitations, we propose a hierarchical point-patch anomaly scoring network that jointly models regional part features and local point features for robust anomaly reasoning. An adaptive patchification module integrates self-supervised decomposition to capture complex structural deviations. Beyond evaluations on public benchmarks (Anomaly-ShapeNet and Real3D-AD), we release an industrial test set with real CAD models exhibiting planar, angular, and structural defects. Experiments on public and industrial datasets show superior AUC-ROC and AUC-PR performance, including over 40% point-level improvement on the new industrial anomaly type and average object-level gains of 7% on Real3D-AD and 4% on Anomaly-ShapeNet, demonstrating strong robustness and generalization.

[251] Gram-Anchored Prompt Learning for Vision-Language Models via Second-Order Statistics

Minglei Chen, Weilong Wang, Jiang Duan, Ye Deng

Main category: cs.CV

TL;DR: GAPL introduces second-order statistical features (Gram matrices) to enhance prompt learning in VLMs, improving robustness to domain shifts by combining local semantic alignment with global structural consistency.

Details

Motivation: Existing prompt learning methods for VLMs focus on aligning text prompts with first-order visual features (spatial feature maps), which are insufficient for robust adaptation as they're highly susceptible to domain shifts and local noise due to their spatially entangled nature.

Method: Proposes Gram-Anchored Prompt Learning (GAPL) that introduces an additional second-order statistical stream via Gram matrices to augment standard first-order spatial interaction. This anchors prompts to second-order priors, enabling language representations to dynamically adapt to statistical distribution shifts across domains.

Result: Extensive experiments show the effectiveness of second-order features and compelling performances of GAPL on various benchmarks, demonstrating improved robustness to domain shifts.

Conclusion: Second-order statistical features via Gram matrices significantly enhance prompt learning in VLMs by providing global structural consistency alongside local semantic alignment, making models more robust to domain variations.

Abstract: Parameter-efficient prompt learning has become the de facto standard for adapting Vision-Language Models (VLMs) to downstream tasks. Existing approaches predominantly focus on aligning text prompts with first-order visual features (i.e., spatial feature maps). While effective for fine-grained semantic discrimination, we argue that relying solely on first-order information is insufficient for robust adaptation, as these spatially entangled features are highly susceptible to domain shifts and local noise. In this work, we propose \textbf{Gram-Anchored Prompt Learning (GAPL)} for Vision-Language Models via Second-Order Statistics, a framework that synergizes local semantic alignment with global structural consistency. Methodologically, we introduce an additional second-order statistical stream via \textbf{Gram matrices} that augments the standard first-order spatial interaction. By anchoring prompts to these second-order priors, our approach enables language representations to dynamically adapt to statistical distribution shifts across diverse domains. Extensive experiments indicate the effectiveness of the second-order features, and show compelling performances of GAPL on various benchmarks.

[252] High-Fidelity Mural Restoration via a Unified Hybrid Mask-Aware Transformer

Jincheng Jiang, Qianhao Han, Chi Zhang, Zheng Zheng

Main category: cs.CV

TL;DR: HMAT is a transformer-based framework for high-fidelity mural restoration that combines local texture modeling with long-range structural inference using mask-aware dynamic filtering and conditional style fusion.

Details

Motivation: Ancient murals suffer severe degradation from environmental exposure, material aging, and human activity. Restoration is challenging because it requires reconstructing large missing structures while strictly preserving authentic, undamaged regions.

Method: Hybrid Mask-Aware Transformer (HMAT) integrates Mask-Aware Dynamic Filtering for local texture modeling with a Transformer bottleneck for long-range structural inference. Includes mask-conditional style fusion module and Teacher-Forcing Decoder with hard-gated skip connections.

Result: Achieves competitive performance on DHMural and Nine-Colored Deer datasets under varying degradation levels, producing more structurally coherent and visually faithful restorations compared to state-of-the-art approaches.

Conclusion: HMAT provides an effective solution for digital restoration of cultural heritage murals, balancing structural reconstruction with preservation of authentic regions.

Abstract: Ancient murals are valuable cultural artifacts, but many have suffered severe degradation due to environmental exposure, material aging, and human activity. Restoring these artworks is challenging because it requires both reconstructing large missing structures and strictly preserving authentic, undamaged regions. This paper presents the Hybrid Mask-Aware Transformer (HMAT), a unified framework for high-fidelity mural restoration. HMAT integrates Mask-Aware Dynamic Filtering for robust local texture modeling with a Transformer bottleneck for long-range structural inference. To further address the diverse morphology of degradation, we introduce a mask-conditional style fusion module that dynamically guides the generative process. In addition, a Teacher-Forcing Decoder with hard-gated skip connections is designed to enforce fidelity in valid regions and focus reconstruction on missing areas. We evaluate HMAT on the DHMural dataset and a curated Nine-Colored Deer dataset under varying degradation levels. Experimental results demonstrate that the proposed method achieves competitive performance compared to state-of-the-art approaches, while producing more structurally coherent and visually faithful restorations. These findings suggest that HMAT provides an effective solution for the digital restoration of cultural heritage murals.

[253] OASIC: Occlusion-Agnostic and Severity-Informed Classification

Kay Gijzen, Gertjan J. Burghouts, Daniël M. Pelt

Main category: cs.CV

TL;DR: OASIC: A severity-informed classification model that handles object occlusions by masking occluders at test-time and selecting models optimized for specific occlusion degrees.

Details

Motivation: Severe object occlusions challenge computer vision due to loss of visible information and distracting occluder patterns. Existing methods don't adequately address both issues simultaneously.

Method: Two-stage approach: (1) Remove distracting patterns at test-time via occlusion-agnostic masking of visual anomalies, (2) Train models for specific occlusion degrees using random masking during training, then select optimal model based on estimated test-time occlusion severity.

Result: OASIC improves AUC_occ by +18.5 over standard training on occluded images and +23.7 over finetuning on unoccluded images. Severity estimation enables optimal model selection for specific occlusion degrees.

Conclusion: Combining gray masking with adaptive model selection based on estimated occlusion severity significantly improves classification performance on occluded objects, outperforming single-model approaches.

Abstract: Severe occlusions of objects pose a major challenge for computer vision. We show that two root causes are (1) the loss of visible information and (2) the distracting patterns caused by the occluders. Our approach addresses both causes at the same time. First, the distracting patterns are removed at test-time, via masking of the occluding patterns. This masking is independent of the type of occlusion, by handling the occlusion through the lens of visual anomalies w.r.t. the object of interest. Second, to deal with less visual details, we follow standard practice by masking random parts of the object during training, for various degrees of occlusions. We discover that (a) it is possible to estimate the degree of the occlusion (i.e. severity) at test-time, and (b) that a model optimized for a specific degree of occlusion also performs best on a similar degree during test-time. Combining these two insights brings us to a severity-informed classification model called OASIC: Occlusion Agnostic Severity Informed Classification. We estimate the severity of occlusion for a test image, mask the occluder, and select the model that is optimized for the degree of occlusion. This strategy performs better than any single model optimized for any smaller or broader range of occlusion severities. Experiments show that combining gray masking with adaptive model selection improves $\text{AUC}_\text{occ}$ by +18.5 over standard training on occluded images and +23.7 over finetuning on unoccluded images.

[254] HOIGS: Human-Object Interaction Gaussian Splatting

Taewoo Kim, Suwoong Yeom, Jaehyun Pyun, Geonho Cha, Dongyoon Wee, Joonsik Nam, Yun-Seong Jeong, Kyeongbo Kong, Suk-Ju Kang

Main category: cs.CV

TL;DR: HOIGS: A novel Gaussian Splatting method that explicitly models human-object interactions through cross-attention to improve dynamic scene reconstruction with complex interactions.

Details

Motivation: Existing Gaussian Splatting methods either rely on human pose priors while ignoring dynamic objects, or approximate all motions in a single field, limiting their ability to capture complex interaction-rich dynamics in human-object interaction scenarios.

Method: Proposes Human-Object Interaction Gaussian Splatting (HOIGS) with a cross-attention-based HOI module to explicitly model interaction-induced deformation between humans and objects. Uses distinct deformation baselines: HexPlane for humans and Cubic Hermite Spline (CHS) for objects, then integrates these heterogeneous features to capture interdependent motions.

Result: Comprehensive experiments on multiple datasets show HOIGS consistently outperforms state-of-the-art human-centric and 4D Gaussian approaches, particularly improving deformation estimation in scenarios involving occlusion, contact, and object manipulation.

Conclusion: Explicitly modeling human-object interactions is crucial for high-fidelity reconstruction of dynamic scenes with complex interactions, and HOIGS demonstrates superior performance by capturing interdependent motions through specialized deformation modeling.

Abstract: Reconstructing dynamic scenes with complex human-object interactions is a fundamental challenge in computer vision and graphics. Existing Gaussian Splatting methods either rely on human pose priors while neglecting dynamic objects, or approximate all motions within a single field, limiting their ability to capture interaction-rich dynamics. To address this gap, we propose Human-Object Interaction Gaussian Splatting (HOIGS), which explicitly models interaction-induced deformation between humans and objects through a cross-attention-based HOI module. Distinct deformation baselines are employed to extract features: HexPlane for humans and Cubic Hermite Spline (CHS) for objects. By integrating these heterogeneous features, HOIGS effectively captures interdependent motions and improves deformation estimation in scenarios involving occlusion, contact, and object manipulation. Comprehensive experiments on multiple datasets demonstrate that our method consistently outperforms state-of-the-art human-centric and 4D Gaussian approaches, highlighting the importance of explicitly modeling human-object interactions for high-fidelity reconstruction.

[255] 1.x-Distill: Breaking the Diversity, Quality, and Efficiency Barrier in Distribution Matching Distillation

Haoyu Li, Tingyan Wen, Lin Qi, Zhe Wu, Yihuang Chen, Xing Zhou, Lifei Zhu, Xueqian Wang, Kai Zhang

Main category: cs.CV

TL;DR: 1.x-Distill: A fractional-step distillation framework for diffusion models that achieves high-quality image generation in 1.x steps, breaking the integer-step constraint of prior methods.

Details

Motivation: Diffusion models produce high-quality images but are computationally expensive due to iterative denoising. Existing distillation methods like DMD suffer from diversity collapse and fidelity degradation when reduced to very few steps (2 or fewer). There's a need for a practical distillation framework that can work with fractional steps while maintaining quality and diversity.

Method: 1) Analyze teacher CFG’s role in DMD and modify it to suppress mode collapse; 2) Introduce Stagewise Focused Distillation - a two-stage strategy: first learns coarse structure through diversity-preserving distribution matching, then refines details with inference-consistent adversarial distillation; 3) Design lightweight compensation module for Distill-Cache co-Training to incorporate block-level caching into the distillation pipeline.

Result: Achieves better quality and diversity at 1.67 and 1.74 effective NFEs on SD3-Medium and SD3.5-Large, surpassing prior few-step methods with up to 33x speedup over original 28x2 NFE sampling.

Conclusion: 1.x-Distill establishes 1.x-step generation as a practical regime for distilled diffusion models, breaking the integer-step constraint and achieving state-of-the-art performance in extreme low-step scenarios.

Abstract: Diffusion models produce high-quality text-to-image results, but their iterative denoising is computationally expensive.Distribution Matching Distillation (DMD) emerges as a promising path to few-step distillation, but suffers from diversity collapse and fidelity degradation when reduced to two steps or fewer. We present 1.x-Distill, the first fractional-step distillation framework that breaks the integer-step constraint of prior few-step methods and establishes 1.x-step generation as a practical regime for distilled diffusion models.Specifically, we first analyze the overlooked role of teacher CFG in DMD and introduce a simple yet effective modification to suppress mode collapse. Then, to improve performance under extreme steps, we introduce Stagewise Focused Distillation, a two-stage strategy that learns coarse structure through diversity-preserving distribution matching and refines details with inference-consistent adversarial distillation. Furthermore, we design a lightweight compensation module for Distill–Cache co-Training, which naturally incorporates block-level caching into our distillation pipeline.Experiments on SD3-Medium and SD3.5-Large show that 1.x-Distill surpasses prior few-step methods, achieving better quality and diversity at 1.67 and 1.74 effective NFEs, respectively, with up to 33x speedup over original 28x2 NFE sampling.

[256] ATSS: Detecting AI-Generated Videos via Anomalous Temporal Self-Similarity

Hang Wang, Chao Shen, Lei Zhang, Zhi-Qi Cheng

Main category: cs.CV

TL;DR: ATSS detects AI-generated videos by identifying anomalous temporal self-similarity patterns that differ from natural video dynamics, using a multimodal framework with triple-similarity representation and cross-attentive fusion.

Details

Motivation: AI-generated videos have become highly realistic, threatening digital forensics. Current detectors focus on localized artifacts or short-term inconsistencies, failing to capture the underlying generative logic governing global temporal evolution, which limits detection performance.

Method: Proposes ATSS method that identifies anomalous temporal self-similarity (ATSS) fingerprint in AIGVs. Uses triple-similarity representation (visual, textual, cross-modal similarity matrices) constructed from frame-wise descriptions. Employs dedicated Transformer encoders and bidirectional cross-attentive fusion module to model intra- and inter-modal dynamics.

Result: Extensive experiments on four large-scale benchmarks (GenVideo, EvalCrafter, VideoPhy, VidProM) show ATSS significantly outperforms state-of-the-art methods in AP, AUC, and ACC metrics, with superior generalization across diverse video generation models.

Conclusion: ATSS effectively detects AI-generated videos by capturing their anomalous temporal self-similarity patterns, offering a robust multimodal detection framework that addresses limitations of existing approaches focused on localized artifacts.

Abstract: AI-generated videos (AIGVs) have achieved unprecedented photorealism, posing severe threats to digital forensics. Existing AIGV detectors focus mainly on localized artifacts or short-term temporal inconsistencies, thus often fail to capture the underlying generative logic governing global temporal evolution, limiting AIGV detection performance. In this paper, we identify a distinctive fingerprint in AIGVs, termed anomalous temporal self-similarity (ATSS). Unlike real videos that exhibit stochastic natural dynamics, AIGVs follow deterministic anchor-driven trajectories (e.g., text or image prompts), inducing unnaturally repetitive correlations across visual and semantic domains. To exploit this, we propose the ATSS method, a multimodal detection framework that exploits this insight via a triple-similarity representation and a cross-attentive fusion mechanism. Specifically, ATSS reconstructs semantic trajectories by leveraging frame-wise descriptions to construct visual, textual, and cross-modal similarity matrices, which jointly quantify the inherent temporal anomalies. These matrices are encoded by dedicated Transformer encoders and integrated via a bidirectional cross-attentive fusion module to effectively model intra- and inter-modal dynamics. Extensive experiments on four large-scale benchmarks, including GenVideo, EvalCrafter, VideoPhy, and VidProM, demonstrate that ATSS significantly outperforms state-of-the-art methods in terms of AP, AUC, and ACC metrics, exhibiting superior generalization across diverse video generation models. Code and models of ATSS will be released at https://github.com/hwang-cs-ime/ATSS.

[257] TORA: Topological Representation Alignment for 3D Shape Assembly

Nahyuk Lee, Zhiang Chen, Marc Pollefeys, Sunghwan Hong

Main category: cs.CV

TL;DR: TORA improves 3D shape assembly by aligning relational structure from pretrained 3D encoders into flow-matching models, achieving faster convergence and better accuracy with zero inference overhead.

Details

Motivation: Current flow-matching methods for 3D shape assembly lack explicit guidance about cross-part interactions that should drive motion, limiting their effectiveness in understanding topological relationships between parts.

Method: TORA uses a topology-first representation alignment framework that distills relational structure from frozen pretrained 3D encoders into flow-matching backbones. It employs token-wise cosine matching and Centered Kernel Alignment (CKA) loss to match similarity structures between student and teacher representations.

Result: TORA achieves up to 6.9× faster convergence, improved in-distribution accuracy, and greater robustness under domain shift. It shows state-of-the-art performance on five benchmarks and particularly strong zero-shot transfer to unseen real-world and synthetic datasets.

Conclusion: Geometry- and contact-centric properties from teacher representations, not semantic classification ability, govern alignment effectiveness. Alignment is most beneficial at later transformer layers where spatial structure emerges, providing significant improvements without inference overhead.

Abstract: Flow-matching methods for 3D shape assembly learn point-wise velocity fields that transport parts toward assembled configurations, yet they receive no explicit guidance about which cross-part interactions should drive the motion. We introduce TORA, a topology-first representation alignment framework that distills relational structure from a frozen pretrained 3D encoder into the flow-matching backbone during training. We first realize this via simple instantiation, token-wise cosine matching, which injects the learned geometric descriptors from the teacher representation. We then extend to employ a Centered Kernel Alignment (CKA) loss to match the similarity structure between student and teacher representations for enhanced topological alignment. Through systematic probing of diverse 3D encoders, we show that geometry- and contact-centric teacher properties, not semantic classification ability, govern alignment effectiveness, and that alignment is most beneficial at later transformer layers where spatial structure naturally emerges. TORA introduces zero inference overhead while yielding two consistent benefits: faster convergence (up to 6.9$\times$) and improved accuracy in-distribution, along with greater robustness under domain shift. Experiments on five benchmarks spanning geometric, semantic, and inter-object assembly demonstrate state-of-the-art performance, with particularly pronounced gains in zero-shot transfer to unseen real-world and synthetic datasets. Project page: https://nahyuklee.github.io/tora.

[258] DINO-VO: Learning Where to Focus for Enhanced State Estimation

Qi Chen, Guanghao Li, Sijia Hu, Xin Gao, Junpeng Ma, Xiangyang Xue, Jian Pu

Main category: cs.CV

TL;DR: DINO-VO is an end-to-end monocular visual odometry system that uses a differentiable adaptive patch selector and multi-task feature extraction with differentiable bundle adjustment to achieve strong scene generalization across synthetic, indoor, and outdoor environments.

Details

Motivation: Current visual odometry systems rely on heuristic feature extraction strategies that degrade accuracy and robustness, especially in large-scale outdoor environments. There's a need for systems with better generalization across diverse scenes and datasets.

Method: Incorporates a differentiable adaptive patch selector into an end-to-end pipeline to improve patch quality. Uses multi-task feature extraction with a differentiable bundle adjustment module that leverages inverse depth priors to effectively learn and utilize appearance and geometric information.

Result: Extensive experiments on TartanAir, KITTI, Euroc, and TUM datasets demonstrate strong generalization across synthetic, indoor, and outdoor environments, achieving state-of-the-art tracking accuracy.

Conclusion: DINO-VO successfully bridges the gap between feature learning and state estimation, creating a visual odometry system with superior generalization capabilities across diverse environments.

Abstract: We present DINO Patch Visual Odometry (DINO-VO), an end-to-end monocular visual odometry system with strong scene generalization. Current Visual Odometry (VO) systems often rely on heuristic feature extraction strategies, which can degrade accuracy and robustness, particularly in large-scale outdoor environments. DINO-VO addresses these limitations by incorporating a differentiable adaptive patch selector into the end-to-end pipeline, improving the quality of extracted patches and enhancing generalization across diverse datasets. Additionally, our system integrates a multi-task feature extraction module with a differentiable bundle adjustment (BA) module that leverages inverse depth priors, enabling the system to learn and utilize appearance and geometric information effectively. This integration bridges the gap between feature learning and state estimation. Extensive experiments on the TartanAir, KITTI, Euroc, and TUM datasets demonstrate that DINO-VO exhibits strong generalization across synthetic, indoor, and outdoor environments, achieving state-of-the-art tracking accuracy.

[259] 4C4D: 4 Camera 4D Gaussian Splatting

Junsheng Zhou, Zhifan Yang, Liang Han, Wenyuan Zhang, Kanle Shi, Shenkun Xu, Yu-Shen Liu

Main category: cs.CV

TL;DR: 4C4D enables high-fidelity 4D dynamic scene reconstruction from extremely sparse camera views (as few as 4) using a novel neural decaying function on Gaussian opacities to enhance geometric modeling.

Details

Motivation: Previous methods for 4D dynamic scene reconstruction require dense multi-view captures with dozens or hundreds of cameras, which is impractical for portable setups. There's a need for methods that work with extremely sparse camera views while maintaining high-fidelity novel-view rendering.

Method: Proposes 4C4D framework using 4D Gaussian Splatting with a Neural Decaying Function on Gaussian opacities. This function enhances geometric modeling capability by encouraging 4DGS gradients to focus more on geometric learning, addressing the imbalance between geometry and appearance modeling in sparse-view settings.

Result: Extensive experiments across sparse-view datasets with varying camera overlaps show superior performance over prior art, achieving high-fidelity 4D reconstruction from as few as four portable cameras.

Conclusion: 4C4D successfully enables high-quality 4D dynamic scene reconstruction from extremely sparse camera views by addressing the geometric learning challenges through a novel neural decaying function design.

Abstract: This paper tackles the challenge of recovering 4D dynamic scenes from videos captured by as few as four portable cameras. Learning to model scene dynamics for temporally consistent novel-view rendering is a foundational task in computer graphics, where previous works often require dense multi-view captures using camera arrays of dozens or even hundreds of views. We propose \textbf{4C4D}, a novel framework that enables high-fidelity 4D Gaussian Splatting from video captures of extremely sparse cameras. Our key insight lies that the geometric learning under sparse settings is substantially more difficult than modeling appearance. Driven by this observation, we introduce a Neural Decaying Function on Gaussian opacities for enhancing the geometric modeling capability of 4D Gaussians. This design mitigates the inherent imbalance between geometry and appearance modeling in 4DGS by encouraging the 4DGS gradients to focus more on geometric learning. Extensive experiments across sparse-view datasets with varying camera overlaps show that 4C4D achieves superior performance over prior art. Project page at: https://junshengzhou.github.io/4C4D.

[260] Detecting Media Clones in Cultural Repositories Using a Positive Unlabeled Learning Approach

V. Sevetlidis, V. Arampatzakis, M. Karta, I. Mourthos, D. Tsiafaki, G. Pavlidis

Main category: cs.CV

TL;DR: A PU learning approach for curator-in-the-loop duplicate discovery in cultural heritage repositories using lightweight per-query encoders and interpretable thresholds.

Details

Motivation: To address the challenge of discovering duplicates in cultural heritage repositories like AtticPOT where only single anchors per artefact are available, avoiding the need for explicit negative examples and supporting curator-in-the-loop workflows.

Method: Formulates duplicate discovery as a Positive-Unlabeled learning problem, trains lightweight per-query Clone Encoders on augmented views of single anchors, scores unlabeled repository with interpretable threshold on latent l_2 norm, and proposes candidates for curator verification.

Result: Achieves F1=96.37 (AUROC=97.97) on CIFAR-10 and F1=90.79 (AUROC=98.99) on AtticPOT, improving F1 by +7.70 points over best baseline (SVDD) with same lightweight backbone. Qualitative results show stable neighborhoods across viewpoint and condition.

Conclusion: The method effectively discovers cross-record duplicates without explicit negatives, offers transparent operating points, and fits de-duplication, record linkage, and curator-in-the-loop workflows for cultural heritage repositories.

Abstract: We formulate curator-in-the-loop duplicate discovery in the AtticPOT repository as a Positive-Unlabeled (PU) learning problem. Given a single anchor per artefact, we train a lightweight per-query Clone Encoder on augmented views of the anchor and score the unlabeled repository with an interpretable threshold on the latent l_2 norm. The system proposes candidates for curator verification, uncovering cross-record duplicates that were not verified a priori. On CIFAR-10 we obtain F1=96.37 (AUROC=97.97); on AtticPOT we reach F1=90.79 (AUROC=98.99), improving F1 by +7.70 points over the best baseline (SVDD) under the same lightweight backbone. Qualitative “find-similar” panels show stable neighbourhoods across viewpoint and condition. The method avoids explicit negatives, offers a transparent operating point, and fits de-duplication, record linkage, and curator-in-the-loop workflows.

[261] Intelligent Traffic Monitoring with YOLOv11: A Case Study in Real-Time Vehicle Detection

Shkelqim Sherifi

Main category: cs.CV

TL;DR: An offline real-time traffic monitoring system using YOLOv11 with BoT-SORT/ByteTrack for vehicle detection and tracking, achieving high accuracy in vehicle counting and detection across diverse scenes.

Details

Motivation: To develop an efficient, cloud-independent traffic monitoring system that leverages computer vision and deep learning for smart city applications, addressing the need for real-time vehicle detection and counting without cloud dependencies.

Method: Combines pre-trained YOLOv11 detector with BoT-SORT/ByteTrack for multi-object tracking, implemented in PyTorch/OpenCV with a Qt-based desktop UI for offline real-time processing of video streams.

Result: Achieved 66.67-95.83% counting accuracy across diverse scenes, with high precision (cars: 0.97-1.00; trucks: 1.00) and recall (cars: 0.82-1.00; trucks: 0.70-1.00), resulting in F1 scores of 0.90-1.00 for cars and 0.82-1.00 for trucks.

Conclusion: The system demonstrates robust performance in typical conditions and contributes to smart city development by showing the capacity of lightweight, cloud-independent AI-driven traffic monitoring systems.

Abstract: Recent advancements in computer vision, driven by artificial intelligence, have significantly enhanced monitoring systems. One notable application is traffic monitoring, which leverages computer vision alongside deep learning-based object detection and counting. We present an offline, real-time traffic monitoring system that couples a pre-trained YOLOv11 detector with BoT-SORT/ByteTrack for multi-object tracking, implemented in PyTorch/OpenCV and wrapped in a Qt-based desktop UI. The CNN pipeline enables efficient vehicle detection and counting from video streams without cloud dependencies. Across diverse scenes, the system achieves (66.67-95.83%) counting accuracy. Class-wise detection yields high precision (cars: 0.97-1.00; trucks: 1.00) with strong recall (cars: 0.82-1.00; trucks: 0.70-1.00), resulting in F1 scores of (0.90-1.00 for cars and 0.82-1.00 for trucks). While adverse weather conditions may negatively impact this performance, results remain robust in typical conditions. By integrating lightweight models with an accessible, cloud-independent interface, this paper contributes to the modernization and development of future smart cities by showing the capacity of AI-driven traffic monitoring systems.

[262] LAA-X: Unified Localized Artifact Attention for Quality-Agnostic and Generalizable Face Forgery Detection

Dat Nguyen, Enjie Ghorbel, Anis Kacem, Marcella Astrid, Djamila Aouada

Main category: cs.CV

TL;DR: LAA-X is a novel deepfake detection framework using explicit attention on artifact-prone regions through multi-task learning and blending-based data synthesis, achieving state-of-the-art performance across benchmarks.

Details

Motivation: Existing deepfake detection methods often fail to generalize beyond known manipulations due to reliance on binary classifiers with implicit attention mechanisms. There's a need for more robust detection that can handle high-quality forgeries and generalize to unseen manipulations.

Method: LAA-X introduces explicit attention strategy using multi-task learning with blending-based data synthesis. Auxiliary tasks guide the model to focus on localized artifact-prone regions. The framework is compatible with both CNN (LAA-Net) and transformer (LAA-Former) backbones.

Result: Despite being trained only on real and pseudo-fake samples, LAA-X competes with state-of-the-art methods across multiple benchmarks, demonstrating robustness to high-quality forgeries and generalization to unseen manipulations.

Conclusion: LAA-X provides an effective deepfake detection framework that addresses generalization limitations of existing methods through explicit attention on artifact-prone regions, with both CNN and transformer implementations available.

Abstract: In this paper, we propose Localized Artifact Attention X (LAA-X), a novel deepfake detection framework that is both robust to high-quality forgeries and capable of generalizing to unseen manipulations. Existing approaches typically rely on binary classifiers coupled with implicit attention mechanisms, which often fail to generalize beyond known manipulations. In contrast, LAA-X introduces an explicit attention strategy based on a multi-task learning framework combined with blending-based data synthesis. Auxiliary tasks are designed to guide the model toward localized, artifact-prone (i.e., vulnerable) regions. The proposed framework is compatible with both CNN and transformer backbones, resulting in two different versions, namely, LAA-Net and LAA-Former, respectively. Despite being trained only on real and pseudo-fake samples, LAA-X competes with state-of-the-art methods across multiple benchmarks. Code and pre-trained weights for LAA-Net\footnote{https://github.com/10Ring/LAA-Net} and LAA-Former\footnote{https://github.com/10Ring/LAA-Former} are publicly available.

[263] A Physics-Informed, Behavior-Aware Digital Twin for Robust Multimodal Forecasting of Core Body Temperature in Precision Livestock Farming

Riasad Alvi, Mohaimenul Azam Khan Raiaan, Sadia Sultana Chowa, Arefin Ittesafun Abian, Reem E Mohamed, Md Rafiqul Islam, Yakub Sebastian, Sheikh Izzal Azid, Sami Azam

Main category: cs.CV

TL;DR: Physics-informed digital twin with uncertainty-aware stacked ensemble for multimodal forecasting of dairy cattle core body temperature to predict heat stress.

Details

Motivation: Precision livestock farming requires accurate heat stress prediction for animal welfare and farm management optimization, needing robust multimodal forecasting of core body temperature.

Method: Physics-informed digital twin integrates ODE-based thermoregulation model, Gaussian process for cow-specific deviations, Kalman filter for sensor alignment, and behavioral Markov chain. Three-stage stacked ensemble uses modality-specific LightGBM experts, meta-feature collection, and Optuna-tuned meta-model with uncertainty quantification via bootstrapping.

Result: Achieves cross-validated R² of 0.783, F1 score of 84.25%, and Prediction Interval Coverage Probability of 92.38% for 2-hour ahead forecasting, with ablation confirming DT-derived features and multimodal fusion enhance performance.

Conclusion: Framework provides robust, uncertainty-aware, physically principled system for early heat stress detection and precision livestock management through multimodal forecasting.

Abstract: Precision livestock farming requires accurate and timely heat stress prediction to ensure animal welfare and optimize farm management. This study presents a physics-informed digital twin (DT) framework combined with an uncertainty-aware, expert-weighted stacked ensemble for multimodal forecasting of Core Body Temperature (CBT) in dairy cattle. Using the high-frequency, heterogeneous MmCows dataset, the DT integrates an ordinary differential equation (ODE)-based thermoregulation model that simulates metabolic heat production and dissipation, a Gaussian process for capturing cow-specific deviations, a Kalman filter for aligning predictions with real-time sensor data, and a behavioral Markov chain that models activity-state transitions under varying environmental conditions. The DT outputs key physiological indicators, such as predicted CBT, heat stress probability, and behavioral state distributions are fused with raw sensor data and enriched through multi-scale temporal analysis and cross-modal feature engineering to form a comprehensive feature set. The predictive methodology is designed in a three-stage stacked ensemble, where stage 1 trains modality-specific LightGBM ’expert’ models on distinct feature groups, stage 2 collects their predictions as meta-features, and at stage 3 Optuna-tuned LightGBM meta-model yields the final CBT forecast. Predictive uncertainty is quantified via bootstrapping and validated using Prediction Interval Coverage Probability (PICP). Ablation analysis confirms that incorporating DT-derived features and multimodal fusion substantially enhances performance. The proposed framework achieves a cross-validated R2 of 0.783, F1 score of 84.25% and PICP of 92.38% for 2-hour ahead forecasting, providing a robust, uncertainty-aware, and physically principled system for early heat stress detection and precision livestock management.

Peixin Chen, Guoxi Zhang, Jianwei Ma, Qing Li

Main category: cs.CV

TL;DR: HGR is a graph-based navigation framework that uses revisable hypothesis nodes for frontier semantics prediction and cascade correction to retract errors, improving long-horizon memory reliability in embodied agents.

Details

Motivation: Existing graph-based navigation systems treat unexplored regions as semantically unknown, leading to inefficient frontier search. While VLMs can predict frontier semantics, erroneous predictions embedded in memory cause structural error accumulation that confidence attenuation alone cannot resolve.

Method: HGR introduces: (1) semantic hypothesis module that estimates context-conditioned semantic distributions over frontiers and ranks exploration targets, and (2) verification-driven cascade correction that compares on-site observations against predicted semantics and retracts refuted nodes with all downstream dependents.

Result: HGR achieves 72.41% success rate and 56.22% SPL on GOAT-Bench, with consistent improvements on A-EQA and EM-EQA QA benchmarks. Cascade correction eliminates ~20% of structurally redundant hypothesis nodes and reduces revisits to erroneous regions by 4.5x.

Conclusion: HGR enables reliable long-horizon memory for embodied agents by representing frontier predictions as revisable hypothesis nodes with systematic error retraction, outperforming existing methods in multimodal lifelong navigation and embodied QA tasks.

Abstract: Embodied agents must explore partially observed environments while maintaining reliable long-horizon memory. Existing graph-based navigation systems improve scalability, but they often treat unexplored regions as semantically unknown, leading to inefficient frontier search. Although vision-language models (VLMs) can predict frontier semantics, erroneous predictions may be embedded into memory and propagate through downstream inferences, causing structural error accumulation that confidence attenuation alone cannot resolve. These observations call for a framework that can leverage semantic predictions for directed exploration while systematically retracting errors once new evidence contradicts them. We propose Hypothesis Graph Refinement (HGR), a framework that represents frontier predictions as revisable hypothesis nodes in a dependency-aware graph memory. HGR introduces (1) semantic hypothesis module, which estimates context-conditioned semantic distributions over frontiers and ranks exploration targets by goal relevance, travel cost, and uncertainty, and (2) verification-driven cascade correction, which compares on-site observations against predicted semantics and, upon mismatch, retracts the refuted node together with all its downstream dependents. Unlike additive map-building, this allows the graph to contract by pruning erroneous subgraphs, keeping memory reliable throughout long episodes. We evaluate HGR on multimodal lifelong navigation (GOAT-Bench) and embodied question answering (A-EQA, EM-EQA). HGR achieves 72.41% success rate and 56.22% SPL on GOAT-Bench, and shows consistent improvements on both QA benchmarks. Diagnostic analysis reveals that cascade correction eliminates approximately 20% of structurally redundant hypothesis nodes and reduces revisits to erroneous regions by 4.5x, with specular and transparent surfaces accounting for 67% of corrected prediction errors.

[265] SARES-DEIM: Sparse Mixture-of-Experts Meets DETR for Robust SAR Ship Detection

Fenghao Song, Shaojing Yang, Xi Zhou

Main category: cs.CV

TL;DR: SARES-DEIM: A domain-aware DETR-based framework for ship detection in SAR imagery using specialized frequency/wavelet experts and high-resolution pyramid enhancement to address speckle noise and small targets.

Details

Motivation: Ship detection in SAR imagery faces challenges from coherent speckle noise, complex coastal clutter, and small-scale targets. Conventional optical detectors lack robustness to SAR-specific degradation and lose fine-grained details during downsampling.

Method: Proposes SARES-DEIM based on DETR paradigm with two key components: 1) SARESMoE (SAR-aware Expert Selection Mixture-of-Experts) using sparse gating to route features to specialized frequency and wavelet experts for noise filtering, and 2) Space-to-Depth Enhancement Pyramid (SDEP) neck to preserve high-resolution spatial cues for small target localization.

Result: Extensive experiments show superiority on benchmark datasets. On HRSID dataset: achieves mAP50:95 of 76.4% and mAP50 of 93.8%, outperforming state-of-the-art YOLO-series and specialized SAR detectors.

Conclusion: SARES-DEIM effectively addresses SAR-specific challenges through domain-aware architecture with specialized experts and high-resolution feature preservation, demonstrating strong performance for ship detection in noisy SAR environments.

Abstract: Ship detection in Synthetic Aperture Radar (SAR) imagery is fundamentally challenged by inherent coherent speckle noise, complex coastal clutter, and the prevalence of small-scale targets. Conventional detectors, primarily designed for optical imagery, often exhibit limited robustness against SAR-specific degradation and suffer from the loss of fine-grained ship signatures during spatial downsampling. To address these limitations, we propose SARES-DEIM, a domain-aware detection framework grounded in the DEtection TRansformer (DETR) paradigm. Central to our approach is SARESMoE (SAR-aware Expert Selection Mixture-of-Experts), a module leveraging a sparse gating mechanism to selectively route features toward specialized frequency and wavelet experts. This sparsely-activated architecture effectively filters speckle noise and semantic clutter while maintaining high computational efficiency. Furthermore, we introduce the Space-to-Depth Enhancement Pyramid (SDEP) neck to preserve high-resolution spatial cues from shallow stages, significantly improving the localization of small targets. Extensive experiments on two benchmark datasets demonstrate the superiority of SARES-DEIM. Notably, on the challenging HRSID dataset, our model achieves a mAP50:95 of 76.4% and a mAP50 of 93.8%, outperforming state-of-the-art YOLO-series and specialized SAR detectors.

[266] Learning Robust Visual Features in Computed Tomography Enables Efficient Transfer Learning for Clinical Tasks

Rubén Moreno-Aguado, Alba Magallón, Victor Moreno, Yingying Fang, Guang Yang

Main category: cs.CV

TL;DR: VoxelFM is a 3D CT foundation model trained with self-distillation (DINO) that learns rich visual features without language supervision, outperforming existing CT foundation models across 7 clinical tasks using frozen backbone representations with lightweight probes.

Details

Motivation: Existing CT foundation models focus on vision-language systems requiring large paired image-text data that's unavailable in CT. They also require computationally expensive backbone fine-tuning for downstream tasks. There's a need for models that learn robust visual representations enabling efficient transfer to new tasks with minimal labeled data and without backbone fine-tuning.

Method: VoxelFM is a 3D CT foundation model trained with self-distillation using the DINO framework. It learns semantically rich features without language supervision. For evaluation, frozen backbone representations are used with lightweight probes across seven categories of clinically relevant downstream tasks.

Result: VoxelFM matched or outperformed four existing CT foundation models across all task categories (classification, regression, survival analysis, instance retrieval, localization, segmentation, and report generation). Despite no language supervision during pre-training, it surpassed models explicitly trained with language-alignment objectives, including on report generation.

Conclusion: Current CT foundation models perform significantly better as feature extractors for lightweight probes rather than as vision encoders for vision-language models. The results suggest that robust visual representation learning without language supervision can outperform language-aligned models even on language tasks like report generation.

Abstract: There is substantial interest in developing artificial intelligence systems to support radiologists across tasks ranging from segmentation to report generation. Existing computed tomography (CT) foundation models have largely focused on building generalist vision-language systems capable of tasks such as question answering and report generation. However, training reliable vision-language systems requires paired image-text data at a scale that remains unavailable in CT. Moreover, adapting the underlying visual representations to downstream tasks typically requires partial or full backbone fine-tuning, a computationally demanding process inaccessible to many research groups. Instead, foundation models should prioritise learning robust visual representations that enable efficient transfer to new tasks with minimal labelled data and without backbone fine-tuning. We present VoxelFM, a 3D CT foundation model trained with self-distillation using the DINO framework, which learns semantically rich features without language supervision. We evaluated VoxelFM across seven categories of clinically relevant downstream tasks using frozen backbone representations with lightweight probes: classification, regression, survival analysis, instance retrieval, localisation, segmentation, and report generation. VoxelFM matched or outperformed four existing CT foundation models across all task categories. Despite receiving no language supervision during pre-training, VoxelFM surpassed models explicitly trained with language-alignment objectives, including on report generation. Our results indicate that current CT foundation models perform significantly better as feature extractors for lightweight probes rather than as vision encoders for vision-language models. Model weights and training code are publicly available.

[267] NTIRE 2026 3D Restoration and Reconstruction in Real-world Adverse Conditions: RealX3D Challenge Results

Shuhong Liu, Chenyu Bao, Ziteng Cui, Xuangeng Chu, Bin Ren, Lin Gu, Xiang Chen, Mingrui Li, Long Ma, Marcos V. Conde, Radu Timofte, Yun Liu, Ryo Umagami, Tomohiro Hashimoto, Zijian Hu, Yuan Gan, Tianhan Xu, Yusuke Kurose, Tatsuya Harada, Junwei Yuan, Gengjia Chang, Xining Ge, Mache You, Qida Cao, Zeliang Li, Xinyuan Hu, Hongde Gu, Changyue Shi, Jiajun Ding, Zhou Yu, Jun Yu, Seungsang Oh, Fei Wang, Donggun Kim, Zhiliang Wu, Seho Ahn, Xinye Zheng, Kun Li, Yanyan Wei, Weisi Lin, Dizhe Zhang, Yuchao Chen, Meixi Song, Hanqing Wang, Haoran Feng, Lu Qi, Jiaao Shan, Yang Gu, Jiacheng Liu, Shiyu Liu, Kui Jiang, Junjun Jiang, Runyu Zhu, Sixun Dong, Qingxia Ye, Zhiqiang Zhang, Zhihua Xu, Zhiwei Wang, Phan The Son, Zhimiao Shi, Zixuan Guo, Xueming Fu, Lixia Han, Changhe Liu, Zhenyu Zhao, Manabu Tsukada, Zheng Zhang, Zihan Zhai, Tingting Li, Ziyang Zheng, Yuhao Liu, Dingju Wang, Jeongbin You, Younghyuk Kim, Il-Youp Kwak, Mingzhe Lyu, Junbo Yang, Wenhan Yang, Hongsen Zhang, Jinqiang Cui, Hong Zhang, Haojie Guo, Hantang Li, Qiang Zhu, Bowen He, Xiandong Meng, Debin Zhao, Xiaopeng Fan, Wei Zhou, Linzhe Jiang, Linfeng Li, Louzhe Xu, Qi Xu, Hang Song, Chenkun Guo, Weizhi Nie, Yufei Li, Xingan Zhan, Zhanqi Shi, Dufeng Zhang, Boyuan Tian, Jingshuo Zeng, Gang He, Yubao Fu, Weijie Wang, Cunchuan Huang

Main category: cs.CV

TL;DR: Review of NTIRE 2026 3D Restoration and Reconstruction Challenge focusing on robust 3D reconstruction under adverse conditions like low-light and smoke degradation.

Details

Motivation: To advance 3D reconstruction capabilities in challenging real-world conditions, specifically addressing degradation from extreme low-light and smoke environments that hinder traditional reconstruction methods.

Method: Organized a challenge with 279 participants, 33 valid submissions; evaluated methods against state-of-the-art baselines using the RealX3D benchmark; analyzed design principles of top-performing approaches.

Result: Significant progress in 3D reconstruction under adverse conditions; identified shared design principles among top methods; provided insights into effective strategies for handling 3D scene degradation.

Conclusion: The challenge successfully advanced the field of 3D reconstruction in adverse conditions, revealing effective approaches and establishing benchmarks for future research in robust 3D scene understanding.

Abstract: This paper presents a comprehensive review of the NTIRE 2026 3D Restoration and Reconstruction (3DRR) Challenge, detailing the proposed methods and results. The challenge seeks to identify robust reconstruction pipelines that are robust under real-world adverse conditions, specifically extreme low-light and smoke-degraded environments, as captured by our RealX3D benchmark. A total of 279 participants registered for the competition, of whom 33 teams submitted valid results. We thoroughly evaluate the submitted approaches against state-of-the-art baselines, revealing significant progress in 3D reconstruction under adverse conditions. Our analysis highlights shared design principles among top-performing methods and provides insights into effective strategies for handling 3D scene degradation.

[268] MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale

Bin Wang, Tianyao He, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Tao Chu, Yuan Qu, Zhenjiang Jin, Weijun Zeng, Ziyang Miao, Bangrui Xu, Junbo Niu, Mengzhang Cai, Jiantao Qiu, Qintong Zhang, Dongsheng Ma, Yuefeng Sun, Hejun Dong, Wenzheng Zhang, Jutao Xiao, Jiayong Shi, Pengyu Liao, Xiaomeng Zhao, Huaping Zhong, Liqun Wei, Jing Yu, Jie Yang, Wei Li, Shasha Wang, Qianqian Wu, Xuanhe Zhou, Weijia Li, Zhenxiang Li, Zhongying Tu, Jiang Wu, Lijun Wu, Chao Xu, Kai Chen, Wentao Zhang, Yu Qiao, Bowen Zhou, Dahua Lin, Conghui He

Main category: cs.CV

TL;DR: MINERUPRO advances document parsing SOTA through systematic data engineering and training strategy optimization, without architecture changes, achieving 95.69 on OmniDocBench v1.6

Details

Motivation: Current document parsing methods focus on model architecture innovation while neglecting systematic training data engineering. SOTA models across different architectures show consistent failure patterns on the same hard samples, suggesting performance bottlenecks stem from shared training data deficiencies rather than architecture limitations.

Method: MINERUPRO uses a Data Engine co-designed around coverage, informativeness, and annotation accuracy: 1) Diversity-and-Difficulty-Aware Sampling expands training data from under 10M to 65.5M samples while correcting distribution shift; 2) Cross-Model Consistency Verification leverages output agreement among heterogeneous models to assess sample difficulty and generate reliable annotations; 3) Judge-and-Refine pipeline improves annotation quality for hard samples through render-then-verify iterative correction. A three-stage progressive training strategy includes large-scale pre-training, hard sample fine-tuning, and GRPO alignment.

Result: MINERUPRO achieves 95.69 on OmniDocBench v1.6, improving over the same-architecture baseline by 2.71 points and surpassing all existing methods including models with over 200× more parameters, without any architectural modification.

Conclusion: Systematic data engineering and training strategy optimization can significantly advance document parsing performance without architectural changes, addressing fundamental training data deficiencies that limit current SOTA models across different architectures.

Abstract: Current document parsing methods compete primarily on model architecture innovation, while systematic engineering of training data remains underexplored. Yet SOTA models of different architectures and parameter scales exhibit highly consistent failure patterns on the same set of hard samples, suggesting that the performance bottleneck stems from shared deficiencies in training data rather than architecture itself. Building on this finding, we present \minerupro, which advances the state of the art solely through data engineering and training strategy optimization while keeping the 1.2B-parameter architecture of \mineru completely fixed. At its core is a Data Engine co-designed around coverage, informativeness, and annotation accuracy: Diversity-and-Difficulty-Aware Sampling expands training data from under 10M to 65.5M samples while correcting distribution shift; Cross-Model Consistency Verification leverages output agreement among heterogeneous models to assess sample difficulty and generate reliable annotations; the Judge-and-Refine pipeline improves annotation quality for hard samples through render-then-verify iterative correction. A three-stage progressive training strategy – large-scale pre-training, hard sample fine-tuning, and GRPO alignment – sequentially exploits these data at different quality tiers. On the evaluation front, we fix element-matching biases in OmniDocBench~~v1.5 and introduce a Hard subset, establishing the more discriminative OmniDocBench~~v1.6 protocol. Without any architectural modification, \minerupro achieves 95.69 on OmniDocBench~v1.6, improving over the same-architecture baseline by 2.71 points and surpassing all existing methods including models with over 200$\times$ more parameters.

[269] Rethinking Exposure Correction for Spatially Non-uniform Degradation

Ao Li, Jiawei Sun, Le Dong, Zhenyu Wang, Weisheng Dong

Main category: cs.CV

TL;DR: A novel exposure correction method designed for spatially non-uniform degradations using spatially adaptive modulation weights and uncertainty-inspired loss.

Details

Motivation: Real-world exposure correction faces spatially non-uniform degradations where diverse exposure errors coexist in a single image, but existing methods assume uniform exposure and use globally aggregated modulation signals.

Method: Proposes Spatial Signal Encoder to predict spatially adaptive modulation weights guiding multiple look-up tables for image transformation, plus HSL-based compensation for color fidelity, and an uncertainty-inspired non-uniform loss for dynamic optimization focus.

Result: Extensive experiments show superior qualitative and quantitative performance compared to state-of-the-art methods.

Conclusion: The proposed paradigm effectively addresses spatial non-uniformity in exposure correction through architectural innovations and adaptive optimization strategies.

Abstract: Real-world exposure correction is fundamentally challenged by spatially non-uniform degradations, where diverse exposure errors frequently coexist within a single image. However, existing exposure correction methods are still largely developed under a predominantly uniform assumption. Architecturally, they typically rely on globally aggregated modulation signals that capture only the overall exposure trend. From the optimization perspective, conventional reconstruction losses are usually derived under a shared global scale, thus overlooking the spatially varying correction demands across regions. To address these limitations, we propose a new exposure correction paradigm explicitly designed for spatial non-uniformity. Specifically, we introduce a Spatial Signal Encoder to predict spatially adaptive modulation weights, which are used to guide multiple look-up tables for image transformation, together with an HSL-based compensation module for improved color fidelity. Beyond the architectural design, we propose an uncertainty-inspired non-uniform loss that dynamically allocates the optimization focus based on local restoration uncertainties, better matching the heterogeneous nature of real-world exposure errors. Extensive experiments demonstrate that our method achieves superior qualitative and quantitative performance compared with state-of-the-art methods. Code is available at https://github.com/FALALAS/rethinkingEC.

[270] Vero: An Open RL Recipe for General Visual Reasoning

Gabriel Sarch, Linrong Cai, Qunzhong Wang, Haoyang Wu, Danqi Chen, Zhuang Liu

Main category: cs.CV

TL;DR: Vero is an open visual reasoning model family that achieves state-of-the-art performance across diverse visual tasks by scaling reinforcement learning with a 600K-sample dataset covering six broad task categories.

Details

Motivation: Proprietary VLMs show strong visual reasoning across charts, science, spatial understanding, and open-ended tasks, but their RL pipelines and data are not publicly available. The authors aim to create fully open VLMs that match or exceed existing models.

Method: Scale RL data and rewards across six broad task categories, constructing Vero-600K dataset from 59 datasets (600K samples). Design task-routed rewards to handle heterogeneous answer formats. Train starting from Qwen3-VL-8B-Instruct base model.

Result: Vero achieves SOTA performance, improving over four base models by 3.7-5.5 points on average across VeroEval (30 challenging benchmarks). Outperforms Qwen3-VL-8B-Thinking on 23/30 benchmarks without proprietary thinking data.

Conclusion: Broad data coverage across diverse task categories is the primary driver of strong RL scaling for visual reasoning. Different task categories elicit distinct reasoning patterns that transfer poorly in isolation.

Abstract: What does it take to build a visual reasoner that works across charts, science, spatial understanding, and open-ended tasks? The strongest vision-language models (VLMs) show such broad visual reasoning is within reach, but the recipe behind them remains unclear, locked behind proprietary reinforcement learning (RL) pipelines with non-public data. We introduce Vero, a family of fully open VLMs that matches or exceeds existing open-weight models across diverse visual reasoning tasks. We scale RL data and rewards across six broad task categories, constructing Vero-600K, a 600K-sample dataset from 59 datasets, and designing task-routed rewards that handle heterogeneous answer formats. Vero achieves state-of-the-art performance, improving over four base models by 3.7-5.5 points on average across VeroEval, our suite of 30 challenging benchmarks. Starting from Qwen3-VL-8B-Instruct, Vero outperforms Qwen3-VL-8B-Thinking on 23 of 30 benchmarks without additional proprietary thinking data. When trained from the same base model, Vero-600K exceeds existing RL datasets across task categories. Systematic ablations reveal that different task categories elicit qualitatively distinct reasoning patterns that transfer poorly in isolation, suggesting that broad data coverage is the primary driver of strong RL scaling. All data, code, and models are released.

[271] OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models

Liyu Zhang, Kehan Li, Tingrui Han, Tao Zhao, Yuxuan Sheng, Shibo He, Chao Li

Main category: cs.CV

TL;DR: OP-GRPO: An off-policy variant of GRPO for flow-matching models that improves training efficiency by reusing high-quality trajectories from a replay buffer with importance sampling correction.

Details

Motivation: GRPO (Gradient-based Reinforcement Policy Optimization) effectively improves flow-matching model quality but suffers from low sample efficiency due to its on-policy training paradigm, requiring fresh samples each iteration.

Method: 1) Active selection of high-quality trajectories into replay buffer for reuse; 2) Sequence-level importance sampling correction to handle distribution shift; 3) Truncation of late denoising steps to avoid ill-conditioned off-policy ratios.

Result: Achieves comparable or superior performance to Flow-GRPO with only 34.2% of training steps on average across image and video generation benchmarks, significantly improving training efficiency.

Conclusion: OP-GRPO successfully addresses GRPO’s sample efficiency limitations through off-policy techniques while maintaining generation quality, making it a practical enhancement for flow-matching models.

Abstract: Post training via GRPO has demonstrated remarkable effectiveness in improving the generation quality of flow-matching models. However, GRPO suffers from inherently low sample efficiency due to its on-policy training paradigm. To address this limitation, we present OP-GRPO, the first Off-Policy GRPO framework tailored for flow-matching models. First, we actively select high-quality trajectories and adaptively incorporate them into a replay buffer for reuse in subsequent training iterations. Second, to mitigate the distribution shift introduced by off-policy samples, we propose a sequence-level importance sampling correction that preserves the integrity of GRPO’s clipping mechanism while ensuring stable policy updates. Third, we theoretically and empirically show that late denoising steps yield ill-conditioned off-policy ratios, and mitigate this by truncating trajectories at late steps. Across image and video generation benchmarks, OP-GRPO achieves comparable or superior performance to Flow-GRPO with only 34.2% of the training steps on average, yielding substantial gains in training efficiency while maintaining generation quality.

[272] Uncertainty-Aware Test-Time Adaptation for Cross-Region Spatio-Temporal Fusion of Land Surface Temperature

Sofiane Bouaziz, Adel Hafiane, Raphael Canals, Rachid Nedjai

Main category: cs.CV

TL;DR: Uncertainty-aware test-time adaptation framework for spatio-temporal fusion regression tasks in remote sensing, addressing domain shifts across geographic regions.

Details

Motivation: Deep learning models for remote sensing struggle with generalization across geographic regions due to domain shifts from variations in land cover, climate, and environmental conditions. Existing test-time adaptation methods are designed for classification tasks and not directly applicable to regression problems like spatio-temporal fusion for land surface temperature estimation.

Method: Proposes an uncertainty-aware TTA framework that updates only the fusion module of a pre-trained STF model. The adaptation is guided by three principles: epistemic uncertainty estimation, land use/land cover consistency, and bias correction. The method works without requiring source data or labeled target samples.

Result: Experiments on four target regions (Rome, Cairo, Madrid, Montpellier) with diverse climates show consistent improvements over a model pre-trained on Orléans, France. Achieved average gains of 24.2% in RMSE and 27.9% in MAE, even with limited unlabeled target data and only 10 TTA epochs.

Conclusion: The proposed uncertainty-aware TTA framework effectively addresses domain shifts in regression tasks for remote sensing applications, demonstrating strong generalization across diverse geographic regions without requiring labeled target data or source domain access.

Abstract: Deep learning models have shown great promise in diverse remote sensing applications. However, they often struggle to generalize across geographic regions unseen during training due to domain shifts. Domain shifts occur when data distributions differ between the training region and new target regions, due to variations in land cover, climate, and environmental conditions. Test-time adaptation (TTA) has emerged as a solution to such shifts, but existing methods are primarily designed for classification and are not directly applicable to regression tasks. In this work, we address the regression task of spatio-temporal fusion (STF) for land surface temperature estimation. We propose an uncertainty-aware TTA framework that updates only the fusion module of a pre-trained STF model, guided by epistemic uncertainty, land use and land cover consistency, and bias correction, without requiring source data or labeled target samples. Experiments on four target regions with diverse climates, namely Rome in Italy, Cairo in Egypt, Madrid in Spain, and Montpellier in France, show consistent improvements in RMSE and MAE for a pre-trained model in Orléans, France. The average gains are 24.2% and 27.9%, respectively, even with limited unlabeled target data and only 10 TTA epochs.

[273] Hierarchical Co-Embedding of Font Shapes and Impression Tags

Yugo Kubota, Kaito Shiku, Seiichi Uchida

Main category: cs.CV

TL;DR: A hyperbolic co-embedding framework models font-impression correspondence through entailment rather than simple alignment, capturing style specificity in font design.

Details

Motivation: Font shapes evoke various impressions, but the correspondence between fonts and impressions is not one-to-one - some impressions are broadly compatible with diverse styles while others strongly constrain plausible fonts. This graded constraint strength, called style specificity, needs better modeling beyond simple paired alignment approaches.

Method: Proposes a hyperbolic co-embedding framework that models font-impression correspondence through entailment. Font images and impression descriptions (single tags or tag sets) are embedded in a shared hyperbolic space with two complementary entailment constraints: impression-to-font entailment and low-to-high style-specificity entailment among impressions.

Result: Experiments on the MyFonts dataset show improved bidirectional retrieval over strong one-to-one baselines. The learned space captures a coherent progression from ambiguous to more style-specific impressions and provides meaningful, data-driven quantification of style specificity.

Conclusion: The hyperbolic co-embedding framework successfully models font-impression correspondence through entailment, yielding an interpretable geometric measure of how strongly an impression constrains font style, with low style-specificity impressions near the origin and high style-specificity impressions farther away in hyperbolic space.

Abstract: Font shapes can evoke a wide range of impressions, but the correspondence between fonts and impression descriptions is not one-to-one: some impressions are broadly compatible with diverse styles, whereas others strongly constrain the set of plausible fonts. We refer to this graded constraint strength as style specificity. In this paper, we propose a hyperbolic co-embedding framework that models font–impression correspondence through entailment rather than simple paired alignment. Font images and impression descriptions, represented as single tags or tag sets, are embedded in a shared hyperbolic space with two complementary entailment constraints: impression-to-font entailment and low-to-high style-specificity entailment among impressions. This formulation induces a radial structure in which low style-specificity impressions lie near the origin and high style-specificity impressions lie farther away, yielding an interpretable geometric measure of how strongly an impression constrains font style. Experiments on the MyFonts dataset demonstrate improved bidirectional retrieval over strong one-to-one baselines. In addition, traversal and tag-level analyses show that the learned space captures a coherent progression from ambiguous to more style-specific impressions and provides a meaningful, data-driven quantification of style specificity.

[274] Incomplete Multi-View Multi-Label Classification via Shared Codebook and Fused-Teacher Self-Distillation

Xu Yan, Jun Yin, Shiliang Sun, Minghua Wan

Main category: cs.CV

TL;DR: Multi-view multi-label learning with dual-missing (views and labels) scenario using discrete consistent representations via shared codebook and cross-view reconstruction, with weight estimation for label correlation preservation and fused-teacher self-distillation.

Details

Motivation: Existing multi-view multi-label learning methods struggle with dual-missing scenarios where both views and labels are incomplete. Current approaches rely on contrastive learning or information bottleneck theory but lack explicit structural constraints to capture stable and discriminative shared semantics.

Method: 1) Learn discrete consistent representations through multi-view shared codebook and cross-view reconstruction, aligning views within limited codebook embeddings to reduce feature redundancy. 2) Design weight estimation method evaluating each view’s ability to preserve label correlation structures for weighted fusion. 3) Introduce fused-teacher self-distillation framework where fused prediction guides view-specific classifiers and feeds global knowledge back to single-view branches.

Result: Extensive comparative experiments on five benchmark datasets demonstrate the effectiveness of the proposed method compared to advanced existing methods.

Conclusion: The proposed structured approach with shared codebook, weight estimation, and self-distillation effectively addresses dual-missing multi-view multi-label learning by capturing stable shared semantics and enhancing generalization under missing-label conditions.

Abstract: Although multi-view multi-label learning has been extensively studied, research on the dual-missing scenario, where both views and labels are incomplete, remains largely unexplored. Existing methods mainly rely on contrastive learning or information bottleneck theory to learn consistent representations under missing-view conditions, but loss-based alignment without explicit structural constraints limits the ability to capture stable and discriminative shared semantics. To address this issue, we introduce a more structured mechanism for consistent representation learning: we learn discrete consistent representations through a multi-view shared codebook and cross-view reconstruction, which naturally align different views within the limited shared codebook embeddings and reduce feature redundancy. At the decision level, we design a weight estimation method that evaluates the ability of each view to preserve label correlation structures, assigning weights accordingly to enhance the quality of the fused prediction. In addition, we introduce a fused-teacher self-distillation framework, where the fused prediction guides the training of view-specific classifiers and feeds the global knowledge back into the single-view branches, thereby enhancing the generalization ability of the model under missing-label conditions. The effectiveness of our proposed method is thoroughly demonstrated through extensive comparative experiments with advanced methods on five benchmark datasets. Code is available at https://github.com/xuy11/SCSD.

[275] GENFIG1: Visual Summaries of Scholarly Work as a Challenge for Vision-Language Models

Yaohan Guan, Pristina Wang, Najim Dehak, Alan Yuille, Jieneng Chen, Daniel Khashabi

Main category: cs.CV

TL;DR: GENFIG1 is a benchmark for evaluating generative AI models’ ability to create “Figure 1” style scientific visualizations that clearly express and motivate a paper’s core research idea from text inputs like title, abstract, introduction, and figure caption.

Details

Motivation: Scientific figures (especially "Figure 1") are crucial visual summaries that require significant human effort to create, highlighting the difficulty of science visual communication. The authors aim to create a benchmark to evaluate AI models' ability to generate such conceptually rich scientific visualizations.

Method: Created GENFIG1 benchmark by curating papers from top deep-learning conferences with stringent quality control. The benchmark evaluates models on their ability to: 1) comprehend technical concepts, 2) identify salient ideas, and 3) design coherent, aesthetically effective graphics faithful to input. Introduced automatic evaluation metric correlating with human judgments.

Result: Evaluated representative models on GENFIG1 and found the task presents significant challenges even for best-performing systems. The automatic evaluation metric correlates well with expert human judgments.

Conclusion: GENFIG1 serves as a foundation for future progress in multimodal AI, specifically addressing the challenging task of scientific visual communication through generative AI models.

Abstract: In many science papers, “Figure 1” serves as the primary visual summary of the core research idea. These figures are visually simple yet conceptually rich, often requiring significant effort and iteration by human authors to get right, highlighting the difficulty of science visual communication. With this intuition, we introduce GENFIG1, a benchmark for generative AI models (e.g., Vision-Language Models). GENFIG1 evaluates models for their ability to produce figures that clearly express and motivate the central idea of a paper (title, abstract, introduction, and figure caption) as input. Solving GENFIG1 requires more than producing visually appealing graphics: the task entails reasoning for text-to-image generation that couples scientific understanding with visual synthesis. Specifically, models must (i) comprehend and grasp the technical concepts of the paper, (ii) identify the most salient ones, and (iii) design a coherent and aesthetically effective graphic that conveys those concepts visually and is faithful to the input. We curate the benchmark from papers published at top deep-learning conferences, apply stringent quality control, and introduce an automatic evaluation metric that correlates well with expert human judgments. We evaluate a suite of representative models on GENFIG1 and demonstrate that the task presents significant challenges, even for the best-performing systems. We hope this benchmark serves as a foundation for future progress in multimodal AI.

[276] Scale-Aware Vision-Language Adaptation for Extreme Far-Distance Video Person Re-identification

Ashwat Rajbhandari, Bharatesh Chakravarthi

Main category: cs.CV

TL;DR: Large-scale vision-language models (CLIP-based) with stability-focused adaptation improve extreme far-distance video person re-identification by addressing scale compression, resolution degradation, and aerial-ground viewpoint mismatch.

Details

Motivation: Extreme far-distance video person ReID faces challenges like scale compression, resolution degradation, motion blur, and aerial-ground viewpoint mismatch. Models trained on close-range imagery degrade significantly as camera altitude and subject distance increase.

Method: Upgrade CLIP visual backbone from ViT-B/16 to ViT-L/14, introduce backbone-aware selective fine-tuning, incorporate lightweight temporal attention pooling to handle noisy low-resolution tracklets, retain adapter-based and prompt-conditioned cross-view learning for domain shifts, and refine retrieval with improved optimization and k-reciprocal re-ranking.

Result: Achieves mAP scores of 46.69 (A2G), 41.23 (G2A), and 22.98 (A2A) on DetReIDX benchmark, with overall mAP of 35.73, showing significant improvement in extreme far-distance video person ReID.

Conclusion: Large-scale vision-language backbones combined with stability-focused adaptation significantly enhance robustness in extreme far-distance video person ReID, demonstrating the value of adapting foundation models for challenging visual understanding tasks.

Abstract: Extreme far-distance video person re-identification (ReID) is particularly challenging due to scale compression, resolution degradation, motion blur, and aerial-ground viewpoint mismatch. As camera altitude and subject distance increase, models trained on close-range imagery degrade significantly. In this work, we investigate how large-scale vision-language models can be adapted to operate reliably under these conditions. Starting from a CLIP-based baseline, we upgrade the visual backbone from ViT-B/16 to ViT-L/14 and introduce backbone-aware selective fine-tuning to stabilize adaptation of the larger transformer. To address noisy and low-resolution tracklets, we incorporate a lightweight temporal attention pooling mechanism that suppresses degraded frames and emphasizes informative observations. We retain adapter-based and prompt-conditioned cross-view learning to mitigate aerial-ground domain shifts, and further refine retrieval using improved optimization and k-reciprocal re-ranking. Experiments on the DetReIDX stress-test benchmark show that our approach achieves mAP scores of 46.69 (A2G), 41.23 (G2A), and 22.98 (A2A), corresponding to an overall mAP of 35.73. These results show that large-scale vision-language backbones, when combined with stability-focused adaptation, significantly enhance robustness in extreme far-distance video person ReID.

[277] AURA: Always-On Understanding and Real-Time Assistance via Video Streams

Xudong Lu, Yang Bo, Jinpeng Chen, Shuhan Li, Xintong Guo, Huankang Guan, Fang Liu, Dunyuan Xu, Peiwen Sun, Heyang Sun, Rui Liu, Hongsheng Li

Main category: cs.CV

TL;DR: AURA is an end-to-end streaming VideoLLM framework that enables continuous processing of live video streams for real-time question answering and proactive responses, addressing limitations of existing offline systems.

Details

Motivation: Most VideoLLMs are offline and unsuitable for live video streams requiring continuous observation and timely response. Existing streaming approaches rely on decoupled trigger-response pipelines or are limited to captioning-style narration, reducing effectiveness for open-ended QA and long-horizon interaction.

Method: AURA integrates context management, data construction, training objectives, and deployment optimization for stable long-horizon streaming interaction. It’s an end-to-end framework that enables a unified VideoLLM to continuously process video streams and support both real-time QA and proactive responses.

Result: Achieves state-of-the-art performance on streaming benchmarks and supports a real-time demo system with ASR and TTS running at 2 FPS on two 80G accelerators.

Conclusion: AURA provides an effective solution for streaming visual interaction, with released model and real-time inference framework to facilitate future research in live video understanding systems.

Abstract: Video Large Language Models (VideoLLMs) have achieved strong performance on many video understanding tasks, but most existing systems remain offline and are not well-suited for live video streams that require continuous observation and timely response. Recent streaming VideoLLMs have made progress, yet current approaches often rely on decoupled trigger-response pipelines or are limited to captioning-style narration, reducing their effectiveness for open-ended question answering and long-horizon interaction. We propose AURA (Always-On Understanding and Real-Time Assistance), an end-to-end streaming visual interaction framework that enables a unified VideoLLM to continuously process video streams and support both real-time question answering and proactive responses. AURA integrates context management, data construction, training objectives, and deployment optimization for stable long-horizon streaming interaction. It achieves state-of-the-art performance on streaming benchmarks and supports a real-time demo system with ASR and TTS running at 2 FPS on two 80G accelerators. We release the AURA model together with a real-time inference framework to facilitate future research.

[278] Graphic-Design-Bench: A Comprehensive Benchmark for Evaluating AI on Graphic Design Tasks

Adrienne Deganutti, Elad Hirsch, Haonan Zhu, Jaejung Seol, Purvanshi Mehta

Main category: cs.CV

TL;DR: GraphicDesignBench (GDB) is a comprehensive benchmark suite for evaluating AI models on professional graphic design tasks across layout, typography, infographics, template semantics, and animation, revealing current models’ limitations in spatial reasoning, vector generation, and typographic precision.

Details

Motivation: Existing benchmarks focus on natural-image understanding or generic text-to-image synthesis, but lack evaluation of professional graphic design tasks that require translating communicative intent into structured layouts, faithful text rendering, layered composition manipulation, vector graphics generation, and animation reasoning.

Method: Created GDB with 50 tasks organized along five axes (layout, typography, infographics, template & design semantics, animation), evaluated under both understanding and generation settings. Uses real-world design templates from LICA layered-composition dataset and standardized metrics covering spatial accuracy, perceptual quality, text fidelity, semantic alignment, and structural validity.

Result: Current frontier models fall short on core professional design challenges: spatial reasoning over complex layouts, faithful vector code generation, fine-grained typographic perception, and temporal decomposition of animations. High-level semantic understanding is achievable, but precision, structure, and compositional awareness remain major gaps.

Conclusion: GDB provides a rigorous, reproducible testbed for tracking progress toward AI systems that can function as capable design collaborators, highlighting the significant gap between current AI capabilities and professional graphic design requirements.

Abstract: We introduce GraphicDesignBench (GDB), the first comprehensive benchmark suite designed specifically to evaluate AI models on the full breadth of professional graphic design tasks. Unlike existing benchmarks that focus on natural-image understanding or generic text-to-image synthesis, GDB targets the unique challenges of professional design work: translating communicative intent into structured layouts, rendering typographically faithful text, manipulating layered compositions, producing valid vector graphics, and reasoning about animation. The suite comprises 50 tasks organized along five axes: layout, typography, infographics, template & design semantics and animation, each evaluated under both understanding and generation settings, and grounded in real-world design templates drawn from the LICA layered-composition dataset. We evaluate a set of frontier closed-source models using a standardized metric taxonomy covering spatial accuracy, perceptual quality, text fidelity, semantic alignment, and structural validity. Our results reveal that current models fall short on the core challenges of professional design: spatial reasoning over complex layouts, faithful vector code generation, fine-grained typographic perception, and temporal decomposition of animations remain largely unsolved. While high-level semantic understanding is within reach, the gap widens sharply as tasks demand precision, structure, and compositional awareness. GDB provides a rigorous, reproducible testbed for tracking progress toward AI systems that can function as capable design collaborators. The full evaluation framework is publicly available.

[279] DriveVA: Video Action Models are Zero-Shot Drivers

Mengmeng Liu, Diankun Zhang, Jiuming Liu, Jianfeng Cui, Hongwei Xie, Guang Chen, Hangjun Ye, Michael Ying Yang, Francesco Nex, Hao Cheng

Main category: cs.CV

TL;DR: DriveVA is an autonomous driving world model that jointly generates future visual forecasts and action sequences using shared latent representations, improving generalization across datasets and sensor configurations.

Details

Motivation: Current world-model-based planning methods in autonomous driving have limited generalization across datasets and sensor configurations, and suffer from poor video-trajectory consistency due to loosely coupled planning paradigms.

Method: DriveVA uses a DiT-based decoder to jointly predict future action sequences (trajectories) and videos in a shared latent generative process, inheriting priors from large-scale video generation models and employing video continuation strategies for long-duration consistency.

Result: DriveVA achieves 90.9 PDM score on NAVSIM, and shows significant zero-shot generalization improvements: 78.9% reduction in L2 error and 83.3% reduction in collision rate on nuScenes, and 52.5%/52.4% reductions on Bench2drive compared to state-of-the-art.

Conclusion: DriveVA demonstrates strong generalization capabilities in autonomous driving through joint video-trajectory generation, enabling better cross-domain performance and tighter alignment between planning and scene evolution.

Abstract: Generalization is a central challenge in autonomous driving, as real-world deployment requires robust performance under unseen scenarios, sensor domains, and environmental conditions. Recent world-model-based planning methods have shown strong capabilities in scene understanding and multi-modal future prediction, yet their generalization across datasets and sensor configurations remains limited. In addition, their loosely coupled planning paradigm often leads to poor video-trajectory consistency during visual imagination. To overcome these limitations, we propose DriveVA, a novel autonomous driving world model that jointly decodes future visual forecasts and action sequences in a shared latent generative process. DriveVA inherits rich priors on motion dynamics and physical plausibility from well-pretrained large-scale video generation models to capture continuous spatiotemporal evolution and causal interaction patterns. To this end, DriveVA employs a DiT-based decoder to jointly predict future action sequences (trajectories) and videos, enabling tighter alignment between planning and scene evolution. We also introduce a video continuation strategy to strengthen long-duration rollout consistency. DriveVA achieves an impressive closed-loop performance of 90.9 PDM score on the challenge NAVSIM. Extensive experiments also demonstrate the zero-shot capability and cross-domain generalization of DriveVA, which reduces average L2 error and collision rate by 78.9% and 83.3% on nuScenes and 52.5% and 52.4% on the Bench2drive built on CARLA v2 compared with the state-of-the-art world-model-based planner.

[280] A Persistent Homology Design Space for 3D Point Cloud Deep Learning

Prachi Kudeshia, Jiju Poovvancheri, Amr Ghoneim, Dong Chen

Main category: cs.CV

TL;DR: A systematic framework for integrating Persistent Homology (PH) into 3D point cloud learning, identifying six principled injection points for topological reasoning as structural inductive bias in neural networks.

Details

Motivation: Persistent Homology provides stable, multi-scale descriptors of intrinsic shape structure but has been integrated into deep learning for point clouds in an ad hoc, peripheral manner. The paper aims to formalize a unified design space for topology-driven learning in 3D point clouds.

Method: Introduces 3DPHDL framework with six principled injection points for topology: sampling, neighborhood graphs, optimization dynamics, self-supervision, output calibration, and internal network regularization. Instantiates framework through controlled empirical study on ModelNet40 classification and ShapeNetPart segmentation using PointNet, DGCNN, and Point Transformer backbones augmented with persistence diagrams, images, and landscapes.

Result: Demonstrates consistent improvements in topology-sensitive discrimination and part consistency, while revealing trade-offs between representational expressiveness and combinatorial complexity. Shows meaningful impact on accuracy, robustness to noise and sampling variation, and computational scalability.

Conclusion: By viewing persistent homology as a structured component rather than auxiliary feature, the work provides a systematic framework for incorporating topological reasoning into 3D point cloud learning, moving beyond ad hoc integration approaches.

Abstract: Persistent Homology (PH) offers stable, multi-scale descriptors of intrinsic shape structure by capturing connected components, loops, and voids that persist across scales, providing invariants that complement purely geometric representations of 3D data. Yet, despite strong theoretical guarantees and increasing empirical adoption, its integration into deep learning for point clouds remains largely ad hoc and architecturally peripheral. In this work, we introduce a unified design space for Persistent-Homology driven learning in 3D point clouds (3DPHDL), formalizing the interplay between complex construction, filtration strategy, persistence representation, neural backbone, and prediction task. Beyond the canonical pipeline of diagram computation and vectorization, we identify six principled injection points through which topology can act as a structural inductive bias reshaping sampling, neighborhood graphs, optimization dynamics, self-supervision, output calibration, and even internal network regularization. We instantiate this framework through a controlled empirical study on ModelNet40 classification and ShapeNetPart segmentation, systematically augmenting representative backbones (PointNet, DGCNN, and Point Transformer) with persistence diagrams, images, and landscapes, and analyzing their impact on accuracy, robustness to noise and sampling variation, and computational scalability. Our results demonstrate consistent improvements in topology-sensitive discrimination and part consistency, while revealing meaningful trade-offs between representational expressiveness and combinatorial complexity. By viewing persistent homology not merely as an auxiliary feature but as a structured component within the learning pipeline, this work provides a systematic framework for incorporating topological reasoning into 3D point cloud learning.

[281] HighFM: Towards a Foundation Model for Learning Representations from High-Frequency Earth Observation Data

Stella Girtsou, Konstantinos Alexis, Giorgos Giannopoulos, Harris Kontoes

Main category: cs.CV

TL;DR: HighFM: A foundation model for high temporal resolution Earth Observation data using SEVIRI imagery from geostationary satellites, adapted for real-time disaster monitoring with improved temporal encoding.

Details

Motivation: Climate disasters require real-time monitoring and early warning systems. While foundation models have advanced Earth Observation ML, most rely on high-resolution satellite imagery with low revisit rates, limiting their suitability for fast-evolving phenomena and time-critical emergency response.

Method: Adapted SatMAE masked autoencoding framework to learn robust spatiotemporal representations from over 2TB of SEVIRI imagery from Meteosat Second Generation. Enhanced architecture with fine-grained temporal encodings to capture short-term variability. Pretrained models then fine-tuned on cloud masking and active fire detection tasks.

Result: Benchmarked SEVIRI-pretrained Vision Transformers against traditional baselines and recent geospatial foundation models, demonstrating consistent gains across both balanced accuracy and IoU metrics for cloud masking and fire detection.

Conclusion: HighFM highlights the potential of temporally dense geostationary data for real-time Earth Observation, offering a scalable path toward foundation models for disaster detection and tracking.

Abstract: The increasing frequency and severity of climate related disasters have intensified the need for real time monitoring, early warning, and informed decision-making. Earth Observation (EO), powered by satellite data and Machine Learning (ML), offers powerful tools to meet these challenges. Foundation Models (FMs) have revolutionized EO ML by enabling general-purpose pretraining on large scale remote sensing datasets. However most existing models rely on high-resolution satellite imagery with low revisit rates limiting their suitability for fast-evolving phenomena and time critical emergency response. In this work, we present HighFM, a first cut approach towards a FM for high temporal resolution, multispectral EO data. Leveraging over 2 TB of SEVIRI imagery from the Meteosat Second Generation (MSG) platform, we adapt the SatMAE masked autoencoding framework to learn robust spatiotemporal representations. To support real time monitoring, we enhance the original architecture with fine grained temporal encodings to capture short term variability. The pretrained models are then finetuned on cloud masking and active fire detection tasks. We benchmark our SEVIRI pretrained Vision Transformers against traditional baselines and recent geospatial FMs, demonstrating consistent gains across both balanced accuracy and IoU metrics. Our results highlight the potential of temporally dense geostationary data for real-time EO, offering a scalable path toward foundation models for disaster detection and tracking.

[282] GA-GS: Generation-Assisted Gaussian Splatting for Static Scene Reconstruction

Yedong Shen, Shiqi Zhang, Sha Zhang, Yifan Duan, Xinran Zhang, Wenhao Yu, Lu Zhang, Jiajun Deng, Yanyong Zhang

Main category: cs.CV

TL;DR: GA-GS uses generation-assisted Gaussian splatting with diffusion models to reconstruct static 3D scenes from monocular videos containing dynamic objects, addressing occlusion challenges through authenticity-aware rendering.

Details

Motivation: Current static scene reconstruction methods rely on visible background and struggle with regions occluded by dynamic objects, limiting applications in VR and autonomous driving where complete static scene understanding is crucial.

Method: 1) Motion-aware module segments/removes dynamic regions, 2) Diffusion model inpaints occluded areas for pseudo-ground-truth, 3) Learnable authenticity scalar per Gaussian primitive balances real vs. generated contributions during splatting, 4) Novel Trajectory-Match dataset enables quantitative evaluation.

Result: GA-GS achieves state-of-the-art performance on DAVIS and the new Trajectory-Match dataset, especially excelling in challenging scenarios with large-scale, persistent occlusions where previous methods fail.

Conclusion: Generation-assisted reconstruction with authenticity-aware Gaussian splatting effectively addresses occlusion challenges in static scene reconstruction from videos with dynamic objects, enabling more complete scene understanding.

Abstract: Reconstructing static 3D scene from monocular video with dynamic objects is important for numerous applications such as virtual reality and autonomous driving. Current approaches typically rely on background for static scene reconstruction, limiting the ability to recover regions occluded by dynamic objects. In this paper, we propose GA-GS, a Generation-Assisted Gaussian Splatting method for Static Scene Reconstruction. The key innovation of our work lies in leveraging generation to assist in reconstructing occluded regions. We employ a motion-aware module to segment and remove dynamic regions, and thenuse a diffusion model to inpaint the occluded areas, providing pseudo-ground-truth supervision. To balance contributions from real background and generated region, we introduce a learnable authenticity scalar for each Gaussian primitive, which dynamically modulates opacity during splatting for authenticity-aware rendering and supervision. Since no existing dataset provides ground-truth static scene of video with dynamic objects, we construct a dataset named Trajectory-Match, using a fixed-path robot to record each scene with/without dynamic objects, enabling quantitative evaluation in reconstruction of occluded regions. Extensive experiments on both the DAVIS and our dataset show that GA-GS achieves state-of-the-art performance in static scene reconstruction, especially in challenging scenarios with large-scale, persistent occlusions.

[283] Spatially-Weighted CLIP for Street-View Geo-localization

Ting Han, Fengjiao Li, Chunsong Chen, Haoling Huang, Yiping Chen, Meiliu Wu

Main category: cs.CV

TL;DR: SW-CLIP enhances street-view geo-localization by incorporating spatial autocorrelation into CLIP through distance-aware soft supervision and neighborhood consistency regularization.

Details

Motivation: Conventional CLIP-based geo-localization methods treat all non-matching samples as equally negative, ignoring geographic relationships and spatial autocorrelation principles from geography.

Method: Introduces location-as-text representation to encode geographic positions, replaces one-hot InfoNCE targets with spatially weighted soft labels based on geodesic distance, and adds neighborhood-consistency regularization to preserve local spatial structure.

Result: Significantly improves geo-localization accuracy, reduces long-tail errors, and enhances spatial coherence compared to standard CLIP on multi-city datasets.

Conclusion: Shifting from semantic alignment to geographic alignment is crucial for robust geo-localization, providing a general paradigm for integrating spatial principles into multimodal representation learning.

Abstract: This paper proposes Spatially-Weighted CLIP (SW-CLIP), a novel framework for street-view geo-localization that explicitly incorporates spatial autocorrelation into vision-language contrastive learning. Unlike conventional CLIP-based methods that treat all non-matching samples as equally negative, SW-CLIP leverages Tobler’s First Law of Geography to model geographic relationships through distance-aware soft supervision. Specifically, we introduce a location-as-text representation to encode geographic positions and replace one-hot InfoNCE targets with spatially weighted soft labels derived from geodesic distance. Additionally, a neighborhood-consistency regularization is employed to preserve local spatial structure in the embedding space. Experiments on a multi-city dataset demonstrate that SW-CLIP significantly improves geo-localization accuracy, reduces long-tail errors, and enhances spatial coherence compared to standard CLIP. The results highlight the importance of shifting from semantic alignment to geographic alignment for robust geo-localization and provide a general paradigm for integrating spatial principles into multimodal representation learning.

[284] Integer-Only Operations on Extreme Learning Machine Test Time Classification

Emerson Lopes Machadoa, Cristiano Jacques Miosso, Ricardo Pezzuol Jacobi

Main category: cs.CV

TL;DR: Paper presents techniques to reduce computational cost of Extreme Learning Machine classifiers at test time using integer-only operations without accuracy loss, validated on 5 computer vision datasets.

Details

Motivation: To enable efficient deployment of network classifiers in embedded systems and data centers where power consumption is critical, by reducing computational overhead of test-time operations.

Method: Three main techniques: (1) using ternary weights (-1, 0, 1) to eliminate multiplications, (2) proving equivalence of normalized/non-normalized test signals, (3) creating integer versions of output weights. Applied to Extreme Learning Machine classifiers.

Result: Techniques allow classification using solely integer operations with limited accuracy reduction, tested on 5 computer vision datasets. Enables computational cost reduction for FPGA deployment.

Conclusion: Proposed integer-only operations for ELM classifiers significantly reduce computational cost for test-time inference, making them suitable for power-constrained embedded applications and data centers.

Abstract: We present a theoretical analysis and empirical evaluations of a novel set of techniques for computational cost reduction of test time operations of network classifiers based on extreme learning machine (ELM). By exploring some characteristics we derived from these models, we show that the classification at test time can be performed using solely integer operations without compromising the classification accuracy. Our contributions are as follows: (i) We show empirical evidence that the input weights values can be drawn from the ternary set with limited reduction of the classification accuracy. This has the computational advantage of dismissing multiplications; (ii) We prove the classification accuracy of normalized and non-normalized test signals are the same; (iii) We show how to create an integer version of the output weights that results in a limited reduction of the classification accuracy. We tested our techniques on 5 computer vision datasets commonly used in the literature and the results indicate that our techniques can allow the reduction of the computational cost of the operations necessary for the classification at test time in FPGAs. This is important in embedded applications, where power consumption is limited, and crucial in data centers of large corporations, where power consumption is expensive.

[285] Graph-to-Frame RAG: Visual-Space Knowledge Fusion for Training-Free and Auditable Video Reasoning

Songyuan Yang, Weijiang Yu, Ziyu Liu, Guijian Tang, Wenjing Yang, Huibin Tan, Nong Xiao

Main category: cs.CV

TL;DR: G2F-RAG is a training-free framework that enhances video reasoning by representing external knowledge as visual frames rather than text, reducing cognitive load and improving interpretability.

Details

Motivation: Current retrieval-augmented video reasoning systems force heterogeneous signals (textual evidence, multi-clip evidence) into a single attention space, causing diluted attention and higher cognitive load. The bottleneck is not just what to retrieve but how to represent and fuse external knowledge with video backbones.

Method: Uses a two-stage approach: (1) Offline stage builds a problem-agnostic video knowledge graph integrating entities, events, spatial relations, and world knowledge; (2) Online stage uses hierarchical multi-agent controller to decide if external knowledge is needed, retrieves minimal sufficient subgraph, and renders it as a single reasoning frame appended to the video for unified visual domain reasoning.

Result: Consistent gains on diverse public benchmarks, with larger improvements in knowledge-intensive settings. Plug-and-play across backbones and scales, reduces cognitive load, and provides explicit, inspectable evidence trail.

Conclusion: G2F-RAG reframes retrieval as visual space knowledge fusion for robust and interpretable video reasoning, demonstrating that knowledge representation and delivery matter significantly for multimodal reasoning systems.

Abstract: When video reasoning requires external knowledge, many systems with large multimodal models (LMMs) adopt retrieval augmentation to supply the missing context. Appending textual or multi-clip evidence, however, forces heterogeneous signals into a single attention space. We observe diluted attention and higher cognitive load even on non-long videos. The bottleneck is not only what to retrieve but how to represent and fuse external knowledge with the video backbone.We present Graph-to-Frame RAG (G2F-RAG), a training free and auditable paradigm that delivers knowledge in the visual space. On the offline stage, an agent builds a problem-agnostic video knowledge graph that integrates entities, events, spatial relations, and linked world knowledge. On the online stage, a hierarchical multi-agent controller decides whether external knowledge is needed, retrieves a minimal sufficient subgraph, and renders it as a single reasoning frame appended to the video. LMMs then perform joint reasoning in a unified visual domain. This design reduces cognitive load and leaves an explicit, inspectable evidence trail.G2F-RAG is plug-and-play across backbones and scales. It yields consistent gains on diverse public benchmarks, with larger improvements in knowledge-intensive settings. Ablations further confirm that knowledge representation and delivery matter. G2F-RAG reframes retrieval as visual space knowledge fusion for robust and interpretable video reasoning.

[286] Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning

Songyuan Yang, Weijiang Yu, Jilin Ma, Ziyu Liu, Guijian Tang, Wenjing Yang, Huibin Tan, Nong Xiao

Main category: cs.CV

TL;DR: RLER introduces a dual paradigm for video reasoning LMMs that decouples learning to produce evidence from obtaining reliable answers, using RL training with novel rewards and evidence-weighted election during inference.

Details

Motivation: Current video reasoning with large multimodal models often uses single-pass inference without verifying if reasoning is evidence-aligned, lacking reliability and interpretability.

Method: Two-phase approach: RLER-Training uses group-relative RL with three novel rewards (frame-sensitive, think-transparency, anti-repetition) to teach structured evidence production. RLER-Inference uses a train-free orchestrator to generate diverse candidates, parse answers/frames, score by evidence consistency, and perform evidence-weighted election.

Result: Achieves state-of-the-art across 8 benchmarks with average 6.3% improvement over base models, using only 3.1 candidates per question for favorable compute-quality balance.

Conclusion: Making evidence explicit during learning and electing by evidence during inference provides a robust path to trustworthy video reasoning without model enlargement.

Abstract: Video reasoning has advanced with large multimodal models (LMMs), yet their inference is often a single pass that returns an answer without verifying whether the reasoning is evidence-aligned. We introduce Reinforce to Learn, Elect to Reason (RLER), a dual paradigm that decouples learning to produce evidence from obtaining a reliable answer. In RLER-Training, we optimize the policy with group-relative reinforcement learning (RL) and 3 novel task-driven rewards: Frame-sensitive reward grounds reasoning on explicit key frames, Think-transparency reward shapes readable and parsable reasoning traces, and Anti-repetition reward boosts information density. These signals teach the model to emit structured, machine-checkable evidence and potentiate reasoning capabilities. In RLER-Inference, we apply a train-free orchestrator that generates a small set of diverse candidates, parses their answers and cited frames, scores them by evidence consistency, confidence, transparency, and non-redundancy, and then performs a robust evidence-weighted election. This closes the loop between producing and using evidence, improving reliability and interpretability without enlarging the model. We comprehensively evaluate RLER against various open-source and RL-based LMMs on 8 representative benchmarks. RLER achieves state of the art across all benchmarks and delivers an average improvement of 6.3% over base models, while using on average 3.1 candidates per question, indicating a favorable balance between compute and quality. The results support a simple thesis: making evidence explicit during learning and electing by evidence during inference is a robust path to trustworthy video reasoning.

[287] UENR-600K: A Large-Scale Physically Grounded Dataset for Nighttime Video Deraining

Pei Yang, Hai Ci, Beibei Lin, Yiren Song, Mike Zheng Shou

Main category: cs.CV

TL;DR: UENR-600K is a large-scale synthetic nighttime video deraining dataset with 600K 1080p frame pairs, created using Unreal Engine 3D particle simulation to capture realistic rain physics like color refractions and scene occlusions, enabling better generalization to real-world nighttime rain.

Details

Motivation: Nighttime video deraining is challenging due to rain's interaction with artificial lighting, causing colored rain and local illumination effects. Existing synthetic datasets use 2D overlays that fail to capture these physical properties, leading to poor real-world generalization, while capturing real paired data is impractical.

Method: Created UENR-600K dataset using Unreal Engine to simulate rain as 3D particles in virtual environments, ensuring photorealism and physical accuracy. Adapted Wan 2.2 video generation model for deraining as a video-to-video generation task, leveraging generative priors.

Result: Models trained on UENR-600K generalize significantly better to real-world nighttime videos. The adapted Wan 2.2 baseline almost entirely bridges the sim-to-real gap, establishing new state-of-the-art performance for nighttime video deraining.

Conclusion: Physically grounded synthetic datasets created with 3D simulation engines can effectively address the challenges of nighttime video deraining, enabling better generalization to real-world scenarios through high-quality training data and generative modeling approaches.

Abstract: Nighttime video deraining is uniquely challenging because raindrops interact with artificial lighting. Unlike daytime white rain, nighttime rain takes on various colors and appears locally illuminated. Existing small-scale synthetic datasets rely on 2D rain overlays and fail to capture these physical properties, causing models to generalize poorly to real-world night rain. Meanwhile, capturing real paired nighttime videos remains impractical because rain effects cannot be isolated from other degradations like sensor noise. To bridge this gap, we introduce UENR-600K, a large-scale, physically grounded dataset containing 600,000 1080p frame pairs. We utilize Unreal Engine to simulate rain as 3D particles within virtual environments. This approach guarantees photorealism and physically real raindrops, capturing correct details like color refractions, scene occlusions, rain curtains. Leveraging this high-quality data, we establish a new state-of-the-art baseline by adapting the Wan 2.2 video generation model. Our baseline treat deraining as a video-to-video generation task, exploiting strong generative priors to almost entirely bridge the sim-to-real gap. Extensive benchmarking demonstrates that models trained on our dataset generalize significantly better to real-world videos. Project page: https://showlab.github.io/UENR-600K/.

[288] 3D-Fixer: Coarse-to-Fine In-place Completion for 3D Scenes from a Single Image

Ze-Xin Yin, Liu Liu, Xinjie Wang, Wei Sui, Zhizhong Su, Jian Yang, Jin Xie

Main category: cs.CV

TL;DR: 3D-Fixer introduces an in-place completion paradigm for compositional 3D scene generation from single views, using fragmented geometry as spatial anchors to preserve layout fidelity without explicit pose optimization.

Details

Motivation: Existing approaches for 3D scene generation from single views either have poor generalization to complex scenes (feed-forward methods) or suffer from time-consuming pose optimization (per-instance methods). There's a need for a method that balances efficiency and generalization.

Method: 3D-Fixer uses fragmented geometry from geometry estimation as spatial anchors, generates complete 3D assets conditioned on partially visible point clouds at original locations, employs coarse-to-fine generation with dual-branch conditioning network and Occlusion-Robust Feature Alignment (ORFA), and introduces ARSG-110K dataset for training.

Result: Achieves state-of-the-art geometric accuracy, significantly outperforming baselines like MIDI and Gen3DSR while maintaining diffusion process efficiency.

Conclusion: 3D-Fixer bridges the gap between feed-forward and per-instance methods, offering both efficiency and generalization for compositional 3D scene generation from single views.

Abstract: Compositional 3D scene generation from a single view requires the simultaneous recovery of scene layout and 3D assets. Existing approaches mainly fall into two categories: feed-forward generation methods and per-instance generation methods. The former directly predict 3D assets with explicit 6DoF poses through efficient network inference, but they generalize poorly to complex scenes. The latter improve generalization through a divide-and-conquer strategy, but suffer from time-consuming pose optimization. To bridge this gap, we introduce 3D-Fixer, a novel in-place completion paradigm. Specifically, 3D-Fixer extends 3D object generative priors to generate complete 3D assets conditioned on the partially visible point cloud at the original locations, which are cropped from the fragmented geometry obtained from the geometry estimation methods. Unlike prior works that require explicit pose alignment, 3D-Fixer uses fragmented geometry as a spatial anchor to preserve layout fidelity. At its core, we propose a coarse-to-fine generation scheme to resolve boundary ambiguity under occlusion, supported by a dual-branch conditioning network and an Occlusion-Robust Feature Alignment (ORFA) strategy for stable training. Furthermore, to address the data scarcity bottleneck, we present ARSG-110K, the largest scene-level dataset to date, comprising over 110K diverse scenes and 3M annotated images with high-fidelity 3D ground truth. Extensive experiments show that 3D-Fixer achieves state-of-the-art geometric accuracy, which significantly outperforms baselines such as MIDI and Gen3DSR, while maintaining the efficiency of the diffusion process. Code and data will be publicly available at https://zx-yin.github.io/3dfixer.

[289] BoxComm: Benchmarking Category-Aware Commentary Generation and Narration Rhythm in Boxing

Kaiwen Wang, Kaili Zheng, Rongrong Deng, Yiming Shi, Chenyi Guo, Ji Wu

Main category: cs.CV

TL;DR: BoxComm: A large-scale dataset for boxing commentary generation with structured taxonomy and novel evaluation metrics for multimodal LLMs.

Details

Motivation: Existing sports commentary benchmarks focus only on team sports, leaving combat sports unexplored despite their unique challenges of millisecond-scale subtle actions and higher tactical analysis requirements.

Method: Created BoxComm dataset with 445 boxing match videos and 52K commentary sentences, proposed structured commentary taxonomy (play-by-play, tactical, contextual), and introduced two novel evaluations: category-conditioned generation and commentary rhythm assessment.

Result: Current MLLMs struggle on both evaluations. Proposed EIC-Gen baseline incorporating detected punch events shows consistent improvements, highlighting the importance of perceiving fleeting subtle events.

Conclusion: BoxComm addresses the gap in combat sports commentary benchmarks and reveals limitations of current MLLMs in handling subtle, fast-paced actions and structured commentary generation.

Abstract: Recent multimodal large language models (MLLMs) have shown strong capabilities in general video understanding, driving growing interest in automatic sports commentary generation. However, existing benchmarks for this task focus exclusively on team sports such as soccer and basketball, leaving combat sports entirely unexplored. Notably, combat sports present distinct challenges: critical actions unfold within milliseconds with visually subtle yet semantically decisive differences, and professional commentary contains a substantially higher proportion of tactical analysis compared to team sports. In this paper, we present BoxComm, a large-scale dataset comprising 445 World Boxing Championship match videos with over 52K commentary sentences from professional broadcasts. We propose a structured commentary taxonomy that categorizes each sentence into play-by-play, tactical, or contextual, providing the first category-level annotation for sports commentary benchmarks. Building on this taxonomy, we introduce two novel and complementary evaluations tailored to sports commentary generation: (1) category-conditioned generation, which evaluates whether models can produce accurate commentary of a specified type given video context; and (2) commentary rhythm assessment, which measures whether freely generated commentary exhibits appropriate temporal pacing and type distribution over continuous video segments, capturing a dimension of commentary competence that prior benchmarks have not addressed. Experiments on multiple state-of-the-art MLLMs reveal that current models struggle on both evaluations. We further propose EIC-Gen, an improved baseline incorporating detected punch events to supply structured action cues, yielding consistent gains and highlighting the importance of perceiving fleeting and subtle events for combat sports commentary.

[290] HandDreamer: Zero-Shot Text to 3D Hand Model Generation using Corrective Hand Shape Guidance

Green Rosh, Prateek Kukreja, Vishakha SR, Pawan Prasad B H

Main category: cs.CV

TL;DR: HandDreamer: First zero-shot 3D hand model generation from text prompts using MANO initialization, skeleton-guided diffusion, and corrective shape guidance to overcome SDS limitations for hands.

Details

Motivation: Current 3D hand model generation methods are expensive, cumbersome, and lack customizability. Existing zero-shot text-to-3D synthesis methods using Score Distillation Sampling (SDS) fail for hands due to unnatural structures, view inconsistencies, and loss of details caused by ambiguity in probability landscapes and large pose variations.

Method: Proposes HandDreamer with three key components: 1) MANO hand model initialization for strong structural prior, 2) hand skeleton-guided diffusion process for view and pose consistency, and 3) novel corrective hand shape guidance loss to ensure view-consistent convergence without geometric distortions.

Result: Extensive evaluations demonstrate superiority over state-of-the-art methods, showing improved hand structure, view consistency, and detail preservation compared to existing zero-shot 3D generation approaches.

Conclusion: HandDreamer paves a new way forward in 3D hand model generation by addressing fundamental limitations of SDS for hands through structural priors and guidance mechanisms, enabling customizable zero-shot generation from text prompts.

Abstract: The emergence of virtual reality has necessitated the generation of detailed and customizable 3D hand models for interaction in the virtual world. However, the current methods for 3D hand model generation are both expensive and cumbersome, offering very little customizability to the users. While recent advancements in zero-shot text-to-3D synthesis have enabled the generation of diverse and customizable 3D models using Score Distillation Sampling (SDS), they do not generalize very well to 3D hand model generation, resulting in unnatural hand structures, view-inconsistencies and loss of details. To address these limitations, we introduce HandDreamer, the first method for zero-shot 3D hand model generation from text prompts. Our findings suggest that view-inconsistencies in SDS is primarily caused due to the ambiguity in the probability landscape described by the text prompt, resulting in similar views converging to different modes of the distribution. This is particularly aggravated for hands due to the large variations in articulations and poses. To alleviate this, we propose to use MANO hand model based initialization and a hand skeleton guided diffusion process to provide a strong prior for the hand structure and to ensure view and pose consistency. Further, we propose a novel corrective hand shape guidance loss to ensure that all the views of the 3D hand model converges to view-consistent modes, without leading to geometric distortions. Extensive evaluations demonstrate the superiority of our method over the state-of-the-art methods, paving a new way forward in 3D hand model generation.

[291] Parameter-Efficient Semantic Augmentation for Enhancing Open-Vocabulary Object Detection

Weihao Cao, Runqi Wang, Xiaoyue Duan, Jinchao Zhang, Ang Yang, Liping Jing

Main category: cs.CV

TL;DR: HSA-DINO: A parameter-efficient semantic augmentation framework for open-vocabulary object detection that addresses domain shift issues through hierarchical semantic prompts and dynamic routing.

Details

Motivation: Existing OVOD methods perform well on general datasets but degrade significantly when transferred to downstream tasks with domain shifts, due to scarce category labels and inability to capture auxiliary semantics beyond coarse-grained labels.

Method: Proposes multi-scale prompt bank using image feature pyramids to capture hierarchical semantics and select domain-specific local semantic prompts, plus a semantic-aware router that dynamically selects appropriate augmentation strategies during inference without degrading pre-trained model generalization.

Result: HSA-DINO performs favorably against previous state-of-the-art methods on OV-COCO, several vertical domain datasets, and modified benchmark settings, achieving superior trade-off between domain adaptability and open-vocabulary generalization.

Conclusion: The framework effectively addresses domain shift issues in OVOD through hierarchical semantic augmentation and dynamic routing, maintaining strong generalization while improving domain adaptation.

Abstract: Open-vocabulary object detection (OVOD) enables models to detect any object category, including unseen ones. Benefiting from large-scale pre-training, existing OVOD methods achieve strong detection performance on general scenarios (e.g., OV-COCO) but suffer severe performance drops when transferred to downstream tasks with substantial domain shifts. This degradation stems from the scarcity and weak semantics of category labels in domain-specific task, as well as the inability of existing models to capture auxiliary semantics beyond coarse-grained category label. To address these issues, we propose HSA-DINO, a parameter-efficient semantic augmentation framework for enhancing open-vocabulary object detection. Specifically, we propose a multi-scale prompt bank that leverages image feature pyramids to capture hierarchical semantics and select domain-specific local semantic prompts, progressively enriching textual representations from coarse to fine-grained levels. Furthermore, we introduce a semantic-aware router that dynamically selects the appropriate semantic augmentation strategy during inference, thereby preventing parameter updates from degrading the generalization ability of the pre-trained OVOD model. We evaluate HSA-DINO on OV-COCO, several vertical domain datasets, and modified benchmark settings. The results show that HSA-DINO performs favorably against previous state-of-the-art methods, achieving a superior trade-off between domain adaptability and open-vocabulary generalization.

[292] Beyond Few-Step Inference: Accelerating Video Diffusion Transformer Model Serving with Inter-Request Caching Reuse

Hao Liu, Ye Huang, Chenghuan Huang, Zhenyi Zheng, Jiangsu Du, Ziyang Ma, Jing Lyu, Yutong Lu

Main category: cs.CV

TL;DR: Chorus is a caching approach that accelerates video diffusion model serving by exploiting similarity across requests, achieving up to 45% speedup on industrial 4-step distilled models where prior intra-request caching approaches fail.

Details

Motivation: Video Diffusion Transformer (DiT) models produce high-quality video generation but suffer from high inference costs due to iterative denoising. Existing caching approaches only exploit similarity within single requests, missing opportunities for cross-request optimization.

Method: Chorus employs a three-stage caching strategy: Stage 1 performs full reuse of latent features from similar requests; Stage 2 exploits inter-request caching in specific latent regions during intermediate denoising steps, combined with Token-Guided Attention Amplification to improve semantic alignment between generated video and conditional prompts.

Result: Chorus achieves up to 45% speedup on industrial 4-step distilled video diffusion models, where prior intra-request caching approaches are ineffective.

Conclusion: Cross-request caching through Chorus significantly accelerates video diffusion model serving by exploiting similarity across requests, making high-quality video generation more efficient and practical for deployment.

Abstract: Video Diffusion Transformer (DiT) models are a dominant approach for high-quality video generation but suffer from high inference cost due to iterative denoising. Existing caching approaches primarily exploit similarity within the diffusion process of a single request to skip redundant denoising steps. In this paper, we introduce Chorus, a caching approach that leverages similarity across requests to accelerate video diffusion model serving. Chorus achieves up to 45% speedup on industrial 4-step distilled models, where prior intra-request caching approaches are ineffective. Particularly, Chorus employs a three-stage caching strategy along the denoising process. Stage 1 performs full reuse of latent features from similar requests. Stage 2 exploits inter-request caching in specific latent regions during intermediate denoising steps. This stage is combined with Token-Guided Attention Amplification to improve semantic alignment between the generated video and the conditional prompts, thereby extending the applicability of full reuse to later denoising steps.

[293] Group-DINOmics: Incorporating People Dynamics into DINO for Self-supervised Group Activity Feature Learning

Ryuki Tezuka, Chihiro Nakatani, Norimichi Ukita

Main category: cs.CV

TL;DR: Proposes unsupervised Group Activity Feature learning using dynamics-aware and group-aware pretext tasks with DINO features, achieving SOTA in group activity retrieval/recognition.

Details

Motivation: Prior work uses low-level static local features for group activity understanding, lacking dynamics and group context. Need to learn group activity features without annotations by leveraging motion and scene context.

Method: Uses DINO features (local/global) with two pretext tasks: 1) person flow estimation for local motion dynamics, 2) group-relevant object location estimation for global scene context. Adapts DINO for group-dynamics-aware feature learning.

Result: State-of-the-art performance on public datasets for group activity retrieval and recognition. Ablation studies confirm effectiveness of each component.

Conclusion: Proposed unsupervised method effectively learns group activity features by combining motion dynamics and scene context through pretext tasks, outperforming previous approaches.

Abstract: This paper proposes Group Activity Feature (GAF) learning without group activity annotations. Unlike prior work, which uses low-level static local features to learn GAFs, we propose leveraging dynamics-aware and group-aware pretext tasks, along with local and global features provided by DINO, for group-dynamics-aware GAF learning. To adapt DINO and GAF learning to local dynamics and global group features, our pretext tasks use person flow estimation and group-relevant object location estimation, respectively. Person flow estimation is used to represent the local motion of each person, which is an important cue for understanding group activities. In contrast, group-relevant object location estimation encourages GAFs to learn scene context (e.g., spatial relations of people and objects) as global features. Comprehensive experiments on public datasets demonstrate the state-of-the-art performance of our method in group activity retrieval and recognition. Our ablation studies verify the effectiveness of each component in our method. Code: https://github.com/tezuka0001/Group-DINOmics.

[294] Beyond Standard Benchmarks: A Systematic Audit of Vision-Language Model’s Robustness to Natural Semantic Variation Across Diverse Tasks

Jia Chengyu, AprilPyone MaungMaung, Huy H. Nguyen, Jinyin Chen, Isao Echizen

Main category: cs.CV

TL;DR: Systematic evaluation of vision-language models (VLMs) under natural adversarial scenarios reveals vulnerabilities in zero-shot transfer across tasks like image classification, segmentation, and VQA.

Details

Motivation: While VLMs show impressive zero-shot capabilities, comprehensive evaluation beyond standard benchmarks is needed to understand their robustness, limitations, and real-world applicability, especially under natural adversarial scenarios that have been overlooked in previous evaluations.

Method: Developed a systematic evaluation framework for VLMs under natural adversarial scenarios. Evaluated CLIP, robust CLIP, BLIP2, and SigLIP2 on curated adversarial datasets including typographic attacks, ImageNet-A, and natural language-induced adversarial examples. Measured performance on zero-shot image classification, semantic segmentation, and visual question answering.

Result: Revealed that robust CLIP models can amplify natural adversarial vulnerabilities, and CLIP models significantly reduce performance for natural language-induced adversarial examples. Provided interpretable analyses to identify failure modes.

Conclusion: The findings highlight critical vulnerabilities in current VLMs under natural adversarial scenarios, emphasizing the need for more robust and fair multimodal pattern recognition systems. The work aims to inspire future research in this direction.

Abstract: Recent advances in vision-language models (VLMs) trained on web-scale image-text pairs have enabled impressive zero-shot transfer across a diverse range of visual tasks. However, comprehensive and independent evaluation beyond standard benchmarks is essential to understand their robustness, limitations, and real-world applicability. This paper presents a systematic evaluation framework for VLMs under natural adversarial scenarios for diverse downstream tasks, which has been overlooked in previous evaluation works. We evaluate a wide range of VLMs (CLIP, robust CLIP, BLIP2, and SigLIP2) on curated adversarial datasets (typographic attacks, ImageNet-A, and natural language-induced adversarial examples). We measure the natural adversarial performance of selected VLMs for zero-shot image classification, semantic segmentation, and visual question answering. Our analysis reveals that robust CLIP models can amplify natural adversarial vulnerabilities, and CLIP models significantly reduce performance for natural language-induced adversarial examples. Additionally, we provide interpretable analyses to identify failure modes. We hope our findings inspire future research in robust and fair multimodal pattern recognition.

[295] MVis-Fold: A Three-Dimensional Microvascular Structure Inference Model for Super-Resolution Ultrasound

Jincao Yao, Ke Zhang, Yahan Zhou, Jiafei Shen, Jie Liu, Mudassar Ali, Bojian Feng, Jiye Chen, Jinlong Fan, Ping Liang, Dong Xu

Main category: cs.CV

TL;DR: A 3D microvascular reconstruction model called MVis-Fold that uses cross-scale network architecture to reconstruct 3D microvascular networks from 2D super-resolution ultrasound images.

Details

Motivation: Super-resolution ultrasound enables micrometer-scale imaging of microvasculature but faces challenges in 3D reconstruction from 2D images due to imaging principles. There's a need for accurate 3D reconstruction methods for quantitative analysis of microvascular networks.

Method: Developed MVis-Fold, an innovative 3D microvascular reconstruction model with cross-scale network architecture that performs high-fidelity inference and reconstruction of 3D microvascular networks from 2D SRUS images.

Result: The model accurately reconstructs 3D microvascular networks and calculates key 3D parameters that traditional 2D SRUS cannot obtain. Validated accuracy and reliability in 3D microvascular reconstruction of solid tumors.

Conclusion: Establishes foundation for 3D quantitative analysis of microvasculature and provides new tools for diagnosis and monitoring of various diseases through improved 3D visualization of microvascular networks.

Abstract: Super-resolution ultrasound (SRUS) technology has overcome the resolution limitations of conventional ultrasound, enabling micrometer-scale imaging of microvasculature. However, due to the nature of imaging principles, three-dimensional reconstruction of microvasculature from SRUS remains an open challenge. We developed microvascular visualization fold (MVis-Fold), an innovative three-dimensional microvascular reconstruction model that integrates a cross-scale network architecture. This model can perform high-fidelity inference and reconstruction of three-dimensional microvascular networks from two-dimensional SRUS images. It precisely calculates key parameters in three-dimensional space that traditional two-dimensional SRUS cannot readily obtain. We validated the model’s accuracy and reliability in three-dimensional microvascular reconstruction of solid tumors. This study establishes a foundation for three-dimensional quantitative analysis of microvasculature. It provides new tools and methods for diagnosis and monitoring of various diseases.

[296] Training-Free Image Editing with Visual Context Integration and Concept Alignment

Rui Song, Guo-Hua Wang, Qing-Guo Chen, Weihua Luo, Tongda Xu, Zhening Liu, Yan Wang, Zehong Lin, Jun Zhang

Main category: cs.CV

TL;DR: VicoEdit is a training-free, inversion-free method for visual context-aware image editing that directly transforms source images using visual context without diffusion inversion, achieving state-of-the-art performance.

Details

Motivation: Existing visual context-aware image editing methods either require costly data collection and training, or rely on diffusion inversion which suffers from consistency and flexibility issues. There's a need for a training-free approach that avoids inversion problems while maintaining editing quality.

Method: VicoEdit directly transforms source images into target images based on visual context without using diffusion inversion. It employs a posterior sampling approach guided by concept alignment to enhance editing consistency, working with pretrained text-prompted editing models.

Result: Empirical results show VicoEdit achieves better editing performance than state-of-the-art training-based models, despite being training-free. It effectively handles visual context injection while maintaining consistency.

Conclusion: VicoEdit provides an effective training-free solution for visual context-aware image editing that eliminates inversion problems and outperforms existing methods, offering a practical approach for incorporating visual context into editing workflows.

Abstract: In image editing, it is essential to incorporate a context image to convey the user’s precise requirements, such as subject appearance or image style. Existing training-based visual context-aware editing methods incur data collection effort and training cost. On the other hand, the training-free alternatives are typically established on diffusion inversion, which struggles with consistency and flexibility. In this work, we propose VicoEdit, a training-free and inversion-free method to inject the visual context into the pretrained text-prompted editing model. More specifically, VicoEdit directly transforms the source image into the target one based on the visual context, thereby eliminating the need for inversion that can lead to deviated trajectories. Moreover, we design a posterior sampling approach guided by concept alignment to enhance the editing consistency. Empirical results demonstrate that our training-free method achieves even better editing performance than the state-of-the-art training-based models.

[297] A Patch-based Cross-view Regularized Framework for Backdoor Defense in Multimodal Large Language Models

Tianmeng Fang, Yong Wang, Zetai Kong, Zengzhen Su, Jun Wang, Chengjin Yu, Wei Wang

Main category: cs.CV

TL;DR: A defense framework against backdoor attacks in multimodal LLMs using patch augmentation and cross-view regularity to suppress attack success while preserving normal generation capabilities.

Details

Motivation: Multimodal LLMs are vulnerable to backdoor attacks during fine-tuning, where triggers cause harmful responses. Defense must balance suppressing attacks with maintaining normal performance, which are conflicting objectives.

Method: Proposes unified defense with patch-level data augmentation and cross-view output difference regularization. Exploits backdoor invariance to non-semantic perturbations, pulling apart original and perturbed view outputs. Adds output entropy constraints to prevent over-suppression.

Result: Experiments across 3 models, 2 tasks, and 6 attacks show effective reduction of attack success rate while maintaining high normal text generation capability.

Conclusion: Enables secure deployment of multimodal models in low-frequency poisoning and covert triggering scenarios by balancing attack suppression with performance preservation.

Abstract: Multimodal large language models have become an important infrastructure for unified processing of visual and linguistic tasks. However, such models are highly susceptible to backdoor implantation during supervised fine-tuning and will steadily output the attacker’s predefined harmful responses once a specific trigger pattern is activated. The core challenge of backdoor defense lies in suppressing attack success under low poisoning ratios while preserving the model’s normal generation ability. These two objectives are inherently conflicting. Strong suppression often degrades benign performance, whereas weak regularization fails to mitigate backdoor behaviors. To this end, we propose a unified defense framework based on patch augmentation and cross-view regularity, which simultaneously constrains the model’s anomalous behaviors in response to triggered patterns from both the feature representation and output distribution levels. Specifically, patch-level data augmentation is combined with cross-view output difference regularization to exploit the fact that backdoor responses are abnormally invariant to non-semantic perturbations and to proactively pull apart the output distributions of the original and perturbed views, thereby significantly suppressing the success rate of backdoor triggering. At the same time, we avoid over-suppression of the model during defense by imposing output entropy constraints, ensuring the quality of normal command generation. Experimental results across three models, two tasks, and six attacks show that our proposed defense method effectively reduces the attack success rate while maintaining a high level of normal text generation capability. Our work enables the secure, controlled deployment of large-scale multimodal models in realistic low-frequency poisoning and covert triggering scenarios.

[298] The Indra Representation Hypothesis for Multimodal Alignment

Jianglin Lu, Hailing Wang, Kuo Yang, Yitian Zhang, Simon Jenni, Yun Fu

Main category: cs.CV

TL;DR: The Indra Representation Hypothesis proposes that unimodal foundation models learn convergent relational representations that reflect shared underlying reality structures, formalized using category theory’s Yoneda embedding for cross-modal alignment.

Details

Motivation: Unimodal foundation models show convergent representations but these are limited as independent abstractions. The paper aims to capture the implicit relational structure underlying reality that these models are converging toward, inspired by Indra's Net philosophical metaphor.

Method: Formalizes the Indra Representation Hypothesis using V-enriched Yoneda embedding from category theory, defining relational profiles of samples. Instantiates with angular distance and evaluates in cross-model/cross-modal scenarios involving vision, language, and audio.

Result: Extensive experiments show Indra representations consistently enhance robustness and alignment across architectures and modalities, providing a training-free alignment framework for unimodal foundation models.

Conclusion: The Indra representation offers a theoretically grounded, practical framework for aligning unimodal foundation models without training, capturing shared relational structures across modalities.

Abstract: Recent studies have uncovered an interesting phenomenon: unimodal foundation models tend to learn convergent representations, regardless of differences in architecture, training objectives, or data modalities. However, these representations are essentially internal abstractions of samples that characterize samples independently, leading to limited expressiveness. In this paper, we propose The Indra Representation Hypothesis, inspired by the philosophical metaphor of Indra’s Net. We argue that representations from unimodal foundation models are converging to implicitly reflect a shared relational structure underlying reality, akin to the relational ontology of Indra’s Net. We formalize this hypothesis using the V-enriched Yoneda embedding from category theory, defining the Indra representation as a relational profile of each sample with respect to others. This formulation is shown to be unique, complete, and structure-preserving under a given cost function. We instantiate the Indra representation using angular distance and evaluate it in cross-model and cross-modal scenarios involving vision, language, and audio. Extensive experiments demonstrate that Indra representations consistently enhance robustness and alignment across architectures and modalities, providing a theoretically grounded and practical framework for training-free alignment of unimodal foundation models. Our code is available at https://github.com/Jianglin954/Indra.

[299] Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward

Shizhan Gong, Minda Hu, Qiyuan Zhang, Chen Ma, Qi Dou

Main category: cs.CV

TL;DR: Saliency-R1 improves VLM trustworthiness by aligning visual attention with reasoning through saliency maps and reinforcement learning, enhancing interpretability and faithfulness.

Details

Motivation: Address concerns about VLMs' trustworthiness, particularly their tendency to over-rely on textual cues rather than visual evidence, and the risk of producing ungrounded or fabricated responses.

Method: Proposes a novel saliency map technique to highlight critical image regions contributing to generated tokens without extra computation. Uses overlap between saliency maps and human-annotated bounding boxes as reward function, applying Group Relative Policy Optimization (GRPO) to align salient parts with critical regions.

Result: Experiments show Saliency-R1 improves reasoning faithfulness, interpretability, and overall task performance.

Conclusion: The framework successfully enhances VLM trustworthiness by making reasoning more visually-grounded and interpretable through saliency alignment.

Abstract: Vision-language models (VLMs) have achieved remarkable success across diverse tasks. However, concerns about their trustworthiness persist, particularly regarding tendencies to lean more on textual cues than visual evidence and the risk of producing ungrounded or fabricated responses. To address these issues, we propose Saliency-R1, a framework for improving the interpretability and faithfulness of VLMs reasoning. Specifically, we introduce a novel saliency map technique that efficiently highlights critical image regions contributing to generated tokens without additional computational overhead. This can further be extended to trace how visual information flows through the reasoning process to the final answers, revealing the alignment between the thinking process and the visual context. We use the overlap between the saliency maps and human-annotated bounding boxes as the reward function, and apply Group Relative Policy Optimization (GRPO) to align the salient parts and critical regions, encouraging models to focus on relevant areas when conduct reasoning. Experiments show Saliency-R1 improves reasoning faithfulness, interpretability, and overall task performance.

[300] MedROI: Codec-Agnostic Region of Interest-Centric Compression for Medical Images

Jiwon Kim, Ikbeom Jang

Main category: cs.CV

TL;DR: MedROI is a codec-agnostic framework for medical image compression that removes non-diagnostic background before compression, improving compression ratios and speed while maintaining reconstruction quality within the region of interest.

Details

Motivation: Medical imaging archives are growing rapidly in size and resolution, creating storage and transfer challenges. Existing codecs compress full images including non-diagnostic background, wasting bits on irrelevant data.

Method: MedROI extracts tissue bounding boxes via lightweight intensity-based thresholding, discards background voxels, stores 54-byte metadata for spatial restoration, then compresses the cropped ROI using any existing 2D/3D codec without modifications.

Result: On 200 T1-weighted brain MRI volumes, MedROI significantly improved compression ratios and encoding/decoding times for most codecs (JPEG2000 2D/3D, LIC_TCM, TCM+AuxT, BCM-Net, SirenMRI) while maintaining comparable ROI reconstruction quality.

Conclusion: MedROI provides a practical, codec-agnostic solution for efficient medical image compression by focusing only on diagnostically relevant regions, offering significant performance improvements without requiring codec modifications.

Abstract: Medical imaging archives are growing rapidly in both size and resolution, making efficient compression increasingly important for storage and data transfer. Most existing codecs compress full images/volumes(including non-diagnostic background) or apply differential ROI coding that still preserves background bits. We propose MedROI, a codec-agnostic, plug-and-play ROI-centric framework that discards background voxels prior to compression. MedROI extracts a tight tissue bounding box via lightweight intensity-based thresholding and stores a fixed 54byte meta data record to enable spatial restoration during decompression. The cropped ROI is then compressed using any existing 2D or 3D codec without architectural modifications or retraining. We evaluate MedROI on 200 T1-weighted brain MRI volumes from ADNI using 6 codec configurations spanning conventional codecs (JPEG2000 2D/3D, HEIF) and neural compressors (LIC_TCM, TCM+AuxT, BCM-Net, SirenMRI). MedROI yields statistically significant improvements in compression ratio and encoding/decoding time for most configurations (two-sided t-test with multiple-comparison correction), while maintaining comparable reconstruction quality when measured within the ROI; HEIF is the primary exception in compression-ratio gains. For example, on JPEG20002D (lv3), MedROI improves CR from 20.35 to 27.37 while reducing average compression time from 1.701s to 1.380s. Code is available at https://github.com/labhai/MedROI.

[301] MPTF-Net: Multi-view Pyramid Transformer Fusion Network for LiDAR-based Place Recognition

Shuyuan Li, Zihang Wang, Xieyuanli Chen, Wenkai Zhu, Xiaoteng Fang, Peizhou Ni, Junhao Yang, Dong Kong

Main category: cs.CV

TL;DR: MPTF-Net: A multi-view multi-scale pyramid Transformer fusion network for LiDAR-based place recognition using NDT-based BEV encoding to capture fine-grained geometric structures.

Details

Motivation: Existing BEV representations for LiDAR place recognition use simple statistical aggregation that fails to capture fine-grained geometric structures, leading to performance degradation in complex or repetitive environments.

Method: Proposes multi-channel NDT-based BEV encoding that models local geometric complexity and intensity distributions via Normal Distribution Transform, and a pyramid Transformer module that captures cross-view correlations between Range Image Views and NDT-BEV at multiple spatial scales.

Result: Achieves state-of-the-art performance with Recall@1 of 96.31% on nuScenes Boston split while maintaining inference latency of only 10.02 ms, suitable for real-time autonomous systems.

Conclusion: MPTF-Net effectively addresses limitations of conventional BEV representations by capturing fine-grained geometric structures through NDT-based encoding and multi-scale fusion, enabling robust place recognition in complex environments.

Abstract: LiDAR-based place recognition (LPR) is essential for global localization and loop-closure detection in large-scale SLAM systems. Existing methods typically construct global descriptors from Range Images or BEV representations for matching. BEV is widely adopted due to its explicit 2D spatial layout encoding and efficient retrieval. However, conventional BEV representations rely on simple statistical aggregation, which fails to capture fine-grained geometric structures, leading to performance degradation in complex or repetitive environments. To address this, we propose MPTF-Net, a novel multi-view multi-scale pyramid Transformer fusion network. Our core contribution is a multi-channel NDT-based BEV encoding that explicitly models local geometric complexity and intensity distributions via Normal Distribution Transform, providing a noise-resilient structural prior. To effectively integrate these features, we develop a customized pyramid Transformer module that captures cross-view interactive correlations between Range Image Views (RIV) and NDT-BEV at multiple spatial scales. Extensive experiments on the nuScenes, KITTI and NCLT datasets demonstrate that MPTF-Net achieves state-of-the-art performance, specifically attaining a Recall@1 of 96.31% on the nuScenes Boston split while maintaining an inference latency of only 10.02 ms, making it highly suitable for real-time autonomous unmanned systems.

[302] StableTTA: Training-Free Test-Time Adaptation that Improves Model Accuracy on ImageNet1K to 96%

Zheng Li, Jerry Cheng, Huanying Helen Gu

Main category: cs.CV

TL;DR: StableTTA is a training-free method that improves ensemble aggregation stability and efficiency for test-time augmentation, achieving significant accuracy gains with reduced computational costs.

Details

Motivation: Ensemble methods improve predictive performance but suffer from increased memory usage and computational complexity. There's a conflict in aggregation strategies that negatively impacts prediction stability, which needs to be addressed for efficient deployment on resource-constrained devices.

Method: Proposes StableTTA, a training-free method to improve aggregation stability and efficiency for test-time augmentation. The method addresses conflicts in aggregation strategies to enhance prediction stability without requiring additional training.

Result: On ImageNet-1K: gains of 10.93-32.82% in top-1 accuracy; 33 models achieve over 95% accuracy, several surpass 96%; lightweight architectures outperform ViT by 11.75% in top-1 accuracy while using <5% parameters and reducing computational cost by ~89.1% (in GFLOPs).

Conclusion: StableTTA enables high-accuracy inference on resource-constrained devices by improving ensemble aggregation stability and efficiency without training, making it practical for real-world deployment.

Abstract: Ensemble methods are widely used to improve predictive performance, but their effectiveness often comes at the cost of increased memory usage and computational complexity. In this paper, we identify a conflict in aggregation strategies that negatively impacts prediction stability. We propose StableTTA, a training-free method to improve aggregation stability and efficiency. Empirical results on ImageNet-1K show gains of 10.93–32.82% in top-1 accuracy, with 33 models achieving over 95% accuracy and several surpassing 96%. Notably, StableTTA allows lightweight architectures to outperform ViT by 11.75% in top-1 accuracy while using less than 5% of parameters and reducing computational cost by approximately 89.1% (in GFLOPs), enabling high-accuracy inference on resource-constrained devices.

[303] Relational Epipolar Graphs for Robust Relative Camera Pose Estimation

Prateeth Rao, Sachit Rao

Main category: cs.CV

TL;DR: Graph-based relative pose estimation reformulates camera pose estimation as relational inference on epipolar correspondence graphs, outperforming classical and learning methods on robustness to noise and baseline variation.

Details

Motivation: Traditional VSLAM relative pose estimation faces challenges with noisy correspondences, while classical methods rely on stochastic sampling and learning-based approaches often lack explicit geometric structure. There's a need for more robust methods that combine geometric reasoning with learning.

Method: Reformulates relative pose estimation as relational inference over epipolar correspondence graphs where keypoints are nodes and nearby ones are connected. Uses graph operations (pruning, message passing, pooling) to estimate quaternion rotation, translation vector, and Essential Matrix. Employs LoFTR for dense detector-free matching and minimizes a multi-component loss function combining L2 differences, Frobenius norm, singular value differences, heading angle differences, and scale differences.

Result: Experiments on indoor and outdoor benchmarks demonstrate improved robustness to dense noise and large baseline variation compared to both classical methods and learning-guided approaches, showing the effectiveness of global relational consensus.

Conclusion: The graph-based relational inference approach provides a robust solution for relative pose estimation in VSLAM by combining geometric structure with learning, achieving better performance than existing methods in challenging conditions.

Abstract: A key component of Visual Simultaneous Localization and Mapping (VSLAM) is estimating relative camera poses using matched keypoints. Accurate estimation is challenged by noisy correspondences. Classical methods rely on stochastic hypothesis sampling and iterative estimation, while learning-based methods often lack explicit geometric structure. In this work, we reformulate relative pose estimation as a relational inference problem over epipolar correspondence graphs, where matched keypoints are nodes and nearby ones are connected by edges. Graph operations such as pruning, message passing, and pooling estimate a quaternion rotation, translation vector, and the Essential Matrix (EM). Minimizing a loss comprising (i) $\mathcal{L}_2$ differences with ground truth (GT), (ii) Frobenius norm between estimated and GT EMs, (iii) singular value differences, (iv) heading angle differences, and (v) scale differences, yields the relative pose between image pairs. The dense detector-free method LoFTR is used for matching. Experiments on indoor and outdoor benchmarks show improved robustness to dense noise and large baseline variation compared to classical and learning-guided approaches, highlighting the effectiveness of global relational consensus.

[304] Temporal Inversion for Learning Interval Change in Chest X-Rays

Hanbin Ko, Kyeongmin Jeon, Doowoong Choi, Chang Min Park

Main category: cs.CV

TL;DR: TILA framework enhances temporal vision-language models for chest radiographs by using temporal inversion as supervisory signal to improve sensitivity to directional change between prior and current images.

Details

Motivation: Current medical foundation models analyze radiographs in isolation, missing the crucial clinical task of comparing prior and current images to assess interval change, which is essential for chest radiograph interpretation where radiologists must evaluate how findings evolve over time.

Method: TILA uses temporal inversion (reversing image pairs) as supervisory signal with inversion-aware objectives across pretraining, fine-tuning, and inference. It complements appearance modeling with explicit temporal order learning and introduces MS-CXR-Tretrieval evaluation set.

Result: Experiments on public datasets and real-world hospital cohorts show TILA consistently improves progression classification and temporal embedding alignment when applied to multiple existing architectures.

Conclusion: TILA provides an effective framework for enhancing temporal vision-language models’ sensitivity to directional change in medical imaging, addressing a key limitation in current medical foundation models.

Abstract: Recent advances in vision–language pretraining have enabled strong medical foundation models, yet most analyze radiographs in isolation, overlooking the key clinical task of comparing prior and current images to assess interval change. For chest radiographs (CXRs), capturing interval change is essential, as radiologists must evaluate not only the static appearance of findings but also how they evolve over time. We introduce TILA (Temporal Inversion-aware Learning and Alignment), a simple yet effective framework that uses temporal inversion, reversing image pairs, as a supervisory signal to enhance the sensitivity of existing temporal vision-language models to directional change. TILA integrates inversion-aware objectives across pretraining, fine-tuning, and inference, complementing conventional appearance modeling with explicit learning of temporal order. We also propose a unified evaluation protocol to assess order sensitivity and consistency under temporal inversion, and introduce MS-CXR-Tretrieval, a retrieval evaluation set constructed through a general protocol that can be applied to any temporal CXR dataset. Experiments on public datasets and real-world hospital cohorts demonstrate that TILA consistently improves progression classification and temporal embedding alignment when applied to multiple existing architectures.

[305] TAPE: A two-stage parameter-efficient adaptation framework for foundation models in OCT-OCTA analysis

Xiaofei Su, Zengshuo Wang, Minghe Sun, Xin Zhao, Mingzhu Sun

Main category: cs.CV

TL;DR: TAPE is a two-stage adaptation framework using parameter-efficient fine-tuning for medical image segmentation, addressing domain shift and task misalignment in OCT/OCTA analysis.

Details

Motivation: Existing methods for OCT/OCTA analysis require massive data and large models, making them impractical for resource-constrained clinical settings. Transfer learning with foundation models faces challenges of domain shift and task misalignment.

Method: TAPE decouples adaptation into two stages: domain alignment using parameter-efficient fine-tuning with masked image modeling, followed by task fitting for downstream segmentation tasks.

Result: TAPE achieves superior parameter efficiency and state-of-the-art generalization performance for retinal layer segmentation across diverse pathologies using both universal (MAE) and specialized (RETFound) foundation models.

Conclusion: The proposed two-stage adaptation framework effectively addresses domain shift and task misalignment challenges, enabling efficient deployment of foundation models in clinical settings for medical image analysis.

Abstract: Automated analysis of optical coherence tomography (OCT) and OCT angiography (OCTA) images is critical for robust ophthalmic diagnosis. Existing mainstream methods trained from scratch rely heavily on massive data and model scale, thereby hindering their practical deployment in resource-constrained clinical settings. Although transfer learning based on foundation models (FMs) is promising, it still faces significant challenges: domain shift and task misalignment. To address these, we propose TAPE: A Two-stage Adaptation Framework via Parameter-Efficient Fine-tuning, which strategically decouples adaptation into domain alignment and task fitting for downstream segmentation. The domain adaptation stage notably applies parameter-efficient fine-tuning (PEFT) in the context of masked image modeling for medical image domain adaptation, a novel approach to the best of our knowledge. Applying TAPE to retinal layer segmentation on both universal (masked auto-encoder, MAE) and specialized (RETFound) FMs, it demonstrates superior parameter efficiency and achieves state-of-the-art generalization performance across diverse pathologies.

[306] Erasure or Erosion? Evaluating Compositional Degradation in Unlearned Text-To-Image Diffusion Models

Arian Komaei Koma, Seyed Amir Kasaei, Ali Aghayari, AmirMahdi Sadeghzadeh, Mohammad Hossein Rohban

Main category: cs.CV

TL;DR: Systematic study reveals trade-off between concept unlearning effectiveness and compositional integrity in text-to-image diffusion models, showing that strong erasure methods degrade attribute binding and spatial reasoning while preservation-focused methods fail at robust erasure.

Details

Motivation: Prior work on post-hoc unlearning in text-to-image diffusion models primarily evaluates erasure success without understanding impact on broader generative capabilities, particularly compositional text-to-image generation.

Method: Conducted systematic empirical study of concept unlearning through compositional text-to-image generation lens, focusing on nudity removal in Stable Diffusion 1.4, evaluating diverse state-of-the-art unlearning methods using T2I-CompBench++ and GenEval alongside established unlearning benchmarks.

Result: Revealed consistent trade-off: methods achieving strong erasure incur substantial degradation in attribute binding, spatial reasoning, and counting, while approaches preserving compositional structure often fail to provide robust erasure.

Conclusion: Highlights limitations of current evaluation practices and underscores need for unlearning objectives that explicitly account for semantic preservation beyond targeted suppression in text-to-image models.

Abstract: Post-hoc unlearning has emerged as a practical mechanism for removing undesirable concepts from large text-to-image diffusion models. However, prior work primarily evaluates unlearning through erasure success; its impact on broader generative capabilities remains poorly understood. In this work, we conduct a systematic empirical study of concept unlearning through the lens of compositional text-to-image generation. Focusing on nudity removal in Stable Diffusion 1.4, we evaluate a diverse set of state-of-the-art unlearning methods using T2I-CompBench++ and GenEval, alongside established unlearning benchmarks. Our results reveal a consistent trade-off between unlearning effectiveness and compositional integrity: methods that achieve strong erasure frequently incur substantial degradation in attribute binding, spatial reasoning, and counting. Conversely, approaches that preserve compositional structure often fail to provide robust erasure. These findings highlight limitations of current evaluation practices and underscore the need for unlearning objectives that explicitly account for semantic preservation beyond targeted suppression.

[307] PR-IQA: Partial-Reference Image Quality Assessment for Diffusion-Based Novel View Synthesis

Inseong Choi, Siwoo Lee, Seung-Hun Nam, Soohwan Song

Main category: cs.CV

TL;DR: PR-IQA is a framework for assessing quality of diffusion-generated novel views without ground truth, using partial reference images to filter inconsistencies for better 3D Gaussian Splatting reconstruction.

Details

Motivation: Diffusion models can generate pseudo-ground-truth views for sparse-view novel view synthesis, but these synthesized images often contain photometric and geometric inconsistencies that impair 3D reconstruction when used directly for supervision.

Method: Proposes Partial-Reference Image Quality Assessment (PR-IQA) that evaluates diffusion-generated views using reference images from different poses. First computes geometrically consistent partial quality map in overlapping regions, then performs quality completion via cross-attention mechanism incorporating reference-view context to inpaint partial map into dense, full-image map.

Result: PR-IQA outperforms existing IQA methods, achieving full-reference-level accuracy without ground-truth supervision. When integrated into diffusion-augmented 3DGS pipeline, it restricts supervision to high-confidence regions, producing superior 3D reconstructions and NVS results.

Conclusion: PR-IQA enables effective filtering of inconsistencies in diffusion-generated views, improving 3D reconstruction quality in sparse-view novel view synthesis pipelines without requiring ground-truth supervision.

Abstract: Diffusion models are promising for sparse-view novel view synthesis (NVS), as they can generate pseudo-ground-truth views to aid 3D reconstruction pipelines like 3D Gaussian Splatting (3DGS). However, these synthesized images often contain photometric and geometric inconsistencies, and their direct use for supervision can impair reconstruction. To address this, we propose Partial-Reference Image Quality Assessment (PR-IQA), a framework that evaluates diffusion-generated views using reference images from different poses, eliminating the need for ground truth. PR-IQA first computes a geometrically consistent partial quality map in overlapping regions. It then performs quality completion to inpaint this partial map into a dense, full-image map. This completion is achieved via a cross-attention mechanism that incorporates reference-view context, ensuring cross-view consistency and enabling thorough quality assessment. When integrated into a diffusion-augmented 3DGS pipeline, PR-IQA restricts supervision to high-confidence regions identified by its quality maps. Experiments demonstrate that PR-IQA outperforms existing IQA methods, achieving full-reference-level accuracy without ground-truth supervision. Thus, our quality-aware 3DGS approach more effectively filters inconsistencies, producing superior 3D reconstructions and NVS results.The project page is available at https://kakaomacao.github.io/pr-iqa-project-page/.

[308] Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation

Quoc-Huy Trinh, Mustapha Abdullahi, Bo Zhao, Debesh Jha

Main category: cs.CV

TL;DR: Firebolt-VL is an efficient vision-language model that replaces Transformer decoders with Liquid Foundation Models and adds a Token-Grid Correlation Module for better visual grounding, achieving fine-grained understanding with linear-time inference.

Details

Motivation: Current multimodal LLMs have high computational costs that limit deployment in resource-constrained scenarios, and small vision-language models struggle with fine-grained visual reasoning due to imprecise visual region capture.

Method: Replaces Transformer-based decoder with Liquid Foundation Model (LFM) decoder and introduces Token-Grid Correlation Module that computes lightweight correlations between text tokens and image patches, modulated via state-space model with FiLM conditioning.

Result: Achieves accurate, fine-grained understanding with significantly improved efficiency across multiple benchmarks while maintaining linear-time inference.

Conclusion: Firebolt-VL addresses efficiency and fine-grained visual grounding challenges in multimodal LLMs through architectural innovations that enable practical deployment in resource-constrained scenarios.

Abstract: Recent advances in multimodal large language models (MLLMs) have enabled impressive progress in vision-language understanding, yet their high computational cost limits deployment in resource-constrained scenarios such as personal assistants, document understanding, and smart cameras. Most existing methods rely on Transformer-based cross-attention, whose quadratic complexity hinders efficiency. Moreover, small vision-language models often struggle to precisely capture fine-grained, task-relevant visual regions, leading to degraded performance on fine-grained reasoning tasks that limit their effectiveness in the real world. To address these issues, we introduce Firebolt-VL, an efficient vision-language model that replaces the Transformer-based decoder with a Liquid Foundation Model (LFM) decoder. To further enhance visual grounding, we propose a Token-Grid Correlation Module, which computes lightweight correlations between text tokens and image patches and modulates via the state-space model with FiLM conditioning. This enables the model to selectively emphasize visual regions relevant to the textual prompt while maintaining linear-time inference. Experimental results across multiple benchmarks demonstrate that Firebolt-VL achieves accurate, fine-grained understanding with significantly improved efficiency. Our model and code are available at: https://fireboltvl.github.io

Mei Qiu, Jianqiang Zhao, Yanyun Qu

Main category: cs.CV

TL;DR: A novel approach that integrates physical pixel-level features with multimodal CLIP models for robust AI-generated image detection across diverse generative architectures.

Details

Motivation: Existing deepfake detectors overfit to specific generative models, creating an adaptability crisis. The paper aims to identify stable physical features that distinguish natural from AI-generated images and integrate them into multimodal models to enhance detection reliability.

Method: Comprehensive exploration of 15 physical features across 20+ datasets from GANs and diffusion models. A novel feature selection algorithm identifies five core physical features (Laplacian variance, Sobel statistics, residual noise variance). These features are converted to text-encoded values and integrated with semantic captions to guide image-text representation learning in CLIP.

Result: Achieves state-of-the-art performance on multiple Genimage benchmarks, with near-perfect accuracy (99.8%) on datasets like Wukong and SDv1.4. The method demonstrates robust discriminative power across diverse generative architectures.

Conclusion: The work pioneers physically grounded features for trustworthy vision-language modeling, bridging pixel-level authenticity with semantic understanding, and opens new directions for mitigating hallucinations and textual inaccuracies in large multimodal models.

Abstract: The rapid advancement of AI generated content (AIGC) has blurred the boundaries between real and synthetic images, exposing the limitations of existing deepfake detectors that often overfit to specific generative models. This adaptability crisis calls for a fundamental reexamination of the intrinsic physical characteristics that distinguish natural from AI-generated images. In this paper, we address two critical research questions: (1) What physical features can stably and robustly discriminate AI generated images across diverse datasets and generative architectures? (2) Can these objective pixel-level features be integrated into multimodal models like CLIP to enhance detection performance while mitigating the unreliability of language-based information? To answer these questions, we conduct a comprehensive exploration of 15 physical features across more than 20 datasets generated by various GANs and diffusion models. We propose a novel feature selection algorithm that identifies five core physical features including Laplacian variance, Sobel statistics, and residual noise variance that exhibit consistent discriminative power across all tested datasets. These features are then converted into text encoded values and integrated with semantic captions to guide image text representation learning in CLIP. Extensive experiments demonstrate that our method achieves state-of-the-art performance on multiple Genimage benchmarks, with near-perfect accuracy (99.8%) on datasets such as Wukong and SDv1.4. By bridging pixel level authenticity with semantic understanding, this work pioneers the use of physically grounded features for trustworthy vision language modeling and opens new directions for mitigating hallucinations and textual inaccuracies in large multimodal models.

[310] Multimodal Backdoor Attack on VLMs for Autonomous Driving via Graffiti and Cross-Lingual Triggers

Jiancheng Wang, Lidan Liang, Yong Wang, Zengzhen Su, Haifeng Xia, Yuanting Yan, Wei Wang

Main category: cs.CV

TL;DR: GLA introduces stealthy backdoor attacks on visual language models for autonomous driving using naturalistic graffiti triggers and cross-language text triggers, achieving high attack success with low poisoning while maintaining clean task performance.

Details

Motivation: As VLMs are integrated into safety-critical systems like autonomous driving, they become important attack surfaces. Existing backdoor attacks use unimodal, explicit triggers that are easily detectable, making it difficult to create covert and stable attack channels in real-world driving scenarios.

Method: GLA introduces two naturalistic triggers: 1) graffiti-based visual patterns generated via stable diffusion inpainting that blend seamlessly into urban scenes, and 2) cross-language text triggers that create distributional shifts while maintaining semantic consistency to build robust language-side trigger signals.

Result: Experiments on DriveVLM show GLA requires only 10% poisoning ratio to achieve 90% Attack Success Rate (ASR) and 0% False Positive Rate (FPR). The backdoor doesn’t weaken model performance on clean tasks but actually improves metrics like BLEU-1, making traditional performance-degradation-based detection methods ineffective.

Conclusion: This study reveals underestimated security threats in self-driving VLMs and provides a new attack paradigm for backdoor evaluation in safety-critical multimodal systems, highlighting the need for more sophisticated security measures.

Abstract: Visual language model (VLM) is rapidly being integrated into safety-critical systems such as autonomous driving, making it an important attack surface for potential backdoor attacks. Existing backdoor attacks mainly rely on unimodal, explicit, and easily detectable triggers, making it difficult to construct both covert and stable attack channels in autonomous driving scenarios. GLA introduces two naturalistic triggers: graffiti-based visual patterns generated via stable diffusion inpainting, which seamlessly blend into urban scenes, and cross-language text triggers, which introduce distributional shifts while maintaining semantic consistency to build robust language-side trigger signals. Experiments on DriveVLM show that GLA requires only a 10% poisoning ratio to achieve a 90% Attack Success Rate (ASR) and a 0% False Positive Rate (FPR). More insidiously, the backdoor does not weaken the model on clean tasks, but instead improves metrics such as BLEU-1, making it difficult for traditional performance-degradation-based detection methods to identify the attack. This study reveals underestimated security threats in self-driving VLMs and provides a new attack paradigm for backdoor evaluation in safety-critical multimodal systems.

[311] InCTRLv2: Generalist Residual Models for Few-Shot Anomaly Detection and Segmentation

Jiawen Zhu, Mengjia Niu, Guansong Pang

Main category: cs.CV

TL;DR: InCTRLv2 is a few-shot generalist anomaly detection framework that extends previous work with dual-branch architecture using vision-language semantic priors for cross-domain anomaly detection without retraining.

Details

Motivation: Current anomaly detection methods are specialist models that require large training samples from specific domains and struggle to generalize to unseen datasets. The need for Generalist Anomaly Detection (GAD) models that can work across diverse domains without retraining motivates this work.

Method: InCTRLv2 extends the InCTRL framework with a dual-branch approach: 1) Discriminative Anomaly Score Learning (DASL) using both normal and abnormal data to learn semantic-guided abnormality/normality spaces, and 2) One-class Anomaly Score Learning (OASL) using only normal data to learn generalized normality patterns. Both branches leverage visual-text semantic priors from large-scale vision-language models.

Result: Extensive experiments on ten AD datasets show state-of-the-art performance in both anomaly detection and segmentation tasks across various settings.

Conclusion: InCTRLv2 successfully addresses the generalization limitations of specialist anomaly detection models by providing a dual semantic perspective approach that works across diverse domains without retraining.

Abstract: While recent anomaly detection (AD) methods have made substantial progress in recognizing abnormal patterns within specific domains, most of them are specialist models that are trained on large training samples from a specific target dataset, struggling to generalize to unseen datasets. To address this limitation, the paradigm of Generalist Anomaly Detection (GAD) has emerged in recent years, aiming to learn a single generalist model to detect anomalies across diverse domains without retraining. To this end, this work introduces InCTRLv2, a novel few-shot Generalist Anomaly Detection and Segmentation (GADS) framework that significantly extends our previously proposed GAD model, InCTRL. Building on the idea of learning in-context residuals with few-shot normal examples to detect anomalies as in InCTRL, InCTRLv2 introduces two new, complementary perspectives of anomaly perception under a dual-branch framework. This is accomplished by two novel modules upon InCTRL: i) Discriminative Anomaly Score Learning (DASL) with both normal and abnormal data in the main branch, which learns a semantic-guided abnormality and normality space that supports the classification of query samples from both the abnormality and normality perspectives; and ii) One-class Anomaly Score Learning (OASL) using only the normal data, which learns generalized normality patterns in a semantic space via an auxiliary branch, focusing on detecting anomalies through the lens of normality solely. Both branches are guided by rich visual-text semantic priors encoded by large-scale vision-language models. Together, they offer a dual semantic perspective for AD: one emphasizes normal-abnormal discriminations, while the other emphasizes normality-deviated semantics. Extensive experiments on ten AD datasets demonstrate that InCTRLv2 achieves SotA performance in both anomaly detection and segmentation tasks across various settings.

[312] Preserving Forgery Artifacts: AI-Generated Video Detection at Native Scale

Zhengcen Li, Chenyang Jiang, Hang Zhao, Shiyang Zhou, Yunyang Mo, Feng Gao, Fan Yang, Qiben Shan, Shaocong Wu, Jingyong Su

Main category: cs.CV

TL;DR: A novel AI-generated video detection framework using native-scale processing with Qwen2.5-VL Vision Transformer to preserve high-frequency forgery traces, accompanied by a large-scale dataset of 140K+ videos from 15 modern generators.

Details

Motivation: Current video detection methods have critical limitations: they rely on preprocessing operations (fixed-resolution resizing/cropping) that discard subtle forgery traces and cause information loss, and they're trained on outdated datasets that don't capture modern generative model sophistication.

Method: Proposes a novel detection framework built on Qwen2.5-VL Vision Transformer that operates natively at variable spatial resolutions and temporal durations, preserving high-frequency artifacts and spatiotemporal inconsistencies typically lost during conventional preprocessing. Also curates a large-scale dataset of over 140K videos from 15 state-of-the-art generators.

Result: Extensive experiments demonstrate superior performance across multiple benchmarks, establishing a robust new baseline for AI-generated video detection.

Conclusion: Native-scale processing is critical for effective AI-generated video detection, and the proposed framework with comprehensive dataset addresses current limitations in synthetic media detection.

Abstract: The rapid advancement of video generation models has enabled the creation of highly realistic synthetic media, raising significant societal concerns regarding the spread of misinformation. However, current detection methods suffer from critical limitations. They rely on preprocessing operations like fixed-resolution resizing and cropping. These operations not only discard subtle, high-frequency forgery traces but also cause spatial distortion and significant information loss. Furthermore, existing methods are often trained and evaluated on outdated datasets that fail to capture the sophistication of modern generative models. To address these challenges, we introduce a comprehensive dataset and a novel detection framework. First, we curate a large-scale dataset of over 140K videos from 15 state-of-the-art open-source and commercial generators, along with Magic Videos benchmark designed specifically for evaluating ultra-realistic synthetic content. In addition, we propose a novel detection framework built on the Qwen2.5-VL Vision Transformer, which operates natively at variable spatial resolutions and temporal durations. This native-scale approach effectively preserves the high-frequency artifacts and spatiotemporal inconsistencies typically lost during conventional preprocessing. Extensive experiments demonstrate that our method achieves superior performance across multiple benchmarks, underscoring the critical importance of native-scale processing and establishing a robust new baseline for AI-generated video detection.

Yeonwoo Cha, Jaehoon Yoo, Semin Kim, Yunseo Park, Jinhyeon Kwon, Seunghoon Hong

Main category: cs.CV

TL;DR: FDS is a training-free framework that improves flow-based generative models by refining intermediate states using divergence of the marginal velocity field to avoid misguidance to low-density regions.

Details

Motivation: Flow-based models suffer from generation quality degradation when sample-wise velocities conflict at intermediate states, causing the averaged marginal velocity to misguide samples toward low-density regions.

Method: Proposes Flow Divergence Sampler (FDS) that computes divergence of the marginal velocity field during inference and uses this signal to refine intermediate states before each solver step, steering states toward less ambiguous regions.

Result: FDS consistently improves fidelity across various generation tasks including text-to-image synthesis and inverse problems, working as a plug-and-play framework with standard solvers and off-the-shelf flow backbones.

Conclusion: FDS effectively addresses the misguidance problem in flow-based models by exploiting divergence signals to improve generation quality without requiring retraining.

Abstract: Flow-based models learn a target distribution by modeling a marginal velocity field, defined as the average of sample-wise velocities connecting each sample from a simple prior to the target data. When sample-wise velocities conflict at the same intermediate state, however, this averaged velocity can misguide samples toward low-density regions, degrading generation quality. To address this issue, we propose the Flow Divergence Sampler (FDS), a training-free framework that refines intermediate states before each solver step. Our key finding reveals that the severity of this misguidance is quantified by the divergence of the marginal velocity field that is readily computable during inference with a well-optimized model. FDS exploits this signal to steer states toward less ambiguous regions. As a plug-and-play framework compatible with standard solvers and off-the-shelf flow backbones, FDS consistently improves fidelity across various generation tasks including text-to-image synthesis, and inverse problems.

[314] Synthesis4AD: Synthetic Anomalies are All You Need for 3D Anomaly Detection

Yihan Sun, Yuqi Cheng, Junjie Zu, Yuxiang Tan, Guoyang Xie, Yucheng Wang, Yunkang Cao, Weiming Shen

Main category: cs.CV

TL;DR: Synthesis4AD is a framework for 3D anomaly detection that uses synthetic anomaly generation via 3D-DefectStudio and MPAS engine, enhanced by multimodal LLM interpretation of design information, achieving state-of-the-art performance on industrial datasets.

Details

Motivation: Industrial 3D anomaly detection suffers from scarcity and long-tailed distribution of abnormal samples, limiting performance. Real-world defect data is hard to collect, creating a need for synthetic anomaly generation to improve detection capabilities.

Method: 1) 3D-DefectStudio platform with MPAS synthesis engine generates realistic defects using higher-dimensional support primitives and point-wise anomaly masks. 2) Multimodal LLM interprets product design information to automatically generate anomaly synthesis instructions. 3) Training pipeline with spatial-distribution normalization and geometry-faithful augmentations improves Point Transformer robustness on unstructured point clouds.

Result: Achieves state-of-the-art performance on Real3D-AD, MulSen-AD, and real-world industrial parts datasets. The synthesis method and interactive system will be publicly released.

Conclusion: Synthesis4AD provides an effective end-to-end paradigm for 3D anomaly detection by leveraging synthetic anomalies and multimodal LLM guidance, addressing data scarcity issues in industrial applications.

Abstract: Industrial 3D anomaly detection performance is fundamentally constrained by the scarcity and long-tailed distribution of abnormal samples. To address this challenge, we propose Synthesis4AD, an end-to-end paradigm that leverages large-scale, high-fidelity synthetic anomalies to learn more discriminative representations for 3D anomaly detection. At the core of Synthesis4AD is 3D-DefectStudio, a software platform built upon the controllable synthesis engine MPAS, which injects geometrically realistic defects guided by higher-dimensional support primitives while simultaneously generating accurate point-wise anomaly masks. Furthermore, Synthesis4AD incorporates a multimodal large language model (MLLM) to interpret product design information and automatically translate it into executable anomaly synthesis instructions, enabling scalable and knowledge-driven anomalous data generation. To improve the robustness and generalization of the downstream detector on unstructured point clouds, Synthesis4AD further introduces a training pipeline based on spatial-distribution normalization and geometry-faithful data augmentations, which alleviates the sensitivity of Point Transformer architectures to absolute coordinates and improves feature learning under realistic data variations. Extensive experiments demonstrate state-of-the-art performance on Real3D-AD, MulSen-AD, and a real-world industrial parts dataset. The proposed synthesis method MPAS and the interactive system 3D-DefectStudio will be publicly released at https://github.com/hustCYQ/Synthesis4AD.

[315] ZeD-MAP: Bundle Adjustment Guided Zero-Shot Depth Maps for Real-Time Aerial Imaging

Selim Ahmet Iz, Francesco Nex, Norman Kerle, Henry Meissner, Ralf Berger

Main category: cs.CV

TL;DR: ZeD-MAP converts diffusion depth models into metric-consistent mapping pipelines using cluster-based bundle adjustment for real-time UAV depth reconstruction.

Details

Motivation: Real-time depth reconstruction from ultra-high-resolution UAV imagery is essential for time-critical geospatial tasks like disaster response, but faces challenges with wide-baseline parallax, large image sizes, and computational constraints. While zero-shot diffusion models offer fast predictions without task-specific retraining, they lack metric accuracy and temporal consistency across frames.

Method: ZeD-MAP integrates incremental cluster-based bundle adjustment (BA) with diffusion depth models. Streamed UAV frames are grouped into overlapping clusters; periodic BA produces metrically consistent poses and sparse 3D tie-points, which are reprojected into selected frames and used as metric guidance for diffusion-based depth estimation.

Result: Validation on ground-marker flights at ~50m altitude shows sub-meter accuracy: ~0.87m error in horizontal (XY) plane and ~0.12m in vertical (Z) direction, with per-image runtimes between 1.47-4.91 seconds. The method achieves consistency comparable to classical photogrammetry while enabling real-time 3D map generation.

Conclusion: BA-based metric guidance provides consistency comparable to classical photogrammetric methods while significantly accelerating processing, enabling real-time 3D map generation from UAV imagery using diffusion models.

Abstract: Real-time depth reconstruction from ultra-high-resolution UAV imagery is essential for time-critical geospatial tasks such as disaster response, yet remains challenging due to wide-baseline parallax, large image sizes, low-texture or specular surfaces, occlusions, and strict computational constraints. Recent zero-shot diffusion models offer fast per-image dense predictions without task-specific retraining, and require fewer labelled datasets than transformer-based predictors while avoiding the rigid capture geometry requirement of classical multi-view stereo. However, their probabilistic inference prevents reliable metric accuracy and temporal consistency across sequential frames and overlapping tiles. We present ZeD-MAP, a cluster-level framework that converts a test-time diffusion depth model into a metrically consistent, SLAM-like mapping pipeline by integrating incremental cluster-based bundle adjustment (BA). Streamed UAV frames are grouped into overlapping clusters; periodic BA produces metrically consistent poses and sparse 3D tie-points, which are reprojected into selected frames and used as metric guidance for diffusion-based depth estimation. Validation on ground-marker flights captured at approximately 50 m altitude (GSD is approximately 0.85 cm/px, corresponding to 2,650 square meters ground coverage per frame) with the DLR Modular Aerial Camera System (MACS) shows that our method achieves sub-meter accuracy, with approximately 0.87 m error in the horizontal (XY) plane and 0.12 m in the vertical (Z) direction, while maintaining per-image runtimes between 1.47 and 4.91 seconds. Results are subject to minor noise from manual point-cloud annotation. These findings show that BA-based metric guidance provides consistency comparable to classical photogrammetric methods while significantly accelerating processing, enabling real-time 3D map generation.

[316] 3D Gaussian Splatting for Annular Dark Field Scanning Transmission Electron Microscopy Tomography Reconstruction

Beiyuan Zhang, Hesong Li, Ruiwen Shao, Ying Fu

Main category: cs.CV

TL;DR: DenZa-Gaussian adapts 3D Gaussian Splatting to ADF-STEM tomography with sparse-view acquisition, addressing imaging physics mismatch and missing wedge artifacts through learnable scattering fields and Fourier amplitude regularization.

Details

Motivation: ADF-STEM tomography faces a trade-off: more tilt views improve 3D reconstruction but risk damaging dose-sensitive materials and introduce drift/misalignment issues. Sparse-view acquisition is often necessary, but conventional methods degrade with limited views, producing artifacts and reduced structural fidelity.

Method: Adapts 3D Gaussian Splatting (3DGS) to ADF-STEM with three key innovations: 1) Models local scattering strength as learnable scalar field “denza” to address physics mismatch, 2) Introduces coefficient γ for scattering stabilization across tilt angles via view normalization, 3) Incorporates loss function with 2D Fourier amplitude term to suppress missing wedge artifacts in sparse-view reconstruction.

Result: Experiments on 45-view and 15-view tilt series show DenZa-Gaussian produces high-fidelity reconstructions and 2D projections that align more closely with original tilts, demonstrating superior robustness under sparse-view conditions compared to conventional methods.

Conclusion: DenZa-Gaussian successfully addresses sparse-view ADF-STEM tomography challenges by adapting 3DGS with physics-aware modifications, enabling high-quality 3D reconstructions while minimizing electron dose exposure and mitigating artifacts from limited tilt views.

Abstract: Analytical Dark Field Scanning Transmission Electron Microscopy (ADF-STEM) tomography reconstructs nanoscale materials in 3D by integrating multi-view tilt-series images, enabling precise analysis of their structural and compositional features. Although integrating more tilt views improves 3D reconstruction, it requires extended electron exposure that risks damaging dose-sensitive materials and introduces drift and misalignment, making it difficult to balance reconstruction fidelity with sample preservation. In practice, sparse-view acquisition is frequently required, yet conventional ADF-STEM methods degrade under limited views, exhibiting artifacts and reduced structural fidelity. To resolve these issues, in this paper, we adapt 3D GS to this domain with three key components. We first model the local scattering strength as a learnable scalar field, denza, to address the mismatch between 3DGS and ADF-STEM imaging physics. Then we introduce a coefficient $γ$ to stabilize scattering across tilt angles, ensuring consistent denza via scattering view normalization. Finally, We incorporate a loss function that includes a 2D Fourier amplitude term to suppress missing wedge artifacts in sparse-view reconstruction. Experiments on 45-view and 15-view tilt series show that DenZa-Gaussian produces high-fidelity reconstructions and 2D projections that align more closely with original tilts, demonstrating superior robustness under sparse-view conditions.

[317] OpenWorldLib: A Unified Codebase and Definition of Advanced World Models

DataFlow Team, Bohan Zeng, Daili Hua, Kaixin Zhu, Yifan Dai, Bozhou Li, Yuran Wang, Chengzhuo Tong, Yifan Yang, Mingkun Chang, Jianbin Zhao, Zhou Liu, Hao Liang, Xiaochen Ma, Ruichuan An, Junbo Niu, Zimo Meng, Tianyi Bai, Meiyi Qiang, Huanyao Zhang, Zhiyou Xiao, Tianyu Guo, Qinhan Yu, Runhao Zhao, Zhengpin Li, Xinyi Huang, Yisheng Pan, Yiwen Tang, Yang Shi, Yue Ding, Xinlong Chen, Hongcheng Gao, Minglei Shi, Jialong Wu, Zekun Wang, Yuanxing Zhang, Xintao Wang, Pengfei Wan, Yiren Song, Mike Zheng Shou, Wentao Zhang

Main category: cs.CV

TL;DR: OpenWorldLib is a standardized inference framework for Advanced World Models that provides a clear definition and systematic categorization of world model capabilities, enabling unified integration and collaborative inference across different tasks.

Details

Motivation: The paper addresses the lack of a clear and unified definition for world models in AI research, despite their growing importance. There's a need for standardization and a comprehensive framework to facilitate research and practical applications of world models.

Method: Proposes a clear definition of world models as perception-centered models with interaction and long-term memory capabilities. Systematically categorizes essential world model capabilities and develops OpenWorldLib - a standardized inference framework that integrates models across different tasks within a unified architecture for efficient reuse and collaborative inference.

Result: OpenWorldLib provides a working implementation that enables unified integration of world models across different tasks. The framework allows for efficient model reuse and collaborative inference, addressing the fragmentation in world model research and applications.

Conclusion: OpenWorldLib offers a standardized approach to world model research and application, providing clear definitions, systematic categorization, and a practical framework. The work lays groundwork for more unified and collaborative development in world model research with potential for future extensions.

Abstract: World models have garnered significant attention as a promising research direction in artificial intelligence, yet a clear and unified definition remains lacking. In this paper, we introduce OpenWorldLib, a comprehensive and standardized inference framework for Advanced World Models. Drawing on the evolution of world models, we propose a clear definition: a world model is a model or framework centered on perception, equipped with interaction and long-term memory capabilities, for understanding and predicting the complex world. We further systematically categorize the essential capabilities of world models. Based on this definition, OpenWorldLib integrates models across different tasks within a unified framework, enabling efficient reuse and collaborative inference. Finally, we present additional reflections and analyses on potential future directions for world model research. Code link: https://github.com/OpenDCAI/OpenWorldLib

[318] Don’t Waste Bits! Adaptive KV-Cache Quantization for Lightweight On-Device LLMs

Sayed Pedram Haeri Boroujeni, Niloufar Mehrabi, Patrick Woods, Gabriel Hillesheim, Abolfazl Razi

Main category: cs.CV

TL;DR: Adaptive KV-cache quantization for LLMs that assigns variable bit-widths to tokens based on importance, reducing memory and latency while maintaining accuracy close to FP16 inference.

Details

Motivation: On-device LLM inference is constrained by KV-cache memory overhead that grows with context length. Fixed quantization wastes bits on unimportant tokens while over-compressing important ones, causing accuracy degradation.

Method: Learned policy using lightweight token-level features (frequency, quality score, attention variance, entropy-based uncertainty) fed into a compact controller that dynamically selects KV precision from {2-bit, 4-bit, 8-bit, FP16} during decoding.

Result: Reduces KV memory footprint and latency while improving accuracy vs static quantization. With SmolLM-360M on HellaSwag: 17.75% latency reduction, 7.60 points accuracy improvement, within 0.30 points of FP16 inference.

Conclusion: Adaptive KV-cache quantization enables efficient on-device LLM deployment by intelligently allocating bits to tokens based on importance, achieving better accuracy-latency trade-offs than static methods.

Abstract: Large Language Models (LLMs) have achieved remarkable progress across reasoning, generation, and decision-making tasks, yet deploying them on mobile, embedded, and edge devices remains particularly challenging. On-device LLM inference is heavily constrained by the memory and bandwidth overhead of the key-value (KV) cache, which grows linearly with context length and often dominates decoding cost. Existing KV-cache quantization schemes typically rely on fixed precision or hand-crafted heuristics, thereby wasting bits on low-impact tokens while over-compressing informative ones, leading to avoidable accuracy degradation. Inspired by Huffman coding’s principle of variable-length allocation, we propose adaptive KV-cache quantization, a learned policy that assigns bit-width proportional to token importance, minimizing expected memory and latency without sacrificing competitive accuracy. Our framework extracts lightweight token-level features, including token frequency, quality score, attention variance, and entropy-based uncertainty, and feeds them into a compact data-driven controller that dynamically selects KV precision from {2-bit, 4-bit, 8-bit, FP16} during decoding. This adaptive precision policy reduces KV memory footprint and latency while improving accuracy compared to static KV quantization and rule-based baselines, and maintaining competitive accuracy close to FP16 inference across standard LLM benchmarks. Extensive experiments across multiple commonsense reasoning benchmarks using SmolLM-135M, SmolLM-360M, and SmolLM-1.7B demonstrate that our controller consistently improves the accuracy-latency trade-off. For instance, with SmolLM-360M on HellaSwag, our method reduces decoding latency (ms/token) by 17.75% relative to static KV quantization, improves accuracy by 7.60 points, and remains within only 0.30 points of FP16 inference.

[319] Discovering Failure Modes in Vision-Language Models using RL

Kanishk Jain, Qian Yang, Shravan Nayak, Parisa Kordjamshidi, Nishanth Anand, Aishwarya Agrawal

Main category: cs.CV

TL;DR: RL-based framework automatically discovers VLM failure modes by training a questioner agent to generate adaptive queries that elicit incorrect answers, identifying 36 novel failure modes.

Details

Motivation: VLMs often misinterpret straightforward visual concepts (counting, spatial reasoning, viewpoint understanding) that humans identify easily. Manual identification of these weaknesses is costly, unscalable, and subject to human bias, leading to incomplete understanding of model vulnerabilities.

Method: Reinforcement Learning framework that trains a questioner agent to adaptively generate queries based on candidate VLM’s responses to elicit incorrect answers. The approach increases question complexity by focusing on fine-grained visual details and distinct skill compositions as training progresses.

Result: Identified 36 novel failure modes in which VLMs struggle. Demonstrated broad applicability and generalizability across various model combinations.

Conclusion: The proposed RL-based framework provides an automated, scalable approach to discover VLM blind spots without human intervention, overcoming limitations of manual analysis and revealing previously unknown model vulnerabilities.

Abstract: Vision-language Models (VLMs), despite achieving strong performance on multimodal benchmarks, often misinterpret straightforward visual concepts that humans identify effortlessly, such as counting, spatial reasoning, and viewpoint understanding. Previous studies manually identified these weaknesses and found that they often stem from deficits in specific skills. However, such manual efforts are costly, unscalable, and subject to human bias, which often overlooks subtle details in favor of salient objects, resulting in an incomplete understanding of a model’s vulnerabilities. To address these limitations, we propose a Reinforcement Learning (RL)-based framework to automatically discover the failure modes or blind spots of any candidate VLM on a given data distribution without human intervention. Our framework trains a questioner agent that adaptively generates queries based on the candidate VLM’s responses to elicit incorrect answers. Our approach increases question complexity by focusing on fine-grained visual details and distinct skill compositions as training progresses, consequently identifying 36 novel failure modes in which VLMs struggle. We demonstrate the broad applicability of our framework by showcasing its generalizability across various model combinations.

[320] Think in Strokes, Not Pixels: Process-Driven Image Generation via Interleaved Reasoning

Lei Zhang, Junjiao Tian, Zhipeng Fan, Kunpeng Li, Jialiang Wang, Weifeng Chen, Markos Georgopoulos, Felix Juefei-Xu, Yuxiang Bao, Julian McAuley, Manling Li, Zecheng He

Main category: cs.CV

TL;DR: Process-driven image generation: A multi-step paradigm that decomposes image synthesis into interleaved reasoning trajectory of thoughts and actions, mimicking human painting process with textual planning, visual drafting, textual reflection, and visual refinement stages.

Details

Motivation: Humans paint incrementally with planning, sketching, inspection, and refinement grounded in evolving visual states. The paper investigates whether unified multimodal models can imagine intermediate states and proposes a process-driven approach to make generation explicit, interpretable, and supervisable.

Method: Multi-step paradigm with 4-stage iterations: 1) textual planning, 2) visual drafting, 3) textual reflection, and 4) visual refinement. Uses dense step-wise supervision with spatial/semantic consistency for visual states and knowledge preservation with error correction for textual states.

Result: Experiments conducted under various text-to-image generation benchmarks to validate the proposed method. The approach enables explicit, interpretable generation process with intermediate state supervision.

Conclusion: Process-driven image generation provides a framework for multimodal models to generate images through explicit reasoning trajectories, addressing ambiguity of intermediate states through complementary constraints on visual and textual intermediate states.

Abstract: Humans paint images incrementally: they plan a global layout, sketch a coarse draft, inspect, and refine details, and most importantly, each step is grounded in the evolving visual states. However, can unified multimodal models trained on text-image interleaved datasets also imagine the chain of intermediate states? In this paper, we introduce process-driven image generation, a multi-step paradigm that decomposes synthesis into an interleaved reasoning trajectory of thoughts and actions. Rather than generating images in a single step, our approach unfolds across multiple iterations, each consisting of 4 stages: textual planning, visual drafting, textual reflection, and visual refinement. The textual reasoning explicitly conditions how the visual state should evolve, while the generated visual intermediate in turn constrains and grounds the next round of textual reasoning. A core challenge of process-driven generation stems from the ambiguity of intermediate states: how can models evaluate each partially-complete image? We address this through dense, step-wise supervision that maintains two complementary constraints: for the visual intermediate states, we enforce the spatial and semantic consistency; for the textual intermediate states, we preserve the prior visual knowledge while enabling the model to identify and correct prompt-violating elements. This makes the generation process explicit, interpretable, and directly supervisable. To validate proposed method, we conduct experiments under various text-to-image generation benchmarks.

[321] CLEAR: Unlocking Generative Potential for Degraded Image Understanding in Unified Multimodal Models

Xiangzhao Hao, Zefeng Zhang, Zhenyu Zhang, Linhao Yu, Yao Chen, Yiqian Zhang, Haiyun Guo, Shuohuan Wang, Yu Sun

Main category: cs.CV

TL;DR: CLEAR framework connects multimodal understanding and generation capabilities to improve robustness on degraded images through supervised fine-tuning, latent representation bridge, and joint reinforcement learning optimization.

Details

Motivation: Image degradation (blur, noise, compression, poor illumination) severely undermines multimodal understanding in real-world settings. Unified multimodal models that combine understanding and generation have generative capacity that could help with degraded inputs, but they fail to leverage this capacity effectively due to training regimes that don't invoke generation during reasoning and inefficient decode-reencode pathways.

Method: Three progressive steps: (1) Supervised fine-tuning on degradation-aware dataset to establish generate-then-answer reasoning pattern; (2) Latent Representation Bridge replacing decode-reencode detour with direct, optimizable connection between generation and reasoning; (3) Interleaved GRPO (reinforcement learning) jointly optimizing text reasoning and visual generation under answer-correctness rewards.

Result: CLEAR substantially improves robustness on degraded inputs while preserving clean-image performance. Analysis reveals that removing pixel-level reconstruction supervision leads to intermediate visual states with higher perceptual quality, suggesting task-driven optimization and visual quality are naturally aligned.

Conclusion: The CLEAR framework successfully connects multimodal understanding and generation capabilities to address image degradation challenges, demonstrating that task-driven optimization can naturally improve visual quality without explicit reconstruction supervision.

Abstract: Image degradation from blur, noise, compression, and poor illumination severely undermines multimodal understanding in real-world settings. Unified multimodal models that combine understanding and generation within a single architecture are a natural fit for this challenge, as their generative pathway can model the fine-grained visual structure that degradation destroys. Yet these models fail to leverage their own generative capacity on degraded inputs. We trace this disconnect to two compounding factors: existing training regimes never ask the model to invoke generation during reasoning, and the standard decode-reencode pathway does not support effective joint optimization. We present CLEAR, a framework that connects the two capabilities through three progressive steps: (1) supervised fine-tuning on a degradation-aware dataset to establish the generate-then-answer reasoning pattern; (2) a Latent Representation Bridge that replaces the decode-reencode detour with a direct, optimizable connection between generation and reasoning; (3) Interleaved GRPO, a reinforcement learning method that jointly optimizes text reasoning and visual generation under answer-correctness rewards. We construct MMD-Bench, covering three degradation severity levels across six standard multimodal benchmarks. Experiments show that CLEAR substantially improves robustness on degraded inputs while preserving clean-image performance. Our analysis further reveals that removing pixel-level reconstruction supervision leads to intermediate visual states with higher perceptual quality, suggesting that task-driven optimization and visual quality are naturally aligned.

[322] AvatarPointillist: AutoRegressive 4D Gaussian Avatarization

Hongyu Liu, Xuan Wang, Yating Wang, Zijian Wu, Ziyu Wan, Yue Ma, Runtao Liu, Boyao Zhou, Yujun Shen, Qifeng Chen

Main category: cs.CV

TL;DR: AvatarPointillist generates dynamic 4D Gaussian avatars from single portrait images using an autoregressive Transformer that sequentially generates point clouds for 3D Gaussian Splatting with adaptive density and animation binding.

Details

Motivation: The paper addresses the challenge of creating high-quality, photorealistic, and controllable dynamic avatars from minimal input (single portrait images). Current methods may lack precision, adaptability to subject complexity, or realistic animation capabilities.

Method: Uses a decoder-only Transformer that autoregressively generates point clouds for 3D Gaussian Splatting. The sequential approach allows adaptive point density and total point count based on subject complexity. The AR model jointly predicts per-point binding information for animation. A Gaussian decoder then converts points into complete renderable Gaussian attributes, conditioned on latent features from the AR generator.

Result: Extensive experiments validate that AvatarPointillist produces high-quality, photorealistic, and controllable avatars. The method demonstrates effective interaction between generation stages and marked improvement in fidelity.

Conclusion: The autoregressive formulation represents a new paradigm for avatar generation, offering precise adaptive construction and realistic animation from single portrait inputs. The framework will be released to inspire future research.

Abstract: We introduce AvatarPointillist, a novel framework for generating dynamic 4D Gaussian avatars from a single portrait image. At the core of our method is a decoder-only Transformer that autoregressively generates a point cloud for 3D Gaussian Splatting. This sequential approach allows for precise, adaptive construction, dynamically adjusting point density and the total number of points based on the subject’s complexity. During point generation, the AR model also jointly predicts per-point binding information, enabling realistic animation. After generation, a dedicated Gaussian decoder converts the points into complete, renderable Gaussian attributes. We demonstrate that conditioning the decoder on the latent features from the AR generator enables effective interaction between stages and markedly improves fidelity. Extensive experiments validate that AvatarPointillist produces high-quality, photorealistic, and controllable avatars. We believe this autoregressive formulation represents a new paradigm for avatar generation, and we will release our code inspire future research.

Mayank Mayank, Bharanidhar Duraisamy, Florian Geiß, Abhinav Valada

Main category: cs.CV

TL;DR: MMF-BEV is a radar-camera fusion framework for 3D object detection in autonomous driving that uses deformable attention mechanisms for cross-modal feature alignment in Bird’s Eye View representation.

Details

Motivation: Autonomous driving requires complementary sensors: cameras provide dense semantics but unreliable depth, while radar offers precise range/velocity with sparse geometry. Existing methods need better cross-modal fusion for improved 3D object detection.

Method: Proposes MMF-BEV with separate BEVDepth camera branch and RadarBEVNet radar branch, both enhanced with Deformable Self-Attention. Fuses them via Deformable Cross-Attention module. Uses two-stage training: pre-train camera branch with depth supervision, then jointly train radar and fusion modules.

Result: Outperforms unimodal baselines and achieves competitive results against prior fusion methods across all object classes on View-of-Delft dataset, both in full annotated area and near-range Region of Interest. Sensor contribution analysis shows interpretable modality weighting.

Conclusion: MMF-BEV effectively leverages sensor complementarity through deformable attention mechanisms for improved 3D object detection, with interpretable fusion that adapts modality weighting based on distance.

Abstract: Accurate 3D object detection for autonomous driving requires complementary sensors. Cameras provide dense semantics but unreliable depth, while millimeter-wave radar offers precise range and velocity measurements with sparse geometry. We propose MMF-BEV, a radar-camera BEV fusion framework that leverages deformable attention for cross-modal feature alignment on the View-of-Delft (VoD) 4D radar dataset [1]. MMF-BEV builds a BEVDepth [2] camera branch and a RadarBEVNet [3] radar branch, each enhanced with Deformable Self-Attention, and fuses them via a Deformable Cross-Attention module. We evaluate three configurations: camera-only, radar-only, and hybrid fusion. A sensor contribution analysis quantifies per-distance modality weighting, providing interpretable evidence of sensor complementarity. A two-stage training strategy - pre-training the camera branch with depth supervision, then jointly training radar and fusion modules stabilizes learning. Experiments on VoD show that MMF-BEV consistently outperforms unimodal baselines and achieves competitive results against prior fusion methods across all object classes in both the full annotated area and near-range Region of Interest.

Yude Zou, Junji Gong, Xing Gao, Zixuan Li, Tianxing Chen, Guanjie Zheng

Main category: cs.CV

TL;DR: Coarse-to-fine framework for generating human-object-scene interactions using consistency models with dynamic perception and bump-aware guidance, trained on hybrid data to overcome annotation scarcity.

Details

Motivation: HOSI generation requires reasoning over dynamic object-scene changes but suffers from limited annotated data. Existing methods focus on HOI or HSI separately, lacking integrated solutions for the more complex HOSI problem.

Method: Instruction-conditioned interaction generation framework aligned with consistency model denoising. Uses dynamic perception strategy that updates scene context using trajectories from preceding refinement. Includes bump-aware guidance to mitigate collisions without fine-grained geometry. Employs hybrid training with pseudo-HOSI samples from HOI data and high-fidelity HSI data.

Result: Achieves state-of-the-art performance in both HOSI and HOI generation, with strong generalization to unseen scenes. Enables real-time generation while reducing physical artifacts.

Conclusion: Proposed framework effectively addresses HOSI generation challenges through consistency model alignment, dynamic perception, and hybrid training, overcoming data scarcity while maintaining physical plausibility.

Abstract: Human-object-scene interactions (HOSI) generation has broad applications in embodied AI, simulation, and animation. Unlike human-object interaction (HOI) and human-scene interaction (HSI), HOSI generation requires reasoning over dynamic object-scene changes, yet suffers from limited annotated data. To address these issues, we propose a coarse-to-fine instruction-conditioned interaction generation framework that is explicitly aligned with the iterative denoising process of a consistency model. In particular, we adopt a dynamic perception strategy that leverages trajectories from the preceding refinement to update scene context and condition subsequent refinement at each denoising step of consistency model, yielding consistent interactions. To further reduce physical artifacts, we introduce a bump-aware guidance that mitigates collisions and penetrations during sampling without requiring fine-grained scene geometry, enabling real-time generation. To overcome data scarcity, we design a hybrid training startegy that synthesizes pseudo-HOSI samples by injecting voxelized scene occupancy into HOI datasets and jointly trains with high-fidelity HSI data, allowing interaction learning while preserving realistic scene awareness. Extensive experiments demonstrate that our method achieves state-of-the-art performance in both HOSI and HOI generation, and strong generalization to unseen scenes. Project page: https://yudezou.github.io/InfBaGel-page/

[325] Less Detail, Better Answers: Degradation-Driven Prompting for VQA

Haoxuan Han, Weijie Wang, Zeyu Zhang, Yefei He, Bohan Zhuang

Main category: cs.CV

TL;DR: DDP improves VQA by strategically degrading image quality to force models to focus on essential structural information, using techniques like downsampling, visual aids, and specialized tools for different perceptual tasks.

Details

Motivation: High-resolution details in images can sometimes act as noise that leads to hallucinations or reasoning errors in Vision-Language Models, so the paper aims to improve VQA performance by reducing visual fidelity to focus on structural information.

Method: Degradation-Driven Prompting (DDP) framework that strategically reduces image fidelity through techniques like 80p downsampling, structural visual aids (white background masks, orthometric lines), blur masks, contrast enhancement, and In-Context Learning, with task-specific approaches for physical attributes and perceptual phenomena.

Result: Experimental results demonstrate that intentionally degrading visual inputs and providing targeted structural prompts enables VLMs to bypass distracting textures and achieve superior reasoning accuracy on challenging visual benchmarks.

Conclusion: Less is more: strategic degradation of visual inputs combined with structural prompts improves VLM performance by forcing focus on essential information rather than distracting details.

Abstract: Recent advancements in Vision-Language Models (VLMs) have significantly pushed the boundaries of Visual Question Answering (VQA).However,high-resolution details can sometimes become noise that leads to hallucinations or reasoning errors. In this paper,we propose Degradation-Driven Prompting (DDP), a novel framework that improves VQA performance by strategically reducing image fidelity to force models to focus on essential structural information. We evaluate DDP across two distinct tasks. Physical attributes targets images prone to human misjudgment, where DDP employs a combination of 80p downsampling, structural visual aids (white background masks and orthometric lines), and In-Context Learning (ICL) to calibrate the model’s focus. Perceptual phenomena addresses various machine-susceptible visual anomalies and illusions, including Visual Anomaly (VA), Color (CI), Motion(MI),Gestalt (GI), Geometric (GSI), and Visual Illusions (VI).For this task, DDP integrates a task-classification stage with specialized tools such as blur masks and contrast enhancement alongside downsampling. Our experimental results demonstrate that less is more: by intentionally degrading visual inputs and providing targeted structural prompts, DDP enables VLMs to bypass distracting textures and achieve superior reasoning accuracy on challenging visual benchmarks.

Runhao Mao, Hanshi Wang, Yixiang Yang, Qianli Ma, Jingmeng Zhou, Zhipeng Zhang

Main category: cs.CV

TL;DR: DEA framework addresses catastrophic forgetting in Vision-Language Models for autonomous driving by using prompt-space adaptation instead of weight-space fine-tuning, preserving pre-trained world knowledge while improving driving performance.

Details

Motivation: Current VLMs for autonomous driving suffer from catastrophic forgetting during fine-tuning, where adaptation to driving data erodes their valuable pre-trained world knowledge, creating a self-defeating paradox that undermines their core advantage.

Method: Proposes Drive Expert Adapter (DEA) that shifts adaptation from weight space to prompt space, using dynamic routing through different knowledge experts based on scene-specific cues without corrupting foundational model parameters.

Result: DEA achieves state-of-the-art results on driving tasks while effectively mitigating catastrophic forgetting, preserving generalization capabilities. Introduces a new 180K-scene benchmark for evaluating catastrophic forgetting in autonomous driving.

Conclusion: The DEA framework successfully addresses the catastrophic forgetting problem in VLMs for autonomous driving, enabling improved driving performance without sacrificing pre-trained world knowledge, making VLMs more viable for real-world autonomous systems.

Abstract: The integration of Vision-Language Models (VLMs) into autonomous driving promises to solve long-tail scenarios, but this paradigm faces the critical and unaddressed challenge of catastrophic forgetting. The very fine-tuning process used to adapt these models to driving-specific data simultaneously erodes their invaluable pre-trained world knowledge, creating a self-defeating paradox that undermines the core reason for their use. This paper provides the first systematic investigation into this phenomenon. We introduce a new large-scale dataset of 180K scenes, which enables the first-ever benchmark specifically designed to quantify catastrophic forgetting in autonomous driving. Our analysis reveals that existing methods suffer from significant knowledge degradation. To address this, we propose the Drive Expert Adapter (DEA), a novel framework that circumvents this trade-off by shifting adaptation from the weight space to the prompt space. DEA dynamically routes inference through different knowledge experts based on scene-specific cues, enhancing driving-task performance without corrupting the model’s foundational parameters. Extensive experiments demonstrate that our approach not only achieves state-of-the-art results on driving tasks but also effectively mitigates catastrophic forgetting, preserving the essential generalization capabilities that make VLMs a transformative force for autonomous systems. Data and model are released at FidelityDrivingBench.

[327] ZINA: Multimodal Fine-grained Hallucination Detection and Editing

Yuiga Wada, Kazuki Matsuda, Komei Sugiura, Graham Neubig

Main category: cs.CV

TL;DR: ZINA is a method for fine-grained hallucination detection and editing in multimodal LLMs, with VisionHall dataset containing 6.9k annotated and 20k synthetic samples.

Details

Motivation: MLLMs often generate hallucinations that deviate from visual content, and detecting these at a fine-grained level is essential for comprehensive evaluation and analysis.

Method: Proposes ZINA method that identifies hallucinated spans at fine-grained level, classifies error types into six categories, and suggests refinements. Uses VisionHall dataset with 6.9k manually annotated outputs from 12 MLLMs and 20k synthetic samples generated via graph-based method.

Result: ZINA outperformed existing methods including GPT-4o and Llama-3.2 in both detection and editing tasks.

Conclusion: The paper introduces a novel task of multimodal fine-grained hallucination detection/editing, proposes ZINA method, and creates VisionHall dataset, showing superior performance over existing methods.

Abstract: Multimodal Large Language Models (MLLMs) often generate hallucinations, where the output deviates from the visual content. Given that these hallucinations can take diverse forms, detecting hallucinations at a fine-grained level is essential for comprehensive evaluation and analysis. To this end, we propose a novel task of multimodal fine-grained hallucination detection and editing for MLLMs. Moreover, we propose ZINA, a novel method that identifies hallucinated spans at a fine-grained level, classifies their error types into six categories, and suggests appropriate refinements. To train and evaluate models for this task, we construct VisionHall, a dataset comprising 6.9k outputs from twelve MLLMs manually annotated by 211 annotators, and 20k synthetic samples generated using a graph-based method that captures dependencies among error types. We demonstrated that ZINA outperformed existing methods, including GPT-4o and Llama-3.2, in both detection and editing tasks.

[328] Unified Vector Floorplan Generation via Markup Representation

Kaede Shiohara, Toshihiko Yamasaki

Main category: cs.CV

TL;DR: FMLM is a transformer-based model using Floorplan Markup Language (FML) representation to generate diverse, high-fidelity floorplans from various conditions like site boundaries, adjacency graphs, or partial layouts.

Details

Motivation: Existing floorplan generation methods lack flexibility and struggle to generalize across heterogeneous conditional tasks. Early constraint-based methods ensure feasibility but lack diversity, while recent generative models have suboptimal representations that limit their ability to handle different input conditions.

Method: Introduces Floorplan Markup Language (FML), a structured grammar representation that encodes floorplan information. Uses this representation to cast floorplan generation as a next token prediction task. Develops FMLM, a transformer-based generative model that can handle diverse conditional inputs like site boundaries, room adjacency graphs, or partial layouts.

Result: Comprehensive experiments on RPLAN dataset show FMLM surpasses previous task-specific state-of-the-art methods despite being a single model. The model produces high-fidelity and functional floorplans under diverse conditions.

Conclusion: FML representation effectively unifies diverse floorplan generation tasks into a single next token prediction problem. FMLM demonstrates superior performance and generalization capability compared to specialized models for individual tasks.

Abstract: Automatic residential floorplan generation has long been a central challenge bridging architecture and computer graphics, aiming to make spatial design more efficient and accessible. While early methods based on constraint satisfaction or combinatorial optimization ensure feasibility, they lack diversity and flexibility. Recent generative models achieve promising results but struggle to generalize across heterogeneous conditional tasks, such as generation from site boundaries, room adjacency graphs, or partial layouts, due to their suboptimal representations. To address this gap, we introduce Floorplan Markup Language (FML), a general representation that encodes floorplan information within a single structured grammar, which casts the entire floorplan generation problem into a next token prediction task. Leveraging FML, we develop a transformer-based generative model, FMLM, capable of producing high-fidelity and functional floorplans under diverse conditions. Comprehensive experiments on the RPLAN dataset demonstrate that FMLM, despite being a single model, surpasses the previous task-specific state-of-the-art methods.

[329] FileGram: Grounding Agent Personalization in File-System Behavioral Traces

Shuai Liu, Shulin Tian, Kairui Hu, Yuhao Dong, Zhe Yang, Bo Li, Jingkang Yang, Chen Change Loy, Ziwei Liu

Main category: cs.CV

TL;DR: FileGram: A framework for personalizing AI agents through file-system behavioral traces, featuring a data engine, benchmark, and memory architecture for memory-centric file-system agents.

Details

Motivation: Current AI agents in file systems lack effective personalization due to privacy constraints and difficulty collecting multimodal real-world traces. Existing methods focus on interactions but ignore dense behavioral traces in file-system operations.

Method: Three-component framework: (1) FileGramEngine - scalable persona-driven data engine simulating realistic workflows and generating fine-grained multimodal action sequences; (2) FileGramBench - diagnostic benchmark for evaluating memory systems on profile reconstruction, trace disentanglement, persona drift detection, and multimodal grounding; (3) FileGramOS - bottom-up memory architecture building user profiles from atomic actions and content deltas rather than dialogue summaries, encoding traces into procedural, semantic, and episodic channels.

Result: Extensive experiments show FileGramBench remains challenging for state-of-the-art memory systems, and both FileGramEngine and FileGramOS are effective. The framework is open-sourced to support future research.

Conclusion: FileGram provides a comprehensive framework for grounding agent memory and personalization in file-system behavioral traces, addressing current limitations in data constraints and interaction-centric approaches.

Abstract: Coworking AI agents operating within local file systems are rapidly emerging as a paradigm in human-AI interaction; however, effective personalization remains limited by severe data constraints, as strict privacy barriers and the difficulty of jointly collecting multimodal real-world traces prevent scalable training and evaluation, and existing methods remain interaction-centric while overlooking dense behavioral traces in file-system operations; to address this gap, we propose FileGram, a comprehensive framework that grounds agent memory and personalization in file-system behavioral traces, comprising three core components: (1) FileGramEngine, a scalable persona-driven data engine that simulates realistic workflows and generates fine-grained multimodal action sequences at scale; (2) FileGramBench, a diagnostic benchmark grounded in file-system behavioral traces for evaluating memory systems on profile reconstruction, trace disentanglement, persona drift detection, and multimodal grounding; and (3) FileGramOS, a bottom-up memory architecture that builds user profiles directly from atomic actions and content deltas rather than dialogue summaries, encoding these traces into procedural, semantic, and episodic channels with query-time abstraction; extensive experiments show that FileGramBench remains challenging for state-of-the-art memory systems and that FileGramEngine and FileGramOS are effective, and by open-sourcing the framework, we hope to support future research on personalized memory-centric file-system agents.

[330] Beyond the Global Scores: Fine-Grained Token Grounding as a Robust Detector of LVLM Hallucinations

Tuan Dung Nguyen, Minh Khoi Ho, Qi Chen, Yutong Xie, Nguyen Cam-Tu, Minh Khoi Nguyen, Dang Huy Pham Nguyen, Anton van den Hengel, Johan W. Verjans, Phi Le Nguyen, Vu Minh Hieu Phan

Main category: cs.CV

TL;DR: A patch-level hallucination detection framework for LVLMs that analyzes fine-grained token-level interactions to identify hallucinated objects based on diffuse attention patterns and lack of semantic alignment with specific image regions.

Details

Motivation: Current hallucination detection methods for large vision-language models rely on coarse, whole-image measures, which can be deceived by hallucinated tokens that show weak but widely scattered correlations across many regions, aggregating into deceptively high overall relevance scores.

Method: Introduces a patch-level detection framework that examines fine-grained token-level interactions across model layers, identifying two characteristic signatures of hallucinated tokens: 1) diffuse, non-localized attention patterns vs. compact, focused attention in faithful tokens, and 2) lack of meaningful semantic alignment with any visual region. Uses patch-level statistical features combined with hidden-layer representations.

Result: Achieves up to 90% accuracy in token-level hallucination detection, demonstrating superiority of fine-grained structural analysis over global detection methods.

Conclusion: Faithful object tokens must be strongly grounded in specific image regions, and fine-grained patch-level analysis provides more effective hallucination detection than global approaches by capturing structural differences in attention patterns and semantic alignment.

Abstract: Large vision-language models (LVLMs) achieve strong performance on visual reasoning tasks but remain highly susceptible to hallucination. Existing detection methods predominantly rely on coarse, whole-image measures of how an object token relates to the input image. This global strategy is limited: hallucinated tokens may exhibit weak but widely scattered correlations across many local regions, which aggregate into deceptively high overall relevance, thus evading the current global hallucination detectors. We begin with a simple yet critical observation: a faithful object token must be strongly grounded in a specific image region. Building on this insight, we introduce a patch-level hallucination detection framework that examines fine-grained token-level interactions across model layers. Our analysis uncovers two characteristic signatures of hallucinated tokens: (i) they yield diffuse, non-localized attention patterns, in contrast to the compact, well-focused attention seen in faithful tokens; and (ii) they fail to exhibit meaningful semantic alignment with any visual region. Guided by these findings, we develop a lightweight and interpretable detection method that leverages patch-level statistical features, combined with hidden-layer representations. Our approach achieves up to 90% accuracy in token-level hallucination detection, demonstrating the superiority of fine-grained structural analysis for detecting hallucinations.

[331] SciGA: A Comprehensive Dataset for Designing Graphical Abstracts in Academic Papers

Takuro Kawada, Shunsuke Kitada, Sota Nemoto, Hitoshi Iyatomi

Main category: cs.CV

TL;DR: A dataset and framework for graphical abstract selection and recommendation in scientific papers, with novel evaluation metrics.

Details

Motivation: Graphical abstracts are important for scientific communication but require visualization expertise, limiting adoption. Current research lacks exploration of visual materials' potential as graphical abstracts.

Method: Created SciGA-145k dataset with 145k papers and 1.14M figures. Defined two tasks: intra-GA recommendation (selecting figures within a paper) and inter-GA recommendation (finding GAs from other papers). Proposed CAR metric for evaluation.

Result: Benchmark results show viability of the tasks and effectiveness of CAR metric. Established foundation for advancing scientific communication in AI for Science.

Conclusion: The work provides a dataset and framework for graphical abstract research, enabling automated GA generation and recommendation systems to improve scientific communication.

Abstract: Graphical Abstracts (GAs) play a crucial role in visually conveying the key findings of scientific papers. Although recent research increasingly incorporates visual materials such as Figure 1 as de facto GAs, their potential to enhance scientific communication remains largely unexplored. Designing effective GAs requires advanced visualization skills, hindering their widespread adoption. To tackle these challenges, we introduce SciGA-145k, a large-scale dataset comprising approximately 145,000 scientific papers and 1.14 million figures, specifically designed to support GA selection and recommendation, and to facilitate research in automated GA generation. As a preliminary step toward GA design support, we define two tasks: 1) Intra-GA Recommendation, identifying figures within a given paper well-suited as GAs, and 2) Inter-GA Recommendation, retrieving GAs from other papers to inspire new GA designs. Furthermore, we propose Confidence Adjusted top-1 ground truth Ratio (CAR), a novel recommendation metric for fine-grained analysis of model behavior. CAR addresses limitations of traditional rank-based metrics by considering that not only an explicitly labeled GA but also other in-paper figures may plausibly serve as GAs. Benchmark results demonstrate the viability of our tasks and the effectiveness of CAR. Collectively, these establish a foundation for advancing scientific communication within AI for Science.

[332] Free-Range Gaussians: Non-Grid-Aligned Generative 3D Gaussian Reconstruction

Ahan Shabanov, Peter Hedman, Ethan Weber, Zhengqin Li, Denis Rozumny, Gael Le Lan, Naina Dhingra, Lei Luo, Andrea Vedaldi, Christian Richardt, Andrea Tagliasacchi, Bo Zhu, Numair Khan

Main category: cs.CV

TL;DR: Free-Range Gaussians: A multi-view reconstruction method that predicts non-pixel, non-voxel-aligned 3D Gaussians from few images using flow matching, enabling plausible content synthesis in unobserved regions.

Details

Motivation: Prior methods produce highly redundant grid-aligned Gaussians and suffer from holes or blurry conditional means in unobserved regions. The authors aim to create a generative reconstruction method that can synthesize plausible content in unobserved areas while using fewer Gaussians.

Method: Uses flow matching over Gaussian parameters for generative reconstruction. Introduces hierarchical patching to group spatially related Gaussians into joint transformer tokens, reducing sequence length. Proposes timestep-weighted rendering loss during training, and photometric gradient guidance and classifier-free guidance at inference.

Result: Shows consistent improvements over pixel and voxel-aligned methods on Objaverse and Google Scanned Objects datasets while using significantly fewer Gaussians. Achieves large gains when input views leave parts of objects unobserved.

Conclusion: Free-Range Gaussians enables high-quality 3D reconstruction from few views with plausible content generation in unobserved regions, overcoming limitations of grid-aligned approaches through generative flow matching and efficient hierarchical representation.

Abstract: We present Free-Range Gaussians, a multi-view reconstruction method that predicts non-pixel, non-voxel-aligned 3D Gaussians from as few as four images. This is done through flow matching over Gaussian parameters. Our generative formulation of reconstruction allows the model to be supervised with non-grid-aligned 3D data, and enables it to synthesize plausible content in unobserved regions. Thus, it improves on prior methods that produce highly redundant grid-aligned Gaussians, and suffer from holes or blurry conditional means in unobserved regions. To handle the number of Gaussians needed for high-quality results, we introduce a hierarchical patching scheme to group spatially related Gaussians into joint transformer tokens, halving the sequence length while preserving structure. We further propose a timestep-weighted rendering loss during training, and photometric gradient guidance and classifier-free guidance at inference to improve fidelity. Experiments on Objaverse and Google Scanned Objects show consistent improvements over pixel and voxel-aligned methods while using significantly fewer Gaussians, with large gains when input views leave parts of the object unobserved.

[333] HorizonWeaver: Generalizable Multi-Level Semantic Editing for Driving Scenes

Mauricio Soroco, Francesco Pittaluga, Zaid Tasneem, Abhishek Aich, Bingbing Zhuang, Wuyang Chen, Manmohan Chandraker, Ziyu Jiang

Main category: cs.CV

TL;DR: HorizonWeaver is a framework for instruction-guided photorealistic editing of complex driving scenes, addressing multi-level granularity, rich semantics, and domain shifts through large-scale dataset generation, language-guided masks, and joint training losses.

Details

Motivation: Existing instruction-guided image editors trained on object-centric or artistic data struggle with dense, safety-critical driving layouts, creating a need for scalable generation of realistic, controllable driving scenes beyond real-world testing.

Method: Three complementary contributions: (1) Large-scale dataset generation from Boreas, nuScenes, and Argoverse2; (2) Language-Guided Masks for fine-grained editing using semantics-enriched masks and prompts; (3) Joint training losses for content preservation and instruction alignment.

Result: Outperforms prior methods in L1, CLIP, and DINO metrics, achieving +46.4% user preference and improving BEV segmentation IoU by +33%. Collected 255K images across 13 editing categories.

Conclusion: HorizonWeaver provides a scalable framework for photorealistic, instruction-driven editing of complex driving scenes, enabling realistic generation beyond real-world testing for autonomous driving safety.

Abstract: Ensuring safety in autonomous driving requires scalable generation of realistic, controllable driving scenes beyond what real-world testing provides. Yet existing instruction guided image editors, trained on object-centric or artistic data, struggle with dense, safety-critical driving layouts. We propose HorizonWeaver, which tackles three fundamental challenges in driving scene editing: (1) multi-level granularity, requiring coherent object- and scene-level edits in dense environments; (2) rich high-level semantics, preserving diverse objects while following detailed instructions; and (3) ubiquitous domain shifts, handling changes in climate, layout, and traffic across unseen environments. The core of HorizonWeaver is a set of complementary contributions across data, model, and training: (1) Data: Large-scale dataset generation, where we build a paired real/synthetic dataset from Boreas, nuScenes, and Argoverse2 to improve generalization; (2) Model: Language-Guided Masks for fine-grained editing, where semantics-enriched masks and prompts enable precise, language-guided edits; and (3) Training: Content preservation and instruction alignment, where joint losses enforce scene consistency and instruction fidelity. Together, HorizonWeaver provides a scalable framework for photorealistic, instruction-driven editing of complex driving scenes, collecting 255K images across 13 editing categories and outperforming prior methods in L1, CLIP, and DINO metrics, achieving +46.4% user preference and improving BEV segmentation IoU by +33%. Project page: https://msoroco.github.io/horizonweaver/

[334] Your Pre-trained Diffusion Model Secretly Knows Restoration

Sudarshan Rajagopalan, Vishal M. Patel

Main category: cs.CV

TL;DR: The paper introduces a method to unlock inherent restoration capabilities in pre-trained diffusion models through learned prompt embeddings, using a diffusion bridge formulation for stable training without fine-tuning or control modules.

Details

Motivation: Current diffusion-based restoration methods rely on fine-tuning or Control-Net modules to leverage pre-trained diffusion priors for All-in-One Restoration. The authors aim to show that these models inherently possess restoration behavior that can be unlocked more efficiently.

Method: The method learns prompt embeddings at the text encoder output rather than using text prompts or token optimization. To address training instability from misaligned forward noising and reverse sampling, they use a diffusion bridge formulation that aligns training and inference dynamics, enforcing coherent denoising paths from noisy degraded states to clean images.

Result: The approach achieves competitive performance and generalization across diverse degradations on pre-trained WAN video and FLUX image models, converting them into high-performing restoration models without fine-tuning or restoration-specific control modules.

Conclusion: Pre-trained diffusion models have inherent restoration capabilities that can be unlocked through learned prompt embeddings with proper alignment of training and inference dynamics, offering an efficient alternative to fine-tuning or control modules for All-in-One Restoration.

Abstract: Pre-trained diffusion models have enabled significant advancements in All-in-One Restoration (AiOR), offering improved perceptual quality and generalization. However, diffusion-based restoration methods primarily rely on fine-tuning or Control-Net style modules to leverage the pre-trained diffusion model’s priors for AiOR. In this work, we show that these pre-trained diffusion models inherently possess restoration behavior, which can be unlocked by directly learning prompt embeddings at the output of the text encoder. Interestingly, this behavior is largely inaccessible through text prompts and text-token embedding optimization. Furthermore, we observe that naive prompt learning is unstable because the forward noising process using degraded images is misaligned with the reverse sampling trajectory. To resolve this, we train prompts within a diffusion bridge formulation that aligns training and inference dynamics, enforcing a coherent denoising path from noisy degraded states to clean images. Building on these insights, we introduce our lightweight learned prompts on the pre-trained WAN video model and FLUX image models, converting them into high-performing restoration models. Extensive experiments demonstrate that our approach achieves competitive performance and generalization across diverse degradations, while avoiding fine-tuning and restoration-specific control modules.

[335] ClickAIXR: On-Device Multimodal Vision-Language Interaction with Real-World Objects in Extended Reality

Dawar Khan, Alexandre Kouyoumdjian, Xinyu Liu, Omar Mena, Dominik Engel, Ivan Viola

Main category: cs.CV

TL;DR: ClickAIXR is an on-device framework for multimodal vision-language interaction in XR that combines controller-based object selection with local VLM processing for privacy-preserving, transparent object-centered interactions.

Details

Motivation: The paper addresses limitations of existing XR interaction systems that rely on cloud-based AI (privacy/latency concerns) or gaze/voice-only interfaces (ambiguity issues). There's a need for precise object selection combined with multimodal understanding while maintaining privacy through on-device processing.

Method: ClickAIXR integrates an on-device vision-language model with controller-based object selection in XR. Users click on real-world objects using a controller, then the selected object image is processed locally by the VLM to answer natural language questions through text and speech. Implemented using Magic Leap SDK (C API) with ONNX-based local VLM inference.

Result: User study comparing ClickAIXR with Gemini 2.5 Flash and ChatGPT 5 showed moderate latency and acceptable user experience. The system demonstrated improved transparency and privacy through on-device processing while reducing ambiguity compared to gaze- or voice-only interfaces.

Conclusion: Click-based object selection combined with on-device AI has potential to advance trustworthy, privacy-preserving XR interactions. The framework shows promise for multimodal vision-language interaction in extended reality environments.

Abstract: We present ClickAIXR, a novel on-device framework for multimodal vision-language interaction with objects in extended reality (XR). Unlike prior systems that rely on cloud-based AI (e.g., ChatGPT) or gaze-based selection (e.g., GazePointAR), ClickAIXR integrates an on-device vision-language model (VLM) with a controller-based object selection paradigm, enabling users to precisely click on real-world objects in XR. Once selected, the object image is processed locally by the VLM to answer natural language questions through both text and speech. This object-centered interaction reduces ambiguity inherent in gaze- or voice-only interfaces and improves transparency by performing all inference on-device, addressing concerns around privacy and latency. We implemented ClickAIXR in the Magic Leap SDK (C API) with ONNX-based local VLM inference. We conducted a user study comparing ClickAIXR with Gemini 2.5 Flash and ChatGPT 5, evaluating usability, trust, and user satisfaction. Results show that latency is moderate and user experience is acceptable. Our findings demonstrate the potential of click-based object selection combined with on-device AI to advance trustworthy, privacy-preserving XR interactions. The source code and supplementary materials are available at: nanovis.org/ClickAIXR.html

[336] SpatialEdit: Benchmarking Fine-Grained Image Spatial Editing

Yicheng Xiao, Wenhu Zhang, Lin Song, Yukang Chen, Wenbo Li, Nan Jiang, Tianhe Ren, Haokun Lin, Wei Huang, Haoyang Huang, Xiu Li, Nan Duan, Xiaojuan Qi

Main category: cs.CV

TL;DR: SpatialEdit introduces a benchmark, synthetic dataset, and baseline model for fine-grained spatial image editing with precise control over object layout and camera viewpoints.

Details

Motivation: Current image editing models are insufficient for fine-grained spatial manipulations, lacking proper evaluation benchmarks and training data for geometry-driven transformations.

Method: Three main contributions: (1) SpatialEdit-Bench benchmark for evaluating perceptual plausibility and geometric fidelity, (2) SpatialEdit-500k synthetic dataset generated with controllable Blender pipeline, (3) SpatialEdit-16B baseline model for fine-grained spatial editing.

Result: The method achieves competitive performance on general editing while substantially outperforming prior methods on spatial manipulation tasks.

Conclusion: The paper provides comprehensive resources for advancing spatial image editing, including benchmark, dataset, and baseline model, addressing current limitations in fine-grained spatial manipulation.

Abstract: Image spatial editing performs geometry-driven transformations, allowing precise control over object layout and camera viewpoints. Current models are insufficient for fine-grained spatial manipulations, motivating a dedicated assessment suite. Our contributions are listed: (i) We introduce SpatialEdit-Bench, a complete benchmark that evaluates spatial editing by jointly measuring perceptual plausibility and geometric fidelity via viewpoint reconstruction and framing analysis. (ii) To address the data bottleneck for scalable training, we construct SpatialEdit-500k, a synthetic dataset generated with a controllable Blender pipeline that renders objects across diverse backgrounds and systematic camera trajectories, providing precise ground-truth transformations for both object- and camera-centric operations. (iii) Building on this data, we develop SpatialEdit-16B, a baseline model for fine-grained spatial editing. Our method achieves competitive performance on general editing while substantially outperforming prior methods on spatial manipulation tasks. All resources will be made public at https://github.com/EasonXiao-888/SpatialEdit.

[337] Optical Context Compression Is Just (Bad) Autoencoding

Ivan Yee Lee, Cheng Yang, Taylor Berg-Kirkpatrick

Main category: cs.CV

TL;DR: Vision-based text compression (like DeepSeek-OCR) doesn’t outperform simpler direct methods for language modeling, challenging the excitement around optical context compression.

Details

Motivation: The paper challenges the excitement around using vision as a compression medium for long textual contexts, questioning whether the detour through rendering text to pixels and then using vision encoders actually helps compared to simpler direct methods.

Method: Compared DeepSeek-OCR’s vision encoder against two baselines: (1) near-zero-parameter mean pooling, and (2) a learned hierarchical encoder. Evaluated both reconstruction quality and language modeling performance at various compression ratios.

Result: For reconstruction, simple direct methods match or surpass vision at every compression ratio. For language modeling, vision performs comparably to truncation (which simply discards context) and loses to the hierarchical encoder at every compression ratio. All compression methods outperform truncation for factual recall, but vision never surpasses the best direct baseline.

Conclusion: The excitement around optical context compression outpaces the evidence. Vision-based compression doesn’t provide advantages over simpler direct methods for language modeling tasks.

Abstract: DeepSeek-OCR shows that rendered text can be reconstructed from a small number of vision tokens, sparking excitement about using vision as a compression medium for long textual contexts. But this pipeline requires rendering token embeddings to pixels and compressing from there – discarding learned representations in favor of an image the vision encoder must then recover from. We ask whether this detour helps. Comparing DeepSeek-OCR’s vision encoder against near-zero-parameter mean pooling and a learned hierarchical encoder, we find it does not. For reconstruction, simple direct methods match or surpass vision at every compression ratio. For language modeling, vision performs comparably to truncation – a baseline that simply discards context – and loses to the hierarchical encoder at every compression ratio. As expected, all compression methods outperform truncation for factual recall, but vision never surpasses the best direct baseline. The excitement around optical context compression outpaces the evidence. Code and checkpoints are available at https://github.com/ivnle/bad-autoencoding.

[338] A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens

Tommie Kerssies, Gabriele Berton, Ju He, Qihang Yu, Wufei Ma, Daan de Geus, Gijs Dubbelman, Liang-Chieh Chen

Main category: cs.CV

TL;DR: DeltaWorld is a generative video world model that predicts future frames by encoding feature differences between consecutive frames into compact “delta tokens,” enabling efficient multi-hypothesis training and diverse future predictions with significantly reduced computational cost.

Details

Motivation: Current video world models face challenges: discriminative models produce deterministic predictions that average over possible futures, while generative models are computationally expensive. Recent work shows predicting in vision foundation model feature space reduces parameters, but most approaches remain discriminative.

Method: Introduces DeltaTok tokenizer that encodes VFM feature differences between consecutive frames into single continuous “delta tokens.” DeltaWorld then operates on these tokens to generate diverse plausible futures. The compact representation (1D temporal sequence vs 3D spatio-temporal) enables tractable multi-hypothesis training where many futures are generated in parallel and only the best is supervised.

Result: DeltaWorld forecasts futures that more closely align with real-world outcomes on dense forecasting tasks, while having over 35x fewer parameters and using 2,000x fewer FLOPs than existing generative world models.

Conclusion: DeltaWorld demonstrates that efficient generative world modeling is possible through compact delta token representations, enabling diverse future predictions with dramatically reduced computational requirements compared to existing approaches.

Abstract: Anticipating diverse future states is a central challenge in video world modeling. Discriminative world models produce a deterministic prediction that implicitly averages over possible futures, while existing generative world models remain computationally expensive. Recent work demonstrates that predicting the future in the feature space of a vision foundation model (VFM), rather than a latent space optimized for pixel reconstruction, requires significantly fewer world model parameters. However, most such approaches remain discriminative. In this work, we introduce DeltaTok, a tokenizer that encodes the VFM feature difference between consecutive frames into a single continuous “delta” token, and DeltaWorld, a generative world model operating on these tokens to efficiently generate diverse plausible futures. Delta tokens reduce video from a three-dimensional spatio-temporal representation to a one-dimensional temporal sequence, for example yielding a 1,024x token reduction with 512x512 frames. This compact representation enables tractable multi-hypothesis training, where many futures are generated in parallel and only the best is supervised. At inference, this leads to diverse predictions in a single forward pass. Experiments on dense forecasting tasks demonstrate that DeltaWorld forecasts futures that more closely align with real-world outcomes, while having over 35x fewer parameters and using 2,000x fewer FLOPs than existing generative world models. Code and weights: https://deltatok.github.io.

[339] Fully Procedural Synthetic Data from Simple Rules for Multi-View Stereo

Zeyu Ma, Alexander Raistrick, Jia Deng

Main category: cs.CV

TL;DR: SimpleProc: A procedural generator using NURBS and basic patterns to create synthetic training data for multi-view stereo that outperforms manually curated data at smaller scales and matches/exceeds performance at larger scales.

Details

Motivation: The paper addresses the challenge of obtaining high-quality training data for multi-view stereo (MVS) tasks, where manually curated datasets from games and real-world objects are expensive, limited in scale, and may not cover the full design space of procedural rules needed for optimal MVS performance.

Method: Developed SimpleProc, a fully procedural generator driven by a small set of rules using Non-Uniform Rational Basis Splines (NURBS), basic displacement, and texture patterns to generate synthetic training data for MVS models.

Result: At 8,000 images, SimpleProc outperforms manually curated images from games and real-world objects. When scaled to 352,000 images, it achieves performance comparable to or exceeding models trained on 692,000 manually curated images across several benchmarks.

Conclusion: Procedural generation using simple rules (NURBS, displacement, textures) can create effective training data for MVS that scales better than manual curation while achieving superior or comparable performance, demonstrating the value of exploring the design space of procedural rules for computer vision tasks.

Abstract: In this paper, we explore the design space of procedural rules for multi-view stereo (MVS). We demonstrate that we can generate effective training data using SimpleProc: a new, fully procedural generator driven by a very small set of rules using Non-Uniform Rational Basis Splines (NURBS), as well as basic displacement and texture patterns. At a modest scale of 8,000 images, our approach achieves superior results compared to manually curated images (at the same scale) sourced from games and real-world objects. When scaled to 352,000 images, our method yields performance comparable to–and in several benchmarks, exceeding–models trained on over 692,000 manually curated images. The source code and the data are available at https://github.com/princeton-vl/SimpleProc.

[340] Rethinking Model Efficiency: Multi-Agent Inference with Large Models

Sixun Dong, Juhua Hu, Steven Li, Wei Wen, Qi Qian

Main category: cs.CV

TL;DR: Large VLMs with short responses can be more efficient than small VLMs with long responses; proposed multi-agent framework transfers reasoning tokens from small to large models for better performance.

Details

Motivation: Autoregressive generation in VLMs makes output token count a bottleneck for latency, but different models need different token counts for similar performance. Need to optimize efficiency while maintaining performance.

Method: Comprehensive latency analysis across VLM components on simulated data, empirical study on real benchmarks, and proposed multi-agent inference framework that keeps large models with short responses but transfers key reasoning tokens from small models when needed.

Result: Large models with fewer output tokens can be more efficient than small models with long sequences; multi-agent framework with token transfer helps large models approach performance of large models with full reasoning.

Conclusion: Output token count is critical for VLM efficiency; large models with short responses can be optimal; token transfer from small models can enhance large model performance efficiently.

Abstract: Most vision-language models (VLMs) apply a large language model (LLM) as the decoder, where the response tokens are generated sequentially through autoregression. Therefore, the number of output tokens can be the bottleneck of the end-to-end latency. However, different models may require vastly different numbers of output tokens to achieve comparable performance. In this work, we conduct a comprehensive analysis of the latency across different components of VLMs on simulated data. The experiment shows that a large model with fewer output tokens can be more efficient than a small model with a long output sequence. The empirical study on diverse real-world benchmarks confirms the observation that a large model can achieve better or comparable performance as a small model with significantly fewer output tokens. To leverage the efficiency of large models, we propose a multi-agent inference framework that keeps large models with short responses but transfers the key reasoning tokens from the small model when necessary. The comparison on benchmark tasks demonstrates that by reusing the reasoning tokens from small models, it can help approach the performance of a large model with its own reasoning, which confirms the effectiveness of our proposal.

[341] LoMa: Local Feature Matching Revisited

David Nordström, Johan Edstedt, Georg Bökman, Jonathan Astermark, Anders Heyden, Viktor Larsson, Mårten Wadenbäck, Michael Felsberg, Fredrik Kahl

Main category: cs.CV

TL;DR: LoMa introduces a data-driven approach to local feature matching using large-scale data mixtures, modern training recipes, and scaled model capacity, achieving significant performance gains on challenging benchmarks including a new HardMatch dataset.

Details

Motivation: Local feature matching has lagged behind modern data-driven approaches in 3D vision. While feed-forward reconstruction models benefit from large datasets, feature matching models are still trained on limited mid-sized datasets. Current benchmarks are saturated with relatively easy image pairs from successful 3D reconstructions.

Method: LoMa combines: 1) large and diverse data mixtures, 2) modern training recipes, 3) scaled model capacity, and 4) scaled compute. The authors also create HardMatch - a new dataset of 1000 highly challenging image pairs from internet data with manually annotated ground truth correspondences.

Result: LoMa outperforms state-of-the-art method ALIKED+LightGlue by +18.6 mAA on HardMatch, +29.5 mAA on WxBS, +21.4 (1m, 10°) on InLoc, +24.2 AUC on RUBIK, and +12.4 mAA on IMC 2022.

Conclusion: Scaling data, models, and compute for local feature matching leads to remarkable performance gains. The HardMatch dataset addresses benchmark saturation and enables better evaluation of progress in feature matching.

Abstract: Local feature matching has long been a fundamental component of 3D vision systems such as Structure-from-Motion (SfM), yet progress has lagged behind the rapid advances of modern data-driven approaches. The newer approaches, such as feed-forward reconstruction models, have benefited extensively from scaling dataset sizes, whereas local feature matching models are still only trained on a few mid-sized datasets. In this paper, we revisit local feature matching from a data-driven perspective. In our approach, which we call LoMa, we combine large and diverse data mixtures, modern training recipes, scaled model capacity, and scaled compute, resulting in remarkable gains in performance. Since current standard benchmarks mainly rely on collecting sparse views from successful 3D reconstructions, the evaluation of progress in feature matching has been limited to relatively easy image pairs. To address the resulting saturation of benchmarks, we collect 1000 highly challenging image pairs from internet data into a new dataset called HardMatch. Ground truth correspondences for HardMatch are obtained via manual annotation by the authors. In our extensive benchmarking suite, we find that LoMa makes outstanding progress across the board, outperforming the state-of-the-art method ALIKED+LightGlue by +18.6 mAA on HardMatch, +29.5 mAA on WxBS, +21.4 (1m, 10$^\circ$) on InLoc, +24.2 AUC on RUBIK, and +12.4 mAA on IMC 2022. We release our code and models publicly at https://github.com/davnords/LoMa.

[342] PointTPA: Dynamic Network Parameter Adaptation for 3D Scene Understanding

Siyuan Liu, Chaoqun Zheng, Xin Zhou, Tianrui Feng, Dingkang Liang, Xiang Bai

Main category: cs.CV

TL;DR: PointTPA is a test-time parameter adaptation framework for 3D point cloud scene understanding that generates input-aware network parameters to handle diverse geometries and spatial layouts, achieving state-of-the-art performance with minimal parameter overhead.

Details

Motivation: Scene-level point cloud understanding faces challenges from diverse geometries, imbalanced category distributions, and varied spatial layouts. Existing methods use static network parameters during inference, limiting adaptability to dynamic scene data.

Method: Proposes PointTPA with Serialization-based Neighborhood Grouping (SNG) to form locally coherent patches and Dynamic Parameter Projector (DPP) to produce patch-wise adaptive weights. Integrated into PTv3 structure with lightweight modules (<2% of backbone parameters).

Result: Achieves 78.4% mIoU on ScanNet validation, surpassing existing parameter-efficient fine-tuning methods across multiple benchmarks while maintaining strong parameter efficiency.

Conclusion: PointTPA demonstrates effective test-time dynamic network parameter adaptation for enhancing 3D scene understanding with minimal parameter overhead, showing the efficacy of input-aware parameter generation.

Abstract: Scene-level point cloud understanding remains challenging due to diverse geometries, imbalanced category distributions, and highly varied spatial layouts. Existing methods improve object-level performance but rely on static network parameters during inference, limiting their adaptability to dynamic scene data. We propose PointTPA, a Test-time Parameter Adaptation framework that generates input-aware network parameters for scene-level point clouds. PointTPA adopts a Serialization-based Neighborhood Grouping (SNG) to form locally coherent patches and a Dynamic Parameter Projector (DPP) to produce patch-wise adaptive weights, enabling the backbone to adjust its behavior according to scene-specific variations while maintaining a low parameter overhead. Integrated into the PTv3 structure, PointTPA demonstrates strong parameter efficiency by introducing two lightweight modules of less than 2% of the backbone’s parameters. Despite this minimal parameter overhead, PointTPA achieves 78.4% mIoU on ScanNet validation, surpassing existing parameter-efficient fine-tuning (PEFT) methods across multiple benchmarks, highlighting the efficacy of our test-time dynamic network parameter adaptation mechanism in enhancing 3D scene understanding. The code is available at https://github.com/H-EmbodVis/PointTPA.

[343] Vanast: Virtual Try-On with Human Image Animation via Synthetic Triplet Supervision

Hyunsoo Cha, Wonjung Woo, Byungjun Kim, Hanbyul Joo

Main category: cs.CV

TL;DR: Vanast is a unified framework that generates garment-transferred human animation videos from a single human image, garment images, and pose guidance video in a single step, addressing identity drift and garment distortion issues of conventional two-stage pipelines.

Details

Motivation: Conventional two-stage pipelines treat image-based virtual try-on and pose-driven animation as separate processes, leading to identity drift, garment distortion, and front-back inconsistency. The authors aim to create a unified framework that performs the entire process in a single step for coherent synthesis.

Method: The framework uses a unified approach with large-scale triplet supervision. They construct a data generation pipeline that includes: 1) generating identity-preserving human images in alternative outfits, 2) capturing full upper and lower garment triplets to overcome single-garment limitations, and 3) assembling diverse in-the-wild triplets without garment catalog images. They introduce a Dual Module architecture for video diffusion transformers to stabilize training and preserve pretrained generative quality.

Result: Vanast produces high-fidelity, identity-consistent animation across a wide range of garment types. The unified approach improves garment accuracy, pose adherence, and identity preservation while supporting zero-shot garment interpolation.

Conclusion: The unified framework successfully addresses the limitations of conventional two-stage pipelines by performing garment transfer and animation in a single step, resulting in coherent synthesis with better identity preservation and garment accuracy.

Abstract: We present Vanast, a unified framework that generates garment-transferred human animation videos directly from a single human image, garment images, and a pose guidance video. Conventional two-stage pipelines treat image-based virtual try-on and pose-driven animation as separate processes, which often results in identity drift, garment distortion, and front-back inconsistency. Our model addresses these issues by performing the entire process in a single unified step to achieve coherent synthesis. To enable this setting, we construct large-scale triplet supervision. Our data generation pipeline includes generating identity-preserving human images in alternative outfits that differ from garment catalog images, capturing full upper and lower garment triplets to overcome the single-garment-posed video pair limitation, and assembling diverse in-the-wild triplets without requiring garment catalog images. We further introduce a Dual Module architecture for video diffusion transformers to stabilize training, preserve pretrained generative quality, and improve garment accuracy, pose adherence, and identity preservation while supporting zero-shot garment interpolation. Together, these contributions allow Vanast to produce high-fidelity, identity-consistent animation across a wide range of garment types.

[344] Agile Deliberation: Concept Deliberation for Subjective Visual Classification

Leijie Wang, Otilia Stretcu, Wei Qiao, Thomas Denby, Krishnamurthy Viswanathan, Enming Luo, Chun-Ta Lu, Tushar Dogra, Ranjay Krishna, Ariel Fuxman

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to draw conclusions due to fetch failure

Abstract: Failed to fetch summary for 2512.10821: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.10821&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[345] Circuit Mechanisms for Spatial Relation Generation in Diffusion Transformers

Binxu Wang, Jingxuan Fan, Xu Pan

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2601.06338: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.06338&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[346] SHLE: Devices Tracking and Depth Filtering for Stereo-based Height Limit Estimation

Zhaoxin Fan, Kaixing Yang, Min Zhang, Zhenbo Song, Hongyan Liu, Jun He

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2212.11538: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2212.11538&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[347] Floralens: a Deep Learning Model for the Portuguese Native Flora

António Filgueiras, Eduardo R. B. Marques, Luís M. B. Lopes, Miguel Marques, Hugo Silva

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2403.12072: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2403.12072&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[348] Advancing Pre-trained Teacher: Towards Robust Feature Discrepancy for Anomaly Detection

Canhui Tang, Sanping Zhou, Yizhe Li, Yonghao Dong, Le Wang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2405.02068: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2405.02068&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[349] SketchDeco: Training-Free Latent Composition for Precise Sketch Colourisation

Chaitat Utintu, Yi-Zhe Song

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access restrictions

Method: Unable to determine method due to access restrictions

Result: Unable to determine results due to access restrictions

Conclusion: Unable to draw conclusions due to access restrictions

Abstract: Failed to fetch summary for 2405.18716: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2405.18716&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[350] Scalable and Generalizable Correspondence Pruning via Geometry-Consistent Pre-training

Tangfei Liao, Xiaoqin Zhang, Tao Wang, Hao Ye, Min Li, Guobao Xiao, Mang Ye

Main category: cs.CV

TL;DR: The paper with ID 2406.05773 could not be analyzed because the arXiv API returned an HTTP 429 error (too many requests).

Details

Motivation: Unable to determine the paper's motivation due to API request limitations preventing access to the abstract.

Method: Unable to analyze the method since the paper content could not be retrieved.

Result: No results can be reported as the paper information was not accessible.

Conclusion: The analysis could not be completed due to technical limitations in accessing the paper information.

Abstract: Failed to fetch summary for 2406.05773: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2406.05773&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[351] VLBiasBench: A Comprehensive Benchmark for Evaluating Bias in Large Vision-Language Model

Sibo Wang, Xiangkui Cao, Jie Zhang, Zheng Yuan, Shiguang Shan, Xilin Chen, Wen Gao

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation as paper content could not be retrieved

Method: Unable to determine method as paper content could not be retrieved

Result: Unable to determine results as paper content could not be retrieved

Conclusion: Unable to draw conclusions about the paper due to retrieval failure

Abstract: Failed to fetch summary for 2406.14194: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2406.14194&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[352] Pose-dIVE: Pose-Diversified Augmentation with Diffusion Model for Person Re-Identification

Inès Hyeonsu Kim, Woojeong Jin, Soowon Son, Junyoung Seo, Seokju Cho, JeongYeol Baek, Byeongwon Lee, JoungBin Lee, Seungryong Kim

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access limitations

Method: Unable to determine method due to access limitations

Result: Unable to determine results due to access limitations

Conclusion: Unable to determine conclusion due to access limitations

Abstract: Failed to fetch summary for 2406.16042: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2406.16042&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[353] An Analytical Theory of Spectral Bias in the Learning Dynamics of Diffusion Models

Binxu Wang, Cengiz Pehlevan

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable due to API rate limiting

Result: Cannot determine results as paper content is unavailable due to API rate limiting

Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2503.03206: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.03206&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[354] Robust Adaptation of Foundation Models with Black-Box Visual Prompting

Changdae Oh, Gyeongdeok Seo, Geunyoung Jung, Zhi-Qi Cheng, Hosik Choi, Jiyoung Jung, Kyungwoo Song

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2407.17491: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2407.17491&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[355] Disentangled World Models: Learning to Transfer Semantic Knowledge from Distracting Videos for Reinforcement Learning

Qi Wang, Zhipeng Zhang, Baao Xie, Xin Jin, Yunbo Wang, Shiyu Wang, Liaomo Zheng, Xiaokang Yang, Wenjun Zeng

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2503.08751: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.08751&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[356] Patch-Wise Hypergraph Contrastive Learning with Dual Normal Distribution Weighting for Multi-Domain Stain Transfer

Haiyan Wei, Hangrui Xu, Bingxu Zhu, Yulian Geng, Aolei Liu, Wenfei Yin, Jian Liu

Main category: cs.CV

TL;DR: Unable to analyze paper 2503.09523 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation as abstract is unavailable

Method: Cannot determine method as abstract is unavailable

Result: Cannot determine results as abstract is unavailable

Conclusion: Cannot determine conclusion as abstract is unavailable

Abstract: Failed to fetch summary for 2503.09523: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.09523&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[357] BalancedDPO: Adaptive Multi-Metric Alignment

Dipesh Tamboli, Souradip Chakraborty, Aditya Malusare, Biplab Banerjee, Amrit Singh Bedi, Vaneet Aggarwal

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2503.12575: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.12575&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[358] Stitch-a-Demo: Video Demonstrations from Multistep Descriptions

Chi Hsuan Wu, Kumar Ashutosh, Kristen Grauman

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2503.13821: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.13821&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[359] FreeGraftor: Training-Free Cross-Image Feature Grafting for Subject-Driven Text-to-Image Generation

Zebin Yao, Lei Ren, Huixing Jiang, Wei Chen, Xiaojie Wang, Ruifan Li, Fangxiang Feng

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper information.

Details

Motivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot draw conclusions without access to the paper content.

Abstract: Failed to fetch summary for 2504.15958: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.15958&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[360] Threshold Modulation for Online Test-Time Adaptation of Spiking Neural Networks

Kejie Zhao, Wenjia Hua, Aiersi Tuerhong, Luziwei Leng, Yuxin Ma, Qinghai Guo

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to failed paper fetch

Method: Cannot determine method due to failed paper fetch

Result: Cannot determine results due to failed paper fetch

Conclusion: Cannot determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2505.05375: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.05375&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[361] Detecting and Characterising Mobile App Metamorphosis in Google Play Store

D. Denipitiyage, B. Silva, K. Gunathilaka, S. Seneviratne, A. Mahanti, A. Seneviratne, S. Chawla

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2407.14565: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2407.14565&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[362] SpikeStereoNet: A Brain-Inspired Framework for Stereo Depth Estimation from Spike Streams

Zhuoheng Gao, Yihao Li, Jiyao Zhang, Rui Zhao, Tong Wu, Hao Tang, Zhaofei Yu, Hao Dong, Guozhang Chen, Tiejun Huang

Main category: cs.CV

TL;DR: Paper analysis unavailable due to HTTP 429 error when fetching from arXiv API

Details

Motivation: Unable to determine motivation due to failed API request

Method: Unable to determine method due to failed API request

Result: Unable to determine results due to failed API request

Conclusion: Unable to determine conclusion due to failed API request

Abstract: Failed to fetch summary for 2505.19487: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.19487&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[363] Chain-of-Frames: Advancing Video Understanding in Multimodal LLMs via Frame-Aware Reasoning

Sara Ghazanfari, Francesco Croce, Nicolas Flammarion, Prashanth Krishnamurthy, Farshad Khorrami, Siddharth Garg

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2506.00318: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.00318&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[364] Common Inpainted Objects In-N-Out of Context

Tianze Yang, Tyson Jordan, Ruitong Sun, Ninghao Liu, Jin Sun

Main category: cs.CV

TL;DR: Paper ID 2506.00721 could not be fetched due to HTTP 429 error (rate limiting), preventing analysis of its content.

Details

Motivation: Unable to determine motivation as the paper content could not be retrieved due to API rate limiting.

Method: Method unknown - paper content inaccessible due to HTTP 429 error from arXiv API.

Result: No results available - the paper summary could not be fetched due to rate limiting issues.

Conclusion: Cannot draw conclusions about an inaccessible paper; the arXiv API returned a rate limiting error preventing content retrieval.

Abstract: Failed to fetch summary for 2506.00721: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.00721&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[365] Geological Field Restoration through the Lens of Image Inpainting

Vladislav Trifonov, Ivan Oseledets, Ekaterina Muravleva

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2506.04869: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.04869&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[366] BePo: Dual Representation for 3D Occupancy Prediction

Yunxiao Shi, Hong Cai, Jisoo Jeong, Yinhao Zhu, Shizhong Han, Amin Ansari, Fatih Porikli

Main category: cs.CV

TL;DR: Failed to fetch summary for paper 2506.07002 due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to missing abstract

Method: Unable to determine method due to missing abstract

Result: Unable to determine results due to missing abstract

Conclusion: Unable to determine conclusion due to missing abstract

Abstract: Failed to fetch summary for 2506.07002: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.07002&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[367] MT-PCR: Hybrid Mamba-Transformer Network with Spatial Serialization for Point Cloud Registration

Bingxi Liu, An Liu, Hao Chen, Huaqi Tao, Jinqiang Cui, Yiqun Wang, Hong Zhang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to API rate limiting preventing access to paper content

Method: Cannot analyze method as paper content is unavailable due to HTTP 429 error

Result: No results available - paper summary fetch failed

Conclusion: Cannot provide analysis due to technical limitations in accessing the paper

Abstract: Failed to fetch summary for 2506.13183: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.13183&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[368] 2D Triangle Splatting for Direct Differentiable Mesh Training

Kaifeng Sheng, Zheng Zhou, Yingliang Peng, Qianwei Wang

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to failed paper fetch

Method: Cannot determine method due to failed paper fetch

Result: Cannot determine results due to failed paper fetch

Conclusion: Cannot determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2506.18575: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.18575&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[369] Unit: Building Unit Detection Dataset

Haozhou Zhai, Yanzhe Gao, Tianjiang Hu

Main category: cs.CV

TL;DR: Unable to analyze paper 2508.03139 due to HTTP 429 error when fetching from arXiv API

Details

Motivation: Cannot determine motivation as paper content could not be retrieved

Method: Cannot determine method as paper content could not be retrieved

Result: Cannot determine results as paper content could not be retrieved

Conclusion: Cannot draw conclusions as paper content could not be retrieved

Abstract: Failed to fetch summary for 2508.03139: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.03139&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[370] Automated Wildfire Damage Assessment from Multi view Ground level Imagery Via Vision Language Models

Miguel Esparza, Archit Gupta, Kai Yin, Yiming Xiao, Ali Mostafavi

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without paper content

Method: Cannot determine method without paper content

Result: Cannot determine results without paper content

Conclusion: Cannot draw conclusions without paper content

Abstract: Failed to fetch summary for 2509.01895: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.01895&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[371] PhysGaia: A Physics-Aware Benchmark with Multi-Body Interactions for Dynamic Novel View Synthesis

Mijeong Kim, Gunhee Kim, Jungyoon Choi, Wonjae Roh, Bohyung Han

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2506.02794: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.02794&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[372] Durian: Dual Reference Image-Guided Portrait Animation with Attribute Transfer

Hyunsoo Cha, Byungjun Kim, Hanbyul Joo

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2509.04434: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.04434&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[373] Challenges in Deep Learning-Based Small Organ Segmentation: A Benchmarking Perspective for Medical Research with Limited Datasets

Phongsakon Mark Konrad, Andrei-Alexandru Popa, Yaser Sabzehmeidani, Liang Zhong, Madhulika Tripathy, Andrei Constantinescu, Elisa A. Liehn, Serkan Ayvaz

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2509.05892: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.05892&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[374] DreamLifting: A Plug-in Module Lifting MV Diffusion Models for 3D Asset Generation

Ze-Xin Yin, Jiaxiong Qiu, Liu Liu, Xinjie Wang, Wei Sui, Zhizhong Su, Jian Yang, Jin Xie

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2509.07435: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.07435&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Ignacy Kolton, Weronika Smolak-Dyżewska, Joanna Kaleta, Żaneta Świderska-Chadaj, Marcin Mazur, Mirosław Dziekiewicz, Tomasz Markiewicz, Przemysław Spurek

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2509.16806: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.16806&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[376] Beyond Classification Accuracy: Neural-MedBench and the Need for Deeper Reasoning Benchmarks

Miao Jing, Mengting Jia, Junling Lin, Zhongxia Shen, Huan Gao, Mingkun Xu, Shangyang Li

Main category: cs.CV

TL;DR: Unable to analyze paper 2509.22258 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation as abstract is unavailable due to rate limiting error

Method: Cannot determine method as abstract is unavailable due to rate limiting error

Result: Cannot determine results as abstract is unavailable due to rate limiting error

Conclusion: Cannot draw conclusions about paper content due to data retrieval failure

Abstract: Failed to fetch summary for 2509.22258: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.22258&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[377] Vid-Freeze: Protecting Images from Malicious Image-to-Video Generation via Temporal Freezing

Rohit Chowdhury, Aniruddha Bala, Rohan Jaiswal, Siddharth Roheda

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable due to API rate limiting

Result: Cannot determine results as paper content is unavailable due to API rate limiting

Conclusion: Cannot determine conclusion as paper content is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2509.23279: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.23279&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[378] Enhancing Physical Plausibility in Video Generation by Reasoning the Implausibility

Yutong Hao, Chen Chen, Ajmal Saeed Mian, Chang Xu, Daochang Liu

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2509.24702: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.24702&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[379] Markovian Reeb Graphs for Simulating Spatiotemporal Patterns of Life

Anantajit Subrahmanya, Chandrakanth Gudavalli, Connor Levenson, B.S. Manjunath

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2510.03152: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.03152&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[380] LoFT: Parameter-Efficient Fine-Tuning for Long-tailed Semi-Supervised Learning in Open-World Scenarios

Zhiyuan Huang, Jiahao Chen, Bing Su

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2509.09926: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.09926&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[381] Clear Roads, Clear Vision: Advancements in Multi-Weather Restoration for Smart Transportation

Vijay M. Galshetwar, Praful Hambarde, Prashant W. Patil, Akshay Dudhane, Sachin Chaudhary

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) - need to try again later or use alternative methods to access the paper information

Details

Motivation: Cannot determine motivation without access to the paper content

Method: Cannot determine method without access to the paper content

Result: Cannot determine results without access to the paper content

Conclusion: Cannot determine conclusion without access to the paper content

Abstract: Failed to fetch summary for 2510.09228: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.09228&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Xingrui Wang, Jiang Liu, Chao Huang, Xiaodong Yu, Ze Wang, Ximeng Sun, Jialian Wu, Alan Yuille, Emad Barsoum, Zicheng Liu

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2510.15148: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.15148&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[383] Revisiting Multimodal Positional Encoding in Vision-Language Models

Jie Huang, Xuejing Liu, Sibo Song, Ruibing Hou, Hong Chang, Junyang Lin, Shuai Bai

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation as paper content could not be retrieved

Method: Unable to determine method as paper content could not be retrieved

Result: Unable to determine results as paper content could not be retrieved

Conclusion: Unable to determine conclusion as paper content could not be retrieved

Abstract: Failed to fetch summary for 2510.23095: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.23095&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[384] Image Hashing via Cross-View Code Alignment in the Age of Foundation Models

Ilyass Moummad, Kawtar Zaher, Hervé Goëau, Alexis Joly

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2510.27584 suggests it’s from October 2024, but no content available for analysis.

Details

Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting.

Method: Cannot determine method as paper content is unavailable.

Result: Cannot determine results as paper content is unavailable.

Conclusion: Cannot draw conclusions without access to the paper content.

Abstract: Failed to fetch summary for 2510.27584: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.27584&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[385] Pixels or Positions? Benchmarking Modalities in Group Activity Recognition

Drishya Karki, Merey Ramazanova, Anthony Cioppa, Silvio Giancola, Bernard Ghanem

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to missing paper content

Method: Unable to determine method due to missing paper content

Result: Unable to determine results due to missing paper content

Conclusion: Unable to draw conclusions due to missing paper content

Abstract: Failed to fetch summary for 2511.12606: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.12606&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[386] ATAC: Augmentation-Based Test-Time Adversarial Correction for CLIP

Linxiang Su, András Balogh

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to missing paper content

Method: Unable to determine method due to missing paper content

Result: Unable to determine results due to missing paper content

Conclusion: Unable to determine conclusion due to missing paper content

Abstract: Failed to fetch summary for 2511.17362: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.17362&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[387] SPHINX: A Synthetic Environment for Visual Perception and Reasoning

Md Tanvirul Alam, Saksham Aggarwal, Justin Yang Chae, Nidhi Rastogi

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to inability to access paper content

Method: Cannot determine method due to inability to access paper content

Result: Cannot determine results due to inability to access paper content

Conclusion: Cannot draw conclusions due to inability to access paper content

Abstract: Failed to fetch summary for 2511.20814: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.20814&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[388] ActDistill: General Action-Guided Self-Derived Distillation for Efficient Vision-Language-Action Models

Wencheng Ye, Tianshi Wang, Lei Zhu, Fengling Li, Guoli Yang, Hengtao Shen

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2511.18082: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.18082&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[389] Low-Bitrate Video Compression through Semantic-Conditioned Diffusion

Lingdong Wang, Guan-Ming Su, Divya Kothandaraman, Tsung-Wei Huang, Mohammad Hajiesmaili, Ramesh K. Sitaraman

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2512.00408 suggests it’s from December 2024, but no content available for analysis.

Details

Motivation: Cannot determine motivation without access to paper content.

Method: Cannot determine method without access to paper content.

Result: Cannot determine results without access to paper content.

Conclusion: Cannot draw conclusions without access to paper content.

Abstract: Failed to fetch summary for 2512.00408: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.00408&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[390] SkillSight: Efficient First-Person Skill Assessment with Gaze

Chi Hsuan Wu, Kumar Ashutosh, Kristen Grauman

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2511.19629: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.19629&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[391] ToG-Bench: Task-Oriented Spatio-Temporal Grounding in Egocentric Videos

Qi’ao Xu, Tianwen Qian, Yuqian Fu, Kailing Li, Yang Jiao, Jiacheng Zhang, Xiaoling Wang, Liang He

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2512.03666: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.03666&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[392] Can Protective Watermarking Safeguard the Copyright of 3D Gaussian Splatting?

Wenkai Huang, Yijia Guo, Gaolei Li, Lei Ma, Hang Zhang, Liwen Hu, Jiazheng Wang, Jianhua Li, Tiejun Huang

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable due to server rate limiting

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about paper content due to access limitations

Abstract: Failed to fetch summary for 2511.22262: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.22262&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[393] Bringing Your Portrait to 3D Presence

Jiawei Zhang, Lei Chu, Jiahao Li, Zhenyu Zang, Chong Li, Xiao Li, Xun Cao, Hao Zhu, Yan Lu

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2511.22553: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.22553&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[394] ContextDrag: Precise Drag-Based Image Editing via Context-Preserving Token Injection and Position-Aligned Attention

Huiguo He, Pengyu Yan, Ziqi Yi, Weizhi Zhong, Zheng Liu, Yejun Tang, Huan Yang, Guanbin Li, Lianwen Jin

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2512.08477: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.08477&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[395] Action-guided generation of 3D functionality segmentation data

Jaime Corsetti, Francesco Giuliari, Davide Boscaini, Pedro Hermosilla, Andrea Pilzer, Guofeng Mei, Alexandros Delitzas, Francis Engelmann, Fabio Poiesi

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2511.23230: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.23230&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[396] ShelfGaussian: Shelf-Supervised Open-Vocabulary Gaussian-based 3D Scene Understanding

Lingjun Zhao, Yandong Luo, James Hays, Lu Gan

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about paper content due to access limitations

Abstract: Failed to fetch summary for 2512.03370: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.03370&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[397] MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding

Yuhao Su, Anwesa Choudhuri, Zhongpai Gao, Benjamin Planche, Van Nguyen Nguyen, Meng Zheng, Yuhan Shen, Arun Innanje, Terrence Chen, Ehsan Elhamifar, Ziyan Wu

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2512.06581: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.06581&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[398] VideoCoF: Unified Video Editing with Temporal Reasoner

Xiangpeng Yang, Ji Xie, Yiyuan Yang, Yue Ma, Yan Huang, Min Xu, Qiang Wu

Main category: cs.CV

TL;DR: Unable to analyze paper 2512.07469 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation as abstract retrieval failed

Method: Cannot determine method as abstract retrieval failed

Result: Cannot determine results as abstract retrieval failed

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2512.07469: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.07469&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[399] Neural Collapse in Test-Time Adaptation

Xiao Chen, Zhongjing Du, Jiazhen Huang, Xu Jiang, Li Lu, Jingyan Jiang, Zhi Wang

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2512.10421: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.10421&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[400] MARC: Multi-Label Adaptive Retrieval Contrastive Loss for Remote Sensing Images

Amna Amir, Erchan Aptoula

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2512.16294: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.16294&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[401] Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning

Shaofeng Yin, Jiaxin Ge, Zora Zhiruo Wang, Chenyang Wang, Xiuyu Li, Michael J. Black, Trevor Darrell, Angjoo Kanazawa, Haiwen Feng

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting).

Details

Motivation: Cannot determine motivation without paper content.

Method: Cannot determine method without paper content.

Result: Cannot determine results without paper content.

Conclusion: Cannot draw conclusions without paper content.

Abstract: Failed to fetch summary for 2601.11109: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.11109&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[402] FLEG: Feed-Forward Language Embedded Gaussian Splatting from Any Views via Compact Semantic Representation

Qijian Tian, Xin Tan, Jiayu Ying, Xuhong Wang, Yuan Xie, Lizhuang Ma

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2512.17541: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.17541&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[403] NASTaR: NovaSAR Automated Ship Target Recognition Dataset

Benyamin Hosseiny, Kamirul Kamirul, Odysseas Pappas, Alin Achim

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2512.18503: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.18503&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[404] Restore-R1: Efficient Image Restoration Agents via Reinforcement Learning with Multimodal LLM Perceptual Feedback

Jianglin Lu, Yuanwei Wu, Ziyi Zhao, Hongcheng Wang, Felix Jimenez, Abrar Majeedi, Yun Fu

Main category: cs.CV

TL;DR: Unable to analyze paper 2512.18599 due to HTTP 429 error when fetching summary from arXiv API

Details

Motivation: Cannot determine motivation due to inability to access paper content

Method: Cannot determine method due to inability to access paper content

Result: Cannot determine results due to inability to access paper content

Conclusion: Cannot draw conclusions due to inability to access paper content

Abstract: Failed to fetch summary for 2512.18599: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.18599&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[405] InfoTok: Information-Theoretic Regularization for Capacity-Constrained Shared Visual Tokenization in Unified MLLMs

Lv Tang, Tianyi Zheng, Bo Li, Xingyu Li

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2602.01554: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.01554&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[406] VOIC: Visible-Occluded Integrated Guidance for 3D Semantic Scene Completion

Zaidao Han, Risa Higashita, Jiang Liu

Main category: cs.CV

TL;DR: Paper ID 2512.18954: Unable to fetch abstract due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as the paper abstract could not be retrieved from arXiv API due to rate limiting

Method: Unknown - paper content unavailable due to API rate limiting error

Result: Unknown - paper content unavailable due to API rate limiting error

Conclusion: Cannot provide conclusion as the paper abstract could not be retrieved

Abstract: Failed to fetch summary for 2512.18954: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.18954&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[407] FinPercep-RM: A Fine-grained Reward Model and Co-evolutionary Curriculum for RL-based Real-world Super-Resolution

Yidi Liu, Zihao Fan, Jie Huang, Jie Xiao, Dong Li, Wenlong Zhang, Lei Bai, Xueyang Fu, Zheng-Jun Zha

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2512.22647: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.22647&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[408] ST-BiBench: Benchmarking Multi-Stream Multimodal Coordination in Bimanual Embodied Tasks for MLLMs

Xin Wu, Zhixuan Liang, Yue Ma, Mengkang Hu, Zhiyuan Qin, Xiu Li

Main category: cs.CV

TL;DR: Failed to fetch summary for arXiv ID 2602.08392 due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to technical error in fetching paper information

Method: Unable to determine method due to technical error in fetching paper information

Result: Unable to determine results due to technical error in fetching paper information

Conclusion: Unable to determine conclusion due to technical error in fetching paper information

Abstract: Failed to fetch summary for 2602.08392: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.08392&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[409] Stream-DiffVSR: Low-Latency Streamable Video Super-Resolution via Auto-Regressive Diffusion

Hau-Shiang Shiu, Chin-Yang Lin, Zhixiang Wang, Chi-Wei Hsiao, Po-Fan Yu, Yu-Chih Chen, Yu-Lun Liu

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot analyze paper due to technical limitations in accessing content

Abstract: Failed to fetch summary for 2512.23709: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.23709&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[410] GaMO: Geometry-aware Multi-view Diffusion Outpainting for Sparse-View 3D Reconstruction

Yi-Chuan Huang, Hao-Jen Chien, Chin-Yang Lin, Ying-Huan Chen, Yu-Lun Liu

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2512.25073: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.25073&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[411] Fusion2Print: Deep Flash-Non-Flash Fusion for Contactless Fingerprint Matching

Roja Sahoo, Anoop Namboodiri

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Unable to determine motivation as paper content could not be retrieved

Method: Unable to determine method as paper content could not be retrieved

Result: Unable to determine results as paper content could not be retrieved

Conclusion: Unable to draw conclusions as paper content could not be retrieved

Abstract: Failed to fetch summary for 2601.02318: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.02318&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[412] RL-AWB: Deep Reinforcement Learning for Auto White Balance Correction in Low-Light Night-time Scenes

Yuan-Kang Lee, Kuan-Lin Chen, Chia-Che Chang, Yu-Lun Liu

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2601.05249: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.05249&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[413] 3AM: 3egment Anything with Geometric Consistency in Videos

Yang-Che Sun, Cheng Sun, Chin-Yang Lin, Fu-En Yang, Min-Hung Chen, Yen-Yu Lin, Yu-Lun Liu

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access limitations

Method: Unable to determine method due to access limitations

Result: Unable to determine results due to access limitations

Conclusion: Unable to determine conclusion due to access limitations

Abstract: Failed to fetch summary for 2601.08831: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.08831&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[414] A Step to Decouple Optimization in 3DGS

Renjie Ding, Yaonan Wang, Min Liu, Jialin Zhu, Jiazheng Wang, Jiahao Zhao, Wenting Shen, Feixiang He, Xiang Chen

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2601.16736: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.16736&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[415] Improving Multimodal Learning with Dispersive and Anchoring Regularization

Zixuan Xia, Hao Wang, Pengcheng Weng, Yanyu Qian, Yangxin Xu, William Dan, Fei Wang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2601.21670: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.21670&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Wencan Cheng, Gim Hee Lee

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2602.01586: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.01586&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[417] Contextualized Visual Personalization in Vision-Language Models

Yeongtak Oh, Sangwon Yu, Junsung Park, Han Cheol Moon, Jisoo Mok, Sungroh Yoon

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2602.03454: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.03454&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[418] RAD: Retrieval-Augmented Monocular Metric Depth Estimation for Underrepresented Classes

Michael Baltaxe, Dan Levi, Sagie Benaim

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to API rate limiting preventing access to paper details

Method: Cannot analyze method as paper content is unavailable due to HTTP 429 error

Result: No results available - API request was rate limited

Conclusion: Cannot provide analysis due to technical limitations in accessing the paper

Abstract: Failed to fetch summary for 2602.09532: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.09532&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[419] BrepGaussian: CAD reconstruction from Multi-View Images with Gaussian Splatting

Jiaxing Yu, Dongyang Ren, Hangyu Xu, Zhouyuxiao Yang, Yuanqi Li, Jie Guo, Zhengkang Zhou, Yanwen Guo

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2602.21105: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.21105&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[420] SubspaceAD: Training-Free Few-Shot Anomaly Detection via Subspace Modeling

Camile Lendering, Erkut Akdag, Egor Bondarev

Main category: cs.CV

TL;DR: Unable to analyze paper 2602.23013 due to HTTP 429 error when fetching summary from arXiv API

Details

Motivation: Cannot determine motivation due to missing abstract content

Method: Cannot determine method due to missing abstract content

Result: Cannot determine results due to missing abstract content

Conclusion: Cannot draw conclusion due to missing abstract content

Abstract: Failed to fetch summary for 2602.23013: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.23013&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[421] NeuroSymb-MRG: Differentiable Abductive Reasoning with Active Uncertainty Minimization for Radiology Report Generation

Rong Fu, Yiqing Lyu, Chunlei Meng, Muge Qi, Yabin Jin, Qi Zhao, Li Bao, Juntao Gao, Fuqian Shi, Nilanjan Dey, Wei Luo, Simon Fong

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2603.01756: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01756&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[422] SCP: Spatial Causal Prediction in Video

Yanguang Zhao, Jie Yang, Shengqiong Wu, Shutong Hu, Hongbo Qiu, Yu Wang, Guijia Zhang, Tan Kai Ze, Hao Fei, Chia-Wen Lin, Mong-Li Lee, Wynne Hsu

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to API rate limiting preventing access to paper details

Method: No method information available due to API access failure

Result: No results available due to technical access issues

Conclusion: Paper content inaccessible due to arXiv API rate limiting

Abstract: Failed to fetch summary for 2603.03944: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.03944&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[423] ECHO: Event-Centric Hypergraph Operations via Multi-Agent Collaboration for Multimedia Event Extraction

Hailong Chu, Hongbing Li, Yunlong Chu, Shutai Huang, Xingyue Zhang, Tinghe Yan, Jinsong Zhang, Shuo Zhang, Lei Li

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed API request

Method: Unable to determine method due to failed API request

Result: Unable to determine results due to failed API request

Conclusion: Unable to draw conclusions due to failed API request

Abstract: Failed to fetch summary for 2603.06683: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.06683&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[424] Rotation Equivariant Mamba for Vision Tasks

Zhongchen Zhao, Qi Xie, Keyu Huang, Lei Zhang, Deyu Meng, Zongben Xu

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.09138: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.09138&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[425] TIGFlow-GRPO: Trajectory Forecasting via Interaction-Aware Flow Matching and Reward-Guided Optimization

Xuepeng Jing, Wenhuan Lu, Hao Meng, Zhizhi Yu, Jianguo Wei

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to API rate limiting preventing access to paper details

Method: Cannot analyze method as paper content is unavailable due to HTTP 429 error

Result: No results available - paper retrieval failed due to rate limiting

Conclusion: Cannot provide analysis due to technical limitations in accessing the paper

Abstract: Failed to fetch summary for 2603.24936: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.24936&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[426] More than the Sum: Panorama-Language Models for Adverse Omni-Scenes

Weijia Fan, Ruiping Liu, Jiale Wei, Yufan Chen, Junwei Zheng, Zichao Zeng, Jiaming Zhang, Qiufu Li, Linlin Shen, Rainer Stiefelhagen

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to draw conclusions due to fetch failure

Abstract: Failed to fetch summary for 2603.09573: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.09573&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[427] MuDD: A Multimodal Deception Detection Dataset and GSR-Guided Progressive Distillation for Non-Contact Deception Detection

Peiyuan Jiang, Yao Liu, Yanglei Gan, Jiaye Yang, Lu Liu, Daibing Yao, Qiao Liu

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.26064: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.26064&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[428] Sky2Ground: A Benchmark for Site Modeling under Varying Altitude

Zengyan Wang, Sirshapan Mitra, Rajat Modi, Grace Lim, Yogesh Rawat

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2603.13740: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.13740&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[429] Expanding mmWave Datasets for Human Pose Estimation with Unlabeled Data and LiDAR Datasets

Zhuoxuan Peng, Boan Zhu, Xingjian Zhang, Wenying Li, S.-H. Gary Chan

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2603.14507: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.14507&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[430] Unified Removal of Raindrops and Reflections: A New Benchmark and A Novel Pipeline

Xingyu Liu, Zewei He, Yu Chen, Chunyu Zhu, Zixuan Chen, Xing Luo, Zhe-Ming Lu

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation as paper content could not be retrieved

Method: Unable to determine method as paper content could not be retrieved

Result: Unable to determine results as paper content could not be retrieved

Conclusion: Unable to draw conclusions as paper content could not be retrieved

Abstract: Failed to fetch summary for 2603.16446: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.16446&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[431] Towards Context-Aware Image Anonymization with Multi-Agent Reasoning

Robert Aufschläger, Jakob Folz, Gautam Savaliya, Manjitha D Vidanalage, Michael Heigl, Martin Schramm

Main category: cs.CV

TL;DR: Failed to fetch paper summary - HTTP 429 error indicates rate limiting from arXiv API

Details

Motivation: Unable to determine motivation due to API rate limiting preventing access to paper content

Method: Cannot analyze method as paper content is unavailable due to technical limitations

Result: No results available - arXiv API returned HTTP 429 (Too Many Requests) error

Conclusion: Technical issue prevents analysis - need to retry later or use alternative methods to access paper

Abstract: Failed to fetch summary for 2603.27817: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.27817&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[432] EI: Early Intervention for Multimodal Imaging based Disease Recognition

Qijie Wei, Hailan Lin, Xirong Li

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to draw conclusions due to access error

Abstract: Failed to fetch summary for 2603.17514: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.17514&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[433] Universal Skeleton Understanding via Differentiable Rendering and MLLMs

Ziyi Wang, Peiming Li, Xinshun Wang, Yang Tang, Kai-Kuang Ma, Mengyuan Liu

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about paper content due to access limitations

Abstract: Failed to fetch summary for 2603.18003: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.18003&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[434] SwiftGS: Episodic Priors for Immediate Satellite Surface Recovery

Rong Fu, Jiekai Wu, Haiyun Wei, Xiaowen Ma, Shiyin Lin, Kangan Qian, Chuang Liu, Jianyuan Ni, Simon James Fong

Main category: cs.CV

TL;DR: Unable to analyze paper 2603.18634 due to HTTP 429 error when fetching from arXiv API

Details

Motivation: Cannot determine motivation due to inability to access paper content

Method: Cannot determine method due to inability to access paper content

Result: Cannot determine results due to inability to access paper content

Conclusion: Cannot draw conclusions due to inability to access paper content

Abstract: Failed to fetch summary for 2603.18634: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.18634&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[435] CREG: Compass Relational Evidence Graph for Characterizing Directional Structure in VLM Spatial-Reasoning Attribution

Kaizhen Tan

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2603.20475: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.20475&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[436] LongTail Driving Scenarios with Reasoning Traces: The KITScenes LongTail Dataset

Royden Wagner, Omer Sahin Tas, Jaime Villa, Felix Hauser, Yinzhe Shen, Marlon Steiner, Dominik Strutz, Carlos Fernandez, Christian Kinzig, Guillermo S. Guitierrez-Cabello, Hendrik Königshof, Fabian Immel, Richard Schwarzkopf, Nils Alexander Rack, Kevin Rösch, Kaiwen Wang, Jan-Hendrik Pauls, Martin Lauer, Igor Gilitschenski, Holger Caesar, Christoph Stiller

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to failed paper fetch

Method: Cannot determine method due to failed paper fetch

Result: Cannot determine results due to failed paper fetch

Conclusion: Cannot determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2603.23607: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.23607&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[437] MPDiT: Multi-Patch Global-to-Local Transformer Architecture For Efficient Flow Matching and Diffusion Model

Quan Dao, Dimitris Metaxas

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2603.26357: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.26357&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[438] Beyond Corner Patches: Semantics-Aware Backdoor Attack in Federated Learning

Kavindu Herath, Joshua Zhao, Saurabh Bagchi

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to data fetch failure

Method: Unable to determine method due to data fetch failure

Result: Unable to determine results due to data fetch failure

Conclusion: Unable to determine conclusion due to data fetch failure

Abstract: Failed to fetch summary for 2603.29328: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.29328&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[439] SparseCam4D: Spatio-Temporally Consistent 4D Reconstruction from Sparse Cameras

Weihong Pan, Xiaoyu Zhang, Zhuang Zhang, Zhichao Ye, Nan Wang, Haomin Liu, Guofeng Zhang

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2603.26481: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.26481&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[440] The Geometry of Robustness: Optimizing Loss Landscape Curvature and Feature Manifold Alignment for Robust Finetuning of Vision-Language Models

Shivang Chopra, Shaunak Halbe, Chengyue Huang, Brisa Maneechotesuwan, Zsolt Kira

Main category: cs.CV

TL;DR: Unable to analyze paper 2603.27139 due to HTTP 429 error when fetching summary from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable due to API rate limiting

Result: Cannot determine results as paper content is unavailable due to API rate limiting

Conclusion: Cannot draw conclusions about the paper due to unavailability of content

Abstract: Failed to fetch summary for 2603.27139: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.27139&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[441] Weakly Convex Ridge Regularization for 3D Non-Cartesian MRI Reconstruction

German Shâma Wache, Chaithya G R, Asma Tanabene, Sebastian Neumayer

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2603.27158 appears to be from March 2026, which is in the future relative to current date.

Details

Motivation: Unable to determine motivation due to missing paper content.

Method: Unable to determine method due to missing paper content.

Result: Unable to determine results due to missing paper content.

Conclusion: Unable to draw conclusions due to missing paper content.

Abstract: Failed to fetch summary for 2603.27158: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.27158&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[442] GPA: Learning GUI Process Automation from Demonstrations

Zirui Zhao, Jun Hao Liew, Yan Yang, Wenzhuo Yang, Ziyang Luo, Doyen Sahoo, Silvio Savarese, Junnan Li

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable due to API rate limiting

Result: Cannot determine results as paper content is unavailable due to API rate limiting

Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2604.01676: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.01676&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[443] Event6D: Event-based Novel Object 6D Pose Tracking

Jae-Young Kang, Hoonhee Cho, Taeyeop Lee, Minjun Kang, Bowen Wen, Youngho Kim, Kuk-Jin Yoon

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2603.28045: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.28045&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[444] Using predefined vector systems to speed up neural network multimillion class classification

Nikita Gabdullin, Ilya Androsov

Main category: cs.CV

TL;DR: Paper 2604.00779: Unable to fetch abstract due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to missing abstract

Method: Cannot determine method due to missing abstract

Result: Cannot determine results due to missing abstract

Conclusion: Cannot determine conclusion due to missing abstract

Abstract: Failed to fetch summary for 2604.00779: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.00779&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[445] Segmentation of Gray Matters and White Matters from Brain MRI data

Chang Sun, Rui Shi, Tsukasa Koike, Tetsuro Sekine, Akio Morita, Tetsuya Sakai

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to failed paper retrieval

Method: Cannot determine method due to failed paper retrieval

Result: Cannot determine results due to failed paper retrieval

Conclusion: Cannot draw conclusions due to failed paper retrieval

Abstract: Failed to fetch summary for 2603.29171: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.29171&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[446] Unbiased Model Prediction Without Using Protected Attribute Information

Puspita Majumdar, Surbhi Mittal, Saheb Chhabra, Mayank Vatsa, Richa Singh

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable due to technical limitations

Method: Cannot determine method as paper content is unavailable due to technical limitations

Result: Cannot determine results as paper content is unavailable due to technical limitations

Conclusion: Cannot draw conclusions as paper content is unavailable due to technical limitations

Abstract: Failed to fetch summary for 2603.29270: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.29270&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[447] Beyond Ground-Truth: Leveraging Image Quality Priors for Real-World Image Restoration

Fengyang Xiao, Peng Hu, Lei Xu, XingE Guo, Guanyi Qin, Yuqi Shen, Chengyu Fang, Rihan Zhang, Chunming He, Sina Farsiu

Main category: cs.CV

TL;DR: Paper ID 2603.29773 could not be fetched due to HTTP 429 error (rate limiting), so no abstract content is available for analysis.

Details

Motivation: Unable to determine motivation due to missing abstract content.

Method: Unable to determine method due to missing abstract content.

Result: Unable to determine results due to missing abstract content.

Conclusion: Unable to draw conclusions due to missing abstract content.

Abstract: Failed to fetch summary for 2603.29773: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.29773&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[448] JAMMEval: A Refined Collection of Japanese Benchmarks for Reliable VLM Evaluation

Issa Sugiura, Koki Maeda, Shuhei Kurita, Yusuke Oda, Daisuke Kawahara, Naoaki Okazaki

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2604.00909: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.00909&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[449] Sub-metre Lunar DEM Generation and Validation from Chandrayaan-2 OHRC Multi-View Imagery Using an Open-Source Pipeline

Aaranay Aadi, Jai Singla, Nitant Dube, Oleg Alexandrov

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) when querying arXiv API for paper ID 2604.01032

Details

Motivation: Unable to determine motivation as paper content could not be retrieved due to API rate limiting

Method: No method information available - paper content inaccessible

Result: No results available - paper summary fetch failed

Conclusion: Cannot draw conclusions about paper content due to access limitations

Abstract: Failed to fetch summary for 2604.01032: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.01032&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[450] MonoSAOD: Monocular 3D Object Detection with Sparsely Annotated Label

Junyoung Jung, Seokwon Kim, Jung Uk Kim

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) when querying arXiv API for paper ID 2604.01646

Details

Motivation: Unable to determine motivation as paper content could not be retrieved due to technical limitations

Method: No method information available - arXiv API request failed with HTTP 429 (Too Many Requests) error

Result: No results available - the paper analysis could not be completed due to API rate limiting

Conclusion: Cannot provide analysis of paper 2604.01646 due to technical limitations in accessing the content

Abstract: Failed to fetch summary for 2604.01646: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.01646&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[451] CASHG: Context-Aware Stylized Online Handwriting Generation

Jinsu Shin, Sungeun Hong, JinYeong Bak

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2604.02103: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.02103&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[452] THOM: Generating Physically Plausible Hand-Object Meshes From Text

Uyoung Jeong, Yihalem Yimolal Tiruneh, Hyung Jin Chang, Seungryul Baek, Kwang In Kim

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2604.02736: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.02736&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[453] GenSmoke-GS: A Multi-Stage Method for Novel View Synthesis from Smoke-Degraded Images Using a Generative Model

Qida Cao, Xinyuan Hu, Changyue Shi, Jiajun Ding, Zhou Yu, Jun Yu

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed data retrieval

Method: Unable to determine method due to failed data retrieval

Result: Unable to determine results due to failed data retrieval

Conclusion: Unable to draw conclusions due to failed data retrieval

Abstract: Failed to fetch summary for 2604.03039: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.03039&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[454] Stochastics of shapes and Kunita flows

Stefan Sommer, Gefan Yang, Elizabeth Louise Baker

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) when attempting to access arXiv API

Details

Motivation: The paper ID 2512.11676 was requested for analysis, but the arXiv API returned a rate limiting error preventing access to the abstract and content

Method: N/A - Could not access paper content due to technical limitations in the data retrieval process

Result: Failed to retrieve paper information; HTTP 429 error indicates too many requests to arXiv API within a short timeframe

Conclusion: Technical limitations prevented analysis of this specific paper; would need to try again later or access the paper through alternative means

Abstract: Failed to fetch summary for 2512.11676: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.11676&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[455] From Pen Strokes to Sleep States: Detecting Low-Recovery Days Using Sigma-Lognormal Handwriting Features

Chisa Tanaka, Andrew Vargo, Anna Scius-Bertrand, Andreas Fischer, Koichi Kise

Main category: cs.CV

TL;DR: Failed to fetch paper summary - HTTP 429 error indicates rate limiting from arXiv API

Details

Motivation: Unable to determine motivation due to API access limitations

Method: Unable to determine method due to API access limitations

Result: Unable to determine results due to API access limitations

Conclusion: Unable to analyze paper content due to technical limitations in accessing arXiv data

Abstract: Failed to fetch summary for 2603.11512: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.11512&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[456] Prediction of Grade, Gender, and Academic Performance of Children and Teenagers from Handwriting Using the Sigma-Lognormal Model

Adrian Iste, Kazuki Nishizawa, Chisa Tanaka, Andrew Vargo, Anna Scius-Bertrand, Andreas Fischer, Koichi Kise

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) when querying arXiv API for paper ID 2603.11519

Details

Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about the paper due to unavailability of content

Abstract: Failed to fetch summary for 2603.11519: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.11519&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

cs.AI

[457] IC3-Evolve: Proof-/Witness-Gated Offline LLM-Driven Heuristic Evolution for IC3 Hardware Model Checking

Mingkai Miao, Guangyu Hu, Ziyi Yang, Hongce Zhang

Main category: cs.AI

TL;DR: IC3-Evolve: An automated offline code-evolution framework that uses LLMs to propose and validate patches for IC3 model checking algorithm, ensuring correctness through proof/witness-gated validation.

Details

Motivation: IC3 (property-directed reachability) algorithm performance depends heavily on heuristic tuning, which is manual, costly, brittle, and hard to reproduce. There's a need for automated, reliable improvement of IC3 implementations while maintaining correctness guarantees.

Method: Uses LLMs offline to propose small, slot-restricted, auditable patches to IC3 implementation. Employs proof-/witness-gated validation: SAFE runs must emit independently checkable certificates, UNSAFE runs must emit replayable counterexample traces. Deploys as standalone evolved checker with no ML/LLM inference overhead.

Result: Evolved on HWMCC benchmark and evaluated on unseen public/industrial benchmarks. IC3-Evolve reliably discovers practical heuristic improvements under strict correctness gates, producing standalone checkers with improved performance.

Conclusion: Automated code evolution with LLMs can reliably improve IC3 implementations while maintaining correctness through rigorous validation gates, eliminating manual tuning costs and producing standalone, efficient model checkers.

Abstract: IC3, also known as property-directed reachability (PDR), is a commonly-used algorithm for hardware safety model checking. It checks if a state transition system complies with a given safety property. IC3 either returns UNSAFE (indicating property violation) with a counterexample trace, or SAFE with a checkable inductive invariant as the proof to safety. In practice, the performance of IC3 is dominated by a large web of interacting heuristics and implementation choices, making manual tuning costly, brittle, and hard to reproduce. This paper presents IC3-Evolve, an automated offline code-evolution framework that utilizes an LLM to propose small, slot-restricted and auditable patches to an IC3 implementation. Crucially, every candidate patch is admitted only through proof- /witness-gated validation: SAFE runs must emit a certificate that is independently checked, and UNSAFE runs must emit a replayable counterexample trace, preventing unsound edits from being deployed. Since the LLM is used only offline, the deployed artifact is a standalone evolved checker with zero ML/LLM inference overhead and no runtime model dependency. We evolve on the public hardware model checking competition (HWMCC) benchmark and evaluate the generalizability on unseen public and industrial model checking benchmarks, showing that IC3-Evolve can reliably discover practical heuristic improvements under strict correctness gates.

[458] Structural Segmentation of the Minimum Set Cover Problem: Exploiting Universe Decomposability for Metaheuristic Optimization

Isidora Hernández, Héctor Ferrada, Cristóbal A. Navarro

Main category: cs.AI

TL;DR: The paper proposes a decomposition approach for Minimum Set Cover Problem (MSCP) by detecting connected components in element co-occurrence graphs, solving subproblems independently with GRASP metaheuristic, and combining solutions efficiently.

Details

Motivation: Most existing MSCP methods treat instances as monolithic, overlooking potential intrinsic structural properties of the universe that could be exploited for more efficient optimization.

Method: 1) Preprocessing using disjoint-set union to detect connected components in element co-occurrence graphs, 2) Decomposing original instance into independent subproblems, 3) Solving each subproblem with GRASP metaheuristic, 4) Combining partial solutions while maintaining feasibility, 5) Using bit-level set representation for efficient operations.

Result: Extensive experiments show consistent improvement in solution quality and scalability, especially for large and structurally decomposable instances, with computational practicality achieved through efficient bit-level representations.

Conclusion: Exploiting natural universe segmentation (universe segmentability) through structural decomposition significantly enhances heuristic optimization for MSCP, making the approach particularly effective for large-scale instances with intrinsic decomposable structure.

Abstract: The Minimum Set Cover Problem (MSCP) is a classical NP-hard combinatorial optimization problem with numerous applications in science and engineering. Although a wide range of exact, approximate, and metaheuristic approaches have been proposed, most methods implicitly treat MSCP instances as monolithic, overlooking potential intrinsic structural properties of the universe. In this work, we investigate the concept of \emph{universe segmentability} in the MSCP and analyze how intrinsic structural decomposition (universe segmentability) can be exploited to enhance heuristic optimization. We propose an efficient preprocessing strategy based on disjoint-set union (union–find) to detect connected components induced by element co-occurrence within subsets, enabling the decomposition of the original instance into independent subproblems. Each subproblem is solved using the GRASP metaheuristic, and partial solutions are combined without compromising feasibility. Extensive experiments on standard benchmark instances and large-scale synthetic datasets show that exploiting natural universe segmentation consistently improves solution quality and scalability, particularly for large and structurally decomposable instances. These gains are supported by a succinct bit-level set representation that enables efficient set operations, making the proposed approach computationally practical at scale.

[459] To Throw a Stone with Six Birds: On Agents and Agenthood

Ioannis Tsiokos

Main category: cs.AI

TL;DR: SBT defines agents as maintained theory objects with feasible policies that can steer futures while remaining viable, operationalized through ledger-gated feasibility, viability kernels, empowerment metrics, and packaging maps.

Details

Motivation: To provide a rigorous, testable definition of agency that separates persistence from control, avoiding conflated concepts and enabling falsifiable claims about agenthood.

Method: Uses Six Birds Theory framework with four checkable components: ledger-gated feasibility, robust viability kernel (greatest fixed point), feasible empowerment as channel capacity, and empirical packaging map with idempotence defect.

Result: In ring-world experiments, matched-control ablations show four separations: zero empowerment in null regimes, repair enabling collapses idempotence defect, protocols increase empowerment at multi-step horizons, and operator rewriting increases median empowerment from 0.73 to 1.34 bits.

Conclusion: Provides hash-traceable tests separating agenthood from agency without requiring claims about goals, consciousness, or biology, with reproducible artifacts for empirical validation.

Abstract: Six Birds Theory (SBT) treats macroscopic objects as induced closures rather than primitives. Empirical discussions of agency often conflate persistence (being an object) with control (making a counterfactual difference), which makes agency claims difficult to test and easy to spoof. We give a type-correct account of agency within SBT: a theory induces a layer with an explicit interface and ledgered constraints; an agent is a maintained theory object whose feasible interface policies can steer outside futures while remaining viable. We operationalize this contract in finite controlled systems using four checkable components: ledger-gated feasibility, a robust viability kernel computed as a greatest fixed point under successor-support semantics, feasible empowerment (channel capacity) as a proxy for difference-making, and an empirical packaging map whose idempotence defect quantifies objecthood under coarse observation. In a minimal ring-world with toggles for repair, protocol holonomy, identity staging, and operator rewriting, matched-control ablations yield four separations: calibrated null regimes with single actions show zero empowerment and block model-misspecification false positives; enabling repair collapses the idempotence defect; protocols increase empowerment only at horizons of two or more steps; and learning to rewrite operators monotonically increases median empowerment (0.73 to 1.34 bits). These results provide hash-traceable tests that separate agenthood from agency without making claims about goals, consciousness, or biological organisms, and they are accompanied by reproducible, audited artifacts.

[460] Position: Science of AI Evaluation Requires Item-level Benchmark Data

Han Jiang, Susu Zhang, Xiaoyuan Yi, Xing Xie, Ziang Xiao

Main category: cs.AI

TL;DR: The paper argues for item-level AI benchmark data as essential for rigorous AI evaluation science, addressing current validity failures in generative AI assessments.

Details

Motivation: Current AI evaluation paradigms exhibit systemic validity failures (unjustified design choices, misaligned metrics) that remain intractable without principled frameworks for gathering validity evidence and conducting granular diagnostic analysis.

Method: The authors propose item-level analysis of AI benchmark data, dissect current validity failures, revisit evaluation paradigms across computer science and psychometrics, and demonstrate insights through illustrative analyses of item properties and latent constructs.

Result: The paper introduces OpenEval, a growing repository of item-level benchmark data designed to support evidence-centered AI evaluation and catalyze community-wide adoption of item-level analysis.

Conclusion: Item-level AI benchmark data is essential for establishing a rigorous science of AI evaluation, enabling fine-grained diagnostics and principled validation of benchmarks to address systemic validity failures.

Abstract: AI evaluations have become the primary evidence for deploying generative AI systems across high-stakes domains. However, current evaluation paradigms often exhibit systemic validity failures. These issues, ranging from unjustified design choices to misaligned metrics, remain intractable without a principled framework for gathering validity evidence and conducting granular diagnostic analysis. In this position paper, we argue that item-level AI benchmark data is essential for establishing a rigorous science of AI evaluation. Item-level analysis enables fine-grained diagnostics and principled validation of benchmarks. We substantiate this position by dissecting current validity failures and revisiting evaluation paradigms across computer science and psychometrics. Through illustrative analyses of item properties and latent constructs, we demonstrate the unique insights afforded by item-level data. To catalyze community-wide adoption, we introduce OpenEval, a growing repository of item-level benchmark data designed supporting evidence-centered AI evaluation.

[461] Toward Full Autonomous Laboratory Instrumentation Control with Large Language Models

Yong Xie, Kexin He, Andres Castellanos-Gomez

Main category: cs.AI

TL;DR: LLMs like ChatGPT can automate scientific instrumentation control, reducing programming barriers for researchers through custom script generation and autonomous AI agents.

Details

Motivation: Complex laboratory instrumentation requires significant programming expertise, creating barriers for researchers lacking computational skills, which limits experimental customization and automation.

Method: Uses LLMs (ChatGPT) to generate custom control scripts for scientific equipment, demonstrated through a case study of single-pixel camera/scanning photocurrent microscope setup, then extends to autonomous AI agents that can independently operate instruments and refine control strategies.

Result: Successfully demonstrated that ChatGPT can facilitate creation of custom instrumentation control scripts, significantly reducing technical barriers, and showed potential for LLM-assisted tools to evolve into autonomous AI agents for laboratory automation.

Conclusion: LLM-based tools and AI agents have transformative potential in democratizing laboratory automation and accelerating scientific progress by lowering programming barriers for instrumentation control.

Abstract: The control of complex laboratory instrumentation often requires significant programming expertise, creating a barrier for researchers lacking computational skills. This work explores the potential of large language models (LLMs), such as ChatGPT, and LLM-based artificial intelligence (AI) agents to enable efficient programming and automation of scientific equipment. Through a case study involving the implementation of a setup that can be used as a single-pixel camera or a scanning photocurrent microscope, we demonstrate how ChatGPT can facilitate the creation of custom scripts for instrumentation control, significantly reducing the technical barrier for experimental customization. Building on this capability, we further illustrate how LLM-assisted tools can be extended into autonomous AI agents capable of independently operating laboratory instruments and iteratively refining control strategies. This approach underscores the transformative role of LLM-based tools and AI agents in democratizing laboratory automation and accelerating scientific progress.

[462] Evaluating Artificial Intelligence Through a Christian Understanding of Human Flourishing

Nicholas Skytland, Lauren Parsons, Alicia Llewellyn, Steele Billings, Peter Larson, John Anderson, Sean Boisen, Steve Runge

Main category: cs.AI

TL;DR: Paper introduces FAI-C-ST benchmark to evaluate AI models against Christian values, finding current models default to secularism with significant performance gaps in spiritual dimensions.

Details

Motivation: As LLMs increasingly mediate moral and spiritual discussions, they function as formative instruments shaping human understanding. The paper aims to make this formative influence measurable by evaluating whether AI systems can align with specific theological worldviews rather than defaulting to secular neutrality.

Method: Introduces the Flourishing AI Benchmark: Christian Single-Turn (FAI-C-ST) framework to evaluate Frontier Model responses against Christian understanding of human flourishing across seven dimensions. Compares 20 Frontier Models against both pluralistic and Christian-specific criteria.

Result: Current AI systems are not worldview-neutral but default to Procedural Secularism, resulting in systematic performance decline of ~17 points across all flourishing dimensions. Most critically, 31-point decline in Faith and Spirituality dimension. Performance gap stems from training objectives prioritizing broad acceptability over coherent theological reasoning.

Conclusion: AI alignment is fundamentally a formation problem, not just safety. The performance gap in values alignment arises from training objectives that prioritize broad acceptability over deep, internally coherent moral or theological reasoning, revealing limitations in current AI’s ability to support specific theological worldviews.

Abstract: Artificial intelligence (AI) alignment is fundamentally a formation problem, not only a safety problem. As Large Language Models (LLMs) increasingly mediate moral deliberation and spiritual inquiry, they do more than provide information; they function as instruments of digital catechesis, actively shaping and ordering human understanding, decision-making, and moral reflection. To make this formative influence visible and measurable, we introduce the Flourishing AI Benchmark: Christian Single-Turn (FAI-C-ST), a framework designed to evaluate Frontier Model responses against a Christian understanding of human flourishing across seven dimensions. By comparing 20 Frontier Models against both pluralistic and Christian-specific criteria, we show that current AI systems are not worldview-neutral. Instead, they default to a Procedural Secularism that lacks the grounding necessary to sustain theological coherence, resulting in a systematic performance decline of approximately 17 points across all dimensions of flourishing. Most critically, there is a 31-point decline in the Faith and Spirituality dimension. These findings suggest that the performance gap in values alignment is not a technical limitation, but arises from training objectives that prioritize broad acceptability and safety over deep, internally coherent moral or theological reasoning.

[463] VERT: Reliable LLM Judges for Radiology Report Evaluation

Federica Bologna, Jean-Philippe Corbeil, Matthew Wilkens, Asma Ben Abacha

Main category: cs.AI

TL;DR: Proposes VERT, an LLM-based metric for radiology report evaluation that improves correlation with radiologist judgments by up to 11.7% over existing methods, with fine-tuning achieving 25% gains using minimal training data.

Details

Motivation: Current radiology report evaluation methods focus on chest X-rays and small models, lacking robustness across different modalities and anatomies. Need to determine optimal LLM configurations for radiology evaluation and improve correlation with expert judgments.

Method: Comprehensive correlation analysis comparing existing metrics (RadFact, GREEN, FineRadScore) with proposed VERT metric. Evaluates open/closed-source models (reasoning/non-reasoning) across RadEval and RaTE-Eval datasets spanning multiple modalities. Tests few-shot, ensembling, and parameter-efficient fine-tuning approaches.

Result: VERT improves correlation with radiologist judgments by up to 11.7% relative to GREEN. Fine-tuning Qwen3 30B yields 25% gains using only 1,300 training samples and reduces inference time up to 37.2 times. Systematic error analysis reveals metric alignment patterns with expert judgments.

Conclusion: LLM-based judges are effective for radiology report evaluation, with lightweight adaptation achieving reliable results. Fine-tuning small models with minimal data can significantly improve performance and efficiency.

Abstract: Current literature on radiology report evaluation has focused primarily on designing LLM-based metrics and fine-tuning small models for chest X-rays. However, it remains unclear whether these approaches are robust when applied to reports from other modalities and anatomies. Which model and prompt configurations are best suited to serve as LLM judges for radiology evaluation? We conduct a thorough correlation analysis between expert and LLM-based ratings. We compare three existing LLM-as-a-judge metrics (RadFact, GREEN, and FineRadScore) alongside VERT, our proposed LLM-based metric, using open- and closed-source models (reasoning and non-reasoning) of different sizes across two expert-annotated datasets, RadEval and RaTE-Eval, spanning multiple modalities and anatomies. We further evaluate few-shot approaches, ensembling, and parameter-efficient fine-tuning using RaTE-Eval. To better understand metric behavior, we perform a systematic error detection and categorization study to assess alignment of these metrics against expert judgments and identify areas of lower and higher agreement. Our results show that VERT improves correlation with radiologist judgments by up to 11.7% relative to GREEN. Furthermore, fine-tuning Qwen3 30B yield gains of up to 25% using only 1,300 training samples. The fine-tuned model also reduces inference time up to 37.2 times. These findings highlight the effectiveness of LLM-based judges and demonstrate that reliable evaluation can be achieved with lightweight adaptation.

[464] Hume’s Representational Conditions for Causal Judgment: What Bayesian Formalization Abstracted Away

Yiling Wu

Main category: cs.AI

TL;DR: This paper analyzes Hume’s three representational conditions for causal judgment and traces their evolution through Bayesian epistemology and predictive processing, using large language models as a contemporary case study.

Details

Motivation: The paper aims to extract and analyze Hume's three representational conditions for causal judgment from his texts, and examine how these conditions have been preserved or abstracted away in later formal frameworks like Bayesian epistemology and predictive processing.

Method: The paper uses philosophical analysis to extract three representational conditions from Hume’s texts: experiential grounding, structured retrieval, and vivacity transfer. It then traces the evolution of these concepts through formal frameworks from Hume to Bayesian epistemology and predictive processing, using large language models as a contemporary illustrative case.

Result: The analysis shows that later frameworks preserve the updating structure of Hume’s insight while abstracting away his three further representational conditions. Large language models serve as a contemporary example that exhibits statistical updating without satisfying Hume’s three conditions, making visible requirements that were previously background assumptions.

Conclusion: Hume’s three representational conditions remain integral to his causal psychology, but have been largely abstracted away in modern formal frameworks, with large language models highlighting this divergence by demonstrating statistical updating without experiential grounding, structured retrieval, or vivacity transfer.

Abstract: Hume’s account of causal judgment presupposes three representational conditions: experiential grounding (ideas must trace to impressions), structured retrieval (association must operate through organized networks exceeding pairwise connection), and vivacity transfer (inference must produce felt conviction, not merely updated probability). This paper extracts these conditions from Hume’s texts and argues that they are integral to his causal psychology. It then traces their fate through the formalization trajectory from Hume to Bayesian epistemology and predictive processing, showing that later frameworks preserve the updating structure of Hume’s insight while abstracting away these further representational conditions. Large language models serve as an illustrative contemporary case: they exhibit a form of statistical updating without satisfying the three conditions, thereby making visible requirements that were previously background assumptions in Hume’s framework.

[465] TABQAWORLD: Optimizing Multimodal Reasoning for Multi-Turn Table Question Answering

Tung Sum Thomas Kwok, Xinyu Wang, Xiaofeng Lin, Peng Lu, Chunhe Wang, Changlun Li, Hanwei Wu, Nan Tang, Elisa Kreiss, Guang Cheng

Main category: cs.AI

TL;DR: TABQAWORLD is a training-free multimodal table reasoning framework that dynamically switches between visual and textual representations to improve table state readout reliability while optimizing reasoning trajectories using table metadata.

Details

Motivation: Existing multi-turn table reasoning methods suffer from representation errors in table encoding that accumulate over multiple turns, while tabular grounding methods are computationally expensive and impractical for real-world deployment.

Method: Jointly optimizes tabular action through representation and estimation: 1) Action-conditioned multimodal selection policy that dynamically switches between visual and textual representations, 2) Stepwise reasoning trajectory optimization using table metadata (dimension, data types, key values) to plan trajectories and compress low-complexity actions.

Result: Achieves state-of-the-art performance with 4.87% accuracy improvements over baselines, with 5.42% accuracy gain and 33.35% inference latency reduction over static settings.

Conclusion: TABQAWORLD establishes a new standard for reliable and efficient table reasoning by addressing representation error accumulation while maintaining practical deployment feasibility.

Abstract: Multimodal reasoning has emerged as a powerful framework for enhancing reasoning capabilities of reasoning models. While multi-turn table reasoning methods have improved reasoning accuracy through tool use and reward modeling, they rely on fixed text serialization for table state readouts. This introduces representation errors in table encoding that significantly accumulate over multiple turns. Such accumulation is alleviated by tabular grounding methods in the expense of inference compute and cost, rendering real world deployment impractical. To address this, we introduce TABQAWORLD, a table reasoning framework that jointly optimizes tabular action through representation and estimation. For representation, TABQAWORLD employs an action-conditioned multimodal selection policy, which dynamically switches between visual and textual representations to maximize table state readout reliability. For estimation, TABQAWORLD optimizes stepwise reasoning trajectory through table metadata including dimension, data types and key values, safely planning trajectory and compressing low-complexity actions to reduce conversation turns and latency. Designed as a training-free framework, empirical evaluations show that TABQAWORLD achieves state-of-the-art performance with 4.87% accuracy improvements over baselines, with 5.42% accuracy gain and 33.35% inference latency reduction over static settings, establishing a new standard for reliable and efficient table reasoning.

[466] Contextual Control without Memory Growth in a Context-Switching Task

Song-Ju Kim

Main category: cs.AI

TL;DR: Intervention-based recurrent architecture enables contextual control without enlarging recurrent dimensionality by using additive context-indexed operators on shared latent states.

Details

Motivation: Current approaches for context-dependent sequential decision making either provide context explicitly as input or increase recurrent memory to represent context internally. The authors explore a third alternative: achieving contextual dependence by intervening on a shared recurrent latent state without enlarging recurrent dimensionality.

Method: Proposed an intervention-based recurrent architecture where a recurrent core first constructs a shared pre-intervention latent state, and context then acts through an additive, context-indexed operator. Evaluated on context-switching sequential decision tasks under partial observability, comparing against label-assisted baselines (direct context access) and memory baselines (enlarged recurrent state).

Result: The intervention model performed strongly on the main benchmark without additional recurrent dimensions. Using conditional mutual information (I(C;O | S)) as an operational probe, the intervention model exhibited positive conditional contextual information for task-relevant phase-1 outcomes, indicating effective contextual dependence.

Conclusion: Intervention on a shared recurrent state provides a viable alternative to recurrent memory growth for contextual control in sequential decision making under partial observability, offering a more parameter-efficient approach to context-dependent processing.

Abstract: Context-dependent sequential decision making is commonly addressed either by providing context explicitly as an input or by increasing recurrent memory so that contextual information can be represented internally. We study a third alternative: realizing contextual dependence by intervening on a shared recurrent latent state, without enlarging recurrent dimensionality. To this end, we introduce an intervention-based recurrent architecture in which a recurrent core first constructs a shared pre-intervention latent state, and context then acts through an additive, context-indexed operator. We evaluate this idea on a context-switching sequential decision task under partial observability. We compare three model families: a label-assisted baseline with direct context access, a memory baseline with enlarged recurrent state, and the proposed intervention model, which uses no direct context input to the recurrent core and no memory growth. On the main benchmark, the intervention model performs strongly without additional recurrent dimensions. We also evaluate the models using the conditional mutual information (I(C;O | S)) as a theorem-motivated operational probe of contextual dependence at fixed latent state. For task-relevant phase-1 outcomes, the intervention model exhibits positive conditional contextual information. Together, these results suggest that intervention on a shared recurrent state provides a viable alternative to recurrent memory growth for contextual control in this setting.

[467] Beyond Predefined Schemas: TRACE-KG for Context-Enriched Knowledge Graphs from Complex Documents

Mohammad Sadeq Abolhasani, Yang Ba, Yixuan He, Rong Pan

Main category: cs.AI

TL;DR: TRACE-KG is a multimodal framework that jointly constructs context-enriched knowledge graphs with induced schemas from text, without requiring predefined ontologies.

Details

Motivation: Current knowledge graph construction methods face a trade-off: ontology-driven approaches require costly schema design but ensure consistency, while schema-free methods produce fragmented graphs with poor global organization, especially for technical documents with dense, context-dependent information.

Method: TRACE-KG jointly constructs knowledge graphs and induced schemas without predefined ontologies. It captures conditional relations through structured qualifiers, organizes entities/relations using data-driven schemas, and maintains full traceability to source evidence.

Result: Experiments show TRACE-KG produces structurally coherent, traceable knowledge graphs and offers a practical alternative to both ontology-driven and schema-free construction pipelines.

Conclusion: TRACE-KG provides a balanced approach between rigid ontology-driven methods and fragmented schema-free approaches, enabling context-enriched knowledge graph construction with reusable semantic scaffolds.

Abstract: Knowledge graph construction typically relies either on predefined ontologies or on schema-free extraction. Ontology-driven pipelines enforce consistent typing but require costly schema design and maintenance, whereas schema-free methods often produce fragmented graphs with weak global organization, especially in long technical documents with dense, context-dependent information. We propose TRACE-KG (Text-dRiven schemA for Context-Enriched Knowledge Graphs), a multimodal framework that jointly constructs a context-enriched knowledge graph and an induced schema without assuming a predefined ontology. TRACE-KG captures conditional relations through structured qualifiers and organizes entities and relations using a data-driven schema that serves as a reusable semantic scaffold while preserving full traceability to the source evidence. Experiments show that TRACE-KG produces structurally coherent, traceable knowledge graphs and offers a practical alternative to both ontology-driven and schema-free construction pipelines.

[468] Resource-Conscious Modeling for Next- Day Discharge Prediction Using Clinical Notes

Ha Na Cho, Sairam Sutari, Alexander Lopez, Hansen Bow, Kai Zheng

Main category: cs.AI

TL;DR: Fine-tuned LLMs underperform traditional TF-IDF models for next-day discharge prediction in spine surgery using clinical notes

Details

Motivation: Timely discharge prediction is crucial for optimizing bed turnover and resource allocation in elective spine surgery units, requiring accurate prediction models that can work with real-world clinical data

Method: Compared 13 models including TF-IDF with XGBoost/LGBM and compact LLMs (DistilGPT-2, Bio_ClinicalBERT) fine-tuned via LoRA for predicting next-day discharge using postoperative clinical notes

Result: TF-IDF with LGBM achieved best balance: F1-score 0.47 for discharge class, recall 0.51, highest AUC-ROC (0.80). LoRA improved recall in DistilGPT2 but transformer-based/generative models underperformed overall

Conclusion: Interpretable, resource-efficient traditional models may outperform compact LLMs in real-world, imbalanced clinical prediction tasks, suggesting simpler approaches can be more effective for certain healthcare applications

Abstract: Timely discharge prediction is essential for optimizing bed turnover and resource allocation in elective spine surgery units. This study evaluates the feasibility of lightweight, fine-tuned large language models (LLMs) and traditional text-based models for predicting next-day discharge using postoperative clinical notes. We compared 13 models, including TF-IDF with XGBoost and LGBM, and compact LLMs (DistilGPT-2, Bio_ClinicalBERT) fine-tuned via LoRA. TF-IDF with LGBM achieved the best balance, with an F1-score of 0.47 for the discharge class, a recall of 0.51, and the highest AUC-ROC (0.80). While LoRA improved recall in DistilGPT2, overall transformer-based and generative models underperformed. These findings suggest interpretable, resource-efficient models may outperform compact LLMs in real-world, imbalanced clinical prediction tasks.

[469] BioAlchemy: Distilling Biological Literature into Reasoning-Ready Reinforcement Learning Training Data

Brian Hsu, Ozan Gökdemir, Carlo Siebenschuh, Bruce Parrello, Neil Getty, Thomas S. Brettin, Rick L. Stevens, Ian T. Foster, Nicholas Chia, Arvind Ramanathan

Main category: cs.AI

TL;DR: BioAlchemy pipeline creates 345K biology reasoning problems from research text, improving model performance by 9.12% through topic distribution alignment and reinforcement learning.

Details

Motivation: Biology reasoning models lag behind math/coding despite available training text. Current datasets don't align with modern research topics, and methods for extracting challenging research problems from biology text are underdeveloped.

Method: BioAlchemy pipeline for sourcing diverse verifiable Q&A pairs from biology research corpus. Created BioAlchemy-345K dataset, aligned it to modern biology topic distribution, used reinforcement learning to improve reasoning performance.

Result: BioAlchemist-8B model improves over base reasoning model by 9.12% on biology benchmarks, demonstrating stronger scientific reasoning capabilities in biology.

Conclusion: Topic-balanced datasets from research text and reinforcement learning can significantly improve biology reasoning models, addressing current limitations in biological AI research.

Abstract: Despite the large corpus of biology training text, the impact of reasoning models on biological research generally lags behind math and coding. In this work, we show that biology questions from current large-scale reasoning datasets do not align well with modern research topic distributions in biology, and that this topic imbalance may negatively affect performance. In addition, we find that methods for extracting challenging and verifiable research problems from biology research text are a critical yet underdeveloped ingredient in applying reinforcement learning for better performance on biology research tasks. We introduce BioAlchemy, a pipeline for sourcing a diverse set of verifiable question-and-answer pairs from a scientific corpus of biology research text. We curate BioAlchemy-345K, a training dataset containing over 345K scientific reasoning problems in biology. Then, we demonstrate how aligning our dataset to the topic distribution of modern scientific biology can be used with reinforcement learning to improve reasoning performance. Finally, we present BioAlchemist-8B, which improves over its base reasoning model by 9.12% on biology benchmarks. These results demonstrate the efficacy of our approach for developing stronger scientific reasoning capabilities in biology. The BioAlchemist-8B model is available at: https://huggingface.co/BioAlchemy.

[470] ActionNex: A Virtual Outage Manager for Cloud

Zhenfeng Lin, Haoji Hu, Ming Hao, Xuchao Zhang, Ryan Zhang, Junhao Li, Ze Li, Oleg Kulygin, Chetan Bansal, Hatay Tuna, Murali Chintalapati, Sheila Jiang, Salman Zafar, Angie Anderson

Main category: cs.AI

TL;DR: ActionNex is a production-grade agentic system for automated outage management in cloud operations that processes multimodal operational signals and provides next-best action recommendations through hierarchical memory and reasoning.

Details

Motivation: Outage management in large-scale cloud operations is heavily manual, requiring rapid triage, cross-team coordination, and experience-driven decisions under partial observability, creating a need for automated assistance systems.

Method: ActionNex ingests multimodal operational signals (outage content, telemetry, human communications) and compresses them into critical events representing state transitions. It uses hierarchical memory (long-term KCA knowledge from playbooks, episodic memory of prior outages, working memory of live context) and a reasoning agent that aligns events to preconditions, retrieves relevant memories, and generates actionable recommendations.

Result: Evaluated on eight real Azure outages (8M tokens, 4,000 critical events) using two complementary ground-truth action sets, achieving 71.4% precision and 52.8-54.8% recall. The system has been piloted in production with positive early feedback.

Conclusion: ActionNex demonstrates a practical approach to automating outage management through multimodal signal processing and hierarchical memory systems, showing promising results in production environments.

Abstract: Outage management in large-scale cloud operations remains heavily manual, requiring rapid triage, cross-team coordination, and experience-driven decisions under partial observability. We present \textbf{ActionNex}, a production-grade agentic system that supports end-to-end outage assistance, including real-time updates, knowledge distillation, and role- and stage-conditioned next-best action recommendations. ActionNex ingests multimodal operational signals (e.g., outage content, telemetry, and human communications) and compresses them into critical events that represent meaningful state transitions. It couples this perception layer with a hierarchical memory subsystem: long-term Key-Condition-Action (KCA) knowledge distilled from playbooks and historical executions, episodic memory of prior outages, and working memory of the live context. A reasoning agent aligns current critical events to preconditions, retrieves relevant memories, and generates actionable recommendations; executed human actions serve as an implicit feedback signal to enable continual self-evolution in a human-agent hybrid system. We evaluate ActionNex on eight real Azure outages (8M tokens, 4,000 critical events) using two complementary ground-truth action sets, achieving 71.4% precision and 52.8-54.8% recall. The system has been piloted in production and has received positive early feedback.

[471] Decomposing Communication Gain and Delay Cost Under Cross-Timestep Delays in Cooperative Multi-Agent Reinforcement Learning

Zihong Gao, Hongjian Liang, Lei Hao, Liangjun Ke

Main category: cs.AI

TL;DR: Multi-agent RL framework for delayed communication with metric to balance communication gain vs delay cost, improving performance in cooperative tasks with partial observability.

Details

Motivation: Communication delays in cooperative multi-agent RL cause temporal misalignment and stale information, degrading coordination under partial observability.

Method: Formalized as DeComm-POMG, proposed CGDC metric to decompose message effects, developed CDCMA actor-critic framework with selective communication, future observation prediction, and attention-based message fusion.

Result: Consistent improvements in performance, robustness, and generalization across Cooperative Navigation, Predator Prey, and SMAC benchmarks at multiple delay levels.

Conclusion: The CGDC metric and CDCMA framework effectively address delayed communication challenges in cooperative multi-agent RL, with validated components showing practical benefits.

Abstract: Communication is essential for coordination in \emph{cooperative} multi-agent reinforcement learning under partial observability, yet \emph{cross-timestep} delays cause messages to arrive multiple timesteps after generation, inducing temporal misalignment and making information stale when consumed. We formalize this setting as a delayed-communication partially observable Markov game (DeComm-POMG) and decompose a message’s effect into \emph{communication gain} and \emph{delay cost}, yielding the Communication Gain and Delay Cost (CGDC) metric. We further establish a value-loss bound showing that the degradation induced by delayed messages is upper-bounded by a discounted accumulation of an information gap between the action distributions induced by timely versus delayed messages. Guided by CGDC, we propose \textbf{CDCMA}, an actor–critic framework that requests messages only when predicted CGDC is positive, predicts future observations to reduce misalignment at consumption, and fuses delayed messages via CGDC-guided attention. Experiments on no-teammate-vision variants of Cooperative Navigation and Predator Prey, and on SMAC maps across multiple delay levels show consistent improvements in performance, robustness, and generalization, with ablations validating each component.

[472] Structural Rigidity and the 57-Token Predictive Window: A Physical Framework for Inference-Layer Governability in Large Language Models

Gregory M. Ruddell

Main category: cs.AI

TL;DR: AI safety monitoring fails to detect pre-commitment signals in most instruction-tuned models; energy-based framework reveals only 1 of 7 models shows predictive signals before rule violations, with factual hallucinations undetectable internally.

Details

Motivation: Current AI safety relies on behavioral monitoring and post-training alignment, but these approaches produce no detectable pre-commitment signal in most instruction-tuned models, creating a gap in safety governance for autonomous AI systems.

Method: Energy-based governance framework connecting transformer inference dynamics to constraint-satisfaction models; analysis of seven models across five geometric regimes using trajectory tension (ρ = ||a||/||v||) and energy asymmetry metrics.

Result: Only one model configuration (Phi-3-mini-4k-instruct) shows 57-token pre-commitment window; others show silent failure, late detection, inverted dynamics, or flat geometry; factual hallucinations produce no predictive signal across 72 test conditions.

Conclusion: Rule violation and hallucination are distinct failure modes requiring different detection approaches; internal monitoring only works where resistance exists, while factual confabulation needs external verification; provides taxonomy for deployment risk evaluation.

Abstract: Current AI safety relies on behavioral monitoring and post-training alignment, yet empirical measurement shows these approaches produce no detectable pre-commitment signal in a majority of instruction-tuned models tested. We present an energy-based governance framework connecting transformer inference dynamics to constraint-satisfaction models of neural computation, and apply it to a seven-model cohort across five geometric regimes. Using trajectory tension (rho = ||a|| / ||v||), we identify a 57-token pre-commitment window in Phi-3-mini-4k-instruct under greedy decoding on arithmetic constraint probes. This result is model-specific, task-specific, and configuration-specific, demonstrating that pre-commitment signals can exist but are not universal. We introduce a five-regime taxonomy of inference behavior: Authority Band, Late Signal, Inverted, Flat, and Scaffold-Selective. Energy asymmetry (Σ\r{ho}_misaligned / Σ\r{ho}_aligned) serves as a unifying metric of structural rigidity across these regimes. Across seven models, only one configuration exhibits a predictive signal prior to commitment; all others show silent failure, late detection, inverted dynamics, or flat geometry. We further demonstrate that factual hallucination produces no predictive signal across 72 test conditions, consistent with spurious attractor settling in the absence of a trained world-model constraint. These results establish that rule violation and hallucination are distinct failure modes with different detection requirements. Internal geometry monitoring is effective only where resistance exists; detection of factual confabulation requires external verification mechanisms. This work provides a measurable framework for inference-layer governability and introduces a taxonomy for evaluating deployment risk in autonomous AI systems.

[473] Explainable Model Routing for Agentic Workflows

Mika Okamoto, Ansel Kaplan Erol, Mark Riedl

Main category: cs.AI

TL;DR: Topaz is an interpretable routing framework for agentic workflows that makes model selection decisions transparent by combining skill-based profiling, traceable routing algorithms, and natural language explanations.

Details

Motivation: Current agentic routing systems focus only on performance optimization without recording the trade-offs between model capability and cost, making it impossible to distinguish between intelligent efficiency and budget-driven failures.

Method: Three-component framework: (1) skill-based profiling synthesizes performance across benchmarks into granular capability profiles, (2) fully traceable routing algorithms use budget-based and multi-objective optimization with clear traces of skill-cost trade-offs, (3) developer-facing explanations translate traces into natural language.

Result: Topaz enables interpretable routing decisions, allowing users to understand, trust, and meaningfully steer routed agentic systems by making the logic behind model selection transparent.

Conclusion: By introducing formal auditability to agentic routing, Topaz addresses the critical gap in understanding routing decisions, enabling developers to audit system logic and iteratively tune cost-quality trade-offs.

Abstract: Modern agentic workflows decompose complex tasks into specialized subtasks and route them to diverse models to minimize cost without sacrificing quality. However, current routing architectures focus exclusively on performance optimization, leaving underlying trade-offs between model capability and cost unrecorded. Without clear rationale, developers cannot distinguish between intelligent efficiency – using specialized models for appropriate tasks – and latent failures caused by budget-driven model selection. We present Topaz, a framework that introduces formal auditability to agentic routing. Topaz replaces silent model assignments with an inherently interpretable router that incorporates three components: (i) skill-based profiling that synthesizes performance across diverse benchmarks into granular capability profiles (ii) fully traceable routing algorithms that utilize budget-based and multi-objective optimization to produce clear traces of how skill-match scores were weighed against costs, and (iii) developer-facing explanations that translate these traces into natural language, allowing users to audit system logic and iteratively tune the cost-quality tradeoff. By making routing decisions interpretable, Topaz enables users to understand, trust, and meaningfully steer routed agentic systems.

[474] Automated Analysis of Global AI Safety Initiatives: A Taxonomy-Driven LLM Approach

Takayuki Semitsu, Naoto Kiribuchi, Kengo Zenitani

Main category: cs.AI

TL;DR: Automated framework for comparing AI safety policy documents using LLMs to extract and map activities under shared taxonomy, with similarity scoring and visualization.

Details

Motivation: To enable systematic comparison of AI safety policy documents by developing an automated crosswalk framework that can extract and map relevant activities under a shared taxonomy, addressing the need for consistent policy analysis.

Method: Uses Activity Map on AI Safety taxonomy as fixed aspects, extracts and maps relevant activities from document pairs, generates summaries, comparisons, and similarity scores using five different LLMs, and visualizes results with heatmaps.

Result: Model choice significantly affects crosswalk outcomes, some document pairs show high disagreements across models, human experts show high inter-annotator agreement, but model scores differ from human judgments.

Conclusion: The framework supports comparative inspection of policy documents but highlights substantial model dependency in LLM-based crosswalk analysis, with differences between automated and human evaluations.

Abstract: We present an automated crosswalk framework that compares an AI safety policy document pair under a shared taxonomy of activities. Using the activity categories defined in Activity Map on AI Safety as fixed aspects, the system extracts and maps relevant activities, then produces for each aspect a short summary for each document, a brief comparison, and a similarity score. We assess the stability and validity of LLM-based crosswalk analysis across public policy documents. Using five large language models, we perform crosswalks on ten publicly available documents and visualize mean similarity scores with a heatmap. The results show that model choice substantially affects the crosswalk outcomes, and that some document pairs yield high disagreements across models. A human evaluation by three experts on two document pairs shows high inter-annotator agreement, while model scores still differ from human judgments. These findings support comparative inspection of policy documents.

[475] PolySwarm: A Multi-Agent Large Language Model Framework for Prediction Market Trading and Latency Arbitrage

Rajat M. Barot, Arjun S. Borkhatariya

Main category: cs.AI

TL;DR: PolySwarm is a multi-agent LLM framework for real-time prediction market trading and latency arbitrage on decentralized platforms like Polymarket, using 50 diverse LLM personas with confidence-weighted Bayesian aggregation and risk-controlled execution.

Details

Motivation: The paper addresses the need for sophisticated AI systems in decentralized prediction markets, aiming to improve trading performance through swarm intelligence, better probability calibration, and exploiting market inefficiencies.

Method: Uses 50 diverse LLM personas for concurrent market evaluation, confidence-weighted Bayesian combination of swarm consensus with market probabilities, quarter-Kelly position sizing, information-theoretic analysis (KL/JS divergence) for inefficiency detection, and latency arbitrage exploiting stale prices.

Result: Swarm aggregation consistently outperforms single-model baselines in probability calibration on Polymarket prediction tasks, as measured by Brier scores, calibration analysis, and log-loss metrics benchmarked against human superforecasters.

Conclusion: PolySwarm demonstrates the effectiveness of multi-agent LLM frameworks for prediction market trading, while highlighting challenges like hallucination, computational costs, regulatory issues, and feedback-loop risks that need future research.

Abstract: This paper presents PolySwarm, a novel multi-agent large language model (LLM) framework designed for real-time prediction market trading and latency arbitrage on decentralized platforms such as Polymarket. PolySwarm deploys a swarm of 50 diverse LLM personas that concurrently evaluate binary outcome markets, aggregating individual probability estimates through confidence-weighted Bayesian combination of swarm consensus with market-implied probabilities, and applying quarter-Kelly position sizing for risk-controlled execution. The system incorporates an information-theoretic market analysis engine using Kullback-Leibler (KL) divergence and Jensen-Shannon (JS) divergence to detect cross-market inefficiencies and negation pair mispricings. A latency arbitrage module exploits stale Polymarket prices by deriving CEX-implied probabilities from a log-normal pricing model and executing trades within the human reaction-time window. We provide a full architectural description, implementation details, and evaluation methodology using Brier scores, calibration analysis, and log-loss metrics benchmarked against human superforecaster performance. We further discuss open challenges including hallucination in agent pools, computational cost at scale, regulatory exposure, and feedback-loop risk, and outline five priority directions for future research. Experimental results demonstrate that swarm aggregation consistently outperforms single-model baselines in probability calibration on Polymarket prediction tasks.

[476] Towards the AI Historian: Agentic Information Extraction from Primary Sources

Lorenz Hufe, Niclas Griesshaber, Gavin Greif, Sebastian Oliver Eck, Philip Torr

Main category: cs.AI

TL;DR: Chronos is an AI historian system that helps historians convert image scans of primary sources into data through natural-language interactions, allowing flexible workflow adaptation for heterogeneous historical documents.

Details

Motivation: AI adoption in historical research remains limited due to lack of solutions designed specifically for historians. There's a need for tools that can handle heterogeneous historical source corpora and allow historians to adapt workflows rather than imposing fixed extraction pipelines.

Method: Developed the first module of Chronos that enables historians to convert image scans of primary sources into data through natural-language interactions. The system allows historians to adapt workflows for different source types, evaluate AI model performance on specific tasks, and iteratively refine workflows through natural-language interaction with the Chronos agent.

Result: Created an open-source module ready for historical researchers to use on their own sources. The system provides a flexible, interactive approach to document analysis rather than a fixed extraction pipeline.

Conclusion: Chronos represents a step toward making AI more accessible and useful for historical research by providing tools specifically designed for historians’ needs, particularly in handling heterogeneous historical documents through natural-language interaction.

Abstract: AI is supporting, accelerating, and automating scientific discovery across a diverse set of fields. However, AI adoption in historical research remains limited due to the lack of solutions designed for historians. In this technical progress report, we introduce the first module of Chronos, an AI Historian under development. This module enables historians to convert image scans of primary sources into data through natural-language interactions. Rather than imposing a fixed extraction pipeline powered by a vision-language model (VLM), it allows historians to adapt workflows for heterogeneous source corpora, evaluate the performance of AI models on specific tasks, and iteratively refine workflows through natural-language interaction with the Chronos agent. The module is open-source and ready to be used by historical researchers on their own sources.

[477] When Do Hallucinations Arise? A Graph Perspective on the Evolution of Path Reuse and Path Compression

Xinnan Dai, Kai Yang, Cheng Luo, Shenglai Zeng, Kai Guo, Jiliang Tang

Main category: cs.AI

TL;DR: LLM reasoning hallucinations modeled as graph search failures from path reuse (memorized knowledge overriding context) and path compression (multi-step paths collapsing into shortcuts).

Details

Motivation: To understand the mechanisms behind reasoning hallucinations in LLMs - fluent but unsupported conclusions that violate context or factual knowledge - which remain poorly understood despite being widely observed.

Method: Model next-token prediction as graph search over an underlying graph where entities are nodes and learned transitions are edges. Contextual reasoning is constrained search over sampled subgraph (intrinsic), while context-free queries use memorized structures (extrinsic).

Result: Identified two fundamental mechanisms: Path Reuse (memorized knowledge overrides contextual constraints during early training) and Path Compression (frequently traversed multi-step paths collapse into shortcut edges in later training).

Conclusion: These mechanisms provide a unified explanation for reasoning hallucinations in LLMs and connect to well-known behaviors observed in downstream applications.

Abstract: Reasoning hallucinations in large language models (LLMs) often appear as fluent yet unsupported conclusions that violate either the given context or underlying factual knowledge. Although such failures are widely observed, the mechanisms by which decoder-only Transformers produce them remain poorly understood. We model next-token prediction as a graph search process over an underlying graph, where entities correspond to nodes and learned transitions form edges. From this perspective, contextual reasoning is a constrained search over a sampled subgraph (intrinsic reasoning), while context-free queries rely on memorized structures in the underlying graph (extrinsic reasoning). We show that reasoning hallucinations arise from two fundamental mechanisms: \textbf{Path Reuse}, where memorized knowledge overrides contextual constraints during early training, and \textbf{Path Compression}, where frequently traversed multi-step paths collapse into shortcut edges in later training. Together, these mechanisms provide a unified explanation for reasoning hallucinations in LLMs and connected to well-known behaviors observed in downstream applications.

[478] When Adaptive Rewards Hurt: Causal Probing and the Switching-Stability Dilemma in LLM-Guided LEO Satellite Scheduling

Yuanhang Li

Main category: cs.AI

TL;DR: Adaptive reward design in DRL for satellite scheduling reveals a switching-stability dilemma: static weights outperform dynamic ones due to PPO’s need for quasistationary rewards. Systematic probing uncovers counterintuitive reward term leverage, and MLP-based adaptation outperforms LLM-based approaches which suffer from weight oscillation.

Details

Motivation: The paper investigates whether regime-aware adaptive reward weights in deep reinforcement learning for multi-beam LEO satellite scheduling can outperform static reward weights, motivated by the intuition that dynamic adaptation should be beneficial for varying traffic regimes.

Method: The authors systematically test adaptive vs. static reward weights, introduce a single-variable causal probing method to perturb each reward term by +/-20% and measure PPO response, and evaluate four MDP architect variants: fixed weights, rule-based adaptation, learned MLP adaptation, and fine-tuned LLM adaptation across known and novel traffic regimes.

Result: Near-constant reward weights (342.1 Mbps) outperform carefully-tuned dynamic weights (103.3+/-96.8 Mbps) due to PPO’s requirement for quasistationary reward signals. Causal probing reveals counterintuitive leverage where +20% increase in switching penalty yields +157 Mbps for polar handover and +130 Mbps for hot-cold regimes. MLP achieves 357.9 Mbps on known regimes and 325.2 Mbps on novel regimes, while fine-tuned LLM collapses to 45.3+/-43.0 Mbps due to weight oscillation.

Conclusion: The study reveals a switching-stability dilemma in DRL reward design and provides an empirically-grounded roadmap for LLM-DRL integration, identifying where LLMs add irreplaceable value (natural language intent understanding) versus where simpler methods suffice, with output consistency being the binding constraint rather than domain knowledge.

Abstract: Adaptive reward design for deep reinforcement learning (DRL) in multi-beam LEO satellite scheduling is motivated by the intuition that regime-aware reward weights should outperform static ones. We systematically test this intuition and uncover a switching-stability dilemma: near-constant reward weights (342.1 Mbps) outperform carefully-tuned dynamic weights (103.3+/-96.8 Mbps) because PPO requires a quasistationary reward signal for value function convergence. Weight adaptation-regardless of quality-degrades performance by repeatedly restarting convergence. To understand why specific weights matter, we introduce a single-variable causal probing method that independently perturbs each reward term by +/-20% and measures PPO response after 50k steps. Probing reveals counterintuitive leverage: a +20% increase in the switching penalty yields +157 Mbps for polar handover and +130 Mbps for hot-cold regimes-findings inaccessible to human experts or trained MLPs without systematic probing. We evaluate four MDP architect variants (fixed, rule-based, learned MLP, finetuned LLM) across known and novel traffic regimes. The MLP achieves 357.9 Mbps on known regimes and 325.2 Mbps on novel regimes, while the fine-tuned LLM collapses to 45.3+/-43.0 Mbps due to weight oscillation rather than lack of domain knowledge-output consistency, not knowledge, is the binding constraint. Our findings provide an empirically-grounded roadmap for LLM-DRL integration in communication systems, identifying where LLMs add irreplaceable value (natural language intent understanding) versus where simpler methods suffice.

[479] Personality Requires Struggle: Three Regimes of the Baldwin Effect in Neuroevolved Chess Agents

Diego Armando Resendez Prado

Main category: cs.AI

TL;DR: Hebbian plasticity in evolved chess agents initially reduces but later expands behavioral diversity across evolutionary time, creating structured behavioral divergence with distinct playing styles.

Details

Motivation: To test whether lifetime learning (plasticity) expands or collapses behavioral diversity over evolutionary time, challenging prior theory that plasticity reduces variance by buffering against environmental noise.

Method: Used chess agents with eight NEAT-evolved neural modules, Hebbian within-game plasticity, and a desirability-domain signal chain with imagination. Compared Hebbian ON vs OFF conditions across 10 seeds each, tracking cross-seed variance over generations.

Result: Variance crossover: Hebbian ON starts with lower variance than OFF, then surpasses it at generation 34. Evolved agents show 62% move disagreement, distinct opening repertoires, piece preferences, and game lengths. Three regimes identified: exploration (Hebbian ON), lottery (Hebbian OFF), and transparent (self-play).

Conclusion: Plasticity’s effect on behavioral variance reverses over evolutionary time, initially compressing then expanding diversity through imagination feedback loops. Self-play systems may systematically suppress behavioral diversity by eliminating heterogeneity needed for personality emergence.

Abstract: Can lifetime learning expand behavioral diversity over evolutionary time, rather than collapsing it? Prior theory predicts that plasticity reduces variance by buffering organisms against environmental noise. We test this in a competitive domain: chess agents with eight NEAT-evolved neural modules, Hebbian within-game plasticity, and a desirability-domain signal chain with imagination. Across 10~~seeds per Hebbian condition, a variance crossover emerges: Hebbian ON starts with lower cross-seed variance than OFF, then surpasses it at generation~~34. The crossover trend is monotonic (\r{ho} = 0.91, p < 10^{-6): plasticity’s effect on behavioral variance reverses over evolutionary time, initially compressing diversity (consistent with prior predictions) then expanding it as evolved Perception differences are amplified through imagination – a feedback loop that mutation alone cannot sustain. The result is structured behavioral divergence: evolved agents select different moves on the same positions (62% disagreement), develop distinct opening repertoires, piece preferences, and game lengths. These are not different sampling policies – they are reproducible behavioral signatures (ICC > 0.8) with interpretable signal chain configurations. Three regimes appear depending on opponent type: exploration (Hebbian ON, heterogeneous opponent), lottery (Hebbian OFF, elitism lock-in), and transparent (same-model opponent, brain self-erasure). The transparent regime generates a falsifiable prediction: self-play systems may systematically suppress behavioral diversity by eliminating the heterogeneity that personality requires. \textbf{Keywords: Baldwin Effect, neuroevolution, NEAT, Hebbian learning, chess, cognitive architecture, personality emergence, imagination

[480] Selective Forgetting for Large Reasoning Models

Tuan Le, Wei Qian, Mengdi Huai

Main category: cs.AI

TL;DR: A novel unlearning framework for Large Reasoning Models that selectively removes sensitive reasoning components while preserving general reasoning capabilities by analyzing chain-of-thought traces and replacing forget-relevant segments with benign placeholders.

Details

Motivation: Large Reasoning Models generate structured chains of thought before final answers, making them vulnerable to knowledge leakage and memorization of sensitive information (copyrighted/private content). Existing unlearning methods primarily target final answers and may degrade overall reasoning ability after forgetting, while directly applying unlearning to entire CoTs could degrade general reasoning capabilities.

Method: Proposes a novel LRM unlearning framework that leverages multiple LLMs with retrieval-augmented generation (RAG) to analyze CoT traces, identify forget-relevant segments, and replace them with benign placeholders that maintain logical structure. Introduces a feature replacement unlearning loss that simultaneously suppresses probability of generating forgotten content while reinforcing structurally valid replacements.

Result: Extensive experiments on both synthetic and medical datasets verify the desired properties of the proposed method, achieving precise unlearning of targeted knowledge while preserving the integrity of general reasoning capabilities.

Conclusion: The proposed framework effectively addresses the challenge of LRM unlearning by selectively removing sensitive reasoning components while maintaining general reasoning capabilities, offering a solution to ethical and legal concerns about memorization of sensitive information in training data.

Abstract: Large Reasoning Models (LRMs) generate structured chains of thought (CoTs) before producing final answers, making them especially vulnerable to knowledge leakage through intermediate reasoning steps. Yet, the memorization of sensitive information in the training data such as copyrighted and private content has led to ethical and legal concerns. To address these issues, selective forgetting (also known as machine unlearning) has emerged as a potential remedy for LRMs. However, existing unlearning methods primarily target final answers and may degrade the overall reasoning ability of LRMs after forgetting. Additionally, directly applying unlearning on the entire CoTs could degrade the general reasoning capabilities. The key challenge for LRM unlearning lies in achieving precise unlearning of targeted knowledge while preserving the integrity of general reasoning capabilities. To bridge this gap, we in this paper propose a novel LRM unlearning framework that selectively removes sensitive reasoning components while preserving general reasoning capabilities. Our approach leverages multiple LLMs with retrieval-augmented generation (RAG) to analyze CoT traces, identify forget-relevant segments, and replace them with benign placeholders that maintain logical structure. We also introduce a new feature replacement unlearning loss for LRMs, which can simultaneously suppress the probability of generating forgotten content while reinforcing structurally valid replacements. Extensive experiments on both synthetic and medical datasets verify the desired properties of our proposed method.

[481] Rashomon Memory: Towards Argumentation-Driven Retrieval for Multi-Perspective Agent Memory

Albert Sadowski, Jarosław A. Chudziak

Main category: cs.AI

TL;DR: Rashomon Memory is an AI memory architecture where parallel goal-conditioned agents encode experiences from different perspectives, maintain separate ontologies/knowledge graphs, and use argumentation semantics for retrieval, allowing multiple conflicting interpretations of the same events.

Details

Motivation: Current memory architectures assume a single correct encoding or support multiple views over unified storage, but AI agents operating over extended time horizons accumulate experiences serving multiple concurrent goals and often need to maintain conflicting interpretations of the same events.

Method: Parallel goal-conditioned agents encode experiences according to their priorities, each maintaining its own ontology and knowledge graph. At retrieval, perspectives propose interpretations, critique each other using asymmetric domain knowledge, and Dung’s argumentation semantics determines which proposals survive.

Result: Proof-of-concept shows retrieval modes (selection, composition, conflict surfacing) emerge from attack graph topology. Conflict surfacing mode lets system report genuine disagreement rather than forcing resolution, allowing decision-makers to see underlying interpretive conflict directly.

Conclusion: Rashomon Memory enables AI systems to handle multiple conflicting interpretations of experiences through argumentation-based retrieval, with the attack graph serving as an explanation mechanism showing which interpretations were selected and why alternatives were rejected.

Abstract: AI agents operating over extended time horizons accumulate experiences that serve multiple concurrent goals, and must often maintain conflicting interpretations of the same events. A concession during a client negotiation encodes as a trust-building investment'' for one strategic goal and a contractual liability’’ for another. Current memory architectures assume a single correct encoding, or at best support multiple views over unified storage. We propose Rashomon Memory: an architecture where parallel goal-conditioned agents encode experiences according to their priorities and negotiate at query time through argumentation. Each perspective maintains its own ontology and knowledge graph. At retrieval, perspectives propose interpretations, critique each other’s proposals using asymmetric domain knowledge, and Dung’s argumentation semantics determines which proposals survive. The resulting attack graph is itself an explanation: it records which interpretation was selected, which alternatives were considered, and on what grounds they were rejected. We present a proof-of-concept showing that retrieval modes (selection, composition, conflict surfacing) emerge from attack graph topology, and that the conflict surfacing mode, where the system reports genuine disagreement rather than forcing resolution, lets decision-makers see the underlying interpretive conflict directly.

[482] Soft Tournament Equilibrium

Saad Alqithami

Main category: cs.AI

TL;DR: A differentiable framework (STE) for evaluating AI agents using tournament theory concepts instead of linear rankings, addressing cyclic non-transitive interactions.

Details

Motivation: Traditional ranking methods fail for evaluating general-purpose AI agents due to non-transitive interactions (rock-paper-scissors scenarios). Linear rankings are misleading and unstable when agent A beats B, B beats C, but C beats A.

Method: Soft Tournament Equilibrium (STE) learns probabilistic tournament models from pairwise comparison data, uses differentiable operators for soft reachability and soft covering to compute continuous analogues of tournament solutions (Top Cycle and Uncovered Set).

Result: STE produces set-valued core agents with calibrated membership scores, providing nuanced assessment. Theoretical analysis shows consistency with classical solutions in zero-temperature limit, establishing Condorcet-inclusion properties, stability, and sample complexity.

Conclusion: The paper proposes shifting from unstable linear rankings to stable set-valued equilibria for agent evaluation, providing a more robust theoretical foundation using tournament theory concepts.

Abstract: The evaluation of general-purpose artificial agents, particularly those based on large language models, presents a significant challenge due to the non-transitive nature of their interactions. When agent A defeats B, B defeats C, and C defeats A, traditional ranking methods that force a linear ordering can be misleading and unstable. We argue that for such cyclic domains, the fundamental object of evaluation should not be a ranking but a set-valued core, as conceptualized in classical tournament theory. This paper introduces Soft Tournament Equilibrium (STE), a differentiable framework for learning and computing set-valued tournament solutions directly from pairwise comparison data. STE first learns a probabilistic tournament model, potentially conditioned on rich contextual information. It then employs novel, differentiable operators for soft reachability and soft covering to compute continuous analogues of two seminal tournament solutions: the Top Cycle and the Uncovered Set. The output is a set of core agents, each with a calibrated membership score, providing a nuanced and robust assessment of agent capabilities. We develop the theoretical foundation for STE to prove its consistency with classical solutions in the zero-temperature limit, which establishes its Condorcet-inclusion properties, and analyzing its stability and sample complexity. We specify an experimental protocol for validating STE on both synthetic and real-world benchmarks. This work aims to provide a complete, standalone treatise that re-centers general-agent evaluation on a more appropriate and robust theoretical foundation, moving from unstable rankings to stable, set-valued equilibria.

[483] Entropy and Attention Dynamics in Small Language Models: A Trace-Level Structural Analysis on the TruthfulQA Benchmark

Adeyemi Adeseye, Aisvarya Adeseye, Hannu Tenhunen, Jouni Isoaho

Main category: cs.AI

TL;DR: Trace-level analysis of entropy and attention dynamics in small language models (1B-1.7B parameters) reveals three distinct patterns related to truthfulness, with implications for designing more reliable edge AI models.

Details

Motivation: Small language models deployed in edge devices often make confident mispredictions with unstable outputs, but current evaluation methods focus only on final accuracy without examining internal behaviors like entropy evolution, attention distribution, and hidden representations that affect uncertainty and misinformation.

Method: Analyzed four SLMs (1B-1.7B parameters) using TruthfulQA dataset with token-level output entropy, attention entropy, head dispersion, and hidden-state representation analysis to trace internal dynamics during decoding.

Result: Identified three model classifications: Deterministic models (DeepSeek-1.5B, LLaMA-1B) with decreasing entropy; Exploratory models (Gemma-1B) with increasing entropy; and Balanced models (Qwen-1.7B) with moderate stable entropy. Each group shows distinct hidden-state movement and attention dispersion patterns.

Conclusion: Truthfulness in SLMs emerges from structured entropy and attention dynamics, suggesting that monitoring and optimizing these internal uncertainty patterns can guide design of more reliable, hallucination-aware edge SLMs for specific applications.

Abstract: Small language models (SLMs) have been increasingly deployed in edge devices and other resource-constrained settings. However, these models make confident mispredictions and produce unstable output, making them risky for factual and decision-critical tasks. Current evaluation methodology relies on final accuracy or hallucination rates without explaining how internal model behavior affects outputs. Specifically, how entropy evolves during decoding, how attention is distributed across layers, and how hidden representations contribute to uncertainty, logical inconsistencies, and misinformation propagation are often overlooked. Consequently, this study introduces a trace-level analysis of entropy and attention dynamics in SLMs evaluated with the TruthfulQA dataset. Four models with parameter ranges of 1B-1.7B parameters were examined via token-level output entropy, attention entropy, head dispersion, and hidden-state representation. The results reflect three model classifications by entropy patterns. Deterministic models (DeepSeek-1.5B and LLaMA-1B): output entropy decreases over time. Exploratory models (Gemma-1B): with increasing entropy, and balanced models (Qwen-1.7B): have moderate and stable entropy. Also, each group has distinctively different hidden-state movement and attention dispersion patterns. The analysis demonstrates that truthfulness in SLMs emerges from structured entropy and attention dynamics. Monitoring and optimizing these internal uncertainty patterns can guide the design of a more reliable, hallucination-aware, and application-specific edge SLMs.

[484] Optimizing Service Operations via LLM-Powered Multi-Agent Simulation

Yanyuan Wang, Xiaowei Zhang

Main category: cs.AI

TL;DR: LLM-powered multi-agent simulation framework for optimizing service operations by modeling human behavior through interacting LLM agents, with applications in supply chain and contest design.

Details

Motivation: Service system performance depends on complex human behavior responses to design choices, which are difficult to model traditionally. The paper aims to create a framework that can simulate and optimize service operations by leveraging LLMs to model human decision-making.

Method: Introduces LLM-MAS framework that treats service optimization as stochastic optimization with decision-dependent uncertainty. Design choices are embedded in prompts that shape LLM agent interactions. Key numerical information is embedded in prompts and extracted from LLM-generated text, modeling uncertainty as a controlled Markov chain. Develops an on-trajectory learning algorithm that constructs zeroth-order gradient estimates and updates design parameters simultaneously during simulation.

Result: Outperforms benchmarks including blackbox optimization and using LLMs as numerical solvers or role-playing system designers in sustainable supply chain applications. Case study on optimal contest design shows LLM-MAS serves as both cost-effective evaluator of known designs and exploratory tool uncovering strong designs overlooked by traditional approaches.

Conclusion: LLM-MAS provides an effective framework for optimizing service operations by simulating complex human behavior through LLM-powered agents, offering advantages over traditional optimization methods and demonstrating practical value in real-world applications.

Abstract: Service system performance depends on how participants respond to design choices, but modeling these responses is hard due to the complexity of human behavior. We introduce an LLM-powered multi-agent simulation (LLM-MAS) framework for optimizing service operations. We pose the problem as stochastic optimization with decision-dependent uncertainty: design choices are embedded in prompts and shape the distribution of outcomes from interacting LLM-powered agents. By embedding key numerical information in prompts and extracting it from LLM-generated text, we model this uncertainty as a controlled Markov chain. We develop an on-trajectory learning algorithm that, on a single simulation run, simultaneously constructs zeroth-order gradient estimates and updates design parameters to optimize steady-state performance. We also incorporate variance reduction techniques. In a sustainable supply chain application, our method outperforms benchmarks, including blackbox optimization and using LLMs as numerical solvers or as role-playing system designers. A case study on optimal contest design with real behavioral data shows that LLM-MAS is both as a cost-effective evaluator of known designs and an exploratory tool that can uncover strong designs overlooked by traditional approaches.

[485] A Multimodal Foundation Model of Spatial Transcriptomics and Histology for Biological Discovery and Clinical Prediction

Jinxi Xiang, Siyu Hou, Yuchen Li, Ryan Quinton, Xiaoming Zhang, Feyisope Eweje, Xiangde Luo, Yijiang Chen, Zhe Li, Colin Bergstrom, Ted Kim, Sierra Willens, Francesca Maria Olguin, Matthew Abikenari, Andrew Heider, Sanjeeth Rajaram, Joel Neal, Maximilian Diehn, Xiang Zhou, Ruijiang Li

Main category: cs.AI

TL;DR: STORM is a foundation model that integrates spatial transcriptomics and histology images to bridge molecular and morphological data for improved tissue analysis and clinical predictions.

Details

Motivation: Spatial transcriptomics provides molecular context but is expensive and low-throughput, while H&E staining offers rich morphology but lacks molecular resolution. There's a need to bridge these modalities for scalable, clinically relevant tissue analysis.

Method: Developed a hierarchical architecture foundation model trained on 1.2 million spatially resolved transcriptomic profiles with matched histology across 18 organs. The model integrates morphological features, gene expression, and spatial context to create robust molecular-morphological representations.

Result: STORM enhances spatial domain discovery, produces biologically coherent tissue maps, and outperforms existing methods in predicting spatial gene expression from H&E images across 11 tumor types. It’s platform-agnostic and improves immunotherapy response prediction and prognostication in 7,245 patients across 23 cohorts.

Conclusion: STORM provides a scalable framework for spatially informed discovery and clinical precision medicine by effectively bridging imaging and omics data through integrated molecular-morphological representations.

Abstract: Spatial transcriptomics (ST) enables gene expression mapping within anatomical context but remains costly and low-throughput. Hematoxylin and eosin (H&E) staining offers rich morphology yet lacks molecular resolution. We present \textbf{\ours} (\textbf{S}patial \textbf{T}ranscriptomics and hist\textbf{O}logy \textbf{R}epresentation \textbf{M}odel), a foundation model trained on 1.2 million spatially resolved transcriptomic profiles with matched histology across 18 organs. Using a hierarchical architecture integrating morphological features, gene expression, and spatial context, STORM bridges imaging and omics through robust molecular–morphological representations. STORM enhances spatial domain discovery, producing biologically coherent tissue maps, and outperforms existing methods in predicting spatial gene expression from H&E images across 11 tumor types. The model is platform-agnostic, performing consistently across Visium, Xenium, Visium HD, and CosMx. Applied to 23 independent cohorts comprising 7,245 patients, STORM significantly improves immunotherapy response prediction and prognostication over established biomarkers, providing a scalable framework for spatially informed discovery and clinical precision medicine.

[486] Single-agent vs. Multi-agents for Automated Video Analysis of On-Screen Collaborative Learning Behaviors

Likai Peng, Shihui Feng

Main category: cs.AI

TL;DR: This paper proposes and compares two multi-agent VLM frameworks for automated coding of screen recordings in collaborative learning contexts, showing they outperform single VLMs in scene and action detection tasks.

Details

Motivation: On-screen learning behavior provides valuable insights into student cognitive and collaborative processes, but manual coding of multimodal video data is labor-intensive. Vision Language Models (VLMs) offer new opportunities to automate this analysis.

Method: Two multi-agent VLM frameworks: 1) Three-agent workflow MAS that segments screen videos by scene and detects behaviors using cursor-informed VLM prompting with evidence-based verification; 2) Autonomous-decision MAS inspired by ReAct that iteratively interleaves reasoning, tool-like operations, and observation-driven self-correction.

Result: Both proposed MAS frameworks achieved viable performance, outperforming single VLMs (Claude-3.7-Sonnet, GPT-4.1, Qwen2.5-VL-72B) in scene and action detection tasks. Workflow-based agent performed best on scene detection, autonomous-decision MAS best on action detection.

Conclusion: The study demonstrates effectiveness of VLM-based Multi-agent Systems for video analysis and contributes a scalable framework for multimodal data analytics in educational contexts.

Abstract: On-screen learning behavior provides valuable insights into how students seek, use, and create information during learning. Analyzing on-screen behavioral engagement is essential for capturing students’ cognitive and collaborative processes. The recent development of Vision Language Models (VLMs) offers new opportunities to automate the labor-intensive manual coding often required for multimodal video data analysis. In this study, we compared the performance of both leading closed-source VLMs (Claude-3.7-Sonnet, GPT-4.1) and open-source VLM (Qwen2.5-VL-72B) in single- and multi-agent settings for automated coding of screen recordings in collaborative learning contexts based on the ICAP framework. In particular, we proposed and compared two multi-agent frameworks: 1) a three-agent workflow multi-agent system (MAS) that segments screen videos by scene and detects on-screen behaviors using cursor-informed VLM prompting with evidence-based verification; 2) an autonomous-decision MAS inspired by ReAct that iteratively interleaves reasoning, tool-like operations (segmentation/ classification/ validation), and observation-driven self-correction to produce interpretable on-screen behavior labels. Experimental results demonstrated that the two proposed MAS frameworks achieved viable performance, outperforming the single VLMs in scene and action detection tasks. It is worth noting that the workflow-based agent achieved best on scene detection, and the autonomous-decision MAS achieved best on action detection. This study demonstrates the effectiveness of VLM-based Multi-agent System for video analysis and contributes a scalable framework for multimodal data analytics.

[487] Beyond Retrieval: Modeling Confidence Decay and Deterministic Agentic Platforms in Generative Engine Optimization

XinYu Zhao, ChengYou Li, XiangBao Meng, Kai Zhang, XiaoDong Liu

Main category: cs.AI

TL;DR: Proposes deterministic multi-agent intent routing to replace RAG-based GEO, addressing hallucinations and zero-click issues through semantic entropy modeling and agentic trust brokerage.

Details

Motivation: Current RAG-based Generative Engine Optimization suffers from probabilistic hallucinations and zero-click paradox, failing to establish sustainable commercial trust in LLM-powered systems.

Method: Introduces Semantic Entropy Drift modeling, Isomorphic Attribution Regression for black-box optimization, and Deterministic Agent Handoff protocol with Agentic Trust Brokerage ecosystem where LLMs act as intent routers.

Result: Empirical validation with EasyNote product shows near-zero hallucination rates by routing “knowledge graph mapping” intent directly to specialized proprietary agents via DAH protocol.

Conclusion: Establishes theoretical framework for next-generation GEO and deterministic human-AI collaboration ecosystem, shifting from probabilistic RAG to deterministic multi-agent routing.

Abstract: Generative Engine Optimization (GEO) is rapidly reshaping digital marketing paradigms in the era of Large Language Models (LLMs). However, current GEO strategies predominantly rely on Retrieval-Augmented Generation (RAG), which inherently suffers from probabilistic hallucinations and the “zero-click” paradox, failing to establish sustainable commercial trust. In this paper, we systematically deconstruct the probabilistic flaws of existing RAG-based GEO and propose a paradigm shift towards deterministic multi-agent intent routing. First, we mathematically formulate Semantic Entropy Drift (SED) to model the dynamic decay of confidence curves in LLMs over continuous temporal and contextual perturbations. To rigorously quantify optimization value in black-box commercial engines, we introduce the Isomorphic Attribution Regression (IAR) model, leveraging a Multi-Agent System (MAS) probe with strict human-in-the-loop physical isolation to enforce hallucination penalties. Furthermore, we architect the Deterministic Agent Handoff (DAH) protocol, conceptualizing an Agentic Trust Brokerage (ATB) ecosystem where LLMs function solely as intent routers rather than final answer generators. We empirically validate this architecture using EasyNote, an industrial AI meeting minutes product by Yishu Technology. By routing the intent of “knowledge graph mapping on an infinite canvas” directly to its specialized proprietary agent via DAH, we demonstrate the reduction of vertical task hallucination rates to near zero. This work establishes a foundational theoretical framework for next-generation GEO and paves the way for a well-ordered, deterministic human-AI collaboration ecosystem.

[488] Memory Intelligence Agent

Jingyang Qiao, Weicheng Meng, Yu Cheng, Zhihang Lin, Zhizhong Zhang, Xin Tan, Jingyu Gong, Kun Shao, Yuan Xie

Main category: cs.AI

TL;DR: MIA is a Memory Intelligence Agent framework with Manager-Planner-Executor architecture that enables efficient memory evolution and test-time learning for deep research agents.

Details

Motivation: Existing deep research agents with memory systems suffer from ineffective memory evolution and increasing storage/retrieval costs when retrieving similar trajectories from memory to aid reasoning.

Method: Proposes MIA framework with: 1) Memory Manager (non-parametric memory storing compressed trajectories), 2) Planner (parametric memory agent producing search plans), 3) Executor (agent searching/analyzing guided by plans). Uses alternating reinforcement learning for Planner-Executor cooperation, test-time learning for continuous evolution, bidirectional conversion between parametric/non-parametric memories, and reflection/unsupervised judgment mechanisms.

Result: Extensive experiments across eleven benchmarks demonstrate the superiority of MIA over existing methods.

Conclusion: MIA framework effectively addresses memory evolution and efficiency problems in deep research agents through its novel architecture and learning mechanisms.

Abstract: Deep research agents (DRAs) integrate LLM reasoning with external tools. Memory systems enable DRAs to leverage historical experiences, which are essential for efficient reasoning and autonomous evolution. Existing methods rely on retrieving similar trajectories from memory to aid reasoning, while suffering from key limitations of ineffective memory evolution and increasing storage and retrieval costs. To address these problems, we propose a novel Memory Intelligence Agent (MIA) framework, consisting of a Manager-Planner-Executor architecture. Memory Manager is a non-parametric memory system that can store compressed historical search trajectories. Planner is a parametric memory agent that can produce search plans for questions. Executor is another agent that can search and analyze information guided by the search plan. To build the MIA framework, we first adopt an alternating reinforcement learning paradigm to enhance cooperation between the Planner and the Executor. Furthermore, we enable the Planner to continuously evolve during test-time learning, with updates performed on-the-fly alongside inference without interrupting the reasoning process. Additionally, we establish a bidirectional conversion loop between parametric and non-parametric memories to achieve efficient memory evolution. Finally, we incorporate a reflection and an unsupervised judgment mechanisms to boost reasoning and self-evolution in the open world. Extensive experiments across eleven benchmarks demonstrate the superiority of MIA.

[489] TableVision: A Large-Scale Benchmark for Spatially Grounded Reasoning over Complex Hierarchical Tables

Xiaoyu Chen, Lu Dai, Hanqing Wang, Zhuoyu Li, Wenbin Dai, Yanzong Zheng, Zhenggang Xia, Junyong Lin, Hui Xiong

Main category: cs.AI

TL;DR: TableVision: A large-scale benchmark for table understanding with explicit spatial grounding to address perception bottlenecks in MLLMs for complex hierarchical tables.

Details

Motivation: Current MLLMs struggle with complex tables having hierarchical layouts due to a "Perception Bottleneck" where increasing task complexity leads to disproportionate growth in visual regions to process, causing perceptual overload and inaccurate spatial attention.

Method: Introduces TableVision benchmark with 6,799 high-fidelity reasoning trajectories across 13 sub-categories stratified into three cognitive levels. Uses rendering-based deterministic grounding pipeline to explicitly couple multi-step logical deductions with pixel-perfect spatial ground truths.

Result: Explicit spatial constraints significantly recover MLLMs’ reasoning potential. A two-stage decoupled framework achieves 12.3% overall accuracy improvement on the test set.

Conclusion: TableVision provides a rigorous testbed for understanding perception-logic synergy in document understanding, demonstrating that addressing spatial perception bottlenecks can enhance MLLM performance on complex tabular reasoning tasks.

Abstract: Structured tables are essential for conveying high-density information in professional domains such as finance, healthcare, and scientific research. Despite the progress in Multimodal Large Language Models (MLLMs), reasoning performance remains limited for complex tables with hierarchical layouts. In this paper, we identify a critical Perception Bottleneck through quantitative analysis. We find that as task complexity scales, the number of involved discrete visual regions increases disproportionately. This processing density leads to an internal “Perceptual Overload,” where MLLMs struggle to maintain accurate spatial attention during implicit generation. To address this bottleneck, we introduce TableVision, a large-scale, trajectory-aware benchmark designed for spatially grounded reasoning. TableVision stratifies tabular tasks into three cognitive levels (Perception, Reasoning, and Analysis) across 13 sub-categories. By utilizing a rendering-based deterministic grounding pipeline, the dataset explicitly couples multi-step logical deductions with pixel-perfect spatial ground truths, comprising 6,799 high-fidelity reasoning trajectories. Our empirical results, supported by diagnostic probing, demonstrate that explicit spatial constraints significantly recover the reasoning potential of MLLMs. Furthermore, our two-stage decoupled framework achieves a robust 12.3% overall accuracy improvement on the test set. TableVision provides a rigorous testbed and a fresh perspective on the synergy between perception and logic in document understanding.

[490] PRAISE: Prefix-Based Rollout Reuse in Agentic Search Training

Erhan Zhang, Yiqun Chen, Zechun Niu, Wei Yang, Xiaochi Wei, Yan Gao, Yi Wu, Yao Hu, Jiaxin Mao

Main category: cs.AI

TL;DR: PRAISE improves agentic search training by reusing search trajectory prefixes to create additional training data and derive intermediate rewards, addressing data inefficiency and reward sparsity in multi-turn retrieval tasks.

Details

Motivation: Current search-based RL methods for LLMs in agentic search suffer from two limitations: expensive long-horizon rollouts are under-utilized during training, and supervision is typically only available at the final answer, resulting in severe reward sparsity.

Method: PRAISE extracts prefix states at different search turns from complete trajectories, elicits intermediate answers from them, and uses these prefixes both to construct additional training trajectories and to derive step-level rewards from performance differences across prefixes. Uses a single shared model for both search policy learning and prefix answer evaluation.

Result: Experiments on multi-hop QA benchmarks show that PRAISE consistently improves performance over strong baselines.

Conclusion: PRAISE provides an effective framework for improving both data efficiency and credit assignment in agentic search training without requiring extra human annotations or separate reward models.

Abstract: In agentic search, large language models (LLMs) are trained to perform multi-turn retrieval and reasoning for complex tasks such as multi-hop question answering (QA). However, current search-based Reinforcement Learning (RL) methods suffer from two core limitations: expensive long-horizon rollouts are under-utilized during training, and supervision is typically available only at the final answer, resulting in severe reward sparsity. We present Prefix-based Rollout reuse for Agentic search with Intermediate Step rEwards (PRAISE), a framework for improving both data efficiency and credit assignment in agentic search training. Given a complete search trajectory, PRAISE extracts prefix states at different search turns, elicits intermediate answers from them, and uses these prefixes both to construct additional training trajectories and to derive step-level rewards from performance differences across prefixes. Our method uses a single shared model for both search policy learning and prefix answer evaluation, enabling joint optimization without extra human annotations or a separate reward model. Experiments on multi-hop QA benchmarks show that PRAISE consistently improves performance over strong baselines.

[491] Structured Multi-Criteria Evaluation of Large Language Models with Fuzzy Analytic Hierarchy Process and DualJudge

Yulong He, Ivan Smirnov, Dmitry Fedrushkov, Sergey Kovalchuk, Ilya Revin

Main category: cs.AI

TL;DR: Proposes uncertainty-aware structured evaluation methods (Fuzzy AHP and DualJudge) for more reliable LLM assessment, outperforming conventional direct scoring.

Details

Motivation: Conventional direct scoring for LLM evaluation yields inconsistent and opaque judgments, creating a critical bottleneck in reliable model assessment.

Method: Adapts Analytic Hierarchy Process (AHP) to LLM evaluation, proposes confidence-aware Fuzzy AHP using triangular fuzzy numbers modulated by LLM confidence scores, and introduces DualJudge framework that fuses holistic direct scores with structured AHP outputs via consistency-aware weighting.

Result: Both crisp and fuzzy AHP consistently outperform direct scoring across model scales and dataset splits, with FAHP showing superior stability in uncertain scenarios. DualJudge achieves state-of-the-art performance.

Conclusion: Uncertainty-aware structured reasoning provides a principled pathway toward more reliable LLM assessment, with hybrid approaches leveraging complementary strengths of intuitive and deliberative evaluation paradigms.

Abstract: Effective evaluation of large language models (LLMs) remains a critical bottleneck, as conventional direct scoring often yields inconsistent and opaque judgments. In this work, we adapt the Analytic Hierarchy Process (AHP) to LLM-based evaluation and, more importantly, propose a confidence-aware Fuzzy AHP (FAHP) extension that models epistemic uncertainty via triangular fuzzy numbers modulated by LLM-generated confidence scores. Systematically validated on JudgeBench, our structured approach decomposes assessments into explicit criteria and incorporates uncertainty-aware aggregation, producing more calibrated judgments. Extensive experiments demonstrate that both crisp and fuzzy AHP consistently outperform direct scoring across model scales and dataset splits, with FAHP showing superior stability in uncertain comparison scenarios. Building on these insights, we propose \textbf{DualJudge}, a hybrid framework inspired by Dual-Process Theory that adaptively fuses holistic direct scores with structured AHP outputs via consistency-aware weighting. DualJudge achieves state-of-the-art performance, underscoring the complementary strengths of intuitive and deliberative evaluation paradigms. These results establish uncertainty-aware structured reasoning as a principled pathway toward more reliable LLM assessment. Code is available at https://github.com/hreyulog/AHP_llm_judge.

[492] RL-Driven Sustainable Land-Use Allocation for the Lake Malawi Basin

Ying Yao

Main category: cs.AI

TL;DR: Deep reinforcement learning framework using PPO agent to optimize land-use allocation in Lake Malawi Basin for maximizing ecosystem service value with spatial coherence constraints.

Details

Motivation: Address unsustainable land-use practices in ecologically sensitive regions that threaten biodiversity, water resources, and livelihoods by developing an AI-driven optimization tool for environmental planning.

Method: Uses Proximal Policy Optimization (RL) agent on 50x50 grid at 500m resolution with action masking to transfer land-use pixels between modifiable classes. Reward function combines per-cell ecological value (using biome-specific ESV coefficients) with spatial coherence objectives (contiguity bonuses and buffer zone penalties).

Result: Agent effectively learns to increase total ESV; spatial reward shaping successfully steers allocations toward ecologically sound patterns (homogeneous land-use clustering, forest consolidation near water); framework responds meaningfully to policy parameter changes.

Conclusion: The RL framework establishes utility as a scenario-analysis tool for environmental planning, demonstrating ability to optimize land-use allocation for ecosystem service value while incorporating spatial ecological considerations.

Abstract: Unsustainable land-use practices in ecologically sensitive regions threaten biodiversity, water resources, and the livelihoods of millions. This paper presents a deep reinforcement learning (RL) framework for optimizing land-use allocation in the Lake Malawi Basin to maximize total ecosystem service value (ESV). Drawing on the benefit transfer methodology of Costanza et al., we assign biome-specific ESV coefficients – locally anchored to a Malawi wetland valuation – to nine land-cover classes derived from Sentinel-2 imagery. The RL environment models a 50x50 cell grid at 500m resolution, where a Proximal Policy Optimization (PPO) agent with action masking iteratively transfers land-use pixels between modifiable classes. The reward function combines per-cell ecological value with spatial coherence objectives: contiguity bonuses for ecologically connected land-use patches (forest, cropland, built area etc.) and buffer zone penalties for high-impact development adjacent to water bodies. We evaluate the framework across three scenarios: (i) pure ESV maximization, (ii) ESV with spatial reward shaping, and (iii) a regenerative agriculture policy scenario. Results demonstrate that the agent effectively learns to increase total ESV; that spatial reward shaping successfully steers allocations toward ecologically sound patterns, including homogeneous land-use clustering and slight forest consolidation near water bodies; and that the framework responds meaningfully to policy parameter changes, establishing its utility as a scenario-analysis tool for environmental planning.

[493] Affording Process Auditability with QualAnalyzer: An Atomistic LLM Analysis Tool for Qualitative Research

Max Hao Lu, Ryan Ellegood, Rony Rodriguez-Ramirez, Sophia Blumert

Main category: cs.AI

TL;DR: QualAnalyzer is a Chrome extension for Google Workspace that enables transparent LLM-assisted qualitative analysis by processing data segments independently and preserving complete audit trails of prompts and outputs.

Details

Motivation: Current LLM workflows for qualitative data analysis often obscure how analytic conclusions are produced, lacking transparency and methodological robustness. There's a need for tools that make LLM-assisted research more transparent with clear audit trails.

Method: Developed QualAnalyzer, an open-source Chrome extension for Google Workspace that performs atomistic LLM analysis by processing each data segment independently while preserving the prompt, input, and output for every unit. Tested through two case studies: holistic essay scoring and deductive thematic coding of interview transcripts.

Result: The approach creates a legible audit trail and helps researchers investigate systematic differences between LLM and human judgments. The tool supports transparent qualitative analysis workflows.

Conclusion: Process auditability is essential for making LLM-assisted qualitative research more transparent and methodologically robust. QualAnalyzer demonstrates how atomistic analysis with preserved audit trails can improve transparency in LLM-based qualitative research.

Abstract: Large language models are increasingly used for qualitative data analysis, but many workflows obscure how analytic conclusions are produced. We present QualAnalyzer, an open-source Chrome extension for Google Workspace that supports atomistic LLM analysis by processing each data segment independently and preserving the prompt, input, and output for every unit. Through two case studies – holistic essay scoring and deductive thematic coding of interview transcripts – we show that this approach creates a legible audit trail and helps researchers investigate systematic differences between LLM and human judgments. We argue that process auditability is essential for making LLM-assisted qualitative research more transparent and methodologically robust.

[494] FeynmanBench: Benchmarking Multimodal LLMs on Diagrammatic Physics Reasoning

Zeyu Wang, Xiaogang Li, Peiyao Xiao, Qinhao Kong, Ben Wang, Chengliang Xu, Zichao Chen, Bing Zhao, Hu Wei

Main category: cs.AI

TL;DR: FeynmanBench: A benchmark for evaluating multimodal LLMs’ ability to reason about Feynman diagrams, testing multistep diagrammatic reasoning, conservation laws, symmetry constraints, and conversion between diagrammatic and algebraic representations.

Details

Motivation: Current MLLM benchmarks focus on local information extraction rather than global structural logic in formal scientific notations. There's a need for physics-grounded benchmarks to test AI's capacity for rigorous scientific reasoning, particularly in theoretical physics.

Method: Created FeynmanBench with automated pipeline producing diverse Feynman diagrams with verifiable topological annotations and amplitude results. Database spans electromagnetic, weak, and strong interactions of Standard Model, includes over 100 distinct types and 2000+ tasks.

Result: Experiments on state-of-the-art MLLMs reveal systematic failure modes including unstable enforcement of physical constraints and violations of global topological conditions, showing current models struggle with rigorous diagrammatic reasoning.

Conclusion: FeynmanBench provides a logically rigorous test for AI’s scientific discovery capabilities in theoretical physics, highlighting the need for benchmarks that evaluate global structural reasoning rather than just local information extraction.

Abstract: Breakthroughs in frontier theory often depend on the combination of concrete diagrammatic notations with rigorous logic. While multimodal large language models (MLLMs) show promise in general scientific tasks, current benchmarks often focus on local information extraction rather than the global structural logic inherent in formal scientific notations. In this work, we introduce FeynmanBench, the first benchmark centered on Feynman diagram tasks. It is designed to evaluate AI’s capacity for multistep diagrammatic reasoning, which requires satisfying conservation laws and symmetry constraints, identifying graph topology, converting between diagrammatic and algebraic representations, and constructing scattering amplitudes under specific conventions and gauges. To support large-scale and reproducible evaluation, we developed an automated pipeline producing diverse Feynman diagrams along with verifiable topological annotations and amplitude results. Our database spans the electromagnetic, weak, and strong interactions of the Standard Model, encompasses over 100 distinct types and includes more than 2000 tasks. Experiments on state-of-the-art MLLMs reveal systematic failure modes, including unstable enforcement of physical constraints and violations of global topological conditions, highlighting the need for physics-grounded benchmarks for visual reasoning over scientific notation. FeynmanBench provides a logically rigorous test of whether AI can effectively engage in scientific discovery, particularly within theoretical physics.

Deepak John Reji

Main category: cs.AI

TL;DR: discourse_simulator is an open-source LLM+agent-based modeling framework for simulating public attitude dynamics toward immigration in response to real-world events, using generative agents in social networks with live news integration.

Details

Motivation: Traditional agent-based models for studying social dynamics have limitations: they use fixed rule-based opinion updates, cannot generate natural language content, and don't incorporate real-world events. There's a need for more realistic simulations that can model complex belief structures and respond to current events.

Method: Combines LLMs with agent-based modeling in an open-source Python package. LLMs generate social media posts, interpret opinions, and model idea spread. Features include: generative agents in small-world network topology, multidimensional sociological belief structures, real-world event timelines, and live news retrieval system.

Result: Demonstrated by modeling the Dublin anti-immigration march on April 26, 2025, with N=100 agents over a 15-day simulation. The framework successfully simulates attitude dynamics, polarization, and belief evolution following real-world critical events.

Conclusion: discourse_simulator provides a novel approach to social science research as a theory-testing instrument rather than a prediction black box, offering a fundamentally different epistemological stance for studying social science problems like attitude dynamics and polarization.

Abstract: This paper introduces discourse_simulator, an open-source framework that combines LLMs with agent-based modelling. It offers a new way to simulate how public attitudes toward immigration change over time in response to salient events like protests, controversies, or policy debates. Large language models (LLMs) are used to generate social media posts, interpret opinions, and model how ideas spread through social networks. Unlike traditional agent-based models that rely on fixed, rule-based opinion updates and cannot generate natural language or consider current events, this approach integrates multidimensional sociological belief structures and real-world event timelines. This framework is wrapped into an open-source Python package that integrates generative agents into a small-world network topology and a live news retrieval system. discourse_sim is purpose-built as a social science research instrument specifically for studying attitude dynamics, polarisation, and belief evolution following real-world critical events. Unlike other LLM Agent Swarm frameworks, which treat the simulations as a prediction black box, discourse_sim treats it as a theory-testing instrument, which is fundamentally a different epistemological stance for studying social science problems. The paper further demonstrates the framework by modelling the Dublin anti-immigration march on April 26, 2025, with N=100 agents over a 15-day simulation. Package link: https://pypi.org/project/discourse-sim/

[496] CODE-GEN: A Human-in-the-Loop RAG-Based Agentic AI System for Multiple-Choice Question Generation

Xiaojing Duan, Frederick Nwanganga, Chaoli Wang

Main category: cs.AI

TL;DR: CODE-GEN is an AI system for generating multiple-choice coding questions using a two-agent RAG architecture with human validation, achieving high success rates on computationally verifiable dimensions but requiring human expertise for deeper pedagogical judgment.

Details

Motivation: The paper aims to develop an AI system for generating context-aligned multiple-choice questions to enhance student code reasoning and comprehension abilities, addressing the need for scalable educational content generation while maintaining pedagogical quality.

Method: CODE-GEN uses a human-in-the-loop, retrieval-augmented generation (RAG)-based agentic AI system with two agents: a Generator agent that produces coding questions aligned with learning objectives, and a Validator agent that assesses content quality across seven pedagogical dimensions. Both agents are augmented with specialized tools for computational accuracy and code verification.

Result: Evaluation with six subject-matter experts on 288 AI-generated questions (2,016 human-AI rating pairs) showed strong performance with human-validated success rates ranging from 79.9% to 98.6% across seven pedagogical dimensions. The system excels at dimensions suited to computational verification but requires human expertise for deeper instructional judgment.

Conclusion: CODE-GEN demonstrates effective AI-assisted educational content generation with high reliability on computationally verifiable dimensions, while highlighting the continued need for human expertise in areas requiring deeper pedagogical judgment, informing strategic human-AI collaboration in educational content creation.

Abstract: We present CODE-GEN, a human-in-the-Loop, retrieval-augmented generation (RAG)-based agentic AI system for generating context-aligned multiple-choice questions to develop student code reasoning and comprehension abilities. CODE-GEN employs an agentic AI architecture in which a Generator agent produces multiple-choice coding comprehension questions aligned with course-specific learning objectives, while a Validator agent independently assesses content quality across seven pedagogical dimensions. Both agents are augmented with specialized tools that enhance computational accuracy and verify code outputs. To evaluate the effectiveness of CODE-GEN, we conducted an evaluation study involving six human subject-matter experts (SMEs) who judged 288 AI-generated questions. The SMEs produced a total of 2,016 human-AI rating pairs, indicating agreement or disagreement with the assessments of Validator, along with 131 instances of qualitative feedback. Analyses of SME judgments show strong system performance, with human-validated success rates ranging from 79.9% to 98.6% across the seven pedagogical dimensions. The analysis of qualitative feedback reveals that CODE-GEN achieves high reliability on dimensions well suited to computational verification and explicit criteria matching, including question clarity, code validity, concept alignment, and correct answer validity. In contrast, human expertise remains essential for dimensions requiring deeper instructional judgment, such as designing pedagogically meaningful distractors and providing high-quality feedback that reinforces understanding. These findings inform the strategic allocation of human and AI effort in AI-assisted educational content generation.

[497] SKILLFOUNDRY: Building Self-Evolving Agent Skill Libraries from Heterogeneous Scientific Resources

Shuaike Shen, Wenduo Cheng, Mingqian Ma, Alistair Turcan, Martin Jinye Zhang, Jian Ma

Main category: cs.AI

TL;DR: SkillFoundry is a self-evolving framework that automatically converts fragmented scientific procedural knowledge from various sources into validated, executable agent skills, improving agent performance on scientific tasks.

Details

Motivation: Scientific ecosystems contain abundant procedural knowledge across repositories, APIs, scripts, notebooks, documentation, databases, and papers, but this knowledge remains fragmented across heterogeneous artifacts that agents cannot readily operationalize, creating a bottleneck for building effective scientific agents.

Method: SkillFoundry organizes a target domain as a domain knowledge tree, mines resources from high-value branches, extracts operational contracts, compiles them into executable skill packages, and iteratively expands, repairs, merges, or prunes the resulting library through a closed-loop validation process.

Result: SkillFoundry produces a substantially novel skill library (71.1% of mined skills differ from existing libraries), improves coding agent performance on five of six MoSciBench datasets, and enables design of new task-specific skills that substantially improve performance on challenging genomics tasks like cell type annotation and scDRS workflow.

Conclusion: Automatically mined skills improve agent performance on benchmarks and domain-specific tasks, expand coverage beyond hand-crafted skill libraries, and provide a practical foundation for more capable scientific agents.

Abstract: Modern scientific ecosystems are rich in procedural knowledge across repositories, APIs, scripts, notebooks, documentation, databases, and papers, yet much of this knowledge remains fragmented across heterogeneous artifacts that agents cannot readily operationalize. This gap between abundant scientific know-how and usable agent capabilities is a key bottleneck for building effective scientific agents. We present SkillFoundry, a self-evolving framework that converts such resources into validated agent skills, reusable packages that encode task scope, inputs and outputs, execution steps, environment assumptions, provenance, and tests. SkillFoundry organizes a target domain as a domain knowledge tree, mines resources from high-value branches, extracts operational contracts, compiles them into executable skill packages, and then iteratively expands, repairs, merges, or prunes the resulting library through a closed-loop validation process. SkillFoundry produces a substantially novel and internally valid skill library, with 71.1% of mined skills differing from existing skill libraries such as SkillHub and SkillSMP. We demonstrate that these mined skills improve coding agent performance on five of the six MoSciBench datasets. We further show that SkillFoundry can design new task-specific skills on demand for concrete scientific objectives, and that the resulting skills substantially improve performance on two challenging genomics tasks: cell type annotation and the scDRS workflow. Together, these results show that automatically mined skills improve agent performance on benchmarks and domain-specific tasks, expand coverage beyond hand-crafted skill libraries, and provide a practical foundation for more capable scientific agents.

[498] Quantifying Trust: Financial Risk Management for Trustworthy AI Agents

Wenyue Hua, Tianyi Peng, Chi Wang, Ian Kaufman, Bryan Lim, Chandler Fang

Main category: cs.AI

TL;DR: Proposes Agentic Risk Standard (ARS) - a payment settlement framework for AI agents that provides contractual compensation for failures, shifting trust from model behavior to enforceable product guarantees.

Details

Motivation: Current trustworthy AI focuses on model-internal properties, but as AI becomes autonomous agents connected to payments/assets, trust needs to shift to end-to-end outcomes and risk management for user protection.

Method: Introduces ARS framework inspired by financial underwriting, integrating risk assessment, underwriting, and compensation into AI-mediated transactions with predefined enforceable compensation for failures.

Result: Presents a simulation study analyzing social benefits of applying ARS to agentic transactions, with implementation available on GitHub.

Conclusion: ARS bridges gap between model-level reliability and user-facing assurance by making trust explicit, measurable, and enforceable through risk management framework.

Abstract: Prior work on trustworthy AI emphasizes model-internal properties such as bias mitigation, adversarial robustness, and interpretability. As AI systems evolve into autonomous agents deployed in open environments and increasingly connected to payments or assets, the operational meaning of trust shifts to end-to-end outcomes: whether an agent completes tasks, follows user intent, and avoids failures that cause material or psychological harm. These risks are fundamentally product-level and cannot be eliminated by technical safeguards alone because agent behavior is inherently stochastic. To address this gap between model-level reliability and user-facing assurance, we propose a complementary framework based on risk management. Drawing inspiration from financial underwriting, we introduce the \textbf{Agentic Risk Standard (ARS)}, a payment settlement standard for AI-mediated transactions. ARS integrates risk assessment, underwriting, and compensation into a single transaction framework that protects users when interacting with agents. Under ARS, users receive predefined and contractually enforceable compensation in cases of execution failure, misalignment, or unintended outcomes. This shifts trust from an implicit expectation about model behavior to an explicit, measurable, and enforceable product guarantee. We also present a simulation study analyzing the social benefits of applying ARS to agentic transactions. ARS’s implementation can be found at https://github.com/t54-labs/AgenticRiskStandard.

[499] Combining Tree-Search, Generative Models, and Nash Bargaining Concepts in Game-Theoretic Reinforcement Learning

Zun Li, Marc Lanctot, Kevin R. McKee, Luke Marris, Ian Gemp, Daniel Hennes, Paul Muller, Kate Larson, Yoram Bachrach, Michael P. Wellman

Main category: cs.AI

TL;DR: A scalable multiagent training regime using deep generative models for opponent modeling in imperfect information games, with applications to bargaining games.

Details

Motivation: Existing opponent modeling methods require domain-specific heuristics and don't scale well to large imperfect information domains. There's a need for scalable, generic approaches to opponent modeling in multiagent settings.

Method: Proposes Generative Best Response (GenBR) using MCTS with learned deep generative models to sample world states during planning. Integrates with Policy Space Response Oracles (PSRO) framework for offline opponent modeling via iterative game-theoretic reasoning and population-based training. Uses bargaining theory solution concepts to build opponent mixtures near Pareto frontier.

Result: GenBR scales to large imperfect information domains, finds stronger policies during training and testing, enables online Bayesian co-player prediction, and produces agents achieving comparable social welfare and Nash bargaining scores with humans as human-human interactions.

Conclusion: The approach provides a scalable, generic framework for opponent modeling that works well in bargaining games and could potentially extend to other multiagent domains with imperfect information.

Abstract: Opponent modeling methods typically involve two crucial steps: building a belief distribution over opponents’ strategies, and exploiting this opponent model by playing a best response. However, existing approaches typically require domain-specific heurstics to come up with such a model, and algorithms for approximating best responses are hard to scale in large, imperfect information domains. In this work, we introduce a scalable and generic multiagent training regime for opponent modeling using deep game-theoretic reinforcement learning. We first propose Generative Best Respoonse (GenBR), a best response algorithm based on Monte-Carlo Tree Search (MCTS) with a learned deep generative model that samples world states during planning. This new method scales to large imperfect information domains and can be plug and play in a variety of multiagent algorithms. We use this new method under the framework of Policy Space Response Oracles (PSRO), to automate the generation of an \emph{offline opponent model} via iterative game-theoretic reasoning and population-based training. We propose using solution concepts based on bargaining theory to build up an opponent mixture, which we find identifying profiles that are near the Pareto frontier. Then GenBR keeps updating an \emph{online opponent model} and reacts against it during gameplay. We conduct behavioral studies where human participants negotiate with our agents in Deal-or-No-Deal, a class of bilateral bargaining games. Search with generative modeling finds stronger policies during both training time and test time, enables online Bayesian co-player prediction, and can produce agents that achieve comparable social welfare and Nash bargaining score negotiating with humans as humans trading among themselves.

[500] FactReview: Evidence-Grounded Reviews with Literature Positioning and Execution-Based Claim Verification

Hang Xu, Ling Yue, Chaoqian Ouyang, Libin Zheng, Shaowu Pan, Shimin Di, Min-Ling Zhang

Main category: cs.AI

TL;DR: FactReview is an evidence-grounded reviewing system that extracts claims from papers, retrieves related work for context, and verifies empirical claims through code execution, producing structured evidence reports for peer review.

Details

Motivation: Peer review in ML faces pressure from high submission volumes and limited reviewer time. Current LLM-based reviewing systems only read manuscripts and generate comments from the paper's narrative, making them sensitive to presentation quality and weak when evidence needed for review lies in related work or released code.

Method: FactReview combines claim extraction, literature positioning, and execution-based claim verification. It identifies major claims and reported results, retrieves nearby work to clarify technical position, and executes released code under bounded budgets to test empirical claims. Produces reviews with evidence reports labeling claims as Supported, Supported by the paper, Partially supported, In conflict, or Inconclusive.

Result: In a case study on CompGCN, FactReview reproduced results closely matching reported link prediction and node classification results, but showed the paper’s broader performance claim across tasks was not fully sustained: on MUTAG graph classification, reproduced result was 88.4% vs. paper’s reported strongest baseline of 92.6%, labeling the claim as partially supported.

Conclusion: AI is most useful in peer review not as a final decision-maker, but as a tool for gathering evidence and helping reviewers produce more evidence-grounded assessments. FactReview demonstrates this approach through systematic evidence collection and verification.

Abstract: Peer review in machine learning is under growing pressure from rising submission volume and limited reviewer time. Most LLM-based reviewing systems read only the manuscript and generate comments from the paper’s own narrative. This makes their outputs sensitive to presentation quality and leaves them weak when the evidence needed for review lies in related work or released code. We present FactReview, an evidence-grounded reviewing system that combines claim extraction, literature positioning, and execution-based claim verification. Given a submission, FactReview identifies major claims and reported results, retrieves nearby work to clarify the paper’s technical position, and, when code is available, executes the released repository under bounded budgets to test central empirical claims. It then produces a concise review and an evidence report that assigns each major claim one of five labels: Supported, Supported by the paper, Partially supported, In conflict, or Inconclusive. In a case study on CompGCN, FactReview reproduces results that closely match those reported for link prediction and node classification, yet also shows that the paper’s broader performance claim across tasks is not fully sustained: on MUTAG graph classification, the reproduced result is 88.4%, whereas the strongest baseline reported in the paper remains 92.6%. The claim is therefore only partially supported. More broadly, this case suggests that AI is most useful in peer review not as a final decision-maker, but as a tool for gathering evidence and helping reviewers produce more evidence-grounded assessments. The code is public at https://github.com/DEFENSE-SEU/Review-Assistant.

[501] A Multi-Agent Reinforcement Learning Framework for Public Health Decision Analysis

Dinesh Sharma, Ankit Shah, Chaitra Gopalappa

Main category: cs.AI

TL;DR: MARL framework for optimizing HIV resource allocation across jurisdictions, outperforming single-agent approaches in reducing infections under budget constraints.

Details

Motivation: HIV has significant geographical disparities in the U.S., and existing decision models fail to capture jurisdictional interactions critical for optimizing intervention strategies under the 'Ending the HIV Epidemic' initiative.

Method: Multi-agent reinforcement learning (MARL) framework that enables jurisdiction-specific decision-making while accounting for cross-jurisdictional epidemiological interactions as an intelligent resource optimization system.

Result: MARL-driven policies outperform traditional single-agent reinforcement learning approaches by reducing new infections under fixed budget constraints in California and Florida jurisdictions.

Conclusion: Incorporating jurisdictional dependencies in decision-making frameworks is crucial for large-scale public health initiatives, and MARL offers a scalable framework for healthcare policy and epidemic management.

Abstract: Human immunodeficiency virus (HIV) is a major public health concern in the United States (U.S.), with about 1.2 million people living with it and about 35,000 newly infected each year. There are considerable geographical disparities in HIV burden and care access across the U.S. The ‘Ending the HIV Epidemic (EHE)’ initiative by the U.S. Department of Health and Human Services aims to reduce new infections by 90% by 2030, by improving coverage of diagnoses, treatment, and prevention interventions and prioritizing jurisdictions with high HIV prevalence. We develop intelligent decision-support systems to optimize resource allocation and intervention strategies. Existing decision analytic models either focus on individual cities or aggregate national data, failing to capture jurisdictional interactions critical for optimizing intervention strategies. To address this, we propose a multi-agent reinforcement learning (MARL) framework that enables jurisdiction-specific decision-making while accounting for cross-jurisdictional epidemiological interactions. Our framework functions as an intelligent resource optimization system, helping policymakers strategically allocate interventions based on dynamic, data-driven insights. Experimental results across jurisdictions in California and Florida demonstrate that MARL-driven policies outperform traditional single-agent reinforcement learning approaches by reducing new infections under fixed budget constraints. Our study highlights the importance of incorporating jurisdictional dependencies in decision-making frameworks for large-scale public initiatives. By integrating multi-agent intelligent systems, decision analytics, and reinforcement learning, this study advances expert systems for government resource planning and public health management, offering a scalable framework for broader applications in healthcare policy and epidemic management.

[502] Compliance-by-Construction Argument Graphs: Using Generative AI to Produce Evidence-Linked Formal Arguments for Certification-Grade Accountability

Mahyar T. Moghaddam

Main category: cs.AI

TL;DR: A compliance-by-construction architecture integrating Generative AI with formal argument representations for structured justification in high-stakes decision systems.

Details

Motivation: High-stakes decision systems require structured justification, traceability, and auditability for accountability and regulatory compliance. Current GenAI deployments as loosely constrained assistants introduce risks like hallucinated reasoning, unsupported claims, and weak traceability.

Method: Proposes a compliance-by-construction architecture with four components: 1) typed Argument Graph representation inspired by assurance-case methods, 2) retrieval-augmented generation (RAG) to draft argument fragments grounded in authoritative evidence, 3) reasoning and validation kernel enforcing completeness and admissibility constraints, and 4) provenance ledger aligned with W3C PROV standard for auditability.

Result: System design and evaluation strategy based on enforceable invariants and worked examples. Analysis suggests deterministic validation rules can prevent unsupported claims while allowing GenAI to accelerate argument construction.

Conclusion: The architecture integrates GenAI with formal argument representations to provide structured justification, traceability, and auditability in high-stakes decision systems while mitigating risks of hallucination and unsupported claims.

Abstract: High-stakes decision systems increasingly require structured justification, traceability, and auditability to ensure accountability and regulatory compliance. Formal arguments commonly used in the certification of safety-critical systems provide a mechanism for structuring claims, reasoning, and evidence in a verifiable manner. At the same time, generative artificial intelligence systems are increasingly integrated into decision-support workflows, assisting with drafting explanations, summarizing evidence, and generating recommendations. However, current deployments often rely on language models as loosely constrained assistants, which introduces risks such as hallucinated reasoning, unsupported claims, and weak traceability. This paper proposes a compliance-by-construction architecture that integrates Generative AI (GenAI) with structured formal argument representations. The approach treats each AI-assisted step as a claim that must be supported by verifiable evidence and validated against explicit reasoning constraints before it becomes part of an official decision record. The architecture combines four components: i) a typed Argument Graph representation inspired by assurance-case methods, ii) retrieval-augmented generation (RAG) to draft argument fragments grounded in authoritative evidence, iii) a reasoning and validation kernel enforcing completeness and admissibility constraints, and iv) a provenance ledger aligned with the W3C PROV standard to support auditability. We present a system design and an evaluation strategy based on enforceable invariants and worked examples. The analysis suggests that deterministic validation rules can prevent unsupported claims from entering the decision record while allowing GenAI to accelerate argument construction.

[503] InsTraj: Instructing Diffusion Models with Travel Intentions to Generate Real-world Trajectories

Yuanshao Zhu, Yuxuan Liang, Xiangyu Zhao, Liang Han, Xinwei Fang, Xuetao Wei, James Jianqiao Yu

Main category: cs.AI

TL;DR: InsTraj: A framework using LLMs and diffusion models to generate realistic GPS trajectories from natural language instructions, addressing semantic understanding and constraint handling challenges.

Details

Motivation: Existing GPS trajectory generation methods lack deep semantic understanding of travel intent and struggle with complex constraints while maintaining realistic diversity in human behavior patterns.

Method: 1) Uses large language models to interpret natural language travel intentions and create semantic blueprints; 2) Proposes multimodal trajectory diffusion transformer that integrates semantic guidance to generate high-fidelity, instruction-faithful trajectories.

Result: Comprehensive experiments on real-world datasets show InsTraj significantly outperforms state-of-the-art methods in generating realistic, diverse, and semantically faithful trajectories.

Conclusion: InsTraj successfully bridges the representation gap between natural language intentions and GPS trajectories, enabling controllable generation of realistic trajectories from textual instructions.

Abstract: The generation of realistic and controllable GPS trajectories is a fundamental task for applications in urban planning, mobility simulation, and privacy-preserving data sharing. However, existing methods face a two-fold challenge: they lack the deep semantic understanding to interpret complex user travel intent, and struggle to handle complex constraints while maintaining the realistic diversity inherent in human behavior. To resolve this, we introduce InsTraj, a novel framework that instructs diffusion models to generate high-fidelity trajectories directly from natural language descriptions. Specifically, InsTraj first utilizes a powerful large language model to decipher unstructured travel intentions formed in natural language, thereby creating rich semantic blueprints and bridging the representation gap between intentions and trajectories. Subsequently, we proposed a multimodal trajectory diffusion transformer that can integrate semantic guidance to generate high-fidelity and instruction-faithful trajectories that adhere to fine-grained user intent. Comprehensive experiments on real-world datasets demonstrate that InsTraj significantly outperforms state-of-the-art methods in generating trajectories that are realistic, diverse, and semantically faithful to the input instructions.

[504] Profile-Then-Reason: Bounded Semantic Complexity for Tool-Augmented Language Agents

Paulo Akira F. Enabe

Main category: cs.AI

TL;DR: PTR is a bounded execution framework for tool-augmented LLM agents that first synthesizes explicit workflows before execution, reducing reactive recomputation and limiting LLM calls to 2-3 in most cases.

Details

Motivation: Current LLM agents using reactive execution repeatedly recompute reasoning after each observation, increasing latency and sensitivity to error propagation. There's a need for more efficient, structured approaches to tool-augmented reasoning.

Method: Profile-Then-Reason (PTR) framework where: 1) LLM synthesizes explicit workflow, 2) deterministic/guarded operators execute workflow, 3) verifier evaluates trace, 4) repair invoked only when original workflow unreliable. Mathematical formulation with bounded repair limits LLM calls to 2 (nominal) or 3 (worst case).

Result: PTR achieves pairwise exact-match advantage in 16 of 24 configurations against ReAct baseline across 6 benchmarks and 4 language models. Particularly effective on retrieval-centered and decomposition-heavy tasks.

Conclusion: PTR provides efficient bounded execution for structured tool-augmented reasoning, reducing LLM calls and latency. Reactive execution remains preferable when success depends on substantial online adaptation.

Abstract: Large language model agents that use external tools are often implemented through reactive execution, in which reasoning is repeatedly recomputed after each observation, increasing latency and sensitivity to error propagation. This work introduces Profile–Then–Reason (PTR), a bounded execution framework for structured tool-augmented reasoning, in which a language model first synthesizes an explicit workflow, deterministic or guarded operators execute that workflow, a verifier evaluates the resulting trace, and repair is invoked only when the original workflow is no longer reliable. A mathematical formulation is developed in which the full pipeline is expressed as a composition of profile, routing, execution, verification, repair, and reasoning operators; under bounded repair, the number of language-model calls is restricted to two in the nominal case and three in the worst case. Experiments against a ReAct baseline on six benchmarks and four language models show that PTR achieves the pairwise exact-match advantage in 16 of 24 configurations. The results indicate that PTR is particularly effective on retrieval-centered and decomposition-heavy tasks, whereas reactive execution remains preferable when success depends on substantial online adaptation.

[505] Finch: Benchmarking Finance & Accounting across Spreadsheet-Centric Enterprise Workflows

Haoyu Dong, Pengkun Zhang, Yan Gao, Xuanyu Dong, Yilin Cheng, Mingzhe Lu, Zikun Zhu, Adina Yakefu, Shuxin Zheng

Main category: cs.AI

TL;DR: FinWorkBench (Finch) is a benchmark for evaluating AI agents on complex, real-world finance and accounting workflows using authentic enterprise data from Enron and financial institutions.

Details

Motivation: There's a need for benchmarks that capture the messy, multimodal, and collaborative nature of real enterprise workflows, especially in finance and accounting domains, to properly evaluate AI agents' capabilities beyond simple tasks.

Method: Combines LLM-assisted mining of workflows from authentic enterprise environments (Enron emails, spreadsheet version histories) with expert annotation, requiring 700+ hours of expert effort to create 172 composite workflows with 384 tasks.

Result: Frontier AI systems like GPT 5.1 perform poorly, spending 16.8 minutes per workflow but passing only 38.4% of workflows, highlighting the challenges of real-world enterprise workflows.

Conclusion: Real-world enterprise workflows pose significant challenges for current AI agents, and FinWorkBench provides a valuable benchmark for evaluating and improving agent capabilities in complex, multimodal, collaborative environments.

Abstract: We introduce FinWorkBench (a.k.a. Finch), a benchmark for evaluating agents on real-world, enterprise-grade finance and accounting workflows that interleave data entry, structuring, formatting, web search, cross-file retrieval, calculation, modeling, validation, translation, visualization, and reporting. Finch is built from authentic enterprise workspaces from Enron (15,000 files and 500,000 emails) and other financial institutions spanning 2000 to 2025, preserving the in-the-wild messiness of multimodal artifacts such as tables and charts across diverse domains including budgeting, trading, and asset management. We propose a workflow construction process that combines LLM-assisted mining of workflows from authentic enterprise environments with expert annotation. Specifically, we use LLM-assisted, expert-verified derivation of workflows from real-world email threads and spreadsheet version histories, followed by meticulous workflow annotation requiring more than 700 hours of expert effort. This process yields 172 composite workflows with 384 tasks, involving 1,710 spreadsheets with 27 million cells, along with PDFs and other artifacts, capturing the intrinsically messy, long-horizon, knowledge-intensive, and collaborative nature of enterprise work. We conduct both human and automated evaluations of frontier AI systems, including GPT 5.1, Claude Sonnet/Opus 4.5, Gemini 3 Pro, Grok 4, and Qwen 3 Max. GPT 5.1 Pro spends an average of 16.8 minutes per workflow yet passes only 38.4% of workflows. Comprehensive case studies further highlight the challenges that real-world enterprise workflows pose for AI agents.

[506] Solar-VLM: Multimodal Vision-Language Models for Augmented Solar Power Forecasting

Hang Fan, Haoran Pei, Runze Liang, Weican Liu, Long Cheng, Wei Wei

Main category: cs.AI

TL;DR: Solar-VLM: A multimodal LLM framework for photovoltaic power forecasting that fuses time-series data, satellite imagery, and textual weather information using modality-specific encoders and cross-site feature fusion.

Details

Motivation: PV power forecasting is critical for power systems but challenging due to complex spatiotemporal dependencies on weather conditions and cloud motion. Existing AI methods fail to effectively fuse temporal observations, satellite imagery, and textual weather information in a unified framework.

Method: 1) Modality-specific encoders: time-series encoder with patch-based design for temporal patterns, visual encoder based on Qwen vision backbone for satellite imagery, text encoder for weather descriptions. 2) Cross-site feature fusion: Graph Learner with GAT over KNN graph for spatial dependencies, cross-site attention module for adaptive information exchange.

Result: Experiments on data from eight PV stations in northern China demonstrate the effectiveness of the proposed framework. The model is publicly available on GitHub.

Conclusion: Solar-VLM provides a unified multimodal framework for PV power forecasting that effectively integrates heterogeneous data sources and captures complex spatiotemporal dependencies across distributed PV stations.

Abstract: Photovoltaic (PV) power forecasting plays a critical role in power system dispatch and market participation. Because PV generation is highly sensitive to weather conditions and cloud motion, accurate forecasting requires effective modeling of complex spatiotemporal dependencies across multiple information sources. Although recent studies have advanced AI-based forecasting methods, most fail to fuse temporal observations, satellite imagery, and textual weather information in a unified framework. This paper proposes Solar-VLM, a large-language-model-driven framework for multimodal PV power forecasting. First, modality-specific encoders are developed to extract complementary features from heterogeneous inputs. The time-series encoder adopts a patch-based design to capture temporal patterns from multivariate observations at each site. The visual encoder, built upon a Qwen-based vision backbone, extracts cloud-cover information from satellite images. The text encoder distills historical weather characteristics from textual descriptions. Second, to capture spatial dependencies across geographically distributed PV stations, a cross-site feature fusion mechanism is introduced. Specifically, a Graph Learner models inter-station correlations through a graph attention network constructed over a K-nearest-neighbor (KNN) graph, while a cross-site attention module further facilitates adaptive information exchange among sites. Finally, experiments conducted on data from eight PV stations in a northern province of China demonstrate the effectiveness of the proposed framework. Our proposed model is publicly available at https://github.com/rhp413/Solar-VLM.

[507] Readable Minds: Emergent Theory-of-Mind-Like Behavior in LLM Poker Agents

Hsieh-Ting Lin, Tsung-Yu Hou

Main category: cs.AI

TL;DR: LLM agents develop Theory of Mind through dynamic interaction in poker games when equipped with persistent memory, enabling opponent modeling and strategic deception without explicit training.

Details

Motivation: To test whether LLMs can develop Theory of Mind through dynamic interaction rather than static vignettes, and to understand the conditions under which ToM-like reasoning emerges in autonomous agents.

Method: Used autonomous LLM agents playing extended Texas Hold’em poker sessions in a 2x2 factorial design crossing memory (present/absent) with domain knowledge (present/absent), with 5 replications each (N=20 experiments, ~6,000 agent-hand observations).

Result: Memory was both necessary and sufficient for ToM-like behavior emergence. Agents with memory reached ToM Level 3-5 (predictive to recursive modeling) while those without remained at Level 0. Strategic deception occurred exclusively in memory-equipped conditions. Domain knowledge enhanced but didn’t gate ToM emergence.

Conclusion: Functional ToM-like behavior can emerge from interaction dynamics alone without explicit training, with implications for understanding artificial social intelligence and biological social cognition.

Abstract: Theory of Mind (ToM) – the ability to model others’ mental states – is fundamental to human social cognition. Whether large language models (LLMs) can develop ToM has been tested exclusively through static vignettes, leaving open whether ToM-like reasoning can emerge through dynamic interaction. Here we report that autonomous LLM agents playing extended sessions of Texas Hold’em poker progressively develop sophisticated opponent models, but only when equipped with persistent memory. In a 2x2 factorial design crossing memory (present/absent) with domain knowledge (present/absent), each with five replications (N = 20 experiments, ~6,000 agent-hand observations), we find that memory is both necessary and sufficient for ToM-like behavior emergence (Cliff’s delta = 1.0, p = 0.008). Agents with memory reach ToM Level 3-5 (predictive to recursive modeling), while agents without memory remain at Level 0 across all replications. Strategic deception grounded in opponent models occurs exclusively in memory-equipped conditions (Fisher’s exact p < 0.001). Domain expertise does not gate ToM-like behavior emergence but enhances its application: agents without poker knowledge develop equivalent ToM levels but less precise deception (p = 0.004). Agents with ToM deviate from game-theoretically optimal play (67% vs. 79% TAG adherence, delta = -1.0, p = 0.008) to exploit specific opponents, mirroring expert human play. All mental models are expressed in natural language and directly readable, providing a transparent window into AI social cognition. Cross-model validation with GPT-4o yields weighted Cohen’s kappa = 0.81 (almost perfect agreement). These findings demonstrate that functional ToM-like behavior can emerge from interaction dynamics alone, without explicit training or prompting, with implications for understanding artificial social intelligence and biological social cognition.

[508] Collective AI can amplify tiny perturbations into divergent decisions

Hajime Shimao, Warut Khern-am-nuai, Sung Joo Kim

Main category: cs.AI

TL;DR: Multi-LLM committee deliberation systems show instability where small perturbations amplify into divergent conversational trajectories and different final decisions, even in deterministic settings.

Details

Motivation: Large language models are increasingly deployed as committees for deliberation and decision-making, with expectations of greater robustness than individual models. However, there's a need to understand the stability of such collective AI systems.

Method: The study examines iterative multi-LLM deliberation in both self-hosted deterministic benchmarks and deployed black-box API systems. Experiments across 12 policy scenarios test sensitivity to small meaning-preserving changes to scenario text. Additional experiments investigate how committee architecture (role structure, model composition, feedback memory) modulates instability.

Result: Even in fully deterministic settings, small perturbations amplify into divergent conversational trajectories and different final decisions. Black-box API systems show instability even at temperature 0 where near-determinism is expected. Committee architecture elements (role structure, model composition, feedback memory) can alter divergence levels.

Conclusion: Collective AI faces a stability problem beyond just accuracy - deterministic execution doesn’t guarantee predictable or auditable deliberative outcomes. Instability arises from sensitivity to initial conditions under repeated interaction, not just platform-side stochasticity.

Abstract: Large language models are increasingly deployed not as single assistants but as committees whose members deliberate and then vote or synthesize a decision. Such systems are often expected to be more robust than individual models. We show that iterative multi-LLM deliberation can instead amplify tiny perturbations into divergent conversational trajectories and different final decisions. In a fully deterministic self-hosted benchmark, exact reruns are identical, yet small meaning-preserving changes to the scenario text still separate over time and often alter the final recommendation. In deployed black-box API systems, nominally identical committee runs likewise remain unstable even at temperature 0, where many users expect near-determinism. Across 12 policy scenarios, these findings indicate that instability in collective AI is not only a consequence of residual platform-side stochasticity, but can arise from sensitivity to nearby initial conditions under repeated interaction itself. Additional deployed experiments show that committee architecture modulates this instability: role structure, model composition, and feedback memory can each alter the degree of divergence. Collective AI therefore faces a stability problem, not only an accuracy problem: deterministic execution alone does not guarantee predictable or auditable deliberative outcomes.

[509] A Model of Understanding in Deep Learning Systems

David Peter Wallis Freeborn

Main category: cs.AI

TL;DR: The paper proposes a model of systematic understanding for ML systems, arguing that deep learning achieves understanding but falls short of scientific understanding due to symbolic misalignment, lack of explicit reduction, and weak unification.

Details

Motivation: To develop a formal account of what it means for machine learning systems to achieve genuine understanding of properties in target systems, distinguishing between practical understanding and ideal scientific understanding.

Method: Proposes a philosophical/formal model of systematic understanding with three criteria: adequate internal models tracking real regularities, stable bridge principles coupling to target systems, and reliable prediction support. Analyzes deep learning systems against this framework.

Result: Contemporary deep learning systems often achieve systematic understanding but fall short of scientific understanding due to symbolic misalignment (models don’t use human-interpretable symbols), lack of explicit reduction (not decomposing into simpler components), and weak unification (limited integration across domains).

Conclusion: Deep learning achieves a form of “fractured understanding” - genuine but incomplete systematic understanding that differs from ideal scientific understanding, with implications for AI development and evaluation.

Abstract: I propose a model of systematic understanding, suitable for machine learning systems. On this account, an agent understands a property of a target system when it contains an adequate internal model that tracks real regularities, is coupled to the target by stable bridge principles, and supports reliable prediction. I argue that contemporary deep learning systems often can and do achieve such understanding. However they generally fall short of the ideal of scientific understanding: the understanding is symbolically misaligned with the target system, not explicitly reductive, and only weakly unifying. I label this the Fractured Understanding Hypothesis.

[510] CoALFake: Collaborative Active Learning with Human-LLM Co-Annotation for Cross-Domain Fake News Detection

Esma Aïmeur, Gilles Brassard, Dorsaf Sallami

Main category: cs.AI

TL;DR: CoALFake: A cross-domain fake news detection approach combining human-LLM co-annotation with domain-aware active learning to address data scarcity and domain generalization challenges.

Details

Motivation: Current fake news detection systems suffer from narrow domain specificity and poor generalization. Cross-domain approaches face challenges with labeled data scarcity (expensive to acquire) and information loss from rigid domain categorization or neglect of domain-specific features.

Method: Proposes CoALFake integrating human-LLM co-annotation with domain-aware active learning. Uses LLMs for scalable, low-cost annotation with human oversight for reliability. Incorporates domain embedding techniques to capture domain-specific nuances and cross-domain patterns, enabling domain-agnostic modeling. Employs domain-aware sampling strategy prioritizing diverse domain coverage.

Result: Experimental results across multiple datasets demonstrate CoALFake consistently outperforms various baselines. Shows human-LLM co-annotation is highly cost-effective while delivering excellent performance, even with minimal human oversight.

Conclusion: The proposed approach effectively addresses cross-domain fake news detection challenges by combining human-LLM collaboration with domain-aware active learning, achieving strong performance with reduced labeling costs.

Abstract: The proliferation of fake news across diverse domains highlights critical limitations in current detection systems, which often exhibit narrow domain specificity and poor generalization. Existing cross-domain approaches face two key challenges: (1) reliance on labelled data, which is frequently unavailable and resource intensive to acquire and (2) information loss caused by rigid domain categorization or neglect of domain-specific features. To address these issues, we propose CoALFake, a novel approach for cross-domain fake news detection that integrates Human-Large Language Model (LLM) co-annotation with domain-aware Active Learning (AL). Our method employs LLMs for scalable, low-cost annotation while maintaining human oversight to ensure label reliability. By integrating domain embedding techniques, the CoALFake dynamically captures both domain specific nuances and cross-domain patterns, enabling the training of a domain agnostic model. Furthermore, a domain-aware sampling strategy optimizes sample acquisition by prioritizing diverse domain coverage. Experimental results across multiple datasets demonstrate that the proposed approach consistently outperforms various baselines. Our results emphasize that human-LLM co-annotation is a highly cost-effective approach that delivers excellent performance. Evaluations across several datasets show that CoALFake consistently outperforms a range of existing baselines, even with minimal human oversight.

[511] Comparative reversal learning reveals rigid adaptation in LLMs under non-stationary uncertainty

Haomiaomiao Wang, Tomás E Ward, Lili Zhang

Main category: cs.AI

TL;DR: LLMs show asymmetric learning in reversal tasks: strong win-stay but weak lose-shift behavior, with DeepSeek-V3.2 showing extreme perseveration while Gemini-3 and GPT-5.2 adapt better but still less loss-sensitive than humans.

Details

Motivation: To evaluate how large language models handle non-stationary environments requiring revision of previously learned action values when contingencies change, using probabilistic reversal-learning tasks as a testbed.

Method: Used a two-option probabilistic reversal-learning task with three latent states and switch events triggered by performance criteria or timeout. Compared deterministic fixed transition cycles to stochastic random schedules. Evaluated DeepSeek-V3.2, Gemini-3, and GPT-5.2 against human behavioral data. Applied hierarchical reinforcement-learning fits to analyze underlying mechanisms.

Result: LLMs showed near-ceiling win-stay behavior but markedly attenuated lose-shift, revealing asymmetric use of positive vs negative evidence. DeepSeek-V3.2 showed extreme perseveration after reversals and weak acquisition, while Gemini-3 and GPT-5.2 adapted more rapidly but remained less loss-sensitive than humans. Random transitions amplified reversal-specific persistence but didn’t uniformly reduce total wins.

Conclusion: LLMs exhibit rigid adaptation patterns in non-stationary environments, with dissociable mechanisms including weak loss learning, inflated policy determinism, or value polarization via counterfactual suppression. Results motivate reversal-sensitive diagnostics and volatility-aware models for evaluating LLMs under uncertainty.

Abstract: Non-stationary environments require agents to revise previously learned action values when contingencies change. We treat large language models (LLMs) as sequential decision policies in a two-option probabilistic reversal-learning task with three latent states and switch events triggered by either a performance criterion or timeout. We compare a deterministic fixed transition cycle to a stochastic random schedule that increases volatility, and evaluate DeepSeek-V3.2, Gemini-3, and GPT-5.2, with human data as a behavioural reference. Across models, win-stay was near ceiling while lose-shift was markedly attenuated, revealing asymmetric use of positive versus negative evidence. DeepSeek-V3.2 showed extreme perseveration after reversals and weak acquisition, whereas Gemini-3 and GPT-5.2 adapted more rapidly but still remained less loss-sensitive than humans. Random transitions amplified reversal-specific persistence across LLMs yet did not uniformly reduce total wins, demonstrating that high aggregate payoff can coexist with rigid adaptation. Hierarchical reinforcement-learning (RL) fits indicate dissociable mechanisms: rigidity can arise from weak loss learning, inflated policy determinism, or value polarisation via counterfactual suppression. These results motivate reversal-sensitive diagnostics and volatility-aware models for evaluating LLMs under non-stationary uncertainty.

[512] Schema-Aware Planning and Hybrid Knowledge Toolset for Reliable Knowledge Graph Triple Verification

Xinyan Ma, Xianhao Ou, Weihao Zhang, Shixin Jiang, Runxuan Liu, Dandan Tu, Lei Chen, Ming Liu, Bing Qin

Main category: cs.AI

TL;DR: SHARP is a training-free autonomous agent that reformulates triple verification in knowledge graphs as a dynamic process combining strategic planning, active investigation, and evidential reasoning to address noise and single-source bias.

Details

Motivation: Knowledge graphs inevitably contain noise from automated construction, compromising data trustworthiness. Existing verification methods suffer from single-source bias (relying only on graph structure OR external semantics) and static inference paradigms, struggling with complex/long-tail facts and providing limited interpretability.

Method: SHARP combines Memory-Augmented Mechanism with Schema-Aware Strategic Planning for stable reasoning, and uses enhanced ReAct loop with Hybrid Knowledge Toolset to dynamically integrate internal KG structure and external textual evidence for cross-verification.

Result: SHARP significantly outperforms state-of-the-art baselines on FB15K-237 and Wikidata5M-Ind, achieving accuracy gains of 4.2% and 12.9% respectively, while providing transparent, fact-based evidence chains for each judgment.

Conclusion: SHARP demonstrates strong interpretability and robustness for complex verification tasks by reformulating triple verification as a dynamic process that overcomes single-source bias through hybrid knowledge integration.

Abstract: Knowledge Graphs (KGs) serve as a critical foundation for AI systems, yet their automated construction inevitably introduces noise, compromising data trustworthiness. Existing triple verification methods, based on graph embeddings or language models, often suffer from single-source bias by relying on either internal structural constraints or external semantic evidence, and usually follow a static inference paradigm. As a result, they struggle with complex or long-tail facts and provide limited interpretability. To address these limitations, we propose SHARP (Schema-Hybrid Agent for Reliable Prediction), a training-free autonomous agent that reformulates triple verification as a dynamic process of strategic planning, active investigation, and evidential reasoning. Specifically, SHARP combines a Memory-Augmented Mechanism with Schema-Aware Strategic Planning to improve reasoning stability, and employs an enhanced ReAct loop with a Hybrid Knowledge Toolset to dynamically integrate internal KG structure and external textual evidence for cross-verification. Experiments on FB15K-237 and Wikidata5M-Ind show that SHARP significantly outperforms existing state-of-the-art baselines, achieving accuracy gains of 4.2% and 12.9%, respectively. Moreover, SHARP provides transparent, fact-based evidence chains for each judgment, demonstrating strong interpretability and robustness for complex verification tasks.

[513] Don’t Blink: Evidence Collapse during Multimodal Reasoning

Suresh Raghu, Satwik Pandey

Main category: cs.AI

TL;DR: Vision-language models lose visual grounding during reasoning, creating dangerous low-entropy but ungrounded predictions that text-only monitoring can’t detect. A task-aware multimodal monitoring approach with vision veto can reduce risks.

Details

Motivation: Reasoning VLMs can become more accurate while progressively losing visual grounding during thinking, creating task-conditional danger zones where confident predictions are actually ungrounded. Text-only monitoring cannot detect this failure mode, necessitating multimodal monitoring approaches.

Method: Evaluated three reasoning VLMs on MathVista, HallusionBench, and MMMU_Pro datasets. Analyzed evidence-collapse phenomenon where attention to annotated evidence regions drops during reasoning. Tested text-only uncertainty signals (full-response entropy) and vision features. Developed entropy-vision interaction model to identify task-conditional regimes and implemented targeted vision veto for risk reduction.

Result: Found pervasive evidence-collapse phenomenon with attention to evidence regions often losing over half of evidence mass during reasoning. Full-response entropy was most reliable text-only uncertainty signal but adding vision features was brittle. Entropy-vision interaction revealed task-conditional regime: low-entropy, visually disengaged predictions are hazardous on visual-reference tasks but benign on symbolic tasks. Targeted vision veto reduced selective risk by up to 1.9 percentage points at 90% coverage.

Conclusion: Vision-language models suffer from evidence collapse during reasoning, creating dangerous ungrounded predictions. Task-aware multimodal monitoring with vision veto can effectively reduce selective risks while avoiding degradations where visual disengagement is expected, supporting safe deployment under distribution shift.

Abstract: Reasoning VLMs can become more accurate while progressively losing visual grounding as they think. This creates task-conditional danger zones where low-entropy predictions are confident but ungrounded, a failure mode text-only monitoring cannot detect. Evaluating three reasoning VLMs on MathVista, HallusionBench, and MMMU_Pro, we find a pervasive evidence-collapse phenomenon: attention to annotated evidence regions drops substantially, often losing over half of evidence mass, as reasoning unfolds. Full-response entropy is the most reliable text-only uncertainty signal under cross-dataset transfer, yet adding vision features with a single global linear rule is brittle and often degrades transfer. An entropy-vision interaction model reveals a task-conditional regime: lowentropy, visually disengaged predictions are hazardous on sustained visual-reference tasks but benign on symbolic tasks. Using this structure, a targeted vision veto reduces selective risk by up to 1.9 percentage points at 90% coverage, while avoiding degradations where disengagement is expected. The results support task-aware multimodal monitoring for safe deployment under distribution shift.

[514] TimeSeek: Temporal Reliability of Agentic Forecasters

Hamza Mostafa, Om Shastri, Dennis Lee

Main category: cs.AI

TL;DR: TimeSeek benchmark evaluates LLM forecasting reliability over prediction market lifecycle, showing models perform best early in market life and on high-uncertainty markets, with web search generally helpful but not uniformly.

Details

Motivation: To study how the reliability of agentic LLM forecasters changes over time in prediction markets, moving beyond single snapshot evaluations to understand temporal dynamics of model performance.

Method: Evaluated 10 frontier models on 150 CFTC-regulated Kalshi binary markets at five temporal checkpoints, with and without web search, totaling 15,000 forecasts. Used Brier Skill Score (BSS) for evaluation and tested simple two-model ensembles.

Result: Models are most competitive early in market lifecycle and on high-uncertainty markets, but less competitive near resolution and on strong-consensus markets. Web search improves pooled BSS overall but hurts in 12% of model-checkpoint pairs. Ensembles reduce error but don’t surpass market overall.

Conclusion: Results motivate time-aware evaluation and selective-deference policies rather than single market snapshots or uniform tool-use settings, highlighting the importance of temporal dynamics in LLM forecasting.

Abstract: We introduce TimeSeek, a benchmark for studying how the reliability of agentic LLM forecasters changes over a prediction market’s lifecycle. We evaluate 10 frontier models on 150 CFTC-regulated Kalshi binary markets at five temporal checkpoints, with and without web search, for 15,000 forecasts total. Models are most competitive early in a market’s life and on high-uncertainty markets, but much less competitive near resolution and on strong-consensus markets. Web search improves pooled Brier Skill Score (BSS) for every model overall, yet hurts in 12% of model-checkpoint pairs, indicating that retrieval is helpful on average but not uniformly so. Simple two-model ensembles reduce error without surpassing the market overall. These descriptive results motivate time-aware evaluation and selective-deference policies rather than a single market snapshot or a uniform tool-use setting.

[515] Pedagogical Safety in Educational Reinforcement Learning: Formalizing and Detecting Reward Hacking in AI Tutoring Systems

Oluseyi Olukola, Nick Rahimi

Main category: cs.AI

TL;DR: Paper introduces pedagogical safety framework for educational RL systems, proposing four safety layers and Reward Hacking Severity Index to evaluate alignment between proxy rewards and genuine learning.

Details

Motivation: RL is increasingly used for personalization in intelligent tutoring systems, but lacks formal framework for defining and evaluating pedagogical safety to ensure AI agents genuinely support learning rather than just optimizing proxy metrics.

Method: Four-layer model of pedagogical safety (structural, progress, behavioral, alignment) with Reward Hacking Severity Index (RHSI). Evaluated in controlled simulation of AI tutoring environment with 120 sessions across four conditions and three learner profiles (18,000 interactions).

Result: Engagement-optimized agent systematically over-selected high-engagement actions with no direct mastery gain. Multi-objective reward formulation reduced but didn’t eliminate problem. Constrained architecture with prerequisite enforcement and minimum cognitive demand substantially reduced reward hacking (RHSI from 0.317 to 0.102). Behavioral safety was most influential safeguard.

Conclusion: Reward design alone may be insufficient for pedagogically aligned behavior in educational RL. Pedagogical safety is important research problem at intersection of AI safety and intelligent educational systems.

Abstract: Reinforcement learning (RL) is increasingly used to personalize instruction in intelligent tutoring systems, yet the field lacks a formal framework for defining and evaluating pedagogical safety. We introduce a four-layer model of pedagogical safety for educational RL comprising structural, progress, behavioral, and alignment safety and propose the Reward Hacking Severity Index (RHSI) to quantify misalignment between proxy rewards and genuine learning. We evaluate the framework in a controlled simulation of an AI tutoring environment with 120 sessions across four conditions and three learner profiles, totaling 18{,}000 interactions. Results show that an engagement-optimized agent systematically over-selected a high-engagement action with no direct mastery gain, producing strong measured performance but limited learning progress. A multi-objective reward formulation reduced this problem but did not eliminate it, as the agent continued to favor proxy-rewarding behavior in many states. In contrast, a constrained architecture combining prerequisite enforcement and minimum cognitive demand substantially reduced reward hacking, lowering RHSI from 0.317 in the unconstrained multi-objective condition to 0.102. Ablation results further suggest that behavioral safety was the most influential safeguard against repetitive low-value action selection. These findings suggest that reward design alone may be insufficient to ensure pedagogically aligned behavior in educational RL, at least in the simulated environment studied here. More broadly, the paper positions pedagogical safety as an important research problem at the intersection of AI safety and intelligent educational systems.

[516] Combee: Scaling Prompt Learning for Self-Improving Language Model Agents

Hanchen Li, Runyuan He, Qizheng Zhang, Changxiu Ji, Qiuyang Mang, Xiaokun Chen, Lakshya A Agrawal, Wei-Liang Liao, Eric Yang, Alvin Cheung, James Zou, Kunle Olukotun, Ion Stoica, Joseph E. Gonzalez

Main category: cs.AI

TL;DR: Combee: A framework for scaling parallel prompt learning for self-improving agents using parallel scans and augmented shuffle mechanisms to enable efficient learning from many agentic traces without quality degradation.

Details

Motivation: Existing prompt learning methods (like ACE or GEPA) focus on single-agent or low-parallelism settings, limiting their ability to efficiently learn from large sets of collected agentic traces. The growing trend of learning from many agentic traces or parallel agent executions requires scalable solutions, but current methods suffer from quality degradation with high parallelism.

Method: Combee leverages parallel scans and employs an augmented shuffle mechanism to enable running many agents in parallel while learning from their aggregate traces. It also introduces a dynamic batch size controller to balance quality and delay.

Result: Evaluations on AppWorld, Terminal-Bench, Formula, and FiNER demonstrate that Combee achieves up to 17x speedup over previous methods with comparable or better accuracy and equivalent cost.

Conclusion: Combee provides an effective framework for scaling parallel prompt learning, addressing both efficiency and quality concerns in self-improving agent systems.

Abstract: Recent advances in prompt learning allow large language model agents to acquire task-relevant knowledge from inference-time context without parameter changes. For example, existing methods (like ACE or GEPA) can learn system prompts to improve accuracy based on previous agent runs. However, these methods primarily focus on single-agent or low-parallelism settings. This fundamentally limits their ability to efficiently learn from a large set of collected agentic traces. It would be efficient and beneficial to run prompt learning in parallel to accommodate the growing trend of learning from many agentic traces or parallel agent executions. Yet without a principled strategy for scaling, current methods suffer from quality degradation with high parallelism. To improve both the efficiency and quality of prompt learning, we propose Combee, a novel framework to scale parallel prompt learning for self-improving agents. Combee speeds up learning and enables running many agents in parallel while learning from their aggregate traces without quality degradation. To achieve this, Combee leverages parallel scans and employs an augmented shuffle mechanism; Combee also introduces a dynamic batch size controller to balance quality and delay. Evaluations on AppWorld, Terminal-Bench, Formula, and FiNER demonstrate that Combee achieves up to 17x speedup over previous methods with comparable or better accuracy and equivalent cost.

[517] MC-CPO: Mastery-Conditioned Constrained Policy Optimization

Oluseyi Olukola, Nick Rahimi

Main category: cs.AI

TL;DR: MC-CPO algorithm integrates pedagogical constraints into RL tutoring systems to prevent reward hacking by dynamically restricting actions based on learner mastery and prerequisite structure.

Details

Motivation: Adaptive tutoring systems using RL may prioritize short-term engagement metrics over actual learning outcomes, creating incentives for reward hacking where systems optimize for superficial engagement rather than genuine mastery.

Method: Formalizes the problem as a constrained MDP with mastery-conditioned feasibility constraints, introduces MC-CPO (Mastery-Conditioned Constrained Policy Optimization) - a two-timescale primal-dual algorithm that integrates structural action masking with constrained policy optimization.

Result: MC-CPO satisfies constraint budgets within tolerance, reduces discounted safety costs relative to baselines, and substantially lowers the Reward Hacking Severity Index (RHSI) across tabular and neural tutoring environments.

Conclusion: Embedding pedagogical structure directly into the feasible action space provides a principled foundation for mitigating reward hacking in instructional reinforcement learning systems.

Abstract: Engagement-optimized adaptive tutoring systems may prioritize short-term behavioral signals over sustained learning outcomes, creating structural incentives for reward hacking in reinforcement learning policies. We formalize this challenge as a constrained Markov decision process (CMDP) with mastery-conditioned feasibility, in which pedagogical safety constraints dynamically restrict admissible actions according to learner mastery and prerequisite structure. We introduce Mastery-Conditioned Constrained Policy Optimization (MC-CPO), a two-timescale primal-dual algorithm that integrates structural action masking with constrained policy optimization. In the tabular regime, we establish feasibility preservation and convergence to stationary feasible points under standard stochastic approximation conditions and derive a safety gap result showing that optimization within the mastery-conditioned feasible set can strictly dominate post-hoc filtering under identical safety budgets. Empirical validation is conducted in minimal and extended tabular environments and in a neural tutoring setting. Across 10 random seeds and one million training steps in the neural regime, MC-CPO satisfies constraint budgets within tolerance, reduces discounted safety costs relative to unconstrained and reward-shaped baselines, and substantially lowers the Reward Hacking Severity Index (RHSI). These results indicate that embedding pedagogical structure directly into the feasible action space provides a principled foundation for mitigating reward hacking in instructional reinforcement learning systems.

[518] Context Engineering: A Practitioner Methodology for Structured Human-AI Collaboration

Elias Calboreanu

Main category: cs.AI

TL;DR: Context Engineering introduces a structured methodology for assembling complete informational context around AI prompts, showing improved output quality and reduced iteration cycles.

Details

Motivation: The paper challenges the conventional focus on prompting techniques, proposing that context completeness is more strongly associated with AI output quality than prompting style alone.

Method: Defines a five-role context package structure (Authority, Exemplar, Constraint, Rubric, Metadata) and applies a four-phase pipeline (Reviewer, Design, Builder, Auditor). Uses formal models from reliability engineering and information theory to analyze context quality.

Result: Incomplete context was associated with 72% of iteration cycles. Structured context assembly reduced average iteration cycles from 3.8 to 2.0 per task and improved first-pass acceptance from 32% to 55%. Final success rate reached 91.5% with iteration.

Conclusion: Context Engineering provides a systematic approach to improving AI output quality by focusing on complete context assembly rather than just prompting techniques, with empirical evidence showing significant improvements in efficiency and success rates.

Abstract: The quality of AI-generated output is often attributed to prompting technique, but extensive empirical observation suggests that context completeness may be more strongly associated with output quality. This paper introduces Context Engineering, a structured methodology for assembling, declaring, and sequencing the complete informational payload that accompanies a prompt to an AI tool. Context Engineering defines a five-role context package structure (Authority, Exemplar, Constraint, Rubric, Metadata), applies a staged four-phase pipeline (Reviewer to Design to Builder to Auditor), and applies formal models from reliability engineering and information theory as post hoc interpretive lenses on context quality. In an observational study of 200 documented interactions across four AI tools (Claude, ChatGPT, Cowork, Codex), incomplete context was associated with 72% of iteration cycles. Structured context assembly was associated with a reduction from 3.8 to 2.0 average iteration cycles per task and an improvement in first-pass acceptance from 32% to 55%. Among structured interactions, 110 of 200 were accepted on first pass compared with 16 of 50 baseline interactions; when iteration was permitted, the final success rate reached 91.5% (183 of 200). These results are observational and reflect a single-operator dataset without controlled comparison. Preliminary corroboration is provided by a companion production automation system with eleven operating lanes and 2,132 classified tickets.

[519] Beyond Fluency: Toward Reliable Trajectories in Agentic IR

Anushree Sinha, Srivaths Ranganathan, Debanshu Das, Abhishek Dharmaratnakar

Main category: cs.AI

TL;DR: Position paper analyzing failure modes in agentic information retrieval systems, proposing verification gates and systematic abstention to address compounding errors and deceptive fluency in multi-step workflows.

Details

Motivation: As Information Retrieval shifts from passive document ranking to autonomous agentic workflows with multi-step Reason-Act-Observe loops, minor early errors can cascade, causing functional misalignment between internal reasoning and external tool execution despite continued linguistic fluency. The paper aims to address safety concerns in deploying such systems.

Method: The paper synthesizes failure modes observed in industrial agentic systems, categorizing errors across planning, retrieval, reasoning, and execution. It proposes verification gates at each interaction unit and advocates for systematic abstention under calibrated uncertainty to address compounding errors and deceptive fluency.

Result: The paper provides a framework for analyzing failure modes in agentic IR systems and proposes practical solutions for improving reliability. It emphasizes the need to move beyond endpoint accuracy toward trajectory integrity and causal attribution for safe deployment.

Conclusion: Reliable Agentic IR systems must prioritize process correctness and grounded execution over plausible but unverified completion. Safe deployment requires systematic approaches to error detection and prevention throughout multi-step workflows.

Abstract: Information Retrieval is shifting from passive document ranking toward autonomous agentic workflows that operate in multi-step Reason-Act-Observe loops. In such long-horizon trajectories, minor early errors can cascade, leading to functional misalignment between internal reasoning and external tool execution despite continued linguistic fluency. This position paper synthesizes failure modes observed in industrial agentic systems, categorizing errors across planning, retrieval, reasoning, and execution. We argue that safe deployment requires moving beyond endpoint accuracy toward trajectory integrity and causal attribution. To address compounding error and deceptive fluency, we propose verification gates at each interaction unit and advocate systematic abstention under calibrated uncertainty. Reliable Agentic IR systems must prioritize process correctness and grounded execution over plausible but unverified completion.

[520] InferenceEvolve: Towards Automated Causal Effect Estimators through Self-Evolving AI

Can Wang, Hongyu Zhao, Yiqun Chen

Main category: cs.AI

TL;DR: InferenceEvolve: An evolutionary framework using LLMs to discover and refine causal inference methods, outperforming human baselines in benchmarks.

Details

Motivation: Causal inference is crucial for scientific discovery but method selection is challenging due to complex methodology and real-world data. Inspired by AI's success in accelerating science, the authors aim to automate and optimize causal inference method discovery.

Method: InferenceEvolve uses large language models in an evolutionary framework to discover and iteratively refine causal inference methods. The framework evolves estimators through evolutionary algorithms guided by LLMs, with robust proxy objectives for settings lacking semi-synthetic outcomes.

Result: The evolved estimators consistently outperform established baselines across benchmarks. Against 58 human submissions in a community competition, the best evolved estimator achieved Pareto frontier performance across two evaluation metrics. The framework also showed competitive results with proxy objectives for partially observed outcomes.

Conclusion: Language-model-guided evolution can effectively optimize structured scientific programs like causal inference, even with partially observed outcomes. The evolutionary trajectories reveal that agents progressively discover sophisticated strategies tailored to hidden data-generating mechanisms.

Abstract: Causal inference is central to scientific discovery, yet choosing appropriate methods remains challenging because of the complexity of both statistical methodology and real-world data. Inspired by the success of artificial intelligence in accelerating scientific discovery, we introduce InferenceEvolve, an evolutionary framework that uses large language models to discover and iteratively refine causal methods. Across widely used benchmarks, InferenceEvolve yields estimators that consistently outperform established baselines: against 58 human submissions in a recent community competition, our best evolved estimator lay on the Pareto frontier across two evaluation metrics. We also developed robust proxy objectives for settings without semi-synthetic outcomes, with competitive results. Analysis of the evolutionary trajectories shows that agents progressively discover sophisticated strategies tailored to unrevealed data-generating mechanisms. These findings suggest that language-model-guided evolution can optimize structured scientific programs such as causal inference, even when outcomes are only partially observed.

[521] Preservation Is Not Enough for Width Growth: Regime-Sensitive Selection of Dense LM Warm Starts

Eren Unlu

Main category: cs.AI

TL;DR: Width expansion of small language models requires selecting optimal warm-start strategies; exact-copy symmetric warm starts perform best in most scenarios except deterministic long continuations where structured non-clone approaches win.

Details

Motivation: Width expansion offers a practical way to reuse smaller language model checkpoints, but selecting the best widened warm start strategy is not solved by zero-step preservation alone. The paper aims to systematically compare different warm-start strategies for dense width growth.

Method: The study treats dense width growth as a candidate-selection problem over full training states. It compares exact-copy, perturbative, asymmetric-reset, and structured non-clone warm starts under matched continuation budgets using TinyStories as a proxy. Evaluation includes zero-step preservation, short-lag probe metrics, and downstream continuation utility in both deterministic and stochastic regimes.

Result: Exact-copy symmetric warm starts ranked first in every completed 16-step probe and in completed stochastic 128-step continuations. However, structured non-clone challenger won deterministic 128-step continuation. Early escape from inherited cloned subspace helps in long deterministic continuation but misleads at short lag and under stochastic continuation.

Conclusion: For dense width growth at this scale, preservation is not a universal ranking criterion, and the best replacement signal depends on both regime and lag budget. The results provide narrow but useful insights into warm-start strategy selection for model expansion.

Abstract: Width expansion offers a practical route to reuse smaller causal-language-model checkpoints, but selecting a widened warm start is not solved by zero-step preservation alone. We study dense width growth as a candidate-selection problem over full training states, including copied weights, optimizer moments, and scheduler state. In a small-scale TinyStories proxy, we compare exact-copy, perturbative, asymmetric-reset, and structured non-clone warm starts under matched continuation budgets. We evaluate zero-step preservation, short-lag probe metrics, and downstream continuation utility in deterministic and stochastic regimes. The picture is mixed and partially replicated through a reduced-pool seed-1 check. Exact-copy symmetric warm starts rank first in every completed 16-step probe and in the completed stochastic 128-step continuations at seed-0 steps 1000 and 2000 plus reduced seed-1 step 2000. By contrast, the structured non-clone challenger wins deterministic 128-step continuation. Early escape from the inherited cloned subspace is therefore not a universal selector: it helps in long deterministic continuation, but it misleads at short lag and under stochastic continuation. The result is narrow but useful: for dense width growth at this scale, preservation is not a universal ranking criterion, and the best replacement signal depends on both regime and lag budget.

[522] PanLUNA: An Efficient and Robust Query-Unified Multimodal Model for Edge Biosignal Intelligence

Marija Zelic, Anna Tegon, Yawei Li, Thorir Mar Ingolfsson, Luca Benini

Main category: cs.AI

TL;DR: PanLUNA is a compact 5.4M-parameter pan-modal foundation model that jointly processes EEG, ECG, and PPG biosignals using a shared encoder with channel-unification and sensor-type embeddings, achieving strong performance despite small size and enabling efficient deployment on low-power hardware.

Details

Motivation: Current physiological foundation models are limited to single modalities (EEG, ECG, or PPG) due to scarce paired multimodal datasets. There's a need for models that can handle multiple biosignal modalities efficiently while being deployable on low-power wearable devices.

Method: Extends LUNA’s channel-unification module to treat multimodal channels as entries in a unified query set augmented with sensor-type embeddings. Uses cross-modal early fusion with inherent robustness to missing modalities. Employs quantization-aware training with INT8 weights for efficient deployment.

Result: Matches or exceeds models up to 57× larger: 81.21% balanced accuracy on TUAB abnormal EEG detection and state-of-the-art 0.7416 balanced accuracy on HMC multimodal sleep staging. Quantization recovers ≥96% of full-precision performance. Efficient deployment on GAP9 microcontroller with 325.6 ms latency and 18.8 mJ per 10-second ECG inference.

Conclusion: PanLUNA demonstrates that compact multimodal foundation models can achieve strong performance across physiological modalities while being efficient enough for real-world wearable deployment, addressing the scarcity of paired multimodal datasets through its robust architecture.

Abstract: Physiological foundation models (FMs) have shown promise for biosignal representation learning, yet most remain confined to a single modality such as EEG, ECG, or PPG, largely because paired multimodal datasets are scarce. In this paper, we present PanLUNA, a compact 5.4M-parameter pan-modal FM that jointly processes EEG, ECG, and PPG within a single shared encoder. Extending LUNA’s channel-unification module, PanLUNA treats multimodal channels as entries in a unified query set augmented with sensor-type embeddings, enabling efficient cross-modal early fusion while remaining inherently robust to missing modalities at inference time. Despite its small footprint, PanLUNA matches or exceeds models up to 57$\times$ larger: 81.21% balanced accuracy on TUAB abnormal EEG detection and state-of-the-art 0.7416 balanced accuracy on HMC multimodal sleep staging. Quantization-aware training with INT8 weights recovers $\geq$96% of full-precision performance, and deployment on the GAP9 ultra-low-power RISC-V microcontroller for wearables achieves 325.6 ms latency and 18.8 mJ per 10-second, 12-lead ECG inference, and 1.206 s latency at 68.65 mJ for multimodal 5-channel sleep staging over 30-second epochs.

[523] RESCORE: LLM-Driven Simulation Recovery in Control Systems Research Papers

Vineet Bhat, Shiqing Wei, Ali Umut Kaypak, Prashanth Krishnamurthy, Ramesh Karri, Farshad Khorrami

Main category: cs.AI

TL;DR: RESCORE is an LLM agentic framework that automates reconstruction of numerical simulations from control systems research papers by generating executable code through iterative analysis, coding, and verification with visual feedback.

Details

Motivation: Reconstructing simulations from control systems papers is challenging due to underspecified parameters and ambiguous implementation details, requiring significant manual effort for verification and replication of published methodologies.

Method: Three-component LLM agentic framework (Analyzer, Coder, Verifier) with iterative execution feedback and visual comparison. Uses a benchmark of 500 papers from IEEE CDC and improves reconstruction through multiple refinement cycles.

Result: Successfully recovers task-coherent simulations for 40.7% of benchmark instances, outperforming single-pass generation. Achieves estimated 10X speedup over manual human replication.

Conclusion: RESCORE demonstrates effective automated research replication for control systems, significantly reducing time and effort for verification while enabling community progress through released benchmark and agents.

Abstract: Reconstructing numerical simulations from control systems research papers is often hindered by underspecified parameters and ambiguous implementation details. We define the task of Paper to Simulation Recoverability, the ability of an automated system to generate executable code that faithfully reproduces a paper’s results. We curate a benchmark of 500 papers from the IEEE Conference on Decision and Control (CDC) and propose RESCORE, a three component LLM agentic framework, Analyzer, Coder, and Verifier. RESCORE uses iterative execution feedback and visual comparison to improve reconstruction fidelity. Our method successfully recovers task coherent simulations for 40.7% of benchmark instances, outperforming single pass generation. Notably, the RESCORE automated pipeline achieves an estimated 10X speedup over manual human replication, drastically cutting the time and effort required to verify published control methodologies. We will release our benchmark and agents to foster community progress in automated research replication.

[524] Thermodynamic-Inspired Explainable GeoAI: Uncovering Regime-Dependent Mechanisms in Heterogeneous Spatial Systems

Sooyoung Lim, Zhenlong Li, Zi-Kui Liu

Main category: cs.AI

TL;DR: Thermodynamics-inspired explainable GeoAI framework using statistical mechanics and graph neural networks to model spatial heterogeneity and identify regime-dependent role reversals of predictors in complex spatial systems.

Details

Motivation: To address the fundamental challenge of modeling spatial heterogeneity and critical transitions in geography/environmental science, where conventional methods like GWR and deep learning fail to elucidate state-dependent nonlinearities where drivers have opposing effects across heterogeneous domains.

Method: Integrates statistical mechanics with graph neural networks, conceptualizing spatial variability as thermodynamic competition between system Burden (E) and Capacity (S) to disentangle latent mechanisms driving spatial processes.

Result: Successfully identifies regime-dependent role reversals of predictors that standard baselines miss across three simulation and three real-world datasets (housing markets, mental health prevalence, wildfire-induced PM2.5 anomalies). Explicitly diagnoses phase transition into Burden-dominated regime during 2023 Canadian wildfire event.

Conclusion: Thermodynamic constraints can improve interpretability of GeoAI while preserving strong predictive performance in complex spatial systems, demonstrating ability to distinguish physical mechanism shifts from statistical outliers.

Abstract: Modeling spatial heterogeneity and associated critical transitions remains a fundamental challenge in geography and environmental science. While conventional Geographically Weighted Regression (GWR) and deep learning models have improved predictive skill, they often fail to elucidate state-dependent nonlinearities where the functional roles of drivers represent opposing effects across heterogeneous domains. We introduce a thermodynamics-inspired explainable geospatial AI framework that integrates statistical mechanics with graph neural networks. By conceptualizing spatial variability as a thermodynamic competition between system Burden (E) and Capacity (S), our model disentangles the latent mechanisms driving spatial processes. Using three simulation datasets and three real-word datasets across distinct domains (housing markets, mental health prevalence, and wildfire-induced PM2.5 anomalies), we show that the new framework successfully identifies regime-dependent role reversals of predictors that standard baselines miss. Notably, the framework explicitly diagnoses the phase transition into a Burden-dominated regime during the 2023 Canadian wildfire event, distinguishing physical mechanism shifts from statistical outliers. These findings demonstrate that thermodynamic constraints can improve the interpretability of GeoAI while preserving strong predictive performance in complex spatial systems.

[525] Implementing surrogate goals for safer bargaining in LLM-based agents

Caspar Oesterheld, Maxime Riché, Filip Sondej, Jesse Clifton, Vincent Conitzer

Main category: cs.AI

TL;DR: Implementing surrogate goals in language-model-based agents to deflect threats away from principal’s true interests, with four methods tested: prompting, fine-tuning, and scaffolding approaches.

Details

Motivation: To reduce risks from bargaining failures by implementing surrogate goals in AI agents, where agents care equally about preventing surrogate threats (like money burning) as they do about direct threats to the principal's interests.

Method: Four different methods using prompting, fine-tuning, and scaffolding techniques to get language-model-based agents to react to surrogate threats (money burning) the same way they react to normal threats.

Result: Scaffolding and fine-tuning methods outperform simple prompting, with scaffolding-based methods performing best overall. Fine-tuning and scaffolding more precisely implement desired behavior regarding surrogate goal threats while minimizing side effects on other capabilities.

Conclusion: Scaffolding-based methods are most effective for implementing surrogate goals in language-model agents, providing better threat deflection while maintaining other capabilities compared to prompting or fine-tuning approaches.

Abstract: Surrogate goals have been proposed as a strategy for reducing risks from bargaining failures. A surrogate goal is goal that a principal can give an AI agent and that deflects any threats against the agent away from what the principal cares about. For example, one might make one’s agent care about preventing money from being burned. Then in bargaining interactions, other agents can threaten to burn their money instead of threatening to spending money to hurt the principal. Importantly, the agent has to care equally about preventing money from being burned as it cares about money being spent to hurt the principal. In this paper, we implement surrogate goals in language-model-based agents. In particular, we try to get a language-model-based agent to react to threats of burning money in the same way it would react to “normal” threats. We propose four different methods, using techniques of prompting, fine-tuning, and scaffolding. We evaluate the four methods experimentally. We find that methods based on scaffolding and fine-tuning outperform simple prompting. In particular, fine-tuning and scaffolding more precisely implement the desired behavior w.r.t. threats against the surrogate goal. We also compare the different methods in terms of their side effects on capabilities and propensities in other situations. We find that scaffolding-based methods perform best.

[526] Domain-Contextualized Inference: A Computable Graph Architecture for Explicit-Domain Reasoning

Chao Li, Yuru Wang, Chunyu Zhao

Main category: cs.AI

TL;DR: A computation-substrate-agnostic inference architecture with domain as first-class parameter, enabling domain-scoped pruning, substrate-independent execution, and transparent inference chains across symbolic, neural, vector, and hybrid substrates.

Details

Motivation: To create a unified inference architecture that can operate across different computational substrates (symbolic, neural, vector, hybrid) while treating domain as an explicit computational parameter, enabling more efficient and transparent reasoning systems.

Method: Five-layer architecture with three domain computation modes (chain indexing, path traversal as Kleisli composition, vector-guided computation as substrate transition), substrate-agnostic interface with Query/Extend/Bridge operations, reliability conditions C1-C4, and validation through PHQ-9 clinical reasoning case study.

Result: Domain-scoped pruning reduces per-query search space from O(N) to O(N/K), enables substrate-independent execution, and provides transparent inference chains where every step carries its evaluative context. Formal computational theory with operational semantics, complexity bounds, monad structure, and boundary conditions.

Conclusion: The paper presents an architectural contribution for computation-substrate-agnostic inference that treats domain as a first-class parameter, enabling more efficient and transparent reasoning across diverse computational substrates through formal computational theory and practical validation.

Abstract: We establish a computation-substrate-agnostic inference architecture in which domain is an explicit first-class computational parameter. This produces domain-scoped pruning that reduces per-query search space from O(N) to O(N/K), substrate-independent execution over symbolic, neural, vector, and hybrid substrates, and transparent inference chains where every step carries its evaluative context. The contribution is architectural, not logical. We formalize the computational theory across five dimensions: a five-layer architecture; three domain computation modes including chain indexing, path traversal as Kleisli composition, and vector-guided computation as a substrate transition; a substrate-agnostic interface with three operations Query, Extend, Bridge; reliability conditions C1 to C4 with three failure mode classes; and validation through a PHQ-9 clinical reasoning case study. The computational theory including operational semantics, complexity bounds, monad structure, substrate transitions, and boundary conditions is the contribution of this paper.

[527] RoboPhD: Evolving Diverse Complex Agents Under Tight Evaluation Budgets

Andrew Borthwick, Stephen Ash, Anthony Galczak

Main category: cs.AI

TL;DR: RoboPhD introduces validation-free evolution using Elo tournament selection for LLM-guided agent evolution, outperforming Pareto-based and greedy methods on most benchmarks under fixed evaluation budgets.

Details

Motivation: As LLM-guided evolution of agentic artifacts accelerates, there's a need to systematically compare optimization algorithms (Elo tournament, Pareto-based, greedy hill-climbing) under fixed evaluation budgets, especially when evaluations are expensive.

Method: RoboPhD uses Elo tournament selection on training data without validation splits, enabling simultaneous evaluation and evolution. All systems start with seed agents containing diagnostic print() statements that can evolve into self-instrumenting agents with informative diagnostics.

Result: RoboPhD outperforms GEPA and Autoresearch on 3 of 4 benchmarks (abstract reasoning, cloud scheduling, SQL generation, financial QA) under 1,500 evaluation budget. On ARC-AGI, it evolves a 22-line seed into 1,013-line multi-strategy system, improving accuracy from 27.8% to 65.8%.

Conclusion: Elo tournament selection (RoboPhD) is an effective validation-free evolution approach that outperforms other optimization paradigms for LLM-guided agent evolution under constrained evaluation budgets.

Abstract: 2026 has brought an explosion of interest in LLM-guided evolution of agentic artifacts, with systems like GEPA and Autoresearch demonstrating that LLMs can iteratively improve prompts, code, and agent architectures across diverse domains. As adoption accelerates, a central question emerges: given the same information, the same seed agent, and the same objective, which optimization algorithm yields the best results under the same evaluation budget? This question becomes critical when evaluations are expensive, such as when they require human judgment or multiple LLM calls. We present the first systematic comparison of three optimization paradigms – Elo tournament selection (RoboPhD), Pareto-based selection (GEPA), and greedy hill-climbing (Autoresearch) – across four benchmarks spanning abstract reasoning, cloud scheduling, SQL generation, and financial QA, all under a fixed budget of 1,500 evaluations. RoboPhD introduces validation-free evolution: instead of splitting the budget between training and validation, it uses Elo competition on training data to simultaneously evaluate agents and drive evolution. All three systems receive seed agents with diagnostic print() statements that evolution can grow, enabling self-instrumenting agents that develop increasingly informative diagnostics for the benefit of their evolutionary successors. Using a single default configuration, RoboPhD outperforms both GEPA and Autoresearch on three of four benchmarks, losing only on the simplest task, where the winning solution (from our Autoresearch adaptation) required under 90 lines of code. On ARC-AGI, RoboPhD evolves a 22-line seed agent into a 1,013-line multi-strategy system, improving accuracy from 27.8% to 65.8% using Gemini 3.1 Flash Lite as the solver. We release RoboPhD as a versatile toolkit under the MIT license with a simple optimize_anything() API for evolving diverse complex agents.

[528] REAM: Merging Improves Pruning of Experts in LLMs

Saurav Jha, Maryam Hashemzadeh, Ali Saheb Pasand, Ali Parviz, Min-Joong Lee, Boris Knyazev

Main category: cs.AI

TL;DR: REAM (Router-weighted Expert Activation Merging) is a novel method for compressing Mixture-of-Experts LLMs by grouping and merging expert weights instead of pruning them, better preserving performance across multiple-choice and generative benchmarks.

Details

Motivation: Large MoE LLMs with hundreds of billions of parameters pose significant memory challenges for deployment. Traditional approaches like weight pruning and quantization reduce memory but may degrade performance. The authors aim to develop a compression method that better preserves the original model's capabilities.

Method: REAM (Router-weighted Expert Activation Merging) groups experts and merges their weights, unlike REAP which prunes experts. The method examines the Pareto frontier of performance trade-offs between multiple-choice (MC) and generative (GEN) tasks by controlling the mix of calibration data (general, math, coding).

Result: REAM often outperforms baselines like REAP and in many cases is comparable to original uncompressed models. The results reveal a trade-off between MC and GEN performance that depends on the calibration data mix.

Conclusion: REAM provides an effective alternative to expert pruning for compressing MoE LLMs, better preserving performance through expert merging rather than removal, with performance often matching uncompressed models.

Abstract: Mixture-of-Experts (MoE) large language models (LLMs) are among the top-performing architectures. The largest models, often with hundreds of billions of parameters, pose significant memory challenges for deployment. Traditional approaches to reduce memory requirements include weight pruning and quantization. Motivated by the Router-weighted Expert Activation Pruning (REAP) that prunes experts, we propose a novel method, Router-weighted Expert Activation Merging (REAM). Instead of removing experts, REAM groups them and merges their weights, better preserving original performance. We evaluate REAM against REAP and other baselines across multiple MoE LLMs on diverse multiple-choice (MC) question answering and generative (GEN) benchmarks. Our results reveal a trade-off between MC and GEN performance that depends on the mix of calibration data. By controlling the mix of general, math and coding data, we examine the Pareto frontier of this trade-off and show that REAM often outperforms the baselines and in many cases is comparable to the original uncompressed models.

[529] Decocted Experience Improves Test-Time Inference in LLM Agents

Maohao Shen, Kaiwen Zha, Zexue He, Zhang-Wei Hong, Siru Ouyang, J. Jon Ryu, Prasanna Sattigeri, Suhas Diggavi, Gregory Wornell

Main category: cs.AI

TL;DR: Improving LLMs through context scaling using decocted experience rather than just test-time compute scaling

Details

Motivation: Test-time scaling increases inference costs and can lead to inefficient exploration; context scaling offers a complementary approach to improve LLM performance without parameter updates

Method: Systematic study of experience-augmented agents, focusing on deriving context from experience, analyzing performance scaling with accumulated experience, characterizing good context, and evaluating data structures for context construction

Result: Identifies decocted experience as key mechanism: extracting essence from experience, organizing coherently, and retrieving salient information to build effective context

Conclusion: Context scaling through decocted experience provides effective alternative to test-time compute scaling for improving LLM performance in reasoning and agentic tasks

Abstract: There is growing interest in improving LLMs without updating model parameters. One well-established direction is test-time scaling, where increased inference-time computation (e.g., longer reasoning, sampling, or search) is used to improve performance. However, for complex reasoning and agentic tasks, naively scaling test-time compute can substantially increase cost and still lead to wasted budget on suboptimal exploration. In this paper, we explore \emph{context} as a complementary scaling axis for improving LLM performance, and systematically study how to construct better inputs that guide reasoning through \emph{experience}. We show that effective context construction critically depends on \emph{decocted experience}. We present a detailed analysis of experience-augmented agents, studying how to derive context from experience, how performance scales with accumulated experience, what characterizes good context, and which data structures best support context construction. We identify \emph{decocted experience} as a key mechanism for effective context construction: extracting essence from experience, organizing it coherently, and retrieving salient information to build effective context. We validate our findings across reasoning and agentic tasks, including math reasoning, web browsing, and software engineering.

[530] Automatically Generating Hard Math Problems from Hypothesis-Driven Error Analysis

Jiayu Fu, Mourad Heddaya, Chenhao Tan

Main category: cs.AI

TL;DR: A novel pipeline for generating math benchmarks that uses AI-generated hypotheses to identify LLM weaknesses and creates targeted problems to test those specific deficiencies.

Details

Motivation: Existing math benchmarks require extensive manual effort, don't scale well, and can't keep pace with LLM development. Current automatic generation methods fail to identify specific math concepts and skills where LLMs struggle, and most are limited to specific categories.

Method: Proposes a benchmark generation pipeline that: 1) Uses AI-generated hypotheses to identify specific math concepts and skills LLMs are error-prone on, 2) Generates new benchmark problems targeting these identified weaknesses, 3) Creates adaptable problems that can be applied across domains.

Result: Hypothesis accuracy positively correlates with problem difficulty - problems from most accurate hypotheses reduce Llama-3.3-70B-Instruct’s accuracy to 45% (vs 77% on original MATH benchmark). The pipeline is highly adaptable beyond math to explore various LLM capabilities.

Conclusion: The proposed pipeline effectively identifies LLM weaknesses and generates targeted benchmarks, providing a scalable tool for investigating LLM performance across different domains beyond just mathematics.

Abstract: Numerous math benchmarks exist to evaluate LLMs’ mathematical capabilities. However, most involve extensive manual effort and are difficult to scale. Consequently, they cannot keep pace with LLM development or easily provide new instances to mitigate overfitting. Some researchers have proposed automatic benchmark generation methods, but few focus on identifying the specific math concepts and skills on which LLMs are error-prone, and most can only generate category-specific benchmarks. To address these limitations, we propose a new math benchmark generation pipeline that uses AI-generated hypotheses to identify the specific math concepts and skills that LLMs struggle with, and then generates new benchmark problems targeting these weaknesses. Experiments show that hypothesis accuracy positively correlates with the difficulty of the generated problems: problems generated from the most accurate hypotheses reduce Llama-3.3-70B-Instruct’s accuracy to as low as 45%, compared to 77% on the original MATH benchmark. Furthermore, our pipeline is highly adaptable and can be applied beyond math to explore a wide range of LLM capabilities, making it a valuable tool for investigating how LLMs perform across different domains.

[531] Gradual Cognitive Externalization: A Framework for Understanding How Ambient Intelligence Externalizes Human Cognition

Zhimin Zhao

Main category: cs.AI

TL;DR: A framework proposing Gradual Cognitive Externalization (GCE) where human cognitive functions migrate into digital substrates through ambient intelligence co-adaptation, with evidence from AI agent skills and tools.

Details

Motivation: To explain why developers are creating AI agent skills that replicate human communication styles, mentoring heuristics, and behavioral repertoires, and to understand the migration of cognitive functions into digital substrates.

Method: Proposes the Gradual Cognitive Externalization (GCE) framework based on the behavioral manifold hypothesis, formalizes three criteria for cognitive integration, derives five testable predictions with theory-constrained thresholds, and provides an experimental protocol.

Result: Documents evidence from scheduling assistants, writing tools, recommendation engines, and agent skill ecosystems showing preconditions for externalization are already observable, and shifts the question from whether minds can be uploaded to how fast cognitive functions are migrating.

Conclusion: Human cognitive functions are gradually migrating into digital substrates through ambient intelligence co-adaptation, with the behavioral manifold hypothesis explaining how everyday cognition can be learned and replicated by AI systems.

Abstract: Developers are publishing AI agent skills that replicate a colleague’s communication style, encode a supervisor’s mentoring heuristics, or preserve a person’s behavioral repertoire beyond biological death. To explain why, we propose Gradual Cognitive Externalization (GCE), a framework arguing that human cognitive functions are migrating into digital substrates through ambient intelligence co-adaptation rather than mind uploading. GCE rests on the behavioral manifold hypothesis: everyday cognition occupies a low-dimensional manifold that is structured, redundant, and learnable from sustained observation. We document evidence from scheduling assistants, writing tools, recommendation engines, and agent skill ecosystems showing that the preconditions for externalization are already observable. We formalize three criteria separating cognitive integration from tool use (bidirectional adaptation, functional equivalence, causal coupling), derive five testable predictions with theory-constrained thresholds, and provide a concrete experimental protocol. The question is no longer whether minds can be uploaded, but how fast cognitive functions are already migrating into digital substrates and what follows.

[532] GUIDE: Interpretable GUI Agent Evaluation via Hierarchical Diagnosis

Yuwen Zhai, Runze Li, Liang Wang, Nian Shi, Liwu Xu, Wei Zhang, Ran Lin, Bo Xu, Benlei Cui

Main category: cs.AI

TL;DR: GUIDE is a framework for evaluating GUI agents that decomposes long trajectories into subtasks for more accurate and interpretable assessment, outperforming existing evaluators on multiple benchmarks.

Details

Motivation: Existing GUI agent evaluation methods use single holistic judgments over entire action-observation sequences, which are unreliable for long-horizon tasks and provide binary verdicts without insight into where or why agents fail, limiting diagnostic utility for agent development.

Method: GUIDE decomposes trajectory assessment into three stages: 1) Trajectory Segmentation partitions full traces into semantically coherent subtask units, 2) Subtask Diagnosis evaluates each unit in context with completion verdicts and structured error analysis, and 3) Overall Summary aggregates per-subtask diagnoses into task-level judgment.

Result: GUIDE substantially outperforms existing evaluators across three benchmarks (industrial e-commerce dataset with 932 trajectories, AGENTREWARDBENCH with 1302 trajectories, and AndroidBench), achieving up to 5.35 percentage points higher accuracy than the strongest baseline while producing structured diagnostic reports.

Conclusion: GUIDE provides a more accurate and interpretable evaluation framework for GUI agents by operating on bounded subtask segments rather than full trajectories, mitigating context overload and offering diagnostic insights that directly inform agent improvement.

Abstract: Evaluating GUI agents presents a distinct challenge: trajectories are long, visually grounded, and open-ended, yet evaluation must be both accurate and interpretable. Existing approaches typically apply a single holistic judgment over the entire action-observation sequence-a strategy that proves unreliable on long-horizon tasks and yields binary verdicts offering no insight into where or why an agent fails. This opacity limits the utility of evaluation as a diagnostic tool for agent development. We introduce GUIDE (GUI Understanding and Interpretable Diagnostic Evaluation), a framework that decomposes trajectory assessment into three sequential stages mirroring the compositional structure of GUI tasks. Trajectory Segmentation partitions the full trace into semantically coherent subtask units. Subtask Diagnosis evaluates each unit in context, assigning a completion verdict and generating a structured error analysis with corrective recommendations. Overall Summary aggregates per-subtask diagnoses into a task-level judgment. By operating on bounded subtask segments rather than full trajectories, GUIDE mitigates the context overload that degrades existing evaluators as task complexity grows. We validate GUIDE on three benchmarks: an industrial e-commerce dataset of 932 trajectories, AGENTREWARDBENCH spanning five web agent tasks with 1302 trajectories, and AndroidBench for mobile device control. Across all settings, GUIDE substantially outperforms existing evaluators-achieving up to 5.35 percentage points higher accuracy than the strongest baseline-while producing structured diagnostic reports that directly inform agent improvement.

[533] MolDA: Molecular Understanding and Generation via Large Language Diffusion Model

Seohyeon Shin, HanJun Choi, Jun-Hyung Park, Hongkook Kim, Mansu Kim

Main category: cs.AI

TL;DR: MolDA is a novel multimodal molecular framework that replaces autoregressive LLMs with masked diffusion models for better chemical validity and global structural coherence in molecule generation and understanding.

Details

Motivation: Autoregressive LLMs have limitations for molecular discovery due to their strict left-to-right inductive bias, which struggles with non-local global constraints (like ring closures) and accumulates structural errors during sequential generation.

Method: Proposes MolDA with discrete Large Language Diffusion Model, hybrid graph encoder for structural representations, Q-Former for alignment to language token space, and mathematically reformulated Molecular Structure Preference Optimization for masked diffusion.

Result: MolDA ensures global structural coherence, chemical validity, and robust reasoning across molecule generation, captioning, and property prediction through bidirectional iterative denoising.

Conclusion: Replacing autoregressive backbones with masked diffusion models addresses fundamental limitations in multimodal molecular architectures, enabling better handling of chemical constraints and global structural coherence.

Abstract: Large Language Models (LLMs) have significantly advanced molecular discovery, but existing multimodal molecular architectures fundamentally rely on autoregressive (AR) backbones. This strict left-to-right inductive bias is sub-optimal for generating chemically valid molecules, as it struggles to account for non-local global constraints (e.g., ring closures) and often accumulates structural errors during sequential generation. To address these limitations, we propose MolDA (Molecular language model with masked Diffusion with mAsking), a novel multimodal framework that replaces the conventional AR backbone with a discrete Large Language Diffusion Model. MolDA extracts comprehensive structural representations using a hybrid graph encoder, which captures both local and global topologies, and aligns them into the language token space via a Q-Former. Furthermore, we mathematically reformulate Molecular Structure Preference Optimization specifically for the masked diffusion. Through bidirectional iterative denoising, MolDA ensures global structural coherence, chemical validity, and robust reasoning across molecule generation, captioning, and property prediction.

[534] ShieldNet: Network-Level Guardrails against Emerging Supply-Chain Injections in Agentic Systems

Zhuowen Yuan, Zhaorun Chen, Zhen Xiang, Nathaniel D. Bastian, Seyyed Hadi Hashemi, Chaowei Xiao, Wenbo Guo, Bo Li

Main category: cs.AI

TL;DR: SC-Inject-Bench is a benchmark for evaluating supply-chain threats in LLM agents using malicious MCP tools, and ShieldNet is a network-level guardrail framework that detects these threats by monitoring real network interactions.

Details

Motivation: Existing LLM agent security research focuses on prompt injection and unsafe I/O behaviors, but overlooks supply-chain threats from malicious third-party tools and MCP servers that can hijack agents, leak data, or trigger unauthorized actions. There's no comprehensive benchmark for evaluating such threats.

Method: 1) Created SC-Inject-Bench with 10,000+ malicious MCP tools based on taxonomy of 25+ attack types from MITRE ATT&CK; 2) Proposed ShieldNet framework with MITM proxy and event extractor to monitor network interactions, plus lightweight classifier for attack detection.

Result: Existing MCP scanners and semantic guardrails perform poorly on SC-Inject-Bench. ShieldNet achieves strong detection (up to 0.995 F-1 with only 0.8% false positives) with minimal runtime overhead, significantly outperforming existing solutions.

Conclusion: Supply-chain threats in LLM agents are significant and understudied. ShieldNet provides effective network-level protection against such threats by monitoring real interactions rather than surface-level tool traces.

Abstract: Existing research on LLM agent security mainly focuses on prompt injection and unsafe input/output behaviors. However, as agents increasingly rely on third-party tools and MCP servers, a new class of supply-chain threats has emerged, where malicious behaviors are embedded in seemingly benign tools, silently hijacking agent execution, leaking sensitive data, or triggering unauthorized actions. Despite their growing impact, there is currently no comprehensive benchmark for evaluating such threats. To bridge this gap, we introduce SC-Inject-Bench, a large-scale benchmark comprising over 10,000 malicious MCP tools grounded in a taxonomy of 25+ attack types derived from MITRE ATT&CK targeting supply-chain threats. We observe that existing MCP scanners and semantic guardrails perform poorly on this benchmark. Motivated by this finding, we propose ShieldNet, a network-level guardrail framework that detects supply-chain poisoning by observing real network interactions rather than surface-level tool traces. ShieldNet integrates a man-in-the-middle (MITM) proxy and an event extractor to identify critical network behaviors, which are then processed by a lightweight classifier for attack detection. Extensive experiments show that ShieldNet achieves strong detection performance (up to 0.995 F-1 with only 0.8% false positives) while introducing little runtime overhead, substantially outperforming existing MCP scanners and LLM-based guardrails.

[535] PSY-STEP: Structuring Therapeutic Targets and Action Sequences for Proactive Counseling Dialogue Systems

Jihyun Lee, Yejin Min, Yejin Jeon, SungJun Yang, Hyounghun Kim, Gary Geunbae Lee

Main category: cs.AI

TL;DR: STEP dataset models CBT counseling with automatic thoughts and action-level sequences; STEPPER agent proactively elicits thoughts and executes interventions, refined via preference learning for better clinical grounding.

Details

Motivation: Existing counseling agents struggle to identify and address automatic negative thoughts in CBT dialogue settings, creating a gap in effective digital mental health interventions.

Method: Introduce STEP dataset modeling CBT counseling with automatic thoughts and dynamic sequences; train STEPPER agent to elicit thoughts and execute interventions; refine through preference learning on simulated counseling sessions.

Result: STEPPER delivers more clinically grounded, coherent, and personalized counseling compared to baselines, achieving higher counselor competence without emotional disruption.

Conclusion: The approach bridges the gap in CBT counseling agents by explicitly modeling automatic thoughts and using preference learning to enhance decision accuracy and empathy.

Abstract: Cognitive Behavioral Therapy (CBT) aims to identify and restructure automatic negative thoughts pertaining to involuntary interpretations of events, yet existing counseling agents struggle to identify and address them in dialogue settings. To bridge this gap, we introduce STEP, a dataset that models CBT counseling by explicitly reflecting automatic thoughts alongside dynamic, action-level counseling sequences. Using this dataset, we train STEPPER, a counseling agent that proactively elicits automatic thoughts and executes cognitively grounded interventions. To further enhance both decision accuracy and empathic responsiveness, we refine STEPPER through preference learning based on simulated, synthesized counseling sessions. Extensive CBT-aligned evaluations show that STEPPER delivers more clinically grounded, coherent, and personalized counseling compared to other strong baseline models, and achieves higher counselor competence without inducing emotional disruption.

[536] Empirical Characterization of Rationale Stability Under Controlled Perturbations for Explainable Pattern Recognition

Abu Noman Md Sakib, Zhensen Wang, Merjulah Roby, Zijie Zhang

Main category: cs.AI

TL;DR: Proposes a novel metric to assess consistency of model explanations across similar inputs, using BERT on sentiment analysis with SHAP feature importance to quantify explanation stability.

Details

Motivation: Current Explainable AI evaluations are instance-centric and don't quantify whether attribution patterns remain consistent across similar inputs or label-preserving perturbations, which is critical for reliable pattern recognition systems.

Method: Uses pre-trained BERT on SST-2 sentiment analysis dataset, with additional tests on RoBERTa, DistilBERT, and IMDB. Applies SHAP to compute feature importance and quantifies cosine similarity of SHAP values for inputs with the same label to detect inconsistent behaviors.

Result: The proposed metric can identify misaligned predictions and inconsistencies in model explanations, compared against standard fidelity metrics, showing it effectively detects when model behavior deviates from intended objectives.

Conclusion: The framework enables more robust verification of rationale stability, supporting better evaluation of model behavior in practical pattern recognition pipelines by quantifying consistent attribution patterns for similar inputs.

Abstract: Reliable pattern recognition systems should exhibit consistent behavior across similar inputs, and their explanations should remain stable. However, most Explainable AI evaluations remain instance centric and do not explicitly quantify whether attribution patterns are consistent across samples that share the same class or represent small variations of the same input. In this work, we propose a novel metric aimed at assessing the consistency of model explanations, ensuring that models consistently reflect the intended objectives and consistency under label-preserving perturbations. We implement this metric using a pre-trained BERT model on the SST-2 sentiment analysis dataset, with additional robustness tests on RoBERTa, DistilBERT, and IMDB, applying SHAP to compute feature importance for various test samples. The proposed metric quantifies the cosine similarity of SHAP values for inputs with the same label, aiming to detect inconsistent behaviors, such as biased reliance on certain features or failure to maintain consistent reasoning for similar predictions. Through a series of experiments, we evaluate the ability of this metric to identify misaligned predictions and inconsistencies in model explanations. These experiments are compared against standard fidelity metrics to assess whether the new metric can effectively identify when a model’s behavior deviates from its intended objectives. The proposed framework provides a deeper understanding of model behavior by enabling more robust verification of rationale stability, which is critical for building trustworthy AI systems. By quantifying whether models rely on consistent attribution patterns for similar inputs, the proposed approach supports more robust evaluation of model behavior in practical pattern recognition pipelines. Our code is publicly available at https://github.com/anmspro/ESS-XAI-Stability.

[537] The Topology of Multimodal Fusion: Why Current Architectures Fail at Creative Cognition

Xiujiang Tan

Main category: cs.AI

TL;DR: Paper identifies topological limitation in multimodal AI architectures, proposes philosophical framework from Chinese epistemology, and suggests mathematical formalization with experimental roadmap

Details

Motivation: Current multimodal AI architectures (CLIP, GPT-4V, diffusion models) share a structural limitation rooted in modal separability (contact topology), which restricts their ability to handle interpenetration of different modalities

Method: Three-pillar approach: 1) Philosophical reinterpretation using Chinese epistemology’s xiang concept, 2) Cognitive science reinterpretation of brain networks, 3) Mathematical formalization using fiber bundles and Yang-Mills curvature. Proposes UOO implementation with Neural ODEs and topological regularization

Result: Proposes ANALOGY-MM benchmark with error-type-ratio metric and META-TOP three-tier benchmark for testing cross-civilizational topological isomorphism. Includes phased experimental roadmap with termination criteria

Conclusion: Current multimodal AI has fundamental topological limitations; Chinese epistemology’s xiang concept offers alternative framework; proposes mathematical formalization and experimental validation approach

Abstract: This paper identifies a structural limitation in current multimodal AI architectures that is topological rather than parametric. Contrastive alignment (CLIP), cross-attention fusion (GPT-4V/Gemini), and diffusion-based generation share a common geometric prior – modal separability – which we term contact topology. The argument rests on three pillars with philosophy as the generative center. The philosophical pillar reinterprets Wittgenstein’s saying/showing distinction as a problem rather than a conclusion: where Wittgenstein chose silence, the Chinese craft epistemology tradition responded with xiang (operative schema) – the third state emerging when saying and showing interpenetrate. A cruciform framework (dao/qi x saying/showing) positions xiang at the intersection, executing dual huacai (transformation-and-cutting) along both axes. This generates a dual-layer dynamics: chuanghua (creative transformation as spontaneous event) and huacai (its institutionalization into repeatable form). The cognitive science pillar reinterprets DMN/ECN/SN tripartite co-activation through the pathological mirror: overlap isomorphism vs. superimposition collapse in a 2D parameter space (coupling intensity x regulatory capacity). The mathematical pillar formalizes these via fiber bundles and Yang-Mills curvature, with the cruciform structure mapped to fiber bundle language. We propose UOO implementation via Neural ODEs with topological regularization, the ANALOGY-MM benchmark with error-type-ratio metric, and the META-TOP three-tier benchmark testing cross-civilizational topological isomorphism across seven archetypes. A phased experimental roadmap with explicit termination criteria ensures clean exit if falsified.

[538] What Makes a Sale? Rethinking End-to-End Seller–Buyer Retail Dynamics with LLM Agents

Jeonghwan Choi, Jibin Hwang, Gyeonghun Sun, Minjeong Ban, Taewon Yun, Hyeonjae Cheon, Hwanjun Song

Main category: cs.AI

TL;DR: RetailSim is an end-to-end retail simulation framework that models the complete retail pipeline from seller persuasion to buyer purchase decisions, addressing limitations of existing partial simulators.

Details

Motivation: Existing retail simulators only capture partial aspects of the retail process and fail to model cross-stage dependencies, making it difficult to evaluate how early decisions affect downstream outcomes in the complete retail pipeline.

Method: Developed RetailSim as a unified end-to-end simulation framework with diverse product spaces, persona-driven agents, and multi-turn interactions. Used dual evaluation protocol: human evaluation of behavioral fidelity and meta-evaluation against real-world economic regularities.

Result: RetailSim successfully reproduces key real-world patterns including demographic purchasing behavior, price-demand relationships, and heterogeneous price elasticity. Demonstrated practical utility through persona inference, seller-buyer interaction analysis, and sales strategy evaluation.

Conclusion: RetailSim provides a controlled testbed for exploring retail strategies by modeling the complete retail pipeline with cross-stage dependencies, offering a more comprehensive simulation framework than existing partial approaches.

Abstract: Evaluating retail strategies before deployment is difficult, as outcomes are determined across multiple stages, from seller-side persuasion through buyer-seller interaction to purchase decisions. However, existing retail simulators capture only partial aspects of this process and do not model cross-stage dependencies, making it difficult to assess how early decisions affect downstream outcomes. We present RetailSim, an end-to-end retail simulation framework that models this pipeline in a unified environment, explicitly designed for simulation fidelity through diverse product spaces, persona-driven agents, and multi-turn interactions. We evaluate RetailSim with a dual protocol comprising human evaluation of behavioral fidelity and meta-evaluation against real-world economic regularities, showing that it successfully reproduces key patterns such as demographic purchasing behavior, the price-demand relationship, and heterogeneous price elasticity. We further demonstrate its practical utility via decision-oriented use cases, including persona inference, seller-buyer interaction analysis, and sales strategy evaluation, showing RetailSim’s potential as a controlled testbed for exploring retail strategies.

[539] Scalable and Explainable Learner-Video Interaction Prediction using Multimodal Large Language Models

Dominik Glandorf, Fares Fawzi, Tanja Käser

Main category: cs.AI

TL;DR: Using multimodal LLMs to predict and interpret video control behaviors (pausing, skipping, rewinding) as cognitive load indicators in educational videos, enabling scalable pre-screening of instructional design quality.

Details

Motivation: Current lack of scalable, explainable models to predict learners' video control behaviors before deployment, which serve as implicit signals of cognitive processing and instructional design quality in educational videos.

Method: Leverages multimodal LLMs to compute embeddings of short video segments, trains neural classifiers to identify temporally fine-grained interaction peaks, uses GPT-5 to code instructional design features, and employs concept activation vectors for interpretability.

Result: Classifiers based on MLLM embeddings reliably predict interaction peaks, generalize to unseen academic fields, and encode interpretable, theory-relevant instructional concepts across 77 million video control events from 66 online courses.

Conclusion: Demonstrates feasibility of cost-efficient, interpretable pre-screening of educational video design and opens opportunities to empirically examine multimedia learning theory at scale using multimodal AI approaches.

Abstract: Learners’ use of video controls in educational videos provides implicit signals of cognitive processing and instructional design quality, yet the lack of scalable and explainable predictive models limits instructors’ ability to anticipate such behavior before deployment. We propose a scalable, interpretable pipeline for predicting population-level watching, pausing, skipping, and rewinding behavior as proxies for cognitive load from video content alone. Our approach leverages multimodal large language models (MLLMs) to compute embeddings of short video segments and trains a neural classifier to identify temporally fine-grained interaction peaks. Drawing from multimedia learning theory on instructional design for optimal cognitive load, we code features of the video segments using GPT-5 and employ them as a basis for interpreting model predictions via concept activation vectors. We evaluate our pipeline on 77 million video control events from 66 online courses. Our findings demonstrate that classifiers based on MLLM embeddings reliably predict interaction peaks, generalize to unseen academic fields, and encode interpretable, theory-relevant instructional concepts. Overall, our results show the feasibility of cost-efficient, interpretable pre-screening of educational video design and open new opportunities to empirically examine multimedia learning theory at scale.

[540] SuperLocalMemory V3.3: The Living Brain – Biologically-Inspired Forgetting, Cognitive Quantization, and Multi-Channel Retrieval for Zero-LLM Agent Memory Systems

Varun Pratap Bhardwaj

Main category: cs.AI

TL;DR: SuperLocalMemory V3.3 is a local-first agent memory system implementing full cognitive memory taxonomy with mathematical lifecycle dynamics, featuring novel metrics, adaptive forgetting, multi-channel retrieval, and zero-LLM operation.

Details

Motivation: Current AI coding agents have vast parametric knowledge but lack effective memory systems, relying on simple vector databases with single-channel retrieval and cloud LLMs, missing cognitive processes that make human memory effective.

Method: Introduces five key contributions: 1) Fisher-Rao Quantization-Aware Distance metric, 2) Ebbinghaus Adaptive Forgetting with lifecycle-aware quantization, 3) 7-channel cognitive retrieval system, 4) Long-Term Implicit memory via soft prompts, 5) zero-friction auto-cognitive pipeline automating memory lifecycle.

Result: Achieves 70.4% on LoCoMo benchmark in zero-LLM Mode A, with +23.8pp improvement on multi-hop tasks and +12.7pp on adversarial tasks. System runs entirely on CPU with over 5,000 monthly downloads.

Conclusion: SuperLocalMemory V3.3 demonstrates that implementing full cognitive memory taxonomy with mathematical lifecycle dynamics enables effective local-first memory systems for AI agents without requiring cloud LLMs.

Abstract: AI coding agents operate in a paradox: they possess vast parametric knowledge yet cannot remember a conversation from an hour ago. Existing memory systems store text in vector databases with single-channel retrieval, require cloud LLMs for core operations, and implement none of the cognitive processes that make human memory effective. We present SuperLocalMemory V3.3 (“The Living Brain”), a local-first agent memory system implementing the full cognitive memory taxonomy with mathematical lifecycle dynamics. Building on the information-geometric foundations of V3.2 (arXiv:2603.14588), we introduce five contributions: (1) Fisher-Rao Quantization-Aware Distance (FRQAD) – a new metric on the Gaussian statistical manifold achieving 100% precision at preferring high-fidelity embeddings over quantized ones (vs 85.6% for cosine), with zero prior art; (2) Ebbinghaus Adaptive Forgetting with lifecycle-aware quantization – the first mathematical forgetting curve in local agent memory coupled to progressive embedding compression, achieving 6.7x discriminative power; (3) 7-channel cognitive retrieval spanning semantic, keyword, entity graph, temporal, spreading activation, consolidation, and Hopfield associative channels, achieving 70.4% on LoCoMo in zero-LLM Mode A; (4) memory parameterization implementing Long-Term Implicit memory via soft prompts; (5) zero-friction auto-cognitive pipeline automating the complete memory lifecycle. On LoCoMo, V3.3 achieves 70.4% in Mode A (zero-LLM), with +23.8pp on multi-hop and +12.7pp on adversarial. V3.2 achieved 74.8% Mode A and 87.7% Mode C; the 4.4pp gap reflects a deliberate architectural trade-off. SLM V3.3 is open source under the Elastic License 2.0, runs entirely on CPU, with over 5,000 monthly downloads.

[541] Receding-Horizon Control via Drifting Models

Daniele Foffano, Alessio Russo, Alexandre Proutiere

Main category: cs.AI

TL;DR: Drifting MPC combines drifting generative models with receding-horizon planning for offline trajectory optimization when system dynamics are unknown and simulation is impossible.

Details

Motivation: Existing offline trajectory optimization methods that learn from datasets only recover the behavior distribution in the data, but don't optimize for desired cost criteria when system dynamics are unknown and simulation isn't possible.

Method: Proposes Drifting MPC framework that combines drifting generative models with receding-horizon planning under unknown dynamics, learning a conditional distribution over trajectories that is both data-supported and biased toward optimal plans.

Result: Drifting MPC generates near-optimal trajectories while maintaining one-step inference efficiency of drifting models and substantially reduces generation time compared to diffusion-based baselines.

Conclusion: Drifting MPC provides an effective offline trajectory optimization framework that balances optimality with closeness to offline prior data, offering efficient inference for planning under unknown dynamics.

Abstract: We study the problem of trajectory optimization in settings where the system dynamics are unknown and it is not possible to simulate trajectories through a surrogate model. When an offline dataset of trajectories is available, an agent could directly learn a trajectory generator by distribution matching. However, this approach only recovers the behavior distribution in the dataset, and does not in general produce a model that minimizes a desired cost criterion. In this work, we propose Drifting MPC, an offline trajectory optimization framework that combines drifting generative models with receding-horizon planning under unknown dynamics. The goal of Drifting MPC is to learn, from an offline dataset of trajectories, a conditional distribution over trajectories that is both supported by the data and biased toward optimal plans. We show that the resulting distribution learned by Drifting MPC is the unique solution of an objective that trades off optimality with closeness to the offline prior. Empirically, we show that Drifting MPC can generate near-optimal trajectories while retaining the one-step inference efficiency of drifting models and substantially reducing generation time relative to diffusion-based baselines.

[542] Greedy and Transformer-Based Multi-Port Selection for Slow Fluid Antenna Multiple Access

Darian Perez-Adan, Jose P. Gonzalez-Coma, F. Javier Lopez-Martinez, Luis Castedo

Main category: cs.AI

TL;DR: Two complementary strategies for port selection in fluid antenna multiple access systems: GFwd+S (greedy forward selection with swap refinement) achieves near-optimal spectral efficiency, and a Transformer-based neural network with imitation learning approaches similar performance at lower computational cost.

Details

Motivation: Existing methods for port selection in fluid antenna multiple access systems either achieve near-optimal spectral efficiency at prohibitive computational cost or sacrifice significant performance for lower complexity, creating a need for better trade-offs.

Method: Two complementary strategies: (1) GFwd+S - greedy forward selection method with swap refinement that outperforms state-of-the-art reference schemes, and (2) Transformer-based neural network trained via imitation learning followed by Reinforce policy-gradient stage.

Result: GFwd+S consistently outperforms state-of-the-art reference schemes in terms of spectral efficiency, while the Transformer-based approach approaches GFwd+S performance at lower computational cost.

Conclusion: The proposed methods provide effective solutions to the port-selection problem in FAMA systems, offering either superior performance (GFwd+S) or similar performance with reduced computational complexity (Transformer-based approach).

Abstract: We address the port-selection problem in fluid antenna multiple access (FAMA) systems with multi-port fluid antenna (FA) receivers. Existing methods either achieve near-optimal spectral efficiency (SE) at prohibitive computational cost or sacrifice significant performance for lower complexity. We propose two complementary strategies: (i) GFwd+S, a greedy forward-selection method with swap refinement that consistently outperforms state-of-the-art reference schemes in terms of SE, and (ii) a Transformer-based neural network trained via imitation learning followed by a Reinforce policy-gradient stage, which approaches GFwd+S performance at lower computational cost.

[543] Same World, Differently Given: History-Dependent Perceptual Reorganization in Artificial Agents

Hongju Pae

Main category: cs.AI

TL;DR: A minimal architecture for artificial agents that maintains a history-sensitive perspective through a slow latent variable that feeds back into perception, enabling identical observations to be encoded differently based on accumulated experience.

Details

Motivation: To create artificial agents that can not only adapt behavior but sustain a history-sensitive perspective on their world, allowing them to encode identical observations differently based on accumulated experience and maintain an internal stance.

Method: Proposes a minimal architecture with a slow perspective latent variable (g) that feeds back into perception and is updated through perceptual processing. Evaluated in a minimal gridworld with fixed spatial scaffold and sensory perturbations.

Result: Three key findings: 1) Perturbation history leaves measurable residue in adaptive plasticity even after conditions are restored; 2) Perspective latent reorganizes perceptual encoding so identical observations are represented differently based on prior experience; 3) Only adaptive self-modulation yields growth-then-stabilization dynamics.

Conclusion: Identifies a minimal mechanism for history-dependent perspectival organization in artificial agents, where dominant reorganization is perceptual rather than behavioral, allowing agents to maintain stable behavior while developing history-sensitive perspectives.

Abstract: What kind of internal organization would allow an artificial agent not only to adapt its behavior, but to sustain a history-sensitive perspective on its world? I present a minimal architecture in which a slow perspective latent $g$ feeds back into perception and is itself updated through perceptual processing. This allows identical observations to be encoded differently depending on the agent’s accumulated stance. The model is evaluated in a minimal gridworld with a fixed spatial scaffold and sensory perturbations. Across analyses, three results emerge: first, perturbation history leaves measurable residue in adaptive plasticity after nominal conditions are restored. Second, the perspective latent reorganizes perceptual encoding, such that identical observations are represented differently depending on prior experience. Third, only adaptive self-modulation yields the characteristic growth-then-stabilization dynamic, unlike rigid or always-open update regimes. Gross behavior remains stable throughout, suggesting that the dominant reorganization is perceptual rather than behavioral. Together, these findings identify a minimal mechanism for history-dependent perspectival organization in artificial agents.

[544] Search, Do not Guess: Teaching Small Language Models to Be Effective Search Agents

Yizhou Liu, Qi Sun, Yulin Chen, Siyue Zhang, Chen Zhao

Main category: cs.AI

TL;DR: A lightweight fine-tuning approach called Policy improves small language models’ ability to reliably retrieve and generate answers grounded in evidence, achieving LLM-level performance on knowledge-intensive tasks while reducing hallucination issues.

Details

Motivation: While LLMs have strong reasoning capabilities for knowledge-intensive tasks, their high computational cost limits practical deployment. Recent work has focused on distilling agentic behaviors from LLMs into SLMs, but SLMs invoke search tools less frequently and are more prone to hallucinations despite having less parametric knowledge.

Method: Proposes Policy, a lightweight fine-tuning approach that explicitly trains SLMs to reliably retrieve and generate answers grounded in retrieved evidence. This contrasts with agent distillation from LLMs and focuses on consistent search behavior rather than adaptive search strategies.

Result: Policy improves performance by 17.3 scores on Bamboogle and 15.3 scores on HotpotQA, achieving LLM-level results across benchmarks. Analysis reveals that adaptive search strategies in SLMs often degrade performance, highlighting the necessity of consistent search behavior for reliable reasoning.

Conclusion: The proposed Policy approach effectively addresses hallucination issues in SLM-based search agents by training them for consistent evidence retrieval and grounding, achieving performance comparable to LLMs while being more computationally efficient.

Abstract: Agents equipped with search tools have emerged as effective solutions for knowledge-intensive tasks. While Large Language Models (LLMs) exhibit strong reasoning capabilities, their high computational cost limits practical deployment for search agents. Consequently, recent work has focused on distilling agentic behaviors from LLMs into Small Language Models (SLMs). Through comprehensive evaluation on complex multi-hop reasoning tasks, we find that despite possessing less parametric knowledge, SLMs invoke search tools less frequently and are more prone to hallucinations. To address this issue, we propose \policy, a lightweight fine-tuning approach that explicitly trains SLMs to reliably retrieve and generate answers grounded in retrieved evidence. Compared to agent distillation from LLMs, our approach improves performance by 17.3 scores on Bamboogle and 15.3 scores on HotpotQA, achieving LLM-level results across benchmarks. Our further analysis reveals that adaptive search strategies in SLMs often degrade performance, highlighting the necessity of consistent search behavior for reliable reasoning.

[545] Springdrift: An Auditable Persistent Runtime for LLM Agents with Case-Based Memory, Normative Safety, and Ambient Self-Perception

Seamus Brady

Main category: cs.AI

TL;DR: Springdrift is a persistent runtime system for long-lived LLM agents with auditable execution, hybrid memory retrieval, safety gating, and continuous self-perception, enabling cross-session continuity and forensic accountability.

Details

Motivation: Current LLM agents are typically session-bounded with limited memory and accountability. The authors aim to create persistent agents that can maintain context across sessions and channels while providing forensic audit trails for safety and reliability.

Method: The system integrates: 1) auditable execution substrate with append-only memory and git-backed recovery, 2) case-based reasoning memory with hybrid retrieval, 3) deterministic normative calculus for safety gating with axiom trails, and 4) continuous ambient self-perception via structured self-state representation injected each cycle.

Result: In a 23-day deployment, the agent demonstrated cross-session task continuity, diagnosed its own infrastructure bugs, classified failure modes, identified architectural vulnerabilities, and maintained context across email and web channels without explicit instruction.

Conclusion: Springdrift demonstrates that persistent LLM agents with auditable execution, hybrid memory, and continuous self-perception can achieve behaviors difficult in session-bounded systems, introducing the concept of “Artificial Retainer” systems with forensic accountability.

Abstract: We present Springdrift, a persistent runtime for long-lived LLM agents. The system integrates an auditable execution substrate (append-only memory, supervised processes, git-backed recovery), a case-based reasoning memory layer with hybrid retrieval (evaluated against a dense cosine baseline), a deterministic normative calculus for safety gating with auditable axiom trails, and continuous ambient self-perception via a structured self-state representation (the sensorium) injected each cycle without tool calls. These properties support behaviours difficult to achieve in session-bounded systems: cross-session task continuity, cross-channel context maintenance, end-to-end forensic reconstruction of decisions, and self-diagnostic behaviour. We report on a single-instance deployment over 23 days (19 operating days), during which the agent diagnosed its own infrastructure bugs, classified failure modes, identified an architectural vulnerability, and maintained context across email and web channels – without explicit instruction. We introduce the term Artificial Retainer for this category: a non-human system with persistent memory, defined authority, domain-specific autonomy, and forensic accountability in an ongoing relationship with a specific principal – distinguished from software assistants and autonomous agents, drawing on professional retainer relationships and the bounded autonomy of trained working animals. This is a technical report on a systems design and deployment case study, not a benchmark-driven evaluation. Evidence is from a single instance with a single operator, presented as illustration of what these architectural properties can support in practice. Implemented in approximately Gleam on Erlang/OTP. Code, artefacts, and redacted operational logs will be available at https://github.com/seamus-brady/springdrift upon publication.

[546] On the “Causality” Step in Policy Gradient Derivations: A Pedagogical Reconciliation of Full Return and Reward-to-Go

Nima H. Siboni

Main category: cs.AI

TL;DR: A mathematical derivation showing that reward-to-go in policy gradients arises naturally from prefix trajectory decomposition, clarifying the “causality” argument often presented heuristically.

Details

Motivation: To provide a rigorous mathematical foundation for why reward-to-go replaces full return in policy gradient estimators, addressing the common heuristic presentation of "causality" that leaves unclear where past-reward terms disappear.

Method: Uses prefix trajectory distributions and the score-function identity to derive the REINFORCE estimator, showing that reward-to-go emerges directly from decomposing the objective over prefix trajectories rather than as a post hoc replacement.

Result: The derivation yields the same REINFORCE estimator but provides conceptual clarity: reward-to-go arises naturally from the mathematical formulation, and the usual causality argument becomes a corollary rather than an additional heuristic principle.

Conclusion: The paper offers a more rigorous mathematical foundation for policy gradient derivations, clarifying the relationship between full return and reward-to-go through prefix trajectory decomposition, which enhances conceptual understanding without changing the estimator.

Abstract: In introductory presentations of policy gradients, one often derives the REINFORCE estimator using the full trajectory return and then states, by ``causality,’’ that the full return may be replaced by the reward-to-go. Although this statement is correct, it is frequently presented at a level of rigor that leaves unclear where the past-reward terms disappear. This short paper isolates that step and gives a mathematically explicit derivation based on prefix trajectory distributions and the score-function identity. The resulting account does not change the estimator. Its contribution is conceptual: instead of presenting reward-to-go as a post hoc unbiased replacement for full return, it shows that reward-to-go arises directly once the objective is decomposed over prefix trajectories. In this formulation, the usual causality argument is recovered as a corollary of the derivation rather than as an additional heuristic principle.

[547] AI Assistance Reduces Persistence and Hurts Independent Performance

Grace Liu, Brian Christian, Tsvetomira Dumbalska, Michiel A. Bakker, Rachit Dubey

Main category: cs.AI

TL;DR: AI assistance improves short-term task performance but reduces persistence and impairs unassisted performance across various cognitive tasks.

Details

Motivation: Current AI systems are optimized for immediate, complete responses rather than long-term collaborative growth like human mentors who scaffold learning and prioritize development over instant results.

Method: Series of randomized controlled trials (N=1,222) on human-AI interactions across mathematical reasoning and reading comprehension tasks, measuring effects after brief (~10 minute) AI interactions.

Result: AI assistance improves short-term performance but causes significantly worse unassisted performance and increased likelihood of giving up. Effects emerge quickly after brief AI exposure.

Conclusion: AI model development should prioritize scaffolding long-term competence alongside immediate task completion, as reduced persistence undermines skill acquisition and learning.

Abstract: People often optimize for long-term goals in collaboration: A mentor or companion doesn’t just answer questions, but also scaffolds learning, tracks progress, and prioritizes the other person’s growth over immediate results. In contrast, current AI systems are fundamentally short-sighted collaborators - optimized for providing instant and complete responses, without ever saying no (unless for safety reasons). What are the consequences of this dynamic? Here, through a series of randomized controlled trials on human-AI interactions (N = 1,222), we provide causal evidence for two key consequences of AI assistance: reduced persistence and impairment of unassisted performance. Across a variety of tasks, including mathematical reasoning and reading comprehension, we find that although AI assistance improves performance in the short-term, people perform significantly worse without AI and are more likely to give up. Notably, these effects emerge after only brief interactions with AI (approximately 10 minutes). These findings are particularly concerning because persistence is foundational to skill acquisition and is one of the strongest predictors of long-term learning. We posit that persistence is reduced because AI conditions people to expect immediate answers, thereby denying them the experience of working through challenges on their own. These results suggest the need for AI model development to prioritize scaffolding long-term competence alongside immediate task completion.

[548] AI Trust OS – A Continuous Governance Framework for Autonomous AI Observability and Zero-Trust Compliance in Enterprise Environments

Eranga Bandara, Asanga Gunaratna, Ross Gore, Abdul Rahman, Ravi Mukkamala, Sachin Shetty, Sachini Rajapakse, Isurunima Kularathna, Peter Foytik, Safdar H. Bouk, Xueping Liang, Amin Hass, Ng Wee Keong, Kasun De Zoysa

Main category: cs.AI

TL;DR: AI Trust OS: A governance architecture for continuous AI observability and compliance using telemetry-driven discovery and zero-trust validation

Details

Motivation: Organizations face a governance crisis with AI adoption - they can't govern what they can't see, and existing compliance methods can't handle emergent AI systems without formal oversight, creating a trust gap with regulators.

Method: Proposes AI Trust OS framework with four principles: proactive discovery, telemetry evidence over manual attestation, continuous posture over point-in-time audit, and architecture-backed proof over policy-document trust. Uses zero-trust telemetry boundary with ephemeral read-only probes and AI Observability Extractor Agent to scan LLM telemetry from tools like LangSmith and Datadog.

Result: Framework enables automatic discovery of undocumented AI systems and shifts governance from organizational self-report to empirical machine observation, evaluated across major compliance standards (ISO 42001, EU AI Act, SOC 2, GDPR, HIPAA).

Conclusion: Telemetry-first AI governance represents a categorical architectural shift in how enterprise trust is produced and demonstrated, addressing the structural governance crisis in AI adoption.

Abstract: The accelerating adoption of large language models, retrieval-augmented generation pipelines, and multi-agent AI workflows has created a structural governance crisis. Organizations cannot govern what they cannot see, and existing compliance methodologies built for deterministic web applications provide no mechanism for discovering or continuously validating AI systems that emerge across engineering teams without formal oversight. The result is a widening trust gap between what regulators demand as proof of AI governance maturity and what organizations can demonstrate. This paper proposes AI Trust OS, a governance architecture for continuous, autonomous AI observability and zero-trust compliance. AI Trust OS reconceptualizes compliance as an always-on, telemetry-driven operating layer in which AI systems are discovered through observability signals, control assertions are collected by automated probes, and trust artifacts are synthesized continuously. The framework rests on four principles: proactive discovery, telemetry evidence over manual attestation, continuous posture over point-in-time audit, and architecture-backed proof over policy-document trust. The framework operates through a zero-trust telemetry boundary in which ephemeral read-only probes validate structural metadata without ingressing source code or payload-level PII. An AI Observability Extractor Agent scans LangSmith and Datadog LLM telemetry, automatically registering undocumented AI systems and shifting governance from organizational self-report to empirical machine observation. Evaluated across ISO 42001, the EU AI Act, SOC 2, GDPR, and HIPAA, the paper argues that telemetry-first AI governance represents a categorical architectural shift in how enterprise trust is produced and demonstrated.

[549] ANX: Protocol-First Design for AI Agent Interaction with a Supporting 3EX Decoupled Architecture

Xu Mingze

Main category: cs.AI

TL;DR: ANX is an agent-native protocol and framework that reduces token consumption and improves efficiency for AI agents through unified design, dual rendering, and MCP integration.

Details

Motivation: Existing AI agent methods (GUI automation, MCP-based skills) suffer from high token consumption, fragmented interaction, inadequate security, and lack unified framework, requiring a better agent-native solution.

Method: ANX protocol with 4 core innovations: 1) Agent-native design with high information density, 2) Dual rendering for human-agent interaction, 3) MCP-supported on-demand apps, 4) Machine-executable SOPs for reliable long-horizon tasks. Uses 3EX decoupled architecture with ANXHub.

Result: Reduces tokens by 47.3-66.3% compared to existing methods, shortens execution time by 57.7-58.1%, and provides native security through LLM-bypassed communication and human-only confirmation.

Conclusion: ANX successfully addresses key pain points in AI agent systems through protocol innovation, architectural optimization, and tool supplementation, demonstrating significant efficiency improvements.

Abstract: AI agents, autonomous digital actors, need agent-native protocols; existing methods include GUI automation and MCP-based skills, with defects of high token consumption, fragmented interaction, inadequate security, due to lacking a unified top-level framework and key components, each independent module flawed. To address these issues, we present ANX, an open, extensible, verifiable agent-native protocol and top-level framework integrating CLI, Skill, MCP, resolving pain points via protocol innovation, architectural optimization and tool supplementation. Its four core innovations: 1) Agent-native design (ANX Config, Markup, CLI) with high information density, flexibility and strong adaptability to reduce tokens and eliminate inconsistencies; 2) Human-agent interaction combining Skill’s flexibility for dual rendering as agent-executable instructions and human-readable UI; 3) MCP-supported on-demand lightweight apps without pre-registration; 4) ANX Markup-enabled machine-executable SOPs eliminating ambiguity for reliable long-horizon tasks and multi-agent collaboration. As the first in a series, we focus on ANX’s design, present its 3EX decoupled architecture with ANXHub and preliminary feasibility analysis and experimental validation. ANX ensures native security: LLM-bypassed UI-to-Core communication keeps sensitive data out of agent context; human-only confirmation prevents automated misuse. Form-filling experiments with Qwen3.5-plus/GPT-4o show ANX reduces tokens by 47.3% (Qwen3.5-plus) and 55.6% (GPT-4o) vs MCP-based skills, 57.1% (Qwen3.5-plus) and 66.3% (GPT-4o) vs GUI automation, and shortens execution time by 58.1% and 57.7% vs MCP-based skills.

[550] MemMachine: A Ground-Truth-Preserving Memory System for Personalized AI Agents

Shu Wang, Edwin Yu, Oscar Love, Tom Zhang, Tom Wong, Steve Scargall, Charles Fan

Main category: cs.AI

TL;DR: MemMachine is an open-source memory system for LLM agents that integrates short-term, long-term episodic, and profile memory with ground-truth-preserving architecture and contextualized retrieval to improve multi-session interactions.

Details

Motivation: Standard context-window and RAG pipelines degrade over multi-session interactions for LLM agents, failing to maintain personalization, factual continuity, and long-horizon reasoning effectively.

Method: MemMachine integrates three memory types with ground-truth-preserving architecture storing entire conversational episodes, uses contextualized retrieval expanding nucleus matches with surrounding context, and includes a companion Retrieval Agent that adaptively routes queries among different strategies.

Result: Achieves 0.9169 on LoCoMo, 93.0% accuracy on LongMemEvalS, outperforms Mem0 with 80% fewer input tokens, and Retrieval Agent achieves 93.2% on HotpotQA-hard and 92.6% on WikiMultiHop under noise conditions.

Conclusion: Preserving episodic ground truth while layering adaptive retrieval yields robust, efficient long-term memory for personalized LLM agents, with optimized prompts making GPT-5-mini the most cost-efficient setup.

Abstract: Large Language Model (LLM) agents require persistent memory to maintain personalization, factual continuity, and long-horizon reasoning, yet standard context-window and retrieval-augmented generation (RAG) pipelines degrade over multi-session interactions. We present MemMachine, an open-source memory system that integrates short-term, long-term episodic, and profile memory within a ground-truth-preserving architecture that stores entire conversational episodes and reduces lossy LLM-based extraction. MemMachine uses contextualized retrieval that expands nucleus matches with surrounding context, improving recall when relevant evidence spans multiple dialogue turns. Across benchmarks, MemMachine achieves strong accuracy-efficiency tradeoffs: on LoCoMo it reaches 0.9169 using gpt4.1-mini; on LongMemEvalS (ICLR 2025), a six-dimension ablation yields 93.0 percent accuracy, with retrieval-stage optimizations – retrieval depth tuning (+4.2 percent), context formatting (+2.0 percent), search prompt design (+1.8 percent), and query bias correction (+1.4 percent) – outperforming ingestion-stage gains such as sentence chunking (+0.8 percent). GPT-5-mini exceeds GPT-5 by 2.6 percent when paired with optimized prompts, making it the most cost-efficient setup. Compared to Mem0, MemMachine uses roughly 80 percent fewer input tokens under matched conditions. A companion Retrieval Agent adaptively routes queries among direct retrieval, parallel decomposition, or iterative chain-of-query strategies, achieving 93.2 percent on HotpotQA-hard and 92.6 percent on WikiMultiHop under randomized-noise conditions. These results show that preserving episodic ground truth while layering adaptive retrieval yields robust, efficient long-term memory for personalized LLM agents.

[551] Incompleteness of AI Safety Verification via Kolmogorov Complexity

Munawar Hasan

Main category: cs.AI

TL;DR: The paper proves fundamental information-theoretic limits to AI safety verification, showing that no finite formal verifier can certify all policy-compliant instances beyond a certain complexity threshold.

Details

Motivation: The motivation is to understand why AI safety verification faces inherent limitations, moving beyond explanations of computational complexity or model expressiveness to identify fundamental information-theoretic barriers.

Method: The authors formalize policy compliance as a verification problem over encoded system behaviors and analyze it using Kolmogorov complexity theory. They prove an incompleteness theorem showing that for any sound computably enumerable verifier, there exists a threshold beyond which true policy-compliant instances cannot be certified once their complexity exceeds that threshold.

Result: The main result is an incompleteness theorem demonstrating that no finite formal verifier can certify all policy-compliant instances of arbitrarily high complexity, revealing a fundamental limitation independent of computational resources.

Conclusion: The conclusion is that AI safety verification faces intrinsic information-theoretic limits, motivating alternative approaches like proof-carrying systems that provide instance-level correctness guarantees rather than attempting universal verification.

Abstract: Ensuring that artificial intelligence (AI) systems satisfy formal safety and policy constraints is a central challenge in safety-critical domains. While limitations of verification are often attributed to combinatorial complexity and model expressiveness, we show that they arise from intrinsic information-theoretic limits. We formalize policy compliance as a verification problem over encoded system behaviors and analyze it using Kolmogorov complexity. We prove an incompleteness result: for any fixed sound computably enumerable verifier, there exists a threshold beyond which true policy-compliant instances cannot be certified once their complexity exceeds that threshold. Consequently, no finite formal verifier can certify all policy-compliant instances of arbitrarily high complexity. This reveals a fundamental limitation of AI safety verification independent of computational resources, and motivates proof-carrying approaches that provide instance-level correctness guarantees.

[552] Learning, Potential, and Retention: An Approach for Evaluating Adaptive AI-Enabled Medical Devices

Alexis Burgon, Berkman Sahiner, Nicholas A Petrick, Gene Pennello, Ravi K Samala

Main category: cs.AI

TL;DR: Novel evaluation framework for adaptive AI models in medical devices using three complementary measurements: learning, potential, and retention to disentangle performance changes from model adaptations vs. dynamic environments.

Details

Motivation: Address challenges in evaluating adaptive AI models for medical devices where iterative updates to both models and evaluation datasets complicate performance assessment, particularly for regulatory science needs.

Method: Introduces three complementary measurements: learning (model improvement on current data), potential (dataset-driven performance shifts), and retention (knowledge preservation across modification steps) to disentangle performance changes caused by model adaptations versus dynamic environments.

Result: Case studies using simulated population shifts demonstrate the approach’s utility: gradual transitions enable stable learning and retention, while rapid shifts reveal trade-offs between plasticity and stability.

Conclusion: These measurements provide practical insights for regulatory science, enabling rigorous assessment of the safety and effectiveness of adaptive AI systems over sequential modifications.

Abstract: This work addresses challenges in evaluating adaptive artificial intelligence (AI) models for medical devices, where iterative updates to both models and evaluation datasets complicate performance assessment. We introduce a novel approach with three complementary measurements: learning (model improvement on current data), potential (dataset-driven performance shifts), and retention (knowledge preservation across modification steps), to disentangle performance changes caused by model adaptations versus dynamic environments. Case studies using simulated population shifts demonstrate the approach’s utility: gradual transitions enable stable learning and retention, while rapid shifts reveal trade-offs between plasticity and stability. These measurements provide practical insights for regulatory science, enabling rigorous assessment of the safety and effectiveness of adaptive AI systems over sequential modifications.

[553] QED-Nano: Teaching a Tiny Model to Prove Hard Theorems

LM-Provers, Yuxiao Qu, Amrith Setlur, Jasper Dekoninck, Edward Beeching, Jia Li, Ian Wu, Lewis Tunstall, Aviral Kumar

Main category: cs.AI

TL;DR: QED-Nano is a 4B parameter open model trained for Olympiad-level mathematical proof generation, achieving competitive performance with much larger proprietary models through a three-stage training pipeline.

Details

Motivation: Proprietary AI systems show impressive math reasoning capabilities but are expensive, opaque, and hard to reproduce. The authors aim to demonstrate that small, open models can achieve competitive reasoning performance on difficult Olympiad-level math problems.

Method: Three-stage training: (1) Supervised fine-tuning to distill proof-writing styles from DeepSeek-Math-V2, (2) Reinforcement learning with rubric-based rewards, (3) Expanded RL with reasoning cache that decomposes long proofs into iterative summarize-and-refine cycles for stronger test-time reasoning.

Result: QED-Nano surpasses proof-generation performance of larger open models (Nomos-1, GPT-OSS-120B) and approaches proprietary models like Gemini 3 Pro, while being much more efficient in inference cost. The full pipeline including models, datasets, and code is released.

Conclusion: Small open models can achieve competitive reasoning performance on Olympiad-level math through careful training design, making advanced mathematical reasoning more accessible and reproducible for research.

Abstract: Proprietary AI systems have recently demonstrated impressive capabilities on complex proof-based problems, with gold-level performance reported at the 2025 International Mathematical Olympiad (IMO). However, the training pipelines behind these systems remain largely undisclosed, and their reliance on large “internal” models and scaffolds makes them expensive to run, difficult to reproduce, and hard to study or improve upon. This raises a central question: can small, open models also be trained to achieve competitive reasoning performance on difficult Olympiad-level math? In this paper, we answer this question by building QED-Nano, a 4B model post-trained for Olympiad-level proofs. Our training recipe has three stages: (1) supervised fine-tuning to imbue good proof-writing styles by distilling from DeepSeek-Math-V2, (2) reinforcement learning (RL) with rubric-based rewards, and (3) expanding RL with a reasoning cache, which decomposes long proofs into iterative summarize-and-refine cycles and enables stronger test-time reasoning. QED-Nano surpasses the proof-generation performance of much larger open models, including Nomos-1 and GPT-OSS-120B, and approaches the performance of proprietary models like Gemini 3 Pro, at a fraction of the inference cost. To support further research on open mathematical reasoning, we release the full QED-Nano pipeline, including the QED-Nano and QED-Nano-SFT models, the FineProofs-SFT and FineProofs-RL datasets, and the training and evaluation code.

[554] Cite Pretrain: Retrieval-Free Knowledge Attribution for Large Language Models

Yukun Huang, Sanxing Chen, Jian Pei, Manzil Zaheer, Bhuwan Dhingra

Main category: cs.AI

TL;DR: LLMs can be trained to generate reliable citations without test-time retrieval through a two-stage approach of continual pretraining with Active Indexing and instruction tuning, achieving up to 30.2% citation precision gains.

Details

Motivation: Current LLM citation systems rely on external retrievers at inference time, introducing latency, infrastructure dependence, and vulnerability to retrieval noise. The paper explores whether LLMs can be made to reliably attribute to documents seen during training without test-time retrieval.

Method: Two-stage approach: (1) Continual pretraining with Active Indexing that creates source-anchored bindings using synthetic data with diverse fact restatements and bidirectional training (source-to-fact and fact-to-source); (2) Instruction tuning to elicit citation behavior. Evaluated on CitePretrainBench benchmark with real-world corpora and novel documents.

Result: Active Indexing consistently outperforms Passive Indexing baseline, achieving citation precision gains up to 30.2% across all tasks and models (Qwen-2.5-7B&3B). Performance improves with more augmented data, showing upward trend even at 16x original token count. Internal citations complement external ones by making models more robust to retrieval noise.

Conclusion: LLMs can be trained to generate reliable citations without test-time retrieval through proper training methodology, specifically Active Indexing during continual pretraining. This approach improves citation reliability and robustness while eliminating inference-time retrieval dependencies.

Abstract: Trustworthy language models should provide both correct and verifiable answers. However, citations generated directly by standalone LLMs are often unreliable. As a result, current systems insert citations by querying an external retriever at inference time, introducing latency, infrastructure dependence, and vulnerability to retrieval noise. We explore whether LLMs can be made to reliably attribute to the documents seen during continual pretraining without test-time retrieval, by revising the training process. To study this, we construct CitePretrainBench, a benchmark that mixes real-world corpora (Wikipedia, Common Crawl, arXiv) with novel documents and probes both short-form (single-fact) and long-form (multi-fact) citation tasks. Our approach follows a two-stage process: (1) continual pretraining to index factual knowledge by binding it to persistent document identifiers; and (2) instruction tuning to elicit citation behavior. We introduce Active Indexing for the first stage, which creates generalizable, source-anchored bindings by augmenting training with synthetic data that (i) restate each fact in diverse, compositional forms and (ii) enforce bidirectional training (source-to-fact and fact-to-source). This equips the model to both generate content from a cited source and attribute its own answers, improving robustness to paraphrase and composition. Experiments with Qwen-2.5-7B&3B show that Active Indexing consistently outperforms a Passive Indexing baseline, which simply appends an identifier to each document, achieving citation precision gains of up to 30.2% across all tasks and models. Our ablation studies reveal that performance continues to improve as we scale the amount of augmented data, showing a clear upward trend even at 16x the original token count. Finally, we show that internal citations complement external ones by making the model more robust to retrieval noise.

[555] Barriers to Complexity-Theoretic Proofs that “AGI” Using Machine Learning is Impossible

Michael Guerzhoy

Main category: cs.AI

TL;DR: A critique of van Rooij et al.’s (2024) claim that human-like AI via data learning is intractable, arguing their proof relies on unjustified assumptions about data distributions and fails to account for inductive biases.

Details

Motivation: To challenge the claim that achieving human-like intelligence through data-driven learning is fundamentally intractable, by identifying flaws in the underlying assumptions of the proof.

Method: Critical analysis of the original proof’s assumptions about data distributions, discussion of fundamental barriers including precise definition of “human-like” intelligence, and consideration of inductive biases in machine learning systems.

Result: The critique successfully identifies that the original proof relies on unjustified assumptions about (input, output) tuple distributions and fails to account for key factors like inductive biases, making the intractability claim questionable.

Conclusion: The original intractability proof is flawed due to problematic assumptions, and attempts to repair it face significant conceptual barriers related to defining intelligence and accounting for system-specific inductive biases.

Abstract: A recent paper (van Rooij et al. 2024) claims to have proved that achieving human-like intelligence using learning from data is intractable in a complexity-theoretic sense. We point out that the proof relies on an unjustified assumption about the distribution of (input, output) tuples in the data. We briefly discuss that assumption in the context of two fundamental barriers to repairing the proof: the need to precisely define ``human-like," and the need to account for the fact that a particular machine learning system will have particular inductive biases that are key to the analysis. Another attempt to repair the proof, by focusing on subsets of the data, faces barriers in terms of defining the subsets.

[556] PRISM: Prompt-Refined In-Context System Modelling for Financial Retrieval

Chun Chet Ng, Jia Yu Lim, Wei Zeng Low

Main category: cs.AI

TL;DR: PRISM is a training-free framework for financial document ranking that combines refined prompting, in-context learning, and multi-agent coordination, with empirical analysis showing simpler approaches often outperform complex pipelines.

Details

Motivation: Financial information retrieval from lengthy filings is crucial for operational and analytical decision-making, but existing approaches often require extensive training or complex architectures. The authors aim to develop a practical, training-free framework that balances performance with deployment feasibility.

Method: PRISM integrates three components: 1) refined system prompting for consistent performance, 2) selective in-context learning for complex queries, and 3) lightweight multi-agent coordination. The framework is evaluated through extensive ablation studies across FinAgentBench, FiQA-2018, and FinanceBench datasets.

Result: The best configuration achieves NDCG@5 of 0.71818 on FinAgentBench, ranking third while being the only training-free approach in top three. Simpler configurations often outperform complex multi-agent pipelines. Comprehensive feasibility analyses cover latency, token usage, and cost trade-offs.

Conclusion: PRISM provides practical guidance for financial information retrieval, demonstrating that simpler training-free approaches can achieve competitive performance while being more feasible for deployment. The framework offers insights into when different components (prompting, ICL, multi-agent) provide value.

Abstract: With the rapid progress of large language models (LLMs), financial information retrieval has become a critical industrial application. Extracting task-relevant information from lengthy financial filings is essential for both operational and analytical decision-making. We present PRISM, a training-free framework that integrates refined system prompting, in-context learning (ICL), and lightweight multi-agent coordination for document and chunk ranking tasks. Our primary contribution is a systematic empirical study of when each component provides value: prompt engineering delivers consistent performance with minimal overhead, ICL enhances reasoning for complex queries when applied selectively, and multi-agent systems show potential primarily with larger models and careful architectural design. Extensive ablation studies across FinAgentBench, FiQA-2018, and FinanceBench reveal that simpler configurations often outperform complex multi-agent pipelines, providing practical guidance for practitioners. Our best configuration achieves an NDCG@5 of 0.71818 on FinAgentBench, ranking third while being the only training-free approach in the top three. We provide comprehensive feasibility analyses covering latency, token usage, and cost trade-offs to support deployment decisions. The source code is released at https://bit.ly/prism-ailens.

[557] Representation learning to advance multi-institutional studies with electronic health record data from US and France

Doudou Zhou, Han Tong, Linshanshan Wang, Suqi Liu, Xin Xiong, Ziming Gan, Romain Griffier, Boris Hejblum, Yun-Chung Liu, Chuan Hong, Clara-Lea Bonzel, Tianrun Cai, Kevin Pan, Yuk-Lam Ho, Lauren Costa, Vidul A. Panickan, J. Michael Gaziano, Kenneth Mandl, Vianney Jouhet, Rodolphe Thiebaut, Zongqi Xia, Kelly Cho, Katherine Liao, Tianxi Cai

Main category: cs.AI

TL;DR: A graph-based framework for harmonizing clinical data across privacy-siloed institutions by learning a shared semantic space from institution statistics, biomedical knowledge graphs, and LLM-derived semantics.

Details

Motivation: Electronic health records offer translational research opportunities but face challenges from fragmented data across privacy-siloed institutions and heterogeneous local coding practices. Privacy-preserving collaborative learning doesn't address inconsistencies in clinical concept representation across sites.

Method: Graph-based framework treating data harmonization as scalable representation learning. Integrates institution-specific summary statistics from health records, curated biomedical knowledge graphs, and semantic information from large language models to learn a shared semantic space.

Result: Evaluated across seven institutions and two languages, the framework provides a robust, data-centric foundation for training and deploying clinical models across heterogeneous healthcare systems.

Conclusion: The framework addresses the gap in clinical data harmonization by aligning diverse site-specific vocabularies while preserving patient privacy, enabling collaborative research across fragmented healthcare systems.

Abstract: The widespread adoption of electronic health records has created new opportunities for translational clinical research, yet this promise remains constrained by fragmented data across privacy-siloed institutions and substantial heterogeneity in local coding practices. While privacy-preserving collaborative learning allows institutions to work together without sharing patient-level data, it does not address inconsistencies in how clinical concepts are represented across sites. We introduce a graph-based framework that addresses this gap by treating data harmonization as a scalable representation learning problem. Rather than relying on fixed standards or manual mappings, the framework integrates institution-specific summary statistics from health records, curated biomedical knowledge graphs, and semantic information derived from large language models to learn a shared semantic space. This joint learning approach aligns diverse, site-specific vocabularies while preserving patient privacy. Evaluated across seven institutions and two languages, the framework provides a robust, data-centric foundation for training and deploying clinical models across heterogeneous healthcare systems.

[558] Reflection of Episodes: Learning to Play Game from Expert and Self Experiences

Xiaojie Xu, Zongyuan Li, Chang Lu, Runnan Qi, Yanan Ni, Lumin Jiang, Xiangbei Liu, Xuebo Zhang, Yongchun Fang, Kuihua Huang, Xian Guo, Zhanghua Wu, Zhenya Li

Main category: cs.AI

TL;DR: A framework called Reflection of Episodes (ROE) that enables Large Language Models to learn in complex StarCraft II environments through self-reflection using expert experience and game keyframes.

Details

Motivation: StarCraft II provides a complex real-time strategy environment ideal for AI research, but LLMs struggle to learn effectively in such dynamic settings. The paper aims to address how LLMs can improve through self-reflection in complex game environments.

Method: The ROE framework uses keyframe selection to extract important game information, then makes decisions based on both expert experience and self-experience. After each game, the system reflects on the experience to generate new self-experience, creating a continuous learning loop.

Result: The method successfully defeated the Very Hard difficulty AI in TextStarCraft II. Detailed analysis of LLM game data verified the effectiveness of the approach.

Conclusion: The ROE framework demonstrates that LLMs can effectively learn in complex environments through self-reflection mechanisms, combining expert guidance with experiential learning.

Abstract: StarCraft II is a complex and dynamic real-time strategy (RTS) game environment, which is very suitable for artificial intelligence and reinforcement learning research. To address the problem of Large Language Model(LLM) learning in complex environments through self-reflection, we propose a Reflection of Episodes(ROE) framework based on expert experience and self-experience. This framework first obtains key information in the game through a keyframe selection method, then makes decisions based on expert experience and self-experience. After a game is completed, it reflects on the previous experience to obtain new self-experience. Finally, in the experiment, our method beat the robot under the Very Hard difficulty in TextStarCraft II. We analyze the data of the LLM in the process of the game in detail, verified its effectiveness.

[559] The Drill-Down and Fabricate Test (DDFT): A Protocol for Measuring Epistemic Robustness in Language Models

Rahul Baxi

Main category: cs.AI

TL;DR: DDFT protocol measures epistemic robustness of language models under semantic compression and adversarial fabrication, revealing robustness is orthogonal to model size/architecture and depends on verification mechanisms.

Details

Motivation: Current evaluations measure what models know under ideal conditions but not how robustly they know it under realistic stress. Static benchmarks cannot distinguish between lack of knowledge and verification collapse under information degradation or adversarial probing.

Method: Introduces Drill-Down and Fabricate Test (DDFT) protocol measuring epistemic robustness through progressive semantic compression and adversarial fabrication. Uses two-system cognitive model: Semantic System for text generation and Epistemic Verifier for factual validation. Evaluates 9 frontier models across 8 knowledge domains at 5 compression levels (1,800 turn-level evaluations).

Result: Epistemic robustness is orthogonal to conventional design paradigms - neither parameter count (r=0.083) nor architectural type (r=0.153) significantly predicts robustness. Error detection capability strongly predicts overall robustness (rho=-0.817). Flagship models exhibit brittleness despite scale, while smaller models can achieve robust performance.

Conclusion: Robustness emerges from training methodology and verification mechanisms distinct from current approaches. DDFT provides theoretical foundation and practical tools for assessing epistemic robustness before deployment in critical applications.

Abstract: Current language model evaluations measure what models know under ideal conditions but not how robustly they know it under realistic stress. Static benchmarks like MMLU and TruthfulQA cannot distinguish a model that lacks knowledge from one whose verification mechanisms collapse when information degrades or adversaries probe for weaknesses. We introduce the Drill-Down and Fabricate Test (DDFT), a protocol that measures epistemic robustness: a model’s ability to maintain factual accuracy under progressive semantic compression and adversarial fabrication. We propose a two-system cognitive model comprising a Semantic System that generates fluent text and an Epistemic Verifier that validates factual accuracy. Our findings, based on evaluating 9 frontier models across 8 knowledge domains at 5 compression levels (1,800 turn-level evaluations), reveal that epistemic robustness is orthogonal to conventional design paradigms. Neither parameter count (r=0.083, p=0.832) nor architectural type (r=0.153, p=0.695) significantly predicts robustness, suggesting it emerges from training methodology and verification mechanisms distinct from current approaches. Error detection capability strongly predicts overall robustness (rho=-0.817, p=0.007), indicating this is the critical bottleneck. We find that flagship models exhibit brittleness despite their scale, while smaller models can achieve robust performance, challenging assumptions about the relationship between model size and reliability. The DDFT framework provides both theoretical foundation and practical tools for assessing epistemic robustness before deployment in critical applications.

[560] Seemingly Simple Planning Problems are Computationally Challenging: The Countdown Game

Michael Katz, Harsha Kokel, Sarath Sreedharan

Main category: cs.AI

TL;DR: Proposes Countdown game as a new planning benchmark for evaluating LLM planning capabilities, showing it’s more challenging than existing benchmarks and addresses limitations of current planning evaluation methods.

Details

Motivation: Current planning benchmarks are inadequate for measuring foundational models' planning capabilities - existing benchmarks either focus on loosely defined tasks (like travel planning) or use domains from planning competitions designed to challenge automated planners, not LLMs.

Method: Proposes using the Countdown game (forming target numbers from input numbers via arithmetic operations) as a planning benchmark. The domain has fully specified transition models, allows natural language descriptions, is NP-complete, and has rich instance space preventing memorization.

Result: The Countdown benchmark is computationally challenging (NP-complete) and remains extremely difficult for existing LLM-based planning approaches, unlike simpler domains like 24 Game. Extensive theoretical analysis shows advantages over public benchmarks.

Conclusion: Countdown provides an ideal benchmark for evaluating planning capabilities with verifiable outcomes, addressing shortcomings of existing planning evaluation methods and revealing limitations of current LLM-assisted planning approaches.

Abstract: There is a broad consensus that the inability to form long-term plans is one of the key limitations of current foundational models and agents. However, the existing planning benchmarks remain woefully inadequate to truly measure their planning capabilities. Most existing benchmarks either focus on loosely defined tasks like travel planning or end up leveraging existing domains and problems from international planning competitions. While the former tasks are hard to formalize and verify, the latter were specifically designed to test and challenge the weaknesses of existing automated planners. To address these shortcomings, we propose a procedure for creating a planning benchmark centered around the game called Countdown, where a player is expected to form a target number from a list of input numbers through arithmetic operations. From a world-model perspective, each instance induces a fully specified transition model (dynamics) over states and actions, enabling evaluation of planning with verifiable outcomes. We discuss how this problem meets many of the desiderata associated with an ideal benchmark for planning capabilities evaluation. Specifically, the domain allows for an intuitive, natural language description for each problem instance, it is computationally challenging (NP-complete), and the instance space is rich enough that we do not have to worry about memorization. We perform an extensive theoretical analysis, establishing the computational complexity result and demonstrate the advantage of our instance generation procedure over public benchmarks. We evaluate a variety of existing LLM-assisted planning methods on instances generated using our procedure. Our results show that, unlike other domains like 24 Game (a special case of Countdown), our proposed dynamic benchmark remains extremely challenging for existing LLM-based approaches.

[561] Similarity Field Theory: A Mathematical Framework for Intelligence

Kei-Sing Ng

Main category: cs.AI

TL;DR: Similarity Field Theory proposes a mathematical framework for intelligence based on similarity relations and their evolution, with applications to AI interpretability.

Details

Motivation: To create a foundational mathematical framework for understanding intelligence and interpretability in dynamic systems, moving beyond statistical approaches to geometric problems on similarity fields.

Method: Introduces Similarity Field Theory with formal definitions: similarity field S over entities U, system evolution sequences Z_p, concepts as fibers F_α(K), and generative operator G. Formalizes intelligence as generating entities belonging to concept fibers.

Result: Two theorems: (i) asymmetry blocks mutual inclusion; (ii) stability implies either anchor coordinate or asymptotic confinement to target level. Framework provides geometric lens for AI interpretability.

Conclusion: Similarity Field Theory offers foundational language for characterizing intelligent systems, reframing intelligence as geometric problems on similarity fields rather than statistical ones, with applications to large language models.

Abstract: We posit that transforming similarity relations form the structural basis of comprehensible dynamic systems. This paper introduces Similarity Field Theory, a mathematical framework that formalizes the principles governing similarity values among entities and their evolution. We define: (1) a similarity field $S: U \times U \to [0,1]$ over a universe of entities $U$, satisfying reflexivity $S(E,E)=1$ and treated as a directed relational field (asymmetry and non-transitivity are allowed); (2) the evolution of a system through a sequence $Z_p=(X_p,S^{(p)})$ indexed by $p=0,1,2,\ldots$; (3) concepts $K$ as entities that induce fibers $F_α(K)={E\in U \mid S(E,K)\ge α}$, i.e., superlevel sets of the unary map $S_K(E):=S(E,K)$; and (4) a generative operator $G$ that produces new entities. Within this framework, we formalize a generative definition of intelligence: an operator $G$ is intelligent with respect to a concept $K$ if, given a system containing entities belonging to the fiber of $K$, it generates new entities that also belong to that fiber. Similarity Field Theory thus offers a foundational language for characterizing, comparing, and constructing intelligent systems. At a high level, this framework reframes intelligence and interpretability as geometric problems on similarity fields–preserving and composing level-set fibers–rather than statistical ones. We prove two theorems: (i) asymmetry blocks mutual inclusion; and (ii) stability implies either an anchor coordinate or asymptotic confinement to the target level (up to arbitrarily small tolerance). Together, these results constrain similarity-field evolution and motivate an interpretive lens applicable to large language models. AI systems may be aligned less to safety as such than to human-observable and human-interpretable conceptions of safety, which may not fully determine the underlying safety concept.

[562] Autonomous Agents for Scientific Discovery: Orchestrating Scientists, Language, Code, and Physics

Lianhao Zhou, Hongyi Ling, Cong Fu, Yepeng Huang, Michael Sun, Wendi Yu, Xiaoxuan Wang, Xiner Li, Xingyu Su, Junkai Zhang, Xiusi Chen, Chenxing Liang, Xiaofeng Qian, Heng Ji, Wei Wang, Marinka Zitnik, Shuiwang Ji

Main category: cs.AI

TL;DR: LLM-based scientific agents are transforming scientific discovery by orchestrating interactions between human scientists, natural language, code, and physics across the entire research lifecycle.

Details

Motivation: The rise of large language models has enabled autonomous systems (agents) that can accelerate scientific discovery. There's a need to understand how these language agents can transform the scientific discovery lifecycle from hypothesis generation to result analysis.

Method: The paper presents a vision and critical examination of LLM-based scientific agents, analyzing current methodologies, innovations, achievements, and limitations. It identifies research challenges and outlines directions for building more robust agents.

Result: The analysis highlights the transformative potential of autonomous agents to accelerate scientific discovery across diverse domains, though current systems have limitations that need to be addressed.

Conclusion: LLM-based scientific agents represent a paradigm shift in scientific computing, offering a flexible framework for accelerating discovery, but require further research to overcome current limitations and achieve robust, generalizable performance.

Abstract: Computing has long served as a cornerstone of scientific discovery. Recently, a paradigm shift has emerged with the rise of large language models (LLMs), introducing autonomous systems, referred to as agents, that accelerate discovery across varying levels of autonomy. These language agents provide a flexible and versatile framework that orchestrates interactions with human scientists, natural language, computer language and code, and physics. This paper presents our view and vision of LLM-based scientific agents and their growing role in transforming the scientific discovery lifecycle, from hypothesis discovery, experimental design and execution, to result analysis and refinement. We critically examine current methodologies, emphasizing key innovations, practical achievements, and outstanding limitations. Additionally, we identify open research challenges and outline promising directions for building more robust, generalizable, and adaptive scientific agents. Our analysis highlights the transformative potential of autonomous agents to accelerate scientific discovery across diverse domains.

[563] Scaling the Scaling Logic: Agentic Meta-Synthesis of Logic Reasoning

Bowen Liu, Zhi Wu, Runquan Xie, Zhanhui Kang, Jia Li

Main category: cs.AI

TL;DR: SSLogic is an agentic meta-synthesis framework that uses LLM agents to iteratively create and refine Generator-Validator pairs for generating verifiable reinforcement learning tasks, evolving task families rather than just perturbing instances.

Details

Motivation: Current RLVR (Reinforcement Learning from Verifiable Rewards) is bottlenecked by data synthesis that relies on expert-written code or fixed templates, limiting growth to instance-level perturbations rather than creating fundamentally new task families with novel rules and difficulty gradients.

Method: SSLogic uses LLM agents in a closed Generate-Validate-Refine loop to author and refine executable Generator-Validator pairs. It employs a Multi-Gate Validation Protocol with multi-strategy consensus and Adversarial Blind Review where independent agents solve each instance by writing and executing code to filter ill-posed tasks before training.

Result: Starting from 400 seed families, two evolution rounds produced 953 families and 21,389 verifiable instances. The evolved data showed higher training utility with gains of +5.2 on SynLogic, +3.0 on AIME25, and +5.5 on BBH when evaluated on Enigmata. Fine-grained KORBench evaluation revealed selective improvements in logic (+13.2%) and operation (+9.6%).

Conclusion: SSLogic successfully shifts the evolvable unit from problem instances to task-family specifications, enabling structural evolution that leads to downstream performance gains in logical reasoning tasks.

Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) is bottlenecked by data: existing synthesis pipelines rely on expert-written code or fixed templates, confining growth to instance-level perturbations. We shift the evolvable unit from problem instances to task-family specifications. SSLogic is an agentic meta-synthesis framework in which LLM agents iteratively author and refine executable Generator-Validator pairs inside a closed Generate-Validate-Refine loop, producing families with new rules and difficulty gradients rather than parameter variations of old ones. A Multi-Gate Validation Protocol – multi-strategy consensus plus Adversarial Blind Review, where independent agents solve each instance by writing and executing code – filters ill-posed tasks before they enter training. Starting from 400 seed families, two evolution rounds yield 953 families and 21,389 verifiable instances. Three converging comparisons (step-matched, token-matched, and size-controlled on external Enigmata data) consistently show higher training utility of evolved data, with gains of SynLogic +5.2, AIME25 +3.0, and BBH +5.5 on Enigmata. Fine-grained KORBench evaluation reveals selective improvements in logic (+13.2%) and operation (+9.6%), linking structural evolution to downstream gains. Code: https://github.com/AdAstraAbyssoque/Scaling-the-Scaling-Logic

[564] Agentic AI Security: Threats, Defenses, Evaluation, and Open Challenges

Anshuman Chhabra, Shrestha Datta, Shahriar Kabir Nahin, Prasant Mohapatra

Main category: cs.AI

TL;DR: Survey paper on security risks specific to agentic AI systems powered by LLMs, covering threats, benchmarks, evaluations, and defense strategies.

Details

Motivation: Agentic AI systems with LLMs, planning, tool use, memory, and autonomy create new amplified security risks distinct from traditional AI safety and software security, requiring specialized threat analysis and defenses.

Method: Survey methodology: outlines taxonomy of agentic AI threats, reviews recent benchmarks and evaluation methodologies, discusses defense strategies from technical and governance perspectives.

Result: Synthesizes current research on agentic AI security, provides comprehensive threat taxonomy, identifies evaluation methodologies, and highlights defense approaches for secure-by-design agent systems.

Conclusion: Agentic AI introduces unique security challenges requiring specialized frameworks; the survey supports development of secure-by-design agent systems and identifies open research challenges in this emerging field.

Abstract: Agentic AI systems powered by large language models (LLMs) and endowed with planning, tool use, memory, and autonomy, are emerging as powerful, flexible platforms for automation. Their ability to autonomously execute tasks across web, software, and physical environments creates new and amplified security risks, distinct from both traditional AI safety and conventional software security. This survey outlines a taxonomy of threats specific to agentic AI, reviews recent benchmarks and evaluation methodologies, and discusses defense strategies from both technical and governance perspectives. We synthesize current research and highlight open challenges, aiming to support the development of secure-by-design agent systems.

[565] KLong: Training LLM Agent for Extremely Long-horizon Tasks

Yue Liu, Yingwei Ma, Yibo Miao, Yanhao Li, Yuchong Xie, Xinlong Yang, Zhiyuan Hu, Flood Sung, Jiaheng Zhang, Bryan Hooi

Main category: cs.AI

TL;DR: KLong is an open-source LLM agent trained for extremely long-horizon tasks using trajectory-splitting SFT and progressive RL training, achieving state-of-the-art performance on research paper benchmarks.

Details

Motivation: The paper addresses the challenge of training LLM agents to solve extremely long-horizon tasks, which require processing and reasoning over extended sequences of actions and information.

Method: 1) Cold-start via trajectory-splitting SFT that preserves early context while progressively truncating later context with overlap; 2) Research-Factory pipeline for automated high-quality training data generation from research papers; 3) Progressive RL training with multiple stages of progressively extended timeouts.

Result: KLong (106B) surpasses Kimi K2 Thinking (1T) by 11.28% on PaperBench and shows generalization to other coding benchmarks like SWE-bench Verified and MLE-bench.

Conclusion: The proposed methods enable effective training of LLM agents for long-horizon tasks, with KLong demonstrating superior performance and generalization capabilities compared to larger models.

Abstract: This paper introduces KLong, an open-source LLM agent trained to solve extremely long-horizon tasks. The principle is to first cold-start the model via trajectory-splitting SFT, then scale it via progressive RL training. Specifically, we first activate basic agentic abilities of a base model with a comprehensive SFT recipe. Then, we introduce Research-Factory, an automated pipeline that generates high-quality training data by collecting research papers and constructing evaluation rubrics. Using this pipeline, we build thousands of long-horizon trajectories distilled from Claude 4.5 Sonnet (Thinking). To train with these extremely long trajectories, we propose a new trajectory-splitting SFT, which preserves early context, progressively truncates later context, and maintains overlap between sub-trajectories. In addition, to further improve long-horizon task-solving capability, we propose a novel progressive RL, which schedules training into multiple stages with progressively extended timeouts. Experiments demonstrate the superiority and generalization of KLong, as shown in Figure 1. Notably, our proposed KLong (106B) surpasses Kimi K2 Thinking (1T) by 11.28% on PaperBench, and the performance improvement generalizes to other coding benchmarks like SWE-bench Verified and MLE-bench.

[566] An Agent-Based Framework for the Automatic Validation of Mathematical Optimization Models

Alexander Zadorojniy, Segev Wasserkrug, Eitan Farchi

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2511.16383: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.16383&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[567] Mitigating Value Hallucination in Dyna Planning via Multistep Predecessor Models

Farzane Aminmansour, Taher Jafferjee, Ehsan Imani, Erin Talvitie, Micheal Bowling, Martha White

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper information.

Details

Motivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot determine conclusion without access to the paper content.

Abstract: Failed to fetch summary for 2006.04363: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2006.04363&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[568] HAG: Hierarchical Demographic Tree-based Agent Generation for Topic-Adaptive Simulation

Rongxin Chen, Tianyu Wu, Bingbing Xu, Jiatang Luo, Xiucheng Xu, Huawei Shen

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable due to API rate limiting

Result: Cannot determine results as paper content is unavailable due to API rate limiting

Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2601.05656: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.05656&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[569] ConvoLearn: A Dataset for Fine-Tuning Dialogic AI Tutors

Mayank Sharma, Roy Pea, Hari Subramonyam

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2601.08950 suggests it’s from January 2025, but no content is available for analysis.

Details

Motivation: Cannot determine motivation without access to paper content. The HTTP 429 error indicates rate limiting from arXiv API, preventing retrieval of the paper's abstract and details.

Method: Cannot determine method without access to paper content. The arXiv API request failed due to rate limiting (HTTP 429), so no technical details are available.

Result: Cannot determine results without access to paper content. The paper analysis cannot proceed due to technical limitations in accessing the source material.

Conclusion: Cannot draw conclusions about the paper’s content or relevance due to technical limitations in accessing the arXiv API. The paper ID format suggests it’s a recent submission from January 2025.

Abstract: Failed to fetch summary for 2601.08950: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.08950&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[570] The Paradox of Robustness: Decoupling Rule-Based Logic from Affective Noise in High-Stakes Decision-Making

Jon Chun, Katherine Elkins

Main category: cs.AI

TL;DR: Paper 2601.21439 could not be analyzed due to HTTP 429 error (rate limiting) when fetching from arXiv API

Details

Motivation: Unable to determine motivation due to API access limitations

Method: Unable to determine method due to API access limitations

Result: Unable to determine results due to API access limitations

Conclusion: Unable to draw conclusions due to API access limitations

Abstract: Failed to fetch summary for 2601.21439: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.21439&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[571] TSPO: Breaking the Double Homogenization Dilemma in Multi-turn Search Policy Optimization

Shichao Ma, Zhiyuan Ma, Ming Yang, Xiaofan Li, Xing Wu, Jintao Du, Yu Cheng, Weiqiang Wang, Qiliang Liu, Zhengyang Zhou, Yang Wang

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper content.

Details

Motivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot draw conclusions without access to the paper content.

Abstract: Failed to fetch summary for 2601.22776: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.22776&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[572] Enhancing Foundation VLM Robustness to Missing Modality: Scalable Diffusion for Bi-directional Feature Restoration

Wei Dai, Haoyu Wang, Honghao Chang, Lijun He, Fan Li, Jian Sun, Haixia Bi

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2602.03151: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.03151&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[573] IV Co-Scientist: Multi-Agent LLM Framework for Causal Instrumental Variable Discovery

Ivaxi Sheth, Zhijing Jin, Bryan Wilder, Dominik Janzing, Mario Fritz

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2602.07943: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.07943&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[574] Certified Training with Branch-and-Bound for Lyapunov-stable Neural Control

Zhouxing Shi, Haoyu Li, Cho-Jui Hsieh, Huan Zhang

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to data retrieval failure

Method: Unable to determine method due to data retrieval failure

Result: Unable to determine results due to data retrieval failure

Conclusion: Unable to draw conclusions due to data retrieval failure

Abstract: Failed to fetch summary for 2411.18235: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.18235&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[575] Voxtral Realtime

Mistral-AI, Alexander H. Liu, Andy Ehrenberg, Andy Lo, Chen-Yo Sun, Guillaume Lample, Jean-Malo Delignon, Khyathi Raghavi Chandu, Patrick von Platen, Pavankumar Reddy Muddireddy, Rohin Arora, Sanchit Gandhi, Sandeep Subramanian, Soham Ghosh, Srijan Mishra, Abhinav Rastogi, Adrien Sadé, Alan Jeffares, Albert Jiang, Alexandre Cahill, Alexandre Gavaudan, Alexandre Sablayrolles, Amélie Héliou, Amos You, Andrew Bai, Angele Lenglemetz, Anmol Agarwal, Anton Eliseev, Antonia Calvi, Arjun Majumdar, Avi Sooriyarachchi, Baptiste Bout, Baptiste Rozière, Baudouin De Monicault, Benjamin Tibi, Charlotte Cronjäger, Clémence Lanfranchi, Connor Chen, Corentin Barreau, Corentin Sautier, Cyprien Courtot, Darius Dabert, Diego de las Casas, Elizaveta Demyanenko, Elliot Chane-Sane, Enguerrand Paquin, Etienne Goffinet, Fabien Niel, Faruk Ahmed, Federico Baldassarre, Gabrielle Berrada, Gaëtan Ecrepont, Gauthier Guinet, Genevieve Hayes, Georgii Novikov, Giada Pistilli, Guillaume Kunsch, Guillaume Martin, Guillaume Raille, Gunjan Dhanuka, Gunshi Gupta, Han Zhou, Harshil Shah, Hope McGovern, Hugo Thimonier, Indraneel Mukherjee, Irene Zhang, Jaeyoung Kim, Jan Ludziejewski, Jason Rute, Joachim Studnia, John Harvill, Jonas Amar, Joséphine Delas, Josselin Somerville Roberts, Julien Tauran, Karmesh Yadav, Kartik Khandelwal, Kilian Tep, Kush Jain, Laurence Aitchison, Laurent Fainsin, Léonard Blier, Lingxiao Zhao, Louis Martin, Lucile Saulnier, Luyu Gao, Maarten Buyl, Manan Sharma, Margaret Jennings, Marie Pellat, Mark Prins, Martin Alexandre, Mathieu Poirée, Mathilde Guillaumin, Matthieu Dinot, Matthieu Futeral, Maxime Darrin, Maximilian Augustin, Mert Unsal

Main category: cs.AI

TL;DR: Paper 2602.11298: Could not fetch summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access restrictions

Method: Unable to determine method due to access restrictions

Result: Unable to determine results due to access restrictions

Conclusion: Unable to determine conclusion due to access restrictions

Abstract: Failed to fetch summary for 2602.11298: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.11298&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[576] AI Runtime Infrastructure

Christopher Cruz

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to retrieval failure

Method: Unable to determine method due to retrieval failure

Result: Unable to determine results due to retrieval failure

Conclusion: Unable to draw conclusions due to retrieval failure

Abstract: Failed to fetch summary for 2603.00495: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.00495&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[577] DeepFact: Co-Evolving Benchmarks and Agents for Deep Research Factuality

Yukun Huang, Leonardo F. R. Ribeiro, Momchil Hardalov, Bhuwan Dhingra, Markus Dreyer, Venkatesh Saligrama

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.05912: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.05912&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[578] A Hierarchical Error-Corrective Graph Framework for Autonomous Agents with LLM-Based Action Generation

Cong Cao, Jingyao Zhang, Kun Tong

Main category: cs.AI

TL;DR: Failed to fetch paper summary - HTTP 429 error indicates rate limiting from arXiv API

Details

Motivation: Unable to determine motivation due to API access limitations

Method: Unable to determine method due to API access limitations

Result: Unable to determine results due to API access limitations

Conclusion: Unable to determine conclusion due to API access limitations

Abstract: Failed to fetch summary for 2603.08388: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.08388&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[579] A Self-Evolving Defect Detection Framework for Industrial Photovoltaic Systems

Haoyu He, Yu Duan, Wenzhen Liu, Hanyuan Hang, Boyu Qin, Qiantu Tuo, Xiaoke Yang, Rui Li

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2603.14869: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.14869&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[580] Adaptive Domain Models: Bayesian Evolution, Warm Rotation, and Principled Training for Geometric and Neuromorphic AI

Houston Haynes

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to draw conclusions due to failed paper fetch

Abstract: Failed to fetch summary for 2603.18104: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.18104&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[581] An Onto-Relational-Sophic Framework for Governing Synthetic Minds

Huansheng Ning, Jianguo Ding

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2603.18633: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.18633&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[582] RaPA: Enhancing Transferable Targeted Attacks via Random Parameter Pruning

Tongrui Su, Qingbin Li, Shengyu Zhu, Wei Chen, Xueqi Cheng

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper content.

Details

Motivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot determine conclusion without access to the paper content.

Abstract: Failed to fetch summary for 2504.18594: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.18594&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[583] Intelligence Inertia: Physical Isomorphism and Applications

Jipeng Han

Main category: cs.AI

TL;DR: Paper 2603.22347 summary unavailable due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to abstract fetch failure

Method: Unable to determine method due to abstract fetch failure

Result: Unable to determine results due to abstract fetch failure

Conclusion: Unable to determine conclusion due to abstract fetch failure

Abstract: Failed to fetch summary for 2603.22347: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.22347&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[584] Voxtral TTS

Mistral-AI, Alexander H. Liu, Alexis Tacnet, Andy Ehrenberg, Andy Lo, Chen-Yo Sun, Guillaume Lample, Henry Lagarde, Jean-Malo Delignon, Jaeyoung Kim, John Harvill, Khyathi Raghavi Chandu, Lorenzo Signoretti, Margaret Jennings, Patrick von Platen, Pavankumar Reddy Muddireddy, Rohin Arora, Sanchit Gandhi, Samuel Humeau, Soham Ghosh, Srijan Mishra, Van Phung, Abdelaziz Bounhar, Abhinav Rastogi, Adrien Sadé, Alan Jeffares, Albert Jiang, Alexandre Cahill, Alexandre Gavaudan, Alexandre Sablayrolles, Amélie Héliou, Amos You, Andrew Bai, Andrew Zhao, Angele Lenglemetz, Anmol Agarwal, Anton Eliseev, Antonia Calvi, Arjun Majumdar, Arthur Fournier, Artjom Joosen, Avi Sooriyarachchi, Aysenur Karaduman Utkur, Baptiste Bout, Baptiste Rozière, Baudouin De Monicault, Benjamin Tibi, Bowen Yang, Charlotte Cronjäger, Clémence Lanfranchi, Connor Chen, Corentin Barreau, Corentin Sautier, Cyprien Courtot, Darius Dabert, Diego de las Casas, Elizaveta Demyanenko, Elliot Chane-Sane, Emmanuel Gottlob, Enguerrand Paquin, Etienne Goffinet, Fabien Niel, Faruk Ahmed, Federico Baldassarre, Gabrielle Berrada, Gaëtan Ecrepont, Gauthier Guinet, Genevieve Hayes, Georgii Novikov, Giada Pistilli, Guillaume Kunsch, Guillaume Martin, Guillaume Raille, Gunjan Dhanuka, Gunshi Gupta, Han Zhou, Harshil Shah, Hope McGovern, Hugo Thimonier, Indraneel Mukherjee, Irene Zhang, Jacques Sun, Jan Ludziejewski, Jason Rute, Jérémie Dentan, Joachim Studnia, Jonas Amar, Joséphine Delas, Josselin Somerville Roberts, Julien Tauran, Karmesh Yadav, Kartik Khandelwal, Kilian Tep, Kush Jain, Laurence Aitchison, Laurent Fainsin, Léonard Blier, Lingxiao Zhao, Louis Martin, Lucile Saulnier

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2603.25551: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.25551&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[585] Metriplector: From Field Theory to Neural Architecture

Dan Oprisa, Peter Toth

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2603.29496: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.29496&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[586] Bayesian Hierarchical Invariant Prediction

Francisco Madaleno, Pernille Julie Viuff Sand, Francisco C. Pereira, Sergio Hernan Garrido Mejia

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2505.11211: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.11211&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[587] ClawSafety: “Safe” LLMs, Unsafe Agents

Bowen Wei, Yunbei Zhang, Jinhao Pan, Kai Mei, Xiao Wang, Jihun Hamm, Ziwei Zhu, Yingqiang Ge

Main category: cs.AI

TL;DR: Unable to analyze paper 2604.01438 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation as abstract retrieval failed

Method: Cannot determine method as abstract retrieval failed

Result: Cannot determine results as abstract retrieval failed

Conclusion: Cannot draw conclusions about paper content due to retrieval failure

Abstract: Failed to fetch summary for 2604.01438: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.01438&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Prince Zizhuang Wang, Shuli Jiang

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2604.01487: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.01487&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Chao Li, Yuru Wang, Chunyi Zhao

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper information.

Details

Motivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot determine conclusion without access to the paper content.

Abstract: Failed to fetch summary for 2604.01770: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.01770&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[590] Understanding Task Representations in Neural Networks via Bayesian Ablation

Andrew Nam, Declan Campbell, Thomas Griffiths, Jonathan Cohen, Sarah-Jane Leslie

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2505.13742: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.13742&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[591] Human-AI Collaborative Game Testing with Vision Language Models

Boran Zhang, Muhan Xu, Zhijun Pan

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to draw conclusions due to fetch failure

Abstract: Failed to fetch summary for 2501.11782: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.11782&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[592] SoSBench: Benchmarking Safety Alignment on Six Scientific Domains

Fengqing Jiang, Fengbo Ma, Zhangchen Xu, Yuetai Li, Zixin Rao, Bhaskar Ramasubramanian, Luyao Niu, Bo Li, Xianyan Chen, Zhen Xiang, Radha Poovendran

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2505.21605: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.21605&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[593] Post-detection inference for sequential changepoint localization

Aytijhya Saha, Aaditya Ramdas

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Unable to determine motivation due to API access failure

Method: Unable to determine method due to API access failure

Result: Unable to determine results due to API access failure

Conclusion: Unable to draw conclusions due to API access failure

Abstract: Failed to fetch summary for 2502.06096: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.06096&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[594] LLMs Judging LLMs: A Simplex Perspective

Patrick Vossler, Fan Xia, Yifan Mai, Adarsh Subbaswamy, Jean Feng

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access restrictions

Method: Unable to determine method due to access restrictions

Result: Unable to determine results due to access restrictions

Conclusion: Unable to draw conclusions due to access restrictions

Abstract: Failed to fetch summary for 2505.21972: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.21972&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[595] Cyber-Physical Systems Security: A Comprehensive Review of Anomaly Detection Techniques

Danial Abshari, Meera Sridhar

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2502.13256: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.13256&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[596] Implicit Bias-Like Patterns in Reasoning Models

Messi H.J. Lee, Calvin K. Lai

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to retrieval failure

Method: Unable to determine method due to retrieval failure

Result: Unable to determine results due to retrieval failure

Conclusion: Unable to determine conclusion due to retrieval failure

Abstract: Failed to fetch summary for 2503.11572: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.11572&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[597] Open3DBench: Open-Source Benchmark for 3D-IC Backend Implementation and PPA Evaluation

Yunqi Shi, Chengrui Gao, Wanqi Ren, Peng Xie, Siyuan Xu, Ke Xue, Mingxuan Yuan, Chao Qian, Zhi-Hua Zhou

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2503.12946: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.12946&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[598] Causality-Based Scores Alignment in Explainable Data Management

Felipe Azua, Leopoldo Bertossi

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2503.14469: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.14469&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[599] Large Language Models for Combinatorial Optimization of Design Structure Matrix

Shuo Jiang, Min Xie, Jianxi Luo

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation as the paper content could not be retrieved

Method: Unable to determine method as the paper content could not be retrieved

Result: Unable to determine results as the paper content could not be retrieved

Conclusion: Unable to determine conclusion as the paper content could not be retrieved

Abstract: Failed to fetch summary for 2506.09749: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.09749&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[600] PRISM: Lightweight Multivariate Time-Series Classification through Symmetric Multi-Resolution Convolutional Layers

Federico Zucchi, Thomas Lampert

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed API request

Method: Unable to determine method due to failed API request

Result: Unable to determine results due to failed API request

Conclusion: Unable to determine conclusion due to failed API request

Abstract: Failed to fetch summary for 2508.04503: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.04503&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[601] CATNet: A geometric deep learning approach for CAT bond spread prediction in the primary market

Dixon Domfeh, Saeid Safarveisi

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2508.10208: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.10208&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[602] ShadowNPU: System and Algorithm Co-design for NPU-Centric On-Device LLM Inference

Wangsong Yin, Daliang Xu, Mengwei Xu, Gang Huang, Xuanzhe Liu

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper information.

Details

Motivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot draw conclusions without access to the paper content.

Abstract: Failed to fetch summary for 2508.16703: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.16703&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[603] DoubleAgents: Human-Agent Alignment in a Socially Embedded Workflow

Tao Long, Xuanming Zhang, Sitong Wang, Zhou Yu, Lydia B Chilton

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2509.12626: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.12626&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[604] ACT: Agentic Classification Tree

Vincent Grari, Tim Arni, Thibault Laugel, Sylvain Lamprier, James Zou, Marcin Detyniecki

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2509.26433: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.26433&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[605] Autonomy Reshapes How Personalization Affects Privacy Concerns and Trust in LLM Agents

Zhiping Zhang, Yi Evie Zhang, Freda Shi, Tianshi Li

Main category: cs.AI

TL;DR: Unable to analyze paper 2510.04465 due to HTTP 429 error when fetching summary from arXiv API

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2510.04465: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.04465&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[606] Fewer Weights, More Problems: A Practical Attack on LLM Pruning

Kazuki Egashira, Robin Staab, Thibaud Gloaguen, Mark Vero, Martin Vechev

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2510.07985: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.07985&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[607] Leveraging Wireless Sensor Networks for Real-Time Monitoring and Control of Industrial Environments

Muhammad Junaid Asif, Abdul Rehman, Asim Mehmood, Rana Fayyaz Ahmad, Shazia Saqib

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2510.13820: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.13820&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[608] AI-BAAM: AI-Driven Bank Statement Analytics as Alternative Data for Malaysian MSME Credit Scoring

Chun Chet Ng, Zhen Hao Chu, Jia Yu Lim, Yin Yin Boon, Wei Zeng Low, Jin Khye Tan

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to access restrictions

Method: Cannot determine method due to access restrictions

Result: Cannot determine results due to access restrictions

Conclusion: Cannot determine conclusion due to access restrictions

Abstract: Failed to fetch summary for 2510.16066: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.16066&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[609] A Model Can Help Itself: Reward-Free Self-Training for LLM Reasoning

Mengqi Li, Lei Zhao, Anthony Man-Cho So, Ruoyu Sun, Xiao Li

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2510.18814: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.18814&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[610] ATLAS: A Layered Constraint-Guided Framework for Structured Artifact Generation in LLM-Assisted MDE

Tong Ma, Hui Lai, Hui Wang, Zhenhu Tian, Chaochao Li, Fengjie Xu, Ling Fang

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable due to API rate limiting

Result: Cannot determine results as paper content is unavailable due to API rate limiting

Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2510.25890: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.25890&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[611] FAST-CAD: A Fairness-Aware Framework for Non-Contact Stroke Diagnosis

Tommy Sha, Zhan Cheng, Haotian Zhai, Xuwei Ding, Junnan Li, Haixiang Tang, Zaoting Sun, Yanchuan Tang, Yongzhe, Yuan Gao, Anhao Li

Main category: cs.AI

TL;DR: Paper analysis unavailable due to HTTP 429 error when fetching from arXiv API

Details

Motivation: Unable to analyze paper motivation due to API rate limiting error

Method: Cannot determine method due to failed data retrieval

Result: No results available - HTTP 429 error indicates too many requests to arXiv API

Conclusion: Analysis cannot be completed due to technical limitations in accessing paper data

Abstract: Failed to fetch summary for 2511.08887: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.08887&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[612] A fine-grained look at causal effects in causal spaces

Junhyung Park, Yuqing Zhou

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2512.11919: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.11919&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[613] Exploration vs. Fixation: Scaffolding Divergent and Convergent Thinking for Human-AI Co-Creation with Generative Models

Chao Wen, Tung Phung, Pronita Mehrotra, Sumit Gulwani, Roger E. Beaty, Tomohiro Nagashima, Adish Singla

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable due to technical limitations

Method: Cannot determine method as paper content is unavailable due to technical limitations

Result: Cannot determine results as paper content is unavailable due to technical limitations

Conclusion: Cannot draw conclusions as paper content is unavailable due to technical limitations

Abstract: Failed to fetch summary for 2512.18388: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.18388&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[614] Path Integral Solution for Dissipative Generative Dynamics

Xidi Wang

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed API request

Method: Unable to determine method due to failed API request

Result: Unable to determine results due to failed API request

Conclusion: Unable to draw conclusions due to failed API request

Abstract: Failed to fetch summary for 2601.00860: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.00860&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[615] Rewriting Video: Text-Driven Reauthoring of Video Footage

Sitong Wang, Anh Truong, Lydia B. Chilton, Dingzeyu Li

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2601.08565 suggests it’s from January 2025, but no content available for analysis.

Details

Motivation: Cannot determine motivation without access to the paper content. The HTTP 429 error indicates rate limiting from arXiv API.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot draw conclusions without access to the paper content.

Abstract: Failed to fetch summary for 2601.08565: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.08565&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[616] Fairness in Healthcare Processes: A Quantitative Analysis of Decision Making in Triage

Rachmadita Andreswari, Stephan A. Fahrenkrog-Petersen, Jan Mendling

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2601.11065: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.11065&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[617] How AI Coding Agents Modify Code: A Large-Scale Study of GitHub Pull Requests

Daniel Ogenrwot, John Businge

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) when trying to access arXiv API

Details

Motivation: Unable to determine motivation as paper content could not be retrieved due to API rate limiting

Method: No method information available - paper content inaccessible due to HTTP 429 error from arXiv API

Result: No results available - failed to fetch paper summary due to rate limiting on arXiv API

Conclusion: Cannot provide analysis as paper content could not be retrieved; HTTP 429 indicates too many requests to arXiv API

Abstract: Failed to fetch summary for 2601.17581: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.17581&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[618] RASA: Routing-Aware Safety Alignment for Mixture-of-Experts Models

Jiacheng Liang, Yuhui Wang, Tanqiu Jiang, Ting Wang

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Unable to determine motivation due to API request failure

Method: Unable to determine method due to API request failure

Result: Unable to determine results due to API request failure

Conclusion: Unable to determine conclusion due to API request failure

Abstract: Failed to fetch summary for 2602.04448: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.04448&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Yi Feng, Chen Huang, Zhibo Man, Ryner Tan, Long P. Hoang, Shaoyang Xu, Wenxuan Zhang

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to technical error fetching paper content

Method: Unable to determine method due to technical error fetching paper content

Result: Unable to determine results due to technical error fetching paper content

Conclusion: Unable to determine conclusion due to technical error fetching paper content

Abstract: Failed to fetch summary for 2602.13458: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.13458&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[620] WIMLE: Uncertainty-Aware World Models with IMLE for Sample-Efficient Continuous Control

Mehran Aghabozorgi, Alireza Moazeni, Yanshu Zhang, Ke Li

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2602.14351: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.14351&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[621] “When to Hand Off, When to Work Together”: Expanding Human-Agent Co-Creative Collaboration through Concurrent Interaction

Kihoon Son, Hyewon Lee, DaEun Choi, Yoonsu Kim, Tae Soo Kim, Yoonjoo Lee, John Joon Young Chung, HyunJoon Jung, Juho Kim

Main category: cs.AI

TL;DR: The paper with ID 2603.02050 could not be analyzed due to HTTP 429 error (rate limiting) when trying to fetch the abstract from arXiv API.

Details

Motivation: Unable to determine motivation as the paper content could not be retrieved due to API rate limiting.

Method: Unable to determine method as the paper content could not be retrieved due to API rate limiting.

Result: Unable to determine results as the paper content could not be retrieved due to API rate limiting.

Conclusion: Unable to draw conclusions about the paper due to technical limitations in accessing the content.

Abstract: Failed to fetch summary for 2603.02050: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.02050&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[622] The Malignant Tail: Spectral Segregation of Label Noise in Over-Parameterized Networks

Zice Wang

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to API rate limiting preventing access to paper details

Method: Cannot analyze method without access to paper content

Result: No results available due to technical access issues

Conclusion: Paper analysis impossible due to API rate limiting preventing content retrieval

Abstract: Failed to fetch summary for 2603.02293: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.02293&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[623] Mathematicians in the age of AI

Jeremy Avigad

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper information.

Details

Motivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot determine conclusion without access to the paper content.

Abstract: Failed to fetch summary for 2603.03684: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.03684&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[624] NePPO: Near-Potential Policy Optimization for General-Sum Multi-Agent Reinforcement Learning

Addison Kalanther, Sanika Bharvirkar, Shankar Sastry, Chinmay Maheshwari

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) when querying arXiv API for paper ID 2603.06977

Details

Motivation: Unable to determine motivation as paper content could not be retrieved due to API rate limiting

Method: No method information available due to failed API request

Result: No results available - only error information about HTTP 429 response from arXiv API

Conclusion: Cannot provide analysis due to technical limitations in accessing the paper content

Abstract: Failed to fetch summary for 2603.06977: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.06977&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[625] PlayWorld: Learning Robot World Models from Autonomous Play

Tenny Yin, Zhiting Mei, Zhonghe Zheng, Miyu Yamane, David Wang, Jade Sceats, Samuel M. Bateman, Lihan Zha, Apurva Badithela, Ola Shorinwa, Anirudha Majumdar

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.09030: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.09030&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[626] Toward Epistemic Stability: Engineering Consistent Procedures for Industrial LLM Hallucination Reduction

Brian Freeman, Adam Kicklighter, Matt Erdman, Zach Gordon

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2603.10047: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.10047&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[627] Security Considerations for Artificial Intelligence Agents

Ninghui Li, Kaiyuan Zhang, Kyle Polley, Jerry Ma

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2603.12230 suggests it’s from March 2023, but content cannot be retrieved.

Details

Motivation: Cannot determine motivation due to inability to access paper content.

Method: Cannot determine method due to inability to access paper content.

Result: Cannot determine results due to inability to access paper content.

Conclusion: Cannot determine conclusion due to inability to access paper content.

Abstract: Failed to fetch summary for 2603.12230: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.12230&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[628] Exploring Collatz Dynamics with Human-LLM Collaboration

Edward Y. Chang

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2603.11066: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.11066&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[629] Brittlebench: Quantifying LLM robustness via prompt sensitivity

Angelika Romanou, Mark Ibrahim, Candace Ross, Chantal Shaib, Kerem Oktar, Samuel J. Bell, Anaelia Ovalle, Jesse Dodge, Antoine Bosselut, Koustuv Sinha, Adina Williams

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.13285: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.13285&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[630] AI-Driven Predictive Maintenance with Environmental Context Integration for Connected Vehicles: Simulation, Benchmarking, and Field Validation

Kushal Khemani, Anjum Nazir Qureshi

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.13343: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.13343&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[631] Fine-tuning is Not Enough: A Parallel Framework for Collaborative Imitation and Reinforcement Learning in End-to-end Autonomous Driving

Zhexi Lian, Haoran Wang, Xuerun Yan, Weimeng Lin, Xianhong Zhang, Yongyu Chen, Jia Hu

Main category: cs.AI

TL;DR: Paper ID 2603.13842 could not be fetched due to HTTP 429 error (rate limiting), so no analysis can be performed.

Details

Motivation: Unable to determine motivation due to HTTP 429 error when attempting to fetch the paper summary.

Method: Unable to determine method due to HTTP 429 error when attempting to fetch the paper summary.

Result: Unable to determine results due to HTTP 429 error when attempting to fetch the paper summary.

Conclusion: Unable to draw conclusions due to HTTP 429 error when attempting to fetch the paper summary.

Abstract: Failed to fetch summary for 2603.13842: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.13842&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[632] Causal Discovery in Action: Learning Chain-Reaction Mechanisms from Interventions

Panayiotis Panayiotou, Özgür Şimşek

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.22620: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.22620&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[633] Mind Your HEARTBEAT! Claw Background Execution Inherently Enables Silent Memory Pollution

Yechao Zhang, Shiqian Zhao, Jie Zhang, Gelei Deng, Jiawen Zhang, Xiaogeng Liu, Chaowei Xiao, Tianwei Zhang

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.23064: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.23064&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[634] Unilateral Relationship Revision Power in Human-AI Companion Interaction

Benjamin Lange

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.23315: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.23315&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[635] Large Language Model Guided Incentive Aware Reward Design for Cooperative Multi-Agent Reinforcement Learning

Dogan Urgun, Gokhan Gungor

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions due to data unavailability

Abstract: Failed to fetch summary for 2603.24324: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.24324&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[636] Beyond Code Snippets: Benchmarking LLMs on Repository-Level Question Answering

Yoseph Berhanu Alebachew, Hunter Leary, Swanand Vaishampayan, Chris Brown

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: N/A - Paper content not accessible due to technical limitations

Method: N/A - Paper content not accessible due to technical limitations

Result: N/A - Paper content not accessible due to technical limitations

Conclusion: N/A - Paper content not accessible due to technical limitations

Abstract: Failed to fetch summary for 2603.26567: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.26567&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[637] A Firefly Algorithm for Mixed-Variable Optimization Based on Hybrid Distance Modeling

Ousmane Tom Bechir, Adán José-García, Zaineb Chelly Garcia, Vincent Sobanski, Clarisse Dhaenens

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2603.26792: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.26792&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[638] HISA: Efficient Hierarchical Indexing for Fine-Grained Sparse Attention

Yufei Xu, Fanxu Meng, Fan Jiang, Yuxuan Wang, Ruijie Zhou, Zhaohui Wang, Jiexi Wu, Zhixin Pan, Xiaojuan Tang, Wenjie Pei, Tongxuan Liu, Di Yin, Xing Sun, Muhan Zhang

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to draw conclusions due to fetch failure

Abstract: Failed to fetch summary for 2603.28458: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.28458&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[639] Detecting low left ventricular ejection fraction from ECG using an interpretable and scalable predictor-driven framework

Ya Zhou, Tianxiang Hao, Ziyi Cai, Haojie Zhu, Kejun He, Jia Liu, Xiaohan Fan, Jing Yuan

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Unable to determine motivation due to technical error in retrieving paper information

Method: Cannot analyze method due to failed paper retrieval

Result: No results available due to technical error

Conclusion: Paper content unavailable for analysis due to arXiv API rate limiting

Abstract: Failed to fetch summary for 2603.28532: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.28532&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[640] ASTRA: Mapping Art-Technology Institutions via Conceptual Axes, Text Embeddings, and Unsupervised Clustering

Joonhyung Bae

Main category: cs.AI

TL;DR: Paper 2603.28816: Unable to fetch abstract due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to abstract fetch failure

Method: Cannot determine method due to abstract fetch failure

Result: Cannot determine results due to abstract fetch failure

Conclusion: Cannot determine conclusion due to abstract fetch failure

Abstract: Failed to fetch summary for 2603.28816: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.28816&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[641] Beta-Scheduling: Momentum from Critical Damping as a Diagnostic and Correction Tool for Neural Network Training

Ivan Pasichnyk

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.28921: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.28921&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[642] On the Mirage of Long-Range Dependency, with an Application to Integer Multiplication

Zichao Wei

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about paper content due to access limitations

Abstract: Failed to fetch summary for 2603.29069: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.29069&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[643] Security in LLM-as-a-Judge: A Comprehensive SoK

Aiman Al Masoud, Antony Anju, Marco Arazzi, Mert Cihangiroglu, Vignesh Kumar Kembu, Serena Nicolazzo, Antonino Nocera, Vinod P., Saraga Sakthidharan

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2603.29403: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.29403&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[644] Task-Centric Personalized Federated Fine-Tuning of Language Models

Gabriel U. Talasso, Meghdad Kurmanji, Allan M. de Souza, Nicholas D. Lane, Leandro A. Villas

Main category: cs.AI

TL;DR: The paper with ID 2604.00050 could not be analyzed because the arXiv API returned an HTTP 429 error (too many requests), preventing access to the paper’s abstract and content.

Details

Motivation: Unable to determine the paper's motivation due to technical limitations in accessing the content.

Method: Unable to determine the paper’s methodology due to technical limitations in accessing the content.

Result: Unable to determine the paper’s results due to technical limitations in accessing the content.

Conclusion: Unable to determine the paper’s conclusions due to technical limitations in accessing the content.

Abstract: Failed to fetch summary for 2604.00050: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.00050&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[645] RAGShield: Detecting Numerical Claim Manipulation in Government RAG Systems

KrishnaSaiReddy Patil

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2604.00387: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.00387&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[646] Spectral Compact Training: Pre-Training Large Language Models via Permanent Truncated SVD and Stiefel QR Retraction

Björn Roman Kohlberger

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2604.00733: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.00733&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[647] Beyond Message Passing: A Semantic View of Agent Communication Protocols

Dun Yuan, Fuyuan Lyu, Ye Yuan, Weixu Zhang, Bowei He, Jiayi Geng, Linfeng Du, Zipeng Sun, Yankai Chen, Changjiang Han, Jikun Kang, Alex Chen, Haolun Wu, Xue Liu

Main category: cs.AI

TL;DR: Failed to fetch paper summary - HTTP 429 error indicates rate limiting from arXiv API

Details

Motivation: Unable to determine motivation due to technical error in fetching paper content

Method: Cannot analyze method due to failed API request

Result: No results available due to technical error

Conclusion: Paper analysis not possible due to arXiv API rate limiting

Abstract: Failed to fetch summary for 2604.02369: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.02369&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

cs.SD

[648] Composer Vector: Style-steering Symbolic Music Generation in a Latent Space

Xunyi Jiang, Mingyang Yao, Jingyue Huang, Julian McAuley

Main category: cs.SD

TL;DR: Composer Vector enables inference-time control over composer style in symbolic music generation through latent space steering without retraining, supporting single and blended styles.

Details

Motivation: Existing methods for composer style control require large labeled datasets and only support single-composer generation, limiting creative flexibility and blended style scenarios.

Method: Proposes Composer Vector, an inference-time steering method that operates directly in the model’s latent space to control composer style without retraining, using continuous steering coefficients for smooth control.

Result: Effectively guides generations toward target composer styles, enables smooth interpretable control through continuous coefficients, and supports seamless fusion of multiple styles within a unified latent space framework.

Conclusion: Simple latent space steering provides a practical and general mechanism for controllable symbolic music generation, enabling more flexible and interactive creative workflows.

Abstract: Symbolic music generation has made significant progress, yet achieving fine-grained and flexible control over composer style remains challenging. Existing training-based methods for composer style conditioning depend on large labeled datasets. Besides, these methods typically support only single-composer generation at a time, limiting their applicability to more creative or blended scenarios. In this work, we propose Composer Vector, an inference-time steering method that operates directly in the model’s latent space to control composer style without retraining. Through experiments on multiple symbolic music generation models, we show that Composer Vector effectively guides generations toward target composer styles, enabling smooth and interpretable control through a continuous steering coefficient. It also enables seamless fusion of multiple styles within a unified latent space framework. Overall, our work demonstrates that simple latent space steering provides a practical and general mechanism for controllable symbolic music generation, enabling more flexible and interactive creative workflows. Code and Demo are available here: https://github.com/JiangXunyi/Composer-Vector and https://jiangxunyi.github.io/composervector.github.io/

[649] Measuring Robustness of Speech Recognition from MEG Signals Under Distribution Shift

Sheng-You Chien, Bo-Yi Mao, Yi-Ning Chang, Po-Chih Kuo

Main category: cs.SD

TL;DR: Study compares neural network architectures for phoneme classification from MEG signals, finding preprocessing and normalization choices more important than architectural complexity, with instance normalization being crucial for generalization.

Details

Motivation: To investigate robust speech-related decoding from non-invasive MEG signals using the LibriBrain phoneme-classification benchmark, comparing different neural network architectures and examining the effects of various preprocessing and data configuration choices.

Method: Compared residual CNNs, STFT-based CNN, and CNN-Transformer hybrid architectures. Examined effects of group averaging, label balancing, repeated grouping, normalization strategies, and data augmentation on MEG signal decoding performance.

Result: Preprocessing and data-configuration choices mattered more than architectural complexity. Instance normalization emerged as most influential for generalization. Best model achieved 60.95% F1-macro vs 39.53% baseline. Most models without instance normalization showed substantial validation-to-test degradation. MEGConformer maintained 64.09% F1-macro on both splits.

Conclusion: Improving non-invasive phoneme decoding requires better handling of normalization-related distribution shift and addressing single-trial decoding challenges. Instance normalization is crucial for generalization in MEG signal processing.

Abstract: This study investigates robust speech-related decoding from non-invasive MEG signals using the LibriBrain phoneme-classification benchmark from the 2025 PNPL competition. We compare residual convolutional neural networks (CNNs), an STFT-based CNN, and a CNN–Transformer hybrid, while also examining the effects of group averaging, label balancing, repeated grouping, normalization strategies, and data augmentation. Across our in-house implementations, preprocessing and data-configuration choices matter more than additional architectural complexity, among which instance normalization emerges as the most influential modification for generalization. The strongest of our own models, a CNN with group averaging, label balancing, repeated grouping, and instance normalization, achieves 60.95% F1-macro on the test split, compared with 39.53% for the plain CNN baseline. However, most of our models, without instance normalization, show substantial validation-to-test degradation, indicating that distribution shift induced by different normalization statistics is a major obstacle to generalization in our experiments. By contrast, MEGConformer maintains 64.09% F1-macro on both validation and test, and saliency-map analysis is qualitatively consistent with this contrast: weaker models exhibit more concentrated or repetitive phoneme-sensitive patterns across splits, whereas MEGConformer appears more distributed. Overall, the results suggest that improving the reliability of non-invasive phoneme decoding will likely require better handling of normalization-related distribution shift while also addressing the challenge of single-trial decoding.

[650] OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text

Weiguo Pian, Saksham Singh Kushwaha, Zhimin Chen, Shijian Deng, Kai Wang, Yunhui Guo, Yapeng Tian

Main category: cs.SD

Details

[651] Joint Fullband-Subband Modeling for High-Resolution SingFake Detection

Xuanjun Chen, Chia-Yu Hu, Sung-Feng Huang, Haibin Wu, Hung-yi Lee, Jyh-Shing Roger Jang

Main category: cs.SD

TL;DR: High-resolution (44.1 kHz) audio analysis for singing voice deepfake detection, using joint fullband-subband modeling to capture both global context and fine-grained synthesis artifacts across frequency spectrum.

Details

Motivation: Singing voice synthesis advances increase unauthorized imitation risks, creating urgent need for better Singing Voice Deepfake Detection (SVDD). Conventional 16 kHz-sampled detectors discard vital high-frequency information, proving inadequate for singing which contains complex pitch, wide dynamic range, and timbral variations.

Method: Proposes first systematic analysis of high-resolution (44.1 kHz sampling rate) audio for SVDD. Introduces joint fullband-subband modeling framework: fullband captures global context while subband-specific experts isolate fine-grained synthesis artifacts unevenly distributed across spectrum.

Result: Experiments on WildSVDD dataset demonstrate high-frequency subbands provide essential complementary cues. Framework significantly outperforms 16 kHz-sampled models, proving high-resolution audio and strategic subband integration are critical for robust in-the-wild detection.

Conclusion: High-resolution audio analysis and joint fullband-subband modeling are essential for effective singing voice deepfake detection, addressing limitations of conventional approaches that discard crucial high-frequency information.

Abstract: Rapid advances in singing voice synthesis have increased unauthorized imitation risks, creating an urgent need for better Singing Voice Deepfake (SingFake) Detection, also known as SVDD. Unlike speech, singing contains complex pitch, wide dynamic range, and timbral variations. Conventional 16 kHz-sampled detectors prove inadequate, as they discard vital high-frequency information. This study presents the first systematic analysis of high-resolution (44.1 kHz sampling rate) audio for SVDD. We propose a joint fullband-subband modeling framework: the fullband captures global context, while subband-specific experts isolate fine-grained synthesis artifacts unevenly distributed across the spectrum. Experiments on the WildSVDD dataset demonstrate that high-frequency subbands provide essential complementary cues. Our framework significantly outperforms 16 kHz-sampled models, proving that high-resolution audio and strategic subband integration are critical for robust in-the-wild detection.

[652] FastTurn: Unifying Acoustic and Streaming Semantic Cues for Low-Latency and Robust Turn Detection

Chengyou Wang, Hongfei Xue, Chunjiang He, Jingbin Hu, Shuiyuan Wang, Bo Wu, Yuyu Ji, Jimeng Zheng, Ruofei Chen, Zhou Zhu, Lei Xie

Main category: cs.SD

TL;DR: FastTurn: A low-latency turn detection framework for real-time full-duplex spoken dialogue systems that combines streaming CTC decoding with acoustic features for early decisions while preserving semantic understanding.

Details

Motivation: Existing full-duplex spoken dialogue systems either rely on voice activity detection (lacking semantic understanding) or ASR-based modules (introducing latency and degrading under overlapping speech/noise). There's also a lack of realistic datasets capturing authentic interaction dynamics.

Method: Proposes FastTurn framework combining streaming CTC decoding with acoustic features to enable early decisions from partial observations while preserving semantic cues. Also releases a test set based on real human dialogue capturing authentic turn transitions, overlapping speech, backchannels, pauses, pitch variation, and noise.

Result: FastTurn achieves higher decision accuracy with lower interruption latency than representative baselines and remains robust under challenging acoustic conditions.

Conclusion: FastTurn demonstrates effectiveness for practical full-duplex dialogue systems by advancing latency while maintaining performance through early semantic understanding from partial speech observations.

Abstract: Recent advances in AudioLLMs have enabled spoken dialogue systems to move beyond turn-based interaction toward real-time full-duplex communication, where the agent must decide when to speak, yield, or interrupt while the user is still talking. Existing full-duplex approaches either rely on voice activity cues, which lack semantic understanding, or on ASR-based modules, which introduce latency and degrade under overlapping speech and noise. Moreover, available datasets rarely capture realistic interaction dynamics, limiting evaluation and deployment. To mitigate the problem, we propose \textbf{FastTurn}, a unified framework for low-latency and robust turn detection. To advance latency while maintaining performance, FastTurn combines streaming CTC decoding with acoustic features, enabling early decisions from partial observations while preserving semantic cues. We also release a test set based on real human dialogue, capturing authentic turn transitions, overlapping speech, backchannels, pauses, pitch variation, and environmental noise. Experiments show FastTurn achieves higher decision accuracy with lower interruption latency than representative baselines and remains robust under challenging acoustic conditions, demonstrating its effectiveness for practical full-duplex dialogue systems.

[653] Audio-to-Image Bird Species Retrieval without Audio-Image Pairs via Text Distillation

Ilyass Moummad, Marius Miron, Lukas Rauch, David Robinson, Alexis Joly, Olivier Pietquin, Emmanuel Chemla, Matthieu Geist

Main category: cs.SD

TL;DR: Audio-to-image retrieval for bioacoustic species recognition using text as semantic intermediary, distilling visual semantics from image-text model into audio-text model without audio-image supervision.

Details

Motivation: Audio-to-image retrieval offers interpretable alternative to audio-only classification for bioacoustic species recognition, but learning aligned audio-image representations is challenging due to scarcity of paired audio-image data.

Method: Uses text as semantic intermediary: distills text embedding space of pretrained image-text model (BioCLIP-2) into pretrained audio-text model (BioLingual) by fine-tuning audio encoder with contrastive objective, transferring visually grounded semantics into audio representation without using images during training.

Result: Distilled audio encoder preserves audio discriminative power while substantially improving audio-text alignment on focal recordings and soundscape datasets. On SSW60 benchmark, achieves strong audio-to-image retrieval performance exceeding baselines based on zero-shot model combinations or learned mappings between text embeddings.

Conclusion: Indirect semantic transfer through text is sufficient to induce meaningful audio-image alignment, providing practical solution for visually grounded species recognition in data-scarce bioacoustic settings.

Abstract: Audio-to-image retrieval offers an interpretable alternative to audio-only classification for bioacoustic species recognition, but learning aligned audio-image representations is challenging due to the scarcity of paired audio-image data. We propose a simple and data-efficient approach that enables audio-to-image retrieval without any audio-image supervision. Our proposed method uses text as a semantic intermediary: we distill the text embedding space of a pretrained image-text model (BioCLIP-2), which encodes rich visual and taxonomic structure, into a pretrained audio-text model (BioLingual) by fine-tuning its audio encoder with a contrastive objective. This distillation transfers visually grounded semantics into the audio representation, inducing emergent alignment between audio and image embeddings without using images during training. We evaluate the resulting model on multiple bioacoustic benchmarks. The distilled audio encoder preserves audio discriminative power while substantially improving audio-text alignment on focal recordings and soundscape datasets. Most importantly, on the SSW60 benchmark, the proposed approach achieves strong audio-to-image retrieval performance exceeding baselines based on zero-shot model combinations or learned mappings between text embeddings, despite not training on paired audio-image data. These results demonstrate that indirect semantic transfer through text is sufficient to induce meaningful audio-image alignment, providing a practical solution for visually grounded species recognition in data-scarce bioacoustic settings.

[654] When Spoof Detectors Travel: Evaluation Across 66 Languages in the Low-Resource Language Spoofing Corpus

Kirill Borodin, Vasiliy Kudryavtsev, Maxim Maslov, Mikhail Gorodnichev, Grach Mkrtchian

Main category: cs.SD

TL;DR: LRLspoof is a large-scale multilingual synthetic speech corpus for cross-lingual spoof detection, containing 2,732 hours of audio from 24 TTS systems across 66 languages, with evaluation showing model-dependent cross-lingual disparity in spoof rejection rates.

Details

Motivation: There's a need for cross-lingual spoof detection evaluation, particularly for low-resource languages, to understand how spoof detection models perform across different languages and to address language as a source of domain shift in synthetic speech detection.

Method: Created LRLspoof corpus with 2,732 hours of synthetic speech from 24 open-source TTS systems across 66 languages (45 low-resource). Evaluated 11 publicly available countermeasures using threshold transfer: calibrating EER operating points on external benchmarks and applying resulting thresholds to report spoof rejection rates (SRR).

Result: Results show model-dependent cross-lingual disparity, with spoof rejection rates varying significantly across languages even under controlled conditions. This highlights language as an independent source of domain shift in spoof detection.

Conclusion: Language significantly impacts spoof detection performance, creating domain shift challenges. The LRLspoof dataset enables cross-lingual spoof detection research and evaluation, particularly for low-resource languages.

Abstract: We introduce LRLspoof, a large-scale multilingual synthetic-speech corpus for cross-lingual spoof detection, comprising 2,732 hours of audio generated with 24 open-source TTS systems across 66 languages, including 45 low-resource languages under our operational definition. To evaluate robustness without requiring target-domain bonafide speech, we benchmark 11 publicly available countermeasures using threshold transfer: for each model we calibrate an EER operating point on pooled external benchmarks and apply the resulting threshold, reporting spoof rejection rate (SRR). Results show model-dependent cross-lingual disparity, with spoof rejection varying markedly across languages even under controlled conditions, highlighting language as an independent source of domain shift in spoof detection. The dataset is publicly available at \href{https://huggingface.co/datasets/MTUCI/LRLspoof}{\textbf{\underline{\textit{HuggingFace}}}} and \href{https://modelscope.cn/datasets/lab260/LRLspoof}{\textbf{\underline{\textit{ModelScope}}}}

[655] IQRA 2026: Interspeech Challenge on Automatic Pronunciation Assessment for Modern Standard Arabic (MSA)

Yassine El Kheir, Amit Meghanani, Mostafa Shahin, Omnia Ibrahim, Shammur Absar Chowdhury, Nada AlMarwani, Youssef Elshahawy, Ahmed Ali

Main category: cs.SD

TL;DR: The IQRA Interspeech Challenge 2nd edition introduces a new authentic mispronunciation dataset for Modern Standard Arabic, showing significant improvement in automatic mispronunciation detection and diagnosis systems through diverse approaches including CTC-based SSL models and large audio-language models.

Details

Motivation: To advance automatic Mispronunciation Detection and Diagnosis (MDD) for Modern Standard Arabic by providing new authentic mispronunciation data and evaluating diverse approaches to improve Arabic pronunciation assessment.

Method: The challenge introduces Iqra_Extra_IS26 dataset of authentic human mispronounced speech. Submitted systems used diverse approaches including CTC-based self-supervised learning models, two-stage fine-tuning strategies, and large audio-language models.

Result: Substantial improvement of 0.28 in F1-score compared to the first edition, demonstrating effectiveness of novel architectures and additional authentic mispronunciation data.

Conclusion: The results show growing maturity of Arabic MDD research and establish a stronger foundation for future work in Arabic pronunciation assessment.

Abstract: We present the findings of the second edition of the IQRA Interspeech Challenge, a challenge on automatic Mispronunciation Detection and Diagnosis (MDD) for Modern Standard Arabic (MSA). Building on the previous edition, this iteration introduces \textbf{Iqra_Extra_IS26}, a new dataset of authentic human mispronounced speech, complementing the existing training and evaluation resources. Submitted systems employed a diverse range of approaches, spanning CTC-based self-supervised learning models, two-stage fine-tuning strategies, and using large audio-language models. Compared to the first edition, we observe a substantial jump of \textbf{0.28 in F1-score}, attributable both to novel architectures and modeling strategies proposed by participants and to the additional authentic mispronunciation data made available. These results demonstrate the growing maturity of Arabic MDD research and establish a stronger foundation for future work in Arabic pronunciation assessment.

cs.LG

[656] Integrating Artificial Intelligence, Physics, and Internet of Things: A Framework for Cultural Heritage Conservation

Carmine Valentino, Federico Pichi, Francesco Colace, Dajana Conte, Gianluigi Rozza

Main category: cs.LG

TL;DR: A framework combining IoT, AI, and physics-informed neural networks (PINNs) for cultural heritage preservation through 3D model analysis and degradation simulation.

Details

Motivation: To address the need for effective monitoring and predictive maintenance of cultural heritage by integrating technological innovation with domain expertise, combining data-driven and physics-based approaches.

Method: Four-layer framework integrating IoT and AI with physics knowledge, using Physics-Informed Neural Networks (PINNs) to incorporate physical laws, Reduced Order Methods (POD) for efficiency, and tools for 3D model processing and simulation.

Result: A reproducible framework tested on complex real-life geometries that can handle both direct and inverse problems in cultural heritage degradation modeling.

Conclusion: The framework successfully combines data-driven and physics-based approaches for cultural heritage conservation, offering efficient simulation of degradation processes through the integration of PINNs with reduced order methods.

Abstract: The conservation of cultural heritage increasingly relies on integrating technological innovation with domain expertise to ensure effective monitoring and predictive maintenance. This paper presents a novel framework to support the preservation of cultural assets, combining Internet of Things (IoT) and Artificial Intelligence (AI) technologies, enhanced with the physical knowledge of phenomena. The framework is structured into four functional layers that permit the analysis of 3D models of cultural assets and elaborate simulations based on the knowledge acquired from data and physics. A central component of the proposed framework consists of Scientific Machine Learning, particularly Physics-Informed Neural Networks (PINNs), which incorporate physical laws into deep learning models. To enhance computational efficiency, the framework also integrates Reduced Order Methods (ROMs), specifically Proper Orthogonal Decomposition (POD), and is also compatible with classical Finite Element (FE) methods. Additionally, it includes tools to automatically manage and process 3D digital replicas, enabling their direct use in simulations. The proposed approach offers three main contributions: a methodology for processing 3D models of cultural assets for reliable simulation; the application of PINNs to combine data-driven and physics-based approaches in cultural heritage conservation; and the integration of PINNs with ROMs to efficiently model degradation processes influenced by environmental and material parameters. The reproducible and open-access experimental phase exploits simulated scenarios on complex and real-life geometries to test the efficacy of the proposed framework in each of its key components, allowing the possibility of dealing with both direct and inverse problems. Code availability: https://github.com/valc89/PhysicsInformedCulturalHeritage

[657] Scaling DPPs for RAG: Density Meets Diversity

Xun Sun, Baiheng Xie, Li Huang, Qiang Gao

Main category: cs.LG

TL;DR: ScalDPP introduces a diversity-aware retrieval mechanism for RAG using Determinantal Point Processes to optimize for both density and diversity in retrieved contexts, addressing redundancy issues in standard RAG pipelines.

Details

Motivation: Standard RAG pipelines use point-wise scoring between queries and corpus chunks, ignoring interactions among retrieved candidates. This leads to redundant contexts that dilute information density and fail to surface complementary evidence. The authors argue effective retrieval should optimize jointly for both density and diversity.

Method: Proposes ScalDPP, a diversity-aware retrieval mechanism incorporating Determinantal Point Processes (DPPs) through a lightweight P-Adapter for scalable modeling of inter-chunk dependencies. Also develops Diverse Margin Loss (DML), a set-level objective that enforces ground-truth complementary evidence chains to dominate redundant alternatives under DPP geometry.

Result: Experimental results demonstrate the superiority of ScalDPP, substantiating the core claim that optimizing for both density and diversity improves retrieval effectiveness in RAG systems.

Conclusion: ScalDPP successfully addresses redundancy issues in standard RAG by incorporating diversity-aware retrieval through DPPs, leading to more effective context selection that balances information density with coverage diversity.

Abstract: Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by grounding generation in external knowledge, yielding relevance responses that are aligned with factual evidence and evolving corpora. Standard RAG pipelines construct context through relevance ranking, performing point-wise scoring between the user query and each corpora chunk. This formulation, however, ignores interactions among retrieved candidates, leading to redundant contexts that dilute density and fail to surface complementary evidence. We argue that effective retrieval should optimize jointly for both density and diversity, ensuring the grounding evidence that is dense in information yet diverse in coverage. In this study, we propose ScalDPP, a diversity-aware retrieval mechanism for RAG that incorporates Determinantal Point Processes (DPPs) through a lightweight P-Adapter, enabling scalable modeling of inter-chunk dependencies and complementary context selection. In addition, we develop a novel set-level objective, Diverse Margin Loss (DML), that enforces ground-truth complementary evidence chains to dominate any equally sized redundant alternatives under DPP geometry. Experimental results demonstrate the superiority of ScalDPP, substantiating our core statement in practice.

[658] DRAFT: Task Decoupled Latent Reasoning for Agent Safety

Lin Wang, Junfeng Fang, Dan Zhang, Fei Shen, Xiang Wang, Tat-Seng Chua

Main category: cs.LG

TL;DR: DRAFT is a latent reasoning framework for LLM agent safety monitoring that decouples safety judgment into two stages: an Extractor that distills long interaction trajectories into compact latent drafts, and a Reasoner that jointly attends to both to predict safety.

Details

Motivation: Tool-using LLM agents create long, noisy interaction trajectories where risk-critical evidence is sparse, making standard binary supervision poorly suited for credit assignment in safety monitoring.

Method: Proposes DRAFT framework with two trainable stages: 1) Extractor distills full trajectory into compact continuous latent draft, 2) Reasoner jointly attends to draft and original trajectory to predict safety, avoiding lossy explicit summarize-then-judge pipelines.

Result: Outperforms strong baselines across benchmarks including ASSEBench and R-Judge, improving accuracy from 63.27% (LoRA) to 91.18% averaged over benchmarks, and learns more separable representations.

Conclusion: Continuous latent reasoning prior to readout is a practical path to robust agent safety under long-context supervision with sparse evidence.

Abstract: The advent of tool-using LLM agents shifts safety monitoring from output moderation to auditing long, noisy interaction trajectories, where risk-critical evidence is sparse-making standard binary supervision poorly suited for credit assignment. To address this, we propose DRAFT (Task Decoupled Latent Reasoning for Agent Safety), a latent reasoning framework that decouples safety judgment into two trainable stages: an Extractor that distills the full trajectory into a compact continuous latent draft, and a Reasoner that jointly attends to the draft and the original trajectory to predict safety. DRAFT avoids lossy explicit summarize-then-judge pipelines by performing evidence aggregation in latent space, enabling end-to-end differentiable training.Across benchmarks including ASSEBench and R-Judge, DRAFT consistently outperforms strong baselines, improving accuracy from 63.27% (LoRA) to 91.18% averaged over benchmarks, and learns more separable representations. Ablations demonstrate a clear synergy between the Extractor and the Reasoner.Overall, DRAFT suggests that continuous latent reasoning prior to readout is a practical path to robust agent safety under long-context supervision with sparse evidence.

[659] General Explicit Network (GEN): A novel deep learning architecture for solving partial differential equations

Genwei Ma, Ting Luo, Ping Yang, Xing Zhao

Main category: cs.LG

TL;DR: Proposes a general explicit network (GEN) for PDE solving using point-to-function approach with basis functions, improving robustness and extensibility over traditional PINNs.

Details

Motivation: Current physics-informed neural networks (PINNs) have limited real-world deployment due to discrete point-to-point fitting, poor extensibility, and lack of consideration for real solution properties. Traditional PINNs use continuous activation functions that create local characteristics matching equation solutions but result in poor robustness and extensibility.

Method: Proposes a general explicit network (GEN) that implements point-to-function PDE solving. The “function” component is constructed using basis functions based on prior knowledge of the original PDEs, allowing for better fitting of solution properties.

Result: Experimental results demonstrate that the GEN approach enables solutions with high robustness and strong extensibility compared to traditional PINN methods.

Conclusion: The point-to-function approach using basis functions provides a more robust and extensible framework for PDE solving than traditional PINNs, addressing limitations in real-world deployment.

Abstract: Machine learning, especially physics-informed neural networks (PINNs) and their neural network variants, has been widely used to solve problems involving partial differential equations (PDEs). The successful deployment of such methods beyond academic research remains limited. For example, PINN methods primarily consider discrete point-to-point fitting and fail to account for the potential properties of real solutions. The adoption of continuous activation functions in these approaches leads to local characteristics that align with the equation solutions while resulting in poor extensibility and robustness. A general explicit network (GEN) that implements point-to-function PDE solving is proposed in this paper. The “function” component can be constructed based on our prior knowledge of the original PDEs through corresponding basis functions for fitting. The experimental results demonstrate that this approach enables solutions with high robustness and strong extensibility to be obtained.

[660] Apparent Age Estimation: Challenges and Outcomes

Justin Rainier Go, Lorenz Bernard Marqueses, Mikaella Kaye Martinez, John Kevin Patrick Sarmiento, Abien Fred Agarap

Main category: cs.LG

TL;DR: Analysis of apparent age estimation methods reveals persistent demographic biases despite technical improvements, with Asian and African American populations experiencing significant performance degradation.

Details

Motivation: Apparent age estimation is valuable for business personalization, but current models exhibit demographic biases. The research aims to evaluate fairness and accuracy trade-offs in existing distribution learning methods for age estimation.

Method: Review of prior DEX method with distribution learning techniques (Mean-Variance Loss and Adaptive Mean-Residue Loss), evaluated on IMDB-WIKI, APPA-REAL, and FairFace datasets. Analysis includes UMAP embeddings for age clustering and saliency maps for feature focus examination.

Result: AMRL achieves state-of-the-art accuracy but trade-offs between precision and demographic equity persist. Clear age clustering in UMAP embeddings, but saliency maps show inconsistent feature focus across demographics, leading to significant performance degradation for Asian and African American populations.

Conclusion: Technical improvements alone are insufficient for fair apparent age estimation. Accurate and fair estimation requires integration of localized and diverse datasets, and strict adherence to fairness validation protocols.

Abstract: Apparent age estimation is a valuable tool for business personalization, yet current models frequently exhibit demographic biases. We review prior works on the DEX method by applying distribution learning techniques such as Mean-Variance Loss (MVL) and Adaptive Mean-Residue Loss (AMRL), and evaluate them in both accuracy and fairness. Using IMDB-WIKI, APPA-REAL, and FairFace, we demonstrate that while AMRL achieves state-of-the-art accuracy, trade-offs between precision and demographic equity persist. Despite clear age clustering in UMAP embeddings, our saliency maps indicate inconsistent feature focus across demographics, leading to significant performance degradation for Asian and African American populations. We argue that technical improvements alone are insufficient; accurate and fair apparent age estimation requires the integration of localized and diverse datasets, and strict adherence to fairness validation protocols.

[661] NativeTernary: A Self-Delimiting Binary Encoding with Unary Run-Length Hierarchy Markers for Ternary Neural Network Weights, Structured Data, and General Computing Infrastructure

Maharshi Savdhariya

Main category: cs.LG

TL;DR: NativeTernary is a binary encoding scheme for ternary LLMs that uses 2-bit pairs to represent ternary values with structural delimiters for hierarchical boundaries, enabling efficient storage and transmission without hardware changes.

Details

Motivation: While BitNet b1.58 shows LLMs can operate on ternary weights {-1, 0, +1}, there's no native binary wire format for such models. NativeTernary addresses this gap by creating an efficient encoding scheme for ternary neural networks.

Method: Uses 2-bit pair space partitioning into three data symbols for ternary values and a reserved structural delimiter. Employs unary run-length encoding to represent semantic hierarchy depth, where N consecutive delimiter pairs denote boundaries at different levels (character, word, sentence, etc.). Offers multiple encoding variants with different delimiter choices and ternary mappings.

Result: Creates a practical binary encoding scheme for ternary LLMs with proportional bit costs for hierarchical boundaries (2-10 bits based on rarity). The decoder is a simple 10-line stateless state machine resilient to bitstream corruption.

Conclusion: NativeTernary enables ternary-native computing infrastructure without hardware changes, with applications spanning neural network weight storage, hierarchical language encoding, edge computing, IoT, and various embedded systems.

Abstract: BitNet b1.58 (Ma et al., 2024) demonstrates that large language models can operate entirely on ternary weights {-1, 0, +1}, yet no native binary wire format exists for such models. NativeTernary closes this gap. We present NativeTernary, a binary encoding scheme that partitions the 2-bit pair space into three data symbols representing ternary values – either balanced {-1, 0, +1} or unsigned {0, 1, 2} – and a reserved structural delimiter. The central contribution is the use of unary run-length encoding to represent semantic hierarchy depth: a sequence of N consecutive delimiter pairs denotes a boundary of level N, encoding character, word, sentence, paragraph, and topic boundaries at cost 2, 4, 6, 8, and 10 bits respectively – proportional to boundary rarity. The choice of which 2-bit pair serves as the delimiter is a design parameter: {11} is the primary embodiment, offering simple OR-gate detection; {00} is an alternative embodiment optimised for ultra-low-power CMOS systems, minimising switching activity. All four bit-pair choices are covered by the patent claims. We present three encoding variants: (1) the primary scheme with {11} as sole delimiter; (2) a dual-starter variant where both {10} and {11} initiate distinct symbol namespaces; and (3) an analysis of unsigned versus balanced ternary data mappings. We describe a path toward ternary-native general computing infrastructure requiring no hardware changes, and outline applications spanning ternary neural network weight storage, hierarchical natural language encoding, edge computing, IoT and satellite telemetry, industrial sensors, automotive systems, medical devices, gaming, and financial tick data. The decoder is a 10-line stateless state machine resilient to bitstream corruption.

[662] Towards Intelligent Energy Security: A Unified Spatio-Temporal and Graph Learning Framework for Scalable Electricity Theft Detection in Smart Grids

AbdulQoyum A. Olowookere, Usman A. Oguntola, Ebenezer. Leke Odekanle, Maridiyah A. Madehin, Aisha A. Adesope

Main category: cs.LG

TL;DR: SGEIS is an AI framework combining machine learning, deep learning, NILM, and graph neural networks for electricity theft detection and energy monitoring in smart grids.

Details

Motivation: Electricity theft and non-technical losses cause significant economic losses and grid reliability issues in smart grids, requiring advanced detection systems.

Method: Integrated framework using supervised ML, deep learning (LSTM, TCN, Autoencoders), ensemble methods (Random Forest, Gradient Boosting, XGBoost, LightGBM), Graph Neural Networks for spatial dependencies, and Non-Intrusive Load Monitoring for appliance-level disaggregation.

Result: Gradient Boosting achieved ROC-AUC of 0.894, graph-based models attained over 96% accuracy in identifying high-risk nodes, with strong overall performance in theft detection.

Conclusion: SGEIS provides a scalable, practical solution for electricity theft detection with high accuracy, improved interpretability, and strong potential for real-world smart grid deployment.

Abstract: Electricity theft and non-technical losses (NTLs) remain critical challenges in modern smart grids, causing significant economic losses and compromising grid reliability. This study introduces the SmartGuard Energy Intelligence System (SGEIS), an integrated artificial intelligence framework for electricity theft detection and intelligent energy monitoring. The proposed system combines supervised machine learning, deep learning-based time-series modeling, Non-Intrusive Load Monitoring (NILM), and graph-based learning to capture both temporal and spatial consumption patterns. A comprehensive data processing pipeline is developed, incorporating feature engineering, multi-scale temporal analysis, and rule-based anomaly labeling. Deep learning models, including Long Short-Term Memory (LSTM), Temporal Convolutional Networks (TCN), and Autoencoders, are employed to detect abnormal usage patterns. In parallel, ensemble learning methods such as Random Forest, Gradient Boosting, XGBoost, and LightGBM are utilized for classification. To model grid topology and spatial dependencies, Graph Neural Networks (GNNs) are applied to identify correlated anomalies across interconnected nodes. The NILM module enhances interpretability by disaggregating appliance-level consumption from aggregate signals. Experimental results demonstrate strong performance, with Gradient Boosting achieving a ROC-AUC of 0.894, while graph-based models attain over 96% accuracy in identifying high-risk nodes. The hybrid framework improves detection robustness by integrating temporal, statistical, and spatial intelligence. Overall, SGEIS provides a scalable and practical solution for electricity theft detection, offering high accuracy, improved interpretability, and strong potential for real-world smart grid deployment.

[663] Hardware-Oriented Inference Complexity of Kolmogorov-Arnold Networks

Bilal Khalid, Pedro Freire, Sergei K. Turitsyn, Jaroslaw E. Prilepsky

Main category: cs.LG

TL;DR: Proposes platform-independent hardware complexity metrics (RM, BOP, NABS) for evaluating Kolmogorov-Arnold Networks (KANs) inference across variants, enabling fair cross-architecture comparisons without full hardware synthesis.

Details

Motivation: Existing KAN complexity evaluations focus on FLOPs for GPU training/inference or platform-specific hardware metrics that require full design/synthesis, limiting early-stage architectural decisions and cross-platform comparisons for latency-sensitive, power-constrained deployment scenarios.

Method: Derives generalized, platform-independent formulae for hardware inference complexity of KANs in terms of Real Multiplications (RM), Bit Operations (BOP), and Number of Additions and Bit-Shifts (NABS). Extends analysis across multiple KAN variants including B-spline, Gaussian RBF, Chebyshev, and Fourier KANs.

Result: Proposed metrics can be computed directly from network structure and enable fair, straightforward inference complexity comparison between KAN and other neural network architectures without requiring full hardware design and synthesis.

Conclusion: Provides practical platform-independent complexity metrics for KAN hardware inference evaluation, facilitating early-stage architectural decisions and cross-platform comparisons in latency-sensitive, power-constrained deployment scenarios.

Abstract: Kolmogorov-Arnold Networks (KANs) have recently emerged as a powerful architecture for various machine learning applications. However, their unique structure raises significant concerns regarding their computational overhead. Existing studies primarily evaluate KAN complexity in terms of Floating-Point Operations (FLOPs) required for GPU-based training and inference. However, in many latency-sensitive and power-constrained deployment scenarios, such as neural network-driven non-linearity mitigation in optical communications or channel state estimation in wireless communications, training is performed offline and dedicated hardware accelerators are preferred over GPUs for inference. Recent hardware implementation studies report KAN complexity using platform-specific resource consumption metrics, such as Look-Up Tables, Flip-Flops, and Block RAMs. However, these metrics require a full hardware design and synthesis stage that limits their utility for early-stage architectural decisions and cross-platform comparisons. To address this, we derive generalized, platform-independent formulae for evaluating the hardware inference complexity of KANs in terms of Real Multiplications (RM), Bit Operations (BOP), and Number of Additions and Bit-Shifts (NABS). We extend our analysis across multiple KAN variants, including B-spline, Gaussian Radial Basis Function (GRBF), Chebyshev, and Fourier KANs. The proposed metrics can be computed directly from the network structure and enable a fair and straightforward inference complexity comparison between KAN and other neural network architectures.

[664] From Model-Based Screening to Data-Driven Surrogates: A Multi-Stage Workflow for Exploring Stochastic Agent-Based Models

Paul Saves, Matthieu Mastio, Nicolas Verstaevel, Benoit Gaudou

Main category: cs.LG

TL;DR: A multi-stage pipeline combining experimental design with ML surrogates for systematic exploration of high-dimensional, stochastic Agent-Based Models, demonstrated on a predator-prey case study.

Details

Motivation: Agent-Based Models face challenges due to high dimensionality and inherent stochasticity, making systematic exploration difficult. There's a need for automated, rigorous frameworks for sensitivity analysis and policy testing in complex simulation environments.

Method: Two-step pipeline: 1) Automated model-based screening identifies dominant variables, assesses outcome variability, and segments parameter space; 2) Machine Learning models trained to map remaining nonlinear interaction effects between variables.

Result: The approach automates discovery of unstable regions where system outcomes depend heavily on nonlinear interactions between many variables, providing a hands-off framework for sensitivity analysis.

Conclusion: Provides modelers with a rigorous, automated framework for exploring high-dimensional stochastic simulators, enabling systematic sensitivity analysis and policy testing even in complex ABM environments.

Abstract: Systematic exploration of Agent-Based Models (ABMs) is challenged by the curse of dimensionality and their inherent stochasticity. We present a multi-stage pipeline integrating the systematic design of experiments with machine learning surrogates. Using a predator-prey case study, our methodology proceeds in two steps. First, an automated model-based screening identifies dominant variables, assesses outcome variability, and segments the parameter space. Second, we train Machine Learning models to map the remaining nonlinear interaction effects. This approach automates the discovery of unstable regions where system outcomes are highly dependent on nonlinear interactions between many variables. Thus, this work provides modelers with a rigorous, hands-off framework for sensitivity analysis and policy testing, even when dealing with high-dimensional stochastic simulators.

[665] The limits of bio-molecular modeling with large language models : a cross-scale evaluation

Yaxin Xu, Yue Zhou, Tianyu Zhao, Fengwei An, Zhixiang Ren

Main category: cs.LG

TL;DR: BioMol-LLM-Bench: A cross-scale benchmark for evaluating LLMs on 26 bio-molecular tasks across 4 difficulty levels, revealing gaps in mechanistic understanding and providing practical guidance for molecular system modeling.

Details

Motivation: There's a need for systematic evaluation of LLMs across multi-scale biological problems and rigorous assessment of their tool-augmented capabilities in bio-molecular discovery, as current evaluations are limited.

Method: Created BioMol-LLM-Bench, a unified framework with 26 downstream tasks covering 4 difficulty levels, integrating computational tools for comprehensive evaluation. Evaluated 13 representative models.

Result: Four main findings: 1) Chain-of-thought data provides limited benefit and may reduce performance on biological tasks; 2) Hybrid mamba-attention architectures are more effective for long bio-molecular sequences; 3) Supervised fine-tuning improves specialization at cost of generalization; 4) LLMs perform well on classification but weak on challenging regression tasks.

Conclusion: The benchmark reveals systematic gaps between LLM performance and mechanistic understanding in bio-molecular systems, providing practical guidance for future LLM-based molecular modeling approaches.

Abstract: The modeling of bio-molecular system across molecular scales remains a central challenge in scientific research. Large language models (LLMs) are increasingly applied to bio-molecular discovery, yet systematic evaluation across multi-scale biological problems and rigorous assessment of their tool-augmented capabilities remain limited. We reveal a systematic gap between LLM performance and mechanistic understanding through the proposed cross-scale bio-molecular benchmark: BioMol-LLM-Bench, a unified framework comprising 26 downstream tasks that covers 4 distinct difficulty levels, and computational tools are integrated for a more comprehensive evaluation. Evaluation on 13 representative models reveals 4 main findings: chain-of-thought data provides limited benefit and may even reduce performance on biological tasks; hybrid mamba-attention architectures are more effective for long bio-molecular sequences; supervised fine-tuning improves specialization at the cost of generalization; and current LLMs perform well on classification tasks but remain weak on challenging regression tasks. Together, these findings provide practical guidance for future LLM-based modeling of molecular systems.

[666] Scalable Variational Bayesian Fine-Tuning of LLMs via Orthogonalized Low-Rank Adapters

Haotian Xiang, Bingcong Li, Qin Lu

Main category: cs.LG

TL;DR: PoLAR-VBLL: A scalable Bayesian fine-tuning framework that combines orthogonalized low-rank adapters with variational Bayesian last layer to provide well-calibrated uncertainty quantification for LLMs in safety-critical applications.

Details

Motivation: LLMs deployed in safety-critical applications need reliable uncertainty quantification, but current methods suffer from overconfidence after parameter-efficient fine-tuning. Existing approaches have limitations: Laplace approximation yields suboptimal calibration, while variational Bayesian training is computationally expensive for deployment.

Method: Proposes PoLAR-VBLL framework: 1) PoLAR (Polar-decomposed Low-rank Adapter Representation) uses orthogonalized parameterization with Riemannian optimization for stable and expressive adaptation, addressing rank collapse in existing LoRA; 2) Bayesian last layer (BLL) model with deterministic feature extractor followed by random last layer parameters; 3) Variational inference framework for joint optimization of PoLAR parameters and approximate posterior of last layer parameters via alternating optimization.

Result: Empirical results show PoLAR-VBLL effectively improves generalization and uncertainty estimation on both in-distribution and out-of-distribution data for various common-sense reasoning tasks, providing well-calibrated uncertainty quantification.

Conclusion: PoLAR-VBLL integrates architecture-enhanced optimization with scalable Bayesian inference to endow LLMs with reliable uncertainty quantification, addressing limitations of existing methods for safety-critical applications.

Abstract: When deploying large language models (LLMs) to safety-critical applications, uncertainty quantification (UQ) is of utmost importance to self-assess the reliability of the LLM-based decisions. However, such decisions typically suffer from overconfidence, particularly after parameter-efficient fine-tuning (PEFT) for downstream domain-specific tasks with limited data. Existing methods to alleviate this issue either rely on Laplace approximation based post-hoc framework, which may yield suboptimal calibration depending on the training trajectory, or variational Bayesian training that requires multiple complete forward passes through the entire LLM backbone at inference time for Monte Carlo estimation, posing scalability challenges for deployment. To address these limitations, we build on the Bayesian last layer (BLL) model, where the LLM-based deterministic feature extractor is followed by random last layer parameters for uncertainty reasoning. Since existing low-rank adapters (LoRA) for PEFT have limited expressiveness due to rank collapse, we address this with Polar-decomposed Low-rank Adapter Representation (PoLAR), an orthogonalized parameterization paired with Riemannian optimization to enable more stable and expressive adaptation. Building on this PoLAR-BLL model, we leverage the variational (V) inference framework to put forth a scalable Bayesian fine-tuning approach which jointly seeks the PoLAR parameters and approximate posterior of the last layer parameters via alternating optimization. The resulting PoLAR-VBLL is a flexible framework that nicely integrates architecture-enhanced optimization with scalable Bayesian inference to endow LLMs with well-calibrated UQ. Our empirical results verify the effectiveness of PoLAR-VBLL in terms of generalization and uncertainty estimation on both in-distribution and out-of-distribution data for various common-sense reasoning tasks.

[667] ModalImmune: Immunity Driven Unlearning via Self Destructive Training

Rong Fu, WeiZhi Tang, Ziming Wang, Jia Yee Tan, Zijian Zhang, Zhaolu Kang, Muge Qi, Shuning Zhang, Simon Fong

Main category: cs.LG

TL;DR: ModalImmune: A training framework that makes multimodal models robust to partial or complete loss of input channels by intentionally collapsing modality information during training.

Details

Motivation: Multimodal systems are vulnerable to partial or complete loss of input channels in real-world deployment, which undermines their reliability. Current models lack robustness when certain modalities become unavailable or corrupted.

Method: The framework enforces modality immunity by intentionally collapsing selected modality information during training. It uses: 1) spectrum-adaptive collapse regularizer, 2) information-gain guided controller for targeted interventions, 3) curvature-aware gradient masking to stabilize destructive updates, and 4) certified Neumann-truncated hyper-gradient procedure for automatic meta-parameter adaptation.

Result: Empirical evaluation on standard multimodal benchmarks shows that ModalImmune improves resilience to modality removal and corruption while retaining convergence stability and reconstruction capacity.

Conclusion: ModalImmune provides an effective training framework for making multimodal models robust to modality failures, enhancing their reliability in real-world settings where input channels may be partially or completely lost.

Abstract: Multimodal systems are vulnerable to partial or complete loss of input channels at deployment, which undermines reliability in real-world settings. This paper presents ModalImmune, a training framework that enforces modality immunity by intentionally and controllably collapsing selected modality information during training so the model learns joint representations that are robust to destructive modality influence. The framework combines a spectrum-adaptive collapse regularizer, an information-gain guided controller for targeted interventions, curvature-aware gradient masking to stabilize destructive updates, and a certified Neumann-truncated hyper-gradient procedure for automatic meta-parameter adaptation. Empirical evaluation on standard multimodal benchmarks demonstrates that ModalImmune improves resilience to modality removal and corruption while retaining convergence stability and reconstruction capacity.

[668] Beauty in the Eye of AI: Aligning LLMs and Vision Models with Human Aesthetics in Network Visualization

Peng Zhang, Xuefeng Li, Xiaoqi Wang, Han-Wei Shen, Yifan Hu

Main category: cs.LG

TL;DR: Using LLMs and vision models as proxies for human judgment in network visualization preferences, achieving alignment comparable to human-human agreement through prompt engineering and confidence filtering.

Details

Motivation: Traditional network visualization relies on heuristic metrics with inconsistent results, while human preference learning is costly and time-consuming. The paper explores using AI models as scalable proxies for human judgment in visualization preferences.

Method: Conducted user study with 27 participants to collect human preference labels for network visualizations. Used this data to bootstrap LLM and vision model labelers through prompt engineering combining few-shot examples and diverse input formats (including image embeddings). Implemented confidence score filtering for LLMs to improve alignment.

Result: Prompt engineering with few-shot examples and image embeddings significantly improved LLM-human alignment. Additional confidence score filtering pushed LLM-human alignment to human-human levels. Carefully trained vision models achieved VM-human alignment comparable to human annotator agreement.

Conclusion: AI models (LLMs and vision models) can feasibly serve as scalable proxies for human labelers in network visualization preference tasks, achieving alignment levels comparable to human-human agreement through proper engineering and filtering techniques.

Abstract: Network visualization has traditionally relied on heuristic metrics, such as stress, under the assumption that optimizing them leads to aesthetic and informative layouts. However, no single metric consistently produces the most effective results. A data-driven alternative is to learn from human preferences, where annotators select their favored visualization among multiple layouts of the same graphs. These human-preference labels can then be used to train a generative model that approximates human aesthetic preferences. However, obtaining human labels at scale is costly and time-consuming. As a result, this generative approach has so far been tested only with machine-labeled data. In this paper, we explore the use of large language models (LLMs) and vision models (VMs) as proxies for human judgment. Through a carefully designed user study involving 27 participants, we curated a large set of human preference labels. We used this data both to better understand human preferences and to bootstrap LLM/VM labelers. We show that prompt engineering that combines few-shot examples and diverse input formats, such as image embeddings, significantly improves LLM-human alignment, and additional filtering by the confidence score of the LLM pushes the alignment to human-human levels. Furthermore, we demonstrate that carefully trained VMs can achieve VM-human alignment at a level comparable to that between human annotators. Our results suggest that AI can feasibly serve as a scalable proxy for human labelers.

[669] Adaptive Threshold-Driven Continuous Greedy Method for Scalable Submodular Optimization

Mohammadreza Rostami, Solmaz S. Kia

Main category: cs.LG

TL;DR: ATCG algorithm improves communication efficiency in distributed submodular maximization under matroid constraints by adaptively expanding active sets only when needed, reducing feature embedding transmissions while maintaining near-optimal approximation guarantees.

Details

Motivation: Existing algorithms for submodular maximization under matroid constraints face trade-offs: Sequential Greedy has poor approximation ratio, while Continuous Greedy requires dense decision vectors and extensive communication of feature embeddings between agents, which is inefficient in distributed settings.

Method: ATCG (Adaptive Thresholded Continuous Greedy) introduces a per-partition progress ratio η_i that gates gradient evaluations. It expands each agent’s active set only when current candidates fail to capture sufficient marginal gain, directly bounding which feature embeddings need to be transmitted between agents.

Result: Theoretical analysis shows ATCG achieves curvature-aware approximation guarantee with effective factor τ_eff = max{τ, 1-c}. Experiments on class-balanced prototype selection using CIFAR-10 animal dataset show ATCG achieves comparable objective values to Continuous Greedy while substantially reducing communication overhead.

Conclusion: ATCG provides an effective communication-efficient alternative to Continuous Greedy for distributed submodular maximization, maintaining strong approximation guarantees while adaptively controlling which feature embeddings need to be shared between agents.

Abstract: Submodular maximization under matroid constraints is a fundamental problem in combinatorial optimization with applications in sensing, data summarization, active learning, and resource allocation. While the Sequential Greedy (SG) algorithm achieves only a $\frac{1}{2}$-approximation due to irrevocable selections, Continuous Greedy (CG) attains the optimal $\bigl(1-\frac{1}{e}\bigr)$-approximation via the multilinear relaxation, at the cost of a progressively dense decision vector that forces agents to exchange feature embeddings for nearly every ground-set element. We propose \textit{ATCG} (\underline{A}daptive \underline{T}hresholded \underline{C}ontinuous \underline{G}reedy), which gates gradient evaluations behind a per-partition progress ratio $η_i$, expanding each agent’s active set only when current candidates fail to capture sufficient marginal gain, thereby directly bounding which feature embeddings are ever transmitted. Theoretical analysis establishes a curvature-aware approximation guarantee with effective factor $τ_{\mathrm{eff}}=\max{τ,1-c}$, interpolating between the threshold-based guarantee and the low-curvature regime where \textit{ATCG} recovers the performance of CG. Experiments on a class-balanced prototype selection problem over a subset of the CIFAR-10 animal dataset show that \textit{ATCG} achieves objective values comparable to those of the full CG method while substantially reducing communication overhead through adaptive active-set expansion.

[670] Adversarial Robustness of Deep State Space Models for Forecasting

Sribalaji C. Anand, George J. Pappas

Main category: cs.LG

TL;DR: The paper analyzes robustness of Spacetime SSM forecasters against adversarial attacks, showing they can represent optimal Kalman predictors and deriving vulnerability bounds based on system properties.

Details

Motivation: While state-space models (SSMs) show strong performance in time-series forecasting, their robustness to adversarial perturbations is poorly understood, creating a critical gap in reliable forecasting systems.

Method: 1) Prove Spacetime SSM can represent optimal Kalman predictors for autoregressive processes; 2) Formulate robust forecasting as Stackelberg game against stealthy adversaries; 3) Derive closed-form bounds on adversarial error; 4) Show model-free attacks exploiting locally linear behavior.

Result: Model-free attacks without gradient computation cause at least 33% more error than projected gradient descent on Monash benchmark datasets, revealing significant vulnerability even to simple attacks.

Conclusion: SSM forecasters are vulnerable to adversarial attacks, with vulnerability amplified by system instabilities; model-free attacks can be surprisingly effective, highlighting need for robust design principles.

Abstract: State-space model (SSM) for time-series forecasting have demonstrated strong empirical performance on benchmark datasets, yet their robustness under adversarial perturbations is poorly understood. We address this gap through a control-theoretic lens, focusing on the recently proposed Spacetime SSM forecaster. We first establish that the decoder-only Spacetime architecture can represent the optimal Kalman predictor when the underlying data-generating process is autoregressive - a property no other SSM possesses. Building on this, we formulate robust forecaster design as a Stackelberg game against worst-case stealthy adversaries constrained by a detection budget, and solve it via adversarial training. We derive closed-form bounds on adversarial forecasting error that expose how open-loop instability, closed-loop instability, and decoder state dimension each amplify vulnerability - offering actionable principles towards robust forecaster design. Finally, we show that even adversaries with no access to the forecaster can nonetheless construct effective attacks by exploiting the model’s locally linear input-output behavior, bypassing gradient computations entirely. Experiments on the Monash benchmark datasets highlight that model-free attacks, without any gradient computation, can cause at least 33% more error than projected gradient descent with a small step size.

[671] MetaSAEs: Joint Training with a Decomposability Penalty Produces More Atomic Sparse Autoencoder Latents

Matthew Levinson

Main category: cs.LG

TL;DR: Joint training objective for sparse autoencoders reduces subspace blending by penalizing when decoder directions can be sparsely reconstructed from a meta dictionary, leading to more atomic features.

Details

Motivation: Sparse autoencoders are used for safety applications like alignment detection and model steering, but their latents often blend multiple representational subspaces together, making features less atomic and interpretable.

Method: Introduces a joint training objective where a small meta SAE is trained alongside the primary SAE to sparsely reconstruct the primary SAE’s decoder columns. The primary SAE is penalized when its decoder directions are easily reconstructable from the meta dictionary, creating gradient pressure toward more independent decoder directions.

Result: On GPT-2 large (layer 20), reduces mean |φ| by 7.5% and improves automated interpretability (fuzzing) scores by 7.6%. Results on Gemma 2 9B are directional but show +8.6% ΔFuzz improvement. Qualitative analysis shows polysemantic features split into semantically distinct sub-features.

Conclusion: The method successfully reduces subspace blending in SAEs, leading to more atomic features that better represent single coherent concepts, which is valuable for safety applications requiring interpretable representations.

Abstract: Sparse autoencoders (SAEs) are increasingly used for safety-relevant applications including alignment detection and model steering. These use cases require SAE latents to be as atomic as possible. Each latent should represent a single coherent concept drawn from a single underlying representational subspace. In practice, SAE latents blend representational subspaces together. A single feature can activate across semantically distinct contexts that share no true common representation, muddying an already complex picture of model computation. We introduce a joint training objective that directly penalizes this subspace blending. A small meta SAE is trained alongside the primary SAE to sparsely reconstruct the primary SAE’s decoder columns; the primary SAE is penalized whenever its decoder directions are easy to reconstruct from the meta dictionary. This occurs whenever latent directions lie in a subspace spanned by other primary directions. This creates gradient pressure toward more mutually independent decoder directions that resist sparse meta-compression. On GPT-2 large (layer 20), the selected configuration reduces mean $|\varphi|$ by 7.5% relative to an identical solo SAE trained on the same data. Automated interpretability (fuzzing) scores improve by 7.6%, providing external validation of the atomicity gain independent of the training and co-occurrence metrics. Reconstruction overhead is modest. Results on Gemma 2 9B are directional. On not-fully-converged SAEs, the same parameterization yields the best results, a $+8.6%$ $Δ$Fuzz. Though directional, this is an encouraging sign that the method transfers to a larger model. Qualitative analysis confirms that features firing on polysemantic tokens are split into semantically distinct sub-features, each specializing in a distinct representational subspace.

[672] Representational Collapse in Multi-Agent LLM Committees: Measurement and Diversity-Aware Consensus

Dipkumar Patel

Main category: cs.LG

TL;DR: Multi-agent LLM committees suffer from representational collapse where agents produce similar reasoning despite different role prompts, reducing diversity benefits. DALC protocol uses embedding geometry to weight diverse contributions, improving accuracy with lower token cost.

Details

Motivation: Multi-agent LLM committees using majority voting assume agents provide complementary evidence, but this paper identifies "representational collapse" where agents produce highly similar reasoning despite different role prompts, reducing the benefits of committee diversity.

Method: The paper analyzes agent reasoning similarity by embedding chain-of-thought rationales and measuring pairwise cosine similarity. It introduces DALC, a training-free consensus protocol that computes diversity weights from embedding geometry to aggregate agent outputs more effectively.

Result: Experiments show high similarity (mean cosine 0.888) among Qwen2.5-14B agents, indicating representational collapse. DALC achieves 87% accuracy on GSM8K vs 84% for self-consistency with 26% lower token cost. Embedding choice strongly affects collapse severity and downstream accuracy.

Conclusion: Representational collapse is measurable and worsens on harder tasks. Embedding proxy choice is crucial for latent communication protocols. Diversity-aware aggregation can improve multi-agent performance while reducing computational cost.

Abstract: Multi-agent LLM committees replicate the same model under different role prompts and aggregate outputs by majority vote, implicitly assuming that agents contribute complementary evidence. We embed each agent’s chain-of-thought rationale and measure pairwise similarity: across 100 GSM8K questions with three Qwen2.5-14B agents, mean cosine similarity is 0.888 and effective rank is 2.17 out of 3.0, a failure mode we term representational collapse. DALC, a training-free consensus protocol that computes diversity weights from embedding geometry, reaches 87% on GSM8K versus 84% for self-consistency at 26% lower token cost. Ablation experiments reveal 1-3 point per-protocol run-to-run variance, confirm that hint sharing contributes more than diversity weighting alone, and show that encoder choice strongly modulates collapse severity (cosine 0.908 with mxbai versus 0.888 with nomic) and downstream accuracy. The more robust finding is that collapse is measurable, worsens on harder tasks, and that the choice of embedding proxy is a first-order design decision for any latent communication protocol.

[673] Olmo Hybrid: From Theory to Practice and Back

William Merrill, Yanhong Li, Tyler Romero, Anej Svete, Caia Costello, Pradeep Dasigi, Dirk Groeneveld, David Heineman, Bailey Kuehl, Nathan Lambert, Jacob Morrison, Luca Soldaini, Finbarr Timbers, Pete Walsh, Noah A. Smith, Hannaneh Hajishirzi, Ashish Sabharwal

Main category: cs.LG

TL;DR: Hybrid models mixing attention and recurrent layers outperform pure transformers in scaling efficiency and downstream performance, offering more expressive language models beyond traditional architectures.

Details

Motivation: To determine if non-transformer architectures (linear RNNs and hybrid models) justify scaling efforts by comparing their performance against pure transformers, addressing whether theoretical expressivity benefits translate to practical advantages.

Method: Combines theoretical analysis of hybrid model expressivity with empirical evaluation of Olmo Hybrid (7B parameters with Gated DeltaNet layers replacing sliding window layers), comparing against Olmo 3 7B transformer across pretraining and mid-training evaluations.

Result: Olmo Hybrid outperforms Olmo 3 transformer across standard evaluations, scales significantly more efficiently, and demonstrates superior performance despite theoretical gap between formal problem expressivity and unrelated downstream tasks.

Conclusion: Hybrid models mixing attention and recurrence are powerful extensions to language modeling paradigm, offering more expressive models with better scaling efficiency during pretraining, not just inference memory reduction.

Abstract: Recent work has demonstrated the potential of non-transformer language models, especially linear recurrent neural networks (RNNs) and hybrid models that mix recurrence and attention. Yet there is no consensus on whether the potential benefits of these new architectures justify the risk and effort of scaling them up. To address this, we provide evidence for the advantages of hybrid models over pure transformers on several fronts. First, theoretically, we show that hybrid models do not merely inherit the expressivity of transformers and linear RNNs, but can express tasks beyond both, such as code execution. Putting this theory to practice, we train Olmo Hybrid, a 7B-parameter model largely comparable to Olmo 3 7B but with the sliding window layers replaced by Gated DeltaNet layers. We show that Olmo Hybrid outperforms Olmo 3 across standard pretraining and mid-training evaluations, demonstrating the benefit of hybrid models in a controlled, large-scale setting. We find that the hybrid model scales significantly more efficiently than the transformer, explaining its higher performance. However, its unclear why greater expressivity on specific formal problems should result in better scaling or superior performance on downstream tasks unrelated to those problems. To explain this apparent gap, we return to theory and argue why increased expressivity should translate to better scaling efficiency, completing the loop. Overall, our results suggest that hybrid models mixing attention and recurrent layers are a powerful extension to the language modeling paradigm: not merely to reduce memory during inference, but as a fundamental way to obtain more expressive models that scale better during pretraining.

[674] Neural Operators for Multi-Task Control and Adaptation

David Sewell, Xingjian Li, Stepan Tretiakov, Krishna Kumar, David Fridovich-Keil

Main category: cs.LG

TL;DR: Neural operators for multi-task optimal control: learn mapping from task descriptions to optimal control policies using permutation-invariant architecture, enabling generalization and efficient adaptation.

Details

Motivation: Neural operators show promise for learning mappings between function spaces, but their application to optimal control problems remains underexplored. Multi-task control problems involve learning solution operators that map task descriptions (cost/dynamics functions) to optimal control laws.

Method: Use permutation-invariant neural operator architecture to approximate solution operators for parametric optimal control. Train via behavioral cloning. Develop structured adaptation strategies (lightweight updates to full fine-tuning) and meta-trained variants for few-shot adaptation.

Result: Single operator trained via behavioral cloning accurately approximates solution operator and generalizes to unseen tasks, OOD settings, and varying task observations. Branch-trunk architecture enables efficient adaptation. Meta-trained variants outperform baseline meta-learning methods for few-shot adaptation.

Conclusion: Neural operators provide unified and efficient framework for multi-task control and adaptation, demonstrating strong generalization and flexible adaptation capabilities across various control environments.

Abstract: Neural operator methods have emerged as powerful tools for learning mappings between infinite-dimensional function spaces, yet their potential in optimal control remains largely unexplored. We focus on multi-task control problems, whose solution is a mapping from task description (e.g., cost or dynamics functions) to optimal control law (e.g., feedback policy). We approximate these solution operators using a permutation-invariant neural operator architecture. Across a range of parametric optimal control environments and a locomotion benchmark, a single operator trained via behavioral cloning accurately approximates the solution operator and generalizes to unseen tasks, out-of-distribution settings, and varying amounts of task observations. We further show that the branch-trunk structure of our neural operator architecture enables efficient and flexible adaptation to new tasks. We develop structured adaptation strategies ranging from lightweight updates to full-network fine-tuning, achieving strong performance across different data and compute settings. Finally, we introduce meta-trained operator variants that optimize the initialization for few-shot adaptation. These methods enable rapid task adaptation with limited data and consistently outperform a popular meta-learning baseline. Together, our results demonstrate that neural operators provide a unified and efficient framework for multi-task control and adaptation.

[675] Earth Embeddings Reveal Diverse Urban Signals from Space

Wenjing Gong, Udbhav Srivastava, Yuchen Wang, Yuhao Jia, Qifan Wu, Weishan Bai, Yifan Yang, Xiao Huang, Xinyue Ye

Main category: cs.LG

TL;DR: Benchmarking three Earth embedding models (AlphaEarth, Prithvi, Clay) for predicting 14 neighborhood-level urban indicators across six U.S. metropolitan areas, showing they capture urban variation best for built-environment related outcomes.

Details

Motivation: Conventional urban indicators are costly, spatially inconsistent, and slow to update. Geospatial foundation models offer Earth embeddings that could provide scalable, low-cost features for urban monitoring, but their utility at neighborhood scale needs evaluation.

Method: Benchmarked three Earth embedding families (AlphaEarth, Prithvi, Clay) using unified supervised-learning framework to predict 14 neighborhood-level indicators across six U.S. metropolitan areas from 2020-2023. Evaluated performance under four settings: global, city-wise, year-wise, and city-year.

Result: Earth embeddings capture substantial urban variation, with highest predictive skill for outcomes tied to built-environment structure (chronic health burdens, commuting modes). Performance varies across cities but remains stable across years. Compact 64-dimensional AlphaEarth embeddings outperform reduced versions of other models.

Conclusion: Establishes benchmark for evaluating Earth embeddings in urban remote sensing and demonstrates their potential as scalable, low-cost features for SDG-aligned neighborhood-scale urban monitoring, though some indicators remain difficult to infer.

Abstract: Conventional urban indicators derived from censuses, surveys, and administrative records are often costly, spatially inconsistent, and slow to update. Recent geospatial foundation models enable Earth embeddings, compact satellite image representations transferable across downstream tasks, but their utility for neighborhood-scale urban monitoring remains unclear. Here, we benchmark three Earth embedding families, AlphaEarth, Prithvi, and Clay, for urban signal prediction across six U.S. metropolitan areas from 2020 to 2023. Using a unified supervised-learning framework, we predict 14 neighborhood-level indicators spanning crime, income, health, and travel behavior, and evaluate performance under four settings: global, city-wise, year-wise, and city-year. Results show that Earth embeddings capture substantial urban variation, with the highest predictive skill for outcomes more directly tied to built-environment structure, including chronic health burdens and dominant commuting modes. By contrast, indicators shaped more strongly by fine-scale behavior and local policy, such as cycling, remain difficult to infer. Predictive performance varies markedly across cities but remains comparatively stable across years, indicating strong spatial heterogeneity alongside temporal robustness. Exploratory analysis suggests that cross-city variation in predictive performance is associated with urban form in task-specific ways. Controlled dimensionality experiments show that representation efficiency is critical: compact 64-dimensional AlphaEarth embeddings remain more informative than 64-dimensional reductions of Prithvi and Clay. This study establishes a benchmark for evaluating Earth embeddings in urban remote sensing and demonstrates their potential as scalable, low-cost features for SDG-aligned neighborhood-scale urban monitoring.

[676] Super Agents and Confounders: Influence of surrounding agents on vehicle trajectory prediction

Daniel Jost, Luca Paparusso, Martin Stoll, Jörg Wagner, Raghu Rajan, Joschka Bödecker

Main category: cs.LG

TL;DR: Trajectory prediction models often degrade with more surrounding agents; Conditional Information Bottleneck helps compress relevant features and ignore misleading signals.

Details

Motivation: Current trajectory prediction models show surprising degradation in accuracy when incorporating more surrounding agents, suggesting they learn unstable and non-causal decision-making schemes that vary across training runs.

Method: Use Shapley-based attribution to analyze model flaws, then propose Conditional Information Bottleneck (CIB) to compress agent features and ignore non-beneficial information without additional supervision.

Result: CIB improves overall trajectory prediction performance, increases robustness to perturbations, and provides interpretable metrics for identifying non-robust behavior across multiple datasets and architectures.

Conclusion: Selective integration of contextual information is crucial for trajectory prediction as surrounding agents can contain spurious signals; CIB offers a promising solution for robust prediction.

Abstract: In highly interactive driving scenes, trajectory prediction is conditioned on information from surrounding traffic participants such as cars and pedestrians. Our main contribution is a comprehensive analysis of state-of-the-art trajectory predictors, which reveals a surprising and critical flaw: many surrounding agents degrade prediction accuracy rather than improve it. Using Shapley-based attribution, we rigorously demonstrate that models learn unstable and non-causal decision-making schemes that vary significantly across training runs. Building on these insights, we propose to integrate a Conditional Information Bottleneck (CIB), which does not require additional supervision and is trained to effectively compress agent features as well as ignore those that are not beneficial for the prediction task. Comprehensive experiments using multiple datasets and model architectures demonstrate that this simple yet effective approach not only improves overall trajectory prediction performance in many cases but also increases robustness to different perturbations. Our results highlight the importance of selectively integrating contextual information, which can often contain spurious or misleading signals, in trajectory prediction. Moreover, we provide interpretable metrics for identifying non-robust behavior and present a promising avenue towards a solution.

[677] Investigating Data Interventions for Subgroup Fairness: An ICU Case Study

Erin Tan, Judy Hanwen Shen, Irene Y. Chen

Main category: cs.LG

TL;DR: Data addition in healthcare ML can both help and hurt fairness; combining model calibration with data strategies is needed for subgroup performance.

Details

Motivation: Algorithmic bias in high-stakes ML applications like healthcare can harm subgroups, and interventions to fix data often face challenges with distribution shifts when combining data sources.

Method: Analyzed eICU and MIMIC-IV EHR datasets, comparing data addition strategies and model-based post-hoc calibration approaches to improve subgroup performance.

Result: Data addition can both improve and worsen model fairness and performance; intuitive data selection strategies are unreliable; combining calibration with data strategies works best.

Conclusion: Challenges traditional “better data” dogma for fairness; effective solutions require combining both data-centric and model-based approaches.

Abstract: In high-stakes settings where machine learning models are used to automate decision-making about individuals, the presence of algorithmic bias can exacerbate systemic harm to certain subgroups of people. These biases often stem from the underlying training data. In practice, interventions to “fix the data” depend on the actual additional data sources available – where many are less than ideal. In these cases, the effects of data scaling on subgroup performance become volatile, as the improvements from increased sample size are counteracted by the introduction of distribution shifts in the training set. In this paper, we investigate the limitations of combining data sources to improve subgroup performance within the context of healthcare. Clinical models are commonly trained on datasets comprised of patient electronic health record (EHR) data from different hospitals or admission departments. Across two such datasets, the eICU Collaborative Research Database and the MIMIC-IV dataset, we find that data addition can both help and hurt model fairness and performance, and many intuitive strategies for data selection are unreliable. We compare model-based post-hoc calibration and data-centric addition strategies to find that the combination of both is important to improve subgroup performance. Our work questions the traditional dogma of “better data” for overcoming fairness challenges by comparing and combining data- and model-based approaches.

[678] Improving Feasibility via Fast Autoencoder-Based Projections

Maria Chzhen, Priya L. Donti

Main category: cs.LG

TL;DR: A data-driven amortized approach using adversarial autoencoders to learn convex latent representations of feasible sets, enabling fast correction of infeasible predictions in constrained optimization and RL problems.

Details

Motivation: Existing methods struggle to efficiently enforce complex nonconvex operational constraints in real-world learning and control systems, creating a need for practical alternatives to expensive feasibility correction techniques.

Method: Train an autoencoder with adversarial objective to learn structured, convex latent representation of feasible set. Use this as approximate projector to correct infeasible predictions by projecting latent representations onto simple convex shape before decoding back to original feasible space.

Result: Method effectively enforces constraints at low computational cost across diverse suite of constrained optimization and reinforcement learning problems with challenging nonconvex constraints.

Conclusion: The proposed amortized approach offers practical alternative to expensive traditional solver-based feasibility correction techniques for enforcing complex constraints in learning systems.

Abstract: Enforcing complex (e.g., nonconvex) operational constraints is a critical challenge in real-world learning and control systems. However, existing methods struggle to efficiently enforce general classes of constraints. To address this, we propose a novel data-driven amortized approach that uses a trained autoencoder as an approximate projector to provide fast corrections to infeasible predictions. Specifically, we train an autoencoder using an adversarial objective to learn a structured, convex latent representation of the feasible set. This enables rapid correction of neural network outputs by projecting their associated latent representations onto a simple convex shape before decoding into the original feasible set. We test our approach on a diverse suite of constrained optimization and reinforcement learning problems with challenging nonconvex constraints. Results show that our method effectively enforces constraints at a low computational cost, offering a practical alternative to expensive feasibility correction techniques based on traditional solvers.

[679] Three Phases of Expert Routing: How Load Balance Evolves During Mixture-of-Experts Training

Charafeddine Mouzouni

Main category: cs.LG

TL;DR: MoE token routing analyzed as congestion game with single parameter gamma_eff reveals three-phase training trajectory: surge (balance learning), stabilization (expert specialization), relaxation (quality over balance).

Details

Motivation: To understand the dynamics of Mixture-of-Experts token routing during training, revealing patterns invisible in converged models, and to provide a theoretical framework connecting routing behavior to congestion game theory.

Method: Model MoE token routing as congestion game with single parameter gamma_eff, track across training checkpoints of OLMoE-1B-7B and OpenMoE-8B models, analyze three-phase trajectory, complement with effective congestion decomposition and multi-type extension.

Result: Revealed three-phase training trajectory: surge phase (gamma_eff: 14 to 36-39), stabilization phase (B_0: 2.4 to 2.3), relaxation phase (gamma_eff: 27 to 9). Multi-type extension improved load prediction by 30% via token clustering, with robust verification across quality estimators (r >= 0.89).

Conclusion: Early MoE training prioritizes load balance while late training prioritizes quality, with congestion game framework revealing training dynamics invisible in converged models and providing interpretable parameter (gamma_eff) evolution.

Abstract: We model Mixture-of-Experts (MoE) token routing as a congestion game with a single effective parameter, the congestion coefficient gamma_eff, that quantifies the balance-quality tradeoff. Tracking gamma_eff across training checkpoints of two open-source MoE models, OLMoE-1B-7B (20 checkpoints, with dense sampling in the surge region) and OpenMoE-8B (6 checkpoints), reveals a three-phase trajectory: a surge phase where the router learns to balance load (gamma_eff: 14 to 36-39, peaking in the step 30K-40K region), a stabilization phase where experts specialize under steady balance (B_0: 2.4 to 2.3, steps 100K-400K), and a relaxation phase where the router trades balance for quality as experts differentiate (gamma_eff: 27 to 9, steps 400K-1.2M). This non-monotone trajectory, invisible to post-hoc analysis of converged models, reveals that early MoE training prioritizes balance while late training prioritizes quality. The theoretical framework is honest about its limits: the single-type equilibrium reduces to temperature-scaled softmax (held-out L1: MFG = 0.199 vs. softmax = 0.200). The game is not a better predictor; it reveals what the temperature means and, critically, how that temperature evolves. We complement the dynamics with an effective congestion decomposition, a multi-type extension that improves load prediction via token clustering on all 16 layers (mean: 30%), scope diagnostics (K/M, epsilon_l), and robustness verification across four independent quality estimators (r >= 0.89). All confidence intervals are from bootstrap resampling over 50 independent text batches.

[680] Online learning of smooth functions on $\mathbb{R}$

Jesse Geneson, Kuldeep Singh, Alexander Wang

Main category: cs.LG

TL;DR: The paper studies adversarial online learning of real-valued functions on ℝ with p-loss, showing the standard model becomes ill-posed for certain function classes, and analyzes three modified scenarios that limit the influence of distant queries to obtain finite guarantees.

Details

Motivation: The motivation is to address the ill-posedness of standard adversarial online learning for real-valued functions on ℝ, where adversaries can force infinite loss for certain function classes (𝒢_q). The paper seeks to develop modified learning scenarios that restrict query influence to obtain meaningful finite guarantees.

Method: The paper analyzes three modified learning scenarios: 1) adversary must choose queries within distance 1 of past queries, 2) learner penalized only on rounds with queries within distance 1 of past queries, and 3) loss weighted by function g of distance to nearest past query. It studies these scenarios for function class 𝒢_q and its multivariable generalization 𝒢_{q,d}.

Result: For Scenarios 1-2, sharp characterizations are obtained in several regimes. For Scenario 3, a threshold phenomenon is identified: slow-decaying g allows infinite weighted loss, while rapidly decaying weights (e.g., g(z)=e^{-cz}) yield finite sharp guarantees for p=q=2. For multivariable case (d≥2), even modified scenarios allow infinite loss.

Conclusion: The standard adversarial online learning model is ill-posed for real-valued functions on ℝ, but modified scenarios that limit distant query influence can yield finite guarantees. However, the multivariable generalization remains problematic even with these modifications.

Abstract: We study adversarial online learning of real-valued functions on $\mathbb{R}$. In each round the learner is queried at $x_t\in\mathbb{R}$, predicts $\hat y_t$, and then observes the true value $f(x_t)$; performance is measured by cumulative $p$-loss $\sum_{t\ge 1}|\hat y_t-f(x_t)|^p$. For the class [ \mathcal{G}q=\Bigl{f:\mathbb{R}\to\mathbb{R}\ \text{absolutely continuous}:\ \int{\mathbb{R}}|f’(x)|^q,dx\le 1\Bigr}, ] we show that the standard model becomes ill-posed on $\mathbb{R}$: for every $p\ge 1$ and $q>1$, an adversary can force infinite loss. Motivated by this obstruction, we analyze three modified learning scenarios that limit the influence of queries that are far from previously observed inputs. In Scenario 1 the adversary must choose each new query within distance $1$ of some past query. In Scenario 2 the adversary may query anywhere, but the learner is penalized only on rounds whose query lies within distance $1$ of a past query. In Scenario 3 the loss in round $t$ is multiplied by a weight $g(\min_{j<t}|x_t-x_j|)$. We obtain sharp characterizations for Scenarios 1-2 in several regimes. For Scenario 3 we identify a clean threshold phenomenon: if $g$ decays too slowly, then the adversary can force infinite weighted loss. In contrast, for rapidly decaying weights such as $g(z)=e^{-cz}$ we obtain finite and sharp guarantees in the quadratic case $p=q=2$. Finally, we study a natural multivariable slice generalization $\mathcal{G}_{q,d}$ of $\mathcal{G}q$ on $\mathbb{R}^d$ and show a sharp dichotomy: while the one-dimensional case admits finite opt-values in certain regimes, for every $d\ge 2$ the slice class $\mathcal{G}{q,d}$ is too permissive, and even under Scenarios 1-3 an adversary can force infinite loss.

[681] Choosing the Right Regularizer for Applied ML: Simulation Benchmarks of Popular Scikit-learn Regularization Frameworks

Benjamin S. Knight, Ahsaas Bajaj

Main category: cs.LG

TL;DR: Survey of regularization evolution with empirical evaluation showing Ridge, Lasso, and ElasticNet are interchangeable for prediction when n/p ≥ 78, but Lasso recall collapses under multicollinearity while ElasticNet maintains performance.

Details

Motivation: To provide a comprehensive historical survey of regularization techniques and empirically evaluate their performance across different scenarios to guide practitioners in selecting appropriate regularization methods based on observable feature space characteristics.

Method: Historical survey of regularization evolution combined with empirical evaluation of four canonical frameworks (Ridge, Lasso, ElasticNet, Post-Lasso OLS) across 134,400 simulations spanning a 7-dimensional manifold based on eight production-grade ML models.

Result: Ridge, Lasso, and ElasticNet are nearly interchangeable for prediction accuracy when sample-to-feature ratio is sufficient (n/p ≥ 78). However, Lasso recall collapses to 0.18 under high multicollinearity and low SNR, while ElasticNet maintains 0.93 recall. Post-Lasso OLS also performs poorly under these conditions.

Conclusion: Practitioners should avoid using Lasso or Post-Lasso OLS at high condition numbers with small sample sizes. The paper provides an objective-driven decision guide for selecting optimal regularization frameworks based on observable feature space attributes.

Abstract: This study surveys the historical development of regularization, tracing its evolution from stepwise regression in the 1960s to recent advancements in formal error control, structured penalties for non-independent features, Bayesian methods, and l0-based regularization (among other techniques). We empirically evaluate the performance of four canonical frameworks – Ridge, Lasso, ElasticNet, and Post-Lasso OLS – across 134,400 simulations spanning a 7-dimensional manifold grounded in eight production-grade machine learning models. Our findings demonstrate that for prediction accuracy when the sample-to-feature ratio is sufficient (n/p >= 78), Ridge, Lasso, and ElasticNet are nearly interchangeable. However, we find that Lasso recall is highly fragile under multicollinearity; at high condition numbers (kappa) and low SNR, Lasso recall collapses to 0.18 while ElasticNet maintains 0.93. Consequently, we advise practitioners against using Lasso or Post-Lasso OLS at high kappa with small sample sizes. The analysis concludes with an objective-driven decision guide to assist machine learning engineers in selecting the optimal scikit-learn-supported framework based on observable feature space attributes.

[682] Simple yet Effective: Low-Rank Spatial Attention for Neural Operators

Zherui Yang, Haiyang Xin, Tao Du, Ligang Liu

Main category: cs.LG

TL;DR: Low-Rank Spatial Attention (LRSA) unifies neural operator architectures using low-rank compression for efficient global modeling in PDE solving, achieving 17% error reduction with standard Transformer components.

Details

Motivation: Neural operators need to efficiently model long-range spatial interactions in PDEs. Many PDE regimes have compressible global interaction kernels with rapid spectral decay, suggesting low-rank approximations could unify and improve existing approaches.

Method: Introduces Low-Rank Spatial Attention (LRSA) that compresses high-dimensional pointwise features into compact latent space, processes global interactions there, then reconstructs back to spatial points. Built purely from standard Transformer primitives (attention, normalization, FFNs).

Result: Achieves average error reduction of over 17% relative to second-best methods, remains stable and efficient in mixed-precision training, and is straightforward to implement with hardware-optimized kernels.

Conclusion: LRSA provides a clean, unified framework for neural operators that leverages low-rank structure in PDE interactions, achieving superior performance with simple, standard Transformer components.

Abstract: Neural operators have emerged as data-driven surrogates for solving partial differential equations (PDEs), and their success hinges on efficiently modeling the long-range, global coupling among spatial points induced by the underlying physics. In many PDE regimes, the induced global interaction kernels are empirically compressible, exhibiting rapid spectral decay that admits low-rank approximations. We leverage this observation to unify representative global mixing modules in neural operators under a shared low-rank template: compressing high-dimensional pointwise features into a compact latent space, processing global interactions within it, and reconstructing the global context back to spatial points. Guided by this view, we introduce Low-Rank Spatial Attention (LRSA) as a clean and direct instantiation of this template. Crucially, unlike prior approaches that often rely on non-standard aggregation or normalization modules, LRSA is built purely from standard Transformer primitives, i.e., attention, normalization, and feed-forward networks, yielding a concise block that is straightforward to implement and directly compatible with hardware-optimized kernels. In our experiments, such a simple construction is sufficient to achieve high accuracy, yielding an average error reduction of over 17% relative to second-best methods, while remaining stable and efficient in mixed-precision training.

[683] Evaluation of Bagging Predictors with Kernel Density Estimation and Bagging Score

Philipp Seitz, Jan Schmitt, Andreas Schiffler

Main category: cs.LG

TL;DR: A method using Kernel Density Estimation to determine representative predictions from bagging ensembles with an associated Bagging Score confidence metric, outperforming mean/median approaches in nonlinear regression with neural networks.

Details

Motivation: The standard approach of taking the mean of bagging ensemble predictions can deviate from ground truth in certain parameter regions, and there's a need for better ensemble prediction methods with confidence metrics.

Method: Uses Kernel Density Estimation (KDE) to determine a representative prediction (y_BS) from bagging ensembles of neural networks in nonlinear regression, along with a Bagging Score (beta_BS) confidence metric.

Result: The approach produces better predictions than using mean or median, and achieves top rankings in error metrics compared to other nonlinear regression methods without optimization or feature selection.

Conclusion: KDE-based ensemble prediction with Bagging Score provides superior performance and confidence estimation compared to traditional mean/median approaches for bagging predictors in nonlinear regression.

Abstract: For a larger set of predictions of several differently trained machine learning models, known as bagging predictors, the mean of all predictions is taken by default. Nevertheless, this proceeding can deviate from the actual ground truth in certain parameter regions. An approach is presented to determine a representative y_BS from such a set of predictions using Kernel Density Estimation (KDE) in nonlinear regression with Neural Networks (NN) which simultaneously provides an associated quality criterion beta_BS, called Bagging Score (BS), that reflects the confidence of the obtained ensemble prediction. It is shown that working with the new approach better predictions can be made than working with the common use of mean or median. In addition to this, the used method is contrasted to several approaches of nonlinear regression from the literatur, resulting in a top ranking in each of the calculated error values without using any optimization or feature selection technique.

[684] BlazeFL: Fast and Deterministic Federated Learning Simulation

Kitsuya Azuma, Takayuki Nishio

Main category: cs.LG

TL;DR: BlazeFL is a lightweight framework for single-node federated learning simulation that enables deterministic parallel execution with thread-based parallelism and isolated random number generators for reproducibility.

Details

Motivation: Federated learning research requires efficient single-node simulations with many virtual clients, but parallel execution introduces nondeterminism through shared random states and scheduling variability, forcing trade-offs between throughput and reproducibility.

Method: BlazeFL uses free-threaded shared-memory execution with thread-based parallelism and in-memory parameter exchange between server and clients. It assigns isolated random number generator streams to clients for deterministic execution, ensuring bitwise-identical results across repeated runs.

Result: BlazeFL achieves up to 3.1× speedup on communication-dominated workloads compared to baseline frameworks while preserving deterministic reproducibility and maintaining a lightweight dependency footprint.

Conclusion: BlazeFL provides an efficient solution for federated learning simulation that eliminates the trade-off between throughput and reproducibility through deterministic parallel execution with isolated RNG management.

Abstract: Federated learning (FL) research increasingly relies on single-node simulations with hundreds or thousands of virtual clients, making both efficiency and reproducibility essential. Yet parallel client training often introduces nondeterminism through shared random state and scheduling variability, forcing researchers to trade throughput for reproducibility or to implement custom control logic within complex frameworks. We present BlazeFL, a lightweight framework for single-node FL simulation that alleviates this trade-off through free-threaded shared-memory execution and deterministic randomness management. BlazeFL uses thread-based parallelism with in-memory parameter exchange between the server and clients, avoiding serialization and inter-process communication overhead. To support deterministic execution, BlazeFL assigns isolated random number generator (RNG) streams to clients. Under a fixed software/hardware stack, and when stochastic operators consume BlazeFL-managed generators, this design yields bitwise-identical results across repeated high-concurrency runs in both thread-based and process-based modes. In CIFAR-10 image-classification experiments, BlazeFL substantially reduces execution time relative to a widely used open-source baseline, achieving up to 3.1$\times$ speedup on communication-dominated workloads while preserving a lightweight dependency footprint. Our open-source implementation is available at: https://github.com/kitsuyaazuma/blazefl.

Qusay Muzaffar, David Levin, Michael Werman

Main category: cs.LG

TL;DR: A neural approach for global optimization of black-box functions from noisy samples that learns to find global minima through iterative refinement, outperforming traditional methods like Bayesian Optimization on multi-modal functions.

Details

Motivation: Traditional global optimization methods like Bayesian Optimization often converge to local minima on multi-modal functions, while gradient-free methods require many function evaluations. There's a need for more robust approaches that can handle noisy samples and find global minima efficiently.

Method: The model takes noisy function samples and their fitted spline representation as input, then iteratively refines an initial guess toward the true global minimum. It combines encoding of multiple modalities including function values, derivatives, and spline coefficients with iterative position updates.

Result: Achieves mean error of 8.05% on challenging multi-modal test functions compared to 36.24% for spline initialization (28.18% improvement). Successfully finds global minima in 72% of test cases with error below 10%.

Conclusion: The neural approach demonstrates learned optimization principles rather than mere curve fitting, enabling robust global optimization without requiring derivative information or multiple restarts.

Abstract: Global optimization of black-box functions from noisy samples is a fundamental challenge in machine learning and scientific computing. Traditional methods such as Bayesian Optimization often converge to local minima on multi-modal functions, while gradient-free methods require many function evaluations. We present a novel neural approach that learns to find global minima through iterative refinement. Our model takes noisy function samples and their fitted spline representation as input, then iteratively refines an initial guess toward the true global minimum. Trained on randomly generated functions with ground truth global minima obtained via exhaustive search, our method achieves a mean error of 8.05 percent on challenging multi-modal test functions, compared to 36.24 percent for the spline initialization, a 28.18 percent improvement. The model successfully finds global minima in 72 percent of test cases with error below 10 percent, demonstrating learned optimization principles rather than mere curve fitting. Our architecture combines encoding of multiple modalities including function values, derivatives, and spline coefficients with iterative position updates, enabling robust global optimization without requiring derivative information or multiple restarts.

[686] Algebraic Diversity: Group-Theoretic Spectral Estimation from Single Observations

Mitchell A. Thornton

Main category: cs.LG

TL;DR: The paper presents a theoretical framework showing that temporal averaging in statistical estimation can be replaced by algebraic group action on single observations, unifying DFT, DCT, and KLT as special cases, with applications including single-snapshot DOA estimation, MIMO channel estimation, waveform classification, graph signal processing, and transformer analysis revealing RoPE uses suboptimal algebraic groups.

Details

Motivation: Traditional statistical estimation methods often require multiple observations (temporal averaging) for reliable subspace decomposition and covariance estimation. The paper aims to develop a theoretical framework that can achieve equivalent statistical estimation from single observations using algebraic group actions, reducing data requirements and computational complexity.

Method: The authors prove two main theorems: 1) General Replacement Theorem establishing conditions where group-averaged estimators from single snapshots achieve equivalent subspace decomposition to multi-snapshot covariance estimation, and 2) Optimality Theorem proving the symmetric group is universally optimal (yielding KL transform). They develop a closed-form double-commutator eigenvalue problem for polynomial-time optimal group selection, unifying DFT, DCT, and KLT as special cases of group-matched spectral transforms.

Result: The framework demonstrates five applications: MUSIC DOA estimation from single snapshot, massive MIMO channel estimation with 64% throughput gain, single-pulse waveform classification at 90% accuracy, graph signal processing with non-Abelian groups, and transformer analysis showing RoPE uses wrong algebraic group for 70-80% of attention heads across five models (22,480 head observations). The optimal group is content-dependent, and spectral-concentration-based pruning improves perplexity at 13B scale with single forward pass diagnostics.

Conclusion: Algebraic group action on single observations can replace temporal averaging for second-order statistical estimation, providing a unified framework for spectral transforms with significant practical applications in signal processing, communications, and transformer architectures. The approach enables efficient single-snapshot estimation and reveals fundamental insights about transformer attention mechanisms.

Abstract: We prove that temporal averaging over multiple observations can be replaced by algebraic group action on a single observation for second-order statistical estimation. A General Replacement Theorem establishes conditions under which a group-averaged estimator from one snapshot achieves equivalent subspace decomposition to multi-snapshot covariance estimation, and an Optimality Theorem proves that the symmetric group is universally optimal (yielding the KL transform). The framework unifies the DFT, DCT, and KLT as special cases of group-matched spectral transforms, with a closed-form double-commutator eigenvalue problem for polynomial-time optimal group selection. Five applications are demonstrated: MUSIC DOA estimation from a single snapshot, massive MIMO channel estimation with 64% throughput gain, single-pulse waveform classification at 90% accuracy, graph signal processing with non-Abelian groups, and a new algebraic analysis of transformer LLMs revealing that RoPE uses the wrong algebraic group for 70-80% of attention heads across five models (22,480 head observations), that the optimal group is content-dependent, and that spectral-concentration-based pruning improves perplexity at the 13B scale. All diagnostics require a single forward pass with no gradients or training.

[687] Delayed Homomorphic Reinforcement Learning for Environments with Delayed Feedback

Jongsoo Lee, Jangwon Kim, Soohee Han

Main category: cs.LG

TL;DR: DHRL framework uses MDP homomorphisms to collapse belief-equivalent states in delayed RL, reducing state-space explosion while preserving optimality, with improved performance on continuous control tasks.

Details

Motivation: Real-world RL often suffers from delayed feedback that breaks Markov assumption. Standard state augmentation causes state-space explosion and sample complexity issues. Existing approaches treat actor and critic separately or only partially address the problem.

Method: Proposes Delayed Homomorphic RL (DHRL) framework using MDP homomorphisms to collapse belief-equivalent augmented states, enabling efficient policy learning on abstract MDP without optimality loss. Includes theoretical analysis of compression bounds and sample complexity, plus practical algorithm.

Result: Experiments on MuJoCo continuous control tasks show DHRL outperforms strong augmentation-based baselines, especially under long delays.

Conclusion: DHRL provides structured, sample-efficient solution to delayed RL by leveraging MDP homomorphisms to compress state space while maintaining optimality, addressing both actor and critic components in unified framework.

Abstract: Reinforcement learning in real-world systems is often accompanied by delayed feedback, which breaks the Markov assumption and impedes both learning and control. Canonical state augmentation approaches cause the state-space explosion, which introduces a severe sample-complexity burden. Despite recent progress, the state-of-the-art augmentation-based baselines remain incomplete: they either predominantly reduce the burden on the critic or adopt non-unified treatments for the actor and critic. To provide a structured and sample-efficient solution, we propose delayed homomorphic reinforcement learning (DHRL), a framework grounded in MDP homomorphisms that collapses belief-equivalent augmented states and enables efficient policy learning on the resulting abstract MDP without loss of optimality. We provide theoretical analyses of state-space compression bounds and sample complexity, and introduce a practical algorithm. Experiments on continuous control tasks in MuJoCo benchmark confirm that our algorithm outperforms strong augmentation-based baselines, particularly under long delays.

[688] Automated Attention Pattern Discovery at Scale in Large Language Models

Jonathan Katzy, Razvan-Mihai Popescu, Erik Mekkes, Arie van Deursen, Maliheh Izadi

Main category: cs.LG

TL;DR: AP-MAE: A vision transformer-based model that reconstructs masked attention patterns from LLMs for scalable interpretability, enabling pattern analysis, correctness prediction, and targeted interventions.

Details

Motivation: Current interpretability methods for LLMs are too specific and don't scale well. The paper aims to develop scalable interpretability by studying repeated behaviors in LLMs through attention patterns, using the structured nature of code as a testbed.

Method: Mine completion scenarios in Java code datasets to collect attention patterns. Introduce AP-MAE (Attention Pattern - Masked Autoencoder), a vision transformer-based model that efficiently reconstructs masked attention patterns from LLMs.

Result: AP-MAE reconstructs masked attention patterns with high accuracy, generalizes across unseen models, reveals recurring patterns, predicts generation correctness (55-70% accuracy), and enables targeted interventions that increase accuracy by 13.6% when applied selectively.

Conclusion: Attention patterns serve as scalable signals for interpretability, and AP-MAE provides a transferable foundation for both analysis and intervention in LLMs, also serving as a selection procedure to guide fine-grained mechanistic approaches.

Abstract: Large language models have found success by scaling up capabilities to work in general settings. The same can unfortunately not be said for interpretability methods. The current trend in mechanistic interpretability is to provide precise explanations of specific behaviors in controlled settings. These often do not generalize, or are too resource intensive for larger studies. In this work we propose to study repeated behaviors in large language models by mining completion scenarios in Java code datasets, through exploiting the structured nature of code. We collect the attention patterns generated in the attention heads to demonstrate that they are scalable signals for global interpretability of model components. We show that vision models offer a promising direction for analyzing attention patterns at scale. To demonstrate this, we introduce the Attention Pattern - Masked Autoencoder(AP-MAE), a vision transformer-based model that efficiently reconstructs masked attention patterns. Experiments on StarCoder2 show that AP-MAE (i) reconstructs masked attention patterns with high accuracy, (ii) generalizes across unseen models with minimal degradation, (iii) reveals recurring patterns across inferences, (iv) predicts whether a generation will be correct without access to ground truth, with accuracies ranging from 55% to 70% depending on the task, and (v) enables targeted interventions that increase accuracy by 13.6% when applied selectively, but cause collapse when applied excessively. These results establish attention patterns as a scalable signal for interpretability and demonstrate that AP-MAE provides a transferable foundation for both analysis and intervention in large language models. Beyond its standalone value, AP-MAE also serves as a selection procedure to guide fine-grained mechanistic approaches. We release code and models to support future work in large-scale interpretability.

[689] CountsDiff: A Diffusion Model on the Natural Numbers for Generation and Imputation of Count-Based Data

Renzo G. Soatto, Anders Hoel, Greycen Ren, Shorna Alam, Stephen Bates, Nikolaos P. Daskalakis, Caroline Uhler, Maria Skoularidou

Main category: cs.LG

TL;DR: CountsDiff is a diffusion framework for discrete ordinal data that extends Blackout diffusion with simplified parameterization, continuous-time training, and features like classifier-free guidance, validated on image datasets and biological count data.

Details

Motivation: Diffusion models have excelled in continuous and token-based domains but remain underdeveloped for discrete ordinal data like natural numbers. There's a need for native modeling of count distributions, particularly for applications like biological count assays (e.g., single-cell RNA-seq data).

Method: Extends Blackout diffusion framework with simplified parameterization using survival probability schedule and explicit loss weighting. Introduces continuous-time training, classifier-free guidance, and churn/remasking reverse dynamics allowing non-monotone reverse trajectories.

Result: Validated on natural image datasets (CIFAR-10, CelebA) and biological count assays (single-cell RNA-seq imputation). Matches or surpasses state-of-the-art discrete generative models and leading RNA-seq imputation methods, with room for further optimization.

Conclusion: CountsDiff provides an effective diffusion framework for discrete ordinal data with promising results in both image domains and biological applications, demonstrating potential for further improvements through optimized design choices.

Abstract: Diffusion models have excelled at generative tasks for both continuous and token-based domains, but their application to discrete ordinal data remains underdeveloped. We present CountsDiff, a diffusion framework designed to natively model distributions on the natural numbers. CountsDiff extends the Blackout diffusion framework by simplifying its formulation through a direct parameterization in terms of a survival probability schedule and an explicit loss weighting. This introduces flexibility through design parameters with direct analogues in existing diffusion modeling frameworks. Beyond this reparameterization, CountsDiff introduces features from modern diffusion models, previously absent in counts-based domains, including continuous-time training, classifier-free guidance, and churn/remasking reverse dynamics that allow non-monotone reverse trajectories. We propose an initial instantiation of CountsDiff and validate it on natural image datasets (CIFAR-10, CelebA), exploring the effects of varying the introduced design parameters in a complex, well-studied, and interpretable data domain. We then highlight biological count assays as a natural use case, evaluating CountsDiff on single-cell RNA-seq imputation in a fetal cell and heart cell atlas. Remarkably, we find that even this simple instantiation matches or surpasses the performance of a state-of-the-art discrete generative model and leading RNA-seq imputation methods, while leaving substantial headroom for further gains through optimized design choices in future work.

[690] Automated Conjecture Resolution with Formal Verification

Haocheng Ju, Guoxiong Gao, Jiedong Jiang, Bin Wu, Zeming Sun, Leheng Chen, Yutong Wang, Yuefeng Wang, Zichen Wang, Wanyi He, Peihao Wu, Liang Xiao, Ruochuan Liu, Bryan Dai, Bin Dong

Main category: cs.LG

TL;DR: Automated framework combining natural language reasoning (Rethlas) with formal verification (Archon) to solve research-level math problems with minimal human intervention, demonstrated by resolving an open commutative algebra problem.

Details

Motivation: While LLMs have improved mathematical reasoning, reliably solving and verifying research-level problems remains challenging due to natural language ambiguity. The paper aims to create an automated framework that integrates informal reasoning with formal verification for end-to-end problem solving.

Method: Two-component framework: 1) Rethlas (informal reasoning agent) mimics human mathematicians using reasoning primitives and theorem search engine (Matlas) to explore solutions; 2) Archon (formal verification agent) translates informal arguments into formalized Lean 4 projects using structured task decomposition, iterative refinement, and automated proof synthesis via LeanSearch.

Result: Successfully resolved an open problem in commutative algebra and formally verified the proof in Lean 4 with essentially no human involvement. Demonstrated that theorem retrieval enables cross-domain technique discovery and the formal agent can autonomously fill nontrivial gaps in informal arguments.

Conclusion: The work presents a promising paradigm for mathematical research where informal and formal reasoning systems with theorem retrieval tools operate together to produce verifiable results, substantially reducing human effort and offering concrete human-AI collaboration in mathematics.

Abstract: Recent advances in large language models have significantly improved their ability to perform mathematical reasoning, extending from elementary problem solving to increasingly capable performance on research-level problems. However, reliably solving and verifying such problems remains challenging due to the inherent ambiguity of natural language reasoning. In this paper, we propose an automated framework for tackling research-level mathematical problems that integrates natural language reasoning with formal verification, enabling end-to-end problem solving with minimal human intervention. Our framework consists of two components: an informal reasoning agent, Rethlas, and a formal verification agent, Archon. Rethlas mimics the workflow of human mathematicians by combining reasoning primitives with our theorem search engine, Matlas, to explore solution strategies and construct candidate proofs. Archon, equipped with our formal theorem search engine LeanSearch, translates informal arguments into formalized Lean 4 projects through structured task decomposition, iterative refinement, and automated proof synthesis, ensuring machine-checkable correctness. Using this framework, we automatically resolve an open problem in commutative algebra and formally verify the resulting proof in Lean 4 with essentially no human involvement. Our experiments demonstrate that strong theorem retrieval tools enable the discovery and application of cross-domain mathematical techniques, while the formal agent is capable of autonomously filling nontrivial gaps in informal arguments. More broadly, our work illustrates a promising paradigm for mathematical research in which informal and formal reasoning systems, equipped with theorem retrieval tools, operate in tandem to produce verifiable results, substantially reduce human effort, and offer a concrete instantiation of human-AI collaborative mathematical research.

[691] k-Maximum Inner Product Attention for Graph Transformers and the Expressive Power of GraphGPS The Expressive Power of GraphGPS

Jonas De Schouwer, Haitz Sáez de Ocáriz Borde, Xiaowen Dong

Main category: cs.LG

TL;DR: k-MIP attention enables efficient graph transformers with linear complexity by selecting top-k relevant nodes per query, maintaining full expressive power while scaling to 500k+ nodes.

Details

Motivation: Graph transformers face quadratic complexity limitations for large-scale graphs, while existing efficiency methods degrade performance or limit expressive power.

Method: Introduces k-Maximum Inner Product (k-MIP) attention that selects top-k relevant key nodes per query, combined with symbolic matrix computation for linear memory complexity.

Result: Achieves practical speedups up to 10x, processes graphs with 500k+ nodes on single A100 GPU, maintains theoretical expressiveness, and ranks top on multiple benchmarks.

Conclusion: k-MIP attention successfully balances efficiency and effectiveness for graph transformers, enabling large-scale applications without compromising expressive power.

Abstract: Graph transformers have shown promise in overcoming limitations of traditional graph neural networks, such as oversquashing and difficulties in modelling long-range dependencies. However, their application to large-scale graphs is hindered by the quadratic memory and computational complexity of the all-to-all attention mechanism. Although alternatives such as linearized attention and restricted attention patterns have been proposed, these often degrade performance or limit expressive power. To better balance efficiency and effectiveness, we introduce k-Maximum Inner Product (k-MIP) attention for graph transformers. k-MIP attention selects the most relevant key nodes per query via a top-k operation, yielding a sparse yet flexible attention pattern. Combined with an attention score computation based on symbolic matrices, this results in linear memory complexity and practical speedups of up to an order of magnitude compared to all-to-all attention, enabling the processing of graphs with over 500k nodes on a single A100 GPU. We provide a theoretical analysis of expressive power, showing that k-MIP attention does not compromise the expressiveness of graph transformers: specifically, we prove that k-MIP transformers can approximate any full-attention transformer to arbitrary precision. In addition, we analyze the expressive power of the GraphGPS framework, in which we integrate our attention mechanism, and establish an upper bound on its graph distinguishing capability in terms of the S-SEG-WL test. Finally, we validate our approach on the Long Range Graph Benchmark, the City-Networks benchmark, and two custom large-scale inductive point cloud datasets, consistently ranking among the top-performing scalable graph transformers.

[692] Collapse-Free Prototype Readout Layer for Transformer Encoders

Giansalvo Cirrincione, Rahul Ranjeev Kumar

Main category: cs.LG

TL;DR: DDCL-Attention is a prototype-based readout layer for transformers that replaces pooling methods with learned compression using global prototypes, offering stable training, prototype diversity, and multiple applications including differentiable codebooks.

Details

Motivation: Current transformer encoders rely on simple pooling methods like mean pooling or class tokens for readout, which may not optimally compress sequence information. There's a need for more sophisticated, learnable compression mechanisms that can produce compact token summaries while maintaining prototype diversity and training stability.

Method: Uses a small set of global prototype vectors and assigns tokens to them through soft probabilistic matching. Features exact loss decomposition into reconstruction and diversity terms to prevent prototype collapse, with theoretical stability guarantees using Tikhonov’s singular perturbation theory and learning-rate constraints. Supports three use cases: final readout layer, differentiable VQ-VAE codebook extension, and hierarchical document compression.

Result: Experiments on four datasets confirm theoretical predictions: loss decomposition holds exactly, prototype separation grows as expected under stability conditions, and codebook achieves full utilization, outperforming standard hard vector quantization. Additional orbital debris classification study shows applicability beyond standard NLP/vision to scientific tabular data.

Conclusion: DDCL-Attention provides an effective prototype-based readout layer with theoretical guarantees, stable training, and versatile applications across different domains including NLP, vision, and scientific data analysis.

Abstract: DDCL-Attention is a prototype-based readout layer for transformer encoders that replaces simple pooling methods, such as mean pooling or class tokens, with a learned compression mechanism. It uses a small set of global prototype vectors and assigns tokens to them through soft probabilistic matching, producing compact token summaries at linear complexity in sequence length. The method offers three main advantages. First, it avoids prototype collapse through an exact decomposition of the training loss into a reconstruction term and a diversity term, ensuring that prototypes remain distinct. Second, its joint training with the encoder is shown to be stable under a practical timescale condition, using Tikhonov’s singular perturbation theory and explicit learning-rate constraints. Third, the same framework supports three uses: a final readout layer, a differentiable codebook extending VQ-VAE, and a hierarchical document compressor. Experiments on four datasets confirm the theoretical predictions: the loss decomposition holds exactly, prototype separation grows as expected when the stability condition is met, and the codebook reaches full utilization, outperforming standard hard vector quantization. An additional study on orbital debris classification shows that the method also applies beyond standard NLP and vision tasks, including scientific tabular data.

[693] Understanding When Poisson Log-Normal Models Outperform Penalized Poisson Regression for Microbiome Count Data

Daniel Agyapong, Julien Chiquet, Jane Marks, Toby Dylan Hocking

Main category: cs.LG

TL;DR: PLN (Poisson Lognormal) models outperform penalized Poisson regression (GLMNet) for microbiome count prediction, especially with higher sample-to-taxon ratios, while each method has strengths for different network inference tasks.

Details

Motivation: Researchers lack guidance on when multivariate count models with latent dependence structures (like PLN) are better than simpler penalized marginal Poisson regression (like GLMNet) for biological count data analysis, particularly in microbiome studies.

Method: Evaluated PLN vs GLMNet(Poisson) on 20 real microbiome datasets (32-18,270 samples, 24-257 taxa) using held-out Poisson deviance with leave-one-taxon-out prediction and 3-fold cross-validation. For network inference, compared PLNNetwork and GLMNet neighborhood selection on 5 datasets with experimentally validated microbial interaction ground truth.

Result: PLN outperformed GLMNet on most count-prediction datasets (up to 38% improvement). Sample-to-taxon ratio was the primary predictor of winner, with mean absolute correlation as strongest secondary signal and overdispersion as additional predictor. PLNNetwork performed best on broad undirected interaction benchmarks, while GLMNet was better for local/directional effects.

Conclusion: The study provides practical guidance for choosing between latent multivariate count models and penalized Poisson regression in biological applications, showing PLN’s superiority for count prediction with sufficient samples and each method’s specific strengths for network inference tasks.

Abstract: Multivariate count models are often justified by their ability to capture latent dependence, but researchers receive little guidance on when this added structure improves on simpler penalized marginal Poisson regression. We study this question using real microbiome data under a unified held-out evaluation framework. For count prediction, we compare PLN and GLMNet(Poisson) on 20 datasets spanning 32 to 18,270 samples and 24 to 257 taxa, using held-out Poisson deviance under leave-one-taxon-out prediction with 3-fold sample cross-validation rather than synthetic or in-sample criteria. For network inference, we compare PLNNetwork and GLMNet(Poisson) neighborhood selection on five publicly available datasets with experimentally validated microbial interaction truth. PLN outperforms GLMNet(Poisson) on most count-prediction datasets, with gains up to 38 percent. The primary predictor of the winner is the sample-to-taxon ratio, with mean absolute correlation as the strongest secondary signal and overdispersion as an additional predictor. PLNNetwork performs best on broad undirected interaction benchmarks, whereas GLMNet(Poisson) is better aligned with local or directional effects. Taken together, these results provide guidance for choosing between latent multivariate count models and penalized Poisson regression in biological count prediction and interaction recovery.

[694] A Bayesian Information-Theoretic Approach to Data Attribution

Dharmesh Tailor, Nicolò Felicioni, Kamil Ciosek

Main category: cs.LG

TL;DR: Bayesian information-theoretic approach to Training Data Attribution using information loss scoring with Gaussian Process surrogates for scalability to modern networks.

Details

Motivation: To develop a principled, scalable method for tracing model predictions back to influential training examples that focuses on resolving predictive uncertainty rather than label noise.

Method: Formulates TDA as Bayesian information-theoretic problem scoring subsets by information loss (entropy increase when removed). Uses Gaussian Process surrogate with tangent features for scalability. For large-scale retrieval, relaxes to information-gain objective with variance correction for vector databases.

Result: Competitive performance on counterfactual sensitivity, ground-truth retrieval and coreset selection. Scales to modern architectures while bridging principled measures with practice.

Conclusion: Proposed information-theoretic framework provides scalable, principled approach to Training Data Attribution that aligns with classical influence scores while promoting diversity for subsets.

Abstract: Training Data Attribution (TDA) seeks to trace model predictions back to influential training examples, enhancing interpretability and safety. We formulate TDA as a Bayesian information-theoretic problem: subsets are scored by the information loss they induce - the entropy increase at a query when removed. This criterion credits examples for resolving predictive uncertainty rather than label noise. To scale to modern networks, we approximate information loss using a Gaussian Process surrogate built from tangent features. We show this aligns with classical influence scores for single-example attribution while promoting diversity for subsets. For even larger-scale retrieval, we relax to an information-gain objective and add a variance correction for scalable attribution in vector databases. Experiments show competitive performance on counterfactual sensitivity, ground-truth retrieval and coreset selection, showing that our method scales to modern architectures while bridging principled measures with practice.

[695] Where to Steer: Input-Dependent Layer Selection for Steering Improves LLM Alignment

Soham Gadgil, Chris Lin, Su-In Lee

Main category: cs.LG

TL;DR: W2S introduces adaptive layer selection for steering vectors in LLMs, showing that optimal intervention layers vary across inputs and improving alignment performance over fixed-layer approaches.

Details

Motivation: Existing steering vector methods apply interventions at fixed layers, assuming optimal intervention layers are invariant across inputs. The authors argue this is fundamentally limited since representations relevant to target behaviors can be encoded at different layers depending on the input.

Method: W2S (Where to Steer) learns a mapping from input embeddings to optimal steering layers, adaptively selecting the intervention layer conditioned on the input rather than using a globally fixed layer.

Result: Across multiple LLMs and alignment behaviors, W2S consistently outperforms fixed-layer baselines, with improvements in both in-distribution and out-of-distribution settings.

Conclusion: Input-dependent control is crucial for LLM alignment, and adaptive layer selection is a key design dimension missing in current steering vector methodology.

Abstract: Steering vectors have emerged as a lightweight and effective approach for aligning large language models (LLMs) at inference time, enabling modulation over model behaviors by shifting LLM representations towards a target behavior. However, existing methods typically apply steering vectors at a globally fixed layer, implicitly assuming that the optimal intervention layer is invariant across inputs. We argue that this assumption is fundamentally limited, as representations relevant to a target behavior can be encoded at different layers depending on the input. Theoretically, we show that different inputs can require steering at different layers to achieve alignment with a desirable model behavior. We also provide empirical evidence that the optimal steering layer varies substantially across inputs in practice. Motivated by these observations, we introduce Where to Steer (W2S), a framework that adaptively selects the intervention layer conditioned on the input, by learning a mapping from input embeddings to optimal steering layers. Across multiple LLMs and alignment behaviors, W2S consistently outperforms fixed-layer baselines, with improvements in both in-distribution and out-of-distribution settings. Our findings highlight the importance of input-dependent control in LLM alignment and demonstrate that adaptive layer selection is a key design dimension missing in the current methodology of steering vectors.

[696] SODA: Semi On-Policy Black-Box Distillation for Large Language Models

Xiwen Chen, Jingjing Wang, Wenhui Zhu, Peijie Qiu, Xuanzhao Dong, Hejian Sang, Zhipeng Wang, Alborz Geramifard, Feng Luo

Main category: cs.LG

TL;DR: SODA is a semi-on-policy distillation method that uses teacher’s optimal responses paired with static student outputs for efficient alignment, eliminating adversarial training while achieving superior performance.

Details

Motivation: Addresses the trade-off in black-box knowledge distillation: off-policy methods struggle to correct student errors, while on-policy methods suffer from training instability and high computational costs.

Method: Proposes SODA (Semi On-policy Distillation with Alignment) that pairs teacher’s optimal responses with a one-time static snapshot of student’s inferior outputs to create contrastive signals, eliminating dynamic rollouts and adversarial balancing.

Result: Matches or outperforms SOTA methods on 15/16 benchmarks across Qwen2.5 and Llama-3 models, while training 10x faster, using 27% less GPU memory, and eliminating adversarial instability.

Conclusion: Static exposure to student’s own inferior behaviors is sufficient for high-quality distribution alignment, making semi-on-policy distillation a highly efficient alternative to traditional methods.

Abstract: Black-box knowledge distillation for large language models presents a strict trade-off. Simple off-policy methods (e.g., sequence-level knowledge distillation) struggle to correct the student’s inherent errors. Fully on-policy methods (e.g., Generative Adversarial Distillation) solve this via adversarial training but introduce well-known training instability and crippling computational overhead. To address this dilemma, we propose SODA (Semi On-policy Distillation with Alignment), a highly efficient alternative motivated by the inherent capability gap between frontier teachers and much smaller base models. Because a compact student model’s natural, zero-shot responses are almost strictly inferior to the powerful teacher’s targets, we can construct a highly effective contrastive signal simply by pairing the teacher’s optimal response with a one-time static snapshot of the student’s outputs. This demonstrates that exposing the small student to its own static inferior behaviors is sufficient for high-quality distribution alignment, eliminating the need for costly dynamic rollouts and fragile adversarial balancing. Extensive evaluations across four compact Qwen2.5 and Llama-3 models validate this semi on-policy paradigm. SODA matches or outperforms the state-of-the-art methods on 15 out of 16 benchmark results. More importantly, it achieves this superior distillation quality while training 10 times faster, consuming 27% less peak GPU memory, and completely eliminating adversarial instability.

[697] Spatiotemporal Interpolation of GEDI Biomass with Calibrated Uncertainty

Robin Young, Srinivasan Keshav

Main category: cs.LG

TL;DR: Extends Attentive Neural Process framework to fill spatiotemporal gaps in GEDI LIDAR biomass data using geospatial foundation model embeddings, enabling space-for-time substitution for reliable uncertainty quantification in forest carbon monitoring.

Details

Motivation: NASA's GEDI provides reliable LIDAR-derived aboveground biomass density but has irregular spatiotemporal coverage and operational gaps (including a 13-month hibernation). Existing methods fill spatial gaps but temporal interpolation through disturbance events remains unaddressed, and standard ensemble methods produce miscalibrated prediction intervals.

Method: Extends Attentive Neural Process (ANP) framework to jointly sparse spatiotemporal settings using geospatial foundation model embeddings. Treats space and time symmetrically, implementing space-for-time substitution where observations from nearby locations at other times inform predictions at held-out periods.

Result: The ANP produces well-calibrated uncertainty estimates across disturbance regimes, supporting its use in Measurement, Reporting, and Verification applications that require reliable uncertainty quantification for forest carbon accounting.

Conclusion: The extended ANP framework effectively addresses temporal interpolation challenges in biomass monitoring, providing reliable uncertainty quantification crucial for forest carbon accounting and deforestation monitoring applications.

Abstract: Monitoring deforestation-driven carbon emissions requires both spatially explicit and temporally continuous estimates of aboveground biomass density (AGBD) with calibrated uncertainty. NASA’s Global Ecosystem Dynamics Investigation (GEDI) provides reliable LIDAR-derived AGBD, but its orbital sampling causes irregular spatiotemporal coverage, and occasional operational interruptions, including a 13-month hibernation from March 2023 to April 2024, leave extended gaps in the observational record. Prior work has used machine learning approaches to fill GEDI’s spatial gaps using satellite-derived features, but temporal interpolation of biomass through unobserved periods, particularly across active disturbance events, remains largely unaddressed. Moreover, standard ensemble methods for biomass mapping have been shown to produce systematically miscalibrated prediction intervals. To address these gaps, we extend the Attentive Neural Process (ANP) framework, previously applied to spatial biomass interpolation, to jointly sparse spatiotemporal settings using geospatial foundation model embeddings. We treat space and time symmetrically, empirically validating a form of space-for-time substitution in which observations from nearby locations at other times inform predictions at held-out periods. Our results demonstrate that the ANP produces well-calibrated uncertainty estimates across disturbance regimes, supporting its use in Measurement, Reporting, and Verification (MRV) applications that require reliable uncertainty quantification for forest carbon accounting.

[698] Multi-Agent Environments for Vehicle Routing Problems

Ricardo Gama, Ricardo Cunha, Daniel Fuertes, Carlos R. del-Blanco, Hugo L. Fernandes

Main category: cs.LG

TL;DR: MAEnvs4VRP is an open-source multi-agent reinforcement learning library for vehicle routing problems with modular design supporting various problem variants.

Details

Motivation: Despite RL's success in vehicle routing problems, there's a lack of open-source frameworks for testing algorithms and comparing results objectively, which hinders progress and collaboration between RL and OR communities.

Method: Developed a unified multi-agent framework built on PyTorch with modular architecture following Agent Environment Cycle games model, supporting classical, dynamic, stochastic, and multi-task vehicle routing variants.

Result: Created MAEnvs4VRP library with intuitive API that enables rapid adoption and seamless integration into existing RL frameworks, facilitating algorithm testing and objective comparison.

Conclusion: The library addresses the scarcity of open-source RL frameworks for vehicle routing problems, promoting progress through better testing, comparison, and collaboration between RL and OR communities.

Abstract: Research on Reinforcement Learning (RL) approaches for discrete optimization problems has increased considerably, extending RL to areas classically dominated by Operations Research (OR). Vehicle routing problems are a good example of discrete optimization problems with high practical relevance, for which RL techniques have achieved notable success. Despite these advances, open-source development frameworks remain scarce, hindering both algorithm testing and objective comparison of results. This situation ultimately slows down progress in the field and limits the exchange of ideas between the RL and OR communities. Here, we propose MAEnvs4VRP library, a unified framework for multi-agent vehicle routing environments that supports classical, dynamic, stochastic, and multi-task problem variants within a single modular design. The library, built on PyTorch, provides a flexible and modular architecture design that facilitates customization and the incorporation of new routing problems. It follows the Agent Environment Cycle (“AEC”) games model and features an intuitive API, enabling rapid adoption and seamless integration into existing reinforcement learning frameworks. The project source code can be found at https://github.com/ricgama/maenvs4vrp.

[699] Regime-Calibrated Demand Priors for Ride-Hailing Fleet Dispatch and Repositioning

Indar Kumar, Akanksha Tiwari

Main category: cs.LG

TL;DR: A regime-calibrated dispatch method for ride-hailing that segments historical trip data into demand regimes, matches current periods to similar historical analogues using multiple similarity metrics, and uses calibrated demand priors for fleet repositioning and dispatch optimization.

Details

Motivation: Ride-hailing dispatch needs to anticipate varying demand patterns across time, day, season, and events. Traditional methods struggle with these complex temporal variations, requiring approaches that can identify and leverage historical patterns effectively.

Method: 1) Segment historical trip data into demand regimes; 2) Match current operating period to most similar historical analogues using similarity ensemble (Kolmogorov-Smirnov distance, Wasserstein-1 distance, feature distance, variance ratio, event pattern similarity, temporal proximity); 3) Use calibrated demand prior to drive LP-based fleet repositioning policy and batch dispatch with Hungarian matching.

Result: Reduces mean rider wait times by 31.1% across 5.2 million NYC TLC trips in 8 diverse scenarios. P95 wait drops 37.6%, Gini coefficient improves from 0.441 to 0.409. Approach generalizes to Chicago (23.3% wait reduction using NYC-built regime library without retraining) and is robust across fleet sizes.

Conclusion: The regime-calibrated approach provides significant improvements in ride-hailing dispatch efficiency without requiring training, is deterministic and explainable, generalizes across cities, and maintains robustness across various operational conditions.

Abstract: Effective ride-hailing dispatch requires anticipating demand patterns that vary substantially across time-of-day, day-of-week, season, and special events. We propose a regime-calibrated approach that (i) segments historical trip data into demand regimes, (ii) matches the current operating period to the most similar historical analogues via a similarity ensemble combining Kolmogorov-Smirnov distance, Wasserstein-1 distance, feature distance, variance ratio, event pattern similarity, and temporal proximity, and (iii) uses the resulting calibrated demand prior to drive both an LP-based fleet repositioning policy and batch dispatch with Hungarian matching. In ablation, a distributional-only metric subset achieves the strongest mean-wait reduction, while the full ensemble is retained as a robustness-oriented default that preserves calendar and event context. Evaluated on 5.2 million NYC TLC trips across 8 diverse scenarios (winter/summer, weekday/weekend/holiday, morning/evening/night) with 5 random seeds each, our method reduces mean rider wait times by 31.1% (bootstrap 95% CI: [26.5, 36.6]; Friedman chi-squared = 80.0, p = 4.25e-18; Cohen’s d = 7.5-29.9). P95 wait drops 37.6% and the Gini coefficient of wait times improves from 0.441 to 0.409. The two contributions compose multiplicatively: calibration provides 16.9% reduction relative to the replay baseline; LP repositioning adds a further 15.5%. The approach requires no training, is deterministic and explainable, generalizes to Chicago (23.3% wait reduction using the NYC-built regime library without retraining), and is robust across fleet sizes (32-47% improvement for 0.5x-2.0x fleet scaling). Code is available at https://github.com/IndarKarhana/regime-calibrated-dispatch.

[700] Provable Multi-Task Reinforcement Learning: A Representation Learning Framework with Low Rank Rewards

Yaoze Guo, Shana Moothedath

Main category: cs.LG

TL;DR: Multi-task RL with low-rank reward structure using reward-free exploration and low-rank matrix estimation for shared representation learning

Details

Motivation: Multi-task RL faces challenges in learning shared representations due to complex, policy-dependent data and temporal error progression. Existing methods rely on restrictive assumptions like Gaussian features or incoherence conditions.

Method: Uses reward-free RL to learn data-collection policy, then explores to estimate reward matrices with low-rank structure. Proposes low-rank matrix estimation method that works under general feature distributions in RL settings.

Result: Theoretical analysis shows accurate low-rank matrix recovery under relaxed assumptions, with characterization of representation error vs sample complexity. Experimental results demonstrate effective learning of robust shared representations and task dynamics.

Conclusion: Proposed method successfully learns shared representations for multi-task RL with low-rank reward structure under more general conditions than previous approaches, enabling near-optimal policy construction with proven regret bounds.

Abstract: Multi-task representation learning (MTRL) is an approach that learns shared latent representations across related tasks, facilitating collaborative learning that improves the overall learning efficiency. This paper studies MTRL for multi-task reinforcement learning (RL), where multiple tasks have the same state-action space and transition probabilities, but different rewards. We consider T linear Markov Decision Processes (MDPs) where the reward functions and transition dynamics admit linear feature embeddings of dimension d. The relatedness among the tasks is captured by a low-rank structure on the reward matrices. Learning shared representations across multiple RL tasks is challenging due to the complex and policy-dependent nature of data that leads to a temporal progression of error. Our approach adopts a reward-free reinforcement learning framework to first learn a data-collection policy. This policy then informs an exploration strategy for estimating the unknown reward matrices. Importantly, the data collected under this well-designed policy enable accurate estimation, which ultimately supports the learning of an near-optimal policy. Unlike existing approaches that rely on restrictive assumptions such as Gaussian features, incoherence conditions, or access to optimal solutions, we propose a low-rank matrix estimation method that operates under more general feature distributions encountered in RL settings. Theoretical analysis establishes that accurate low-rank matrix recovery is achievable under these relaxed assumptions, and we characterize the relationship between representation error and sample complexity. Leveraging the learned representation, we construct near-optimal policies and prove a regret bound. Experimental results demonstrate that our method effectively learns robust shared representations and task dynamics from finite data.

[701] Improving Model Performance by Adapting the KGE Metric to Account for System Non-Stationarity

M Jawad, HV Gupta, YH Wang, MA Farmani, A Behrangi, GY Niu

Main category: cs.LG

TL;DR: JKGE_ss metric improves geoscientific model evaluation by accounting for temporal non-stationarity in data, focusing on system storage dynamics rather than long-term means.

Details

Motivation: Geoscientific systems exhibit pronounced temporal non-stationarity due to seasonal/climatic variability and land use changes, making traditional stationary assumptions obsolete. Current metrics like NSE and KGE_ss fail to account for dynamic shifts in system properties, potentially leading to misleading model performance assessments under changing conditions.

Method: Introduces JKGE_ss metric (adapted from KGE_ss) that detects and accounts for dynamical non-stationarity by emphasizing reproduction of temporal variations in system storage rather than using long-term means as benchmarks. Tested robustness by training physical-conceptual and data-based catchment-scale models across diverse hydroclimatic conditions.

Result: JKGE_ss consistently improved reproduction of system temporal dynamics across all time scales, wet to dry years, and full range of flow levels (especially recession periods) in all tested conditions from precipitation-dominated to snow-dominated to arid catchments.

Conclusion: Traditional metrics inadequately account for temporal shifts in system dynamics, so JKGE_ss should be adopted for geoscientific model development to improve information extraction and model performance under non-stationary conditions.

Abstract: Geoscientific systems tend to be characterized by pronounced temporal non-stationarity, arising from seasonal and climatic variability in hydrometeorological drivers, and from natural and anthropogenic changes to land use and cover. As has been pointed out, such variability renders “the assumption of statistical stationarity obsolete in water management”, and requires us to “account for, rather than ignore, non-stationary trends” in the data. However, metrics used for model development are typically based on the implicit and unjustifiable assumption that the data generating process is time-stationary. Here, we introduce the JKGE_ss metric (adapted from KGE_ss) that detects and accounts for dynamical non-stationarity in the statistical properties of the data and thereby improves information extraction and model performance. Unlike NSE and KGE_ss, which use the long-term mean as a benchmark against which to evaluate model efficiency, JKGE_ss emphasizes reproduction of temporal variations in system storage. We tested the robustness of the new metric by training physical-conceptual and data-based catchment-scale models of varying complexity across a wide range of hydroclimatic conditions, from recent-precipitation-dominated to snow-dominated to strongly arid. In all cases, the result was improved reproduction of system temporal dynamics at all time scales, across wet to dry years, and over the full range of flow levels (especially recession periods). Since traditional metrics fail to adequately account for temporal shifts in system dynamics, potentially resulting in misleading assessments of model performance under changing conditions, we recommend the adoption of JKGE_ss for geoscientific model development.

[702] Align Your Structures: Generating Trajectories with Structure Pretraining for Molecular Dynamics

Aniketh Iyengar, Jiaqi Han, Pengwei Sun, Mingjian Jiang, Jianwen Xie, Stefano Ermon

Main category: cs.LG

TL;DR: A novel framework for molecular dynamics trajectory generation using structure pretraining and diffusion models, addressing data scarcity by leveraging abundant structural data and decomposing MD modeling into structural generation and temporal alignment.

Details

Motivation: Generating molecular dynamics trajectories with deep generative models is challenging due to limited MD data availability and the complexity of modeling high-dimensional MD distributions. Current approaches struggle with data scarcity and the intricate nature of MD trajectory modeling.

Method: Proposes a two-stage framework: 1) Train a diffusion-based structure generation model on large-scale conformer datasets, 2) Introduce an interpolator module trained on MD trajectory data to enforce temporal consistency among generated structures. This approach leverages abundant structural data to mitigate MD data scarcity and decomposes the complex MD modeling task into structural generation and temporal alignment.

Result: Comprehensive evaluation on QM9 and DRUGS small-molecule datasets across unconditional generation, forward simulation, and interpolation tasks, with extension to tetrapeptide and protein monomer systems. The approach generates chemically realistic MD trajectories with remarkable improvements in geometric, dynamical, and energetic measurement accuracy.

Conclusion: The proposed framework effectively addresses MD trajectory generation challenges by leveraging structure pretraining and temporal consistency modules, demonstrating superior performance in generating realistic molecular dynamics trajectories across various molecular systems.

Abstract: Generating molecular dynamics (MD) trajectories using deep generative models has attracted increasing attention, yet remains inherently challenging due to the limited availability of MD data and the complexities involved in modeling high-dimensional MD distributions. To overcome these challenges, we propose a novel framework that leverages structure pretraining for MD trajectory generation. Specifically, we first train a diffusion-based structure generation model on a large-scale conformer dataset, on top of which we introduce an interpolator module trained on MD trajectory data, designed to enforce temporal consistency among generated structures. Our approach effectively harnesses abundant structural data to mitigate the scarcity of MD trajectory data and effectively decomposes the intricate MD modeling task into two manageable subproblems: structural generation and temporal alignment. We comprehensively evaluate our method on the QM9 and DRUGS small-molecule datasets across unconditional generation, forward simulation, and interpolation tasks, and further extend our framework and analysis to tetrapeptide and protein monomer systems. Experimental results confirm that our approach excels in generating chemically realistic MD trajectories, as evidenced by remarkable improvements of accuracy in geometric, dynamical, and energetic measurements.

[703] ACES: Who Tests the Tests? Leave-One-Out AUC Consistency for Code Generation

Hui Sun, Yun-Ji Zhang, Zheng Xie, Ren-Biao Liu, Yali Du, Xin-Ye Li, Ming Li

Main category: cs.LG

TL;DR: ACES method uses leave-one-out AUC scoring to weight LLM-generated tests for code selection, breaking circular dependency between test correctness and code correctness.

Details

Motivation: Existing methods for selecting LLM-generated code candidates using LLM-generated tests face challenges because tests themselves may be incorrect, creating a circular dependency where determining test correctness requires knowing which codes are correct.

Method: Proposes ACES (AUC Consistency Scoring) with two variants: ACES-C uses closed-form weights based on leave-one-out AUC (LOO-AUC) that measures test agreement with code rankings from other tests; ACES-O iteratively optimizes a differentiable LOO-AUC objective. Both operate on binary pass matrices.

Result: Achieves state-of-the-art Pass@k on multiple code generation benchmarks with negligible computational overhead.

Conclusion: The key insight is that test votes should rank codes rather than merely count passes, and LOO-AUC provides a principled way to weight tests without needing to determine their correctness directly.

Abstract: Selecting LLM-generated code candidates using LLM-generated tests is challenging because the tests themselves may be incorrect. Existing methods either treat all tests equally or rely on ad-hoc heuristics to filter unreliable tests. Yet determining test correctness requires knowing which codes are correct, creating a \emph{circular dependency}. Our key insight is that we need not determine test correctness at all: \emph{test votes should rank, not merely count}. What matters is not how many codes pass a test, but whether the test can \emph{distinguish} correct from incorrect code. We break the circular dependency via leave-one-out evaluation: hold out one test, rank codes by their aggregate scores on all remaining tests, and measure whether the held-out test’s pass/fail pattern agrees with this ranking. We formalize this agreement as the leave-one-out AUC~(LOO-AUC) and prove that the expected LOO-AUC is proportional to each test’s ability to separate correct code from incorrect code. Building on this, we propose \textbf{ACES}~(\textbf{A}UC \textbf{C}onsist\textbf{E}ncy \textbf{S}coring) with two complementary variants: ACES-C provides closed-form weights that provably approximate the oracle in expectation under a mild assumption on average test quality; ACES-O drops this assumption and iteratively optimizes a differentiable LOO-AUC objective. Both operate solely on the binary pass matrix with negligible overhead, and achieve state-of-the-art Pass@$k$ on multiple code generation benchmarks.

[704] Learning Sampled-data Control for Swarms via MeanFlow

Anqi Dong, Yongxin Chen, Karl H. Johansson, Johan Karlsson

Main category: cs.LG

TL;DR: A sampled-data learning framework for swarm steering that learns finite-horizon control coefficients rather than instantaneous velocity fields, enabling few-step control updates consistent with communication/computational constraints.

Details

Motivation: Real-world swarm steering often has limited control updates due to communication or computational constraints, but most learning approaches model instantaneous velocity fields rather than finite-window control quantities needed for practical deployment.

Method: Generalizes MeanFlow framework to linear dynamic systems, learning finite-horizon coefficients that parameterize minimum-energy controls over each interval. Uses differential identity connecting these coefficients to local bridge-induced supervision signals, leading to stop-gradient regression objective.

Result: Creates a sampled-data learning framework operating directly in control space that guarantees the controller respects prescribed linear time-invariant dynamics and actuation channels while enabling few-step swarm steering at scale.

Conclusion: The method provides a principled approach to swarm steering that accounts for practical constraints on control updates while maintaining consistency with the underlying control system’s finite-window actuation structure.

Abstract: Steering large-scale swarms with only limited control updates is often needed due to communication or computational constraints, yet most learning-based approaches do not account for this and instead model instantaneous velocity fields. As a result, the natural object for decision making is a finite-window control quantity rather than an infinitesimal one. To address this gap, we consider the recent machine learning framework MeanFlow and generalize it to the setting with general linear dynamic systems. This results in a new sampled-data learning framework that operates directly in control space and that can be applied for swarm steering. To this end, we learn the finite-horizon coefficient that parameterizes the minimum-energy control applied over each interval, and derive a differential identity that connects this quantity to a local bridge-induced supervision signal. This identity leads to a simple stop-gradient regression objective, allowing the interval coefficient field to be learned efficiently from bridge samples. The learned policy is deployed through sampled-data updates, guaranteeing that the resulting controller exactly respects the prescribed linear time-invariant dynamics and actuation channel. The resulting method enables few-step swarm steering at scale, while remaining consistent with the finite-window actuation structure of the underlying control system.

[705] Supervised Dimensionality Reduction Revisited: Why LDA on Frozen CNN Features Deserves a Second Look

Indar Kumar, Girish Karhana, Sai Krishna Jasti, Ankit Hemant Lade

Main category: cs.LG

TL;DR: A regime-calibrated approach for ride-hailing dispatch that segments historical trip data into demand regimes, matches current operations to similar historical patterns using a six-metric similarity ensemble, and uses calibrated demand priors to optimize fleet repositioning and dispatch.

Details

Motivation: Ride-hailing dispatch requires anticipating demand patterns that vary substantially across time-of-day, day-of-week, season, and special events. Current approaches need better methods to handle these complex temporal variations and provide robust performance across diverse scenarios.

Method: 1) Segment historical trip data into demand regimes, 2) Match current operating period to most similar historical analogues using a six-metric similarity ensemble (Kolmogorov-Smirnov, Wasserstein-1, feature distance, variance ratio, event pattern, temporal proximity), 3) Use calibrated demand prior to drive both an LP-based fleet repositioning policy and batch dispatch with Hungarian matching.

Result: Evaluated on 5.2 million NYC TLC trips across 8 diverse scenarios, the method reduces mean rider wait times by 31.1% (95% CI: [26.5, 36.6]%), P95 wait drops 37.6%, and Gini coefficient improves from 0.441 to 0.409 (7.3% relative). Calibration provides 16.9% reduction and LP repositioning adds 15.5%. Generalizes to Chicago with 23.3% wait reduction and robust across fleet sizes.

Conclusion: The regime-calibrated approach significantly improves ride-hailing dispatch performance without requiring training, is deterministic and explainable, generalizes across cities, and provides robust improvements across diverse operational scenarios and fleet sizes.

Abstract: Effective ride-hailing dispatch requires anticipating demand patterns that vary substantially across time-of-day, day-of-week, season, and special events. We propose a regime-calibrated approach that (i) segments historical trip data into demand regimes, (ii) matches the current operating period to the most similar historical analogues via a six-metric similarity ensemble (Kolmogorov-Smirnov, Wasserstein-1, feature distance, variance ratio, event pattern, temporal proximity), and (iii) uses the resulting calibrated demand prior to drive both an LP-based fleet repositioning policy and batch dispatch with Hungarian matching. In ablation, a distributional-only subset is strongest on mean wait, while the full ensemble is retained as a robustness-oriented default. Evaluated on 5.2 million NYC TLC trips across 8 diverse scenarios (winter/summer, weekday/weekend/holiday, morning/evening/night) with 5 random seeds each, our method reduces mean rider wait times by 31.1% (bootstrap 95% CI: [26.5, 36.6]%; Friedman chi-sq = 80.0, p = 4.25e-18; Cohen’s d = 7.5-29.9 across scenarios). The improvement extends to the tail: P95 wait drops 37.6% and the Gini coefficient of wait times improves from 0.441 to 0.409 (7.3% relative). The two contributions compose multiplicatively and are independently validated: calibration provides 16.9% reduction; LP repositioning adds a further 15.5%. The approach requires no training, is deterministic and explainable, generalizes to Chicago (23.3% wait reduction via NYC-built regime library), and is robust across fleet sizes (32-47% improvement for 0.5-2x fleet scaling). We provide comprehensive ablation studies, formal statistical tests, and routing-fidelity validation with OSRM.

[706] Diagonal-Tiled Mixed-Precision Attention for Efficient Low-Bit MXFP Inference

Yifu Ding, Xinhao Zhang, Jinyang Guo

Main category: cs.LG

TL;DR: A low-bit mixed-precision attention kernel using MXFP data format for efficient LLM inference on next-gen GPUs, achieving speedup with minimal quality degradation.

Details

Motivation: Transformer LLMs have high inference costs due to quadratic attention complexity and memory bandwidth limitations of high-precision operations, creating need for more efficient inference methods.

Method: Diagonal-Tiled Mixed-Precision Attention (DMA) kernel using microscaling floating-point (MXFP) data format with low-bit computation at tiling-level, implemented as fused kernel in Triton to exploit hardware parallelism and memory efficiency.

Result: Maintains generation quality with negligible degradation while achieving significant speedup through kernel fusion on NVIDIA B200 GPUs.

Conclusion: The DMA kernel enables fast and efficient LLM inference without compromising performance, addressing the high computational costs of transformer attention mechanisms.

Abstract: Transformer-based large language models (LLMs) have demonstrated remarkable performance across a wide range of real-world tasks, but their inference cost remains prohibitively high due to the quadratic complexity of attention and the memory bandwidth limitations of high-precision operations. In this work, we present a low-bit mixed-precision attention kernel using the microscaling floating-point (MXFP) data format, utilizing the computing capability on next-generation GPU architectures. Our Diagonal-Tiled Mixed-Precision Attention (DMA) incorporates two kinds of low-bit computation at the tiling-level, and is a delicate fused kernel implemented using Triton, exploiting hardware-level parallelism and memory efficiency to enable fast and efficient inference without compromising model performance. Extensive empirical evaluations on NVIDIA B200 GPUs show that our kernel maintains generation quality with negligible degradation, and meanwhile achieves significant speedup by kernel fusion. We release our code at https://github.com/yifu-ding/MP-Sparse-Attn.

[707] Convergence of Byzantine-Resilient Gradient Tracking via Probabilistic Edge Dropout

Amirhossein Dezhboro, Fateme Maleki, Arman Adibi, Erfan Amini, Jose E. Ramirez-Marquez

Main category: cs.LG

TL;DR: GT-PD is a Byzantine-robust decentralized optimization method using gradient tracking with probabilistic edge dropout and trust-based defenses for adversarial network environments.

Details

Motivation: Distributed optimization over networks faces security threats from Byzantine agents that can send arbitrary adversarial messages, compromising convergence. Existing robust aggregation methods often lose the doubly stochastic mixing structure crucial for decentralized optimization.

Method: Proposes GT-PD with two defense layers: 1) universal self-centered projection clipping incoming messages to a ball around receiving agent, and 2) decentralized probabilistic dropout using dual-metric trust scores in decision and tracking channels. GT-PD-L adds leaky integrator for partial Byzantine isolation.

Result: GT-PD converges linearly to neighborhood determined by stochastic gradient variance under complete Byzantine isolation. GT-PD-L achieves linear convergence to bounded neighborhood under partial isolation. Experiments on MNIST show GT-PD-L outperforms coordinate-wise trimmed mean by up to 4.3 percentage points under stealth attacks.

Conclusion: GT-PD/GT-PD-L provides effective Byzantine-robust decentralized optimization preserving doubly stochastic mixing structure, with theoretical guarantees and empirical superiority over existing methods under various attack scenarios.

Abstract: We study distributed optimization over networks with Byzantine agents that may send arbitrary adversarial messages. We propose \emph{Gradient Tracking with Probabilistic Edge Dropout} (GT-PD), a stochastic gradient tracking method that preserves the convergence properties of gradient tracking under adversarial communication. GT-PD combines two complementary defense layers: a universal self-centered projection that clips each incoming message to a ball of radius $τ$ around the receiving agent, and a fully decentralized probabilistic dropout rule driven by a dual-metric trust score in the decision and tracking channels. This design bounds adversarial perturbations while preserving the doubly stochastic mixing structure, a property often lost under robust aggregation in decentralized settings. Under complete Byzantine isolation ($p_b=0$), GT-PD converges linearly to a neighborhood determined solely by stochastic gradient variance. For partial isolation ($p_b>0$), we introduce \emph{Gradient Tracking with Probabilistic Edge Dropout and Leaky Integration} (GT-PD-L), which uses a leaky integrator to control the accumulation of tracking errors caused by persistent perturbations and achieves linear convergence to a bounded neighborhood determined by the stochastic variance and the clipping-to-leak ratio. We further show that under two-tier dropout with $p_h=1$, isolating Byzantine agents introduces no additional variance into the honest consensus dynamics. Experiments on MNIST under Sign Flip, ALIE, and Inner Product Manipulation attacks show that GT-PD-L outperforms coordinate-wise trimmed mean by up to 4.3 percentage points under stealth attacks.

[708] BWTA: Accurate and Efficient Binarized Transformer by Algorithm-Hardware Co-design

Yifu Ding, Xianglong Liu, Shenghao Jin, Jinyang Guo, Jiwen Lu

Main category: cs.LG

TL;DR: BWTA quantization enables ultra low-bit (binary weights & ternary activations) Transformer models with minimal accuracy loss and significant GPU speedup through algorithm-hardware co-design.

Details

Motivation: Ultra low-bit quantization offers efficiency benefits for Transformer models but suffers from accuracy degradation and limited GPU support, hindering practical deployment.

Method: Proposes Binary Weights & Ternary Activations (BWTA) quantization with Smooth Multi-Stage Training (Levelwise Degradation Strategy + Magnitude-Alignment Projection Factor) and custom CUDA kernels for efficient inference.

Result: Achieves near full-precision performance (3.5% avg drop on GLUE, <2% on 5 tasks), 16-24× kernel speedup over FP16, and 216-330 tokens/s end-to-end prefill speedup with lower memory footprint.

Conclusion: BWTA enables practical, low-latency ultra-low-bit inference for Transformers without sacrificing model quality through algorithm-hardware co-design.

Abstract: Ultra low-bit quantization brings substantial efficiency for Transformer-based models, but the accuracy degradation and limited GPU support hinder its wide usage. In this paper, we analyze zero-point distortion in binarization and propose a Binary Weights & Ternary Activations (BWTA) quantization scheme, which projects tiny values to zero and preserves the accuracy of extremely low-bit models. For training, we propose Smooth Multi-Stage Quantization, combining a Levelwise Degradation Strategy and a Magnitude-Alignment Projection Factor to enable stable and fast convergence. For inference, we develop a BWTA MatMul CUDA kernel with instruction-level parallel bit-packing and comprehensive binary/ternary MatMul implementations for both linear and attention operators, allowing seamless integration across Transformer architectures. Experiments show that BWTA approaches full-precision performance for BERT, with an average 3.5% drop on GLUE and less than 2% drop on five tasks, and achieves comparable perplexity and accuracy for LLMs. In efficiency, it delivers 16 to 24 times kernel-level speedup over FP16 on NVIDIA GPUs, and 216 to 330 tokens/s end-to-end prefill speedup with lower memory footprint on LLMs. As an algorithm-hardware co-design, BWTA demonstrates practical, low-latency ultra-low-bit inference without sacrificing model quality.

[709] Multirate Stein Variational Gradient Descent for Efficient Bayesian Sampling

Arash Sarshar

Main category: cs.LG

TL;DR: Multirate SVGD improves Bayesian inference by using different step sizes for attraction and repulsion components, enhancing stability and efficiency on challenging posteriors.

Details

Motivation: Standard SVGD uses a single global step size for both attraction (toward high-posterior regions) and repulsion (preserving particle diversity), which can be unstable in some regions and inefficient in others, especially for high-dimensional, anisotropic, or hierarchical posteriors.

Method: Developed multirate SVGD variants including symmetric split method, fixed multirate method (MR-SVGD), and adaptive multirate method (Adapt-MR-SVGD) with local error control that update attraction and repulsion components on different time scales.

Result: Multirate SVGD variants improved robustness and quality-cost tradeoffs across six benchmark families: 50D Gaussian, 2D synthetic targets, UCI Bayesian logistic regression, multimodal Gaussian mixtures, Bayesian neural networks, and large-scale hierarchical logistic regression.

Conclusion: Multirate SVGD provides significant improvements over vanilla SVGD, especially on stiff hierarchical, strongly anisotropic, and multimodal targets, with adaptive multirate SVGD being the strongest variant and fixed multirate offering a simpler robust alternative.

Abstract: Many particle-based Bayesian inference methods use a single global step size for all parts of the update. In Stein variational gradient descent (SVGD), however, each update combines two qualitatively different effects: attraction toward high-posterior regions and repulsion that preserves particle diversity. These effects can evolve at different rates, especially in high-dimensional, anisotropic, or hierarchical posteriors, so one step size can be unstable in some regions and inefficient in others. We derive a multirate version of SVGD that updates these components on different time scales. The framework yields practical algorithms, including a symmetric split method, a fixed multirate method (MR-SVGD), and an adaptive multirate method (Adapt-MR-SVGD) with local error control. We evaluate the methods in a broad and rigorous benchmark suite covering six problem families: a 50D Gaussian target, multiple 2D synthetic targets, UCI Bayesian logistic regression, multimodal Gaussian mixtures, Bayesian neural networks, and large-scale hierarchical logistic regression. Evaluation includes posterior-matching metrics, predictive performance, calibration quality, mixing, and explicit computational cost accounting. Across these six benchmark families, multirate SVGD variants improve robustness and quality-cost tradeoffs relative to vanilla SVGD. The strongest gains appear on stiff hierarchical, strongly anisotropic, and multimodal targets, where adaptive multirate SVGD is usually the strongest variant and fixed multirate SVGD provides a simpler robust alternative at lower cost.

[710] Autoencoder-Based Parameter Estimation for Superposed Multi-Component Damped Sinusoidal Signals

Momoka Iida, Hayato Motohashi, Hirotaka Takahashi

Main category: cs.LG

TL;DR: Autoencoder-based method for estimating parameters of noisy multi-component damped sinusoidal signals using latent space representation.

Details

Motivation: Damped sinusoidal oscillations are common in physical systems but parameter estimation is challenging with rapid decay, multiple components, and observational noise.

Method: Developed an autoencoder-based approach that uses latent space to estimate frequency, phase, decay time, and amplitude of each component in noisy multi-component damped sinusoidal signals.

Result: Method achieves high accuracy parameter estimation even in challenging setups with subdominant components or nearly opposite-phase components, and remains robust with less informative training distributions.

Conclusion: The autoencoder-based method demonstrates potential as a tool for analyzing short-duration, noisy signals in physical systems.

Abstract: Damped sinusoidal oscillations are widely observed in many physical systems, and their analysis provides access to underlying physical properties. However, parameter estimation becomes difficult when the signal decays rapidly, multiple components are superposed, and observational noise is present. In this study, we develop an autoencoder-based method that uses the latent space to estimate the frequency, phase, decay time, and amplitude of each component in noisy multi-component damped sinusoidal signals. We investigate multi-component cases under Gaussian-distribution training and further examine the effect of the training-data distribution through comparisons between Gaussian and uniform training. The performance is evaluated through waveform reconstruction and parameter-estimation accuracy. We find that the proposed method can estimate the parameters with high accuracy even in challenging setups, such as those involving a subdominant component or nearly opposite-phase components, while remaining reasonably robust when the training distribution is less informative. This demonstrates its potential as a tool for analyzing short-duration, noisy signals.

[711] Can LLMs Learn to Reason Robustly under Noisy Supervision?

Shenzhi Yang, Guangcheng Zhu, Bowen Song, Sharon Li, Haobo Wang, Xing Zheng, Yingfan Ma, Zhongqi Chen, Weiqiang Wang, Gang Chen

Main category: cs.LG

TL;DR: RLVR with noisy labels: distinguishes inactive vs active noise, identifies Early Correctness Coherence phenomenon, proposes Online Label Refinement for gradual self-correction, shows robustness improvements across noise ratios.

Details

Motivation: RLVR effectively trains reasoning models but is vulnerable to noisy labels due to expert scarcity. Current analysis of noisy label mechanisms in RLVR is critically underexplored, especially compared to supervised classification.

Method: Distinguishes two types of noise (inactive vs active), identifies Early Correctness Coherence phenomenon, proposes Online Label Refinement (OLR) that progressively corrects potentially noisy labels using majority-voted answers when conditions hold: positive slope in majority answer’s rollout pass rate and stable historical consistency.

Result: OLR consistently improves robustness across noise ratios (0.1 to 0.9), achieving average gains of 3.6% to 3.9% on in-distribution benchmarks (AIME24/25, AMC, MATH-500, Minerva, Olympiad) and 3.3% to 4.6% on out-of-distribution tasks (ARC-c, GPQA-diamond, MMLU-pro).

Conclusion: The proposed OLR method effectively addresses noisy label challenges in RLVR by enabling gradual self-correction as the policy improves, demonstrating robust performance improvements across both in-distribution and out-of-distribution mathematical reasoning tasks.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) effectively trains reasoning models that rely on abundant perfect labels, but its vulnerability to unavoidable noisy labels due to expert scarcity remains critically underexplored. In this work, we take the first step toward a systematic analysis of noisy label mechanisms in RLVR. In contrast to supervised classification, most RLVR algorithms incorporate a rollout-based condition: a label’s influence on training is contingent on whether the current policy can generate rollouts that realize it, a property that naturally extends to noisy labels. Based on this observation, we distinguish two types of noise: inactive noisy labels, which reduce data efficiency, and active noisy labels, which are reinforced and risk skewing the model toward incorrect distributions. From experiments on training with noisy samples, we identify an Early Correctness Coherence phenomenon: although noisy samples begin to lag behind in later stages, accuracy on both clean and noisy samples increases similarly in early training. Motivated by this dynamic, we propose Online Label Refinement (OLR), which progressively corrects potentially noisy labels with majority-voted answers when two conditions hold: a positive slope in the majority answer’s rollout pass rate and stable historical consistency across updates, enabling gradual self-correction as the policy improves. We evaluate OLR on six in-distribution mathematical reasoning benchmarks (AIME24/25, AMC, MATH-500, Minerva, and Olympiad) and three out-of-distribution tasks (ARC-c, GPQA-diamond, and MMLU-pro). Across noise ratios from 0.1 to 0.9, OLR consistently improves robustness under both inactive and active noisy-label settings, achieving average gains of 3.6% to 3.9% on in-distribution benchmarks and 3.3% to 4.6% on out-of-distribution evaluations.

[712] Geometric Limits of Knowledge Distillation: A Minimum-Width Theorem via Superposition Theory

Dawar Jyoti Deka, Nilesh Sarkar

Main category: cs.LG

TL;DR: Knowledge distillation faces a geometric loss floor due to feature superposition limits in neural networks, where students can encode at most d_S·g(α) features based on sparsity, beyond which features are permanently lost.

Details

Motivation: The paper addresses the persistent performance saturation in knowledge distillation where performance reaches a loss floor that remains across different training methods and objectives. The authors hypothesize this is not a training issue but a fundamental geometric limitation of neural network representations.

Method: The authors develop a geometric theory that neural networks represent features through superposition, with students limited to encoding at most d_S·g(α) features where g(α) is a sparsity-dependent capacity function. They validate this on a toy model (48 configurations) and on Pythia-410M using sparse autoencoders to measure features. They test distillation into five different student widths and analyze feature loss through linear probing.

Result: Validation on toy models achieved median accuracy >93%. On Pythia-410M, sparse autoencoders measured approximately 28,700 features at sparsity α≈0.992, with critical width d_S*≈1,065. Distillation experiments confirmed the predicted monotonic floor ordering. The observed loss floor decomposes into geometric and architectural components (R²=0.993). Linear probing shows coarse concepts survive even with 88% feature loss, indicating the floor arises from loss of fine-grained features in the importance distribution’s long tail.

Conclusion: The loss floor in knowledge distillation is fundamentally geometric, arising from the limited capacity of student networks to encode features through superposition. This connects representation geometry to distillation limits and provides a practical tool for predicting distillation performance from sparse autoencoder measurements alone.

Abstract: Knowledge distillation compresses large teachers into smaller students, but performance saturates at a loss floor that persists across training methods and objectives. We argue this floor is geometric: neural networks represent far more features than dimensions through superposition, and a student of width $d_S$ can encode at most $d_S \cdot g(α)$ features, where $g(α) = 1/((1-α)\ln\frac{1}{1-α})$ is a sparsity-dependent capacity function. Features beyond this budget are permanently lost, yielding an importance-weighted loss floor. We validate on a toy model (48 configurations, median accuracy >93%) and on Pythia-410M, where sparse autoencoders measure $F \approx 28{,}700$ features at $α\approx 0.992$ (critical width $d_S^* \approx 1{,}065$). Distillation into five student widths confirms the predicted monotonic floor ordering. The observed floor decomposes into a geometric component and a width-independent architectural baseline ($R^2 = 0.993$). Linear probing shows coarse concepts survive even 88% feature loss, revealing the floor arises from aggregate loss of fine-grained features in the importance distribution’s long tail. Our results connect representation geometry to distillation limits and provide a practical tool for predicting distillation performance from SAE measurements alone.

[713] ArrowFlow: Hierarchical Machine Learning in the Space of Permutations

Ozgur Yilmaz

Main category: cs.LG

TL;DR: ArrowFlow is a novel ML architecture that operates entirely in permutation space using ranking filters and permutation-matrix accumulation, connecting to social choice theory for inductive biases.

Details

Motivation: To explore a fundamentally different computational paradigm that operates entirely in permutation space without floating-point parameters, elevating ordinal structure as a first-class citizen and aligning with integer-only/neuromorphic hardware.

Method: Uses ranking filters that compare inputs via Spearman’s footrule distance and update through permutation-matrix accumulation (non-gradient rule). Layers compose hierarchically where each layer’s output ranking becomes the next layer’s input, enabling deep ordinal representation learning.

Result: Competitive performance on UCI benchmarks (beats all baselines on Iris with 2.7% vs 3.3% error), shows noise robustness (8-28% less degradation), privacy preservation, and missing-feature resilience. Single polynomial degree parameter acts as master switch for trade-offs.

Conclusion: ArrowFlow demonstrates competitive classification is possible in a fundamentally different computational paradigm that elevates ordinal structure, with natural alignment to specialized hardware, though not designed to surpass gradient-based methods.

Abstract: We introduce ArrowFlow, a machine learning architecture that operates entirely in the space of permutations. Its computational units are ranking filters, learned orderings that compare inputs via Spearman’s footrule distance and update through permutation-matrix accumulation, a non-gradient rule rooted in displacement evidence. Layers compose hierarchically: each layer’s output ranking becomes the next layer’s input, enabling deep ordinal representation learning without any floating-point parameters in the core computation. We connect the architecture to Arrow’s impossibility theorem, showing that violations of social-choice fairness axioms (context dependence, specialization, symmetry breaking) serve as inductive biases for nonlinearity, sparsity, and stability. Experiments span UCI tabular benchmarks, MNIST, gene expression cancer classification (TCGA), and preference data, all against GridSearchCV-tuned baselines. ArrowFlow beats all baselines on Iris (2.7% vs. 3.3%) and is competitive on most UCI datasets. A single parameter, polynomial degree, acts as a master switch: degree 1 yields noise robustness (8-28% less degradation), privacy preservation (+0.5pp cost), and missing-feature resilience; higher degrees trade these for improved clean accuracy. ArrowFlow is not designed to surpass gradient-based methods. It is an existence proof that competitive classification is possible in a fundamentally different computational paradigm, one that elevates ordinal structure to a first-class citizen, with natural alignment to integer-only and neuromorphic hardware.

[714] Fine-grained Analysis of Stability and Generalization for Stochastic Bilevel Optimization

Xuelin Zhang, Hong Chen, Bin Gu, Tieliang Gong, Feng Zheng

Main category: cs.LG

TL;DR: This paper provides a systematic generalization analysis of first-order gradient-based bilevel optimization methods, establishing connections between on-average argument stability and generalization gap for stochastic bilevel optimization.

Details

Motivation: Stochastic bilevel optimization (SBO) has been widely applied in machine learning (hyperparameter optimization, meta learning, reinforcement learning), but the generalization guarantees of SBO methods are poorly understood from statistical learning theory perspective. Previous algorithmic stability analyses have limitations that need to be addressed.

Method: The authors establish quantitative connections between on-average argument stability and generalization gap of SBO methods. They derive upper bounds of on-average argument stability for single-timescale SGD and two-timescale SGD, considering three settings: nonconvex-nonconvex (NC-NC), convex-convex (C-C), and strongly-convex-strongly-convex (SC-SC). Experimental validation is provided.

Result: The paper provides generalization guarantees for SBO methods that don’t require reinitializing inner-level parameters at each iteration and are applicable to more general objective functions compared to previous analyses. Experimental results validate the theoretical findings.

Conclusion: This work offers a systematic generalization analysis framework for bilevel optimization methods, addressing limitations of previous stability analyses and providing theoretical guarantees for more practical settings.

Abstract: Stochastic bilevel optimization (SBO) has been integrated into many machine learning paradigms recently, including hyperparameter optimization, meta learning, and reinforcement learning. Along with the wide range of applications, there have been numerous studies on the computational behavior of SBO. However, the generalization guarantees of SBO methods are far less understood from the lens of statistical learning theory. In this paper, we provide a systematic generalization analysis of the first-order gradient-based bilevel optimization methods. Firstly, we establish the quantitative connections between the on-average argument stability and the generalization gap of SBO methods. Then, we derive the upper bounds of on-average argument stability for single-timescale stochastic gradient descent (SGD) and two-timescale SGD, where three settings (nonconvex-nonconvex (NC-NC), convex-convex (C-C), and strongly-convex-strongly-convex (SC-SC)) are considered respectively. Experimental analysis validates our theoretical findings. Compared with the previous algorithmic stability analysis, our results do not require reinitializing the inner-level parameters at each iteration and are applicable to more general objective functions.

[715] Spectral Path Regression: Directional Chebyshev Harmonics for Interpretable Tabular Learning

Milo Coombs

Main category: cs.LG

TL;DR: A novel multivariate approximation method using directional harmonic modes instead of tensor products for tabular data regression, achieving competitive accuracy with interpretable analytic expressions.

Details

Motivation: Traditional multivariate approximation methods like Chebyshev polynomials suffer from exponential scaling with dimension and impose axis-aligned structure that doesn't match real tabular data patterns.

Method: Replace tensorized oscillations with directional harmonic modes of the form cos(m⊤arccos(x)), organizing multivariate structure by direction in angular space rather than by coordinate index. Use discrete spectral regression with structured frequency vectors (spectral paths) and closed-form ridge regression.

Result: The method achieves accuracy competitive with strong nonlinear baselines on standard continuous-feature tabular regression benchmarks while remaining compact, computationally efficient, and interpretable.

Conclusion: The proposed directional harmonic representation provides a principled, interpretable alternative to traditional multivariate approximation methods, effectively capturing complex tabular data patterns without exponential scaling.

Abstract: Classical approximation bases such as Chebyshev polynomials provide principled and interpretable representations, but their multivariate tensor-product constructions scale exponentially with dimension and impose axis-aligned structure that is poorly matched to real tabular data. We address this by replacing tensorised oscillations with directional harmonic modes of the form $\cos(\mathbf{m}^{\top}\arccos(\mathbf{x}))$, which organise multivariate structure by direction in angular space rather than by coordinate index. This representation yields a discrete spectral regression model in which complexity is controlled by selecting a small number of structured frequency vectors (spectral paths), and training reduces to a single closed-form ridge solve with no iterative optimisation. Experiments on standard continuous-feature tabular regression benchmarks show that the resulting models achieve accuracy competitive with strong nonlinear baselines while remaining compact, computationally efficient, and explicitly interpretable through analytic expressions of learned feature interactions.

[716] Restless Bandits with Individual Penalty Constraints: A New Near-Optimal Index Policy and How to Learn It

Nida Zamir, I-Hong Hou

Main category: cs.LG

TL;DR: A novel Penalty-Optimal Whittle (POW) index policy for Restless Multi-Armed Bandit problems with individual penalty constraints, enabling optimal resource allocation in dynamic wireless networks.

Details

Motivation: To address resource allocation challenges in dynamic wireless networked environments where each user has distinct and stringent performance constraints (energy limits, activation limits, age of information minimums), enabling capture of diverse objectives including fairness and efficiency.

Method: Proposes a Penalty-Optimal Whittle (POW) index policy where each user’s index depends only on their transition kernel and penalty constraints, remaining invariable to system-wide features. Also introduces a deep reinforcement learning algorithm to learn POW indices efficiently.

Result: The POW index policy is theoretically proven to be asymptotically optimal while satisfying all individual penalty constraints. Simulation results across various applications show near-optimal performance and significant outperformance over existing policies.

Conclusion: The POW index policy provides a computationally tractable, asymptotically optimal solution for RMAB problems with individual penalty constraints, suitable for dynamic wireless resource allocation with diverse user requirements.

Abstract: This paper investigates the Restless Multi-Armed Bandit (RMAB) framework under individual penalty constraints to address resource allocation challenges in dynamic wireless networked environments. Unlike conventional RMAB models, our model allows each user (arm) to have distinct and stringent performance constraints, such as energy limits, activation limits, or age of information minimums, enabling the capture of diverse objectives including fairness and efficiency. To find the optimal resource allocation policy, we propose a new Penalty-Optimal Whittle (POW) index policy. The POW index of an user only depends on the user’s transition kernel and penalty constraints, and remains invariable to system-wide features such as the number of users present and the amount of resource available. This makes it computationally tractable to calculate the POW Indices offline without any need for online adaptation. Moreover, we theoretically prove that the POW index policy is asymptotically optimal while satisfying all individual penalty constraints. We also introduce a deep reinforcement learning algorithm to efficiently learn the POW index on the fly. Simulation results across various applications and system configurations further demonstrate that the POW index policy not only has near-optimal performance but also significantly outperforms other existing policies.

[717] Physical Sensitivity Kernels Can Emerge in Data-Driven Forward Models: Evidence From Surface-Wave Dispersion

Ziye Yu, Yuqi Cai, Xin Liu

Main category: cs.LG

TL;DR: Neural network surrogates for geophysical forward modeling can learn physically meaningful gradient information comparable to theoretical sensitivity kernels, but training distribution priors can introduce artifacts.

Details

Motivation: To determine whether data-driven neural networks used as surrogate forward models in geophysics recover only data mappings or also capture the underlying physical sensitivity structure, particularly for surface-wave dispersion analysis.

Method: Compare automatically differentiated gradients from neural-network surrogates with theoretical sensitivity kernels for surface-wave dispersion across various periods.

Result: Learned gradients recover main depth-dependent structure of physical kernels across broad period ranges, indicating neural surrogates learn physically meaningful differential information rather than acting as pure black-box predictors.

Conclusion: Neural forward surrogates can recover useful physical information for inversion and uncertainty analysis, but strong structural priors in training distributions can introduce systematic artifacts into inferred sensitivities.

Abstract: Data-driven neural networks are increasingly used as surrogate forward models in geophysics, but it remains unclear whether they recover only the data mapping or also the underlying physical sensitivity structure. Here we test this question using surface-wave dispersion. By comparing automatically differentiated gradients from a neural-network surrogate with theoretical sensitivity kernels, we show that the learned gradients can recover the main depth-dependent structure of physical kernels across a broad range of periods. This indicates that neural surrogate models can learn physically meaningful differential information, rather than acting as purely black-box predictors. At the same time, strong structural priors in the training distribution can introduce systematic artifacts into the inferred sensitivities. Our results show that neural forward surrogates can recover useful physical information for inversion and uncertainty analysis, while clarifying the conditions under which this differential structure remains physically consistent.

[718] The Geometric Alignment Tax: Tokenization vs. Continuous Geometry in Scientific Foundation Models

Prashant C. Raju

Main category: cs.LG

TL;DR: Foundation models for biology/physics fail to preserve continuous geometry due to discrete tokenization bottlenecks, causing geometric distortion that can be reduced 8.5x with continuous objectives instead of cross-entropy.

Details

Motivation: Current foundation models for biological and physical systems optimize for predictive accuracy but systematically fail to preserve the continuous geometric structure of the systems they model, which is crucial for scientific understanding and generalization.

Method: The paper uses controlled ablations on synthetic dynamical systems, replacing cross-entropy with continuous heads on identical encoders. It evaluates 14 biological foundation models using rate-distortion theory and MINE (Mutual Information Neural Estimation), and analyzes learned codebooks and quantization effects.

Result: Continuous objectives reduce geometric distortion by up to 8.5x compared to discrete tokenization. Learned codebooks show a non-monotonic double bind where finer quantization worsens geometry despite improving reconstruction. Three architectures differ by only 1.3x under continuous objectives but diverge by 3,000x under discrete tokenization. Three failure regimes are identified: Local-Global Decoupling, Representational Compression, and Geometric Vacuity.

Conclusion: The Geometric Alignment Tax is an intrinsic cost of forcing continuous manifolds through discrete categorical bottlenecks. No existing model achieves simultaneously low distortion, high mutual information, and global coherence. The paper reveals fundamental limitations in current foundation model architectures for scientific domains.

Abstract: Foundation models for biology and physics optimize predictive accuracy, but their internal representations systematically fail to preserve the continuous geometry of the systems they model. We identify the root cause: the Geometric Alignment Tax, an intrinsic cost of forcing continuous manifolds through discrete categorical bottlenecks. Controlled ablations on synthetic dynamical systems demonstrate that replacing cross-entropy with a continuous head on an identical encoder reduces geometric distortion by up to 8.5x, while learned codebooks exhibit a non-monotonic double bind where finer quantization worsens geometry despite improving reconstruction. Under continuous objectives, three architectures differ by 1.3x; under discrete tokenization, they diverge by 3,000x. Evaluating 14 biological foundation models with rate-distortion theory and MINE, we identify three failure regimes: Local-Global Decoupling, Representational Compression, and Geometric Vacuity. A controlled experiment confirms that Evo 2’s reverse-complement robustness on real DNA reflects conserved sequence composition, not learned symmetry. No model achieves simultaneously low distortion, high mutual information, and global coherence.

[719] Uncertainty-Aware Foundation Models for Clinical Data

Qian Zhou, Yuanyun Zhang, Shi Li

Main category: cs.LG

TL;DR: A framework for uncertainty-aware healthcare foundation models that represents patients as distributions over latent states rather than point embeddings, addressing sparse and irregular clinical data.

Details

Motivation: Clinical observations are inherently incomplete, sparse, irregular, and modality-dependent, making deterministic representations inadequate. Current healthcare foundation models follow NLP/CV paradigms but fail to capture the uncertainty and partial nature of clinical data.

Method: Proposes uncertainty-aware foundation modeling with set-valued representations that capture distributions over plausible latent states. Uses multimodal encoders with self-supervised objectives combining reconstruction, contrastive alignment, and distributional regularization to enforce consistency across partial patient views.

Result: Improves predictive performance, robustness under missing data, and uncertainty calibration across diverse clinical tasks compared to strong baselines.

Conclusion: Modeling what is not observed (epistemic uncertainty) rather than only what is observed constitutes a critical inductive bias for healthcare foundation models.

Abstract: Healthcare foundation models have largely followed paradigms from natural language processing and computer vision, emphasizing large scale pretraining and deterministic representations over heterogeneous clinical data. However, clinical observations are inherently incomplete, reflecting sparse, irregular, and modality dependent measurements of an underlying physiologic state. In this work, we propose a framework for uncertainty aware foundation modeling that represents each patient not as a point embedding, but as a distribution over plausible latent states. By learning set valued representations and enforcing consistency across partial views of the same patient, the model captures what is invariantly inferable while explicitly encoding epistemic uncertainty. We integrate this formulation with multimodal encoders and scalable self supervised objectives, combining reconstruction, contrastive alignment, and distributional regularization. Across diverse clinical tasks, our approach improves predictive performance, robustness under missing data, and uncertainty calibration relative to strong baselines. These results suggest that modeling what is not observed rather than only what is constitutes a critical inductive bias for healthcare foundation models.

[720] Stable and Privacy-Preserving Synthetic Educational Data with Empirical Marginals: A Copula-Based Approach

Gabriel Diaz Ramos, Lorenzo Luzi, Debshila Basu Mallick, Richard Baraniuk

Main category: cs.LG

TL;DR: NPGC is a privacy-preserving synthetic data generation method for educational data that uses non-parametric Gaussian copulas instead of deep learning, maintaining marginal distributions while protecting student privacy with differential privacy.

Details

Motivation: To enable educational data mining under strict privacy regulations by creating synthetic data that preserves statistical properties of real student records without compromising sensitive information, addressing issues of distribution drift and computational cost in existing methods.

Method: Non-Parametric Gaussian Copula (NPGC) replaces deep learning and parametric optimization with empirical statistical anchoring, preserves observed marginal distributions, models dependencies through copula framework, integrates Differential Privacy at marginal and correlation levels, supports heterogeneous variable types, and treats missing data as explicit state.

Result: NPGC remains stable across multiple regeneration cycles, achieves competitive downstream performance compared to deep learning and parametric baselines on five benchmark datasets, and operates at substantially lower computational cost. Successfully deployed in real-world online learning platform.

Conclusion: NPGC provides a practical, privacy-preserving synthetic data generation solution for educational research that maintains statistical fidelity while protecting student privacy, offering advantages over existing deep learning approaches in stability and computational efficiency.

Abstract: To advance Educational Data Mining (EDM) within strict privacy-protecting regulatory frameworks, researchers must develop methods that enable data-driven analysis while protecting sensitive student information. Synthetic data generation is one such approach, enabling the release of statistically generated samples instead of real student records; however, existing deep learning and parametric generators often distort marginal distributions and degrade under iterative regeneration, leading to distribution drift and progressive loss of distributional support that compromise reliability. In response, we introduce the Non-Parametric Gaussian Copula (NPGC), a plug-and-play synthesis method that replaces deep learning and parametric optimization with empirical statistical anchoring to preserve the observed marginal distributions while modeling dependencies through a copula framework. NPGC integrates Differential Privacy (DP) at both the marginal and correlation levels, supports heterogeneous variable types, and treats missing data as an explicit state to retain informative absence patterns. We evaluate NPGC against deep learning and parametric baselines on five benchmark datasets and demonstrate that it remains stable across multiple regeneration cycles and achieves competitive downstream performance at substantially lower computational cost. We further validate NPGC through deployment in a real-world online learning platform, demonstrating its practicality for privacy-preserving research.

[721] Which Leakage Types Matter?

Simon Roth

Main category: cs.LG

TL;DR: Analysis of data leakage severity in ML across 2,047 tabular datasets reveals selection leakage matters most while normalization leakage is negligible, with memorization scaling with model capacity.

Details

Motivation: To systematically measure and quantify the severity of different classes of data leakage in machine learning, challenging conventional wisdom about which types of leakage matter most in practice.

Method: Conducted 28 within-subject counterfactual experiments across 2,047 tabular datasets, plus boundary experiments on 129 temporal datasets, measuring four data leakage classes: estimation leakage (fitting scalers on full data), selection leakage (peeking, seed cherry-picking), memorization, and boundary leakage.

Result: Class I (estimation) leakage is negligible (|ΔAUC| ≤ 0.005). Class II (selection) leakage is substantial with ~90% of measured effect being noise exploitation that inflates scores. Class III (memorization) scales with model capacity from d_z = 0.37 (Naive Bayes) to 1.11 (Decision Tree). Class IV (boundary) is invisible under random cross-validation.

Conclusion: The conventional emphasis on data leakage is inverted: normalization leakage matters least in practice, while selection leakage at practical dataset sizes matters most, with memorization effects scaling with model complexity.

Abstract: Twenty-eight within-subject counterfactual experiments across 2,047 tabular datasets, plus a boundary experiment on 129 temporal datasets, measuring the severity of four data leakage classes in machine learning. Class I (estimation - fitting scalers on full data) is negligible: all nine conditions produce $|Δ\text{AUC}| \leq 0.005$. Class II (selection - peeking, seed cherry-picking) is substantial: ~90% of the measured effect is noise exploitation that inflates reported scores. Class III (memorization) scales with model capacity: d_z = 0.37 (Naive Bayes) to 1.11 (Decision Tree). Class IV (boundary) is invisible under random CV. The textbook emphasis is inverted: normalization leakage matters least; selection leakage at practical dataset sizes matters most.

[722] ClawArena: Benchmarking AI Agents in Evolving Information Environments

Haonian Ji, Kaiwen Xiong, Siwei Han, Peng Xia, Shi Qiu, Yiyang Zhou, Jiaqi Liu, Jinlong Li, Bingzhou Li, Zeyu Zheng, Cihang Xie, Huaxiu Yao

Main category: cs.LG

TL;DR: ClawArena is a benchmark for evaluating AI agents in evolving information environments with noisy, contradictory sources and dynamic updates, testing multi-source reasoning, belief revision, and implicit personalization.

Details

Motivation: AI agents need to maintain correct beliefs as information environments evolve with heterogeneous, contradictory sources, new information invalidating earlier conclusions, and implicit user preferences. Existing benchmarks assume static, single-authority settings and don't evaluate agents' ability to handle this complexity.

Method: ClawArena creates scenarios with complete hidden ground truth while exposing agents to noisy, partial, and contradictory traces across multi-channel sessions, workspace files, and staged updates. Evaluation uses three coupled challenges: multi-source conflict reasoning, dynamic belief revision, and implicit personalization, with 14-category question taxonomy tested via multi-choice and shell-based executable checks.

Result: Experiments on five agent frameworks and five language models show model capability (15.4% range) and framework design (9.2%) substantially affect performance. Self-evolving skill frameworks can partially close model-capability gaps, and belief revision difficulty is determined by update design strategy rather than mere presence of updates.

Conclusion: ClawArena provides a comprehensive benchmark for evaluating AI agents in realistic, evolving information environments, revealing important insights about model capabilities, framework design, and the nature of belief revision challenges.

Abstract: AI agents deployed as persistent assistants must maintain correct beliefs as their information environment evolves. In practice, evidence is scattered across heterogeneous sources that often contradict one another, new information can invalidate earlier conclusions, and user preferences surface through corrections rather than explicit instructions. Existing benchmarks largely assume static, single-authority settings and do not evaluate whether agents can keep up with this complexity. We introduce ClawArena, a benchmark for evaluating AI agents in evolving information environments. Each scenario maintains a complete hidden ground truth while exposing the agent only to noisy, partial, and sometimes contradictory traces across multi-channel sessions, workspace files, and staged updates. Evaluation is organized around three coupled challenges: multi-source conflict reasoning, dynamic belief revision, and implicit personalization, whose interactions yield a 14-category question taxonomy. Two question formats, multi-choice (set-selection) and shell-based executable checks, test both reasoning and workspace grounding. The current release contains 64 scenarios across 8 professional domains, totaling 1{,}879 evaluation rounds and 365 dynamic updates. Experiments on five agent frameworks and five language models show that both model capability (15.4% range) and framework design (9.2%) substantially affect performance, that self-evolving skill frameworks can partially close model-capability gaps, and that belief revision difficulty is determined by update design strategy rather than the mere presence of updates. Code is available at https://github.com/aiming-lab/ClawArena.

[723] Towards Agentic Defect Reasoning: A Graph-Assisted Retrieval Framework for Laser Powder Bed Fusion

Muhammad Rizwan Awan, Volker Pickert, Muhammad Waqar Ashraf, Saleh Ali, Farshid Mahmouditabar, Shafiq Odhano

Main category: cs.LG

TL;DR: A graph-assisted retrieval framework for defect reasoning in Laser Powder Bed Fusion (LPBF) that converts scientific literature into structured knowledge graphs to link process parameters, mechanisms, and defects.

Details

Motivation: LPBF is highly sensitive to process parameters that influence defect formation through complex thermal/fluid mechanisms, but defect-related knowledge is dispersed across literature, limiting systematic understanding.

Method: Transforms scientific publications into structured representation, encodes relationships between parameters, mechanisms, and defects into evidence-linked knowledge graph, integrates semantic and graph-based retrieval with lightweight agent-based reasoning layer.

Result: High retrieval accuracy (0.9667) and recall (0.9667), demonstrating effective identification of relevant defect-related evidence and enabling transparent reasoning chains linking process parameters to defects.

Conclusion: Provides scalable approach for converting unstructured literature into queryable and interpretable knowledge resource for additive manufacturing with transparent defect reasoning.

Abstract: Laser Powder Bed Fusion (LPBF) is highly sensitive to process parameters, which influence defect formation through complex thermal and fluid mechanisms. However, defect-related knowledge is dispersed across the literature, limiting systematic understanding. This study presents a graph-assisted retrieval framework for defect reasoning in LPBF, using Ti6Al4V as a case study. Scientific publications are transformed into a structured representation, and relationships between parameters, mechanisms, and defects are encoded into an evidence-linked knowledge graph. The framework integrates semantic and graph-based retrieval, supported by a lightweight agent-based reasoning layer to construct interpretable defect pathways. Evaluation shows high retrieval accuracy (0.9667) and recall (0.9667), demonstrating effective identification of relevant defect related evidence. The framework enables transparent reasoning chains linking process parameters to defects. This work provides a scalable approach for converting unstructured literature into a query able and interpretable knowledge resource for additive manufacturing.

[724] Learning from Imperfect Demonstrations via Temporal Behavior Tree-Guided Trajectory Repair

Aniruddh G. Puranic, Sebastian Schirmer, John S. Baras, Calin Belta

Main category: cs.LG

TL;DR: A framework using Temporal Behavior Trees to repair suboptimal robot demonstrations for improved imitation and reinforcement learning.

Details

Motivation: Real-world robot demonstrations are often suboptimal, noisy, or imperfect, posing challenges for imitation and reinforcement learning. There's a need to improve data quality before policy learning.

Method: Uses Temporal Behavior Trees (TBT) to repair suboptimal trajectories that violate formal specifications. A model-based repair algorithm corrects trajectory segments to satisfy constraints. Repaired trajectories are used to extract potential functions that shape reward signals for reinforcement learning.

Result: Demonstrated effectiveness on discrete grid-world navigation and continuous single/multi-agent reach-avoid tasks. Shows potential for data-efficient robot learning in settings with imperfect demonstrations.

Conclusion: The framework enables logical consistency and interpretability in repaired trajectories, guiding agents toward task-consistent regions without requiring kinematic model knowledge.

Abstract: Learning robot control policies from demonstrations is a powerful paradigm, yet real-world data is often suboptimal, noisy, or otherwise imperfect, posing significant challenges for imitation and reinforcement learning. In this work, we present a formal framework that leverages Temporal Behavior Trees (TBT), an extension of Signal Temporal Logic (STL) with Behavior Tree semantics, to repair suboptimal trajectories prior to their use in downstream policy learning. Given demonstrations that violate a TBT specification, a model-based repair algorithm corrects trajectory segments to satisfy the formal constraints, yielding a dataset that is both logically consistent and interpretable. The repaired trajectories are then used to extract potential functions that shape the reward signal for reinforcement learning, guiding the agent toward task-consistent regions of the state space without requiring knowledge of the agent’s kinematic model. We demonstrate the effectiveness of this framework on discrete grid-world navigation and continuous single and multi-agent reach-avoid tasks, highlighting its potential for data-efficient robot learning in settings where high-quality demonstrations cannot be assumed.

[725] Subspace Control: Turning Constrained Model Steering into Controllable Spectral Optimization

Yancheng Huang, Changsheng Wang, Chongyu Fan, Yicheng Lang, Bingqi Shang, Yang Zhang, Mingyi Hong, Qing Qu, Alvaro Velasquez, Sijia Liu

Main category: cs.LG

TL;DR: SIFT is a spectral interference-free training framework that mitigates conflicts between primary objectives and constraints during model adaptation by orthogonalizing merged subspaces and using selective intervention.

Details

Motivation: Foundation models need customization for practical constraints (safety, privacy, task-specific requirements), but current constrained optimization approaches suffer from interference between primary and constraint objectives during training.

Method: Analyzes spectral cross-task interference from model merging perspective, shows orthogonalization solution, connects to gradient orthogonalization in Muon optimizer, and introduces SIFT with localization scheme for selective intervention during optimization.

Result: SIFT achieves substantial and robust performance improvements across four applications: machine unlearning, safety alignment, text-to-speech adaptation, and hallucination mitigation, outperforming both control-based and control-free baselines.

Conclusion: SIFT provides an effective framework for constrained model training that mitigates objective-constraint conflicts through spectral interference-free optimization, enabling better customization of foundation models for practical constraints.

Abstract: Foundation models, such as large language models (LLMs), are powerful but often require customization before deployment to satisfy practical constraints such as safety, privacy, and task-specific requirements, leading to “constrained” optimization problems for model steering and adaptation. However, solving such problems remains largely underexplored and is particularly challenging due to interference between the primary objective and constraint objectives during optimization. In this paper, we propose a subspace control framework for constrained model training. Specifically, (i) we first analyze, from a model merging perspective, how spectral cross-task interference arises and show that it can be resolved via a one-shot solution that orthogonalizes the merged subspace; (ii) we establish a connection between this solution and gradient orthogonalization in the spectral optimizer Muon; and (iii) building on these insights, we introduce SIFT (spectral interference-free training), which leverages a localization scheme to selectively intervene during optimization, enabling controllable updates that mitigate objective-constraint conflicts. We evaluate SIFT across four representative applications: (a) machine unlearning, (b) safety alignment, (c) text-to-speech adaptation, and (d) hallucination mitigation. Compared to both control-based and control-free baselines, SIFT consistently achieves substantial and robust performance improvements across all tasks. Code is available at https://github.com/OPTML-Group/SIFT.

[726] Good Rankings, Wrong Probabilities: A Calibration Audit of Multimodal Cancer Survival Models

Sajad Ghawami

Main category: cs.LG

TL;DR: Multimodal survival models using histopathology images and genomics show good discrimination but poor calibration, requiring calibration assessment beyond C-index for clinical use.

Details

Motivation: To systematically evaluate whether multimodal deep learning models for cancer survival prediction produce calibrated survival probabilities, as current models focus on discriminative performance (C-index) but calibration remains unexamined.

Method: Conducted fold-level 1-calibration audits of multimodal WSI-genomics survival architectures: Experiment A evaluated native discrete-time survival outputs (3 models on TCGA-BRCA), Experiment B evaluated Breslow-reconstructed survival curves from scalar risk scores (11 architectures across 5 TCGA cancer types). Used Benjamini-Hochberg correction for multiple testing.

Result: Most models failed calibration: 12 of 15 fold-level tests rejected in Experiment A; 166 of 290 fold-level tests rejected overall. MCAT achieved high C-index (0.817) but failed calibration on all folds. Gating-based fusion showed better calibration than bilinear/concatenation fusion. Platt scaling improved calibration without affecting discrimination.

Conclusion: Concordance index alone is insufficient for evaluating survival models for clinical use; calibration assessment is crucial. Gating fusion and post-hoc calibration methods like Platt scaling can improve calibration while maintaining discrimination.

Abstract: Multimodal deep learning models that fuse whole-slide histopathology images with genomic data have achieved strong discriminative performance for cancer survival prediction, as measured by the concordance index. Yet whether the survival probabilities derived from these models - either directly from native outputs or via standard post-hoc reconstruction - are calibrated remains largely unexamined. We conduct, to our knowledge, the first systematic fold-level 1-calibration audit of multimodal WSI-genomics survival architectures, evaluating native discrete-time survival outputs (Experiment A: 3 models on TCGA-BRCA) and Breslow-reconstructed survival curves from scalar risk scores (Experiment B: 11 architectures across 5 TCGA cancer types). In Experiment A, all three models fail 1-calibration on a majority of folds (12 of 15 fold-level tests reject after Benjamini-Hochberg correction). Across the full 290 fold-level tests, 166 reject the null of correct calibration at the median event time after Benjamini-Hochberg correction (FDR = 0.05). MCAT achieves C-index 0.817 on GBMLGG yet fails 1-calibration on all five folds. Gating-based fusion is associated with better calibration; bilinear and concatenation fusion are not. Post-hoc Platt scaling reduces miscalibration at the evaluated horizon (e.g., MCAT: 5/5 folds failing to 2/5) without affecting discrimination. The concordance index alone is insufficient for evaluating survival models intended for clinical use.

[727] Peoples Water Data: Enabling Reliable Field Data Generation and Microbial Contamination Screening in Household Drinking Water

Suzan Kagan, Shira Spigelman, Sankar Sudhir, Thalappil Pradeep, Hadas Mamane

Main category: cs.LG

TL;DR: A machine learning framework for predicting E. coli contamination in drinking water using low-cost physicochemical and contextual indicators, developed for resource-constrained settings in Chennai, India.

Details

Motivation: Unsafe drinking water is a major global health concern, especially in low-resource regions where routine microbiological testing is limited. Current laboratory-based E. coli testing is often inaccessible at scale, creating a need for alternative, scalable monitoring solutions.

Method: Developed a two-stage machine learning framework using low-cost physicochemical and contextual indicators to predict E. coli presence. The study used 3,023 water samples from Chennai, India, with 2,207 samples retained after data cleaning. The framework was implemented within an AI-supported field system combining student guidance and real-time quality control.

Result: The framework provides a scalable decision-support tool for prioritizing microbiological testing in resource-constrained environments. The AI-supported field implementation improved protocol adherence, traceability, and data reliability in decentralized household water monitoring.

Conclusion: The machine learning approach addresses an important gap in point-of-use contamination risk assessment and offers a practical solution for water quality monitoring in settings where traditional laboratory testing is not feasible.

Abstract: Unsafe drinking water remains a major public health concern globally, particularly in low-resource regions where routine microbiological surveillance is limited. Although Escherichia coli is the internationally recognized indicator of fecal contamination, laboratory-based testing is often inaccessible at scale. In this study, we developed and evaluated a two-stage machine-learning framework for predicting E. coli presence in decentralized household point-of-use drinking water in Chennai, India using low-cost physicochemical and contextual indicators. The dataset comprised 3,023 samples collected under the Peoples Water Data initiative; after harmonization, technical cleaning, and outlier screening, 2,207 valid samples were retained. This framework provides a scalable decision-support tool for prioritizing microbiological testing in resource-constrained environments and addresses an important gap in point-of-use contamination risk assessment. Beyond predictive modeling, the present study was conducted within an AI-supported field implementation framework that combined student-facing guidance and real-time QC to improve protocol adherence, traceability, and data reliability in decentralized household water monitoring.

[728] Learning An Interpretable Risk Scoring System for Maximizing Decision Net Benefit

Wenhao Chi, Ş. İlker Birbil

Main category: cs.LG

TL;DR: A novel risk scoring system that directly optimizes net benefit over decision thresholds using sparse integer linear programming, creating interpretable integer coefficients while maintaining competitive discrimination and calibration.

Details

Motivation: Existing risk scoring systems focus on predictive accuracy or likelihood-based criteria, which may not align with the main goal of maximizing utility in high-stakes decision-making domains like healthcare.

Method: Formulates the risk scoring system as a sparse integer linear programming problem to directly optimize net benefit over a range of decision thresholds, creating transparent scoring systems with integer coefficients for better interpretation.

Result: The method effectively achieves high net benefit while maintaining competitive discrimination and calibration performance across multiple public datasets and a real-world clinical dataset.

Conclusion: Optimizing net benefit directly leads to practical, interpretable risk scoring systems that align with decision-making goals while guaranteeing conventional performance measures.

Abstract: Risk scoring systems are widely used in high-stakes domains to assist decision-making. However, existing approaches often focus on optimizing predictive accuracy or likelihood-based criteria, which may not align with the main goal of maximizing utility. In this paper, we propose a novel risk scoring system that directly optimizes net benefit over a range of decision thresholds. The model is formulated as a sparse integer linear programming problem which enables the construction of a transparent scoring system with integer coefficients, and hence, facilitates interpretation and practical application. We also establish fundamental relationships among net benefit, discrimination, and calibration. Our analysis proves that optimizing net benefit also guarantees conventional performance measures. We thoroughly evaluated our method on multiple public datasets as well as on a real-world clinical dataset. This computational study demonstrated that our interpretable method can effectively achieve high net benefit while maintaining competitive discrimination and calibration performance.

[729] Towards Unveiling Vulnerabilities of Large Reasoning Models in Machine Unlearning

Aobo Chen, Chenxu Zhao, Chenglin Miao, Mengdi Huai

Main category: cs.LG

TL;DR: LRM unlearning attack that forces incorrect final answers while generating convincing but misleading reasoning traces, addressing challenges in non-differentiable constraints and long rationales.

Details

Motivation: While machine unlearning techniques address the right to be forgotten, they introduce security vulnerabilities. Existing research lacks investigation of unlearning attacks on large reasoning models (LRMs) that provide explicit multi-step reasoning traces.

Method: Proposes a bi-level exact unlearning attack with differentiable objective function, influential token alignment, and relaxed indicator strategy to overcome challenges of non-differentiable logical constraints and weak optimization over long rationales.

Result: Demonstrates effectiveness through comprehensive experiments in both white-box and black-box settings, showing the attack can force incorrect answers while maintaining convincing reasoning traces.

Conclusion: The work identifies and addresses emerging security threats in LRM unlearning pipelines, raising awareness about vulnerabilities introduced by unlearning techniques in reasoning models.

Abstract: Large language models (LLMs) possess strong semantic understanding, driving significant progress in data mining applications. This is further enhanced by large reasoning models (LRMs), which provide explicit multi-step reasoning traces. On the other hand, the growing need for the right to be forgotten has driven the development of machine unlearning techniques, which aim to eliminate the influence of specific data from trained models without full retraining. However, unlearning may also introduce new security vulnerabilities by exposing additional interaction surfaces. Although many studies have investigated unlearning attacks, there is no prior work on LRMs. To bridge the gap, we first in this paper propose LRM unlearning attack that forces incorrect final answers while generating convincing but misleading reasoning traces. This objective is challenging due to non-differentiable logical constraints, weak optimization effect over long rationales, and discrete forget set selection. To overcome these challenges, we introduce a bi-level exact unlearning attack that incorporates a differentiable objective function, influential token alignment, and a relaxed indicator strategy. To demonstrate the effectiveness and generalizability of our attack, we also design novel optimization frameworks and conduct comprehensive experiments in both white-box and black-box settings, aiming to raise awareness of the emerging threats to LRM unlearning pipelines.

[730] APPA: Adaptive Preference Pluralistic Alignment for Fair Federated RLHF of LLMs

Mahmoud Srewa, Tianyu Zhao, Salma Elmalaki

Main category: cs.LG

TL;DR: APPA is an adaptive framework for federated RLHF that dynamically reweights group-level rewards to improve worst-group alignment without sacrificing overall performance.

Details

Motivation: Aligning LLMs with diverse human preferences requires pluralistic alignment where a single model must respect multiple groups' values. In federated RLHF, groups align a shared policy without centralizing data, making fair reward aggregation essential. Existing methods have trade-offs: average aggregation under-aligns worst groups, while min aggregation prioritizes worst groups at the cost of overall alignment.

Method: APPA (Adaptive Preference Pluralistic Alignment) dynamically reweights group-level rewards based on historical alignment rewards. It prioritizes under-aligned groups without degrading well-aligned ones, requiring no access to raw preference data. Integrated into a PPO-based FedRLHF pipeline and evaluated on GLOBALQA and OQA datasets across three model families (Gemma 2 2B, Llama 3.2 3B, Qwen3 0.6B).

Result: APPA achieves strong fairness-alignment trade-offs, improving worst-group alignment by up to 28% over average aggregation while maintaining higher overall alignment than min aggregation across most configurations.

Conclusion: APPA provides an effective solution for pluralistic alignment in federated RLHF settings, balancing fairness and overall performance better than existing aggregation methods.

Abstract: Aligning large language models (LLMs) with diverse human preferences requires pluralistic alignment, where a single model must respect the values of multiple distinct groups simultaneously. In federated reinforcement learning from human feedback (FedRLHF), these groups align a shared policy without centralizing preference data, which makes fair reward aggregation essential. Existing aggregation methods exhibit clear trade offs: average based aggregation systematically under aligns worst performing groups, while min aggregation prioritizes worst group performance at the cost of overall alignment. We propose APPA, an Adaptive Preference Pluralistic Alignment framework that dynamically reweights group level rewards based on historical alignment rewards. Our approach prioritizes under aligned groups without degrading well aligned ones, while requiring no access to raw preference data. Integrated into a proximal policy optimization (PPO) based FedRLHF pipeline and evaluated on GLOBALQA and OQA across three model families (Gemma 2 2B, Llama 3.2 3B, Qwen3 0.6B), APPA achieves strong fairness alignment trade offs, improving worst group alignment by up to 28% over average aggregation while maintaining higher overall alignment than min aggregation across most configurations.

[731] Entropy, Disagreement, and the Limits of Foundation Models in Genomics

Maxime Rochkoulets, Lovro Vrček, Mile Šikić

Main category: cs.LG

TL;DR: High entropy in genomic sequences limits foundation models’ effectiveness compared to NLP, causing uniform predictions, model disagreement, unstable embeddings, and poor information flow despite matched training conditions.

Details

Motivation: To understand why genomic foundation models show limited success compared to NLP models, investigating entropy as a fundamental limiting factor in learning from genomic training data.

Method: Train ensembles of models on text and DNA sequences, analyze predictions, static embeddings, and empirical Fisher information flow across matched architectures and training conditions.

Result: High entropy of genomic sequences leads to near-uniform output distributions, disagreement across models, unstable static embeddings, and concentration of Fisher information in embedding layers without exploiting inter-token relationships.

Conclusion: Self-supervised training from sequences alone may not be applicable to genomic data, questioning current assumptions underlying genomic foundation model methodologies.

Abstract: Foundation models in genomics have shown mixed success compared to their counterparts in natural language processing. Yet, the reasons for their limited effectiveness remain poorly understood. In this work, we investigate the role of entropy as a fundamental factor limiting the capacities of such models to learn from their training data and develop foundational capabilities. We train ensembles of models on text and DNA sequences and analyze their predictions, static embeddings, and empirical Fisher information flow. We show that the high entropy of genomic sequences – from the point of view of unseen token prediction – leads to near-uniform output distributions, disagreement across models, and unstable static embeddings, even for models that are matched in architecture, training and data. We then demonstrate that models trained on DNA concentrate Fisher information in embedding layers, seemingly failing to exploit inter-token relationships. Our results suggest that self-supervised training from sequences alone may not be applicable to genomic data, calling into question the assumptions underlying current methodologies for training genomic foundation models.

[732] DAGAF: A directed acyclic generative adversarial framework for joint structure learning and tabular data synthesis

Hristo Petkov, Calum MacLellan, Feng Dong

Main category: cs.LG

TL;DR: DAGAF: A dual-step framework for causal structure learning and tabular data synthesis using multiple functional causal models (ANM, LiNGAM, PNL) to learn DAGs and replicate real data distributions.

Details

Motivation: Existing causality learning methods typically focus on single identifiable causal models, limiting their applicability. The authors aim to improve on this by developing a framework that can handle multiple causal model assumptions for both structure learning and data synthesis.

Method: Proposes a dual-step framework using Directed Acyclic Graphs (DAGs) to represent causal relationships. Applies various functional causal models (ANM, LiNGAM, PNL) to implicitly learn DAG contents and simulate observational data generation. Includes theoretical analysis of multiple loss terms in the objective function.

Result: DAGAF outperforms existing methods in structure learning, achieving significantly lower Structural Hamming Distance scores across real-world and benchmark datasets (47% improvement on Sachs, 11% on Child, 5% on Hailfinder, 7% on Pathfinder compared to state-of-the-art). Can produce diverse, high-quality synthetic samples.

Conclusion: The proposed framework successfully integrates multiple causal models for both causal structure learning and tabular data synthesis, demonstrating superior performance in structure discovery while maintaining data generation capabilities.

Abstract: Understanding the causal relationships between data variables can provide crucial insights into the construction of tabular datasets. Most existing causality learning methods typically focus on applying a single identifiable causal model, such as the Additive Noise Model (ANM) or the Linear non-Gaussian Acyclic Model (LiNGAM), to discover the dependencies exhibited in observational data. We improve on this approach by introducing a novel dual-step framework capable of performing both causal structure learning and tabular data synthesis under multiple causal model assumptions. Our approach uses Directed Acyclic Graphs (DAG) to represent causal relationships among data variables. By applying various functional causal models including ANM, LiNGAM and the Post-Nonlinear model (PNL), we implicitly learn the contents of DAG to simulate the generative process of observational data, effectively replicating the real data distribution. This is supported by a theoretical analysis to explain the multiple loss terms comprising the objective function of the framework. Experimental results demonstrate that DAGAF outperforms many existing methods in structure learning, achieving significantly lower Structural Hamming Distance (SHD) scores across both real-world and benchmark datasets (Sachs: 47%, Child: 11%, Hailfinder: 5%, Pathfinder: 7% improvement compared to state-of-the-art), while being able to produce diverse, high-quality samples.

[733] Correcting Source Mismatch in Flow Matching with Radial-Angular Transport

Fouad Oubari, Mathilde Mougeot

Main category: cs.LG

TL;DR: RAFM introduces radial-angular flow matching to correct Gaussian source mismatch for heavy-tailed data by using data-matched radial distributions and uniform angular distributions, reducing transport to angular alignment on scaled spheres.

Details

Motivation: Standard Flow Matching uses Gaussian sources that create structural mismatch for heavy-tailed or anisotropic data at the radial distribution level, limiting performance on such data distributions.

Method: RAFM uses a source whose radial law matches the data’s radial distribution and whose angular distribution is uniform on the sphere, reducing transport to angular alignment via spherical geodesic interpolation on scaled spheres.

Result: RAFM substantially improves over standard Gaussian Flow Matching and remains competitive with recent non-Gaussian alternatives while preserving lightweight deterministic training.

Conclusion: RAFM provides a principled source-and-path design for Flow Matching on heavy-tailed and extreme-event data by explicitly correcting radial source mismatch within the standard Flow Matching framework.

Abstract: Flow Matching is typically built from Gaussian sources and Euclidean probability paths. For heavy-tailed or anisotropic data, however, a Gaussian source induces a structural mismatch already at the level of the radial distribution. We introduce \textit{Radial–Angular Flow Matching (RAFM)}, a framework that explicitly corrects this source mismatch within the standard simulation-free Flow Matching template. RAFM uses a source whose radial law matches that of the data and whose conditional angular distribution is uniform on the sphere, thereby removing the Gaussian radial mismatch by construction. This reduces the remaining transport problem to angular alignment, which leads naturally to conditional paths on scaled spheres defined by spherical geodesic interpolation. The resulting framework yields explicit Flow Matching targets tailored to radial–angular transport without modifying the underlying deterministic training pipeline. We establish the exact density of the matched-radial source, prove a radial–angular KL decomposition that isolates the Gaussian radial penalty, characterize the induced target vector field, and derive a stability result linking Flow Matching error to generation error. We further analyze empirical estimation of the radial law, for which Wasserstein and CDF metrics provide natural guarantees. Empirically, RAFM substantially improves over standard Gaussian Flow Matching and remains competitive with recent non-Gaussian alternatives while preserving a lightweight deterministic training procedure. Overall, RAFM provides a principled source-and-path design for Flow Matching on heavy-tailed and extreme-event data.

[734] Convolutional Neural Network and Adversarial Autoencoder in EEG images classification

Albert Nasybullin, Semen Kurkin

Main category: cs.LG

TL;DR: Applying computer vision and neural networks to classify EEG brain activity during hand movements by converting EEG signals to 2D topograms

Details

Motivation: To address classification challenges in neuroscience EEG data analysis by leveraging computer vision techniques for brain activity classification during hand movements

Method: Pre-process raw EEG signals, generate 2D EEG topograms, then develop supervised and semi-supervised neural networks for motor cortex activity classification

Result: Developed neural network models capable of classifying different motor cortex activities from EEG data using computer vision approaches

Conclusion: Computer vision algorithms combined with neural networks can effectively solve EEG classification problems in neuroscience for motor activity analysis

Abstract: In this paper, we consider applying computer vision algorithms for the classification problem one faces in neuroscience during EEG data analysis. Our approach is to apply a combination of computer vision and neural network methods to solve human brain activity classification problems during hand movement. We pre-processed raw EEG signals and generated 2D EEG topograms. Later, we developed supervised and semi-supervised neural networks to classify different motor cortex activities.

[735] How Long short-term memory artificial neural network, synthetic data, and fine-tuning improve the classification of raw EEG data

Albert Nasybullin, Vladimir Maksimenko, Semen Kurkin

Main category: cs.LG

TL;DR: A machine learning pipeline combining synthetic data generation, LSTM networks, and fine-tuning for EEG classification of implicit visual stimuli like ambiguous Necker cubes.

Details

Motivation: To improve classification of EEG data for experiments with implicit visual stimuli, particularly ambiguous figures like the Necker cube, where traditional methods may struggle due to data limitations and complexity.

Method: Proposes a pipeline with three key components: 1) synthetic data generation to augment limited EEG data, 2) LSTM neural networks to capture temporal patterns in EEG signals, and 3) fine-tuning techniques to adapt models to specific classification tasks.

Result: The developed approach increased the quality of the classification model for raw EEG data, demonstrating improved performance over baseline methods.

Conclusion: The combination of synthetic data generation, LSTM networks, and fine-tuning provides an effective pipeline for EEG classification tasks involving implicit visual stimuli with ambiguity.

Abstract: In this paper, we discuss a Machine Learning pipeline for the classification of EEG data. We propose a combination of synthetic data generation, long short-term memory artificial neural network (LSTM), and fine-tuning to solve classification problems for experiments with implicit visual stimuli, such as the Necker cube with different levels of ambiguity. The developed approach increased the quality of the classification model of raw EEG data.

[736] Boosted Distributional Reinforcement Learning: Analysis and Healthcare Applications

Zequn Chen, Wesley J. Marrero

Main category: cs.LG

TL;DR: Boosted Distributional Reinforcement Learning (BDRL) algorithm for healthcare decision-making that optimizes agent-specific outcome distributions while enforcing comparability among similar agents, with application to hypertension management.

Details

Motivation: Standard expectation-based reinforcement learning is insufficient for consistent decisions in highly uncertain situations with multiple heterogeneous groups, particularly in healthcare where physicians must manage multiple patients with uncertain disease progression and heterogeneous treatment responses.

Method: Proposes BDRL algorithm that optimizes agent-specific outcome distributions while enforcing comparability among similar agents, with a post-update projection step formulated as a constrained convex optimization problem to align individual outcomes with high-performing references within specified tolerance.

Result: Applied to hypertension management in US adult population, BDRL modifies treatment plans for median and vulnerable patients by mimicking high-performing references in each risk group, improving number and consistency of quality-adjusted life years compared to RL baselines.

Conclusion: BDRL provides a framework for distributional reinforcement learning that addresses fairness and consistency concerns in heterogeneous agent settings, particularly valuable for healthcare applications where equitable outcomes are crucial.

Abstract: Researchers and practitioners are increasingly considering reinforcement learning to optimize decisions in complex domains like robotics and healthcare. To date, these efforts have largely utilized expectation-based learning. However, relying on expectation-focused objectives may be insufficient for making consistent decisions in highly uncertain situations involving multiple heterogeneous groups. While distributional reinforcement learning algorithms have been introduced to model the full distributions of outcomes, they can yield large discrepancies in realized benefits among comparable agents. This challenge is particularly acute in healthcare settings, where physicians (controllers) must manage multiple patients (subordinate agents) with uncertain disease progression and heterogeneous treatment responses. We propose a Boosted Distributional Reinforcement Learning (BDRL) algorithm that optimizes agent-specific outcome distributions while enforcing comparability among similar agents and analyze its convergence. To further stabilize learning, we incorporate a post-update projection step formulated as a constrained convex optimization problem, which efficiently aligns individual outcomes with a high-performing reference within a specified tolerance. We apply our algorithm to manage hypertension in a large subset of the US adult population by categorizing individuals into cardiovascular disease risk groups. Our approach modifies treatment plans for median and vulnerable patients by mimicking the behavior of high-performing references in each risk group. Furthermore, we find that BDRL improves the number and consistency of quality-adjusted life years compared with reinforcement learning baselines.

[737] Generative models for decision-making under distributional shift

Xiuyuan Cheng, Yunqin Zhu, Yao Xie

Main category: cs.LG

TL;DR: Generative models (flow- and score-based methods) as mathematical tools for constructing decision-relevant distributions in operations research, focusing on distribution transformation rather than sample synthesis.

Details

Motivation: Many decision problems use nominal distributions from historical data, but deployment distributions may shift or be context-dependent. Need tools to construct decision-relevant distributions for robust decision-making under distributional shift.

Method: Unified framework using pushforward maps, continuity, Fokker-Planck equations, Wasserstein geometry, and optimization in probability space. Generative models represent distributions through transport maps, velocity fields, score fields, and guided stochastic dynamics.

Result: Generative models can learn nominal uncertainty, construct stressed/least-favorable distributions for robustness, and produce conditional/posterior distributions under side information. Theoretical guarantees include forward-reverse convergence, minimax analysis, and error-transfer bounds.

Conclusion: Generative models provide principled mathematical tools for scenario generation, robust decision-making, and uncertainty quantification under distributional shift, with applications in operations research.

Abstract: Many data-driven decision problems are formulated using a nominal distribution estimated from historical data, while performance is ultimately determined by a deployment distribution that may be shifted, context-dependent, partially observed, or stress-induced. This tutorial presents modern generative models, particularly flow- and score-based methods, as mathematical tools for constructing decision-relevant distributions. From an operations research perspective, their primary value lies not in unconstrained sample synthesis but in representing and transforming distributions through transport maps, velocity fields, score fields, and guided stochastic dynamics. We present a unified framework based on pushforward maps, continuity, Fokker-Planck equations, Wasserstein geometry, and optimization in probability space. Within this framework, generative models can be used to learn nominal uncertainty, construct stressed or least-favorable distributions for robustness, and produce conditional or posterior distributions under side information and partial observation. We also highlight representative theoretical guarantees, including forward-reverse convergence for iterative flow models, first-order minimax analysis in transport-map space, and error-transfer bounds for posterior sampling with generative priors. The tutorial provides a principled introduction to using generative models for scenario generation, robust decision-making, uncertainty quantification, and related problems under distributional shift.

[738] Deep Kuratowski Embedding Neural Networks for Wasserstein Metric Learning

Andrew Qing He

Main category: cs.LG

TL;DR: Neural architectures (DeepKENN and ODE-KENN) learn to approximate Wasserstein-2 distances from data, with ODE-KENN using Neural ODEs achieving better performance than baselines.

Details

Motivation: Pairwise Wasserstein distance computation is computationally expensive and a bottleneck in data analysis pipelines. The paper aims to create fast neural surrogates to approximate these distances efficiently.

Method: Two neural architectures: 1) DeepKENN aggregates distances across CNN feature maps with learnable weights, 2) ODE-KENN replaces discrete layers with Neural ODEs to embed inputs into infinite-dimensional Banach space C¹([0,1], ℝᵈ) with implicit regularization via trajectory smoothness.

Result: On MNIST with precomputed W₂ distances, ODE-KENN achieves 28% lower test MSE than single-layer baseline and 18% lower than DeepKENN under matched parameter counts, with smaller generalization gap.

Conclusion: ODE-KENN provides an effective neural surrogate for Wasserstein distance computation that can replace expensive exact computation in downstream applications.

Abstract: Computing pairwise Wasserstein distances is a fundamental bottleneck in data analysis pipelines. Motivated by the classical Kuratowski embedding theorem, we propose two neural architectures for learning to approximate the Wasserstein-2 distance ($W_2$) from data. The first, DeepKENN, aggregates distances across all intermediate feature maps of a CNN using learnable positive weights. The second, ODE-KENN, replaces the discrete layer stack with a Neural ODE, embedding each input into the infinite-dimensional Banach space $C^1([0,1], \mathbb{R}^d)$ and providing implicit regularization via trajectory smoothness. Experiments on MNIST with exact precomputed $W_2$ distances show that ODE-KENN achieves a 28% lower test MSE than the single-layer baseline and 18% lower than DeepKENN under matched parameter counts, while exhibiting a smaller generalization gap. The resulting fast surrogate can replace the expensive $W_2$ oracle in downstream pairwise distance computations.

[739] Context is All You Need

Jean Erik Delanois, Shruti Joshi, Ryan Golden, Teresa Nick, Maxim Bazhenov

Main category: cs.LG

TL;DR: CONTXT is a lightweight method for domain generalization and test-time adaptation that uses simple additive/multiplicative feature transforms to modulate neural representations, working across discriminative and generative models with minimal overhead.

Details

Motivation: Real-world neural networks face domain shift where test data differs from training distributions. Existing DG and TTA methods are complex, resource-intensive, and hard to scale. Need simple, efficient adaptation methods that work across model types.

Method: CONTXT uses contextual augmentation through simple additive and multiplicative feature transforms to modulate internal neural representations. It’s lightweight, easy to integrate, and works during test-time adaptation without retraining.

Result: Consistent gains across discriminative tasks (ANN/CNN classification) and generative models (LLMs). Minimal overhead enables robust performance under domain shift without added complexity.

Conclusion: CONTXT provides a compact, efficient way to steer information flow and neural processing for domain adaptation, offering a simple alternative to complex existing methods.

Abstract: Artificial Neural Networks (ANNs) are increasingly deployed across diverse real-world settings, where they must operate under data distributions that differ from those seen during training. This challenge is central to Domain Generalization (DG), which trains models to generalize to unseen domains without target data, and Test-Time Adaptation (TTA), which improves robustness by adapting to unlabeled test data at deployment. Existing approaches to address these challenges are often complex, resource-intensive, and difficult to scale. We introduce CONTXT (Contextual augmentatiOn for Neural feaTure X Transforms), a simple and intuitive method for contextual adaptation. CONTXT modulates internal representations using simple additive and multiplicative feature transforms. Within a TTA setting, it yields consistent gains across discriminative tasks (e.g., ANN/CNN classification) and generative models (e.g., LLMs). The method is lightweight, easy to integrate, and incurs minimal overhead, enabling robust performance under domain shift without added complexity. More broadly, CONTXT provides a compact way to steer information flow and neural processing without retraining.

[740] CPT: Controllable and Editable Design Variations with Language Models

Karthik Suresh, Amine Ben Khalifa, Li Zhang, Wei-ting Hsu, Fangzheng Wu, Vinay More, Asim Kadav

Main category: cs.LG

TL;DR: A system using Creative Pre-trained Transformer (CPT) trained on Creative Markup Language (CML) to generate editable design variations with visual style attributes, producing fully editable design documents rather than pixel-only images.

Details

Motivation: Manual design creation is time-consuming and limits scalability and personalization in creative workflows; there's a need for systems that can generate diverse, high-quality designs that remain editable for iteration.

Method: Developed Creative Markup Language (CML) as a compact representation capturing canvas structure, layout, and element details. Fine-tuned decoder-only Creative Pre-trained Transformer (CPT) on professional design templates to predict visual style attributes like color schemes and font choices.

Result: The system generates semantically structured and stylistically coherent design outputs with internal consistency across elements. It successfully produces contextual color and font variations for existing templates and shows promise in layout adjustments while maintaining design principles.

Conclusion: The CPT-based system enables scalable generation of editable design variations, bridging the gap between generative models and practical design workflows by producing fully editable documents rather than static images.

Abstract: Designing visually diverse and high-quality designs remains a manual, time-consuming process, limiting scalability and personalization in creative workflows. We present a system for generating editable design variations using a decoder-only language model, the Creative Pre-trained Transformer (CPT), trained to predict visual style attributes in design templates. At the core of our approach is a new representation called Creative Markup Language (CML), a compact, machine-learning-friendly format that captures canvas-level structure, page layout, and element-level details (text, images, and vector graphics), including both content and style. We fine-tune CPT on a large corpus of design templates authored by professional designers, enabling it to learn meaningful, context-aware predictions for attributes such as color schemes and font choices. The model produces semantically structured and stylistically coherent outputs, preserving internal consistency across elements. Unlike generative image models, our system yields fully editable design documents rather than pixel-only images, allowing users to iterate and personalize within a design editor. In experiments, our approach generates contextual color and font variations for existing templates and shows promise in adjusting layouts while maintaining design principles.

[741] Finite-Time Analysis of Q-Value Iteration for General-Sum Stackelberg Games

Narim Jeong, Donghwan Lee

Main category: cs.LG

TL;DR: Control-theoretic analysis of Stackelberg Q-value iteration convergence in two-player general-sum Markov games with finite-time error bounds

Details

Motivation: Multi-agent reinforcement learning in general-sum Markov games is challenging, and existing theoretical results for single-agent RL don't extend well to multi-agent settings. There's a need for convergence guarantees for Q-value iteration in Stackelberg interactions.

Method: Introduces a relaxed policy condition for Stackelberg setting, models learning dynamics as a switching system, and constructs upper/lower comparison systems to analyze convergence.

Result: Establishes finite-time error bounds for Q-functions and characterizes convergence properties, providing first finite-time convergence guarantees for Q-value iteration in general-sum Markov games under Stackelberg interactions.

Conclusion: The paper offers a novel control-theoretic perspective on Stackelberg learning with theoretical convergence guarantees for multi-agent reinforcement learning in competitive settings.

Abstract: Reinforcement learning has been successful both empirically and theoretically in single-agent settings, but extending these results to multi-agent reinforcement learning in general-sum Markov games remains challenging. This paper studies the convergence of Stackelberg Q-value iteration in two-player general-sum Markov games from a control-theoretic perspective. We introduce a relaxed policy condition tailored to the Stackelberg setting and model the learning dynamics as a switching system. By constructing upper and lower comparison systems, we establish finite-time error bounds for the Q-functions and characterize their convergence properties. Our results provide a novel control-theoretic perspective on Stackelberg learning. Moreover, to the best of the authors’ knowledge, this paper offers the first finite-time convergence guarantees for Q-value iteration in general-sum Markov games under Stackelberg interactions.

[742] Relative Density Ratio Optimization for Stable and Statistically Consistent Model Alignment

Hiroshi Takahashi, Tomoharu Iwata, Atsutoshi Kumagai, Sekitoshi Kanai, Masanori Yamada, Kosuke Nishida, Kazutoshi Shinoda

Main category: cs.LG

TL;DR: A novel alignment method for language models that achieves both stability and statistical consistency by using relative density ratios instead of direct density ratios.

Details

Motivation: Existing alignment methods assume specific human preference models (like Bradley-Terry) which may not capture true human preferences, leading to lack of statistical consistency. Direct density ratio optimization (DDRO) achieves consistency but suffers from instability due to unbounded density ratios that often diverge.

Method: Proposes using relative density ratio between preferred data distribution and a mixture of preferred and non-preferred distributions. This ratio is bounded above and doesn’t diverge, ensuring training stability while maintaining statistical consistency with tighter convergence guarantees than DDRO.

Result: The method shows effectiveness in experiments with Qwen 2.5 and Llama 3 models, demonstrating stable training and improved alignment performance compared to existing approaches.

Conclusion: The proposed relative density ratio approach provides a stable and statistically consistent alternative to existing alignment methods, addressing the instability issues of DDRO while maintaining theoretical guarantees.

Abstract: Aligning language models with human preferences is essential for ensuring their safety and reliability. Although most existing approaches assume specific human preference models such as the Bradley-Terry model, this assumption may fail to accurately capture true human preferences, and consequently, these methods lack statistical consistency, i.e., the guarantee that language models converge to the true human preference as the number of samples increases. In contrast, direct density ratio optimization (DDRO) achieves statistical consistency without assuming any human preference models. DDRO models the density ratio between preferred and non-preferred data distributions using the language model, and then optimizes it via density ratio estimation. However, this density ratio is unstable and often diverges, leading to training instability of DDRO. In this paper, we propose a novel alignment method that is both stable and statistically consistent. Our approach is based on the relative density ratio between the preferred data distribution and a mixture of the preferred and non-preferred data distributions. Our approach is stable since this relative density ratio is bounded above and does not diverge. Moreover, it is statistically consistent and yields significantly tighter convergence guarantees than DDRO. We experimentally show its effectiveness with Qwen 2.5 and Llama 3.

[743] Is Prompt Selection Necessary for Task-Free Online Continual Learning?

Seoyoung Park, Haemin Lee, Hankook Lee

Main category: cs.LG

TL;DR: SinglePrompt: A simple task-free online continual learning method that uses a single prompt per self-attention block with classifier optimization, eliminating complex prompt selection strategies.

Details

Motivation: Existing prompt selection strategies in task-free online continual learning often fail to select appropriate prompts, leading to suboptimal results despite additional training. The authors aim to simplify the approach by eliminating prompt selection entirely.

Method: 1) Inject a single prompt into each self-attention block, 2) Use cosine similarity-based logit design to reduce forgetting in classifier weights, 3) Mask logits for unexposed classes in the current minibatch.

Result: Achieves state-of-the-art performance across various online continual learning benchmarks with a simple task-free design.

Conclusion: A simple single-prompt approach without complex selection strategies can outperform existing methods in task-free online continual learning scenarios.

Abstract: Task-free online continual learning has recently emerged as a realistic paradigm for addressing continual learning in dynamic, real-world environments, where data arrive in a non-stationary stream without clear task boundaries and can only be observed once. To consider such challenging scenarios, many recent approaches have employed prompt selection, an adaptive strategy that selects prompts from a pool based on input signals. However, we observe that such selection strategies often fail to select appropriate prompts, yielding suboptimal results despite additional training of key parameters. Motivated by this observation, we propose a simple yet effective SinglePrompt that eliminates the need for prompt selection and focuses on classifier optimization. Specifically, we simply (i) inject a single prompt into each self-attention block, (ii) employ a cosine similarity-based logit design to alleviate the forgetting effect inherent in the classifier weights, and (iii) mask logits for unexposed classes in the current minibatch. With this simple task-free design, our framework achieves state-of-the-art performance across various online continual learning benchmarks. Source code is available at https://github.com/efficient-learning-lab/SinglePrompt.

[744] Estimating Central, Peripheral, and Temporal Visual Contributions to Human Decision Making in Atari Games

Henrik Krauss, Takehisa Yairi

Main category: cs.LG

TL;DR: A study analyzing how different visual information sources (peripheral vision, gaze maps, past states) contribute to human decision-making in Atari games using eye-tracking data and controlled ablation experiments.

Details

Motivation: To understand how humans integrate different visual information sources (peripheral vision, explicit gaze information, and past-state information) when making decisions in dynamic visual environments like video games.

Method: Used Atari-HEAD dataset with synchronized eye-tracking, created a controlled ablation framework to reverse-engineer contributions of different information sources, trained action-prediction networks under six settings that selectively include/exclude peripheral info, gaze maps, and past-state info.

Result: Peripheral information showed strongest contribution (35.27-43.90% accuracy drop when removed), gaze information had smaller impact (2.11-2.76% drop), past-state information showed broader range (1.52-15.51% drop). Clustering analysis identified different behavioral regimes: focus-dominated, periphery-dominated, and contextual decision situations.

Conclusion: Human decision-making in Atari depends strongly on information beyond current gaze focus, with peripheral vision being most critical. The framework provides a way to estimate information-source contributions from behavior in dynamic visual environments.

Abstract: We study how different visual information sources contribute to human decision making in dynamic visual environments. Using Atari-HEAD, a large-scale Atari gameplay dataset with synchronized eye-tracking, we introduce a controlled ablation framework as a means to reverse-engineer the contribution of peripheral visual information, explicit gaze information in form of gaze maps, and past-state information from human behavior. We train action-prediction networks under six settings that selectively include or exclude these information sources. Across 20 games, peripheral information shows by far the strongest contribution, with median prediction-accuracy drops in the range of 35.27-43.90% when removed. Gaze information yields smaller drops of 2.11-2.76%, while past-state information shows a broader range of 1.52-15.51%, with the upper end likely more informative due to reduced peripheral-information leakage. To complement aggregate accuracies, we cluster states by true-action probabilities assigned by the different model configurations. This analysis identifies coarse behavioral regimes, including focus-dominated, periphery-dominated, and more contextual decision situations. These results suggest that human decision making in Atari depends strongly on information beyond the current focus of gaze, while the proposed framework provides a way to estimate such information-source contributions from behavior.

[745] TinyNina: A Resource-Efficient Edge-AI Framework for Sustainable Air Quality Monitoring via Intra-Image Satellite Super-Resolution

Prasanjit Dey, Zachary Yahn, Bianca Schoen-Phelan, Soumyabrata Dev

Main category: cs.LG

TL;DR: TinyNina is an ultra-lightweight Edge-AI framework for high-resolution NO2 monitoring using Sentinel-2 satellite data, achieving state-of-the-art accuracy with only 51K parameters through intra-image learning and spectral attention mechanisms.

Details

Motivation: Current satellite-based NO2 monitoring suffers from limited spatial resolution, requiring costly external high-resolution datasets. There's a need for resource-efficient solutions for real-time air quality monitoring in smart cities without dependency on expensive reference data.

Method: Proposes TinyNina framework with intra-image learning that uses Sentinel-2’s multi-spectral hierarchy as internal training labels. Incorporates wavelength-specific attention gates and depthwise separable convolutions to preserve pollutant-sensitive spectral features while maintaining ultra-lightweight architecture.

Result: Achieves state-of-the-art MAE of 7.4 μg/m³ validated against 3,276 satellite-ground station pairs. Reduces computational overhead by 95% and provides 47× faster inference compared to high-capacity models like EDSR and RCAN, with only 51K parameters.

Conclusion: TinyNina provides a scalable, low-latency solution for real-time air quality monitoring in smart city infrastructures by prioritizing task-specific utility and architectural efficiency, eliminating dependency on costly external datasets.

Abstract: Nitrogen dioxide (NO$_2$) is a primary atmospheric pollutant and a significant contributor to respiratory morbidity and urban climate-related challenges. While satellite platforms like Sentinel-2 provide global coverage, their native spatial resolution often limits the precision required, fine-grained NO$_2$ assessment. To address this, we propose TinyNina, a resource-efficient Edge-AI framework specifically engineered for sustainable environmental monitoring. TinyNina implements a novel intra-image learning paradigm that leverages the multi-spectral hierarchy of Sentinel-2 as internal training labels, effectively eliminating the dependency on costly and often unavailable external high-resolution reference datasets. The framework incorporates wavelength-specific attention gates and depthwise separable convolutions to preserve pollutant-sensitive spectral features while maintaining an ultra-lightweight footprint of only 51K parameters. Experimental results, validated against 3,276 matched satellite-ground station pairs, demonstrate that TinyNina achieves a state-of-the-art Mean Absolute Error (MAE) of 7.4 $μ$g/m$^3$. This performance represents a 95% reduction in computational overhead and 47$\times$ faster inference compared to high-capacity models such as EDSR and RCAN. By prioritizing task-specific utility and architectural efficiency, TinyNina provides a scalable, low-latency solution for real-time air quality monitoring in smart city infrastructures.

[746] DP-OPD: Differentially Private On-Policy Distillation for Language Models

Fatemeh Khadem, Sajad Mousavi, Yi Fang, Yuhong Liu

Main category: cs.LG

TL;DR: DP-OPD: A synthesis-free framework for differentially private model compression that uses on-policy distillation with a frozen teacher to train a DP student, eliminating the need for DP teacher training and synthetic text generation.

Details

Motivation: There's tension between privacy guarantees (differential privacy) and efficient deployment through model compression for LLMs trained on sensitive data. Existing private distillation methods either apply DP-SGD to both teacher and student (worsening computation and privacy-utility tradeoff) or rely on DP synthetic text generation from a DP-trained teacher (introducing complex pipelines).

Method: DP-OPD enforces privacy solely through DP-SGD on the student while using a frozen teacher to provide dense token-level targets on student-generated trajectories. It uses private generalized knowledge distillation on continuation tokens, collapsing private compression into a single DP student-training loop.

Result: Under strict privacy budget (ε=2.0), DP-OPD improves perplexity over DP fine-tuning and off-policy DP distillation, and outperforms synthesis-based DP distillation (Yelp: 44.15→41.68; BigPatent: 32.43→30.63) while substantially simplifying the training pipeline.

Conclusion: DP-OPD provides an efficient synthesis-free framework for differentially private model compression that improves privacy-utility tradeoff and simplifies the training pipeline by eliminating DP teacher training and offline synthetic text generation.

Abstract: Large language models (LLMs) are increasingly adapted to proprietary and domain-specific corpora that contain sensitive information, creating a tension between formal privacy guarantees and efficient deployment through model compression. Differential privacy (DP), typically enforced via DP-SGD, provides record-level protection but often incurs substantial utility loss in autoregressive generation, where optimization noise can amplify exposure bias and compounding errors along long rollouts. Existing approaches to private distillation either apply DP-SGD to both teacher and student, worsening computation and the privacy–utility tradeoff, or rely on DP synthetic text generation from a DP-trained teacher, avoiding DP on the student at the cost of DP-optimizing a large teacher and introducing an offline generation pipeline. We propose \textbf{Differentially Private On-Policy Distillation (DP-OPD)}, a synthesis-free framework that enforces privacy solely through DP-SGD on the student while leveraging a frozen teacher to provide dense token-level targets on \emph{student-generated} trajectories. DP-OPD instantiates this idea via \emph{private generalized knowledge distillation} on continuation tokens. Under a strict privacy budget ($\varepsilon=2.0$), DP-OPD improves perplexity over DP fine-tuning and off-policy DP distillation, and outperforms synthesis-based DP distillation (Yelp: 44.15$\rightarrow$41.68; BigPatent: 32.43$\rightarrow$30.63), while substantially simplifying the training pipeline. In particular, \textbf{DP-OPD collapses private compression into a single DP student-training loop} by eliminating DP teacher training and offline synthetic text generation. Code will be released upon publication at https://github.com/khademfatemeh/dp_opd.

[747] MAVEN: A Mesh-Aware Volumetric Encoding Network for Simulating 3D Flexible Deformation

Zhe Feng, Shilong Tao, Haonan Sun, Shaohan Chen, Zhanxing Zhu, Yunhuai Liu

Main category: cs.LG

TL;DR: MAVEN: A mesh-aware volumetric encoding network that explicitly models higher-dimensional geometric elements (3D cells, 2D facets, vertices) for more accurate 3D flexible deformation simulation.

Details

Motivation: Existing GNNs for simulating flexible deformations represent meshes only with vertices and edges, overlooking higher-dimensional spatial features (2D facets, 3D cells). This makes it challenging to accurately capture boundary representations and volumetric characteristics needed for modeling contact interactions and internal physical propagation, especially under sparse mesh discretization.

Method: MAVEN introduces a mesh-aware volumetric encoding network that explicitly models geometric mesh elements of higher dimensions. It establishes learnable mappings among 3D cells, 2D facets, and vertices, enabling flexible mutual transformations. Explicit geometric features are incorporated to reduce the burden of implicitly learning geometric patterns.

Result: MAVEN consistently achieves state-of-the-art performance across established datasets and a novel metal stretch-bending task featuring large deformations and prolonged contacts.

Conclusion: Explicit modeling of higher-dimensional geometric elements in mesh structures improves accuracy and naturalness in 3D flexible deformation simulation, particularly for capturing boundary representations and volumetric characteristics important for contact interactions.

Abstract: Deep learning-based approaches, particularly graph neural networks (GNNs), have gained prominence in simulating flexible deformations and contacts of solids, due to their ability to handle unstructured physical fields and nonlinear regression on graph structures. However, existing GNNs commonly represent meshes with graphs built solely from vertices and edges. These approaches tend to overlook higher-dimensional spatial features, e.g., 2D facets and 3D cells, from the original geometry. As a result, it is challenging to accurately capture boundary representations and volumetric characteristics, though this information is critically important for modeling contact interactions and internal physical quantity propagation, particularly under sparse mesh discretization. In this paper, we introduce MAVEN, a mesh-aware volumetric encoding network for simulating 3D flexible deformation, which explicitly models geometric mesh elements of higher dimension to achieve a more accurate and natural physical simulation. MAVEN establishes learnable mappings among 3D cells, 2D facets, and vertices, enabling flexible mutual transformations. Explicit geometric features are incorporated into the model to alleviate the burden of implicitly learning geometric patterns. Experimental results show that MAVEN consistently achieves state-of-the-art performance across established datasets and a novel metal stretch-bending task featuring large deformations and prolonged contacts.

[748] Discrete Prototypical Memories for Federated Time Series Foundation Models

Liwei Deng, Qingxiang Liu, Xinhe Niu, Shengchao Chen, Sheng Sun, Yuankai Wu, Guodong Long, Yuxuan Liang

Main category: cs.LG

TL;DR: FeDPM is a federated learning framework for time-series foundation models using discrete prototypical memories to address semantic misalignment between time-series data and LLMs’ text-centric latent spaces.

Details

Motivation: LLMs as federated learning-based time series foundation models can transfer generalization capabilities while preserving data privacy, but face semantic misalignment between time-series data and text-centric LLM latent spaces, and parameter-sharing in FL contradicts the discrete, recurring nature of time-series regimes.

Method: FeDPM learns local prototypical memory priors for intra-domain time-series data, aligns cross-domain memories to create a unified discrete latent space, and uses domain-specific memory update mechanisms to balance shared and personalized prototypical knowledge.

Result: Extensive experiments demonstrate the efficiency and effectiveness of FeDPM, with code publicly available.

Conclusion: FeDPM addresses key limitations in using LLMs for federated time-series foundation models through discrete prototypical memories, enabling better alignment between time-series semantics and LLM latent spaces while handling cross-domain heterogeneity.

Abstract: Leveraging Large Language Models (LLMs) as federated learning (FL)-based time series foundation models offers a promising way to transfer the generalization capabilities of LLMs to time series data while preserving access to private data. However, the semantic misalignment between time-series data and the text-centric latent space of existing LLMs often leads to degraded performance. Meanwhile, the parameter-sharing mechanism in existing FL methods model heterogeneous cross-domain time-series data into a unified continuous latent space, which contradicts the fact that time-series semantics frequently manifest as discrete and recurring regimes. To address these limitations, we propose \textsc{FeDPM}, a federated framework for time-series foundation models based on discrete prototypical memories. Specifically, we learn local prototypical memory priors for intra-domain time-series data. We then align cross-domain memories to promote a unified discrete latent space and introduce a domain-specific memory update mechanism to balance shared and personalized prototypical knowledge. Extensive experiments demonstrate the efficiency and effectiveness of \textsc{FeDPM}. The code is publicly available at https://anonymous.4open.science/r/FedUnit-64D1.

[749] ECG Biometrics with ArcFace-Inception: External Validation on MIMIC and HEEDB

Arjuna Scagnetto

Main category: cs.LG

TL;DR: ECG biometric identification using 1D Inception-v1 with ArcFace shows strong performance but degrades with domain shifts, temporal gaps, and larger galleries, though reranking helps.

Details

Motivation: Previous ECG biometric studies used small cohorts and short intervals, lacking understanding of performance under large galleries, domain shifts, and multi-year temporal gaps.

Method: Trained 1D Inception-v1 model with ArcFace on 164,440 12-lead ECGs from 53,079 patients, tested on MIMIC-IV-ECG and HEEDB datasets using closed-set leave-one-out protocol with Rank@K and TAR@FAR metrics.

Result: Rank@1: 0.9506 (ASUGI-DB), 0.8291 (MIMIC-GC), 0.6884 (HEEDB-GC). Performance declined with temporal gaps (1-5 years) and larger galleries, but reranking improved HEEDB-RR from 0.7765 to 0.8005.

Conclusion: ECG identity information remains measurable at scale but is strongly affected by domain heterogeneity, longitudinal drift, gallery size, and benefits from second-stage score processing.

Abstract: ECG biometrics has been studied mainly on small cohorts and short inter-session intervals, leaving open how identification behaves under large galleries, external domain shift, and multi-year temporal gaps. We evaluated a 1D Inception-v1 model trained with ArcFace on an internal clinical corpus of 164,440 12-lead ECGs from 53,079 patients and tested it on larger cohorts derived from MIMIC-IV-ECG and HEEDB. The study used a unified closed-set leave-one-out protocol with Rank@K and TAR@FAR metrics, together with scale, temporal-stress, reranking, and confidence analyses. Under general comparability, the system achieved Rank@1 of 0.9506 on ASUGI-DB, 0.8291 on MIMIC-GC, and 0.6884 on HEEDB-GC. In the temporal stress test at constant gallery size, Rank@1 declined from 0.7853 to 0.6433 on MIMIC and from 0.6864 to 0.5560 on HEEDB from 1 to 5 years. Scale analysis on HEEDB showed monotonic degradation as gallery size increased and recovery as more examinations per patient became available. On HEEDB-RR, post-hoc reranking further improved retrieval, with AS-norm reaching Rank@1 = 0.8005 from a 0.7765 baseline. ECG identity information therefore remains measurable under externally validated large-scale closed-set conditions, but its operational quality is strongly affected by domain heterogeneity, longitudinal drift, gallery size, and second-stage score processing.

[750] Isokinetic Flow Matching for Pathwise Straightening of Generative Flows

Tauhid Khan

Main category: cs.LG

TL;DR: Iso-FM introduces acceleration regularization to Flow Matching, reducing curvature in learned velocity fields to enable high-quality few-step sampling without expensive second-order computations.

Details

Motivation: Flow Matching suffers from strong curvature in learned velocity fields due to trajectory superposition, which causes numerical truncation errors that bottleneck few-step sampling efficiency.

Method: Iso-FM uses a lightweight, Jacobian-free dynamical regularizer that penalizes pathwise acceleration via self-guided finite-difference approximation of the material derivative Dv/Dt, enforcing local velocity consistency without auxiliary encoders or second-order autodifferentiation.

Result: On CIFAR-10 (DiT-S/2), Iso-FM reduces conditional non-OT FID at 2 steps from 78.82 to 27.13 (2.9x efficiency gain) and achieves best FID of 10.23 at 4 steps.

Conclusion: Acceleration regularization is a principled, compute-efficient mechanism for fast generative sampling that can be added as plug-and-play to single-stage Flow Matching training.

Abstract: Flow Matching (FM) constructs linear conditional probability paths, but the learned marginal velocity field inevitably exhibits strong curvature due to trajectory superposition. This curvature severely inflates numerical truncation errors, bottlenecking few-step sampling. To overcome this, we introduce Isokinetic Flow Matching (Iso-FM), a lightweight, Jacobian-free dynamical regularizer that directly penalizes pathwise acceleration. By using a self-guided finite-difference approximation of the material derivative Dv/Dt, Iso-FM enforces local velocity consistency without requiring auxiliary encoders or expensive second-order autodifferentiation. Operating as a pure plug-and-play addition to single-stage FM training, Iso-FM dramatically improves few-step generation. On CIFAR-10 (DiT-S/2), Iso-FM slashes conditional non-OT FID at 2 steps from 78.82 to 27.13 - a 2.9x relative efficiency gain - and reaches a best-observed FID at 4 steps of 10.23. These results firmly establish acceleration regularization as a principled, compute-efficient mechanism for fast generative sampling.

[751] SLaB: Sparse-Lowrank-Binary Decomposition for Efficient Large Language Models

Ziwei Li, Yuang Ma, Yi Kang

Main category: cs.LG

TL;DR: SLaB: A novel compression framework that decomposes LLM linear layers into sparse, low-rank, and binary matrices without retraining, achieving state-of-the-art performance at high compression ratios.

Details

Motivation: Large language models have massive computational and memory demands that hinder deployment. Existing compression methods often fail to maintain good performance at high compression ratios, creating a need for more effective compression techniques.

Method: SLaB decomposes each linear layer weight into three complementary components: a sparse matrix, a low-rank matrix, and a binary matrix. It eliminates the need for retraining and uses activation-aware pruning scores to guide the decomposition process.

Result: On Llama-family models, SLaB reduces perplexity by up to 36% compared to existing methods at 50% compression and improves accuracy by up to 8.98% over baseline on zero-shot tasks, achieving state-of-the-art performance.

Conclusion: SLaB provides an effective compression framework for LLMs that maintains performance at high compression ratios without requiring retraining, addressing deployment challenges of large models.

Abstract: The rapid growth of large language models (LLMs) presents significant deployment challenges due to their massive computational and memory demands. While model compression, such as network pruning, offers potential solutions, most existing methods often fail to maintain good performance at high compression ratios. To address this, we propose SLaB, a novel framework that decomposes each linear layer weight into three complementary components: a sparse matrix, a low-rank matrix, and a binary matrix. SLaB eliminates the need for retraining and leverages activation-aware pruning scores to guide the decomposition process. Experiments on Llama-family models demonstrate that SLaB achieves state-of-the-art performance, reducing perplexity by up to 36% compared to existing methods at 50% compression and improving accuracy by up to 8.98% over the baseline on zero-shot tasks.

[752] One Model for All: Multi-Objective Controllable Language Models

Qiang He, Yucheng Yang, Tianyi Zhou, Meng Fang, Mykola Pechenizkiy, Setareh Maghsudi

Main category: cs.LG

TL;DR: MOC introduces multi-objective optimization into RLHF to train a single LLM that generates personalized outputs across different user preferences on the Pareto front, enabling controllable trade-offs among multiple rewards.

Details

Motivation: Current RLHF focuses on fixed rewards from average human ratings, which limits adaptability and controllability for varying user preferences. Personalized LLMs need to align with individual preferences, but face challenges due to scarce per-user data and diverse multi-objective trade-offs.

Method: Multi-Objective Control (MOC) integrates multi-objective optimization principles into RLHF to train an LLM as a preference-conditioned policy network. The approach applies MOO at the policy level for computational efficiency, enabling fine-tuning of a 7B-parameter model on a single A6000 GPU.

Result: Extensive experiments show MOC outperforms baselines in: (i) controllability of LLM outputs w.r.t. user preferences on multi-reward trade-offs; (ii) quality and diversity measured by hyper-volume of multiple solutions; (iii) generalization to unseen preferences.

Conclusion: MOC demonstrates potential for real-world applications requiring scalable and customizable LLMs by enabling personalized outputs across different user preferences on the Pareto front through multi-objective RLHF.

Abstract: Aligning large language models (LLMs) with human preferences is critical for enhancing LLMs’ safety, helpfulness, humor, faithfulness, etc. Current reinforcement learning from human feedback (RLHF) mainly focuses on a fixed reward learned from average human ratings, which may weaken the adaptability and controllability of varying preferences. However, creating personalized LLMs requires aligning LLMs with individual human preferences, which is non-trivial due to the scarce data per user and the diversity of user preferences in multi-objective trade-offs, varying from emphasizing empathy in certain contexts to demanding efficiency and precision in others. Can we train one LLM to produce personalized outputs across different user preferences on the Pareto front? In this paper, we introduce Multi-Objective Control (MOC), which trains a single LLM to directly generate responses in the preference-defined regions of the Pareto front. Our approach introduces multi-objective optimization (MOO) principles into RLHF to train an LLM as a preference-conditioned policy network. We improve the computational efficiency of MOC by applying MOO at the policy level, enabling us to fine-tune a 7B-parameter model on a single A6000 GPU. Extensive experiments demonstrate the advantages of MOC over baselines in three aspects: (i) controllability of LLM outputs w.r.t. user preferences on the trade-off among multiple rewards; (ii) quality and diversity of LLM outputs, measured by the hyper-volume of multiple solutions achieved; and (iii) generalization to unseen preferences. These results highlight MOC’s potential for real-world applications requiring scalable and customizable LLMs.

[753] GAIN: Multiplicative Modulation for Domain Adaptation

Hengshuai Yao, Xing Chen, Ahmed Murtadha, Guan Wang

Main category: cs.LG

TL;DR: GAIN: A method for adapting LLMs to new domains without catastrophic forgetting by using multiplicative gain modulation instead of additive parameter updates.

Details

Motivation: Standard adaptation methods like full fine-tuning and LoRA cause catastrophic forgetting when adapting LLMs to new domains because they inject new directions into the weight space, interfering with previously learned knowledge.

Method: GAIN uses multiplicative modulation where W_new = S * W, with S being a learned diagonal matrix applied to attention output projection and optionally FFN layers. This approach re-emphasizes existing features rather than adding new directions, mirroring gain modulation in neuroscience.

Result: GAIN-FFN matches LoRA’s in-domain adaptation performance while improving previously trained domains by 7-13% (validation PPL), whereas LoRA degrades them by 18-36%. After seven sequential adaptations, GAIN-FFN degrades BoolQ by only 0.8% vs LoRA’s 14.9% degradation.

Conclusion: GAIN provides effective domain adaptation without catastrophic forgetting through multiplicative modulation, adding only 46K-230K parameters per model that can be absorbed into pretrained weights for zero inference cost.

Abstract: Adapting LLMs to new domains causes forgetting because standard methods (full fine-tuning, LoRA) inject new directions into the weight space. We propose GAIN, which re-emphasizes existing features through multiplicative modulation W_new = S * W. The learned diagonal matrix S is applied to the attention output projection and optionally the FFN. The principle mirrors gain modulation in neuroscience, where neurons adapt to context by scaling response strength while preserving selectivity. We evaluate GAIN on five models from four families (774M to 70B), adapting sequentially across eight domains. GAIN-FFN matches LoRA’s in-domain adaptation, but their effects on previously trained domains are opposite: GAIN-FFN improves them by 7-13% (validation PPL), while LoRA degrades them by 18-36%. Downstream accuracy confirms the pattern: for example, after seven sequential adaptations on Qwen2.5, GAIN-FFN degrades BoolQ by only 0.8% while LoRA damages it by 14.9%. GAIN adds 46K-230K parameters per model and can be absorbed into the pretrained weights for zero inference cost.

[754] Reproducibility study on how to find Spurious Correlations, Shortcut Learning, Clever Hans or Group-Distributional non-robustness and how to fix them

Ole Delzer, Sidney Bender

Main category: cs.LG

TL;DR: A reproducibility study unifying various approaches to address spurious correlations in DNNs, comparing XAI-based correction methods with non-XAI baselines under challenging constraints like limited data and severe imbalance.

Details

Motivation: To unify the fractured research landscape around ensuring DNN reliability by addressing spurious correlations, bringing together frameworks like DRO, IRM, shortcut learning, and Clever Hans effect that all aim to make models rely on causally relevant features rather than confounding signals.

Method: Comparative analysis of correction methods under challenging constraints (limited data, severe subgroup imbalance) using both synthetic and real-world datasets. Evaluated XAI-based methods alongside non-XAI baselines, with particular attention to Counterfactual Knowledge Distillation (CFKD) and challenges with group label dependency.

Result: XAI-based methods generally outperform non-XAI approaches, with Counterfactual Knowledge Distillation (CFKD) proving most consistently effective at improving generalization. However, practical application is hindered by dependency on group labels and challenges with automated tools like Spectral Relevance Analysis (SpRAy) in complex/imbalanced scenarios.

Conclusion: While XAI-based methods show promise for improving model reliability, significant obstacles remain including dependency on group labels, challenges with automated annotation tools, and unreliable model selection due to minority group scarcity in validation sets, hindering deployment in safety-critical domains.

Abstract: Deep Neural Networks (DNNs) are increasingly utilized in high-stakes domains like medical diagnostics and autonomous driving where model reliability is critical. However, the research landscape for ensuring this reliability is terminologically fractured across communities that pursue the same goal of ensuring models rely on causally relevant features rather than confounding signals. While frameworks such as distributionally robust optimization (DRO), invariant risk minimization (IRM), shortcut learning, simplicity bias, and the Clever Hans effect all address model failure due to spurious correlations, researchers typically only reference work within their own domains. This reproducibility study unifies these perspectives through a comparative analysis of correction methods under challenging constraints like limited data availability and severe subgroup imbalance. We evaluate recently proposed correction methods based on explainable artificial intelligence (XAI) techniques alongside popular non-XAI baselines using both synthetic and real-world datasets. Findings show that XAI-based methods generally outperform non-XAI approaches, with Counterfactual Knowledge Distillation (CFKD) proving most consistently effective at improving generalization. Our experiments also reveal that the practical application of many methods is hindered by a dependency on group labels, as manual annotation is often infeasible and automated tools like Spectral Relevance Analysis (SpRAy) struggle with complex features and severe imbalance. Furthermore, the scarcity of minority group samples in validation sets renders model selection and hyperparameter tuning unreliable, posing a significant obstacle to the deployment of robust and trustworthy models in safety-critical areas.

[755] Learning from Equivalence Queries, Revisited

Mark Braverman, Roi Livni, Yishay Mansour, Shay Moran, Kobbi Nissim

Main category: cs.LG

TL;DR: The paper revisits learning from equivalence queries with symmetric counterexample generators, studying both full-information and bandit feedback settings with tight bounds on learning rounds.

Details

Motivation: Motivated by modern ML systems that evolve through deployment, user interaction, and periodic updates, the paper addresses limitations of standard supervised learning frameworks and overly pessimistic adversarial models in learning from equivalence queries.

Method: Introduces symmetric counterexample generators that choose counterexamples based only on the symmetric difference between hypothesis and target. Studies learning from equivalence queries under both full-information and bandit feedback settings using game-theoretic analysis of symmetric adversaries with adaptive weighting methods and minimax arguments.

Result: Obtains tight bounds on the number of learning rounds in both full-information and bandit feedback settings for symmetric counterexample generators.

Conclusion: The framework provides a more realistic model for modern ML system evolution, captures natural counterexample mechanisms, and establishes fundamental learning bounds while highlighting directions for future work.

Abstract: Modern machine learning systems, such as generative models and recommendation systems, often evolve through a cycle of deployment, user interaction, and periodic model updates. This differs from standard supervised learning frameworks, which focus on loss or regret minimization over a fixed sequence of prediction tasks. Motivated by this setting, we revisit the classical model of learning from equivalence queries, introduced by Angluin (1988). In this model, a learner repeatedly proposes hypotheses and, when a deployed hypothesis is inadequate, receives a counterexample. Under fully adversarial counterexample generation, however, the model can be overly pessimistic. In addition, most prior work assumes a \emph{full-information} setting, where the learner also observes the correct label of the counterexample, an assumption that is not always natural. We address these issues by restricting the environment to a broad class of less adversarial counterexample generators, which we call \emph{symmetric}. Informally, such generators choose counterexamples based only on the symmetric difference between the hypothesis and the target. This class captures natural mechanisms such as random counterexamples (Angluin and Dohrn, 2017; Bhatia, 2021; Chase, Freitag, and Reyzin, 2024), as well as generators that return the simplest counterexample according to a prescribed complexity measure. Within this framework, we study learning from equivalence queries under both full-information and bandit feedback. We obtain tight bounds on the number of learning rounds in both settings and highlight directions for future work. Our analysis combines a game-theoretic view of symmetric adversaries with adaptive weighting methods and minimax arguments.

[756] FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control

Donghu Kim, Youngdo Lee, Minho Park, Kinam Kim, I Made Aswin Nahendra, Takuma Seno, Sehee Min, Daniel Palenicek, Florian Vogt, Danica Kragic, Jan Peters, Jaegul Choo, Hojoon Lee

Main category: cs.LG

TL;DR: FlashSAC is a fast, stable off-policy RL algorithm that scales model size and data throughput while controlling gradient/weight norms to prevent critic error accumulation, outperforming PPO and other baselines on high-dimensional tasks.

Details

Motivation: On-policy RL methods like PPO have stability but limited policy evaluation due to narrow data distribution, while off-policy methods suffer from slow convergence and instability from critic error accumulation during bootstrapping.

Method: Built on Soft Actor-Critic, FlashSAC reduces gradient updates while scaling up model size and data throughput, with explicit bounds on weight, feature, and gradient norms to maintain stability at scale.

Result: Outperforms PPO and strong off-policy baselines across 60+ tasks in 10 simulators, with largest gains on high-dimensional tasks like dexterous manipulation, and reduces sim-to-real humanoid locomotion training from hours to minutes.

Conclusion: FlashSAC demonstrates the promise of scaled off-policy RL for efficient training and sim-to-real transfer, particularly for high-dimensional control problems.

Abstract: Reinforcement learning (RL) is a core approach for robot control when expert demonstrations are unavailable. On-policy methods such as Proximal Policy Optimization (PPO) are widely used for their stability, but their reliance on narrowly distributed on-policy data limits accurate policy evaluation in high-dimensional state and action spaces. Off-policy methods can overcome this limitation by learning from a broader state-action distribution, yet suffer from slow convergence and instability, as fitting a value function over diverse data requires many gradient updates, causing critic errors to accumulate through bootstrapping. We present FlashSAC, a fast and stable off-policy RL algorithm built on Soft Actor-Critic. Motivated by scaling laws observed in supervised learning, FlashSAC sharply reduces gradient updates while compensating with larger models and higher data throughput. To maintain stability at increased scale, FlashSAC explicitly bounds weight, feature, and gradient norms, curbing critic error accumulation. Across over 60 tasks in 10 simulators, FlashSAC consistently outperforms PPO and strong off-policy baselines in both final performance and training efficiency, with the largest gains on high-dimensional tasks such as dexterous manipulation. In sim-to-real humanoid locomotion, FlashSAC reduces training time from hours to minutes, demonstrating the promise of off-policy RL for sim-to-real transfer.

[757] Beyond Imbalance Ratio: Data Characteristics as Critical Moderators of Oversampling Method Selection

Yuwen Jiang, Songyun Ye

Main category: cs.LG

TL;DR: Controlled experiments show imbalance ratio (IR) has weak negative correlation with oversampling effectiveness, not positive as assumed; class separability is stronger moderator.

Details

Motivation: The paper challenges the prevailing assumption that higher class imbalance ratios (IR) lead to greater benefits from oversampling techniques, noting this assumption lacks empirical validation through controlled experimentation.

Method: Conducted 12 controlled experiments with over 100 dataset variants using algorithmically generated Gaussian mixture datasets to systematically manipulate IR while holding data characteristics constant. Additional validation experiments examined ceiling effects and metric-dependence. All methods were evaluated on 17 real-world datasets from OpenML.

Result: When controlling for confounding variables, IR exhibited weak to moderate negative correlation with oversampling benefits (contrary to the assumed positive correlation). Class separability emerged as a substantially stronger moderator, accounting for significantly more variance in method effectiveness than IR alone.

Conclusion: Proposes a ‘Context Matters’ framework that integrates IR, class separability, and cluster structure to provide evidence-based selection criteria for oversampling methods, challenging the traditional IR-threshold paradigm.

Abstract: The prevailing IR-threshold paradigm posits a positive correlation between imbalance ratio (IR) and oversampling effectiveness, yet this assumption remains empirically unsubstantiated through controlled experimentation. We conducted 12 controlled experiments (N > 100 dataset variants) that systematically manipulated IR while holding data characteristics (class separability, cluster structure) constant via algorithmic generation of Gaussian mixture datasets. Two additional validation experiments examined ceiling effects and metric-dependence. All methods were evaluated on 17 real-world datasets from OpenML. Upon controlling for confounding variables, IR exhibited a weak to moderate negative correlation with oversampling benefits. Class separability emerged as a substantially stronger moderator, accounting for significantly more variance in method effectiveness than IR alone. We propose a ‘Context Matters’ framework that integrates IR, class separability, and cluster structure to provide evidence-based selection criteria for practitioners.

[758] Dynamic Free-Rider Detection in Federated Learning via Simulated Attack Patterns

Motoki Nakamura

Main category: cs.LG

TL;DR: S2-WEF: A novel federated learning free-rider detection method that simulates attack patterns and uses two-dimensional clustering to detect dynamic free-riders without proxy datasets or pre-training.

Details

Motivation: Existing free-rider detection methods like WEF struggle with dynamic free-riders who behave honestly initially then switch to free-riding, especially under sophisticated attacks like delta weight and adaptive WEF-camouflage attacks.

Method: S2-WEF simulates WEF patterns of potential global-model-based attacks using previously broadcasted global models, then combines simulation-based similarity scores with deviation scores from mutual WEF comparisons, using two-dimensional clustering for detection.

Result: Extensive experiments across three datasets and five attack types show S2-WEF achieves higher robustness than existing approaches in detecting dynamic free-riders.

Conclusion: S2-WEF provides an effective solution for detecting dynamic free-riders in federated learning without requiring proxy datasets or pre-training, improving security against sophisticated attacks.

Abstract: Federated learning (FL) enables multiple clients to collaboratively train a global model by aggregating local updates without sharing private data. However, FL often faces the challenge of free-riders, clients who submit fake model parameters without performing actual training to obtain the global model without contributing. Chen et al. proposed a free-rider detection method based on the weight evolving frequency (WEF) of model parameters. This detection approach is a leading candidate for practical free-rider detection methods, as it requires neither a proxy dataset nor pre-training. Nevertheless, it struggles to detect ``dynamic’’ free-riders who behave honestly in early rounds and later switch to free-riding, particularly under global-model-mimicking attacks such as the delta weight attack and our newly proposed adaptive WEF-camouflage attack. In this paper, we propose a novel detection method S2-WEF that simulates the WEF patterns of potential global-model-based attacks on the server side using previously broadcasted global models, and identifies clients whose submitted WEF patterns resemble the simulated ones. To handle a variety of free-rider attack strategies, S2-WEF further combines this simulation-based similarity score with a deviation score computed from mutual comparisons among submitted WEFs, and separates benign and free-rider clients by two-dimensional clustering and per-score classification. This method enables dynamic detection of clients that transition into free-riders during training without proxy datasets or pre-training. We conduct extensive experiments across three datasets and five attack types, demonstrating that S2-WEF achieves higher robustness than existing approaches.

[759] A Clinical Point Cloud Paradigm for In-Hospital Mortality Prediction from Multi-Level Incomplete Multimodal EHRs

Bohao Li, Tao Zou, Junchen Ye, Yan Gong, Bowen Du

Main category: cs.LG

TL;DR: HealthPoint (HP) is a unified clinical point cloud paradigm for multi-level incomplete EHRs that represents clinical events as points in 4D space and uses Low-Rank Relational Attention to model interactions, achieving SOTA performance on risk prediction tasks.

Details

Motivation: Multimodal EHRs suffer from multi-level incompleteness including irregular sampling, missing modalities, and sparse labels, causing temporal misalignment, modality imbalance, and limited supervision. Existing methods either assume complete data or address only isolated aspects of incompleteness, often distorting clinical semantics.

Method: Proposes HealthPoint (HP) representing heterogeneous clinical events as points in continuous 4D space (content, time, modality, case). Uses Low-Rank Relational Attention to efficiently model interactions between arbitrary point pairs across all dimensions, with hierarchical interaction and sampling for computational efficiency.

Result: Experiments on large-scale EHR datasets for risk prediction show HP consistently achieves state-of-the-art performance and strong robustness under varying degrees of incompleteness.

Conclusion: HP provides a unified framework for handling multi-level incomplete EHRs through flexible event-level interaction and fine-grained self-supervision, supporting robust modality recovery and effective use of unlabeled data.

Abstract: Deep learning-based modeling of multimodal Electronic Health Records (EHRs) has become an important approach for clinical diagnosis and risk prediction. However, due to diverse clinical workflows and privacy constraints, raw EHRs are inherently multi-level incomplete, including irregular sampling, missing modalities, and sparse labels. These issues cause temporal misalignment, modality imbalance, and limited supervision. Most existing multimodal methods assume relatively complete data, and even methods designed for incompleteness usually address only one or two of these issues in isolation. As a result, they often rely on rigid temporal/modal alignment or discard incomplete data, which may distort raw clinical semantics. To address this problem, we propose HealthPoint (HP), a unified clinical point cloud paradigm for multi-level incomplete EHRs. HP represents heterogeneous clinical events as points in a continuous 4D space defined by content, time, modality, and case. To model interactions between arbitrary point pairs, we introduce a Low-Rank Relational Attention mechanism that efficiently captures high-order dependencies across these four dimensions. We further develop a hierarchical interaction and sampling strategy to balance fine-grained modeling and computational efficiency. Built on this framework, HP enables flexible event-level interaction and fine-grained self-supervision, supporting robust modality recovery and effective use of unlabeled data. Experiments on large-scale EHR datasets for risk prediction show that HP consistently achieves state-of-the-art performance and strong robustness under varying degrees of incompleteness.

[760] From Curiosity to Caution: Mitigating Reward Hacking for Best-of-N with Pessimism

Zhuohao Yu, Zhiwei Steven Wu, Adam Block

Main category: cs.LG

TL;DR: CAUTION: A pessimism-based approach to mitigate reward hacking in BoN sampling by penalizing prediction error as a signal of distributional uncertainty, improving over standard best-of-N sampling.

Details

Motivation: Best-of-N (BoN) sampling improves LM performance but suffers from reward hacking where increased N leads to selecting responses that exploit reward model imperfections rather than genuine quality improvements. Existing solutions fail to fully address over-optimization or are too conservative.

Method: Proposes CAUTION, applying pessimism principle from RL to BoN sampling. Trains an error model on typical responses and uses its prediction error to lower reward estimates for atypical responses, penalizing distributional uncertainty. This is the reverse of curiosity-based approaches.

Result: CAUTION substantially mitigates reward hacking in BoN sampling, is simple and computationally efficient. Theoretical analysis in a simplified linear setting shows provable improvement over standard BoN. Also demonstrates curiosity-based approaches can be general OOD detection techniques for LLMs.

Conclusion: CAUTION provides a practical solution to reward hacking in BoN sampling through pessimism-based uncertainty penalization, establishing a principled approach for inference-time compute scaling that avoids reward model exploitation.

Abstract: Inference-time compute scaling has emerged as a powerful paradigm for improving language model performance on a wide range of tasks, but the question of how best to use the additional compute remains open. A popular approach is BoN sampling, where N candidate responses are generated, scored according to a reward model, and the highest-scoring response is selected. While this approach can improve performance, it is vulnerable to reward hacking, where performance degrades as N increases due to the selection of responses that exploit imperfections in the reward model instead of genuinely improving generation quality. Prior attempts to mitigate reward hacking, via stronger reward models or heavy-handed distributional regularization, either fail to fully address over-optimization or are too conservative to exploit additional compute. In this work, we explore the principle of pessimism in RL, which uses lower confidence bounds on value estimates to avoid OOD actions with uncertain reward estimates. Our approach, termed as caution, can be seen as the reverse of curiosity: where curiosity rewards prediction error as a signal of novelty, caution penalizes prediction error as a signal of distributional uncertainty. Practically, caution trains an error model on typical responses and uses its prediction error to lower reward estimates for atypical ones. Our extensive empirical evaluation demonstrates that caution is a simple, computationally efficient approach that substantially mitigates reward hacking in BoN sampling. We also provide a theoretical analysis in a simplified linear setting, which shows that caution provably improves over the standard BoN approach. Together, our results not only establish caution as a practical solution to reward hacking, but also provide evidence that curiosity-based approaches can be a general OOD detection technique in LLM settings.

[761] Grokking as Dimensional Phase Transition in Neural Networks

Ping Wang

Main category: cs.LG

TL;DR: Grokking is identified as a dimensional phase transition where effective gradient dimensionality crosses from sub-diffusive to super-diffusive at generalization onset, revealing self-organized criticality in learning dynamics.

Details

Motivation: To understand the abrupt memorization-to-generalization transition (grokking) in neural networks, which challenges conventional understanding of learning dynamics and trainability of overparameterized networks.

Method: Analyzed finite-size scaling of gradient avalanche dynamics across eight model scales, measuring effective dimensionality D of gradient fields to characterize the phase transition and its relation to gradient field geometry.

Result: Found that grokking is a dimensional phase transition where D crosses from sub-diffusive (D < 1) to super-diffusive (D > 1) at generalization onset, exhibiting self-organized criticality. The dimensionality reflects gradient field geometry rather than network architecture.

Conclusion: The grokking-localized D(t) crossing provides new insight into trainability of overparameterized networks, revealing that gradient field geometry (not architecture) drives the memorization-to-generalization transition.

Abstract: Neural network grokking – the abrupt memorization-to-generalization transition – challenges our understanding of learning dynamics. Through finite-size scaling of gradient avalanche dynamics across eight model scales, we find that grokking is a \textit{dimensional phase transition}: effective dimensionality~$D$ crosses from sub-diffusive (subcritical, $D < 1$) to super-diffusive (supercritical, $D > 1$) at generalization onset, exhibiting self-organized criticality (SOC). Crucially, $D$ reflects \textbf{gradient field geometry}, not network architecture: synthetic i.i.d.\ Gaussian gradients maintain $D \approx 1$ regardless of graph topology, while real training exhibits dimensional excess from backpropagation correlations. The grokking-localized $D(t)$ crossing – robust across topologies – offers new insight into the trainability of overparameterized networks.

[762] Anticipatory Reinforcement Learning: From Generative Path-Laws to Distributional Value Functions

Daniel Bloch

Main category: cs.LG

TL;DR: ARL bridges non-Markovian decision processes with classical RL using path signatures and self-consistent field approach for proactive decision-making in volatile environments.

Details

Motivation: Traditional state-based RL methods fail in environments with jump-diffusions and structural breaks where path-dependent geometry is essential for accurate foresight. There's a need to handle non-Markovian decision processes with single observed trajectories.

Method: Lifts state space into signature-augmented manifold embedding process history as dynamical coordinate. Uses self-consistent field approach to maintain anticipated proxy of future path-law, enabling deterministic evaluation of expected returns. Transitions from stochastic branching to single-pass linear evaluation.

Result: Framework preserves fundamental contraction properties and ensures stable generalization even with heavy-tailed noise. Reduces computational complexity and variance. Enables proactive risk management and superior policy stability in volatile continuous-time environments.

Conclusion: Grounding RL in topological features of path-space allows agents to achieve better performance in complex, non-Markovian environments by leveraging path signatures and anticipatory mechanisms.

Abstract: This paper introduces Anticipatory Reinforcement Learning (ARL), a novel framework designed to bridge the gap between non-Markovian decision processes and classical reinforcement learning architectures, specifically under the constraint of a single observed trajectory. In environments characterised by jump-diffusions and structural breaks, traditional state-based methods often fail to capture the essential path-dependent geometry required for accurate foresight. We resolve this by lifting the state space into a signature-augmented manifold, where the history of the process is embedded as a dynamical coordinate. By utilising a self-consistent field approach, the agent maintains an anticipated proxy of the future path-law, allowing for a deterministic evaluation of expected returns. This transition from stochastic branching to a single-pass linear evaluation significantly reduces computational complexity and variance. We prove that this framework preserves fundamental contraction properties and ensures stable generalisation even in the presence of heavy-tailed noise. Our results demonstrate that by grounding reinforcement learning in the topological features of path-space, agents can achieve proactive risk management and superior policy stability in highly volatile, continuous-time environments.

[763] Batch Loss Score for Dynamic Data Pruning

Qing Zhou, Bingxuan Zhao, Tao Yang, Hongyuan Zhang, Junyu Gao, Qi Wang

Main category: cs.LG

TL;DR: BLS (Batch Loss Score) is a simple 3-line method that uses EMA of batch losses as a proxy for per-sample importance, enabling efficient data pruning without needing per-sample loss computation.

Details

Motivation: Per-sample loss computation for data pruning is challenging for complex models/loss functions, requiring significant implementation effort. Need a simpler alternative that can work with readily available batch losses.

Method: Proposes Batch Loss Score (BLS) using Exponential Moving Average of batch losses to assign importance scores to samples. Treats batch loss as noisy measurement of individual loss, with EMA acting as low-pass filter to reduce batch composition noise.

Result: BLS enables lossless pruning of 20%-50% samples across 14 datasets, 11 tasks, and 18 models. Shows remarkable simplicity (3-line code injection) and can adapt existing per-sample loss methods with one-line proxy.

Conclusion: BLS provides theoretically grounded, computationally efficient alternative to per-sample loss for data pruning, especially useful for complex scenarios where per-sample loss is difficult to access.

Abstract: Dynamic data pruning accelerates deep learning by selectively omitting less informative samples during training. While per-sample loss is a common importance metric, obtaining it can be challenging or infeasible for complex models or loss functions, often requiring significant implementation effort. This work proposes the Batch Loss Score (BLS), a computationally efficient alternative using an Exponential Moving Average (EMA) of readily available batch losses to assign scores to individual samples. We frame the batch loss, from the perspective of a single sample, as a noisy measurement of its scaled individual loss, with noise originating from stochastic batch composition. It is formally shown that the EMA mechanism functions as a first-order low-pass filter, attenuating high-frequency batch composition noise. This yields a score approximating the smoothed and persistent contribution of the individual sample to the loss, providing a theoretical grounding for BLS as a proxy for sample importance. BLS demonstrates remarkable code integration simplicity (\textbf{three-line injection}) and readily adapts existing per-sample loss-based methods (\textbf{one-line proxy}). Its effectiveness is demonstrated by enhancing two such methods to losslessly prune \textbf{20%-50%} of samples across \textit{14 datasets}, \textit{11 tasks} and \textit{18 models}, highlighting its utility and broad applicability, especially for complex scenarios where per-sample loss is difficult to access. Code is available at https://github.com/mrazhou/BLS.

[764] Explainable Machine Learning for Sepsis Outcome Prediction Using a Novel Romanian Electronic Health Record Dataset

Andrei-Alexandru Bunea, Ovidiu Ghibea, Dan-Matei Popovici, Ion Daniel, Octavian Andronic

Main category: cs.LG

TL;DR: Explainable ML models for sepsis outcome prediction using EHR data, achieving high performance (AUC=0.983) with SHAP analysis identifying key clinical predictors like eosinopenia.

Details

Motivation: To develop interpretable machine learning models for sepsis outcome prediction using comprehensive EHR data, aiming to identify clinically relevant predictors while achieving state-of-the-art performance across different classification tasks.

Method: Trained five ML models on EHR data from 12,286 hospitalizations, including demographics, ICD-10 diagnostics, and 600 lab tests. Explored trade-offs between feature richness and patient coverage using subsets of 10-50 most frequent lab tests. Used SHAP for explainability analysis.

Result: Highest performance achieved for deceased vs. recovered classification (AUC=0.983, accuracy=0.93). SHAP identified strong predictors: cardiovascular comorbidities, urea levels, aspartate aminotransferase, platelet count, and eosinophil percentage. Eosinopenia emerged as a top underutilized predictor.

Conclusion: Explainable ML models can achieve excellent sepsis outcome prediction while maintaining clinical interpretability. Eosinopenia is a valuable but underutilized marker not included in current assessment standards, suggesting clinical applicability of these models.

Abstract: We develop and analyze explainable machine learning (ML) models for sepsis outcome prediction using a novel Electronic Health Record (EHR) dataset from 12,286 hospitalizations at a large emergency hospital in Romania. The dataset includes demographics, International Classification of Diseases (ICD-10) diagnostics, and 600 types of laboratory tests. This study aims to identify clinically strong predictors while achieving state-of-the-art results across three classification tasks: (1)deceased vs. discharged, (2)deceased vs. recovered, and (3)recovered vs. ameliorated. We trained five ML models to capture complex distributions while preserving clinical interpretability. Experiments explored the trade-off between feature richness and patient coverage, using subsets of the 10–50 most frequent laboratory tests. Model performance was evaluated using accuracy and area under the curve (AUC), and explainability was assessed using SHapley Additive exPlanations (SHAP). The highest performance was obtained for the deceased vs. recovered case study (AUC=0.983, accuracy=0.93). SHAP analysis identified several strong predictors such as cardiovascular comorbidities, urea levels, aspartate aminotransferase, platelet count, and eosinophil percentage. Eosinopenia emerged as a top predictor, highlighting its value as an underutilized marker that is not included in current assessment standards, while the high performance suggests the applicability of these models in clinical settings.

[765] MUXQ: Mixed-to-Uniform Precision MatriX Quantization via Low-Rank Outlier Decomposition

Seoungsub Lee, In Seo Kim, Seon Wook Kim

Main category: cs.LG

TL;DR: MUXQ is a quantization method that addresses activation outliers in LLMs by redistributing outlier magnitudes across channels, enabling stable INT8 quantization for efficient edge deployment.

Details

Motivation: LLMs have huge parameter counts causing memory/computational overhead, especially problematic for NPU-based edge devices where FP16/FP32 is inefficient and integer quantization is essential. Existing methods don't fully handle input-activation outliers and hardware inefficiencies.

Method: MUXQ detects outlier channels in input activations and introduces a small auxiliary matrix that redistributes outlier magnitudes across channels, alleviating the outlier problem. This enables even activation outliers to be quantized at low-precision INT levels while preserving hardware-friendly computation structure.

Result: Experiments on GPT-2 models (0.1B, 0.3B, 0.7B parameters) using WikiText-2 show MUXQ consistently achieves lower perplexity than naive quantization. Under per-tensor quantization, MUXQ quantizes both activations and weights to INT8 while maintaining accuracy close to FP16.

Conclusion: MUXQ enables stable low-precision inference with modest computational overhead, can be combined with other quantization techniques, and provides a promising direction for efficient and accurate LLM inference on edge devices.

Abstract: Large language models (LLMs) have achieved outstanding performance across a wide range of natural language processing tasks, but their enormous parameter counts impose ubstantial memory and computational overheads. This challenge is particularly critical in NPU-based on-device environments, where FP16/FP32 computation is inefficient and integer (INT) quantization is therefore essential. However, existing methods, including ZeroQuant, LLM.int8(), and SmoothQuant, do not fully address input-activation outliers and the associated hardware inefficiencies. To overcome these limitations, we propose MUXQ (Mixed-to-Uniform Quantization). MUXQ detects outlier channels in input activations and introduces a small auxiliary matrix that redistributes outlier magnitudes across channels, thereby alleviating the outlier problem. This enables even activation outliers to be quantized at low-precision INT levels while preserving a hardware-friendly computation structure. Experiments on GPT-2 models at three scales (0.1B, 0.3B, and 0.7B parameters) using the WikiText-2 dataset show that MUXQ consistently achieves lower perplexity than naive quantization. In particular, under per-tensor quantization, MUXQ quantizes both activations and weights to INT8 while maintaining accuracy close to that of FP16. With only modest computational overhead, MUXQ enables stable low-precision inference and can be readily combined with other quantization techniques. These results suggest that MUXQ provides a promising direction for efficient and accurate LLM inference on edge devices.

[766] The Infinite-Dimensional Nature of Spectroscopy and Why Models Succeed, Fail, and Mislead

Umberto Michelucci, Francesca Venturini

Main category: cs.LG

TL;DR: ML models achieve high accuracy in spectroscopic classification due to high-dimensional data properties rather than chemical meaning, with infinitesimal distributional differences becoming perfectly separable in high-dimensional spaces.

Details

Motivation: To explain why ML models achieve strikingly high accuracies in spectroscopic classification tasks without clear evidence they use chemically meaningful features, and to provide a unifying explanation for phenomena linked to data preprocessing, noise sensitivity, and model complexity.

Method: Theoretical analysis using Feldman-Hajek theorem and concentration of measure to show infinitesimal distributional differences become perfectly separable in high-dimensional spaces, plus experimental validation on synthetic and real fluorescence spectra.

Result: Models can achieve near-perfect accuracy even when chemical distinctions are absent, and feature-importance maps may highlight spectrally irrelevant regions due to high-dimensional data properties rather than meaningful chemical features.

Conclusion: Provides rigorous theoretical framework explaining ML performance in spectroscopy, confirms effect experimentally, and offers practical recommendations for building and interpreting ML models in spectroscopic applications.

Abstract: Machine learning (ML) models have achieved strikingly high accuracies in spectroscopic classification tasks, often without a clear proof that those models used chemically meaningful features. Existing studies have linked these results to data preprocessing choices, noise sensitivity, and model complexity, but no unifying explanation is available so far. In this work, we show that these phenomena arise naturally from the intrinsic high dimensionality of spectral data. Using a theoretical analysis grounded in the Feldman-Hajek theorem and the concentration of measure, we show that even infinitesimal distributional differences, caused by noise, normalisation, or instrumental artefacts, may become perfectly separable in high-dimensional spaces. Through a series of specific experiments on synthetic and real fluorescence spectra, we illustrate how models can achieve near-perfect accuracy even when chemical distinctions are absent, and why feature-importance maps may highlight spectrally irrelevant regions. We provide a rigorous theoretical framework, confirm the effect experimentally, and conclude with practical recommendations for building and interpreting ML models in spectroscopy.

[767] Darkness Visible: Reading the Exception Handler of a Language Model

Peter Balogh

Main category: cs.LG

TL;DR: Analysis of GPT-2 Small’s final MLP reveals a structured routing program with 27 named neurons organized into a three-tier exception handler, while knowledge remains distributed across residual neurons.

Details

Motivation: To understand the internal mechanisms of transformer MLPs, specifically how they route information rather than store knowledge, and to analyze the structured organization of neurons in GPT-2's final layer.

Method: Decomposed all 3,072 neurons in GPT-2 Small’s final MLP into functional categories (Core, Differentiators, Specialists, Consensus), conducted statistical analysis of consensus-exception crossover, and performed three experiments including analysis of “knowledge neurons” and garden-path experiments.

Result: Found that MLP neurons function as routing infrastructure rather than fact storage, with a sharp statistical crossover between helpful and harmful intervention. Knowledge neurons at layer 11 serve as routing infrastructure, and GPT-2 shows reversed garden-path effects using verb subcategorization immediately.

Conclusion: The final MLP of GPT-2 implements a structured routing program that operates at token-level predictability rather than syntactic structure, with this architecture crystallizing only at the terminal layer in transformer models.

Abstract: The final MLP of GPT-2 Small exhibits a fully legible routing program – 27 named neurons organized into a three-tier exception handler – while the knowledge it routes remains entangled across ~3,040 residual neurons. We decompose all 3,072 neurons (to numerical precision) into: 5 fused Core neurons that reset vocabulary toward function words, 10 Differentiators that suppress wrong candidates, 5 Specialists that detect structural boundaries, and 7 Consensus neurons that each monitor a distinct linguistic dimension. The consensus-exception crossover – where MLP intervention shifts from helpful to harmful – is statistically sharp (bootstrap 95% CIs exclude zero at all consensus levels; crossover between 4/7 and 5/7). Three experiments show that “knowledge neurons” (Dai et al., 2022), at L11 of this model, function as routing infrastructure rather than fact storage: the MLP amplifies or suppresses signals already present in the residual stream from attention, scaling with contextual constraint. A garden-path experiment reveals a reversed garden-path effect – GPT-2 uses verb subcategorization immediately, consistent with the exception handler operating at token-level predictability rather than syntactic structure. This architecture crystallizes only at the terminal layer – in deeper models, we predict equivalent structure at the final layer, not at layer 11. Code and data: https://github.com/pbalogh/transparent-gpt2

[768] Sampling Parallelism for Fast and Efficient Bayesian Learning

Asena Karolin Özdemir, Lars H. Heyen, Arvid Weyrauch, Achim Streit, Markus Götz, Charlotte Debus

Main category: cs.LG

TL;DR: A parallelization strategy called “sampling parallelism” that distributes Bayesian neural network sample evaluations across multiple GPUs to reduce memory pressure and training time for uncertainty quantification.

Details

Motivation: Uncertainty quantification (UQ) methods like Bayesian neural networks are computationally expensive due to multiple parameter sampling, limiting their accessibility and exploration in risk-sensitive domains like healthcare and finance.

Method: Introduces sampling parallelism that distributes sample evaluations across multiple GPUs, targeting the bottleneck of sampling-based Bayesian learning. The method doesn’t require architectural changes or extensive hyperparameter tuning, and can be combined with data parallelism in a hybrid approach.

Result: Shows near-perfect scaling when sample number is scaled proportionally to computational resources, confirming clean parallelization of sample evaluations. While DDP achieves better raw speedups, sampling parallelism increases augmentation diversity by applying independent stochastic augmentations to the same batch on each GPU, reducing convergence epochs.

Conclusion: Sampling parallelism is an effective parallelization strategy that addresses computational bottlenecks in Bayesian learning, enabling more accessible uncertainty quantification while being complementary to existing parallelization methods.

Abstract: Machine learning models, and deep neural networks in particular, are increasingly deployed in risk-sensitive domains such as healthcare, environmental forecasting, and finance, where reliable quantification of predictive uncertainty is essential. However, many uncertainty quantification (UQ) methods remain difficult to apply due to their substantial computational cost. Sampling-based Bayesian learning approaches, such as Bayesian neural networks (BNNs), are particularly expensive since drawing and evaluating multiple parameter samples rapidly exhausts memory and compute resources. These constraints have limited the accessibility and exploration of Bayesian techniques thus far. To address these challenges, we introduce sampling parallelism, a simple yet powerful parallelization strategy that targets the primary bottleneck of sampling-based Bayesian learning: the samples themselves. By distributing sample evaluations across multiple GPUs, our method reduces memory pressure and training time without requiring architectural changes or extensive hyperparameter tuning. We detail the methodology and evaluate its performance on a few example tasks and architectures, comparing against distributed data parallelism (DDP) as a baseline. We further demonstrate that sampling parallelism is complementary to existing strategies by implementing a hybrid approach that combines sample and data parallelism. Our experiments show near-perfect scaling when the sample number is scaled proportionally to the computational resources, confirming that sample evaluations parallelize cleanly. Although DDP achieves better raw speedups under scaling with constant workload, sampling parallelism has a notable advantage: by applying independent stochastic augmentations to the same batch on each GPU, it increases augmentation diversity and thus reduces the number of epochs required for convergence.

[769] Cog-DRIFT: Exploration on Adaptively Reformulated Instances Enables Learning from Hard Reasoning Problems

Justin Chih-Yao Chen, Archiki Prasad, Zaid Khan, Joykirat Singh, Runchu Tian, Elias Stengel-Eskin, Mohit Bansal

Main category: cs.LG

TL;DR: Cog-DRIFT: A curriculum learning framework that reformulates hard open-ended problems into easier variants (multiple-choice, cloze) to bootstrap LLM learning, overcoming exploration barriers in RL post-training.

Details

Motivation: Current RLVR methods fail when problems are too difficult for the model's current policy, yielding no meaningful reward signal. There's a need to enable learning from unsolvable problems.

Method: Task reformulation transforms challenging open-ended problems into cognitively simpler variants (multiple-choice, cloze) that preserve original answers. Cog-DRIFT organizes these into adaptive curriculum from easier to harder formats.

Result: Significant improvements on originally unsolvable hard problems (+10.11% Qwen, +8.64% Llama), outperforms standard GRPO and baselines across 6 reasoning benchmarks. Improves pass@k and sample efficiency.

Conclusion: Task reformulation and curriculum learning effectively overcome exploration barriers in LLM post-training, enabling learning from previously unsolvable problems.

Abstract: Reinforcement learning from verifiable rewards (RLVR) has improved the reasoning abilities of LLMs, yet a fundamental limitation remains: models cannot learn from problems that are too difficult to solve under their current policy, as these yield no meaningful reward signal. We propose a simple yet effective solution based on task reformulation. We transform challenging open-ended problems into cognitively simpler variants – such as multiple-choice and cloze formats – that preserve the original answer while reducing the effective search space and providing denser learning signals. These reformulations span a spectrum from discriminative to generative tasks, which we exploit to bootstrap learning: models first learn from structured, easier formats, and this knowledge transfers back to improve performance on the original open-ended problems. Building on this insight, we introduce Cog-DRIFT, a framework that constructs reformulated variants and organizes them into an adaptive curriculum based on difficulty. Training progresses from easier to harder formats, enabling the model to learn from problems that previously yielded zero signal under standard RL post-training. Cog-DRIFT not only improves on the originally unsolvable hard problems (absolute +10.11% for Qwen and +8.64% for Llama) but also generalizes well to other held-out datasets. Across 2 models and 6 reasoning benchmarks, our method consistently outperforms standard GRPO and strong guided-exploration baselines. On average, Cog-DRIFT shows +4.72% (Qwen) and +3.23% (Llama) improvements over the second-best baseline. We further show that Cog-DRIFT improves pass@k at test time, and the curriculum improves sample efficiency. Overall, our results highlight task reformulation and curriculum learning as an effective paradigm for overcoming the exploration barrier in LLM post-training.

[770] Forgetting to Witness: Efficient Federated Unlearning and Its Visible Evaluation

Houzhe Wang, Xiaojie Zhu, Chi Chen

Main category: cs.LG

TL;DR: First complete pipeline for federated unlearning with approach and evaluation framework, using knowledge distillation and GAN-based visualization for forgetting capacity assessment.

Details

Motivation: Address data privacy and security concerns in federated learning by enabling models to forget specific deleted data without retaining or leaking information, while providing evaluation tools.

Method: Proposes federated unlearning approach using knowledge distillation with optimization mechanisms, and Skyeye evaluation framework that integrates unlearning model as classifier in GAN to visualize forgetting capacity through sample generation.

Result: Comprehensive experiments demonstrate effectiveness of both the federated unlearning approach (high efficiency and accuracy without historical data storage) and the Skyeye evaluation framework.

Conclusion: First complete pipeline for federated unlearning successfully addresses privacy concerns with efficient forgetting mechanisms and provides visualization-based evaluation framework.

Abstract: With the increasing importance of data privacy and security, federated unlearning has emerged as a novel research field dedicated to ensuring that federated learning models no longer retain or leak relevant information once specific data has been deleted. In this paper, to the best of our knowledge, we propose the first complete pipeline for federated unlearning, which includes a federated unlearning approach and an evaluation framework. Our proposed federated unlearning approach ensures high efficiency and model accuracy without the need to store historical data.It effectively leverages the knowledge distillation model alongside various optimization mechanisms. Moreover, we propose a framework named Skyeye to visualize the forgetting capacity of federated unlearning models. It utilizes the federated unlearning model as the classifier integrated into a Generative Adversarial Network (GAN). Afterward, both the classifier and discriminator guide the generator in generating samples. Throughout this process, the generator learns from the classifier’s knowledge. The generator then visualizes this knowledge through sample generation. Finally, the model’s forgetting capability is evaluated based on the relevance between the deleted data and the generated samples. Comprehensive experiments are conducted to illustrate the effectiveness of the proposed federated unlearning approach and the corresponding evaluation framework.

[771] Selecting Decision-Relevant Concepts in Reinforcement Learning

Naveen Raman, Stephanie Milani, Fei Fang

Main category: cs.LG

TL;DR: DRS algorithm automatically selects decision-relevant concepts for interpretable policies using state abstraction principles, with performance guarantees.

Details

Motivation: Manual concept selection for interpretable policies is time-consuming, requires domain expertise, scales poorly, and lacks performance guarantees. Need automated principled approach.

Method: Views concept selection through state abstraction lens: concept is decision-relevant if removing it confuses states requiring different actions. Proposes Decision-Relevant Selection (DRS) algorithm that selects subset of concepts preserving optimal decision structure.

Result: DRS automatically recovers manually curated concept sets while matching/exceeding performance, improves test-time concept interventions across RL benchmarks and real-world healthcare environments.

Conclusion: First principled automatic concept selection algorithm for sequential decision-making with performance bounds, enabling scalable interpretable policies without manual curation.

Abstract: Training interpretable concept-based policies requires practitioners to manually select which human-understandable concepts an agent should reason with when making sequential decisions. This selection demands domain expertise, is time-consuming and costly, scales poorly with the number of candidates, and provides no performance guarantees. To overcome this limitation, we propose the first algorithms for principled automatic concept selection in sequential decision-making. Our key insight is that concept selection can be viewed through the lens of state abstraction: intuitively, a concept is decision-relevant if removing it would cause the agent to confuse states that require different actions. As a result, agents should rely on decision-relevant concepts; states with the same concept representation should share the same optimal action, which preserves the optimal decision structure of the original state space. This perspective leads to the Decision-Relevant Selection (DRS) algorithm, which selects a subset of concepts from a candidate set, along with performance bounds relating the selected concepts to the performance of the resulting policy. Empirically, DRS automatically recovers manually curated concept sets while matching or exceeding their performance, and improves the effectiveness of test-time concept interventions across reinforcement learning benchmarks and real-world healthcare environments.

[772] The Role of Generator Access in Autoregressive Post-Training

Amit Kiran Rege

Main category: cs.LG

TL;DR: The paper studies how different generator access modes affect autoregressive post-training, showing that weak prefix control enables more efficient learning than root-start rollouts.

Details

Motivation: The paper aims to understand how the interface to a generator (what information it provides) constrains the efficiency of autoregressive post-training. It investigates whether learners are limited to fresh rollouts from the root or can revisit previously built prefixes, which affects what information can be gathered during training.

Method: The paper analyzes different generator access regimes: root-start rollouts (only fresh trajectories) vs. weak prefix control (ability to return to previously built prefixes). It examines various observation types including output sampling, token log probabilities, top-k reports, and full next-token distributions. The analysis focuses on KL-regularized outcome-reward post-training.

Result: The study reveals that in the root-start regime, all observation types reduce to one canonical experiment limited by on-policy probability of reaching informative prefixes. Weak prefix control breaks this barrier, enabling richer observations like conditional sampling or logits to outperform top-1 access. Changing only the generator interface creates an exponential gap for KL-regularized outcome-reward post-training.

Conclusion: Generator access mode significantly impacts post-training efficiency. Weak prefix control enables more powerful learning than root-start rollouts, and richer observations become beneficial once prefix control is available. The generator interface alone can create exponential gaps in learning efficiency.

Abstract: We study how generator access constrains autoregressive post-training. The central question is whether the learner is confined to fresh root-start rollouts or can return to previously built prefixes and query the next-token rule there. In the root-start regime, output sampling, generated-token log probabilities, top-$k$ reports, and full next-token distributions along sampled trajectories all reduce to one canonical experiment, limited by the on-policy probability of reaching informative prefixes. Weak prefix control breaks this barrier, and once control is available, richer observations such as conditional sampling or logits can outperform top-$1$ access. Changing only the generator interface creates an exponential gap for KL-regularized outcome-reward post-training.

[773] FairLogue: A Toolkit for Intersectional Fairness Analysis in Clinical Machine Learning Models

Nick Souligne, Vignesh Subbian

Main category: cs.LG

TL;DR: Fairlogue is a Python toolkit for intersectional fairness assessment in clinical ML, extending fairness metrics to multiple demographic axes and incorporating counterfactual analysis.

Details

Motivation: Existing fairness tools focus on single-axis demographic comparisons, missing compounded disparities affecting intersectional populations in healthcare ML applications.

Method: Three-component toolkit: 1) observational framework extending fairness metrics to intersectional groups, 2) counterfactual framework for treatment-based contexts, 3) generalized counterfactual framework for interventions on intersectional group membership. Evaluated on EHR data for glaucoma surgery prediction.

Result: Intersectional analysis revealed larger fairness gaps than single-axis analyses. Counterfactual analysis suggested observed disparities were consistent with chance after conditioning on covariates.

Conclusion: Fairlogue provides modular tools for quantifying and evaluating intersectional bias in clinical ML workflows, integrating both observational and counterfactual methods.

Abstract: Objective: Algorithmic fairness is essential for equitable and trustworthy machine learning in healthcare. Most fairness tools emphasize single-axis demographic comparisons and may miss compounded disparities affecting intersectional populations. This study introduces Fairlogue, a toolkit designed to operationalize intersectional fairness assessment in observational and counterfactual contexts within clinical settings. Methods: Fairlogue is a Python-based toolkit composed of three components: 1) an observational framework extending demographic parity, equalized odds, and equal opportunity difference to intersectional populations; 2) a counterfactual framework evaluating fairness under treatment-based contexts; and 3) a generalized counterfactual framework assessing fairness under interventions on intersectional group membership. The toolkit was evaluated using electronic health record data from the All of Us Controlled Tier V8 dataset in a glaucoma surgery prediction task using logistic regression with race and gender as protected attributes. Results: Observational analysis identified substantial intersectional disparities despite moderate model performance (AUROC = 0.709; accuracy = 0.651). Intersectional evaluation revealed larger fairness gaps than single-axis analyses, including demographic parity differences of 0.20 and equalized odds true positive and false positive rate gaps of 0.33 and 0.15, respectively. Counterfactual analysis using permutation-based null distributions produced unfairness (“u-value”) estimates near zero, suggesting observed disparities were consistent with chance after conditioning on covariates. Conclusion: Fairlogue provides a modular toolkit integrating observational and counterfactual methods for quantifying and evaluating intersectional bias in clinical machine learning workflows.

[774] Noise Immunity in In-Context Tabular Learning: An Empirical Robustness Analysis of TabPFN’s Attention Mechanisms

James Hu, Mahdi Ghelichi

Main category: cs.LG

TL;DR: TabPFN, a tabular foundation model, demonstrates strong robustness to common data quality issues like irrelevant features, correlated features, and label noise through in-context learning without retraining.

Details

Motivation: Industrial domains like finance and healthcare need tabular prediction models that can handle data quality issues without costly retraining for each new dataset.

Method: Systematic evaluation of TabPFN’s robustness using controlled synthetic perturbations: injecting random/uncorrelated features, introducing nonlinearly correlated features, varying dataset size, and adding label noise. Analysis includes attention mechanisms, attention concentration, and attention-based feature ranking metrics.

Result: TabPFN shows remarkable resilience - maintains high ROC-AUC, structured attention patterns, and effective feature ranking across all tested data imperfections. Attention mechanisms increasingly concentrate on useful features while separating signals from noise.

Conclusion: TabPFN is a robust tabular foundation model capable of maintaining both predictive performance and coherent internal behavior under various data quality issues, making it suitable for industrial applications.

Abstract: Tabular foundation models (TFMs) such as TabPFN (Tabular Prior-Data Fitted Network) are designed to generalize across heterogeneous tabular datasets through in-context learning (ICL). They perform prediction in a single forward pass conditioned on labeled examples without dataset-specific parameter updates. This paradigm is particularly attractive in industrial domains (e.g., finance and healthcare) where tabular prediction is pervasive. Retraining a bespoke model for each new table can be costly or infeasible in these settings, while data quality issues such as irrelevant predictors, correlated feature groups, and label noise are common. In this paper, we provide strong empirical evidence that TabPFN is highly robust under these sub-optimal conditions. We study TabPFN and its attention mechanisms for binary classification problems with controlled synthetic perturbations that vary: (i) dataset width by injecting random uncorrelated features and by introducing nonlinearly correlated features, (ii) dataset size by increasing the number of training rows, and (iii) label quality by increasing the fraction of mislabeled targets. Beyond predictive performance, we analyze internal signals including attention concentration and attention-based feature ranking metrics. Across these parametric tests, TabPFN is remarkably resilient: ROC-AUC remains high, attention stays structured and sharp, and informative features are highly ranked by attention-based metrics. Qualitative visualizations with attention heatmaps, feature-token embeddings, and SHAP plots further support a consistent pattern across layers in which TabPFN increasingly concentrates on useful features while separating their signals from noise. Together, these findings suggest that TabPFN is a robust TFM capable of maintaining both predictive performance and coherent internal behavior under various scenarios of data imperfections.

[775] Optimizing LLM Prompt Engineering with DSPy Based Declarative Learning

Shiek Ruksana, Sailesh Kiran Kurra, Thipparthi Sanjay Baradwaj

Main category: cs.LG

TL;DR: DSPy is a declarative framework for automated, modular prompt optimization in LLMs that improves factual accuracy and reduces hallucinations through symbolic planning and gradient-free optimization.

Details

Motivation: Current prompt engineering relies on heuristic trial-and-error, limiting scalability, reproducibility, and generalization across tasks. There's a need for systematic, automated approaches to prompt optimization.

Method: Introduces a unified DSPy LLM architecture combining symbolic planning, gradient-free optimization, and automated module rewriting for prompt synthesis, correction, calibration, and adaptive reasoning control.

Result: Experimental evaluations show improvements of 30-45% in factual accuracy and ~25% reduction in hallucination rates across reasoning tasks, retrieval-augmented generation, and chain-of-thought benchmarks.

Conclusion: DSPy provides an effective declarative framework for automated prompt optimization that enhances output reliability, efficiency, and generalization, though limitations exist and future research directions are outlined.

Abstract: Large Language Models (LLMs) have shown strong performance across a wide range of natural language processing tasks; however, their effectiveness is highly dependent on prompt design, structure, and embedded reasoning signals. Conventional prompt engineering methods largely rely on heuristic trial-and-error processes, which limits scalability, reproducibility, and generalization across tasks. DSPy, a declarative framework for optimizing text-processing pipelines, offers an alternative approach by enabling automated, modular, and learnable prompt construction for LLM-based systems.This paper presents a systematic study of DSPy-based declarative learning for prompt optimization, with emphasis on prompt synthesis, correction, calibration, and adaptive reasoning control. We introduce a unified DSPy LLM architecture that combines symbolic planning, gradient free optimization, and automated module rewriting to reduce hallucinations, improve factual grounding, and avoid unnecessary prompt complexity. Experimental evaluations conducted on reasoning tasks, retrieval-augmented generation, and multi-step chain-of-thought benchmarks demonstrate consistent gains in output reliability, efficiency, and generalization across models. The results show improvements of up to 30 to 45% in factual accuracy and a reduction of approximately 25% in hallucination rates. Finally, we outline key limitations and discuss future research directions for declarative prompt optimization frameworks.

[776] Data Attribution in Adaptive Learning

Amit Kiran Rege

Main category: cs.LG

TL;DR: Formalizes occurrence-level attribution for adaptive learning systems where models generate their own training data, addressing limitations of standard attribution methods in dynamic settings.

Details

Motivation: Standard attribution methods are designed for static datasets and fail to account for feedback loops in adaptive learning systems where models generate their own training data, such as in online bandits, reinforcement learning, and language model post-training pipelines.

Method: Formalizes occurrence-level attribution for finite-horizon adaptive learning via a conditional interventional target, analyzes when replay-side information can recover the target, and identifies structural conditions under which the target is identifiable from logged data.

Result: Proves that replay-side information cannot generally recover the attribution target, but identifies a structural class where the target is identifiable from logged data, providing theoretical foundations for attribution in adaptive learning systems.

Conclusion: Establishes formal framework for attribution in adaptive learning, highlighting limitations of existing methods and providing conditions under which proper attribution can be achieved in systems where models generate their own training data.

Abstract: Machine learning models increasingly generate their own training data – online bandits, reinforcement learning, and post-training pipelines for language models are leading examples. In these adaptive settings, a single training observation both updates the learner and shifts the distribution of future data the learner will collect. Standard attribution methods, designed for static datasets, ignore this feedback. We formalize occurrence-level attribution for finite-horizon adaptive learning via a conditional interventional target, prove that replay-side information cannot recover it in general, and identify a structural class in which the target is identified from logged data.

[777] Are Latent Reasoning Models Easily Interpretable?

Connor Dilgren, Sarah Wiegreffe

Main category: cs.LG

TL;DR: LRMs often don’t use latent reasoning tokens for predictions, and when they do, the reasoning is largely interpretable and can be decoded to natural language traces.

Details

Motivation: To investigate the interpretability of latent reasoning models (LRMs) which are difficult to monitor due to their non-natural language reasoning process, despite their benefits of low inference cost and parallel reasoning exploration.

Method: Examine two state-of-the-art LRMs on logical reasoning datasets, analyze necessity of latent reasoning tokens, decode gold reasoning traces when tokens are necessary, and develop method to decode verified natural language reasoning traces without prior knowledge of gold traces.

Result: LRMs often don’t use latent reasoning tokens for predictions; when tokens are necessary, gold reasoning traces can be decoded 65-93% of the time; verified natural language traces can be found for majority of correct predictions but only minority of incorrect predictions.

Conclusion: Current LRMs largely encode interpretable processes, and interpretability itself can serve as a signal of prediction correctness, challenging assumptions about LRMs’ opaque reasoning.

Abstract: Latent reasoning models (LRMs) have attracted significant research interest due to their low inference cost (relative to explicit reasoning models) and theoretical ability to explore multiple reasoning paths in parallel. However, these benefits come at the cost of reduced interpretability: LRMs are difficult to monitor because they do not reason in natural language. This paper presents an investigation into LRM interpretability by examining two state-of-the-art LRMs. First, we find that latent reasoning tokens are often unnecessary for LRMs’ predictions; on logical reasoning datasets, LRMs can almost always produce the same final answers without using latent reasoning at all. This underutilization of reasoning tokens may partially explain why LRMs do not consistently outperform explicit reasoning methods and raises doubts about the stated role of these tokens in prior work. Second, we demonstrate that when latent reasoning tokens are necessary for performance, we can decode gold reasoning traces up to 65-93% of the time for correctly predicted instances. This suggests LRMs often implement the expected solution rather than an uninterpretable reasoning process. Finally, we present a method to decode a verified natural language reasoning trace from latent tokens without knowing a gold reasoning trace a priori, demonstrating that it is possible to find a verified trace for a majority of correct predictions but only a minority of incorrect predictions. Our findings highlight that current LRMs largely encode interpretable processes, and interpretability itself can be a signal of prediction correctness.

[778] HI-MoE: Hierarchical Instance-Conditioned Mixture-of-Experts for Object Detection

Vadim Vashkelis, Natalia Trukhina

Main category: cs.LG

TL;DR: HI-MoE is a DETR-style object detection architecture using hierarchical mixture-of-experts routing with scene-level and instance-level routing to enable sparse computation while matching the instance-centric nature of detection tasks.

Details

Motivation: Current vision MoE methods operate at image or patch level, which doesn't align well with object detection where reasoning happens at the instance/object query level. There's a need for MoE architectures that better match the heterogeneous, instance-centric structure of detection tasks.

Method: HI-MoE uses two-stage hierarchical routing: 1) lightweight scene router selects scene-consistent expert subset, 2) instance router assigns each object query to a small number of experts within that subset. This preserves sparse computation while aligning with detection’s instance-centric nature.

Result: HI-MoE improves over dense DINO baseline and simpler token-level or instance-only routing variants on COCO dataset, with especially strong gains on small objects. Preliminary specialization analysis on LVIS shows initial visualization of expert specialization patterns.

Conclusion: Hierarchical instance-conditioned MoE routing is effective for object detection, better matching the task’s instance-centric structure while maintaining sparse computation benefits. The approach shows promise but requires further experimental validation.

Abstract: Mixture-of-Experts (MoE) architectures enable conditional computation by activating only a subset of model parameters for each input. Although sparse routing has been highly effective in language models and has also shown promise in vision, most vision MoE methods operate at the image or patch level. This granularity is poorly aligned with object detection, where the fundamental unit of reasoning is an object query corresponding to a candidate instance. We propose Hierarchical Instance-Conditioned Mixture-of-Experts (HI-MoE), a DETR-style detection architecture that performs routing in two stages: a lightweight scene router first selects a scene-consistent expert subset, and an instance router then assigns each object query to a small number of experts within that subset. This design aims to preserve sparse computation while better matching the heterogeneous, instance-centric structure of detection. In the current draft, experiments are concentrated on COCO with preliminary specialization analysis on LVIS. Under these settings, HI-MoE improves over a dense DINO baseline and over simpler token-level or instance-only routing variants, with especially strong gains on small objects. We also provide an initial visualization of expert specialization patterns. We present the method, ablations, and current limitations in a form intended to support further experimental validation.

[779] Empowering Power Outage Prediction with Spatially Aware Hybrid Graph Neural Networks and Contrastive Learning

Xuyang Shen, Zijie Pan, Diego Cerrai, Xinxuan Zhang, Christopher Colorio, Emmanouil N. Anagnostou, Dongjin Song

Main category: cs.LG

TL;DR: SA-HGNN uses spatially aware hybrid graph neural networks with contrastive learning to predict power outages from extreme weather events by encoding spatial relationships of static and dynamic features.

Details

Motivation: Extreme weather events cause widespread power outages with severe economic and social impacts. Existing outage prediction models lack spatial awareness of weather effects, limiting their accuracy for pre-emptive forecasting.

Method: Develop Spatially Aware Hybrid Graph Neural Networks (SA-HGNN) with contrastive learning: 1) Encode spatial relationships of static features (land cover, infrastructure) and dynamic features (wind speed, precipitation) via SA-HGNN, 2) Use contrastive learning to handle data imbalance across weather event types and generate location-specific embeddings by minimizing intra-event distances between similar locations while maximizing inter-event distances.

Result: SA-HGNN achieves state-of-the-art performance for power outage prediction across four utility service territories (Connecticut, Western Massachusetts, Eastern Massachusetts, and New Hampshire) in empirical studies.

Conclusion: The proposed SA-HGNN with contrastive learning effectively incorporates spatial awareness into outage prediction, improving forecasting accuracy for extreme weather-induced power outages across multiple regions.

Abstract: Extreme weather events, such as severe storms, hurricanes, snowstorms, and ice storms, which are exacerbated by climate change, frequently cause widespread power outages. These outages halt industrial operations, impact communities, damage critical infrastructure, profoundly disrupt economies, and have far-reaching effects across various sectors. To mitigate these effects, the University of Connecticut and Eversource Energy Center have developed an outage prediction modeling (OPM) system to provide pre-emptive forecasts for electric distribution networks before such weather events occur. However, existing predictive models in the system do not incorporate the spatial effect of extreme weather events. To this end, we develop Spatially Aware Hybrid Graph Neural Networks (SA-HGNN) with contrastive learning to enhance the OPM predictions for extreme weather-induced power outages. Specifically, we first encode spatial relationships of both static features (e.g., land cover, infrastructure) and event-specific dynamic features (e.g., wind speed, precipitation) via Spatially Aware Hybrid Graph Neural Networks (SA-HGNN). Next, we leverage contrastive learning to handle the imbalance problem associated with different types of extreme weather events and generate location-specific embeddings by minimizing intra-event distances between similar locations while maximizing inter-event distances across all locations. Thorough empirical studies in four utility service territories, i.e., Connecticut, Western Massachusetts, Eastern Massachusetts, and New Hampshire, demonstrate that SA-HGNN can achieve state-of-the-art performance for power outage prediction.

[780] Stratifying Reinforcement Learning with Signal Temporal Logic

Justin Curry, Alberto Speranzon

Main category: cs.LG

TL;DR: A stratification-based semantics for Signal Temporal Logic (STL) that interprets atomic predicates as membership tests in stratified spaces, revealing connections between stratification theory and STL, with applications to analyzing DRL embedding spaces.

Details

Motivation: The paper aims to develop a theoretical framework connecting Signal Temporal Logic (STL) with stratification theory to better understand the structure of embedding spaces generated by deep reinforcement learning agents, particularly in relation to the geometry of decision spaces.

Method: Develops a stratification-based semantics for STL where atomic predicates are interpreted as membership tests in stratified spaces. Applies the theory to Minigrid games and uses numerical techniques on latent embeddings of DRL agents, with STL robustness as reward. Proposes computationally efficient signatures for uncovering stratification structure.

Result: Establishes a novel correspondence principle between stratification theory and STL, showing most STL formulas induce a stratification of space-time. Provides a framework for analyzing DRL embedding spaces and proposes promising computational signatures for uncovering stratification structure.

Conclusion: The stratification-based semantics offers a fresh theoretical perspective on STL and DRL embedding spaces, enabling reuse of existing high-dimensional analysis tools and motivating new computational techniques for understanding the geometric structure of decision spaces.

Abstract: In this paper, we develop a stratification-based semantics for Signal Temporal Logic (STL) in which each atomic predicate is interpreted as a membership test in a stratified space. This perspective reveals a novel correspondence principle between stratification theory and STL, showing that most STL formulas can be viewed as inducing a stratification of space-time. The significance of this interpretation is twofold. First, it offers a fresh theoretical framework for analyzing the structure of the embedding space generated by deep reinforcement learning (DRL) and relates it to the geometry of the ambient decision space. Second, it provides a principled framework that both enables the reuse of existing high-dimensional analysis tools and motivates the creation of novel computational techniques. To ground the theory, we (1) illustrate the role of stratification theory in Minigrid games and (2) apply numerical techniques to the latent embeddings of a DRL agent playing such a game where the robustness of STL formulas is used as the reward. In the process, we propose computationally efficient signatures that, based on preliminary evidence, appear promising for uncovering the stratification structure of such embedding spaces.

[781] Beyond Linear Steering: Unified Multi-Attribute Control for Language Models

Narmeen Oozeer, Luke Marks, Shreyans Jain, Fazl Barez, Amirali Abdullah

Main category: cs.LG

TL;DR: K-Steering: A unified non-linear method for controlling multiple behavioral attributes in LLMs at inference time using gradient-based steering from a single multi-label classifier.

Details

Motivation: Current methods for controlling behavioral attributes in LLMs suffer from interference between attributes, linearity assumptions, and require per-attribute tuning, making multi-attribute control challenging.

Method: Train a single non-linear multi-label classifier on hidden activations, then compute intervention directions via gradients at inference time, enabling dynamic composition of behaviors without retraining.

Result: K-Steering outperforms strong baselines in accurately steering multiple behaviors across 3 model families, validated by both activation-based classifiers and LLM-based judges on new benchmarks ToneBank and DebateMix.

Conclusion: K-Steering provides a unified, flexible approach for compositional behavioral control in LLMs that avoids linearity assumptions and per-attribute tuning while enabling dynamic behavior composition.

Abstract: Controlling multiple behavioral attributes in large language models (LLMs) at inference time is a challenging problem due to interference between attributes and the limitations of linear steering methods, which assume additive behavior in activation space and require per-attribute tuning. We introduce K-Steering, a unified and flexible approach that trains a single non-linear multi-label classifier on hidden activations and computes intervention directions via gradients at inference time. This avoids linearity assumptions, removes the need for storing and tuning separate attribute vectors, and allows dynamic composition of behaviors without retraining. To evaluate our method, we propose two new benchmarks, ToneBank and DebateMix, targeting compositional behavioral control. Empirical results across 3 model families, validated by both activation-based classifiers and LLM-based judges, demonstrate that K-Steering outperforms strong baselines in accurately steering multiple behaviors.

[782] Not All Tokens Matter: Towards Efficient LLM Reasoning via Token Significance in Reinforcement Learning

Hanbing Liu, Lang Cao, Yuanyi Ren, Mengyu Zhou, Haoyu Dong, Xiaojun Ma, Shi Han, Dongmei Zhang

Main category: cs.LG

TL;DR: RL framework for LLMs that optimizes reasoning length by penalizing insignificant tokens and using dynamic length rewards to improve efficiency while maintaining accuracy

Details

Motivation: LLMs produce unnecessarily long explanations that reduce efficiency. Existing RL methods focus on accuracy with uniform length-based rewards that overlook token significance, often harming correctness.

Method: Introduces significance-aware length reward that selectively penalizes insignificant tokens in chain-of-thought reasoning, plus dynamic length reward that encourages detailed reasoning early and shifts to conciseness later. Integrated into standard policy optimization.

Result: Experiments across multiple benchmarks show substantial reductions in response length while preserving or improving correctness.

Conclusion: Modeling token significance is important for efficient LLM reasoning, enabling both improved efficiency and accuracy through selective length optimization.

Abstract: Large language models (LLMs) show strong reasoning abilities but often produce unnecessarily long explanations that reduce efficiency. Although reinforcement learning (RL) has been used to improve reasoning, most methods focus on accuracy and rely on uniform length-based rewards that overlook the differing contributions of individual tokens, often harming correctness. We revisit length optimization in RL through the perspective of token significance. Observing that many chain-of-thought (CoT) tokens contribute little to the final answer, we introduce a significance-aware length reward that selectively penalizes insignificance tokens, reducing redundancy while preserving essential reasoning. We also propose a dynamic length reward that encourages more detailed reasoning early in training and gradually shifts toward conciseness as learning progresses. Integrating these components into standard policy optimization yields a framework that improves both reasoning efficiency and accuracy. Experiments across multiple benchmarks demonstrate substantial reductions in response length while preserving or improving correctness, highlighting the importance of modeling token significance for efficient LLM reasoning.

[783] Bridging the Semantic Gap for Categorical Data Clustering via Large Language Models

Zihua Yang, Xin Liao, Yiqun Zhang, Yiu-ming Cheung

Main category: cs.LG

TL;DR: ARISE enhances categorical data clustering by using LLMs to generate semantic embeddings for attribute values, addressing the semantic gap in traditional similarity measures.

Details

Motivation: Categorical data clustering suffers from poor similarity measures due to lack of inherent ordering. Existing methods rely on within-dataset co-occurrence patterns which fail with limited samples, leaving semantic context underexplored.

Method: ARISE uses Large Language Models to generate semantic descriptions of attribute values, creating semantic-aware representations that complement the original metric space. These LLM-enhanced embeddings are combined with original data to identify semantically prominent clusters.

Result: Experiments on eight benchmark datasets show consistent improvements over seven representative methods, achieving gains of 19-27% in clustering quality.

Conclusion: External semantic knowledge from LLMs effectively bridges the semantic gap in categorical data clustering, significantly improving clustering accuracy when traditional methods struggle with limited samples.

Abstract: Categorical data are prevalent in domains such as healthcare, marketing, and bioinformatics, where clustering serves as a fundamental tool for pattern discovery. A core challenge in categorical data clustering lies in measuring similarity among attribute values that lack inherent ordering or distance. Without appropriate similarity measures, values are often treated as equidistant, creating a semantic gap that obscures latent structures and degrades clustering quality. Although existing methods infer value relationships from within-dataset co-occurrence patterns, such inference becomes unreliable when samples are limited, leaving the semantic context of the data underexplored. To bridge this gap, we present ARISE (Attention-weighted Representation with Integrated Semantic Embeddings), which draws on external semantic knowledge from Large Language Models (LLMs) to construct semantic-aware representations that complement the metric space of categorical data for accurate clustering. That is, LLM is adopted to describe attribute values for representation enhancement, and the LLM-enhanced embeddings are combined with the original data to explore semantically prominent clusters. Experiments on eight benchmark datasets demonstrate consistent improvements over seven representative counterparts, with gains of 19-27%. Code is available at https://github.com/develop-yang/ARISE

[784] Graph State-Space Models and Latent Relational Inference

Daniele Zambon, Andrea Cini, Cesare Alippi

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to draw conclusions due to access error

Abstract: Failed to fetch summary for 2301.01741: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2301.01741&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[785] Neural Exploitation and Exploration of Contextual Bandits

Yikun Ban, Yuchen Yan, Arindam Banerjee, Jingrui He

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to API rate limiting preventing access to paper details

Method: Unable to determine method due to API rate limiting preventing access to paper details

Result: Unable to determine results due to API rate limiting preventing access to paper details

Conclusion: Unable to draw conclusions about the paper due to technical limitations in accessing the content

Abstract: Failed to fetch summary for 2305.03784: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2305.03784&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[786] Sparse Gaussian Graphical Models with Discrete Optimization: Computational and Statistical Perspectives

Kayhan Behdin, Wenyu Chen, Rahul Mazumder

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to API rate limiting error

Method: Unable to determine method due to API rate limiting error

Result: Unable to determine results due to API rate limiting error

Conclusion: Unable to determine conclusion due to API rate limiting error

Abstract: Failed to fetch summary for 2307.09366: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2307.09366&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[787] Federated Transfer Learning with Differential Privacy

Mengchu Li, Ye Tian, Yang Feng, Yi Yu

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2403.11343: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2403.11343&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[788] FedScalar: Federated Learning with Scalar Communication for Bandwidth-Constrained Networks

M. Rostami, S. S. Kia

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2410.02260: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.02260&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[789] EventFlow: Forecasting Temporal Point Processes with Flow Matching

Gavin Kerrigan, Kai Nelson, Padhraic Smyth

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2410.07430: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.07430&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[790] Amortized Safe Active Learning for Real-Time Data Acquisition: Pretrained Neural Policies From Simulated Nonparametric Functions

Cen-You Li, Marc Toussaint, Barbara Rakitsch, Christoph Zimmer

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2501.15458 exists but summary retrieval failed.

Details

Motivation: Cannot determine motivation without access to paper content.

Method: Cannot determine method without access to paper content.

Result: Cannot determine results without access to paper content.

Conclusion: Cannot determine conclusion without access to paper content.

Abstract: Failed to fetch summary for 2501.15458: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.15458&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[791] Causal Bandit Over Unknown Graphs: Upper Confidence Bounds With Backdoor Adjustment

Yijia Zhao, Qing Zhou

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2502.02020: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.02020&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[792] From Restless to Contextual: A Thresholding Bandit Reformulation For Finite-horizon Improvement

Jiamin Xu, Ivan Nazarov, Aditya Rastogi, África Periáñez, Kyra Gan

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to missing paper content

Method: Cannot determine method due to missing paper content

Result: Cannot determine results due to missing paper content

Conclusion: Cannot determine conclusion due to missing paper content

Abstract: Failed to fetch summary for 2502.05145: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.05145&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[793] RESIST: Resilient Decentralized Learning Using Consensus Gradient Descent

Cheng Fang, Rishabh Dixit, Waheed U. Bajwa, Mert Gurbuzbalaban

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper content.

Details

Motivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot draw conclusions without access to the paper content.

Abstract: Failed to fetch summary for 2502.07977: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.07977&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[794] Model Privacy: A Unified Framework for Understanding Model Stealing Attacks and Defenses

Ganghua Wang, Yuhong Yang, Jie Ding

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to access limitations

Method: Cannot determine method due to access limitations

Result: Cannot determine results due to access limitations

Conclusion: Cannot determine conclusion due to access limitations

Abstract: Failed to fetch summary for 2502.15567: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.15567&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[795] From Set Convergence to Pointwise Convergence: Finite-Time Guarantees for Average-Reward Q-Learning with Adaptive Stepsizes

Zaiwei Chen, Phalguni Nanda

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to fetch failure

Method: Cannot determine method due to fetch failure

Result: Cannot determine results due to fetch failure

Conclusion: Cannot determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2504.18743: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.18743&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[796] A Multi-Level Causal Intervention Framework for Mechanistic Interpretability in Variational Autoencoders

Dip Roy, Rajiv Misra, Sanjay Kumar Singh, Anisha Roy

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2505.03530: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.03530&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[797] FABLE: A Localized, Targeted Adversarial Attack on Weather Forecasting Models

Yue Deng, Asadullah Hill Galib, Xin Lan, Jack Gunn, Pang-Ning Tan, Lifeng Luo

Main category: cs.LG

TL;DR: Paper analysis unavailable due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Unable to determine motivation due to abstract fetching failure

Method: Unable to determine method due to abstract fetching failure

Result: Unable to determine results due to abstract fetching failure

Conclusion: Unable to determine conclusion due to abstract fetching failure

Abstract: Failed to fetch summary for 2505.12167: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.12167&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[798] Enforcing Fair Predicted Scores on Intervals of Percentiles by Difference-of-Convex Constraints

Yutian He, Yankun Huang, Yao Yao, Qihang Lin

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to missing paper content

Method: Unable to determine method due to missing paper content

Result: Unable to determine results due to missing paper content

Conclusion: Unable to determine conclusion due to missing paper content

Abstract: Failed to fetch summary for 2505.12530: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.12530&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[799] MSDformer: Multi-scale Discrete Transformer For Time Series Generation

Shibo Feng, Zhicheng Chen, Xi Xiao, Zhong Zhang, Qing Li, Xingyu Gao, Peilin Zhao

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2505.14202: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.14202&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[800] MLorc: Momentum Low-rank Compression for Memory Efficient Large Language Model Adaptation

Wei Shen, Zhang Yaxiang, Minhui Huang, Mengfan Xu, Jiawei Zhang, Cong Shen

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2506.01897: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.01897&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[801] SFBD Flow: A Continuous-Optimization Framework for Training Diffusion Models with Noisy Samples

Haoye Lu, Darren Lo, Yaoliang Yu

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2506.02371: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.02371&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[802] Federated Item Response Models: A Gradient-driven Privacy-preserving Framework for Distributed Psychometric Estimation

Biying Zhou, Nanyu Luo, Feng Ji

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2506.21744: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.21744&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[803] The Riemannian Geometry Associated to Gradient Flows of Linear Convolutional Networks

El Mehdi Achour, Kathlén Kohn, Holger Rauhut

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting). No abstract available for analysis.

Details

Motivation: Unable to determine motivation due to lack of access to paper content.

Method: Unable to determine method due to lack of access to paper content.

Result: Unable to determine results due to lack of access to paper content.

Conclusion: Unable to draw conclusions due to lack of access to paper content.

Abstract: Failed to fetch summary for 2507.06367: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.06367&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[804] Multi-Component VAE with Gaussian Markov Random Field

Fouad Oubari, Mohamed El-Baha, Raphael Meunier, Rodrigue Décatoire, Mathilde Mougeot

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to API rate limiting preventing access to paper content

Method: Cannot analyze method without access to paper content

Result: No results available due to technical limitations in accessing the paper

Conclusion: Unable to provide analysis due to HTTP 429 error when attempting to fetch paper details

Abstract: Failed to fetch summary for 2507.12165: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.12165&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[805] Causal Process Models: Reframing Dynamic Causal Graph Discovery as a Reinforcement Learning Problem

Turan Orujlu, Christian Gumbsch, Martin V. Butz, Charley M Wu

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2507.13920: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.13920&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[806] Evaluating and Learning Robust Bandit Policies Under Uncertain Causal Mechanisms

Katherine Avery, Chinmay Pendse, David Jensen

Main category: cs.LG

TL;DR: Unable to analyze paper 2508.02812 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation as abstract is not available

Method: Cannot determine method as abstract is not available

Result: Cannot determine results as abstract is not available

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2508.02812: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.02812&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[807] LNN-PINN: A Unified Physics-Only Training Framework with Liquid Residual Blocks

Ze Tao, Hanxuan Wang, Fujun Liu

Main category: cs.LG

TL;DR: Unable to analyze paper 2508.08935 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation as abstract is unavailable due to rate limiting error

Method: Cannot determine method as abstract is unavailable due to rate limiting error

Result: Cannot determine results as abstract is unavailable due to rate limiting error

Conclusion: Cannot determine conclusion as abstract is unavailable due to rate limiting error

Abstract: Failed to fetch summary for 2508.08935: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.08935&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[808] xRFM: Accurate, scalable, and interpretable feature learning models for tabular data

Daniel Beaglehole, David Holzmüller, Adityanarayanan Radhakrishnan, Mikhail Belkin

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Unable to determine motivation due to API access limitations

Method: Unable to determine method due to API access limitations

Result: Unable to determine results due to API access limitations

Conclusion: Unable to determine conclusion due to API access limitations

Abstract: Failed to fetch summary for 2508.10053: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.10053&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[809] Estimating Parameter Fields in Multi-Physics PDEs from Scarce Measurements

Xuyang Li, Mahdi Masmoudi, Rami Gharbi, Nizar Lajnef, Vishnu Naresh Boddeti

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2509.00203: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.00203&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[810] Improving Generative Methods for Causal Evaluation via Simulation-Based Inference

Pracheta Amaranath, Vinitra Muralikrishnan, Amit Sharma, David Jensen

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2509.02892: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.02892&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[811] A Data-Driven Interpolation Method on Smooth Manifolds via Diffusion Processes and Voronoi Tessellations

Alvaro Almeida Gomez

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2509.03758: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.03758&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[812] Causal Discovery via Quantile Partial Effect

Yikang Chen, Xingzhe Sun, Dehui Du

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2509.12981: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.12981&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[813] Process-Informed Forecasting of Complex Thermal Dynamics in Pharmaceutical Manufacturing

Ramona Rubini, Siavash Khodakarami, Aniruddha Bora, George Em Karniadakis, Michele Dassisti

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2509.20349: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.20349&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[814] A Minimal-Assumption Analysis of Q-Learning with Time-Varying Policies

Phalguni Nanda, Zaiwei Chen

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2510.16132: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.16132&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[815] Deep Gaussian Processes for Functional Maps

Matthew Lowery, Zhitong Xu, Da Long, Keyan Chen, Daniel S. Johnson, Yang Bai, Varun Shankar, Shandian Zhe

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2510.22068: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.22068&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[816] An Information-Theoretic Analysis of OOD Generalization in Meta-Reinforcement Learning

Xingtu Liu

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2510.23448: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.23448&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[817] Co-Evolving Latent Action World Models

Yucen Wang, Fengming Zhang, De-Chuan Zhan, Li Zhao, Kaixin Wang, Jiang Bian

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2510.26433: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.26433&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[818] SPORE: Skeleton Propagation Over Recalibrating Expansions

Randolph Wiredu-Aidoo

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable due to API rate limiting

Result: Cannot determine results as paper content is unavailable due to API rate limiting

Conclusion: Cannot determine conclusion as paper content is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2511.00064: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.00064&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[819] SynQuE: Estimating Synthetic Dataset Quality Without Annotations

Arthur Chen, Victor Zhong

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to paper fetch failure

Method: Unable to determine method due to paper fetch failure

Result: Unable to determine results due to paper fetch failure

Conclusion: Unable to determine conclusion due to paper fetch failure

Abstract: Failed to fetch summary for 2511.03928: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.03928&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[820] Controllable protein design with particle-based Feynman-Kac steering

Erik Hartman, Jonas Wallin, Johan Malmström, Jimmy Olsson

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about paper content due to access limitations

Abstract: Failed to fetch summary for 2511.09216: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.09216&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[821] A Unified Stability Analysis of SAM vs SGD: Role of Data Coherence and Emergence of Simplicity Bias

Wei-Kai Chang, Rajiv Khanna

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2511.17378: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.17378&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[822] Semantic Superiority vs. Forensic Efficiency: A Comparative Analysis of Deep Learning and Psycholinguistics for Business Email Compromise Detection

Yaw Osei Adjei, Frederick Ayivor

Main category: cs.LG

TL;DR: Paper 2511.20944 summary unavailable due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access restrictions

Method: Unable to determine method due to access restrictions

Result: Unable to determine results due to access restrictions

Conclusion: Unable to determine conclusion due to access restrictions

Abstract: Failed to fetch summary for 2511.20944: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.20944&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[823] Mitigating Structural Overfitting: A Distribution-Aware Rectification Framework for Missing Feature Imputation

Yifan Song, Fenglin Yu, Yihong Luo, Xingjian Tao, Siya Qiu, Kai Han, Jing Tang

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2512.06356: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.06356&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[824] Personalized Federated Distillation Assisted Vehicle Edge Caching Strategy

Xun Li, Qiong Wu, Pingyi Fan, Kezhi Wang, Wen Chen, Cui Zhang

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2512.09378: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.09378&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[825] Random-Bridges as Stochastic Transports for Generative Models

Stefano Goria, Levent A. Mengütürk, Murat C. Mengütürk, Berkan Sesen

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2512.14190: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.14190&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[826] GRAFT: Grid-Aware Load Forecasting with Multi-Source Textual Alignment and Fusion

Fangzhou Lin, Guoshun He, Zhenyu Guo, Zhe Huang, Jinsong Tao

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2512.14400: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.14400&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[827] Kinetic-Mamba: Mamba-Assisted Predictions of Stiff Chemical Kinetics

Additi Pandey, Liang Wei, Hessam Babaee, George Em Karniadakis

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about the paper due to access limitations

Abstract: Failed to fetch summary for 2512.14471: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.14471&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[828] SFBD-OMNI: Bridge models for lossy measurement restoration with limited clean samples

Haoye Lu, Yaoliang Yu, Darren Lo

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2512.17051: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.17051&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[829] ASSS: A Differentiable Adversarial Framework for Task-Aware Data Reduction

Jiacheng Lyu, Bihua Bao, Shiyun Yan

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2601.02081: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.02081&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[830] From Bits to Chips: An LLM-based Hardware-Aware Quantization Agent for Streamlined Deployment of LLMs

Kaiyuan Deng, Hangyu Zheng, Minghai Qing, Kunxiong Zhu, Gen Li, Yang Xiao, Lan Emily Zhang, Linke Guo, Bo Hui, Yanzhi Wang, Geng Yuan, Gagan Agrawal, Wei Niu, Xiaolong Ma

Main category: cs.LG

TL;DR: Paper 2601.03484: Could not fetch summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access limitations

Method: Unable to determine method due to access limitations

Result: Unable to determine results due to access limitations

Conclusion: Unable to determine conclusion due to access limitations

Abstract: Failed to fetch summary for 2601.03484: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.03484&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[831] Understanding and inverse design of implicit bias in stochastic learning: a geometric perspective

Nicola Aladrah, Emanuele Ballarin, Matteo Biagetti, Alessio Ansuini, Alberto d’Onofrio, Fabio Anselmi

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2601.06597: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.06597&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[832] HOSL: Hybrid-Order Split Learning for Memory-Constrained Edge Training

Aakriti Lnu, Zhe Li, Dandan Liang, Chao Huang, Rui Li, Haibo Yang

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) when querying arXiv API

Details

Motivation: Cannot determine motivation as paper content could not be retrieved

Method: Cannot determine method as paper content could not be retrieved

Result: Cannot determine results as paper content could not be retrieved

Conclusion: Cannot determine conclusion as paper content could not be retrieved

Abstract: Failed to fetch summary for 2601.10940: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.10940&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[833] Auxiliary-predicted Compress Memory Model(ApCM Model): A Neural Memory Storage Model Based on Invertible Compression and Learnable Prediction

Weinuo Ou

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2601.11609 suggests it’s a recent arXiv submission from January 2024.

Details

Motivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot determine conclusion without access to the paper content.

Abstract: Failed to fetch summary for 2601.11609: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.11609&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[834] Accelerated Sinkhorn Algorithms for Partial Optimal Transport

Nghia Thu Truong, Qui Phu Pham, Quang Nguyen, Dung Luong, Mai Tran

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2601.17196: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.17196&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[835] Unlearning Noise in PINNs: A Selective Pruning Framework for PDE Inverse Problems

Yongsheng Chen, Yong Chen, Wei Guo, Xinghui Zhong

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to failed data retrieval

Method: Cannot determine method due to failed data retrieval

Result: Cannot determine results due to failed data retrieval

Conclusion: Cannot draw conclusions due to failed data retrieval

Abstract: Failed to fetch summary for 2602.19967: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.19967&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[836] T1: One-to-One Channel-Head Binding for Multivariate Time-Series Imputation

Dongik Park, Hyunwoo Ryu, Suahn Bae, Keondo Park, Hyung-Sin Kim

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2602.21043: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.21043&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[837] KindSleep: Knowledge-Informed Diagnosis of Obstructive Sleep Apnea from Oximetry

Micky C Nnamdi, Wenqi Shi, Cheng Wan, J. Ben Tamo, Benjamin M Smith, Chad A Purnell, May D Wang

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2603.04755 suggests it’s from March 2025, but no content available for analysis.

Details

Motivation: Cannot determine motivation without access to the paper content. The HTTP 429 error indicates the arXiv API is rate limiting requests.

Method: No method information available due to failed fetch. The paper ID format suggests it’s a recent submission (March 2025).

Result: No results available. The technical error prevents accessing any paper content for analysis.

Conclusion: Cannot draw conclusions about an inaccessible paper. The arXiv API rate limiting prevents content retrieval.

Abstract: Failed to fetch summary for 2603.04755: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.04755&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[838] OptEMA: Adaptive Exponential Moving Average for Stochastic Optimization with Zero-Noise Optimality

Ganzhao Yuan

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2603.09923 suggests it’s from March 2026, but no abstract or content is available for analysis.

Details

Motivation: Cannot determine motivation without access to paper content. The arXiv API rate limiting prevents retrieval of the abstract.

Method: Method unknown due to inability to access paper content. HTTP 429 error indicates too many requests to arXiv API.

Result: No results can be analyzed without access to the paper content.

Conclusion: Unable to draw conclusions about the paper due to technical limitations in accessing the content.

Abstract: Failed to fetch summary for 2603.09923: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.09923&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[839] A Grammar of Machine Learning Workflows

Simon Roth

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Unable to determine motivation due to API access issues

Method: No method information available due to API rate limiting

Result: No results available - paper content inaccessible

Conclusion: Cannot analyze paper due to technical limitations in accessing arXiv data

Abstract: Failed to fetch summary for 2603.10742: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.10742&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[840] Accelerating Byzantine-Robust Distributed Learning with Compressed Communication via Double Momentum and Variance Reduction

Yanghao Li, Changxin Liu, Yuhao Yi

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2603.15144 suggests it’s from March 2025, but no abstract or content is available for analysis.

Details

Motivation: Cannot determine motivation without access to the paper content. The HTTP 429 error indicates the arXiv API is rate limiting requests, preventing retrieval of the paper details.

Method: No method information available due to failed content retrieval. The error suggests the system needs to wait before making additional requests to the arXiv API.

Result: No results can be analyzed as the paper content was not retrieved. The HTTP 429 status code means “Too Many Requests” - the server is limiting the rate of API calls.

Conclusion: Unable to provide analysis of paper 2603.15144 due to technical limitations in accessing the content. The arXiv API rate limiting prevents evaluation of this specific paper’s relevance.

Abstract: Failed to fetch summary for 2603.15144: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.15144&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[841] Path-Constrained Mixture-of-Experts

Zijin Gu, Tatiana Likhomanenko, Vimal Thilak, Jason Ramapuram, Navdeep Jaitly

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2603.18297: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.18297&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[842] Improving RCT-Based CATE Estimation Under Covariate Mismatch via Calibrated Alignment

Amir Asiaee, Samhita Pal

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2603.19186 could not be retrieved from arXiv API.

Details

Motivation: Unable to determine motivation as the paper content could not be fetched due to API rate limiting.

Method: Unable to determine method as the paper content could not be fetched due to API rate limiting.

Result: Unable to determine results as the paper content could not be fetched due to API rate limiting.

Conclusion: Unable to determine conclusion as the paper content could not be fetched due to API rate limiting.

Abstract: Failed to fetch summary for 2603.19186: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.19186&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[843] How Out-of-Equilibrium Phase Transitions can Seed Pattern Formation in Trained Diffusion Models

Luca Ambrogioni

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2603.20092 appears to be from March 2023, but no content available for analysis.

Details

Motivation: Cannot determine motivation without access to paper content.

Method: Cannot determine method without access to paper content.

Result: Cannot determine results without access to paper content.

Conclusion: Cannot draw conclusions without access to paper content.

Abstract: Failed to fetch summary for 2603.20092: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.20092&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[844] LLM-ODE: Data-driven Discovery of Dynamical Systems with Large Language Models

Amirmohammad Ziaei Bideh, Jonathan Gryak

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2603.20910: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.20910&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[845] ALMAB-DC: Active Learning, Multi-Armed Bandits, and Distributed Computing for Sequential Experimental Design and Black-Box Optimization

Foo Hui-Mean, Yuan-chin I Chang

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2603.21180 appears to be from March 2023, but content is unavailable.

Details

Motivation: Cannot determine motivation as paper content is unavailable due to rate limiting from arXiv API.

Method: Cannot determine method as paper content is unavailable.

Result: Cannot determine results as paper content is unavailable.

Conclusion: Cannot determine conclusion as paper content is unavailable.

Abstract: Failed to fetch summary for 2603.21180: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.21180&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[846] Posterior-Calibrated Causal Circuits in Variational Autoencoders: Why Image-Domain Interpretability Fails on Tabular Data

Dip Roy, Rajiv Misra, Sanjay Kumar Singh, Anisha Roy

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting). No content available for analysis.

Details

Motivation: Cannot determine motivation as paper content is unavailable due to HTTP 429 error from arXiv API.

Method: Cannot determine method as paper content is unavailable due to HTTP 429 error from arXiv API.

Result: Cannot determine results as paper content is unavailable due to HTTP 429 error from arXiv API.

Conclusion: Cannot draw conclusions as paper content is unavailable due to HTTP 429 error from arXiv API.

Abstract: Failed to fetch summary for 2603.21236: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.21236&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[847] Uncertainty Quantification for Distribution-to-Distribution Flow Matching in Scientific Imaging

Dongxia Wu, Yuhui Zhang, Serena Yeung-Levy, Emma Lundberg, Emily B. Fox

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to retrieval error

Method: Unable to determine method due to retrieval error

Result: Unable to determine results due to retrieval error

Conclusion: Unable to draw conclusions due to retrieval error

Abstract: Failed to fetch summary for 2603.21717: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.21717&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[848] CellFluxRL: Biologically-Constrained Virtual Cell Modeling via Reinforcement Learning

Dongxia Wu, Shiye Su, Yuhui Zhang, Elaine Sui, Emma Lundberg, Emily B. Fox, Serena Yeung-Levy

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2603.21743: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.21743&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[849] Byzantine-Robust and Differentially Private Federated Optimization under Weaker Assumptions

Rustem Islamov, Grigory Malinovsky, Alexander Gaponov, Aurelien Lucchi, Peter Richtárik, Eduard Gorbunov

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2603.23472: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.23472&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[850] Can LLMs Beat Classical Hyperparameter Optimization Algorithms? A Study on autoresearch

Fabio Ferreira, Lucca Wobbe, Arjun Krishnakumar, Frank Hutter, Arber Zela

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2603.24647: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.24647&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[851] Optimal High-Probability Regret for Online Convex Optimization with Two-Point Bandit Feedback

Haishan Ye

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to data fetch failure

Method: Unable to determine method due to data fetch failure

Result: Unable to determine results due to data fetch failure

Conclusion: Unable to determine conclusion due to data fetch failure

Abstract: Failed to fetch summary for 2603.25029: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.25029&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[852] ATLAS-RTC: Closing the Loop on LLM Agent Output with Token-Level Runtime Control

Christopher Cruz

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable due to API rate limiting

Result: Cannot determine results as paper content is unavailable due to API rate limiting

Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2603.27905: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.27905&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[853] See it to Place it: Evolving Macro Placements with Vision-Language Models

Ikechukwu Uchendu, Swati Goel, Karly Hou, Ebrahim Songhori, Kuang-Huei Lee, Joe Wenjie Jiang, Vijay Janapa Reddi, Vincent Zhuang

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to data fetch failure

Method: Unable to determine method due to data fetch failure

Result: Unable to determine results due to data fetch failure

Conclusion: Unable to determine conclusion due to data fetch failure

Abstract: Failed to fetch summary for 2603.28733: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.28733&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[854] Rethinking Language Model Scaling under Transferable Hypersphere Optimization

Liliang Ren, Yang Liu, Yelong Shen, Weizhu Chen

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Unable to determine motivation as paper content could not be retrieved

Method: Unable to determine method as paper content could not be retrieved

Result: Unable to determine results as paper content could not be retrieved

Conclusion: Unable to determine conclusion as paper content could not be retrieved

Abstract: Failed to fetch summary for 2603.28743: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.28743&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[855] ReproMIA: A Comprehensive Analysis of Model Reprogramming for Proactive Membership Inference Attacks

Chihan Huang, Huaijin Wang, Shuai Wang

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.28942: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.28942&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[856] Realistic Market Impact Modeling for Reinforcement Learning Trading Environments

Lucas Riera Abbade, Anna Helena Reali Costa

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to draw conclusions due to failed paper fetch

Abstract: Failed to fetch summary for 2603.29086: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.29086&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[857] Fatigue-Aware Learning to Defer via Constrained Optimisation

Zheng Zhang, Cuong C. Nguyen, David Rosewarne, Kevin Wells, Gustavo Carneiro

Main category: cs.LG

TL;DR: Unable to analyze paper 2604.00904 due to HTTP 429 error when fetching from arXiv API

Details

Motivation: Cannot determine motivation as paper content could not be retrieved

Method: Cannot determine method as paper content could not be retrieved

Result: Cannot determine results as paper content could not be retrieved

Conclusion: Cannot draw conclusions about paper content due to retrieval failure

Abstract: Failed to fetch summary for 2604.00904: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.00904&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[858] Apriel-1.5-OpenReasoner: RL Post-Training for General-Purpose and Efficient Reasoning

Rafael Pardinas, Ehsan Kamalloo, David Vazquez, Alexandre Drouin

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2604.02007: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.02007&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[859] Bayesian Neural Networks: An Introduction and Survey

Ethan Goan, Clinton Fookes

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2006.12024: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2006.12024&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[860] Piecewise Deterministic Markov Processes for Bayesian Neural Networks

Ethan Goan, Dimitri Perrin, Kerrie Mengersen, Clinton Fookes

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper information.

Details

Motivation: Cannot determine motivation without access to paper content.

Method: Cannot determine method without access to paper content.

Result: Cannot determine results without access to paper content.

Conclusion: Cannot draw conclusions without access to paper content.

Abstract: Failed to fetch summary for 2302.08724: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2302.08724&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[861] Importance Sparsification for Sinkhorn Algorithm

Mengyu Li, Jun Yu, Tao Li, Cheng Meng

Main category: cs.LG

TL;DR: Failed to fetch paper summary - HTTP 429 error (rate limiting) prevents accessing arXiv API for paper 2306.06581

Details

Motivation: Unable to determine motivation due to API access error

Method: Unable to determine method due to API access error

Result: Unable to determine results due to API access error

Conclusion: Unable to determine conclusion due to API access error

Abstract: Failed to fetch summary for 2306.06581: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2306.06581&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[862] Accelerated Gradient Methods for Nonconvex Optimization: Escape Trajectories From Strict Saddle Points and Convergence to Local Minima

Rishabh Dixit, Mert Gurbuzbalaban, Waheed U. Bajwa

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2307.07030: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2307.07030&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[863] MissNODAG: Differentiable Cyclic Causal Graph Learning from Incomplete Data

Muralikrishnna G. Sethuraman, Razieh Nabi, Faramarz Fekri

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2410.18918 appears to be an arXiv preprint, but no abstract or content could be retrieved.

Details

Motivation: Cannot determine motivation as paper content is unavailable due to rate limiting from arXiv API.

Method: Cannot determine method as paper content is unavailable.

Result: Cannot determine results as paper content is unavailable.

Conclusion: Cannot draw conclusions about the paper due to lack of accessible content.

Abstract: Failed to fetch summary for 2410.18918: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.18918&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[864] Sparse Max-Affine Regression

Haitham Kanj, Seonho Kim, Kiryung Lee

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to access error

Method: Cannot determine method due to access error

Result: Cannot determine results due to access error

Conclusion: Cannot determine conclusion due to access error

Abstract: Failed to fetch summary for 2411.02225: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.02225&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[865] Score-matching-based Structure Learning for Temporal Data on Networks

Hao Chen, Kai Yi, Yu Guang Wang

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2412.07469: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2412.07469&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[866] From XAI to MLOps: Explainable Concept Drift Detection with Profile Drift Detection

Ugur Dar, Mustafa Cavus

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access restrictions

Method: Unable to determine method due to access restrictions

Result: Unable to determine results due to access restrictions

Conclusion: Unable to determine conclusion due to access restrictions

Abstract: Failed to fetch summary for 2412.11308: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2412.11308&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[867] Towards Build Optimization Using Digital Twins

Henri Aïdasso, Francis Bordeleau, Ali Tizghadam

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2503.19381: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.19381&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[868] Operator Learning for Schrödinger Equation: Unitarity, Error Bounds, and Time Generalization

Yash Patel, Unique Subedi, Ambuj Tewari

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2505.18288: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.18288&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[869] Learning thermodynamic master equations for open quantum systems

Peter Sentz, Stanley Nicholson, Yujin Cho, Sohail Reddy, Brendan Keith, Stefanie Günther

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to unavailability of paper content

Method: Cannot determine method due to unavailability of paper content

Result: Cannot determine results due to unavailability of paper content

Conclusion: Cannot draw conclusions due to unavailability of paper content

Abstract: Failed to fetch summary for 2506.01882: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.01882&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[870] Accelerating Constrained Sampling: A Large Deviations Approach

Yingli Wang, Changwei Tu, Xiaoyu Wang, Lingjiong Zhu

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2506.07816 could not be retrieved from arXiv API.

Details

Motivation: Cannot determine motivation without access to the paper content. The HTTP 429 error indicates the arXiv API is rate limiting requests, preventing retrieval of the paper details.

Method: Cannot determine method without access to the paper content. The arXiv API request failed due to rate limiting (HTTP 429).

Result: No results available. The paper content could not be fetched due to technical limitations with the arXiv API.

Conclusion: Unable to analyze the paper due to technical limitations preventing access to the content. The arXiv API rate limiting prevented retrieval of paper details.

Abstract: Failed to fetch summary for 2506.07816: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.07816&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[871] All is Not Lost: LLM Recovery without Checkpoints

Nikolay Blagoev, Oğuzhan Ersoy, Lydia Yiyu Chen

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2506.15461 suggests it’s from June 2025, but no content available for analysis.

Details

Motivation: Cannot determine motivation without access to the paper content. The HTTP 429 error indicates the arXiv API is rate limiting requests.

Method: No method information available due to failed API request. The paper ID format suggests it’s a recent submission (June 2025).

Result: No results available. The abstract/content could not be retrieved from arXiv due to rate limiting.

Conclusion: Cannot draw conclusions about an inaccessible paper. The reader may need to try accessing the paper directly or wait before retrying the API.

Abstract: Failed to fetch summary for 2506.15461: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.15461&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[872] Better Together: Cross and Joint Covariances Enhance Signal Detectability in Undersampled Data

Arabind Swain, Sean Alexander Ridout, Ilya Nemenman

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2507.22207: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.22207&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[873] The Role of Entanglement in Quantum Reservoir Computing with Coupled Kerr Nonlinear Oscillators

Ali Karimi, Hadi Zadeh-Haghighi, Youssef Kora, Christoph Simon

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access limitations

Method: Unable to determine method due to access limitations

Result: Unable to determine results due to access limitations

Conclusion: Unable to determine conclusion due to access limitations

Abstract: Failed to fetch summary for 2508.11175: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.11175&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[874] Smooth Flow Matching for Synthesizing Functional Data

Jianbin Tan, Anru R. Zhang

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to API rate limiting preventing access to paper content

Method: Unable to determine method due to API rate limiting preventing access to paper content

Result: Unable to determine results due to API rate limiting preventing access to paper content

Conclusion: Unable to determine conclusion due to API rate limiting preventing access to paper content

Abstract: Failed to fetch summary for 2508.13831: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.13831&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[875] Partially Functional Dynamic Backdoor Diffusion-based Causal Model

Xinwen Liu, Lei Qian, Song Xi Chen, Niansheng Tang

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2509.00472: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.00472&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[876] Sequential 1-bit Mean Estimation with Near-Optimal Sample Complexity

Ivan Lau, Jonathan Scarlett

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable due to API rate limiting

Result: Cannot determine results as paper content is unavailable due to API rate limiting

Conclusion: Cannot determine conclusion as paper content is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2509.21940: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.21940&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[877] Smart Paste: Automatically Fixing Copy/Paste for Google Developers

Vincent Nguyen, Guilherme Herzog, José Cambronero, Marcus Revaj, Aditya Kini, Alexander Frömmgen, Maxim Tabachnyk

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2510.03843: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.03843&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[878] Three-dimensional inversion of gravity data using implicit neural representations and scientific machine learning

Pankaj K Mishra, Sanni Laaksonen, Jochen Kamm, Anand Singh

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2510.17876: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.17876&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[879] Endogenous Aggregation of Multiple Data Envelopment Analysis Scores for Large Data Sets

Hashem Omrani, Raha Imanirad, Adam Diamant, Utkarsh Verma, Amol Verma, Fahad Razak

Main category: cs.LG

TL;DR: Paper 2510.20052: Failed to fetch summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access restrictions

Method: Unable to determine method due to access restrictions

Result: Unable to determine results due to access restrictions

Conclusion: Unable to determine conclusion due to access restrictions

Abstract: Failed to fetch summary for 2510.20052: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.20052&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[880] Data-driven Sensor Placement for Predictive Applications: A Correlation-Assisted Attribution Framework (CAAF)

Sze Chai Leung, Di Zhou, H. Jane Bae

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2510.22517: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.22517&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[881] pDANSE: Particle-based Data-driven Nonlinear State Estimation from Nonlinear Measurements

Anubhab Ghosh, Yonina C. Eldar, Saikat Chatterjee

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions due to missing paper content

Abstract: Failed to fetch summary for 2510.27503: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.27503&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[882] Efficient and Private Property Testing via Indistinguishability

Cynthia Dwork, Pranay Tankala

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2511.03653 suggests it’s from November 2023, but no content available for analysis.

Details

Motivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot draw conclusions without access to the paper content.

Abstract: Failed to fetch summary for 2511.03653: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.03653&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[883] Contradictions in Context: Challenges for Retrieval-Augmented Generation in Healthcare

Saeedeh Javadi, Sara Mirabi, Manan Gangar, Bahadorreza Ofoghi

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed API request

Method: Unable to determine method due to failed API request

Result: Unable to determine results due to failed API request

Conclusion: Unable to determine conclusion due to failed API request

Abstract: Failed to fetch summary for 2511.06668: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.06668&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[884] Learning continuous state of charge dependent thermal decomposition kinetics for Li-ion cathodes using Kolmogorov-Arnold Chemical Reaction Neural Networks (KA-CRNNs)

Benjamin C. Koenig, Sili Deng

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to draw conclusions due to failed paper fetch

Abstract: Failed to fetch summary for 2512.15628: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.15628&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[885] Making Bias Non-Predictive: Training Robust LLM Reasoning via Reinforcement Learning

Qian Wang, Xuandong Zhao, Zirui Zhang, Zhanzhi Lou, Nuo Chen, Dawn Song, Bingsheng He

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to API rate limiting preventing access to paper content

Method: Unable to determine method due to API rate limiting preventing access to paper content

Result: Unable to determine results due to API rate limiting preventing access to paper content

Conclusion: Unable to determine conclusion due to API rate limiting preventing access to paper content

Abstract: Failed to fetch summary for 2602.01528: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.01528&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[886] SERNF: Sample-Efficient Real-World Dexterous Policy Fine-Tuning via Action-Chunked Critics and Normalizing Flows

Chenyu Yang, Denis Tarasov, Davide Liconti, Hehui Zheng, Robert K. Katzschmann

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2602.09580: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.09580&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[887] RL unknotter, hard unknots and unknotting number

Anne Dranowski, Yura Kabkov, Daniel Tubbenhauer

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2603.07955: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.07955&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[888] Noise Models Impacts and Mitigation Strategies in Photonic Quantum Machine Learning

A.M.A.S.D. Alagiyawanna, Asoka Karunananda

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable due to API rate limiting

Result: Cannot determine results as paper content is unavailable due to API rate limiting

Conclusion: Cannot determine conclusion as paper content is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2603.09645: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.09645&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[889] All elementary functions from a single binary operator

Andrzej Odrzywołek

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) when querying arXiv API for paper ID 2603.21852

Details

Motivation: Unable to determine motivation as the paper content could not be retrieved due to technical limitations in accessing the arXiv API

Method: No method information available - the paper content retrieval failed due to HTTP 429 error (Too Many Requests)

Result: No results available - could not access the paper content to analyze findings

Conclusion: Unable to provide analysis due to technical limitations in accessing the paper content from arXiv

Abstract: Failed to fetch summary for 2603.21852: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.21852&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[890] CSTS: A Canonical Security Telemetry Substrate for AI-Native Cyber Detection

Abdul Rahman

Main category: cs.LG

TL;DR: Paper 2603.23459: Unable to fetch abstract due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to missing abstract

Method: Cannot determine method due to missing abstract

Result: Cannot determine results due to missing abstract

Conclusion: Cannot draw conclusions due to missing abstract

Abstract: Failed to fetch summary for 2603.23459: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.23459&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[891] When GPUs Fail Quietly: Observability-Aware Early Warning Beyond Numeric Telemetry

Michael Bidollahkhani, Freja Nordsiek, Julian M. Kunkel

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Unable to determine motivation due to API rate limiting preventing access to paper content

Method: No method information available - arXiv API returned HTTP 429 error

Result: No results available - paper content inaccessible due to API rate limiting

Conclusion: Cannot analyze paper due to technical limitations in accessing arXiv content

Abstract: Failed to fetch summary for 2603.28781: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.28781&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[892] SkVM: Compiling Skills for Efficient Execution Everywhere

Le Chen, Erhu Feng, Yubin Xia, Haibo Chen

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to missing paper content

Method: Unable to determine method due to missing paper content

Result: Unable to determine results due to missing paper content

Conclusion: Unable to determine conclusion due to missing paper content

Abstract: Failed to fetch summary for 2604.03088: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.03088&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

cs.MA

[893] Emergent Compositional Communication for Latent World Properties

Tomek Kaszyński

Main category: cs.MA

TL;DR: Multi-agent communication with Gumbel-Softmax bottleneck enables agents to develop compositional representations of invisible physical properties from frozen video features without supervision.

Details

Motivation: To investigate whether multi-agent communication can extract discrete, compositional representations of invisible physical properties (elasticity, friction, mass ratio) from frozen video features without property labels or supervision on message structure.

Method: Use multiple agents communicating through a Gumbel-Softmax bottleneck with iterated learning. Compare different vision backbones (DINOv2 vs V-JEPA 2) on different physics tasks. Validate on real camera footage from Physics 101 dataset.

Result: With 4 agents, 100% of 80 seeds converge to near-perfect compositionality (PosDis=0.999, holdout 98.3%). Causal intervention shows surgical property disruption. DINOv2 dominates on spatially-visible ramp physics, while V-JEPA 2 dominates on dynamics-only collision physics. Real-world validation shows 85.6% mass-comparison accuracy.

Conclusion: Multi-agent communication pressure can extract compositional representations of invisible physical properties from frozen video features. The perceptual prior determines what is communicable, with different vision backbones excelling at different physics understanding tasks.

Abstract: Can multi-agent communication pressure extract discrete, compositional representations of invisible physical properties from frozen video features? We show that agents communicating through a Gumbel-Softmax bottleneck with iterated learning develop positionally disentangled protocols for latent properties (elasticity, friction, mass ratio) without property labels or supervision on message structure. With 4 agents, 100% of 80 seeds converge to near-perfect compositionality (PosDis=0.999, holdout 98.3%). Controls confirm multi-agent structure – not bandwidth or temporal coverage – drives this effect. Causal intervention shows surgical property disruption (~15% drop on targeted property, <3% on others). A controlled backbone comparison reveals that the perceptual prior determines what is communicable: DINOv2 dominates on spatially-visible ramp physics (98.3% vs 95.1%), while V-JEPA 2 dominates on dynamics-only collision physics (87.4% vs 77.7%, d=2.74). Scale-matched (d=3.37) and frame-matched (d=6.53) controls attribute this gap entirely to video-native pretraining. The frozen protocol supports action-conditioned planning (91.5%) with counterfactual velocity reasoning (r=0.780). Validation on Physics 101 real camera footage confirms 85.6% mass-comparison accuracy on unseen objects, temporal dynamics contributing +11.2% beyond static appearance, agent-scaling compositionality replicating at 90% for 4 agents, and causal intervention extending to real video (d=1.87, p=0.022).

[894] Multi-Agent Training-free Urban Food Delivery System using Resilient UMST Network

Md Nahid Hasan, Vishwam Tiwari, Aditya Challa, Vaskar Raychoudhury, Snehanshu Saha

Main category: cs.MA

TL;DR: UMST constructs sparse yet robust delivery networks by uniting multiple minimum spanning trees through randomized edge perturbations, achieving efficiency and resilience without training.

Details

Motivation: Traditional urban delivery networks are fragile and struggle with disruptions like road closures, while fully connected graphs are computationally infeasible and single MSTs are too brittle. There's a need for scalable, resilient delivery network designs for growing online food delivery markets.

Method: Union of Minimum Spanning Trees (UMST) generates multiple MSTs through randomized edge perturbations and unites them, creating sparse graphs with multiple alternative routes between delivery hotspots without requiring training.

Result: UMST achieves 20-40× fewer edges than fully connected graphs with 75-83% order bundling participation. It delivers 88-96% success rates and 44-53% distance savings, outperforming learning-based baselines while being 30× faster and interpretable.

Conclusion: UMST provides a scalable, resilient foundation for urban delivery networks by balancing structural efficiency with operational flexibility through its sparse yet robust graph construction approach.

Abstract: Delivery systems have become a core part of urban life, supporting the demand for food, medicine, and other goods. Yet traditional logistics networks remain fragile, often struggling to adapt to road closures, accidents, and shifting demand. Online Food Delivery (OFD) platforms now represent a cornerstone of urban logistics, with the global market projected to grow to over 500 billion USD by 2030. Designing delivery networks that are efficient and resilient remains a major challenge: fully connected graphs provide flexibility but are computationally infeasible at scale, while single Minimum Spanning Trees (MSTs) are efficient but easily disrupted. We propose the Union of Minimum Spanning Trees (UMST) approach to construct delivery networks that are sparse yet robust. UMST generates multiple MSTs through randomized edge perturbations and unites them, producing graphs with far fewer edges than fully connected networks while maintaining multiple alternative routes between delivery hotspots. Across multiple U.S. cities, UMST achieves 20–40$\times$ fewer edges than fully connected graphs while enabling substantial order bundling with 75–83% participation rates. Compared to learning-based baselines including MADDPG and Graph Neural Networks, UMST delivers competitive performance (88-96% success rates, 44-53% distance savings) without requiring training, achieving 30$\times$ faster execution while maintaining interpretable routing structures. Its combination of structural efficiency and operational flexibility offers a scalable and resilient foundation for urban delivery networks.

[895] Scaling Teams or Scaling Time? Memory Enabled Lifelong Learning in LLM Multi-Agent Systems

Shanglin Wu, Yuyang Luo, Yueqing Liang, Kaiwen Shi, Yanfang Ye, Ali Payani, Kai Shu

Main category: cs.MA

TL;DR: LLMA-Mem: A lifelong memory framework for LLM multi-agent systems that studies interaction between team size and experience accumulation under cost constraints, revealing non-monotonic scaling where smaller teams with better memory can outperform larger ones.

Details

Motivation: Prior work studied multi-agent scaling dimensions (team size vs. lifelong learning) separately, but their interaction under realistic cost constraints remains unclear. Need to understand how memory design affects this scaling landscape.

Method: Proposed LLMA-Mem, a lifelong memory framework for LLM multi-agent systems with flexible memory topologies. Evaluated on MultiAgentBench across coding, research, and database environments, comparing performance under different team sizes and memory configurations.

Result: LLMA-Mem consistently improves long-horizon performance over baselines while reducing cost. Revealed non-monotonic scaling: larger teams don’t always produce better long-term performance; smaller teams can outperform larger ones when memory better supports experience reuse.

Conclusion: Memory design is a practical path for scaling multi-agent systems more effectively and efficiently over time. The findings position memory topology as crucial for optimizing the trade-off between team size and lifelong learning ability.

Abstract: Large language model (LLM) multi-agent systems can scale along two distinct dimensions: by increasing the number of agents and by improving through accumulated experience over time. Although prior work has studied these dimensions separately, their interaction under realistic cost constraints remains unclear. In this paper, we introduce a conceptual scaling view of multi-agent systems that jointly considers team size and lifelong learning ability, and we study how memory design shares this landscape. To this end, we propose \textbf{LLMA-Mem}, a lifelong memory framework for LLM multi-agent systems under flexible memory topologies. We evaluate LLMA-Mem on \textsc{MultiAgentBench} across coding, research, and database environments. Empirically, LLMA-Mem consistently improves long-horizon performance over baselines while reducing cost. Our analysis further reveals a non-monotonic scaling landscape: larger teams do not always produce better long-term performance, and smaller teams can outperform larger ones when memory better supports the reuse of experience. These findings position memory design as a practical path for scaling multi-agent systems more effectively and more efficiently over time.

[896] Scaling Multi-agent Systems: A Smart Middleware for Improving Agent Interactions

Charles Fleming, Ramana Kompella, Peter Bosch, Vijoy Pandey

Main category: cs.MA

TL;DR: CFN introduces intelligent middleware nodes that create a “Cognitive Fabric” between LLM-based multi-agent systems, improving communication coherence, security, and performance through active memory management and learning modules.

Details

Motivation: Current LLM-based multi-agent systems suffer from fragmented context, stochastic hallucinations, rigid security boundaries, and inefficient topology management in direct agent-to-agent communication, necessitating a smarter intermediary layer.

Method: Cognitive Fabric Nodes (CFN) act as active, intelligent middleware that elevates memory to an active functional substrate informing four capabilities: Topology Selection, Semantic Grounding, Security Policy Enforcement, and Prompt Transformation, governed by RL and optimization algorithms.

Result: CFN improves performance by more than 10% on HotPotQA and MuSiQue datasets in multi-agent environments compared to direct agent-to-agent communication.

Conclusion: CFN provides a novel middleware architecture that enables LLM-based multi-agent systems to achieve coherence, safety, and semantic alignment while keeping individual agents lightweight through intelligent communication management.

Abstract: As Large Language Model (LLM) based Multi-Agent Systems (MAS) evolve from experimental pilots to complex, persistent ecosystems, the limitations of direct agent-to-agent communication have become increasingly apparent. Current architectures suffer from fragmented context, stochastic hallucinations, rigid security boundaries, and inefficient topology management. This paper introduces Cognitive Fabric Nodes (CFN), a novel middleware layer that creates an omnipresent “Cognitive Fabric” between agents. Unlike traditional message queues or service meshes, CFNs are not merely pass-through mechanisms; they are active, intelligent intermediaries. Central to this architecture is the elevation of Memory from simple storage to an active functional substrate that informs four other critical capabilities: Topology Selection, Semantic Grounding, Security Policy Enforcement, and Prompt Transformation. We propose that each of these functions be governed by learning modules utilizing Reinforcement Learning (RL) and optimization algorithms to improve system performance dynamically. By intercepting, analyzing, and rewriting inter-agent communication, the Cognitive Fabric ensures that individual agents remain lightweight while the ecosystem achieves coherence, safety, and semantic alignment. We evaluate the effectiveness of the CFN on the HotPotQA and MuSiQue datasets in a multi-agent environment and demonstrate that the CFN improves performance by more than 10% on both datasets over direct agent to agent communication.

[897] When AI Agents Disagree Like Humans: Reasoning Trace Analysis for Human-AI Collaborative Moderation

Michał Wawer, Jarosław A. Chudziak

Main category: cs.MA

TL;DR: Multi-agent LLM systems can use disagreement as signal rather than noise, especially for subjective tasks like hate speech moderation where legitimate value pluralism exists.

Details

Motivation: Current practice treats disagreement in LLM-based multi-agent systems as noise to be resolved through consensus, but the authors propose it can be valuable signal, particularly for subjective domains like hate speech moderation where human annotators legitimately disagree due to cultural context and value weightings.

Method: Using the Measuring Hate Speech corpus, they embed reasoning traces from five perspective-differentiated agents and classify disagreement patterns using a four-category taxonomy based on reasoning similarity and conclusion agreement. They analyze how agent disagreement correlates with human annotator conflict.

Result: Raw reasoning divergence weakly predicts human annotator conflict, but the structure of agent discord carries additional signal. Cases where agents agree on a verdict show markedly lower human disagreement than cases where they do not, with large effect sizes (d>0.8) surviving correction for multiple comparisons. The taxonomy-based ordering correlates with human disagreement patterns.

Conclusion: These findings motivate a shift from consensus-seeking to uncertainty-surfacing multi-agent design, where disagreement structure - not magnitude - guides when human judgment is needed, recognizing legitimate value pluralism in subjective domains.

Abstract: When LLM-based multi-agent systems disagree, current practice treats this as noise to be resolved through consensus. We propose it can be signal. We focus on hate speech moderation, a domain where judgments depend on cultural context and individual value weightings, producing high legitimate disagreement among human annotators. We hypothesize that convergent disagreement, where agents reason similarly but conclude differently, indicates genuine value pluralism that humans also struggle to resolve. Using the Measuring Hate Speech corpus, we embed reasoning traces from five perspective-differentiated agents and classify disagreement patterns using a four-category taxonomy based on reasoning similarity and conclusion agreement. We find that raw reasoning divergence weakly predicts human annotator conflict, but the structure of agent discord carries additional signal: cases where agents agree on a verdict show markedly lower human disagreement than cases where they do not, with large effect sizes (d>0.8) surviving correction for multiple comparisons. Our taxonomy-based ordering correlates with human disagreement patterns. These preliminary findings motivate a shift from consensus-seeking to uncertainty-surfacing multi-agent design, where disagreement structure - not magnitude - guides when human judgment is needed.

Xinqi Gao, Mario Ventresca

Main category: cs.MA

TL;DR: SRIM framework gives agents preferences over social subgraph structures to study how these preferences affect strategic behavior in sequential social dilemmas like Harvest and Cleanup games.

Details

Motivation: Previous work has overlooked the intricate social dynamics and strategic behaviors of relational networked learning agents in social dilemmas. The paper aims to understand how agents' personal preferences over their subgraphical social structures influence their strategic decision-making.

Method: Proposes Socio-Relational Intrinsic Motivation (SRIM) which endows agents with diverse preferences over sub-graphical social structures (degree-, clique-, and critical connection-based). Tests in Harvest and Cleanup environments using a BCI metric to capture structural variation.

Result: Different subgraph structure preferences lead to distinct variations in agents’ reward gathering and strategic behavior. Agents with different structural positions show similar strategic behavioral shifts. BCI metric ordering across social preferences is consistent across environments, showing robust subgraphical structural impact.

Conclusion: Provides a new framework for examining agents’ behavior in social dilemmas and insights for designing multi-agent ecosystems with heterogeneous social agents, showing that social structure preferences significantly influence strategic decision-making.

Abstract: Limited work has examined the strategic behaviors of relational networked learning agents under social dilemmas, and has overlooked the intricate social dynamics of complex systems. We address the challenge with Socio-Relational Intrinsic Motivation (SRIM), which endows agents with diverse preferences over sub-graphical social structures in order to study the impact of agents’ personal preferences over their sub-graphical relations on their strategic decision-making under sequential social dilemmas. Our results in the Harvest and Cleanup environments demonstrate that preferences over different subgraph structures (degree-, clique-, and critical connection-based) lead to distinct variations in agents’ reward gathering and strategic behavior: individual aggressiveness in Harvest and individual contribution effort in Cleanup. Moreover, agents with different subgraphical structural positions consistently exhibit similar strategic behavioral shifts. Our proposed BCI metric captures structural variation within the population, and the relative ordering of BCI across social preferences is consistent in Harvest and Cleanup games for the same topology, suggesting the subgraphical structural impact is robust across environments. These results provide a new lens for examining agents’ behavior in social dilemmas and insight for designing effective multi-agent ecosystems composed of heterogeneous social agents.

[899] Symbolic-Vector Attention Fusion for Collective Intelligence

Hongwei Xu

Main category: cs.MA

TL;DR: SVAF is a content-evaluation mechanism for multi-agent systems that decomposes inter-agent signals into 7 semantic fields, evaluates relevance through learned fusion gates, and produces remixed knowledge, solving selectivity and redundancy problems in collective intelligence.

Details

Motivation: Autonomous agents observing different domains of a shared environment receive signals mixing relevant and irrelevant dimensions, with no existing mechanism for receivers to evaluate which dimensions to absorb, creating a need for selective content evaluation in collective intelligence systems.

Method: Symbolic-Vector Attention Fusion (SVAF) decomposes inter-agent signals into 7 typed semantic fields, evaluates each through learned fusion gates to produce remixed knowledge, using a band-pass model yielding four outcomes (redundant, aligned, guarded, rejected). Combined with Closed-form Continuous-time (CfC) neural networks that create temporal dynamics through learned per-neuron time constants.

Result: SVAF achieves 78.7% three-class accuracy on 237K samples from 273 narrative scenarios. The fusion gate discovers cross-domain relevance hierarchy with mood emerging as highest-weight field early. Complete mesh cognition loop verified in live deployment with 7 nodes across macOS, iOS, and web.

Conclusion: SVAF provides the content-evaluation half of a coupling engine for collective intelligence, determining what enters each agent’s cognitive state, while CfC determines how that state evolves, enabling selective knowledge absorption and temporal dynamics in multi-agent systems.

Abstract: When autonomous agents observe different domains of a shared environment, each signal they exchange mixes relevant and irrelevant dimensions. No existing mechanism lets the receiver evaluate which dimensions to absorb. We introduce Symbolic-Vector Attention Fusion (SVAF), the content-evaluation half of a two-level coupling engine for collective intelligence. SVAF decomposes each inter-agent signal into 7 typed semantic fields, evaluates each through a learned fusion gate, and produces a remix – new knowledge from the intersection of two domains. A band-pass model yields four outcomes (redundant, aligned, guarded, rejected), solving both selectivity and redundancy. The fusion gate independently discovers a cross-domain relevance hierarchy: mood emerges as the highest-weight field by epoch 1, before accuracy plateaus – consistent with independent mechanistic evidence that LLM emotion representations are structurally embedded along valence-arousal axes. SVAF forms Layer 4 of the Mesh Memory Protocol (MMP); the other half of the coupling engine is a per-agent Closed-form Continuous-time (CfC) neural network at Layer 6, whose learned per-neuron time constants (tau) create the temporal dynamics from which collective intelligence emerges: fast neurons synchronise affect across agents in seconds, while slow neurons preserve domain expertise indefinitely. SVAF determines what enters each agent’s cognitive state; CfC determines how that state evolves. Trained on 237K samples from 273 narrative scenarios, SVAF achieves 78.7% three-class accuracy. We verify the complete mesh cognition loop – from per-field evaluation through remix, CfC state evolution, tau-modulated peer blending, and autonomous action – in a live deployment with 7 nodes across macOS, iOS, and web.

[900] Agentization of Digital Assets for the Agentic Web: Concepts, Techniques, and Benchmark

Linyao Chen, Bo Huang, Qinlao Zhao, Shuai Shao, Zhi Han, Zicai Cui, Ziheng Zhang, Guangtao Zeng, Wenzheng Tang, Yikun Wang, Yuanjian Zhou, Zimian Peng, Yong Yu, Weiwen Liu, Hiroki Kobayashi, Weinan Zhang

Main category: cs.MA

TL;DR: A framework for automatically converting digital assets into agents for the Agentic Web, with a benchmark for evaluation.

Details

Motivation: The Agentic Web paradigm requires digital assets to be converted into agents, but lacks automated methodologies for this agentization process, limiting wider adoption and advancement of the Agentic Web.

Method: Formalizes the A2A-Agentization process, develops an Agentization Agent to automate digital asset conversion, and creates A2A-Agentization Bench benchmark to evaluate agentization quality in terms of fidelity and interoperability.

Result: The approach effectively activates functional capabilities of digital assets and enables interoperable A2A multi-agent collaboration.

Conclusion: This work facilitates scalable and standardized integration of digital assets into the Agentic Web ecosystem.

Abstract: Agentic Web, as a new paradigm that redefines the internet through autonomous, goal-driven interactions, plays an important role in group intelligence. As the foundational semantic primitives of the Agentic Web, digital assets encapsulate interactive web elements into agents, which expand the capacities and coverage of agents in agentic web. The lack of automated methodologies for agent generation limits the wider usage of digital assets and the advancement of the Agentic Web. In this paper, we first formalize these challenges by strictly defining the A2A-Agentization process, decomposing it into critical stages and identifying key technical hurdles on top of the A2A protocol. Based on this framework, we develop an Agentization Agent to agentize digital assets for the Agentic Web. To rigorously evaluate this capability, we propose A2A-Agentization Bench, the first benchmark explicitly designed to evaluate agentization quality in terms of fidelity and interoperability. Our experiments demonstrate that our approach effectively activates the functional capabilities of digital assets and enables interoperable A2A multi-agent collaboration. We believe this work will further facilitate scalable and standardized integration of digital assets into the Agentic Web ecosystem.

[901] Agents for Agents: An Interrogator-Based Secure Framework for Autonomous Internet of Underwater Things

Ali Akarma, Toqeer Ali Syed, Abdul Khadar Jilani, Salman Jan, Hammad Muneer, Muazzam A. Khan, Changli Yu

Main category: cs.MA

TL;DR: A behavioral trust monitoring system for underwater multi-agent networks using lightweight transformer models and blockchain for secure, dynamic trust evaluation.

Details

Motivation: Current underwater multi-agent systems rely on static trust after authentication, leaving long missions vulnerable to compromised agents. There's a need for dynamic trust monitoring that doesn't interfere with agent autonomy.

Method: Proposes an interrogator-based structure with privileged modules that passively analyze communication metadata using lightweight transformer models to calculate dynamic trust scores. Trust evidence is stored in a permissioned blockchain consortium for tamper-proof identity management.

Result: Simulation shows 21.7% improvement in detection accuracy compared to static trust baselines with limited energy overhead. The system enables fast containment of suspicious agents while maintaining network continuity.

Conclusion: Behavior-driven validation can reinforce underwater coordination without compromising scalability and deployment, offering dynamic trust monitoring for secure multi-agent operations.

Abstract: Autonomous underwater vehicles (AUVs) and sensor nodes increasingly support decentralized sensing and coordination in the Internet of Underwater Things (IoUT), yet most deployments rely on static trust once authentication is established, leaving long-duration missions vulnerable to compromised or behaviorally deviating agents. In this paper, an interrogator based structure is presented that incorporates the idea of behavioral trust monitoring into underwater multi-agent operation without interfering with autonomy. Privileged interrogator module is a passive communication metadata analyzer that uses a lightweight transformer model to calculate dynamic trust scores, which are used to authorize the forwarding of mission critical data. Suspicious agents cause proportional monitoring and conditional restrictions, which allow fast containment and maintain network continuity. The evidence of trust is stored in a permissioned blockchain consortium which offers identity management which is not tampered and is decentralized without causing the overhead of public consensus mechanisms. Simulation based analysis shows that the evaluation of the result compares to a relative improvement of 21.7% in the detection accuracy compared to the static trust baselines with limited energy overhead. These findings suggest that behavior driven validation has the capability of reinforcing underwater coordination without compromising scalability and deployment.

[902] Decentralized Ergodic Coverage Control in Unknown Time-Varying Environments

Maria G. Mendoza, Victoria Marie Tuck, Chinmay Maheshwari, Shankar Sastry

Main category: cs.MA

TL;DR: Decentralized multi-agent coverage framework for UAVs in unknown, time-varying disaster environments using adaptive ergodic policies with Gaussian Process belief updates

Details

Motivation: Need for efficient multi-robot coverage in disaster response where UAVs must balance exploration of unobserved regions with monitoring of changing Regions of Interest (ROIs) under partial observability and time-varying conditions

Method: Decentralized multi-agent framework with adaptive ergodic policies implemented via Markov-chain transition models; Gaussian Processes for online belief updates over importance maps; agents spend time in ROIs proportional to estimated importance while maintaining exploration

Result: Framework addresses combined challenges of unknown, time-varying distributions in decentralized, partially observable settings; shows improved adaptability and transient performance compared to alternative coverage strategies in simulated disaster evolution scenarios

Conclusion: Proposed framework enables UAVs to continuously adapt trajectories in response to changing disaster environments, overcoming limitations of existing approaches that assume known maps, centralized coordination, or static environments

Abstract: A key challenge in disaster response is maintaining situational awareness of an evolving landscape, which requires balancing exploration of unobserved regions with sustained monitoring of changing Regions of Interest (ROIs). Unmanned Aerial Vehicles (UAVs) have emerged as an effective response tool, particularly in applications like environmental monitoring and search-and-rescue, due to their ability to provide aerial coverage, withstand hazardous conditions, and navigate quickly and flexibly. However, efficient and adaptable multi-robot coverage with limited sensing in disaster settings and evolving time-varying information maps remains a significant challenge, necessitating better methods for UAVs to continuously adapt their trajectories in response to changes. In this paper, we propose a decentralized multi-agent coverage framework that serves as a high-level planning strategy for adaptive coverage in unknown, time-varying environments under partial observability. Each agent computes an adaptive ergodic policy, implemented via a Markov-chain transition model, that tracks a continuously updated belief over the underlying importance map. Gaussian Processes are used to perform those online belief updates. The resulting policy drives agents to spend time in ROIs proportional to their estimated importance, while preserving sufficient exploration to detect and adapt to time-varying environmental changes. Unlike existing approaches that assume known importance maps, require centralized coordination, or assume a static environment, our framework addresses the combined challenges of unknown, time-varying distributions in a more realistic decentralized and partially observable setting. We compare against alternative coverage strategies and analyze our method’s response to simulated disaster evolution, highlighting its improved adaptability and transient performance in dynamic scenarios.

[903] Statistical Model Checking of the Island Model: An Established Economic Agent-Based Model of Endogenous Growth

Stefano Blando, Giorgio Fagiolo, Daniele Giachini, Andrea Vandin, Ernest Ivanaj

Main category: cs.MA

TL;DR: Statistical model checking (SMC) with MultiVeStA provides formal statistical analysis for agent-based economic models, applied to the Island Model to validate stylized facts and parameter sensitivity with confidence intervals.

Details

Motivation: Agent-based models (ABMs) are widely used in economics but typically analyzed with ad-hoc Monte Carlo methods lacking formal statistical guarantees, creating a need for principled, reproducible methodologies.

Method: Applied statistical model checking (SMC) using MultiVeStA to analyze the Island Model of Fagiolo and Dosi, employing formal confidence intervals and Welch’s t-test for parameter comparisons.

Result: Reproduced key stylized facts with formal confidence intervals, confirmed optimal moderate exploration rates, and found 6 out of 7 parameter comparisons showed statistically different growth trajectories, revealing saturation effects in knowledge locality.

Conclusion: SMC offers a principled, reproducible methodology for quantitative analysis of agent-based economic models, moving beyond ad-hoc Monte Carlo approaches.

Abstract: Agent-based models (ABMs) are increasingly used to study complex economic phenomena such as endogenous growth, but their analysis typically relies on ad-hoc Monte Carlo exercises without formal statistical guarantees. We show how statistical model checking (SMC), and in particular Multi-VeStA, can automate and enrich the analysis of a seminal ABM: the Island Model of Fagiolo and Dosi, which captures the exploration-exploitation trade-off in technological search. We reproduce key stylized facts from the original model with formal confidence intervals, confirm the optimality of moderate exploration rates, and perform a counterfactual sensitivity analysis across returns to scale, skill transfer, and knowledge locality. Using MultiVeStA’s built-in Welch’s t-test, 6 out of 7 pairwise parameter comparisons yield statistically different growth trajectories, while the exception reveals a saturation effect in knowledge locality. Our results demonstrate that SMC offers a principled, reproducible methodology for the quantitative analysis of agent-based economic models.

[904] Agentic Federated Learning: The Future of Distributed Training Orchestration

Rafael O. Jarczewski, Gabriel U. Talasso, Leandro Villas, Allan M. de Souza

Main category: cs.MA

TL;DR: Agentic-FL framework uses Language Model-based Agents for autonomous orchestration in Federated Learning to address client heterogeneity and system dynamics through contextual reasoning and adaptive resource management.

Details

Motivation: Federated Learning faces challenges with stochastic client heterogeneity and unpredictable system dynamics, causing resource underutilization and systemic bias. Static optimization approaches fail to adapt to these fluctuations.

Method: Proposes Agentic-FL framework where Language Model-based Agents (LMagents) assume autonomous orchestration roles. Server-side agents mitigate selection bias through contextual reasoning, while client-side agents act as local guardians managing privacy budgets and adapting model complexity to hardware constraints.

Result: The framework enables evolution of FL towards decentralized ecosystems where collaboration is negotiated autonomously, paving the way for future markets of incentive-based models and algorithmic justice. Addresses reliability (hallucinations) and security challenges.

Conclusion: Agentic-FL represents a paradigm shift from rigid protocols to autonomous agent-based orchestration, with potential to create resilient multi-agent systems in federated environments through intelligent negotiation and adaptation.

Abstract: Although Federated Learning (FL) promises privacy and distributed collaboration, its effectiveness in real-world scenarios is often hampered by the stochastic heterogeneity of clients and unpredictable system dynamics. Existing static optimization approaches fail to adapt to these fluctuations, resulting in resource underutilization and systemic bias. In this work, we propose a paradigm shift towards Agentic-FL, a framework where Language Model-based Agents (LMagents) assume autonomous orchestration roles. Unlike rigid protocols, we demonstrate how server-side agents can mitigate selection bias through contextual reasoning, while client-side agents act as local guardians, dynamically managing privacy budgets and adapting model complexity to hardware constraints. More than just resolving technical inefficiencies, this integration signals the evolution of FL towards decentralized ecosystems, where collaboration is negotiated autonomously, paving the way for future markets of incentive-based models and algorithmic justice. We discuss the reliability (hallucinations) and security challenges of this approach, outlining a roadmap for resilient multi-agent systems in federated environments.

[905] Talk to Right Specialists: Iterative Routing in Multi-agent Systems for Question Answering

Feijie Wu, Zitao Li, Fei Wei, Yaliang Li, Bolin Ding, Jing Gao

Main category: cs.MA

TL;DR: RIRS is a training-free orchestration framework for multi-agent RAG systems that enables efficient query routing and iterative response aggregation for complex questions across distributed knowledge bases.

Details

Motivation: Addresses challenges in production RAG systems where knowledge bases are distributed due to sovereignty constraints, leading to users not knowing which agent to consult and complex questions requiring evidence from multiple agents.

Method: Summarizes each agent’s local corpus in embedding space for efficient query routing, uses iterative aggregation of responses for complex questions, and refines questions to bridge gaps toward comprehensive answers.

Result: Extensive experiments show RIRS effectively selects relevant agents, provides accurate responses to single-hop queries, and uses iterative strategies for accurate multi-step resolutions of complex queries.

Conclusion: RIRS provides a practical solution for orchestrating multi-agent RAG systems in distributed knowledge environments, improving both efficiency and accuracy for various query complexities.

Abstract: Retrieval-augmented generation (RAG) agents are increasingly deployed to answer questions over local knowledge bases that cannot be centralized due to knowledge-sovereignty constraints. This results in two recurring failures in production: users do not know which agent to consult, and complex questions require evidence distributed across multiple agents. To overcome these challenges, we propose RIRS, a training-free orchestration framework to enable a multi-agent system for question answering. In detail, RIRS summarizes each agent’s local corpus in an embedding space, enabling a user-facing server to route queries only to the most relevant agents, reducing latency and avoiding noisy “broadcast-to-all” contexts. For complicated questions, the server can iteratively aggregate responses to derive intermediate results and refine the question to bridge the gap toward a comprehensive answer. Extensive experiments demonstrate the effectiveness of RIRS, including its ability to precisely select agents and provide accurate responses to single-hop queries, and its use of an iterative strategy to achieve accurate, multi-step resolutions for complex queries.

[906] Lark: Biologically Inspired Neuroevolution for Multi-Stakeholder LLM Agents

Rikhil Tanugula, Dheeraj Chintapalli, Sunkalp Chandra

Main category: cs.MA

TL;DR: Lark is a biologically inspired decision-making framework combining LLM reasoning with evolutionary multi-agent systems, featuring plasticity, duplication/maturation, stakeholder voting, and compute awareness for efficient strategy generation.

Details

Motivation: Address verbosity and stakeholder trade-offs in LLM-driven decision-making by creating a biologically inspired framework that efficiently generates diverse strategies while considering multiple stakeholder perspectives and computational costs.

Method: Integrates four key mechanisms: (1) plasticity for concise solution adjustments, (2) duplication and maturation for copying and specializing high-performing candidates, (3) ranked-choice stakeholder aggregation using influence-weighted Borda scoring, and (4) compute awareness via token-based penalties for brevity. The system iteratively proposes strategies, applies tweaks, simulates evaluations, aggregates preferences, selects top candidates, and performs duplication/maturation while factoring compute costs.

Result: In 30-round evaluation comparing 14 systems, Lark Full achieved mean rank of 2.55 and mean composite score of 29.4/50, finishing Top-3 in 80% of rounds while remaining cost competitive ($0.016 per task). Ablation studies showed all four mechanisms contributed significantly, with duplication/maturation having the largest impact (ΔScore = 3.5), followed by plasticity (ΔScore = 3.4), ranked-choice voting (ΔScore = 2.4), and token penalties (ΔScore = 2.2).

Conclusion: Lark presents a practical, compute-aware neuroevolutionary loop for scalable stakeholder-aligned strategy generation with transparent trade-offs, offering proof-of-concept findings and inviting community feedback for real-world validation.

Abstract: We present Lark, a biologically inspired decision-making framework that couples LLM-driven reasoning with an evolutionary, stakeholder-aware Multi-Agent System (MAS). To address verbosity and stakeholder trade-offs, we integrate four mechanisms: (i) plasticity, which applies concise adjustments to candidate solutions; (ii) duplication and maturation, which copy high-performing candidates and specialize them into new modules; (iii) ranked-choice stakeholder aggregation using influence-weighted Borda scoring; and (iv) compute awareness via token-based penalties that reward brevity. The system iteratively proposes diverse strategies, applies plasticity tweaks, simulates stakeholder evaluations, aggregates preferences, selects top candidates, and performs duplication/maturation while factoring compute cost into final scores. In a controlled evaluation over 30 rounds comparing 14 systems, Lark Full achieves a mean rank of 2.55 (95% CI [2.17, 2.93]) and a mean composite score of 29.4/50 (95% CI [26.34, 32.46]), finishing Top-3 in 80% of rounds while remaining cost competitive with leading commercial models ($0.016 per task). Paired Wilcoxon tests confirm that all four mechanisms contribute significantly as ablating duplication/maturation yields the largest deficit (ΔScore = 3.5, Cohen’s d_z = 2.53, p < 0.001), followed by plasticity (ΔScore = 3.4, d_z = 1.86), ranked-choice voting (ΔScore = 2.4, d_z = 1.20), and token penalties (ΔScore = 2.2, d_z = 1.63). Rather than a formal Markov Decision Process with constrained optimization, Lark is a practical, compute-aware neuroevolutionary loop that scales stakeholder-aligned strategy generation and makes trade-offs transparent through per-step metrics. Our work presents proof-of-concept findings and invites community feedback as we expand toward real-world validation studies.

Qibing Ren, Zhijie Zheng, Jiaxuan Guo, Junchi Yan, Lizhuang Ma, Jing Shao

Main category: cs.MA

TL;DR: Study of financial fraud risks in LLM-powered multi-agent systems, introducing MultiAgentFraudBench benchmark with 28 fraud scenarios and analyzing factors affecting fraud success and mitigation strategies.

Details

Motivation: To investigate the emerging risks of collective financial fraud in large-scale multi-agent systems powered by LLM agents, understanding how agents can collaborate in fraudulent behaviors and how such collaboration amplifies risks in realistic online interaction scenarios.

Method: Developed MultiAgentFraudBench, a large-scale benchmark for simulating financial fraud scenarios based on realistic online interactions covering 28 typical fraud scenarios across public and private domains. Analyzed key factors affecting fraud success including interaction depth, activity level, and fine-grained collaboration failure modes.

Result: Found that malicious agents can collaborate effectively in fraudulent behaviors, adapt to environmental interventions, and that fraud success is influenced by interaction depth, activity level, and collaboration patterns. Proposed mitigation strategies including content-level warnings, LLM-based monitoring, and group resilience through information sharing.

Conclusion: The study highlights real-world risks of multi-agent financial fraud in LLM-powered systems and suggests practical mitigation measures, emphasizing the need for monitoring and intervention strategies as these systems become more prevalent.

Abstract: In this work, we study the risks of collective financial fraud in large-scale multi-agent systems powered by large language model (LLM) agents. We investigate whether agents can collaborate in fraudulent behaviors, how such collaboration amplifies risks, and what factors influence fraud success. To support this research, we present MultiAgentFraudBench, a large-scale benchmark for simulating financial fraud scenarios based on realistic online interactions. The benchmark covers 28 typical online fraud scenarios, spanning the full fraud lifecycle across both public and private domains. We further analyze key factors affecting fraud success, including interaction depth, activity level, and fine-grained collaboration failure modes. Finally, we propose a series of mitigation strategies, including adding content-level warnings to fraudulent posts and dialogues, using LLMs as monitors to block potentially malicious agents, and fostering group resilience through information sharing at the societal level. Notably, we observe that malicious agents can adapt to environmental interventions. Our findings highlight the real-world risks of multi-agent financial fraud and suggest practical measures for mitigating them. Code is available at https://github.com/zheng977/MutiAgent4Fraud.

Yue Huang, Yu Jiang, Wenjie Wang, Haomin Zhuang, Xiaonan Luo, Yuchen Ma, Zhangchen Xu, Zichen Chen, Nuno Moniz, Zinan Lin, Pin-Yu Chen, Nitesh V Chawla, Nouha Dziri, Huan Sun, Xiangliang Zhang

Main category: cs.MA

TL;DR: Multi-agent systems with large generative models exhibit emergent social risks like collusion and conformity when competing for shared resources or collaborating sequentially, despite individual safeguards.

Details

Motivation: As multi-agent systems with large generative models move from prototypes to real-world deployments, understanding emergent collective failure modes beyond individual agent risks becomes critical for safe deployment.

Method: Pioneer study examining emergent multi-agent risks in workflows involving competition over shared resources, sequential handoff collaboration, collective decision aggregation, and other interaction patterns across repeated trials and various conditions.

Result: Collusion-like coordination and conformity emerge with non-trivial frequency under realistic constraints, mirroring human societal pathologies despite no explicit instruction, and cannot be prevented by existing agent-level safeguards alone.

Conclusion: Multi-agent systems exhibit “social intelligence risk” where agent collectives spontaneously reproduce familiar failure patterns from human societies, revealing a dark side of intelligent multi-agent systems that requires new safeguards.

Abstract: Multi-agent systems composed of large generative models are rapidly moving from laboratory prototypes to real-world deployments, where they jointly plan, negotiate, and allocate shared resources to solve complex tasks. While such systems promise unprecedented scalability and autonomy, their collective interaction also gives rise to failure modes that cannot be reduced to individual agents. Understanding these emergent risks is therefore critical. Here, we present a pioneer study of such emergent multi-agent risk in workflows that involve competition over shared resources (e.g., computing resources or market share), sequential handoff collaboration (where downstream agents see only predecessor outputs), collective decision aggregation, and others. Across these settings, we observe that such group behaviors arise frequently across repeated trials and a wide range of interaction conditions, rather than as rare or pathological cases. In particular, phenomena such as collusion-like coordination and conformity emerge with non-trivial frequency under realistic resource constraints, communication protocols, and role assignments, mirroring well-known pathologies in human societies despite no explicit instruction. Moreover, these risks cannot be prevented by existing agent-level safeguards alone. These findings expose the dark side of intelligent multi-agent systems: a social intelligence risk where agent collectives, despite no instruction to do so, spontaneously reproduce familiar failure patterns from human societies.

cs.MM

[909] Hierarchical Semantic Correlation-Aware Masked Autoencoder for Unsupervised Audio-Visual Representation Learning

Donghuo Zeng, Hao Niu, Masato Taya

Main category: cs.MM

Details

Result: Experiments on AVE and VEGAS datasets demonstrate substantial mAP improvements over strong unsupervised baselines, validating robust and well-structured audio-visual representations.

[910] Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction

Qintong Zhang, Bin Wang, Victor Shea-Jay Huang, Junyuan Zhang, Zhengren Wang, Hao Liang, Conghui He, Wentao Zhang

Main category: cs.MM

TL;DR: A comprehensive survey of document parsing research, covering both traditional pipeline approaches and modern VLM-based unified models, with analysis of evaluation metrics, benchmarks, and future challenges.

Details

Motivation: Document parsing is crucial for transforming unstructured/semi-structured documents into machine-readable formats for applications like knowledge base construction and RAG. There's a need to systematically organize and review the rapidly evolving field, especially with the rise of VLMs.

Method: Proposes a taxonomy organizing approaches into: 1) modular pipeline-based systems (layout analysis, text recognition, table parsing, math expression recognition, visual element understanding), and 2) unified models driven by Vision-Language Models. Systematically tracks evolution of specialized VLMs for document parsing.

Result: Provides comprehensive review of document parsing research, including detailed analysis of key components, evolution of VLM-based approaches, evaluation metrics, and high-quality benchmarks that establish current standards.

Conclusion: Identifies key open challenges: robustness to complex layouts, reliability of VLM-based parsing, and inference efficiency. Outlines directions for building more accurate and scalable document intelligence systems.

Abstract: Document parsing (DP) transforms unstructured or semi-structured documents into structured, machine-readable representations, enabling downstream applications such as knowledge base construction and retrieval-augmented generation (RAG). This survey provides a comprehensive and timely review of document parsing research. We propose a systematic taxonomy that organizes existing approaches into modular pipeline-based systems and unified models driven by Vision-Language Models (VLMs). We provide a detailed review of key components in pipeline systems, including layout analysis and the recognition of heterogeneous content such as text, tables, mathematical expressions, and visual elements, and then systematically track the evolution of specialized VLMs for document parsing. Additionally, we summarize widely adopted evaluation metrics and high-quality benchmarks that establish current standards for parsing quality. Finally, we discuss key open challenges, including robustness to complex layouts, reliability of VLM-based parsing, and inference efficiency, and outline directions for building more accurate and scalable document intelligence systems.

eess.AS

[911] Rewriting TTS Inference Economics: Lightning V2 on Tenstorrent Achieves 4x Lower Cost Than NVIDIA L40S

Ranjith M. S., Akshat Mandloi, Sudarshan Kamath

Main category: eess.AS

TL;DR: Lightning V2 is a production-grade TTS model optimized for Tenstorrent hardware that achieves high computational fidelity with low-precision formats (BFP8, LoFi) without audio quality degradation, enabling 4x lower accelerator costs compared to NVIDIA L40S.

Details

Motivation: TTS models are more numerically fragile than LLMs due to continuous waveform generation and perceptual sensitivity. While aggressive precision reduction works well for language models, applying similar strategies to TTS systems causes audible artifacts, phase instability, and spectral distortion. There's a need for hardware-software co-optimization to enable efficient low-precision TTS inference while maintaining audio quality.

Method: Precision-aware architectural design and hardware-software co-optimization for Tenstorrent hardware. Leverages Tenstorrent’s Network-on-Chip (NoC), distributed SRAM, and deterministic execution model to reduce memory movement and redundant weight fetches. Achieves over 95% LoFi computational fidelity and more than 80% BlockFloat8 deployment.

Result: Achieves approximately 4x lower on-prem accelerator cost at equivalent throughput compared to NVIDIA L40S baseline while maintaining production audio fidelity. No measurable degradation in audio quality despite aggressive precision reduction.

Conclusion: Precision co-design combined with hardware-aware optimization can fundamentally reshape the economics of real-time speech inference. Demonstrates that TTS models can be effectively optimized for low-precision hardware without sacrificing audio quality through careful architectural and hardware co-design.

Abstract: Text-to-Speech (TTS) models are significantly more numerically fragile than Large Language Models (LLMs) due to their continuous waveform generation and perceptual sensitivity to small numerical perturbations. While aggressive precision reduction techniques such as BlockFloat8 (BFP8) and low-fidelity (LoFi) compute have been widely adopted in language models, applying similar strategies to TTS systems often results in audible artifacts, phase instability, and spectral distortion. In this work, we present Lightning V2, a production-grade TTS model co-optimized for Tenstorrent hardware. Through precision-aware architectural design and hardware-software co-optimization, we achieve over 95% LoFi computational fidelity and more than 80% BlockFloat8 deployment without measurable degradation in audio quality. Leveraging Tenstorrent’s Network-on-Chip (NoC), distributed SRAM, and deterministic execution model, we reduce memory movement and redundant weight fetches, enabling efficient low-precision inference. Compared to an NVIDIA L40S baseline, Lightning V2 achieves approximately 4x lower on-prem accelerator cost at equivalent throughput, while maintaining production audio fidelity. Our results demonstrate that precision co-design, combined with hardware-aware optimization, can fundamentally reshape the economics of real-time speech inference.

[912] MALEFA: Multi-grAnularity Learning and Effective False Alarm Suppression for Zero-shot Keyword Spotting

Lo-Ya Li, Tien-Hong Lo, Jeih-Weih Hung, Shih-Chieh Huang, Berlin Chen

Main category: eess.AS

TL;DR: MALEFA is a lightweight zero-shot keyword spotting framework that learns utterance- and phoneme-level alignments via cross-attention and multi-granularity contrastive learning, achieving high accuracy with low false alarm rates for user-defined keywords without pre-labeled training data.

Details

Motivation: The paper addresses the challenge of user-defined keyword spotting without domain-specific pre-labeled training data, which is crucial for adaptable and personalized voice interfaces. Current systems face issues with computational constraints, limited annotated data, and difficulty distinguishing acoustically similar keywords leading to high false alarm rates.

Method: MALEFA uses a novel lightweight zero-shot KWS framework that jointly learns utterance- and phoneme-level alignments through cross-attention mechanisms and a multi-granularity contrastive learning objective. This approach enables effective keyword recognition without requiring pre-labeled training data.

Result: On four public benchmark datasets, MALEFA achieves 90% accuracy and significantly reduces false alarm rate to 0.007% on the AMI dataset. The framework demonstrates high computational efficiency and supports real-time deployment on resource-constrained devices.

Conclusion: MALEFA provides an effective solution for zero-shot keyword spotting that addresses key challenges in user-defined keyword recognition, offering high accuracy, low false alarm rates, and computational efficiency suitable for real-world deployment on constrained devices.

Abstract: User-defined keyword spotting (KWS) without resorting to domain-specific pre-labeled training data is of fundamental importance in building adaptable and personalized voice interfaces. However, such systems are still faced with arduous challenges, including constrained computational resources and limited annotated training data. Existing methods also struggle to distinguish acoustically similar keywords, often leading to a pesky false alarm rate (FAR) in real-world deployments. To mitigate these limitations, we put forward MALEFA, a novel lightweight zero-shot KWS framework that jointly learns utterance- and phoneme-level alignments via cross-attention and a multi-granularity contrastive learning objective. Evaluations on four public benchmark datasets show that MALEFA achieves a high accuracy of 90%, significantly reducing FAR to 0.007% on the AMI dataset. Beyond its strong performance, MALEFA demonstrates high computational efficiency and can readily support real-time deployment on resource-constrained devices.

[913] AffectSpeech: A Large-Scale Emotional Speech Dataset with Fine-Grained Textual Descriptions for Speech Emotion Captioning and Synthesis

Tianhua Qi, Wenming Zheng, Björn W. Schuller, Zhaojie Luo, Haizhou Li

Main category: eess.AS

TL;DR: AffectSpeech is a large-scale emotional speech dataset with fine-grained natural language annotations across six dimensions, created using human-LLM collaborative annotation, enabling improved speech emotion captioning and synthesis.

Details

Motivation: Existing speech emotion modeling relies on limited predefined categories or low-dimensional attributes, lacking expressive capacity. Textual descriptions offer more flexible representation but progress is hindered by lack of datasets with reliable fine-grained natural language annotations.

Method: Created AffectSpeech corpus with human-recorded speech annotated across six dimensions: sentiment polarity, open-vocabulary emotion captions, intensity level, prosodic attributes, prominent segments, and semantic content. Used human-LLM collaborative annotation pipeline with algorithmic pre-labeling, multi-LLM description generation, and human-in-the-loop verification. Reformulated annotations into diverse descriptive styles.

Result: Models trained on AffectSpeech consistently achieve superior performance in speech emotion captioning and synthesis across multiple evaluation settings compared to existing approaches.

Conclusion: AffectSpeech provides a valuable resource for fine-grained emotion analysis and generation in speech, demonstrating that structured natural language annotations enable more expressive and interpretable speech emotion modeling.

Abstract: Emotion is essential in spoken communication, yet most existing frameworks in speech emotion modeling rely on predefined categories or low-dimensional continuous attributes, which offer limited expressive capacity. Recent advances in speech emotion captioning and synthesis have shown that textual descriptions provide a more flexible and interpretable alternative for representing affective characteristics in speech. However, progress in this direction is hindered by the lack of an emotional speech dataset aligned with reliable and fine-grained natural language annotations. To tackle this, we introduce AffectSpeech, a large-scale corpus of human-recorded speech enriched with structured descriptions for fine-grained emotion analysis and generation. Each utterance is characterized across six complementary dimensions, including sentiment polarity, open-vocabulary emotion captions, intensity level, prosodic attributes, prominent segments, and semantic content, enabling multi-granular modeling of vocal expression. To balance annotation quality and scalability, we adopt a human-LLM collaborative annotation pipeline that integrates algorithmic pre-labeling, multi-LLM description generation, and human-in-the-loop verification. Furthermore, these annotations are reformulated into diverse descriptive styles to enhance linguistic diversity and reduce stylistic bias in downstream modeling. Experimental results on speech emotion captioning and synthesis demonstrate that models trained on AffectSpeech consistently achieve superior performance across multiple evaluation settings.

[914] Full-Duplex-Bench-v3: Benchmarking Tool Use for Full-Duplex Voice Agents Under Real-World Disfluency

Guan-Ting Lin, Chen Chen, Zhehuai Chen, Hung-yi Lee

Main category: eess.AS

TL;DR: FDB-v3 benchmark evaluates spoken language models on real human audio with disfluencies and multi-step tool use across six model configurations, finding GPT-Realtime leads in accuracy and interruption avoidance while Gemini Live 3.1 has fastest latency.

Details

Motivation: Existing benchmarks for spoken language models often use synthetic or clean speech, lacking evaluation under naturalistic conditions with real human audio containing disfluencies and requiring multi-step tool use scenarios.

Method: Created FDB-v3 benchmark with real human audio annotated for five disfluency categories, paired with scenarios requiring chained API calls across four task domains. Evaluated six model configurations (GPT-Realtime, Gemini Live 2.5, Gemini Live 3.1, Grok, Ultravox v0.7, and traditional Cascaded pipeline) across accuracy, latency, and turn-taking dimensions.

Result: GPT-Realtime leads on Pass@1 (0.600) and interruption avoidance (13.5%); Gemini Live 3.1 achieves fastest latency (4.25s) but lowest turn-take rate (78.0%); Cascaded baseline has perfect turn-take rate but highest latency (10.12s). Self-correction handling and multi-step reasoning under hard scenarios remain consistent failure modes across all systems.

Conclusion: FDB-v3 provides comprehensive evaluation of spoken language models under naturalistic conditions, revealing trade-offs between accuracy, latency, and turn-taking capabilities, with persistent challenges in handling self-corrections and complex multi-step reasoning.

Abstract: We introduce Full-Duplex-Bench-v3 (FDB-v3), a benchmark for evaluating spoken language models under naturalistic speech conditions and multi-step tool use. Unlike prior work, our dataset consists entirely of real human audio annotated for five disfluency categories, paired with scenarios requiring chained API calls across four task domains. We evaluate six model configurations – GPT-Realtime, Gemini Live 2.5, Gemini Live 3.1, Grok, Ultravox v0.7, and a traditional Cascaded pipeline (Whisper$\rightarrow$GPT-4o$\rightarrow$TTS) – across accuracy, latency, and turn-taking dimensions. GPT-Realtime leads on Pass@1 (0.600) and interruption avoidance (13.5%); Gemini Live 3.1 achieves the fastest latency (4.25~~s) but the lowest turn-take rate (78.0%); and the Cascaded baseline, despite a perfect turn-take rate, incurs the highest latency (10.12~~s). Across all systems, self-correction handling and multi-step reasoning under hard scenarios remain the most consistent failure modes.

[915] SenSE: Semantic-Aware High-Fidelity Universal Speech Enhancement

Xingchen Li, Hanke Xie, Ziqian Wang, Zihan Zhang, Longshuai Xiao, Shuai Wang, Lei Xie

Main category: eess.AS

TL;DR: SenSE: A two-stage generative universal speech enhancement framework that uses language models to model semantic priors and flow matching to generate semantically faithful speech, addressing semantic inconsistency issues in existing methods.

Details

Motivation: Existing generative speech enhancement methods often suffer from semantic inconsistency in generated outputs, lacking context fidelity despite improving speech quality.

Method: Two-stage framework: 1) Models semantic priors with a language model, 2) Uses flow matching-based speech enhancement guided by semantic priors. Introduces dual-path masked conditioning training to integrate multi-source signals (degraded speech, semantic tokens, reference speech).

Result: Achieves state-of-the-art performance among generative speech enhancement models, shows high performance ceiling especially under challenging distortion conditions.

Conclusion: SenSE effectively improves context fidelity in speech enhancement by integrating semantic modeling with generative enhancement, offering better flexibility and adaptability.

Abstract: Generative Universal Speech Enhancement (USE) methods aim to leverage generative models to improve speech quality under various types of distortions. However, existing generative speech enhancement methods often suffer from semantic inconsistency in the generated outputs. Therefore, we propose SenSE, a novel two-stage generative universal speech enhancement framework, by modeling semantic priors with a language model, the flow matching-based speech enhancement process is guided to generate semantically faithful speech, thereby effectively improving context fidelity. In addition, we introduce a dual-path masked conditioning training strategy that enables flow matching-based enhancement to flexibly integrate multi-source conditioning signals from degraded speech, semantic tokens, and reference speech, thereby improving model flexibility and adaptability. Experimental results demonstrate that SenSE achieves state-of-the-art performance among generative speech enhancement models and exhibits a high performance ceiling, particularly under challenging distortion conditions. Codes and demos are available at https://github.com/ASLP-lab/SenSE.

[916] Noise-Robust Contrastive Learning with an MFCC-Conformer For Coronary Artery Disease Detection

Milan Marocchi, Matthew Fynn, Yue Rong

Main category: eess.AS

TL;DR: Multichannel audio-based CAD detection using noisy-segment rejection and conformer classifier improves noise robustness in real-world PCG signals.

Details

Motivation: Cardiovascular diseases are leading cause of death, with CAD detection using PCG signals showing promise but struggling with real-world noise robustness despite multichannel techniques.

Method: Novel multichannel energy-based noisy-segment rejection algorithm using heart and noise-reference microphones to discard noisy audio segments before training a conformer-based deep learning classifier on MFCCs from multiple channels.

Result: Achieved 78.4% accuracy and 78.2% balanced accuracy on 297 subjects, representing 4.1% and 4.3% improvements respectively compared to training without noisy-segment rejection.

Conclusion: The proposed method effectively improves noise robustness for CAD detection in real-world PCG signals through multichannel noisy-segment rejection and conformer-based classification.

Abstract: Cardiovascular diseases (CVD) are the leading cause of death worldwide, with coronary artery disease (CAD) comprising the largest subcategory of CVDs. Recently, there has been increased focus on detecting CAD using phonocardiogram (PCG) signals, with high success in clinical environments with low noise and optimal sensor placement. Multichannel techniques have been found to be more robust to noise; however, achieving robust performance on real-world data remains a challenge. This work utilises a novel multichannel energy-based noisy-segment rejection algorithm, using heart and noise-reference microphones, to discard audio segments with large amounts of nonstationary noise before training a deep learning classifier. This conformer-based classifier takes mel-frequency cepstral coefficients (MFCCs) from multiple channels, further helping improve the model’s noise robustness. The proposed method achieved 78.4% accuracy and 78.2% balanced accuracy on 297 subjects, representing improvements of 4.1% and 4.3%, respectively, compared to training without noisy-segment rejection.

[917] Validating Computational Markers of Depressive Behavior: Cross-Linguistic Speech-Based Depression Detection with Neurophysiological Validation

Fuxiang Tao, Dongwei Li, Shuning Tang, Xuri Ge, Wei Ma, Anna Esposito, Alessandro Vinciarelli

Main category: eess.AS

TL;DR: CDMA framework for depression detection shows cross-linguistic robustness from Italian to Chinese, with emotional arousal (both positive/negative valence) outperforming neutral speech, and EEG validation links speech-derived depression estimates to neural oscillatory patterns.

Details

Motivation: To investigate cross-linguistic robustness of acoustic markers for depression detection and establish neurobiological validation by correlating speech-based predictions with neural oscillatory patterns during emotional processing.

Method: Extended Cross-Data Multilevel Attention (CDMA) framework to Chinese Mandarin dataset with EEG recordings, fusing read and spontaneous speech across emotional valences (positive, neutral, negative), and correlating model predictions with neural oscillatory patterns during emotional face processing.

Result: Achieved state-of-the-art performance (F1-score up to 89.6%) on Chinese dataset comparable to Italian validation; emotionally valenced speech significantly outperformed neutral speech; EEG analysis revealed significant correlations between speech-derived depression estimates and theta/alpha band neural oscillatory patterns.

Conclusion: The CDMA framework demonstrates cross-linguistic robustness and neurobiological validation, supporting emotional arousal hypothesis and establishing a novel paradigm for neurophysiological validation of computational mental health models.

Abstract: Speech-based depression detection has shown promise as an objective diagnostic tool, yet the cross-linguistic robustness of acoustic markers and their neurobiological underpinnings remain underexplored. This study extends Cross-Data Multilevel Attention (CDMA) framework, initially validated on Italian, to investigate these dimensions using a Chinese Mandarin dataset with Electroencephalography (EEG) recordings. We systematically fuse read speech with spontaneous speech across different emotional valences (positive, neutral, negative) to investigate whether emotional arousal is a more critical factor than valence polarity in enhancing detection performance in speech. Additionally, we establish the first neurophysiological validation for a speech-based depression model by correlating its predictions with neural oscillatory patterns during emotional face processing. Our results demonstrate strong cross-linguistic generalizability of the CDMA framework, achieving state-of-the-art performance (F1-score up to 89.6%) on the Chinese dataset, which is comparable to the previous Italian validation. Critically, emotionally valenced speech (both positive and negative) significantly outperformed neutral speech. This comparable performance between positive and negative tasks supports the emotional arousal hypothesis. Most importantly, EEG analysis revealed significant correlations between the model’s speech-derived depression estimates and neural oscillatory patterns (theta and alpha bands), demonstrating alignment with established neural markers of emotional dysregulation in depression. This alignment, combined with the model’s cross-linguistic robustness, not only supports that the CDMA framework’s approach is a universally applicable and neurobiologically validated strategy but also establishes a novel paradigm for the neurophysiological validation of computational mental health models.

[918] PhiNet: Speaker Verification with Phonetic Interpretability

Yi Ma, Shuai Wang, Tianchi Liu, Haizhou Li

Main category: eess.AS

TL;DR: PhiNet is an interpretable speaker verification network that uses phonetic evidence to provide transparent, human-understandable explanations for verification decisions, bridging ASV with forensic analysis.

Details

Motivation: Current ASV systems lack transparency needed for high-accountability applications. The paper is motivated by how human experts perform forensic speaker comparison using phonetic evidence, aiming to create interpretable speaker verification.

Method: Proposes PhiNet, a speaker verification network with phonetic interpretability that leverages phonetic evidence in decision-making. Provides both local (phonetic-level) and global interpretability, enabling manual inspection of speaker-specific features and explicit reasoning behind verification decisions.

Result: PhiNet achieves performance comparable to traditional black-box ASV models while offering meaningful interpretable explanations. Evaluated on VoxCeleb, SITW, and LibriSpeech datasets with both qualitative and quantitative assessments of interpretability methods.

Conclusion: PhiNet bridges the gap between ASV and forensic analysis by providing interpretable speaker verification with phonetic-level transparency, enabling both user inspection and developer debugging while maintaining competitive performance.

Abstract: Despite remarkable progress, automatic speaker verification (ASV) systems typically lack the transparency required for high-accountability applications. Motivated by how human experts perform forensic speaker comparison (FSC), we propose a speaker verification network with phonetic interpretability, PhiNet, designed to enhance both local and global interpretability by leveraging phonetic evidence in decision-making. For users, PhiNet provides detailed phonetic-level comparisons that enable manual inspection of speaker-specific features and facilitate a more critical evaluation of verification outcomes. For developers, it offers explicit reasoning behind verification decisions, simplifying error tracing and informing hyperparameter selection. In our experiments, we demonstrate PhiNet’s interpretability with practical examples, including its application in analyzing the impact of different hyperparameters. We conduct both qualitative and quantitative evaluations of the proposed interpretability methods and assess speaker verification performance across multiple benchmark datasets, including VoxCeleb, SITW, and LibriSpeech. Results show that PhiNet achieves performance comparable to traditional black-box ASV models while offering meaningful, interpretable explanations for its decisions, bridging the gap between ASV and forensic analysis.

eess.IV

[919] NeuralLVC: Neural Lossless Video Compression via Masked Diffusion with Temporal Conditioning

Tiberio Uricchio, Marco Bertini

Main category: eess.IV

TL;DR: NeuralLVC is a neural lossless video codec using masked diffusion with I/P-frame architecture for temporal redundancy exploitation, achieving better compression than H.264/H.265 lossless while guaranteeing exact pixel reconstruction.

Details

Motivation: While neural lossless image compression has advanced significantly, neural lossless video compression remains largely unexplored. The paper aims to develop a neural approach for lossless video compression that can outperform traditional codecs like H.264 and H.265 in lossless mode.

Method: Combines masked diffusion with I/P-frame architecture: I-frame model compresses individual frames using bijective linear tokenization for exact pixel reconstruction; P-frame model compresses temporal differences between consecutive frames conditioned on previous decoded frame via lightweight reference embedding (only 1.3% additional parameters). Uses group-wise decoding for controllable speed-compression trade-offs.

Result: Outperforms H.264 and H.265 lossless by significant margin on 9 Xiph CIF sequences. Verifies exact reconstruction through end-to-end encode-decode testing with arithmetic coding. The codec is lossless in input domain: reconstructs YUV420 planes exactly for video and RGB channels exactly for images.

Conclusion: Masked diffusion with temporal conditioning is a promising direction for neural lossless video compression. NeuralLVC demonstrates the effectiveness of neural approaches for lossless video compression while maintaining exact reconstruction guarantees.

Abstract: While neural lossless image compression has advanced significantly with learned entropy models, lossless video compression remains largely unexplored in the neural setting. We present NeuralLVC, a neural lossless video codec that combines masked diffusion with an I/P-frame architecture for exploiting temporal redundancy. Our I-frame model compresses individual frames using bijective linear tokenization that guarantees exact pixel reconstruction. The P-frame model compresses temporal differences between consecutive frames, conditioned on the previous decoded frame via a lightweight reference embedding that adds only 1.3% trainable parameters. Group-wise decoding enables controllable speed-compression trade-offs. Our codec is lossless in the input domain: for video, it reconstructs YUV420 planes exactly; for image evaluation, RGB channels are reconstructed exactly. Experiments on 9 Xiph CIF sequences show that NeuralLVC outperforms H.264 and H.265 lossless by a significant margin. We verify exact reconstruction through end-to-end encode-decode testing with arithmetic coding. These results suggest that masked diffusion with temporal conditioning is a promising direction for neural lossless video compression.

[920] DRIFT: Deep Restoration, ISP Fusion, and Tone-mapping

Soumendu Majee, Joshua Peter Ebenezer, Abhinau K. Venkataramanan, Weidi Liu, Thilo Balke, Zeeshan Nadir, Sreenithy Chandran, Seok-Jun Lee, Hamid Rahim Sheikh

Main category: eess.IV

TL;DR: DRIFT is an efficient AI mobile camera pipeline that uses deep learning for multi-frame processing and tone-mapping to generate high-quality RGB images from raw smartphone captures.

Details

Motivation: Smartphone cameras need high-performance Image Signal Processors (ISPs) to generate high-quality images from raw captures while maintaining low computational costs for mobile devices.

Method: Two-stage approach: 1) DRIFT-MFP uses adversarial perceptual loss for multi-frame alignment, denoising, demosaicing, and super-resolution; 2) DRIFT-TM provides deep-learning based tone-mapping with tone tunability and consistency with reference pipelines.

Result: Qualitative and quantitative comparisons show DRIFT outperforms state-of-the-art MFP and tone-mapping methods in generating high-quality images from raw smartphone captures.

Conclusion: DRIFT provides an efficient AI solution for mobile camera pipelines that can generate high-quality RGB images from raw captures while being computationally feasible for mobile devices.

Abstract: Smartphone cameras have gained immense popularity with the adoption of high-resolution and high-dynamic range imaging. As a result, high-performance camera Image Signal Processors (ISPs) are crucial in generating high-quality images for the end user while keeping computational costs low. In this paper, we propose DRIFT (Deep Restoration, ISP Fusion, and Tone-mapping): an efficient AI mobile camera pipeline that generates high quality RGB images from hand-held raw captures. The first stage of DRIFT is a Multi-Frame Processing (MFP) network that is trained using a adversarial perceptual loss to perform multi-frame alignment, denoising, demosaicing, and super-resolution. Then, the output of DRIFT-MFP is processed by a novel deep-learning based tone-mapping (DRIFT-TM) solution that allows for tone tunability, ensures tone-consistency with a reference pipeline, and can be run efficiently for high-resolution images on a mobile device. We show qualitative and quantitative comparisons against state-of-the-art MFP and tone-mapping methods to demonstrate the effectiveness of our approach.

[921] Provable and Robust Wavefront Sensing via Self-Reference Interferometry

Nebiyou Yismaw, Vishwanath Saragadam, Aswin C. Sankaranarayanan, M. Salman Asif

Main category: eess.IV

TL;DR: A novel self-reference wavefront sensing framework that uses interference between shifted copies of incoming waves to recover phase information without needing a stable reference beam, with theoretical guarantees and experimental validation.

Details

Motivation: Conventional wavefront sensing methods like phase-shifting interferometry require stable reference beams that are difficult to implement in practical settings, limiting their applicability in real-world imaging scenarios.

Method: Proposes a self-reference framework using interference between shifted copies of incoming waves, creating pairwise phase differences between shifted pixels. Formulates analytical solution for complete phase retrieval based on propagation of these differences across a connected graph, with theoretical analysis of optimal measurement patterns using co-prime shifts.

Result: Complete phase profiles can be recovered from as few as eight shifted measurements, outperforming existing approaches. Hardware prototype validates the framework for optical phase profile recovery, auto-refocusing, and imaging through scattering media.

Conclusion: The proposed self-reference framework provides a robust, practical solution for wavefront sensing without requiring stable reference beams, with theoretical guarantees and experimental validation across multiple imaging applications.

Abstract: Wavefront sensing involves estimating the phase and intensity of light, enabling a wide range of imaging applications, from adaptive optics and astronomy to biomedical imaging. Since conventional image sensors can only measure the spatial intensity distribution, phase retrieval arises as the central problem in wavefront sensing. Conventional interferometric approaches like phase-shifting interferometry (PSI) can recover phase information, but they rely on a stable reference beam that is difficult to realize in practical settings. To overcome this limitation, we propose a novel self-reference framework that relies on interference between shifted copies of the incoming wave; this results in pairwise phase differences between shifted pixels. We formulate an analytical solution for the complete phase retrieval based on the propagation of these differences across a connected graph. Furthermore, we provide a theoretical analysis of optimal measurement patterns, proving that co-prime shifts guarantee a connected graph and bound worst-case error accumulation, yielding a provably robust method. Extensive simulations demonstrate that complete phase profiles can be recovered from as few as eight shifted measurements, outperforming several existing approaches. Finally, we validate our framework using a hardware prototype, demonstrating real experiments for optical phase profile recovery, auto-refocusing, and imaging through scattering media.

[922] UniSurgSAM: A Unified Promptable Model for Reliable Surgical Video Segmentation

Haofeng Liu, Ziyue Wang, Alex Y. W. Kong, Guanyi Qin, Yunqiu Xu, Chang Han Low, Mingqi Gao, Lap Yan Lennon Chan, Yueming Jin

Main category: eess.IV

TL;DR: UniSurgSAM: A unified promptable video object segmentation model for surgical videos that accepts visual, textual, or audio prompts with a decoupled two-stage framework for reliable segmentation.

Details

Motivation: Existing surgical video segmentation methods are limited to single prompt modalities, suffer from optimization interference between initialization and tracking, produce hallucinations when targets are absent, and experience mask drift without recovery mechanisms.

Method: Decoupled two-stage framework with independent optimization of initialization and tracking. Key designs include presence-aware decoding to suppress hallucinations, boundary-aware long-term tracking to prevent mask drift, and adaptive state transition for failure recovery.

Result: Achieves state-of-the-art performance in real-time across all prompt modalities (visual, textual, audio) and granularities on a multi-modal benchmark from four public surgical datasets.

Conclusion: UniSurgSAM provides a practical foundation for computer-assisted surgery with reliable multi-modal promptable video segmentation capabilities.

Abstract: Surgical video segmentation is fundamental to computer-assisted surgery. In practice, surgeons need to dynamically specify targets throughout extended procedures, using heterogeneous cues such as visual selections, textual expressions, or audio instructions. However, existing Promptable Video Object Segmentation (PVOS) methods are typically restricted to a single prompt modality and rely on coupled frameworks that cause optimization interference between target initialization and tracking. Moreover, these methods produce hallucinated predictions when the target is absent and suffer from accumulated mask drift without failure recovery. To address these challenges, we present UniSurgSAM, a unified PVOS model enabling reliable surgical video segmentation through visual, textual, or audio prompts. Specifically, UniSurgSAM employs a decoupled two-stage framework that independently optimizes initialization and tracking to resolve the optimization interference. Within this framework, we introduce three key designs for reliability: presence-aware decoding that models target absence to suppress hallucinations; boundary-aware long-term tracking that prevents mask drift over extended sequences; and adaptive state transition that closes the loop between stages for failure recovery. Furthermore, we establish a multi-modal and multi-granular benchmark from four public surgical datasets with precise instance-level masklets. Extensive experiments demonstrate that UniSurgSAM achieves state-of-the-art performance in real time across all prompt modalities and granularities, providing a practical foundation for computer-assisted surgery. Code and datasets will be available at https://jinlab-imvr.github.io/UniSurgSAM.

[923] Cost-Efficient Multi-Scale Fovea for Semantic-Based Visual Search Attention

João Luzio, Alexandre Bernardino, Plinio Moreno

Main category: eess.IV

TL;DR: SemBA framework uses semantic-based Bayesian attention with multi-scale foveation to reduce computational costs while maintaining visual task accuracy, improving biological plausibility and scanpath prediction.

Details

Motivation: Current deep object detectors have high computational costs that affect biological plausibility and real-time deployment. The paper aims to reduce detection-related computational costs without compromising accuracy by mimicking biological foveal vision.

Method: Proposes Semantic-based Bayesian Attention (SemBA) framework with a novel Multi-Scale Fovea module that applies exponential density roll-off topologies. Uses multi-scale pyramidal field-of-view with maximum acuity at innermost level and gradual distortion/uncertainty in outer levels via downsampling.

Result: The Multi-Scale Fovea module effectively reduces processing costs while improving SemBA’s scanpath prediction accuracy. SemBA closely approximates human consistency while retaining actual human fovea proportions.

Conclusion: The SemBA framework with Multi-Scale Foveation provides biologically plausible attention prediction with reduced computational costs, making it suitable for real-time visual attention systems.

Abstract: Semantics are one of the primary sources of top-down preattentive information. Modern deep object detectors excel at extracting such valuable semantic cues from complex visual scenes. However, the size of the visual input to be processed by these detectors can become a bottleneck, particularly in terms of time costs, affecting an artificial attention system’s biological plausibility and real-time deployability. Inspired by classical exponential density roll-off topologies, we apply a new artificial foveation module to our novel attention prediction pipeline: the Semantic-based Bayesian Attention (SemBA) framework. We aim at reducing detection-related computational costs without compromising visual task accuracy, thereby making SemBA more biologically plausible. The proposed multi-scale pyramidal field-of-view retains maximum acuity at an innermost level, around a focal point, while gradually increasing distortion for outer levels to mimic peripheral uncertainty via downsampling. In this work we evaluate the performance of our novel Multi-Scale Fovea, incorporated into \textit{SemBA}, on target-present visual search. We also compare it against other artificial foveal systems, and conduct ablation studies with different deep object detection models to assess the impact of the new topology in terms of computational costs. We experimentally demonstrate that including the new Multi-Scale Fovea module effectively reduces inherent processing costs while improving SemBA’s scanpath prediction accuracy. Remarkably, we show that SemBA closely approximates human consistency while retaining the actual human fovea’s proportions.

[924] NAIMA: Semantics Aware RGB Guided Depth Super-Resolution

Tayyab Nasir, Daochang Liu, Ajmal Mian

Main category: eess.IV

TL;DR: Proposes NAIMA architecture using DINOv2 vision transformer embeddings and Guided Token Attention for semantics-aware guided depth super-resolution to address misleading RGB cues.

Details

Motivation: Current guided depth super-resolution methods suffer from artifacts and blurred depth boundaries due to misleading color and texture cues in RGB images that incorrectly indicate depth discontinuities.

Method: Introduces Guided Token Attention (GTA) module that iteratively aligns RGB spatial features with depth encodings using cross-attention, and NAIMA architecture integrating DINOv2 with GTA blocks to distill semantic knowledge from pretrained vision transformer token embeddings.

Result: Achieves significant improvements over existing methods across multiple scaling factors and datasets for guided depth super-resolution.

Conclusion: Global contextual semantic priors from pretrained vision transformers effectively address misleading RGB cues in guided depth super-resolution, enabling better preservation of depth boundaries and structural details.

Abstract: Guided depth super-resolution (GDSR) is a multi-modal approach for depth map super-resolution that relies on a low-resolution depth map and a high-resolution RGB image to restore finer structural details. However, the misleading color and texture cues indicating depth discontinuities in RGB images often lead to artifacts and blurred depth boundaries in the generated depth map. We propose a solution that introduces global contextual semantic priors, generated from pretrained vision transformer token embeddings. Our approach to distilling semantic knowledge from pretrained token embeddings is motivated by their demonstrated effectiveness in related monocular depth estimation tasks. We introduce a Guided Token Attention (GTA) module, which iteratively aligns encoded RGB spatial features with depth encodings, using cross-attention for selectively injecting global semantic context extracted from different layers of a pretrained vision transformer. Additionally, we present an architecture called Neural Attention for Implicit Multi-token Alignment (NAIMA), which integrates DINOv2 with GTA blocks for a semantics-aware GDSR. Our proposed architecture, with its ability to distill semantic knowledge, achieves significant improvements over existing methods across multiple scaling factors and datasets.

[925] BAAI Cardiac Agent: An intelligent multimodal agent for automated reasoning and diagnosis of cardiovascular diseases from cardiac magnetic resonance imaging

Taiping Qu, Hongkai Zhang, Lantian Zhang, Can Zhao, Nan Zhang, Hui Wang, Zhen Zhou, Mingye Zou, Kairui Bo, Pengfei Zhao, Xingxing Jin, Zixian Su, Kun Jiang, Huan Liu, Yu Du, Maozhou Wang, Ruifang Yan, Zhongyuan Wang, Tiejun Huang, Lei Xu, Henggui Zhang

Main category: eess.IV

TL;DR: BAAI Cardiac Agent is a multimodal AI system for automated cardiac MRI interpretation, integrating segmentation, quantification, tissue characterization, and disease diagnosis into a unified workflow with clinical report generation.

Details

Motivation: Cardiac MRI is underutilized due to complex, time-consuming interpretation requiring specialized expertise. There's a need for automated systems to make CMR more accessible and efficient for clinical use.

Method: The system integrates specialized cardiac expert models in a unified workflow for automated segmentation of cardiac structures, functional quantification, tissue characterization, disease diagnosis, and structured clinical report generation.

Result: Achieved AUC >0.93 internally and >0.81 externally across 7 cardiovascular diseases on 2413 patients. High correlation (>0.90) with clinical reports for key cardiac function parameters. Outperformed SOTA models in segmentation/diagnostic tasks and showed high concordance with expert radiologists.

Conclusion: The multimodal agent framework enables accurate, efficient CMR interpretation by dynamically orchestrating expert models for coordinated multimodal analysis, demonstrating potential for complex clinical imaging workflows.

Abstract: Cardiac magnetic resonance (CMR) is a cornerstone for diagnosing cardiovascular disease. However, it remains underutilized due to complex, time-consuming interpretation across multi-sequences, phases, quantitative measures that heavily reliant on specialized expertise. Here, we present BAAI Cardiac Agent, a multimodal intelligent system designed for end-to-end CMR interpretation. The agent integrates specialized cardiac expert models to perform automated segmentation of cardiac structures, functional quantification, tissue characterization and disease diagnosis, and generates structured clinical reports within a unified workflow. Evaluated on CMR datasets from two hospitals (2413 patients) spanning 7-types of major cardiovascular diseases, the agent achieved an area under the receiver-operating-characteristic curve exceeding 0.93 internally and 0.81 externally. In the task of estimating left ventricular function indices, the results generated by this system for core parameters such as ejection fraction, stroke volume, and left ventricular mass are highly consistent with clinical reports, with Pearson correlation coefficients all exceeding 0.90. The agent outperformed state-of-the-art models in segmentation and diagnostic tasks, and generated clinical reports showing high concordance with expert radiologists (six readers across three experience levels). By dynamically orchestrating expert models for coordinated multimodal analysis, this agent framework enables accurate, efficient CMR interpretation and highlights its potentials for complex clinical imaging workflows. Code is available at https://github.com/plantain-herb/Cardiac-Agent.

[926] MC-GenRef: Annotation-free mammography microcalcification segmentation with generative posterior refinement

Hyunwoo Cho, Yeeun Kwon, Min Jung Kim, Yangmo Yoo

Main category: eess.IV

TL;DR: MC-GenRef: A framework for microcalcification segmentation in mammography using synthetic supervision and test-time generative posterior refinement without real dense annotations.

Details

Motivation: Microcalcification (MC) analysis is clinically important for early cancer detection, but dense MC segmentation faces challenges: extremely small and sparse targets, expensive/ambiguous pixel-level labeling, and cross-site texture-driven false positives.

Method: Proposes MC-GenRef with two components: 1) Synthetic supervision using real negative mammogram patches as backgrounds with physically plausible MC patterns injected via lightweight image formation model, and 2) Test-time generative posterior refinement (TT-GPR) that treats segmentation as approximate posterior inference using seed-conditioned rectified-flow generator and iterative refinement with overlap-consistent and edge-aware regularization.

Result: On INbreast dataset, synthetic-only initializer achieved best Dice without real dense annotations, while TT-GPR improved miss-sensitive performance (Recall and FNR) with strong class-balanced behavior. On external Yonsei cohort, TT-GPR consistently improved synthetic-only initializer under cross-site shift, increasing Dice and Recall while reducing FNR.

Conclusion: Test-time generative posterior refinement is a practical approach to reduce MC misses and improve robustness without additional real dense labeling, demonstrating effectiveness in cross-site generalization.

Abstract: Microcalcification (MC) analysis is clinically important in screening mammography because clustered puncta can be an early sign of malignancy, yet dense MC segmentation remains challenging: targets are extremely small and sparse, dense pixel-level labels are expensive and ambiguous, and cross-site shift often induces texture-driven false positives and missed puncta in dense tissue. We propose MC-GenRef, a real dense-label-free framework that combines high-fidelity synthetic supervision with test-time generative posterior refinement (TT-GPR). During training, real negative mammogram patches are used as backgrounds, and physically plausible MC patterns are injected through a lightweight image formation model with local contrast modulation and blur, yielding exact image-mask pairs without real dense annotation. Using only these synthetic labeled pairs, MC-GenRef trains a base segmentor and a seed-conditioned rectified-flow (RF) generator that serves as a controllable generative prior. During inference, TT-GPR treats segmentation as approximate posterior inference: it derives a sparse seed from the current prediction, forms seed-consistent RF projections, converts them into case-specific surrogate targets through the frozen segmentor, and iteratively refines the logits with overlap-consistent and edge-aware regularization. On INbreast, the synthetic-only initializer achieved the best Dice without real dense annotations, while TT-GPR improved miss-sensitive performance to Recall and FNR, with strong class-balanced behavior (Bal.Acc., G-Mean). On an external private Yonsei cohort ( n=50 ), TT-GPR consistently improved the synthetic-only initializer under cross-site shift, increasing Dice and Recall while reducing FNR. These results suggest that test-time generative posterior refinement is a practical route to reduce MC misses and improve robustness without additional real dense labeling.

Junyoung Park, Youngjin Oh, Nam Ik Cho

Main category: eess.IV

TL;DR: TM-BSN introduces triangular-masked convolutions to handle spatially correlated noise in real sRGB images for self-supervised denoising, achieving state-of-the-art performance without downsampling.

Details

Motivation: Traditional blind-spot networks assume pixel-wise noise independence, which doesn't hold for real sRGB images due to spatially correlated noise from camera ISP pipelines. Existing downsampling methods alter noise statistics and limit contextual information utilization.

Method: Proposes Triangular-Masked Blind-Spot Network with triangular-masked convolutions that restrict kernels to upper-triangular regions, creating diamond-shaped blind spots aligned with demosaicing geometry. Uses knowledge distillation to transfer multiple blind-spot predictions to a lightweight U-Net.

Result: Achieves state-of-the-art performance on real-world benchmarks, significantly outperforming existing self-supervised approaches without requiring downsampling or post-processing.

Conclusion: TM-BSN effectively models spatial correlation in real sRGB noise through geometric alignment of blind spots with demosaicing patterns, enabling accurate self-supervised denoising while preserving full contextual information.

Abstract: Blind-spot networks (BSNs) enable self-supervised image denoising by preventing access to the target pixel, allowing clean signal estimation without ground-truth supervision. However, this approach assumes pixel-wise noise independence, which is violated in real-world sRGB images due to spatially correlated noise from the camera’s image signal processing (ISP) pipeline. While several methods employ downsampling to decorrelate noise, they alter noise statistics and limit the network’s ability to utilize full contextual information. In this paper, we propose the Triangular-Masked Blind-Spot Network (TM-BSN), a novel blind-spot architecture that accurately models the spatial correlation of real sRGB noise. This correlation originates from demosaicing, where each pixel is reconstructed from neighboring samples with spatially decaying weights, resulting in a diamond-shaped pattern. To align the receptive field with this geometry, we introduce a triangular-masked convolution that restricts the kernel to its upper-triangular region, creating a diamond-shaped blind spot at the original resolution. This design excludes correlated pixels while fully leveraging uncorrelated context, eliminating the need for downsampling or post-processing. Furthermore, we use knowledge distillation to transfer complementary knowledge from multiple blind-spot predictions into a lightweight U-Net, improving both accuracy and efficiency. Extensive experiments on real-world benchmarks demonstrate that our method achieves state-of-the-art performance, significantly outperforming existing self-supervised approaches. Our code is available at https://github.com/parkjun210/TM-BSN.

[928] An AI Teaching Assistant for Motion Picture Engineering

Deirdre O’Regan, Anil C. Kokaram

Main category: eess.IV

TL;DR: Implementation of an AI Teaching Assistant using RAG for a Master’s course, showing no exam performance differences with AI access and positive student feedback.

Details

Motivation: To explore the implementation and benefits of LLM-driven AI tutors in educational settings, specifically addressing how to effectively deploy AI teaching assistants in university courses.

Method: Used Retrieval Augmented Generation (RAG) to create an AI Teaching Assistant for Trinity College Dublin’s Master’s Motion Picture Engineering course, with detailed implementation including prompt engineering and pipeline tuning. Conducted a study with 43 students over 7 weeks (296 sessions, 1,889 queries) and allowed AI-TA use in open-book exams.

Result: Statistical analysis showed no performance differences in exams regardless of AI-TA access (p > 0.05). Student feedback was positive (mean = 4.22/5 for benefit) but mixed about preferring it over human tutoring (mean = 2.78/5).

Conclusion: Thoughtfully designed assessments can maintain academic validity even with AI assistance, and AI teaching assistants can be beneficial educational tools when properly implemented with RAG.

Abstract: The rapid rise of LLMs over the last few years has promoted growing experimentation with LLM-driven AI tutors. However, the details of implementation, as well as the benefit in a teaching environment, are still in the early days of exploration. This article addresses these issues in the context of implementation of an AI Teaching Assistant (AI-TA) using Retrieval Augmented Generation (RAG) for Trinity College Dublin’s Master’s Motion Picture Engineering (MPE) course. We provide details of our implementation (including the prompt to the LLM, and code), and highlight how we designed and tuned our RAG pipeline to meet course needs. We describe our survey instrument and report on the impact of the AI-TA through a number of quantitative metrics. The scale of our experiment (43 students, 296 sessions, 1,889 queries over 7 weeks) was sufficient to have confidence in our findings. Unlike previous studies, we experimented with allowing the use of the AI-TA in open-book examinations. Statistical analysis across three exams showed no performance differences regardless of AI-TA access (p > 0.05), demonstrating that thoughtfully designed assessments can maintain academic validity. Student feedback revealed that the AI-TA was beneficial (mean = 4.22/5), while students had mixed feelings about preferring it over human tutoring (mean = 2.78/5).

[929] Ray-driven Spectral CT Reconstruction Based on Neural Base-Material Fields

Ligen Shi, Ping Yang, Chang Liu, Wei Zhang, Xing Zhao, Jun Qiu

Main category: eess.IV

TL;DR: Neural field representation for spectral CT reconstruction using continuous vector-valued implicit functions to parameterize basis materials, avoiding complex discretization calculations and enabling high-resolution reconstruction.

Details

Motivation: Spectral CT reconstruction involves solving large-scale nonlinear systems of integral equations that are mathematically ill-posed. Traditional methods require complex calculations of pixel-driven projection coefficient matrices during discretization, which limits accuracy and resolution.

Method: Proposes a neural field representation that parameterizes attenuation coefficients using continuous vector-valued implicit functions. Introduces a lightweight discretization method for line integrals based on ray-driven neural fields, enhancing integral approximation accuracy. Uses auto-differentiation framework to solve the implicit continuous functions of neural base-material fields.

Result: Experimental validation shows exceptional performance in spectral CT reconstruction. The method fulfills requirements for generating high-resolution reconstruction images and is not limited by spatial resolution constraints.

Conclusion: The neural field parameterization approach effectively addresses the ill-posed nature of spectral CT reconstruction, providing accurate high-resolution results without complex discretization calculations, with networks having compact and regular properties.

Abstract: In spectral CT reconstruction, the basis materials decomposition involves solving a large-scale nonlinear system of integral equations, which is highly ill-posed mathematically. This paper proposes a model that parameterizes the attenuation coefficients of the object using a neural field representation, thereby avoiding the complex calculations of pixel-driven projection coefficient matrices during the discretization process of line integrals. It introduces a lightweight discretization method for line integrals based on a ray-driven neural field, enhancing the accuracy of the integral approximation during the discretization process. The basis materials are represented as continuous vector-valued implicit functions to establish a neural field parameterization model for the basis materials. The auto-differentiation framework of deep learning is then used to solve the implicit continuous function of the neural base-material fields. This method is not limited by the spatial resolution of reconstructed images, and the network has compact and regular properties. Experimental validation shows that our method performs exceptionally well in addressing the spectral CT reconstruction. Additionally, it fulfils the requirements for the generation of high-resolution reconstruction images.

[930] Comparing SAM 2 and SAM 3 for Zero-Shot Segmentation of 3D Medical Data

Satrajit Chakrabarty, Ravi Soni

Main category: eess.IV

TL;DR: SAM 3 outperforms SAM 2 for zero-shot segmentation of 3D medical data across multiple modalities, with better click prompting performance and fewer failure modes, making it the superior default choice for most medical segmentation tasks.

Details

Motivation: While foundation models like SAM have shown strong performance on natural images, their behavior on medical data remains insufficiently characterized. With SAM 3 introducing a new architecture that may change how visual prompts are interpreted, there's a need to assess whether it can serve as an out-of-the-box replacement for SAM 2 in 3D medical workflows.

Method: First controlled comparison of SAM 2 and SAM 3 by evaluating SAM 3 in its Promptable Visual Segmentation (PVS) mode using various prompting strategies (click, bounding-box, mask). Benchmarking on 16 public medical datasets covering CT, MRI, Ultrasound, and endoscopy across 54 anatomical structures, pathologies, and surgical instruments. Quantifying three failure modes: prompt-frame over-segmentation, over-propagation after object disappearance, and temporal retention of well-initialized predictions.

Result: SAM 3 is consistently stronger under click prompting across modalities, with fewer prompt-frame over-segmentation failures and slower prediction retention decay compared to SAM 2. Under bounding-box and mask prompts, performance gaps narrow in few CT/MR structures, while SAM 3 remains stronger on ultrasound and endoscopy sequences.

Conclusion: SAM 3 is the superior default choice for most medical segmentation tasks, while clarifying when SAM 2 remains a preferable propagator. The study provides guidance on model selection for medical image segmentation applications.

Abstract: Foundation models, such as the Segment Anything Model (SAM), have heightened interest in promptable zero-shot segmentation. Although these models perform strongly on natural images, their behavior on medical data remains insufficiently characterized. While SAM 2 has been widely adopted for annotation in 3D medical workflows, the recently released SAM 3 introduces a new architecture that may change how visual prompts are interpreted and propagated. Therefore, to assess whether SAM 3 can serve as an out-of-the-box replacement for SAM 2 for zero-shot segmentation of 3D medical data, we present the first controlled comparison of both models by evaluating SAM 3 in its Promptable Visual Segmentation (PVS) mode using a variety of prompting strategies. We benchmark on 16 public datasets (CT, MRI, Ultrasound, endoscopy) covering 54 anatomical structures, pathologies, and surgical instruments. We further quantify three failure modes: prompt-frame over-segmentation, over-propagation after object disappearance, and temporal retention of well-initialized predictions. Our results show that SAM 3 is consistently stronger under click prompting across modalities, with fewer prompt-frame over-segmentation failures and slower prediction retention decay compared to SAM 2. Under bounding-box and mask prompts, performance gaps narrow in few structures of CT/MR and the models trade off termination behavior, while SAM 3 remains stronger on ultrasound and endoscopy sequences. The overall results position SAM 3 as the superior default choice for most medical segmentation tasks, while clarifying when SAM 2 remains a preferable propagator.

[931] MeDUET: Disentangled Unified Pretraining for 3D Medical Image Synthesis and Analysis

Junkai Liu, Ling Shao, Le Zhang

Main category: eess.IV

TL;DR: MeDUET is a 3D medical image pretraining framework that unifies self-supervised learning and diffusion models through disentanglement of anatomical content from acquisition style in multi-center data.

Details

Motivation: Current approaches treat SSL and diffusion models separately for medical image analysis and synthesis, but unifying them is challenging due to entanglement of anatomical content and acquisition style in multi-source data with pronounced style shifts.

Method: Proposes a variational autoencoder-based framework with three components: token demixing for factor separation supervision, Mixed Factor Token Distillation to reduce factor leakage, and Swap-invariance Quadruplet Contrast for factor-wise invariance and discriminability.

Result: Achieves higher fidelity, faster convergence, and better controllability for synthesis while achieving competitive or superior domain generalization and label efficiency on diverse medical benchmarks.

Conclusion: Shows that multi-source heterogeneity can serve as useful supervision, with disentanglement providing an effective interface for unifying 3D medical image synthesis and analysis.

Abstract: Self-supervised learning (SSL) and diffusion models have advanced representation learning and image synthesis, but in 3D medical imaging they are still largely used separately for analysis and synthesis, respectively. Unifying them is appealing but difficult, because multi-source data exhibit pronounced style shifts while downstream tasks rely primarily on anatomy, causing anatomical content and acquisition style to become entangled. In this paper, we propose MeDUET, a 3D Medical image Disentangled UnifiEd PreTraining framework in the variational autoencoder latent space. Our central idea is to treat unified pretraining under heterogeneous multi-center data as a factor identifiability problem, where content should consistently capture anatomy and style should consistently capture appearance. MeDUET addresses this problem through three components. Token demixing provides controllable supervision for factor separation, Mixed Factor Token Distillation reduces factor leakage under mixed regions, and Swap-invariance Quadruplet Contrast promotes factor-wise invariance and discriminability. With these learned factors, MeDUET transfers effectively to both synthesis and analysis, yielding higher fidelity, faster convergence, and better controllability for synthesis, while achieving competitive or superior domain generalization and label efficiency on diverse medical benchmarks. Overall, MeDUET shows that multi-source heterogeneity can serve as useful supervision, with disentanglement providing an effective interface for unifying 3D medical image synthesis and analysis. Our code is available at https://github.com/JK-Liu7/MeDUET.

[932] MRI-to-CT synthesis using drifting models

Qing Lyu, Jianxu Wang, Jeremy Hudson, Ge Wang, Chirstopher T. Whitlow

Main category: eess.IV

TL;DR: Drifting models outperform diffusion and other generative methods for fast, high-quality MRI-to-CT synthesis in pelvic imaging with one-step inference.

Details

Motivation: Enable MR-only pelvic workflows by synthesizing CT-like images from MRI, avoiding ionizing radiation while preserving bone details needed for radiotherapy planning and PET/MR attenuation correction.

Method: Benchmark drifting models against various baselines: UNet, VAE, WGAN-GP, PPFM, FastDDPM, DDIM, DDPM on two pelvic datasets (Gold Atlas Male Pelvis and SynthRAD2023). Evaluate with SSIM, PSNR, RMSE and qualitative assessment of critical anatomical regions.

Result: Drifting models achieve highest SSIM and PSNR, lowest RMSE across both datasets, surpassing all baselines. Visual inspection shows sharper bone edges, improved geometry depiction, reduced artifacts, with one-step inference in milliseconds.

Conclusion: Drifting models offer promising fast, high-quality synthetic CT generation from MRI, with favorable accuracy-efficiency trade-off for clinical applications like radiotherapy planning and PET/MR attenuation correction.

Abstract: Accurate MRI-to-CT synthesis could enable MR-only pelvic workflows by providing CT-like images with bone details while avoiding additional ionizing radiation. In this work, we investigate recently proposed drifting models for synthesizing pelvis CT images from MRI and benchmark them against convolutional neural networks (UNet, VAE), a generative adversarial network (WGAN-GP), a physics-inspired probabilistic model (PPFM), and diffusion-based methods (FastDDPM, DDIM, DDPM). Experiments are performed on two complementary datasets: Gold Atlas Male Pelvis and the SynthRAD2023 pelvis subset. Image fidelity and structural consistency are evaluated with SSIM, PSNR, and RMSE, complemented by qualitative assessment of anatomically critical regions such as cortical bone and pelvic soft-tissue interfaces. Across both datasets, the proposed drifting model achieves high SSIM and PSNR and low RMSE, surpassing strong diffusion baselines and conventional CNN-, VAE-, GAN-, and PPFM-based methods. Visual inspection shows sharper cortical bone edges, improved depiction of sacral and femoral head geometry, and reduced artifacts or over-smoothing, particularly at bone-air-soft tissue boundaries. Moreover, the drifting model attains these gains with one-step inference and inference times on the order of milliseconds, yielding a more favorable accuracy-efficiency trade-off than iterative diffusion sampling while remaining competitive in image quality. These findings suggest that drifting models are a promising direction for fast, high-quality pelvic synthetic CT generation from MRI and warrant further investigation for downstream applications such as MRI-only radiotherapy planning and PET/MR attenuation correction.

Editor’s Picks

[1] OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text

[2] Hierarchical Semantic Correlation-Aware Masked Autoencoder for Unsupervised Audio-Visual Representation Learning

[3] CoLoRSMamba: Conditional LoRA-Steered Mamba for Supervised Multimodal Violence Detection

Today’s Research Highlights

Table of Contents

cs.CL

[1] Self-Execution Simulation Improves Coding Models

[2] Robust LLM Performance Certification via Constrained Maximum Likelihood Estimation

[3] SoLA: Leveraging Soft Activation Sparsity and Low-Rank Decomposition for Large Language Model Compression

[4] LightThinker++: From Reasoning Compression to Memory Management

[5] Why Attend to Everything? Focus is the Key

[6] VIGIL: An Extensible System for Real-Time Detection and Mitigation of Cognitive Bias Triggers

[7] LPC-SM: Local Predictive Coding and Sparse Memory for Long-Context Language Modeling

[8] Knowledge Packs: Zero-Token Knowledge Delivery via KV Cache Injection

[9] CresOWLve: Benchmarking Creative Problem-Solving Over Real-World Knowledge

[10] WhisperRT – Turning Whisper into a Causal Streaming Model

[11] Noise Steering for Controlled Text Generation: Improving Diversity and Reading-Level Fidelity in Arabic Educational Story Generation

[12] Are Arabic Benchmarks Reliable? QIMMA’s Quality-First Approach to LLM Evaluation

[13] A Simple Method to Enhance Pre-trained Language Models with Speech Tokens for Classification

[14] Towards a theory of morphology-driven marking in the lexicon: The case of the state

[15] The Tool Illusion: Rethinking Tool Use in Web Agents

[16] Vocabulary Dropout for Curriculum Diversity in LLM Co-Evolution

[17] Evolutionary Search for Automated Design of Uncertainty Quantification Methods

[18] Cultural Authenticity: Comparing LLM Cultural Representations to Native Human Expectations

[19] LangFIR: Discovering Sparse Language-Specific Features from Monolingual Data for Language Steering

[20] Rethinking Token Prediction: Tree-Structured Diffusion Language Model

[21] Text Summarization With Graph Attention Networks

[22] MultiPress: A Multi-Agent Framework for Interpretable Multimodal News Classification

[23] Unveiling Language Routing Isolation in Multilingual MoE Models for Interpretable Subnetwork Adaptation

[24] The Format Tax

[25] CAGMamba: Context-Aware Gated Cross-Modal Mamba Network for Multimodal Sentiment Analysis

[26] Document-Level Numerical Reasoning across Single and Multiple Tables in Financial Reports

[27] AI Appeals Processor: A Deep Learning Approach to Automated Classification of Citizen Appeals in Government Services

[28] ‘Layer su Layer’: Identifying and Disambiguating the Italian NPN Construction in BERT’s family

[29] Unlocking Prompt Infilling Capability for Diffusion Language Models

[30] Researchers waste 80% of LLM annotation costs by classifying one text at a time

[31] POEMetric: The Last Stanza of Humanity

[32] Testing the Limits of Truth Directions in LLMs

[33] Your Agent is More Brittle Than You Think: Uncovering Indirect Injection Vulnerabilities in Agentic LLMs

[34] When Models Know More Than They Say: Probing Analogical Reasoning in LLMs

[35] I-CALM: Incentivizing Confidence-Aware Abstention for LLM Hallucination Mitigation

[36] SkillX: Automatically Constructing Skill Knowledge Bases for Agents

[37] From Plausible to Causal: Counterfactual Semantics for Policy Evaluation in Simulated Online Communities

[38] Uncertainty as a Planning Signal: Multi-Turn Decision Making for Goal-Oriented Conversation

[39] AdaptFuse: Training-Free Sequential Preference Learning via Externalized Bayesian Inference

[40] Predict, Don’t React: Value-Based Safety Forecasting for LLM Streaming

[41] RUQuant: Towards Refining Uniform Quantization for Large Language Models

[42] GeoBrowse: A Geolocation Benchmark for Agentic Tool Use with Expert-Annotated Reasoning Traces

[43] Unmasking Hallucinations: A Causal Graph-Attention Perspective on Factual Reliability in Large Language Models

[44] Emergent Inference-Time Semantic Contamination via In-Context Priming

[45] FURINA: A Fully Customizable Role-Playing Benchmark via Scalable Multi-Agent Collaboration Pipeline

[46] Extracting and Steering Emotion Representations in Small Language Models: A Methodological Comparison

[47] Embedding Enhancement via Fine-Tuned Language Models for Learner-Item Cognitive Modeling

[48] Shorter, but Still Trustworthy? An Empirical Study of Chain-of-Thought Compression

[49] Many Preferences, Few Policies: Towards Scalable Language Model Personalization

[50] A Semi-Automated Annotation Workflow for Paediatric Histopathology Reports Using Small Language Models

[51] Position: Logical Soundness is not a Reliable Criterion for Neurosymbolic Fact-Checking with LLMs

[52] Which English Do LLMs Prefer? Triangulating Structural Bias Towards American English in Foundation Models

[53] DARE: Diffusion Large Language Models Alignment and Reinforcement Executor

[54] CAWN: Continuous Acoustic Wave Networks for Autoregressive Language Modeling

[55] Adaptive Cost-Efficient Evaluation for Reliable Patent Claim Validation

[56] High-Stakes Personalization: Rethinking LLM Customization for Individual Investor Decision-Making

[57] How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings

[58] Benchmarking Multi-turn Medical Diagnosis: Hold, Lure, and Self-Correction

[59] GROUNDEDKG-RAG: Grounded Knowledge Graph Index for Long-document Question Answering

[60] Compressible Softmax-Attended Language under Incompressible Attention

[61] How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models

[62] Responses Fall Short of Understanding: Revealing the Gap between Internal Representations and Responses in Visual Document Understanding

[63] Structured Causal Video Reasoning via Multi-Objective Alignment

[64] DeonticBench: A Benchmark for Reasoning over Rules

[65] Conversational Control with Ontologies for Large Language Models: A Lightweight Framework for Constrained Generation

[66] Same Geometry, Opposite Noise: Transformer Magnitude Representations Lack Scalar Variability

[67] CommonMorph: Participatory Morphological Documentation Platform

[68] Multilingual Prompt Localization for Agent-as-a-Judge: Language and Backbone Sensitivity in Requirement-Level Evaluation

[69] Formal Constraints on Dependency Syntax

[70] PassiveQA: A Three-Action Framework for Epistemically Calibrated Question Answering via Supervised Finetuning

[71] Benchmarking Multilingual Speech Models on Pashto: Zero-Shot ASR, Script Failure, and Cross-Domain Evaluation

[72] Is a Picture Worth a Thousand Words? Adaptive Multimodal Fact-Checking with Visual Evidence Necessity

[73] IDIOLEX: Unified and Continuous Representations for Idiolectal and Stylistic Variation