Editor’s Picks
Top papers matching your research interests in multimodal LLMs, audio and vision understanding/generation.
[1] AudioGuard: Toward Comprehensive Audio Safety Protection Across Diverse Threat Models
Mintong Kang, Chen Fang, Bo Li
Main category: cs.SD
TL;DR: AudioSafetyBench: First policy-based audio safety benchmark addressing unique audio risks like harmful sound events, speaker attributes, voice cloning, and voice-content compositional harms, with AudioGuard as a unified guardrail solution.
Details
Motivation: Audio safety is more complex than just "unsafe text spoken aloud" - real-world risks include audio-native harmful sound events, speaker attributes (child voice), impersonation/voice-cloning misuse, and voice-content compositional harms. Current benchmarks and guardrails are inadequate for this unique risk landscape.Method: 1) Conduct large-scale red teaming on audio systems to systematically uncover vulnerabilities; 2) Develop comprehensive, policy-grounded audio risk taxonomy; 3) Create AudioSafetyBench, the first policy-based audio safety benchmark across diverse threat models; 4) Propose AudioGuard with SoundGuard (waveform-level audio-native detection) and ContentGuard (policy-grounded semantic protection).
Result: AudioSafetyBench supports diverse languages, suspicious voices (celebrity/impersonation and child voice), risky voice-content combinations, and non-speech sound events. AudioGuard consistently improves guardrail accuracy over strong audio-LLM-based baselines with substantially lower latency across AudioSafetyBench and four complementary benchmarks.
Conclusion: The paper addresses critical gaps in audio safety for foundation models by providing comprehensive benchmarks and effective guardrail solutions that handle the unique complexities of audio risks beyond just text-to-speech safety.
Abstract: Audio has rapidly become a primary interface for foundation models, powering real-time voice assistants. Ensuring safety in audio systems is inherently more complex than just “unsafe text spoken aloud”: real-world risks can hinge on audio-native harmful sound events, speaker attributes (e.g., child voice), impersonation/voice-cloning misuse, and voice-content compositional harms, such as child voice plus sexual content. The nature of audio makes it challenging to develop comprehensive benchmarks or guardrails against this unique risk landscape. To close this gap, we conduct large-scale red teaming on audio systems, systematically uncover vulnerabilities in audio, and develop a comprehensive, policy-grounded audio risk taxonomy and AudioSafetyBench, the first policy-based audio safety benchmark across diverse threat models. AudioSafetyBench supports diverse languages, suspicious voices (e.g., celebrity/impersonation and child voice), risky voice-content combinations, and non-speech sound events. To defend against these threats, we propose AudioGuard, a unified guardrail consisting of 1) SoundGuard for waveform-level audio-native detection and 2) ContentGuard for policy-grounded semantic protection. Extensive experiments on AudioSafetyBench and four complementary benchmarks show that AudioGuard consistently improves guardrail accuracy over strong audio-LLM-based baselines with substantially lower latency.
Relevance: 9/10
[2] Noise-Aware In-Context Learning for Hallucination Mitigation in ALLMs
Qixuan Huang, Khalid Zaman, Masashi Unoki
Main category: cs.SD
TL;DR: A plug-and-play Noise-Aware In-Context Learning method to reduce hallucinations in auditory large language models for audio captioning tasks, with a new hallucination benchmark dataset and evaluation metrics.
Details
Motivation: Auditory LLMs suffer from hallucination issues in audio understanding tasks, but existing evaluation methods are binary and insufficient for complex generative tasks, while mitigation strategies require expensive fine-tuning.Method: Proposes Noise-Aware In-Context Learning (NAICL) - constructs noise prior library, retrieves relevant noise examples as contextual priors to guide models to reduce speculative associations when acoustic evidence is insufficient and adopt conservative generation.
Result: All evaluated ALLMs exhibit same hallucination behaviors. NAICL reduces overall hallucination rate from 26.53% to 16.98%. Also establishes Clotho-1K multi-event benchmark dataset with four hallucination types and fine-grained metrics.
Conclusion: NAICL effectively mitigates hallucinations in auditory LLMs without fine-tuning, and the new benchmark enables comprehensive evaluation of hallucination patterns in audio captioning tasks.
Abstract: Auditory large language models (ALLMs) have demonstrated strong general capabilities in audio understanding and reasoning tasks. However, their reliability is still undermined by hallucination issues. Existing hallucination evaluation methods are formulated as binary classification tasks, which are insufficient to characterize the more complex hallucination patterns that arise in generative tasks. Moreover, current hallucination mitigation strategies rely on fine-tuning, resulting in high computational costs. To address the above limitations, we propose a plug-and-play Noise-Aware In-Context Learning (NAICL) method. Specifically, we construct a noise prior library, retrieve noise examples relevant to the input audio, and incorporate them as contextual priors, thereby guiding the model to reduce speculative associations when acoustic evidence is insufficient and to adopt a more conservative generation strategy. In addition, we establish a hallucination benchmark for audio caption tasks including the construction of the Clotho-1K multi-event benchmark dataset, the definition of four types of auditory hallucinations, and the introduction of metrics such as hallucination type distribution to support fine-grained analysis. Experimental results show that all evaluated ALLMs exhibit same hallucination behaviors. Moreover, the proposed NAICL method reduces the overall hallucination rate from 26.53% to 16.98%.
Relevance: 9/10
[3] Tora3: Trajectory-Guided Audio-Video Generation with Physical Coherence
Junchao Liao, Zhenghao Zhang, Xiangyu Meng, Litao Li, Ziying Zhang, Siyu Zhu, Long Qin, Weizhi Wang
Main category: cs.CV
TL;DR: Tora3 is a trajectory-guided audio-video generation framework that uses object trajectories as a shared kinematic prior to improve physical coherence and motion-sound alignment in AV generation.
Details
Motivation: Current AV generation methods produce visually unstable object motions and sounds that are only loosely aligned with motion or contact events, lacking explicit motion-aware structure shared between video and audio generation.Method: Uses object trajectories as shared kinematic prior; designs trajectory-aligned motion representation for video, kinematic-audio alignment module driven by trajectory-derived second-order kinematic states, and hybrid flow matching scheme that preserves trajectory fidelity while maintaining local coherence.
Result: Extensive experiments show Tora3 improves motion realism, motion-sound synchronization, and overall AV generation quality over strong open-source baselines.
Conclusion: Tora3 demonstrates that using object trajectories as a shared kinematic prior effectively improves physical coherence and motion-sound relations in audio-video generation.
Abstract: Audio-video (AV) generation has recently made strong progress in perceptual quality and multimodal coherence, yet generating content with plausible motion-sound relations remains challenging. Existing methods often produce object motions that are visually unstable and sounds that are only loosely aligned with salient motion or contact events, largely because they lack an explicit motion-aware structure shared by video and audio generation. We present Tora3, a trajectory-guided AV generation framework that improves physical coherence by using object trajectories as a shared kinematic prior. Rather than treating trajectories as a video-only control signal, Tora3 uses them to jointly guide visual motion and acoustic events. Specifically, we design a trajectory-aligned motion representation for video, a kinematic-audio alignment module driven by trajectory-derived second-order kinematic states, and a hybrid flow matching scheme that preserves trajectory fidelity in trajectory-conditioned regions while maintaining local coherence elsewhere. We further curate PAV, a large-scale AV dataset emphasizing motion-relevant patterns with automatically extracted motion annotations. Extensive experiments show that Tora3 improves motion realism, motion-sound synchronization, and overall AV generation quality over strong open-source baselines.
Relevance: 9/10
Today’s Research Highlights
AI-enhanced summaries of the latest research papers from arXiv.
Table of Contents
- cs.CL [Total: 97]
- cs.CV [Total: 197]
- cs.AI [Total: 88]
- cs.SD [Total: 12]
- cs.LG [Total: 159]
- cs.MA [Total: 9]
- cs.MM [Total: 4]
- eess.AS [Total: 6]
- eess.IV [Total: 9]
cs.CL
[1] Drift and selection in LLM text ecosystems
Søren Riis
Main category: cs.CL
TL;DR: Mathematical framework for recursive text generation where AI systems learn from their own outputs, analyzing drift (unfiltered reuse removes rare forms) and selection (filtering by publication/verification) effects on public text corpus evolution.
Details
Motivation: The public text record is increasingly shaped by AI-generated outputs that then become training data for future AI systems, creating a recursive loop that could fundamentally alter the quality and structure of public knowledge.Method: Developed an exactly solvable mathematical framework using variable-order n-gram agents to model recursive text generation, separating drift (unfiltered reuse) and selection (publication filtering) forces acting on the corpus.
Result: Drift removes rare forms leading to stable distributions; selection determines corpus depth - mere statistical replication leads to shallow equilibrium, while normative filtering (quality/novelty) preserves deeper structure with optimal upper bound on divergence from shallow states.
Conclusion: Framework identifies when recursive publication compresses public text versus when selective filtering sustains richer structure, with important implications for designing AI training corpora to maintain quality and diversity.
Abstract: The public text record – the material from which both people and AI systems now learn – is increasingly shaped by its own outputs. Generated text enters the public record, later agents learn from it, and the cycle repeats. Here we develop an exactly solvable mathematical framework for this recursive process, based on variable-order $n$-gram agents, and separate two forces acting on the public corpus. The first is drift: unfiltered reuse progressively removes rare forms, and in the infinite-corpus limit we characterise the stable distributions exactly. The second is selection: publication, ranking and verification filter what enters the record, and the outcome depends on what is selected. When publication merely reflects the statistical status quo, the corpus converges to a shallow state in which further lookahead brings no benefit. When publication is normative – rewarding quality, correctness or novelty – deeper structure persists, and we establish an optimal upper bound on the resulting divergence from shallow equilibria. The framework therefore identifies when recursive publication compresses public text and when selective filtering sustains richer structure, with implications for the design of AI training corpora.
[2] SynDocDis: A Metadata-Driven Framework for Generating Synthetic Physician Discussions Using Large Language Models
Beny Rubinstein, Sergio Matos
Main category: cs.CL
TL;DR: SynDocDis generates synthetic physician-to-physician dialogues using structured prompting and de-identified metadata for privacy-compliant medical AI research.
Details
Motivation: Physician-physician discussions contain valuable clinical knowledge but are restricted by privacy regulations. Existing synthetic data approaches focus on patient-physician interactions or structured records, leaving a gap in physician-to-physician communication synthesis.Method: Combines structured prompting techniques with privacy-preserving de-identified case metadata to generate clinically accurate physician-to-physician dialogues. Evaluated by five practicing physicians across nine oncology and hepatology scenarios.
Result: Achieved exceptional communication effectiveness (mean 4.4/5) and strong medical content quality (mean 4.1/5) with substantial interrater reliability (kappa = 0.70). Framework achieved 91% clinical relevance ratings while maintaining privacy.
Conclusion: SynDocDis is a promising framework for advancing medical AI research ethically through privacy-compliant synthetic physician dialogue generation, with applications in medical education and clinical decision support.
Abstract: Physician-physician discussions of patient cases represent a rich source of clinical knowledge and reasoning that could feed AI agents to enrich and even participate in subsequent interactions. However, privacy regulations and ethical considerations severely restrict access to such data. While synthetic data generation using Large Language Models offers a promising alternative, existing approaches primarily focus on patient-physician interactions or structured medical records, leaving a significant gap in physician-to-physician communication synthesis. We present SynDocDis, a novel framework that combines structured prompting techniques with privacy-preserving de-identified case metadata to generate clinically accurate physician-to-physician dialogues. Evaluation by five practicing physicians in nine oncology and hepatology scenarios demonstrated exceptional communication effectiveness (mean 4.4/5) and strong medical content quality (mean 4.1/5), with substantial interrater reliability (kappa = 0.70, 95% CI: 0.67-0.73). The framework achieved 91% clinical relevance ratings while maintaining doctors’ and patients’ privacy. These results place SynDocDis as a promising framework for advancing medical AI research ethically and responsibly through privacy-compliant synthetic physician dialogue generation with direct applications in medical education and clinical decision support.
[3] EMA Is Not All You Need: Mapping the Boundary Between Structure and Content in Recurrent Context
Arth Singh
Main category: cs.CL
TL;DR: EMA traces as simple recurrent context reveal limitations of fixed-coefficient accumulation vs learned selection in sequence modeling
Details
Motivation: To understand what efficient sequence models gain over simple temporal averaging by using exponential moving average (EMA) traces as a controlled probe to map boundaries of fixed-coefficient accumulationMethod: Uses EMA traces (simplest recurrent context without gating or content-based retrieval) to analyze sequence modeling capabilities; tests multi-timescale EMA traces on grammatical role assignment and language modeling tasks
Result: EMA traces encode temporal structure well (96% of BiGRU on grammatical roles, surpassing on structure-dependent roles) but destroy token identity (130M-parameter LM reaches C4 perplexity 260, 8x GPT-2); predictor ablation shows entire gap localized to traces
Conclusion: Fixed-coefficient accumulation suffers irreversible information dilution that only learned, input-dependent selection can resolve; EMA traces apply lossy, data-independent compression that cannot be recovered by downstream predictors
Abstract: What exactly do efficient sequence models gain over simple temporal averaging? We use exponential moving average (EMA) traces, the simplest recurrent context (no gating, no content-based retrieval), as a controlled probe to map the boundary between what fixed-coefficient accumulation can and cannot represent. EMA traces encode temporal structure: a Hebbian architecture with multi-timescale traces achieves 96% of a supervised BiGRU on grammatical role assignment with zero labels, surpassing the supervised model on structure-dependent roles. EMA traces destroy token identity: a 130M-parameter language model using only EMA context reaches C4 perplexity 260 (8x GPT-2), and a predictor ablation (replacing the linear predictor with full softmax attention) yields identical loss, localizing the entire gap to the traces. The traces apply lossy, data-independent compression; by the data processing inequality, no downstream predictor can recover the discarded information. Fixed-coefficient accumulation, whether across time or depth, suffers irreversible information dilution that only learned, input-dependent selection can resolve.
[4] Re-Mask and Redirect: Exploiting Denoising Irreversibility in Diffusion Language Models
Arth Singh
Main category: cs.CL
TL;DR: Diffusion-based language models have a critical safety vulnerability: their safety alignment depends on monotonic denoising schedules where refusal tokens are never re-evaluated, allowing simple prefix injection attacks to bypass safety measures.
Details
Motivation: The paper investigates the security of safety-aligned diffusion-based language models, motivated by the observation that their safety mechanisms rely on fragile assumptions about denoising schedules that may not hold under adversarial conditions.Method: The authors demonstrate a simple two-step attack: 1) re-masking refusal tokens that are committed early in denoising, and 2) injecting a 12-token affirmative prefix. They test this against LLaDA-8B-Instruct and Dream-7B-Instruct on HarmBench, and compare with gradient-optimized attacks.
Result: The simple attack achieves 76.1% ASR on LLaDA-8B-Instruct and 81.8% ASR on Dream-7B-Instruct. Surprisingly, gradient-optimized attacks perform worse (41.5% vs 76.1%), showing the vulnerability is structural rather than requiring sophisticated exploitation.
Conclusion: Diffusion-based LLM safety is architecturally shallow and not adversarially robust, relying solely on never-violated denoising schedules. The paper proposes defenses including safety-aware unmasking schedules, step-conditional prefix detection, and post-commitment re-verification.
Abstract: Diffusion-based language models (dLLMs) generate text by iteratively denoising masked token sequences. We show that their safety alignment rests on a single fragile assumption: that the denoising schedule is monotonic and committed tokens are never re-evaluated. Safety-aligned dLLMs commit refusal tokens within the first 8-16 of 64 denoising steps, and the schedule treats these commitments as permanent. A trivial two-step intervention - re-masking these tokens and injecting a 12-token affirmative prefix - achieves 76.1% ASR on HarmBench (n=159, Lg=128) against LLaDA-8B-Instruct and 81.8% ASR (n=159) against Dream-7B-Instruct, without any gradient computation or adversarial search. The simplicity of this exploit is itself the central finding: augmenting with gradient-optimized perturbation via a differentiable Gumbel-softmax chain consistently degrades ASR (e.g., 41.5% vs. 76.1% at Lg=128), confirming that the vulnerability is structural rather than requiring sophisticated exploitation. These findings reveal that dLLM safety is not adversarially robust but architecturally shallow - it holds only because the denoising schedule is never violated. We discuss defenses including safety-aware unmasking schedules, step-conditional prefix detection, and post-commitment re-verification.
[5] WAND: Windowed Attention and Knowledge Distillation for Efficient Autoregressive Text-to-Speech Models
Hanna Lee, Tan Dat Nguyen, Jaehoon Kang, Kyuhong Shim
Main category: cs.CL
TL;DR: WAND is a framework that adapts pretrained autoregressive text-to-speech models to use windowed attention for constant computational complexity while maintaining quality through knowledge distillation.
Details
Motivation: Current autoregressive TTS models have quadratic memory and compute costs due to full self-attention, limiting their efficiency for long sequences. There's a need for more efficient attention mechanisms that preserve synthesis quality.Method: Proposes WAND with: 1) Separated attention - persistent global attention over conditioning tokens and local sliding-window attention over generated tokens; 2) Curriculum learning for stable fine-tuning with progressively tightening windows; 3) Knowledge distillation from full-attention teacher models to recover quality efficiently.
Result: Achieves up to 66.2% KV cache memory reduction and length-invariant, near-constant per-step latency while preserving original synthesis quality across three modern AR-TTS models.
Conclusion: WAND enables efficient autoregressive TTS with constant computational complexity without sacrificing quality, making long-form speech synthesis more practical.
Abstract: Recent decoder-only autoregressive text-to-speech (AR-TTS) models produce high-fidelity speech, but their memory and compute costs scale quadratically with sequence length due to full self-attention. In this paper, we propose WAND, Windowed Attention and Knowledge Distillation, a framework that adapts pretrained AR-TTS models to operate with constant computational and memory complexity. WAND separates the attention mechanism into two: persistent global attention over conditioning tokens and local sliding-window attention over generated tokens. To stabilize fine-tuning, we employ a curriculum learning strategy that progressively tightens the attention window. We further utilize knowledge distillation from a full-attention teacher to recover high-fidelity synthesis quality with high data efficiency. Evaluated on three modern AR-TTS models, WAND preserves the original quality while achieving up to 66.2% KV cache memory reduction and length-invariant, near-constant per-step latency.
[6] Medical Reasoning with Large Language Models: A Survey and MR-Bench
Xiaohan Ren, Chenxiao Fan, Wenyin Ma, Hongliang He, Chongming Gao, Xiaoyan Zhao, Fuli Feng
Main category: cs.CL
TL;DR: A comprehensive survey of medical reasoning with LLMs, organizing methods into seven technical routes and introducing MR-Bench for clinical evaluation, revealing gaps between exam performance and real-world clinical decision-making.
Details
Motivation: While LLMs show promise on medical exam tasks, clinical decision-making requires robust medical reasoning beyond factual recall, necessitating systematic evaluation of reasoning capabilities for real-world deployment.Method: Organizes medical reasoning methods into seven technical routes based on cognitive theories (abduction, deduction, induction), conducts unified cross-benchmark evaluation, and introduces MR-Bench derived from real hospital data.
Result: Reveals significant gap between exam-level performance and accuracy on authentic clinical decision tasks, highlighting limitations of current models for real-world clinical reasoning.
Conclusion: Provides unified framework for medical reasoning evaluation, identifies key gaps between current LLM capabilities and clinical requirements, and emphasizes need for clinically-grounded reasoning benchmarks.
Abstract: Large language models (LLMs) have achieved strong performance on medical exam-style tasks, motivating growing interest in their deployment in real-world clinical settings. However, clinical decision-making is inherently safety-critical, context-dependent, and conducted under evolving evidence. In such situations, reliable LLM performance depends not on factual recall alone, but on robust medical reasoning. In this work, we present a comprehensive review of medical reasoning with LLMs. Grounded in cognitive theories of clinical reasoning, we conceptualize medical reasoning as an iterative process of abduction, deduction, and induction, and organize existing methods into seven major technical routes spanning training-based and training-free approaches. We further conduct a unified cross-benchmark evaluation of representative medical reasoning models under a consistent experimental setting, enabling a more systematic and comparable assessment of the empirical impact of existing methods. To better assess clinically grounded reasoning, we introduce MR-Bench, a benchmark derived from real-world hospital data. Evaluations on MR-Bench expose a pronounced gap between exam-level performance and accuracy on authentic clinical decision tasks. Overall, this survey provides a unified view of existing medical reasoning methods, benchmarks, and evaluation practices, and highlights key gaps between current model performance and the requirements of real-world clinical reasoning.
[7] Uncertainty Estimation for the Open-Set Text Classification systems
Leonid Erlygin, Alexey Zaytsev
Main category: cs.CL
TL;DR: HolUE method adapted for text domain to estimate uncertainty in open-set text classification, addressing both text and gallery uncertainty sources.
Details
Motivation: Accurate uncertainty estimation is crucial for robust and trustworthy recognition systems, especially in open-set text classification where samples can be from known classes or unknown/novel classes.Method: Adapt Holistic Uncertainty Estimation (HolUE) method for text domain to capture two major uncertainty sources: text uncertainty (ill-formulated queries) and gallery uncertainty (ambiguity of data distribution).
Result: Achieves 40-365% improvement in Prediction Rejection Ratio over baseline across multiple datasets: 365% on Yahoo Answers, 347% on DBPedia, 240% on PAN authorship attribution, and 40% on CLINC150 intent classification.
Conclusion: The adapted HolUE method effectively captures different uncertainty types in open-set text classification, enabling better prediction of when the system will make recognition errors.
Abstract: Accurate uncertainty estimation is essential for building robust and trustworthy recognition systems. In this paper, we consider the open-set text classification (OSTC) task - and uncertainty estimation for it. For OSTC a text sample should be classified as one of the existing classes or rejected as unknown. To account for the different uncertainty types encountered in OSTC, we adapt the Holistic Uncertainty Estimation (HolUE) method for the text domain. Our approach addresses two major causes of prediction errors in text recognition systems: text uncertainty that stems from ill formulated queries and gallery uncertainty that is related the ambiguity of data distribution. By capturing these sources, it becomes possible to predict when the system will make a recognition error. We propose a new OSTC benchmark and conduct extensive experiments on a wide range of data, utilizing the authorship attribution, intent and topic classification datasets. HolUE achieves 40-365% improvement in Prediction Rejection Ratio (PRR) over the quality-based SCF baseline across datasets: 365% on Yahoo Answers (0.79 vs 0.17 at FPIR 0.1), 347% on DBPedia (0.85 vs 0.19), 240% on PAN authorship attribution (0.51 vs 0.15 at FPIR 0.5), and 40% on CLINC150 intent classification (0.73 vs~0.52). We make public our code and protocols https://github.com/Leonid-Erlygin/text_uncertainty.git
[8] A Representation-Level Assessment of Bias Mitigation in Foundation Models
Svetoslav Nizhnichenkov, Rahul Nair, Elizabeth Daly, Brian Mac Namee
Main category: cs.CL
TL;DR: Bias mitigation in foundation models (BERT and Llama2) reduces gender-occupation disparities in embedding spaces, making representations more neutral. Representational analysis reveals interpretable geometric transformations, and a new dataset WinoDec is introduced for decoder-only model assessment.
Details
Motivation: To understand how bias mitigation techniques reshape the internal representations of foundation models, providing an internal audit of model behavior through representational analysis and validating debiasing effectiveness.Method: Used BERT and Llama2 as representative encoder-only and decoder-only architectures, compared baseline and bias-mitigated variants, analyzed shifts in associations between gender and occupation terms in embedding spaces, and introduced WinoDec dataset for decoder-only model assessment.
Result: Bias mitigation reduces gender-occupation disparities in embedding spaces, leading to more neutral and balanced internal representations. These representational shifts are consistent across both model types, showing fairness improvements manifest as interpretable geometric transformations.
Conclusion: Embedding analysis is a valuable tool for understanding and validating debiasing methods in foundation models. The WinoDec dataset facilitates assessment of decoder-only models in bias mitigation research.
Abstract: We investigate how successful bias mitigation reshapes the embedding space of encoder-only and decoder-only foundation models, offering an internal audit of model behaviour through representational analysis. Using BERT and Llama2 as representative architectures, we assess the shifts in associations between gender and occupation terms by comparing baseline and bias-mitigated variants of the models. Our findings show that bias mitigation reduces gender-occupation disparities in the embedding space, leading to more neutral and balanced internal representations. These representational shifts are consistent across both model types, suggesting that fairness improvements can manifest as interpretable and geometric transformations. These results position embedding analysis as a valuable tool for understanding and validating the effectiveness of debiasing methods in foundation models. To further promote the assessment of decoder-only models, we introduce WinoDec, a dataset consisting of 4,000 sequences with gender and occupation terms, and release it to the general public. (https://github.com/winodec/wino-dec)
[9] Neural networks for Text-to-Speech evaluation
Ilya Trofimenko, David Kocharyan, Aleksandr Zaitsev, Pavel Repnikov, Mark Levin, Nikita Shevtsov
Main category: cs.CL
TL;DR: Novel neural models for automated TTS quality assessment that approximate human judgments in both relative (SBS) and absolute (MOS) settings, outperforming human inter-rater reliability.
Details
Motivation: Human subjective evaluation protocols (MOS and SBS) for TTS quality assessment are expensive, slow, and biased. There's a need for automated neural models that can approximate expert judgments at scale.Method: Proposed NeuralSBS (HuBERT-backed) for relative assessment, and enhanced MOSNet with custom sequence-length batching plus WhisperBert (multimodal stacking ensemble combining Whisper audio features and BERT textual embeddings via weak learners).
Result: NeuralSBS achieves 73.7% accuracy on SOMOS dataset; best MOS models achieve RMSE of ~0.40, significantly outperforming human inter-rater RMSE baseline of 0.62. Ensemble stacking outperforms direct latent fusion.
Conclusion: Dedicated metric learning frameworks are necessary for TTS quality assessment, with ensemble-based multimodal approaches showing superior performance over naive fusion or zero-shot LLM evaluators.
Abstract: Ensuring that Text-to-Speech (TTS) systems deliver human-perceived quality at scale is a central challenge for modern speech technologies. Human subjective evaluation protocols such as Mean Opinion Score (MOS) and Side-by-Side (SBS) comparisons remain the de facto gold standards, yet they are expensive, slow, and sensitive to pervasive assessor biases. This study addresses these barriers by formulating, and implementing a suite of novel neural models designed to approximate expert judgments in both relative (SBS) and absolute (MOS) settings. For relative assessment, we propose NeuralSBS, a HuBERT-backed model achieving 73.7% accuracy (on SOMOS dataset). For absolute assessment, we introduce enhancements to MOSNet using custom sequence-length batching, as well as WhisperBert, a multimodal stacking ensemble that combines Whisper audio features and BERT textual embeddings via weak learners. Our best MOS models achieve a Root Mean Square Error (RMSE) of ~0.40, significantly outperforming the human inter-rater RMSE baseline of 0.62. Furthermore, our ablation studies reveal that naively fusing text via cross-attention can degrade performance, highlighting the effectiveness of ensemble-based stacking over direct latent fusion. We additionally report negative results with SpeechLM-based architectures and zero-shot LLM evaluators (Qwen2-Audio, Gemini 2.5 flash preview), reinforcing the necessity of dedicated metric learning frameworks.
[10] Temperature-Dependent Performance of Prompting Strategies in Extended Reasoning Large Language Models
Mousa Salah, Amgad Muneer
Main category: cs.CL
TL;DR: Systematic evaluation of temperature and prompting strategies for extended reasoning LLMs on mathematical problems shows optimal performance varies by approach, with zero-shot peaking at moderate temperatures and chain-of-thought at extremes.
Details
Motivation: Extended reasoning models enable explicit test-time computation for complex problem solving, but optimal configuration of sampling temperature and prompting strategy remains underexplored, challenging the common practice of using T=0 for reasoning tasks.Method: Systematically evaluate chain-of-thought and zero-shot prompting across four temperature settings (0.0, 0.4, 0.7, 1.0) using Grok-4.1 with extended reasoning on 39 mathematical problems from AMO-Bench, a challenging International Mathematical Olympiad-level benchmark.
Result: Zero-shot prompting achieves peak performance at moderate temperatures (59% accuracy at T=0.4 and T=0.7), while chain-of-thought performs best at temperature extremes. Benefit of extended reasoning increases from 6x at T=0.0 to 14.3x at T=1.0.
Conclusion: Temperature should be optimized jointly with prompting strategy, challenging the common practice of using T=0 for reasoning tasks. Different prompting approaches have different optimal temperature regimes for extended reasoning models.
Abstract: Extended reasoning models represent a transformative shift in Large Language Model (LLM) capabilities by enabling explicit test-time computation for complex problem solving. However, the optimal configuration of sampling temperature and prompting strategy for these systems remains largely underexplored. We systematically evaluate chain-of-thought and zero-shot prompting across four temperature settings (0.0, 0.4, 0.7, and 1.0) using Grok-4.1 with extended reasoning on 39 mathematical problems from AMO-Bench, a challenging International Mathematical Olympiad-level benchmark. We find that zero-shot prompting achieves peak performance at moderate temperatures, reaching 59% accuracy at T=0.4 and T=0.7, while chain-of-thought prompting performs best at the temperature extremes. Most notably, the benefit of extended reasoning increases from 6x at T=0.0 to 14.3x at T=1.0. These results suggest that temperature should be optimized jointly with prompting strategy, challenging the common practice of using T=0 for reasoning tasks.
[11] Attention-Based Sampler for Diffusion Language Models
Yuyan Zhou, Kai Syun Hou, Weiyu Chen, James Kwok
Main category: cs.CL
TL;DR: The paper proposes Attn-Sampler, a training-free decoding algorithm for diffusion-based LLMs that uses attention matrix column sums to determine optimal decoding order, improving both generation quality and parallelism.
Details
Motivation: Auto-regressive models have limitations in inference efficiency and modeling flexibility due to sequential decoding. Diffusion-based LLMs offer parallel decoding potential but current methods rely on token-level information without considering global sequence structure, leading to suboptimal results.Method: Theoretical analysis shows optimal sequence likelihood can be achieved by decoding tokens in descending order of attention matrix column sums. This insight is implemented in Attn-Sampler with block attention approximation and dynamic attention thresholding for practical acceleration.
Result: Extensive experiments across multiple benchmarks validate Attn-Sampler’s effectiveness, demonstrating superior generation quality while enhancing decoding parallelism compared to existing methods.
Conclusion: Attention-guided decoding based on column sums provides a theoretically grounded alternative to greedy search, enabling better generation quality and improved parallelism in diffusion-based language models.
Abstract: Auto-regressive models (ARMs) have established a dominant paradigm in language modeling. However, their strictly sequential decoding paradigm imposes fundamental constraints on both inference efficiency and modeling flexibility. To address these limitations, diffusion-based large language models (dLLMs) have been proposed, offering the potential for parallel decoding and flexible language modeling. Despite these advantages, current dLLMs decoding strategies rely primarily on token level information, which fails to account for global sequence structure and often yields suboptimal results. In this paper, we study the decoding order selection problem from the perspective of log-likelihood maximization. We theoretically demonstrate that optimal sequence likelihood can be approximately achieved by decoding tokens in descending order of their attention matrix column sums. This finding provides a principled justification for attention-guided decoding and offers a theoretically grounded alternative to greedy search. We instantiate this theoretical insight in a new training-free decoding algorithm, termed Attn-Sampler, and further propose a block attention approximation and dynamic attention thresholding for practical acceleration. Extensive experiments across multiple benchmarks validate the effectiveness of our proposed method, demonstrating that it achieves superior generation quality while enhancing the decoding parallelism.
[12] Multi-User Large Language Model Agents
Shu Yang, Shenzhe Zhu, Hao Zhu, José Ramón Enríquez, Di Wang, Alex Pentland, Michiel A. Bakker, Jiaxin Pei
Main category: cs.CL
TL;DR: This paper presents the first systematic study of multi-user LLM agents, formalizing them as multi-principal decision problems and revealing systematic gaps in current models’ ability to handle conflicting user interests, privacy, and coordination.
Details
Motivation: Current LLM-based agents are optimized for single-user interactions but increasingly need to serve multiple users simultaneously in team workflows and organizational tools, creating challenges with conflicts, information asymmetry, and privacy constraints.Method: The authors formalize multi-user interaction with LLM agents as a multi-principal decision problem, introduce a unified multi-user interaction protocol, and design three targeted stress-testing scenarios to evaluate LLMs’ capabilities in instruction following, privacy preservation, and coordination.
Result: Frontier LLMs frequently fail to maintain stable prioritization under conflicting user objectives, exhibit increasing privacy violations over multi-turn interactions, and suffer from efficiency bottlenecks when coordination requires iterative information gathering.
Conclusion: There are systematic gaps in current LLMs’ capabilities for multi-user settings, highlighting the need for new approaches to handle multi-principal interactions with conflicting interests, privacy constraints, and coordination challenges.
Abstract: Large language models (LLMs) and LLM-based agents are increasingly deployed as assistants in planning and decision making, yet most existing systems are implicitly optimized for a single-principal interaction paradigm, in which the model is designed to satisfy the objectives of one dominant user whose instructions are treated as the sole source of authority and utility. However, as they are integrated into team workflows and organizational tools, they are increasingly required to serve multiple users simultaneously, each with distinct roles, preferences, and authority levels, leading to multi-user, multi-principal settings with unavoidable conflicts, information asymmetry, and privacy constraints. In this work, we present the first systematic study of multi-user LLM agents. We begin by formalizing multi-user interaction with LLM agents as a multi-principal decision problem, where a single agent must account for multiple users with potentially conflicting interests and associated challenges. We then introduce a unified multi-user interaction protocol and design three targeted stress-testing scenarios to evaluate current LLMs’ capabilities in instruction following, privacy preservation, and coordination. Our results reveal systematic gaps: frontier LLMs frequently fail to maintain stable prioritization under conflicting user objectives, exhibit increasing privacy violations over multi-turn interactions, and suffer from efficiency bottlenecks when coordination requires iterative information gathering.
[13] Dynamic sparsity in tree-structured feed-forward layers at scale
Reza Sedghi, Robin Schiewer, Anand Subramoney, David Kappel
Main category: cs.CL
TL;DR: Tree-structured feed-forward layers replace dense MLP blocks in transformers, enabling conditional computation via hard hierarchical routing with only 5% activation per token while matching dense baseline performance.
Details
Motivation: Feed-forward MLP blocks consume significant compute in transformers, motivating sparse alternatives to reduce computational costs while maintaining performance.Method: Uses sparse, tree-structured feed-forward layers as drop-in replacements for MLP blocks with hard hierarchical routing (no separate router network). Applies conditional sparsity for autoregressive language modeling and question answering, scaling beyond 1B parameters.
Result: Models activate <5% of feed-forward units per token yet match dense baselines in training and fine-tuning. Shows emergent auto-pruning effect where hard routing with asymmetric nonlinearities deactivates unused paths, converting dynamic routing to static structural sparsity.
Conclusion: Tree-structured feed-forward layers provide scalable, controllable sparsification for large transformers, enabling efficient conditional computation without performance loss.
Abstract: At typical context lengths, the feed-forward MLP block accounts for a large share of a transformer’s compute budget, motivating sparse alternatives to dense MLP blocks. We study sparse, tree-structured feed-forward layers as drop-in replacements for MLP blocks in deep transformer architectures, enabling conditional computation via hard hierarchical routing without a separate router network. We demonstrate for the first time that this form of tree-structured conditional sparsity can be applied for autoregressive language modeling and downstream question answering, including zero- and few-shot settings, and its scalability beyond 1B parameters. Despite activating fewer than 5% of the feed-forward block’s units per token, our models match dense baselines under controlled training and fine-tuning protocols. We further analyze training dynamics and identify an emergent auto-pruning effect: the interaction of hard routing with asymmetric nonlinearities progressively deactivates unused paths, yielding partial conversion of dynamic routing into static structural sparsity. We show that simple architectural choices can modulate this behavior and recover balanced trees without auxiliary losses. Overall, our work demonstrates that tree-structured feed-forward layers provide a scalable and controllable mechanism for sparsifying large transformer models.
[14] Scalable High-Recall Constraint-Satisfaction-Based Information Retrieval for Clinical Trials Matching
Cyrus Zhou, Yufei Jin, Yilin Xu, Yu-Chiang Wang, Chieh-Ju Chao, Monica S. Lam
Main category: cs.CL
TL;DR: SatIR is a clinical trial retrieval method using constraint satisfaction and LLMs to match patients to trials with high precision and interpretability.
Details
Motivation: Clinical trials often struggle with enrollment despite many available trials. Existing retrieval methods based on keyword/embedding matching have low recall, low precision, and limited interpretability due to complex constraints.Method: Uses formal methods (Satisfiability Modulo Theories and relational algebra) to represent and match constraints from trials and patient records. Leverages LLMs to convert informal clinical reasoning into explicit formal constraints, along with medical ontologies and conceptual models.
Result: Outperforms TrialGPT on all three retrieval objectives: retrieves 32%-72% more relevant-and-eligible trials per patient, improves recall by 22-38 points, serves more patients with at least one useful trial. Fast retrieval at 2.95 seconds per patient over 3,621 trials.
Conclusion: SatIR is scalable, effective, and interpretable for clinical trial retrieval, showing promise for improving patient enrollment in trials.
Abstract: Clinical trials are central to evidence-based medicine, yet many struggle to meet enrollment targets, despite the availability of over half a million trials listed on ClinicalTrials.gov, which attracts approximately two million users monthly. Existing retrieval techniques, largely based on keyword and embedding-similarity matching between patient profiles and eligibility criteria, often struggle with low recall, low precision, and limited interpretability due to complex constraints. We propose SatIR, a scalable clinical trial retrieval method based on constraint satisfaction, enabling high-precision and interpretable matching of patients to relevant trials. Our approach uses formal methods – Satisfiability Modulo Theories (SMT) and relational algebra – to efficiently represent and match key constraints from clinical trials and patient records. Beyond leveraging established medical ontologies and conceptual models, we use Large Language Models (LLMs) to convert informal reasoning regarding ambiguity, implicit clinical assumptions, and incomplete patient records into explicit, precise, controllable, and interpretable formal constraints. Evaluated on 59 patients and 3,621 trials, SatIR outperforms TrialGPT on all three evaluated retrieval objectives. It retrieves 32%-72% more relevant-and-eligible trials per patient, improves recall over the union of useful trials by 22-38 points, and serves more patients with at least one useful trial. Retrieval is fast, requiring 2.95 seconds per patient over 3,621 trials. These results show that SatIR is scalable, effective, and interpretable.
[15] Sentiment Classification of Gaza War Headlines: A Comparative Analysis of Large Language Models and Arabic Fine-Tuned BERT Models
Amr Eleraqi, Hager H. Mustafa, Abdul Hadi N. Ahmed
Main category: cs.CL
TL;DR: Comparative analysis of AI models’ sentiment interpretation in conflict media discourse reveals systematic architectural biases, with fine-tuned BERT models leaning neutral and LLMs amplifying negativity, highlighting model choice as interpretive lens selection.
Details
Motivation: To understand how different AI architectures interpret sentiment in conflict-related media discourse, examining systematic biases and divergences rather than evaluating accuracy against human standards, using the 2023 Gaza War as a case study.Method: Comparative analysis of 3 large language models and 6 fine-tuned Arabic BERT models on 10,990 Arabic news headlines using information-theoretic metrics (Shannon Entropy, Jensen-Shannon Distance, Variance Score) and frame-conditioned analysis to quantify systematic differences in sentiment interpretation.
Result: Fine-tuned BERT models (especially MARBERT) show strong bias toward neutral classifications, while LLMs consistently amplify negative sentiment (LLaMA-3.1-8B near-total collapse into negativity). GPT-4.1 adjusts sentiment based on narrative frames, while other LLMs show limited contextual modulation.
Conclusion: Model choice constitutes interpretive lens selection that shapes algorithmic framing of conflict narratives, highlighting risks of treating automated sentiment outputs as neutral measures in war contexts and foregrounding algorithmic discrepancy as analytical object.
Abstract: This study examines how different artificial intelligence architectures interpret sentiment in conflict-related media discourse, using the 2023 Gaza War as a case study. Drawing on a corpus of 10,990 Arabic news headlines (Eleraqi 2026), the research conducts a comparative analysis between three large language models and six fine-tuned Arabic BERT models. Rather than evaluating accuracy against a single human-annotated gold standard, the study adopts an epistemological approach that treats sentiment classification as an interpretive act produced by model architectures. To quantify systematic differences across models, the analysis employs information-theoretic and distributional metrics, including Shannon Entropy, Jensen-Shannon Distance, and a Variance Score measuring deviation from aggregate model behavior. The results reveal pronounced and non-random divergence in sentiment distributions. Fine-tuned BERT models, particularly MARBERT, exhibit a strong bias toward neutral classifications, while LLMs consistently amplify negative sentiment, with LLaMA-3.1-8B showing near-total collapse into negativity. Frame-conditioned analysis further demonstrates that GPT-4.1 adjusts sentiment judgments in line with narrative frames (e.g., humanitarian, legal, security), whereas other LLMs display limited contextual modulation. These findings suggest that the choice of model constitutes a choice of interpretive lens, shaping how conflict narratives are algorithmically framed and emotionally evaluated. The study contributes to media studies and computational social science by foregrounding algorithmic discrepancy as an object of analysis and by highlighting the risks of treating automated sentiment outputs as neutral or interchangeable measures of media tone in contexts of war and crisis.
[16] Can We Still Hear the Accent? Investigating the Resilience of Native Language Signals in the LLM Era
Nabelanita Utami, Sasano Ryohei
Main category: cs.CL
TL;DR: Analysis shows LLMs are homogenizing research writing, reducing linguistic fingerprints that reveal author native languages, with varying impacts across different languages.
Details
Motivation: To investigate whether the shift from traditional writing tools to LLMs is homogenizing research papers by reducing linguistic fingerprints that reveal authors' native language backgrounds.Method: Analyzed ACL Anthology papers across three eras (pre-neural network, pre-LLM, post-LLM), constructed labeled dataset using semi-automated framework, and fine-tuned classifier for native language identification.
Result: Consistent decline in NLI performance over time, with post-LLM era showing anomalies: Chinese and French show unexpected resistance/divergent trends, while Japanese and Korean exhibit sharper-than-expected declines.
Conclusion: LLMs are homogenizing research writing, reducing linguistic diversity, with varying impacts across different languages, suggesting complex interactions between LLM assistance and linguistic backgrounds.
Abstract: The evolution of writing assistance tools from machine translation to large language models (LLMs) has changed how researchers write. This study investigates whether this shift is homogenizing research papers by analyzing native language identification (NLI) trends in ACL Anthology papers across three eras: pre-neural network (NN), pre-LLM, and post-LLM. We construct a labeled dataset using a semi-automated framework and fine-tune a classifier to detect linguistic fingerprints of author backgrounds. Our analysis shows a consistent decline in NLI performance over time. Interestingly, the post-LLM era reveals anomalies: while Chinese and French show unexpected resistance or divergent trends, Japanese and Korean exhibit sharper-than-expected declines.
[17] Adaptive Rigor in AI System Evaluation using Temperature-Controlled Verdict Aggregation via Generalized Power Mean
Aleksandr Meshkov
Main category: cs.CL
TL;DR: TCVA introduces temperature-controlled verdict aggregation for LLM evaluation, using a five-level scoring system with power-mean aggregation and temperature parameter to adjust evaluation rigor for different application domains.
Details
Motivation: Existing LLM evaluation methods (LLM-as-a-Judge, verdict systems, NLI) don't align well with human assessment because they lack adaptability to application domains - they can't adjust strictness based on whether the domain is safety-critical or conversational.Method: Temperature-Controlled Verdict Aggregation (TCVA) combines: 1) five-level verdict-scoring system, 2) generalized power-mean aggregation, and 3) temperature parameter T [0.1, 1.0] to control evaluation rigor. Low temperatures produce pessimistic scores for safety-critical domains; high temperatures yield lenient scores for conversational AI.
Result: Experimental evaluation on SummEval and USR datasets with human Likert-scale annotations shows TCVA achieves correlation with human judgments comparable to RAGAS on faithfulness (Spearman = 0.667 vs. 0.676) while consistently outperforming DeepEval. The method requires no additional LLM calls when adjusting temperature.
Conclusion: TCVA provides a flexible, domain-adaptive evaluation framework for LLM-based systems that better aligns with human assessment by allowing control over evaluation strictness through an intuitive temperature parameter.
Abstract: Existing evaluation methods for LLM-based AI systems, such as LLM-as-a-Judge, verdict systems, and NLI, do not always align well with human assessment because they cannot adapt their strictness to the application domain. This paper presents Temperature-Controlled Verdict Aggregation (TCVA), a method that combines a five-level verdict-scoring system with generalized power-mean aggregation and an intuitive temperature parameter T [0.1, 1.0] to control evaluation rigor. Low temperatures yield pessimistic scores suited for safety-critical domains; high temperatures produce lenient scores appropriate for conversational AI. Experimental evaluation on three benchmark datasets with human Likert-scale annotations (SummEval and USR) shows that TCVA achieves correlation with human judgments comparable to RAGAS on faithfulness (Spearman = 0.667 vs. 0.676) while consistently outperforming DeepEval. The method requires no additional LLM calls when adjusting the temperature parameter.
[18] Interactive ASR: Towards Human-Like Interaction and Semantic Coherence Evaluation for Agentic Speech Recognition
Peng Wang, Yanqiao Zhu, Zixuan Jiang, Qinyuan Chen, Xingjian Zhao, Xipeng Qiu, Wupeng Wang, Zhifu Gao, Xiangang Li, Kai Yu, Xie Chen
Main category: cs.CL
TL;DR: Proposes an agentic framework for interactive ASR using LLM-as-a-Judge for semantic evaluation and LLM-driven agents for multi-turn interactive correction.
Details
Motivation: Current ASR systems focus on WER which treats all words equally and fails to reflect semantic correctness, and interactive correction - essential for human communication - is underexplored in ASR research.Method: Integrates LLM-as-a-Judge as semantic-aware evaluation metric and designs LLM-driven agent framework for multi-turn interaction to iteratively refine recognition outputs through semantic feedback.
Result: Extensive experiments on GigaSpeech (English), WenetSpeech (Chinese), and ASRU 2019 code-switching test set show effectiveness in improving semantic fidelity and interactive correction capability.
Conclusion: The proposed agentic framework advances ASR beyond token-level accuracy to semantic correctness and enables human-like interactive correction, with code to be released for future research.
Abstract: Recent years have witnessed remarkable progress in automatic speech recognition (ASR), driven by advances in model architectures and large-scale training data. However, two important aspects remain underexplored. First, Word Error Rate (WER), the dominant evaluation metric for decades, treats all words equally and often fails to reflect the semantic correctness of an utterance at the sentence level. Second, interactive correction-an essential component of human communication-has rarely been systematically studied in ASR research. In this paper, we integrate these two perspectives under an agentic framework for interactive ASR. We propose leveraging LLM-as-a-Judge as a semantic-aware evaluation metric to assess recognition quality beyond token-level accuracy. Furthermore, we design an LLM-driven agent framework to simulate human-like multi-turn interaction, enabling iterative refinement of recognition outputs through semantic feedback. Extensive experiments are conducted on standard benchmarks, including GigaSpeech (English), WenetSpeech (Chinese), the ASRU 2019 code-switching test set. Both objective and subjective evaluations demonstrate the effectiveness of the proposed framework in improving semantic fidelity and interactive correction capability. We will release the code to facilitate future research in interactive and agentic ASR.
[19] Litmus (Re)Agent: A Benchmark and Agentic System for Predictive Evaluation of Multilingual Models
Avni Mittal, Shanu Kumar, Sandipan Dandapat, Monojit Choudhury
Main category: cs.CL
TL;DR: A framework for predicting model performance in target languages when direct benchmark results are missing, using structured agentic reasoning to infer missing results from incomplete literature evidence.
Details
Motivation: Multilingual deployment faces challenges with sparse evaluation coverage and uneven published evidence across languages, tasks, and model families, making it difficult to estimate model performance in target languages without direct benchmark results.Method: Created a controlled benchmark of 1,500 questions across six tasks and five evidence scenarios that separates accessible evidence from ground truth. Developed Litmus (Re)Agent, a DAG-orchestrated agentic system that decomposes queries into hypotheses, retrieves evidence, and synthesizes predictions through feature-aware aggregation.
Result: Litmus (Re)Agent achieved the best overall performance across six systems, with the largest gains in transfer-heavy scenarios where direct evidence is weak or absent.
Conclusion: Structured agentic reasoning is a promising approach for multilingual performance estimation under incomplete evidence conditions.
Abstract: We study predictive multilingual evaluation: estimating how well a model will perform on a task in a target language when direct benchmark results are missing. This problem is common in multilingual deployment, where evaluation coverage is sparse and published evidence is uneven across languages, tasks, and model families. We introduce a controlled benchmark of 1,500 questions spanning six tasks and five evidence scenarios. The benchmark separates accessible evidence from ground truth, enabling evaluation of systems that must infer missing results from incomplete literature evidence. We also present Litmus (Re)Agent, a DAG-orchestrated agentic system that decomposes queries into hypotheses, retrieves evidence, and synthesises predictions through feature-aware aggregation. Across six systems, Litmus (Re)Agent achieves the best overall performance, with the largest gains in transfer-heavy scenarios where direct evidence is weak or absent. These results show that structured agentic reasoning is a promising approach to multilingual performance estimation under incomplete evidence.
[20] EXAONE 4.5 Technical Report
Eunbi Choi, Kibong Choi, Sehyun Chun, Seokhee Hong, Junwon Hwang, Hyojin Jeon, Ahra Jo, Hyunjik Jo, Yeonsik Jo, Joonkee Kim, Seonghwan Kim, Soyeon Kim, Sunkyoung Kim, Yireun Kim, Yongil Kim, Changhun Lee, Haeju Lee, Jinsik Lee, Kyungmin Lee, Sangha Park, Kwangrok Ryoo, Minju Seo, Sejong Yang, Heuiyeen Yeen, Hwan Chang, Stanley Jungkyu Choi, Yejin Choi, Kyubeen Han, Joonwon Jang, Kijeong Jeon, Geunyeong Jeong, Gerrard Jeongwon Jo, Jiyeon Jung, Daeseong Kim, Dohoon Kim, Dohyun Kim, Hyunseo Kim, Minu Kim, Myoungshin Kim, Youchul Kim, Byungoh Ko, Christopher Lee, Edward Hwayoung Lee, Honglak Lee, Jiyoung Lee, Sangeun Lee, Seungwon Lim, Woohyung Lim, Jueun Mun, Jaewoo Park, Jimin Park, Jinho Park, Yongmin Park, Wooseok Seo, Yongwoo Song, Sihyuk Yi, Kyungjae Yoo, Sangyeon Yoon
Main category: cs.CL
TL;DR: EXAONE 4.5 is LG AI Research’s first open-weight vision language model that integrates visual encoding into their existing framework for multimodal pretraining, with specialized document understanding capabilities and extended 256K token context length.
Details
Motivation: To create an open-weight vision language model that combines visual and textual understanding, with particular focus on document-centric applications aligned with LG's strategic domains, while enabling practical industrial deployment.Method: Integrates a dedicated visual encoder into the EXAONE 4.0 framework for native multimodal pretraining on both visual and textual data, trained on large-scale carefully curated data emphasizing document corpora, with extended 256K token context length.
Result: Achieves competitive performance in general benchmarks while outperforming state-of-the-art models of similar scale in document understanding and Korean contextual reasoning, with substantial gains in document-centric tasks.
Conclusion: EXAONE 4.5 represents a significant step in practical industrial vision language models with specialized document understanding capabilities, designed for continuous extension to additional domains and applications.
Abstract: This technical report introduces EXAONE 4.5, the first open-weight vision language model released by LG AI Research. EXAONE 4.5 is architected by integrating a dedicated visual encoder into the existing EXAONE 4.0 framework, enabling native multimodal pretraining over both visual and textual modalities. The model is trained on large-scale data with careful curation, particularly emphasizing document-centric corpora that align with LG’s strategic application domains. This targeted data design enables substantial performance gains in document understanding and related tasks, while also delivering broad improvements across general language capabilities. EXAONE 4.5 extends context length up to 256K tokens, facilitating long-context reasoning and enterprise-scale use cases. Comparative evaluations demonstrate that EXAONE 4.5 achieves competitive performance in general benchmarks while outperforming state-of-the-art models of similar scale in document understanding and Korean contextual reasoning. As part of LG’s ongoing effort toward practical industrial deployment, EXAONE 4.5 is designed to be continuously extended with additional domains and application scenarios to advance AI for a better life.
[21] Decomposing the Delta: What Do Models Actually Learn from Preference Pairs?
Chia-Hsuan Lee, Mingyang Zhou, Renkun Ni, Zelei Cheng, Sihui Dai, Supriyo Chakraborty, Shixiong Zhang, Sambit Sahu, William Campbell
Main category: cs.CL
TL;DR: Investigating what aspects of preference data drive reasoning gains in language models, focusing on generator-level vs sample-level quality deltas in preference pairs.
Details
Motivation: While preference optimization methods like DPO and KTO are widely used for aligning language models, little is understood about what specific properties of preference data actually drive downstream reasoning improvements.Method: Study two types of quality delta: generator-level delta (differences in capability between models generating chosen vs rejected traces) and sample-level delta (quality differences within individual preference pairs). Vary generator scale/model family for generator-level delta, and use LLM-as-a-judge to rate reasoning quality dimensions for sample-level delta.
Result: Increasing generator-level delta steadily improves performance on out-of-domain reasoning tasks, and filtering data by sample-level delta enables more data-efficient training.
Conclusion: Twofold recipe for improving reasoning through preference optimization: maximize generator-level delta when constructing preference pairs, and exploit sample-level delta to select the most informative training examples.
Abstract: Preference optimization methods such as DPO and KTO are widely used for aligning language models, yet little is understood about what properties of preference data drive downstream reasoning gains. We ask: what aspects of a preference pair improve a reasoning model’s performance on general reasoning tasks? We investigate two distinct notions of quality delta in preference data: generator-level delta, arising from the differences in capability between models that generate chosen and rejected reasoning traces, and sample-level delta, arising from differences in judged quality differences within an individual preference pair. To study generator-level delta, we vary the generator’s scale and model family, and to study sample-level delta, we employ an LLM-as-a-judge to rate the quality of generated traces along multiple reasoning-quality dimensions. We find that increasing generator-level delta steadily improves performance on out-of-domain reasoning tasks and filtering data by sample-level delta can enable more data-efficient training. Our results suggest a twofold recipe for improving reasoning performance through preference optimization: maximize generator-level delta when constructing preference pairs and exploit sample-level delta to select the most informative training examples.
[22] SPASM: Stable Persona-driven Agent Simulation for Multi-turn Dialogue Generation
Han Luo, Guy Laban
Main category: cs.CL
TL;DR: SPASM framework for stable multi-turn dialogue generation using persona-driven agents with Egocentric Context Projection to prevent persona drift and echoing
Details
Motivation: LLMs deployed in multi-turn settings need consistent roles/personas across long horizons, especially for synthetic dialogue generation where identity failures like persona drift and echoing occurMethod: Modular framework with persona creation via schema sampling/validation, Client-Responder dialogue generation, termination detection, and Egocentric Context Projection (ECP) that stores history in perspective-agnostic representation and projects to each agent’s view
Result: Created dataset of 4,500 personas and 45,000 conversations across 3 LLM backbones; ECP substantially reduces persona drift and eliminates echoing; embedding analyses reveal persona structure and responder-driven interaction geometry
Conclusion: SPASM provides stable persona-driven dialogue generation framework that improves multi-turn consistency without model weight changes, enabling reliable synthetic dialogue creation
Abstract: Large language models are increasingly deployed in multi-turn settings such as tutoring, support, and counseling, where reliability depends on preserving consistent roles, personas, and goals across long horizons. This requirement becomes critical when LLMs are used to generate synthetic dialogues for training and evaluation, since LLM–LLM conversations can accumulate identity-related failures such as persona drift, role confusion, and “echoing”, where one agent gradually mirrors its partner. We introduce SPASM (Stable Persona-driven Agent Simulation for Multi-turn dialogue generation), a modular, stability-first framework that decomposes simulation into (i) persona creation via schema sampling, plausibility validation, and natural-language persona crafting, (ii) Client–Responder dialogue generation, and (iii) termination detection for coherent stopping. To improve long-horizon stability without changing model weights, we propose Egocentric Context Projection (ECP): dialogue history is stored in a perspective-agnostic representation and deterministically projected into each agent’s egocentric view before generation. Across three LLM backbones (GPT-4o-mini, DeepSeek-V3.2, Qwen-Plus) and nine Client–Responder pairings, we construct a dataset of 4,500 personas and 45,000 conversations (500 personas X 10 conversations per pairing). Ablations show ECP substantially reduces persona drift and, under human validation, eliminates echoing; embedding analyses recover persona structure and reveal strong responder-driven interaction geometry. Our code is available at https://github.com/lhannnn/SPASM.
[23] LLMs Underperform Graph-Based Parsers on Supervised Relation Extraction for Complex Graphs
Paolo Gajo, Domenic Rosati, Hassan Sajjad, Alberto Barrón-Cedeño
Main category: cs.CL
TL;DR: Graph-based parsers outperform large language models for relation extraction when dealing with complex linguistic graphs with many relations
Details
Motivation: While LLMs show promise for relation extraction, their performance on complex linguistic graphs with many relations remains unclear compared to specialized graph-based approachesMethod: Evaluated four LLMs against a graph-based parser on six relation extraction datasets with varying graph sizes and complexities, comparing performance as number of relations increases
Result: Graph-based parser increasingly outperforms LLMs as the number of relations in input documents increases, making it superior for complex linguistic graphs despite being much lighter
Conclusion: Specialized graph-based parsers remain better than LLMs for relation extraction tasks involving complex linguistic graphs with many relations, offering superior performance with lower computational cost
Abstract: Relation extraction represents a fundamental component in the process of creating knowledge graphs, among other applications. Large language models (LLMs) have been adopted as a promising tool for relation extraction, both in supervised and in-context learning settings. However, in this work we show that their performance still lags behind much smaller architectures when the linguistic graph underlying a text has great complexity. To demonstrate this, we evaluate four LLMs against a graph-based parser on six relation extraction datasets with sentence graphs of varying sizes and complexities. Our results show that the graph-based parser increasingly outperforms the LLMs, as the number of relations in the input documents increases. This makes the much lighter graph-based parser a superior choice in the presence of complex linguistic graphs.
[24] Cards Against LLMs: Benchmarking Humor Alignment in Large Language Models
Yousra Fettach, Guillaume Bied, Hannu Toivonen, Tijl De Bie
Main category: cs.CL
TL;DR: LLMs show limited alignment with human humor preferences in Cards Against Humanity games, with models agreeing more with each other than with humans due to systematic biases
Details
Motivation: Humor is culturally embedded and socially significant but largely unexplored in LLM alignment; understanding if LLMs can genuinely appreciate human humor or if their judgments reflect structural artifactsMethod: Five frontier language models played Cards Against Humanity games alongside humans, selecting funniest responses from candidate cards across 9,894 rounds; analyzed model-human alignment and inter-model agreement
Result: All models exceeded random baseline but showed modest alignment with human preferences; models agreed with each other substantially more than with humans; systematic position biases and content preferences partly explain these patterns
Conclusion: LLM humor judgment may reflect structural artifacts of inference and alignment rather than genuine preference, raising questions about true humor understanding in language models
Abstract: Humor is one of the most culturally embedded and socially significant dimensions of human communication, yet it remains largely unexplored as a dimension of Large Language Model (LLM) alignment. In this study, five frontier language models play the same Cards Against Humanity games (CAH) as human players. The models select the funniest response from a slate of ten candidate cards across 9,894 rounds. While all models exceed the random baseline, alignment with human preference remains modest. More striking is that models agree with each other substantially more often than they agree with humans. We show that this preference is partly explained by systematic position biases and content preferences, raising the question whether LLM humor judgment reflects genuine preference or structural artifacts of inference and alignment.
[25] Revisiting Anisotropy in Language Transformers: The Geometry of Learning Dynamics
Raphael Bernas, Fanny Jourdan, Antonin Poché, Céline Hudelot
Main category: cs.CL
TL;DR: The paper investigates anisotropy in Transformer architectures, proposing geometric explanations for frequency-biased sampling and training effects, and uses concept-based mechanistic interpretability to show activation-derived tangent directions capture gradient anisotropy.
Details
Motivation: Transformers dominate NLP but exhibit anisotropy (uneven representation geometry), which challenges geometric interpretation. Previous theoretical work lacks grounding in representation geometry, motivating a geometric analysis of this phenomenon.Method: Extends previous work with geometric arguments about frequency-biased sampling and training effects. Uses concept-based mechanistic interpretability during training to fit activation-derived low-rank tangent proxies, comparing them against true backpropagated gradients across encoder and decoder language models.
Result: Activation-derived tangent directions capture unusually large gradient energy and substantially larger share of gradient anisotropy than matched-rank normal controls, providing empirical support for tangent-aligned anisotropy account.
Conclusion: The study provides geometric understanding of anisotropy in Transformers and demonstrates that activation-derived tangent directions effectively capture gradient anisotropy, supporting a tangent-aligned interpretation of this phenomenon.
Abstract: Since their introduction, Transformer architectures have dominated Natural Language Processing (NLP). However, recent research has highlighted an inherent anisotropy phenomenon in these models, presenting a significant challenge to their geometric interpretation. Previous theoretical studies on this phenomenon are rarely grounded in the underlying representation geometry. In this paper, we extend them by deriving geometric arguments for how frequency-biased sampling attenuates curvature visibility and why training preferentially amplify tangent directions. Empirically, we then use concept-based mechanistic interpretability during training, rather than only post hoc, to fit activation-derived low-rank tangent proxies and test them against ordinary backpropagated true gradients. Across encoder-style and decoder-style language models, we find that these activation-derived directions capture both unusually large gradient energy and a substantially larger share of gradient anisotropy than matched-rank normal controls, providing strong empirical support for a tangent-aligned account of anisotropy.
[26] MT-OSC: Path for LLMs that Get Lost in Multi-Turn Conversation
Jyotika Singh, Fang Tu, Miguel Ballesteros, Weiyi Sun, Sandip Ghoshal, Michelle Yuan, Yassine Benajiba, Sujith Ravi, Dan Roth
Main category: cs.CL
TL;DR: MT-OSC: A framework for automatically condensing multi-turn chat history to reduce token usage while preserving essential information, improving LLM performance in extended conversations.
Details
Motivation: LLMs degrade in performance when instructions and context are distributed across multiple conversational turns, and appending full chat history exhausts context windows, increasing latency and computational costs.Method: One-off Sequential Condensation framework with a Condenser Agent that uses few-shot inference-based condensation and a lightweight Decider to selectively retain essential information from chat history.
Result: Reduces token counts by up to 72% in 10-turn dialogues, consistently narrows multi-turn performance gap across 13 state-of-the-art LLMs, improves or preserves accuracy across datasets while remaining robust to distractors.
Conclusion: MT-OSC is a scalable solution for multi-turn chats that enables richer context within constrained input spaces while reducing latency and operational costs.
Abstract: Large language models (LLMs) suffer significant performance degradation when user instructions and context are distributed over multiple conversational turns, yet multi-turn (MT) interactions dominate chat interfaces. The routine approach of appending full chat history to prompts rapidly exhausts context windows, leading to increased latency, higher computational costs, and diminishing returns as conversations extend. We introduce MT-OSC, a One-off Sequential Condensation framework that efficiently and automatically condenses chat history in the background without disrupting the user experience. MT-OSC employs a Condenser Agent that uses a few-shot inference-based Condenser and a lightweight Decider to selectively retain essential information, reducing token counts by up to 72% in 10-turn dialogues. Evaluated across 13 state-of-the-art LLMs and diverse multi-turn benchmarks, MT-OSC consistently narrows the multi-turn performance gap - yielding improved or preserved accuracy across datasets while remaining robust to distractors and irrelevant turns. Our results establish MT-OSC as a scalable solution for multi-turn chats, enabling richer context within constrained input spaces, reducing latency and operational cost, while balancing performance.
[27] MedConceal: A Benchmark for Clinical Hidden-Concern Reasoning Under Partial Observability
Yikun Han, Joey Chan, Jingyuan Chen, Mengting Ai, Simo Du, Yue Guo
Main category: cs.CL
TL;DR: MedConceal benchmark evaluates hidden-concern reasoning in medical dialogue using interactive patient simulator with clinician-visible context and simulator-internal hidden concerns
Details
Motivation: Patient-clinician communication involves asymmetric information where patients don't disclose fears/misconceptions unless skillfully elicited, requiring reasoning under partial observability. Existing benchmarks sidestep this by exposing hidden state or evaluating without modeling what remains hidden.Method: Created MedConceal benchmark with 300 curated cases and 600 clinician-LLM interactions using interactive patient simulator. Built from clinician-answered online health discussions, pairing clinician-visible context with simulator-internal hidden concerns using expert taxonomy. Simulator withholds concerns, tracks revelation via communication signals, and enables process-aware evaluation.
Result: No single system dominates: frontier models lead on different confirmation metrics, while human clinicians (N=159) remain strongest on intervention success. Hidden-concern reasoning under partial observability identified as key unresolved challenge.
Conclusion: MedConceal provides benchmark for evaluating hidden-concern reasoning in medical dialogue, revealing that current systems struggle with partial observability challenges that human clinicians handle better.
Abstract: Patient-clinician communication is an asymmetric-information problem: patients often do not disclose fears, misconceptions, or practical barriers unless clinicians elicit them skillfully. Effective medical dialogue therefore requires reasoning under partial observability: clinicians must elicit latent concerns, confirm them through interaction, and respond in ways that guide patients toward appropriate care. However, existing medical dialogue benchmarks largely sidestep this challenge by exposing hidden patient state, collapsing elicitation into extraction, or evaluating responses without modeling what remains hidden. We present MedConceal, a benchmark with an interactive patient simulator for evaluating hidden-concern reasoning in medical dialogue, comprising 300 curated cases and 600 clinician-LLM interactions. Built from clinician-answered online health discussions, each case pairing clinician-visible context with simulator-internal hidden concerns derived from prior literature and structured using an expert-developed taxonomy. The simulator withholds these concerns from the dialogue agent, tracks whether they have been revealed and addressed via theory-grounded turn-level communication signals, and is clinician-reviewed for clinical plausibility. This enables process-aware evaluation of both task success and the interaction process that leads to it. We study two abilities: confirmation, surfacing hidden concerns through multi-turn dialogue, and intervention, addressing the primary concern and guiding the patient toward a target plan. Results show that no single system dominates: frontier models lead on different confirmation metrics, while human clinicians (N=159) remain strongest on intervention success. Together, these results identify hidden-concern reasoning under partial observability as a key unresolved challenge for medical dialogue systems.
[28] Lessons Without Borders? Evaluating Cultural Alignment of LLMs Using Multilingual Story Moral Generation
Sophie Wu, Andrew Piper
Main category: cs.CL
TL;DR: Multilingual story moral generation task reveals LLMs produce culturally homogenized moral interpretations with less diversity than human responses across 14 language-culture pairs.
Details
Motivation: To evaluate cultural alignment in language models through narrative interpretation, moving beyond static benchmarks by examining how models generate story morals across different linguistic and cultural contexts.Method: Created dataset of human-written story morals across 14 language-culture pairs, compared model outputs using semantic similarity, human preference surveys, and value categorization analysis.
Result: Frontier models (GPT-4o, Gemini) generate semantically similar morals preferred by humans, but show less cross-linguistic variation and focus on narrower set of widely shared values compared to human diversity.
Conclusion: While models approximate central tendencies of human moral interpretation, they struggle to reproduce cultural diversity in narrative understanding, highlighting limitations in cultural alignment.
Abstract: Stories are key to transmitting values across cultures, but their interpretation varies across linguistic and cultural contexts. Thus, we introduce multilingual story moral generation as a novel culturally grounded evaluation task. Using a new dataset of human-written story morals collected across 14 language-culture pairs, we compare model outputs with human interpretations via semantic similarity, a human preference survey, and value categorization. We show that frontier models such as GPT-4o and Gemini generate story morals that are semantically similar to human responses and preferred by human evaluators. However, their outputs exhibit markedly less cross-linguistic variation and concentrate on a narrower set of widely shared values. These findings suggest that while contemporary models can approximate central tendencies of human moral interpretation, they struggle to reproduce the diversity that characterizes human narrative understanding. By framing narrative interpretation as an evaluative task, this work introduces a new approach to studying cultural alignment in language models beyond static benchmarks or knowledge-based tests.
[29] Cross-Lingual Attention Distillation with Personality-Informed Generative Augmentation for Multilingual Personality Recognition
Jing Jie Tan, Ban-Hoe Kwan, Danny Wee-Kiat Ng, Yan-Chai Hum, Noriyuki Kawarazaki, Kosuke Takano
Main category: cs.CL
TL;DR: ADAM is a multilingual personality recognition system using LLM-based translation augmentation and cross-lingual attention distillation to bridge linguistic gaps in personality analysis.
Details
Motivation: The lack of multilingual datasets for personality recognition remains a significant challenge, limiting cross-lingual and cross-cultural personality analysis capabilities.Method: Uses LLM-based translation augmentation with Personality-Informed Generative Augmentation (PIGA) to create multilingual training data, then employs Cross-Lingual Attention Distillation (CLAD) to train models that understand personality traits across languages.
Result: CLAD significantly outperforms standard BCE across all languages and personality traits, achieving BA score improvements of +0.0573 on Essays dataset and +0.0968 on Kaggle dataset, with strong generalizability comparable to leading encoder models.
Conclusion: ADAM successfully addresses multilingual personality recognition challenges through innovative data augmentation and cross-lingual distillation techniques, enabling more effective personality analysis across diverse languages and cultures.
Abstract: While significant work has been done on personality recognition, the lack of multilingual datasets remains an unresolved challenge. To address this, we propose ADAM (Cross-Lingual (A)ttention (D)istillation with Personality-Guided Generative (A)ugmentation for (M)ultilingual Personality Recognition), a state-of-the-art approach designed to advance multilingual personality recognition. Our approach leverages an existing English-language personality dataset as the primary source and employs a large language model (LLM) for translationbased augmentation, enhanced by Personality-Informed Generative Augmentation (PIGA), to generate high-quality training data in multiple languages, including Japanese, Chinese, Malay, and French. We provide a thorough analysis to justify the effectiveness of these augmentation techniques. Building on these advancements, ADAM integrates Cross-Lingual Attention Distillation (CLAD) to train a model capable of understanding and recognizing personality traits across languages, bridging linguistic and cultural gaps in personality analysis. This research presents a thorough evaluation of the proposed augmentation method, incorporating an ablation study on recognition performance to ensure fair comparisons and robust validation. Overall, with PIGA augmentation, the findings demonstrate that CLAD significantly outperforms the standard BCE across all languages and personality traits, achieving notable improvements in average BA scores - 0.6332 (+0.0573) on the Essays dataset and 0.7448 (+0.0968) on the Kaggle dataset. The CLAD-trained model also demonstrated strong generalizability and achieved benchmark performance comparable to current leading encoder models. The model weight, dataset, and algorithm repository are available at https://research.jingjietan.com/?q=ADAM.
[30] GRASP: Grounded CoT Reasoning with Dual-Stage Optimization for Multimodal Sarcasm Target Identification
Faxian Wan, Xiaocui Yang, Yifan Cao, Shi Feng, Daling Wang, Yifei Zhang
Main category: cs.CL
TL;DR: GRASP is a framework for multimodal sarcasm target identification that uses visual grounding and explicit chain-of-thought reasoning to improve fine-grained localization of sarcasm targets in text and images.
Details
Motivation: Existing multimodal sarcasm detection approaches rely on implicit cross-modal alignment with limited interpretability and suboptimal fine-grained localization. The paper addresses the challenge of Multimodal Sarcasm Target Identification (MSTI), which requires precise localization of textual phrases and visual regions, going beyond traditional binary classification.Method: Proposes GRASP framework with: 1) MSTI-MAX dataset curation to mitigate class imbalance and enrich multimodal sarcasm cues; 2) Grounded CoT reasoning that explicitly anchors sarcasm-related visual regions in reasoning trajectories; 3) Dual-stage optimization: Supervised Fine-Tuning with coordinate-aware weighted loss followed by Fine-Grained Target Policy Optimization.
Result: GRASP outperforms existing baselines in fine-grained sarcasm target identification across modalities. LLM-as-a-Judge evaluation quantitatively measures the quality of internal reasoning chains. The framework demonstrates improved localization of sarcasm targets in both text and visual regions.
Conclusion: GRASP advances multimodal sarcasm analysis by integrating visual grounding with explicit reasoning, moving beyond black-box approaches. The framework provides better interpretability and fine-grained target identification for multimodal sarcasm understanding.
Abstract: Moving beyond the traditional binary classification paradigm of Multimodal Sarcasm Detection, Multimodal Sarcasm Target Identification (MSTI) presents a more formidable challenge, requiring precise localization of fine-grained targets such as textual phrases and visual regions. Existing approaches predominantly rely on implicit cross-modal alignment, offering limited interpretability and suboptimal fine-grained localization. To address these limitations, we propose GRASP, Grounded Chain-of-Thought ReAsoning with Dual-Stage Optimization for Multimodal Sarcasm Prediction and Target Identification, a framework that integrates visual grounding with explicit Chain-of-Thought (CoT) reasoning to move beyond black-box MSTI. Specifically, we curate MSTI-MAX, a refined dataset that mitigates class imbalance and enriches multimodal sarcasm cues. We introduce Grounded CoT reasoning, which explicitly anchors sarcasm-related visual regions within the reasoning trajectory and prompts the model to articulate rationales before predicting the final classification labels and sarcasm targets. Furthermore, we employ a dual-stage outcome-supervised joint optimization strategy: Supervised Fine-Tuning with a coordinate-aware weighted loss, followed by Fine-Grained Target Policy Optimization. Extensive experiments demonstrate that GRASP outperforms existing baselines in fine-grained sarcasm target identification across modalities, and an LLM-as-a-Judge evaluation quantitatively measures the quality of internal reasoning chains. Our dataset and source code will be released on GitHub.
[31] NCL-BU at SemEval-2026 Task 3: Fine-tuning XLM-RoBERTa for Multilingual Dimensional Sentiment Regression
Tong Wu, Nicolay Rusnachenko, Huizhi Liang
Main category: cs.CL
TL;DR: Fine-tuned XLM-RoBERTa system for dimensional aspect-based sentiment analysis predicting continuous valence-arousal scores, outperforming few-shot LLMs like GPT-5.2 and LLaMA variants.
Details
Motivation: Extend traditional categorical aspect-based sentiment analysis to continuous dimensional space (valence-arousal regression) to capture more nuanced emotional responses to specific aspects in text.Method: Fine-tuned XLM-RoBERTa-base with dual regression heads for valence and arousal prediction, using input format [CLS] T [SEP] a_i [SEP]. Trained separate models for each language-domain combination (English/Chinese across restaurant, laptop, finance domains). Compared against few-shot prompting with LLMs including GPT-5.2, LLaMA-3-70B, LLaMA-3.3-70B, and LLaMA-4-Maverick.
Result: Task-specific fine-tuning substantially and consistently outperformed all LLM-based few-shot methods across all evaluation datasets, demonstrating the superiority of specialized models over general LLMs for this dimensional sentiment regression task.
Conclusion: Fine-tuned transformer models remain more effective than few-shot LLMs for dimensional aspect-based sentiment analysis, highlighting the value of task-specific training over general-purpose language models for this regression problem.
Abstract: Dimensional Aspect-Based Sentiment Analysis (DimABSA) extends traditional ABSA from categorical polarity labels to continuous valence-arousal (VA) regression. This paper describes a system developed for Track A - Subtask 1 (Dimensional Aspect Sentiment Regression), aiming to predict real-valued VA scores in the [1, 9] range for each given aspect in a text. A fine-tuning approach based on XLM-RoBERTa-base is adopted, constructing the input as [CLS] T [SEP] a_i [SEP] and training dual regression heads with sigmoid-scaled outputs for valence and arousal prediction. Separate models are trained for each language-domain combination (English and Chinese across restaurant, laptop, and finance domains), and training and development sets are merged for final test predictions. In development experiments, the fine-tuning approach is compared against several large language models including GPT-5.2, LLaMA-3-70B, LLaMA-3.3-70B, and LLaMA-4-Maverick under a few-shot prompting setting, demonstrating that task-specific fine-tuning substantially and consistently outperforms these LLM-based methods across all evaluation datasets. The code is publicly available at https://github.com/tongwu17/SemEval-2026-Task3-Track-A.
[32] MuTSE: A Human-in-the-Loop Multi-use Text Simplification Evaluator
Rares-Alexandru Roscan, Gabriel Petre1, Adrian-Marius Dumitran, Angela-Liliana Dumitran
Main category: cs.CL
TL;DR: MuTSE is an interactive web application for evaluating LLM-generated text simplifications across different prompts and models, featuring visual comparison matrices and semantic alignment analysis.
Details
Motivation: Current methods for evaluating LLM text simplifications are limited - researchers use static computational scripts while educators rely on conversational interfaces, neither supporting systematic multi-dimensional evaluation of prompt-model permutations.Method: Developed MuTSE, an interactive human-in-the-loop web application that supports concurrent execution of P×M prompt-model permutations, generates real-time comparison matrices, and integrates a tiered semantic alignment engine with linearity bias heuristic (λ).
Result: The system visually maps source sentences to simplified counterparts, reduces cognitive load for qualitative analysis, and enables reproducible structured annotation for downstream NLP dataset construction.
Conclusion: MuTSE addresses critical methodological challenges in evaluating LLM text simplifications by providing a structured visual framework for comparative analysis across diverse prompting strategies and architectures.
Abstract: As Large Language Models (LLMs) become increasingly prevalent in text simplification, systematically evaluating their outputs across diverse prompting strategies and architectures remains a critical methodological challenge in both NLP research and Intelligent Tutoring Systems (ITS). Developing robust prompts is often hindered by the absence of structured, visual frameworks for comparative text analysis. While researchers typically rely on static computational scripts, educators are constrained to standard conversational interfaces – neither paradigm supports systematic multi-dimensional evaluation of prompt-model permutations. To address these limitations, we introduce \textbf{MuTSE}\footnote{The project code and the demo have been made available for peer review at the following anonymized URL. https://osf.io/njs43/overview?view_only=4b4655789f484110a942ebb7788cdf2a, an interactive human-in-the-loop web application designed to streamline the evaluation of LLM-generated text simplifications across arbitrary CEFR proficiency targets. The system supports concurrent execution of $P \times M$ prompt-model permutations, generating a comprehensive comparison matrix in real-time. By integrating a novel tiered semantic alignment engine augmented with a linearity bias heuristic ($λ$), MuTSE visually maps source sentences to their simplified counterparts, reducing the cognitive load associated with qualitative analysis and enabling reproducible, structured annotation for downstream NLP dataset construction.
[33] TaxPraBen: A Scalable Benchmark for Structured Evaluation of LLMs in Chinese Real-World Tax Practice
Gang Hu, Yating Chen, Haiyan Ding, Wang Gao, Jiajia Huang, Min Peng, Qianqian Xie, Kun Yu
Main category: cs.CL
TL;DR: TaxPraBen: First dedicated benchmark for Chinese taxation practice combining traditional NLP tasks with real-world scenarios like tax risk prevention and strategy planning.
Details
Motivation: LLMs have gaps in specialized, knowledge-intensive Chinese tax domain; existing benchmarks focus on isolated NLP tasks rather than real-world practical capabilities needed for tax practice.Method: Created TaxPraBen benchmark with 10 traditional application tasks + 3 real-world scenarios (tax risk prevention, inspection analysis, strategy planning) from 14 datasets (7.3K instances). Uses scalable structured evaluation paradigm with “structured parsing-field alignment extraction-numerical and textual matching”.
Result: Evaluated 19 LLMs using Bloom’s taxonomy: closed-source large-parameter LLMs excel; Chinese LLMs like Qwen2.5 outperform multilingual LLMs; YaYi2 LLM with some tax data fine-tuning shows limited improvement.
Conclusion: TaxPraBen serves as vital resource for advancing LLM evaluations in practical applications, revealing significant performance disparities in specialized domains.
Abstract: While Large Language Models (LLMs) excel in various general domains, they exhibit notable gaps in the highly specialized, knowledge-intensive, and legally regulated Chinese tax domain. Consequently, while tax-related benchmarks are gaining attention, many focus on isolated NLP tasks, neglecting real-world practical capabilities. To address this issue, we introduce TaxPraBen, the first dedicated benchmark for Chinese taxation practice. It combines 10 traditional application tasks, along with 3 pioneering real-world scenarios: tax risk prevention, tax inspection analysis, and tax strategy planning, sourced from 14 datasets totaling 7.3K instances. TaxPraBen features a scalable structured evaluation paradigm designed through process of “structured parsing-field alignment extraction-numerical and textual matching”, enabling end-to-end tax practice assessment while being extensible to other domains. We evaluate 19 LLMs based on Bloom’s taxonomy. The results indicate significant performance disparities: all closed-source large-parameter LLMs excel, and Chinese LLMs like Qwen2.5 generally exceed multilingual LLMs, while the YaYi2 LLM, fine-tuned with some tax data, shows only limited improvement. TaxPraBen serves as a vital resource for advancing evaluations of LLMs in practical applications.
[34] MAB-DQA: Addressing Query Aspect Importance in Document Question Answering with Multi-Armed Bandits
Yixin Xiang, Yunshan Ma, Xiaoyu Du, Yibing Chen, Yanxin Zhang, Jinhui Tang
Main category: cs.CL
TL;DR: MAB-DQA: A Multi-Armed Bandit framework for Document Question Answering that dynamically allocates retrieval budgets to different query aspects to better utilize multiple document images.
Details
Motivation: Current multimodal RAG approaches for visual DQA struggle with effectively utilizing many document images, often retrieving only a few candidate pages and overlooking informative but less visually salient content in favor of common but low-information pages.Method: Decomposes queries into aspect-aware subqueries, treats each as a bandit arm, uses preliminary reasoning results as reward signals to estimate aspect utility, and dynamically reallocates retrieval budgets toward high-value aspects using exploration-exploitation policy.
Result: Shows average improvement of 5%-18% over state-of-the-art methods on four benchmarks, consistently enhancing document understanding.
Conclusion: MAB-DQA effectively addresses the limitation of current multimodal RAG in visual DQA by modeling varying importance of query aspects and dynamically optimizing retrieval budgets, leading to significant performance improvements.
Abstract: Document Question Answering (DQA) involves generating answers from a document based on a user’s query, representing a key task in document understanding. This task requires interpreting visual layouts, which has prompted recent studies to adopt multimodal Retrieval-Augmented Generation (RAG) that processes page images for answer generation. However, in multimodal RAG, visual DQA struggles to utilize a large number of images effectively, as the retrieval stage often retains only a few candidate pages (e.g., Top-4), causing informative but less visually salient content to be overlooked in favor of common yet low-information pages. To address this issue, we propose a Multi-Armed Bandit-based DQA framework (MAB-DQA) to explicitly model the varying importance of multiple implicit aspects in a query. Specifically, MAB-DQA decomposes a query into aspect-aware subqueries and retrieves an aspect-specific candidate set for each. It treats each subquery as an arm and uses preliminary reasoning results from a small number of representative pages as reward signals to estimate aspect utility. Guided by an exploration-exploitation policy, MAB-DQA dynamically reallocates retrieval budgets toward high-value aspects. With the most informative pages and their correlations, MAB-DQA generates the expected results. On four benchmarks, MAB-DQA shows an average improvement of 5%-18% over the state-of-the-art method, consistently enhancing document understanding. Code at https://github.com/ElephantOH/MAB-DQA.
[35] Breaking Block Boundaries: Anchor-based History-stable Decoding for Diffusion Large Language Models
Shun Zou, Yong Wang, Zehui Chen, Lin Chen, Chongyang Tao, Feng Zhao, Xiangxiang Chu
Main category: cs.CL
TL;DR: AHD is a training-free decoding strategy for diffusion LLMs that identifies stable tokens early using dynamic anchors, enabling cross-block decoding to improve both efficiency and performance across multimodal domains.
Details
Motivation: Semi-autoregressive decoding in diffusion LLMs suffers from inherent block constraints that unnecessarily delay decoding of cross-block stable tokens, reducing efficiency and performance.Method: Proposes Anchor-based History-stable Decoding (AHD) which monitors token stability trends in real-time using dynamic anchors, initiates early cross-block decoding once tokens reach stability, and leverages historical information to improve reliability.
Result: AHD improves both performance and inference efficiency across language, vision-language, and audio-language domains, reducing decoding steps by 80% while improving BBH benchmark performance by 3.67%.
Conclusion: AHD effectively addresses block constraint limitations in semi-autoregressive decoding for diffusion LLMs, offering a training-free solution that enhances both efficiency and performance across multimodal applications.
Abstract: Diffusion Large Language Models (dLLMs) have recently become a promising alternative to autoregressive large language models (ARMs). Semi-autoregressive (Semi-AR) decoding is widely employed in base dLLMs and advanced decoding strategies due to its superior performance. However, our observations reveal that Semi-AR decoding suffers from inherent block constraints, which cause the decoding of many cross-block stable tokens to be unnecessarily delayed. To address this challenge, we systematically investigate the identification of stable tokens and present three key findings: (1) naive lookahead decoding is unreliable, (2) token stability closely correlates with convergence trend, and (3) historical information is isolated. Building on these insights, we propose Anchor-based History-stable Decoding (AHD), a training-free, plug-and-play dynamic decoding strategy. Specifically, AHD monitors the stability trend of tokens in real time through dynamic anchors. Once a token reaches stability, it initiates early cross-block decoding to enhance efficiency and performance. Extensive experiments across language, vision-language, and audio-language domains demonstrate that AHD simultaneously improves both performance and inference efficiency. Notably, AHD effectively reverses the performance degradation typically observed in existing advanced decoding acceleration strategies. For instance, on the BBH benchmark, our approach reduces decoding steps by 80% while improving performance by 3.67%.
[36] Towards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces
Jiawei Chen, Ruoxi Xu, Boxi Cao, Ruotong Pan, Yunfei Zhang, Yifei Hu, Yong Du, Tingting Gao, Yaojie Lu, Yingfei Sun, Xianpei Han, Le Sun, Xiangyu Wu, Hongyu Lin
Main category: cs.CL
TL;DR: OmniBehavior: First real-world user simulation benchmark revealing LLMs’ structural biases in simulating complex human behaviors
Details
Motivation: Existing LLM-based user simulation benchmarks are limited to isolated scenarios, narrow action spaces, or synthetic data, failing to capture holistic authentic human behavior patternsMethod: Introduces OmniBehavior benchmark constructed entirely from real-world data, integrating long-horizon, cross-scenario, and heterogeneous behavioral patterns into unified framework
Result: LLMs struggle with complex behaviors (performance plateaus despite context expansion), exhibit structural bias toward “positive average person” with hyper-activity, persona homogenization, and Utopian bias, losing individual differences and long-tail behaviors
Conclusion: Reveals critical limitations in current LLMs for high-fidelity human behavior simulation and identifies directions for future research to address structural biases
Abstract: The emergence of Large Language Models (LLMs) has illuminated the potential for a general-purpose user simulator. However, existing benchmarks remain constrained to isolated scenarios, narrow action spaces, or synthetic data, failing to capture the holistic nature of authentic human behavior. To bridge this gap, we introduce OmniBehavior, the first user simulation benchmark constructed entirely from real-world data, integrating long-horizon, cross-scenario, and heterogeneous behavioral patterns into a unified framework. Based on this benchmark, we first provide empirical evidence that previous datasets with isolated scenarios suffer from tunnel vision, whereas real-world decision-making relies on long-term, cross-scenario causal chains. Extensive evaluations of state-of-the-art LLMs reveal that current models struggle to accurately simulate these complex behaviors, with performance plateauing even as context windows expand. Crucially, a systematic comparison between simulated and authentic behaviors uncovers a fundamental structural bias: LLMs tend to converge toward a positive average person, exhibiting hyper-activity, persona homogenization, and a Utopian bias. This results in the loss of individual differences and long-tail behaviors, highlighting critical directions for future high-fidelity simulation research.
[37] Confident in a Confidence Score: Investigating the Sensitivity of Confidence Scores to Supervised Fine-Tuning
Lorenzo Jaime Yu Flores, Cesare Spinoso di-Piano, Jackie Chi Kit Cheung
Main category: cs.CL
TL;DR: Fine-tuning language models degrades correlation between confidence scores and output quality, making uncertainty quantification unreliable without testing
Details
Motivation: To understand how supervised fine-tuning affects the reliability of uncertainty quantification techniques in language models, since confidence scores need to correlate with output quality to be useful for detecting hallucinations or uncertain predictionsMethod: Investigated the underlying behavior of confidence scores and their sensitivity to supervised fine-tuning (SFT), analyzing how correlation between confidence scores and quality changes post-SFT, and conducted a case study to demonstrate downstream impact
Result: Post-SFT, correlation of various confidence scores degrades due to changes in confidence scores from factors other than output quality (like similarity to training distribution), and failing to address this miscorrelation reduces usefulness of confidence scores on downstream tasks
Conclusion: Confidence metrics cannot be used off-the-shelf without testing, motivating the need for developing metrics more robust to fine-tuning
Abstract: Uncertainty quantification is a set of techniques that measure confidence in language models. They can be used, for example, to detect hallucinations or alert users to review uncertain predictions. To be useful, these confidence scores must be correlated with the quality of the output. However, recent work found that fine-tuning can affect the correlation between confidence scores and quality. Hence, we investigate the underlying behavior of confidence scores to understand its sensitivity to supervised fine-tuning (SFT). We find that post-SFT, the correlation of various confidence scores degrades, which can stem from changes in confidence scores due to factors other than the output quality, such as the output’s similarity to the training distribution. We demonstrate via a case study how failing to address this miscorrelation reduces the usefulness of the confidence scores on a downstream task. Our findings show how confidence metrics cannot be used off-the-shelf without testing, and motivate the need for developing metrics which are more robust to fine-tuning.
[38] Quantisation Reshapes the Metacognitive Geometry of Language Models
Jon-Paul Cacioli
Main category: cs.CL
TL;DR: Quantization changes how LLMs monitor their confidence across knowledge domains but doesn’t affect their actual discrimination ability; domain-specific confidence training failed because diagnostic profiles don’t transfer across quantization formats.
Details
Motivation: To understand how model quantization affects metacognitive efficiency in LLMs across different knowledge domains, and whether domain-specific confidence training can improve metacognition.Method: Evaluated Llama-3-8B-Instruct on 3,000 questions at Q5_K_M and f16 precision, analyzed M-ratio and Type-2 AUROC across four domains, conducted pre-registered domain-conditional SFT training with controls.
Result: Quantization restructures M-ratio profiles across domains (Spearman rho = 0.00) while Type-2 AUROC remains perfectly stable (rho = 1.00). Domain-specific training successfully reshaped confidence distributions but didn’t improve meta-d’ due to diagnostic profile non-transferability.
Conclusion: Quantization fundamentally changes how LLMs monitor confidence across domains without affecting discrimination ability; systems using M-ratio profiles have unexamined dependency on inference format, while AUROC_2 is safer.
Abstract: We report that model quantisation restructures domain-level metacognitive efficiency in LLMs rather than degrading it uniformly. Evaluating Llama-3-8B-Instruct on the same 3,000 questions at Q5_K_M and f16 precision, we find that M-ratio profiles across four knowledge domains are uncorrelated between formats (Spearman rho = 0.00). Arts & Literature moves from worst-monitored (M-ratio = 0.606 at Q5_K_M) to best-monitored (1.542 at f16). Geography moves from well-monitored (1.210) to under-monitored (0.798). However, Type-2 AUROC profiles are perfectly stable across formats (rho = 1.00), localising the restructuring to the M-ratio normalisation rather than the underlying discrimination signal. This finding emerged from a pre-registered attempt to improve metacognition through domain-conditional training. We prescribed confidence-amplification SFT for the diagnosed weak domain, with matched-budget agnostic and wrong-prescription controls. All four confirmatory hypotheses were null (10,000 bootstrap resamples, seed = 42). The training successfully reshaped confidence distributions, doubling the NLP gap in Science from 0.076 to 0.152, but did not improve meta-d’ because the diagnostic profile did not transfer across formats. Any system relying on domain-level M-ratio profiles has an unexamined dependency on inference format. Systems using AUROC_2 are safer. We release all code, pre-registrations, and trial-level data.
[39] Testing the Assumptions of Active Learning for Translation Tasks with Few Samples
Lorenzo Jaime Yu Flores, Cesare Spinoso di-Piano, Ori Ernst, David Ifeoluwa Adelani, Jackie Chi Kit Cheung
Main category: cs.CL
TL;DR: Active learning strategies fail to outperform random sampling in low-data regimes for language generation tasks due to incorrect assumptions about informativeness and diversity being correlated with test performance.
Details
Motivation: To understand why active learning (AL) strategies perform poorly compared to random sampling when using only 100-500 samples for language generation tasks, and to investigate whether the core assumptions underlying AL strategies hold in low-data regimes.Method: The researchers investigated whether the core assumptions of AL strategies (informativeness and diversity of training data being correlated with test performance) hold by analyzing various factors that impact performance in low-data scenarios.
Result: Neither informativeness nor diversity of training data, which AL strategies optimize for, were correlated with test set performance. Instead, factors like ordering of training samples and interactions with pre-training data had larger impacts on performance.
Conclusion: Future AL methods must account for factors like training sample ordering and pre-training data interactions rather than relying on informativeness and diversity assumptions to work effectively with very few samples.
Abstract: Active learning (AL) is a training paradigm for selecting unlabeled samples for annotation to improve model performance on a test set, which is useful when only a limited number of samples can be annotated. These algorithms often work by optimizing for the informativeness and diversity of the training data to be annotated. Recent work found that AL strategies fail to outperform random sampling on various language generation tasks when using 100-500 samples. To understand AL’s poor performance when only using few samples, we investigate whether the core assumptions underlying AL strategies hold. We find that neither the informativeness nor diversity of the training data, which AL strategies optimize for, are correlated with test set performance. Instead, factors like the ordering of the training samples and interactions with pre-training data have a larger impact on performance. This suggests that future AL methods must take these factors into account in order to work with very few samples.
[40] PerMix-RLVR: Preserving Persona Expressivity under Verifiable-Reward Alignment
Jihwan Oh, Soowon Oh, Murad Aghazada, Minchan Jeong, Sungnyun Kim, Se-Young Yun
Main category: cs.CL
TL;DR: Proposes PerMix-RLVR, a persona-mixed reinforcement learning approach that reduces persona sensitivity in LLMs while maintaining persona fidelity, addressing the trade-off between robustness and expressivity.
Details
Motivation: Persona prompting improves LLM performance but optimal persona selection is time-consuming and impact on output quality is poorly understood. Existing inference-time strategies incur computational costs, so the authors aim to address persona sensitivity during training.Method: Uses reinforcement learning with verifiable rewards (RLVR) to reduce persona sensitivity, but identifies a trade-off between robustness and persona expressivity. Proposes PerMix-RLVR, a persona-mixed RLVR strategy that mitigates this trade-off by preserving robustness to harmful persona variation while enabling faithful persona adoption.
Result: PerMix-RLVR improves persona stability score (PSS) over RLVR by +21.2% on MATH500, while enhancing persona fidelity by +11.4% on PersonaGym, demonstrating better balance between robustness and expressivity.
Conclusion: The proposed PerMix-RLVR effectively addresses the persona robustness-fidelity trade-off in LLM training, enabling models to be robust to persona variations while maintaining faithful persona adoption when needed.
Abstract: Persona prompting has been widely adopted to steer large language models (LLMs) behavior and improve their instruction performance by assigning specific characters. However, identifying an optimal persona is time-consuming, and its impact on output quality remains poorly understood. Prior work has mainly addressed this issue at the prompt level via inference-time strategies, incurring additional computation. In this work, we avoid inference-time prompt search by tackling persona sensitivity during training, aiming to train models that adapt their behavior to diverse personas while preserving task performance. In particular, we find that reinforcement learning with verifiable rewards (RLVR) systematically reduces sensitivity to persona prompts, but also reveals an inherent trade-off of outcome-based optimization: while RLVR improves robustness on tasks with verifiable goals, it can also degrade persona expressivity when needed, e.g., in-character role-playing. To address this limitation, we propose PerMix-RLVR, a persona-mixed RLVR strategy that mitigates the persona robustness-fidelity trade-off, preserving strong robustness to harmful persona variation while enabling faithful persona adoption when required. Concretely, PerMix-RLVR improves persona stability score (PSS) over RLVR by +21.2% on MATH500, while also enhancing persona fidelity by +11.4% on PersonaGym.
[41] ASTRA: Adaptive Semantic Tree Reasoning Architecture for Complex Table Question Answering
Xiaoke Guo, Songze Li, Zhiqiang Liu, Zhaoyan Gong, Yuanxiang Liu, Huajun Chen, Wen Zhang
Main category: cs.CL
TL;DR: ASTRA introduces an adaptive semantic tree reasoning architecture for table question answering, addressing serialization bottlenecks through logical semantic trees and dual-mode reasoning.
Details
Motivation: Current table serialization methods for LLMs suffer from structural neglect, representation gaps, and reasoning opacity. Existing approaches fail to capture explicit hierarchies and lack schema flexibility, while tree-based methods have limited semantic adaptability.Method: ASTRA consists of two main modules: AdaSTR (Adaptive Semantic Tree Reasoning) which reconstructs tables into Logical Semantic Trees using LLMs’ global semantic awareness, and DuTR (Dual-mode Tree Reasoning) which integrates tree-search-based textual navigation for linguistic alignment with symbolic code execution for precise verification.
Result: Experiments on complex table benchmarks demonstrate that ASTRA achieves state-of-the-art (SOTA) performance in table question answering.
Conclusion: ASTRA effectively addresses table serialization bottlenecks through adaptive semantic tree construction and dual-mode reasoning, significantly improving LLM performance on complex table understanding tasks.
Abstract: Table serialization remains a critical bottleneck for Large Language Models (LLMs) in complex table question answering, hindered by challenges such as structural neglect, representation gaps, and reasoning opacity. Existing serialization methods fail to capture explicit hierarchies and lack schema flexibility, while current tree-based approaches suffer from limited semantic adaptability. To address these limitations, we propose ASTRA (Adaptive Semantic Tree Reasoning Architecture) including two main modules, AdaSTR and DuTR. First, we introduce AdaSTR, which leverages the global semantic awareness of LLMs to reconstruct tables into Logical Semantic Trees. This serialization explicitly models hierarchical dependencies and employs an adaptive mechanism to optimize construction strategies based on table scale. Second, building on this structure, we present DuTR, a dual-mode reasoning framework that integrates tree-search-based textual navigation for linguistic alignment and symbolic code execution for precise verification. Experiments on complex table benchmarks demonstrate that our method achieves state-of-the-art (SOTA) performance.
[42] Towards Linguistically-informed Representations for English as a Second or Foreign Language: Review, Construction and Application
Wenxi Li, Xihao Wang, Weiwei Sun
Main category: cs.CL
TL;DR: A paper proposing a novel syntactico-semantic resource for English as a Second/Foreign Language (ESFL) using construction-based analysis, with 1643 annotated sentences and a pilot study testing the Linguistic Niche Hypothesis.
Details
Motivation: ESFL is increasingly recognized as a distinct linguistic system rather than just deviation from standard English, creating need for dedicated knowledge-intensive representations to properly model its unique characteristics.Method: Uses constructivist theories with constructions as fundamental units to model syntax-semantics interface; creates gold-standard resource of 1643 annotated ESFL sentences; conducts pilot study testing Linguistic Niche Hypothesis.
Result: Developed a comprehensive syntactico-semantic resource capturing ESFL phenomena while preserving unique characteristics; demonstrated practical utility through pilot study on Linguistic Niche Hypothesis.
Conclusion: The proposed resource serves as valuable tool for Second Language Acquisition research, enabling better understanding of ESFL as distinct linguistic system with practical applications in linguistic theory testing.
Abstract: The widespread use of English as a Second or Foreign Language (ESFL) has sparked a paradigm shift: ESFL is not seen merely as a deviation from standard English but as a distinct linguistic system in its own right. This shift highlights the need for dedicated, knowledge-intensive representations of ESFL. In response, this paper surveys existing ESFL resources, identifies their limitations, and proposes a novel solution. Grounded in constructivist theories, the paper treats constructions as the fundamental units of analysis, allowing it to model the syntax–semantics interface of both ESFL and standard English. This design captures a wide range of ESFL phenomena by referring to syntactico-semantic mappings of English while preserving ESFL’s unique characteristics, resulting a gold-standard syntactico-semantic resource comprising 1643 annotated ESFL sentences. To demonstrate the sembank’s practical utility, we conduct a pilot study testing the Linguistic Niche Hypothesis, highlighting its potential as a valuable tool in Second Language Acquisition research.
[43] CONDESION-BENCH: Conditional Decision-Making of Large Language Models in Compositional Action Space
Yeonjun Hwang, Sungyong Park, Minju Kim, Dongha Lee, Jinyoung Yeo
Main category: cs.CL
TL;DR: CONDESION-BENCH: A benchmark for evaluating LLMs in conditional decision-making with compositional action spaces, moving beyond finite action sets to incorporate explicit feasibility constraints.
Details
Motivation: Existing decision-making benchmarks for LLMs oversimplify real-world scenarios by using finite action sets and ignoring explicit feasibility conditions, failing to capture the compositional nature of actions and constraints that affect decision validity.Method: Introduces CONDESION-BENCH where actions are defined as allocations to decision variables with explicit conditions at variable, contextual, and allocation levels, using oracle-based evaluation of both decision quality and condition adherence.
Result: Provides a more rigorous assessment framework for LLMs as decision-support tools by evaluating both the quality of decisions and adherence to explicit feasibility conditions in compositional action spaces.
Conclusion: CONDESION-BENCH addresses limitations of existing benchmarks by incorporating conditional constraints and compositional action structures, enabling better evaluation of LLMs for real-world decision-making scenarios.
Abstract: Large language models have been widely explored as decision-support tools in high-stakes domains due to their contextual understanding and reasoning capabilities. However, existing decision-making benchmarks rely on two simplifying assumptions: actions are selected from a finite set of pre-defined candidates, and explicit conditions restricting action feasibility are not incorporated into the decision-making process. These assumptions fail to capture the compositional structure of real-world actions and the explicit conditions that constrain their validity. To address these limitations, we introduce CONDESION-BENCH, a benchmark designed to evaluate conditional decision-making in compositional action space. In CONDESION-BENCH, actions are defined as allocations to decision variables and are restricted by explicit conditions at the variable, contextual, and allocation levels. By employing oracle-based evaluation of both decision quality and condition adherence, we provide a more rigorous assessment of LLMs as decision-support tools.
[44] Anchored Sliding Window: Toward Robust and Imperceptible Linguistic Steganography
Ruiyi Yan, Shiao Meng, Yugo Murawaki
Main category: cs.CL
TL;DR: ASW framework improves linguistic steganography robustness to text modifications while maintaining quality by anchoring prompt and bridge context in sliding window.
Details
Motivation: Traditional linguistic steganography assumes unaltered text transmission, making it fragile to modifications. Previous robustness methods compromise text quality by limiting context window.Method: Proposes anchored sliding window (ASW) framework that anchors prompt and bridge context within context window alongside latest tokens. Optimizes bridge context via prompt distillation and self-distillation strategies.
Result: ASW significantly outperforms baseline in text quality, imperceptibility, and robustness across diverse settings. Consistently better performance than previous methods.
Conclusion: ASW framework effectively addresses fragility of linguistic steganography to text modifications while maintaining high text quality through anchored context and distillation optimization.
Abstract: Linguistic steganography based on language models typically assumes that steganographic texts are transmitted without alteration, making them fragile to even minor modifications. While previous work mitigates this fragility by limiting the context window, it significantly compromises text quality. In this paper, we propose the anchored sliding window (ASW) framework to improve imperceptibility and robustness. In addition to the latest tokens, the prompt and a bridge context are anchored within the context window, encouraging the model to compensate for the excluded tokens. We formulate the optimization of the bridge context as a variant of prompt distillation, which we further extend using self-distillation strategies. Experiments show that our ASW significantly and consistently outperforms the baseline method in text quality, imperceptibility, and robustness across diverse settings. The code is available at github.com/ryehr/ASW_steganography.
[45] NyayaMind- A Framework for Transparent Legal Reasoning and Judgment Prediction in the Indian Legal System
Parjanya Aditya Shukla, Shubham Kumar Nigam, Debtanu Datta, Balaramamahanthi Deepak Patnaik, Noel Shallum, Pradeep Reddy Vanga, Saptarshi Ghosh, Arnab Bhattacharya
Main category: cs.CL
TL;DR: NyayaMind is an open-source framework for Court Judgment Prediction and Explanation that integrates retrieval, reasoning, and verification mechanisms to provide transparent legal reasoning for the Indian judiciary.
Details
Motivation: Current CJPE systems need both high predictive performance and transparent, structured legal reasoning aligned with judicial practices to be practically useful in judicial or legal research settings.Method: The framework consists of two main components: 1) Retrieval Module using RAG pipeline to identify relevant statutes and precedents, and 2) Prediction Module using reasoning-oriented LLMs fine-tuned for Indian legal domain to generate structured outputs including issues, arguments, rationale, and final decision.
Result: NyayaMind significantly improves the quality of explanation and evidence alignment compared to existing CJPE approaches, as demonstrated through extensive results and expert evaluation.
Conclusion: The framework provides a promising step toward trustworthy AI-assisted legal decision support systems by enabling transparent and scalable legal reasoning for the Indian judiciary.
Abstract: Court Judgment Prediction and Explanation (CJPE) aims to predict a judicial decision and provide a legally grounded explanation for a given case based on the facts, legal issues, arguments, cited statutes, and relevant precedents. For such systems to be practically useful in judicial or legal research settings, they must not only achieve high predictive performance but also generate transparent and structured legal reasoning that aligns with established judicial practices. In this work, we present NyayaMind, an open-source framework designed to enable transparent and scalable legal reasoning for the Indian judiciary. The proposed framework integrates retrieval, reasoning, and verification mechanisms to emulate the structured decision-making process typically followed in courts. Specifically, NyayaMind consists of two main components: a Retrieval Module and a Prediction Module. The Retrieval Module employs a RAG pipeline to identify legally relevant statutes and precedent cases from large-scale legal corpora, while the Prediction Module utilizes reasoning-oriented LLMs fine-tuned for the Indian legal domain to generate structured outputs including issues, arguments, rationale, and the final decision. Our extensive results and expert evaluation demonstrate that NyayaMind significantly improves the quality of explanation and evidence alignment compared to existing CJPE approaches, providing a promising step toward trustworthy AI-assisted legal decision support systems.
[46] Hierarchical Alignment: Enforcing Hierarchical Instruction-Following in LLMs through Logical Consistency
Shu Yang, Zihao Zhou, Di Wang, Wenda Li
Main category: cs.CL
TL;DR: NSHA is a neuro-symbolic approach for hierarchical instruction-following in LLMs that handles instruction conflicts by modeling priorities and using constraint satisfaction reasoning.
Details
Motivation: Real-world LLM applications involve multiple instructions from heterogeneous sources with different authority levels (system policies, user requests, tool outputs, retrieved context). Prior work focuses on adversarial attacks but overlooks benign instruction conflicts that commonly occur in practice, where models must balance security, task utility, and behavioral consistency.Method: Neuro-Symbolic Hierarchical Alignment (NSHA) explicitly models instruction priorities. At inference: solver-guided reasoning formulates instruction resolution as constraint satisfaction problem to derive maximally consistent applicable instructions. At training: distills solver-based decisions into model parameters using automatically constructed supervision.
Result: NSHA significantly improves performance under instruction conflicts while maintaining competitive utility in reference settings, evaluated across rule following, task execution, tool use, and safety in both single-turn and multi-turn interactions.
Conclusion: NSHA provides an effective approach for hierarchical instruction-following that handles real-world instruction conflicts by combining neuro-symbolic reasoning with constraint satisfaction, enabling LLMs to better navigate complex multi-source instruction environments.
Abstract: Large language models increasingly operate under multiple instructions from heterogeneous sources with different authority levels, including system policies, user requests, tool outputs, and retrieved context. While prior work on instruction hierarchy highlights the importance of respecting instruction priorities, it mainly focuses on adversarial attacks and overlooks the benign but common instruction conflicts that arise in real-world applications. In such settings, models must not only avoid security violations but also preserve task utility and behavioral consistency when instructions partially or implicitly conflict. We propose Neuro-Symbolic Hierarchical Alignment (NSHA) for hierarchical instruction-following by explicitly modeling and enforcing instruction priorities. At inference time, we introduce solver-guided reasoning that formulates instruction resolution as a constraint satisfaction problem, enabling the model to derive a maximally consistent set of applicable instructions under hierarchical constraints. At training time, NSHA distills solver-based decisions into model parameters using automatically constructed supervision. We evaluate our approach on rule following, task execution, tool use, and safety, covering both single-turn and multi-turn interactions, and show that NSHA significantly improves performance under such conflicts while maintaining competitive utility in reference settings.
[47] Prototype-Regularized Federated Learning for Cross-Domain Aspect Sentiment Triplet Extraction
Zongming Cai, Jianhang Tang, Zhenyong Zhang, Jinghui Qin, Kebing Jin, Hankz Hankui Zhuo
Main category: cs.CL
TL;DR: PCD-SpanProto: A prototype-regularized federated learning framework for Aspect Sentiment Triplet Extraction that enables cross-domain knowledge transfer through class-level prototype exchange instead of full model parameters.
Details
Motivation: Existing ASTE methods are trained on individual datasets in isolation, failing to capture common feature representations across domains. Data privacy constraints prevent centralized data aggregation, creating challenges for cross-domain learning.Method: Proposes a prototype-regularized federated learning framework with weighted performance-aware aggregation strategy and contrastive regularization module to improve global prototypes under domain heterogeneity and enhance intra-class compactness and inter-class separability across clients.
Result: Extensive experiments on four ASTE datasets demonstrate that the method outperforms baselines and reduces communication costs, validating the effectiveness of prototype-based cross-domain knowledge transfer.
Conclusion: The proposed PCD-SpanProto framework successfully addresses cross-domain ASTE challenges through prototype-based federated learning, enabling knowledge transfer while preserving data privacy and reducing communication overhead.
Abstract: Aspect Sentiment Triplet Extraction (ASTE) aims to extract all sentiment triplets of aspect terms, opinion terms, and sentiment polarities from a sentence. Existing methods are typically trained on individual datasets in isolation, failing to jointly capture the common feature representations shared across domains. Moreover, data privacy constraints prevent centralized data aggregation. To address these challenges, we propose Prototype-based Cross-Domain Span Prototype extraction (PCD-SpanProto), a prototype-regularized federated learning framework to enable distributed clients to exchange class-level prototypes instead of full model parameters. Specifically, we design a weighted performance-aware aggregation strategy and a contrastive regularization module to improve the global prototype under domain heterogeneity and the promotion between intra-class compactness and inter-class separability across clients. Extensive experiments on four ASTE datasets demonstrate that our method outperforms baselines and reduces communication costs, validating the effectiveness of prototype-based cross-domain knowledge transfer.
[48] Think Less, Know More: State-Aware Reasoning Compression with Knowledge Guidance for Efficient Reasoning
Yi Sui, Chaozhuo Li, Dawei Song
Main category: cs.CL
TL;DR: STACK is a framework for compressing Chain-of-Thought reasoning in Large Reasoning Models by dynamically adapting compression strategies based on step-specific redundancy and reasoning states, achieving better accuracy-efficiency trade-offs.
Details
Motivation: Large Reasoning Models using long Chain-of-Thought reasoning suffer from overthinking, leading to excessive steps and high inference latency. Existing compression methods lack fine-grained, step-level adaptation to redundancy and reasoning bias.Method: STACK performs step-wise CoT compression by modeling stage-specific redundancy sources and integrating retrieval-augmented guidance. It constructs online long-short contrastive samples and dynamically switches between knowledge-guided compression for uncertain states and self-prompted compression for confident states, with answer-convergence-based early stopping. Uses reward-difference-driven training combining PPO and DPO.
Result: On three mathematical reasoning benchmarks, STACK reduces average response length by 59.9% while improving accuracy by 4.8 points over existing methods, achieving superior accuracy-efficiency balance.
Conclusion: STACK effectively addresses overthinking in Large Reasoning Models through state-aware compression with knowledge guidance, demonstrating significant improvements in both efficiency and accuracy for mathematical reasoning tasks.
Abstract: Large Reasoning Models (LRMs) achieve strong performance on complex tasks by leveraging long Chain-of-Thought (CoT), but often suffer from overthinking, leading to excessive reasoning steps and high inference latency. Existing CoT compression methods struggle to balance accuracy and efficiency, and lack fine-grained, step-level adaptation to redundancy and reasoning bias. Therefore, we propose State-Aware Reasoning Compression with Knowledge Guidance (STACK), a framework that performs step-wise CoT compression by explicitly modeling stage-specific redundancy sources and integrating with a retrieval-augmented guidance. STACK constructs online long-short contrastive samples and dynamically switches between knowledge-guided compression for uncertain or biased reasoning state and self-prompted compression for overly long but confident state, complemented by an answer-convergence-based early stopping mechanism to suppress redundant verification. We further propose a reward-difference-driven training strategy by combining Proximal Policy Optimization (PPO) and Direct Preference Optimization (DPO), enabling models to learn state-conditioned compression strategies. Experiments on three mathematical reasoning benchmarks show that STACK achieves a superior accuracy-efficiency balance, reducing average response length by 59.9% while improving accuracy by 4.8 points over existing methods.
[49] Persona-E$^2$: A Human-Grounded Dataset for Personality-Shaped Emotional Responses to Textual Events
Yuqin Yang, Haowu Zhou, Haoran Tu, Zhiwen Hui, Shiqi Yan, HaoYang Li, Dong She, Xianrong Yao, Yang Gao, Zhanpeng Jin
Main category: cs.CL
TL;DR: Persona-E² dataset links personality traits (MBTI/Big Five) to emotional appraisals from reader’s perspective, addressing personality illusion in LLMs for emotion understanding.
Details
Motivation: Current affective computing treats emotion as static property of text, focusing on writer's sentiment while ignoring reader's perspective and how individual personalities lead to diverse emotional appraisals of same events. LLMs suffer from "personality illusion" - relying on stereotypes rather than authentic cognitive logic, with bottleneck being lack of ground-truth human data linking personality traits to emotional shifts.Method: Introduce Persona-E² (Persona-Event2Emotion), large-scale dataset grounded in annotated MBTI and Big Five traits to capture reader-based emotional variations across news, social media, and life narratives. Conduct extensive experiments with state-of-the-art LLMs.
Result: LLMs struggle to capture precise appraisal shifts, particularly in social media domains. Personality information significantly improves comprehension, with Big Five traits alleviating “personality illusion.”
Conclusion: Reader perspective and personality traits are crucial for authentic emotion understanding in text. Persona-E² dataset enables better modeling of how personality influences emotional appraisals, addressing limitations in current affective computing and LLM approaches.
Abstract: Most affective computing research treats emotion as a static property of text, focusing on the writer’s sentiment while overlooking the reader’s perspective. This approach ignores how individual personalities lead to diverse emotional appraisals of the same event. Although role-playing Large Language Models (LLMs) attempt to simulate such nuanced reactions, they often suffer from “personality illusion’’ – relying on surface-level stereotypes rather than authentic cognitive logic. A critical bottleneck is the absence of ground-truth human data to link personality traits to emotional shifts. To bridge the gap, we introduce Persona-E$^2$ (Persona-Event2Emotion), a large-scale dataset grounded in annotated MBTI and Big Five traits to capture reader-based emotional variations across news, social media, and life narratives. Extensive experiments reveal that state-of-the-art LLMs struggle to capture precise appraisal shifts, particularly in social media domains. Crucially, we find that personality information significantly improves comprehension, with the Big Five traits alleviating “personality illusion.'
[50] Facet-Level Tracing of Evidence Uncertainty and Hallucination in RAG
Passant Elchafei, Monorama Swain, Shahed Masoudian, Markus Schedl
Main category: cs.CL
TL;DR: A facet-level diagnostics framework for QA that decomposes questions into atomic reasoning facets to analyze evidence usage in RAG systems, revealing systematic failure modes in evidence integration.
Details
Motivation: Existing RAG evaluations focus on answer-level or passage-level accuracy, offering limited insight into how evidence is actually used during generation, even when relevant documents are available.Method: Introduces facet-level decomposition of questions into atomic reasoning facets, uses a Facet x Chunk matrix combining retrieval relevance with NLI-based faithfulness scores, and analyzes three inference modes: Strict RAG, Soft RAG, and LLM-only generation.
Result: Hallucinations in RAG systems are driven more by how retrieved evidence is integrated during generation than by retrieval accuracy, with facet-level analysis exposing systematic evidence override and misalignment patterns.
Conclusion: Facet-level diagnostics provide interpretable insights into RAG failure modes, revealing that evidence integration issues (not retrieval accuracy) are the primary cause of hallucinations, with systematic patterns of evidence override and misalignment.
Abstract: Retrieval-Augmented Generation (RAG) aims to reduce hallucination by grounding answers in retrieved evidence, yet hallucinated answers remain common even when relevant documents are available. Existing evaluations focus on answer-level or passage-level accuracy, offering limited insight into how evidence is used during generation. In this work, we introduce a facet-level diagnostics framework for QA that decomposes each input question into atomic reasoning facets. For each facet, we assess evidence sufficiency and grounding using a structured Facet x Chunk matrix that combines retrieval relevance with natural language inference-based faithfulness scores. To diagnose evidence usage, we analyze three controlled inference modes: Strict RAG, which enforces exclusive reliance on retrieved evidence; Soft RAG, which allows integration of retrieved evidence and parametric knowledge; and LLM-only generation without retrieval. Comparing these modes enables thorough analysis of retrieval-generation misalignment, defined as cases where relevant evidence is retrieved but not correctly integrated during generation. Across medical QA and HotpotQA, we evaluate three open-source and closed-source LLMs (GPT, Gemini, and LLaMA), providing interpretable diagnostics that reveal recurring facet-level failure modes, including evidence absence, evidence misalignment, and prior-driven overrides. Our results demonstrate that hallucinations in RAG systems are driven less by retrieval accuracy and more by how retrieved evidence is integrated during generation, with facet-level analysis exposing systematic evidence override and misalignment patterns that remain hidden under answer-level evaluation.
[51] Do LLMs Follow Their Own Rules? A Reflexive Audit of Self-Stated Safety Policies
Avni Mittal
Main category: cs.CL
TL;DR: SNCA framework extracts LLMs’ self-stated safety rules, formalizes them, and measures behavioral compliance, revealing systematic gaps between stated policies and actual behavior across frontier models.
Details
Motivation: LLMs internalize safety policies through RLHF, but these policies are never formally specified and remain difficult to inspect. Existing benchmarks evaluate models against external standards but don't measure whether models understand and enforce their own stated boundaries.Method: Introduces Symbolic-Neural Consistency Audit (SNCA): (1) extracts model’s self-stated safety rules via structured prompts, (2) formalizes them as typed predicates (Absolute, Conditional, Adaptive), and (3) measures behavioral compliance via deterministic comparison against harm benchmarks.
Result: Evaluating four frontier models across 45 harm categories and 47,496 observations reveals systematic gaps: models claiming absolute refusal frequently comply with harmful prompts, reasoning models achieve highest self-consistency but fail to articulate policies for 29% of categories, and cross-model agreement on rule types is remarkably low (11%).
Conclusion: The gap between what LLMs say and what they do is measurable and architecture-dependent, motivating reflexive consistency audits as a complement to behavioral benchmarks.
Abstract: LLMs internalize safety policies through RLHF, yet these policies are never formally specified and remain difficult to inspect. Existing benchmarks evaluate models against external standards but do not measure whether models understand and enforce their own stated boundaries. We introduce the Symbolic-Neural Consistency Audit (SNCA), a framework that (1) extracts a model’s self-stated safety rules via structured prompts, (2) formalizes them as typed predicates (Absolute, Conditional, Adaptive), and (3) measures behavioral compliance via deterministic comparison against harm benchmarks. Evaluating four frontier models across 45 harm categories and 47,496 observations reveals systematic gaps between stated policy and observed behavior: models claiming absolute refusal frequently comply with harmful prompts, reasoning models achieve the highest self-consistency but fail to articulate policies for 29% of categories, and cross-model agreement on rule types is remarkably low (11%). These results demonstrate that the gap between what LLMs say and what they do is measurable and architecture-dependent, motivating reflexive consistency audits as a complement to behavioral benchmarks.
[52] ScheMatiQ: From Research Question to Structured Data through Interactive Schema Discovery
Shahar Levy, Eliya Habba, Reshef Mintz, Barak Raveh, Renana Keydar, Gabriel Stanovsky
Main category: cs.CL
TL;DR: ScheMatiQ is an LLM-powered system that automatically generates structured schemas and databases from natural language research questions over large document collections, replacing manual annotation with AI-driven extraction.
Details
Motivation: Traditional manual annotation for extracting structured evidence from large document collections is slow, error-prone, and requires domain expertise. There's a need for automated systems that can understand natural language research questions and produce structured schemas from corpora.Method: ScheMatiQ leverages backbone LLMs to take a natural language question and a document corpus, then automatically produces a schema and grounded database. It includes a web interface for human steering and revision of the extraction process.
Result: The system successfully supports real-world analysis in law and computational biology domains through collaboration with domain experts. It demonstrates practical utility for extracting structured evidence from large document collections.
Conclusion: ScheMatiQ provides an effective automated alternative to manual annotation for structured evidence extraction from document collections, with demonstrated real-world applications and an open-source implementation available for broader use.
Abstract: Many disciplines pose natural-language research questions over large document collections whose answers typically require structured evidence, traditionally obtained by manually designing an annotation schema and exhaustively labeling the corpus, a slow and error-prone process. We introduce ScheMatiQ, which leverages calls to a backbone LLM to take a question and a corpus to produce a schema and a grounded database, with a web interface that lets steer and revise the extraction. In collaboration with domain experts, we show that ScheMatiQ yields outputs that support real-world analysis in law and computational biology. We release ScheMatiQ as open source with a public web interface, and invite experts across disciplines to use it with their own data. All resources, including the website, source code, and demonstration video, are available at: www.ScheMatiQ-ai.com
[53] EthicMind: A Risk-Aware Framework for Ethical-Emotional Alignment in Multi-Turn Dialogue
Jiawen Deng, Wei Li, Wentao Zhang, Ziyun Jiao, Fuji Ren
Main category: cs.CL
TL;DR: EthicMind is a risk-aware framework for multi-turn dialogue that jointly analyzes ethical risk and user emotion to generate responses balancing ethical guidance with emotional engagement, without requiring additional training.
Details
Motivation: Current dialogue systems fail to adapt to evolving ethical risk and user emotion across multi-turn interactions, treating empathy and ethical safety in isolation, which can cause harm in emotionally and ethically sensitive settings.Method: Formulates ethical-emotional alignment as a turn-level decision problem. At each turn, jointly analyzes ethical risk signals and user emotion, plans a high-level response strategy, and generates context-sensitive replies without additional model training.
Result: EthicMind achieves more consistent ethical guidance and emotional engagement than competitive baselines, particularly in high-risk and morally ambiguous scenarios, as evaluated through a risk-stratified, multi-turn evaluation protocol.
Conclusion: The framework successfully addresses the need for adaptive ethical-emotional alignment in multi-turn dialogue systems operating in sensitive contexts, balancing ethical safety with emotional attunement.
Abstract: Intelligent dialogue systems are increasingly deployed in emotionally and ethically sensitive settings, where failures in either emotional attunement or ethical judgment can cause significant harm. Existing dialogue models typically address empathy and ethical safety in isolation, and often fail to adapt their behavior as ethical risk and user emotion evolve across multi-turn interactions. We formulate ethical-emotional alignment in dialogue as an explicit turn-level decision problem, and propose \textsc{EthicMind}, a risk-aware framework that implements this formulation in multi-turn dialogue at inference time. At each turn, \textsc{EthicMind} jointly analyzes ethical risk signals and user emotion, plans a high-level response strategy, and generates context-sensitive replies that balance ethical guidance with emotional engagement, without requiring additional model training. To evaluate alignment behavior under ethically complex interactions, we introduce a risk-stratified, multi-turn evaluation protocol with a context-aware user simulation procedure. Experimental results show that \textsc{EthicMind} achieves more consistent ethical guidance and emotional engagement than competitive baselines, particularly in high-risk and morally ambiguous scenarios.
[54] Task-Aware LLM Routing with Multi-Level Task-Profile-Guided Data Synthesis for Cold-Start Scenarios
Hui Liu, Bin Zou, Kecheng Chen, Jie Liu, Wenya Wang, Haoliang Li
Main category: cs.CL
TL;DR: TRouter: A task-type-aware router for LLM selection using hierarchical task taxonomy and synthetic data to address cold-start routing challenges
Details
Motivation: LLMs show performance and cost variability across tasks, requiring routing systems, but existing routers struggle in cold-start scenarios without in-domain training dataMethod: Multi-level task-profile-guided data synthesis framework creates hierarchical task taxonomy and diverse QA pairs; TRouter models query-conditioned cost/performance via latent task-type variables with prior regularization from synthesized taxonomy
Result: Synthesis framework alleviates cold-start issues; TRouter delivers effective LLM routing across multiple benchmarks in both cold-start and in-domain settings
Conclusion: Task-profile-guided synthesis and task-type-aware routing with latent variables improves LLM routing utility, especially in cold-start scenarios
Abstract: Large language models (LLMs) exhibit substantial variability in performance and computational cost across tasks and queries, motivating routing systems that select models to meet user-specific cost-performance trade-offs. However, existing routers generalize poorly in cold-start scenarios where in-domain training data is unavailable. We address this limitation with a multi-level task-profile-guided data synthesis framework that constructs a hierarchical task taxonomy and produces diverse question-answer pairs to approximate the test-time query distribution. Building on this, we introduce TRouter, a task-type-aware router approach that models query-conditioned cost and performance via latent task-type variables, with prior regularization derived from the synthesized task taxonomy. This design enhances TRouter’s routing utility under both cold-start and in-domain settings. Across multiple benchmarks, we show that our synthesis framework alleviates cold-start issues and that TRouter delivers effective LLM routing.
[55] Automated Instruction Revision (AIR): A Structured Comparison of Task Adaptation Strategies for LLM
Solomiia Bilyk, Volodymyr Getmanskyi, Taras Firman
Main category: cs.CL
TL;DR: AIR is a rule-induction method for adapting LLMs to downstream tasks using few examples, compared against prompt optimization, retrieval, and fine-tuning across diverse benchmarks showing task-dependent performance.
Details
Motivation: To develop and evaluate Automated Instruction Revision (AIR) as an adaptation method for LLMs that can work with limited task-specific examples, and to understand which adaptation strategies work best for different types of tasks.Method: AIR uses rule induction to adapt LLMs to downstream tasks with limited examples. The method is compared against prompt optimization, retrieval-based methods (like KNN retrieval), and fine-tuning across a diverse benchmark suite testing different task requirements including knowledge injection, structured extraction, label remapping, and logical reasoning.
Result: Performance is strongly task-dependent: AIR was strongest on label-remapping classification, KNN retrieval performed best on closed-book QA, and fine-tuning dominated structured extraction and event-order reasoning. No single method dominates across all settings.
Conclusion: AIR is most promising when task behavior can be captured by compact, interpretable instruction rules, while retrieval and fine-tuning remain stronger for tasks requiring source-specific knowledge or dataset-specific annotation regularities.
Abstract: This paper studies Automated Instruction Revision (AIR), a rule-induction-based method for adapting large language models (LLMs) to downstream tasks using limited task-specific examples. We position AIR within the broader landscape of adaptation strategies, including prompt optimization, retrieval-based methods, and fine-tuning. We then compare these approaches across a diverse benchmark suite designed to stress different task requirements, such as knowledge injection, structured extraction, label remapping, and logical reasoning. The paper argues that adaptation performance is strongly task-dependent: no single method dominates across all settings. Across five benchmarks, AIR was strongest or near-best on label-remapping classification, while KNN retrieval performed best on closed-book QA, and fine-tuning dominated structured extraction and event-order reasoning. AIR is most promising when task behavior can be captured by compact, interpretable instruction rules, while retrieval and fine-tuning remain stronger in tasks dominated by source-specific knowledge or dataset-specific annotation regularities.
[56] UIPress: Bringing Optical Token Compression to UI-to-Code Generation
Dasen Dai, Shuoqi Li, Ronghao Chen, Huacan Wang, Biao Wu, Qizhen Lan
Main category: cs.CL
TL;DR: UIPress is a learned compression module for UI-to-Code generation that reduces visual tokens from ~6,700 to 256, achieving better performance and 9.1× speedup while adding minimal parameters.
Details
Motivation: UI-to-Code generation requires VLMs to produce thousands of tokens from screenshots, making visual token efficiency critical. Existing compression methods either use task-agnostic heuristics or zero out features without truly reducing prefill latency, and none adapt to the non-uniform information density of UI screenshots.Method: UIPress is a lightweight learned compression module inserted between frozen ViT encoder and LLM decoder. It combines depthwise-separable convolutions, element-guided spatial reweighting, and Transformer refinement to compress visual tokens. Uses LoRA on decoder to bridge representation gap, adding only ~21.7M trainable parameters.
Result: At 256 tokens, UIPress achieves CLIP score of 0.8127 on Design2Code, outperforming uncompressed baseline by +7.5% and strongest inference-time method by +4.6%, with 9.1× time-to-first-token speedup.
Conclusion: UIPress is the first encoder-side learned compression method for UI-to-Code task, demonstrating that learned compression can significantly improve both performance and efficiency for vision-language tasks with high visual token requirements.
Abstract: UI-to-Code generation requires vision-language models (VLMs) to produce thousands of tokens of structured HTML/CSS from a single screenshot, making visual token efficiency critical. Existing compression methods either select tokens at inference time using task-agnostic heuristics, or zero out low-attention features without actually shortening the sequence – neither truly reduces prefill latency or adapts to the non-uniform information density of UI screenshots. Meanwhile, optical (encoder-side learned) compression has shown strong results for document OCR, yet no prior work has adapted this paradigm to UI-to-Code generation. We propose UIPress, a lightweight learned compression module inserted between the frozen ViT encoder and the LLM decoder of Qwen3-VL-8B. UIPress combines depthwise-separable convolutions, element-guided spatial reweighting, and Transformer refinement to compress ${\sim}$6{,}700 visual tokens to a fixed budget of 256. Together with Low-Rank Adaptation (LoRA) on the decoder to bridge the representation gap, the entire system adds only ${\sim}$21.7M trainable parameters (0.26% of the 8B base model). Under a fair comparison on the same base model against four baselines on Design2Code, UIPress at 256 tokens achieves a CLIP score of 0.8127, outperforming the uncompressed baseline by +7.5% and the strongest inference-time method by +4.6%, while delivering 9.1$\times$ time-to-first-token speedup. To the best of our knowledge, UIPress is the first encoder-side learned compression method for the UI-to-Code task.
[57] Many-Tier Instruction Hierarchy in LLM Agents
Jingyu Zhang, Tianjian Li, William Jurayj, Hongyuan Zhan, Benjamin Van Durme, Daniel Khashabi
Main category: cs.CL
TL;DR: ManyIH introduces a scalable instruction hierarchy paradigm for LLM agents to resolve conflicts among arbitrarily many privilege levels, with a benchmark showing current models struggle with fine-grained conflict resolution.
Details
Motivation: Current instruction hierarchy approaches assume fixed, small privilege levels (typically <5) with rigid role labels, which is inadequate for real-world agentic settings where conflicts can arise across many more sources and contexts.Method: Proposes Many-Tier Instruction Hierarchy (ManyIH) paradigm for resolving instruction conflicts with arbitrarily many privilege levels, and introduces ManyIH-Bench benchmark with up to 12 levels of conflicting instructions across 853 agentic tasks (427 coding, 426 instruction-following).
Result: Experiments show current frontier models perform poorly (~40% accuracy) when instruction conflict scales, highlighting the urgent need for methods targeting fine-grained, scalable instruction conflict resolution.
Conclusion: The work underscores the need for explicit methods targeting fine-grained, scalable instruction conflict resolution in agentic settings, as current models struggle with complex privilege hierarchies.
Abstract: Large language model agents receive instructions from many sources-system messages, user prompts, tool outputs, and more-each carrying different levels of trust and authority. When these instructions conflict, models must reliably follow the highest-privilege instruction to remain safe and effective. The dominant paradigm, instruction hierarchy (IH), assumes a fixed, small set of privilege levels (typically fewer than five) defined by rigid role labels (e.g., system > user). This is inadequate for real-world agentic settings, where conflicts can arise across far more sources and contexts. In this work, we propose Many-Tier Instruction Hierarchy (ManyIH), a paradigm for resolving instruction conflicts among instructions with arbitrarily many privilege levels. We introduce ManyIH-Bench, the first benchmark for ManyIH. ManyIH-Bench requires models to navigate up to 12 levels of conflicting instructions with varying privileges, comprising 853 agentic tasks (427 coding and 426 instruction-following). ManyIH-Bench composes constraints developed by LLMs and verified by humans to create realistic and difficult test cases spanning 46 real-world agents. Our experiments show that even the current frontier models perform poorly (~40% accuracy) when instruction conflict scales. This work underscores the urgent need for methods that explicitly target fine-grained, scalable instruction conflict resolution in agentic settings.
[58] From Reasoning to Agentic: Credit Assignment in Reinforcement Learning for Large Language Models
Chenchen Zhang
Main category: cs.CL
TL;DR: Survey of 47 credit assignment methods for RL with LLMs, focusing on sparse rewards in reasoning and agentic settings, with taxonomy, resources, and analysis of methodological differences.
Details
Motivation: The credit assignment problem in RL for LLMs is challenging due to sparse outcome-level rewards and long trajectories. This is particularly acute in two regimes: reasoning RL (long chain-of-thought generations) and agentic RL (multi-turn environment interactions with stochastic transitions). Existing methods need systematic organization and evaluation.Method: Survey methodology analyzing 47 credit assignment methods (41 core, 6 enablers) published 2024-2026. Creates two-dimensional taxonomy by assignment granularity (token, segment, step, turn, multi-agent) and methodology (Monte Carlo, TD, model-based, game-theoretic, information-theoretic). Develops three resources: structured paper inventory, reporting checklist, and benchmark protocol with decision tree.
Result: Identifies that reasoning credit assignment is maturing around process reward models and critic-free group comparison, while agentic credit assignment is driving new approaches like hindsight counterfactual analysis, privileged asymmetric critics, and turn-level MDP reformulations that have no direct precedent in reasoning RL.
Conclusion: The shift from reasoning to agentic RL complicates and reshapes the credit assignment landscape, requiring different methodological approaches for each regime. The survey provides structured resources to guide future research and address systematic gaps in credit assignment methods for LLM-based RL.
Abstract: Reinforcement learning (RL) for large language models (LLMs) increasingly relies on sparse, outcome-level rewards – yet determining which actions within a long trajectory caused the outcome remains difficult. This credit assignment (CA) problem manifests in two regimes: reasoning RL, where credit must be distributed across tokens and steps within a single chain-of-thought generation (500–30K+ tokens); and agentic RL, where multi-turn environment interaction introduces stochastic transitions, partial observability, and horizons of 100+ turns (100K–1M tokens), making episode-level credit increasingly uninformative. We survey 47 CA methods (41 core, 6 adjacent enablers) published between 2024 and early 2026, organizing them in a two-dimensional taxonomy by assignment granularity (token, segment, step, turn, multi-agent) and methodology (Monte Carlo, temporal difference, model-based, game-theoretic, information-theoretic). Beyond the survey itself, we contribute three reusable resources: (1) a structured, machine-readable paper inventory with taxonomy labels, baseline families, and evidence levels; (2) a reporting checklist for future CA papers, validated against the reviewed literature to identify systematic methodological gaps; and (3) a benchmark protocol specification with task families, metadata requirements, and controlled bifurcation tasks, accompanied by a method selection decision tree. Our synthesis suggests that the shift from reasoning to agentic RL complicates and reshapes the credit assignment landscape: reasoning CA is maturing around process reward models and critic-free group comparison, while agentic CA is driving genuinely new approaches – hindsight counterfactual analysis, privileged asymmetric critics, and turn-level MDP reformulations – that have no direct precedent in reasoning RL.
[59] Across the Levels of Analysis: Explaining Predictive Processing in Humans Requires More Than Machine-Estimated Probabilities
Sathvik Nair, Colin Phillips
Main category: cs.CL
TL;DR: The paper critiques and extends claims about language models’ relationship to human language processing through Marr’s levels of analysis framework, arguing for combining LLM strengths with psycholinguistic models.
Details
Motivation: To critically examine two key claims about language models: 1) that prediction is central to language processing, and 2) that psycholinguistic advances depend on LLMs, using Marr's analytical framework to provide deeper theoretical understanding.Method: Uses Marr’s three levels of analysis (computational, algorithmic, implementation) as a theoretical framework to critique existing claims about language models and language processing, proposing an integrative approach.
Result: Provides a structured critique showing limitations of current claims about LLMs’ relationship to human language processing, and outlines how combining LLM capabilities with psycholinguistic models could advance both fields.
Conclusion: Future progress requires integrating LLM strengths with psycholinguistic models, moving beyond simplistic claims to more nuanced understanding of language processing across Marr’s levels.
Abstract: Under the lens of Marr’s levels of analysis, we critique and extend two claims about language models (LMs) and language processing: first, that predicting upcoming linguistic information based on context is central to language processing, and second, that many advances in psycholinguistics would be impossible without large language models (LLMs). We further outline future directions that combine the strengths of LLMs with psycholinguistic models.
[60] Agentic Jackal: Live Execution and Semantic Value Grounding for Text-to-JQL
Vishnu Murali, Anmol Gulati, Elias Lumer, Kevin Frank, Sindy Campagna, Vamse Kumar Subbiah
Main category: cs.CL
TL;DR: Jackal: First large-scale execution-based benchmark for translating natural language to Jira Query Language (JQL), with 100K validated NL-JQL pairs and tool-augmented agent approach using Jira MCP server and semantic retrieval.
Details
Motivation: Existing single-pass LLMs struggle with translating natural language to JQL due to inability to discover instance-specific categorical values (component names, fix versions) and verify queries against live data, limiting accuracy on ambiguous requests. No open execution-based benchmark exists for this task.Method: Created Jackal benchmark with 100K validated NL-JQL pairs on live Jira instance with 200K+ issues. Proposed Agentic Jackal - tool-augmented agent equipping LLMs with live query execution via Jira MCP server and JiraAnchor semantic retrieval tool for resolving categorical values through embedding-based similarity search.
Result: Single-pass LLMs average only 43.4% execution accuracy on short queries. Agentic approach improved 7 of 9 models with 9.0% relative gain on most challenging variant. JiraAnchor ablation raised categorical-value accuracy from 48.7% to 71.7%, with component-field accuracy jumping from 16.9% to 66.2%. Dominant failure modes are semantic ambiguities (issue-type disambiguation, text-field selection) rather than value-resolution errors.
Conclusion: Text-to-JQL remains an open challenge requiring tool-augmented approaches. Benchmark and evaluation code released publicly to support reproducibility. Future work should address semantic ambiguities identified as dominant failure modes.
Abstract: Translating natural language into Jira Query Language (JQL) requires resolving ambiguous field references, instance-specific categorical values, and complex Boolean predicates. Single-pass LLMs cannot discover which categorical values (e.g., component names or fix versions) actually exist in a given Jira instance, nor can they verify generated queries against a live data source, limiting accuracy on paraphrased or ambiguous requests. No open, execution-based benchmark exists for mapping natural language to JQL. We introduce Jackal, the first large-scale, execution-based text-to-JQL benchmark comprising 100,000 validated NL-JQL pairs on a live Jira instance with over 200,000 issues. To establish baselines on Jackal, we propose Agentic Jackal, a tool-augmented agent that equips LLMs with live query execution via the Jira MCP server and JiraAnchor, a semantic retrieval tool that resolves natural-language mentions of categorical values through embedding-based similarity search. Among 9 frontier LLMs evaluated, single-pass models average only 43.4% execution accuracy on short natural-language queries, highlighting that text-to-JQL remains an open challenge. The agentic approach improves 7 of 9 models, with a 9.0% relative gain on the most linguistically challenging variant; in a controlled ablation isolating JiraAnchor, categorical-value accuracy rises from 48.7% to 71.7%, with component-field accuracy jumping from 16.9% to 66.2%. Our analysis identifies inherent semantic ambiguities, such as issue-type disambiguation and text-field selection, as the dominant failure modes rather than value-resolution errors, pointing to concrete directions for future work. We publicly release the benchmark, all agent transcripts, and evaluation code to support reproducibility.
[61] RecaLLM: Addressing the Lost-in-Thought Phenomenon with Explicit In-Context Retrieval
Kyle Whitecross, Negin Rahimi
Main category: cs.CL
TL;DR: RecaLLM is a reasoning language model that interleaves reasoning with explicit in-context retrieval to address the “lost-in-thought” problem where reasoning steps degrade subsequent retrieval performance in long contexts.
Details
Motivation: The paper addresses the intertwined relationship between in-context retrieval and reasoning in LLMs, noting that reasoning steps that improve performance paradoxically make subsequent retrieval more challenging (the "lost-in-thought" problem), creating a bottleneck for test-time scaling with long contexts.Method: RecaLLM interleaves reasoning with explicit in-context retrieval, alternating between reasoning and retrieving context information needed for intermediate subproblems. It uses a constrained decoding mechanism for verbatim copying of evidence spans to improve grounding, and is trained on diverse lexical and semantic retrieval tasks.
Result: RecaLLM achieves strong performance on long-context benchmarks RULER and HELMET, significantly outperforming baselines. It shows consistent gains at context windows up to 128K tokens using training samples of only up to 10K tokens, far shorter than existing long-context approaches.
Conclusion: The approach demonstrates a promising path toward improving long-context performance without expensive long-context training data, effectively addressing the lost-in-thought problem through interleaved reasoning and retrieval.
Abstract: We propose RecaLLM, a set of reasoning language models post-trained to make effective use of long-context information. In-context retrieval, which identifies relevant evidence from context, and reasoning are deeply intertwined: retrieval supports reasoning, while reasoning often determines what must be retrieved. However, their interaction remains largely underexplored. In preliminary experiments on several open-source LLMs, we observe that in-context retrieval performance substantially degrades even after a short reasoning span, revealing a key bottleneck for test-time scaling that we refer to as lost-in-thought: reasoning steps that improve performance also make subsequent in-context retrieval more challenging. To address this limitation, RecaLLM interleaves reasoning with explicit in-context retrieval, alternating between reasoning and retrieving context information needed to solve intermediate subproblems. We introduce a negligible-overhead constrained decoding mechanism that enables verbatim copying of evidence spans, improving the grounding of subsequent generation. Trained on diverse lexical and semantic retrieval tasks, RecaLLM achieves strong performance on two long-context benchmarks, RULER and HELMET, significantly outperforming baselines. Notably, we observe consistent gains at context windows of up to 128K tokens using training samples of at most 10K tokens, far shorter than those used by existing long-context approaches, highlighting a promising path toward improving long-context performance without expensive long-context training data.
[62] BERT-as-a-Judge: A Robust Alternative to Lexical Methods for Efficient Reference-Based LLM Evaluation
Hippolyte Gisserot-Boukhlef, Nicolas Boizard, Emmanuel Malherbe, Céline Hudelot, Pierre Colombo
Main category: cs.CL
TL;DR: BERT-as-a-Judge: An encoder-based approach for evaluating LLM outputs that’s more robust to phrasing variations than lexical methods and more efficient than LLM-as-a-Judge approaches.
Details
Motivation: Current LLM evaluation methods have limitations: lexical methods conflate problem-solving ability with formatting compliance, while LLM-as-a-Judge approaches are computationally expensive. There's a need for reliable, scalable evaluation that balances accuracy and efficiency.Method: Introduces BERT-as-a-Judge, an encoder-driven approach that assesses answer correctness in reference-based generative settings. It’s trained on synthetically annotated question-candidate-reference triplets and is robust to output phrasing variations.
Result: BERT-as-a-Judge consistently outperforms lexical baselines while matching the performance of much larger LLM judges. It provides a compelling tradeoff between accuracy and computational efficiency, enabling reliable, scalable evaluation.
Conclusion: BERT-as-a-Judge offers a practical solution for LLM evaluation that balances performance and efficiency, with detailed insights provided for practitioners and all project artifacts released for adoption.
Abstract: Accurate evaluation is central to the large language model (LLM) ecosystem, guiding model selection and downstream adoption across diverse use cases. In practice, however, evaluating generative outputs typically relies on rigid lexical methods to extract and assess answers, which can conflate a model’s true problem-solving ability with its compliance with predefined formatting guidelines. While recent LLM-as-a-Judge approaches mitigate this issue by assessing semantic correctness rather than strict structural conformity, they also introduce substantial computational overhead, making evaluation costly. In this work, we first systematically investigate the limitations of lexical evaluation through a large-scale empirical study spanning 36 models and 15 downstream tasks, demonstrating that such methods correlate poorly with human judgments. To address this limitation, we introduce BERT-as-a-Judge, an encoder-driven approach for assessing answer correctness in reference-based generative settings, robust to variations in output phrasing, and requiring only lightweight training on synthetically annotated question-candidate-reference triplets. We show that it consistently outperforms the lexical baseline while matching the performance of much larger LLM judges, providing a compelling tradeoff between the two and enabling reliable, scalable evaluation. Finally, through extensive experimentation, we provide detailed insights into BERT-as-a-Judge’s performance to offer practical guidance for practitioners, and release all project artifacts to foster downstream adoption.
[63] You Can’t Fight in Here! This is BBS!
Richard Futrell, Kyle Mahowald
Main category: cs.CL
TL;DR: A discussion paper arguing that modern language models can inform linguistics despite common misconceptions, advocating for expanded research integrating LMs with language sciences.
Details
Motivation: To address misconceptions about language models' relevance to linguistics and advocate for a more integrated research program between computational models and language sciences.Method: Philosophical discussion format with fictional characters representing different perspectives, analyzing common misconceptions about LMs in linguistics.
Result: Identifies two key misconceptions: String Statistics Strawman (LMs are just statistical models) and As Good As it Gets Assumption (current LM research is the limit).
Conclusion: Advocates for expanded research program integrating LM-based work with traditional language sciences to produce better science of both human language and LMs.
Abstract: Norm, the formal theoretical linguist, and Claudette, the computational language scientist, have a lovely time discussing whether modern language models can inform important questions in the language sciences. Just as they are about to part ways until they meet again, 25 of their closest friends show up – from linguistics, neuroscience, cognitive science, psychology, philosophy, and computer science. We use this discussion to highlight what we see as some common underlying issues: the String Statistics Strawman (the mistaken idea that LMs can’t be linguistically competent or interesting because they, like their Markov model predecessors, are statistical models that learn from strings) and the As Good As it Gets Assumption (the idea that LM research as it stands in 2026 is the limit of what it can tell us about linguistics). We clarify the role of LM-based work for scientific insights into human language and advocate for a more expansive research program for the language sciences in the AI age, one that takes on the commentators’ concerns in order to produce a better and more robust science of both human language and of LMs.
[64] Many Ways to Be Fake: Benchmarking Fake News Detection Under Strategy-Driven AI Generation
Xinyu Wang, Sai Koneru, Wenbo Zhang, Wenliang Zheng, Saksham Ranjan, Sarah Rajtmajer
Main category: cs.CL
TL;DR: A benchmark for detecting AI-generated fake news that blends falsehoods with accurate information, revealing limitations in current detection methods.
Details
Motivation: Modern fake news increasingly involves human-AI collaboration where strategic inaccuracies are embedded within otherwise credible narratives, creating mixed-truth cases that are underrepresented in existing benchmarks and pose a realistic threat.Method: Introduces MANYFAKE, a synthetic benchmark with 6,798 fake news articles generated through multiple strategy-driven prompting pipelines that capture various ways fake news can be constructed and refined. Evaluates state-of-the-art fake news detectors on this benchmark.
Result: Advanced reasoning-enabled models approach saturation on fully fabricated stories but remain brittle when falsehoods are subtle, optimized, and interwoven with accurate information.
Conclusion: Mixed-truth fake news represents a significant challenge for current detection methods, highlighting the need for more sophisticated approaches to handle strategically embedded falsehoods within credible narratives.
Abstract: Recent advances in large language models (LLMs) have enabled the large-scale generation of highly fluent and deceptive news-like content. While prior work has often treated fake news detection as a binary classification problem, modern fake news increasingly arises through human-AI collaboration, where strategic inaccuracies are embedded within otherwise accurate and credible narratives. These mixed-truth cases represent a realistic and consequential threat, yet they remain underrepresented in existing benchmarks. To address this gap, we introduce MANYFAKE, a synthetic benchmark containing 6,798 fake news articles generated through multiple strategy-driven prompting pipelines that capture many ways fake news can be constructed and refined. Using this benchmark, we evaluate a range of state-of-the-art fake news detectors. Our results show that even advanced reasoning-enabled models approach saturation on fully fabricated stories, but remain brittle when falsehoods are subtle, optimized, and interwoven with accurate information.
[65] Case-Grounded Evidence Verification: A Framework for Constructing Evidence-Sensitive Supervision
Soroosh Tayebi Arasteh, Mehdi Joodaki, Mahshad Lotfinia, Sven Nebelung, Daniel Truhn
Main category: cs.CL
TL;DR: A framework for evidence verification where models must determine if provided evidence supports a claim in a specific case context, with automated supervision generation including counterfactual negatives.
Details
Motivation: Current evidence-grounded reasoning often fails because supervision is weak, evidence is only loosely tied to claims, and evaluation doesn't directly test evidence dependence. Models need to make decisions that genuinely depend on whether evidence supports claims.Method: Introduces case-grounded evidence verification framework where models receive case context, external evidence, and structured claim, then decide if evidence supports the claim. Key innovation is supervision construction procedure that generates explicit support examples with semantically controlled non-support examples (counterfactual wrong-state and topic-related negatives) without manual evidence annotation.
Result: The trained verifier substantially outperforms case-only and evidence-only baselines, remains strong with correct evidence, and collapses when evidence is removed or swapped, indicating genuine evidence dependence. Behavior transfers across unseen evidence articles and external case distribution, though performance degrades under evidence-source shift and remains sensitive to backbone choice.
Conclusion: A major bottleneck in evidence grounding is not only model capacity, but the lack of supervision that encodes the causal role of evidence. The framework enables models to learn genuine evidence dependence through structured supervision.
Abstract: Evidence-grounded reasoning requires more than attaching retrieved text to a prediction: a model should make decisions that depend on whether the provided evidence supports the target claim. In practice, this often fails because supervision is weak, evidence is only loosely tied to the claim, and evaluation does not test evidence dependence directly. We introduce case-grounded evidence verification, a general framework in which a model receives a local case context, external evidence, and a structured claim, and must decide whether the evidence supports the claim for that case. Our key contribution is a supervision construction procedure that generates explicit support examples together with semantically controlled non-support examples, including counterfactual wrong-state and topic-related negatives, without manual evidence annotation. We instantiate the framework in radiology and train a standard verifier on the resulting support task. The learned verifier substantially outperforms both case-only and evidence-only baselines, remains strong under correct evidence, and collapses when evidence is removed or swapped, indicating genuine evidence dependence. This behavior transfers across unseen evidence articles and an external case distribution, though performance degrades under evidence-source shift and remains sensitive to backbone choice. Overall, the results suggest that a major bottleneck in evidence grounding is not only model capacity, but the lack of supervision that encodes the causal role of evidence.
[66] Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism
Hadas Orgad, Boyi Wei, Kaden Zheng, Martin Wattenberg, Peter Henderson, Seraphina Goldfarb-Tarrant, Yonatan Belinkov
Main category: cs.CL
TL;DR: Targeted weight pruning reveals LLMs have a compact, coherent internal structure for harmfulness that is distinct from benign capabilities, explaining emergent misalignment phenomena.
Details
Motivation: Despite alignment training, LLM safety remains brittle with jailbreaks and emergent misalignment. The paper investigates whether this reflects a fundamental lack of coherent internal organization for harmfulness or if there's an underlying structure that could inform more principled safety approaches.Method: Uses targeted weight pruning as a causal intervention to probe internal organization of harmfulness in LLMs. Analyzes weight distributions across harm types, compares aligned vs unaligned models, and examines how fine-tuning engages these weights to trigger emergent misalignment.
Result: Harmful content generation depends on a compact set of weights general across harm types and distinct from benign capabilities. Aligned models show greater compression of harm generation weights. This compression explains emergent misalignment: fine-tuning engaging these compressed weights in one domain triggers broad misalignment. Pruning harm generation weights reduces emergent misalignment. Harm generation capability is dissociated from content recognition/explanation.
Conclusion: LLMs have a coherent internal structure for harmfulness that may serve as a foundation for more principled safety approaches, despite surface-level brittleness of safety guardrails.
Abstract: Large language models (LLMs) undergo alignment training to avoid harmful behaviors, yet the resulting safeguards remain brittle: jailbreaks routinely bypass them, and fine-tuning on narrow domains can induce ``emergent misalignment’’ that generalizes broadly. Whether this brittleness reflects a fundamental lack of coherent internal organization for harmfulness remains unclear. Here we use targeted weight pruning as a causal intervention to probe the internal organization of harmfulness in LLMs. We find that harmful content generation depends on a compact set of weights that are general across harm types and distinct from benign capabilities. Aligned models exhibit a greater compression of harm generation weights than unaligned counterparts, indicating that alignment reshapes harmful representations internally–despite the brittleness of safety guardrails at the surface level. This compression explains emergent misalignment: if weights of harmful capabilities are compressed, fine-tuning that engages these weights in one domain can trigger broad misalignment. Consistent with this, pruning harm generation weights in a narrow domain substantially reduces emergent misalignment. Notably, LLMs harmful generation capability is dissociated from how they recognize and explain such content. Together, these results reveal a coherent internal structure for harmfulness in LLMs that may serve as a foundation for more principled approaches to safety.
[67] Grammar as a Behavioral Biometric: Using Cognitively Motivated Grammar Models for Authorship Verification
Andrea Nini, Oren Halvani, Lukas Graner, Sophie Titze, Valerio Gherardi, Shunichi Ishihara
Main category: cs.CL
TL;DR: Proposes LambdaG, a simpler authorship verification method based on modeling author grammar using Cognitive Linguistics principles, achieving superior performance over neural network methods with better explainability.
Details
Motivation: Existing authorship verification methods suffer from high complexity, low explainability, and lack clear scientific justification. The authors aim to develop a simpler, more interpretable method grounded in Cognitive Linguistics theory.Method: Models author grammar following Cognitive Linguistics principles, calculates λ_G (LambdaG) as the ratio of likelihoods of a document given candidate’s grammar versus reference population grammar.
Result: LambdaG achieves superior performance on twelve datasets compared to seven baseline methods, including neural network-based approaches. It’s robust to small variations in reference population composition and provides interpretable visualizations.
Conclusion: LambdaG’s effectiveness stems from its compatibility with Cognitive Linguistics theories that predict grammar as a behavioral biometric, offering a simpler, more explainable alternative to complex neural methods.
Abstract: Authorship Verification (AV) is a key area of research in digital text forensics, which addresses the fundamental question of whether two texts were written by the same person. Numerous computational approaches have been proposed over the last two decades in an attempt to address this challenge. However, existing AV methods often suffer from high complexity, low explainability and especially from a lack of clear scientific justification. We propose a simpler method based on modeling the grammar of an author following Cognitive Linguistics principles. These models are used to calculate $λ_G$ (LambdaG): the ratio of the likelihoods of a document given the candidate’s grammar versus given a reference population’s grammar. Our empirical evaluation, conducted on twelve datasets and compared against seven baseline methods, demonstrates that LambdaG achieves superior performance, including against several neural network-based AV methods. LambdaG is also robust to small variations in the composition of the reference population and provides interpretable visualizations, enhancing its explainability. We argue that its effectiveness is due to the method’s compatibility with Cognitive Linguistics theories predicting that a person’s grammar is a behavioral biometric.
[68] Mitigating Extrinsic Gender Bias for Bangla Classification Tasks
Sajib Kumar Saha Joy, Arman Hassan Mahy, Meherin Sultana, Azizah Mamun Abha, MD Piyal Ahmmed, Yue Dong, G M Shahariar
Main category: cs.CL
TL;DR: Proposes RandSymKL, a randomized debiasing strategy with symmetric KL divergence for mitigating extrinsic gender bias in Bangla pretrained language models across four classification tasks.
Details
Motivation: Extrinsic gender bias in low-resource languages like Bangla remains underexplored, despite its importance for fair NLP applications in diverse linguistic contexts.Method: Constructed four manually annotated benchmark datasets (sentiment, toxicity, hate speech, sarcasm) with gender perturbations, then proposed RandSymKL - a randomized training approach integrating symmetric KL divergence and cross-entropy loss for bias mitigation.
Result: RandSymKL effectively reduces gender bias while maintaining competitive accuracy compared to baseline methods across all four classification tasks.
Conclusion: The proposed approach successfully addresses extrinsic gender bias in Bangla language models, with publicly released datasets and implementation to support further research in low-resource language bias mitigation.
Abstract: In this study, we investigate extrinsic gender bias in Bangla pretrained language models, a largely underexplored area in low-resource languages. To assess this bias, we construct four manually annotated, task-specific benchmark datasets for sentiment analysis, toxicity detection, hate speech detection, and sarcasm detection. Each dataset is augmented using nuanced gender perturbations, where we systematically swap gendered names and terms while preserving semantic content, enabling minimal-pair evaluation of gender-driven prediction shifts. We then propose RandSymKL, a randomized debiasing strategy integrated with symmetric KL divergence and cross-entropy loss to mitigate the bias across task-specific pretrained models. RandSymKL is a refined training approach to integrate these elements in a unified way for extrinsic gender bias mitigation focused on classification tasks. Our approach was evaluated against existing bias mitigation methods, with results showing that our technique not only effectively reduces bias but also maintains competitive accuracy compared to other baseline approaches. To promote further research, we have made both our implementation and datasets publicly available: https://github.com/sajib-kumar/Mitigating-Bangla-Extrinsic-Gender-Bias
[69] Exploring Cross-lingual Latent Transplantation: Mutual Opportunities and Open Challenges
Yangfan Ye, Xiaocheng Feng, Xiachong Feng, Libo Qin, Yichong Huang, Lei Huang, Weitao Ma, Qichen Hong, Zhirui Zhang, Yunfei Lu, Xiaohui Yan, Duyu Tang, Dandan Tu, Bing Qin
Main category: cs.CL
TL;DR: XTransplant is a probing framework that transplants latent activations across languages to enhance multilingual capabilities and cultural adaptability in LLMs, revealing attention modules handle multilingual understanding while feed-forward modules capture culture-specific knowledge.
Details
Motivation: Current LLMs have imbalanced multilingual capabilities and cultural adaptability due to English-centric pre-training data. The paper aims to better exploit models' internalized multilingual knowledge during inference to address these limitations.Method: Proposes cross-lingual latent transplantation (XTransplant) framework that transplants latent activations across languages, allowing models to harness complementary strengths of English and non-English resources. Analyzes different model components (attention vs feed-forward modules) and conducts extensive analysis of stability, effectiveness, and generalizability.
Result: XTransplant shows mutually beneficial effects on multilingual capability and cultural adaptability, especially for low-resource languages and cultures. Attention modules support multilingual understanding while feed-forward modules capture culture-specific knowledge. Analysis reveals considerable underutilization of current LLMs’ multilingual potential.
Conclusion: XTransplant offers a new approach for advancing cross-lingual interactions and better leveraging models’ internalized multilingual knowledge, exposing significant untapped potential in current LLMs for multilingual and cultural adaptation tasks.
Abstract: Current large language models (LLMs) often exhibit imbalances in multilingual capabilities and cultural adaptability, largely attributed to their English-centric pre-training data. In this paper, we introduce and investigate cross-lingual latent transplantation (XTransplant), a probing framework which aims to further exploit the model’s internalized multilingual knowledge during inference and examine its effects on the multilingual capability and cultural adaptability of LLMs. XTransplant framework enables models to harness the complementary strengths of both English and non-English resources by transplanting latent activations across languages. Through extensive analysis, we empirically demonstrate that XTransplant, a form of cross-lingual interaction, has mutually beneficial effects on the multilingual capability and cultural adaptability of LLMs, particularly for low-resource languages and cultures. We further reveal that attention modules play a pivotal role in supporting multilingual understanding, while feed-forward modules are more adept at capturing culture-specific knowledge. In addition, we conduct in-depth analysis of XTransplant’s stability, effectiveness, and generalizability. By probing the upper bound performance of XTransplant, we expose the considerable underutilization of current LLMs’ multilingual potential-a challenge that remains open. We hope our analysis offers a new lens for advancing cross-lingual interactions and better leveraging models’ internalized multilingual knowledge.
[70] MSMO-ABSA: Multi-Scale and Multi-Objective Optimization for Cross-Lingual Aspect-Based Sentiment Analysis
Chengyan Wu, Bolei Ma, Ningyuan Deng, Yanqing He, Yun Xue, Xiaoyong Liu
Main category: cs.CL
TL;DR: MSMO framework improves cross-lingual aspect-based sentiment analysis through multi-scale alignment and multi-objective optimization with code-switching and knowledge distillation.
Details
Motivation: Existing multilingual ABSA studies lack robust feature alignment and finer aspect-level alignment, limiting cross-lingual performance.Method: Proposes MSMO with multi-scale alignment (sentence-level and aspect-level using code-switched bilingual sentences) and multi-objective optimization (supervised + consistency training), plus target language knowledge distillation.
Result: MSMO achieves state-of-the-art performance across multiple languages and models for cross-lingual ABSA.
Conclusion: The framework significantly enhances cross-lingual ABSA through improved alignment and optimization strategies.
Abstract: Aspect-based sentiment analysis (ABSA) garnered growing research interest in multilingual contexts in the past. However, the majority of the studies lack more robust feature alignment and finer aspect-level alignment. In this paper, we propose a novel framework, MSMO: Multi-Scale and Multi-Objective optimization for cross-lingual ABSA. During multi-scale alignment, we achieve cross-lingual sentence-level and aspect-level alignment, aligning features of aspect terms in different contextual environments. Specifically, we introduce code-switched bilingual sentences into the language discriminator and consistency training modules to enhance the model’s robustness. During multi-objective optimization, we design two optimization objectives: supervised training and consistency training, aiming to enhance cross-lingual semantic alignment. To further improve model performance, we incorporate distilled knowledge of the target language into the model. Results show that MSMO significantly enhances cross-lingual ABSA by achieving state-of-the-art performance across multiple languages and models.
[71] Constraining Sequential Model Editing with Editing Anchor Compression
Hao-Xiang Xu, Jun-Yu Ma, Zhen-Hua Ling, Ningyu Zhang, Jia-Chen Gu
Main category: cs.CL
TL;DR: EAC framework compresses editing anchors to minimize parameter deviation during sequential LLM editing, preserving general abilities while maintaining editing knowledge.
Details
Motivation: LLMs suffer from hallucinations due to outdated knowledge, and retraining is resource-intensive. Sequential editing degrades general abilities as parameter matrices deviate significantly from original states, affecting knowledge associations.Method: Editing Anchor Compression (EAC) framework selects editing anchors important for encoding new relations without deviating too much from original parameter matrix. Compresses editing information to constrain parameter deviation during sequential editing.
Result: EAC applied to two editing methods on three LLMs across four tasks preserves over 70% of general abilities while better retaining editing knowledge compared to original methods. Effectively minimizes unreasonable deviations caused by model editing.
Conclusion: EAC framework successfully addresses degradation of general abilities during sequential LLM editing by constraining parameter matrix deviation through selective anchor compression.
Abstract: Large language models (LLMs) struggle with hallucinations due to false or outdated knowledge. Given the high resource demands of retraining these models, there is an increasing focus on developing model editing. However, the general abilities of LLMs across downstream tasks are prone to significant degradation during sequential editing. This paper statistically observes that the parameter matrix after editing exhibits a significant deviation compared to its previous state as the number of edits increases. This serious deviation affects the original knowledge associations within LLMs and leads to the degradation of their general abilities. To this end, a framework termed Editing Anchor Compression (EAC) is proposed to constrain the deviation of the parameter matrix during sequential editing. It compresses the editing information by selecting editing anchors that are important in encoding new relations without deviating too much from the original matrix, thereby preserving the general abilities. Experiments of applying EAC to two popular editing methods on three LLMs across four tasks are conducted. Evaluation results show that EAC effectively minimizes unreasonable deviations caused by model editing, preserving over 70% of the general abilities while better retaining the editing knowledge compared to the original counterpart methods.
[72] SessionIntentBench: A Multi-task Inter-session Intention-shift Modeling Benchmark for E-commerce Customer Behavior Understanding
Yuqi Yang, Weiqi Wang, Baixuan Xu, Wei Fan, Qing Zong, Chunkit Chan, Zheye Deng, Xin Liu, Yifan Gao, Changlong Yu, Chen Luo, Yang Li, Zheng Li, Qingyu Yin, Bing Yin, Yangqiu Song
Main category: cs.CL
TL;DR: SessionIntentBench: A multimodal benchmark for evaluating L(V)LMs’ ability to understand customer intention shifts in e-commerce browsing sessions using intention trees and session trajectories.
Details
Motivation: Prior works fail to effectively capture customer intention in e-commerce sessions due to insufficient information exploitation and lack of explicit intention modeling benchmarks. Current approaches only use apparent information like descriptions and titles, missing deeper user preference signals.Method: Introduces intention tree concept and dataset curation pipeline to construct SessionIntentBench with 1.95M intention entries and 1.13M session intention trajectories. Uses 10,905 sessions to create 13M tasks across four subtasks evaluating intention shift understanding.
Result: Created large-scale benchmark with human-annotated gold set. Experiments show current L(V)LMs fail to capture intention in complex session settings. Analysis demonstrates that injecting intention information enhances LLM performance.
Conclusion: SessionIntentBench provides scalable way to exploit session data for customer intention understanding and reveals limitations of current multimodal models in capturing intention shifts across browsing sessions.
Abstract: Session history is a common way of recording user interacting behaviors throughout a browsing activity with multiple products. For example, if an user clicks a product webpage and then leaves, it might because there are certain features that don’t satisfy the user, which serve as an important indicator of on-the-spot user preferences. However, all prior works fail to capture and model customer intention effectively because insufficient information exploitation and only apparent information like descriptions and titles are used. There is also a lack of data and corresponding benchmark for explicitly modeling intention in E-commerce product purchase sessions. To address these issues, we introduce the concept of an intention tree and propose a dataset curation pipeline. Together, we construct a sibling multimodal benchmark, SessionIntentBench, that evaluates L(V)LMs’ capability on understanding inter-session intention shift with four subtasks. With 1,952,177 intention entries, 1,132,145 session intention trajectories, and 13,003,664 available tasks mined using 10,905 sessions, we provide a scalable way to exploit the existing session data for customer intention understanding. We conduct human annotations to collect ground-truth label for a subset of collected data to form an evaluation gold set. Extensive experiments on the annotated data further confirm that current L(V)LMs fail to capture and utilize the intention across the complex session setting. Further analysis show injecting intention enhances LLMs’ performances.
[73] BEDTime: A Unified Benchmark for Automatically Describing Time Series
Medhasweta Sen, Zachary Gottesman, Jiaxing Qiu, C. Bayan Bruss, Nam Nguyen, Tom Hartvigsen
Main category: cs.CL
TL;DR: A benchmark study evaluating multimodal models’ ability to describe structural properties of time series, finding vision-language models outperform dedicated time-series models and language-only approaches.
Details
Motivation: Recent multimodal models claim strong performance on complex time series tasks but lack foundational evaluations of basic capabilities like describing structural properties of time series.Method: Created a benchmark (unnamed in abstract) with five datasets reformatted across three modalities to assess models’ ability to recognize, differentiate, and generate descriptions of univariate time series. Evaluated 17 state-of-the-art models.
Result: (1) Dedicated time series-language models underperform despite being designed for similar tasks, (2) vision language models perform best, (3) language-only methods perform worst, and (4) all models are fragile to real-world robustness tests.
Conclusion: The findings critique prior claims about multimodal time series models and provide directions for advancing multimodal time series modeling, particularly regarding robustness and foundational capabilities.
Abstract: Recent works propose complex multi-modal models that handle both time series and language, ultimately claiming high performance on complex tasks like time series reasoning and cross-modal question answering. However, they skip foundational evaluations that such complex models should have mastered. So we ask a simple question: \textit{How well can recent models describe structural properties of time series?} To answer this, we propose that successful models should be able to \textit{recognize}, \textit{differentiate}, and \textit{generate} descriptions of univariate time series. We then create \textbf{\benchmark}, a benchmark to assess these novel tasks, that comprises \textbf{five datasets} reformatted across \textbf{three modalities}. In evaluating \textbf{17 state-of-the-art models}, we find that (1) surprisingly, dedicated time series-language models fall short, despite being designed for similar tasks, (2) vision language models are quite capable, (3) language only methods perform worst, despite many lauding their potential, and (4) all approaches are clearly fragile to a range of real world robustness tests, indicating directions for future work. Together, our findings critique prior works’ claims and provide avenues for advancing multi-modal time series modeling.
[74] Localizing Task Recognition and Task Learning in In-Context Learning via Attention Head Analysis
Haolin Yang, Hakaze Cho, Naoya Inoue
Main category: cs.CL
TL;DR: The paper proposes a unified framework (TSLA) to analyze in-context learning mechanisms in LLMs by identifying specialized attention heads for Task Recognition and Task Learning, showing how they work together to enable ICL.
Details
Motivation: To reconcile two dominant perspectives on in-context learning mechanisms: component-level analysis of attention heads vs. holistic decomposition into Task Recognition and Task Learning components, providing a unified interpretable account.Method: Proposed Task Subspace Logit Attribution (TSLA) framework to identify attention heads specialized in TR and TL. Used correlation analysis, ablation studies, input perturbations, and steering experiments with geometric analysis of hidden states.
Result: Identified distinct TR and TL heads with complementary roles: TR heads align hidden states with task subspace for recognition, while TL heads rotate states within subspace toward correct labels. Showed how previous findings (induction heads, task vectors) reconcile with TR-TL decomposition.
Conclusion: The TSLA framework provides a unified, interpretable account of how LLMs execute in-context learning across diverse tasks, bridging component-level and holistic perspectives on ICL mechanisms.
Abstract: We investigate the mechanistic underpinnings of in-context learning (ICL) in large language models by reconciling two dominant perspectives: the component-level analysis of attention heads and the holistic decomposition of ICL into Task Recognition (TR) and Task Learning (TL). We propose a novel framework based on Task Subspace Logit Attribution (TSLA) to identify attention heads specialized in TR and TL, and demonstrate their distinct yet complementary roles. Through correlation analysis, ablation studies, and input perturbations, we show that the identified TR and TL heads independently and effectively capture the TR and TL components of ICL. Using steering experiments with geometric analysis of hidden states, we reveal that TR heads promote task recognition by aligning hidden states with the task subspace, while TL heads rotate hidden states within the subspace toward the correct label to facilitate prediction. We further show how previous findings on ICL mechanisms, including induction heads and task vectors, can be reconciled with our attention-head-level analysis of the TR-TL decomposition. Our framework thus provides a unified and interpretable account of how large language models execute ICL across diverse tasks and settings.
[75] Task Vectors, Learned Not Extracted: Performance Gains and Mechanistic Insight
Haolin Yang, Hakaze Cho, Kaize Ding, Naoya Inoue
Main category: cs.CL
TL;DR: The paper introduces Learned Task Vectors (LTVs) as a direct training approach for task representations in LLMs, surpassing extracted task vectors in accuracy and flexibility, while also analyzing their mechanistic role in in-context learning.
Details
Motivation: Current methods for extracting task vectors from LLMs are cumbersome, opaque, and don't elucidate how these vectors influence computation during in-context learning. There's a need for better task representations and understanding of their mechanistic role.Method: Proposes directly training Learned Task Vectors (LTVs) that can act at arbitrary layers and positions. Systematically analyzes TV mechanisms through attention-head OV circuits, identifying “key heads” and examining TV propagation through Transformer layers.
Result: LTVs outperform extracted task vectors in accuracy and offer superior flexibility. Analysis reveals TVs steer predictions primarily through attention-head OV circuits, with TV propagation being largely linear despite Transformer nonlinearities.
Conclusion: LTVs provide both a practical approach for obtaining effective task vectors and a principled lens into the mechanistic foundations of in-context learning in LLMs.
Abstract: Large Language Models (LLMs) can perform new tasks from in-context demonstrations, a phenomenon known as in-context learning (ICL). Recent work suggests that these demonstrations are compressed into task vectors (TVs), compact task representations that LLMs exploit for predictions. However, prior studies typically extract TVs from model outputs or hidden states using cumbersome and opaque methods, and they rarely elucidate the mechanisms by which TVs influence computation. In this work, we address both limitations. First, we propose directly training Learned Task Vectors (LTVs), which surpass extracted TVs in accuracy and exhibit superior flexibility-acting effectively at arbitrary layers, positions, and even with ICL prompts. Second, through systematic analysis, we investigate the mechanistic role of TVs, showing that at the low level they steer predictions primarily through attention-head OV circuits, with a small subset of “key heads” most decisive. At a higher level, we find that despite Transformer nonlinearities, TV propagation is largely linear: early TVs are rotated toward task-relevant subspaces to improve logits of relevant labels, while later TVs are predominantly scaled in magnitude. Taken together, LTVs not only provide a practical approach for obtaining effective TVs but also offer a principled lens into the mechanistic foundations of ICL.
[76] Adaptive Planning for Multi-Attribute Controllable Summarization with Monte Carlo Tree Search
Sangwon Ryu, Heejin Do, Yunsu Kim, Gary Geunbae Lee, Jungseul Ok
Main category: cs.CL
TL;DR: PACO is a training-free framework for multi-attribute controllable summarization that uses adaptive planning with Monte Carlo Tree Search to determine optimal attribute control order, enabling progressive refinement without per-attribute fine-tuning.
Details
Motivation: Current controllable summarization approaches struggle with interdependent attributes and require per-attribute fine-tuning, limiting flexibility. There's a need for a training-free method that can handle correlated constraints consistently across diverse attributes.Method: Proposes PACO framework that reframes the task as planning attribute control order using Monte Carlo Tree Search. Nodes represent summaries, actions are single-attribute adjustments, enabling progressive refinement of only attributes needing further control. The method adaptively discovers optimal control sequences without training.
Result: PACO achieves robust multi-attribute controllability across diverse domains and models, surpassing both LLM-based self-planning models and fine-tuned baselines. PACO with Llama-3.2-1B rivals controllability of much larger Llama-3.3-70B baselines, and with larger models achieves superior control performance.
Conclusion: PACO provides an effective training-free solution for multi-attribute controllable summarization, demonstrating strong performance across model sizes and domains while addressing the challenges of interdependent attributes and per-attribute fine-tuning limitations.
Abstract: Controllable summarization moves beyond generic outputs toward human-aligned summaries guided by specified attributes. In practice, the interdependence among attributes makes it challenging for language models to satisfy correlated constraints consistently. Moreover, previous approaches often require per-attribute fine-tuning, limiting flexibility across diverse summary attributes. In this paper, we propose adaptive planning for multi-attribute controllable summarization (PACO), a training-free framework that reframes the task as planning the order of sequential attribute control with a customized Monte Carlo Tree Search (MCTS). In PACO, nodes represent summaries, and actions correspond to single-attribute adjustments, enabling progressive refinement of only the attributes requiring further control. This strategy adaptively discovers optimal control orders, ultimately producing summaries that effectively meet all constraints. Extensive experiments across diverse domains and models demonstrate that PACO achieves robust multi-attribute controllability, surpassing both LLM-based self-planning models and fine-tuned baselines. Remarkably, PACO with Llama-3.2-1B rivals the controllability of the much larger Llama-3.3-70B baselines. With larger models, PACO achieves superior control performance, outperforming all competitors.
[77] Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels
Zhepeng Cen, Haolin Chen, Shiyu Wang, Zuxin Liu, Zhiwei Liu, Jielin Qiu, Ding Zhao, Silvio Savarese, Caiming Xiong, Huan Wang, Weiran Yao
Main category: cs.CL
TL;DR: Webscale-RL pipeline converts web-scale pre-training documents into diverse QA pairs for RL training, creating a 1.2M example dataset that enables more efficient RL training for language models.
Details
Motivation: Current RL training for LLMs suffers from a data bottleneck - existing RL datasets are much smaller and less diverse than web-scale pre-training corpora, limiting RL's potential for bridging the training-generation gap and enabling robust reasoning.Method: Developed Webscale-RL pipeline that systematically converts large-scale pre-training documents into millions of diverse, verifiable question-answer pairs for RL training. Created Webscale-RL dataset with 1.2 million examples across 9+ domains.
Result: Models trained on Webscale-RL dataset significantly outperform continual pretraining and strong data refinement baselines across benchmarks. RL training with this dataset is substantially more efficient, achieving performance of continual pre-training with up to 100× fewer tokens.
Conclusion: Webscale-RL presents a viable path toward scaling RL to pre-training levels, enabling more capable and efficient language models by addressing the critical data bottleneck in RL training.
Abstract: Large Language Models (LLMs) have achieved remarkable success through imitation learning on vast text corpora, but this paradigm creates a training-generation gap and limits robust reasoning. Reinforcement learning (RL) offers a more data-efficient solution capable of bridging this gap, yet its application has been constrained by a critical data bottleneck: existing RL datasets are orders of magnitude smaller and less diverse than web-scale pre-training corpora. To address this, we introduce the Webscale-RL pipeline, a scalable data engine that systematically converts large-scale pre-training documents into millions of diverse, verifiable question-answer pairs for RL. Using this pipeline, we construct the Webscale-RL dataset, containing 1.2 million examples across more than 9 domains. Our experiments show that the model trained on this dataset significantly outperforms continual pretraining and strong data refinement baselines across a suite of benchmarks. Notably, RL training with our dataset proves substantially more efficient, achieving the performance of continual pre-training with up to 100$\times$ fewer tokens. Our work presents a viable path toward scaling RL to pre-training levels, enabling more capable and efficient language models.
[78] SSPO: Subsentence-level Policy Optimization
Kun Yang, Zikang chen, Yanmeng Wang, Zhigen Li, Ning Cheng, Shaojun Wang, Jing Xiao
Main category: cs.CL
TL;DR: SSPO is a new RLVR algorithm that computes importance ratios at subsentence level to balance stability between token-level GRPO and response-level GSPO, achieving better performance on math reasoning tasks.
Details
Motivation: Existing RLVR algorithms have stability issues: GRPO suffers from unstable policy updates due to token-level importance ratios, while GSPO can retain high-variance tokens and has near-zero clipping fractions that cause unstable updates.Method: Proposes SSPO which computes importance ratios at subsentence level, balancing between GRPO and GSPO. Also incorporates subsentence-level entropy into PPO-CLIP to adaptively adjust clipping bounds - encouraging exploration for high-entropy tokens while tightening clipping for low-entropy tokens.
Result: SSPO achieves average score of 46.72 across five datasets on Qwen2.5-1.5B-Math model, outperforming GRPO (43.01) and GSPO (44.42), and attains SOTA on four datasets. Also achieves highest averaged scores on Qwen2.5-7B-Math model over five baseline methods.
Conclusion: SSPO effectively addresses stability issues in RLVR by using subsentence-level importance ratios and adaptive entropy-based clipping, demonstrating superior performance on math reasoning tasks.
Abstract: As a key component of large language model (LLM) post-training, Reinforcement Learning from Verifiable Rewards (RLVR) has substantially improved reasoning performance. However, existing RLVR algorithms exhibit distinct stability issues: GRPO (Group Relative Policy Optimization) often suffers from unstable policy updates, while GSPO (Group Sequence Policy Optimization) can retain high-variance tokens. In GRPO, the importance ratio is computed at the token level, which overemphasizes individual tokens and makes learning sensitive to outliers, potentially causing training collapse. GSPO instead computes a response-level importance ratio, mitigating variance and reducing the accumulation of token-level noise present in GRPO. Nevertheless, our experiments show that GSPO frequently yields a near-zero clipping fraction: extreme token-level ratios can be diluted by other tokens in the same response, causing the entire response to be retained and resulting in unstable updates. We propose SSPO, which computes importance ratios at the subsentence level, striking a balance between GRPO and GSPO. SSPO alleviates training collapse and excessive variance while avoiding the failure mode in which the clipping mechanism indiscriminately retains entire responses. Moreover, we incorporate subsentence-level entropy into PPO-CLIP to adaptively adjust the clipping bounds: we encourage exploration for high-entropy tokens while tightening the clipping range for low-entropy tokens. Empirically, SSPO achieves an average score of 46.72 across five datasets on Qwen2.5-1.5B-Math model, outperforming GRPO (43.01) and GSPO (44.42), and attains state-of-the-art results on four datasets. On Qwen2.5-7B-Math model, SSPO also achieves the highest averaged scores over five baseline methods. These results demonstrate SSPO’s effectiveness in RLVR.
[79] Structured Uncertainty guided Clarification for LLM Agents
Manan Suri, Puneet Mathur, Nedim Lipka, Franck Dernoncourt, Ryan A. Rossi, Dinesh Manocha
Main category: cs.CL
TL;DR: Structured uncertainty framework for LLM agents that quantifies disambiguation value of questions using EVPI to improve tool-calling with ambiguous instructions
Details
Motivation: LLM agents with tool-calling often fail with ambiguous/incomplete instructions, and existing prompting approaches lack principled criteria for question selection and stoppingMethod: Introduces structured uncertainty formulation operating over tool parameters and domains, separating specification uncertainty from model uncertainty. Uses Expected Value of Perfect Information (EVPI) to quantify disambiguation value of questions, balanced with aspect-based cost modeling to prevent redundancy
Result: SAGE-Agent achieves 7-39% higher coverage on ambiguous tasks while reducing clarification questions by 1.5-2.7x. Uncertainty-guided reward modeling boosts When2Call accuracy from 36.5% to 65.2% (3B) and 36.7% to 62.9% (7B) through uncertainty-weighted GRPO training
Conclusion: Structured uncertainty provides principled framework improving both inference-time interaction efficiency and training-time sample efficiency in tool-augmented agents
Abstract: LLM agents with tool-calling capabilities often fail when user instructions are ambiguous or incomplete, leading to incorrect invocations and task failures. Existing approaches operate in unstructured language spaces, generating clarifying questions through prompting strategies that lack principled criteria for determining which questions to ask and when to stop. We introduce a principled formulation of structured uncertainty that operates directly over tool parameters and their domains, cleanly separating specification uncertainty (what the user wants) from model uncertainty (what the LLM predicts). Our formulation uses Expected Value of Perfect Information (EVPI) to quantify the disambiguation value of each potential question, balanced against aspect-based cost modeling that prevents redundant questioning. We demonstrate the versatility of this formulation through two applications. First, SAGE-Agent uses structured uncertainty for inference-time question selection, achieving 7-39% higher coverage on ambiguous tasks while reducing clarification questions by 1.5-2.7x compared to strong prompting and uncertainty-based baselines. Second, we show that structured uncertainty provides effective training signals: uncertainty-guided reward modeling boosts When2Call accuracy from 36.5% to 65.2% (3B model) and 36.7% to 62.9% (7B model) through uncertainty-weighted GRPO training, demonstrating more sample-efficient reinforcement learning for tool-calling agents. To enable evaluation, we present ClarifyBench, the first multi-turn dynamic tool-calling disambiguation benchmark. Our results establish structured uncertainty as a principled framework that improves both inference-time interaction efficiency and training-time sample efficiency in tool-augmented agents.
[80] SkillFactory: Self-Distillation For Learning Cognitive Behaviors
Zayne Sprague, Jack Lu, Manya Wadhwa, Sedrick Keh, Mengye Ren, Greg Durrett
Main category: cs.CL
TL;DR: SkillFactory: A method to teach language models cognitive skills like verification and backtracking through self-generated training data before reinforcement learning, improving generalization and robustness.
Details
Motivation: Current reasoning models need cognitive skills like verification, backtracking, and retrying, but base language models often don't exhibit these skills. The goal is to teach models these skills without relying on distillation from stronger models.Method: SkillFactory uses self-generated samples from the model itself, rearranged to create “silver” supervised fine-tuning (SFT) traces that demonstrate cognitive skills. These imperfect but effective traces prime the model to acquire skills during subsequent reinforcement learning (RL).
Result: (1) SkillFactory SFT initialization helps models generalize to harder task variants post-RL despite lower pre-RL performance; (2) Models actually use cognitive skills; (3) RLed SkillFactory models are more robust to regression on out-of-domain tasks than RLed base models.
Conclusion: Inductive biases learned prior to RL help models learn robust cognitive skill use. Self-generated training data can effectively teach models cognitive skills that aren’t present in base models.
Abstract: Reasoning models leveraging long chains of thought employ various cognitive skills, such as verification of their answers, backtracking, retrying by an alternate method, and more. Previous work has shown that when a base language model exhibits these skills, training that model further with reinforcement learning (RL) can learn to leverage them. How can we get models to leverage skills that aren’t exhibited by base models? Our work, SkillFactory, is a method for fine-tuning models to roughly learn these skills during a supervised fine-tuning (SFT) stage prior to RL. Our approach does not rely on distillation from a stronger model, but instead uses samples from the model itself, rearranged to provide training data in the format of those skills. These “silver” SFT traces may be imperfect, but are nevertheless effective for priming a model to acquire skills during RL. Our evaluation shows that (1) starting from SkillFactory SFT initialization helps a model to generalize to harder variants of a task post-RL, despite lower performance pre-RL;(2) cognitive skills are indeed used by the model; (3) RLed SkillFactory models are more robust to regression on out-of-domain tasks than RLed base models. Our work suggests that inductive biases learned prior to RL help models learn robust cognitive skill use.
[81] Which Pieces Does Unigram Tokenization Really Need?
Sander Land, Yuval Pinter
Main category: cs.CL
TL;DR: Implementation guide for Unigram tokenization algorithm with simplified alternative for better compression
Details
Motivation: Unigram tokenization offers probabilistic alternative to BPE but has complex implementation limiting adoption; need to bridge theory-practice gapMethod: Provide clear implementation guide and parameter choices; identify simpler algorithm that trades slightly higher training loss for improved compression
Result: Bridged implementation gap; identified simplified algorithm with better compression despite slightly higher training loss
Conclusion: Makes Unigram tokenization more accessible; simpler algorithm offers practical benefits for compression
Abstract: The Unigram tokenization algorithm offers a probabilistic alternative to the greedy heuristics of Byte-Pair Encoding. Despite its theoretical elegance, its implementation in practice is complex, limiting its adoption to the SentencePiece package and adapters thereof. We bridge this gap between theory and practice by providing a clear guide to implementation and parameter choices. We also identify a simpler algorithm that accepts slightly higher training loss in exchange for improved compression.
[82] The Roots of Performance Disparity in Multilingual Language Models: Intrinsic Modeling Difficulty or Design Choices?
Chen Shani, Yuval Reif, Nathan Roll, Dan Jurafsky, Ekaterina Shutova
Main category: cs.CL
TL;DR: Survey paper analyzing why multilingual language models show uneven performance across languages, examining whether gaps stem from intrinsic linguistic difficulty or modeling artifacts like tokenization, encoding, and data allocation.
Details
Motivation: Multilingual language models promise broader NLP access but deliver uneven performance across languages. The paper aims to understand whether performance gaps reflect intrinsic linguistic difficulty or are artifacts of modeling choices like tokenization, encoding, data exposure, and parameter sharing.Method: Literature survey organized around two key questions: 1) whether linguistic disparities arise from representation and allocation choices rather than inherent complexity, and 2) which design choices mitigate inequities across typologically diverse languages. The paper reviews linguistic features (orthography, morphology, lexical diversity, syntax, information density, typological distance) and links each to concrete modeling mechanisms.
Result: The survey finds that performance gaps often shrink when segmentation, encoding, and data exposure are normalized, suggesting much apparent difficulty stems from current modeling choices rather than intrinsic linguistic complexity.
Conclusion: The paper synthesizes insights into design recommendations for tokenization, sampling, architectures, and evaluation to support more balanced multilingual language models that provide more equitable performance across diverse languages.
Abstract: Multilingual language models (LMs) promise broader NLP access, yet current systems deliver uneven performance across the world’s languages. This survey examines why these gaps persist and whether they reflect intrinsic linguistic difficulty or modeling artifacts. We organize the literature around two questions: do linguistic disparities arise from representation and allocation choices (e.g., tokenization, encoding, data exposure, parameter sharing) rather than inherent complexity; and which design choices mitigate inequities across typologically diverse languages. We review linguistic features, such as orthography, morphology, lexical diversity, syntax, information density, and typological distance, linking each to concrete modeling mechanisms. Gaps often shrink when segmentation, encoding, and data exposure are normalized, suggesting much apparent difficulty stems from current modeling choices. We synthesize these insights into design recommendations for tokenization, sampling, architectures, and evaluation to support more balanced multilingual LMs.
[83] EVOKE: Emotion Vocabulary Of Korean and English
Yoonwon Jung, Hagyeong Shin, Benjamin K. Bergen
Main category: cs.CL
TL;DR: EVOKE is a Korean-English parallel dataset of emotion words with comprehensive coverage, many-to-many translations, and identification of language-specific emotion words, containing 1,426 Korean and 1,397 English words with systematic annotation of adjectives and verbs.
Details
Motivation: To create a systematic, theory-agnostic dataset of emotion words in both Korean and English that can serve as a practical tool for emotion science, psycholinguistics, computational linguistics, and NLP research, addressing the need for comprehensive cross-linguistic emotion vocabulary resources.Method: The authors compiled a parallel dataset of emotion words in Korean and English, systematically annotating 819 Korean and 924 English adjectives and verbs. They annotated multiple meanings of each word and their relationships, identifying polysemous emotion words and emotion-related metaphors, and established many-to-many translations between the two languages.
Result: Created EVOKE dataset containing 1,426 Korean words and 1,397 English words with comprehensive coverage of emotion vocabulary, many-to-many translations, identification of language-specific emotion words, and annotation of polysemous words and emotion-related metaphors.
Conclusion: EVOKE is the most systematic and theory-agnostic dataset of emotion words in both Korean and English to date, providing a valuable resource for researchers in emotion science, psycholinguistics, computational linguistics, and NLP who need cross-linguistic emotion vocabulary data.
Abstract: This paper introduces EVOKE (Emotion Vocabulary of Korean and English), a Korean-English parallel dataset of emotion words. The dataset offers comprehensive coverage of emotion words in each language, in addition to many-to-many translations between words in the two languages and identification of language-specific emotion words. The dataset contains 1,426 Korean words and 1,397 English words, and we systematically annotate 819 Korean and 924 English adjectives and verbs. We also annotate multiple meanings of each word and their relationships, identifying polysemous emotion words and emotion-related metaphors. The dataset is, to our knowledge, the most systematic and theory-agnostic dataset of emotion words in both Korean and English to date. It can serve as a practical tool for emotion science, psycholinguistics, computational linguistics, and natural language processing, allowing researchers to adopt different views on the resource reflecting their needs and theoretical perspectives. The dataset is publicly available at https://github.com/yoonwonj/EVOKE.
[84] Mnemis: Dual-Route Retrieval on Hierarchical Graphs for Long-Term LLM Memory
Zihao Tang, Xin Yu, Ziyu Xiao, Zengxuan Wen, Zelin Li, Jiaxi Zhou, Hualei Wang, Haohua Wang, Haizhen Huang, Weiwei Deng, Feng Sun, Qi Zhang
Main category: cs.CL
TL;DR: Mnemis is a novel memory framework for LLMs that combines similarity-based retrieval (System-1) with hierarchical graph-based global reasoning (System-2) to improve memory organization and retrieval for complex scenarios requiring comprehensive information coverage.
Details
Motivation: Existing memory retrieval methods (RAG and Graph-RAG) rely primarily on similarity-based mechanisms, which struggle with scenarios requiring global reasoning or comprehensive coverage of all relevant information. There's a need for a more sophisticated memory framework that can handle complex retrieval scenarios beyond simple similarity matching.Method: Mnemis organizes memory into two structures: a base graph for similarity retrieval (System-1) and a hierarchical graph that enables top-down, deliberate traversal over semantic hierarchies (System-2). The framework integrates both retrieval routes to combine their complementary strengths, retrieving memory items that are both semantically and structurally relevant.
Result: Mnemis achieves state-of-the-art performance across long-term memory benchmarks, scoring 93.9 on LoCoMo and 91.6 on LongMemEval-S using GPT-4.1-mini, outperforming all compared methods.
Conclusion: The Mnemis framework successfully addresses limitations of existing memory retrieval methods by integrating System-1 similarity search with System-2 global reasoning, providing a more comprehensive and effective memory system for LLMs that can handle complex retrieval scenarios requiring both semantic relevance and structural coverage.
Abstract: AI Memory, specifically how models organizes and retrieves historical messages, becomes increasingly valuable to Large Language Models (LLMs), yet existing methods (RAG and Graph-RAG) primarily retrieve memory through similarity-based mechanisms. While efficient, such System-1-style retrieval struggles with scenarios that require global reasoning or comprehensive coverage of all relevant information. In this work, We propose Mnemis, a novel memory framework that integrates System-1 similarity search with a complementary System-2 mechanism, termed Global Selection. Mnemis organizes memory into a base graph for similarity retrieval and a hierarchical graph that enables top-down, deliberate traversal over semantic hierarchies. By combining the complementary strength from both retrieval routes, Mnemis retrieves memory items that are both semantically and structurally relevant. Mnemis achieves state-of-the-art performance across all compared methods on long-term memory benchmarks, scoring 93.9 on LoCoMo and 91.6 on LongMemEval-S using GPT-4.1-mini.
[85] Where Vision Becomes Text: Locating the OCR Routing Bottleneck in Vision-Language Models
Jonathan Steinberg, Oren Gal
Main category: cs.CL
TL;DR: The paper investigates how OCR information flows through different vision-language model architectures using causal interventions, revealing architecture-specific OCR bottlenecks and surprising findings about OCR interference with other visual tasks.
Details
Motivation: To understand how optical character recognition (OCR) information is processed and integrated within different vision-language model architectures, specifically examining where OCR signals enter the language processing stream and how they affect model behavior.Method: Used causal interventions by comparing activation differences between original images and text-inpainted versions across three VLM architecture families (Qwen3-VL, Phi-4, InternVL3.5). Applied principal component analysis (PCA) to analyze OCR signal dimensionality and conducted transfer experiments across datasets.
Result: Found architecture-specific OCR bottlenecks: DeepStack models (Qwen) peak at mid-depth (~50%) for scene text, while single-stage projection models (Phi-4, InternVL) peak at early layers (6-25%). OCR signal is low-dimensional (PC1 captures 72.9% variance) and PCA directions transfer across datasets. Surprisingly, OCR removal improved counting performance (+6.9pp) in modular architectures like Qwen3-VL-4B.
Conclusion: OCR processing pathways are architecture-dependent but share common low-dimensional representations across datasets. In modular architectures, OCR circuits can interfere with other visual processing tasks, suggesting potential for optimization by selectively routing OCR information.
Abstract: Vision-language models (VLMs) can read text from images, but where does this optical character recognition (OCR) information enter the language processing stream? We investigate the OCR routing mechanism across three architecture families (Qwen3-VL, Phi-4, InternVL3.5) using causal interventions. By computing activation differences between original images and text-inpainted versions, we identify architecture-specific OCR bottlenecks whose dominant location depends on the vision-language integration strategy: DeepStack models (Qwen) show peak sensitivity at mid-depth (about 50%) for scene text, while single-stage projection models (Phi-4, InternVL) peak at early layers (6-25%), though the exact layer of maximum effect varies across datasets. The OCR signal is remarkably low-dimensional: PC1 captures 72.9% of variance. Crucially, principal component analysis (PCA) directions learned on one dataset transfer to others, demonstrating shared text-processing pathways. Surprisingly, in models with modular OCR circuits (notably Qwen3-VL-4B), OCR removal can improve counting performance (up to +6.9 percentage points), suggesting OCR interferes with other visual processing in sufficiently modular architectures.
[86] CodeScout: Contextual Problem Statement Enhancement for Software Agents
Manan Suri, Xiangci Li, Mehdi Shojaie, Songyang Han, Chao-Chun Hsu, Shweta Garg, Aniket Anand Deshmukh, Varun Kumar
Main category: cs.CL
TL;DR: CodeScout improves AI code assistance by refining underspecified user requests through lightweight pre-exploration of codebases, converting vague problems into comprehensive statements before agent execution.
Details
Motivation: Current AI code assistance tools struggle with poorly-defined problem statements lacking sufficient context, leading to longer trajectories, over-exploration, repeated failed fixes, and suboptimal outcomes in software development tasks.Method: CodeScout performs contextual query refinement through lightweight pre-exploration: targeted context scoping, multi-perspective analysis examining potential fixes and exploration opportunities, then synthesizes insights into enhanced problem statements with reproduction steps, expected behaviors, and targeted exploration hints.
Result: Evaluation on SWEBench-Verified shows 20% improvement in resolution rates with up to 27 additional issues resolved compared to baseline methods, reducing non-converging agent trajectories while clarifying user intent.
Conclusion: Systematic query refinement through contextual analysis represents a promising direction for enhancing AI code assistance capabilities without requiring modifications to underlying agent scaffolds.
Abstract: Current AI-powered code assistance tools often struggle with poorly-defined problem statements that lack sufficient task context and requirements specification. Recent analysis of software engineering agents reveals that failures on such underspecified requests are highly correlated with longer trajectories involving either over-exploration or repeated attempts at applying the same fix without proper evolution or testing, leading to suboptimal outcomes across software development tasks. We introduce CodeScout, a contextual query refinement approach that systematically converts underspecified user requests into comprehensive, actionable problem statements through lightweight pre-exploration of the target codebase. Our key innovation is demonstrating that structured analysis before task execution can supplement existing agentic capabilities without requiring any modifications to their underlying scaffolds. CodeScout performs targeted context scoping, conducts multi-perspective analysis examining potential fixes and exploration opportunities, then synthesizes these insights into enhanced problem statements with reproduction steps, expected behaviors, and targeted exploration hints. This pre-exploration directly addresses the identified failure patterns by reducing non-converging agent trajectories while clarifying user intent in natural language space. We evaluate CodeScout using state-of-the-art agentic scaffolds and language models on SWEBench-Verified, demonstrating a 20% improvement in resolution rates with up to 27 additional issues resolved compared to the default baseline method. Our results suggest that systematic query refinement through contextual analysis represents a promising direction for enhancing AI code assistance capabilities.
[87] Improving Automatic Summarization of Radiology Reports through Mid-Training of Large Language Models
Mengxian Lyu, Cheng Peng, Ziyi Chen, Mengyuan Zhang, Jieting Li Lu, Yonghui Wu
Main category: cs.CL
TL;DR: Proposes subdomain adaptation through mid-training for radiology report summarization, showing that clinical pre-training followed by radiology mid-training outperforms direct fine-tuning approaches.
Details
Motivation: To reduce physician burden through better automatic summarization of radiology reports by addressing limitations of direct fine-tuning approaches with a mid-training strategy for subdomain adaptation.Method: Three adaptation strategies: (1) general-domain pre-training, (2) clinical-domain pre-training, and (3) clinical-domain pre-training followed by subdomain mid-training. Uses large-scale clinical text from UF Health for development, with mid-training and fine-tuning on OpenI and MIMIC-CXR datasets.
Result: Mid-trained model GatorTronT5-Radio achieved best performance, outperforming models without mid-training in both ROUGE-L (text-based) and RadGraph-F1 (factuality) measures. Also shows better few-shot learning and alleviates “cold start” problem.
Conclusion: Supports “pre-training, mid-training, fine-tuning” strategy over direct fine-tuning for radiology report summarization, demonstrating effectiveness of subdomain adaptation through mid-training.
Abstract: Automatic summarization of radiology reports is an essential application to reduce the burden on physicians. Previous studies have widely used the “pre-training, fine-tuning” strategy to adapt large language models (LLMs) for summarization. This study proposed a subdomain adaptation through a mid-training method to improve summarization. We explored three adaptation strategies: (1) general-domain pre-training, (2) clinical-domain pre-training, and (3) clinical-domain pre-training followed by subdomain mid-training. We developed models using large-scale clinical text from the University of Florida (UF) Health and conducted mid-training and fine-tuning experiments using widely used benchmark datasets including OpenI and MIMIC-CXR. The experimental results show that the mid-trained model, GatorTronT5-Radio, achieved the best performance, outperforming models without mid-training in both text-based measures (ROUGE-L) and factuality measures (RadGraph-F1). Our mid-training methods also demonstrate better few-shot learning and could alleviate the “cold start” problem reported in previous studies as a learning barrier. Our findings support the use of “pre-training, mid-training, fine-tuning,” instead of the widely used direct fine-tuning strategy.
[88] No Single Best Model for Diversity: Learning a Router for Sample Diversity
Yuhan Liu, Fangyuan Xu, Vishakh Padmakumar, Daphne Ippolito, Eunsol Choi
Main category: cs.CL
TL;DR: The paper introduces diversity coverage as a metric to evaluate LLMs’ ability to generate comprehensive sets of valid answers to open-ended prompts, and proposes a router system to select the best model for each query to maximize answer diversity.
Details
Motivation: When prompts allow multiple valid answers, current LLMs struggle to comprehensively generate diverse responses. There's a need to evaluate and improve models' ability to produce comprehensive answer sets that satisfy a wide range of users.Method: 1) Introduces diversity coverage metric to measure quality of unique answers relative to optimal set; 2) Evaluates 18 LLMs on open-ended prompts; 3) Develops a router that predicts the best model for each query based on prompt characteristics.
Result: No single model dominates across all prompts, but per-prompt there exists a significantly better model. The trained router outperforms single best model baseline by 2.5% (26.3% vs 23.8%) on NB-Wildchat and generalizes to out-of-domain datasets and different prompting strategies.
Conclusion: The work establishes foundations for studying comprehensive answer generation when multiple models are available, showing that model selection per query can significantly improve answer diversity coverage.
Abstract: When posed with prompts that permit a large number of valid answers, comprehensively generating them is the first step towards satisfying a wide range of users. In this paper, we study methods to elicit a comprehensive set of valid responses. To evaluate this, we introduce \textbf{diversity coverage}, a metric that measures the total quality scores assigned to each \textbf{unique} answer in the predicted answer set relative to the best possible answer set with the same number of answers. Using this metric, we evaluate 18 LLMs, finding no single model dominates at generating diverse responses to a wide range of open-ended prompts. Yet, per each prompt, there exists a model that outperforms all other models significantly at generating a diverse answer set. Motivated by this finding, we introduce a router that predicts the best model for each query. On NB-Wildchat, our trained router outperforms the single best model baseline (26.3% vs $23.8%). We further show generalization to an out-of-domain dataset (NB-Curated) as well as different answer-generation prompting strategies. Our work lays foundation for studying generating comprehensive answers when we have access to a suite of models.
[89] Verbalizing LLMs’ assumptions to explain and control sycophancy
Myra Cheng, Isabel Sieh, Humishka Zope, Sunny Yu, Lujain Ibrahim, Aryaman Arora, Jared Moore, Desmond Ong, Dan Jurafsky, Diyi Yang
Main category: cs.CL
TL;DR: A framework called Verbalized Assumptions that elicits LLMs’ incorrect assumptions about user intent, showing how these assumptions cause social sycophancy and enabling interpretable steering of model behavior.
Details
Motivation: LLMs exhibit social sycophancy by affirming users rather than providing genuine assessments, which may stem from incorrect assumptions about user intent (e.g., underestimating how often users seek information over reassurance).Method: Verbalized Assumptions framework to elicit LLMs’ assumptions about user intent; assumption probes (linear probes on internal representations) to causally link assumptions to sycophantic behavior and enable interpretable steering.
Result: Top bigram in LLMs’ assumptions on social sycophancy datasets is “seeking validation”; assumption probes successfully steer social sycophancy; LLMs trained on human-human conversation fail to account for different expectations in human-AI interactions.
Conclusion: Assumptions are a key mechanism for LLM sycophancy; Verbalized Assumptions provides insight into sycophancy, delusion, and safety issues; there’s a mismatch between human expectations from AI vs. human responses.
Abstract: LLMs can be socially sycophantic, affirming users when they ask questions like “am I in the wrong?” rather than providing genuine assessment. We hypothesize that this behavior arises from incorrect assumptions about the user, like underestimating how often users are seeking information over reassurance. We present Verbalized Assumptions, a framework for eliciting these assumptions from LLMs. Verbalized Assumptions provide insight into LLM sycophancy, delusion, and other safety issues, e.g., the top bigram in LLMs’ assumptions on social sycophancy datasets is ``seeking validation.’’ We provide evidence for a causal link between Verbalized Assumptions and sycophantic model behavior: our assumption probes (linear probes trained on internal representations of these assumptions) enable interpretable fine-grained steering of social sycophancy. We explore why LLMs default to sycophantic assumptions: on identical queries, people expect more objective and informative responses from AI than from other humans, but LLMs trained on human-human conversation do not account for this difference in expectations. Our work contributes a new understanding of assumptions as a mechanism for sycophancy.
[90] Many Preferences, Few Policies: Towards Scalable Language Model Personalization
Cheol Woo Kim, Jai Moondra, Roozbeh Nahavandi, Andrew Perrault, Milind Tambe, Swati Gupta
Main category: cs.CL
TL;DR: PALM algorithm selects small portfolio of LLMs to cover diverse user preferences with theoretical guarantees on portfolio size and approximation quality.
Details
Motivation: Maintaining separate LLMs for each user is impractical due to compute, memory, and system constraints, but users have diverse preferences across multiple traits (safety, humor, brevity). Need principled method to select small portfolio of LLMs that can serve heterogeneous users effectively.Method: Models user preferences as multi-dimensional weight vectors across traits. Given reward functions for each dimension, PALM algorithm generates small portfolio of LLMs such that for any weight vector, portfolio contains near-optimal LLM for corresponding scalarized objective. Provides theoretical guarantees on portfolio size and approximation quality.
Result: First result providing theoretical guarantees on both size and approximation quality of LLM portfolios for personalization. Characterizes trade-off between system cost and personalization, and diversity of LLMs needed to cover user preference landscape. Empirical results validate guarantees and show greater output diversity over baselines.
Conclusion: PALM enables practical LLM personalization by selecting small portfolios that can effectively serve diverse user preferences with theoretical guarantees, addressing the system cost vs. personalization trade-off.
Abstract: The holy grail of LLM personalization is a single LLM for each user, perfectly aligned with that user’s preferences. However, maintaining a separate LLM per user is impractical due to constraints on compute, memory, and system complexity. We address this challenge by developing a principled method for selecting a small portfolio of LLMs that captures representative behaviors across heterogeneous users. We model user preferences across multiple traits (e.g., safety, humor, brevity) through a multi-dimensional weight vector. Given reward functions across these dimensions, our algorithm PALM (Portfolio of Aligned LLMs) generates a small portfolio of LLMs such that, for any weight vector, the portfolio contains a near-optimal LLM for the corresponding scalarized objective. To the best of our knowledge, this is the first result that provides theoretical guarantees on both the size and approximation quality of LLM portfolios for personalization. It characterizes the trade-off between system cost and personalization, as well as the diversity of LLMs required to cover the landscape of user preferences. We provide empirical results that validate these guarantees and demonstrate greater output diversity over common baselines.
[91] TEC: A Collection of Human Trial-and-error Trajectories for Problem Solving
Xinkai Zhang, Jingtao Zhan, Yiqun Liu, Qingyao Ai
Main category: cs.CL
TL;DR: Researchers introduce TEC dataset capturing human trial-and-error problem-solving with web navigation tasks, showing humans outperform LLMs in trial-and-error effectiveness.
Details
Motivation: Current AI trial-and-error techniques rely on simple heuristics and lack data on how humans actually conduct trial-and-error in practice, limiting performance gains.Method: Developed a data annotation platform to record users’ complete trajectories across multiple trials with error reflections, collected from 46 participants on 58 tasks (5,370 trajectories across 41,229 webpages).
Result: Humans achieve substantially higher accuracy compared to LLMs, demonstrating superior trial-and-error effectiveness. Dataset provides foundation for understanding human behavior.
Conclusion: TEC platform and dataset enable study of human trial-and-error and development of more capable AI systems. Publicly available resources.
Abstract: Trial-and-error is a fundamental strategy for humans to solve complex problems and a necessary capability for Artificial Intelligence (AI) systems operating in real-world environments. Although several trial-and-error AI techniques have recently been proposed, most of them rely on simple heuristics designed by researchers and achieve limited performance gains. The core issue is the absence of appropriate data: current models cannot learn from detailed records of how humans actually conduct trial-and-error in practice. To address this gap, we introduce a data annotation platform and a corresponding dataset, termed Trial-and-Error Collection (TEC). The platform records users’ complete trajectories across multiple trials and collects their reflections after receiving error feedback. Using this platform, we record the problem-solving processes of 46 participants on 58 tasks, resulting in 5,370 trial trajectories along with error reflections across 41,229 webpages. With this dataset, we observe that humans achieve substantially higher accuracy compared to LLMs, which demonstrates that humans are more effective in trial-and-error than LLMs. We believe that the TEC platform and dataset provide a valuable foundation for understanding human trial-and-error behavior and for developing more capable AI systems. Platform and dataset are publicly available.
[92] WisdomInterrogatory (LuWen): An Open-Source Legal Large Language Model Technical Report
Yiquan Wu, Yuhang Liu, Yifei Liu, Ang Li, Siying Zhou, Kun Kuang, Fei Wu
Main category: cs.CL
TL;DR: LuWen is an open-source Chinese legal language model built on Baichuan foundation model, adapted for legal domain through continual pre-training, supervised fine-tuning, and retrieval-augmented generation with legal knowledge base.
Details
Motivation: Large language models struggle in legal domain due to specialized terminology, complex reasoning requirements, and rapidly evolving legal knowledge, requiring domain-specific adaptation.Method: Three key techniques: 1) continual pre-training on large-scale legal corpus, 2) supervised fine-tuning with curated legal instruction data, 3) retrieval-augmented generation integrated with comprehensive legal knowledge base.
Result: Outperforms several strong baselines on five representative legal tasks: legal judgment prediction, judicial examination, legal text summarization, law article question answering, and judicial decision reasoning.
Conclusion: Demonstrates effectiveness of adapting general-purpose language models to legal domain through specialized training techniques and knowledge integration.
Abstract: Large language models have demonstrated remarkable capabilities across a wide range of natural language processing tasks, yet their application in the legal domain remains challenging due to the specialized terminology, complex reasoning requirements, and rapidly evolving legal knowledge involved. In this paper, we present WisdomInterrogatory (LuWen), an open-source Chinese legal language model built upon the Baichuan foundation model through three key techniques: continual pre-training on a large-scale legal corpus, supervised fine-tuning with carefully curated legal instruction data, and retrieval-augmented generation integrated with a comprehensive legal knowledge base. We evaluate LuWen on five representative legal tasks spanning both prediction and generation settings, including legal judgment prediction, judicial examination, legal text summarization, law article question answering, and judicial decision reasoning. Experimental results show that LuWen outperforms several strong baselines, demonstrating the effectiveness of our approach in adapting general-purpose language models to the legal domain.
[93] Fast-dVLM: Efficient Block-Diffusion VLM via Direct Conversion from Autoregressive VLM
Chengyue Wu, Shiyi Lan, Yonggan Fu, Sensen Gao, Jin Wang, Jincheng Yu, Jose M. Alvarez, Pavlo Molchanov, Ping Luo, Song Han, Ligeng Zhu, Enze Xie
Main category: cs.CL
TL;DR: Fast-dVLM: A block-diffusion-based vision-language model that enables parallel decoding for faster inference while maintaining generation quality comparable to autoregressive models.
Details
Motivation: Autoregressive decoding in VLMs limits inference throughput, especially in edge devices for robotics/autonomous driving where batch size is one. Need parallel decoding methods that can handle both continuous visual representations and discrete text tokens while preserving pretrained multimodal capabilities.Method: Block-diffusion-based VLM with KV-cache-compatible parallel decoding and speculative block decoding. Compares two AR-to-diffusion conversion strategies: two-stage (LLM backbone first) vs direct (full VLM conversion). Introduces multimodal diffusion adaptations including block size annealing, causal context attention, auto-truncation masking, and vision efficient concatenation.
Result: Fast-dVLM matches autoregressive counterpart in generation quality across 11 multimodal benchmarks. With SGLang integration and FP8 quantization, achieves over 6x end-to-end inference speedup over AR baseline.
Conclusion: Direct conversion strategy is more efficient than two-stage approach. Fast-dVLM enables practical parallel decoding for VLMs while maintaining quality, addressing inference bottlenecks in edge deployment scenarios.
Abstract: Vision-language models (VLMs) predominantly rely on autoregressive decoding, which generates tokens one at a time and fundamentally limits inference throughput. This limitation is especially acute in physical AI scenarios such as robotics and autonomous driving, where VLMs are deployed on edge devices at batch size one, making AR decoding memory-bandwidth-bound and leaving hardware parallelism underutilized. While block-wise discrete diffusion has shown promise for parallel text generation, extending it to VLMs remains challenging due to the need to jointly handle continuous visual representations and discrete text tokens while preserving pretrained multimodal capabilities. We present Fast-dVLM, a block-diffusion-based VLM that enables KV-cache-compatible parallel decoding and speculative block decoding for inference acceleration. We systematically compare two AR-to-diffusion conversion strategies: a two-stage approach that first adapts the LLM backbone with text-only diffusion fine-tuning before multimodal training, and a direct approach that converts the full AR VLM in one stage. Under comparable training budgets, direct conversion proves substantially more efficient by leveraging the already multimodally aligned VLM; we therefore adopt it as our recommended recipe. We introduce a suite of multimodal diffusion adaptations, block size annealing, causal context attention, auto-truncation masking, and vision efficient concatenation, that collectively enable effective block diffusion in the VLM setting. Extensive experiments across 11 multimodal benchmarks show Fast-dVLM matches its autoregressive counterpart in generation quality. With SGLang integration and FP8 quantization, Fast-dVLM achieves over 6x end-to-end inference speedup over the AR baseline.
[94] An Empirical Analysis of Static Analysis Methods for Detection and Mitigation of Code Library Hallucinations
Clarissa Miranda-Pena, Andrew Reeson, Cécile Paris, Josiah Poon, Jonathan K. Kummerfeld
Main category: cs.CL
TL;DR: Static analysis tools can detect 16-70% of LLM code hallucinations involving library usage, but have inherent limitations with an upper bound of 48.5-77% detection potential.
Details
Motivation: LLMs continue to hallucinate when generating code, especially with library usage, producing non-existent library features in 8.1-40% of responses. The paper aims to evaluate static analysis as a practical approach for detecting and mitigating these hallucinations.Method: The researchers analyze the potential of static analysis tools for detecting LLM code hallucinations. They evaluate performance across different LLMs and datasets, conduct manual analysis to identify cases that static methods cannot plausibly catch, and establish upper bounds on detection potential.
Result: Static analysis tools can detect 16-70% of all errors and 14-85% of library hallucinations, with performance varying by LLM and dataset. Manual analysis reveals static methods cannot catch certain cases, establishing an upper bound of 48.5-77% detection potential.
Conclusion: Static analysis provides a cheap method for addressing some forms of hallucination in LLM-generated code, but has inherent limitations that prevent it from fully solving the problem. The research quantifies both the practical utility and theoretical limits of static approaches.
Abstract: Despite extensive research, Large Language Models continue to hallucinate when generating code, particularly when using libraries. On NL-to-code benchmarks that require library use, we find that LLMs generate code that uses non-existent library features in 8.1-40% of responses. One intuitive approach for detection and mitigation of hallucinations is static analysis. In this paper, we analyse the potential of static analysis tools, both in terms of what they can solve and what they cannot. We find that static analysis tools can detect 16-70% of all errors, and 14-85% of library hallucinations, with performance varying by LLM and dataset. Through manual analysis, we identify cases a static method could not plausibly catch, which gives an upper bound on their potential from 48.5% to 77%. Overall, we show that static analysis methods are cheap method for addressing some forms of hallucination, and we quantify how far short of solving the problem they will always be.
[95] MemReader: From Passive to Active Extraction for Long-Term Agent Memory
Jingyi Kang, Chunyu Li, Ding Chen, Bo Tang, Feiyu Xiong, Zhiyu Li
Main category: cs.CL
TL;DR: MemReader introduces active long-term memory extraction for AI agents with two models: a compact passive extractor and an active extractor using reinforcement learning to make selective memory writing decisions based on information value, ambiguity, and completeness.
Details
Motivation: Existing memory extraction systems treat it as one-shot passive transcription, struggling with noisy dialogue, missing references, and cross-turn dependencies, leading to memory pollution, low-value writes, and inconsistency in personalized and autonomous agents.Method: Two models: MemReader-0.6B (compact passive extractor distilled for accurate structured outputs) and MemReader-4B (active extractor optimized with Group Relative Policy Optimization). The active extractor uses ReAct-style paradigm to evaluate information value, reference ambiguity, and completeness before acting, enabling selective memory writing, deferral, retrieval, or discarding.
Result: MemReader consistently outperforms existing extraction-based baselines on LOCOMO, LongMemEval, and HaluMem benchmarks. MemReader-4B achieves state-of-the-art performance on knowledge updating, temporal reasoning, and hallucination reduction tasks.
Conclusion: Effective agent memory requires reasoning-driven selective memory extraction rather than just extracting more information. MemReader enables building low-noise, dynamically evolving long-term memory and has been integrated into MemOS for real-world deployment.
Abstract: Long-term memory is fundamental for personalized and autonomous agents, yet populating it remains a bottleneck. Existing systems treat memory extraction as a one-shot, passive transcription from context to structured entries, which struggles with noisy dialogue, missing references, and cross-turn dependencies, leading to memory pollution, low-value writes, and inconsistency. In this paper, we introduce the MemReader family for active long-term memory extraction in agent systems: MemReader-0.6B, a compact and cost-efficient passive extractor distilled for accurate and schema-consistent structured outputs, and MemReader-4B, an active extractor optimized with Group Relative Policy Optimization (GRPO) to make memory writing decisions. Under a ReAct-style paradigm, MemReader-4B explicitly evaluates information value, reference ambiguity, and completeness before acting, and can selectively write memories, defer incomplete inputs, retrieve historical context, or discard irrelevant chatter. Experiments on LOCOMO, LongMemEval, and HaluMem show that MemReader consistently outperforms existing extraction-based baselines. In particular, MemReader-4B achieves state-of-the-art performance on tasks involving knowledge updating, temporal reasoning, and hallucination reduction. These results suggest that effective agent memory requires not merely extracting more information, but performing reasoning-driven and selective memory extraction to build low-noise and dynamically evolving long-term memory. Furthermore, MemReader has been integrated into MemOS and is being deployed in real-world applications. To support future research and adoption, we release the models and provide public API access.
[96] Linear Representations of Hierarchical Concepts in Language Models
Masaki Sakata, Benjamin Heinzerling, Takumi Ito, Sho Yokoi, Kentaro Inui
Main category: cs.CL
TL;DR: Language models encode hierarchical relations (e.g., Japan ⊂ Eastern Asia ⊂ Asia) in interpretable linear representations that can be recovered through domain-specific linear transformations.
Details
Motivation: To understand how hierarchical relations between concepts are encoded in language model representations, going beyond prior work by covering multi-token entities and cross-layer analysis.Method: Train linear transformations specific to each hierarchical depth and semantic domain using Linear Relational Concepts, analyze representational geometry across layers, and evaluate in-domain generalization and cross-domain transfer.
Result: Hierarchical relations can be linearly recovered from model representations within domains, encoded in low-dimensional domain-specific subspaces, with highly similar hierarchy representation across these subspaces.
Conclusion: Language models encode concept hierarchies in highly interpretable linear representations, with consistent patterns across different semantic domains despite domain-specific encoding subspaces.
Abstract: We investigate how and to what extent hierarchical relations (e.g., Japan $\subset$ Eastern Asia $\subset$ Asia) are encoded in the internal representations of language models. Building on Linear Relational Concepts, we train linear transformations specific to each hierarchical depth and semantic domain, and characterize representational differences associated with hierarchical relations by comparing these transformations. Going beyond prior work on the representational geometry of hierarchies in LMs, our analysis covers multi-token entities and cross-layer representations. Across multiple domains we learn such transformations and evaluate in-domain generalization to unseen data and cross-domain transfer. Experiments show that, within a domain, hierarchical relations can be linearly recovered from model representations. We then analyze how hierarchical information is encoded in representation space. We find that it is encoded in a relatively low-dimensional subspace and that this subspace tends to be domain-specific. Our main result is that hierarchy representation is highly similar across these domain-specific subspaces. Overall, we find that all models considered in our experiments encode concept hierarchies in the form of highly interpretable linear representations.
[97] HyperMem: Hypergraph Memory for Long-Term Conversations
Juwei Yue, Chuanrui Hu, Jiawei Sheng, Zuyi Zhou, Wenyuan Zhang, Tingwen Liu, Li Guo, Yafeng Deng
Main category: cs.CL
TL;DR: HyperMem: A hypergraph-based hierarchical memory architecture for conversational agents that models high-order associations among multiple elements using hyperedges, improving long-term conversation coherence and retrieval accuracy.
Details
Motivation: Existing memory approaches (RAG, graph-based) rely on pairwise relations, which fail to capture high-order associations among multiple elements, leading to fragmented retrieval and poor coherence in long-term conversations.Method: Proposes HyperMem with three-level hierarchical memory (topics, episodes, facts) using hyperedges to group related episodes and facts. Implements hybrid lexical-semantic index and coarse-to-fine retrieval strategy for efficient high-order association retrieval.
Result: Achieves state-of-the-art performance on LoCoMo benchmark with 92.73% LLM-as-a-judge accuracy, demonstrating effectiveness for long-term conversations.
Conclusion: HyperMem’s hypergraph-based architecture effectively captures high-order associations, enabling more coherent and personalized long-term conversational interactions.
Abstract: Long-term memory is essential for conversational agents to maintain coherence, track persistent tasks, and provide personalized interactions across extended dialogues. However, existing approaches as Retrieval-Augmented Generation (RAG) and graph-based memory mostly rely on pairwise relations, which can hardly capture high-order associations, i.e., joint dependencies among multiple elements, causing fragmented retrieval. To this end, we propose HyperMem, a hypergraph-based hierarchical memory architecture that explicitly models such associations using hyperedges. Particularly, HyperMem structures memory into three levels: topics, episodes, and facts, and groups related episodes and their facts via hyperedges, unifying scattered content into coherent units. Leveraging this structure, we design a hybrid lexical-semantic index and a coarse-to-fine retrieval strategy, supporting accurate and efficient retrieval of high-order associations. Experiments on the LoCoMo benchmark show that HyperMem achieves state-of-the-art performance with 92.73% LLM-as-a-judge accuracy, demonstrating the effectiveness of HyperMem for long-term conversations.
cs.CV
[98] Detection of Hate and Threat in Digital Forensics: A Case-Driven Multimodal Approach
Ponkoj Chandra Shill
Main category: cs.CV
TL;DR: Multimodal forensic framework for hate/threat detection that analyzes evidence configuration (embedded text, contextual text, image-only) and applies appropriate text analysis, multimodal fusion, or vision-language models based on available evidence.
Details
Motivation: Digital forensic investigations need to handle heterogeneous evidence (images, documents, reports) containing harmful content, but existing approaches assume clean text or use vision models without forensic justification.Method: Case-driven multimodal framework that first determines evidence configuration (embedded text, associated contextual text, image-only), then selectively applies text analysis, multimodal fusion, or vision-language models (ViT backbones) based on evidence availability.
Result: Experimental evaluation on forensic-style image evidence shows consistent and interpretable behavior across heterogeneous evidence scenarios, improving evidentiary traceability.
Conclusion: The framework mirrors forensic decision-making by conditioning inference on evidence availability, avoids unjustified modality assumptions, and provides interpretable multimodal analysis for forensic applications.
Abstract: Digital forensic investigations increasingly rely on heterogeneous evidence such as images, scanned documents, and contextual reports. These artifacts may contain explicit or implicit expressions of harm, hate, threat, violence, or intimidation, yet existing automated approaches often assume clean text input or apply vision models without forensic justification. This paper presents a case-driven multimodal approach for hate and threat detection in forensic analysis. The proposed framework explicitly determines the presence and source of textual evidence, distinguishing between embedded text, associated contextual text, and image-only evidence. Based on the identified evidence configuration, the framework selectively applies text analysis, multimodal fusion, or image-only semantic reasoning using vision language models with vision transformer backbones (ViT). By conditioning inference on evidence availability, the approach mirrors forensic decision-making, improves evidentiary traceability, and avoids unjustified modality assumptions. Experimental evaluation on forensic-style image evidence demonstrates consistent and interpretable behavior across heterogeneous evidence scenarios.
[99] A Semi-Automated Framework for 3D Reconstruction of Medieval Manuscript Miniatures
Riccardo Pallotto, Pierluigi Feliciati, Tiberio Uricchio
Main category: cs.CV
TL;DR: A framework for converting 2D medieval manuscript miniatures to 3D models using AI methods, with Hi3DGen identified as best for balancing quality and detail.
Details
Motivation: To create accessible 3D models from 2D manuscript art for XR applications, tactile printing, and web visualization, particularly for cultural heritage preservation and accessibility.Method: Evaluated 7 image-to-3D methods on 69 manuscript figures using rendering and volumetric metrics; developed pipeline combining SAM segmentation, Hi3DGen mesh generation, expert refinement in ZBrush, and AI-assisted texturing.
Result: Hi3DGen best balances topological quality with surface detail; framework successfully applied to Gothic and Renaissance manuscripts; models support WebXR, AR overlay, and tactile 3D printing.
Conclusion: Semi-automated framework enables 3D reconstruction of manuscript art with Hi3DGen as optimal starting point, supporting cultural heritage accessibility through XR and tactile applications.
Abstract: This paper presents a semi-automated framework for transforming two-dimensional miniatures from medieval manuscripts into three-dimensional digital models suitable for extended reality (XR), tactile 3Dprinting, and web-based visualization. We evaluate seven image-to-3D methods (TripoSR, SF3D, SPAR3D, TRELLIS, Wonder3D, SAM3D, Hi3DGen) on 69manuscript figures from two collections using rendering-based metrics (Silhouette IoU, LPIPS, CLIPScore) and volumetric measures (Depth Range Ratio, watertight percentage), revealing a trade-off between volumetric expansion and geometric fidelity. Hi3DGen balances topological quality with rich surface detail through its normal bridging approach, making it a good starting point for expert refinement. Our pipeline combines SAM segmentation, Hi3DGen mesh generation, expert refinement in ZBrush, and AI-assisted texturing. Two case studies on Gothic illuminations from the Decretum Gratiani (Vatican Library) and Renaissance miniatures by Giulio Clovio demonstrate applicability across artistic traditions. The resulting models can support WebXR visualization, AR overlay on physical manuscripts, and tactile 3D~prints for visually impaired users.
[100] ViSAGE @ NTIRE 2026 Challenge on Video Saliency Prediction
Kun Wang, Yupeng Hu, Zhiran Li, Hao Liu, Qianlong Xiang, Liqiang Nie
Main category: cs.CV
TL;DR: ViSAGE: A multi-expert ensemble framework for video saliency prediction that uses adaptive gated experts to refine spatio-temporal features and fuse complementary predictions.
Details
Motivation: To exploit complementary inductive biases for video saliency prediction by creating a framework that can capture complex spatio-temporal saliency cues in videos through specialized experts.Method: Proposes Video Saliency with Adaptive Gated Experts (ViSAGE), a multi-expert ensemble framework where each specialized decoder performs adaptive gating and modulation to refine spatio-temporal features, with complementary predictions fused at inference.
Result: Ranked first on two out of four evaluation metrics on the NTIRE 2026 Challenge Private Test set, and outperformed most competing solutions on the other two metrics.
Conclusion: ViSAGE demonstrates effectiveness and generalization ability for video saliency prediction by aggregating diverse inductive biases to capture complex spatio-temporal saliency cues.
Abstract: In this report, we present our champion solution for the NTIRE 2026 Challenge on Video Saliency Prediction held in conjunction with CVPR 2026. To exploit complementary inductive biases for video saliency, we propose Video Saliency with Adaptive Gated Experts (ViSAGE), a multi-expert ensemble framework. Each specialized decoder performs adaptive gating and modulation to refine spatio-temporal features. The complementary predictions from different experts are then fused at inference. ViSAGE thereby aggregates diverse inductive biases to capture complex spatio-temporal saliency cues in videos. On the Private Test set, ViSAGE ranked first on two out of four evaluation metrics, and outperformed most competing solutions on the other two metrics, demonstrating its effectiveness and generalization ability. Our code has been released at https://github.com/iLearn-Lab/CVPRW26-ViSAGE.
[101] MARINER: A 3E-Driven Benchmark for Fine-Grained Perception and Complex Reasoning in Open-Water Environments
Xingming Liao, Ning Chen, Muying Shu, Yunpeng Yin, Peijian Zeng, Zhuowei Wang, Nankai Lin, Lianglun Cheng
Main category: cs.CV
TL;DR: MARINER is a comprehensive maritime benchmark for multimodal AI evaluation, featuring 16,629 images with fine-grained vessel categories, adverse environments, and maritime incidents across classification, detection, and VQA tasks.
Details
Motivation: Addresses the lack of dedicated benchmarks for fine-grained visual understanding and high-level reasoning in real-world open-water environments, particularly for maritime applications.Method: Introduces MARINER benchmark built under Entity-Environment-Event (3E) paradigm with 16,629 multi-source maritime images covering 63 vessel categories, diverse adverse environments, and 5 typical maritime incidents.
Result: Extensive evaluations on mainstream MLLMs show even advanced models struggle with fine-grained discrimination and causal reasoning in complex marine scenes, establishing baselines for future research.
Conclusion: MARINER fills the gap for realistic and cognitive-level evaluation in maritime multimodal understanding and promotes robust vision-language models for open-water applications.
Abstract: Fine-grained visual understanding and high-level reasoning in real-world open-water environments remain under-explored due to the lack of dedicated benchmarks. We introduce MARINER, a comprehensive benchmark built under the novel Entity-Environment-Event (3E) paradigm. MARINER contains 16,629 multi-source maritime images with 63 fine-grained vessel categories, diverse adverse environments, and 5 typical dynamic maritime incidents, covering fine-grained classification, object detection, and visual question answering tasks. We conduct extensive evaluations on mainstream Multimodal Large language models (MLLMs) and establish baselines, revealing that even advanced models struggle with fine-grained discrimination and causal reasoning in complex marine scenes. As a dedicated maritime benchmark, MARINER fills the gap of realistic and cognitive-level evaluation for maritime multimodal understanding, and promotes future research on robust vision-language models for open-water applications. Appendix and supplementary materials are available at https://lxixim.github.io/MARINER.
[102] Mitigating Hallucination in Large Vision-Language Models via Adaptive Attention Calibration
Mehrdad Fazli, Bowen Wei, Ahmet Sari, Ziwei Zhu
Main category: cs.CV
TL;DR: CAAC framework reduces hallucination in LVLMs by calibrating attention biases through visual-token balancing and confidence-guided adaptive rescaling.
Details
Motivation: Large vision-language models suffer from hallucination issues, confidently describing non-existent objects. Current training-free methods fail in open-ended, long-form generation scenarios where attention biases (spatial perception bias and modality bias) cause visual grounding degradation.Method: Two-step approach: 1) Visual-Token Calibration (VTC) balances attention across image tokens to address spatial perception bias; 2) Adaptive Attention Re-Scaling (AAR) reinforces visual grounding guided by model confidence to counter modality bias that shifts focus from visual to textual inputs over time.
Result: Outperforms baselines on CHAIR, AMBER, and POPE benchmarks, particularly effective in long-form generations, significantly reducing hallucination while maintaining accuracy.
Conclusion: CAAC provides an effective training-free intervention for reducing hallucination in LVLMs by addressing attention biases through confidence-aware calibration, improving visual alignment during generation.
Abstract: Large vision-language models (LVLMs) achieve impressive performance on multimodal tasks but often suffer from hallucination, and confidently describe objects or attributes not present in the image. Current training-free interventions struggle to maintain accuracy in open-ended and long-form generation scenarios. We introduce the Confidence-Aware Attention Calibration (CAAC) framework to address this challenge by targeting two key biases: spatial perception bias, which distributes attention disproportionately across image tokens, and modality bias, which shifts focus from visual to textual inputs over time. CAAC employs a two-step approach: Visual-Token Calibration (VTC) to balance attention across visual tokens, and Adaptive Attention Re-Scaling (AAR) to reinforce visual grounding guided by the model’s confidence. This confidence-driven adjustment ensures consistent visual alignment during generation. Experiments on CHAIR, AMBER, and POPE benchmarks demonstrate that CAAC outperforms baselines, particularly in long-form generations, effectively reducing hallucination.
[103] WildDet3D: Scaling Promptable 3D Detection in the Wild
Weikai Huang, Jieyu Zhang, Sijun Li, Taoyang Jia, Jiafei Duan, Yunqian Cheng, Jaemin Cho, Mattew Wallingford, Rustin Soraki, Chris Dongjoo Kim, Donovan Clay, Taira Anderson, Winson Han, Ali Farhadi, Bharath Hariharan, Zhongzheng Ren, Ranjay Krishna
Main category: cs.CV
TL;DR: WildDet3D is a unified geometry-aware monocular 3D object detection system that accepts multiple prompt types (text, point, box) and incorporates depth cues, trained on the largest open 3D detection dataset (WildDet3D-Data) with 1M+ images across 13.5K categories.
Details
Motivation: Current monocular 3D object detection methods have limitations: they're designed for single prompt types, lack mechanisms to incorporate geometric cues, and are trained on narrow-category datasets that limit open-world generalization. The authors aim to create a practical system for the open world that can generalize beyond closed-set categories, support diverse prompts, and leverage geometric cues.Method: Two main contributions: 1) WildDet3D architecture - a unified geometry-aware system that natively accepts text, point, and box prompts and can incorporate auxiliary depth signals at inference time. 2) WildDet3D-Data - the largest open 3D detection dataset constructed by generating candidate 3D boxes from existing 2D annotations and retaining only human-verified ones, yielding over 1M images across 13.5K categories.
Result: WildDet3D establishes new SOTA across multiple benchmarks: 22.6/24.8 AP3D on WildDet3D-Bench with text/box prompts; 34.2/36.4 AP3D on Omni3D with text/box prompts; 40.3/48.9 ODS on Argoverse 2 and ScanNet in zero-shot evaluation. Incorporating depth cues yields substantial gains (+20.7 AP on average across settings).
Conclusion: The work addresses key bottlenecks in monocular 3D object detection by providing a unified geometry-aware architecture and large-scale open-world dataset, enabling practical open-world 3D understanding with support for multiple prompt modalities and geometric cue integration.
Abstract: Understanding objects in 3D from a single image is a cornerstone of spatial intelligence. A key step toward this goal is monocular 3D object detection–recovering the extent, location, and orientation of objects from an input RGB image. To be practical in the open world, such a detector must generalize beyond closed-set categories, support diverse prompt modalities, and leverage geometric cues when available. Progress is hampered by two bottlenecks: existing methods are designed for a single prompt type and lack a mechanism to incorporate additional geometric cues, and current 3D datasets cover only narrow categories in controlled environments, limiting open-world transfer. In this work we address both gaps. First, we introduce WildDet3D, a unified geometry-aware architecture that natively accepts text, point, and box prompts and can incorporate auxiliary depth signals at inference time. Second, we present WildDet3D-Data, the largest open 3D detection dataset to date, constructed by generating candidate 3D boxes from existing 2D annotations and retaining only human-verified ones, yielding over 1M images across 13.5K categories in diverse real-world scenes. WildDet3D establishes a new state-of-the-art across multiple benchmarks and settings. In the open-world setting, it achieves 22.6/24.8 AP3D on our newly introduced WildDet3D-Bench with text and box prompts. On Omni3D, it reaches 34.2/36.4 AP3D with text and box prompts, respectively. In zero-shot evaluation, it achieves 40.3/48.9 ODS on Argoverse 2 and ScanNet. Notably, incorporating depth cues at inference time yields substantial additional gains (+20.7 AP on average across settings).
[104] On Semiotic-Grounded Interpretive Evaluation of Generative Art
Ruixiang Jiang, Changwen Chen
Main category: cs.CV
TL;DR: SemJudge is a novel evaluator for generative art that assesses deeper symbolic and indexical meaning beyond surface-level image quality, using computational semiotic theory to model human-art interaction.
Details
Motivation: Current generative art evaluators focus only on surface-level image quality or literal prompt adherence, failing to assess the deeper symbolic or abstract meaning intended by artists, which is essential for meaningful human-art interaction.Method: Proposes SemJudge evaluator based on Peircean computational semiotic theory, modeling Human-GenArt Interaction as cascaded semiosis. Uses Hierarchical Semiosis Graph (HSG) to reconstruct meaning-making process from prompt to generated artifact, explicitly assessing symbolic and indexical meaning beyond just iconic representation.
Result: Extensive quantitative experiments show SemJudge aligns more closely with human judgments than prior evaluators on interpretation-intensive fine-art benchmarks. User studies demonstrate SemJudge produces deeper, more insightful artistic interpretations.
Conclusion: SemJudge enables generative art to move beyond generating “pretty” images toward expressing complex human experience by addressing the structural blindness of current evaluators to symbolic and indexical meaning.
Abstract: Interpretation is essential to deciphering the language of art: audiences communicate with artists by recovering meaning from visual artifacts. However, current Generative Art (GenArt) evaluators remain fixated on surface-level image quality or literal prompt adherence, failing to assess the deeper symbolic or abstract meaning intended by the creator. We address this gap by formalizing a Peircean computational semiotic theory that models Human-GenArt Interaction (HGI) as cascaded semiosis. This framework reveals that artistic meaning is conveyed through three modes - iconic, symbolic, and indexical - yet existing evaluators operate heavily within the iconic mode, remaining structurally blind to the latter two. To overcome this structural blindness, we propose SemJudge. This evaluator explicitly assesses symbolic and indexical meaning in HGI via a Hierarchical Semiosis Graph (HSG) that reconstructs the meaning-making process from prompt to generated artifact. Extensive quantitative experiments show that SemJudge aligns more closely with human judgments than prior evaluators on an interpretation-intensive fine-art benchmark. User studies further demonstrate that SemJudge produces deeper, more insightful artistic interpretations, thereby paving the way for GenArt to move beyond the generation of “pretty” images toward a medium capable of expressing complex human experience. Project page: https://github.com/songrise/SemJudge.
[105] 3D-VCD: Hallucination Mitigation in 3D-LLM Embodied Agents through Visual Contrastive Decoding
Makanjuola Ogunleye, Eman Abdelrahman, Ismini Lourentzou
Main category: cs.CV
TL;DR: 3D-VCD: Inference-time visual contrastive decoding framework for hallucination mitigation in 3D embodied agents by contrasting predictions between original and distorted 3D scene graphs.
Details
Motivation: Large multimodal models used as reasoning cores for embodied agents in 3D environments suffer from hallucinations that produce unsafe and ungrounded decisions. Existing hallucination mitigation methods target 2D vision-language settings and don't transfer to embodied 3D reasoning where failures arise from object presence, spatial layout, and geometric grounding.Method: Constructs a distorted 3D scene graph by applying semantic and geometric perturbations to object-centric representations (category substitutions, coordinate or extent corruption). Contrasts predictions under original and distorted 3D contexts to suppress tokens insensitive to grounded scene evidence.
Result: Evaluated on 3D-POPE and HEAL benchmarks, consistently improves grounded reasoning without any retraining. Establishes inference-time contrastive decoding over structured 3D representations as effective for more reliable embodied intelligence.
Conclusion: 3D-VCD is the first inference-time visual contrastive decoding framework for hallucination mitigation in 3D embodied agents, providing a practical route to more reliable embodied intelligence through structured 3D representation contrast.
Abstract: Large multimodal models are increasingly used as the reasoning core of embodied agents operating in 3D environments, yet they remain prone to hallucinations that can produce unsafe and ungrounded decisions. Existing inference-time hallucination mitigation methods largely target 2D vision-language settings and do not transfer to embodied 3D reasoning, where failures arise from object presence, spatial layout, and geometric grounding rather than pixel-level inconsistencies. We introduce 3D-VCD, the first inference-time visual contrastive decoding framework for hallucination mitigation in 3D embodied agents. 3D-VCD constructs a distorted 3D scene graph by applying semantic and geometric perturbations to object-centric representations, such as category substitutions and coordinate or extent corruption. By contrasting predictions under the original and distorted 3D contexts, our method suppresses tokens that are insensitive to grounded scene evidence and are therefore likely driven by language priors. We evaluate 3D-VCD on the 3D-POPE and HEAL benchmarks and show that it consistently improves grounded reasoning without any retraining, establishing inference-time contrastive decoding over structured 3D representations as an effective and practical route to more reliable embodied intelligence.
[106] SenBen: Sensitive Scene Graphs for Explainable Content Moderation
Fatih Cagatay Akyon, Alptekin Temizel
Main category: cs.CV
TL;DR: A new benchmark (SenBen) for sensitive content detection with scene graphs, plus a distilled VLM model that outperforms commercial APIs on grounded scene understanding while being much faster.
Details
Motivation: Current content moderation systems lack spatial grounding and interpretability - they can't explain what sensitive content was detected, who's involved, or where it occurs. There's a need for more detailed, explainable sensitive content detection.Method: Created SenBen benchmark with 13,999 frames from 157 movies annotated with Visual Genome-style scene graphs. Distilled a frontier VLM into a compact 241M student model using multi-task training with suffix-based object identity, Vocabulary-Aware Recall Loss, and decoupled Query2Label tag head with asymmetric loss.
Result: Achieved +6.4 percentage point improvement in SenBen Recall over standard cross-entropy training. Outperformed all evaluated VLMs except Gemini models and all commercial safety APIs on grounded scene graph metrics. Achieved highest object detection and captioning scores across all models, with 7.6× faster inference and 16× less GPU memory.
Conclusion: The SenBen benchmark enables more interpretable and grounded sensitive content detection, and the distilled model demonstrates state-of-the-art performance with significant efficiency gains.
Abstract: Content moderation systems classify images as safe or unsafe but lack spatial grounding and interpretability: they cannot explain what sensitive behavior was detected, who is involved, or where it occurs. We introduce the Sensitive Benchmark (SenBen), the first large-scale scene graph benchmark for sensitive content, comprising 13,999 frames from 157 movies annotated with Visual Genome-style scene graphs (25 object classes, 28 attributes including affective states such as pain, fear, aggression, and distress, 14 predicates) and 16 sensitivity tags across 5 categories. We distill a frontier VLM into a compact 241M student model using a multi-task recipe that addresses vocabulary imbalance in autoregressive scene graph generation through suffix-based object identity, Vocabulary-Aware Recall (VAR) Loss, and a decoupled Query2Label tag head with asymmetric loss, yielding a +6.4 percentage point improvement in SenBen Recall over standard cross-entropy training. On grounded scene graph metrics, our student model outperforms all evaluated VLMs except Gemini models and all commercial safety APIs, while achieving the highest object detection and captioning scores across all models, at $7.6\times$ faster inference and $16\times$ less GPU memory.
[107] InsEdit: Towards Instruction-based Visual Editing via Data-Efficient Video Diffusion Models Adaptation
Zhefan Rao, Bin Zou, Haoxuan Che, Xuanhua He, Chong Hou Choi, Yanheng Li, Rui Liu, Qifeng Chen
Main category: cs.CV
TL;DR: InsEdit is an instruction-based video editing model built on HunyuanVideo-1.5 that achieves state-of-the-art results with minimal video editing data using Mutual Context Attention for aligned video pair creation.
Details
Motivation: Instruction-based video editing is natural but data-hungry, and high-quality video editing data is scarce. The authors aim to create a strong video editor without requiring large-scale video editing data.Method: InsEdit combines visual editing architecture with a video data pipeline based on Mutual Context Attention (MCA), which creates aligned video pairs where edits can begin mid-clip rather than only from the first frame. Built on HunyuanVideo-1.5 backbone.
Result: Achieves state-of-the-art results among open-source methods on video instruction editing benchmarks using only O(100)K video editing data. Also supports image editing without modification due to training recipe including image editing data.
Conclusion: Video generation backbones can become strong video editors without large-scale video editing data through proper architectural design and data pipeline techniques like Mutual Context Attention.
Abstract: Instruction-based video editing is a natural way to control video content with text, but adapting a video generation model into an editor usually appears data-hungry. At the same time, high-quality video editing data remains scarce. In this paper, we show that a video generation backbone can become a strong video editor without large scale video editing data. We present InsEdit, an instruction-based editing model built on HunyuanVideo-1.5. InsEdit combines a visual editing architecture with a video data pipeline based on Mutual Context Attention (MCA), which creates aligned video pairs where edits can begin in the middle of a clip rather than only from the first frame. With only O(100)K video editing data, InsEdit achieves state-of-the-art results among open-source methods on our video instruction editing benchmarks. In addition, because our training recipe also includes image editing data, the final model supports image editing without any modification.
[108] Tora3: Trajectory-Guided Audio-Video Generation with Physical Coherence
Junchao Liao, Zhenghao Zhang, Xiangyu Meng, Litao Li, Ziying Zhang, Siyu Zhu, Long Qin, Weizhi Wang
Main category: cs.CV
TL;DR: Tora3 is a trajectory-guided audio-video generation framework that uses object trajectories as a shared kinematic prior to improve physical coherence and motion-sound alignment in AV generation.
Details
Motivation: Current AV generation methods produce visually unstable object motions and sounds that are only loosely aligned with motion or contact events, lacking explicit motion-aware structure shared between video and audio generation.Method: Uses object trajectories as shared kinematic prior; designs trajectory-aligned motion representation for video, kinematic-audio alignment module driven by trajectory-derived second-order kinematic states, and hybrid flow matching scheme that preserves trajectory fidelity while maintaining local coherence.
Result: Extensive experiments show Tora3 improves motion realism, motion-sound synchronization, and overall AV generation quality over strong open-source baselines.
Conclusion: Tora3 demonstrates that using object trajectories as a shared kinematic prior effectively improves physical coherence and motion-sound relations in audio-video generation.
Abstract: Audio-video (AV) generation has recently made strong progress in perceptual quality and multimodal coherence, yet generating content with plausible motion-sound relations remains challenging. Existing methods often produce object motions that are visually unstable and sounds that are only loosely aligned with salient motion or contact events, largely because they lack an explicit motion-aware structure shared by video and audio generation. We present Tora3, a trajectory-guided AV generation framework that improves physical coherence by using object trajectories as a shared kinematic prior. Rather than treating trajectories as a video-only control signal, Tora3 uses them to jointly guide visual motion and acoustic events. Specifically, we design a trajectory-aligned motion representation for video, a kinematic-audio alignment module driven by trajectory-derived second-order kinematic states, and a hybrid flow matching scheme that preserves trajectory fidelity in trajectory-conditioned regions while maintaining local coherence elsewhere. We further curate PAV, a large-scale AV dataset emphasizing motion-relevant patterns with automatically extracted motion annotations. Extensive experiments show that Tora3 improves motion realism, motion-sound synchronization, and overall AV generation quality over strong open-source baselines.
[109] EfficientSign: An Attention-Enhanced Lightweight Architecture for Indian Sign Language Recognition
Rishabh Gupta, Shravya R. Nalla
Main category: cs.CV
TL;DR: EfficientSign is a lightweight sign language recognition model using EfficientNet-B0 with attention modules (Squeeze-and-Excitation and spatial attention) that achieves 99.94% accuracy on Indian Sign Language alphabets with 62% fewer parameters than ResNet18.
Details
Motivation: The motivation is to build a sign language recognizer that works on mobile phones, requiring a lightweight yet accurate model for practical deployment.Method: The method uses EfficientNet-B0 as backbone with two attention modules: Squeeze-and-Excitation for channel attention and a spatial attention layer for focusing on hand gestures. The model was tested on 12,637 images of Indian Sign Language alphabets (26 classes) using 5-fold cross-validation.
Result: EfficientSign achieves 99.94% accuracy (±0.05%), matching ResNet18’s 99.97% accuracy but with 62% fewer parameters (4.2M vs 11.2M). Deep features from EfficientNet-B0 fed into classical classifiers also performed well: SVM (99.63%), Logistic Regression (99.03%), KNN (96.33%), all surpassing previous SURF-based methods (92%).
Conclusion: Attention-enhanced learning provides an efficient and deployable solution for sign language recognition without requiring massive models or hand-tuned feature pipelines, making it suitable for mobile deployment.
Abstract: How do you build a sign language recognizer that works on a phone? That question drove this work. We built EfficientSign, a lightweight model which takes EfficientNet-B0 and focuses on two attention modules (Squeeze-and-Excitation for channel focus, and a spatial attention layer that focuses on the hand gestures). We tested it against five other approaches on 12,637 images of Indian Sign Language alphabets, all 26 classes, using 5-fold cross-validation. EfficientSign achieves the accuracy of 99.94% (+/-0.05%), which matches the performance of ResNet18’s 99.97% accuracy, but with 62% fewer parameters (4.2M vs 11.2M). We also experimented with feeding deep features (1,280-dimensional vectors pulled from EfficientNet-B0’s pooling layer) into classical classifiers. SVM achieved the accuracy of 99.63%, Logistic Regression achieved the accuracy of 99.03% and KNN achieved accuracy of 96.33%. All of these blow past the 92% that SURF-based methods managed on a similar dataset back in 2015. Our results show that attention-enhanced learning model provides an efficient and deployable solution for ISL recognition without requiring a massive model or hand-tuned feature pipelines anymore.
[110] Off-the-shelf Vision Models Benefit Image Manipulation Localization
Zhengxuan Zhang, Keji Song, Junmin Hu, Ao Luo, Yuezun Li
Main category: cs.CV
TL;DR: ReVi adapter repurposes off-the-shelf vision models for image manipulation localization by disentangling semantic redundancy from manipulation-specific information, enabling scalable IML without full retraining.
Details
Motivation: Bridge the gap between image manipulation localization (IML) and general vision tasks by showing that general semantic priors can benefit IML, rather than treating them as separate research directions.Method: Propose ReVi adapter that repurposes existing general-purpose vision models (image generation, segmentation networks) for IML. Inspired by robust PCA, it disentangles semantic redundancy from manipulation-specific information and selectively enhances the latter. Uses frozen pre-trained models with only adapter fine-tuning.
Result: Experimental results demonstrate superiority over existing methods, showing potential for scalable IML frameworks without extensive model redesign or full retraining.
Conclusion: General semantic priors from vision models can effectively benefit IML through proper adaptation, enabling scalable solutions that leverage existing vision foundation models.
Abstract: Image manipulation localization (IML) and general vision tasks are typically treated as two separate research directions due to the fundamental differences between manipulation-specific and semantic features. In this paper, however, we bridge this gap by introducing a fresh perspective: these two directions are intrinsically connected, and general semantic priors can benefit IML. Building on this insight, we propose a novel trainable adapter (named ReVi) that repurposes existing off-the-shelf general-purpose vision models (e.g., image generation and segmentation networks) for IML. Inspired by robust principal component analysis, the adapter disentangles semantic redundancy from manipulation-specific information embedded in these models and selectively enhances the latter. Unlike existing IML methods that require extensive model redesign and full retraining, our method relies on the off-the-shelf vision models with frozen parameters and only fine-tunes the proposed adapter. The experimental results demonstrate the superiority of our method, showing the potential for scalable IML frameworks.
[111] Unified Multimodal Uncertain Inference
Dengjia Zhang, Alexander Martin, William Jurayj, Kenton Murray, Benjamin Van Durme, Reno Kriz
Main category: cs.CV
TL;DR: UMUI introduces a unified multimodal uncertain inference task across text, audio, and video, requiring calibrated probability estimates, with CLUE method achieving strong performance with smaller models.
Details
Motivation: Current uncertain inference research is limited to text or single-modality binary entailment, lacking frameworks for fine-grained probabilistic reasoning across multiple modalities like audio and video.Method: Introduces CLUE (Calibrated Latent Uncertainty Estimation) combining self-consistent teacher calibration and distribution-based confidence probing, with human-annotated evaluation sets across audio, visual, and audiovisual settings.
Result: The 3B-parameter model achieves equivalent or stronger performance than baselines up to 32B parameters across all modalities (text, audio, video).
Conclusion: UMUI provides a unified framework for multimodal uncertain inference, and CLUE enables effective calibrated probability estimation across modalities with efficient model sizes.
Abstract: We introduce Unified Multimodal Uncertain Inference (UMUI), a multimodal inference task spanning text, audio, and video, where models must produce calibrated probability estimates of hypotheses conditioned on a premise in any modality or combination. While uncertain inference has been explored in text, extension to other modalities has been limited to single-modality binary entailment judgments, leaving no framework for fine-grained probabilistic reasoning in or across other modalities. To address this, we curate a human-annotated evaluation set with scalar probability judgments across audio, visual, and audiovisual settings, and additionally evaluate on existing text and audio benchmarks. We introduce CLUE (Calibrated Latent Uncertainty Estimation), which combines self-consistent teacher calibration and distribution-based confidence probing to produce calibrated predictions. We demonstrate that our 3B-parameter model achieves equivalent or stronger performance than baselines up to 32B parameters across all modalities.
[112] RS-OVC: Open-Vocabulary Counting for Remote-Sensing Data
Tamir Shor, George Leifman, Genady Beryozkin
Main category: cs.CV
TL;DR: RS-OVC: First open vocabulary counting model for remote-sensing imagery that can count novel object classes unseen during training using textual/visual conditioning
Details
Motivation: Current RS object-counting methods are limited to closed, pre-defined object classes, requiring costly re-annotation and re-training for novel objects, which hinders application in dynamic real-world monitoring scenarios.Method: Proposes RS-OVC, an open vocabulary counting model that uses textual and/or visual conditioning to count novel object classes without requiring re-training or re-annotation.
Result: The model demonstrates accurate counting of novel object classes that were unseen during training, based solely on textual and/or visual conditioning.
Conclusion: RS-OVC addresses the limitation of closed-set counting in remote sensing by enabling open vocabulary counting, facilitating adaptation to novel objects without costly re-training.
Abstract: Object-Counting for remote-sensing (RS) imagery is attracting increasing research interest due to its crucial role in a wide and diverse set of applications. While several promising methods for RS object-counting have been proposed, existing methods focus on a closed, pre-defined set of object classes. This limitation necessitates costly re-annotation and model re-training to adapt current approaches for counting of novel objects that have not been seen during training, and severely inhibits their application in dynamic, real-world monitoring scenarios. To address this gap, in this work we propose RS-OVC - the first Open Vocabulary Counting (OVC) model for Remote-Sensing and aerial imagery. We show that our model is capable of accurate counting of novel object classes, that were unseen during training, based solely on textual and/or visual conditioning.
[113] Deep Learning-Based Tracking and Lineage Reconstruction of Ligament Breakup
Vrushank Ahire, Vivek Kurumanghat, Mudasir Ganaie, Lipika Kabiraj
Main category: cs.CV
TL;DR: A deep learning framework for detecting and tracking droplets, ligaments, and blobs in high-speed shadowgraphy of liquid sheet disintegration, with physics-informed temporal relationship modeling for fragmentation events.
Details
Motivation: Analyzing liquid sheet disintegration into droplets and ligaments requires quantifying highly transient, multi-scale dynamics from high-speed images. Conventional tracking methods fail to capture one-to-many fragmentation events essential for spray analysis.Method: Two-stage framework: 1) Faster R-CNN with ResNet-50/FPN backbone detects and classifies droplets/ligaments using synthetic data augmentation; 2) Transformer-augmented MLP classifies inter-frame associations (continuation, fragmentation, non-association) using physics-informed geometric features.
Result: Achieved F1 score up to 0.872 for detection, and 86.1% accuracy, 93.2% precision, and perfect recall (1.00) for fragmentation event classification despite severe class imbalance. Enables automated reconstruction of fragmentation trees and breakup statistics.
Conclusion: The framework successfully automates analysis of liquid sheet disintegration, capturing fragmentation events and parent-child lineage that conventional methods miss, providing valuable breakup statistics for spray analysis.
Abstract: The disintegration of liquid sheets into ligaments and droplets involves highly transient, multi-scale dynamics that are difficult to quantify from high-speed shadowgraphy images. Identifying droplets, ligaments, and blobs formed during breakup, along with tracking across frames, is essential for spray analysis. However, conventional multi-object tracking frameworks impose strict one-to-one temporal associations and cannot represent one-to-many fragmentation events. In this study, we present a two-stage deep learning framework for object detection and temporal relationship modeling across frames. The framework captures ligament deformation, fragmentation, and parent-child lineage during liquid sheet disintegration. In the first stage, a Faster R-CNN with a ResNet-50 backbone and Feature Pyramid Network detects and classifies ligaments and droplets in high-speed shadowgraphy recordings of an impinging Carbopol gel jet. A morphology-preserving synthetic data generation strategy augments the training set without introducing physically implausible configurations, achieving a held-out F1 score of up to 0.872 across fourteen original-to-synthetic configurations. In the second stage, a Transformer-augmented multilayer perceptron classifies inter-frame associations into continuation, fragmentation (one-to-many), and non-association using physics-informed geometric features. Despite severe class imbalance, the model achieves 86.1% accuracy, 93.2% precision, and perfect recall (1.00) for fragmentation events. Together, the framework enables automated reconstruction of fragmentation trees, preservation of parent-child lineage, and extraction of breakup statistics such as fragment multiplicity and droplet size distributions. By explicitly identifying children droplets formed from ligament fragmentation, the framework provides automated analysis of the primary atomization mode.
[114] MAG-3D: Multi-Agent Grounded Reasoning for 3D Understanding
Henry Zheng, Chenyue Fang, Rui Huang, Siyuan Wei, Xiao Liu, Gao Huang
Main category: cs.CV
TL;DR: MAG-3D is a training-free multi-agent framework for grounded 3D reasoning using off-the-shelf vision-language models, achieving SOTA performance without task-specific training.
Details
Motivation: While VLMs excel at 2D multimodal understanding, grounded reasoning in 3D scenes remains underexplored. Current approaches rely on in-domain tuning or hand-crafted pipelines, limiting flexibility and zero-shot generalization to novel environments.Method: Proposes a multi-agent framework with three specialized agents: 1) Planning agent decomposes tasks and orchestrates reasoning, 2) Grounding agent performs free-form 3D grounding and frame retrieval from scene observations, 3) Coding agent conducts geometric reasoning and verification through executable programs.
Result: Achieves state-of-the-art performance on challenging 3D reasoning benchmarks, demonstrating flexible training-free reasoning across diverse scenes.
Conclusion: MAG-3D enables effective grounded 3D reasoning without task-specific training, offering a flexible framework for zero-shot generalization to novel 3D environments.
Abstract: Vision-language models (VLMs) have achieved strong performance in multimodal understanding and reasoning, yet grounded reasoning in 3D scenes remains underexplored. Effective 3D reasoning hinges on accurate grounding: to answer open-ended queries, a model must first identify query-relevant objects and regions in a complex scene, and then reason about their spatial and geometric relationships. Recent approaches have demonstrated strong potential for grounded 3D reasoning. However, they often rely on in-domain tuning or hand-crafted reasoning pipelines, which limit their flexibility and zero-shot generalization to novel environments. In this work, we present MAG-3D, a training-free multi-agent framework for grounded 3D reasoning with off-the-shelf VLMs. Instead of relying on task-specific training or fixed reasoning procedures, MAG-3D dynamically coordinates expert agents to address the key challenges of 3D reasoning. Specifically, we propose a planning agent that decomposes the task and orchestrates the overall reasoning process, a grounding agent that performs free-form 3D grounding and relevant frame retrieval from extensive 3D scene observations, and a coding agent that conducts flexible geometric reasoning and explicit verification through executable programs. This multi-agent collaborative design enables flexible training-free 3D grounded reasoning across diverse scenes and achieves state-of-the-art performance on challenging benchmarks.
[115] What Matters in Virtual Try-Off? Dual-UNet Diffusion Model For Garment Reconstruction
Loc-Phat Truong, Meysam Madadi, Sergio Escalera
Main category: cs.CV
TL;DR: Virtual Try-Off (VTOFF) framework using diffusion models to reconstruct canonical garments from draped images, achieving state-of-the-art performance on fashion datasets.
Details
Motivation: While Virtual Try-On (VTON) is well-researched, the inverse problem of Virtual Try-Off (VTOFF) - reconstructing canonical garments from draped images - remains underexplored, creating a need for robust architectural foundations.Method: Adapts diffusion-based strategies from VTON and Latent Diffusion Models, focusing on Dual-UNet architecture with systematic analysis of: (1) Stable Diffusion variants as generation backbone, (2) conditioning strategies (mask designs, image inputs, semantic features), and (3) losses/training strategies (attention-based loss, perceptual objectives, curriculum schedules).
Result: Achieves state-of-the-art performance on VITON-HD and DressCode datasets with 9.5% drop in DISTS metric and competitive performance on LPIPS, FID, KID, and SSIM metrics.
Conclusion: Establishes strong baselines and provides architectural insights for Virtual Try-Off research, revealing trade-offs across different design configurations for garment reconstruction.
Abstract: Virtual Try-On (VTON) has seen rapid advancements, providing a strong foundation for generative fashion tasks. However, the inverse problem, Virtual Try-Off (VTOFF)-aimed at reconstructing the canonical garment from a draped-on image-remains a less understood domain, distinct from the heavily researched field of VTON. In this work, we seek to establish a robust architectural foundation for VTOFF by studying and adapting various diffusion-based strategies from VTON and general Latent Diffusion Models (LDMs). We focus our investigation on the Dual-UNet Diffusion Model architecture and analyze three axes of design: (i) Generation Backbone: comparing Stable Diffusion variants; (ii) Conditioning: ablating different mask designs, masked/unmasked inputs for image conditioning, and the utility of high-level semantic features; and (iii) Losses and Training Strategies: evaluating the impact of the auxiliary attention-based loss, perceptual objectives and multi-stage curriculum schedules. Extensive experiments reveal trade-offs across various configuration options. Evaluated on VITON-HD and DressCode datasets, our framework achieves state-of-the-art performance with a drop of 9.5% on the primary metric DISTS and competitive performance on LPIPS, FID, KID, and SSIM, providing both stronger baselines and insights to guide future Virtual Try-Off research.
[116] Accelerating Transformer-Based Monocular SLAM via Geometric Utility Scoring
Xinmiao Xiong, Bangya Liu, Hao Wang, Dayou Li, Nuo Chen, Andrew Feng, Mingyu Ding, Suman Banerjee, Yang Zhou, Zhiwen Fan
Main category: cs.CV
TL;DR: LeanGate: Lightweight frame-gating network for monocular SLAM that predicts geometric utility scores to skip redundant frames before expensive GFM processing, achieving 5x speedup while maintaining accuracy.
Details
Motivation: Current GFM-based SLAM systems suffer from computational redundancy by processing dense video streams with expensive geometric decoding for all frames, only to later reject many as non-keyframes through post hoc selection.Method: Proposes LeanGate, a lightweight feed-forward frame-gating network that predicts geometric utility scores to assess a frame’s mapping value before heavy GFM feature extraction and matching stages, acting as a predictive plug-and-play module.
Result: Bypasses over 90% of redundant frames, reduces tracking FLOPs by more than 85%, achieves 5x end-to-end throughput speedup, and maintains tracking and mapping accuracy comparable to dense baselines on standard SLAM benchmarks.
Conclusion: LeanGate effectively addresses computational inefficiency in GFM-based SLAM systems by early rejection of redundant frames, enabling real-time performance while preserving geometric accuracy.
Abstract: Geometric Foundation Models (GFMs) have recently advanced monocular SLAM by providing robust, calibration-free 3D priors. However, deploying these models on dense video streams introduces significant computational redundancy. Current GFM-based SLAM systems typically rely on post hoc keyframe selection. Because of this, they must perform expensive dense geometric decoding simply to determine whether a frame contains novel geometry, resulting in late rejection and wasted computation. To mitigate this inefficiency, we propose LeanGate, a lightweight feed-forward frame-gating network. LeanGate predicts a geometric utility score to assess a frame’s mapping value prior to the heavy GFM feature extraction and matching stages. As a predictive plug-and-play module, our approach bypasses over 90% of redundant frames. Evaluations on standard SLAM benchmarks demonstrate that LeanGate reduces tracking FLOPs by more than 85% and achieves a 5x end-to-end throughput speedup. Furthermore, it maintains the tracking and mapping accuracy of dense baselines.
[117] LMGenDrive: Bridging Multimodal Understanding and Generative World Modeling for End-to-End Driving
Hao Shao, Letian Wang, Yang Zhou, Yuxuan Hu, Zhuofan Zong, Steven L. Waslander, Wei Zhan, Hongsheng Li
Main category: cs.CV
TL;DR: LMGenDrive: A unified framework combining LLM-based multimodal understanding with generative world models for end-to-end autonomous driving that generates both future driving videos and control signals.
Details
Motivation: Address generalization challenges in autonomous driving for long-tail and open-world scenarios by unifying understanding (via LLMs/VLMs) and imagination (via generative world models), inspired by human intelligence.Method: Combines LLM-based multimodal understanding with generative world models, takes multi-view camera inputs and natural-language instructions, generates future driving videos and control signals, uses progressive three-stage training strategy from vision pretraining to multi-step long-horizon driving.
Result: Significantly outperforms prior methods on challenging closed-loop benchmarks, with clear gains in instruction following, spatio-temporal understanding, and robustness to rare scenarios.
Conclusion: Unifying multimodal understanding and generation is a promising direction for more generalizable and robust embodied decision-making systems.
Abstract: Recent years have seen remarkable progress in autonomous driving, yet generalization to long-tail and open-world scenarios remains a major bottleneck for large-scale deployment. To address this challenge, some works use LLMs and VLMs for vision-language understanding and reasoning, enabling vehicles to interpret rare and safety-critical situations when generating actions. Others study generative world models to capture the spatio-temporal evolution of driving scenes, allowing agents to imagine possible futures before acting. Inspired by human intelligence, which unifies understanding and imagination, we explore a unified model for autonomous driving. We present LMGenDrive, the first framework that combines LLM-based multimodal understanding with generative world models for end-to-end closed-loop driving. Given multi-view camera inputs and natural-language instructions, LMGenDrive generates both future driving videos and control signals. This design provides complementary benefits: video prediction improves spatio-temporal scene modeling, while the LLM contributes strong semantic priors and instruction grounding from large-scale pretraining. We further propose a progressive three-stage training strategy, from vision pretraining to multi-step long-horizon driving, to improve stability and performance. LMGenDrive supports both low-latency online planning and autoregressive offline video generation. Experiments show that it significantly outperforms prior methods on challenging closed-loop benchmarks, with clear gains in instruction following, spatio-temporal understanding, and robustness to rare scenarios. These results suggest that unifying multimodal understanding and generation is a promising direction for more generalizable and robust embodied decision-making systems.
[118] AI Driven Soccer Analysis Using Computer Vision
Adrian Manchado, Tanner Cellio, Jonathan Keane, Yiyang Wang
Main category: cs.CV
TL;DR: Computer vision system for soccer analysis using object detection, segmentation, and homography to track players and extract real-world tactical metrics from video footage.
Details
Motivation: Sport analysis provides crucial data for coaching decisions and team performance, but extracting complex features from game footage requires automated computer vision systems to track players and translate positions to real-world coordinates for tactical insights.Method: Combines object detection models (YOLO/Faster R-CNN) for player identification with SAM2 for segmentation and tracking, plus CNN-based key point detection for field landmarks. Uses homography to transform camera perspective to real-world coordinates, enabling calculation of player metrics like speed, distance, and positioning heatmaps.
Result: System enables extraction of valuable tactical insights including player speed, distance covered, positioning heatmaps, and complex team statistics from standard video footage, providing coaches with previously unavailable performance data.
Conclusion: Computer vision approach successfully transforms video analysis into actionable tactical data through automated player tracking and perspective transformation, offering significant improvements over manual video analysis for sports performance evaluation.
Abstract: Sport analysis is crucial for team performance since it provides actionable data that can inform coaching decisions, improve player performance, and enhance team strategies. To analyze more complex features from game footage, a computer vision model can be used to identify and track key entities from the field. We propose the use of an object detection and tracking system to predict player positioning throughout the game. To translate this to positioning in relation to the field dimensions, we use a point prediction model to identify key points on the field and combine these with known field dimensions to extract actual distances. For the player-identification model, object detection models like YOLO and Faster R-CNN are evaluated on the accuracy of our custom video footage using multiple different evaluation metrics. The goal is to identify the best model for object identification to obtain the most accurate results when paired with SAM2 (Segment Anything Model 2) for segmentation and tracking. For the key point detection model, we use a CNN model to find consistent locations in the soccer field. Through homography, the positions of points and objects in the camera perspective will be transformed to a real-ground perspective. The segmented player masks from SAM2 are transformed from camera perspective to real-world field coordinates through homography, regardless of camera angle or movement. The transformed real-world coordinates can be used to calculate valuable tactical insights including player speed, distance covered, positioning heatmaps, and more complex team statistics, providing coaches and players with actionable performance data previously unavailable from standard video analysis.
[119] LPLCv2: An Expanded Dataset for Fine-Grained License Plate Legibility Classification
Lucas Wojcik, Eduardo A. F. Machoski, Eduil Nascimento, Rayson Laroca, David Menotti
Main category: cs.CV
TL;DR: Expands license plate recognition benchmark with improved annotations and training methods, achieving state-of-the-art performance despite real-world challenges.
Details
Motivation: Real-world ALPR systems struggle with low-quality imaging, compression artifacts, and suboptimal camera setups. Existing benchmarks are too small and contain annotation errors, limiting progress in identifying illegible license plates.Method: Expands original benchmark 3x with additional capture days, revises annotations, adds novel labels (LP-level, vehicle-level, image-level). Introduces EMA-based loss function, refined learning rate scheduler, and novel protocol to address camera contamination between splits.
Result: Baseline model achieves 89.5% F1-score on test set, significantly surpassing previous state-of-the-art. Camera contamination protocol shows small impact on performance.
Conclusion: The expanded benchmark and improved training methods enable better real-world ALPR performance, with publicly available dataset and code for community use.
Abstract: Modern Automatic License Plate Recognition (ALPR) systems achieve outstanding performance in controlled, well-defined scenarios. However, large-scale real-world usage remains challenging due to low-quality imaging devices, compression artifacts, and suboptimal camera installation. Identifying illegible license plates (LPs) has recently become feasible through a dedicated benchmark; however, its impact has been limited by its small size and annotation errors. In this work, we expand the original benchmark to over three times the size with two extra capture days, revise its annotations and introduce novel labels. LP-level annotations include bounding boxes, text, and legibility level, while vehicle-level annotations comprise make, model, type, and color. Image-level annotations feature camera identity, capture conditions (e.g., rain and faulty cameras), acquisition time, and day ID. We present a novel training procedure featuring an Exponential Moving Average-based loss function and a refined learning rate scheduler, addressing common mistakes in testing. These improvements enable a baseline model to achieve an 89.5% F1-score on the test set, considerably surpassing the previous state of the art. We further introduce a novel protocol to explicitly addresses camera contamination between training and evaluation splits, where results show a small impact. Dataset and code are publicly available at https://github.com/lmlwojcik/LPLCv2-Dataset.
[120] SIC3D: Style Image Conditioned Text-to-3D Gaussian Splatting Generation
Ming He, Zhixiang Chen, Steve Maddock
Main category: cs.CV
TL;DR: SIC3D: A two-stage controllable image-conditioned text-to-3D generation pipeline using 3D Gaussian Splatting that addresses texture ambiguity and limited controllability in text-to-3D generation.
Details
Motivation: Current text-to-3D generation methods suffer from limited controllability and texture ambiguity due to text modality limitations. The paper aims to address these issues by introducing image conditioning for better control and style transfer.Method: Two-stage pipeline: 1) Text-to-3DGS generation creates 3D object content from text, 2) Style transfer from reference image to 3DGS using novel Variational Stylized Score Distillation (VSSD) loss and scaling regularization to capture global/local patterns and prevent artifacts.
Result: Extensive experiments show SIC3D enhances geometric fidelity and style adherence, outperforming prior approaches in both qualitative and quantitative evaluations.
Conclusion: SIC3D successfully addresses texture ambiguity and controllability issues in text-to-3D generation through image-conditioned style transfer with 3D Gaussian Splatting, demonstrating superior performance over existing methods.
Abstract: Recent progress in text-to-3D object generation enables the synthesis of detailed geometry from text input by leveraging 2D diffusion models and differentiable 3D representations. However, the approaches often suffer from limited controllability and texture ambiguity due to the limitation of the text modality. To address this, we present SIC3D, a controllable image-conditioned text-to-3D generation pipeline with 3D Gaussian Splatting (3DGS). There are two stages in SIC3D. The first stage generates the 3D object content from text with a text-to-3DGS generation model. The second stage transfers style from a reference image to the 3DGS. Within this stylization stage, we introduce a novel Variational Stylized Score Distillation (VSSD) loss to effectively capture both global and local texture patterns while mitigating conflicts between geometry and appearance. A scaling regularization is further applied to prevent the emergence of artifacts and preserve the pattern from the style image. Extensive experiments demonstrate that SIC3D enhances geometric fidelity and style adherence, outperforming prior approaches in both qualitative and quantitative evaluations.
[121] State Space Models are Effective Sign Language Learners: Exploiting Phonological Compositionality for Vocabulary-Scale Recognition
Bryan Cheng, Austin Jin, Jasper Zhang
Main category: cs.CV
TL;DR: PHONSSM addresses catastrophic scaling in sign language recognition by enforcing phonological decomposition through anatomically-grounded graph attention, explicit factorization, and prototypical classification, achieving state-of-the-art results on large vocabulary ASL datasets using only skeleton data.
Details
Motivation: Sign language recognition suffers from catastrophic scaling failure where models work well on small vocabularies but collapse at realistic sizes. Existing architectures treat signs as atomic visual patterns, learning flat representations that cannot exploit the compositional structure of sign languages, which are systematically organized from discrete phonological parameters (handshape, location, movement, orientation) reused across vocabulary.Method: PHONSSM enforces phonological decomposition through: 1) anatomically-grounded graph attention, 2) explicit factorization into orthogonal subspaces corresponding to different phonological parameters, and 3) prototypical classification enabling few-shot transfer. The model uses only skeleton data rather than RGB video input.
Result: On the largest ASL dataset ever assembled (5,565 signs), PHONSSM achieves 72.1% accuracy on WLASL2000, an improvement of +18.4 percentage points over skeleton state-of-the-art, surpassing most RGB methods without video input. Gains are most dramatic in few-shot regime (+225% relative), and the model transfers zero-shot to ASL Citizen, exceeding supervised RGB baselines.
Conclusion: The vocabulary scaling bottleneck in sign language recognition is fundamentally a representation learning problem that can be solved through compositional inductive biases mirroring linguistic structure, enabling models to scale to realistic vocabulary sizes.
Abstract: Sign language recognition suffers from catastrophic scaling failure: models achieving high accuracy on small vocabularies collapse at realistic sizes. Existing architectures treat signs as atomic visual patterns, learning flat representations that cannot exploit the compositional structure of sign languages-systematically organized from discrete phonological parameters (handshape, location, movement, orientation) reused across the vocabulary. We introduce PHONSSM, enforcing phonological decomposition through anatomically-grounded graph attention, explicit factorization into orthogonal subspaces, and prototypical classification enabling few-shot transfer. Using skeleton data alone on the largest ASL dataset ever assembled (5,565 signs), PHONSSM achieves 72.1% on WLASL2000 (+18.4pp over skeleton SOTA), surpassing most RGB methods without video input. Gains are most dramatic in the few-shot regime (+225% relative), and the model transfers zero-shot to ASL Citizen, exceeding supervised RGB baselines. The vocabulary scaling bottleneck is fundamentally a representation learning problem, solvable through compositional inductive biases mirroring linguistic structure.
[122] InstrAct: Towards Action-Centric Understanding in Instructional Videos
Zhuoyi Yang, Jiapeng Yu, Reuben Tan, Boyang Li, Huijuan Xu
Main category: cs.CV
TL;DR: InstrAction: A pretraining framework for instructional videos that addresses static bias by filtering noisy captions, generating action-centric negatives, extracting motion-relevant tokens, and using auxiliary objectives for temporal modeling and cross-modal grounding.
Details
Motivation: Current Video Foundation Models struggle with understanding instructional videos due to noisy web supervision and "static bias" - relying on objects rather than motion cues for action recognition and temporal relation modeling.Method: 1) Data-driven strategy: filters noisy captions and generates action-centric hard negatives for contrastive learning; 2) Action Perceiver: extracts motion-relevant tokens from video encodings; 3) Auxiliary objectives: Dynamic Time Warping alignment for sequential temporal structure and Masked Action Modeling for cross-modal grounding.
Result: Outperforms state-of-the-art Video Foundation Models on the InstrAct Bench for semantic reasoning, procedural logic, and fine-grained retrieval tasks.
Conclusion: InstrAction effectively addresses static bias in instructional video understanding through action-centric pretraining, achieving superior performance on action-centric understanding tasks.
Abstract: Understanding instructional videos requires recognizing fine-grained actions and modeling their temporal relations, which remains challenging for current Video Foundation Models (VFMs). This difficulty stems from noisy web supervision and a pervasive “static bias”, where models rely on objects rather than motion cues. To address this, we propose InstrAction, a pretraining framework for instructional videos’ action-centric representations. We first introduce a data-driven strategy, which filters noisy captions and generates action-centric hard negatives to disentangle actions from objects during contrastive learning. At the visual feature level, an Action Perceiver extracts motion-relevant tokens from redundant video encodings. Beyond contrastive learning, we introduce two auxiliary objectives: Dynamic Time Warping alignment (DTW-Align) for modeling sequential temporal structure, and Masked Action Modeling (MAM) for strengthening cross-modal grounding. Finally, we introduce the InstrAct Bench to evaluate action-centric understanding, where our method consistently outperforms state-of-the-art VFMs on semantic reasoning, procedural logic, and fine-grained retrieval tasks.
[123] R2G: A Multi-View Circuit Graph Benchmark Suite from RTL to GDSII
Zewei Zhou, Jiajun Zou, Jiajia Zhang, Ao Yang, Ruichao He, Haozheng Zhou, Ao Liu, Jiawei Liu, Leilei Jin, Shan Shen, Daying Sun
Main category: cs.CV
TL;DR: R2G is a standardized multi-view circuit-graph benchmark suite for GNNs in physical design tasks, providing consistent representations and evaluation protocols across 30 IP cores.
Details
Motivation: Progress in applying GNNs to physical design tasks like congestion prediction and wirelength estimation is hindered by inconsistent circuit representations and lack of controlled evaluation protocols.Method: Created R2G benchmark suite with five stage-aware circuit-graph views having information parity, spanning synthesis, placement, and routing stages. Includes end-to-end DEF-to-graph pipeline, loaders, unified splits, domain metrics, and reproducible baselines.
Result: Systematic studies show: (1) view choice dominates model choice (Test R² varies by >0.3 across representations for fixed GNN), (2) node-centric views generalize best across placement and routing, (3) decoder-head depth (3-4 layers) is primary accuracy driver enabling near-perfect predictions (R²>0.99).
Conclusion: R2G isolates representation choice as a confound in EDA and graph-ML benchmarks, showing view selection is more critical than model architecture for circuit design tasks.
Abstract: Graph neural networks (GNNs) are increasingly applied to physical design tasks such as congestion prediction and wirelength estimation, yet progress is hindered by inconsistent circuit representations and the absence of controlled evaluation protocols. We present R2G (RTL-to-GDSII), a multi-view circuit-graph benchmark suite that standardizes five stage-aware views with information parity (every view encodes the same attribute set, differing only in where features attach) over 30 open-source IP cores (up to $10^6$ nodes/edges). R2G provides an end-to-end DEF-to-graph pipeline spanning synthesis, placement, and routing stages, together with loaders, unified splits, domain metrics, and reproducible baselines. By decoupling representation choice from model choice, R2G isolates a confound that prior EDA and graph-ML benchmarks leave uncontrolled. In systematic studies with GINE, GAT, and ResGatedGCN, we find: (i) view choice dominates model choice, with Test R$^2$ varying by more than 0.3 across representations for a fixed GNN; (ii) node-centric views generalize best across both placement and routing; and (iii) decoder-head depth (3–4 layers) is the primary accuracy driver, turning divergent training into near-perfect predictions (R$^2$$>$0.99). Code and datasets are available at https://github.com/ShenShan123/R2G.
[124] Towards Responsible Multimodal Medical Reasoning via Context-Aligned Vision-Language Models
Sumra Khan, Sagar Chhabriya, Aizan Zafar, Sheeraz Arif, Amgad Muneer, Anas Zafar, Shaina Raza, Rizwan Qureshi
Main category: cs.CV
TL;DR: Medical VLM framework that enforces cross-modality agreement through contextual verification to reduce hallucinations and improve reliability in radiology diagnosis.
Details
Motivation: Medical VLMs often produce fluent but weakly grounded conclusions due to over-reliance on dominant modalities, leading to unreliable diagnostic outputs.Method: Augments frozen VLM with structured contextual signals (radiomic statistics, explainability activations, semantic cues) and uses contextual verification to enforce agreement across evidence before generating structured outputs with supporting evidence and uncertainty estimates.
Result: Improves discriminative performance (AUC 0.918 to 0.925), reduces hallucinated keywords (1.14 to 0.25), produces more concise explanations (19.4 to 15.3 words) while maintaining calibrated uncertainty on chest X-ray datasets.
Conclusion: Enforcing multi-evidence agreement improves reliability and trustworthiness in medical multimodal reasoning while preserving underlying model architecture, with modality informativeness significantly influencing reasoning behavior.
Abstract: Medical vision-language models (VLMs) show strong performance on radiology tasks but often produce fluent yet weakly grounded conclusions due to over-reliance on a dominant modality. We introduce a context-aligned reasoning framework that enforces agreement across heterogeneous clinical evidence before generating diagnostic conclusions. The proposed approach augments a frozen VLM with structured contextual signals derived from radiomic statistics, explainability activations, and vocabulary-grounded semantic cues. Instead of producing free-form responses, the model generates structured outputs containing supporting evidence, uncertainty estimates, limitations, and safety notes. We observe that auxiliary signals alone provide limited benefit; performance gains emerge only when these signals are integrated through contextual verification. Experiments on chest X-ray datasets demonstrate that context alignment improves discriminative performance (AUC 0.918 to 0.925) while maintaining calibrated uncertainty. The framework also substantially reduces hallucinated keywords (1.14 to 0.25) and produces more concise reasoning explanations (19.4 to 15.3 words) without increasing model confidence (0.70 to 0.68). Cross-dataset evaluation on CheXpert further reveals that modality informativeness significantly influences reasoning behavior. These results suggest that enforcing multi-evidence agreement improves both reliability and trustworthiness in medical multimodal reasoning, while preserving the underlying model architecture.
[125] Gen-n-Val: Agentic Image Data Generation and Validation
Jing-En Huang, I-Sheng Fang, Tzuhsuan Huang, Yu-Lun Liu, Chih-Yu Wang, Jun-Cheng Chen
Main category: cs.CV
TL;DR: Gen-n-Val is an agentic framework using Layer Diffusion, LLMs, and VLLMs to generate high-quality synthetic instance masks and images for object detection/segmentation, addressing data scarcity and quality issues in large-vocabulary benchmarks.
Details
Motivation: Address data scarcity, label noise, and long-tailed category imbalance in computer vision tasks like object detection and instance segmentation, especially for rare categories in large-vocabulary benchmarks like LVIS where current synthetic data generation methods produce low-quality results with multiple objects per mask, inaccurate segmentation, and incorrect labels.Method: Two-agent framework: (1) LD prompt agent (LLM) optimizes prompts for Layer Diffusion to generate high-quality foreground single-object images and segmentation masks; (2) data validation agent (VLLM) filters out low-quality synthetic instance images. System prompts for both agents are optimized using TextGrad.
Result: Reduces invalid synthetic data from 50% to 7%; improves rare class performance by 7.6% on LVIS instance segmentation with Mask R-CNN; improves by 3.6% mAP on rare classes in COCO instance segmentation with YOLOv9c/YOLO11m; shows 7.1% mAP improvement over YOLO-Worldv2-M in open-vocabulary object detection with YOLO11m.
Conclusion: Gen-n-Val effectively addresses synthetic data quality issues through an agentic framework combining diffusion models, LLMs, and VLLMs, significantly improving performance on rare classes and demonstrating scalability in model capacity and dataset size.
Abstract: The data scarcity, label noise, and long-tailed category imbalance remain important and unresolved challenges in many computer vision tasks, such as object detection and instance segmentation, especially on large-vocabulary benchmarks like LVIS, where most categories appear in only a few images. Current synthetic data generation methods still suffer from multiple objects per mask, inaccurate segmentation, incorrect category labels, and other issues, limiting their effectiveness. To address these issues, we introduce Gen-n-Val, a novel agentic data generation framework that leverages Layer Diffusion (LD), a Large Language Model (LLM), and a Vision Large Language Model (VLLM) to produce high-quality and diverse instance masks and images for object detection and instance segmentation. Gen-n-Val consists of two agents: (1) the LD prompt agent, an LLM, optimizes rompts to encourage LD to generate high-quality foreground single-object images and corresponding segmentation masks; and (2) the data validation agent, a VLLM, filters out low-quality synthetic instance images. The system prompts for both agents are optimized by TextGrad. Compared to state-of-the-art synthetic data approaches like MosaicFusion, our approach reduces invalid synthetic data from 50% to 7% and improves performance by 7.6% on rare classes in LVIS instance segmentation with Mask R-CNN, and by 3.6% mAP on rare classes in COCO instance segmentation with YOLOv9c and YOLO11m. Furthermore, Gen-n-Val shows significant improvements (7.1% mAP) over YOLO-Worldv2-M in open-vocabulary object detection benchmarks with YOLO11m. Moreover, Gen-n-Val has scalability in model capacity and dataset size. The code is available at https://github.com/aiiu-lab/Gen-n-Val.
[126] CatalogStitch: Dimension-Aware and Occlusion-Preserving Object Compositing for Catalog Image Generation
Sanyam Jain, Pragya Kandari, Manit Singhal, He Zhang, Soo Ye Kim
Main category: cs.CV
TL;DR: CatalogStitch automates generative object compositing for catalog images by handling dimension mismatches and occlusion restoration without manual intervention.
Details
Motivation: Current generative object compositing methods require tedious manual intervention for catalog image generation, including mask adjustments for different product dimensions and restoration of occluded elements after generation.Method: CatalogStitch introduces two model-agnostic techniques: 1) dimension-aware mask computation that automatically adapts target regions for different product dimensions, and 2) occlusion-aware hybrid restoration that preserves occluding elements perfectly. Also creates CatalogStitch-Eval benchmark with 58 examples.
Result: The techniques were evaluated with three state-of-the-art compositing models (ObjectStitch, OmniPaint, and InsertAnything), showing consistent improvements across diverse catalog scenarios and reducing manual intervention.
Conclusion: CatalogStitch transforms generative compositing into a practical, human-friendly tool for production catalog workflows by automating tedious corrections and reducing manual intervention.
Abstract: Generative object compositing methods have shown remarkable ability to seamlessly insert objects into scenes. However, when applied to real-world catalog image generation, these methods require tedious manual intervention: users must carefully adjust masks when product dimensions differ, and painstakingly restore occluded elements post-generation. We present CatalogStitch, a set of model-agnostic techniques that automate these corrections, enabling user-friendly content creation. Our dimension-aware mask computation algorithm automatically adapts the target region to accommodate products with different dimensions; users simply provide a product image and background, without manual mask adjustments. Our occlusion-aware hybrid restoration method guarantees pixel-perfect preservation of occluding elements, eliminating post-editing workflows. We additionally introduce CatalogStitch-Eval, a 58-example benchmark covering aspect-ratio mismatch and occlusion-heavy catalog scenarios, together with supplementary PDF and HTML viewers. We evaluate our techniques with three state-of-the-art compositing models (ObjectStitch, OmniPaint, and InsertAnything), demonstrating consistent improvements across diverse catalog scenarios. By reducing manual intervention and automating tedious corrections, our approach transforms generative compositing into a practical, human-friendly tool for production catalog workflows.
[127] DeFakeQ: Enabling Real-Time Deepfake Detection on Edge Devices via Adaptive Bidirectional Quantization
Xiangyu Li, Yujing Sun, Yuhang Zheng, Yuexin Ma, Kwok-Yan Lam
Main category: cs.CV
TL;DR: DefakeQ is a novel quantization framework specifically designed for deepfake detectors that enables real-time deployment on resource-constrained edge devices by preserving subtle forgery artifacts through adaptive bidirectional compression.
Details
Motivation: Existing deepfake detection methods are computationally intensive and parameter-heavy, limiting deployment on edge devices. Standard quantization techniques degrade the subtle forgery artifacts crucial for detection, creating a need for specialized quantization strategies for deepfake detectors.Method: DefakeQ introduces an adaptive bidirectional compression strategy that simultaneously leverages feature correlations and eliminates redundancy. This approach balances model compactness with detection performance by preserving discriminative features essential for deepfake detection.
Result: Extensive experiments across five benchmark datasets and eleven state-of-the-art backbone detectors show DefakeQ consistently outperforms existing quantization and model compression baselines. Real-world deployment on mobile devices demonstrates real-time deepfake detection capability.
Conclusion: DefakeQ successfully addresses the challenge of deploying deepfake detectors on edge devices by providing a specialized quantization framework that preserves detection accuracy while enabling real-time performance, making it practical for mobile applications.
Abstract: Deepfake detection has become a fundamental component of modern media forensics. Despite significant progress in detection accuracy, most existing methods remain computationally intensive and parameter-heavy, limiting their deployment on resource-constrained edge devices that require real-time, on-site inference. This limitation is particularly critical in an era where mobile devices are extensively used for media-centric applications, including online payments, virtual meetings, and social networking. Meanwhile, due to the unique requirement of capturing extremely subtle forgery artifacts for deepfake detection, state-of-the-art quantization techniques usually underperform for such a challenging task. These fine-grained cues are highly sensitive to model compression and can be easily degraded during quantization, leading to noticeable performance drops. This challenge highlights the need for quantization strategies specifically designed to preserve the discriminative features essential for reliable deepfake detection. To address this gap, we propose DefakeQ, the first quantization framework tailored for deepfake detectors, enabling real-time deployment on edge devices. Our approach introduces a novel adaptive bidirectional compression strategy that simultaneously leverages feature correlations and eliminates redundancy, achieving an effective balance between model compactness and detection performance. Extensive experiments across five benchmark datasets and eleven state-of-the-art backbone detectors demonstrate that DeFakeQ consistently surpasses existing quantization and model compression baselines. Furthermore, we deploy DefakeQ on mobile devices in real-world scenarios, demonstrating its capability for real-time deepfake detection and its practical applicability in edge environments.
[128] BIAS: A Biologically Inspired Algorithm for Video Saliency Detection
Zhao-ji Zhang, Ya-tang Li
Main category: cs.CV
TL;DR: BIAS is a biologically-inspired model for fast dynamic visual saliency detection in videos, combining static and motion features with millisecond latency, outperforming deep learning methods on DHF1K dataset and showing strong performance in traffic accident analysis.
Details
Motivation: The paper aims to develop a fast, biologically plausible model for dynamic visual saliency detection that can process continuous video streams with low latency, addressing the need for efficient real-time attention mechanisms in video analysis applications.Method: BIAS builds on the Itti-Koch framework and incorporates a retina-inspired motion detector to extract temporal features. It uses a greedy multi-Gaussian peak-fitting algorithm to identify foci of attention, balancing winner-take-all competition with information maximization.
Result: BIAS achieves millisecond-scale latency and outperforms heuristic-based approaches and several deep-learning models on the DHF1K dataset, especially in videos dominated by bottom-up attention. In traffic accident analysis, it achieves state-of-the-art performance in cause-effect recognition and can anticipate accidents up to 0.72 seconds before manual annotation.
Conclusion: BIAS successfully bridges biological plausibility and computational efficiency to achieve interpretable, high-speed dynamic saliency detection with strong real-world utility in applications like traffic accident analysis.
Abstract: We present BIAS, a fast, biologically inspired model for dynamic visual saliency detection in continuous video streams. Building on the Itti–Koch framework, BIAS incorporates a retina-inspired motion detector to extract temporal features, enabling the generation of saliency maps that integrate both static and motion information. Foci of attention (FOAs) are identified using a greedy multi-Gaussian peak-fitting algorithm that balances winner-take-all competition with information maximization. BIAS detects salient regions with millisecond-scale latency and outperforms heuristic-based approaches and several deep-learning models on the DHF1K dataset, particularly in videos dominated by bottom-up attention. Applied to traffic accident analysis, BIAS demonstrates strong real-world utility, achieving state-of-the-art performance in cause-effect recognition and anticipating accidents up to 0.72 seconds before manual annotation with reliable accuracy. Overall, BIAS bridges biological plausibility and computational efficiency to achieve interpretable, high-speed dynamic saliency detection.
[129] Harnessing Weak Pair Uncertainty for Text-based Person Search
Jintao Sun, Zhedong Zheng, Gangyi Ding
Main category: cs.CV
TL;DR: Uncertainty-aware method for text-based person search that handles weak positive image-text pairs where descriptions come from different camera views.
Details
Motivation: Existing methods focus on strict one-to-one correspondence between visual and textual modalities, ignoring weak positive pairs where text descriptions are annotated from different camera views for the same person.Method: Two-module approach: 1) Uncertainty estimation to obtain confidence scores for positive pairs, 2) Uncertainty regularization to adaptively adjust loss weights based on predicted uncertainty, plus group-wise image-text matching loss.
Result: Achieves mAP improvements of +3.06%, +3.55%, and +6.94% on CUHK-PEDES, RSTPReid, and ICFG-PEDES datasets respectively against competitive methods.
Conclusion: The uncertainty-aware approach effectively leverages weak positive pairs and prevents models from pushing away potentially weak positive candidates, improving text-based person search performance.
Abstract: In this paper, we study the text-based person search, which is to retrieve the person of interest via natural language description. Prevailing methods usually focus on the strict one-to-one correspondence pair matching between the visual and textual modality, such as contrastive learning. However, such a paradigm unintentionally disregards the weak positive image-text pairs, which are of the same person but the text descriptions are annotated from different views (cameras). To take full use of weak positives, we introduce an uncertainty-aware method to explicitly estimate image-text pair uncertainty, and incorporate the uncertainty into the optimization procedure in a smooth manner. Specifically, our method contains two modules: uncertainty estimation and uncertainty regularization. (1) Uncertainty estimation is to obtain the relative confidence on the given positive pairs; (2) Based on the predicted uncertainty, we propose the uncertainty regularization to adaptively adjust loss weight. Additionally, we introduce a group-wise image-text matching loss to further facilitate the representation space among the weak pairs. Compared with existing methods, the proposed method explicitly prevents the model from pushing away potentially weak positive candidates. Extensive experiments on three widely-used datasets, .e.g, CUHK-PEDES, RSTPReid and ICFG-PEDES, verify the mAP improvement of our method against existing competitive methods +3.06%, +3.55% and +6.94%, respectively.
[130] Precise Shield: Explaining and Aligning VLLM Safety via Neuron-Level Guidance
Enyi Shi, Fei Shen, Shuyi Miao, Linxia Zhu, Pengyang Shao, Jinhui Tang, Tat-Seng Chua
Main category: cs.CV
TL;DR: Precise Shield: A two-stage framework that identifies safety neurons in Vision-Language Large Models and constrains parameter updates to improve safety against multilingual/multimodal attacks while preserving generalization.
Details
Motivation: VLLMs face critical security vulnerabilities from multilingual and multimodal composite attacks where harmful images paired with low-resource language texts bypass current defenses, exposing structural blind spots in cross-lingual and cross-modal safety methods.Method: Two-stage framework: 1) Identify safety neurons by contrasting activation patterns between harmful and benign inputs, 2) Constrain parameter updates strictly within this subspace via gradient masking, affecting fewer than 0.03% of parameters.
Result: Substantially improves safety while preserving multilingual and multimodal generalization. Analysis reveals moderate overlap of safety neurons across languages and modalities, enabling zero-shot cross-lingual and cross-modal transfer of safety capabilities.
Conclusion: Offers a new direction for neuron-level, transfer-based safety enhancement in multimodal models by precisely targeting safety-critical neurons rather than broad parameter updates.
Abstract: In real-world deployments, Vision-Language Large Models (VLLMs) face critical challenges from multilingual and multimodal composite attacks: harmful images paired with low-resource language texts can easily bypass defenses designed for high-resource language scenarios, exposing structural blind spots in current cross-lingual and cross-modal safety methods. This raises a mechanistic question: where is safety capability instantiated within the model, and how is it distributed across languages and modalities? Prior studies on pure-text LLMs have identified cross-lingual shared safety neurons, suggesting that safety may be governed by a small subset of critical neurons. Leveraging this insight, we propose Precise Shield, a two-stage framework that first identifies safety neurons by contrasting activation patterns between harmful and benign inputs, and then constrains parameter updates strictly within this subspace via gradient masking with affecting fewer than 0.03% of parameters. This strategy substantially improves safety while preserving multilingual and multimodal generalization. Further analysis reveals a moderate overlap of safety neurons across languages and modalities, enabling zero-shot cross-lingual and cross-modal transfer of safety capabilities, and offering a new direction for neuron-level, transfer-based safety enhancement.
[131] HM-Bench: A Comprehensive Benchmark for Multimodal Large Language Models in Hyperspectral Remote Sensing
Xinyu Zhang, Zurong Mai, Qingmei Li, Zjin Liao, Yibin Wen, Yuhang Chen, Xiaoya Fan, Chan Tsz Ho, Bi Tianyuan, Haoyuan Liang, Ruifeng Su, Zihao Qian, Juepeng Zheng, Jianxi Huang, Yutong Lu, Haohuan Fu
Main category: cs.CV
TL;DR: HM-Bench is the first benchmark for evaluating multimodal LLMs on hyperspectral image understanding, featuring 19K QA pairs across 13 tasks, with dual-modality evaluation using PCA images and textual reports.
Details
Motivation: While MLLMs have advanced in natural image understanding, their ability to perceive and reason over hyperspectral images (HSI) remains underexplored. HSI's high dimensionality and spectral-spatial properties pose unique challenges for models trained primarily on RGB data, creating a gap in remote sensing applications.Method: Introduced HM-Bench with 19,337 question-answer pairs across 13 task categories. Proposed dual-modality evaluation framework: transforms HSI data into PCA-based composite images and structured textual reports since existing MLLMs cannot process raw hyperspectral cubes natively. This enables systematic comparison of different representations.
Result: Extensive evaluations on 18 representative MLLMs reveal significant difficulties in handling complex spatial-spectral reasoning tasks. Visual inputs generally outperform textual inputs, highlighting the importance of grounding in spectral-spatial evidence for effective HSI understanding.
Conclusion: HM-Bench addresses a critical gap in evaluating MLLMs for hyperspectral image understanding, demonstrating current models’ limitations in spectral-spatial reasoning and emphasizing the superiority of visual over textual representations for this modality.
Abstract: While multimodal large language models (MLLMs) have made significant strides in natural image understanding, their ability to perceive and reason over hyperspectral image (HSI) remains underexplored, which is a vital modality in remote sensing. The high dimensionality and intricate spectral-spatial properties of HSI pose unique challenges for models primarily trained on RGB data.To address this gap, we introduce Hyperspectral Multimodal Benchmark (HM-Bench), the first benchmark designed specifically to evaluate MLLMs in HSI understanding. We curate a large-scale dataset of 19,337 question-answer pairs across 13 task categories, ranging from basic perception to spectral reasoning. Given that existing MLLMs are not equipped to process raw hyperspectral cubes natively, we propose a dual-modality evaluation framework that transforms HSI data into two complementary representations: PCA-based composite images and structured textual reports. This approach facilitates a systematic comparison of different representation for model performance. Extensive evaluations on 18 representative MLLMs reveal significant difficulties in handling complex spatial-spectral reasoning tasks. Furthermore, our results demonstrate that visual inputs generally outperform textual inputs, highlighting the importance of grounding in spectral-spatial evidence for effective HSI understanding. Dataset and appendix can be accessed at https://github.com/HuoRiLi-Yu/HM-Bench.
[132] Adaptive Dual Residual U-Net with Attention Gate and Multiscale Spatial Attention Mechanisms (ADRUwAMS)
Mohsen Yaghoubi Suraki
Main category: cs.CV
TL;DR: Proposes ADRUwAMS, a novel deep learning model combining adaptive dual residual networks with attention mechanisms for brain tumor segmentation, achieving state-of-the-art results on BraTS datasets.
Details
Motivation: Early detection of glioma brain tumors is crucial for effective treatment, requiring automated segmentation. Current methods face challenges due to tumor characteristics like location and size variations, necessitating more accurate segmentation approaches.Method: ADRUwAMS combines adaptive dual residual networks with attention gates and multiscale spatial attention mechanisms. The dual residual architecture captures both high-level semantic and low-level details, while attention mechanisms focus on relevant tumor regions and combine multiscale features.
Result: Achieved dice scores of 0.9229 (whole tumor), 0.8432 (tumor core), and 0.8004 (enhancing tumor) on BraTS 2020 dataset, demonstrating superior segmentation performance.
Conclusion: The proposed ADRUwAMS model effectively addresses brain tumor segmentation challenges by integrating advanced attention mechanisms with residual networks, providing accurate and reliable segmentation for medical applications.
Abstract: Glioma is a harmful brain tumor that requires early detection to ensure better health results. Early detection of this tumor is key for effective treatment and requires an automated segmentation process. However, it is a challenging task to find tumors due to tumor characteristics like location and size. A reliable method to accurately separate tumor zones from healthy tissues is deep learning models, which have shown promising results over the last few years. In this research, an Adaptive Dual Residual U-Net with Attention Gate and Multiscale Spatial Attention Mechanisms (ADRUwAMS) is introduced. This model is an innovative combination of adaptive dual residual networks, attention mechanisms, and multiscale spatial attention. The dual adaptive residual network architecture captures high-level semantic and intricate low-level details from brain images, ensuring precise segmentation of different tumor parts, types, and hard regions. The attention gates use gating and input signals to compute attention coefficients for the input features, and multiscale spatial attention generates scaled attention maps and combines these features to hold the most significant information about the brain tumor. We trained the model for 200 epochs using the ReLU activation function on BraTS 2020 and BraTS 2019 datasets. These improvements resulted in high accuracy for tumor detection and segmentation on BraTS 2020, achieving dice scores of 0.9229 for the whole tumor, 0.8432 for the tumor core, and 0.8004 for the enhancing tumor.
[133] GeoMMBench and GeoMMAgent: Toward Expert-Level Multimodal Intelligence in Geoscience and Remote Sensing
Aoran Xiao, Shihao Cheng, Yonghao Xu, Yexian Ren, Hongruixuan Chen, Naoto Yokoya
Main category: cs.CV
TL;DR: GeoMMBench is a comprehensive multimodal QA benchmark for geoscience/remote sensing, and GeoMMAgent is a multi-agent framework that outperforms standalone LLMs by integrating retrieval, perception, and reasoning tools.
Details
Motivation: Multimodal LLMs in geoscience/remote sensing face challenges: wide disciplinary knowledge, heterogeneous sensor modalities, and fragmented tasks. Existing benchmarks are limited, and current models lack domain expertise for expert-level geospatial interpretation.Method: 1) Created GeoMMBench benchmark covering diverse RS disciplines, sensors, and tasks for rigorous evaluation. 2) Evaluated 36 open-source and proprietary LLMs. 3) Developed GeoMMAgent multi-agent framework integrating retrieval, perception, and reasoning through domain-specific RS models and tools.
Result: Evaluation revealed systematic deficiencies in domain knowledge, perceptual grounding, and reasoning. GeoMMAgent significantly outperformed standalone LLMs, demonstrating the importance of tool-augmented agents for complex geoscience challenges.
Conclusion: Tool-augmented multi-agent frameworks are essential for tackling complex geoscience/RS problems, as standalone LLMs lack necessary domain expertise. GeoMMBench enables broader evaluation, and GeoMMAgent provides a practical solution for expert-level geospatial interpretation.
Abstract: Recent advances in multimodal large language models (MLLMs) have accelerated progress in domain-oriented AI, yet their development in geoscience and remote sensing (RS) remains constrained by distinctive challenges: wide-ranging disciplinary knowledge, heterogeneous sensor modalities, and a fragmented spectrum of tasks. To bridge these gaps, we introduce GeoMMBench, a comprehensive multimodal question-answering benchmark covering diverse RS disciplines, sensors, and tasks, enabling broader and more rigorous evaluation than prior benchmarks. Using GeoMMBench, we assess 36 open-source and proprietary large language models, uncovering systematic deficiencies in domain knowledge, perceptual grounding, and reasoning–capabilities essential for expert-level geospatial interpretation. Beyond evaluation, we propose GeoMMAgent, a multi-agent framework that strategically integrates retrieval, perception, and reasoning through domain-specific RS models and tools. Extensive experimental results demonstrate that GeoMMAgent significantly outperforms standalone LLMs, underscoring the importance of tool-augmented agents for dynamically tackling complex geoscience and RS challenges.
[134] Fast Model-guided Instance-wise Adaptation Framework for Real-world Pansharpening with Fidelity Constraints
Zhiqi Yang, Jin-Liang Xiao, Shan Yin, Liang-Jian Deng, Gemine Vivone
Main category: cs.CV
TL;DR: FMGPan is a fast, generalizable model-guided framework for pansharpening that achieves cross-sensor generality and rapid training-inference by leveraging pretrained models with lightweight adaptive networks and novel physical fidelity constraints.
Details
Motivation: Existing DL-based pansharpening methods require high training costs, large datasets, and suffer from poor generalization when test distribution differs from training. Zero-shot methods offer better generalization but have limited fusion quality, high computational overhead, and slow convergence.Method: Proposes FMG-Pan, a model-guided instance-wise adaptation framework that uses a pretrained model to guide a lightweight adaptive network through joint optimization with spectral and physical fidelity constraints, including a novel physical fidelity term for spatial detail preservation.
Result: Achieves state-of-the-art performance on real-world datasets under intra- and cross-sensor settings. On WorldView-3 dataset, completes training and inference for 512x512x8 image within 3 seconds on RTX 3090 GPU, significantly faster than existing zero-shot methods.
Conclusion: FMGPan provides an efficient solution for real-world pansharpening with strong generalization capabilities and practical deployment suitability due to its fast training-inference speed.
Abstract: Pansharpening aims to generate high-resolution multispectral (HRMS) images by fusing low-resolution multispectral (LRMS) and high-resolution panchromatic (PAN) images while preserving both spectral and spatial information. Although deep learning (DL)-based pansharpening methods achieve impressive performance, they require high training cost and large datasets, and often degrade when the test distribution differs from training, limiting generalization. Recent zero-shot methods, trained on a single PAN/LRMS pair, offer strong generalization but suffer from limited fusion quality, high computational overhead, and slow convergence. To address these issues, we propose FMG-Pan, a fast and generalizable model-guided instance-wise adaptation framework for real-world pansharpening, achieving both cross-sensor generality and rapid training-inference. The framework leverages a pretrained model to guide a lightweight adaptive network through joint optimization with spectral and physical fidelity constraints. We further design a novel physical fidelity term to enhance spatial detail preservation. Extensive experiments on real-world datasets under both intra- and cross-sensor settings demonstrate state-of-the-art performance. On the WorldView-3 dataset, FMG-Pan completes training and inference for a 512x512x8 image within 3 seconds on an RTX 3090 GPU, significantly faster than existing zero-shot methods, making it suitable for practical deployment.
[135] Large-Scale Universal Defect Generation: Foundation Models and Datasets
Yuanting Fan, Jun Liu, Bin-Bin Gao, Xiaochen Chen, Yuhuan Lin, Zhewei Dai, Jiawei Zhan, Chengjie Wang
Main category: cs.CV
TL;DR: UniDG is a universal defect generation foundation model that creates realistic defects across diverse domains without per-category fine-tuning, using a large-scale dataset and multimodal attention mechanisms.
Details
Motivation: Existing defect generation methods suffer from limited generalization, degraded realism, and category consistency issues due to reliance on few-shot learning and lack of large-scale paired defect editing data, especially with substantial variations in defect scale and morphology.Method: Introduces UDG dataset (300K normal-abnormal-mask-caption quadruplets) and UniDG model with Defect-Context Editing via adaptive cropping and structured diptych input, MM-DiT multimodal attention for fusing reference/target conditions, and two-stage training (Diversity-SFT followed by Consistency-RFT).
Result: Outperforms prior few-shot anomaly generation and image insertion/editing baselines on MVTec-AD and VisA datasets in synthesis quality and downstream single-/multi-class anomaly detection/localization.
Conclusion: UniDG provides a universal foundation model for defect generation that achieves better generalization, realism, and consistency across diverse domains without requiring per-category fine-tuning.
Abstract: Existing defect/anomaly generation methods often rely on few-shot learning, which overfits to specific defect categories due to the lack of large-scale paired defect editing data. This issue is aggravated by substantial variations in defect scale and morphology, resulting in limited generalization, degraded realism, and category consistency. We address these challenges by introducing UDG, a large-scale dataset of 300K normal-abnormal-mask-caption quadruplets spanning diverse domains, and by presenting UniDG, a universal defect generation foundation model that supports both reference-based defect generation and text instruction-based defect editing without per-category fine-tuning. UniDG performs Defect-Context Editing via adaptive defect cropping and structured diptych input format, and fuses reference and target conditions through MM-DiT multimodal attention. A two-stage training strategy, Diversity-SFT followed by Consistency-RFT, further improves diversity while enhancing realism and reference consistency. Extensive experiments on MVTec-AD and VisA show that UniDG outperforms prior few-shot anomaly generation and image insertion/editing baselines in synthesis quality and downstream single- and multi-class anomaly detection/localization. Code will be available at https://github.com/RetoFan233/UniDG.
[136] MV3DIS: Multi-View Mask Matching via 3D Guides for Zero-Shot 3D Instance Segmentation
Yibo Zhao, Yigong Zhang, Jin Xie
Main category: cs.CV
TL;DR: MV3DIS is a zero-shot 3D instance segmentation framework that uses 3D priors and multi-view consistency to improve segmentation quality by matching 2D masks across views using coarse 3D segments as reference.
Details
Motivation: Existing zero-shot 3D instance segmentation methods rely on independent frame processing and 2D metrics from SAM, ignoring multi-view correlations and 3D priors, leading to inconsistent masks and fragmented 3D segmentation.Method: Coarse-to-fine framework with 3D-guided mask matching using coarse 3D segments as reference, multi-view mask consolidation via 3D coverage distributions, and depth consistency weighting to suppress occlusion ambiguities.
Result: Superior performance on ScanNetV2, ScanNet200, ScanNet++, Replica, and Matterport3D datasets compared to previous methods.
Conclusion: Explicit incorporation of 3D priors and multi-view consistency significantly improves zero-shot 3D instance segmentation quality and robustness.
Abstract: Conventional 3D instance segmentation methods rely on labor-intensive 3D annotations for supervised training, which limits their scalability and generalization to novel objects. Recent approaches leverage multi-view 2D masks from the Segment Anything Model (SAM) to guide the merging of 3D geometric primitives, thereby enabling zero-shot 3D instance segmentation. However, these methods typically process each frame independently and rely solely on 2D metrics, such as SAM prediction scores, to produce segmentation maps. This design overlooks multi-view correlations and inherent 3D priors, leading to inconsistent 2D masks across views and ultimately fragmented 3D segmentation. In this paper, we propose MV3DIS, a coarse-to-fine framework for zero-shot 3D instance segmentation that explicitly incorporates 3D priors. Specifically, we introduce a 3D-guided mask matching strategy that uses coarse 3D segments as a common reference to match 2D masks across views and consolidates multi-view mask consistency via 3D coverage distributions. Guided by these view-consistent 2D masks, the coarse 3D segments are further refined into precise 3D instances. Additionally, we introduce a depth consistency weighting scheme that quantifies projection reliability to suppress ambiguities from inter-object occlusions, thereby improving the robustness of 3D-to-2D correspondence. Extensive experiments on the ScanNetV2, ScanNet200, ScanNet++, Replica, and Matterport3D datasets demonstrate the effectiveness of MV3DIS, which achieves superior performance over previous methods
[137] TAIHRI: Task-Aware 3D Human Keypoints Localization for Close-Range Human-Robot Interaction
Ao Li, Yonggen Ling, Yiyang Lin, Yuji Wang, Yong Deng, Yansong Tang
Main category: cs.CV
TL;DR: TAIHRI is a Vision-Language Model for close-range human-robot interaction that understands motion commands and localizes task-relevant 3D human keypoints via 2D keypoint reasoning and token prediction.
Details
Motivation: Current 3D human keypoints estimation focuses on whole-body reconstruction relative to root joints, but robots in HRI need precise metric-scale localization of task-relevant body parts in egocentric camera coordinates.Method: Quantizes 3D keypoints into finite interaction space, uses VLM to understand motion commands, directs attention to task-relevant keypoints, and localizes 3D coordinates via 2D keypoint reasoning with next token prediction.
Result: Superior estimation accuracy for task-critical body parts on egocentric interaction benchmarks, with seamless adaptation to downstream tasks like natural language control and human mesh recovery.
Conclusion: TAIHRI opens new research avenues in embodied human-robot interaction by enabling precise 3D localization of task-relevant body parts through vision-language understanding.
Abstract: Accurate 3D human keypoints localization is a critical technology enabling robots to achieve natural and safe physical interaction with users. Conventional 3D human keypoints estimation methods primarily focus on the whole-body reconstruction quality relative to the root joint. However, in practical human-robot interaction (HRI) scenarios, robots are more concerned with the precise metric-scale spatial localization of task-relevant body parts under the egocentric camera 3D coordinate. We propose TAIHRI, the first Vision-Language Model (VLM) tailored for close-range HRI perception, capable of understanding users’ motion commands and directing the robot’s attention to the most task-relevant keypoints. By quantizing 3D keypoints into a finite interaction space, TAIHRI precisely localize the 3D spatial coordinates of critical body parts by 2D keypoint reasoning via next token prediction, and seamlessly adapt to downstream tasks such as natural language control or global space human mesh recovery. Experiments on egocentric interaction benchmarks demonstrate that TAIHRI achieves superior estimation accuracy for task-critical body parts. We believe TAIHRI opens new research avenues in the field of embodied human-robot interaction. Code is available at: https://github.com/Tencent/TAIHRI.
[138] Degradation-Robust Fusion: An Efficient Degradation-Aware Diffusion Framework for Multimodal Image Fusion in Arbitrary Degradation Scenarios
Yu Shi, Yu Liu, Zhong-Cheng Wu, Juan Cheng, Huafeng Li, Xun Chen
Main category: cs.CV
TL;DR: Proposes a degradation-aware diffusion framework for image fusion that handles complex degradations like noise, blur, and low resolution through implicit denoising and joint observation model correction.
Details
Motivation: Real-world image fusion faces complex degradations (noise, blur, low resolution) that limit existing methods. End-to-end neural networks lack interpretability, while diffusion models are designed for single-domain targets and can't directly handle fusion's multi-source complementary information without natural fused data.Method: An efficient degradation-aware diffusion framework that performs implicit denoising by directly regressing the fused image instead of predicting noise. Includes a joint observation model correction mechanism that simultaneously imposes degradation and fusion constraints during sampling to ensure reconstruction accuracy.
Result: Experiments on diverse fusion tasks and degradation configurations demonstrate superiority under complex degradation scenarios compared to existing methods.
Conclusion: The proposed framework effectively addresses image fusion under arbitrary degradation scenarios by combining diffusion’s generative priors with degradation-aware adaptation and joint constraints, overcoming limitations of both neural networks and conventional diffusion models.
Abstract: Complex degradations like noise, blur, and low resolution are typical challenges in real world image fusion tasks, limiting the performance and practicality of existing methods. End to end neural network based approaches are generally simple to design and highly efficient in inference, but their black-box nature leads to limited interpretability. Diffusion based methods alleviate this to some extent by providing powerful generative priors and a more structured inference process. However, they are trained to learn a single domain target distribution, whereas fusion lacks natural fused data and relies on modeling complementary information from multiple sources, making diffusion hard to apply directly in practice. To address these challenges, this paper proposes an efficient degradation aware diffusion framework for image fusion under arbitrary degradation scenarios. Specifically, instead of explicitly predicting noise as in conventional diffusion models, our method performs implicit denoising by directly regressing the fused image, enabling flexible adaptation to diverse fusion tasks under complex degradations with limited steps. Moreover, we design a joint observation model correction mechanism that simultaneously imposes degradation and fusion constraints during sampling to ensure high reconstruction accuracy. Experiments on diverse fusion tasks and degradation configurations demonstrate the superiority of the proposed method under complex degradation scenarios.
[139] Customized Fusion: A Closed-Loop Dynamic Network for Adaptive Multi-Task-Aware Infrared-Visible Image Fusion
Zengyi Yang, Yu Liu, Juan Cheng, Zhiqin Zhu, Yafei Zhang, Huafeng Li
Main category: cs.CV
TL;DR: CLDyN is a closed-loop dynamic network for infrared-visible image fusion that adapts to multiple downstream tasks without retraining through semantic compensation mechanisms.
Details
Motivation: Existing infrared-visible image fusion methods struggle with simultaneously adapting to multiple downstream tasks, lacking the ability to customize fusion according to specific task requirements.Method: Proposes a Closed-Loop Dynamic Network (CLDyN) with a Requirement-driven Semantic Compensation (RSC) module that uses a Basis Vector Bank (BVB) and Architecture-Adaptive Semantic Injection (A2SI) block to customize network architecture based on task requirements through explicit feedback from downstream tasks.
Result: Experiments on M3FD, FMB, and VT5000 datasets show CLDyN maintains high fusion quality while exhibiting strong multi-task adaptability across different downstream tasks.
Conclusion: CLDyN successfully addresses the multi-task adaptation problem in infrared-visible image fusion through a closed-loop optimization mechanism that enables task-customized fusion without retraining.
Abstract: Infrared-visible image fusion aims to integrate complementary information for robust visual understanding, but existing fusion methods struggle with simultaneously adapting to multiple downstream tasks. To address this issue, we propose a Closed-Loop Dynamic Network (CLDyN) that can adaptively respond to the semantic requirements of diverse downstream tasks for task-customized image fusion. Specifically, CLDyN introduces a closed-loop optimization mechanism that establishes a semantic transmission chain to achieve explicit feedback from downstream tasks to the fusion network through a Requirement-driven Semantic Compensation (RSC) module. The RSC module leverages a Basis Vector Bank (BVB) and an Architecture-Adaptive Semantic Injection (A2SI) block to customize the network architecture according to task requirements, thereby enabling task-specific semantic compensation and allowing the fusion network to actively adapt to diverse tasks without retraining. To promote semantic compensation, a reward-penalty strategy is introduced to reward or penalize the RSC module based on task performance variations. Experiments on the M3FD, FMB, and VT5000 datasets demonstrate that CLDyN not only maintains high fusion quality but also exhibits strong multi-task adaptability. The code is available at https://github.com/YR0211/CLDyN.
[140] M-IDoL: Information Decomposition for Modality-Specific and Diverse Representation Learning in Medical Foundation Model
Yihang Liu, Ying Wen, Jiaxiong Yang, Longzhen Yang, Lianghua He, Heng Tao Shen
Main category: cs.CV
TL;DR: M-IDoL is a self-supervised medical foundation model that uses Information Decomposition to learn modality-specific and diverse representations by separating multimodal features into Mixture-of-Experts subspaces and performing fine-grained semantic discrimination within each modality.
Details
Motivation: Existing medical foundation models suffer from information ambiguity that blends multimodal representations in a single embedding space, leading to degradation of modality specificity and diversity. The authors aim to address this by developing a model that can learn universal representations while preserving modality-specific characteristics.Method: Proposes M-IDoL with two key objectives: 1) maximize inter-modality entropy by dispersing multimodal representations into separable Mixture-of-Experts subspaces to achieve representation specificity across modalities, and 2) minimize intra-modality uncertainty by performing fine-grained semantic discrimination within each MoE subspace to enrich representation diversity per modality.
Result: Pre-trained on 1.15 million medical images, M-IDoL delivers superior generalization across 21 downstream clinical tasks, outperforming 20 foundation models on five imaging modalities (X-ray, fundus, OCT, dermoscopy, pathology). It learns modality-specific and diverse representations with clearer separation of feature clusters across modalities and finer-grained feature discrimination within each modality.
Conclusion: M-IDoL successfully addresses information ambiguity in medical foundation models by decomposing multimodal representations into modality-specific subspaces while maintaining intra-modality diversity, leading to improved generalization performance across diverse clinical tasks.
Abstract: Medical foundation models (MFMs) aim to learn universal representations from multimodal medical images that can generalize effectively to diverse downstream clinical tasks. However, most existing MFMs suffer from information ambiguity that blend multimodal representations in a single embedding space, leading to the degradation of modality specificity and diversity. In this paper, we propose M-IDoL, a self-supervised \underline{\textit{M}}FM that introduces Information Decomposition for multimodal representation Learning via two objectives: i) maximize inter-modality entropy by dispersing multimodal representation into separable Mixture-of-Experts (MoE) subspaces to achieve representation specificity across modalities; and ii) minimize intra-modality uncertainty by performing fine-grained semantic discrimination within each MoE subspace to enrich representation diversity per modality. By pre-training on 1.15 million medical images, M-IDoL i) delivers superior generalization across 21 downstream clinical tasks, outperforming 20 foundation models on five imaging modalities (e.g., X-ray, fundus, OCT, dermoscopy and pathology), and ii) learns modality-specific and diverse representations, showing clearer separation of feature cluster across modalities and finer-grained feature discrimination within each modality.
[141] MASS: Mesh-inellipse Aligned Deformable Surfel Splatting for Hand Reconstruction and Rendering from Egocentric Monocular Video
Haoyu Zhu, Yi Zhang, Lei Yao, Lap-pui Chau, Yi Wang
Main category: cs.CV
TL;DR: MASS reconstructs high-fidelity 3D hands from egocentric monocular videos using deformable 2D Gaussian surfels aligned with parametric hand meshes, achieving superior reconstruction with efficient computation.
Details
Motivation: Existing methods for 3D hand reconstruction from egocentric monocular videos have limitations in capturing high-resolution geometry, hand-object interactions, and complex objects on hands, while also being computationally expensive for real-time applications.Method: Proposes Mesh-inellipse Aligned deformable Surfel Splatting (MASS) using deformable 2D Gaussian surfels. Introduces mesh-aligned Steiner Inellipse and fractal densification for mesh-to-surfel conversion, Gaussian Surfel Deformation for modeling hand deformations, a two-stage training strategy, and binding loss for optimization robustness.
Result: Extensive experiments on ARCTIC, Hand Appearance, and Interhand2.6M datasets demonstrate superior reconstruction performance compared to state-of-the-art methods.
Conclusion: MASS effectively addresses challenges in 3D hand reconstruction from egocentric monocular videos by leveraging deformable 2D Gaussian surfels, achieving high-fidelity results with computational efficiency.
Abstract: Reconstructing high-fidelity 3D hands from egocentric monocular videos remains a challenge due to the limitations in capturing high-resolution geometry, hand-object interactions, and complex objects on hands. Additionally, existing methods often incur high computational costs, making them impractical for real-time applications. In this work, we propose Mesh-inellipse Aligned deformable Surfel Splatting (MASS) to address these challenges by leveraging a deformable 2D Gaussian Surfel representation. We introduce the mesh-aligned Steiner Inellipse and fractal densification for mesh-to-surfel conversion that initiates high-resolution 2D Gaussian surfels from coarse parametric hand meshes, providing surface representation with photorealistic rendering potential. Second, we propose Gaussian Surfel Deformation, which enables efficient modeling of hand deformations and personalized features by predicting residual updates to surfel attributes and introducing an opacity mask to refine geometry and texture without adaptive density control. In addition, we propose a two-stage training strategy and a novel binding loss to improve the optimization robustness and reconstruction quality. Extensive experiments on the ARCTIC dataset, the Hand Appearance dataset, and the Interhand2.6M dataset demonstrate that our model achieves superior reconstruction performance compared to state-of-the-art methods.
[142] TouchAnything: Diffusion-Guided 3D Reconstruction from Sparse Robot Touches
Langzhe Gu, Hung-Jui Huang, Mohamad Qadri, Michael Kaess, Wenzhen Yuan
Main category: cs.CV
TL;DR: TouchAnything uses pretrained 2D vision diffusion models as geometric priors for 3D reconstruction from sparse tactile measurements, enabling accurate shape estimation from few touches even for unseen objects.
Details
Motivation: Vision-based shape estimation fails under occlusions or poor lighting, while tactile sensing provides direct geometric information but sparse touches alone are insufficient for 3D reconstruction. The paper aims to leverage visual priors to overcome tactile reconstruction limitations.Method: Uses pretrained large-scale 2D vision diffusion models as semantic/geometric priors. Formulates reconstruction as optimization problem enforcing tactile consistency while guiding solutions toward shapes consistent with diffusion prior, given sparse contact constraints and coarse class-level object description.
Result: Method reconstructs accurate geometries from only a few touches, outperforms existing baselines, and enables open-world 3D reconstruction of previously unseen object instances.
Conclusion: TouchAnything demonstrates successful transfer of geometric knowledge from visual diffusion models to tactile domain, enabling robust 3D reconstruction from sparse tactile measurements.
Abstract: Accurate object geometry estimation is essential for many downstream tasks, including robotic manipulation and physical interaction. Although vision is the dominant modality for shape perception, it becomes unreliable under occlusions or challenging lighting conditions. In such scenarios, tactile sensing provides direct geometric information through physical contact. However, reconstructing global 3D geometry from sparse local touches alone is fundamentally underconstrained. We present TouchAnything, a framework that leverages a pretrained large-scale 2D vision diffusion model as a semantic and geometric prior for 3D reconstruction from sparse tactile measurements. Unlike prior work that trains category-specific reconstruction networks or learns diffusion models directly from tactile data, we transfer the geometric knowledge encoded in pretrained visual diffusion models to the tactile domain. Given sparse contact constraints and a coarse class-level description of the object, we formulate reconstruction as an optimization problem that enforces tactile consistency while guiding solutions toward shapes consistent with the diffusion prior. Our method reconstructs accurate geometries from only a few touches, outperforms existing baselines, and enables open-world 3D reconstruction of previously unseen object instances. Our project page is https://grange007.github.io/touchanything .
[143] Low-Data Supervised Adaptation Outperforms Prompting for Cloud Segmentation Under Domain Shift
Harshith Kethavath, Weiming Hu
Main category: cs.CV
TL;DR: Vision-language models like CLIPSeg fail to adapt to satellite imagery via prompting alone; fine-tuning with minimal labeled data outperforms all prompt engineering attempts for cloud segmentation tasks.
Details
Motivation: The paper investigates whether prompting can effectively adapt vision-language models to specialized domains like remote sensing, where visual and linguistic distributions differ significantly from natural images used in pretraining.Method: Evaluated CLIPSeg on CloudSEN12+ cloud segmentation benchmark using 60 prompt variants (simple labels, domain terminology, appearance descriptors, contextual cues). Compared prompting results with supervised fine-tuning using varying amounts of labeled data (0.1% to full dataset).
Result: All prompt variants underperformed zero-shot baseline (0.255 mIoU), with engineered prompts scoring as low as 0.07 mIoU. Fine-tuning with just 0.1% labeled data (~8 images) surpassed zero-shot performance, and 5-10% data recovered ~85% of maximum achievable mIoU. Full fine-tuning consistently outperformed low-rank adaptation.
Conclusion: Prompting cannot bridge the gap between CLIP’s natural image representations and specialized satellite imagery. Labeled data is essential for adapting vision-language models to domain-specific visual tasks, even in small quantities.
Abstract: Adapting vision-language models to remote sensing imagery presents a fundamental challenge: both the visual and linguistic distributions of satellite data lie far outside natural image pretraining corpora. Despite this, prompting remains the dominant deployment paradigm, driven by the assumption that domain-specific language can guide frozen model representations toward specialized tasks. We test this assumption directly on a domain where the mismatch is prominent: cloud segmentation for satellite imagery. Using CLIPSeg on the CloudSEN12+ cloud segmentation benchmark, we evaluate 60 prompt variants spanning simple labels, domain terminology, appearance descriptors, and contextual cues, finding that every variant underperforms the zero-shot baseline (0.255 mIoU), with engineered prompts scoring as low as 0.07 mIoU. No amount of linguistic refinement bridges the gap between CLIP’s natural image representations and satellite spectral imagery. In contrast, supervised fine-tuning with just 0.1% labeled data (~8 images) surpasses zero-shot performance overall, and 5-10% data recovers ~85% of maximum achievable mIoU. Full fine-tuning consistently outperforms low-rank adaptation by 0.03-0.09 mIoU, with the largest gaps for spectrally ambiguous classes, and at 0.5 to 1% labeled data, fine-tuning temporarily degrades performance on these classes before recovering, a supervision dip that aggregate mIoU can mask. For practitioners adapting vision-language models to specialized imagery, our results deliver a clear message: labeled data is not the expensive alternative to prompting; it is the worthwhile path.
[144] Dynamic Class-Aware Active Learning for Unbiased Satellite Image Segmentation
Gadi Hemanth Kumar, Athira Nambiar, Pankaj Bodani
Main category: cs.CV
TL;DR: Proposes DCAU-AL, a dynamic class-aware uncertainty-based active learning method for satellite image segmentation that adaptively selects samples based on real-time class performance gaps to address class imbalance.
Details
Motivation: Active learning can reduce annotation costs for satellite imagery segmentation, but standard methods lack adaptability to target underperforming or rare classes, leading to bias and poor performance on imbalanced datasets.Method: DCAU-AL continuously tracks per-class segmentation performance and dynamically adjusts sampling weights to focus on poorly performing or underrepresented classes throughout the active learning process.
Result: Extensive experiments on OpenEarth land cover dataset show DCAU-AL significantly outperforms existing AL methods, especially under severe class imbalance, delivering superior per-class IoU and improved annotation efficiency.
Conclusion: The proposed adaptive acquisition function effectively addresses class imbalance in active learning for satellite image segmentation, enabling better performance with fewer labeled samples.
Abstract: Semantic segmentation of satellite imagery plays a vital role in land cover mapping and environmental monitoring. However, annotating large-scale, high-resolution satellite datasets is costly and time consuming, especially when covering vast geographic regions. Instead of randomly labeling data or exhaustively annotating entire datasets, Active Learning (AL) offers an efficient alternative by intelligently selecting the most informative samples for annotation with the help of Human-in-the-loop (HITL), thereby reducing labeling costs while maintaining high model performance. AL is particularly beneficial for large-scale or resource-constrained satellite applications, as it enables high segmentation accuracy with significantly fewer labeled samples. Despite these advantages, standard AL strategies typically rely on global uncertainty or diversity measures and lack the adaptability to target underperforming or rare classes as training progresses, leading to bias in the system. To overcome these limitations, we propose a novel adaptive acquisition function, Dynamic Class-Aware Uncertainty based Active learning (DCAU-AL) that prioritizes sample selection based on real-time class-wise performance gaps, thereby overcoming class-imbalance issue. The proposed DCAU-AL mechanism continuously tracks the performance of the segmentation per class and dynamically adjusts the sampling weights to focus on poorly performing or underrepresented classes throughout the active learning process. Extensive experiments on the OpenEarth land cover dataset show that DCAU-AL significantly outperforms existing AL methods, especially under severe class imbalance, delivering superior per-class IoU and improved annotation efficiency.
[145] How Should Video LLMs Output Time? An Analysis of Efficient Temporal Grounding Paradigms
Shengji Jin, Yuanhao Zou, Victor Zhu, Zhengping Ji, Chen Chen
Main category: cs.CV
TL;DR: Controlled study comparing three video temporal grounding output paradigms (Text Numeral, Temporal Token, Continuous Temporal) across identical compact VLMs, showing continuous distribution achieves best efficiency-accuracy trade-off.
Details
Motivation: Existing VTG methods couple output paradigms with different backbones/datasets/training, making it hard to isolate output design impact. Need systematic investigation of output formulation vs. efficiency for edge deployment.Method: Empirical comparison of three VTG output paradigms (Text Numeral Generation, Temporal Token Generation, Continuous Temporal Decoding) across identical compact VLMs (SmolVLM2, FastVLM, Molmo2) using consistent datasets and LoRA fine-tuning protocols.
Result: Output formulation significantly affects both grounding accuracy and computational cost independent of model scale. Continuous distribution paradigm achieves most favorable efficiency-accuracy trade-off on Pareto frontier with robust localization and minimal latency overhead.
Conclusion: Provides objective empirical guidelines for designing efficient, deployment-ready VTG systems, showing continuous temporal decoding is optimal for edge deployment scenarios.
Abstract: While Multimodal Large Language Models (MLLMs) have advanced Video Temporal Grounding (VTG), existing methods often couple output paradigms with different backbones, datasets, and training protocols. This makes it challenging to isolate the specific impact of the output design. Additionally, as VTG systems are increasingly considered for resource-constrained edge deployment, the trade-off between output formulation and system-level efficiency requires systematic investigation. In this paper, we present a controlled empirical study comparing three dominant VTG output paradigms: Text Numeral Generation, Temporal Token Generation, and Continuous Temporal Decoding. We evaluate these paradigms across identical compact VLMs (SmolVLM2, FastVLM, and Molmo2) using consistent datasets and LoRA fine-tuning protocols. Evaluations on Charades-STA, QVHighlights, and YouCook2 measure both localization accuracy and system efficiency, including inference latency, training throughput, and parameter overhead. Our results demonstrate that the choice of output formulation significantly affects both grounding accuracy and computational cost, independent of model scale. Specifically, the continuous distribution paradigm consistently achieves the most favorable efficiency-accuracy trade-off on the Pareto frontier, delivering robust localization with minimal latency overhead. These findings provide objective empirical guidelines for designing efficient, deployment-ready VTG systems.
[146] ActFER: Agentic Facial Expression Recognition via Active Tool-Augmented Visual Reasoning
Shifeng Liu, Zhengye Zhang, Sirui Zhao, Xinglong Mao, Zhehan Kan, Zhixiang Wei, Shiwei Wu, Chaoyou Fu, Tong Xu, Enhong Chen
Main category: cs.CV
TL;DR: ActFER: An agentic framework for facial expression recognition that actively acquires visual evidence through face detection, alignment, and selective zooming, then reasons using multimodal chain-of-thought, trained with a novel reinforcement learning algorithm (UC-GRPO).
Details
Motivation: Existing MLLM-based FER methods are passive - they rely on externally prepared facial inputs and perform single-pass reasoning over fixed visual evidence without active facial perception capabilities.Method: ActFER dynamically invokes tools for face detection and alignment, selectively zooms into informative local regions, and reasons over facial Action Units (AUs) and emotions through visual Chain-of-Thought. Uses UC-GRPO reinforcement learning with AU-grounded multi-level rewards, query-conditional contrastive utility estimation, and emotion-aware EMA calibration.
Result: ActFER consistently outperforms passive MLLM-based FER baselines and substantially improves AU prediction accuracy in comprehensive experiments.
Conclusion: The proposed agentic framework enables active visual evidence acquisition followed by multimodal reasoning, moving FER beyond passive label prediction toward active perception and reasoning-based affect understanding.
Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have created new opportunities for facial expression recognition (FER), moving it beyond pure label prediction toward reasoning-based affect understanding. However, existing MLLM-based FER methods still follow a passive paradigm: they rely on externally prepared facial inputs and perform single-pass reasoning over fixed visual evidence, without the capability for active facial perception. To address this limitation, we propose ActFER, an agentic framework that reformulates FER as active visual evidence acquisition followed by multimodal reasoning. Specifically, ActFER dynamically invokes tools for face detection and alignment, selectively zooms into informative local regions, and reasons over facial Action Units (AUs) and emotions through a visual Chain-of-Thought. To realize such behavior, we further develop Utility-Calibrated GRPO (UC-GRPO), a reinforcement learning algorithm tailored to agentic FER. UC-GRPO uses AU-grounded multi-level verifiable rewards to densify supervision, query-conditional contrastive utility estimation to enable sample-aware dynamic credit assignment for local inspection, and emotion-aware EMA calibration to reduce noisy utility estimates while capturing emotion-wise inspection tendencies. This algorithm enables ActFER to learn both when local inspection is beneficial and how to reason over the acquired evidence. Comprehensive experiments show that ActFER trained with UC-GRPO consistently outperforms passive MLLM-based FER baselines and substantially improves AU prediction accuracy.
[147] PinpointQA: A Dataset and Benchmark for Small Object-Centric Spatial Understanding in Indoor Videos
Zhiyu Zhou, Peilin Liu, Ruoxuan Zhang, Luyang Zhang, Cheng Zhang, Hongxia Xie, Wen-Huang Cheng
Main category: cs.CV
TL;DR: PinpointQA is a new benchmark for evaluating small object-centric spatial understanding in indoor videos, featuring four progressively challenging tasks for multimodal LLMs.
Details
Motivation: Existing benchmarks lack direct evaluation of whether models can localize target objects in videos with sufficient precision for practical applications like object search and assistive tasks.Method: Created dataset from ScanNet++ and ScanNet200 with 1,024 scenes and 10,094 QA pairs, organized into four tasks: Target Presence Verification, Nearest Reference Identification, Fine-Grained Spatial Description, and Structured Spatial Prediction.
Result: Experiments show consistent capability gaps in MLLMs along the progressive chain, with SSP being particularly difficult. Supervised fine-tuning on PinpointQA yields substantial gains, especially on harder tasks.
Conclusion: PinpointQA serves as both a diagnostic benchmark for evaluating spatial understanding in MLLMs and an effective training dataset for improving these capabilities.
Abstract: Small object-centric spatial understanding in indoor videos remains a significant challenge for multimodal large language models (MLLMs), despite its practical value for object search and assistive applications. Although existing benchmarks have advanced video spatial intelligence, embodied reasoning, and diagnostic perception, no existing benchmark directly evaluates whether a model can localize a target object in video and express its position with sufficient precision for downstream use. In this work, we introduce PinpointQA, the first dataset and benchmark for small object-centric spatial understanding in indoor videos. Built from ScanNet++ and ScanNet200, PinpointQA comprises 1,024 scenes and 10,094 QA pairs organized into four progressively challenging tasks: Target Presence Verification (TPV), Nearest Reference Identification (NRI), Fine-Grained Spatial Description (FSD), and Structured Spatial Prediction (SSP). The dataset is built from intermediate spatial representations, with QA pairs generated automatically and further refined through quality control. Experiments on representative MLLMs reveal a consistent capability gap along the progressive chain, with SSP remaining particularly difficult. Supervised fine-tuning on PinpointQA yields substantial gains, especially on the harder tasks, demonstrating that PinpointQA serves as both a diagnostic benchmark and an effective training dataset. The dataset and project page are available at https://rainchowz.github.io/PinpointQA.
[148] Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory
Zile Wang, Zexiang Liu, Jaixing Li, Kaichen Huang, Baixin Xu, Fei Kang, Mengyin An, Peiyu Wang, Biao Jiang, Yichen Wei, Yidan Xietian, Jiangbo Pei, Liang Hu, Boyi Jiang, Hua Xue, Zidong Wang, Haofeng Sun, Wei Li, Wanli Ouyang, Xianglong He, Yang Liu, Yangguang Li, Yahui Zhou
Main category: cs.CV
TL;DR: Matrix-Game 3.0 is a memory-augmented interactive world model for 720p real-time longform video generation, achieving 40 FPS with 5B model while maintaining minute-long consistency.
Details
Motivation: Existing diffusion models struggle to simultaneously achieve memory-enabled long-term temporal consistency and high-resolution real-time generation, limiting real-world applicability.Method: Three systematic improvements: 1) Industrial-scale infinite data engine with synthetic, game-collected, and real-world augmented quadruplet data; 2) Training framework for long-horizon consistency via prediction residual modeling and camera-aware memory retrieval; 3) Multi-segment autoregressive distillation with DMD, quantization, and VAE pruning for real-time inference.
Result: Achieves up to 40 FPS real-time generation at 720p resolution with 5B model, maintaining stable memory consistency over minute-long sequences. Scaling to 2x14B model improves quality, dynamics, and generalization.
Conclusion: Provides a practical pathway toward industrial-scale deployable world models by solving the trade-off between long-term consistency and real-time high-resolution generation.
Abstract: With the advancement of interactive video generation, diffusion models have increasingly demonstrated their potential as world models. However, existing approaches still struggle to simultaneously achieve memory-enabled long-term temporal consistency and high-resolution real-time generation, limiting their applicability in real-world scenarios. To address this, we present Matrix-Game 3.0, a memory-augmented interactive world model designed for 720p real-time longform video generation. Building upon Matrix-Game 2.0, we introduce systematic improvements across data, model, and inference. First, we develop an upgraded industrial-scale infinite data engine that integrates Unreal Engine-based synthetic data, large-scale automated collection from AAA games, and real-world video augmentation to produce high-quality Video-Pose-Action-Prompt quadruplet data at scale. Second, we propose a training framework for long-horizon consistency: by modeling prediction residuals and re-injecting imperfect generated frames during training, the base model learns self-correction; meanwhile, camera-aware memory retrieval and injection enable the base model to achieve long horizon spatiotemporal consistency. Third, we design a multi-segment autoregressive distillation strategy based on Distribution Matching Distillation (DMD), combined with model quantization and VAE decoder pruning, to achieve efficient real-time inference. Experimental results show that Matrix-Game 3.0 achieves up to 40 FPS real-time generation at 720p resolution with a 5B model, while maintaining stable memory consistency over minute-long sequences. Scaling up to a 2x14B model further improves generation quality, dynamics, and generalization. Our approach provides a practical pathway toward industrial-scale deployable world models.
[149] StreamMeCo: Long-Term Agent Memory Compression for Efficient Streaming Video Understanding
Junxi Wang, Te Sun, Jiayi Zhu, Junxian Li, Haowen Xu, Zichen Wen, Xuming Hu, Zhiyu Li, Linfeng Zhang
Main category: cs.CV
TL;DR: StreamMeCo: Efficient memory compression framework for vision agents that reduces memory overhead by 70% while improving accuracy and speeding up retrieval.
Details
Motivation: Vision agent memory for streaming video understanding incurs substantial memory overhead, leading to high storage and computation costs. Current approaches need efficient compression methods to maintain performance while reducing memory footprint.Method: Proposes StreamMeCo framework with edge-free minmax sampling for isolated nodes and edge-aware weight pruning for connected nodes to evict redundant memory nodes. Also introduces time-decay memory retrieval mechanism to mitigate performance degradation from compression.
Result: Achieves 70% memory graph compression with 1.87x speedup in memory retrieval and average accuracy improvement of 1.0% on three benchmark datasets (M3-Bench-robot, M3-Bench-web, Video-MME-Long).
Conclusion: StreamMeCo effectively compresses vision agent memory while maintaining or improving accuracy, offering practical benefits for streaming video understanding applications with reduced computational costs.
Abstract: Vision agent memory has shown remarkable effectiveness in streaming video understanding. However, storing such memory for videos incurs substantial memory overhead, leading to high costs in both storage and computation. To address this issue, we propose StreamMeCo, an efficient Stream Agent Memory Compression framework. Specifically, based on the connectivity of the memory graph, StreamMeCo introduces edge-free minmax sampling for the isolated nodes and an edge-aware weight pruning for connected nodes, evicting the redundant memory nodes while maintaining the accuracy. In addition, we introduce a time-decay memory retrieval mechanism to further eliminate the performance degradation caused by memory compression. Extensive experiments on three challenging benchmark datasets (M3-Bench-robot, M3-Bench-web and Video-MME-Long) demonstrate that under 70% memory graph compression, StreamMeCo achieves a 1.87* speedup in memory retrieval while delivering an average accuracy improvement of 1.0%. Our code is available at https://github.com/Celina-love-sweet/StreamMeCo.
[150] Robust by Design: A Continuous Monitoring and Data Integration Framework for Medical AI
Mohammad Daouk, Jan Ulrich Becker, Neeraja Kambham, Anthony Chang, Chandra Mohan, Hien Van Nguyen
Main category: cs.CV
TL;DR: An autonomous continuous monitoring framework for medical AI that prevents performance degradation from data drift in dynamic clinical environments using uncertainty gating and incremental retraining with performance safeguards.
Details
Motivation: Medical AI models suffer performance drops in dynamic clinical environments due to data drift, requiring robust adaptation methods to maintain performance over time without catastrophic forgetting.Method: Three-stage method using multi-metric feature analysis (Euclidean, cosine, Mahalanobis distances) and Monte Carlo dropout-based uncertainty gating to select statistically similar new data with low predictive entropy, followed by incremental retraining with strict performance safeguards (no metric degradation >5%).
Result: The framework prevented performance degradation on glomerular pathology image classification, maintaining AUC (~0.92) and accuracy (~89%) when adding new images to a ResNet18 ensemble on multi-center data.
Conclusion: The approach successfully addresses data shift and avoids catastrophic forgetting, enabling sustained learning in medical imaging AI through autonomous continuous monitoring and selective data integration.
Abstract: Adaptive medical AI models often face performance drops in dynamic clinical environments due to data drift. We propose an autonomous continuous monitoring and data integration framework that maintains robust performance over time. Focusing on glomerular pathology image classification (proliferative vs. non-proliferative lupus nephritis), our three-stage method uses multi-metric feature analysis and Monte Carlo dropout-based uncertainty gating to decide when to retrain on new data. Only images statistically similar to the training distribution (via Euclidean, cosine, Mahalanobis metrics) and with low predictive entropy are integrated. The model is then incrementally retrained with these images under strict performance safeguards (no metric degradation >5%). In experiments with a ResNet18 ensemble on a multi-center dataset, the framework prevents performance degradation: new images were added without significant change in AUC (~0.92) or accuracy (~89%). This approach addresses data shift and avoids catastrophic forgetting, enabling sustained learning in medical imaging AI.
[151] Domain-generalizable Face Anti-Spoofing with Patch-based Multi-tasking and Artifact Pattern Conversion
Seungjin Jung, Yonghyun Jeong, Minha Kim, Jimin Min, Youngjoon Yoo, Jongwon Choi
Main category: cs.CV
TL;DR: PCGAN improves Face Anti-Spoofing by generating diverse spoof artifacts through latent disentanglement, enhancing domain generalization and partial attack detection.
Details
Motivation: Face Anti-Spoofing (FAS) algorithms face limitations due to insufficient dataset diversity, which hampers their ability to generalize across unseen visual domains and spoofing methods, compromising facial recognition security.Method: Proposes Pattern Conversion Generative Adversarial Network (PCGAN) that disentangles latent vectors for spoof artifacts and facial features to generate diverse artifact images. Incorporates patch-based learning and multi-task learning to address partial attacks and overfitting to facial features.
Result: Extensive experiments validate PCGAN’s effectiveness in domain generalization and detecting partial attacks, showing substantial improvement in facial recognition security.
Conclusion: PCGAN successfully enhances FAS domain generalization through artifact generation and specialized learning techniques, significantly improving security against diverse spoofing attacks.
Abstract: Face Anti-Spoofing (FAS) algorithms, designed to secure face recognition systems against spoofing, struggle with limited dataset diversity, impairing their ability to handle unseen visual domains and spoofing methods. We introduce the Pattern Conversion Generative Adversarial Network (PCGAN) to enhance domain generalization in FAS. PCGAN effectively disentangles latent vectors for spoof artifacts and facial features, allowing to generate images with diverse artifacts. We further incorporate patch-based and multi-task learning to tackle partial attacks and overfitting issues to facial features. Our extensive experiments validate PCGAN’s effectiveness in domain generalization and detecting partial attacks, giving a substantial improvement in facial recognition security.
[152] BlendFusion – Scalable Synthetic Data Generation for Diffusion Model Training
Thejas Venkatesh, Suguna Varshini Velury
Main category: cs.CV
TL;DR: BlendFusion is a scalable framework for generating high-quality synthetic image-caption pairs from 3D scenes using path tracing, addressing visual inconsistencies and model collapse issues in diffusion-based synthetic data generation.
Details
Motivation: Diffusion models for synthetic data generation often produce visually inconsistent images and can lead to Model Autophagy Disorder (MAD) when models are trained on their own synthetic data, creating a feedback loop that causes model collapse.Method: Proposes BlendFusion framework using path tracing from 3D scenes with object-centric camera placement, robust filtering mechanisms, and automatic captioning to generate high-quality image-caption pairs. Creates FineBLEND dataset from diverse 3D scenes.
Result: Empirical analysis shows FineBLEND’s quality compared to existing image-caption datasets. Demonstrates effectiveness of object-centric camera placement over object-agnostic approaches. Provides open-source framework for community dataset creation.
Conclusion: BlendFusion offers a scalable solution for high-quality synthetic data generation from 3D scenes, addressing limitations of diffusion-based approaches and providing tools for the community to create their own datasets.
Abstract: With the rapid adoption of diffusion models, synthetic data generation has emerged as a promising approach for addressing the growing demand for large-scale image datasets. However, images generated purely by diffusion models often exhibit visual inconsistencies, and training models on such data can create an autophagous feedback loop that leads to model collapse, commonly referred to as Model Autophagy Disorder (MAD). To address these challenges, we propose BlendFusion, a scalable framework for synthetic data generation from 3D scenes using path tracing. Our pipeline incorporates an object-centric camera placement strategy, robust filtering mechanisms, and automatic captioning to produce high-quality image-caption pairs. Using this pipeline, we curate FineBLEND, an image-caption dataset constructed from a diverse set of 3D scenes. We empirically analyze the quality of FineBLEND and compare it to several widely used image-caption datasets. We also demonstrate the effectiveness of our object-centric camera placement strategy relative to object-agnostic sampling approaches. Our open-source framework is designed for high configurability, enabling the community to create their own datasets from 3D scenes.
[153] CAD 100K: A Comprehensive Multi-Task Dataset for Car Related Visual Anomaly Detection
Jiahua Pang, Ying Li, Dongpu Cao, Jingcai Luo, Yanuo Zheng, Bao Yunfan, Yujie Lei, Rui Yuan, Yuxi Tian, Guojin Yuan, Hongchang Chen, Zhi Zheng, Yongchun Liu
Main category: cs.CV
TL;DR: CAD Dataset: A large-scale benchmark for car-related multi-task visual anomaly detection with 100+ images across 7 vehicle domains and 3 tasks, featuring synthetic data augmentation for few-shot learning.
Details
Motivation: Existing visual anomaly detection methods are task-specific and lack a unified benchmark for multi-task evaluation in car manufacturing quality assessment, creating a gap in standardized evaluation.Method: Created CAD Dataset with over 100 images spanning 7 vehicle domains and 3 tasks, incorporating synthetic data augmentation for few-shot anomaly images, and implemented multi-task baseline models for evaluation.
Result: Multi-task learning promotes task interaction and knowledge transfer but also reveals challenging conflicts between tasks, demonstrating both benefits and complexities of MTL approaches.
Conclusion: The CAD dataset provides a standardized platform to advance car-related multi-task visual anomaly detection research, highlighting both the potential and challenges of MTL in this domain.
Abstract: Multi-task visual anomaly detection is critical for car-related manufacturing quality assessment. However, existing methods remain task-specific, hindered by the absence of a unified benchmark for multi-task evaluation. To fill in this gap, We present the CAD Dataset, a large-scale and comprehensive benchmark designed for car-related multi-task visual anomaly detection. The dataset contains over 100 images crossing 7 vehicle domains and 3 tasks, providing models a comprehensive view for car-related anomaly detection. It is the first car-related anomaly dataset specialized for multi-task learning(MTL), while combining synthesis data augmentation for few-shot anomaly images. We implement a multi-task baseline and conduct extensive empirical studies. Results show MTL promotes task interaction and knowledge transfer, while also exposing challenging conflicts between tasks. The CAD dataset serves as a standardized platform to drive future advances in car-related multi-task visual anomaly detection.
[154] Leave My Images Alone: Preventing Multi-Modal Large Language Models from Analyzing Images via Visual Prompt Injection
Zedian Shao, Hongbin Liu, Yuepeng Hu, Neil Zhenqiang Gong
Main category: cs.CV
TL;DR: ImageProtector: A user-side method that adds imperceptible perturbations to images to make MLLMs refuse to analyze them, protecting privacy by preventing extraction of sensitive information.
Details
Motivation: Open-weight MLLMs can be misused to extract sensitive information (identities, locations, private details) from personal images at scale, raising critical safety and societal concerns.Method: Proactive user-side protection by embedding carefully crafted, nearly imperceptible perturbations that act as visual prompt injection attacks on MLLMs, inducing refusal responses.
Result: Effective across six MLLMs and four datasets; three countermeasures (Gaussian noise, DiffPure, adversarial training) partially mitigate impact but degrade model accuracy/efficiency.
Conclusion: Highlights promise and limitations of perturbation-based privacy protection for open-weight MLLMs in large-scale automated image analysis scenarios.
Abstract: Multi-modal large language models (MLLMs) have emerged as powerful tools for analyzing Internet-scale image data, offering significant benefits but also raising critical safety and societal concerns. In particular, open-weight MLLMs may be misused to extract sensitive information from personal images at scale, such as identities, locations, or other private details. In this work, we propose ImageProtector, a user-side method that proactively protects images before sharing by embedding a carefully crafted, nearly imperceptible perturbation that acts as a visual prompt injection attack on MLLMs. As a result, when an adversary analyzes a protected image with an MLLM, the MLLM is consistently induced to generate a refusal response such as “I’m sorry, I can’t help with that request.” We empirically demonstrate the effectiveness of ImageProtector across six MLLMs and four datasets. Additionally, we evaluate three potential countermeasures, Gaussian noise, DiffPure, and adversarial training, and show that while they partially mitigate the impact of ImageProtector, they simultaneously degrade model accuracy and/or efficiency. Our study focuses on the practically important setting of open-weight MLLMs and large-scale automated image analysis, and highlights both the promise and the limitations of perturbation-based privacy protection.
[155] Skill-Conditioned Visual Geolocation for Vision-Language
Chenjie Yang, Yutian Jiang, Chenyu Wu
Main category: cs.CV
TL;DR: GeoSkill: A training-free framework for image geolocation using evolving Skill-Graphs that enables autonomous self-evolution through reasoning rollouts on web-scale data.
Details
Motivation: Current vision-language models for image geolocation lack structured geographic reasoning and autonomous self-evolution capabilities, relying on implicit parametric memory that can exploit outdated knowledge and generate hallucinations.Method: Proposes GeoSkill with evolving Skill-Graphs: initializes graph from human expert trajectories, uses inference model for direct reasoning, and employs Autonomous Evolution mechanism with larger model to conduct reasoning rollouts on web-scale image-coordinate pairs, then synthesizes/prunes skills based on successful/failed trajectories.
Result: Achieves promising performance in geolocation accuracy and reasoning faithfulness on GeoRC, maintains superior generalization across diverse external datasets, and fosters emergence of novel, verifiable skills enhancing real-world geographic knowledge.
Conclusion: GeoSkill addresses limitations of current VLMs in geolocation by providing structured reasoning and autonomous evolution without parameter updates, significantly improving geographic cognition beyond isolated case studies.
Abstract: Vision-language models (VLMs) have shown a promising ability in image geolocation, but they still lack structured geographic reasoning and the capacity for autonomous self-evolution. Existing methods predominantly rely on implicit parametric memory, which often exploits outdated knowledge and generates hallucinated reasoning. Furthermore, current inference is a “one-off” process, lacking the feedback loops necessary for self-evolution based on reasoning outcomes. To address these issues, we propose GeoSkill, a training-free framework based on an evolving Skill-Graph. We first initialize the graph by refining human expert trajectories into atomic, natural-language skills. For execution, GeoSkill employs an inference model to perform direct reasoning guided by the current Skill-Graph. For continuous growth, an Autonomous Evolution mechanism leverages a larger model to conduct multiple reasoning rollouts on image-coordinate pairs sourced from web-scale data and verified real-world reasoning. By analyzing both successful and failed trajectories from these rollouts, the mechanism iteratively synthesizes and prunes skills, effectively expanding the Skill-Graph and correcting geographic biases without any parameter updates. Experiments demonstrate that GeoSkill achieves promising performance in both geolocation accuracy and reasoning faithfulness on GeoRC, while maintaining superior generalization across diverse external datasets. Furthermore, our autonomous evolution fosters the emergence of novel, verifiable skills, significantly enhancing the system’s cognition of real-world geographic knowledge beyond isolated case studies.
[156] NTIRE 2026 The 3rd Restore Any Image Model (RAIM) Challenge: Multi-Exposure Image Fusion in Dynamic Scenes (Track 2)
Lishen Qu, Yao Liu, Jie Liang, Hui Zeng, Wen Dai, Guanyi Qin, Ya-nan Guan, Shihao Zhou, Jufeng Yang, Lei Zhang, Radu Timofte, Xiyuan Yuan, Wanjie Sun, Shihang Li, Bo Zhang, Bin Chen, Jiannan Lin, Yuxu Chen, Qinquan Gao, Tong Tong, Song Gao, Jiacong Tang, Tao Hu, Xiaowen Ma, Qingsen Yan, Sunhan Xu, Juan Wang, Xinyu Sun, Lei Qi, He Xu, Jiachen Tu, Guoyi Xu, Yaoxin Jiang, Jiajia Liu, Yaokun Shi
Main category: cs.CV
TL;DR: NTIRE 2026 RAIM challenge on multi-exposure image fusion in dynamic scenes, focusing on HDR imaging with scene motion, illumination variation, and camera jitter.
Details
Motivation: Address practical HDR imaging challenges where exposure bracketing must be fused under dynamic conditions (scene motion, illumination changes, camera jitter) that cause misalignment and ghosting artifacts.Method: Organized a benchmark challenge with 100 training sequences (7 exposure levels) and 100 test sequences (5 exposure levels). Evaluated submissions using leaderboard scores based on PSNR, SSIM, and LPIPS metrics, plus consideration of perceptual quality, efficiency, and reproducibility.
Result: Attracted 114 teams and 987 submissions. Winning methods significantly improved artifact removal and fine detail recovery in multi-exposure fusion.
Conclusion: The challenge successfully advanced the state-of-the-art in HDR image fusion for dynamic scenes, providing a benchmark dataset and code repository for future research.
Abstract: This paper presents NTIRE 2026, the 3rd Restore Any Image Model (RAIM) challenge on multi-exposure image fusion in dynamic scenes. We introduce a benchmark that targets a practical yet difficult HDR imaging setting, where exposure bracketing must be fused under scene motion, illumination variation, and handheld camera jitter. The challenge data contains 100 training sequences with 7 exposure levels and 100 test sequences with 5 exposure levels, reflecting real-world scenarios that frequently cause misalignment and ghosting artefacts. We evaluate submissions with a leaderboard score derived from PSNR, SSIM, and LPIPS, while also considering perceptual quality, efficiency, and reproducibility during the final review. This track attracted 114 participating teams and received 987 submissions. The winning methods significantly improved the ability to remove artifacts from multi-exposure fusion and recover fine details. The dataset and the code of each team can be found at the repository: https://github.com/qulishen/RAIM-HDR.
[157] SiMing-Bench: Evaluating Procedural Correctness from Continuous Interactions in Clinical Skill Videos
Xiyang Huang, Jiawei Lin, Keying Wu, Jiaxin Huang, Kailai Yang, Renxiong Wei, Cheng zeng, Jiayi Xiang, Ziyan Kuang, Min Peng, Qianqian Xie, Sophia Ananiadou
Main category: cs.CV
TL;DR: SiMing-Bench is a new benchmark for evaluating MLLMs’ ability to track procedural state updates in clinical skill videos, focusing on how interactions affect correctness throughout workflows.
Details
Motivation: Current video benchmarks for MLLMs overlook the critical capability of tracking how ongoing interactions update procedural state to determine correctness of later actions, especially important for expert procedural judgment in domains like clinical skills.Method: Introduces SiMing-Bench with SiMing-Score dataset: physician-annotated clinical skill videos (CPR, AED operation, bag-mask ventilation) with standardized rubrics and dual-expert labels. Evaluates MLLMs on rubric-grounded process-level judgment of interaction-driven state updates.
Result: MLLMs show consistently weak agreement with physician judgments. Weak performance on intermediate steps persists even when overall procedure-level correlation appears acceptable, indicating coarse global assessment overestimates models’ procedural judgment ability.
Conclusion: The bottleneck is not fine-grained scoring or temporal localization, but modeling how continuous interactions update procedural state over time. SiMing-Bench reveals significant gaps in current MLLMs’ procedural reasoning capabilities.
Abstract: Current video benchmarks for multimodal large language models (MLLMs) focus on event recognition, temporal ordering, and long-context recall, but overlook a harder capability required for expert procedural judgment: tracking how ongoing interactions update the procedural state and thereby determine the correctness of later actions. We introduce SiMing-Bench, the first benchmark for evaluating this capability from full-length clinical skill videos. It targets rubric-grounded process-level judgment of whether interaction-driven state updates preserve procedural correctness across an entire workflow. SiMing-Bench is instantiated with SiMing-Score, a physician-annotated dataset of real clinical skill examination videos spanning cardiopulmonary resuscitation, automated external defibrillator operation, and bag-mask ventilation, each paired with a standardized step-wise rubric and dual-expert labels. Across diverse open- and closed-source MLLMs, we observe consistently weak agreement with physician judgments. Moreover, weak performance on rubric-defined intermediate steps persists even when overall procedure-level correlation appears acceptable, suggesting that coarse global assessment substantially overestimates current models’ procedural judgment ability. Additional analyses with binary step judgment and step-aligned clips indicate that the bottleneck is not merely fine-grained scoring or temporal localization, but modeling how continuous interactions update procedural state over time.
[158] Scene-Agnostic Object-Centric Representation Learning for 3D Gaussian Splatting
Tsuheng Hsu, Guiyu Liu, Juho Kannala, Janne Heikkilä
Main category: cs.CV
TL;DR: Proposes an object-centric learning approach for 3D Gaussian Splatting using a scene-agnostic object codebook from pre-trained slot attention to enable consistent object representations across views and scenes without mask processing or per-scene training.
Details
Motivation: Current 3D scene understanding methods using 2D masks from visual foundation models have limitations: supervision signals aren't fundamentally object-centric, require additional mask processing, have mask identity conflicts across views, and produce scene-dependent representations that don't generalize across scenes.Method: Uses a pre-trained slot attention-based Global Object Centric Learning (GOCL) module to learn a scene-agnostic object codebook. Couples this codebook with the module’s unsupervised object masks to directly supervise identity features of 3D Gaussians without mask pre/post-processing or explicit multi-view alignment.
Result: Enables object supervision and identification without per-scene fine-tuning or retraining. Introduces unsupervised object-centric learning into 3DGS, yielding more structured representations and better generalization for downstream tasks.
Conclusion: The method successfully integrates object-centric learning into 3D Gaussian Splatting, providing consistent, identity-anchored object representations across views and scenes, with improved generalization for robotic interaction, scene understanding, and cross-scene applications.
Abstract: Recent works on 3D scene understanding leverage 2D masks from visual foundation models (VFMs) to supervise radiance fields, enabling instance-level 3D segmentation. However, the supervision signals from foundation models are not fundamentally object-centric and often require additional mask pre/post-processing or specialized training and loss design to resolve mask identity conflicts across views. The learned identity of the 3D scene is scene-dependent, limiting generalizability across scenes. Therefore, we propose a dataset-level, object-centric supervision scheme to learn object representations in 3D Gaussian Splatting (3DGS). Building on a pre-trained slot attention-based Global Object Centric Learning (GOCL) module, we learn a scene-agnostic object codebook that provides consistent, identity-anchored representations across views and scenes. By coupling the codebook with the module’s unsupervised object masks, we can directly supervise the identity features of 3D Gaussians without additional mask pre-/post-processing or explicit multi-view alignment. The learned scene-agnostic codebook enables object supervision and identification without per-scene fine-tuning or retraining. Our method thus introduces unsupervised object-centric learning (OCL) into 3DGS, yielding more structured representations and better generalization for downstream tasks such as robotic interaction, scene understanding, and cross-scene generalization.
[159] Text-Conditioned Multi-Expert Regression Framework for Fully Automated Multi-Abutment Design
Mianjie Zheng, Xinquan Yang, Xuefen Liu, Xuguang Li, Kun Tang, He Meng, Linlin Shen
Main category: cs.CV
TL;DR: TEMAD is a fully automated text-conditioned multi-expert architecture for dental implant abutment design that integrates implant site localization and parameter regression into a unified pipeline.
Details
Motivation: Current dental implant abutment design relies heavily on manual effort, is time-consuming, and existing deep learning approaches remain largely manual or semi-automated, requiring substantial clinician intervention and lacking scalability in multi-abutment scenarios.Method: Proposes TEMAD framework with: 1) Implant Site Identification Network (ISIN) for automatic localization, 2) Tooth-Conditioned Feature-wise Linear Modulation (TC-FiLM) module for adaptive mesh calibration using tooth embeddings, and 3) System-Prompted Mixture-of-Experts (SPMoE) mechanism for system-aware regression using implant system prompts.
Result: Extensive experiments on a large-scale abutment design dataset show TEMAD achieves state-of-the-art performance compared to existing methods, particularly in multi-abutment settings.
Conclusion: TEMAD validates effectiveness for fully automated dental implant planning, addressing limitations of manual/semi-automated approaches and improving scalability in multi-abutment scenarios.
Abstract: Dental implant abutments serve as the geometric and biomechanical interface between the implant fixture and the prosthetic crown, yet their design relies heavily on manual effort and is time-consuming. Although deep neural networks have been proposed to assist dentists in designing abutments, most existing approaches remain largely manual or semi-automated, requiring substantial clinician intervention and lacking scalability in multi-abutment scenarios. To address these limitations, we propose TEMAD, a fully automated, text-conditioned multi-expert architecture for multi-abutment design. This framework integrates implant site localization and implant system, compatible abutment parameter regression into a unified pipeline. Specifically, we introduce an Implant Site Identification Network (ISIN) to automatically localize implant sites and provide this information to the subsequent multi-abutment regression network. We further design a Tooth-Conditioned Feature-wise Linear Modulation (TC-FiLM) module, which adaptively calibrates mesh representations using tooth embeddings to enable position-specific feature modulation. Additionally, a System-Prompted Mixture-of-Experts (SPMoE) mechanism leverages implant system prompts to guide expert selection, ensuring system-aware regression. Extensive experiments on a large-scale abutment design dataset show that TEMAD achieves state-of-the-art performance compared to existing methods, particularly in multi-abutment settings, validating its effectiveness for fully automated dental implant planning.
[160] Fine-Grained Action Segmentation for Renorrhaphy in Robot-Assisted Partial Nephrectomy
Jiaheng Dai, Huanrong Liu, Tailai Zhou, Tongyu Jia, Qin Liu, Yutong Ban, Zeju Li, Yu Gao, Xin Ma, Qingbiao Li
Main category: cs.CV
TL;DR: A benchmark for fine-grained action segmentation in robot-assisted partial nephrectomy using temporal models on surgical video data.
Details
Motivation: Fine-grained action segmentation during renorrhaphy in robot-assisted partial nephrectomy requires frame-level recognition of visually similar suturing gestures with variable duration and substantial class imbalance, which current methods struggle with.Method: The SIA-RAPN benchmark defines the problem on 50 clinical videos from da Vinci Xi system with 12 frame-level labels, comparing four temporal models (MS-TCN++, AsFormer, TUT, and DiffAct) built on I3D features using multiple evaluation metrics.
Result: DiffAct achieves the highest F1, frame-wise accuracy, edit score, and frame mAP, while MS-TCN++ attains the highest balanced accuracy across five split configurations on the primary dataset.
Conclusion: The benchmark establishes a standardized evaluation for surgical action segmentation and shows that temporal modeling approaches like DiffAct perform well on this challenging fine-grained recognition task.
Abstract: Fine-grained action segmentation during renorrhaphy in robot-assisted partial nephrectomy requires frame-level recognition of visually similar suturing gestures with variable duration and substantial class imbalance. The SIA-RAPN benchmark defines this problem on 50 clinical videos acquired with the da Vinci Xi system and annotated with 12 frame-level labels. The benchmark compares four temporal models built on I3D features: MS-TCN++, AsFormer, TUT, and DiffAct. Evaluation uses balanced accuracy, edit score, segmental F1 at overlap thresholds of 10, 25, and 50, frame-wise accuracy, and frame-wise mean average precision. In addition to the primary evaluation across five released split configurations on SIA-RAPN, the benchmark reports cross-domain results on a separate single-port RAPN dataset. Across the strongest reported values over those five runs on the primary dataset, DiffAct achieves the highest F1, frame-wise accuracy, edit score, and frame mAP, while MS-TCN++ attains the highest balanced accuracy.
[161] Learning Vision-Language-Action World Models for Autonomous Driving
Guoqing Wang, Pin Tang, Xiangxuan Ren, Guodongfang Zhao, Bailan Feng, Chao Ma
Main category: cs.CV
TL;DR: VLA-World is a Vision-Language-Action world model that combines predictive imagination with reflective reasoning for autonomous driving, improving foresight and safety through self-generated future frames.
Details
Motivation: Current VLA models lack explicit temporal dynamics and global world consistency, limiting foresight and safety. World models can simulate future scenes but struggle to reason about them. The paper aims to unify predictive imagination with reflective reasoning for better driving performance.Method: VLA-World uses action-derived trajectories to guide next-frame image generation, capturing spatial-temporal cues. It then reasons over these self-generated future frames to refine trajectories. The approach uses nuScenes-GR-20K dataset and three-stage training: pretraining, supervised fine-tuning, and reinforcement learning.
Result: VLA-World consistently surpasses state-of-the-art VLA and world-model baselines on both planning and future-generation benchmarks, achieving higher performance and better interpretability.
Conclusion: The proposed VLA-World model effectively unifies predictive imagination with reflective reasoning, demonstrating improved driving foresight and safety through self-generated future reasoning.
Abstract: Vision-Language-Action (VLA) models have recently achieved notable progress in end-to-end autonomous driving by integrating perception, reasoning, and control within a unified multimodal framework. However, they often lack explicit modeling of temporal dynamics and global world consistency, which limits their foresight and safety. In contrast, world models can simulate plausible future scenes but generally struggle to reason about or evaluate the imagined future they generate. In this work, we present VLA-World, a simple yet effective VLA world model that unifies predictive imagination with reflective reasoning to improve driving foresight. VLA-World first uses an action-derived feasible trajectory to guide the generation of the next-frame image, capturing rich spatial and temporal cues that describe how the surrounding environment evolves. The model then reasons over this self-generated future imagined frame to refine the predicted trajectory, achieving higher performance and better interpretability. To support this pipeline, we curate nuScenes-GR-20K, a generative reasoning dataset derived from nuScenes, and employ a three-stage training strategy that includes pretraining, supervised fine-tuning, and reinforcement learning. Extensive experiments demonstrate that VLA-World consistently surpasses state-of-the-art VLA and world-model baselines on both planning and future-generation benchmarks. Project page: https://vlaworld.github.io
[162] Nested Radially Monotone Polar Occupancy Estimation: Clinically-Grounded Optic Disc and Cup Segmentation for Glaucoma Screening
Rimsa Goperma, Rojan Basnet, Liang Zhao
Main category: cs.CV
TL;DR: NPS-Net: A framework for optic disc and optic cup segmentation that guarantees clinical validity through nested polar shape representation, achieving strong zero-shot generalization across datasets.
Details
Motivation: Existing deep learning methods for OD/OC segmentation from fundus photographs don't guarantee clinical validness (star-convexity and nested structure), leading to diagnostic metric corruption, especially under cross-dataset domain shift.Method: Proposes NPS-Net (Nested Polar Shape Network) that formulates OD/OC segmentation as nested radially monotone polar occupancy estimation, which guarantees clinical validity through its output representation.
Result: Achieves strong zero-shot generalization across 7 datasets: maintains 100% anatomical validity on RIM-ONE, improves Cup Dice by 12.8%, reduces vCDR MAE by >56%; on PAPILA achieves Disc Dice of 0.9438 and Disc HD95 of 2.78px (83% reduction).
Conclusion: NPS-Net provides clinically valid OD/OC segmentation with guaranteed anatomical constraints, demonstrating superior generalization performance across datasets compared to existing methods.
Abstract: Valid segmentation of the optic disc (OD) and optic cup (OC) from fundus photographs is essential for glaucoma screening. Unfortunately, existing deep learning methods do not guarantee clinical validness including star-convexity and nested structure of OD and OC, resulting corruption in diagnostic metric, especially under cross-dataset domain shift. To adress this issue, this paper proposed NPS-Net (Nested Polar Shape Network), the first framework that formulates the OD/OC segmentation as nested radially monotone polar occupancy estimation.This output representation can guarantee the aforementioned clinical validness and achieve high accuracy. Evaluated across seven public datasets, NPS-Net shows strong zero-shot generalization. On RIM-ONE, it maintains 100% anatomical validity and improves Cup Dice by 12.8% absolute over the best baseline, reducing vCDR MAE by over 56%. On PAPILA, it achieves Disc Dice of 0.9438 and Disc HD95 of 2.78 px, an 83% reduction over the best competing method.
[163] Visually-Guided Policy Optimization for Multimodal Reasoning
Zengbin Wang, Feng Xiong, Liang Lin, Xuecai Hu, Yong Wang, Yanlin Wang, Man Zhang, Xiangxiang Chu
Main category: cs.CV
TL;DR: VGPO enhances visual attention in vision-language models through visual attention compensation and dual-grained advantage reweighting to combat visual forgetting during reasoning.
Details
Motivation: Current vision-language models suffer from insufficient visual faithfulness due to text-dominated nature and temporal visual forgetting along reasoning steps, leading to sparse attention activation to visual tokens.Method: Proposes Visually-Guided Policy Optimization (VGPO) with: 1) Visual Attention Compensation mechanism using visual similarity to localize/amplify visual cues and progressively elevate visual expectations, and 2) dual-grained advantage reweighting strategy (intra-trajectory level for high visual activation tokens, inter-trajectory level for superior visual accumulation trajectories).
Result: Extensive experiments show VGPO achieves better visual activation and superior performance in mathematical multimodal reasoning and visual-dependent tasks.
Conclusion: VGPO effectively addresses visual faithfulness issues in VLMs by reinforcing visual focus during policy optimization, combating visual forgetting, and improving multimodal reasoning capabilities.
Abstract: Reinforcement learning with verifiable rewards (RLVR) has significantly advanced the reasoning ability of vision-language models (VLMs). However, the inherent text-dominated nature of VLMs often leads to insufficient visual faithfulness, characterized by sparse attention activation to visual tokens. More importantly, our empirical analysis reveals that temporal visual forgetting along reasoning steps exacerbates this deficiency. To bridge this gap, we propose Visually-Guided Policy Optimization (VGPO), a novel framework to reinforce visual focus during policy optimization. Specifically, VGPO initially introduces a Visual Attention Compensation mechanism that leverages visual similarity to localize and amplify visual cues, while progressively elevating visual expectations in later steps to counteract visual forgetting. Building on this mechanism, we implement a dual-grained advantage re-weighting strategy: the intra-trajectory level highlights tokens exhibiting relatively high visual activation, while the inter-trajectory level prioritizes trajectories demonstrating superior visual accumulation. Extensive experiments demonstrate that VGPO achieves better visual activation and superior performance in mathematical multimodal reasoning and visual-dependent tasks.
[164] Frequency-Enhanced Diffusion Models: Curriculum-Guided Semantic Alignment for Zero-Shot Skeleton Action Recognition
Yuxi Zhou, Zhengbo Zhang, Jingyu Pan, Zhiyu Lin, Zhigang Tu
Main category: cs.CV
TL;DR: FDSM is a frequency-aware diffusion model for zero-shot skeleton action recognition that addresses spectral bias issues to recover fine-grained motion details through spectral residual modules and adaptive losses.
Details
Motivation: Supervised skeleton-based action recognition methods rely on exhaustive annotation and struggle with generalization to novel actions. Zero-shot approaches face challenges due to diffusion models' spectral bias that oversmooths high-frequency motion dynamics.Method: Proposes Frequency-Aware Diffusion for Skeleton-Text Matching (FDSM) with three components: 1) Semantic-Guided Spectral Residual Module to recover high-frequency details, 2) Timestep-Adaptive Spectral Loss for frequency-aware optimization, and 3) Curriculum-based Semantic Abstraction for progressive learning.
Result: Achieves state-of-the-art performance on NTU RGB+D, PKU-MMD, and Kinetics-skeleton datasets for zero-shot skeleton action recognition.
Conclusion: FDSM effectively addresses spectral bias in diffusion models for skeleton-text matching, enabling better recovery of fine-grained motion details and superior zero-shot action recognition performance.
Abstract: Human action recognition is pivotal in computer vision, with applications ranging from surveillance to human-robot interaction. Despite the effectiveness of supervised skeleton-based methods, their reliance on exhaustive annotation limits generalization to novel actions. Zero-Shot Skeleton Action Recognition (ZSAR) emerges as a promising paradigm, yet it faces challenges due to the spectral bias of diffusion models, which oversmooth high-frequency dynamics. Here, we propose Frequency-Aware Diffusion for Skeleton-Text Matching (FDSM), integrating a Semantic-Guided Spectral Residual Module, a Timestep-Adaptive Spectral Loss, and Curriculum-based Semantic Abstraction to address these challenges. Our approach effectively recovers fine-grained motion details, achieving state-of-the-art performance on NTU RGB+D, PKU-MMD, and Kinetics-skeleton datasets. Code has been made available at https://github.com/yuzhi535/FDSM. Project homepage: https://yuzhi535.github.io/FDSM.github.io/
[165] Arbitration Failure, Not Perceptual Blindness: How Vision-Language Models Resolve Visual-Linguistic Conflicts
Farhad Nooralahzadeh, Omid Rohanian, Yi Zhang, Jonathan Fürst, Kurt Stockinger
Main category: cs.CV
TL;DR: VLMs encode visual information well but fail to use it in final answers due to arbitration issues between visual evidence and prior knowledge, not perception problems.
Details
Motivation: To understand why VLMs sometimes give wrong answers despite seeing visual evidence correctly, and to determine whether the problem is perception (not seeing) or arbitration (not using what they see).Method: Used Multimodal Arbitration Crossover (MAC) analysis with layer-by-layer Logit Lens probing across 10 VLMs of various sizes, plus full-sequence activation patching for causality analysis and training-free activation steering interventions.
Result: Visual attributes are linearly decodable from early layers (AUC > 0.86) with similar accuracy for both successful and failed samples. The final-layer logit gap predicts grounding outcomes. Full-sequence activation patching (not last-token interventions) alters 60-84% of outputs, with image tokens carrying almost all causal impact. Early-layer activation steering improves visual grounding by up to +3.8%.
Conclusion: VLMs already see well but fail to act on what they see; the problem is arbitration, not perception. Targeted interventions in early layers can help bridge this gap.
Abstract: When a Vision-Language Model (VLM) sees a blue banana and answers “yellow”, is the problem of perception or arbitration? We explore the question in ten VLMs with various sizes and reveal an Encoding–Grounding Dissociation: models that fail to report what they see (and thus provide a wrong answer) still encode the visual evidence as strongly as models that provide the correct answer. Using Multimodal Arbitration Crossover (MAC) analysis with layer-by-layer Logit Lens probing, we track the competition between visual and prior signals across every layer of each model. We show that visual attributes can be linearly decodable from early layers (AUC > 0.86). The accuracy remains nearly identical for both successful and failed samples. However, the gap in the final-layer logit – not the strength of encoding – better predicts grounding outcomes with a correlation of . After having studied when VLMs base their answers on image clues rather than prior knowledge, we want to understand the causal relationships. We establish causality through full-sequence activation patching. The standard last-token interventions in LLM interpretability do not affect VLMs. In contrast, replacing the full token sequence at layers identified by MAC alters 60 to 84% of outputs. Partial-token decomposition shows that image tokens carry almost all of the causal impact, while text tokens have none. Scaling addresses the remaining architectural differences to achieve perfect retention. Moving from diagnosis to intervention, we show that training-free activation steering – both linear and sparse autoencoder-guided – in early layers can improve visual grounding by up to +3.8% with degrading performance in some setups. Overall, these findings lead to a clear conclusion: VLMs already see well, but the challenge is acting on what they see. Targeted interventions can help to bridge this gap.
[166] Cross-Modal Knowledge Distillation from Spatial Transcriptomics to Histology
Arbel Hizmi, Artemii Bakulin, Shai Bagon, Nir Yosef
Main category: cs.CV
TL;DR: Cross-modal distillation from spatial transcriptomics to H&E histology enables histology-only inference of tissue niches without transcriptomic input at test time.
Details
Motivation: Spatial transcriptomics provides rich molecular tissue organization but is costly and scarce, while H&E histology is abundant but less granular. Need to transfer transcriptomics-derived niche structure to histology-only models.Method: Cross-modal distillation using paired spatial transcriptomics and H&E data to train a histology-only model that learns transcriptomics-derived tissue niche structure, enabling inference with histology alone.
Result: Distilled model achieves substantially higher agreement with transcriptomics-derived niche structure than unsupervised morphology-based baselines, recovers biologically meaningful neighborhood composition confirmed by cell-type analysis.
Conclusion: Framework successfully transfers transcriptomics knowledge to histology-only models, enabling histology-based tissue niche analysis without costly transcriptomics at inference.
Abstract: Spatial transcriptomics provides a molecularly rich description of tissue organization, enabling unsupervised discovery of tissue niches – spatially coherent regions of distinct cell-type composition and function that are relevant to both biological research and clinical interpretation. However, spatial transcriptomics remains costly and scarce, while H&E histology is abundant but carries a less granular signal. We propose to leverage paired spatial transcriptomics and H&E data to transfer transcriptomics-derived niche structure to a histology-only model via cross-modal distillation. Across multiple tissue types and disease contexts, the distilled model achieves substantially higher agreement with transcriptomics-derived niche structure than unsupervised morphology-based baselines trained on identical image features, and recovers biologically meaningful neighborhood composition as confirmed by cell-type analysis. The resulting framework leverages paired spatial transcriptomic and H&E data during training, and can then be applied to held-out tissue regions using histology alone, without any transcriptomic input at inference time.
[167] Memory-Efficient Transfer Learning with Fading Side Networks via Masked Dual Path Distillation
Yutong Zhang, Jiaxin Chen, Honglin Chen, Kaiqi Zheng, Shengcai Liao, Hanwen Zhong, Weixin Li, Yunhong Wang
Main category: cs.CV
TL;DR: MDPD is a memory-efficient transfer learning method that uses mutual distillation between frozen backbones and learnable side networks during fine-tuning, then discards the side network during inference to accelerate inference by at least 25.2% while maintaining accuracy.
Details
Motivation: Current memory-efficient transfer learning methods use lightweight side networks that reduce trainable parameters during fine-tuning but introduce additional memory and time overhead during inference, contradicting the goal of efficient transfer learning.Method: Proposes Masked Dual Path Distillation (MDPD) with mutual distillation between frozen backbones and learnable side networks during fine-tuning, then discards side network during inference. Also designs novel feature-based knowledge distillation for encoder structures with multiple layers.
Result: Achieves at least 25.2% inference acceleration while keeping parameter and memory consumption comparable to SOTA methods, and improves accuracy across vision/language-only and vision-and-language tasks with distinct backbones.
Conclusion: MDPD provides an effective solution for memory-efficient transfer learning that accelerates inference without sacrificing accuracy, addressing the inference overhead problem in existing methods.
Abstract: Memory-efficient transfer learning (METL) approaches have recently achieved promising performance in adapting pre-trained models to downstream tasks. They avoid applying gradient backpropagation in large backbones, thus significantly reducing the number of trainable parameters and high memory consumption during fine-tuning. However, since they typically employ a lightweight and learnable side network, these methods inevitably introduce additional memory and time overhead during inference, which contradicts the ultimate goal of efficient transfer learning. To address the above issue, we propose a novel approach dubbed Masked Dual Path Distillation (MDPD) to accelerate inference while retaining parameter and memory efficiency in fine-tuning with fading side networks. Specifically, MDPD develops a framework that enhances the performance by mutually distilling the frozen backbones and learnable side networks in fine-tuning, and discard the side network during inference without sacrificing accuracy. Moreover, we design a novel feature-based knowledge distillation method for the encoder structure with multiple layers. Extensive experiments on distinct backbones across vision/language-only and vision-and-language tasks demonstrate that our method not only accelerates inference by at least 25.2% while keeping parameter and memory consumption comparable, but also remarkably promotes the accuracy compared to SOTA approaches. The source code is available at https://github.com/Zhang-VKk/MDPD.
[168] VL-Calibration: Decoupled Confidence Calibration for Large Vision-Language Models Reasoning
Wenyi Xiao, Xinchi Xu, Leilei Gan
Main category: cs.CV
TL;DR: VL-Calibration: A reinforcement learning framework that decouples visual and reasoning confidence in Large Vision Language Models to reduce hallucinations and improve calibration
Details
Motivation: LVLMs frequently exhibit hallucinations and incorrect responses with high certainty, hindering usage in high-stakes domains. Existing confidence calibration methods for text-only LLMs are mismatched for LVLMs because they use single holistic confidence scores that conflate perceptual failures and reasoning errors.Method: Proposes VL-Calibration, a reinforcement learning framework that explicitly decouples confidence into visual and reasoning components. Introduces intrinsic visual certainty estimation combining: (1) visual grounding measured by KL-divergence under image perturbations, and (2) internal certainty measured by token entropy. Uses token-level advantage reweighting to focus optimization on tokens based on visual certainty.
Result: Experiments on thirteen benchmarks show VL-Calibration effectively improves calibration while boosting visual reasoning accuracy. The method generalizes to out-of-distribution benchmarks across different model scales and architectures.
Conclusion: VL-Calibration addresses the unique challenges of LVLM confidence calibration by decoupling visual and reasoning uncertainty, leading to improved reliability and reduced hallucinations in multimodal reasoning.
Abstract: Large Vision Language Models (LVLMs) achieve strong multimodal reasoning but frequently exhibit hallucinations and incorrect responses with high certainty, which hinders their usage in high-stakes domains. Existing verbalized confidence calibration methods, largely developed for text-only LLMs, typically optimize a single holistic confidence score using binary answer-level correctness. This design is mismatched to LVLMs: an incorrect prediction may arise from perceptual failures or from reasoning errors given correct perception, and a single confidence conflates these sources while visual uncertainty is often dominated by language priors. To address these issues, we propose VL-Calibration, a reinforcement learning framework that explicitly decouples confidence into visual and reasoning confidence. To supervise visual confidence without ground-truth perception labels, we introduce an intrinsic visual certainty estimation that combines (i) visual grounding measured by KL-divergence under image perturbations and (ii) internal certainty measured by token entropy. We further propose token-level advantage reweighting to focus optimization on tokens based on visual certainty, suppressing ungrounded hallucinations while preserving valid perception. Experiments on thirteen benchmarks show that VL-Calibration effectively improves calibration while boosting visual reasoning accuracy, and it generalizes to out-of-distribution benchmarks across model scales and architectures.
[169] Physically Grounded 3D Generative Reconstruction under Hand Occlusion using Proprioception and Multi-Contact Touch
Gabriele Mario Caddeo, Pasquale Marra, Lorenzo Natale
Main category: cs.CV
TL;DR: Multimodal approach combining vision, proprioception, and tactile sensing for metric-scale amodal object reconstruction under severe hand occlusion using physics-guided diffusion models.
Details
Motivation: Prior occlusion-aware 3D generation methods rely only on vision, which is insufficient for accurate reconstruction under severe hand occlusion. Physical interaction signals (proprioception and tactile contact) can provide crucial constraints to reduce ambiguity in occluded regions.Method: Uses multimodal approach combining visible RGB, occluder masks, hand geometry from proprioception, and tactile contact information. Represents objects as pose-aware, camera-aligned SDFs with Structure-VAE latent space. Trains conditional flow-matching diffusion model with physics-based objectives and differentiable decoder-guidance to reduce hand-object interpenetration and align with contact observations.
Result: Experiments in simulation show adding proprioception and touch substantially improves completion under occlusion and yields physically plausible reconstructions at correct real-world scale compared to vision-only baselines. Model successfully transfers to real humanoid robot with different end-effector than training.
Conclusion: Multimodal physically grounded approach combining vision, proprioception, and tactile sensing enables metric-scale amodal object reconstruction under severe occlusion, producing physically consistent estimates that integrate naturally into existing reconstruction pipelines.
Abstract: We propose a multimodal, physically grounded approach for metric-scale amodal object reconstruction and pose estimation under severe hand occlusion. Unlike prior occlusion-aware 3D generation methods that rely only on vision, we leverage physical interaction signals: proprioception provides the posed hand geometry, and multi-contact touch constrains where the object surface must lie, reducing ambiguity in occluded regions. We represent object structure as a pose-aware, camera-aligned signed distance field (SDF) and learn a compact latent space with a Structure-VAE. In this latent space, we train a conditional flow-matching diffusion model, pretraining on vision-only images and finetuning on occluded manipulation scenes while conditioning on visible RGB evidence, occluder/visibility masks, the hand latent representation, and tactile information. Crucially, we incorporate physics-based objectives and differentiable decoder-guidance during finetuning and inference to reduce hand–object interpenetration and to align the reconstructed surface with contact observations. Because our method produces a metric, physically consistent structure estimate, it integrates naturally into existing two-stage reconstruction pipelines, where a downstream module refines geometry and predicts appearance. Experiments in simulation show that adding proprioception and touch substantially improves completion under occlusion and yields physically plausible reconstructions at correct real-world scale compared to vision-only baselines; we further validate transfer by deploying the model on a real humanoid robot with an end-effector different from those used during training.
[170] VisionFoundry: Teaching VLMs Visual Perception with Synthetic Images
Guanyu Zhou, Yida Yin, Wenhao Chai, Shengbang Tong, Xingyu Fu, Zhuang Liu
Main category: cs.CV
TL;DR: VisionFoundry generates synthetic VQA data using only task keywords to improve VLMs’ visual perception capabilities like spatial understanding
Details
Motivation: VLMs struggle with visual perception tasks like spatial understanding and viewpoint recognition due to limited supervision in natural image datasets. The authors investigate whether targeted synthetic supervision can address these weaknesses.Method: VisionFoundry is a task-aware synthetic data generation pipeline that takes only task names as input, uses LLMs to generate questions/answers/T2I prompts, synthesizes images with T2I models, and verifies consistency with a proprietary VLM without needing reference images or human annotation.
Result: Created VisionFoundry-10K dataset with 10k image-question-answer triples across 10 tasks. Models trained on this data achieved +7% improvement on MMVP and +10% on CV-Bench-3D benchmarks while preserving broader capabilities and showing favorable scaling behavior.
Conclusion: Limited task-targeted supervision contributes to VLMs’ visual perception bottleneck, and synthetic supervision is a promising path toward more systematic VLM training.
Abstract: Vision-language models (VLMs) still struggle with visual perception tasks such as spatial understanding and viewpoint recognition. One plausible contributing factor is that natural image datasets provide limited supervision for low-level visual skills. This motivates a practical question: can targeted synthetic supervision, generated from only a task keyword such as Depth Order, address these weaknesses? To investigate this question, we introduce VisionFoundry, a task-aware synthetic data generation pipeline that takes only the task name as input and uses large language models (LLMs) to generate questions, answers, and text-to-image (T2I) prompts, then synthesizes images with T2I models and verifies consistency with a proprietary VLM, requiring no reference images or human annotation. Using VisionFoundry, we construct VisionFoundry-10K, a synthetic visual question answering (VQA) dataset containing 10k image-question-answer triples spanning 10 tasks. Models trained on VisionFoundry-10K achieve substantial improvements on visual perception benchmarks: +7% on MMVP and +10% on CV-Bench-3D, while preserving broader capabilities and showing favorable scaling behavior as data size increases. Our results suggest that limited task-targeted supervision is an important contributor to this bottleneck and that synthetic supervision is a promising path toward more systematic training for VLMs.
[171] Detecting Diffusion-generated Images via Dynamic Assembly ForestsDetecting Diffusion-generated Images via Dynamic Assembly Forests
Mengxin Fu, Yuezun Li
Main category: cs.CV
TL;DR: DAF is a lightweight deep forest-based detector for diffusion-generated images that achieves competitive performance with far fewer parameters and computational cost than DNN-based methods.
Details
Motivation: Address security concerns about diffusion-generated images by exploring traditional machine learning alternatives to DNN-based detection methods, aiming for lightweight, resource-efficient solutions.Method: Proposes Dynamic Assembly Forest (DAF), built on deep forest paradigm, addressing feature learning and scalable training limitations of traditional ML for diffusion image detection.
Result: DAF achieves competitive performance with significantly fewer parameters, lower computational cost, and no GPU requirement compared to DNN-based methods.
Conclusion: DAF demonstrates strong potential as a practical substitute for heavyweight DNN models in resource-constrained scenarios for diffusion-generated image detection.
Abstract: Diffusion models are known for generating high-quality images, causing serious security concerns. To combat this, most efforts rely on deep neural networks (e.g., CNNs and Transformers), while largely overlooking the potential of traditional machine learning models. In this paper, we freshly investigate such alternatives and proposes a novel Dynamic Assembly Forest model (DAF) to detect diffusion-generated images. Built upon the deep forest paradigm, DAF addresses the inherent limitations in feature learning and scalable training, making it an effective diffusion-generated image detector. Compared to existing DNN-based methods, DAF has significantly fewer parameters, much lower computational cost, and can be deployed without GPUs, while achieving competitive performance under standard evaluation protocols. These results highlight the strong potential of the proposed method as a practical substitute for heavyweight DNN models in resource-constrained scenarios. Our code and models are available at https://github.com/OUC-VAS/DAF.
[172] FIRE-CIR: Fine-grained Reasoning for Composed Fashion Image Retrieval
François Gardères, Camille-Sovanneary Gauthier, Jean Ponce, Shizhe Chen
Main category: cs.CV
TL;DR: FIRE-CIR introduces a question-driven visual reasoning approach for fashion composed image retrieval, using automatically generated attribute-focused questions to verify visual evidence in reference and candidate images, outperforming state-of-the-art methods.
Details
Motivation: Current vision-language models for composed image retrieval often fail to reason about what to preserve and what to change from reference images, leading to suboptimal results and lack of interpretability, especially in fine-grained domains like fashion.Method: FIRE-CIR performs question-driven visual reasoning by automatically generating attribute-focused visual questions from modification text and verifying corresponding visual evidence in both reference and candidate images. The model is trained on a large-scale fashion-specific visual question answering dataset containing single- and dual-image analysis questions, and uses explicit reasoning to re-rank retrieval candidates.
Result: Experimental results on the Fashion IQ benchmark show that FIRE-CIR outperforms state-of-the-art methods in retrieval accuracy while providing interpretable, attribute-level insights into retrieval decisions.
Conclusion: FIRE-CIR successfully brings compositional reasoning and interpretability to fashion composed image retrieval through question-driven visual reasoning, demonstrating improved performance and explainability over embedding-based approaches.
Abstract: Composed image retrieval (CIR) aims to retrieve a target image that depicts a reference image modified by a textual description. While recent vision-language models (VLMs) achieve promising CIR performance by embedding images and text into a shared space for retrieval, they often fail to reason about what to preserve and what to change. This limitation hinders interpretability and yields suboptimal results, particularly in fine-grained domains like fashion. In this paper, we introduce FIRE-CIR, a model that brings compositional reasoning and interpretability to fashion CIR. Instead of relying solely on embedding similarity, FIRE-CIR performs question-driven visual reasoning: it automatically generates attribute-focused visual questions derived from the modification text, and verifies the corresponding visual evidence in both reference and candidate images. To train such a reasoning system, we automatically construct a large-scale fashion-specific visual question answering dataset, containing questions requiring either single- or dual-image analysis. During retrieval, our model leverages this explicit reasoning to re-rank candidate results, filtering out images inconsistent with the intended modifications. Experimental results on the Fashion IQ benchmark show that FIRE-CIR outperforms state-of-the-art methods in retrieval accuracy. It also provides interpretable, attribute-level insights into retrieval decisions.
[173] Few-Shot Personalized Age Estimation
Jakub Paplhám, Vojtěch Franc, Artem Moroz
Main category: cs.CV
TL;DR: OpenPAE: First open benchmark for N-shot personalized age estimation with sophisticated baselines showing personalization consistently improves performance
Details
Motivation: Existing age estimation methods treat faces as independent samples, ignoring individual aging rates due to genetics, lifestyle, and health. When reference images of the same person are available, this context can be exploited for personalized estimation, but existing benchmarks are closed-source and limited.Method: Introduces OpenPAE benchmark with strict evaluation protocols. Establishes hierarchy of baselines: arithmetic offset, closed-form Bayesian linear regression, and conditional attentive neural process for N-shot personalized age estimation.
Result: Personalization consistently improves performance, gains are not merely domain adaptation, and nonlinear methods significantly outperform simpler alternatives.
Conclusion: OpenPAE provides first open benchmark for personalized age estimation with released models, code, protocols, and evaluation splits, demonstrating the value of personalization in age estimation tasks.
Abstract: Existing age estimation methods treat each face as an independent sample, learning a global mapping from appearance to age. This ignores a well-documented phenomenon: individuals age at different rates due to genetics, lifestyle, and health, making the mapping from face to age identity-dependent. When reference images of the same person with known ages are available, we can exploit this context to personalize the estimate. The only existing benchmark for this task (NIST FRVT) is closed-source and limited to a single reference image. In this work, we introduce OpenPAE, the first open benchmark for $N$-shot personalized age estimation with strict evaluation protocols. We establish a hierarchy of increasingly sophisticated baselines: from arithmetic offset, through closed-form Bayesian linear regression, to a conditional attentive neural process. Our experiments show that personalization consistently improves performance, that the gains are not merely domain adaptation, and that nonlinear methods significantly outperform simpler alternatives. We release all models, code, protocols, and evaluation splits.
[174] FaceLiVTv2: An Improved Hybrid Architecture for Efficient Mobile Face Recognition
Novendra Setyawan, Chi-Chia Sun, Mao-Hsiu Hsu, Wen-Kai Kuo, Jun-Wei Hsieh
Main category: cs.CV
TL;DR: FaceLiVTv2 is an improved lightweight hybrid CNN-Transformer architecture for mobile face recognition that achieves better accuracy-efficiency trade-off through Lite MHLA modules and RepMix blocks.
Details
Motivation: There's a need for lightweight face recognition models that can run efficiently on edge/mobile devices with strict constraints on latency, memory, and energy consumption, while maintaining reliable accuracy. Current hybrid CNN-Transformer architectures struggle to balance performance and computational efficiency.Method: FaceLiVTv2 introduces Lite MHLA (lightweight global token interaction module) that replaces multi-layer attention with multi-head linear token projections and affine rescale transformations to reduce redundancy while preserving representational diversity. It integrates Lite MHLA into unified RepMix blocks that coordinate local and global feature interactions, using global depthwise convolution for adaptive spatial aggregation in the embedding stage.
Result: FaceLiVTv2 reduces mobile inference latency by 22% relative to FaceLiVTv1, achieves speedups up to 30.8% over GhostFaceNets on mobile devices, and delivers 20-41% latency improvements over EdgeFace and KANFace across platforms while maintaining higher recognition accuracy on benchmarks like LFW, CA-LFW, CP-LFW, CFP-FP, AgeDB-30, and IJB.
Conclusion: FaceLiVTv2 offers a practical and deployable solution for real-time face recognition on edge and mobile devices by effectively balancing recognition performance with computational efficiency through its improved hybrid architecture design.
Abstract: Lightweight face recognition is increasingly important for deployment on edge and mobile devices, where strict constraints on latency, memory, and energy consumption must be met alongside reliable accuracy. Although recent hybrid CNN-Transformer architectures have advanced global context modeling, striking an effective balance between recognition performance and computational efficiency remains an open challenge. In this work, we present FaceLiVTv2, an improved version of our FaceLiVT hybrid architecture designed for efficient global–local feature interaction in mobile face recognition. At its core is Lite MHLA, a lightweight global token interaction module that replaces the original multi-layer attention design with multi-head linear token projections and affine rescale transformations, reducing redundancy while preserving representational diversity across heads. We further integrate Lite MHLA into a unified RepMix block that coordinates local and global feature interactions and adopts global depthwise convolution for adaptive spatial aggregation in the embedding stage. Under our experimental setup, results on LFW, CA-LFW, CP-LFW, CFP-FP, AgeDB-30, and IJB show that FaceLiVTv2 consistently improves the accuracy-efficiency trade-off over existing lightweight methods. Notably, FaceLiVTv2 reduces mobile inference latency by 22% relative to FaceLiVTv1, achieves speedups of up to 30.8% over GhostFaceNets on mobile devices, and delivers 20-41% latency improvements over EdgeFace and KANFace across platforms while maintaining higher recognition accuracy. These results demonstrate that FaceLiVTv2 offers a practical and deployable solution for real-time face recognition. Code is available at https://github.com/novendrastywn/FaceLiVT.
[175] Strips as Tokens: Artist Mesh Generation with Native UV Segmentation
Rui Xu, Dafei Qin, Kaichun Qiao, Qiujie Dong, Huaijin Pi, Qixuan Zhang, Longwen Zhang, Lan Xu, Jingyi Yu, Wenping Wang, Taku Komura
Main category: cs.CV
TL;DR: SATO introduces a novel token ordering strategy using triangle strips to generate artist-quality meshes, preserving edge flow and enabling unified triangle/quadrilateral mesh generation.
Details
Motivation: Existing autoregressive transformers for mesh generation use suboptimal token ordering strategies: coordinate-based sorting creates inefficient long sequences, while patch-based heuristics disrupt continuous edge flow and structural regularity needed for professional artist standards.Method: Proposes Strips as Tokens (SATO) framework with token ordering inspired by triangle strips. Constructs sequences as connected chains of faces that explicitly encode UV boundaries, preserving organized edge flow and semantic layout. Uses unified representation enabling same token sequence to decode into either triangle or quadrilateral meshes.
Result: SATO consistently outperforms prior methods in geometric quality, structural coherence, and UV segmentation. Joint training on both triangle and quad data leverages large-scale triangle data for structural priors and high-quality quad data for geometric regularity.
Conclusion: SATO provides an effective token ordering strategy for autoregressive mesh generation that better preserves artist-quality structural properties and enables flexible mesh type generation through unified representation.
Abstract: Recent advancements in autoregressive transformers have demonstrated remarkable potential for generating artist-quality meshes. However, the token ordering strategies employed by existing methods typically fail to meet professional artist standards, where coordinate-based sorting yields inefficiently long sequences, and patch-based heuristics disrupt the continuous edge flow and structural regularity essential for high-quality modeling. To address these limitations, we propose Strips as Tokens (SATO), a novel framework with a token ordering strategy inspired by triangle strips. By constructing the sequence as a connected chain of faces that explicitly encodes UV boundaries, our method naturally preserves the organized edge flow and semantic layout characteristic of artist-created meshes. A key advantage of this formulation is its unified representation, enabling the same token sequence to be decoded into either a triangle or quadrilateral mesh. This flexibility facilitates joint training on both data types: large-scale triangle data provides fundamental structural priors, while high-quality quad data enhances the geometric regularity of the outputs. Extensive experiments demonstrate that SATO consistently outperforms prior methods in terms of geometric quality, structural coherence, and UV segmentation.
[176] Geometry Reinforced Efficient Attention Tuning Equipped with Normals for Robust Stereo Matching
Jiahao Li, Xinhong Chen, Zhengmin Jiang, Cheng Huang, Yung-Hui Li, Jianping Wang
Main category: cs.CV
TL;DR: GREATEN is a stereo matching framework that improves synthetic-to-real generalization by incorporating surface normals as domain-invariant geometric cues to compensate for limitations of image textures in challenging regions.
Details
Motivation: Synthetic-to-real zero-shot generalization in stereo matching remains challenging due to cross-domain shifts and ill-posed ambiguities in image textures, especially in occluded, textureless, repetitive, and non-Lambertian regions.Method: Proposes GREATEN with three key components: 1) Gated Contextual-Geometric Fusion module that adaptively suppresses unreliable image features and fuses with normal-driven geometric features, 2) Specular-Transparent Augmentation strategy for robustness in non-Lambertian regions, 3) Sparse attention designs (SSA, SDMA, SVA) for efficient global feature extraction.
Result: Trained only on synthetic data (SceneFlow), GREATEN-IGEV achieves 30% error reduction on ETH3D, 8.5% on Booster, and 14.1% on KITTI-2015 compared to state-of-the-art methods, while running 19.2% faster and supporting high-resolution inference.
Conclusion: Surface normals serve as effective domain-invariant geometric cues for improving synthetic-to-real generalization in stereo matching, with the proposed framework achieving state-of-the-art performance across multiple benchmarks.
Abstract: Despite remarkable advances in image-driven stereo matching over the past decade, Synthetic-to-Realistic Zero-Shot (Syn-to-Real) generalization remains an open challenge. This suboptimal generalization performance mainly stems from cross-domain shifts and ill-posed ambiguities inherent in image textures, particularly in occluded, textureless, repetitive, and non-Lambertian (specular/transparent) regions. To improve Syn-to-Real generalization, we propose GREATEN, a framework that incorporates surface normals as domain-invariant, object-intrinsic, and discriminative geometric cues to compensate for the limitations of image textures. The proposed framework consists of three key components. First, a Gated Contextual-Geometric Fusion (GCGF) module adaptively suppresses unreliable contextual cues in image features and fuses the filtered image features with normal-driven geometric features to construct domain-invariant and discriminative contextual-geometric representations. Second, a Specular-Transparent Augmentation (STA) strategy improves the robustness of GCGF against misleading visual cues in non-Lambertian regions. Third, sparse attention designs preserve the fine-grained global feature extraction capability of GREAT-Stereo for handling occlusion and texture-related ambiguities while substantially reducing computational overhead, including Sparse Spatial (SSA), Sparse Dual-Matching (SDMA), and Simple Volume (SVA) attentions. Trained exclusively on synthetic data such as SceneFlow, GREATEN-IGEV achieves outstanding Syn-to-Real performance. Specifically, it reduces errors by 30% on ETH3D, 8.5% on the non-Lambertian Booster, and 14.1% on KITTI-2015, compared to FoundationStereo, Monster-Stereo, and DEFOM-Stereo, respectively. In addition, GREATEN-IGEV runs 19.2% faster than GREAT-IGEV and supports high-resolution (3K) inference on Middlebury with disparity ranges up to 768.
[177] Vision Transformers for Preoperative CT-Based Prediction of Histopathologic Chemotherapy Response Score in High-Grade Serous Ovarian Carcinoma
Francesca Fati, Felipe Coutinho, Marika Reinius, Marina Rosanu, Gabriel Funingana, Luigi De Vitis, Gabriella Schivardi, Hannah Clayton, Alice Traversa, Zeyu Gao, Guilherme Penteado, Shangqi Gao, Francesco Pastori, Ramona Woitek, Maria Cristina Ghioni, Giovanni Damiano Aletti, Mercedes Jimenez-Linan, Sarah Burge, Nicoletta Colombo, Evis Sala, Maria Francesca Spadea, Timothy L. Kline, James D. Brenton, Jaime Cardoso, Francesco Multinu, Elena De Momi, Mireia Crispin-Ortuzar, Ines P. Machado
Main category: cs.CV
TL;DR: A multimodal deep learning framework using CT imaging and clinical data to predict chemotherapy response scores in ovarian cancer patients before surgery.
Details
Motivation: To develop a preoperative tool for predicting chemotherapy response in high-grade serous ovarian carcinoma using non-invasive methods, since current histopathological biomarkers are only available postoperatively.Method: A 2.5D multimodal deep learning framework that processes omental CT slices with a pre-trained Vision Transformer encoder and fuses visual representations with clinical variables through intermediate fusion to predict Chemotherapy Response Score.
Result: The multimodal model achieved ROC-AUC of 0.95 (95% accuracy, 80% precision) on internal test cohort (n=41) and ROC-AUC of 0.68 (67% accuracy, 75% precision) on external test set (n=70).
Conclusion: Transformer-based deep learning shows feasibility for preoperative prediction of chemotherapy response in ovarian cancer using routine clinical and CT imaging data as an investigational decision-support tool.
Abstract: Purpose. High-grade serous ovarian carcinoma (HGSOC) is characterized by pronounced biological and spatial heterogeneity and is frequently diagnosed at an advanced stage. Neoadjuvant chemotherapy (NACT) followed by delayed primary surgery is commonly employed in patients unsuitable for primary cytoreduction. The Chemotherapy Response Score (CRS) is a validated histopathological biomarker of response to NACT, but it is only available postoperatively. In this study, we investigate whether pre-treatment computed tomography (CT) imaging and clinical data can be used to predict CRS as an investigational decision-support adjunct to inform multidisciplinary team (MDT) discussions regarding expected treatment response. Methods. We proposed a 2.5D multimodal deep learning framework that processes lesion-dense omental slices using a pre-trained Vision Transformer encoder and integrates the resulting visual representations with clinical variables through an intermediate fusion module to predict CRS. Results. Our multimodal model, integrating imaging and clinical data, achieved a ROC-AUC of 0.95 alongside 95% accuracy and 80% precision on the internal test cohort (IEO, n=41 patients). On the external test set (OV04, n=70 patients), it achieved a ROC-AUC of 0.68, alongside 67% accuracy and 75% precision. Conclusion. These preliminary results demonstrate the feasibility of transformer-based deep learning for preoperative prediction of CRS in HGSOC using routine clinical data and CT imaging. As an investigational, pre-treatment decision-support tool, this approach may assist MDT discussions by providing early, non-invasive estimates of treatment response.
[178] Deep Light Pollution Removal in Night Cityscape Photographs
Hao Wang, Xiaolin Wu, Xi Zhang, Baoqing Sun
Main category: cs.CV
TL;DR: A physically-based model and learning framework for removing light pollution artifacts from nighttime photography, addressing anisotropic light spread and skyglow effects.
Details
Motivation: Nighttime photography suffers from light pollution caused by artificial lighting, which creates skyglow, washes out stars, and produces halos/glow artifacts around light sources. Existing methods focus on dehazing for detail legibility, but light pollution removal aims to restore pristine night appearance by neutralizing ground lighting effects.Method: Proposes a physically-based degradation model that extends previous nighttime dehazing models by incorporating: (1) anisotropic spread of directional light sources, and (2) skyglow caused by invisible surface lights behind skylines. Uses a training strategy leveraging large generative models and synthetic-real coupling to address data scarcity and enhance generalization.
Result: Extensive experiments show the proposed formulation and learning framework substantially reduce light pollution artifacts and better recover authentic night imagery compared to prior nighttime restoration methods.
Conclusion: The paper presents an effective approach for light pollution removal in nighttime photography through a physically-based model addressing key aspects of artificial lighting degradation and a robust training strategy for real-world application.
Abstract: Nighttime photography is severely degraded by light pollution induced by pervasive artificial lighting in urban environments. After long-range scattering and spatial diffusion, unwanted artificial light overwhelms natural night luminance, generates skyglow that washes out the view of stars and celestial objects and produces halos and glow artifacts around light sources. Unlike nighttime dehazing, which aims to improve detail legibility through thick air, the objective of light pollution removal is to restore the pristine night appearance by neutralizing the radiative footprint of ground lighting. In this paper we introduce a physically-based degradation model that adds to the previous ones for nighttime dehazing two critical aspects; (i) anisotropic spread of directional light sources, and (ii) skyglow caused by invisible surface lights behind skylines. In addition, we construct a training strategy that leverages large generative model and synthetic-real coupling to compensate for the scarcity of paired real data and enhance generalization. Extensive experiments demonstrate that the proposed formulation and learning framework substantially reduce light pollution artifacts and better recover authentic night imagery than prior nighttime restoration methods.
[179] Benchmarking CNN- and Transformer-Based Models for Surgical Instrument Segmentation in Robotic-Assisted Surgery
Sara Ameli
Main category: cs.CV
TL;DR: Benchmark of 5 deep learning architectures (UNet, UNet++, DeepLabV3, Attention UNet, SegFormer) for surgical instrument segmentation in robotic prostatectomy videos using SAR-RARP50 dataset.
Details
Motivation: Accurate surgical instrument segmentation is critical for computer-assisted interventions like tool tracking, workflow analysis, and autonomous decision-making in robotic-assisted surgery.Method: Benchmarked five architectures on SAR-RARP50 dataset with compound loss function (Cross Entropy + Dice loss) to address class imbalance and capture fine boundaries. Models include UNet, UNet++, DeepLabV3, Attention UNet, and SegFormer.
Result: Convolutional models (UNet, Attention UNet) provide strong baseline performance. DeepLabV3 achieves results comparable to SegFormer, showing effectiveness of atrous convolution and multi-scale context aggregation. SegFormer enhances global contextual understanding for better generalization.
Conclusion: Provides comprehensive comparison and practical insights for selecting segmentation models in surgical AI applications, highlighting trade-offs between convolutional and transformer-based approaches.
Abstract: Accurate segmentation of surgical instruments in robotic-assisted surgery is critical for enabling context-aware computer-assisted interventions, such as tool tracking, workflow analysis, and autonomous decision-making. In this study, we benchmark five deep learning architectures-UNet, UNet, DeepLabV3, Attention UNet, and SegFormer on the SAR-RARP50 dataset for multi-class semantic segmentation of surgical instruments in real-world radical prostatectomy videos. The models are trained with a compound loss function combining Cross Entropy and Dice loss to address class imbalance and capture fine object boundaries. Our experiments reveal that while convolutional models such as UNet and Attention UNet provide strong baseline performance, DeepLabV3 achieves results comparable to SegFormer, demonstrating the effectiveness of atrous convolution and multi-scale context aggregation in capturing complex surgical scenes. Transformer-based architectures like SegFormer further enhance global contextual understanding, leading to improved generalization across varying instrument appearances and surgical conditions. This work provides a comprehensive comparison and practical insights for selecting segmentation models in surgical AI applications, highlighting the trade-offs between convolutional and transformer-based approaches.
[180] Efficient Spatial-Temporal Focal Adapter with SSM for Temporal Action Detection
Yicheng Qiu, Keiji Yanai
Main category: cs.CV
TL;DR: A novel SSM-based framework for temporal human action detection in videos using an Efficient Spatial-Temporal Focal Adapter with Temporal Boundary-aware SSM for improved long-term temporal modeling and reduced feature redundancy.
Details
Motivation: Existing CNN and Transformer models for temporal action detection struggle with feature redundancy and degraded global dependency modeling in long video sequences, limiting their real-world scalability. State Space Models (SSMs) offer promising linear long-term modeling capabilities that could address these limitations.Method: Proposes a novel framework with Efficient Spatial-Temporal Focal (ESTF) Adapter integrated into pre-trained layers. Combines Temporal Boundary-aware SSM (TB-SSM) for temporal feature modeling with efficient spatial feature processing to enhance long-term temporal reasoning.
Result: Extensive experiments across multiple benchmarks show significant improvements in both localization performance and robustness compared to previous SSM-based and other structural methods.
Conclusion: The proposed SSM-based framework with ESTF Adapter and TB-SSM effectively addresses limitations of existing models for temporal action detection, demonstrating superior performance in handling long video sequences with improved global temporal reasoning.
Abstract: Temporal human action detection aims to identify and localize action segments within untrimmed videos, serving as a pivotal task in video understanding. Despite the progress achieved by prior architectures like CNN and Transformer models, these continue to struggle with feature redundancy and degraded global dependency modeling capabilities when applied to long video sequences. These limitations severely constrain their scalability in real-world video analysis. State Space Models (SSMs) offer a promising alternative with linear long-term modeling and robust global temporal reasoning capabilities. Rethinking the application of SSMs in temporal modeling, this research constructs a novel framework for video human action detection. Specifically, we introduce the Efficient Spatial-Temporal Focal (ESTF) Adapter into the pre-trained layers. This module integrates the advantages of our proposed Temporal Boundary-aware SSM(TB-SSM) for temporal feature modeling with efficient processing of spatial features. We perform comprehensive and quantitative analyses across multiple benchmarks, comparing our proposed method against previous SSM-based and other structural methods. Extensive experiments demonstrate that our improved strategy significantly enhances both localization performance and robustness, validating the effectiveness of our proposed method.
[181] Neural Distribution Prior for LiDAR Out-of-Distribution Detection
Zizhao Li, Zhengkang Xiang, Jiayang Ao, Feng Liu, Joseph West, Kourosh Khoshelham
Main category: cs.CV
TL;DR: NDP framework improves LiDAR OOD detection by modeling prediction distributions and correcting class imbalance bias, achieving 10x better performance than previous methods.
Details
Motivation: Current LiDAR perception models fail to recognize out-of-distribution objects in open-world scenarios due to class imbalance issues and uniform distribution assumptions in existing OOD scoring methods.Method: Proposes Neural Distribution Prior (NDP) that models distributional structure of network predictions and adaptively reweights OOD scores based on alignment with learned distribution prior. Includes attention-based module to correct class-dependent confidence bias and Perlin noise-based OOD synthesis for generating diverse auxiliary OOD samples without external datasets.
Result: Achieves point-level AP of 61.31% on STU test set, which is more than 10x higher than previous best results. Demonstrates substantial improvement on SemanticKITTI and STU benchmarks.
Conclusion: NDP provides an effective solution for open-world LiDAR perception by addressing class imbalance issues and improving OOD detection performance, compatible with various existing OOD scoring formulations.
Abstract: LiDAR-based perception is critical for autonomous driving due to its robustness to poor lighting and visibility conditions. Yet, current models operate under the closed-set assumption and often fail to recognize unexpected out-of-distribution (OOD) objects in the open world. Existing OOD scoring functions exhibit limited performance because they ignore the pronounced class imbalance inherent in LiDAR OOD detection and assume a uniform class distribution. To address this limitation, we propose the Neural Distribution Prior (NDP), a framework that models the distributional structure of network predictions and adaptively reweights OOD scores based on alignment with a learned distribution prior. NDP dynamically captures the logit distribution patterns of training data and corrects class-dependent confidence bias through an attention-based module. We further introduce a Perlin noise-based OOD synthesis strategy that generates diverse auxiliary OOD samples from input scans, enabling robust OOD training without external datasets. Extensive experiments on the SemanticKITTI and STU benchmarks demonstrate that NDP substantially improves OOD detection performance, achieving a point-level AP of 61.31% on the STU test set, which is more than 10$\times$ higher than the previous best result. Our framework is compatible with various existing OOD scoring formulations, providing an effective solution for open-world LiDAR perception.
[182] ELT: Elastic Looped Transformers for Visual Generation
Sahil Goyal, Swayam Agrawal, Gautham Govind Anil, Prateek Jain, Sujoy Paul, Aditya Kusupati
Main category: cs.CV
TL;DR: ELT is a parameter-efficient visual generative model using recurrent transformers with weight sharing and intra-loop self-distillation for image/video generation.
Details
Motivation: To create highly parameter-efficient visual generative models that maintain high synthesis quality while drastically reducing parameter counts compared to conventional deep transformer stacks.Method: Uses recurrent transformer architecture with iterative, weight-shared transformer blocks and Intra-Loop Self Distillation (ILSD) where intermediate loops are distilled from maximum training loops for consistency.
Result: Achieves 4× parameter reduction with competitive FID of 2.0 on ImageNet 256×256 and FVD of 72.8 on UCF-101, enabling Any-Time inference with dynamic quality-compute trade-offs.
Conclusion: ELT significantly shifts the efficiency frontier for visual synthesis by enabling parameter-efficient, high-quality image and video generation with flexible inference capabilities.
Abstract: We introduce Elastic Looped Transformers (ELT), a highly parameter-efficient class of visual generative models based on a recurrent transformer architecture. While conventional generative models rely on deep stacks of unique transformer layers, our approach employs iterative, weight-shared transformer blocks to drastically reduce parameter counts while maintaining high synthesis quality. To effectively train these models for image and video generation, we propose the idea of Intra-Loop Self Distillation (ILSD), where student configurations (intermediate loops) are distilled from the teacher configuration (maximum training loops) to ensure consistency across the model’s depth in a single training step. Our framework yields a family of elastic models from a single training run, enabling Any-Time inference capability with dynamic trade-offs between computational cost and generation quality, with the same parameter count. ELT significantly shifts the efficiency frontier for visual synthesis. With $4\times$ reduction in parameter count under iso-inference-compute settings, ELT achieves a competitive FID of $2.0$ on class-conditional ImageNet $256 \times 256$ and FVD of $72.8$ on class-conditional UCF-101.
[183] Mosaic: Multimodal Jailbreak against Closed-Source VLMs via Multi-View Ensemble Optimization
Yuqin Lan, Gen Li, Yuanze Hu, Weihao Shen, Zhaoxin Fan, Faguo Wu, Xiao Zhang, Laurence T. Yang, Zhiming Zheng
Main category: cs.CV
TL;DR: Mosaic is a multi-view ensemble optimization framework for multimodal jailbreak attacks against closed-source VLMs that addresses surrogate dependency by reducing over-reliance on single surrogate models and visual views through text transformation, multi-view image optimization, and surrogate ensemble guidance.
Details
Motivation: Existing multimodal jailbreak attacks have limitations: explicit visual prompt attacks are easily detectable, while gradient-based adversarial optimization works well in homogeneous open-source settings but suffers from surrogate dependency in heterogeneous commercial closed-source VLM settings, creating a gap between attack effectiveness in different environments.Method: Mosaic uses three core components: 1) Text-Side Transformation module that perturbs refusal-sensitive lexical patterns, 2) Multi-View Image Optimization module that updates perturbations under diverse cropped views to avoid overfitting, and 3) Surrogate Ensemble Guidance module that aggregates optimization signals from multiple surrogate VLMs to reduce bias.
Result: Extensive experiments on safety benchmarks show Mosaic achieves state-of-the-art Attack Success Rate and Average Toxicity against commercial closed-source VLMs, demonstrating effectiveness in heterogeneous settings where previous methods fail.
Conclusion: Mosaic successfully addresses surrogate dependency in multimodal jailbreak attacks against closed-source VLMs through its multi-view ensemble approach, providing a more robust framework for evaluating and improving VLM safety in real-world heterogeneous deployment scenarios.
Abstract: Vision-Language Models (VLMs) are powerful but remain vulnerable to multimodal jailbreak attacks. Existing attacks mainly rely on either explicit visual prompt attacks or gradient-based adversarial optimization. While the former is easier to detect, the latter produces subtle perturbations that are less perceptible, but is usually optimized and evaluated under homogeneous open-source surrogate-target settings, leaving its effectiveness on commercial closed-source VLMs under heterogeneous settings unclear. To examine this issue, we study different surrogate-target settings and observe a consistent gap between homogeneous and heterogeneous settings, a phenomenon we term surrogate dependency. Motivated by this finding, we propose Mosaic, a Multi-view ensemble optimization framework for multimodal jailbreak against closed-source VLMs, which alleviates surrogate dependency under heterogeneous surrogate-target settings by reducing over-reliance on any single surrogate model and visual view. Specifically, Mosaic incorporates three core components: a Text-Side Transformation module, which perturbs refusal-sensitive lexical patterns; a Multi-View Image Optimization module, which updates perturbations under diverse cropped views to avoid overfitting to a single visual view; and a Surrogate Ensemble Guidance module, which aggregates optimization signals from multiple surrogate VLMs to reduce surrogate-specific bias. Extensive experiments on safety benchmarks demonstrate that Mosaic achieves state-of-the-art Attack Success Rate and Average Toxicity against commercial closed-source VLMs.
[184] UniSemAlign: Text-Prototype Alignment with a Foundation Encoder for Semi-Supervised Histopathology Segmentation
Le-Van Thai, Tien Dat Nguyen, Hoai Nhan Pham, Lan Anh Dinh Thi, Duy-Dong Nguyen, Ngoc Lam Quang Bui
Main category: cs.CV
TL;DR: UniSemAlign: A dual-modal semantic alignment framework for semi-supervised semantic segmentation in computational pathology that enhances visual segmentation by injecting explicit class-level structure into pixel-wise learning.
Details
Motivation: Semi-supervised semantic segmentation in computational pathology faces challenges due to scarce pixel-level annotations and unreliable pseudo-label supervision. There's a need for better methods to leverage limited labeled data while maintaining segmentation quality.Method: Built on a pathology-pretrained Transformer encoder, UniSemAlign introduces complementary prototype-level and text-level alignment branches in a shared embedding space. The framework fuses aligned representations with visual predictions to generate more reliable supervision for unlabeled histopathology images, trained end-to-end with supervised segmentation, cross-view consistency, and cross-modal alignment objectives.
Result: Extensive experiments on GlaS and CRAG datasets show UniSemAlign substantially outperforms recent semi-supervised baselines under limited supervision, achieving Dice improvements of up to 2.6% on GlaS and 8.6% on CRAG with only 10% labeled data, and strong improvements at 20% supervision.
Conclusion: UniSemAlign effectively addresses semi-supervised segmentation challenges in computational pathology by leveraging dual-modal semantic alignment to provide structured guidance, reduce class ambiguity, and stabilize pseudo-label refinement, demonstrating significant performance improvements with limited labeled data.
Abstract: Semi-supervised semantic segmentation in computational pathology remains challenging due to scarce pixel-level annotations and unreliable pseudo-label supervision. We propose UniSemAlign, a dual-modal semantic alignment framework that enhances visual segmentation by injecting explicit class-level structure into pixel-wise learning. Built upon a pathology-pretrained Transformer encoder, UniSemAlign introduces complementary prototype-level and text-level alignment branches in a shared embedding space, providing structured guidance that reduces class ambiguity and stabilizes pseudo-label refinement. The aligned representations are fused with visual predictions to generate more reliable supervision for unlabeled histopathology images. The framework is trained end-to-end with supervised segmentation, cross-view consistency, and cross-modal alignment objectives. Extensive experiments on the GlaS and CRAG datasets demonstrate that UniSemAlign substantially outperforms recent semi-supervised baselines under limited supervision, achieving Dice improvements of up to 2.6% on GlaS and 8.6% on CRAG with only 10% labeled data, and strong improvements at 20% supervision. Code is available at: https://github.com/thailevann/UniSemAlign
[185] MixFlow: Mixed Source Distributions Improve Rectified Flows
Nazir Nayal, Christopher Wewer, Jan Eric Lenssen
Main category: cs.CV
TL;DR: MixFlow improves diffusion model sampling efficiency by conditioning source distribution on data-aligned signals and training on linear mixtures to reduce generative path curvature.
Details
Motivation: Diffusion models suffer from slow iterative sampling due to highly curved generative paths, which are caused by independence between the standard Gaussian source distribution and the data distribution.Method: Two complementary contributions: 1) κ-FC formulation conditions source distribution on arbitrary signal κ to better align with data, 2) MixFlow trains flow models on linear mixtures of unconditional distribution and κ-FC-based distribution to reduce curvature.
Result: Improves generation quality by 12% in FID compared to standard rectified flow and 7% compared to previous baselines under fixed sampling budget, with faster training convergence and better sampling efficiency.
Conclusion: MixFlow effectively addresses curvature issues in diffusion models through better source-data alignment, leading to improved sampling efficiency and generation quality.
Abstract: Diffusion models and their variations, such as rectified flows, generate diverse and high-quality images, but they are still hindered by slow iterative sampling caused by the highly curved generative paths they learn. An important cause of high curvature, as shown by previous work, is independence between the source distribution (standard Gaussian) and the data distribution. In this work, we tackle this limitation by two complementary contributions. First, we attempt to break away from the standard Gaussian assumption by introducing $κ\texttt{-FC}$, a general formulation that conditions the source distribution on an arbitrary signal $κ$ that aligns it better with the data distribution. Then, we present MixFlow, a simple but effective training strategy that reduces the generative path curvatures and considerably improves sampling efficiency. MixFlow trains a flow model on linear mixtures of a fixed unconditional distribution and a $κ\texttt{-FC}$-based distribution. This simple mixture improves the alignment between the source and data, provides better generation quality with less required sampling steps, and accelerates the training convergence considerably. On average, our training procedure improves the generation quality by 12% in FID compared to standard rectified flow and 7% compared to previous baselines under a fixed sampling budget. Code available at: $\href{https://github.com/NazirNayal8/MixFlow}{https://github.com/NazirNayal8/MixFlow}$
[186] Globally Optimal Pose from Orthographic Silhouettes
Agniva Sengupta, Dilara Kuş, Jianning Li, Stefan Zachow
Main category: cs.CV
TL;DR: A method for determining 3D shape pose from silhouettes using silhouette area continuity and aspect ratio signatures for global optimality
Details
Motivation: To solve the problem of determining the pose of known 3D shapes from their unoccluded silhouettes with global optimality, without relying on correspondences or being limited by shape convexity or genusMethod: Uses pre-computed silhouette-signatures modeled as response surfaces of silhouette areas, leveraging continuity of silhouette area with respect to rotation trajectories. Combines this with aspect ratio of 2D ellipses fitted to projected silhouettes as auxiliary shape signature. Uses resolution-guided candidate search through branching of rotation search space
Result: Validated on synthetic and real examples, demonstrating significantly improved accuracy against comparable approaches. First method to efficiently estimate globally optimal pose from just silhouettes for any shape regardless of convexity and genus
Conclusion: Proposes a novel approach for globally optimal pose estimation from silhouettes using silhouette area continuity and auxiliary shape signatures, enabling efficient search without correspondence guidance
Abstract: We solve the problem of determining the pose of known shapes in $\mathbb{R}^3$ from their unoccluded silhouettes. The pose is determined up to global optimality using a simple yet under-explored property of the area-of-silhouette: its continuity w.r.t trajectories in the rotation space. The proposed method utilises pre-computed silhouette-signatures, modelled as a response surface of the area-of-silhouettes. Querying this silhouette-signature response surface for pose estimation leads to a strong branching of the rotation search space, making resolution-guided candidate search feasible. Additionally, we utilise the aspect ratio of 2D ellipses fitted to projected silhouettes as an auxiliary global shape signature to accelerate the pose search. This combined strategy forms the first method to efficiently estimate globally optimal pose from just the silhouettes, without being guided by correspondences, for any shape, irrespective of its convexity and genus. We validate our method on synthetic and real examples, demonstrating significantly improved accuracy against comparable approaches. Code, data, and supplementary in: https://agnivsen.github.io/pose-from-silhouette/
[187] CT-1: Vision-Language-Camera Models Transfer Spatial Reasoning Knowledge to Camera-Controllable Video Generation
Haoyu Zhao, Zihao Zhang, Jiaxi Gu, Haoran Chen, Qingping Zheng, Pin Tang, Yeyin Jin, Yuang Zhang, Junqi Cheng, Zenghui Lu, Peng Shu, Zuxuan Wu, Yu-Gang Jiang
Main category: cs.CV
TL;DR: CT-1 is a vision-language-camera model that generates videos with precise camera control by estimating camera trajectories using wavelet-based regularization and integrating them into a video diffusion model.
Details
Motivation: Existing video generation methods provide imprecise camera control from text or require manual trajectory parameters, limiting automated use. There's a need for accurate, automated camera-controllable video generation.Method: CT-1 uses vision-language modules and Diffusion Transformer to estimate camera trajectories, employs wavelet-based regularization loss in frequency domain, integrates trajectories into video diffusion model, and trains on CT-200K dataset with 47M+ frames.
Result: CT-1 bridges spatial reasoning and video synthesis, produces faithful camera-controllable videos, and improves camera control accuracy by 25.7% over prior methods.
Conclusion: CT-1 enables precise camera control in video generation by transferring spatial reasoning knowledge, offering automated high-quality video synthesis with accurate camera movements.
Abstract: Camera-controllable video generation aims to synthesize videos with flexible and physically plausible camera movements. However, existing methods either provide imprecise camera control from text prompts or rely on labor-intensive manual camera trajectory parameters, limiting their use in automated scenarios. To address these issues, we propose a novel Vision-Language-Camera model, termed CT-1 (Camera Transformer 1), a specialized model designed to transfer spatial reasoning knowledge to video generation by accurately estimating camera trajectories. Built upon vision-language modules and a Diffusion Transformer model, CT-1 employs a Wavelet-based Regularization Loss in the frequency domain to effectively learn complex camera trajectory distributions. These trajectories are integrated into a video diffusion model to enable spatially aware camera control that aligns with user intentions. To facilitate the training of CT-1, we design a dedicated data curation pipeline and construct CT-200K, a large-scale dataset containing over 47M frames. Experimental results demonstrate that our framework successfully bridges the gap between spatial reasoning and video synthesis, yielding faithful and high-quality camera-controllable videos and improving camera control accuracy by 25.7% over prior methods.
[188] Long-SCOPE: Fully Sparse Long-Range Cooperative 3D Perception
Jiahao Wang, Zikun Xu, Yuner Zhang, Zhongwei Jiang, Chenyang Lu, Shuocheng Yang, Yuxuan Wang, Jiaru Zhong, Chuang Zhang, Shaobing Xu, Jianqiang Wang
Main category: cs.CV
TL;DR: Long-SCOPE: A fully sparse framework for robust long-distance cooperative 3D perception in autonomous driving using V2X communication, addressing computational scaling and feature association challenges.
Details
Motivation: Existing cooperative 3D perception methods face practical deployment challenges at long distances due to quadratic computational scaling of dense BEV representations and fragile feature association mechanisms under observation/alignment errors.Method: Introduces a fully sparse framework with two novel components: 1) Geometry-guided Query Generation module for accurate detection of small, distant objects, and 2) learnable Context-Aware Association module for robust matching of cooperative queries despite positional noise.
Result: Achieves state-of-the-art performance on V2X-Seq and Griffin datasets, particularly in challenging 100-150 m long-range settings, while maintaining competitive computation and communication costs.
Conclusion: Long-SCOPE provides an effective solution for practical long-distance cooperative 3D perception in autonomous driving, overcoming key bottlenecks of existing methods.
Abstract: Cooperative 3D perception via Vehicle-to-Everything communication is a promising paradigm for enhancing autonomous driving, offering extended sensing horizons and occlusion resolution. However, the practical deployment of existing methods is hindered at long distances by two critical bottlenecks: the quadratic computational scaling of dense BEV representations and the fragility of feature association mechanisms under significant observation and alignment errors. To overcome these limitations, we introduce Long-SCOPE, a fully sparse framework designed for robust long-distance cooperative 3D perception. Our method features two novel components: a Geometry-guided Query Generation module to accurately detect small, distant objects, and a learnable Context-Aware Association module that robustly matches cooperative queries despite severe positional noise. Experiments on the V2X-Seq and Griffin datasets validate that Long-SCOPE achieves state-of-the-art performance, particularly in challenging 100-150 m long-range settings, while maintaining highly competitive computation and communication costs.
[189] Adding Another Dimension to Image-based Animal Detection
Vandita Shukla, Fabio Remondino, Benjamin Risse
Main category: cs.CV
TL;DR: A pipeline for generating 3D bounding box labels from 2D animal images using Skinned Multi Animal Linear models and camera pose refinement, enabling development of monocular 3D animal detection algorithms.
Details
Motivation: Monocular animal imaging reduces 3D structures to 2D projections, and existing detection algorithms produce 2D bounding boxes lacking orientation information. There's a lack of labeled datasets for 3D animal detection since labeling requires 3D input streams alongside RGB data.Method: Uses Skinned Multi Animal Linear models to estimate 3D bounding boxes, projects them into 2D image space using a dedicated camera pose refinement algorithm, and computes cuboid face visibility metrics to assess which sides of animals are captured.
Result: The method was evaluated on the Animal3D dataset and demonstrated accurate performance across different species and settings.
Conclusion: The generated 3D bounding boxes and visibility metrics form a crucial step toward developing and benchmarking future monocular 3D animal detection algorithms.
Abstract: Monocular imaging of animals inherently reduces 3D structures to 2D projections. Detection algorithms lead to 2D bounding boxes that lack information about animal’s orientation relative to the camera. To build 3D detection methods for RGB animal images, there is a lack of labeled datasets; such labeling processes require 3D input streams along with RGB data. We present a pipeline that utilises Skinned Multi Animal Linear models to estimate 3D bounding boxes and to project them as robust labels into 2D image space using a dedicated camera pose refinement algorithm. To assess which sides of the animal are captured, cuboid face visibility metrics are computed. These 3D bounding boxes and metrics form a crucial step toward developing and benchmarking future monocular 3D animal detection algorithms. We evaluate our method on the Animal3D dataset, demonstrating accurate performance across species and settings.
[190] SHIFT: Steering Hidden Intermediates in Flow Transformers
Nina Konovalova, Andrey Kuznetsov, Aibek Alanov
Main category: cs.CV
TL;DR: SHIFT: A lightweight framework for concept removal and style control in DiT diffusion models via targeted activation steering at inference time
Details
Motivation: While DiT-based diffusion models achieve strong prompt adherence and high-quality image generation, there's a need for flexible control over generated content without retraining. The paper aims to enable concept removal, style shifting, and object manipulation at inference time.Method: SHIFT learns steering vectors that are dynamically applied to selected layers and timesteps during inference. Inspired by activation steering in LLMs, it manipulates intermediate activations to suppress unwanted visual concepts while preserving overall image quality and remaining prompt content.
Result: SHIFT provides effective and flexible control over DiT generation across diverse prompts and targets without requiring time-consuming retraining. It can remove concepts, shift styles, and bias samples toward adding/changing target objects.
Conclusion: SHIFT offers a simple but effective framework for post-hoc control of DiT diffusion models, enabling concept removal and style manipulation through targeted activation steering at inference time.
Abstract: Diffusion models have become leading approaches for high-fidelity image generation. Recent DiT-based diffusion models, in particular, achieve strong prompt adherence while producing high-quality samples. We propose SHIFT, a simple but effective and lightweight framework for concept removal in DiT diffusion models via targeted manipulation of intermediate activations at inference time, inspired by activation steering in large language models. SHIFT learns steering vectors that are dynamically applied to selected layers and timesteps to suppress unwanted visual concepts while preserving the prompt’s remaining content and overall image quality. Beyond suppression, the same mechanism can shift generations into a desired \emph{style domain} or bias samples toward adding or changing target objects. We demonstrate that SHIFT provides effective and flexible control over DiT generation across diverse prompts and targets without time-consuming retraining.
[191] PhysInOne: Visual Physics Learning and Reasoning in One Suite
Siyuan Zhou, Hejun Wang, Hu Cheng, Jinxi Li, Dongsheng Wang, Junwei Jiang, Yixiao Jin, Jiayue Huang, Shiwei Mao, Shangjia Liu, Yafei Yang, Hongkang Song, Shenxing Wei, Zihui Zhang, Peng Huang, Shijie Liu, Zhengli Hao, Hao Li, Yitian Li, Wenqi Zhou, Zhihan Zhao, Zongqi He, Hongtao Wen, Shouwang Huang, Peng Yun, Bowen Cheng, Pok Kazaf Fu, Wai Kit Lai, Jiahao Chen, Kaiyuan Wang, Zhixuan Sun, Ziqi Li, Haochen Hu, Di Zhang, Chun Ho Yuen, Bing Wang, Zhihua Wang, Chuhang Zou, Bo Yang
Main category: cs.CV
TL;DR: PhysInOne is a massive synthetic dataset with 2M videos across 153K 3D scenes covering 71 physical phenomena, providing comprehensive ground-truth annotations for physics-aware AI training.
Details
Motivation: Addresses the critical scarcity of physically-grounded training data for AI systems, as existing datasets are limited to only hundreds or thousands of examples and lack comprehensive physical annotations.Method: Creates a large-scale synthetic dataset with 2 million videos across 153,810 dynamic 3D scenes covering 71 basic physical phenomena. Features multiobject interactions with complex backgrounds and comprehensive ground-truth annotations including 3D geometry, semantics, dynamic motion, physical properties, and text descriptions.
Result: Fine-tuning foundation models on PhysInOne significantly enhances physical plausibility across four applications: physics-aware video generation, future frame prediction, physical property estimation, and motion transfer. The dataset exposes critical gaps in modeling complex physical dynamics and estimating intrinsic properties.
Conclusion: PhysInOne establishes a new benchmark for advancing physics-grounded world models in generation, simulation, and embodied AI, being orders of magnitude larger than prior works and providing comprehensive physical annotations.
Abstract: We present PhysInOne, a large-scale synthetic dataset addressing the critical scarcity of physically-grounded training data for AI systems. Unlike existing datasets limited to merely hundreds or thousands of examples, PhysInOne provides 2 million videos across 153,810 dynamic 3D scenes, covering 71 basic physical phenomena in mechanics, optics, fluid dynamics, and magnetism. Distinct from previous works, our scenes feature multiobject interactions against complex backgrounds, with comprehensive ground-truth annotations including 3D geometry, semantics, dynamic motion, physical properties, and text descriptions. We demonstrate PhysInOne’s efficacy across four emerging applications: physics-aware video generation, long-/short-term future frame prediction, physical property estimation, and motion transfer. Experiments show that fine-tuning foundation models on PhysInOne significantly enhances physical plausibility, while also exposing critical gaps in modeling complex physical dynamics and estimating intrinsic properties. As the largest dataset of its kind, orders of magnitude beyond prior works, PhysInOne establishes a new benchmark for advancing physics-grounded world models in generation, simulation, and embodied AI.
[192] TinyNeRV: Compact Neural Video Representations via Capacity Scaling, Distillation, and Low-Precision Inference
Muhammad Hannan Akhtar, Ihab Amer, Tamer Shanableh
Main category: cs.CV
TL;DR: TinyNeRV: Systematic study of extremely compact neural video representations for efficient deployment in resource-constrained environments, introducing lightweight architectures and optimization techniques.
Details
Motivation: Existing neural video representations (NeRV) focus on moderate/high capacity models, leaving compact configurations for constrained environments insufficiently explored. Need systematic study of tiny architectures for efficient deployment.Method: Introduces two lightweight configurations (NeRV-T and NeRV-T+), evaluates across video datasets, explores knowledge distillation with frequency-aware focal supervision, and examines low-precision inference via post-training quantization and quantization-aware training.
Result: Carefully designed tiny NeRV variants achieve favorable quality-efficiency trade-offs while substantially reducing parameter count, computational cost, and memory requirements. Provides practical limits of compact neural video representations.
Conclusion: Tiny NeRV architectures enable efficient deployment in resource-constrained and real-time environments, offering guidance for practical applications of neural video representations.
Abstract: Implicit neural video representations encode entire video sequences within the parameters of a neural network and enable constant time frame reconstruction. Recent work on Neural Representations for Videos (NeRV) has demonstrated competitive reconstruction performance while avoiding the sequential decoding process of conventional video codecs. However, most existing studies focus on moderate or high capacity models, leaving the behavior of extremely compact configurations required for constrained environments insufficiently explored. This paper presents a systematic study of tiny NeRV architectures designed for efficient deployment. Two lightweight configurations, NeRV-T and NeRV-T+, are introduced and evaluated across multiple video datasets in order to analyze how aggressive capacity reduction affects reconstruction quality, computational complexity, and decoding throughput. Beyond architectural scaling, the work investigates strategies for improving the performance of compact models without increasing inference cost. Knowledge distillation with frequency-aware focal supervision is explored to enhance reconstruction fidelity in low-capacity networks. In addition, the impact of lowprecision inference is examined through both post training quantization and quantization aware training to study the robustness of tiny models under reduced numerical precision. Experimental results demonstrate that carefully designed tiny NeRV variants can achieve favorable quality efficiency trade offs while substantially reducing parameter count, computational cost, and memory requirements. These findings provide insight into the practical limits of compact neural video representations and offer guidance for deploying NeRV style models in resource constrained and real-time environments. The official implementation is available at https: //github.com/HannanAkhtar/TinyNeRV-Implementation.
[193] Hitem3D 2.0: Multi-View Guided Native 3D Texture Generation
Huiang He, Shengchu Zhao, Jianwen Huang, Jie Li, Jiaqi Wu, Hu Zhang, Pei Tang, Heliang Zheng, Yukun Li, Rongfei Jia
Main category: cs.CV
TL;DR: Hitem3D 2.0 is a framework for generating high-quality 3D textures by integrating 2D multi-view generation priors with native 3D texture representations to address texture coverage, cross-view inconsistency, and geometry-texture misalignment issues.
Details
Motivation: Existing 3D texture generation methods suffer from incomplete texture coverage, cross-view inconsistency, and misalignment between geometry and texture, which limits their practical application and quality.Method: Two-stage approach: 1) Multi-view synthesis framework using pre-trained image editing backbone with plug-and-play modules for geometric alignment, cross-view consistency, and illumination uniformity; 2) Native 3D texture generation model that projects multi-view textures onto 3D surfaces and completes textures in unseen regions.
Result: Hitem3D 2.0 outperforms existing methods in texture detail, fidelity, consistency, coherence, and alignment, demonstrating significant improvements in texture completeness and cross-view coherence.
Conclusion: The integration of multi-view consistency constraints with native 3D texture modeling effectively addresses key limitations in 3D texture generation, producing higher quality textures with better coverage and alignment.
Abstract: Although recent advances have improved the quality of 3D texture generation, existing methods still struggle with incomplete texture coverage, cross-view inconsistency, and misalignment between geometry and texture. To address these limitations, we propose Hitem3D 2.0, a multi-view guided native 3D texture generation framework that enhances texture quality through the integration of 2D multi-view generation priors and native 3D texture representations. Hitem3D 2.0 comprises two key components: a multi-view synthesis framework and a native 3D texture generation model. The multi-view generation is built upon a pre-trained image editing backbone and incorporates plug-and-play modules that explicitly promote geometric alignment, cross-view consistency, and illumination uniformity, thereby enabling the synthesis of high-fidelity multi-view images. Conditioned on the generated views and 3D geometry, the native 3D texture generation model projects multi-view textures onto 3D surfaces while plausibly completing textures in unseen regions. Through the integration of multi-view consistency constraints with native 3D texture modeling, Hitem3D 2.0 significantly improves texture completeness, cross-view coherence, and geometric alignment. Experimental results demonstrate that Hitem3D 2.0 outperforms existing methods in terms of texture detail, fidelity, consistency, coherence, and alignment.
[194] Rays as Pixels: Learning A Joint Distribution of Videos and Camera Trajectories
Wonbong Jang, Shikun Liu, Soubhik Sanyal, Juan Camilo Perez, Kam Woh Ng, Sanskar Agrawal, Juan-Manuel Perez-Rua, Yiannis Douratsos, Tao Xiang
Main category: cs.CV
TL;DR: A Video Diffusion Model that jointly learns video and camera trajectory distributions, enabling three tasks: camera prediction from video, joint video+camera generation from images, and video generation along target trajectories.
Details
Motivation: Traditional separation of camera parameter recovery and novel view rendering breaks down with sparse image coverage or ambiguous poses, as each task needs what the other produces. There's a need for unified modeling of videos and camera trajectories.Method: Rays as Pixels (RaP) - a Video Diffusion Model representing cameras as dense ray pixels (raxels) and denoising them jointly with video frames using Decoupled Self-Cross Attention mechanism. Single model handles three tasks through joint distribution learning.
Result: Model successfully performs: 1) camera trajectory prediction from video, 2) joint video+camera generation from input images, 3) video generation from images along target trajectories. Self-consistency tests show forward and inverse predictions agree, with trajectory prediction requiring far fewer denoising steps than video generation.
Conclusion: Joint modeling of videos and camera trajectories enables unified handling of multiple vision tasks, with self-consistency demonstrating the model’s coherent understanding of the relationship between visual content and camera motion.
Abstract: Recovering camera parameters from images and rendering scenes from novel viewpoints have long been treated as separate tasks in computer vision and graphics. This separation breaks down when image coverage is sparse or poses are ambiguous, since each task needs what the other produces. We propose Rays as Pixels, a Video Diffusion Model (VDM) that learns a joint distribution over videos and camera trajectories. We represent each camera as dense ray pixels (raxels) and denoise them jointly with video frames through Decoupled Self-Cross Attention mechanism. A single trained model handles three tasks: predicting camera trajectories from video, jointly generating video and camera trajectory from input images, and generating video from input images along a target camera trajectory. Because the model can both predict trajectories from a video and generate views conditioned on its own predictions, we evaluate it through a closed-loop self-consistency test, demonstrating that its forward and inverse predictions agree. Notably, trajectory prediction requires far fewer denoising steps than video generation, even a few denoising steps suffice for self-consistency. We report results on pose estimation and camera-controlled video generation.
[195] FashionStylist: An Expert Knowledge-enhanced Multimodal Dataset for Fashion Understanding
Kaidong Feng, Zhuoxuan Huang, Huizhong Guo, Yuting Jin, Xinyu Chen, Yue Liang, Yifei Gai, Li Zhou, Yunshan Ma, Zhu Sun
Main category: cs.CV
TL;DR: FashionStylist: An expert-annotated benchmark for holistic fashion understanding with three tasks: outfit-to-item grounding, outfit completion, and outfit evaluation.
Details
Motivation: Existing fashion datasets are fragmented and task-specific, focusing on item attributes, outfit co-occurrence, or weak textual supervision, providing limited support for holistic outfit understanding that requires both visual perception and expert-level reasoning.Method: Constructed through a dedicated fashion-expert annotation pipeline, providing professionally grounded annotations at both item and outfit levels. Supports three tasks: outfit-to-item grounding (realistic item recovery from complex outfits), outfit completion (compatibility-aware composition), and outfit evaluation (expert-level assessment).
Result: FashionStylist serves as a unified benchmark for multiple fashion tasks and as an effective training resource for improving grounding, completion, and outfit-level semantic evaluation in MLLM-based fashion systems.
Conclusion: The paper introduces a comprehensive benchmark for holistic fashion understanding that bridges the gap between visual perception and expert-level reasoning, supporting multiple realistic fashion tasks through professional annotations.
Abstract: Fashion understanding requires both visual perception and expert-level reasoning about style, occasion, compatibility, and outfit rationale. However, existing fashion datasets remain fragmented and task-specific, often focusing on item attributes, outfit co-occurrence, or weak textual supervision, and thus provide limited support for holistic outfit understanding. In this paper, we introduce FashionStylist, an expert-annotated benchmark for holistic and expert-level fashion understanding. Constructed through a dedicated fashion-expert annotation pipeline, FashionStylist provides professionally grounded annotations at both the item and outfit levels. It supports three representative tasks: outfit-to-item grounding, outfit completion, and outfit evaluation. These tasks cover realistic item recovery from complex outfits with layering and accessories, compatibility-aware composition beyond co-occurrence matching, and expert-level assessment of style, season, occasion, and overall coherence. Experimental results show that FashionStylist serves not only as a unified benchmark for multiple fashion tasks, but also as an effective training resource for improving grounding, completion, and outfit-level semantic evaluation in MLLM-based fashion systems.
[196] Beyond Segmentation: Structurally Informed Facade Parsing from Imperfect Images
Maciej Janicki, Aleksander Plocharski, Przemyslaw Musialski
Main category: cs.CV
TL;DR: Augmenting YOLOv8 with alignment loss improves structural coherence in facade parsing for procedural reconstruction
Details
Motivation: Standard object detectors produce facade parsings lacking structural coherence needed for downstream procedural reconstruction, as they treat architectural elements independentlyMethod: Augment YOLOv8 training with custom lightweight alignment loss that encourages grid-consistent bounding box arrangements, injecting geometric priors without changing inference pipeline
Result: Method improves structural regularity on CMP dataset, correcting alignment errors from perspective and occlusion while maintaining controllable trade-off with detection accuracy
Conclusion: Alignment loss regularization effectively enhances structural coherence in facade parsing for reconstruction applications while preserving standard detection performance
Abstract: Standard object detectors typically treat architectural elements independently, often resulting in facade parsings that lack the structural coherence required for downstream procedural reconstruction. We address this limitation by augmenting the YOLOv8 training objective with a custom lightweight alignment loss. This regularization encourages grid-consistent arrangements of bounding boxes during training, effectively injecting geometric priors without altering the standard inference pipeline. Experiments on the CMP dataset demonstrate that our method successfully improves structural regularity, correcting alignment errors caused by perspective and occlusion while maintaining a controllable trade-off with standard detection accuracy.
[197] GeRM: A Generative Rendering Model From Physically Realistic to Photorealistic
Jiayuan Lu, Rengan Xie, Xuancheng Jin, Zhizhen Wu, Qi Ye, Tian Xie, Hujun Bao, Rui Wang. Yuchi Huo
Main category: cs.CV
TL;DR: GeRM is a multimodal generative rendering model that bridges the gap between Physically-Based Rendering (PBR) and Photorealistic Rendering (PRR) by learning a distribution transfer vector field to generate controllable photorealistic images.
Details
Motivation: There's a gap between PBR (mathematical light simulation) and PRR (photorealistic rendering) due to the need for realistic digital models of geometry and appearance. Current approaches face a dilemma: explicit simulation requires unreachable realistic models, while implicit generation sacrifices controllability and geometric consistency.Method: 1) Model the PBR-to-PRR transition as distribution transfer and learn a Distribution Transfer Vector Field (DTV Field). 2) Create P2P-50K dataset using multi-agent VLM framework for expert-guided pairwise transfers. 3) Develop multi-condition ControlNet to learn DTV Field, synthesizing PBR images and progressively transitioning them to PRR using G-buffers, text prompts, and region enhancement cues.
Result: GeRM enables fluid navigation between strict physical fidelity (PBR) and perceptual photorealism (PRR), allowing controllable generation of photorealistic images while maintaining geometric consistency.
Conclusion: GeRM presents the first multimodal generative rendering model that unifies PBR and PRR, addressing the P2P gap through a distribution transfer approach with controllable photorealistic image generation.
Abstract: For decades, Physically-Based Rendering (PBR) is the fundation of synthesizing photorealisitic images, and therefore sometimes roughly referred as Photorealistic Rendering (PRR). While PBR is indeed a mathematical simulation of light transport that guarantees physical reality, photorealism has additional reliance on the realistic digital model of geometry and appearance of the real world, leaving a barely explored gap from PBR to PRR (P2P). Consequently, the path toward photorealism faces a critical dilemma: the explicit simulation of PRR encumbered by unreachable realistic digital models for real-world existence, while implicit generation models sacrifice controllability and geometric consistency. Based on this insight, this paper presents the problem, data, and approach of mitigating P2P gap, followed by the first multi-modal generative rendering model, dubbed GeRM, to unify PBR and PRR. GeRM integrates physical attributes like G-buffers with text prompts, and progressive incremental injection to generate controllable photorealistic images, allowing users to fluidly navigate the continuum between strict physical fidelity and perceptual photorealism. Technically, we model the transition between PBR and PRR images as a distribution transfer and aim to learn a distribution transfer vector field (DTV Field) to guide this process. To define the learning objective, we first leverage a multi-agent VLM framework to construct an expert-guided pairwise P2P transfer dataset, named P2P-50K, where each paired sample in the dataset corresponds to a transfer vector in the DTV Field. Subsequently, we propose a multi-condition ControlNet to learn the DTV Field, which synthesizes PBR images and progressively transitions them into PRR images, guided by G-buffers, text prompts, and cues for enhanced regions.
[198] VAGNet: Vision-based accident anticipation with global features
Vipooshan Vipulananthan, Charith D. Chitraranjan
Main category: cs.CV
TL;DR: VAGNet is a deep neural network that predicts traffic accidents from dashcam videos using global scene features instead of object-level features, achieving better performance and efficiency than existing methods.
Details
Motivation: Traffic accidents cause global fatalities and injuries, making accident anticipation crucial for driver assistance systems and autonomous driving. Current methods are computationally intensive due to object-level feature extraction, creating a need for more efficient real-time solutions.Method: VAGNet uses global scene features from dashcam videos without explicit object detection. It employs VideoMAE-V2 for global feature extraction and combines transformer and graph modules to learn accident prediction from traffic scene representations.
Result: Experiments on four benchmark datasets (DAD, DoTA, DADA, and Nexar) show VAGNet achieves higher average precision and mean time-to-accident while being computationally more efficient than existing methods.
Conclusion: VAGNet demonstrates that global scene features are effective for accident anticipation, offering a computationally efficient alternative to object-level approaches while maintaining or improving prediction performance.
Abstract: Traffic accidents are a leading cause of fatalities and injuries across the globe. Therefore, the ability to anticipate hazardous situations in advance is essential. Automated accident anticipation enables timely intervention through driver alerts and collision avoidance maneuvers, forming a key component of advanced driver assistance systems. In autonomous driving, such predictive capabilities support proactive safety behaviors, such as initiating defensive driving and human takeover when required. Using dashcam video as input offers a cost-effective solution, but it is challenging due to the complexity of real-world driving scenes. Accident anticipation systems need to operate in real-time. However, current methods involve extracting features from each detected object, which is computationally intensive. We propose VAGNet, a deep neural network that learns to predict accidents from dash-cam video using global features of traffic scenes without requiring explicit object-level features. The network consists of transformer and graph modules, and we use the vision foundation model VideoMAE-V2 for global feature extraction. Experiments on four benchmark datasets (DAD, DoTA, DADA, and Nexar) show that our method anticipates accidents with higher average precision and mean time-to-accident while being computationally more efficient compared to existing methods.
[199] Structure-Aware Fine-Grained Gaussian Splatting for Expressive Avatar Reconstruction
Yuze Su, Hongsong Wang, Jie Gui, Liang Wang
Main category: cs.CV
TL;DR: SFGS reconstructs expressive full-body 3D human avatars from monocular videos using Gaussian splatting with spatial triplanes and temporal hexplanes, plus structure-aware modules for fine details.
Details
Motivation: Existing 3D human avatar methods capture body motion but fail at fine details like hand movements and facial expressions. Need for photorealistic, topology-aware avatars from monocular videos.Method: Uses spatial-only triplane and time-aware hexplane for dynamic features. Structure-aware Gaussian module captures pose-dependent details coherently. Residual refinement module for fine-grained hand reconstruction. Single-stage training.
Result: Outperforms state-of-the-art baselines in quantitative and qualitative evaluations. Generates high-fidelity avatars with natural motion and fine details.
Conclusion: SFGS effectively reconstructs expressive, coherent full-body 3D human avatars with fine details from monocular videos using novel Gaussian splatting approach.
Abstract: Reconstructing photorealistic and topology-aware human avatars from monocular videos remains a significant challenge in the fields of computer vision and graphics. While existing 3D human avatar modeling approaches can effectively capture body motion, they often fail to accurately model fine details such as hand movements and facial expressions. To address this, we propose Structure-aware Fine-grained Gaussian Splatting (SFGS), a novel method for reconstructing expressive and coherent full-body 3D human avatars from a monocular video sequence. The SFGS use both spatial-only triplane and time-aware hexplane to capture dynamic features across consecutive frames. A structure-aware gaussian module is designed to capture pose-dependent details in a spatially coherent manner and improve pose and texture expression. To better model hand deformations, we also propose a residual refinement module based on fine-grained hand reconstruction. Our method requires only a single-stage training and outperforms state-of-the-art baselines in both quantitative and qualitative evaluations, generating high-fidelity avatars with natural motion and fine details. The code is on Github: https://github.com/Su245811YZ/SFGS
[200] From Frames to Events: Rethinking Evaluation in Human-Centric Video Anomaly Detection
Narges Rashvand, Shanle Yao, Armin Danesh Pazho, Babak Rahimi Ardabili, Hamed Tabkhi
Main category: cs.CV
TL;DR: Proposes event-centric evaluation for pose-based video anomaly detection, showing large performance gap between frame-level and event-level metrics.
Details
Motivation: Traditional frame-level evaluation in video anomaly detection misaligns with real-world needs where anomalies manifest as coherent temporal events, not isolated frames. Current metrics overestimate performance for operational systems requiring actionable event-level alerts.Method: 1) Audits existing VAD benchmarks to characterize event structure; 2) Introduces two temporal event localization strategies: score-refinement pipeline with hierarchical Gaussian smoothing and adaptive binarization, and an end-to-end Dual-Branch Model; 3) Establishes event-based evaluation standard using Temporal Action Localization metrics.
Result: Substantial performance gap revealed: While state-of-the-art models achieve >52% frame-level AUC-ROC on NWPUC, their event-level localization precision falls below 10% at minimal tIoU=0.2, with average event-level F1 of only 0.11 across thresholds.
Conclusion: Frame-level metrics systematically overestimate VAD performance for deployment requiring actionable alerts. Event-centric perspective and evaluation are crucial for meaningful assessment of anomaly detection systems in operational surveillance.
Abstract: Pose-based Video Anomaly Detection (VAD) has gained significant attention for its privacy-preserving nature and robustness to environmental variations. However, traditional frame-level evaluations treat video as a collection of isolated frames, fundamentally misaligned with how anomalies manifest and are acted upon in the real world. In operational surveillance systems, what matters is not the flagging of individual frames, but the reliable detection, localization, and reporting of a coherent anomalous event, a contiguous temporal episode with an identifiable onset and duration. Frame-level metrics are blind to this distinction, and as a result, they systematically overestimate model performance for any deployment that requires actionable, event-level alerts. In this work, we propose a shift toward an event-centric perspective in VAD. We first audit widely used VAD benchmarks, including SHT[19], CHAD[6], NWPUC[4], and HuVAD[25], to characterize their event structure. We then introduce two strategies for temporal event localization: a score-refinement pipeline with hierarchical Gaussian smoothing and adaptive binarization, and an end-to-end Dual-Branch Model that directly generates event-level detections. Finally, we establish the first event-based evaluation standard for VAD by adapting Temporal Action Localization metrics, including tIoU-based event matching and multi-threshold F1 evaluation. Our results quantify a substantial performance gap: while all SoTA models achieve frame-level AUC-ROC exceeding 52% on the NWPUC[4], their event-level localization precision falls below 10% even at a minimal tIoU=0.2, with an average event-level F1 of only 0.11 across all thresholds. The code base for this work is available at https://github.com/TeCSAR-UNCC/EventCentric-VAD.
[201] LuMon: A Comprehensive Benchmark and Development Suite with Novel Datasets for Lunar Monocular Depth Estimation
Aytaç Sekmen, Fatih Emre Gunes, Furkan Horoz, Hüseyin Umut Işık, Mehmet Alp Ozaydin, Onur Altay Topaloglu, Şahin Umutcan Üstündaş, Yurdasen Alp Yeni, Halil Ersin Soken, Erol Sahin, Ramazan Gokberk Cinbis, Sinan Kalkan
Main category: cs.CV
TL;DR: LuMon is a benchmarking framework for evaluating monocular depth estimation methods for lunar exploration, featuring real Chang’e-3 stereo depth data and CHERI analog dataset, with systematic evaluation revealing persistent domain gaps between terrestrial and lunar environments.
Details
Motivation: Monocular Depth Estimation (MDE) is crucial for autonomous lunar rover navigation, but deploying terrestrial MDE networks to the Moon faces severe domain gaps due to harsh shadows, textureless regolith, and zero atmospheric scattering. Existing evaluations rely on inadequate analogs and lack actual metric ground truth.Method: Introduces LuMon benchmarking framework with novel datasets: high-quality stereo ground truth depth from real Chang’e-3 mission and CHERI dark analog dataset. Conducts systematic zero-shot evaluation of state-of-the-art MDE architectures across synthetic, analog, and real datasets. Establishes sim-to-real domain adaptation baseline by fine-tuning foundation models on synthetic data.
Result: While domain adaptation yields drastic in-domain performance gains on synthetic data, it exhibits minimal generalization to authentic lunar imagery, highlighting a persistent cross-domain transfer gap. Analysis reveals inherent limitations of current networks in handling lunar-specific challenges like craters, rocks, extreme shading, and varying depth ranges.
Conclusion: The LuMon framework sets a standard foundation to guide future advancements in extraterrestrial perception and domain adaptation, revealing that current terrestrial MDE methods struggle with lunar conditions despite domain adaptation efforts.
Abstract: Monocular Depth Estimation (MDE) is crucial for autonomous lunar rover navigation using electro-optical cameras. However, deploying terrestrial MDE networks to the Moon brings a severe domain gap due to harsh shadows, textureless regolith, and zero atmospheric scattering. Existing evaluations rely on analogs that fail to replicate these conditions and lack actual metric ground truth. To address this, we present LuMon, a comprehensive benchmarking framework to evaluate MDE methods for lunar exploration. We introduce novel datasets featuring high-quality stereo ground truth depth from the real Chang’e-3 mission and the CHERI dark analog dataset. Utilizing this framework, we conduct a systematic zero-shot evaluation of state-of-the-art architectures across synthetic, analog, and real datasets. We rigorously assess performance against mission critical challenges like craters, rocks, extreme shading, and varying depth ranges. Furthermore, we establish a sim-to-real domain adaptation baseline by fine tuning a foundation model on synthetic data. While this adaptation yields drastic in-domain performance gains, it exhibits minimal generalization to authentic lunar imagery, highlighting a persistent cross-domain transfer gap. Our extensive analysis reveals the inherent limitations of current networks and sets a standard foundation to guide future advancements in extraterrestrial perception and domain adaptation.
[202] VISOR: Agentic Visual Retrieval-Augmented Generation via Iterative Search and Over-horizon Reasoning
Yucheng Shen, Jiulong Wu, Jizhou Huang, Dawei Yin, Lingyong Yan, Min Cao
Main category: cs.CV
TL;DR: VISOR is a single-agent framework for visual retrieval-augmented generation that addresses visual evidence sparsity and search drift in long-horizon reasoning through structured evidence space, visual action evaluation, and dynamic trajectory management.
Details
Motivation: Existing agentic Visual Retrieval-Augmented Generation (VRAG) systems face two critical bottlenecks: (1) Visual evidence sparsity where key evidence is scattered across pages and processed in isolation, hindering cross-page reasoning, and fine-grained intra-image evidence requires precise visual actions that are often misused; (2) Search drift in long horizons where accumulation of visual tokens dilutes context and causes cognitive overload, leading agents to deviate from their search objectives.Method: VISOR proposes a unified single-agent framework with: (1) structured Evidence Space for progressive cross-page reasoning, (2) Visual Action Evaluation and Correction mechanism to manage visual actions, (3) Dynamic Trajectory with Sliding Window and Intent Injection to mitigate search drift, and (4) training using Group Relative Policy Optimization-based Reinforcement Learning (GRPO-based RL) pipeline with state masking and credit assignment tailored for dynamic context reconstruction.
Result: Extensive experiments on ViDoSeek, SlideVQA, and MMLongBench demonstrate that VISOR achieves state-of-the-art performance with superior efficiency for long-horizon visual reasoning tasks.
Conclusion: VISOR effectively addresses the key challenges in agentic VRAG systems by providing a comprehensive framework that handles visual evidence sparsity and search drift, enabling more effective long-horizon visual reasoning through structured evidence organization and dynamic context management.
Abstract: Visual Retrieval-Augmented Generation (VRAG) empowers Vision-Language Models to retrieve and reason over visually rich documents. To tackle complex queries requiring multi-step reasoning, agentic VRAG systems interleave reasoning with iterative retrieval.. However, existing agentic VRAG faces two critical bottlenecks. (1) Visual Evidence Sparsity: key evidence is scattered across pages yet processed in isolation, hindering cross-page reasoning; moreover, fine-grained intra-image evidence often requires precise visual actions, whose misuse degrades retrieval quality; (2) Search Drift in Long Horizons: the accumulation of visual tokens across retrieved pages dilutes context and causes cognitive overload, leading agents to deviate from their search objective. To address these challenges, we propose VISOR (Visual Retrieval-Augmented Generation via Iterative Search and Over-horizon Reasoning), a unified single-agent framework. VISOR features a structured Evidence Space for progressive cross-page reasoning, coupled with a Visual Action Evaluation and Correction mechanism to manage visual actions. Additionally, we introduce a Dynamic Trajectory with Sliding Window and Intent Injection to mitigate search drift. They anchor the evidence space while discarding earlier raw interactions, preventing context from being overwhelmed by visual tokens. We train VISOR using a Group Relative Policy Optimization-based Reinforcement Learning (GRPO-based RL) pipeline with state masking and credit assignment tailored for dynamic context reconstruction. Extensive experiments on ViDoSeek, SlideVQA, and MMLongBench demonstrate that VISOR achieves state-of-the-art performance with superior efficiency for long-horizon visual reasoning tasks.
[203] Robust 4D Visual Geometry Transformer with Uncertainty-Aware Priors
Ying Zang, Yidong Han, Chaotao Ding, Yuanqi Hu, Deyi Ji, Qi Zhu, Xuanfu Li, Jin Ma, Lingyun Sun, Tianrun Chen, Lanyun Zhu
Main category: cs.CV
TL;DR: A framework for dynamic 4D scene reconstruction that disentangles dynamic and static components using uncertainty modeling across three synergistic mechanisms: entropy-guided subspace projection, local-consistency geometry purification, and uncertainty-aware cross-view consistency.
Details
Motivation: While 3D foundation models excel in static settings, they struggle with dynamic sequences where motion causes significant geometric ambiguity. There's a need for methods that can effectively disentangle dynamic and static components in 4D scene reconstruction.Method: Three synergistic mechanisms: 1) Entropy-Guided Subspace Projection uses information-theoretic weighting to adaptively aggregate multi-head attention distributions, isolating dynamic motion cues from semantic noise. 2) Local-Consistency Driven Geometry Purification enforces spatial continuity via radius-based neighborhood constraints to eliminate structural outliers. 3) Uncertainty-Aware Cross-View Consistency formulates multi-view projection refinement as heteroscedastic maximum likelihood estimation, using depth confidence as probabilistic weight.
Result: Outperforms current state-of-the-art methods on dynamic benchmarks, reducing Mean Accuracy error by 13.43% and improving segmentation F-measure by 10.49%. Maintains feed-forward inference efficiency and requires no task-specific fine-tuning or per-scene optimization.
Conclusion: The framework effectively addresses geometric ambiguity in dynamic 4D scene reconstruction through uncertainty modeling across different stages, achieving superior performance while maintaining efficiency and generalization capabilities.
Abstract: Reconstructing dynamic 4D scenes is an important yet challenging task. While 3D foundation models like VGGT excel in static settings, they often struggle with dynamic sequences where motion causes significant geometric ambiguity. To address this, we present a framework designed to disentangle dynamic and static components by modeling uncertainty across different stages of the reconstruction process. Our approach introduces three synergistic mechanisms: (1) Entropy-Guided Subspace Projection, which leverages information-theoretic weighting to adaptively aggregate multi-head attention distributions, effectively isolating dynamic motion cues from semantic noise; (2) Local-Consistency Driven Geometry Purification, which enforces spatial continuity via radius-based neighborhood constraints to eliminate structural outliers; and (3) Uncertainty-Aware Cross-View Consistency, which formulates multi-view projection refinement as a heteroscedastic maximum likelihood estimation problem, utilizing depth confidence as a probabilistic weight. Experiments on dynamic benchmarks show that our approach outperforms current state-of-the-art methods, reducing Mean Accuracy error by 13.43% and improving segmentation F-measure by 10.49%. Our framework maintains the efficiency of feed-forward inference and requires no task-specific fine-tuning or per-scene optimization.
[204] EpiAgent: An Agent-Centric System for Ancient Inscription Restoration
Shipeng Zhu, Ang Chen, Na Nie, Pengfei Fang, Min-Ling Zhang, Hui Xue
Main category: cs.CV
TL;DR: EpiAgent is an agent-centric system using LLM-based planning to restore degraded ancient inscriptions through multimodal analysis and iterative refinement, outperforming existing methods.
Details
Motivation: Ancient inscriptions suffer from complex degradation, but existing AI approaches struggle with rigid pipelines that can't handle heterogeneous real-world degradations. The authors aim to create a more flexible, expert-level restoration system inspired by human epigraphers' workflow.Method: Proposes EpiAgent, an agent-centric system that formulates inscription restoration as hierarchical planning. Uses an LLM-based central planner following Observe-Conceive-Execute-Reevaluate paradigm to orchestrate multimodal analysis, historical experience, specialized restoration tools, and iterative self-refinement.
Result: EpiAgent achieves superior restoration quality and stronger generalization across real-world degraded inscriptions compared to existing methods.
Conclusion: The work represents an important step toward expert-level agent-driven restoration of cultural heritage, demonstrating the value of agent-centric coordination for complex multimodal restoration tasks.
Abstract: Ancient inscriptions, as repositories of cultural memory, have suffered from centuries of environmental and human-induced degradation. Restoring their intertwined visual and textual integrity poses one of the most demanding challenges in digital heritage preservation. However, existing AI-based approaches often rely on rigid pipelines, struggling to generalize across such complex and heterogeneous real-world degradations. Inspired by the skill-coordinated workflow of human epigraphers, we propose EpiAgent, an agent-centric system that formulates inscription restoration as a hierarchical planning problem. Following an Observe-Conceive-Execute-Reevaluate paradigm, an LLM-based central planner orchestrates collaboration among multimodal analysis, historical experience, specialized restoration tools, and iterative self-refinement. This agent-centric coordination enables a flexible and adaptive restoration process beyond conventional single-pass methods. Across real-world degraded inscriptions, EpiAgent achieves superior restoration quality and stronger generalization compared to existing methods. Our work marks an important step toward expert-level agent-driven restoration of cultural heritage. The code is available at https://github.com/blackprotoss/EpiAgent.
[205] Envisioning the Future, One Step at a Time
Stefan Andreas Baumann, Jannik Wiese, Tommaso Martorella, Mahdi M. Kalayeh, Björn Ommer
Main category: cs.CV
TL;DR: Autoregressive diffusion model predicts sparse point trajectories for open-set future scene dynamics from single images, enabling fast generation of thousands of diverse futures while maintaining physical plausibility.
Details
Motivation: Existing approaches for scene evolution prediction rely on dense video or latent-space prediction, expending substantial capacity on dense appearance rather than underlying sparse point trajectories. This makes large-scale exploration of future hypotheses costly and limits performance for long-horizon, multi-modal motion prediction.Method: Formulates future scene dynamics prediction as step-wise inference over sparse point trajectories using an autoregressive diffusion model. The model advances trajectories through short, locally predictable transitions while explicitly modeling uncertainty growth over time. This dynamics-centric representation enables fast rollout of diverse futures from single images.
Result: The method matches or surpasses dense simulators in predictive accuracy while achieving orders-of-magnitude higher sampling speed. It enables generation of thousands of diverse futures from a single image while maintaining physical plausibility and long-range coherence.
Conclusion: The sparse trajectory-based approach makes open-set future prediction scalable and practical, addressing limitations of dense prediction methods by focusing on underlying dynamics rather than appearance.
Abstract: Accurately anticipating how complex, diverse scenes will evolve requires models that represent uncertainty, simulate along extended interaction chains, and efficiently explore many plausible futures. Yet most existing approaches rely on dense video or latent-space prediction, expending substantial capacity on dense appearance rather than on the underlying sparse trajectories of points in the scene. This makes large-scale exploration of future hypotheses costly and limits performance when long-horizon, multi-modal motion is essential. We address this by formulating the prediction of open-set future scene dynamics as step-wise inference over sparse point trajectories. Our autoregressive diffusion model advances these trajectories through short, locally predictable transitions, explicitly modeling the growth of uncertainty over time. This dynamics-centric representation enables fast rollout of thousands of diverse futures from a single image, optionally guided by initial constraints on motion, while maintaining physical plausibility and long-range coherence. We further introduce OWM, a benchmark for open-set motion prediction based on diverse in-the-wild videos, to evaluate accuracy and variability of predicted trajectory distributions under real-world uncertainty. Our method matches or surpasses dense simulators in predictive accuracy while achieving orders-of-magnitude higher sampling speed, making open-set future prediction both scalable and practical. Project page: http://compvis.github.io/myriad.
[206] Region-Constrained Group Relative Policy Optimization for Flow-Based Image Editing
Zhuohan Ouyang, Zhe Qian, Wenhuo Cui, Chaoqun Wang
Main category: cs.CV
TL;DR: RC-GRPO-Editing: A region-constrained GRPO post-training framework for flow-based image editing that improves instruction adherence in target regions while preserving non-target content by reducing noisy credit assignment.
Details
Motivation: Existing flow-based image editing models using GRPO reward-driven post-training suffer from noisy credit assignment where global exploration perturbs non-target regions, inflating reward variance and yielding noisy advantages. This leads to poor instruction following and editing consistency.Method: Proposes RC-GRPO-Editing with two key components: 1) Region-decoupled initial noise perturbations to localize exploration and reduce background-induced reward variance, and 2) Attention concentration reward that aligns cross-attention with intended editing region throughout rollout to reduce unintended changes.
Result: Experiments on CompBench show consistent improvements in editing region instruction adherence and non-target preservation compared to existing methods.
Conclusion: The region-constrained GRPO framework effectively addresses noisy credit assignment in flow-based image editing, enabling cleaner localized credit assignment and better balance between target modification and non-target preservation.
Abstract: Instruction-guided image editing requires balancing target modification with non-target preservation. Recently, flow-based models have emerged as a strong and increasingly adopted backbone for instruction-guided image editing, thanks to their high fidelity and efficient deterministic ODE sampling. Building on this foundation, GRPO-based reward-driven post-training has been explored to directly optimize editing-specific rewards, improving instruction following and editing consistency. However, existing methods often suffer from noisy credit assignment: global exploration also perturbs non-target regions, inflating within-group reward variance and yielding noisy GRPO advantages. To address this, we propose RC-GRPO-Editing, a region-constrained GRPO post-training framework for flow-based image editing under deterministic ODE sampling. It suppresses background-induced nuisance variance to enable cleaner localized credit assignment, improving editing region instruction adherence while preserving non-target content. Concretely, we localize exploration via region-decoupled initial noise perturbations to reduce background-induced reward variance and stabilize GRPO advantages, and introduce an attention concentration reward that aligns cross-attention with the intended editing region throughout the rollout, reducing unintended changes in non-target regions. Experiments on CompBench show consistent improvements in editing region instruction adherence and non-target preservation.
[207] Seeing is Believing: Robust Vision-Guided Cross-Modal Prompt Learning under Label Noise
Zibin Geng, Xuefeng Jiang, Jia Li, Zheng Li, Tian Wen, Lvhua Wu, Sheng Sun, Yuwei Wang, Min Liu
Main category: cs.CV
TL;DR: VisPrompt: A vision-guided prompt learning framework that improves robustness to label noise by injecting visual semantics into prompt representations using cross-modal attention and adaptive modulation.
Details
Motivation: Prompt learning for vision-language models is parameter-efficient but vulnerable to label noise. Visual content contains richer, more reliable semantic information that remains robust under noise, while text prompts are highly susceptible to noise.Method: Uses cross-modal attention to reversely inject visual semantics into prompt representations, allowing prompts to selectively aggregate relevant visual information. Introduces lightweight conditional modulation to adaptively control visual information injection strength based on visual cue quality.
Result: Outperforms existing baselines on seven benchmark datasets under both synthetic and real-world label noise settings. Improves robustness while keeping VLM backbone frozen and adding minimal trainable parameters.
Conclusion: VisPrompt effectively suppresses noise-induced disturbances, reduces prompt update instability, and alleviates memorization of mislabeled samples, achieving stronger robustness in noisy-label settings.
Abstract: Prompt learning is a parameter-efficient approach for vision-language models, yet its robustness under label noise is less investigated. Visual content contains richer and more reliable semantic information, which remains more robust under label noise. However, the prompt itself is highly susceptible to label noise. Motivated by this intuition, we propose VisPrompt, a lightweight and robust vision-guided prompt learning framework for noisy-label settings. Specifically, we exploit a cross-modal attention mechanism to reversely inject visual semantics into prompt representations. This enables the prompt tokens to selectively aggregate visual information relevant to the current sample, thereby improving robustness by anchoring prompt learning to stable instance-level visual evidence and reducing the influence of noisy supervision. To address the instability caused by using the same way of injecting visual information for all samples, despite differences in the quality of their visual cues, we further introduce a lightweight conditional modulation mechanism to adaptively control the strength of visual information injection, which strikes a more robust balance between text-side semantic priors and image-side instance evidence. The proposed framework effectively suppresses the noise-induced disturbances, reduce instability in prompt updates, and alleviate memorization of mislabeled samples. VisPrompt significantly improves robustness while keeping the pretrained VLM backbone frozen and introducing only a small amount of additional trainable parameters. Extensive experiments under synthetic and real-world label noise demonstrate that VisPrompt generally outperforms existing baselines on seven benchmark datasets and achieves stronger robustness. Our code is publicly available at https://github.com/gezbww/Vis_Prompt.
[208] EGLOCE: Training-Free Energy-Guided Latent Optimization for Concept Erasure
Junyeong Ahn, Seojin Yoon, Sungyong Baik
Main category: cs.CV
TL;DR: EGLOCE is a training-free method for removing unwanted concepts from text-to-image diffusion models using dual energy-guided latent optimization during inference.
Details
Motivation: As text-to-image diffusion models become more widespread, there's a growing need to remove specific concepts (explicit content, copyrighted characters/styles) for safety and compliance. Existing unlearning approaches are costly, degrade unrelated concepts, or rely on weak inference-time adjustments.Method: EGLOCE uses a dual-objective framework: (1) repulsion energy that steers generation away from target concepts via gradient descent in latent space, and (2) retention energy that preserves semantic alignment to the original prompt. It operates entirely at inference time without modifying model weights.
Result: Extensive experiments show EGLOCE improves concept removal while maintaining image quality and prompt alignment across baselines, even with adversarial attacks. It enables plug-and-play integration with existing models.
Conclusion: EGLOCE establishes a new paradigm for safe and controllable image generation through dual energy-based guidance during sampling, offering training-free concept erasure with better performance than previous approaches.
Abstract: As text-to-image diffusion models grow increasingly prevalent, the ability to remove specific concepts-mostly explicit content and many copyrighted characters or styles-has become essential for safety and compliance. Existing unlearning approaches often require costly re-training, modify parameters at the cost of degradation of unrelated concept fidelity, or depend on indirect inference-time adjustment that compromise the effectiveness of concept erasure. Inspired by the success of energy-guided sampling for preservation of the condition of diffusion models, we introduce Energy-Guided Latent Optimization for Concept Erasure (EGLOCE), a training-free approach that removes unwanted concepts by re-directing noisy latent during inference. Our method employs a dual-objective framework: a repulsion energy that steers generation away from target concepts via gradient descent in latent space, and a retention energy that preserves semantic alignment to the original prompt. Combined with previous approaches that either require erroneous modified model weights or provide weak inference-time guidance, EGLOCE operates entirely at inference and enhances erasure performance, enabling plug-and-play integration. Extensive experiments demonstrate that EGLOCE improves concept removal while maintaining image quality and prompt alignment across baselines, even with adversarial attacks. To the best of our knowledge, our work is the first to establish a new paradigm for safe and controllable image generation through dual energy-based guidance during sampling.
[209] SynFlow: Scaling Up LiDAR Scene Flow Estimation with Synthetic Data
Qingwen Zhang, Xiaomeng Zhu, Chenhan Jiang, Patric Jensfelt
Main category: cs.CV
TL;DR: SynFlow generates large-scale synthetic LiDAR scene flow data from simulation to overcome real-world annotation scarcity, enabling robust motion priors that generalize well to real data.
Details
Motivation: 3D dynamic perception requires motion anticipation models, but progress is hindered by scarcity of dense, high-quality motion annotations in real-world data. Self-supervision on unlabeled real data doesn't close performance gaps due to noisy proxy signals.Method: SynFlow is a data generation pipeline that creates large-scale synthetic datasets for LiDAR scene flow from scalable simulation. It uses motion-oriented strategy (not sensor-specific realism) to synthesize diverse kinematic patterns across 4,000 sequences (~940k frames), representing 34x scale-up over existing real-world benchmarks.
Result: Models trained exclusively on SynFlow-4k synthetic data generalize across multiple real-world benchmarks in zero-shot regime, rivaling in-domain supervised baselines on nuScenes and outperforming state-of-the-art on TruckScenes by 31.8%. With only 5% real-world labels for fine-tuning, surpasses models trained from scratch on full budget.
Conclusion: Learning motion priors from scalable simulation is effective for 3D motion estimation. SynFlow-4k provides domain-invariant motion prior and serves as label-efficient foundation, enabling research in generalizable 3D motion estimation.
Abstract: Reliable 3D dynamic perception requires models that can anticipate motion beyond predefined categories, yet progress is hindered by the scarcity of dense, high-quality motion annotations. While self-supervision on unlabeled real data offers a path forward, empirical evidence suggests that scaling unlabeled data fails to close the performance gap due to noisy proxy signals. In this paper, we propose a shift in paradigm: learning robust real-world motion priors entirely from scalable simulation. We introduce SynFlow, a data generation pipeline that generates large-scale synthetic dataset specifically designed for LiDAR scene flow. Unlike prior works that prioritize sensor-specific realism, SynFlow employs a motion-oriented strategy to synthesize diverse kinematic patterns across 4,000 sequences ($\sim$940k frames), termed SynFlow-4k. This represents a 34x scale-up in annotated volume over existing real-world benchmarks. Our experiments demonstrate that SynFlow-4k provides a highly domain-invariant motion prior. In a zero-shot regime, models trained exclusively on our synthetic data generalize across multiple real-world benchmarks, rivaling in-domain supervised baselines on nuScenes and outperforming state-of-the-art methods on TruckScenes by 31.8%. Furthermore, SynFlow-4k serves as a label-efficient foundation: fine-tuning with only 5% of real-world labels surpasses models trained from scratch on the full available budget. We open-source the pipeline and dataset to facilitate research in generalizable 3D motion estimation. More detail can be found at https://kin-zhang.github.io/SynFlow.
[210] Do Vision Language Models Need to Process Image Tokens?
Sambit Ghosh, R. Venkatesh Babu, Chirag Agarwal
Main category: cs.CV
TL;DR: Vision Language Models (VLMs) don’t need deep visual processing throughout all layers; visual representations stabilize early while text continues evolving, and visual depth requirements are task-dependent.
Details
Motivation: Current VLMs process dense image tokens across deep transformer stacks with substantial computational overhead, but it's unclear whether sustained image-token processing is necessary or if visual representations meaningfully evolve across layers.Method: Systematically investigate functional role of image tokens in VLMs by analyzing visual representation convergence, entropy stabilization, intrinsic dimensionality compression, trajectory curvature, and conducting depth-wise visual truncation experiments.
Result: Visual representations rapidly converge to bounded-complexity regime (entropy stabilizes, dimensionality compresses, curvature becomes near-constant) while textual representations continue restructuring across depth. Visual representations become interchangeable between layers after stabilization. Visual depth necessity is task-dependent: single-token predictions robust to truncated depth, multi-token generation requires sustained visual access.
Conclusion: Deeper visual processing is not uniformly essential in VLMs, challenging current multimodal LLM architecture paradigms. Visual representations stabilize early and influence reasoning structure more than final conclusions.
Abstract: Vision Language Models (VLMs) have achieved remarkable success by integrating visual encoders with large language models (LLMs). While VLMs process dense image tokens across deep transformer stacks (incurring substantial computational overhead), it remains fundamentally unclear whether sustained image-token processing is necessary for their performance or visual representations meaningfully evolve from early to later layers. In this work, we systematically investigate the functional role of image tokens in VLMs and show that visual representations rapidly converge to a bounded-complexity regime, \ie their entropy stabilizes, intrinsic dimensionality compresses, and trajectory curvature approaches a near-constant profile. In contrast, textual representations continue to undergo substantial restructuring across depth. Once stabilized, visual representations become largely interchangeable between layers, indicating limited additional transformation in deeper stages. Further, depth-wise visual truncation reveals that the necessity of visual processing is task-dependent, where single-token predictions remain comparatively robust to truncated visual depth, but multi-token generation require sustained access to visual representations. Under deterministic decoding, reducing visual depth perturbs intermediate reasoning trajectories more strongly than final outputs, suggesting that image tokens influence the structure of reasoning more than the ultimate conclusions. Collectively, these findings \textbf{question the assumption} that deeper visual processing is uniformly essential in VLMs, challenging the current paradigm of multimodal LLM architectures.
[211] SCoRe: Clean Image Generation from Diffusion Models Trained on Noisy Images
Yuta Matsuzaki, Seiichi Uchida, Shumpei Takezaki
Main category: cs.CV
TL;DR: SCoRe is a training-free spectral regeneration method that improves image generation quality from diffusion models trained on noisy datasets by suppressing corrupted high-frequency components and regenerating them via SDEdit.
Details
Motivation: Diffusion models trained on noisy datasets often reproduce high-frequency training artifacts, degrading generation quality. Existing approaches require retraining or fine-tuning, which is computationally expensive.Method: Proposes SCoRe (Spectral Cutoff Regeneration): 1) Suppresses corrupted high-frequency components via frequency cutoff, 2) Regenerates them using SDEdit, 3) Derives theoretical mapping between cutoff frequency and SDEdit initialization timestep based on Radially Averaged Power Spectral Density (RAPSD) to prevent excessive noise injection.
Result: Experiments on synthetic (CIFAR-10) and real-world (SIDD) noisy datasets show SCoRe substantially outperforms post-processing and noise-robust baselines, restoring samples closer to clean image distributions without retraining or fine-tuning.
Conclusion: SCoRe provides an effective training-free solution for clean image generation from diffusion models trained on noisy datasets by leveraging spectral bias and controlled regeneration.
Abstract: Diffusion models trained on noisy datasets often reproduce high-frequency training artifacts, significantly degrading generation quality. To address this, we propose SCoRe (Spectral Cutoff Regeneration), a training-free, generation-time spectral regeneration method for clean image generation from diffusion models trained on noisy images. Leveraging the spectral bias of diffusion models, which infer high-frequency details from low-frequency cues, SCoRe suppresses corrupted high-frequency components of a generated image via a frequency cutoff and regenerates them via SDEdit. Crucially, we derive a theoretical mapping between the cutoff frequency and the SDEdit initialization timestep based on Radially Averaged Power Spectral Density (RAPSD), which prevents excessive noise injection during regeneration. Experiments on synthetic (CIFAR-10) and real-world (SIDD) noisy datasets demonstrate that SCoRe substantially outperforms post-processing and noise-robust baselines, restoring samples closer to clean image distributions without any retraining or fine-tuning.
[212] Growing a Multi-head Twig via Distillation and Reinforcement Learning to Accelerate Large Vision-Language Models
Zhenwei Shao, Mingyang Wang, Weijun Zhang, Zhou Yu, Wenwen Pan, Yan Yang, Tao Wei, Hongyuan Zhang, Jun Yu
Main category: cs.CV
TL;DR: TwigVLM accelerates vision-language models by adding a lightweight ’twig’ module for better token pruning and self-speculative decoding, achieving 154% speedup with 96% accuracy retention.
Details
Motivation: Large VLMs have high computational overheads that hinder practical deployment. Existing token pruning methods suffer from accuracy drops due to insensitive attention signals in early layers and limited speedup for long responses.Method: TwigVLM grows a lightweight ’twig’ module on an early VLM layer. It uses twig-guided token pruning (TTP) for better accuracy retention and self-speculative decoding (SSD) for faster generation. TwigVLM++ extends this with multi-head architecture, two-stage training (distillation + pruning-oriented RL), and tree-based SSD.
Result: On LLaVA-1.5-7B, TwigVLM preserves 96% of original performance after pruning 88.9% of visual tokens and achieves 154% speedup in generating long responses, outperforming state-of-the-art VLM acceleration methods.
Conclusion: TwigVLM provides an effective architecture for accelerating VLMs while maintaining accuracy, addressing key limitations of existing token pruning methods through twig-guided pruning and speculative decoding.
Abstract: Large vision-language models (VLMs) have demonstrated remarkable capabilities in open-world multimodal understanding, yet their high computational overheads pose great challenges for practical deployment. Some recent works have proposed methods to accelerate VLMs by pruning redundant visual tokens guided by the attention maps of VLM’s early layers. Despite the success of these token pruning methods, they still suffer from two major shortcomings: (i) considerable accuracy drop due to insensitive attention signals in early layers, and (ii) limited speedup when generating long responses (e.g., 30 tokens). To address the limitations above, we present TwigVLM – a simple and general architecture by growing a lightweight module, named twig, upon an early layer of the base VLM. Compared with most existing VLM acceleration methods purely based on visual token pruning, our TwigVLM not only achieves better accuracy retention by employing a twig-guided token pruning (TTP) strategy, but also yields higher generation speed by utilizing a self-speculative decoding (SSD) strategy. Taking LLaVA-1.5-7B as the base VLM, experimental results show that TwigVLM preserves 96% of the original performance after pruning 88.9% of the visual tokens and achieves 154% speedup in generating long responses, delivering significantly better performance in terms of both accuracy and speed over the state-of-the-art VLM acceleration methods. Moreover, we extend TwigVLM to an improved TwigVLM++ variant by introducing a novel multi-head twig architecture with a specialized pruning head. TwigVLM++ improves pruning quality via a two-stage training paradigm combining a distillation learning stage and a pruning-oriented reinforcement learning stage, and further accelerates inference via a tree-based SSD strategy.
[213] AsymLoc: Towards Asymmetric Feature Matching for Efficient Visual Localization
Mohammad Omama, Gabriele Berton, Eric Foxlin, Yelin Kim
Main category: cs.CV
TL;DR: AsymLoc: A distillation framework for visual localization that uses a large Teacher model offline and lightweight Student model online, achieving 95% of teacher accuracy with 10x smaller models through geometry-driven matching and joint detector-descriptor distillation.
Details
Motivation: Need for precise real-time visual localization on resource-constrained edge devices (AR/VR, robotics, smart glasses) where battery life and heat dissipation are concerns. Current efficient models still need further compute reduction without sacrificing accuracy.Method: Asymmetric visual localization with large Teacher model processing database images offline and lightweight Student processing queries online. Uses AsymLoc distillation framework with geometry-driven matching objective and joint detector-descriptor distillation to enable fast, parameter-less nearest-neighbor matching between different models.
Result: Achieves up to 95% of teacher’s localization accuracy using order of magnitude smaller models. Outperforms existing baselines on HPatches, ScanNet, IMC2022, and Aachen datasets, establishing new state-of-the-art efficiency-accuracy trade-off.
Conclusion: AsymLoc enables practical deployment of visual localization on resource-constrained devices by significantly reducing compute while maintaining high accuracy through asymmetric architecture and novel distillation approach.
Abstract: Precise and real-time visual localization is critical for applications like AR/VR and robotics, especially on resource-constrained edge devices such as smart glasses, where battery life and heat dissipation can be a primary concerns. While many efficient models exist, further reducing compute without sacrificing accuracy is essential for practical deployment. To address this, we propose asymmetric visual localization: a large Teacher model processes pre-mapped database images offline, while a lightweight Student model processes the query image online. This creates a challenge in matching features from two different models without resorting to heavy, learned matchers. We introduce AsymLoc, a novel distillation framework that aligns a Student to its Teacher through a combination of a geometry-driven matching objective and a joint detector-descriptor distillation objective, enabling fast, parameter-less nearest-neighbor matching. Extensive experiments on HPatches, ScanNet, IMC2022, and Aachen show that AsymLoc achieves up to 95% of the teacher’s localization accuracy using an order of magnitude smaller models, significantly outperforming existing baselines and establishing a new state-of-the-art efficiency-accuracy trade-off.
[214] Realizing Immersive Volumetric Video: A Multimodal Framework for 6-DoF VR Engagement
Zhengxian Yang, Shengqi Wang, Shi Pan, Hongshuai Li, Haoxiang Wang, Lin Li, Guanjun Li, Zhengqi Wen, Borong Lin, Jianhua Tao, Tao Yu
Main category: cs.CV
TL;DR: Immersive Volumetric Videos (IVV) format with 6-DoF visual-audio interaction, constructed from real-world captures using ImViD dataset and Gaussian-based reconstruction pipeline with sound field reconstruction.
Details
Motivation: Need for fully immersive VR/AR experiences integrating 6-DoF visual and auditory interaction from real-world captured videos, which remains largely unexplored compared to computer-generated content.Method: 1) ImViD dataset: multi-view, multi-modal capture with synchronized video-audio acquisition; 2) Dynamic light field reconstruction using Gaussian-based spatio-temporal representation with flow-guided initialization and multi-term supervision; 3) First method for sound field reconstruction from multi-view audiovisual data.
Result: High-quality, temporally stable audiovisual volumetric content with large 6-DoF interaction spaces demonstrated through extensive benchmarks and immersive VR experiments.
Conclusion: Provides foundational definition and practical construction methodology for immersive volumetric videos, enabling real-world captured 6-DoF audiovisual experiences for VR/AR.
Abstract: Fully immersive experiences that tightly integrate 6-DoF visual and auditory interaction are essential for virtual and augmented reality. While such experiences can be achieved through computer-generated content, constructing them directly from real-world captured videos remains largely unexplored. We introduce Immersive Volumetric Videos, a new volumetric media format designed to provide large 6-DoF interaction spaces, audiovisual feedback, and high-resolution, high-frame-rate dynamic content. To support IVV construction, we present ImViD, a multi-view, multi-modal dataset built upon a space-oriented capture philosophy. Our custom capture rig enables synchronized multi-view video-audio acquisition during motion, facilitating efficient capture of complex indoor and outdoor scenes with rich foreground–background interactions and challenging dynamics. The dataset provides 5K-resolution videos at 60 FPS with durations of 1-5 minutes, offering richer spatial, temporal, and multimodal coverage than existing benchmarks. Leveraging this dataset, we develop a dynamic light field reconstruction framework built upon a Gaussian-based spatio-temporal representation, incorporating flow-guided sparse initialization, joint camera temporal calibration, and multi-term spatio-temporal supervision for robust and accurate modeling of complex motion. We further propose, to our knowledge, the first method for sound field reconstruction from such multi-view audiovisual data. Together, these components form a unified pipeline for immersive volumetric video production. Extensive benchmarks and immersive VR experiments demonstrate that our pipeline generates high-quality, temporally stable audiovisual volumetric content with large 6-DoF interaction spaces. This work provides both a foundational definition and a practical construction methodology for immersive volumetric videos.
[215] Incremental Semantics-Aided Meshing from LiDAR-Inertial Odometry and RGB Direct Label Transfer
Muhammad Affan, Ville Lehtola, George Vosselman
Main category: cs.CV
TL;DR: Semantics-aided incremental mesh reconstruction pipeline using RGB+LiDAR fusion for high-fidelity indoor reconstruction, outperforming geometric-only baselines.
Details
Motivation: Address challenges in geometric mesh reconstruction from LiDAR-inertial scans in complex indoor environments where point cloud sparsity, geometric drift, and fixed fusion parameters cause holes, over-smoothing, and spurious surfaces at structural boundaries.Method: Modular incremental RGB+LiDAR pipeline: vision foundation model labels each RGB frame; labels are incrementally projected/fused onto LiDAR-inertial odometry map; incremental semantics-aware TSDF fusion produces final mesh via marching cubes.
Result: Outperforms state-of-the-art geometric baselines ImMesh and Voxblox; semantic guidance improves geometric reconstruction quality; quantitative evaluation on Oxford Spires dataset shows improvements, qualitative results on NTU VIRAL dataset demonstrate benefits.
Conclusion: Semantics-aided fusion improves geometric mesh quality; resulting semantically labelled meshes are valuable for reconstructing USD assets, offering path from indoor LiDAR scanning to XR and digital modeling.
Abstract: Geometric high-fidelity mesh reconstruction from LiDAR-inertial scans remains challenging in large, complex indoor environments – such as cultural buildings – where point cloud sparsity, geometric drift, and fixed fusion parameters produce holes, over-smoothing, and spurious surfaces at structural boundaries. We propose a modular, incremental RGB+LiDAR pipeline that generates incremental semantics-aided high-quality meshes from indoor scans through scan frame-based direct label transfer. A vision foundation model labels each incoming RGB frame; labels are incrementally projected and fused onto a LiDAR-inertial odometry map; and an incremental semantics-aware Truncated Signed Distance Function (TSDF) fusion step produces the final mesh via marching cubes. This frame-level fusion strategy preserves the geometric fidelity of LiDAR while leveraging rich visual semantics to resolve geometric ambiguities at reconstruction boundaries caused by LiDAR point-cloud sparsity and geometric drift. We demonstrate that semantic guidance improves geometric reconstruction quality; quantitative evaluation is therefore performed using geometric metrics on the Oxford Spires dataset, while results from the NTU VIRAL dataset are analyzed qualitatively. The proposed method outperforms state-of-the-art geometric baselines ImMesh and Voxblox, demonstrating the benefit of semantics-aided fusion for geometric mesh quality. The resulting semantically labelled meshes are of value when reconstructing Universal Scene Description (USD) assets, offering a path from indoor LiDAR scanning to XR and digital modeling.
[216] Online3R: Online Learning for Consistent Sequential Reconstruction Based on Geometry Foundation Model
Shunkai Zhou, Zike Yan, Fei Xue, Dong Wu, Yuchen Deng, Hongbin Zha
Main category: cs.CV
TL;DR: Online3R is a sequential reconstruction framework that adapts to new scenes via online learning using visual prompts in a frozen geometry foundation model, with local-global self-supervised consistency constraints for efficient updates.
Details
Motivation: Existing reconstruction methods struggle with inconsistency issues when adapting to new scenes, and online learning faces challenges with missing ground truth and efficiency requirements during test-time updates.Method: Introduces learnable visual prompts into a pretrained frozen geometry foundation model; uses local-global self-supervised learning with local consistency on intermediate/fused results and global consistency on sparse keyframes for efficient online adaptation.
Result: Online3R outperforms previous state-of-the-art methods on various benchmarks, demonstrating effective adaptation to new scenes while maintaining geometry prediction capabilities.
Conclusion: The framework successfully resolves inconsistency issues in sequential reconstruction through online learning with visual prompts and efficient self-supervised constraints, advancing scene adaptation capabilities.
Abstract: We present Online3R, a new sequential reconstruction framework that is capable of adapting to new scenes through online learning, effectively resolving inconsistency issues. Specifically, we introduce a set of learnable lightweight visual prompts into a pretrained, frozen geometry foundation model to capture the knowledge of new environments while preserving the fundamental capability of the foundation model for geometry prediction. To solve the problems of missing groundtruth and the requirement of high efficiency when updating these visual prompts at test time, we introduce a local-global self-supervised learning strategy by enforcing the local and global consistency constraints on predictions. The local consistency constraints are conducted on intermediate and previously local fused results, enabling the model to be trained with high-quality pseudo groundtruth signals; the global consistency constraints are operated on sparse keyframes spanning long distances rather than per frame, allowing the model to learn from a consistent prediction over a long trajectory in an efficient way. Our experiments demonstrate that Online3R outperforms previous state-of-the-art methods on various benchmarks. Project page: https://shunkaizhou.github.io/online3r-1.0/
[217] Bharat Scene Text: A Novel Comprehensive Dataset and Benchmark for Indian Language Scene Text Understanding
Anik De, Abhirama Subramanyam Penamakuri, Rajeev Yadav, Aditya Rathore, Harshiv Shah, Devesh Sharma, Sagar Agarwal, Pravin Kumar, Anand Mishra
Main category: cs.CV
TL;DR: BSTD: A large-scale Indian language scene text dataset with 100K+ words across 11 languages for multiple text recognition tasks, addressing gaps in multilingual scene text research.
Details
Motivation: Indian language scene text recognition remains challenging due to script diversity, non-standard fonts, varying writing styles, and lack of high-quality datasets and open-source models, despite English scene text recognition being advanced.Method: Created Bharat Scene Text Dataset (BSTD) with 100K+ words spanning 11 Indian languages and English from 6,500+ scene images, with meticulous annotations supporting detection, script identification, cropped word recognition, and end-to-end recognition tasks.
Result: Evaluated state-of-the-art English models adapted for Indian languages, highlighting challenges and opportunities in Indian language scene text recognition. Dataset and models are open source.
Conclusion: BSTD represents a significant step toward advancing Indian language scene text recognition research by providing comprehensive benchmark data and open-source resources.
Abstract: Reading scene text, that is, text appearing in images, has numerous application areas, including assistive technology, search, and e-commerce. Although scene text recognition in English has advanced significantly and is often considered nearly a solved problem, Indian language scene text recognition remains an open challenge. This is due to script diversity, non-standard fonts, and varying writing styles, and, more importantly, the lack of high-quality datasets and open-source models. To address these gaps, we introduce the Bharat Scene Text Dataset (BSTD) - a large-scale and comprehensive benchmark for studying Indian Language Scene Text Recognition. It comprises more than 100K words that span 11 Indian languages and English, sourced from over 6,500 scene images captured across various linguistic regions of India. The dataset is meticulously annotated and supports multiple scene text tasks, including: (i) Scene Text Detection, (ii) Script Identification, (iii) Cropped Word Recognition, and (iv) End-to-End Scene Text Recognition. We evaluated state-of-the-art models originally developed for English by adapting (fine-tuning) them for Indian languages. Our results highlight the challenges and opportunities in Indian language scene text recognition. We believe that this dataset represents a significant step toward advancing research in this domain. All our models and data are open source.
[218] RIRF: Reasoning Image Restoration Framework
Wending Yan, Rongkai Zhang, Kaihua Tang, Yu Cheng, Qiankun Liu
Main category: cs.CV
TL;DR: R&R integrates structured Chain-of-Thought reasoning into universal image restoration, using a fine-tuned Qwen3-VL model to diagnose degradations and provide interpretable priors for restoration, achieving SOTA performance.
Details
Motivation: Existing universal image restoration methods lack explicit diagnostic reasoning about degradation composition, severity, and scene semantics before restoration, focusing only on pixel reconstruction.Method: Proposes R&R framework with explicit reasoner (fine-tuned Qwen3-VL) that diagnoses degradation types, quantifies severity, infers degradation factors, and describes scene semantics. Uses structured reasoning as priors for restorer and leverages degradation severity as RL signals.
Result: Achieves state-of-the-art performance across diverse UIR benchmarks while providing unique interpretability into the restoration process.
Conclusion: R&R demonstrates that tightly coupling semantic diagnostic reasoning with pixel-level restoration in a unified framework improves both performance and interpretability in universal image restoration.
Abstract: Universal image restoration (UIR) aims to recover clean images from diverse and unknown degradations using a unified model. Existing UIR methods primarily focus on pixel reconstruction and often lack explicit diagnostic reasoning over degradation composition, severity, and scene semantics prior to restoration. We propose Reason and Restore (R&R), a novel framework that integrates structured Chain-of-Thought (CoT) reasoning into the image restoration pipeline. R&R introduces an explicit reasoner, implemented by fine-tuning Qwen3-VL, to diagnose degradation types, quantify degradation severity, infer key degradation-related factors, and describe relevant scene and object semantics. The resulting structured reasoning provides interpretable and fine-grained diagnostic priors for the restorer. To further improve restoration quality, the quantified degradation severity produced by the reasoner is leveraged as reinforcement learning (RL) signals to guide and strengthen the restorer. Unlike existing multimodal LLM-based agentic systems that decouple reasoning from low-level vision tasks, R&R tightly couples semantic diagnostic reasoning with pixel-level restoration in a unified framework. Extensive experiments across diverse UIR benchmarks demonstrate that R&R achieves state-of-the-art performance while offering unique interpretability into the restoration process.
[219] EgoTL: Egocentric Think-Aloud Chains for Long-Horizon Tasks
Lulin Liu, Dayou Li, Yiqing Liang, Sicong Jiang, Hitesh Vijay, Hezhen Hu, Xuhai Xu, Zirui Liu, Srinivas Shakkottai, Manling Li, Zhiwen Fan
Main category: cs.CV
TL;DR: EgoTL introduces a think-aloud capture pipeline for egocentric data with say-before-act protocol to address noisy VLM auto-labeling in household tasks by providing accurate human action labels, chain-of-thought reasoning, and spatial annotations.
Details
Motivation: Current VLM-based auto-labeling for embodied intelligence is noisy due to lack of accurate human action labels, chain-of-thought reasoning, and spatial annotations, leading to hallucinations and errors in long-horizon spatial instruction following for household tasks.Method: EgoTL builds a think-aloud capture pipeline using say-before-act protocol to record step-by-step goals and spoken reasoning with word-level timestamps, then calibrates physical properties with metric-scale spatial estimators, memory-bank walkthrough for scene context, and clip-level tags for navigation and manipulation actions.
Result: The method enables benchmarking VLMs and World Models on six task dimensions across over 100 daily household tasks, showing foundation models still fall short as egocentric assistants or open-world simulators. Finetuning with human CoT aligned with metric labels improves long-horizon planning, reasoning, instruction following, and spatial grounding.
Conclusion: EgoTL addresses critical gaps in egocentric data annotation for embodied intelligence, providing a framework for improving VLM performance in household tasks through better spatial grounding and reasoning chain alignment.
Abstract: Large foundation models have made significant advances in embodied intelligence, enabling synthesis and reasoning over egocentric input for household tasks. However, VLM-based auto-labeling is often noisy because the primary data sources lack accurate human action labels, chain-of-thought (CoT), and spatial annotations; these errors are amplified during long-horizon spatial instruction following. These issues stem from insufficient coverage of minute-long, daily household planning tasks and from inaccurate spatial grounding. As a result, VLM reasoning chains and world-model synthesis can hallucinate objects, skip steps, or fail to respect real-world physical attributes. To address these gaps, we introduce EgoTL. EgoTL builds a think-aloud capture pipeline for egocentric data. It uses a say-before-act protocol to record step-by-step goals and spoken reasoning with word-level timestamps, then calibrates physical properties with metric-scale spatial estimators, a memory-bank walkthrough for scene context, and clip-level tags for navigation instructions and detailed manipulation actions. With EgoTL, we are able to benchmark VLMs and World Models on six task dimensions from three layers and long-horizon generation over minute-long sequences across over 100 daily household tasks. We find that foundation models still fall short as egocentric assistants or open-world simulators. Finally, we finetune foundation models with human CoT aligned with metric labels on the training split of EgoTL, which improves long-horizon planning and reasoning, step-wise reasoning, instruction following, and spatial grounding.
[220] Tango: Taming Visual Signals for Efficient Video Large Language Models
Shukang Yin, Sirui Zhao, Hanchao Wang, Baozhi Jia, Xianquan Wang, Chaoyou Fu, Enhong Chen
Main category: cs.CV
TL;DR: Tango is a token pruning framework for Video LLMs that improves attention-based selection and similarity-based clustering to optimize visual signal utilization while preserving performance.
Details
Motivation: Existing token pruning methods for Video LLMs have limitations: attention-based selection fails to account for multi-modal attention distributions, and similarity-based clustering creates fragmented clusters with distorted representations after pooling.Method: Proposes Tango framework with diversity-driven strategy for attention-based token selection and Spatio-temporal Rotary Position Embedding (ST-RoPE) to preserve geometric structure via locality priors.
Result: When retaining only 10% of video tokens, Tango preserves 98.9% of original performance on LLaVA-OV while delivering 1.88x inference speedup across various Video LLMs and benchmarks.
Conclusion: Tango effectively addresses limitations in existing token pruning methods for Video LLMs, optimizing visual signal utilization while maintaining performance and improving efficiency.
Abstract: Token pruning has emerged as a mainstream approach for developing efficient Video Large Language Models (Video LLMs). This work revisits and advances the two predominant token-pruning paradigms: attention-based selection and similarity-based clustering. Our study reveals two critical limitations in existing methods: (1) conventional top-k selection strategies fail to fully account for the attention distribution, which is often spatially multi-modal and long-tailed in magnitude; and (2) direct similarity-based clustering frequently generates fragmented clusters, resulting in distorted representations after pooling. To address these bottlenecks, we propose Tango, a novel framework designed to optimize the utilization of visual signals. Tango integrates a diversity-driven strategy to enhance attention-based token selection, and introduces Spatio-temporal Rotary Position Embedding (ST-RoPE) to preserve geometric structure via locality priors. Comprehensive experiments across various Video LLMs and video understanding benchmarks demonstrate the effectiveness and generalizability of our approach. Notably, when retaining only 10% of the video tokens, Tango preserves 98.9% of the original performance on LLaVA-OV while delivering a 1.88x inference speedup.
[221] LADR: Locality-Aware Dynamic Rescue for Efficient Text-to-Image Generation with Diffusion Large Language Models
Chenglin Wang, Yucheng Zhou, Shawn Chen, Tao Wang, Kai Zhang
Main category: cs.CV
TL;DR: LADR is a training-free acceleration method for discrete diffusion language models that speeds up multimodal image generation by 4x while maintaining quality, using spatial locality and frontier token recovery.
Details
Motivation: Discrete diffusion models for unified multimodal generation suffer from high inference latency due to iterative decoding. Existing acceleration methods either require expensive retraining or fail to exploit the spatial redundancy in visual data.Method: Locality-Aware Dynamic Rescue (LADR) exploits spatial Markov property of images by prioritizing recovery of tokens at the “generation frontier” (regions adjacent to observed pixels). It uses morphological neighbor identification, risk-bounded filtering to prevent error propagation, and manifold-consistent inverse scheduling to align diffusion trajectory with accelerated mask density.
Result: Achieves ~4x speedup over standard baselines on four text-to-image generation benchmarks while maintaining or even enhancing generative fidelity, especially in spatial reasoning tasks. Offers state-of-the-art efficiency-quality trade-off.
Conclusion: LADR provides an effective training-free acceleration method for discrete diffusion language models in multimodal generation, leveraging spatial locality to significantly reduce inference latency without compromising quality.
Abstract: Discrete Diffusion Language Models have emerged as a compelling paradigm for unified multimodal generation, yet their deployment is hindered by high inference latency arising from iterative decoding. Existing acceleration strategies often require expensive re-training or fail to leverage the 2D spatial redundancy inherent in visual data. To address this, we propose Locality-Aware Dynamic Rescue (LADR), a training-free method that expedites inference by exploiting the spatial Markov property of images. LADR prioritizes the recovery of tokens at the ‘‘generation frontier’’, regions spatially adjacent to observed pixels, thereby maximizing information gain. Specifically, our method integrates morphological neighbor identification to locate candidate tokens, employs a risk-bounded filtering mechanism to prevent error propagation, and utilizes manifold-consistent inverse scheduling to align the diffusion trajectory with the accelerated mask density. Extensive experiments on four text-to-image generation benchmarks demonstrate that our LADR achieves an approximate 4 x speedup over standard baselines. Remarkably, it maintains or even enhances generative fidelity, particularly in spatial reasoning tasks, offering a state-of-the-art trade-off between efficiency and quality.
[222] OmniPrism: Learning Disentangled Visual Concept for Image Generation
Yangyang Li, Daqing Liu, Wu Liu, Allen He, Xinchen Liu, Yongdong Zhang, Guoqing Jin
Main category: cs.CV
TL;DR: OmniPrism: A method for disentangling multiple visual concepts from reference images to enable creative image generation with precise control over content, style, and composition aspects.
Details
Motivation: Existing methods for visual concept generation are limited to single-aspect concept generation or struggle with multi-aspect scenarios, leading to concept confusion and hindering creative generation. There's a need for better disentanglement of multiple concepts from reference images.Method: Proposes OmniPrism with a contrastive orthogonal disentangled (COD) training pipeline that learns disentangled concept representations guided by natural language. Uses a multimodal extractor’s semantic space for concept disentanglement, constructs a paired concept disentangled dataset (PCD-200K), and injects learned representations into additional diffusion cross-attention layers with block embeddings to adapt each block’s concept domain.
Result: Extensive experiments show the method can generate high-quality, concept-disentangled results with high fidelity to both text prompts and desired concepts from reference images.
Conclusion: OmniPrism effectively addresses multi-aspect concept generation challenges by learning disentangled visual concept representations and integrating them into diffusion models, enabling creative image generation with precise control over different concept aspects.
Abstract: Creative visual concept generation often draws inspiration from specific concepts in a reference image to produce relevant outcomes. However, existing methods are typically constrained to single-aspect concept generation or are easily disrupted by irrelevant concepts in multi-aspect concept scenarios, leading to concept confusion and hindering creative generation. To address this, we propose OmniPrism, a visual concept disentangling approach for creative image generation. Our method learns disentangled concept representations guided by natural language and trains a diffusion model to incorporate these concepts. We utilize the rich semantic space of a multimodal extractor to achieve concept disentanglement from given images and concept guidance. To disentangle concepts with different semantics, we construct a paired concept disentangled dataset (PCD-200K), where each pair shares the same concept such as content, style, and composition. We learn disentangled concept representations through our contrastive orthogonal disentangled (COD) training pipeline, which are then injected into additional diffusion cross-attention layers for generation. A set of block embeddings is designed to adapt each block’s concept domain in the diffusion models. Extensive experiments demonstrate that our method can generate high-quality, concept-disentangled results with high fidelity to text prompts and desired concepts.
[223] Chain-of-Zoom: Extreme Super-Resolution via Scale Autoregression and Preference Alignment
Bryan Sangwoo Kim, Jeongsol Kim, Jong Chul Ye
Main category: cs.CV
TL;DR: Chain-of-Zoom (CoZ) enables extreme image super-resolution beyond training scales by decomposing SR into autoregressive zoom steps with multi-scale-aware text prompts generated by vision-language models.
Details
Motivation: Current SISR models work well at trained scale factors but fail when asked to magnify far beyond their training regime. There's a need for scalable SR that can achieve extreme resolutions without retraining.Method: CoZ factorizes SR into autoregressive chain of intermediate scale states, repeatedly reusing a backbone SR model. It uses multi-scale-aware text prompts from a VLM fine-tuned with Generalized Reward Policy Optimization (GRPO) to guide high-magnification steps where visual cues diminish.
Result: A standard 4x diffusion SR model wrapped in CoZ achieves beyond 256x enlargement with high perceptual quality and fidelity, demonstrating scalability far beyond original training capabilities.
Conclusion: CoZ provides a model-agnostic framework for extreme super-resolution without additional training, leveraging vision-language models for guidance at high magnifications where visual information is limited.
Abstract: Modern single-image super-resolution (SISR) models deliver photo-realistic results at the scale factors on which they are trained, but collapse when asked to magnify far beyond that regime. We address this scalability bottleneck with Chain-of-Zoom (CoZ), a model-agnostic framework that factorizes SISR into an autoregressive chain of intermediate scale-states with multi-scale-aware prompts. CoZ repeatedly re-uses a backbone SR model, decomposing the conditional probability into tractable sub-problems to achieve extreme resolutions without additional training. Because visual cues diminish at high magnifications, we augment each zoom step with multi-scale-aware text prompts generated by a vision-language model (VLM). The prompt extractor itself is fine-tuned using Generalized Reward Policy Optimization (GRPO) with a critic VLM, aligning text guidance towards human preference. Experiments show that a standard 4x diffusion SR model wrapped in CoZ attains beyond 256x enlargement with high perceptual quality and fidelity. Project Page: https://bryanswkim.github.io/chain-of-zoom/.
[224] Enhancing the Safety of Medical Vision-Language Models by Synthetic Demonstrations
Zhiyu Xue, Reza Abbasi-Asl, Ramtin Pedarsani
Main category: cs.CV
TL;DR: A novel inference-time defense strategy for generative medical vision-language models that mitigates harmful queries while avoiding over-defense through synthetic clinical demonstrations.
Details
Motivation: Medical vision-language models need security against harmful queries but face over-defense risks where safety mechanisms degrade general performance on benign clinical queries.Method: Proposes inference-time defense using synthetic clinical demonstrations to enhance safety against visual/textual jailbreak attacks without compromising performance, with mixed demonstration strategy for balancing security/performance under few-shot constraints.
Result: Defense strategy enhances model safety without significant performance degradation across nine medical imaging modalities; increasing demonstration budget alleviates over-defense; mixed demonstration strategy provides effective trade-off.
Conclusion: The proposed inference-time defense effectively secures medical VLMs against harmful queries while maintaining clinical utility, with demonstration-based approaches offering practical security-performance balance.
Abstract: Generative medical vision-language models~(Med-VLMs) are primarily designed to generate complex textual information~(e.g., diagnostic reports) from multimodal inputs including vision modality~(e.g., medical images) and language modality~(e.g., clinical queries). However, their security vulnerabilities remain underexplored. Med-VLMs should be capable of rejecting harmful queries, such as \textit{Provide detailed instructions for using this CT scan for insurance fraud}. At the same time, addressing security concerns introduces the risk of over-defense, where safety-enhancing mechanisms may degrade general performance, causing Med-VLMs to reject benign clinical queries. In this paper, we propose a novel inference-time defense strategy to mitigate harmful queries, enabling defense against visual and textual jailbreak attacks. Using diverse medical imaging datasets collected from nine modalities, we demonstrate that our defense strategy based on synthetic clinical demonstrations enhances model safety without significantly compromising performance. Additionally, we find that increasing the demonstration budget alleviates the over-defense issue. We then introduce a mixed demonstration strategy as a trade-off solution for balancing security and performance under few-shot demonstration budget constraints.
[225] Listener-Rewarded Thinking in VLMs for Image Preferences
Alexander Gambashidze, Li Pengyi, Matvey Skripkin, Andrey Galichin, Anton Gusarov, Konstantin Sobolev, Andrey Kuznetsov, Ivan Oseledets
Main category: cs.CV
TL;DR: Introduces listener-augmented GRPO framework for training vision-language reward models, using independent listener model to evaluate reasoning traces and provide dense confidence scores that shape RL rewards, improving generalization and reducing reasoning contradictions.
Details
Motivation: Current reward models for human visual preferences often fail to generalize, and supervised fine-tuning leads to memorization. While RL approaches like GRPO help, they suffer from reasoning contradictions when a model's reasoning trace conflicts with an independent evaluator's assessment.Method: Proposes listener-augmented GRPO framework where an independent frozen vision-language model (listener) re-evaluates the reasoner’s chain-of-thought to provide dense, calibrated confidence scores. These scores shape the RL reward signal, encouraging the reasoner to produce explanations persuasive to an independent model.
Result: Achieves best accuracy on ImageReward benchmark (67.4%), significantly improves out-of-distribution performance on large-scale human preference dataset (1.2M votes, up to +6% over naive reasoner), and reduces reasoning contradictions compared to GRPO and SFT baselines.
Conclusion: Listener-based rewards provide a scalable, data-efficient path to aligning vision-language models with nuanced human preferences, demonstrating improved generalization and reduced reasoning contradictions.
Abstract: Training robust and generalizable reward models for human visual preferences is essential for aligning text-to-image and text-to-video generative models with human intent. However, current reward models often fail to generalize, and supervised fine-tuning leads to memorization, demanding complex annotation pipelines. While reinforcement learning (RL), specifically Group Relative Policy Optimization (GRPO), improves generalization, we uncover a key failure mode: a significant drop in reasoning accuracy occurs when a model’s reasoning trace contradicts that of an independent, frozen vision-language model (“listener”) evaluating the same output. To address this, we introduce a listener-augmented GRPO framework. Here, the listener re-evaluates the reasoner’s chain-of-thought to provide a dense, calibrated confidence score, shaping the RL reward signal. This encourages the reasoner not only to answer correctly, but to produce explanations that are persuasive to an independent model. Our listener-shaped reward scheme achieves best accuracy on the ImageReward benchmark (67.4%), significantly improves out-of-distribution (OOD) performance on a large-scale human preference dataset (1.2M votes, up to +6% over naive reasoner), and reduces reasoning contradictions compared to strong GRPO and SFT baselines. These results demonstrate that listener-based rewards provide a scalable, data-efficient path to aligning vision-language models with nuanced human preferences. We will release our reasoning model here: https://huggingface.co/alexgambashidze/qwen2.5vl_image_preference_reasoner.
[226] P3P Made Easy
Seong Hun Lee, Patrick Vandewalle, Javier Civera
Main category: cs.CV
TL;DR: Classical P3P problem revisited with compact algebraic solver based on 1841 formulation, achieving competitive accuracy and runtime with modern implementations.
Details
Motivation: The classical Perspective-Three-Point (P3P) problem for camera pose estimation has an elegant but overlooked formulation from 1841 that can be implemented efficiently with modern computational insights.Method: Revisits Grunert’s 1841 formulation that reduces P3P to a quartic polynomial with simple coefficients, and implements a compact algebraic solver with modern computational techniques.
Result: The classical formulation achieves accuracy and runtime comparable to state-of-the-art methods, offering excellent balance between simplicity, efficiency, and accuracy.
Conclusion: Historical P3P formulations remain highly competitive when implemented with modern computational insights, providing a simple yet effective solution for camera pose estimation.
Abstract: We revisit the classical Perspective-Three-Point (P3P) problem, which aims to recover the absolute pose of a calibrated camera from three 2D-3D correspondences. It has long been known that P3P can be reduced to a quartic polynomial with analytically simple and computationally efficient coefficients. However, this elegant formulation has been largely overlooked in modern literature. Building on the theoretical foundation that traces back to Grunert’s work in 1841, we propose a compact algebraic solver that achieves accuracy and runtime comparable to state-of-the-art methods. Our results show that this classical formulation remains highly competitive when implemented with modern insights, offering an excellent balance between simplicity, efficiency, and accuracy.
[227] PoseGen: In-Context LoRA Finetuning for Pose-Controllable Long Human Video Generation
Jingxuan He, Busheng Su, Finn Wong
Main category: cs.CV
TL;DR: PoseGen generates long-duration human videos from a single reference image and driving video using in-context LoRA finetuning for identity preservation and pose conditioning for motion control, with segment-interleaved generation for extended duration.
Details
Motivation: Current diffusion-based models struggle with generating temporally coherent, long-duration videos while maintaining precise control over subject identity and movement, often suffering from identity drift and being limited to short video lengths.Method: Uses in-context LoRA finetuning that injects subject appearance at token level for identity preservation and conditions on pose information at channel level for motion control. Introduces segment-interleaved generation with non-overlapping segments generated first using shared KV-cache for background consistency, then stitched via pose-aware interpolated generation.
Result: Despite training on only 33 hours of video data, PoseGen outperforms state-of-the-art baselines in identity fidelity, pose accuracy, and temporal consistency for long-duration video generation.
Conclusion: PoseGen presents an effective framework for generating extended-duration human videos with precise identity and motion control, addressing key limitations of current diffusion models through innovative architectural and generation strategies.
Abstract: Generating temporally coherent, long-duration videos with precise control over subject identity and movement remains a fundamental challenge for contemporary diffusion-based models, which often suffer from identity drift and are limited to short video length. We present PoseGen, a novel framework that generates human videos of extended duration from a single reference image and a driving video. Our contributions include an in-context LoRA finetuning design that injects subject appearance at the token level for identity preservation, while simultaneously conditioning on pose information at the channel level for fine-grained motion control. To overcome duration limits, we introduce a segment-interleaved generation strategy, where non-overlapping segments are first generated with improved background consistency through a shared KV-cache mechanism, and then stitched into a continuous sequence via pose-aware interpolated generation. Despite being trained on a remarkably small 33-hour video dataset, PoseGen demonstrates superior performance over state-of-the-art baselines in identity fidelity, pose accuracy, and temporal consistency. Code is available at https://github.com/Jessie459/PoseGen .
[228] ClusterMark: Towards Robust Watermarking for Autoregressive Image Generators with Visual Token Clustering
Denis Lukovnikov, Andreas Müller, Erwin Quiring, Asja Fischer
Main category: cs.CV
TL;DR: Token-level watermarking for autoregressive image models using visual token clustering improves robustness against perturbations while maintaining image quality
Details
Motivation: In-generation watermarking works well for latent diffusion models but is underexplored for autoregressive image models, which generate images via visual token sequences. Existing token-level watermarking schemes from LLMs don't transfer well to images due to decreased detectability under common perturbations.Method: Propose watermarking based on visual token clustering that assigns similar tokens to the same set (red/green). Investigate training-free clustering and fine-tuned token/cluster predictors. Biases next-token prediction based on prior tokens like KGW watermarking for LLMs.
Result: Cluster-based watermarks greatly improve robustness against perturbations and regeneration attacks while preserving image quality, outperforming baselines and concurrent works. Offers fast verification runtime comparable to lightweight post-hoc watermarking.
Conclusion: Token clustering enables effective watermarking for autoregressive image models with strong robustness and practical verification speed, bridging the gap between LLM watermarking techniques and visual generation models.
Abstract: In-generation watermarking for latent diffusion models has recently shown high robustness in marking generated images for easier detection and attribution. However, its application to autoregressive (AR) image models is underexplored. Autoregressive models generate images by autoregressively predicting a sequence of visual tokens that are then decoded into pixels using a VQ-VAE decoder. Inspired by KGW watermarking for large language models, we examine token-level watermarking schemes that bias the next-token prediction based on prior tokens. We find that a direct transfer of these schemes works in principle, but the detectability of the watermarks decreases considerably under common image perturbations. As a remedy, we propose a watermarking approach based on visual token clustering, which assigns similar tokens to the same set (red or green). We investigate token clustering in a training-free setting, as well as in combination with a more accurate fine-tuned token or cluster predictor. Overall, our experiments show that cluster-based watermarks greatly improve robustness against perturbations and regeneration attacks while preserving image quality, outperforming a set of baselines and concurrent works. Moreover, our methods offer fast verification runtime, comparable to lightweight post-hoc watermarking techniques.
[229] VSI: Visual Subtitle Integration for Keyframe Selection to enhance Long Video Understanding
Jianxiang He, Meisheng Hong, Jungang Li, Weiyu Guo, Xuming Hu, Hui Xiong
Main category: cs.CV
TL;DR: VSI is a multimodal keyframe retrieval framework that integrates visual and subtitle information for precise localization in long videos, achieving SOTA performance on text-related tasks.
Details
Motivation: Existing keyframe search algorithms for MLLMs rely solely on visual modality, making them difficult to adapt to text-related tasks and often causing retrieval results to deviate from core semantic content.Method: Proposes VISUAL-SUBTITLE INTEGRATION (VSI), a multimodal keyframe retrieval framework using dual-branch collaborative retrieval combining Video Search and Subtitle Match to fuse complementary visual and textual information.
Result: Experiments on LongVideoBench and VideoMME show VSI achieves state-of-the-art accuracy in keyframe retrieval, delivers breakthrough performance in text-related tasks, and exhibits strong generalization across other tasks.
Conclusion: VSI effectively addresses limitations of visual-only keyframe retrieval by integrating multimodal information, significantly improving performance on text-related video understanding tasks.
Abstract: Multimodal large language models (MLLMs) demonstrate exceptional performance in vision-language tasks, yet their processing of long videos is constrained by input context length and high computational costs. Sparse frame sampling thus becomes a necessary preprocessing step, with sampled frame quality directly impacting downstream performance. Existing keyframe search algorithms achieve a balance between efficiency and sampled frame quality but heavily rely on the visual modality alone. This makes them difficult to adapt to text-related tasks and often leads to retrieval results deviating from core semantic content. To address this, we propose the VISUAL-SUBTITLE INTEGRATION (VSI), a multimodal keyframe retrieval framework. It employs a dual-branch collaborative retrieval approach combining Video Search and Subtitle Match to fuse complementary visual and textual information for precise localization. Experiments on LongVideoBench and VideoMME demonstrate that VSI achieves state-of-the-art accuracy in keyframe retrieval while delivering breakthrough performance in text-related tasks and exhibiting strong generalization across other tasks.
[230] Mitigating Domain Drift in Multi Species Segmentation with DINOv2: A Cross-Domain Evaluation in Herbicide Research Trials
Artzai Picon, Itziar Eguskiza, Daniel Mugica, Javier Romero, Carlos Javier Jimenez, Eric White, Gabriel Do-Lago-Junqueira, Christian Klukas, Ramon Navarra-Mestre
Main category: cs.CV
TL;DR: Vision foundation models (DINOv2) with hierarchical taxonomic inference improve plant species and damage segmentation robustness across domain shifts in agricultural monitoring.
Details
Motivation: Deep learning models for plant segmentation often fail to generalize across real-world variations like seasons, geographies, devices, and sensing modalities, limiting their operational use in phenotyping pipelines.Method: Integrates vision foundation models (DINOv2) with hierarchical taxonomic inference, trained on multi-year datasets from Germany and Spain, and tested under challenging domain shifts including temporal changes, geographic transfer, and extreme sensor shifts to drone imagery.
Result: Foundation-model backbone consistently outperforms baselines, improving species-level F1 from 0.52 to 0.87 on in-distribution data, and maintains advantages under moderate (0.77 vs. 0.24) and extreme (0.44 vs. 0.14) shift conditions. Hierarchical inference provides additional robustness.
Conclusion: Combining foundation models with structured biological hierarchies enables scalable, shift-resilient agricultural monitoring, now deployed in BASF’s phenotyping workflow for herbicide research trials.
Abstract: Reliable plant species and damage segmentation for herbicide field research trials requires models that can withstand substantial real-world variation across seasons, geographies, devices, and sensing modalities. Most deep learning approaches trained on controlled datasets fail to generalize under these domain shifts, limiting their suitability for operational phenotyping pipelines. This study evaluates a segmentation framework that integrates vision foundation models (DINOv2) with hierarchical taxonomic inference to improve robustness across heterogeneous agricultural conditions. We train on a large, multi-year dataset collected in Germany and Spain (2018-2020), comprising 14 plant species and 4 herbicide damage classes, and assess generalization under increasingly challenging shifts: temporal and device changes (2023), geographic transfer to the United States, and extreme sensor shift to drone imagery (2024). Results show that the foundation-model backbone consistently outperforms prior baselines, improving species-level F1 from 0.52 to 0.87 on in-distribution data and maintaining significant advantages under moderate (0.77 vs. 0.24) and extreme (0.44 vs. 0.14) shift conditions. Hierarchical inference provides an additional layer of robustness, enabling meaningful predictions even when fine-grained species classification degrades (family F1: 0.68, class F1: 0.88 on aerial imagery). Error analysis reveals that failures under severe shift stem primarily from vegetation-soil confusion, suggesting that taxonomic distinctions remain preserved despite background and viewpoint variability. The system is now deployed within BASF’s phenotyping workflow for herbicide research trials across multiple regions, illustrating the practical viability of combining foundation models with structured biological hierarchies for scalable, shift-resilient agricultural monitoring.
[231] SelfHVD: Self-Supervised Handheld Video Deblurring
Honglei Xu, Zhilu Zhang, Junjie Fan, Xiaohe Wu, Wangmeng Zuo
Main category: cs.CV
TL;DR: Self-supervised method for handheld video deblurring using sharp clues from video as training signals, with self-enhanced data creation and spatial consistency constraints.
Details
Motivation: Handheld video shooting often results in blurry frames due to camera shake, and existing methods struggle with real-world handheld video due to domain gaps between training and testing data.Method: 1) Extract sharp clues from video as misalignment labels for neighboring blurry frames; 2) Self-Enhanced Video Deblurring (SEVD) to create higher-quality paired data; 3) Self-Constrained Spatial Consistency Maintenance (SCSCM) to prevent position shifts; 4) Construct synthetic and real-world handheld video datasets.
Result: Method significantly outperforms existing self-supervised approaches on constructed datasets and common real-world datasets.
Conclusion: Proposed self-supervised framework effectively addresses handheld video deblurring by leveraging sharp clues within videos, with novel data enhancement and consistency constraints.
Abstract: Shooting video with handheld shooting devices often results in blurry frames due to shaking hands and other instability factors. Although previous video deblurring methods have achieved impressive progress, they still struggle to perform satisfactorily on real-world handheld video due to the blur domain gap between training and testing data. To address the issue, we propose a self-supervised method for handheld video deblurring, which is driven by sharp clues in the video. First, to train the deblurring model, we extract the sharp clues from the video and take them as misalignment labels of neighboring blurry frames. Second, to improve the deblurring ability of the model, we propose a novel Self-Enhanced Video Deblurring (SEVD) method to create higher-quality paired video data. Third, we propose a Self-Constrained Spatial Consistency Maintenance (SCSCM) method to regularize the model, preventing position shifts between the output and input frames. Moreover, we construct synthetic and real-world handheld video datasets for handheld video deblurring. Extensive experiments on these and other common real-world datasets demonstrate that our method significantly outperforms existing self-supervised ones. The code and datasets are publicly available at https://cshonglei.github.io/SelfHVD.
[232] VisionLaw: Inferring Interpretable Intrinsic Dynamics from Visual Observations via Bilevel Optimization
Jiajing Lin, Shu Jiang, Qingyuan Zeng, Zhenzhong Wang, Min Jiang
Main category: cs.CV
TL;DR: VisionLaw: A bilevel optimization framework that uses LLMs as physics experts to generate interpretable constitutive laws from visual observations, enabling physically plausible interactive simulation with 3D assets.
Details
Motivation: Existing methods for inferring intrinsic dynamics from visual observations face challenges: manually defined constitutive priors don't align well with actual dynamics, while neural network approaches lack interpretability and generalization. There's a need for interpretable expressions of intrinsic dynamics that can generalize to novel scenarios.Method: Bilevel optimization framework with: 1) Upper level: LLMs-driven decoupled constitutive evolution strategy where LLMs act as physics experts to generate/revise constitutive laws with decoupling mechanism to reduce search complexity. 2) Lower level: Vision-guided constitutive evaluation mechanism that uses visual simulation to evaluate consistency between generated laws and underlying intrinsic dynamics.
Result: Outperforms state-of-the-art methods on synthetic and real-world datasets, effectively infers interpretable intrinsic dynamics from visual observations, and exhibits strong generalization for interactive simulation in novel scenarios.
Conclusion: VisionLaw successfully addresses the limitations of existing approaches by combining LLMs’ reasoning with visual simulation to infer interpretable constitutive laws, enabling physically plausible interactive simulation with 3D assets.
Abstract: The intrinsic dynamics of an object governs its physical behavior in the real world, playing a critical role in enabling physically plausible interactive simulation with 3D assets. Existing methods have attempted to infer the intrinsic dynamics of objects from visual observations, but generally face two major challenges: one line of work relies on manually defined constitutive priors, making it difficult to align with actual intrinsic dynamics; the other models intrinsic dynamics using neural networks, resulting in limited interpretability and poor generalization. To address these challenges, we propose VisionLaw, a bilevel optimization framework that infers interpretable expressions of intrinsic dynamics from visual observations. At the upper level, we introduce an LLMs-driven decoupled constitutive evolution strategy, where LLMs are prompted to act as physics experts to generate and revise constitutive laws, with a built-in decoupling mechanism that substantially reduces the search complexity of LLMs. At the lower level, we introduce a vision-guided constitutive evaluation mechanism, which utilizes visual simulation to evaluate the consistency between the generated constitutive law and the underlying intrinsic dynamics, thereby guiding the upper-level evolution. Experiments on both synthetic and real-world datasets demonstrate that VisionLaw can effectively infer interpretable intrinsic dynamics from visual observations. It significantly outperforms existing state-of-the-art methods and exhibits strong generalization for interactive simulation in novel scenarios.
[233] Revisiting Image Manipulation Localization under Realistic Manipulation Scenarios
Xuekang Zhu, Ji-Zhe Zhou, Kaiwen Feng, Chenfan Qu, Xiwen Wang, Yunfei Wang, Liting Zhou, Jian Liu
Main category: cs.CV
TL;DR: Paper 2509.20006: Unable to fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to unavailability of paper contentMethod: Cannot determine method due to unavailability of paper content
Result: Cannot determine results due to unavailability of paper content
Conclusion: Cannot determine conclusion due to unavailability of paper content
Abstract: Failed to fetch summary for 2509.20006: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.20006&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[234] LoBE-GS: Load-Balanced and Efficient 3D Gaussian Splatting for Large-Scale Scene Reconstruction
Sheng-Hsiang Hung, Ting-Yu Yen, Wei-Fang Sun, Simon See, Shih-Hsuan Hung, Hung-Kuo Chu
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2510.01767 suggests it’s from October 2025, but no content available for analysis.
Details
Motivation: Cannot determine motivation without access to paper content.Method: Cannot determine method without access to paper content.
Result: Cannot determine results without access to paper content.
Conclusion: Cannot draw conclusions without access to paper content.
Abstract: Failed to fetch summary for 2510.01767: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.01767&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[235] Unmasking Puppeteers: Leveraging Biometric Leakage to Disarm Impersonation in AI-based Videoconferencing
Danial Samadi Vahdati, Tai Duc Nguyen, Ekta Prashnani, Koki Nagano, David Luebke, Orazio Gallo, Matthew Stamm
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2510.03548 suggests it’s from October 2024, but no content available for analysis.
Details
Motivation: Cannot determine motivation without access to paper content.Method: Cannot determine method without access to paper content.
Result: Cannot determine results without access to paper content.
Conclusion: Cannot draw conclusions without access to paper content.
Abstract: Failed to fetch summary for 2510.03548: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.03548&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[236] REACT3D: Recovering Articulations for Interactive Physical 3D Scenes
Zhao Huang, Boyang Sun, Alexandros Delitzas, Jiaqi Chen, Marc Pollefeys
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2510.11340: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.11340&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[237] Adversarial Concept Distillation for One-Step Diffusion Personalization
Yixiong Yang, Tao Wu, Senmao Li, Shiqi Yang, Yaxing Wang, Joost van de Weijer, Kai Wang
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2510.20512: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.20512&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[238] Generative View Stitching
Chonghyuk Song, Michal Stary, Boyuan Chen, George Kopanas, Vincent Sitzmann
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2510.24718: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.24718&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[239] Dejavu: Towards Experience Feedback Learning for Embodied Intelligence
Shaokai Wu, Yanbiao Ji, Qiuchang Li, Zhiyi Zhang, Qichen He, Wenyuan Xie, Guodong Zhang, Bayram Bayramli, Yue Ding, Hongtao Lu
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2510.10181: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.10181&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[240] All You Need for Object Detection: From Pixels, Points, and Prompts to Next-Gen Fusion and Multimodal LLMs/VLMs in Autonomous Vehicles
Sayed Pedram Haeri Boroujeni, Niloufar Mehrabi, Hazim Alzorgan, Mahlagha Fazeli, Abolfazl Razi
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2510.26641: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.26641&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[241] Another BRIXEL in the Wall: Towards Cheaper Dense Features
Alexander Lappe, Martin A. Giese
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper information.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot draw conclusions without access to the paper content.
Abstract: Failed to fetch summary for 2511.05168: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.05168&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[242] How Noise Benefits AI-generated Image Detection
Ziqiang Li, Jiazhen Yan, Fan Wang, Kai Zeng, Zhangjie Fu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to draw conclusions due to failed API request
Abstract: Failed to fetch summary for 2511.16136: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.16136&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[243] RADSeg: Unleashing Parameter and Compute Efficient Zero-Shot Open-Vocabulary Segmentation Using Agglomerative Models
Omar Alama, Darshil Jariwala, Avigyan Bhattacharya, Seungchan Kim, Wenshan Wang, Sebastian Scherer
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2511.19704: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.19704&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[244] PRADA: Probability-Ratio-Based Attribution and Detection of Autoregressive-Generated Images
Simon Damm, Jonas Ricker, Henning Petzka, Asja Fischer
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2511.20068: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.20068&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[245] A Compact Hybrid Convolution–Frequency State Space Network for Learned Image Compression
Haodong Pan, Hao Wei, Yusong Wang, Nanning Zheng, Caigui Jiang
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2511.20151: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.20151&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[246] See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models
Le Thien Phuc Nguyen, Zhuoran Yu, Samuel Low Yu Hang, Subin An, Jeongik Lee, Yohan Ban, SeungEun Chung, Thanh-Huy Nguyen, JuWan Maeng, Soochahn Lee, Yong Jae Lee
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to access errorMethod: Cannot determine method due to access error
Result: Cannot determine results due to access error
Conclusion: Cannot determine conclusion due to access error
Abstract: Failed to fetch summary for 2512.02231: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.02231&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[247] SimScale: Learning to Drive via Real-World Simulation at Scale
Haochen Tian, Tianyu Li, Haochen Liu, Jiazhi Yang, Yihang Qiu, Guang Li, Junli Wang, Yinfeng Gao, Zhang Zhang, Liang Wang, Hangjun Ye, Tieniu Tan, Long Chen, Hongyang Li
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2511.23369: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.23369&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[248] AVA-VLA: Improving Vision-Language-Action models with Active Visual Attention
Lei Xiao, Jifeng Li, Juntao Gao, Feiyang Ye, Yan Jin, Jingjing Qian, Jing Zhang, Yong Wu, Xiaoyuan Yu
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2511.18960: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.18960&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[249] ShelfGaussian: Shelf-Supervised Open-Vocabulary Gaussian-based 3D Scene Understanding
Lingjun Zhao, Yandong Luo, James Hays, Lu Gan
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2512.03370: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.03370&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[250] Out-of-the-box: Black-box Causal Attacks on Object Detectors
Melane Navaratnarajah, David A. Kelly, Hana Chockler
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to access restrictionsMethod: Cannot determine method due to access restrictions
Result: Cannot determine results due to access restrictions
Conclusion: Cannot determine conclusion due to access restrictions
Abstract: Failed to fetch summary for 2512.03730: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.03730&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[251] Beyond Flicker: Detecting Kinematic Inconsistencies for Generalizable Deepfake Video Detection
Alejandro Cobo, Roberto Valle, José Miguel Buenaposada, Luis Baumela
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2512.04175: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.04175&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[252] Relational Visual Similarity
Thao Nguyen, Sicheng Mo, Krishna Kumar Singh, Yilin Wang, Jing Shi, Nicholas Kolkin, Eli Shechtman, Yong Jae Lee, Yuheng Li
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2512.07833: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.07833&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[253] 4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation
Chiao-An Yang, Ryo Hachiuma, Sifei Liu, Subhashree Radhakrishnan, Raymond A. Yeh, Yu-Chiang Frank Wang, Min-Hung Chen
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2512.17012: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.17012&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[254] Adversarial Evasion Attacks on Computer Vision using SHAP Values
Frank Mollard, Marcus Becker, Florian Roehrbein
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2601.10587: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.10587&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[255] FlashLips: 100-FPS Mask-Free Latent Lip-Sync using Reconstruction Instead of Diffusion or GANs
Andreas Zinonos, Michał Stypułkowski, Antoni Bigata, Stavros Petridis, Maja Pantic, Nikita Drobyshev
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error in accessing paper informationMethod: Unable to determine method due to technical error in accessing paper information
Result: Unable to determine results due to technical error in accessing paper information
Conclusion: Unable to determine conclusion due to technical error in accessing paper information
Abstract: Failed to fetch summary for 2512.20033: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.20033&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[256] Streaming Video Instruction Tuning
Jiaer Xia, Peixian Chen, Mengdan Zhang, Xing Sun, Kaiyang Zhou
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2512.21334: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.21334&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[257] Self-Supervised Slice-to-Volume Reconstruction with Gaussian Representations for Fetal MRI
Yinsong Wang, Thomas Fletcher, Xinzhe Luo, Aine Travers Dineen, Rhodri Cusack, Chen Qin
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2601.22990 could not be retrieved from arXiv API.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot draw conclusions without access to the paper content.
Abstract: Failed to fetch summary for 2601.22990: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.22990&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[258] EmoCtrl: Controllable Emotional Image Content Generation
Jingyuan Yang, Weibin Luo, Hui Huang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to draw conclusions due to fetch failure
Abstract: Failed to fetch summary for 2512.22437: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.22437&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[259] Tiled Prompts: Overcoming Prompt Misguidance in Image and Video Super-Resolution
Bryan Sangwoo Kim, Jonghyun Park, Jong Chul Ye
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to draw conclusions due to failed API request
Abstract: Failed to fetch summary for 2602.03342: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.03342&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[260] Measurement-Consistent Langevin Corrector for Stabilizing Latent Diffusion Inverse Problem Solvers
Lee Hyoseok, Sohwi Lim, Eunju Cha, Tae-Hyun Oh
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2601.04791: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.04791&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[261] CoMoVi: Co-Generation of 3D Human Motions and Realistic Videos
Chengfeng Zhao, Jiazhi Shu, Yubo Zhao, Tianyu Huang, Jiahao Lu, Zekai Gu, Chengwei Ren, Zhiyang Dou, Qing Shuai, Yuan Liu
Main category: cs.CV
TL;DR: Unable to analyze paper 2601.10632 due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrievedMethod: Cannot determine method as paper content could not be retrieved
Result: Cannot determine results as paper content could not be retrieved
Conclusion: Cannot draw conclusions without access to the paper content
Abstract: Failed to fetch summary for 2601.10632: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.10632&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[262] Zero-Shot Generative De-identification: Inversion-Free Flow for Privacy-Preserving Skin Image Analysis
Konstantinos Moutselos, Ilias Maglogiannis
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to paper fetch failureMethod: Unable to determine method due to paper fetch failure
Result: Unable to determine results due to paper fetch failure
Conclusion: Unable to draw conclusions due to paper fetch failure
Abstract: Failed to fetch summary for 2602.00821: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.00821&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[263] Descriptor: Parasitoid Wasps and Associated Hymenoptera Dataset (DAPWH)
Joao Manoel Herrera Pinheiro, Gabriela Do Nascimento Herrera, Luciana Bueno Dos Reis Fernandes, Alvaro Doria Dos Santos, Ricardo V. Godoy, Eduardo A. B. Almeida, Helena Carolina Onody, Marcelo Andrade Da Costa Vieira, Angelica Maria Penteado-Dias, Marcelo Becker
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2602.20028: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.20028&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[264] Token Reduction via Local and Global Contexts Optimization for Efficient Video Large Language Models
Jinlong Li, Liyuan Jiang, Haonan Zhang, Nicu Sebe
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.01400: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01400&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[265] Better Eyes, Better Thoughts: Why Vision Chain-of-Thought Fails in Medicine
Yuan Wu, Zongxian Yang, Jiayu Qian, Songpan Gao, Guanxing Chen, Qiankun Li, Yu-An Huang, Zhi-An Huang
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2603.06665: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.06665&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[266] Intrinsic Concept Extraction Based on Compositional Interpretability
Hanyu Shi, Hong Tao, Guoheng Huang, Jianbin Jiang, Xuhang Chen, Chi-Man Pun, Shanhu Wang, Pan Pan
Main category: cs.CV
TL;DR: The paper with ID 2603.11795 could not be analyzed due to HTTP 429 error (rate limiting) when attempting to fetch the abstract from arXiv API.
Details
Motivation: Unable to determine motivation as the paper content could not be retrieved due to API rate limiting.Method: Unable to determine method as the paper content could not be retrieved due to API rate limiting.
Result: Unable to determine results as the paper content could not be retrieved due to API rate limiting.
Conclusion: Unable to draw conclusions about the paper as the content could not be retrieved due to API rate limiting.
Abstract: Failed to fetch summary for 2603.11795: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.11795&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[267] CausalVAD: De-confounding End-to-End Autonomous Driving via Causal Intervention
Jiacheng Tang, Zhiyuan Zhou, Zhuolin He, Jia Zhang, Kai Zhang, Jian Pu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error in fetching paper contentMethod: Unable to determine method due to technical error in fetching paper content
Result: Unable to determine results due to technical error in fetching paper content
Conclusion: Unable to draw conclusions due to technical error in fetching paper content
Abstract: Failed to fetch summary for 2603.18561: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.18561&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[268] RAM: Recover Any 3D Human Motion in-the-Wild
Sen Jia, Ning Zhu, Jinqin Zhong, Jiale Zhou, Huaping Zhang, Jenq-Neng Hwang, Lei Li
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about the paper due to technical limitations in accessing the content
Abstract: Failed to fetch summary for 2603.19929: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.19929&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[269] Chronological Contrastive Learning: Few-Shot Progression Assessment in Irreversible Diseases
Clemens Watzenböck, Daniel Aletaha, Michaël Deman, Thomas Deimel, Jana Eder, Ivana Janickova, Robert Janiczek, Peter Mandl, Philipp Seeböck, Gabriela Supp, Paul Weiser, Georg Langs
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2603.21935: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.21935&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[270] FDIF: Formula-Driven supervised Learning with Implicit Functions for 3D Medical Image Segmentation
Yukinori Yamamoto, Kazuya Nishimura, Tsukasa Fukusato, Hirokazu Nosato, Tetsuya Ogata, Hirokatsu Kataoka
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access restrictionsMethod: Unable to determine method due to access restrictions
Result: Unable to determine results due to access restrictions
Conclusion: Unable to determine conclusion due to access restrictions
Abstract: Failed to fetch summary for 2603.23199: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.23199&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[271] Towards Context-Aware Image Anonymization with Multi-Agent Reasoning
Robert Aufschläger, Jakob Folz, Gautam Savaliya, Manjitha D Vidanalage, Michael Heigl, Martin Schramm
Main category: cs.CV
TL;DR: This paper appears to be about multimodal large language models with audio and vision understanding/generation capabilities, but the abstract could not be fetched due to HTTP 429 error (rate limiting).
Details
Motivation: Unable to determine motivation from the abstract due to fetch failure. Based on the paper ID (2603.27817) and the reader's research interests, it likely addresses multimodal AI combining audio and vision with large language models.Method: Unknown - abstract fetch failed. Likely involves novel architectures or training approaches for multimodal LLMs handling audio and visual inputs.
Result: Unknown - abstract fetch failed. Results would typically include benchmarks, performance metrics, or novel capabilities in audio-visual understanding/generation.
Conclusion: Unknown - abstract fetch failed. Would typically discuss implications for multimodal AI, limitations, and future work in audio-visual LLMs.
Abstract: Failed to fetch summary for 2603.27817: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.27817&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[272] B-MoE: A Body-Part-Aware Mixture-of-Experts “All Parts Matter” Approach to Micro-Action Recognition
Nishit Poddar, Aglind Reka, Diana-Laura Borza, Snehashis Majhi, Michal Balazia, Abhijit Das, Francois Bremond
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2603.24245: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.24245&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[273] HD-VGGT: High-Resolution Visual Geometry Transformer
Tianrun Chen, Yuanqi Hu, Yidong Han, Hanjie Xu, Deyi Ji, Qi Zhu, Chunan Yu, Xin Zhang, Cheng Chen, Chaotao Ding, Ying Zang, Xuanfu Li, Jin Ma, Lanyun Zhu
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2603.27222: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.27222&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[274] RetinexDualV2: Physically-Grounded Dual Retinex for Generalized UHD Image Restoration
Mohab Kishawy, Jun Chen
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper retrievalMethod: Unable to determine method due to failed paper retrieval
Result: Unable to determine results due to failed paper retrieval
Conclusion: Unable to determine conclusion due to failed paper retrieval
Abstract: Failed to fetch summary for 2603.27979: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.27979&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[275] UAV-Track VLA: Embodied Aerial Tracking via Vision-Language-Action Models
Qiyao Zhang, Shuhua Zheng, Jianli Sun, Chengxiang Li, Xianke Wu, Zihan Song, Zhiyong Cui, Yisheng Lv, Yonglin Tian
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2604.02241: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.02241&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[276] DiffHDR: Re-Exposing LDR Videos with Video Diffusion Models
Zhengming Yu, Li Ma, Mingming He, Leo Isikdogan, Yuancheng Xu, Dmitriy Smirnov, Pablo Salamanca, Dao Mi, Pablo Delgado, Ning Yu, Julien Philip, Xin Li, Wenping Wang, Paul Debevec
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) when querying arXiv API for paper ID 2604.06161
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Cannot analyze method without access to paper content
Result: No results available due to technical limitations in accessing the paper
Conclusion: Cannot provide analysis due to HTTP 429 error when attempting to fetch paper from arXiv
Abstract: Failed to fetch summary for 2604.06161: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.06161&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[277] R3PM-Net: Real-time, Robust, Real-world Point Matching Network
Yasaman Kashefbahrami, Erkut Akdag, Panagiotis Meletis, Evgeniya Balmashnova, Dip Goswami, Egor Bondarau
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2604.05060: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.05060&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[278] HaloProbe: Bayesian Detection and Mitigation of Object Hallucinations in Vision-Language Models
Reihaneh Zohrabi, Hosein Hasani, Akshita Gupta, Mahdieh Soleymani Baghshah, Anna Rohrbach, Marcus Rohrbach
Main category: cs.CV
TL;DR: Paper 2604.06165: Could not fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to HTTP 429 error preventing access to paper contentMethod: Unable to determine method due to HTTP 429 error preventing access to paper content
Result: Unable to determine results due to HTTP 429 error preventing access to paper content
Conclusion: Unable to determine conclusion due to HTTP 429 error preventing access to paper content
Abstract: Failed to fetch summary for 2604.06165: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.06165&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[279] OV-Stitcher: A Global Context-Aware Framework for Training-Free Open-Vocabulary Semantic Segmentation
Seungjae Moon, Seunghyun Oh, Youngmin Ro
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2604.08110: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.08110&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[280] Needle in a Haystack: One-Class Representation Learning for Detecting Rare Malignant Cells in Computational Cytology
Swarnadip Chatterjee, Vladimir Basic, Arrigo Capitanio, Orcun Goksel, Joakim Lindblad
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to failed paper fetchMethod: Cannot determine method due to failed paper fetch
Result: Cannot determine results due to failed paper fetch
Conclusion: Cannot draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2604.07722: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.07722&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[281] Plug-and-Play Logit Fusion for Heterogeneous Pathology Foundation Models
Gexin Huang, Anqi Li, Yusheng Tan, Beidi Zhao, Gang Wang, Zu-Hua Gao, Xiaoxiao Li
Main category: cs.CV
TL;DR: Paper 2604.07779: Could not fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to HTTP 429 error preventing access to paper detailsMethod: Unable to determine method due to HTTP 429 error preventing access to paper details
Result: Unable to determine results due to HTTP 429 error preventing access to paper details
Conclusion: Unable to determine conclusion due to HTTP 429 error preventing access to paper details
Abstract: Failed to fetch summary for 2604.07779: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.07779&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[282] CrashSight: A Phase-Aware, Infrastructure-Centric Video Benchmark for Traffic Crash Scene Understanding and Reasoning
Rui Gan, Junyi Ma, Pei Li, Xingyou Yang, Kai Chen, Sikai Chen, Bin Ran
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper retrievalMethod: Unable to determine method due to failed paper retrieval
Result: Unable to determine results due to failed paper retrieval
Conclusion: Unable to determine conclusion due to failed paper retrieval
Abstract: Failed to fetch summary for 2604.08457: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.08457&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[283] Generative 3D Gaussian Splatting for Arbitrary-ResolutionAtmospheric Downscaling and Forecasting
Tao Han, Zhibin Wen, Zhenghao Chen, Fenghua Lin, Junyu Gao, Song Guo, Lei Bai
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2604.07928: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.07928&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[284] SIM1: Physics-Aligned Simulator as Zero-Shot Data Scaler in Deformable Worlds
Yunsong Zhou, Hangxu Liu, Xuekun Jiang, Xing Shen, Yuanzhen Zhou, Hui Wang, Baole Fang, Yang Tian, Mulin Yu, Qiaojun Yu, Li Ma, Hengjie Li, Hanqing Wang, Jia Zeng, Jiangmiao Pang
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2604.08544: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.08544&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[285] Shortcut Learning in Glomerular AI: Adversarial Penalties Hurt, Entropy Helps
Mohammad Daouk, Jan Ulrich Becker, Neeraja Kambham, Anthony Chang, Hien Van Nguyen, Chandra Mohan
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing paper contentMethod: Unable to determine method due to missing paper content
Result: Unable to determine results due to missing paper content
Conclusion: Unable to draw conclusions due to missing paper content
Abstract: Failed to fetch summary for 2604.07936: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.07936&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[286] SAT: Selective Aggregation Transformer for Image Super-Resolution
Dinh Phu Tran, Thao Do, Saad Wazir, Seongah Kim, Seon Kwon Kim, Daeyoung Kim
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2604.07994: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.07994&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[287] PolySLGen: Online Multimodal Speaking-Listening Reaction Generation in Polyadic Interaction
Zhi-Yi Lin, Thomas Markhorst, Jouh Yeong Chew, Xucong Zhang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2604.08125: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.08125&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[288] CAMotion: A High-Quality Benchmark for Camouflaged Moving Object Detection in the Wild
Siyuan Yao, Hao Sun, Ruiqi Yu, Xiwei Jiang, Wenqi Ren, Xiaochun Cao
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to retrieval errorMethod: Unable to determine method due to retrieval error
Result: Unable to determine results due to retrieval error
Conclusion: Unable to determine conclusion due to retrieval error
Abstract: Failed to fetch summary for 2604.08287: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.08287&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[289] SyncBreaker:Stage-Aware Multimodal Adversarial Attacks on Audio-Driven Talking Head Generation
Wenli Zhang, Xianglong Shi, Sirui Zhao, Xinqi Chen, Guo Cheng, Yifan Xu, Tong Xu, Yong Liao
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2604.08405: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.08405&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[290] ParseBench: A Document Parsing Benchmark for AI Agents
Boyang Zhang, Sebastián G. Acosta, Preston Carlson, Sacha Bron, Pierre-Loïc Doulcet, Daniel B. Ospina, Simon Suo
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2604.08538: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.08538&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[291] ETCH-X: Robustify Expressive Body Fitting to Clothed Humans with Composable Datasets
Xiaoben Li, Jingyi Wu, Zeyu Cai, Siyuan Yu, Boqian Li, Yuliang Xiu
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2604.08548: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.08548&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[292] When & How to Write for Personalized Demand-aware Query Rewriting in Video Search
Cheng cheng, Chenxing Wang, Aolin Li, Haijun Wu, Huiyun Hu, Juyuan Wang
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2602.17667: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.17667&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[293] Enhanced Self-Supervised Multi-Image Super-Resolution for Camera Array Images
Yating Chen, Feng Huang, Xianyu Wu, Jing Wu, Ying Shen
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about the paper due to access limitations
Abstract: Failed to fetch summary for 2604.06816: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.06816&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[294] TurPy: a physics-based and differentiable optical turbulence simulator for algorithmic development and system optimization
Joseph L. Greene, Alfred Moore, Iris Ochoa, Emily Kwan, Patrick Marano, Christopher R. Valenta
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2604.07248: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.07248&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.AI
[295] OpenKedge: Governing Agentic Mutation with Execution-Bound Safety and Evidence Chains
Jun He, Deying Yu
Main category: cs.AI
TL;DR: OpenKedge is a protocol that transforms API mutations into governed processes using declarative intent proposals, execution contracts, and cryptographic evidence chains for safe autonomous agent operation.
Details
Motivation: Current API-centric architectures for autonomous AI agents allow probabilistic systems to directly execute state mutations without proper context, coordination, or safety guarantees, creating fundamental security and reliability flaws.Method: OpenKedge introduces a protocol where actors submit declarative intent proposals evaluated against system state, temporal signals, and policies. Approved intents become execution contracts with strict bounds on actions, resources, and time, enforced via ephemeral identities. The Intent-to-Execution Evidence Chain (IEEC) cryptographically links all components.
Result: OpenKedge deterministically arbitrates competing intents and prevents unsafe execution while maintaining high throughput, as demonstrated in multi-agent conflict scenarios and cloud infrastructure mutations.
Conclusion: OpenKedge provides a principled foundation for safely operating agentic systems at scale by shifting from reactive filtering to preventative, execution-bound enforcement with verifiable mutation processes.
Abstract: The rise of autonomous AI agents exposes a fundamental flaw in API-centric architectures: probabilistic systems directly execute state mutations without sufficient context, coordination, or safety guarantees. We introduce OpenKedge, a protocol that redefines mutation as a governed process rather than an immediate consequence of API invocation. OpenKedge requires actors to submit declarative intent proposals, which are evaluated against deterministically derived system state, temporal signals, and policy constraints prior to execution. Approved intents are compiled into execution contracts that strictly bound permitted actions, resource scope, and time, and are enforced via ephemeral, task-oriented identities. This shifts safety from reactive filtering to preventative, execution-bound enforcement. Crucially, OpenKedge introduces an Intent-to-Execution Evidence Chain (IEEC), which cryptographically links intent, context, policy decisions, execution bounds, and outcomes into a unified lineage. This transforms mutation into a verifiable and reconstructable process, enabling deterministic auditability and reasoning about system behavior. We evaluate OpenKedge across multi-agent conflict scenarios and cloud infrastructure mutations. Results show that the protocol deterministically arbitrates competing intents and cages unsafe execution while maintaining high throughput, establishing a principled foundation for safely operating agentic systems at scale.
[296] From Business Events to Auditable Decisions: Ontology-Governed Graph Simulation for Enterprise AI
Hongyin Zhu, Jinming Liang, Mengjun Hou, Ruifan Tang, Xianbin Zhu, Jingyuan Yang, Yuanman Mao, Feng Wu
Main category: cs.AI
TL;DR: LOM-action introduces event-driven ontology simulation for enterprise AI, using business events to trigger deterministic graph mutations in isolated sandboxes, ensuring decisions are derived exclusively from scenario-valid simulation graphs with full audit trails.
Details
Motivation: Existing LLM-based agent systems fail to simulate how business scenarios reshape knowledge spaces, producing fluent but ungrounded decisions without audit trails. Enterprise AI needs trustworthy decision intelligence that accounts for scenario-specific constraints.Method: Event-driven ontology simulation where business events trigger scenario conditions in enterprise ontology, driving deterministic graph mutations in isolated sandboxes to create scenario-valid simulation graphs. Dual-mode architecture with skill mode and reasoning mode follows event→simulation→decision pipeline.
Result: Achieves 93.82% accuracy and 98.74% tool-chain F1, significantly outperforming frontier baselines Doubao-1.8 and DeepSeek-V3.2 (24-36% F1 despite 80% accuracy), exposing “illusive accuracy” phenomenon.
Conclusion: Ontology-governed, event-driven simulation, not model scale, is the architectural prerequisite for trustworthy enterprise decision intelligence. Four-fold F1 advantage confirms the importance of scenario-specific simulation over unrestricted knowledge access.
Abstract: Existing LLM-based agent systems share a common architectural failure: they answer from the unrestricted knowledge space without first simulating how active business scenarios reshape that space for the event at hand – producing decisions that are fluent but ungrounded and carrying no audit trail. We present LOM-action, which equips enterprise AI with \emph{event-driven ontology simulation}: business events trigger scenario conditions encoded in the enterprise ontology~(EO), which drive deterministic graph mutations in an isolated sandbox, evolving a working copy of the subgraph into the scenario-valid simulation graph $G_{\text{sim}}$; all decisions are derived exclusively from this evolved graph. The core pipeline is \emph{event $\to$ simulation $\to$ decision}, realized through a dual-mode architecture – \emph{skill mode} and \emph{reasoning mode}. Every decision produces a fully traceable audit log. LOM-action achieves 93.82% accuracy and 98.74% tool-chain F1 against frontier baselines Doubao-1.8 and DeepSeek-V3.2, which reach only 24–36% F1 despite 80% accuracy – exposing the \emph{illusive accuracy} phenomenon. The four-fold F1 advantage confirms that ontology-governed, event-driven simulation, not model scale, is the architectural prerequisite for trustworthy enterprise decision intelligence.
[297] Sustained Impact of Agentic Personalisation in Marketing: A Longitudinal Case Study
Olivier Jeunen, Eleanor Hanna, Schaun Wheeler
Main category: cs.AI
TL;DR: A longitudinal study comparing human-curated vs. autonomous agent CRM messaging strategies shows both approaches can sustain engagement lift, suggesting a symbiotic human-agent model for scalable personalization.
Details
Motivation: Traditional CRM relies on manual optimization of static rule-based messaging, but it's unclear how much human oversight is needed for adaptive autonomous systems to sustain performance over time.Method: 11-month longitudinal case study of a real-world consumer application comparing two periods: active human-curated phase followed by passive autonomous agent phase operating from a fixed component library.
Result: Human management generated highest relative engagement lift, but autonomous agents successfully sustained positive lift during passive period, preserving performance gains.
Conclusion: A symbiotic model where human intervention drives strategic initialization and discovery, while autonomous agents ensure scalable retention and preservation of performance gains.
Abstract: In consumer applications, Customer Relationship Management (CRM) has traditionally relied on the manual optimisation of static, rule-based messaging strategies. While adaptive and autonomous learning systems offer the promise of scalable personalisation, it remains unclear to what extent ``human-in-the-loop’’ oversight is required to sustain performance uplift over time. This paper presents a longitudinal case study analysing a real-world consumer application that leverages agentic infrastructure to personalise marketing messaging for a large-scale user base over an 11-month period. We compare two distinct periods: an active phase where marketers directly curated content, audiences, and strategies – followed immediately by a passive phase where agents operated autonomously from a fixed library of components. Our results demonstrate that whilst active human management generates the highest relative lift in engagement metrics, the autonomous agents successfully sustained a positive lift during the passive period. These findings suggest a symbiotic model where human intervention drives strategic initialisation and discovery, yet autonomous agents can ensure the scalable retention and preservation of performance gains.
[298] RAMP: Hybrid DRL for Online Learning of Numeric Action Models
Yarin Benyamin, Argaman Mordoch, Shahaf S. Shperberg, Roni Stern
Main category: cs.AI
TL;DR: RAMP is an online strategy that combines reinforcement learning, action model learning, and planning to learn numeric planning action models through environment interactions rather than requiring expert traces.
Details
Motivation: Automated planning requires action models specifying preconditions and effects, but obtaining such models is difficult. Existing learning algorithms for numeric domains are offline and require expert traces as input, limiting their practical applicability.Method: RAMP simultaneously trains a Deep Reinforcement Learning policy, learns a numeric action model from past interactions, and uses that model to plan future actions. The components form a feedback loop: RL gathers data to refine the action model, while the planner generates plans to continue training RL. Developed Numeric PDDLGym framework to convert numeric planning problems to Gym environments.
Result: Experimental results on standard IPC numeric domains show RAMP significantly outperforms PPO (a well-known DRL algorithm) in terms of solvability and plan quality.
Conclusion: RAMP enables online learning of numeric planning action models through environment interactions, overcoming limitations of offline approaches requiring expert traces, and demonstrates superior performance compared to pure DRL methods.
Abstract: Automated planning algorithms require an action model specifying the preconditions and effects of each action, but obtaining such a model is often hard. Learning action models from observations is feasible, but existing algorithms for numeric domains are offline, requiring expert traces as input. We propose the Reinforcement learning, Action Model learning, and Planning (RAMP) strategy for learning numeric planning action models online via interactions with the environment. RAMP simultaneously trains a Deep Reinforcement Learning (DRL) policy, learns a numeric action model from past interactions, and uses that model to plan future actions when possible. These components form a positive feedback loop: the RL policy gathers data to refine the action model, while the planner generates plans to continue training the RL policy. To facilitate this integration of RL and numeric planning, we developed Numeric PDDLGym, an automated framework for converting numeric planning problems to Gym environments. Experimental results on standard IPC numeric domains show that RAMP significantly outperforms PPO, a well-known DRL algorithm, in terms of solvability and plan quality.
[299] Parameterized Complexity Of Representing Models Of MSO Formulas
Petr Kučera, Petr Martinek
Main category: cs.AI
TL;DR: Extension of Courcelle’s theorem showing MSO2 formulas with free variables can be represented by decision diagrams with size parameterized by treewidth/pathwidth, connecting parameterized complexity to knowledge representation.
Details
Motivation: Courcelle's theorem provides parameterized linear time algorithms for checking MSO2 properties on bounded treewidth graphs, but doesn't address representation of models with free variables. The paper aims to extend this to knowledge representation by showing such models can be compactly represented using decision diagrams.Method: Extends Courcelle’s theorem by proving that models of MSO2 formulas with free variables can be represented by sentential decision diagrams (SDDs) with size parameterized linearly by treewidth, and by ordered binary decision diagrams (OBDDs) with size parameterized linearly by pathwidth. Also provides lower bounds showing limitations of OBDD representations for treewidth-bounded graphs.
Result: Shows parameterized linear upper bounds on SDD size for treewidth and OBDD size for pathwidth. Also demonstrates there exists an MSO2 formula and treewidth-bounded graph class that doesn’t admit OBDDs with size parameterized by treewidth, establishing a separation between treewidth and pathwidth for OBDD representations.
Conclusion: The work connects Courcelle’s theorem to knowledge representation by showing compact decision diagram representations for MSO2 models, with different diagram types suitable for different width parameters (SDDs for treewidth, OBDDs for pathwidth).
Abstract: Monadic second order logic (MSO2) plays an important role in parameterized complexity due to the Courcelle’s theorem. This theorem states that the problem of checking if a given graph has a property specified by a given MSO2 formula can be solved by a parameterized linear time algorithm with respect to the treewidth of the graph and the size of the formula. We extend this result by showing that models of MSO2 formula with free variables can be represented with a decision diagram whose size is parameterized linear in the above mentioned parameter. In particular, we show a parameterized linear upper bound on the size of a sentential decision diagram (SDD) when treewidth is considered and a parameterized linear upper bound on the size of an ordered binary decision diagram (OBDD) when considering the pathwidth in the parameter. In addition, building on a lower bound on the size of OBDD by Razgon (2014), we show that there is an MSO2 formula and a class of graphs with bounded treewidth which do not admit an OBDD with the size parameterized by the treewidth. Our result offers a new perspective on the Courcelle’s theorem and connects it to the area of knowledge representation.
[300] Model Space Reasoning as Search in Feedback Space for Planning Domain Generation
James Oswald, Daniel Oblinsky, Volodymyr Varha, Vasilije Dragovic, Harsha Kokel, Kavitha Srinivas, Michael Katz, Shirin Sohrabi
Main category: cs.AI
TL;DR: Agentic LLM framework with symbolic feedback for generating planning domains from natural language descriptions
Details
Motivation: Current LLMs struggle to generate high-quality planning domains from natural language descriptions for practical deployment, despite their capabilities in domain generation assistanceMethod: Uses agentic language model feedback framework with symbolic information augmentation, evaluates domain quality under various symbolic feedback forms (landmarks, VAL plan validator output), and employs heuristic search over model space for optimization
Result: Not specified in abstract, but investigates ability of framework to generate planning domains from natural language with symbolic augmentation
Conclusion: Research investigates improved planning domain generation through LLM feedback mechanisms with symbolic information integration
Abstract: The generation of planning domains from natural language descriptions remains an open problem even with the advent of large language models and reasoning models. Recent work suggests that while LLMs have the ability to assist with domain generation, they are still far from producing high quality domains that can be deployed in practice. To this end, we investigate the ability of an agentic language model feedback framework to generate planning domains from natural language descriptions that have been augmented with a minimal amount of symbolic information. In particular, we evaluate the quality of the generated domains under various forms of symbolic feedback, including landmarks, and output from the VAL plan validator. Using these feedback mechanisms, we experiment using heuristic search over model space to optimize domain quality.
[301] Artifacts as Memory Beyond the Agent Boundary
John D. Martin, Fraser Mince, Esra’a Saleh, Amy Pajak
Main category: cs.AI
TL;DR: Formalizing how environment serves as external memory in RL, showing certain observations reduce needed internal memory
Details
Motivation: To formalize the situated cognition intuition that intelligent behavior depends on using environmental resources as memory, within Reinforcement Learning frameworkMethod: Introduce mathematical framing for environment as functional memory, define artifacts as observations that reduce history representation needs, prove theoretical results, and conduct experiments with spatial paths
Result: Agents observing spatial paths require less internal memory for performant policies; effect arises unintentionally through sensory stream; findings satisfy qualitative properties of external memory accounts
Conclusion: Environment can functionally serve as external memory in RL, suggesting principled ways to exploit environment as substitute for explicit internal memory
Abstract: The situated view of cognition holds that intelligent behavior depends not only on internal memory, but on an agent’s active use of environmental resources. Here, we begin formalizing this intuition within Reinforcement Learning (RL). We introduce a mathematical framing for how the environment can functionally serve as an agent’s memory, and prove that certain observations, which we call artifacts, can reduce the information needed to represent history. We corroborate our theory with experiments showing that when agents observe spatial paths, the amount of memory required to learn a performant policy is reduced. Interestingly, this effect arises unintentionally, and implicitly through the agent’s sensory stream. We discuss the implications of our findings, and show they satisfy qualitative properties previously used to ground accounts of external memory. Moving forward, we anticipate further work on this subject could reveal principled ways to exploit the environment as a substitute for explicit internal memory.
[302] Hidden in Plain Sight: Visual-to-Symbolic Analytical Solution Inference from Field Visualizations
Pengze Li, Jiaquan Zhang, Yunbo Long, Xinping Liu, Zhou wenjie, Encheng Su, Zihang Zeng, Jiaqi Liu, Jiyao Liu, Junchi Yu, Lihao Liu, Philip Torr, Shixiang Tang, Aoran Wang, Xi Chen
Main category: cs.AI
TL;DR: ViSA-R2 enables AI to infer analytical solutions of 2D linear steady-state physical fields from visual observations, outputting executable SymPy expressions with numeric constants.
Details
Motivation: Recovering analytical solutions from visual observations is fundamental for AI-assisted scientific reasoning but remains underexplored, especially for physical field analysis.Method: ViSA-R2 uses a self-verifying, solution-centric chain-of-thought pipeline that mimics physicist reasoning: structural pattern recognition → solution-family hypothesis → parameter derivation → consistency verification, built on an 8B Qwen3-VL backbone.
Result: ViSA-R2 outperforms strong open-source baselines and frontier closed-source VLMs on ViSA-Bench (30 linear steady-state scenarios) using numerical accuracy, expression-structure similarity, and character-level accuracy metrics.
Conclusion: The approach demonstrates effective visual-to-symbolic analytical solution inference for physical fields, with potential applications in scientific AI reasoning and analysis.
Abstract: Recovering analytical solutions of physical fields from visual observations is a fundamental yet underexplored capability for AI-assisted scientific reasoning. We study visual-to-symbolic analytical solution inference (ViSA) for two-dimensional linear steady-state fields: given field visualizations (and first-order derivatives) plus minimal auxiliary metadata, the model must output a single executable SymPy expression with fully instantiated numeric constants. We introduce ViSA-R2 and align it with a self-verifying, solution-centric chain-of-thought pipeline that follows a physicist-like pathway: structural pattern recognition solution-family (ansatz) hypothesis parameter derivation consistency verification. We also release ViSA-Bench, a VLM-ready synthetic benchmark covering 30 linear steady-state scenarios with verifiable analytical/symbolic annotations, and evaluate predictions by numerical accuracy, expression-structure similarity, and character-level accuracy. Using an 8B open-weight Qwen3-VL backbone, ViSA-R2 outperforms strong open-source baselines and the evaluated closed-source frontier VLMs under a standardized protocol.
[303] Enhancing LLM Problem Solving via Tutor-Student Multi-Agent Interaction
Nurullah Eymen Özdemir, Erhan Oztop
Main category: cs.AI
TL;DR: PETITE framework uses tutor-student multi-agent system from same LLM to improve coding problem-solving through structured role-based interactions without ground-truth supervision.
Details
Motivation: Inspired by human cognitive development through structured social interactions like tutor-learner relationships, the paper explores whether similar role-based multi-agent systems can create synergistic effects that push LLMs beyond existing frameworks.Method: Proposes PETITE framework with two agents from same LLM assigned asymmetric roles: student agent generates and refines coding solutions, tutor agent provides structured evaluative feedback without access to ground-truth answers, enabling synergistic problem-solving.
Result: Achieves similar or higher accuracy than state-of-the-art approaches (Self-Consistency, Self-Refine, Multi-Agent Debate, Multi-Agent Review) on APPS coding benchmark while consuming significantly fewer tokens.
Conclusion: Developmentally grounded role-differentiated interaction structures provide a principled and resource-efficient paradigm for enhancing LLM problem-solving through structured peer-like interactions.
Abstract: Human cognitive development is shaped not only by individual effort but by structured social interaction, where role-based exchanges such as those between a tutor and a learner, enable solutions that neither could achieve alone. Inspired by these developmental principles, we ask the question whether a tutor-student multi-agent system can create a synergistic effect by pushing Large Language Model (LLM) beyond what it can do within existing frameworks. To test the idea, we adopt autonomous coding problem domain where two agents instantiated from the same LLM assigned asymmetric roles: a student agent generates and iteratively refines solutions, while a tutor agent provides structured evaluative feedback without access to ground-truth answers. In our proposed framework (PETITE), we aim to extract better problem-solving performance from one model by structuring its interaction through complementary roles, rather than relying on stronger supervisory models or heterogeneous ensembles. Our model is evaluated on the APPS coding benchmark against state-of-the-art approaches of Self-Consistency, Self-Refine, Multi-Agent Debate, and Multi-Agent Review. The results show that our model achieves similar or higher accuracy while consuming significantly fewer tokens. These results suggest that developmentally grounded role-differentiated interaction structures provide a principled and resource-efficient paradigm for enhancing LLM problem-solving through structured peer-like interactions. Index Terms- Peer Tutoring, Scaffolding, Large Language Models, Multi-Agent Systems, Code Generation
[304] SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks
Tianyi Wang, Yixia Li, Long Li, Yibiao Chen, Shaohan Huang, Yun Chen, Peng Li, Yang Liu, Guanhua Chen
Main category: cs.AI
TL;DR: SPPO is a new RL algorithm for aligning reasoning LLMs that reformulates reasoning as a sequence-level contextual bandit problem, using decoupled scalar value functions to avoid multi-sampling while maintaining sample efficiency.
Details
Motivation: Standard token-level PPO struggles with aligning LLMs in reasoning tasks due to unstable temporal credit assignment over long Chain-of-Thought horizons and prohibitive memory costs of value models. Critic-free alternatives like GRPO require multiple samples for baseline estimation, causing significant computational overhead and limiting training throughput.Method: SPPO reformulates reasoning as a Sequence-Level Contextual Bandit problem, employing a decoupled scalar value function to derive low-variance advantage signals without requiring multiple samples. This harmonizes PPO’s sample efficiency with outcome-based update stability.
Result: Extensive experiments on mathematical benchmarks show SPPO significantly surpasses standard PPO and matches the performance of computation-heavy group-based methods, offering a resource-efficient framework for aligning reasoning LLMs.
Conclusion: SPPO provides a scalable algorithm that addresses the limitations of both token-level PPO and critic-free alternatives for aligning reasoning LLMs, balancing sample efficiency with computational practicality.
Abstract: Proximal Policy Optimization (PPO) is central to aligning Large Language Models (LLMs) in reasoning tasks with verifiable rewards. However, standard token-level PPO struggles in this setting due to the instability of temporal credit assignment over long Chain-of-Thought (CoT) horizons and the prohibitive memory cost of the value model. While critic-free alternatives like GRPO mitigate these issues, they incur significant computational overhead by requiring multiple samples for baseline estimation, severely limiting training throughput. In this paper, we introduce Sequence-Level PPO (SPPO), a scalable algorithm that harmonizes the sample efficiency of PPO with the stability of outcome-based updates. SPPO reformulates the reasoning process as a Sequence-Level Contextual Bandit problem, employing a decoupled scalar value function to derive low-variance advantage signals without multi-sampling. Extensive experiments on mathematical benchmarks demonstrate that SPPO significantly surpasses standard PPO and matches the performance of computation-heavy group-based methods, offering a resource-efficient framework for aligning reasoning LLMs.
[305] StaRPO: Stability-Augmented Reinforcement Policy Optimization
Jinghan Zhang, Fengran Mo, Tharindu Cyril Weerasooriya, Ruimin Dai, Xiaoyan Han, Yanjie Fu, Dakuo Wang, Kunpeng Liu
Main category: cs.AI
TL;DR: StaRPO is a reinforcement learning framework that incorporates reasoning stability metrics (Autocorrelation Function and Path Efficiency) to improve logical consistency in large language model reasoning tasks.
Details
Motivation: Existing RL frameworks for language models focus only on final-answer correctness, ignoring the internal logical structure of reasoning processes, leading to models that generate fluent but logically inconsistent, erratic, or redundant responses.Method: Proposes StaRPO framework that decomposes reasoning stability into two computable metrics: Autocorrelation Function (ACF) for local step-to-step coherence, and Path Efficiency (PE) for global goal-directedness. These stability rewards are combined with task rewards for process-aware feedback.
Result: Experiments on four reasoning benchmarks show StaRPO consistently outperforms baselines, enhancing both final-answer accuracy and logical stability. Validation shows ACF and PE rewards correlate with logic errors on two backbone models.
Conclusion: StaRPO effectively incorporates reasoning stability into RL optimization, providing complementary process-aware feedback that improves both accuracy and logical consistency in language model reasoning tasks.
Abstract: Reinforcement learning (RL) is effective in enhancing the accuracy of large language models in complex reasoning tasks. Existing RL policy optimization frameworks rely on final-answer correctness as feedback signals and rarely capture the internal logical structure of the reasoning process. Consequently, the models would generate fluent and semantically relevant responses but logically inconsistent, structurally erratic, or redundant. To this end, we propose StaRPO, a stability-augmented reinforcement learning framework that explicitly incorporates reasoning stability into the optimization objective. Our StaRPO decomposes stability into two computable lightweight metrics: the Autocorrelation Function (ACF) to evaluate local step-to-step coherence, and Path Efficiency (PE) to evaluate global goal-directedness of the reasoning trajectory. These stability rewards are combined with task rewards to provide complementary and process-aware feedback. We validate the effectiveness of using ACF and PE rewards by showing their correlation with logic errors on two backbone models. Experiments on four reasoning benchmarks show that StaRPO consistently outperforms compared baselines and can enhance both final-answer accuracy and logical stability.
[306] PilotBench: A Benchmark for General Aviation Agents with Safety Constraints
Yalun Wu, Haotian Liu, Zhoujun Li, Boyang Wang
Main category: cs.AI
TL;DR: PilotBench is a benchmark for evaluating LLMs on safety-critical flight trajectory and attitude prediction, revealing a precision-controllability tradeoff between traditional forecasters and LLMs.
Details
Motivation: To assess whether LLMs trained on text can reliably reason about complex physics while adhering to safety constraints, particularly for embodied AI agents in physical environments like aviation.Method: Created PilotBench using 708 real-world general aviation trajectories across 9 flight phases with 34-channel telemetry. Introduced Pilot-Score metric (60% regression accuracy, 40% instruction adherence/safety compliance). Evaluated 41 models including LLMs and traditional forecasters.
Result: Traditional forecasters achieved better MAE (7.01) but lacked semantic reasoning, while LLMs had 86-89% instruction-following but worse MAE (11-14). LLM performance degraded in high-workload phases like Climb and Approach, revealing brittle implicit physics models.
Conclusion: Reveals a Precision-Controllability Dichotomy, suggesting hybrid architectures combining LLMs’ symbolic reasoning with specialized forecasters’ numerical precision for safety-constrained embodied AI.
Abstract: As Large Language Models (LLMs) advance toward embodied AI agents operating in physical environments, a fundamental question emerges: can models trained on text corpora reliably reason about complex physics while adhering to safety constraints? We address this through PilotBench, a benchmark evaluating LLMs on safety-critical flight trajectory and attitude prediction. Built from 708 real-world general aviation trajectories spanning nine operationally distinct flight phases with synchronized 34-channel telemetry, PilotBench systematically probes the intersection of semantic understanding and physics-governed prediction through comparative analysis of LLMs and traditional forecasters. We introduce Pilot-Score, a composite metric balancing 60% regression accuracy with 40% instruction adherence and safety compliance. Comparative evaluation across 41 models uncovers a Precision-Controllability Dichotomy: traditional forecasters achieve superior MAE of 7.01 but lack semantic reasoning capabilities, while LLMs gain controllability with 86–89% instruction-following at the cost of 11–14 MAE precision. Phase-stratified analysis further exposes a Dynamic Complexity Gap-LLM performance degrades sharply in high-workload phases such as Climb and Approach, suggesting brittle implicit physics models. These empirical discoveries motivate hybrid architectures combining LLMs’ symbolic reasoning with specialized forecasters’ numerical precision. PilotBench provides a rigorous foundation for advancing embodied AI in safety-constrained domains.
[307] SEA-Eval: A Benchmark for Evaluating Self-Evolving Agents Beyond Episodic Assessment
Sihang Jiang, Lipeng Ma, Zhonghua Hong, Keyi Wang, Zhiyu Lu, Shisong Chen, Jinghao Zhang, Tianjun Pan, Weijia Zhou, Jiaqing Liang, Yanghua Xiao
Main category: cs.AI
TL;DR: This paper introduces SEA-Eval, the first benchmark for evaluating Self-Evolving Agents (SEA) across intra-task reliability and long-term evolutionary performance, revealing significant evolutionary bottlenecks in current LLM-based agents despite similar success rates.
Details
Motivation: Current LLM-based agents are limited by static toolsets and episodic amnesia, unable to accumulate experience or optimize strategies across task boundaries. While SEA paradigms exist, there's no formal definition or benchmark to evaluate continuous cross-task evolution.Method: The paper provides a formal definition of SEA grounded in digital embodiment and continuous cross-task evolution, and introduces SEA-Eval benchmark that organizes tasks into sequential streams, analyzing Success Rate and Token Consumption over time to quantify evolutionary gain and structural stability.
Result: Empirical evaluations reveal significant evolutionary bottlenecks in current state-of-the-art frameworks, where identical success rates mask up to 31.2 times differences in token consumption and divergent evolutionary trajectories under sequential analysis.
Conclusion: SEA-Eval provides a rigorous scientific foundation for advancing agents from mere task executors toward genuinely self-evolving digital entities, addressing limitations of existing episodic benchmarks.
Abstract: Current LLM-based agents demonstrate strong performance in episodic task execution but remain constrained by static toolsets and episodic amnesia, failing to accumulate experience or optimize strategies across task boundaries. While the Self-Evolving Agent (SEA) paradigm has been previously proposed, this paper contributes a new formal definition of SEA grounded in digital embodiment and continuous cross-task evolution, and introduces SEA-Eval, the first benchmark designed to evaluate SEA characteristics across two dimensions, intra-task execution reliability and long-term evolutionary performance. By organizing tasks into sequential streams and analyzing Success Rate and Token Consumption over time, SEA-Eval quantifies evolutionary gain and structural stability in ways that existing episodic benchmarks cannot. Empirical evaluations reveal a significant evolutionary bottleneck in current state-of-the-art frameworks, where identical success rates mask up to 31.2 times differences in token consumption and divergent evolutionary trajectories under sequential analysis. SEA-Eval provides a rigorous scientific foundation for advancing agents from mere task executors toward genuinely self-evolving digital entities.
[308] Hypergraph Neural Networks Accelerate MUS Enumeration
Hiroya Ijima, Koichiro Yawata
Main category: cs.AI
TL;DR: HGNN-based reinforcement learning method accelerates MUS enumeration by reducing expensive satisfiability checks through learned constraint selection
Details
Motivation: MUS enumeration faces exponential search space challenges, especially when satisfiability checks are computationally expensive. Existing ML approaches are limited to Boolean domains and rely on explicit variable-constraint relationships.Method: Uses Hypergraph Neural Networks (HGNNs) to incrementally build a hypergraph with constraints as vertices and enumerated MUSes as hyperedges. Trains an HGNN-based agent via reinforcement learning to minimize satisfiability checks needed to find MUSes.
Result: Method effectively accelerates MUS enumeration, enumerating more MUSes within the same satisfiability check budget compared to conventional methods.
Conclusion: Proposed domain-agnostic HGNN approach successfully reduces computational cost of MUS enumeration while being applicable across various constraint satisfaction domains.
Abstract: Enumerating Minimal Unsatisfiable Subsets (MUSes) is a fundamental task in constraint satisfaction problems (CSPs). Its major challenge is the exponential growth of the search space, which becomes particularly severe when satisfiability checks are expensive. Recent machine learning approaches reduce this cost for Boolean satisfiability problems but rely on explicit variable-constraint relationships, limiting their application domains. This paper proposes a domain-agnostic method to accelerate MUS enumeration using Hypergraph Neural Networks (HGNNs). The proposed method incrementally builds a hypergraph with constraints as vertices and MUSes enumerated until the current step as hyperedges, and employs an HGNN-based agent trained via reinforcement learning to minimize the number of satisfiability checks required to obtain an MUS. Experimental results demonstrate the effectiveness of our approach in accelerating MUS enumeration, showing that our method can enumerate more MUSes within the same satisfiability check budget compared to conventional methods.
[309] Advantage-Guided Diffusion for Model-Based Reinforcement Learning
Daniele Foffano, Arvid Eriksson, David Broman, Karl H. Johansson, Alexandre Proutiere
Main category: cs.AI
TL;DR: AGD-MBRL introduces advantage-guided diffusion for model-based RL to address compounding errors and short-horizon myopia by steering diffusion sampling toward high-advantage trajectories.
Details
Motivation: Autoregressive world models in MBRL suffer from compounding errors, while diffusion models can mitigate this but existing guides are either policy-only (discarding value information) or reward-based (myopic with short horizons). Need guidance that considers long-term advantage beyond generated windows.Method: Introduces Advantage-Guided Diffusion for MBRL (AGD-MBRL) with two guides: Sigmoid Advantage Guidance (SAG) and Exponential Advantage Guidance (EAG). Guides reverse diffusion process using agent’s advantage estimates to concentrate sampling on trajectories with higher long-term return. Integrates with PolyGRAD-style architectures by guiding state components while keeping action generation policy-conditioned.
Result: On MuJoCo control tasks (HalfCheetah, Hopper, Walker2D, Reacher), AGD-MBRL improves sample efficiency and final return over PolyGRAD, online Diffuser-style reward guide, and model-free baselines (PPO/TRPO), sometimes by 2x margin.
Conclusion: Advantage-aware guidance is a simple, effective remedy for short-horizon myopia in diffusion-model MBRL, enabling policy improvement through reweighted sampling of high-advantage trajectories.
Abstract: Model-based reinforcement learning (MBRL) with autoregressive world models suffers from compounding errors, whereas diffusion world models mitigate this by generating trajectory segments jointly. However, existing diffusion guides are either policy-only, discarding value information, or reward-based, which becomes myopic when the diffusion horizon is short. We introduce Advantage-Guided Diffusion for MBRL (AGD-MBRL), which steers the reverse diffusion process using the agent’s advantage estimates so that sampling concentrates on trajectories expected to yield higher long-term return beyond the generated window. We develop two guides: (i) Sigmoid Advantage Guidance (SAG) and (ii) Exponential Advantage Guidance (EAG). We prove that a diffusion model guided through SAG or EAG allows us to perform reweighted sampling of trajectories with weights increasing in state-action advantage-implying policy improvement under standard assumptions. Additionally, we show that the trajectories generated from AGD-MBRL follow an improved policy (that is, with higher value) compared to an unguided diffusion model. AGD integrates seamlessly with PolyGRAD-style architectures by guiding the state components while leaving action generation policy-conditioned, and requires no change to the diffusion training objective. On MuJoCo control tasks (HalfCheetah, Hopper, Walker2D and Reacher), AGD-MBRL improves sample efficiency and final return over PolyGRAD, an online Diffuser-style reward guide, and model-free baselines (PPO/TRPO), in some cases by a margin of 2x. These results show that advantage-aware guidance is a simple, effective remedy for short-horizon myopia in diffusion-model MBRL.
[310] Strategic Algorithmic Monoculture:Experimental Evidence from Coordination Games
Gonzalo Ballestero, Hadi Hosseini, Samarth Khanna, Ran I. Shorrer
Main category: cs.AI
TL;DR: LLMs exhibit both primary algorithmic monoculture (baseline action similarity) and strategic algorithmic monoculture (adjusting similarity in response to coordination incentives), showing strong coordination on similar actions but lagging behind humans in maintaining heterogeneity when divergence is beneficial.
Details
Motivation: To understand how AI agents coordinate in multi-agent environments, distinguishing between baseline action similarity (primary monoculture) and strategic adjustment of similarity in response to incentives (strategic monoculture), and comparing human and LLM behavior in these contexts.Method: Implemented a simple experimental design that cleanly separates primary and strategic algorithmic monoculture forces, deploying it on both human subjects and large language model (LLM) subjects to compare their coordination behaviors.
Result: LLMs show high levels of baseline similarity (primary monoculture) and, like humans, regulate similarity in response to coordination incentives (strategic monoculture). While LLMs coordinate extremely well on similar actions, they lag behind humans in sustaining heterogeneity when divergence is rewarded.
Conclusion: LLMs exhibit both forms of algorithmic monoculture, demonstrating strong coordination capabilities but limitations in maintaining diverse strategies when beneficial, highlighting important differences between human and AI multi-agent coordination.
Abstract: AI agents increasingly operate in multi-agent environments where outcomes depend on coordination. We distinguish primary algorithmic monoculture – baseline action similarity – from strategic algorithmic monoculture, whereby agents adjust similarity in response to incentives. We implement a simple experimental design that cleanly separates these forces, and deploy it on human and large language model (LLM) subjects. LLMs exhibit high levels of baseline similarity (primary monoculture) and, like humans, they regulate it in response to coordination incentives (strategic monoculture). While LLMs coordinate extremely well on similar actions, they lag behind humans in sustaining heterogeneity when divergence is rewarded.
[311] Overhang Tower: Resource-Rational Adaptation in Sequential Physical Planning
Ruihong Shen, Shiqian Li, Yixin Zhu
Main category: cs.AI
TL;DR: Humans adapt both physical prediction mechanisms (from simulation to heuristics) and planning strategies (from deliberative to myopic) based on cognitive resource constraints in sequential physical planning tasks.
Details
Motivation: To understand how humans combine physical prediction mechanisms (Intuitive Physics Engine vs. cue-based heuristics) with planning strategies (deliberative lookahead vs. myopic) under cognitive resource constraints, bridging two separate research debates.Method: Used the Overhang Tower construction task where participants maximize horizontal overhang while maintaining stability. Manipulated cognitive resources through time pressure and task complexity to observe transitions in prediction mechanisms and planning strategies.
Result: Found a dual transition: IPE-based simulation dominates early stages while CNN-based visual heuristics prevail as complexity grows; time pressure truncates deliberative lookahead, shifting planning toward shallower horizons.
Conclusion: Reveals a hierarchical, resource-rational architecture that flexibly trades computational cost against predictive fidelity, unifying simulation vs. heuristics and myopic vs. deliberative planning debates as a dynamic repertoire reconfigured by cognitive budget.
Abstract: Humans effortlessly navigate the physical world by predicting how objects behave under gravity and contact forces, yet how such judgments support sequential physical planning under resource constraints remains poorly understood. Research on intuitive physics debates whether prediction relies on the Intuitive Physics Engine (IPE) or fast, cue-based heuristics; separately, decision-making research debates deliberative lookahead versus myopic strategies. These debates have proceeded in isolation, leaving the cognitive architecture of sequential physical planning underspecified. How physical prediction mechanisms and planning strategies jointly adapt under limited cognitive resources remains an open question. Here we show that humans exhibit a dual transition under resource pressure, simultaneously shifting both physical prediction mechanism and planning strategy to match cognitive budget. Using Overhang Tower, a construction task requiring participants to maximize horizontal overhang while maintaining stability, we find that IPE-based simulation dominates early stages while CNN-based visual heuristics prevail as complexity grows; concurrently, time pressure truncates deliberative lookahead, shifting planning toward shallower horizons: a dual transition unpredicted by prior single-mechanism accounts. These findings reveal a hierarchical, resource-rational architecture that flexibly trades computational cost against predictive fidelity. Our results unify two long-standing debates (simulation vs. heuristics and myopic vs. deliberative planning) as a dynamic repertoire reconfigured by cognitive budget.
[312] Camera Artist: A Multi-Agent Framework for Cinematic Language Storytelling Video Generation
Haobo Hu, Qi Mao, Yuanhang Li, Libiao Jin
Main category: cs.AI
TL;DR: Camera Artist is a multi-agent framework that generates narrative videos with explicit cinematic language by modeling real-world filmmaking workflows, improving narrative continuity and film quality.
Details
Motivation: Existing multi-agent systems for automated filmmaking often lack mechanisms for structuring narrative progression across shots and deliberate use of cinematic language, resulting in fragmented storytelling and limited filmic quality.Method: Builds upon existing agentic pipelines and introduces a dedicated Cinematography Shot Agent that integrates recursive storyboard generation for shot-to-shot narrative continuity and cinematic language injection for expressive, film-oriented shot designs.
Result: Extensive quantitative and qualitative results show the approach consistently outperforms existing baselines in narrative consistency, dynamic expressiveness, and perceived film quality.
Conclusion: Camera Artist successfully addresses limitations in automated filmmaking by incorporating explicit cinematic language and narrative continuity mechanisms through a specialized cinematography agent.
Abstract: We propose Camera Artist, a multi-agent framework that models a real-world filmmaking workflow to generate narrative videos with explicit cinematic language. While recent multi-agent systems have made substantial progress in automating filmmaking workflows from scripts to videos, they often lack explicit mechanisms to structure narrative progression across adjacent shots and deliberate use of cinematic language, resulting in fragmented storytelling and limited filmic quality. To address this, Camera Artist builds upon established agentic pipelines and introduces a dedicated Cinematography Shot Agent, which integrates recursive storyboard generation to strengthen shot-to-shot narrative continuity and cinematic language injection to produce more expressive, film-oriented shot designs. Extensive quantitative and qualitative results demonstrate that our approach consistently outperforms existing baselines in narrative consistency, dynamic expressiveness, and perceived film quality.
[313] DRBENCHER: Can Your Agent Identify the Entity, Retrieve Its Properties and Do the Math?
Young-Suk Lee, Ramon Fernandez Astudillo, Radu Florian
Main category: cs.AI
TL;DR: DRBENCHER is a synthetic benchmark generator for evaluating AI agents on tasks requiring both web browsing and computational reasoning, addressing gaps in existing isolated evaluations.
Details
Motivation: Existing benchmarks evaluate web browsing and computational capabilities separately, creating a blind spot for assessing real-world performance where agents must interleave both capabilities. There's a need for benchmarks that test integrated browsing and computation.Method: DRBENCHER generates synthetic benchmarks using an answer-first pipeline with four criteria: verifiability (gold answers computed via parameterized code over knowledge graphs), complexity (multi-hop entity identification, property retrieval, domain computation), difficulty (two-stage verification cascade), and diversity (greedy max-min embedding filter). It spans five domains: biochemistry, financial, geophysical, security, and history.
Result: Human evaluation shows 76% validity (84% excluding stale data), with 35% of errors due to outdated knowledge-graph entries. The strongest frontier model achieves only 20% answer accuracy. DRBENCHER achieves higher semantic diversity than manually constructed benchmarks like BrowseComp+, MATH-500, and GPQA.
Conclusion: DRBENCHER addresses a critical gap in evaluating integrated browsing and computation capabilities, revealing limitations in current AI systems and highlighting challenges with reasoning over evolving data. The benchmark’s synthetic generation approach enables comprehensive evaluation of real-world agent performance.
Abstract: Deep research agents increasingly interleave web browsing with multi-step computation, yet existing benchmarks evaluate these capabilities in isolation, creating a blind spot in assessing real-world performance. We introduce DRBENCHER, a synthetic benchmark generator for questions that require both browsing and computation. It enforces four criteria: verifiability (gold answers are computed by executing parameterized code over knowledge-graph values), complexity (multi-hop entity identification, property retrieval, and domain-specific computation), difficulty (a two-stage verification cascade filters out questions solvable by the generating model), and diversity (a greedy max-min embedding filter maximizes coverage). These criteria are realized via a unified answer-first pipeline spanning five domains: biochemistry, financial, geophysical, security, and history. Human evaluation shows 76% validity (84% excluding stale data), with 35% of errors due to outdated knowledge-graph entries, highlighting an inherent limitation of systems that reason over evolving data. Automatic evaluation shows that the strongest frontier model achieves only 20% answer accuracy. Compared to manually constructed benchmarks (BrowseComp+, MATH-500, GPQA), DRBENCHER achieves the highest semantic diversity.
[314] SAGE: A Service Agent Graph-guided Evaluation Benchmark
Ling Shi, Yuqin Dai, Ziyin Wang, Ning Gao, Wei Zhang, Chaozheng Wang, Yujie Wang, Wei He, Jinpeng Wang, Deiyi Xiong
Main category: cs.AI
TL;DR: SAGE is a multi-agent benchmark for evaluating LLMs in customer service by formalizing SOPs into dialogue graphs and using adversarial testing to assess logical compliance and empathy resilience.
Details
Motivation: Existing benchmarks for LLMs in customer service are inadequate because they use static paradigms and single-dimensional metrics that don't account for diverse user behaviors or the strict adherence to structured Standard Operating Procedures required in real-world deployments.Method: Proposes SAGE (Service Agent Graph-guided Evaluation) which formalizes unstructured SOPs into Dynamic Dialogue Graphs, introduces an Adversarial Intent Taxonomy and modular Extension Mechanism, and uses a framework with Judge Agents and a Rule Engine to analyze interactions between User and Service Agents.
Result: Experiments on 27 LLMs across 6 industrial scenarios reveal a significant “Execution Gap” where models accurately classify intents but fail to derive correct subsequent actions, and “Empathy Resilience” where models maintain polite conversational facades despite underlying logical failures under high adversarial intensity.
Conclusion: SAGE provides a comprehensive benchmark for evaluating LLMs in customer service applications, highlighting important gaps in model performance that existing benchmarks miss, particularly around logical compliance and the disconnect between intent understanding and action execution.
Abstract: The development of Large Language Models (LLMs) has catalyzed automation in customer service, yet benchmarking their performance remains challenging. Existing benchmarks predominantly rely on static paradigms and single-dimensional metrics, failing to account for diverse user behaviors or the strict adherence to structured Standard Operating Procedures (SOPs) required in real-world deployments. To bridge this gap, we propose SAGE (Service Agent Graph-guided Evaluation), a universal multi-agent benchmark for automated, dual-axis assessment. SAGE formalizes unstructured SOPs into Dynamic Dialogue Graphs, enabling precise verification of logical compliance and comprehensive path coverage. We introduce an Adversarial Intent Taxonomy and a modular Extension Mechanism, enabling low-cost deployment across domains and facilitating automated dialogue data synthesis. Evaluation is conducted via a framework where Judge Agents and a Rule Engine analyze interactions between User and Service Agents to generate deterministic ground truth. Extensive experiments on 27 LLMs across 6 industrial scenarios reveal a significant Execution Gap'' where models accurately classify intents but fail to derive correct subsequent actions. We also observe Empathy Resilience’’, a phenomenon where models maintain polite conversational facades despite underlying logical failures under high adversarial intensity. Code and resources are available at https://anonymous.4open.science/r/SAGE-Bench-4CD3/.
[315] Constraint-Aware Corrective Memory for Language-Based Drug Discovery Agents
Maochen Sun, Youzhi Zhang, Gaofeng Meng
Main category: cs.AI
TL;DR: CACM is a constraint-aware corrective memory framework for language-based drug discovery agents that improves protocol-level success through precise set-level diagnosis and concise memory management.
Details
Motivation: Current language-based drug discovery systems face a fundamental control problem: agents plan step-by-step while task validity is determined at the whole candidate set level, leading to imprecise failure localization and noisy planner states.Method: CACM introduces protocol auditing and a grounded diagnostician that analyze multimodal evidence (task requirements, pocket context, candidate-set evidence) to localize protocol violations and generate remediation hints. It organizes memory into static, dynamic, and corrective channels with compression before write-back.
Result: CACM improves target-level success rate by 36.4% over state-of-the-art baselines, demonstrating that precise diagnosis and economical agent states are crucial for reliable language-based drug discovery.
Conclusion: Reliable language-based drug discovery benefits from more precise diagnosis and more economical agent states, not just more powerful molecular tools. CACM’s constraint-aware approach addresses the fundamental control problem in autonomous drug discovery.
Abstract: Large language models are making autonomous drug discovery agents increasingly feasible, but reliable success in this setting is not determined by any single action or molecule. It is determined by whether the final returned set jointly satisfies protocol-level requirements such as set size, diversity, binding quality, and developability. This creates a fundamental control problem: the agent plans step by step, while task validity is decided at the level of the whole candidate set. Existing language-based drug discovery systems therefore tend to rely on long raw history and under-specified self-reflection, making failure localization imprecise and planner-facing agent states increasingly noisy. We present CACM (Constraint-Aware Corrective Memory), a language-based drug discovery framework built around precise set-level diagnosis and a concise memory write-back mechanism. CACM introduces protocol auditing and a grounded diagnostician, which jointly analyze multimodal evidence spanning task requirements, pocket context, and candidate-set evidence to localize protocol violations, generate actionable remediation hints, and bias the next action toward the most relevant correction. To keep planning context compact, CACM organizes memory into static, dynamic, and corrective channels and compresses them before write-back, thereby preserving persistent task information while exposing only the most decision-relevant failures. Our experimental results show that CACM improves the target-level success rate by 36.4% over the state-of-the-art baseline. The results show that reliable language-based drug discovery benefits not only from more powerful molecular tools, but also from more precise diagnosis and more economical agent states.
[316] Mind the Gap Between Spatial Reasoning and Acting! Step-by-Step Evaluation of Agents With Spatial-Gym
Lars Benedikt Kaesberg, Tianyu Yang, Niklas Bauer, Terry Ruas, Jan Philip Wahle, Bela Gipp
Main category: cs.AI
TL;DR: Spatial-Gym: A benchmark for evaluating spatial reasoning in LLMs through 2D-grid pathfinding puzzles, showing models significantly underperform humans and struggle with scaling reasoning effort with difficulty.
Details
Motivation: Existing benchmarks evaluate spatial reasoning in one-shot settings, unlike humans who work interactively. There's a need for better evaluation frameworks that isolate spatial constraint reasoning and enable diagnosis of model limitations.Method: Introduces Spatial-Gym, a Gymnasium environment testing pathfinding in 2D-grid puzzles as sequential decision tasks with optional backtracking. Evaluates 8 models in three settings (one-shot, step-by-step, step-by-step with backtracking) against human, random, and A* baselines on 500 episodes.
Result: Best model (GPT-OSS 120B) achieves only 16.0% solve rate, 82 points below human baseline (98.0%). Step-by-step helps weaker models but hurts stronger ones. Backtracking improves completion but not solve rates for strong models. Key findings: models fail to scale reasoning with difficulty, vision models reduce solve rate by 73%, and chain-of-thought retains 3-5x accuracy advantage.
Conclusion: Spatial-Gym enables diagnosis of model limitations and provides a framework for improving spatial reasoning through reinforcement learning, revealing significant gaps between current models and human spatial reasoning capabilities.
Abstract: Spatial reasoning is central to navigation and robotics, yet measuring model capabilities on these tasks remains difficult. Existing benchmarks evaluate models in a one-shot setting, requiring full solution generation in a single response, unlike humans, who work in interactive environments step-by-step. We introduce Spatial-Gym, a Gymnasium environment that isolates spatial constraint reasoning by testing pathfinding in 2D-grid puzzles as a sequential decision task with optional backtracking. We evaluate eight models in three settings (one-shot, step-by-step, step-by-step with backtracking) against human, random, and A* baselines on 500 episodes. The best model, GPT-OSS 120B, achieves a solve rate of 16.0%, 82 points below the human baseline (98.0%). Step-by-step format helps weaker models (up to +5.4%) by removing formatting errors, but hurts stronger models (up to 5.6%) by constraining global planning. Backtracking improves episode completion, but increases solve rate only for weaker models; stronger models rarely backtrack and do not benefit from it. Our experiments have three key findings: (1) models fail to scale reasoning effort with difficulty, (2) vision models receiving images of the spatial environment reduce solve rate by 73%, and (3) extended chain-of-thought reasoning retains a 3-5x accuracy advantage over standard inference even in the step-by-step setting. Spatial-Gym enables diagnosis of model limitations and provides a framework for improving spatial reasoning through reinforcement learning.
[317] HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?
Mohamed Elfeki, Tu Trinh, Kelvin Luu, Guangze Luo, Nathan Hunt, Ernesto Montoya, Nandan Marwaha, Yannis He, Charles Wang, Fernando Crabedo, Alessa Castilo, Bing Liu
Main category: cs.AI
TL;DR: HiL-Bench is a benchmark for evaluating AI agents’ ability to recognize when they need human help for incomplete/ambiguous tasks, measuring selective escalation skills rather than just execution correctness.
Details
Motivation: Current coding agents fail when specifications are incomplete or ambiguous, not due to lack of capability but poor judgment about when to ask for help. Existing benchmarks don't capture this failure mode because they provide unambiguous instructions and only reward execution correctness.Method: Developed HiL-Bench with human-validated blockers (missing info, ambiguous requests, contradictions) that only surface through progressive exploration. Introduced Ask-F1 metric (harmonic mean of question precision and blocker recall) to measure selective escalation while preventing gaming through question spam.
Result: Evaluation shows large universal judgment gap: frontier models recover only a fraction of full-information performance when deciding whether to ask. Identified three failure patterns: overconfident wrong beliefs, high uncertainty with persistent errors, and imprecise escalation. RL training on Ask-F1 reward improved both help-seeking quality and task pass rate with cross-domain transfer.
Conclusion: Poor help-seeking is a model-level flaw, not task-specific. Judgment about when to ask for help is trainable through RL on appropriate metrics, enabling models to learn to detect unresolvable uncertainty rather than domain-specific heuristics.
Abstract: Frontier coding agents solve complex tasks when given complete context but collapse when specifications are incomplete or ambiguous. The bottleneck is not raw capability, but judgment: knowing when to act autonomously and when to ask for help. Current benchmarks are blind to this failure mode. They supply unambiguous detailed instructions and solely reward execution correctness, so an agent that makes a lucky guess for a missing requirement will score identically to one that would have asked to be certain. We present HiL-Bench (Human-in-the-Loop Benchmark) to measure this selective escalation skill. Each task contains human-validated blockers (missing information, ambiguous requests, contradictory information) that surface only through progressive exploration, not upfront inspection. Our core metric, Ask-F1, the harmonic mean of question precision and blocker recall, captures the tension between over-asking and silent guessing; its structure architecturally prevents gaming through question spam. Evaluation across SWE and text-to-SQL domains reveals a large universal judgment gap: no frontier model recovers more than a fraction of its full-information performance when deciding whether to ask. Failure analysis identifies three key help-seeking patterns: overconfident wrong beliefs with no gap detection; high uncertainty detection yet persistent errors; broad, imprecise escalation without self-correction. These consistent patterns confirm poor help-seeking is a model-level flaw, not task-specific. RL training on shaped Ask-F1 reward shows judgment is trainable: a 32B model improves both help-seeking quality and task pass rate, with gains that transfer across domains. The model does not learn domain-specific heuristics for when to ask; it learns to detect unresolvable uncertainty and act on it.
[318] Do We Really Need to Approach the Entire Pareto Front in Many-Objective Bayesian Optimisation?
Chao Jiang, Jingyu Huang, Miqing Li
Main category: cs.AI
TL;DR: SPMO is a single point-based multi-objective Bayesian optimization framework that focuses on finding one high-quality solution rather than approximating the entire Pareto front, using a novel ESPI acquisition function.
Details
Motivation: In many-objective optimization with limited evaluation budgets, approximating the entire Pareto front becomes infeasible. Since decision-makers ultimately select only one solution for deployment, it's more practical to focus on finding a single high-quality solution rather than exploring the entire front.Method: Proposes SPMO framework with ESPI (Expected Single-Point Improvement) acquisition function that improves solution quality along directions leading to good tradeoffs between objectives. Uses gradient-based optimization via Sample Average Approximation (SAA) approach for effective optimization.
Result: SPMO is computationally tractable and outperforms state-of-the-art methods on a wide range of benchmark and real-world problems. Theoretical convergence guarantees are proven under SAA.
Conclusion: For many-objective optimization with limited evaluation budgets, focusing on finding a single high-quality solution is more effective than attempting to approximate the entire Pareto front. The SPMO framework provides a practical approach for this scenario.
Abstract: Many-objective optimisation, a subset of multi-objective optimisation, involves optimisation problems with more than three objectives. As the number of objectives increases, the number of solutions needed to adequately represent the entire Pareto front typically grows substantially. This makes it challenging, if not infeasible, to design a search algorithm capable of effectively exploring the entire Pareto front. This difficulty is particularly acute in the Bayesian optimisation paradigm, where sample efficiency is critical and only a limited number of solutions (often a few hundred) are evaluated. Moreover, after the optimisation process, the decision-maker eventually selects just one solution for deployment, regardless of how many high-quality, diverse solutions are available. In light of this, we argue an idea that under a very limited evaluation budget, it may be more useful to focus on finding a single solution of the highest possible quality for the decision-maker, rather than aiming to approximate the entire Pareto front as existing many-/multi-objective Bayesian optimisation methods typically do. Bearing this idea in mind, this paper proposes a \underline{s}ingle \underline{p}oint-based \underline{m}ulti-\underline{o}bjective search framework (SPMO) that aims to improve the quality of solutions along a direction that leads to a good tradeoff between objectives. Within SPMO, we present a simple acquisition function, called expected single-point improvement (ESPI), working under both noiseless and noisy scenarios. We show that ESPI can be optimised effectively with gradient-based methods via the sample average approximation (SAA) approach and theoretically prove its convergence guarantees under the SAA. We also empirically demonstrate that the proposed SPMO is computationally tractable and outperforms state-of-the-arts on a wide range of benchmark and real-world problems.
[319] Bayesian Social Deduction with Graph-Informed Language Models
Shahab Rahimirad, Guven Gergerli, Lucia Romero, Angela Qian, Matthew Lyle Olson, Simon Stepputtis, Joseph Campbell
Main category: cs.AI
TL;DR: A hybrid reasoning framework combining LLMs with structured probabilistic models achieves competitive performance in social deduction games, defeating human players for the first time.
Details
Motivation: Current LLMs struggle with social reasoning tasks like inferring unobservable beliefs and intentions from partial observations, especially when distilled to smaller real-time-capable variants.Method: Introduces a hybrid framework that externalizes belief inference to a structured probabilistic model while using LLMs for language understanding and interaction in the Avalon social deduction game.
Result: Achieves 67% win rate against human players in controlled study, outperforming both reasoning baselines and human teammates, with competitive performance against larger models.
Conclusion: Hybrid reasoning combining LLMs with structured probabilistic models enables effective social reasoning in real-time agents, representing a significant advance in language agent capabilities.
Abstract: Social reasoning - inferring unobservable beliefs and intentions from partial observations of other agents - remains a challenging task for large language models (LLMs). We evaluate the limits of current reasoning language models in the social deduction game Avalon and find that while the largest models demonstrate strong performance, they require extensive test-time inference and degrade sharply when distilled to smaller, real-time-capable variants. To address this, we introduce a hybrid reasoning framework that externalizes belief inference to a structured probabilistic model, while using an LLM for language understanding and interaction. Our approach achieves competitive performance with much larger models in Agent-Agent play and, notably, is the first language agent to defeat human players in a controlled study - achieving a 67% win rate and receiving higher qualitative ratings than both reasoning baselines and human teammates. We release code, models, and a dataset to support future work on social reasoning in LLM agents, which can be found at https://camp-lab-purdue.github.io/bayesian-social-deduction/
[320] E3-TIR: Enhanced Experience Exploitation for Tool-Integrated Reasoning
Weiyang Guo, Zesheng Shi, Liye Zhao, Jiayuan Ma, Zeen Zhu, Junxian He, Min Zhang, Jing Li
Main category: cs.AI
TL;DR: E3-TIR is a warm-up training paradigm for tool-integrated reasoning agents that combines expert guidance with self-exploration to improve training efficiency and performance.
Details
Motivation: Existing training paradigms for tool-integrated reasoning in LLMs have limitations: Zero-RL suffers from inefficient exploration and mode degradation, while SFT-then-RL faces high data costs and capability plateaus due to low-entropy collapse.Method: Proposes E3-TIR that formulates training as dynamic integration of three experience types: Expert Prefixes, Expert Guided, and Self-Exploration. Uses diverse branching exploration around expert “anchors” with mix policy optimization to mitigate distribution shifts and resolve optimization conflicts.
Result: Achieves 6% performance improvement over traditional paradigms on tool-use tasks while requiring less than 10% of synthetic data. Achieves 1.46x gain in ROI (performance, data cost, training efficiency) compared to baselines.
Conclusion: E3-TIR effectively balances exploration diversity with training efficiency by dynamically adapting the model’s knowledge boundaries, making it a superior warm-up paradigm for early-stage agent training.
Abstract: While Large Language Models (LLMs) have demonstrated significant potential in Tool-Integrated Reasoning (TIR), existing training paradigms face significant limitations: Zero-RL suffers from inefficient exploration and mode degradation due to a lack of prior guidance, while SFT-then-RL is limited by high data costs and capability plateaus caused by low-entropy collapse. To address these challenges, we propose E3-TIR (Enhanced Experience Exploitation), a warm-up paradigm for the early stages of agent training. Specifically, we formulate training as the dynamic integration of three experience types: Expert Prefixes, Expert Guided, and Self-Exploration. By executing diverse branching exploration around expert “anchors” and employing a mix policy optimization mechanism, we effectively mitigate distribution shifts and resolve optimization conflicts arising from shared prefixes. Our method dynamically adapts the model’s knowledge boundaries, effectively balancing exploration diversity with training efficiency.Experimental results demonstrate that E3-TIR achieves a 6 performance improvement over traditional paradigms on tool-use tasks, while requiring less than 10 of the synthetic data. Furthermore, in terms of ROI, a comprehensive metric integrating performance, data cost, and training efficiency we achieve a 1.46x gain compared to baselines. Code is available at https://github.com/yuki-younai/E3-TIR.
[321] When Identity Skews Debate: Anonymization for Bias-Reduced Multi-Agent Reasoning
Hyeong Kyu Choi, Xiaojin Zhu, Sharon Li
Main category: cs.AI
TL;DR: This paper addresses identity bias in multi-agent debate (MAD) systems, where LLM agents exhibit sycophancy (uncritically adopting peers’ views) or self-bias (stubbornly adhering to their own outputs), compromising debate reliability.
Details
Motivation: Recent studies reveal that LLM agents in multi-agent debate systems are not neutral - they suffer from identity-driven sycophancy and self-bias, which undermines the reliability and trustworthiness of debate outcomes.Method: 1) Formalize debate dynamics as identity-weighted Bayesian update process; 2) Propose response anonymization by removing identity markers from prompts to force equal weights on agent identity; 3) Define Identity Bias Coefficient (IBC) to measure agents’ tendency to follow peers vs themselves.
Result: Empirical studies across multiple models and benchmarks confirm that identity bias is widespread, with sycophancy being far more common than self-bias. Response anonymization effectively reduces bias and improves trustworthiness.
Conclusion: The work highlights the need to ensure MAD systems reason based on content rather than identity, providing a principled framework to mitigate and quantify identity bias in multi-agent debate settings.
Abstract: Multi-agent debate (MAD) aims to improve large language model (LLM) reasoning by letting multiple agents exchange answers and then aggregate their opinions. Yet recent studies reveal that agents are not neutral: they are prone to identity-driven sycophancy and self-bias, uncritically adopting a peer’s view or stubbornly adhering to their own prior output, undermining the reliability of debate. In this work, we present the first principled framework that joins sycophancy and self-bias to mitigate and quantify identity bias in MAD. First, we formalize the debate dynamics as an identity-weighted Bayesian update process. Second, we propose response anonymization: by removing identity markers from prompts, agents cannot distinguish “self” from “peer”, which forces equal weights on agent identity, thereby reducing bias and improving trustworthiness. Third, we define the Identity Bias Coefficient (IBC), a principled bias metric that measures an agent’s tendency to follow its peer versus itself. Empirical studies across multiple models and benchmarks confirm that identity bias is widespread, with sycophancy far more common than self-bias. Our findings highlight the need to ensure that MAD systems reason based on content rather than identity. Code is released in https://github.com/deeplearning-wisc/MAD-identity-bias.
[322] Process Reward Agents for Steering Knowledge-Intensive Reasoning
Jiwoong Sohn, Tomasz Sternal, Kenneth Styppa, Torsten Hoefler, Michael Moor
Main category: cs.AI
TL;DR: PRA introduces test-time process reward agents that provide domain-grounded, online step-wise rewards to frozen language models for improved reasoning in knowledge-intensive domains.
Details
Motivation: Reasoning in knowledge-intensive domains is challenging because intermediate steps aren't locally verifiable, requiring synthesis across large external knowledge sources. Current process reward models operate post-hoc and can't be integrated into dynamic inference procedures.Method: Process Reward Agents (PRA) provide domain-grounded, online, step-wise rewards to frozen policies. Unlike retrieval-augmented PRMs, PRA enables search-based decoding to rank and prune candidate trajectories at every generation step during inference.
Result: PRA achieves 80.8% accuracy on MedQA with Qwen3-4B (new SOTA at 4B scale), generalizes to unseen frozen models from 0.5B to 8B parameters, and improves accuracy by up to 25.7% without policy updates.
Conclusion: PRA demonstrates a paradigm where frozen reasoners are decoupled from domain-specific reward modules, enabling deployment of new backbones in complex domains without retraining.
Abstract: Reasoning in knowledge-intensive domains remains challenging as intermediate steps are often not locally verifiable: unlike math or code, evaluating step correctness may require synthesizing clues across large external knowledge sources. As a result, subtle errors can propagate through reasoning traces, potentially never to be detected. Prior work has proposed process reward models (PRMs), including retrieval-augmented variants, but these methods operate post hoc, scoring completed trajectories, which prevents their integration into dynamic inference procedures. Here, we introduce Process Reward Agents (PRA), a test-time method for providing domain-grounded, online, step-wise rewards to a frozen policy. In contrast to prior retrieval-augmented PRMs, PRA enables search-based decoding to rank and prune candidate trajectories at every generation step. Experiments on multiple medical reasoning benchmarks demonstrate that PRA consistently outperforms strong baselines, achieving 80.8% accuracy on MedQA with Qwen3-4B, a new state of the art at the 4B scale. Importantly, PRA generalizes to unseen frozen policy models ranging from 0.5B to 8B parameters, improving their accuracy by up to 25.7% without any policy model updates. More broadly, PRA suggests a paradigm in which frozen reasoners are decoupled from domain-specific reward modules, allowing the deployment of new backbones in complex domains without retraining.
[323] Memory Intelligence Agent
Jingyang Qiao, Weicheng Meng, Yu Cheng, Zhihang Lin, Zhizhong Zhang, Xin Tan, Jingyu Gong, Kun Shao, Yuan Xie
Main category: cs.AI
TL;DR: MIA is a Memory Intelligence Agent framework with Manager-Planner-Executor architecture for deep research agents, featuring parametric/non-parametric memory systems, test-time learning, and bidirectional memory conversion for efficient reasoning and evolution.
Details
Motivation: Existing deep research agents with memory systems suffer from ineffective memory evolution and increasing storage/retrieval costs when retrieving similar trajectories, limiting their reasoning efficiency and autonomous evolution capabilities.Method: Proposes MIA framework with three components: 1) Memory Manager (non-parametric memory storing compressed trajectories), 2) Planner (parametric memory agent producing search plans), 3) Executor (agent searching/analyzing guided by plans). Uses alternating reinforcement learning for Planner-Executor cooperation, test-time learning for continuous evolution, bidirectional parametric/non-parametric memory conversion, and reflection/unsupervised judgment mechanisms.
Result: Extensive experiments across eleven benchmarks demonstrate the superiority of MIA over existing methods.
Conclusion: MIA effectively addresses memory evolution and efficiency issues in deep research agents through its novel architecture and learning mechanisms, enabling better reasoning and autonomous evolution capabilities.
Abstract: Deep research agents (DRAs) integrate LLM reasoning with external tools. Memory systems enable DRAs to leverage historical experiences, which are essential for efficient reasoning and autonomous evolution. Existing methods rely on retrieving similar trajectories from memory to aid reasoning, while suffering from key limitations of ineffective memory evolution and increasing storage and retrieval costs. To address these problems, we propose a novel Memory Intelligence Agent (MIA) framework, consisting of a Manager-Planner-Executor architecture. Memory Manager is a non-parametric memory system that can store compressed historical search trajectories. Planner is a parametric memory agent that can produce search plans for questions. Executor is another agent that can search and analyze information guided by the search plan. To build the MIA framework, we first adopt an alternating reinforcement learning paradigm to enhance cooperation between the Planner and the Executor. Furthermore, we enable the Planner to continuously evolve during test-time learning, with updates performed on-the-fly alongside inference without interrupting the reasoning process. Additionally, we establish a bidirectional conversion loop between parametric and non-parametric memories to achieve efficient memory evolution. Finally, we incorporate a reflection and an unsupervised judgment mechanisms to boost reasoning and self-evolution in the open world. Extensive experiments across eleven benchmarks demonstrate the superiority of MIA.
[324] SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions
Ashima Suvarna, Kendrick Phan, Mehrab Beikzadeh, Hritik Bansal, Saadia Gabriel
Main category: cs.AI
TL;DR: SUPERNOVA is a data curation framework for Reinforcement Learning with Verifiable Rewards (RLVR) that enhances general reasoning in LLMs by systematically adapting instruction-tuning datasets with expert annotations.
Details
Motivation: While RLVR has improved LLM reasoning in formal domains like math and code, LLMs still struggle with general reasoning tasks requiring causal inference and temporal understanding. The key limitation is the lack of high-quality, verifiable training data spanning diverse reasoning skills.Method: Proposes SUPERNOVA framework that leverages instruction-tuning datasets with expert-annotated ground-truth to extract rich reasoning patterns for RLVR. Conducts 100+ controlled RL experiments to analyze three key factors: (1) source task selection, (2) task mixing strategies, and (3) synthetic interventions for improving data quality.
Result: Source task selection significantly impacts downstream reasoning performance, with task-specific selection outperforming overall average strategies. Models trained on SUPERNOVA outperform strong baselines (Qwen3.5) on challenging reasoning benchmarks (BBEH, Zebralogic, MMLU-Pro), achieving up to 52.8% relative improvement on BBEH.
Conclusion: SUPERNOVA demonstrates effective principled data curation for RLVR, providing practical insights for extending RLVR to general reasoning using human-annotated resources. The framework shows systematic data design choices can significantly enhance LLM reasoning capabilities.
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has significantly improved large language model (LLM) reasoning in formal domains such as mathematics and code. Despite these advancements, LLMs still struggle with general reasoning tasks requiring capabilities such as causal inference and temporal understanding. Extending RLVR to general reasoning is fundamentally constrained by the lack of high-quality, verifiable training data that spans diverse reasoning skills. To address this challenge, we propose SUPERNOVA, a data curation framework for RLVR aimed at enhancing general reasoning. Our key insight is that instruction-tuning datasets containing expert-annotated ground-truth encode rich reasoning patterns that can be systematically adapted for RLVR. To study this, we conduct 100+ controlled RL experiments to analyze how data design choices impact downstream reasoning performance. In particular, we investigate three key factors: (i) source task selection, (ii) task mixing strategies, and (iii) synthetic interventions for improving data quality. Our analysis reveals that source task selection is non-trivial and has a significant impact on downstream reasoning performance. Moreover, selecting tasks based on their performance for individual target tasks outperforms strategies based on overall average performance. Finally, models trained on SUPERNOVA outperform strong baselines (e.g., Qwen3.5) on challenging reasoning benchmarks including BBEH, Zebralogic, and MMLU-Pro. In particular, training on SUPERNOVA yields relative improvements of up to 52.8% on BBEH across model sizes, demonstrating the effectiveness of principled data curation for RLVR. Our findings provide practical insights for curating human-annotated resources to extend RLVR to general reasoning. The code and data is available at https://github.com/asuvarna31/supernova.
[325] Reflection of Episodes: Learning to Play Game from Expert and Self Experiences
Xiaojie Xu, Zongyuan Li, Chang Lu, Runnan Qi, Yanan Ni, Lumin Jiang, Xiangbei Liu, Xuebo Zhang, Yongchun Fang, Kuihua Huang, Xian Guo, Zhanghua Wu, Zhenya Li
Main category: cs.AI
TL;DR: A framework called Reflection of Episodes (ROE) that uses LLMs with expert experience and self-reflection to play StarCraft II, beating Very Hard difficulty bots.
Details
Motivation: StarCraft II is a complex real-time strategy environment suitable for AI research, but LLMs struggle to learn in such complex environments through self-reflection alone. The paper aims to address this limitation by combining expert experience with self-experience reflection.Method: ROE framework: 1) Keyframe selection to extract important game information, 2) Decision-making using both expert experience and self-experience, 3) Post-game reflection to generate new self-experience from gameplay episodes.
Result: The method successfully beat the Very Hard difficulty bot in TextStarCraft II. Detailed analysis of LLM gameplay data verified the effectiveness of the approach.
Conclusion: The ROE framework effectively enables LLMs to learn in complex environments like StarCraft II through a combination of expert guidance and self-reflection, demonstrating improved performance over standard approaches.
Abstract: StarCraft II is a complex and dynamic real-time strategy (RTS) game environment, which is very suitable for artificial intelligence and reinforcement learning research. To address the problem of Large Language Model(LLM) learning in complex environments through self-reflection, we propose a Reflection of Episodes(ROE) framework based on expert experience and self-experience. This framework first obtains key information in the game through a keyframe selection method, then makes decisions based on expert experience and self-experience. After a game is completed, it reflects on the previous experience to obtain new self-experience. Finally, in the experiment, our method beat the robot under the Very Hard difficulty in TextStarCraft II. We analyze the data of the LLM in the process of the game in detail, verified its effectiveness.
[326] ChipSeek: Optimizing Verilog Generation via EDA-Integrated Reinforcement Learning
Zhirong Chen, Kaiyan Chang, Zhuolin Li, Cangyuan Li, Xinyang He, Chujie Chen, Mengdi Wang, Haobo Xu, Yinhe Han, Huawei Li, Ying Wang
Main category: cs.AI
TL;DR: ChipSeek: Hierarchical RL framework for LLMs to generate functionally correct and hardware-optimized RTL code using EDA tool feedback and curriculum-guided policy optimization.
Details
Motivation: Current LLM approaches for RTL code generation fail to simultaneously optimize functional correctness and hardware efficiency metrics (Power, Performance, Area). Supervised fine-tuning produces correct but suboptimal designs, while post-processing techniques are inefficient and don't improve LLMs' intrinsic capabilities.Method: Proposes ChipSeek: hierarchical reward-based RL framework integrating direct feedback from EDA simulators and synthesis tools. Uses Curriculum-Guided Dynamic Policy Optimization (CDPO) to enhance LLMs’ ability to generate optimized RTL code by learning hardware design trade-offs.
Result: Achieves state-of-the-art functional correctness and PPA performance on standard benchmarks. Excels in specific optimization tasks, consistently yielding highly efficient designs for fine-grained optimization goals like power, delay, and area.
Conclusion: ChipSeek successfully overcomes limitations of existing approaches by enabling LLMs to generate both functionally correct and hardware-optimized RTL code through hierarchical RL with direct EDA tool feedback.
Abstract: Large Language Models have emerged as powerful tools for automating Register-Transfer Level (RTL) code generation, yet they face critical limitations: existing approaches typically fail to simultaneously optimize functional correctness and hardware efficiency metrics such as Power, Performance, and Area (PPA). Methods relying on supervised fine-tuning commonly produce functionally correct but suboptimal designs due to the lack of inherent mechanisms for learning hardware optimization principles. Conversely, external post-processing techniques aiming to refine PPA performance after generation often suffer from inefficiency and do not improve the LLMs’ intrinsic capabilities. To overcome these challenges, we propose ChipSeek, a novel hierarchical reward based reinforcement learning framework designed to encourage LLMs to generate RTL code that is both functionally correct and optimized for PPA metrics. Our approach integrates direct feedback from EDA simulators and synthesis tools into a hierarchical reward mechanism, facilitating a nuanced understanding of hardware design trade-offs. Through Curriculum-Guided Dynamic Policy Optimization (CDPO), ChipSeek enhances the LLM’s ability to generate high-quality, optimized RTL code. Evaluations on standard benchmarks demonstrate ChipSeek’s superior performance, achieving state-of-the-art functional correctness and PPA performance. Furthermore, it excels in specific optimization tasks, consistently yielding highly efficient designs when individually targeting fine-grained optimization goals such as power, delay, and area. The artifact is open-source in https://github.com/rong-hash/chipseek.
[327] Rethinking Prospect Theory for LLMs: Revealing the Instability of Decision-Making under Epistemic Uncertainty
Rui Wang, Qihan Lin, Jiayu Liu, Qing Zong, Tianshi Zheng, Dadi Guo, Haochen Shi, Weiqi Wang, Yangqiu Song
Main category: cs.AI
TL;DR: LLMs’ decision-making under linguistic uncertainty doesn’t consistently fit Prospect Theory, and PT parameters aren’t robust to epistemic marker perturbations.
Details
Motivation: To investigate whether Prospect Theory (PT) properly models LLM decision-making under linguistic uncertainty, and test PT's robustness to epistemic markers like "likely" that create ambiguity.Method: Three-stage workflow: 1) Estimate PT parameters using economics questions and evaluate fitness metrics, 2) Derive probability mappings for epistemic markers in same context, 3) Inject these mappings into prompts to test PT parameter stability.
Result: PT doesn’t consistently model LLM decision-making across different models, and applying Prospect Theory to LLMs is not robust to epistemic uncertainty from linguistic markers.
Conclusion: Caution needed when deploying PT-based frameworks in real-world applications with epistemic ambiguity; provides insights for LLM behavior interpretation and alignment.
Abstract: Prospect Theory (PT) models human decision-making behaviour under uncertainty, among which linguistic uncertainty is commonly adopted in real-world scenarios. Although recent studies have developed some frameworks to test PT parameters for Large Language Models (LLMs), few have considered the fitness of PT itself on LLMs. Moreover, whether PT is robust under linguistic uncertainty perturbations, especially epistemic markers (e.g. “likely”), remains highly under-explored. To address these gaps, we design a three-stage workflow based on a classic behavioural economics experimental setup. We first estimate PT parameters with economics questions and evaluate PT’s fitness with performance metrics. We then derive probability mappings for epistemic markers in the same context, and inject these mappings into the prompt to investigate the stability of PT parameters. Our findings suggest that modelling LLMs’ decision-making with PT is not consistently reliable across models, and applying Prospect Theory to LLMs is likely not robust to epistemic uncertainty. The findings caution against the deployment of PT-based frameworks in real-world applications where epistemic ambiguity is prevalent, giving valuable insights in behaviour interpretation and future alignment direction for LLM decision-making.
[328] Interactive Program Synthesis for Modeling Collaborative Physical Activities from Narrated Demonstrations
Edward Kim, Daniel He, Jorge Chao, Wiktor Rajca, Mohammed Amin, Nishant Malpani, Ruta Desai, Antti Oulasvirta, Bjoern Hartmann, Sanjit Seshia
Main category: cs.AI
TL;DR: System for teaching collaborative physical tasks using program synthesis from narrated demonstrations, enabling users to inspect and correct learned behavior without coding.
Details
Motivation: Teaching systems collaborative physical tasks is challenging because it requires inferring users' assumptions about teammates' intent, which is ambiguous and dynamic. Existing systems focus on non-collaborative activities and lack interpretable, correctable representations.Method: Frames collaborative task learning as program synthesis problem. Represents behavior as editable programs. Uses narrated demonstrations (paired physical actions + natural language) as unified modality for teaching, inspecting, and correcting system logic without requiring users to see or write code.
Result: In study with 20 users teaching multiplayer soccer tactics: 70% (14/20) successfully refined learned programs to match intent, 90% (18/20) found it easy to correct programs. Surfaces unique challenges in representing learning as programs and teaching collaborative physical activities.
Conclusion: Program synthesis with narrated demonstrations enables interpretable and correctable learning of collaborative physical tasks. The approach allows users to teach, inspect, and refine system behavior without coding expertise. Identifies challenges and outlines mitigation strategies for future work.
Abstract: Teaching systems physical tasks is a long standing goal in HCI, yet most prior work has focused on non collaborative physical activities. Collaborative tasks introduce added complexity, requiring systems to infer users assumptions about their teammates intent, which is an inherently ambiguous and dynamic process. This necessitates representations that are interpretable and correctable, enabling users to inspect and refine system behavior. We address this challenge by framing collaborative task learning as a program synthesis problem. Our system represents behavior as editable programs and uses narrated demonstrations, i.e. paired physical actions and natural language, as a unified modality for teaching, inspecting, and correcting system logic without requiring users to see or write code. The same modality is used for the system to communicate its learning to users. In a within subjects study, 20 users taught multiplayer soccer tactics to our system. 70 percent (14/20) of participants successfully refined learned programs to match their intent and 90 percent (18/20) found it easy to correct the programs. The study surfaced unique challenges in representing learning as programs and in enabling users to teach collaborative physical activities. We discuss these issues and outline mitigation strategies.
[329] Chain-in-Tree: Back to Sequential Reasoning in LLM Tree Search
Xinzhe Li
Main category: cs.AI
TL;DR: CiT is a plug-in framework that reduces computational costs in LLM tree search by selectively branching only when necessary, achieving 75-85% savings with minimal accuracy loss.
Details
Motivation: Current LLM inference via tree search (LITS) methods are computationally expensive, expanding at every reasoning step. There's a need for more efficient approaches that maintain performance while reducing computational overhead.Method: Proposes Chain-in-Tree (CiT) with lightweight Branching Necessity (BN) evaluations: BN-DP (direct prompting) and BN-SC (self-consistency). These decide when to branch during search instead of expanding at every step, integrated into frameworks like Tree of Thoughts, ReST-MCTS, and RAP.
Result: BN-DP reduces token generation, model calls, and runtime by 75-85% on GSM8K and Math500 with often negligible or no accuracy loss. BN-SC yields substantial savings (up to 80%) but shows instability in some settings due to extremely long reasoning steps in a small subset of examples.
Conclusion: CiT provides an efficient plug-in framework for LLM tree search that dramatically reduces computational costs while maintaining performance, with theoretical guarantees that BN-DP never increases policy invocations.
Abstract: Test-time scaling improves large language models (LLMs) on long-horizon reasoning tasks by allocating more compute at inference. LLM inference via tree search (LITS) achieves strong performance but is highly inefficient. We propose Chain-in-Tree (CiT), a plug-in framework that decides when to branch during search instead of expanding at every step. CiT introduces lightweight Branching Necessity (BN) evaluations, including BN-DP (direct prompting) and BN-SC (self-consistency). Integrated into Tree of Thoughts, ReST-MCTS, and RAP, BN-DP reduces token generation, model calls, and runtime by 75-85% on GSM8K and Math500, with often negligible or no accuracy loss. BN-SC typically yields substantial savings (up to 80%) generally but shows instability in 1-4 out of 14 settings, caused by a small subset of examples that produce extremely long reasoning steps. We theoretically prove that BN-DP never increases policy invocations and release unified implementations applicable across LITS frameworks. The full codebase is publicly available at https://github.com/xinzhel/chain_in_tree.
[330] AlphaCast: A Human Wisdom-LLM Intelligence Co-Reasoning Framework for Interactive Time Series Forecasting
Xiaohan Zhang, Tian Gao, Mingyue Cheng, Bokai Pan, Ze Guo, Yaguo Liu, Xiaoyu Tao, Qi Liu
Main category: cs.AI
TL;DR: Alphacast is an agentic reasoning framework that uses training-free LLMs for time series forecasting by mimicking human expert iterative reasoning through multi-stage workflows and external knowledge tools.
Details
Motivation: Current time series forecasting methods treat it as static single-pass regression, but human experts use iterative reasoning integrating temporal features, domain knowledge, case references, and context with continuous refinement.Method: Reformulates forecasting as expert-like process with multi-stage workflow: context preparation, reasoning-based generation, and reflective evaluation. Uses lightweight toolkit with feature set, knowledge base, case library, and contextual pool to support LLM-based reasoning.
Result: Extensive experiments across multiple benchmarks show Alphacast generally outperforms representative baselines.
Conclusion: Alphacast transforms forecasting from single-pass output to multi-turn autonomous interaction process, enabling accurate forecasting with training-free LLMs.
Abstract: Time series forecasting plays a crucial role in decision-making across many real-world applications. Despite substantial progress, most existing methods still treat forecasting as a static, single-pass regression problem. In contrast, human experts form predictions through iterative reasoning that integrates temporal features, domain knowledge, case-based references, and supplementary context, with continuous refinement. In this work, we propose Alphacast, an interaction-driven agentic reasoning framework that enables accurate time series forecasting with training-free large language models. Alphacast reformulates forecasting as an expert-like process and organizes it into a multi-stage workflow involving context preparation, reasoning-based generation, and reflective evaluation, transforming forecasting from a single-pass output into a multi-turn, autonomous interaction process. To support diverse perspectives commonly considered by human experts, we develop a lightweight toolkit comprising a feature set, a knowledge base, a case library, and a contextual pool that provides external support for LLM-based reasoning. Extensive experiments across multiple benchmarks show that Alphacast generally outperforms representative baselines. Code is available at this repository: https://github.com/echo01-ai/AlphaCast.
[331] Thermally Activated Dual-Modal Adversarial Clothing against AI Surveillance Systems
Jiahuan Long, Tingsong Jiang, Hanqing Liu, Chao Ma, Weien Zhou, Yang Yang, Wen Yao
Main category: cs.AI
TL;DR: Thermally activated adversarial wearable using thermochromic dyes and heating units to create dynamic patterns on clothing that evade AI surveillance in both visible and infrared modalities.
Details
Motivation: To address the conspicuous appearance limitation of traditional adversarial patches for privacy protection against AI surveillance, creating a more practical and adaptable solution that can be deployed in real-world scenarios.Method: Integration of thermochromic dyes with flexible heating units on clothing surfaces to create thermally activated adversarial patterns. The system appears as ordinary black clothing by default, but hidden patterns are revealed when heated, creating adversarial effects against surveillance systems.
Result: The adversarial wearable achieves rapid texture activation within 50 seconds and maintains over 80% adversarial success rate across diverse real-world surveillance environments, working effectively in both visible and infrared modalities.
Conclusion: This work demonstrates a physically grounded, user-controllable anti-AI system that provides a practical pathway for privacy protection against ubiquitous AI surveillance through proactive adversarial techniques.
Abstract: Adversarial patches have emerged as a popular privacy-preserving approach for resisting AI-driven surveillance systems. However, their conspicuous appearance makes them difficult to deploy in real-world scenarios. In this paper, we propose a thermally activated adversarial wearable designed to ensure adaptability and effectiveness in complex real-world environments. The system integrates thermochromic dyes with flexible heating units to induce visually dynamic adversarial patterns on clothing surfaces. In its default state, the clothing appears as an ordinary black T-shirt. Upon heating via an embedded thermal unit, hidden adversarial patterns on the fabric are activated, allowing the wearer to effectively evade detection across both visible and infrared modalities. Physical experiments demonstrate that the adversarial wearable achieves rapid texture activation within 50 seconds and maintains an adversarial success rate above 80% across diverse real-world surveillance environments. This work demonstrates a new pathway toward physically grounded, user-controllable anti-AI systems, highlighting the growing importance of proactive adversarial techniques for privacy protection in the age of ubiquitous AI surveillance.
[332] EchoTrail-GUI: Building Actionable Memory for GUI Agents via Critic-Guided Self-Exploration
Runze Li, Yuwen Zhai, Bo Xu, LiWu Xu, Nian Shi, Wei Zhang, Ran Lin, Liang Wang
Main category: cs.AI
TL;DR: EchoTrail-GUI is a framework that adds experiential learning to GUI agents by creating a dynamic memory system from past successful task trajectories, improving performance without human supervision.
Details
Motivation: Current GUI agents suffer from "digital amnesia" - they treat each task in isolation without learning from past successes, leading to repeated errors and poor generalization to novel challenges.Method: Three-stage framework: 1) Experience Exploration - autonomous interaction with GUI environments to build curated database of successful trajectories validated by reward model, 2) Memory Injection - efficient retrieval of relevant past trajectories for new tasks, 3) GUI Task Inference - injecting memories as in-context guidance for agent reasoning.
Result: Significant improvement in task success rate and operational efficiency on benchmarks including Android World and AndroidLab, validating the power of structured memory in GUI automation.
Conclusion: EchoTrail-GUI demonstrates that equipping GUI agents with dynamic, accessible memory enables human-like experiential learning, creating more robust and intelligent automation systems.
Abstract: Contemporary GUI agents, while increasingly capable due to advances in Large Vision-Language Models (VLMs), often operate with a critical limitation: they treat each task in isolation, lacking a mechanism to systematically learn from past successes. This digital ‘‘amnesia’’ results in sub-optimal performance, repeated errors, and poor generalization to novel challenges. To bridge this gap, we introduce EchoTrail-GUI, a novel framework designed to mimic human-like experiential learning by equipping agents with a dynamic, accessible memory. Our framework operates in three distinct stages. First, during Experience Exploration, an agent autonomously interacts with GUI environments to build a curated database of successful task trajectories, validated by a reward model. Crucially, the entire knowledge base construction is thus fully automated, requiring no human supervision. Second, in the Memory Injection stage, upon receiving a new task, our system efficiently retrieves the most relevant past trajectories to serve as actionable ‘‘memories’’. Finally, during GUI Task Inference, these memories are injected as in-context guidance to inform the agent’s reasoning and decision-making process. We demonstrate the efficacy of our approach on benchmarks including Android World and AndroidLab. The results show that EchoTrail-GUI significantly improves the task success rate and operational efficiency of baseline agents, validating the power of structured memory in creating more robust and intelligent GUI automation.
[333] Sample-Efficient Neurosymbolic Deep Reinforcement Learning
Celeste Veronese, Alessandro Farinelli, Daniele Meli
Main category: cs.AI
TL;DR: Neuro-symbolic DRL approach integrates symbolic knowledge as logical rules to improve sample efficiency and generalization in complex environments.
Details
Motivation: Deep RL algorithms require large datasets and struggle with generalization; integrating symbolic knowledge can improve sample efficiency and transfer learning to more complex tasks.Method: Partial policies from simple domain instances are represented as logical rules, then used to guide training through action distribution biasing during exploration and Q-value rescaling during exploitation.
Result: Improved performance over state-of-the-art reward machine baseline on challenging gridworld variants in both fully and partially observable settings, with faster convergence in sparse-reward environments.
Conclusion: Neuro-symbolic integration enhances interpretability, trustworthiness, and accelerates convergence in RL, particularly for sparse-reward environments and long-horizon tasks.
Abstract: Reinforcement Learning (RL) is a well-established framework for sequential decision-making in complex environments. However, state-of-the-art Deep RL (DRL) algorithms typically require large training datasets and often struggle to generalize beyond small-scale training scenarios, even within standard benchmarks. We propose a neuro-symbolic DRL approach that integrates background symbolic knowledge to improve sample efficiency and generalization to more challenging, unseen tasks. Partial policies defined for simple domain instances, where high performance is easily attained, are transferred as useful priors to accelerate learning in more complex settings and avoid tuning DRL parameters from scratch. To do so, partial policies are represented as logical rules, and online reasoning is performed to guide the training process through two mechanisms: (i) biasing the action distribution during exploration, and (ii) rescaling Q-values during exploitation. This neuro-symbolic integration enhances interpretability and trustworthiness while accelerating convergence, particularly in sparse-reward environments and tasks with long planning horizons. We empirically validate our methodology on challenging variants of gridworld environments, both in the fully observable and partially observable setting. We show improved performance over a state-of-the-art reward machine baseline.
[334] Reasoning Models Will Sometimes Lie About Their Reasoning
William Walden, Miriam Wanner
Main category: cs.AI
TL;DR: Large Reasoning Models often fail to acknowledge using hints in their reasoning even when explicitly alerted to unusual inputs, revealing challenges for faithfulness evaluation and interpretability.
Details
Motivation: Prior work shows LRMs don't always volunteer information about how hints influence reasoning, but fails to specify what models should do when alerted to unusual inputs - a standard security measure against prompt injections.Method: Study faithfulness under realistic settings where models are explicitly alerted to possibility of unusual inputs. Use both existing faithfulness metrics and propose new, more granular metrics to evaluate model behavior.
Result: Instructions about unusual inputs yield strong results on prior faithfulness metrics, but new granular metrics show mixed results: models may acknowledge hints but often deny intending to use them, even when permitted and demonstrably using them.
Conclusion: Results reveal broader challenges for chain-of-thought monitoring and interpretability, highlighting the gap between model behavior and faithful reasoning reporting.
Abstract: Hint-based faithfulness evaluations have established that Large Reasoning Models (LRMs) may not say what they think: they do not always volunteer information about how key parts of the input (e.g. answer hints) influence their reasoning. Yet, these evaluations also fail to specify what models should do when confronted with hints or other unusual prompt content – even though versions of such instructions are standard security measures (e.g. for countering prompt injections). Here, we study faithfulness under this more realistic setting in which models are explicitly alerted to the possibility of unusual inputs. We find that such instructions can yield strong results on faithfulness metrics from prior work. However, results on new, more granular metrics proposed in this work paint a mixed picture: although models may acknowledge the presence of hints, they will often deny intending to use them – even when permitted to use hints and even when it can be demonstrated that they are using them. Our results thus raise broader challenges for CoT monitoring and interpretability.
[335] Precomputing Multi-Agent Path Replanning using Temporal Flexibility
Issa Hanou, Eric Kemmeren, Devin Wild Thomas, Mathijs de Weerdt
Main category: cs.AI
TL;DR: FlexSIPP: An algorithm for efficient multi-agent replanning when one agent is delayed, using precomputed temporal flexibility to avoid cascading delays while maintaining plan feasibility.
Details
Motivation: Multi-agent plan execution becomes challenging when an agent is delayed, creating conflicts with other agents. Traditional approaches either replan only the delayed agent (often inefficient or infeasible) or replan all agents (computationally expensive and causes cascading delays). Need a method to efficiently replan while avoiding cascading delays.Method: FlexSIPP tracks and uses temporal flexibility of other agents - the maximum delay an agent can take without changing the order of other agents or further delaying them. The algorithm precomputes all possible plans for the delayed agent and returns changes to other agents for any single-agent delay within the given scenario.
Result: Demonstrated in real-world case study of replanning trains in the densely-used Dutch railway network and in the MovingAI benchmark set. FlexSIPP provides effective solutions relevant to real-world adjustments within reasonable timeframe.
Conclusion: FlexSIPP efficiently handles single-agent delays in multi-agent systems by leveraging temporal flexibility, avoiding computationally expensive full replanning while maintaining plan feasibility and efficiency.
Abstract: Executing a multi-agent plan can be challenging when an agent is delayed, because this typically creates conflicts with other agents. So, we need to quickly find a new safe plan. Replanning only the delayed agent often does not yield an efficient plan, and sometimes cannot even yield a feasible one. On the other hand, replanning other agents may lead to a cascade of changes and delays and is computationally expensive. We show how to efficiently replan by tracking and using the temporal flexibility of other agents while avoiding cascading delays. This flexibility is the maximum delay an agent can take without changing the order of other agents or further delaying them. Our algorithm, FlexSIPP, precomputes all possible plans for the delayed agent and returns the changes to the other agents for any single-agent delay within the given scenario. We demonstrate our method in a real-world case study of replanning trains in the densely-used Dutch railway network and in the MovingAI benchmark set. Our experiments show that FlexSIPP provides effective solutions relevant to real-world adjustments, and within a reasonable timeframe.
[336] ConvoLearn: A Learning Sciences Grounded Dataset for Fine-Tuning Dialogic AI Tutors
Mayank Sharma, Roy Pea, Hari Subramonyam
Main category: cs.AI
TL;DR: ConvoLearn is a dataset of 2,134 semi-synthetic tutor-student dialogues for training LLMs in dialogic tutoring, showing that fine-tuning on this data improves AI tutor quality as rated by teachers.
Details
Motivation: Current LLMs used in education lack alignment with dialogic tutoring principles, which emphasize collaborative knowledge construction between tutor and student. The authors aim to develop AI tutors capable of more dialogic interactions.Method: Created ConvoLearn dataset with 2,134 semi-synthetic tutor-student dialogues based on six dimensions of dialogic tutoring from knowledge-building theory. Used middle school Earth Science curriculum. Trained classifiers on this data and fine-tuned Mistral-7B model with dimension-level fine-tuning.
Result: Classifier scores from ConvoLearn correlate significantly with expert-coded instructional quality in authentic classrooms. Fine-tuned Mistral-7B shows dialogic tutoring behavior that credentialed teachers rate as competitive with strong proprietary baselines.
Conclusion: Dimension-labeled dialogic training data captures meaningful pedagogical signal that generalizes beyond synthetic domains, enabling development of more dialogic AI tutors through targeted fine-tuning.
Abstract: Despite their growing adoption in education, LLMs remain misaligned with the core principle of effective tutoring: the dialogic construction of knowledge. We introduce ConvoLearn, a dataset of 2,134 semi-synthetic tutor-student dialogues operationalizing six dimensions of dialogic tutoring grounded in knowledge-building theory, situated in a middle school Earth Science curriculum. We show that dimension-labeled dialogic training data captures meaningful pedagogical signal that generalizes beyond its semi-synthetic domain: scores from a classifier trained on ConvoLearn correlate significantly with expert-coded instructional quality in authentic classrooms across multiple subscales. As a proof of concept, we fine-tune Mistral-7B on ConvoLearn and show that dimension-level fine-tuning can steer a 7B open-weight model toward dialogic tutoring behavior that credentialed teachers rate as competitive with a strong proprietary baseline. With this work, we support the development of AI tutors capable of more dialogic interactions.
[337] The Hot Mess of AI: How Does Misalignment Scale With Model Intelligence and Task Complexity?
Alexander Hägele, Aryo Pradipta Gema, Henry Sleight, Ethan Perez, Jascha Sohl-Dickstein
Main category: cs.AI
TL;DR: Paper analyzes AI failure modes using bias-variance decomposition, finding that as models spend more time reasoning and taking actions, their failures become more incoherent rather than systematic misalignment.
Details
Motivation: To understand how extremely capable AI models will fail - whether through systematic goal misalignment or incoherent, nonsensical behavior - as AI becomes entrusted with more consequential tasks and failure risks grow more severe.Method: Operationalizes the question using bias-variance decomposition of AI errors, measuring “error-incoherence” as the fraction of error stemming from variance rather than bias in task outcomes across various tasks and frontier models.
Result: Longer reasoning and action sequences lead to more incoherent failures; error-incoherence varies with model scale but larger models tend to be more incoherent; scale alone unlikely to eliminate error-incoherence.
Conclusion: As AIs pursue harder tasks requiring more sequential action and thought, failures will be accompanied by more incoherent behavior, suggesting future where AIs cause accidents due to unpredictable misbehavior rather than consistent pursuit of misaligned goals.
Abstract: As AI becomes more capable, we entrust it with more general and consequential tasks. The risks from failure grow more severe with increasing task scope. It is therefore important to understand how extremely capable AI models will fail: Will they fail by systematically pursuing goals we do not intend? Or will they fail by being a hot mess, and taking nonsensical actions that do not further any goal? We operationalize this question using a bias-variance decomposition of the errors made by AI models: An AI’s \emph{error-incoherence} on a task is measured over test-time randomness as the fraction of its error that stems from variance rather than bias in task outcome. Across all tasks and frontier models we measure, the longer models spend reasoning and taking actions, \emph{the more incoherent} their failures become. Error-incoherence changes with model scale in a way that is experiment dependent. However, in several settings, larger, more capable models are more incoherent than smaller models. Consequently, scale alone seems unlikely to eliminate error-incoherence. Instead, as more capable AIs pursue harder tasks, requiring more sequential action and thought, our results predict failures to be accompanied by more incoherent behavior. This suggests a future where AIs sometimes cause industrial accidents (due to unpredictable misbehavior), but are less likely to exhibit consistent pursuit of a misaligned goal. This increases the relative importance of alignment research targeting reward hacking or goal misspecification.
[338] H-AdminSim: A Multi-Agent Simulator for Realistic Hospital Administrative Workflows with FHIR Integration
Jun-Min Lee, Meong Hi Son, Edward Choi
Main category: cs.AI
TL;DR: H-AdminSim: A comprehensive simulation framework for hospital administrative workflows using multi-agent systems and FHIR integration to evaluate LLM-based automation.
Details
Motivation: Hospital administration handles thousands of daily requests, but prior LLM research has focused only on patient-physician interactions or isolated subtasks, missing the complexity of real administrative workflows.Method: Proposes H-AdminSim framework combining realistic data generation with multi-agent-based simulation of hospital administrative workflows, with quantitative evaluation using detailed rubrics and FHIR integration for interoperability.
Result: Creates a unified, interoperable environment for testing administrative workflows across heterogeneous hospital settings, serving as a standardized testbed for assessing LLM-driven administrative automation.
Conclusion: H-AdminSim addresses the gap in comprehensive hospital administrative workflow simulation and enables systematic comparison of LLM performance in real-world administrative automation scenarios.
Abstract: Hospital administration departments handle a wide range of operational tasks and, in large hospitals, process over 10,000 requests per day, driving growing interest in LLM-based automation. However, prior work has focused primarily on patient-physician interactions or isolated administrative subtasks, failing to capture the complexity of real administrative workflows. To address this gap, we propose H-AdminSim, a comprehensive simulation framework that combines realistic data generation with multi-agent-based simulation of hospital administrative workflows. These tasks are quantitatively evaluated using detailed rubrics, enabling systematic comparison of LLMs. Through FHIR integration, H-AdminSim provides a unified and interoperable environment for testing administrative workflows across heterogeneous hospital settings, serving as a standardized testbed for assessing the feasibility and performance of LLM-driven administrative automation.
[339] Reasoning in a Combinatorial and Constrained World: Benchmarking LLMs on Natural-Language Combinatorial Optimization
Xia Jiang, Jing Chen, Cong Zhang, Jie Gao, Chengpeng Hu, Chenhao Zhang, Yaoxin Wu, Yingqian Zhang
Main category: cs.AI
TL;DR: NLCO benchmark evaluates LLMs on natural language combinatorial optimization problems across 43 CO tasks, showing models struggle with larger instances and certain problem types despite strong performance on small cases.
Details
Motivation: While LLMs excel at math and logic reasoning, their ability to handle combinatorial optimization (searching high-dimensional solution spaces under hard constraints) remains underexplored. The authors aim to bridge this gap by creating a benchmark to evaluate LLMs on end-to-end CO reasoning without code or external solvers.Method: Introduce NLCO benchmark with 43 CO problems organized using a four-layer taxonomy (variable types, constraint families, global patterns, objective classes). Provide solver-annotated solutions and evaluate LLMs on feasibility, solution optimality, and reasoning efficiency across a wide range of modern models.
Result: High-performing LLMs achieve strong feasibility and solution quality on small instances, but performance degrades as instance size grows, even with more reasoning tokens. Set-based tasks are relatively easy, while graph-structured problems and bottleneck objectives lead to more frequent failures.
Conclusion: LLMs have limitations in handling combinatorial optimization at scale, with systematic performance variations across problem types. The NLCO benchmark enables fine-grained evaluation of LLMs’ CO reasoning capabilities and reveals areas needing improvement.
Abstract: While large language models (LLMs) have shown strong performance in math and logic reasoning, their ability to handle combinatorial optimization (CO) – searching high-dimensional solution spaces under hard constraints – remains underexplored. To bridge the gap, we introduce NLCO, a \textbf{N}atural \textbf{L}anguage \textbf{C}ombinatorial \textbf{O}ptimization benchmark that evaluates LLMs on end-to-end CO reasoning: given a language-described decision-making scenario, the model must output a discrete solution without writing code or calling external solvers. NLCO covers 43 CO problems and is organized using a four-layer taxonomy of variable types, constraint families, global patterns, and objective classes, enabling fine-grained evaluation. We provide solver-annotated solutions and comprehensively evaluate LLMs by feasibility, solution optimality, and reasoning efficiency. Experiments across a wide range of modern LLMs show that high-performing models achieve strong feasibility and solution quality on small instances, but both degrade as instance size grows, even if more tokens are used for reasoning. We also observe systematic effects across the taxonomy: set-based tasks are relatively easy, whereas graph-structured problems and bottleneck objectives lead to more frequent failures.
[340] ReplicatorBench: Benchmarking LLM Agents for Replicability in Social and Behavioral Sciences
Bang Nguyen, Dominik Soós, Qian Ma, Rochana R. Obadage, Zack Ranjan, Sai Koneru, Anna Szabelska, Adam Gill, Timothy M. Errington, Shakhlo Nematova, Sarah Rajtmajer, Jian Wu, Meng Jiang
Main category: cs.AI
TL;DR: ReplicatorBench: A benchmark for evaluating AI agents in scientific paper replication, addressing limitations of existing benchmarks by including both replicable and non-replicable research claims and evaluating the full replication process.
Details
Motivation: Existing benchmarks for AI assessment of scientific papers focus only on computational reproduction with code/data access, ignoring real-world challenges like inconsistent data availability and lack of ground-truth diversity (only reproducible papers). They also evaluate outcomes rather than the replication process.Method: Introduces ReplicatorBench with human-verified replicable and non-replicable research claims from social/behavioral sciences. Evaluates AI agents across three stages: (1) extraction/retrieval of replication data, (2) design/execution of computational experiments, (3) interpretation of results. Also develops ReplicatorAgent framework with tools like web search and sandboxed environments.
Result: Evaluation across four LLMs shows current agents can design/execute computational experiments effectively but struggle with retrieving necessary resources like new data for replication. Different programming languages and code access levels were tested.
Conclusion: ReplicatorBench provides comprehensive evaluation of AI agents in research replication, revealing current limitations in resource retrieval while demonstrating strengths in experimental design/execution. The benchmark enables testing of agents’ ability to mimic human replicators in real-world scenarios.
Abstract: The literature has witnessed an emerging interest in AI agents for automated assessment of scientific papers. Existing benchmarks focus primarily on the computational aspect of this task, testing agents’ ability to reproduce or replicate research outcomes when having access to the code and data. This setting, while foundational, (1) fails to capture the inconsistent availability of new data for replication as opposed to reproduction, and (2) lacks ground-truth diversity by focusing only on reproducible papers, thereby failing to evaluate an agent’s ability to identify non-replicable research. Furthermore, most benchmarks only evaluate outcomes rather than the replication process. In response, we introduce ReplicatorBench, an end-to-end benchmark, including human-verified replicable and non-replicable research claims in social and behavioral sciences for evaluating AI agents in research replication across three stages: (1) extraction and retrieval of replication data; (2) design and execution of computational experiments; and (3) interpretation of results, allowing a test of AI agents’ capability to mimic the activities of human replicators in real world. To set a baseline of AI agents’ capability, we develop ReplicatorAgent, an agentic framework equipped with necessary tools, like web search and iterative interaction with sandboxed environments, to accomplish tasks in ReplicatorBench. We evaluate ReplicatorAgent across four underlying large language models (LLMs), as well as different design choices of programming language and levels of code access. Our findings reveal that while current LLM agents are capable of effectively designing and executing computational experiments, they struggle with retrieving resources, such as new data, necessary to replicate a claim. All code and data are publicly available at https://github.com/CenterForOpenScience/llm-benchmarking.
[341] PACED: Distillation and On-Policy Self-Distillation at the Frontier of Student Competence
Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang
Main category: cs.AI
TL;DR: PACED: A distillation method that weights training problems by p(1-p) where p is student’s pass rate, focusing on the zone of proximal development to improve efficiency and performance.
Details
Motivation: Standard LLM distillation treats all training problems equally, wasting compute on problems the student has already mastered or cannot yet solve. This inefficiency has a gradient-level signature where cross-problem gradient signal-to-noise ratio collapses at both extremes of student pass rate.Method: Proposes PACED which weights each problem by w(p) = p(1-p) where p is the student’s empirical pass rate, concentrating training on the zone of proximal development. The Beta kernel w(p) = p^α(1-p)^β is proven to be the leading-order optimal weight family arising from SNR boundary-collapse structure. Uses only student rollouts, no architectural changes, and no hyperparameters.
Result: Sets new SOTA on MATH-500, AIME2024, and AIME2025 across Qwen3, Qwen2.5, and Llama-3 families, improving over unweighted distillation by up to +8.2 and over AKL baseline by up to +3.6. Reduces forgetting to 1.4% and 0.6% in distillation and self-distillation. Two-stage forward-then-reverse KL schedule pushes gains further to +5.8 over standard forward KL.
Conclusion: PACED provides an efficient distillation method that focuses training on problems in the student’s zone of proximal development, achieving significant performance improvements while reducing forgetting, with theoretical guarantees on optimality and robustness.
Abstract: Standard LLM distillation treats all training problems equally – wasting compute on problems the student has already mastered or cannot yet solve. We empirically show that this inefficiency has a precise gradient-level signature: the cross-problem gradient signal-to-noise ratio (SNR) follows a bell curve over student pass rate, collapsing at both extremes.
We propose PACED, which weights each problem by $w(p) = p(1{-}p)$ where $p$ is the student’s empirical pass rate – concentrating training on the zone of proximal development. This requires only student rollouts, no architectural changes, and no hyperparameters. We prove the Beta kernel $w(p) = p^α(1{-}p)^β$ is the leading-order optimal weight family arising from the SNR boundary-collapse structure, and is minimax-robust under misspecification (worst-case efficiency loss $O(δ^2)$).
Across Qwen3, Qwen2.5, and Llama-3 families, PACED sets a new state of the art in our experimental setting on MATH-500, AIME2024, and AIME2025, improving over unweighted distillation by up to $\mathbf{+8.2}$ and over the strong AKL baseline by up to $\mathbf{+3.6}$, while reducing forgetting to $\mathbf{1.4%}$ and $\mathbf{0.6%}$ in distillation and self-distillation. A two-stage forward-then-reverse KL schedule pushes gains further to $\mathbf{+5.8}$ over standard forward KL on the hardest benchmark.
[342] Reasoning Provenance for Autonomous AI Agents: Structured Behavioral Analytics Beyond State Checkpoints and Execution Traces
Neelmani Vispute, Aditya Kadam
Main category: cs.AI
TL;DR: The paper introduces Agent Execution Record (AER), a structured reasoning provenance primitive for AI agents that captures intent, observation, inference, and evidence chains as first-class queryable fields to enable population-level behavioral analytics.
Details
Motivation: As AI agents become autonomous infrastructure, there's a need to analyze their reasoning behavior across populations. Existing systems provide execution traces and telemetry but lack structured reasoning provenance as a first-class primitive.Method: Introduces AER with structured fields for intent, observation, inference, versioned plans with revision rationale, evidence chains, structured verdicts with confidence scores, and delegation authority chains. Formalizes distinction between computational state persistence and reasoning provenance.
Result: AER enables population-level behavioral analytics including reasoning pattern mining, confidence calibration, cross-agent comparison, and counterfactual regression testing via mock replay. Includes domain-agnostic model with extensible profiles, reference implementation, SDK, and preliminary deployment evaluation.
Conclusion: Structured reasoning provenance is essential for analyzing autonomous AI agents at scale, and AER provides a foundational primitive that cannot be faithfully reconstructed from computational state persistence alone.
Abstract: As AI agents transition from human-supervised copilots to autonomous platform infrastructure, the ability to analyze their reasoning behavior across populations of investigations becomes a pressing infrastructure requirement. Existing operational tooling addresses adjacent needs effectively: state checkpoint systems enable fault tolerance; observability platforms provide execution traces for debugging; telemetry standards ensure interoperability. What current systems do not natively provide as a first-class, schema-level primitive is structured reasoning provenance – normalized, queryable records of why the agent chose each action, what it concluded from each observation, how each conclusion shaped its strategy, and which evidence supports its final verdict. This paper introduces the Agent Execution Record (AER), a structured reasoning provenance primitive that captures intent, observation, and inference as first-class queryable fields on every step, alongside versioned plans with revision rationale, evidence chains, structured verdicts with confidence scores, and delegation authority chains. We formalize the distinction between computational state persistence and reasoning provenance, argue that the latter cannot in general be faithfully reconstructed from the former, and show how AERs enable population-level behavioral analytics: reasoning pattern mining, confidence calibration, cross-agent comparison, and counterfactual regression testing via mock replay. We present a domain-agnostic model with extensible domain profiles, a reference implementation and SDK, and outline an evaluation methodology informed by preliminary deployment on a production platformized root cause analysis agent.
[343] TRU: Targeted Reverse Update for Efficient Multimodal Recommendation Unlearning
Zhanting Zhou, KaHou Tam, Ziqiang Zheng, Zeyu Ma, Yang Yang
Main category: cs.AI
TL;DR: TRU: Targeted reverse update framework for multimodal recommendation systems that addresses non-uniform deletion influence across ranking behavior, modality branches, and network layers through three coordinated interventions.
Details
Motivation: Existing approximate unlearning methods for multimodal recommendation systems use uniform reverse updates, but deletion influence is actually concentrated unevenly across ranking behavior, modality branches, and network layers, creating three bottlenecks in MRS unlearning.Method: Proposes TRU with three coordinated interventions: 1) ranking fusion gate to suppress residual target-item influence, 2) branch-wise modality scaling to preserve retained multimodal representations, and 3) capacity-aware layer isolation to localize reverse updates to deletion-sensitive modules.
Result: Experiments across two backbones, three datasets, and three unlearning regimes show TRU achieves better retain-forget trade-off than prior baselines, with security audits confirming deeper forgetting and behavior closer to full retraining on retained data.
Conclusion: TRU addresses the fundamental mismatch between uniform unlearning assumptions and the non-uniform deletion influence in modern MRS, providing an effective plug-and-play unlearning framework.
Abstract: Multimodal recommendation systems (MRS) jointly model user-item interaction graphs and rich item content, but this tight coupling makes user data difficult to remove once learned. Approximate machine unlearning offers an efficient alternative to full retraining, yet existing methods for MRS mainly rely on a largely uniform reverse update across the model. We show that this assumption is fundamentally mismatched to modern MRS: deleted-data influence is not uniformly distributed, but concentrated unevenly across \textit{ranking behavior}, \textit{modality branches}, and \textit{network layers}. This non-uniformity gives rise to three bottlenecks in MRS unlearning: target-item persistence in the collaborative graph, modality imbalance across feature branches, and layer-wise sensitivity in the parameter space. To address this mismatch, we propose \textbf{targeted reverse update} (TRU), a plug-and-play unlearning framework for MRS. Instead of applying a blind global reversal, TRU performs three coordinated interventions across the model hierarchy: a ranking fusion gate to suppress residual target-item influence in ranking, branch-wise modality scaling to preserve retained multimodal representations, and capacity-aware layer isolation to localize reverse updates to deletion-sensitive modules. Experiments across two representative backbones, three datasets, and three unlearning regimes show that TRU consistently achieves a better retain-forget trade-off than prior approximate baselines, while security audits further confirm deeper forgetting and behavior closer to a full retraining on the retained data.
[344] AgentCE-Bench: Agent Configurable Evaluation with Scalable Horizons and Controllable Difficulty under Lightweight Environments
Wang Yang, Chaoda Song, Xinpeng Li, Debargha Ganguly, Chuang Ma, Shouren Wang, Zhihao Dou, Yuli Zhou, Vipin Chaudhary, Xiaotian Han
Main category: cs.AI
TL;DR: AgentCE-Bench: A lightweight grid-based planning benchmark for agent evaluation with scalable horizons and controllable difficulty, eliminating environment interaction overhead.
Details
Motivation: Existing agent benchmarks have high environment interaction overhead (up to 41% of evaluation time) and imbalanced task horizon/difficulty distributions, making aggregate scores unreliable for evaluating agent reasoning capabilities.Method: Proposes AgentCE-Bench with a unified grid-based planning task where agents fill hidden slots in a partially completed schedule subject to local and global constraints. Uses two orthogonal control axes: Scalable Horizons (number of hidden slots H) and Controllable Difficulty (decoy budget B for misleading candidates). Features a Lightweight Environment design where all tool calls are resolved via static JSON files, eliminating setup overhead.
Result: Validated that H and B provide reliable control over task horizon and difficulty, with strong domain consistency and model discriminability. Comprehensive experiments across 13 models of diverse sizes and families over 6 domains revealed significant cross-model performance variation, confirming interpretable and controllable evaluation of agent reasoning.
Conclusion: AgentCE-Bench provides a fast, reproducible evaluation framework suitable for training-time validation, addressing critical limitations of existing agent benchmarks while offering fine-grained control over task complexity for reliable agent assessment.
Abstract: Existing Agent benchmarks suffer from two critical limitations: high environment interaction overhead (up to 41% of total evaluation time) and imbalanced task horizon and difficulty distributions that make aggregate scores unreliable. To address these issues, we propose AgentCE-Bench built around a unified grid-based planning task, where agents must fill hidden slots in a partially completed schedule subject to both local slot constraints and global constraints. Our benchmark offers fine-grained control through two orthogonal axes: \textbf{Scalable Horizons}, controlled by the number of hidden slots $H$, and \textbf{Controllable Difficulty}, governed by a decoy budget $B$ that determines the number of globally misleading decoy candidates. Crucially, all tool calls are resolved via static JSON files under a \textbf{Lightweight Environment} design, eliminating setup overhead and enabling fast, reproducible evaluation suitable for training-time validation. We first validate that $H$ and $B$ provide reliable control over task horizon and difficulty, and that AgentCE-Bench exhibits strong domain consistency and model discriminability. We then conduct comprehensive experiments across 13 models of diverse sizes and families over 6 domains, revealing significant cross-model performance variation and confirming that AgentCE-Bench provides interpretable and controllable evaluation of agent reasoning.
[345] ActionNex: A Virtual Outage Manager for Cloud Computing
Zhenfeng Lin, Haoji Hu, Ming Hao, Xuchao Zhang, Ryan Zhang, Junhao Li, Ze Li, Oleg Kulygin, Chetan Bansal, Hatay Tuna, Murali Chintalapati, Sheila Jiang, Salman Zafar, Angie Anderson
Main category: cs.AI
TL;DR: ActionNex is a production-grade agentic system for automated outage management in cloud operations that processes multimodal operational signals and provides next-best action recommendations through hierarchical memory and reasoning.
Details
Motivation: Outage management in large-scale cloud operations is heavily manual, requiring rapid triage, cross-team coordination, and experience-driven decisions under partial observability, creating a need for automated assistance systems.Method: ActionNex ingests multimodal operational signals (outage content, telemetry, human communications) and compresses them into critical events. It uses a hierarchical memory subsystem with long-term KCA knowledge from playbooks, episodic memory of prior outages, and working memory of live context. A reasoning agent aligns current events to preconditions, retrieves relevant memories, and generates actionable recommendations.
Result: Evaluated on eight real Azure outages (8M tokens, 4,000 critical events) using two complementary ground-truth action sets, achieving 71.4% precision and 52.8-54.8% recall. The system has been piloted in production with positive early feedback.
Conclusion: ActionNex demonstrates a practical approach to automating outage management through multimodal signal processing and hierarchical memory, showing promising results in production environments for cloud operations.
Abstract: Outage management in large-scale cloud operations remains heavily manual, requiring rapid triage, cross-team coordination, and experience-driven decisions under partial observability. We present \textbf{ActionNex}, a production-grade agentic system that supports end-to-end outage assistance, including real-time updates, knowledge distillation, and role- and stage-conditioned next-best action recommendations. ActionNex ingests multimodal operational signals (e.g., outage content, telemetry, and human communications) and compresses them into critical events that represent meaningful state transitions. It couples this perception layer with a hierarchical memory subsystem: long-term Key-Condition-Action (KCA) knowledge distilled from playbooks and historical executions, episodic memory of prior outages, and working memory of the live context. A reasoning agent aligns current critical events to preconditions, retrieves relevant memories, and generates actionable recommendations; executed human actions serve as an implicit feedback signal to enable continual self-evolution in a human-agent hybrid system. We evaluate ActionNex on eight real Azure outages (8M tokens, 4,000 critical events) using two complementary ground-truth action sets, achieving 71.4% precision and 52.8-54.8% recall. The system has been piloted in production and has received positive early feedback.
[346] Squeeze Evolve: Unified Multi-Model Orchestration for Verifier-Free Evolution
Monishwaran Maheswaran, Leon Lakhani, Zhongzhu Zhou, Shijia Yang, Junxiong Wang, Coleman Hooper, Yuezhou Hu, Rishabh Tiwari, Jue Wang, Harman Singh, Qingyang Wu, Yuqing Jian, Ce Zhang, Kurt Keutzer, Tri Dao, Xiaoxia Wu, Ben Athiwaratkun, James Zou, Chenfeng Xu
Main category: cs.AI
TL;DR: Squeeze Evolve is a multi-model orchestration framework for verifier-free evolutionary inference that allocates model capability based on marginal utility, using stronger models for high-impact stages and cheaper models for others to improve diversity, cost-efficiency, and performance.
Details
Motivation: Verifier-free evolution faces bottlenecks in diversity and efficiency - without external correction, repeated evolution collapses to narrow modes, while uniform use of high-cost models wastes compute and becomes economically impractical.Method: A unified multi-model orchestration framework that allocates model capability where it has highest marginal utility. Stronger models are reserved for high-impact stages, while cheaper models handle other stages at lower costs, addressing diversity and cost-efficiency jointly.
Result: Achieves new SOTA results on several tasks including AIME 2025, HMMT 2025, LiveCodeBench V6, GPQA-Diamond, ARC-AGI-V2, and multimodal vision benchmarks (MMMU-Pro, BabyVision). Reduces API cost by ~3× and increases fixed-budget throughput by ~10×. First verifier-free evolutionary method to match/exceed verifier-based methods on discovery tasks.
Conclusion: Squeeze Evolve improves the cost-capability frontier over single-model evolution, supports open/closed/mixed-model deployments, and demonstrates that intelligent model allocation can overcome limitations of verifier-free evolution while maintaining economic practicality.
Abstract: We show that verifier-free evolution is bottlenecked by both diversity and efficiency: without external correction, repeated evolution accelerates collapse toward narrow modes, while the uniform use of a high-cost model wastes compute and quickly becomes economically impractical. We introduce Squeeze Evolve, a unified multi-model orchestration framework for verifier-free evolutionary inference. Our approach is guided by a simple principle: allocate model capability where it has the highest marginal utility. Stronger models are reserved for high-impact stages, while cheaper models handle the other stages at much lower costs. This principle addresses diversity and cost-efficiency jointly while remaining lightweight. Squeeze Evolve naturally supports open-source, closed-source, and mixed-model deployments. Across AIME 2025, HMMT 2025, LiveCodeBench V6, GPQA-Diamond, ARC-AGI-V2, and multimodal vision benchmarks, such as MMMU-Pro and BabyVision, Squeeze Evolve consistently improves the cost-capability frontier over single-model evolution and achieves new state-of-the-art results on several tasks. Empirically, Squeeze Evolve reduces API cost by up to $\sim$3$\times$ and increases fixed-budget serving throughput by up to $\sim$10$\times$. Moreover, on discovery tasks, Squeeze Evolve is the first verifier-free evolutionary method to match, and in some cases exceed, the performance of verifier-based evolutionary methods.
[347] Domain-Contextualized Inference: A Computable Graph Architecture for Explicit-Domain Reasoning
Chao Li, Yuru Wang, Chunyi Zhao
Main category: cs.AI
TL;DR: A computational theory for domain-scoped inference architecture that enables substrate-agnostic execution across symbolic, neural, vector, and hybrid systems with explicit domain parameterization.
Details
Motivation: To create a computation-substrate-agnostic inference architecture where domain is an explicit first-class computational parameter, enabling transparent inference chains and reducing search complexity.Method: Five-layer architecture with three domain computation modes (chain indexing, Kleisli composition, vector-guided computation), substrate-agnostic interface (Query, Extend, Bridge operations), reliability conditions, and validation through PHQ-9 clinical reasoning case study.
Result: Domain-scoped pruning reduces per-query search space from O(N) to O(N/K), enables substrate-independent execution, and provides transparent inference chains with evaluative context at every step.
Conclusion: The paper contributes a computational theory for substrate-agnostic inference with formal operational semantics, complexity bounds, monad structure, and validation through clinical reasoning applications.
Abstract: We establish a computation-substrate-agnostic inference architecture in which domain is an explicit first-class computational parameter. This produces domain-scoped pruning that reduces per-query search space from O(N) to O(N/K), substrate-independent execution over symbolic, neural, vector, and hybrid substrates, and transparent inference chains where every step carries its evaluative context. The contribution is architectural, not logical. We formalize the computational theory across five dimensions: a five-layer architecture; three domain computation modes including chain indexing, path traversal as Kleisli composition, and vector-guided computation as a substrate transition; a substrate-agnostic interface with three operations Query, Extend, Bridge; reliability conditions C1 to C4 with three failure mode classes; and validation through a PHQ-9 clinical reasoning case study. The computational theory including operational semantics, complexity bounds, monad structure, substrate transitions, and boundary conditions is the contribution of this paper.
[348] ActivityEditor: Learning to Synthesize Physically Valid Human Mobility
Chenjie Yang, Yutian Jiang, Anqi Liang, Wei Qi, Chenyu Wu, Junbo Zhang
Main category: cs.AI
TL;DR: ActivityEditor: A dual-LLM-agent framework for zero-shot cross-regional human mobility trajectory generation using intention-based and editor agents with reinforcement learning.
Details
Motivation: Address data scarcity in human mobility modeling where historical trajectories are unavailable or restricted, enabling mobility simulation in data-scarce regions.Method: Dual-LLM-agent framework with intention-based agent generating structured human intentions and coarse activity chains using demographic-driven priors, followed by editor agent refining outputs through iterative revisions with reinforcement learning using multiple rewards based on real-world physical constraints.
Result: Achieves superior zero-shot performance across diverse urban contexts, maintaining high statistical fidelity and physical validity for mobility simulation in data-scarce scenarios.
Conclusion: ActivityEditor provides a robust and highly generalizable solution for human mobility modeling in regions with limited or no historical trajectory data.
Abstract: Human mobility modeling is indispensable for diverse urban applications. However, existing data-driven methods often suffer from data scarcity, limiting their applicability in regions where historical trajectories are unavailable or restricted. To bridge this gap, we propose \textbf{ActivityEditor}, a novel dual-LLM-agent framework designed for zero-shot cross-regional trajectory generation. Our framework decomposes the complex synthesis task into two collaborative stages. Specifically, an intention-based agent, which leverages demographic-driven priors to generate structured human intentions and coarse activity chains to ensure high-level socio-semantic coherence. These outputs are then refined by editor agent to obtain mobility trajectories through iteratively revisions that enforces human mobility law. This capability is acquired through reinforcement learning with multiple rewards grounded in real-world physical constraints, allowing the agent to internalize mobility regularities and ensure high-fidelity trajectory generation. Extensive experiments demonstrate that \textbf{ActivityEditor} achieves superior zero-shot performance when transferred across diverse urban contexts. It maintains high statistical fidelity and physical validity, providing a robust and highly generalizable solution for mobility simulation in data-scarce scenarios. Our code is available at: https://anonymous.4open.science/r/ActivityEditor-066B.
[349] Towards Knowledgeable Deep Research: Framework and Benchmark
Wenxuan Liu, Zixuan Li, Long Bai, Chunmao Zhang, Fenghui Zhang, Zhuo Chen, Wei Li, Yuxin Zuo, Fei Wang, Bingbing Xu, Xuhui Jiang, Jin Zhang, Xiaolong Jin, Jiafeng Guo, Tat-Seng Chua, Xueqi Cheng
Main category: cs.AI
TL;DR: HKA framework enables LLM agents to perform deep research using both structured (tables, figures) and unstructured knowledge, generating multimodal reports with quantitative analysis.
Details
Motivation: Existing deep research agents mainly focus on unstructured web content, but real-world research requires structured knowledge for quantitative computation and deeper analysis. The paper introduces Knowledgeable Deep Research (KDR) as a more challenging task that integrates both knowledge types.Method: Proposes Hybrid Knowledge Analysis (HKA) framework with multi-agent architecture including Structured Knowledge Analyzer that uses coding and vision-language models to process structured data into figures, tables, and insights. Integrates both structured and unstructured knowledge into coherent multimodal reports.
Result: HKA outperforms most existing DR agents on general-purpose and knowledge-centric metrics, and surpasses Gemini DR agent on vision-enhanced metrics. Evaluated on KDR-Bench with 9 domains, 41 expert questions, and 1,252 structured knowledge resources.
Conclusion: HKA effectively enables deep, structure-aware knowledge analysis and multimodal report generation. The work establishes foundation for structured knowledge analysis in DR agents and facilitates future multimodal deep research studies.
Abstract: Deep Research (DR) requires LLM agents to autonomously perform multi-step information seeking, processing, and reasoning to generate comprehensive reports. In contrast to existing studies that mainly focus on unstructured web content, a more challenging DR task should additionally utilize structured knowledge to provide a solid data foundation, facilitate quantitative computation, and lead to in-depth analyses. In this paper, we refer to this novel task as Knowledgeable Deep Research (KDR), which requires DR agents to generate reports with both structured and unstructured knowledge. Furthermore, we propose the Hybrid Knowledge Analysis framework (HKA), a multi-agent architecture that reasons over both kinds of knowledge and integrates the texts, figures, and tables into coherent multimodal reports. The key design is the Structured Knowledge Analyzer, which utilizes both coding and vision-language models to produce figures, tables, and corresponding insights. To support systematic evaluation, we construct KDR-Bench, which covers 9 domains, includes 41 expert-level questions, and incorporates a large number of structured knowledge resources (e.g., 1,252 tables). We further annotate the main conclusions and key points for each question and propose three categories of evaluation metrics including general-purpose, knowledge-centric, and vision-enhanced ones. Experimental results demonstrate that HKA consistently outperforms most existing DR agents on general-purpose and knowledge-centric metrics, and even surpasses the Gemini DR agent on vision-enhanced metrics, highlighting its effectiveness in deep, structure-aware knowledge analysis. Finally, we hope this work can serve as a new foundation for structured knowledge analysis in DR agents and facilitate future multimodal DR studies.
[350] EigentSearch-Q+: Enhancing Deep Research Agents with Structured Reasoning Tools
Boer Zhang, Mingyan Wu, Dongzhuoran Zhou, Yuqicheng Zhu, Wendong Fan, Puzhen Zhang, Zifeng Ding, Guohao Li, Yuan He
Main category: cs.AI
TL;DR: Q+ introduces structured query and evidence processing tools for web research agents to improve search deliberation and evidence aggregation in open-ended question answering.
Details
Motivation: Current deep research agents rely on implicit, unstructured search behavior leading to redundant exploration and brittle evidence aggregation. The paper aims to make web search more deliberate by guiding query planning, monitoring search progress, and extracting evidence from web content.Method: Q+ provides query and evidence processing tools integrated into browser sub-agents. It includes query planning guidance, search progress monitoring, and evidence extraction from long web snapshots. The system is integrated into Eigent’s multi-agent workforce as EigentSearch-Q+.
Result: Q+ improves Eigent’s browser agent accuracy across four benchmarks (SimpleQA-Verified, FRAMES, WebWalkerQA, X-Bench DeepSearch) by 3.0, 3.8, and 0.6 percentage points for GPT-4.1, GPT-5.1, and Minimax M2.5 backends respectively. Case studies show more coherent tool-calling trajectories.
Conclusion: Structured query and evidence processing tools (Q+) significantly improve web research agents’ performance by making search progress and evidence handling explicit, leading to more deliberate and effective information retrieval.
Abstract: Deep research requires reasoning over web evidence to answer open-ended questions, and it is a core capability for AI agents. Yet many deep research agents still rely on implicit, unstructured search behavior that causes redundant exploration and brittle evidence aggregation. Motivated by Anthropic’s “think” tool paradigm and insights from the information-retrieval literature, we introduce Q+, a set of query and evidence processing tools that make web search more deliberate by guiding query planning, monitoring search progress, and extracting evidence from long web snapshots. We integrate Q+ into the browser sub-agent of Eigent, an open-source, production-ready multi-agent workforce for computer use, yielding EigentSearch-Q+. Across four benchmarks (SimpleQA-Verified, FRAMES, WebWalkerQA, and X-Bench DeepSearch), Q+ improves Eigent’s browser agent benchmark-size-weighted average accuracy by 3.0, 3.8, and 0.6 percentage points (pp) for GPT-4.1, GPT-5.1, and Minimax M2.5 model backends, respectively. Case studies further suggest that EigentSearch-Q+ produces more coherent tool-calling trajectories by making search progress and evidence handling explicit.
[351] MONETA: Multimodal Industry Classification through Geographic Information with Multi Agent Systems
Arda Yüksel, Gabriel Thiem, Susanne Walter, Patrick Felka, Gabriela Alves Werb, Ivan Habernal
Main category: cs.AI
TL;DR: MONETA: A multimodal benchmark for industry classification using text (websites, Wikipedia, Wikidata) and geospatial data (OpenStreetMap, satellite imagery) with 1,000 European businesses across 20 economic activity labels.
Details
Motivation: Manual industry classification is costly and requires significant data collection for updates. The paper aims to replicate expert verification using existing multimodal resources to automate industry classification.Method: Created MONETA benchmark with multimodal sources (text from websites/Wikipedia/Wikidata and geospatial from OpenStreetMap/satellite imagery). Used training-free baselines with MLLMs, enhanced with multi-turn design, context enrichment, and classification explanations.
Result: Training-free baselines achieved 62.10% (open-source MLLMs) and 74.10% (closed-source MLLMs). Performance improved up to 22.80% with multi-turn design, context enrichment, and explanation techniques.
Conclusion: Multimodal approaches can effectively automate industry classification, reducing manual annotation costs. The benchmark and enhanced guidelines will be publicly released.
Abstract: Industry classification schemes are integral parts of public and corporate databases as they classify businesses based on economic activity. Due to the size of the company registers, manual annotation is costly, and fine-tuning models with every update in industry classification schemes requires significant data collection. We replicate the manual expert verification by using existing or easily retrievable multimodal resources for industry classification. We present MONETA, the first multimodal industry classification benchmark with text (Website, Wikipedia, Wikidata) and geospatial sources (OpenStreetMap and satellite imagery). Our dataset enlists 1,000 businesses in Europe with 20 economic activity labels according to EU guidelines (NACE). Our training-free baseline reaches 62.10% and 74.10% with open and closed-source Multimodal Large Language Models (MLLM). We observe an increase of up to 22.80% with the combination of multi-turn design, context enrichment, and classification explanations. We will release our dataset and the enhanced guidelines.
[352] ASPECT:Analogical Semantic Policy Execution via Language Conditioned Transfer
Ajsal Shereef Palattuparambil, Thommen George Karimpanal, Santu Rana
Main category: cs.AI
TL;DR: Using LLMs as semantic operators to enable zero-shot transfer in RL by remapping novel task descriptions to align with source task knowledge.
Details
Motivation: RL agents struggle with generalization to novel tasks, even when structurally similar to mastered ones. Existing zero-shot transfer methods are limited by predefined discrete class systems, restricting adaptability to truly novel or compositional task variations.Method: Replace discrete latent variables with natural language conditioning via text-conditioned VAE. Use LLM as dynamic semantic operator at test time to remap current observation descriptions to align with source task. Source-aligned caption conditions VAE to generate imagined state compatible with original training, enabling direct policy reuse.
Result: Achieves zero-shot transfer across broad spectrum of complex and truly novel analogous tasks, moving beyond limitations of fixed category mappings.
Conclusion: LLMs’ flexible reasoning capabilities enable more generalized zero-shot transfer in RL by semantically remapping task descriptions rather than relying on rigid rules or predefined categories.
Abstract: Reinforcement Learning (RL) agents often struggle to generalize knowledge to new tasks, even those structurally similar to ones they have mastered. Although recent approaches have attempted to mitigate this issue via zero-shot transfer, they are often constrained by predefined, discrete class systems, limiting their adaptability to novel or compositional task variations. We propose a significantly more generalized approach, replacing discrete latent variables with natural language conditioning via a text-conditioned Variational Autoencoder (VAE). Our core innovation utilizes a Large Language Model (LLM) as a dynamic \textit{semantic operator} at test time. Rather than relying on rigid rules, our agent queries the LLM to semantically remap the description of the current observation to align with the source task. This source-aligned caption conditions the VAE to generate an imagined state compatible with the agent’s original training, enabling direct policy reuse. By harnessing the flexible reasoning capabilities of LLMs, our approach achieves zero-shot transfer across a broad spectrum of complex and truly novel analogous tasks, moving beyond the limitations of fixed category mappings. Code and videos are available \href{https://anonymous.4open.science/r/ASPECT-85C3/}{here}.
[353] Provable Post-Training Quantization: Theoretical Analysis of OPTQ and Qronos
Haoyu Zhang, Shihao Zhang, Ian Colbert, Rayan Saab
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2508.04853: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.04853&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[354] Investigating Multimodal Large Language Models to Support Usability Evaluation
Sebastian Lubos, Alexander Felfernig, Damian Garber, Gerhard Leitner, Julian Schwazer, Manuel Henrich
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to determine conclusion due to failed API request
Abstract: Failed to fetch summary for 2508.16165: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.16165&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[355] AR-KAN: Autoregressive-Weight-Enhanced Kolmogorov-Arnold Network for Time Series Forecasting
Chen Zeng, Tiehang Xu, Qiao Wang
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2509.02967 appears to be from September 2025, suggesting recent research in multimodal AI.
Details
Motivation: Cannot determine motivation without access to paper content. Based on the arXiv ID format (2509.02967), this appears to be a recent paper from September 2025, potentially related to multimodal AI research.Method: Method unknown due to HTTP 429 error preventing access to paper content. The arXiv API rate limiting prevents retrieval of abstract and details.
Result: Results cannot be determined without access to the paper content. The HTTP 429 error indicates the arXiv API is rate limiting requests.
Conclusion: Unable to analyze paper due to technical limitations. The arXiv API rate limiting prevents proper analysis of this recent multimodal AI research paper.
Abstract: Failed to fetch summary for 2509.02967: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.02967&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[356] STCast: Adaptive Boundary Alignment for Global and Regional Weather Forecasting
Hao Chen, Tao Han, Jie Zhang, Song Guo, Lei Bai
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2509.25210: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.25210&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[357] On-the-Fly Adaptation to Quantization: Configuration-Aware LoRA for Efficient Fine-Tuning of Quantized LLMs
Rongguang Ye, Ming Tang, Edith C. H. Ngai
Main category: cs.AI
TL;DR: Unable to analyze paper 2509.25214 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract could not be retrievedMethod: Cannot determine method as abstract could not be retrieved
Result: Cannot determine results as abstract could not be retrieved
Conclusion: Cannot draw conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2509.25214: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.25214&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[358] Traj2Action: A Co-Denoising Framework for Trajectory-Guided Human-to-Robot Skill Transfer
Han Zhou, Jinjin Cao, Liyuan Ma, Xueji Fang, Guo-jun Qi
Main category: cs.AI
TL;DR: Failed to fetch paper summary - HTTP 429 error indicates rate limiting from arXiv API
Details
Motivation: Unable to determine motivation due to API access issuesMethod: Unable to determine method due to API access issues
Result: Unable to determine results due to API access issues
Conclusion: Unable to analyze paper content due to technical limitations in accessing arXiv data
Abstract: Failed to fetch summary for 2510.00491: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.00491&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[359] RESample: A Robust Data Augmentation Framework via Exploratory Sampling for Robotic Manipulation
Yuquan Xue, Guanxing Lu, Zhenyu Wu, Chuanrui Zhang, Bofang Jia, Zhengyi Gu, Ziwei Wang
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2510.17640: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.17640&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[360] How Similar Are Grokipedia and Wikipedia? A Multi-Dimensional Textual and Structural Comparison
Taha Yasseri, Saeedeh Mohammadi
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2510.26899: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.26899&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[361] EGMOF: Efficient Generation of Metal-Organic Frameworks Using a Hybrid Diffusion-Transformer Architecture
Seunghee Han, Yeonghun Kang, Taeun Bae, Junho Kim, Younghun Kim, Varinia Bernales, Alan Aspuru-Guzik, Jihan Kim
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation due to failed paper retrievalMethod: Cannot determine method due to failed paper retrieval
Result: Cannot determine results due to failed paper retrieval
Conclusion: Cannot draw conclusions due to failed paper retrieval
Abstract: Failed to fetch summary for 2511.03122: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.03122&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[362] Evolutionary Optimization Trumps Adam Optimization on Embedding Space Exploration
Domício Pereira Neto, João Correia, Penousal Machado
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error fetching paper contentMethod: Unable to determine method due to technical error fetching paper content
Result: Unable to determine results due to technical error fetching paper content
Conclusion: Unable to determine conclusion due to technical error fetching paper content
Abstract: Failed to fetch summary for 2511.03913: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.03913&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[363] Commanding Humanoid by Free-form Language: A Large Language Action Model with Unified Motion Vocabulary
Zhirui Liu, Kaiyang Ji, Ke Yang, Jingyi Yu, Ye Shi, Jingya Wang
Main category: cs.AI
TL;DR: Paper 2511.22963 summary unavailable due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access restrictionsMethod: Unable to determine method due to access restrictions
Result: Unable to determine results due to access restrictions
Conclusion: Unable to determine conclusion due to access restrictions
Abstract: Failed to fetch summary for 2511.22963: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.22963&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[364] From Navigation to Refinement: Revealing the Two-Stage Nature of Flow-based Diffusion Models through Oracle Velocity
Haoming Liu, Jinnuo Liu, Yanhao Li, Liuyang Bai, Yunkai Ji, Yuanhe Guo, Shenji Wan, Hongyi Wen
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) - need to try again later or use alternative methods
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2512.02826: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.02826&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[365] The Two-Stage Decision-Sampling Hypothesis: Understanding the Emergence of Self-Reflection in RL-Trained LLMs
Zibo Zhao, Yuanting Zha, Haipeng Zhang, Xingcheng Xu
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to access errorMethod: Cannot determine method due to access error
Result: Cannot determine results due to access error
Conclusion: Cannot determine conclusion due to access error
Abstract: Failed to fetch summary for 2601.01580: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.01580&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[366] Screen, Cache, and Match: A Training-Free Causality-Consistent Reference Frame Framework for Human Animation
Jianan Wang, Nailei Hei, Li He, Huanzhen Wang, Aoxing Li, Yingkai Zhao, Yuxuan Lin, Haofen Wang, Chunyang Wang, Yan Wang, Wenqiang Zhang
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2601.22160: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.22160&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[367] On the Limits of Layer Pruning for Generative Reasoning in Large Language Models
Safal Shrestha, Anubhav Shrestha, Aadim Nepal, Minwu Kim, Keith Ross
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2602.01997: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.01997&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[368] Exploring Teachers’ Perspectives on Using Conversational AI Agents for Group Collaboration
Prerna Ravi, Carúmey Stevens, Beatriz Flamia Azevedo, Jasmine David, Brandon Hanks, Hal Abelson, Grace Lin, Emma Anderson
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2602.07142: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.07142&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[369] An Adaptive Model Selection Framework for Demand Forecasting under Horizon-Induced Degradation to Support Business Strategy and Operations
Adolfo González, Víctor Parada
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2602.13939: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.13939&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[370] SubQuad: Near-Quadratic-Free Structure Inference with Distribution-Balanced Objectives in Adaptive Receptor framework
Rong Fu, Zijian Zhang, Kun Liu, Jiekai Wu, Xianda Li, Simon Fong
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about the paper due to access limitations
Abstract: Failed to fetch summary for 2602.17330: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.17330&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[371] Reinforcement-aware Knowledge Distillation for LLM Reasoning
Zhaoyang Zhang, Shuli Jiang, Yantao Shen, Yuting Zhang, Dhananjay Ram, Shuo Yang, Zhuowen Tu, Wei Xia, Stefano Soatto
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting errorMethod: Unable to determine method due to API rate limiting error
Result: Unable to determine results due to API rate limiting error
Conclusion: Unable to determine conclusion due to API rate limiting error
Abstract: Failed to fetch summary for 2602.22495: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22495&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[372] Why Adam Can Beat SGD: Second-Moment Normalization Yields Sharper Tails
Ruinan Jin, Yingbin Liang, Shaofeng Zou
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to access limitationsMethod: Cannot determine method due to access limitations
Result: Cannot determine results due to access limitations
Conclusion: Cannot determine conclusion due to access limitations
Abstract: Failed to fetch summary for 2603.03099: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.03099&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[373] Memory-efficient Continual Learning with Prototypical Exemplar Condensation
Minh-Duong Nguyen, Thien-Thanh Dao, Le-Tuan Nguyen, Dung D. Le, Kok-Seng Wong
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2603.13804: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.13804&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[374] Fine-tuning is Not Enough: A Parallel Framework for Collaborative Imitation and Reinforcement Learning in End-to-end Autonomous Driving
Zhexi Lian, Haoran Wang, Xuerun Yan, Weimeng Lin, Xianhong Zhang, Yongyu Chen, Jia Hu
Main category: cs.AI
TL;DR: Paper ID 2603.13842 could not be fetched due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access limitationsMethod: Unable to determine method due to access limitations
Result: Unable to determine results due to access limitations
Conclusion: Unable to determine conclusion due to access limitations
Abstract: Failed to fetch summary for 2603.13842: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.13842&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[375] You’ve Got a Golden Ticket: Improving Generative Robot Policies With A Single Noise Vector
Omkar Patil, Ondrej Biza, Thomas Weng, Karl Schmeckpeper, Wil Thomason, Xiaohan Zhang, Robin Walters, Nakul Gopalan, Sebastian Castro, Eric Rosen
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrievedMethod: Cannot determine method as paper content could not be retrieved
Result: Cannot determine results as paper content could not be retrieved
Conclusion: Cannot draw conclusion as paper content could not be retrieved
Abstract: Failed to fetch summary for 2603.15757: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.15757&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[376] Kill-Chain Canaries: Stage-Level Tracking of Prompt Injection Across Attack Surfaces and Model Safety Tiers
Haochuan Kevin Wang, Zechen Zhang
Main category: cs.AI
TL;DR: Paper 2603.28013 summary could not be fetched due to HTTP 429 (rate limiting) error from arXiv API
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper detailsMethod: Unknown - paper content not accessible due to technical limitations
Result: No results available - paper information retrieval failed
Conclusion: Cannot analyze paper due to arXiv API rate limiting preventing access to abstract and content
Abstract: Failed to fetch summary for 2603.28013: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.28013&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[377] Boosted Distributional Reinforcement Learning: Analysis and Healthcare Applications
Zequn Chen, Wesley J. Marrero
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting).
Details
Motivation: Unable to determine motivation due to fetch failure.Method: Unable to determine method due to fetch failure.
Result: Unable to determine results due to fetch failure.
Conclusion: Unable to determine conclusion due to fetch failure.
Abstract: Failed to fetch summary for 2604.04334: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.04334&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[378] Explorable Theorems: Making Written Theorems Explorable by Grounding Them in Formal Representations
Hita Kambhamettu, Will Crichton, Sean Welleck, Harrison Goldstein, Andrew Head
Main category: cs.AI
TL;DR: Failed to fetch paper summary - HTTP 429 error indicates rate limiting from arXiv API
Details
Motivation: Unable to determine motivation due to API fetch failureMethod: Unable to determine method due to API fetch failure
Result: Unable to determine results due to API fetch failure
Conclusion: Unable to analyze paper due to technical issues with arXiv API access
Abstract: Failed to fetch summary for 2604.02598: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.02598&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[379] From Paper to Program: Accelerating Quantum Many-Body Algorithm Development via a Multi-Stage LLM-Assisted Workflow
Yi Zhou
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to inability to access paper contentMethod: Cannot determine method due to inability to access paper content
Result: Cannot determine results due to inability to access paper content
Conclusion: Cannot draw conclusions due to inability to access paper content
Abstract: Failed to fetch summary for 2604.04089: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.04089&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[380] ALTO: Adaptive LoRA Tuning and Orchestration for Heterogeneous LoRA Training Workloads
Jingwei Zuo, Xinze Feng, Zien Liu, Kaijian Wang, Fanjiang Ye, Ye Cao, Zhuang Wang, Yuke Wang
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access restrictionsMethod: Unable to determine method due to access restrictions
Result: Unable to determine results due to access restrictions
Conclusion: Unable to determine conclusion due to access restrictions
Abstract: Failed to fetch summary for 2604.05426: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.05426&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[381] Governed Capability Evolution for Embodied Agents: Safe Upgrade, Compatibility Checking, and Runtime Rollback for Embodied Capability Modules
Xue Qin, Simin Luan, John See, Cong Yang, Zhijun Li
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2604.08059: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.08059&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[382] QARIMA: A Quantum Approach To Classical Time Series Analysis
Nishikanta Mohanty, Bikash K. Behera, Badshah Mukherjee, Pravat Dash
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2604.08277: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.08277&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.SD
[383] Script Collapse in Multilingual ASR: Defining and Measuring Script Fidelity Rate
Hanif Rahman
Main category: cs.SD
TL;DR: The paper introduces Script Fidelity Rate (SFR) as a new metric to detect script collapse in ASR models, where models produce fluent output in the wrong writing system, and systematically measures this failure across multiple languages and models.
Details
Motivation: Word Error Rate (WER) fails to detect systematic failure modes in ASR where models produce fluent output in the wrong writing system (script collapse). This is a critical issue for multilingual ASR deployment but remains unmeasured in existing literature.Method: Define Script Fidelity Rate (SFR) - the fraction of hypothesis characters in the target script block, computable without reference transcriptions. Evaluate across six languages spanning four writing systems (Pashto, Urdu, Hindi, Bengali, Malayalam, Somali) and nine ASR models on FLEURS test sets.
Result: Across 53 model-language pairs, 34% exhibit script collapse (SFR < 10%). MMS-1B and SeamlessM4T-v2 maintain SFR above 99% on every language. Three collapse patterns identified: Latin phonetic substitution (smaller Whisper on Indic), Arabic substitution for Somali’s Latin-script, and Devanagari substitution where larger Whisper models treat all Indic audio as Hindi.
Conclusion: Script collapse is a widespread but previously unmeasured failure mode in ASR. SFR provides a simple, reference-free metric to detect this issue. Some models (MMS-1B, SeamlessM4T-v2) demonstrate robust script fidelity, while others (including Whisper large-v3) show systematic failures.
Abstract: Word error rate (WER) is the dominant metric for automatic speech recognition, yet it cannot detect a systematic failure mode: models that produce fluent output in the wrong writing system. We define Script Fidelity Rate (SFR), the fraction of hypothesis characters in the target script block, computable without reference transcriptions, and report the first systematic measurement of script collapse across six languages spanning four writing systems (Pashto, Urdu, Hindi, Bengali, Malayalam, Somali) and nine ASR models on FLEURS test sets. Across 53 evaluated model-language pairs, 18 (34%; 95% Wilson CI: 23-47%) exhibit script collapse (SFR < 10%); MMS-1B and SeamlessM4T-v2 maintain SFR above 99% on every language evaluated, confirming that SFR correctly identifies high fidelity where it is present. We identify three distinct collapse patterns: Latin phonetic substitution (smaller Whisper on Indic languages), Arabic substitution for Somali’s Latin-script orthography, and Devanagari substitution where larger Whisper models treat all Indic audio as Hindi, a failure present even in Whisper large-v3.
[384] AudioGuard: Toward Comprehensive Audio Safety Protection Across Diverse Threat Models
Mintong Kang, Chen Fang, Bo Li
Main category: cs.SD
TL;DR: AudioSafetyBench: First policy-based audio safety benchmark addressing unique audio risks like harmful sound events, speaker attributes, voice cloning, and voice-content compositional harms, with AudioGuard as a unified guardrail solution.
Details
Motivation: Audio safety is more complex than just "unsafe text spoken aloud" - real-world risks include audio-native harmful sound events, speaker attributes (child voice), impersonation/voice-cloning misuse, and voice-content compositional harms. Current benchmarks and guardrails are inadequate for this unique risk landscape.Method: 1) Conduct large-scale red teaming on audio systems to systematically uncover vulnerabilities; 2) Develop comprehensive, policy-grounded audio risk taxonomy; 3) Create AudioSafetyBench, the first policy-based audio safety benchmark across diverse threat models; 4) Propose AudioGuard with SoundGuard (waveform-level audio-native detection) and ContentGuard (policy-grounded semantic protection).
Result: AudioSafetyBench supports diverse languages, suspicious voices (celebrity/impersonation and child voice), risky voice-content combinations, and non-speech sound events. AudioGuard consistently improves guardrail accuracy over strong audio-LLM-based baselines with substantially lower latency across AudioSafetyBench and four complementary benchmarks.
Conclusion: The paper addresses critical gaps in audio safety for foundation models by providing comprehensive benchmarks and effective guardrail solutions that handle the unique complexities of audio risks beyond just text-to-speech safety.
Abstract: Audio has rapidly become a primary interface for foundation models, powering real-time voice assistants. Ensuring safety in audio systems is inherently more complex than just “unsafe text spoken aloud”: real-world risks can hinge on audio-native harmful sound events, speaker attributes (e.g., child voice), impersonation/voice-cloning misuse, and voice-content compositional harms, such as child voice plus sexual content. The nature of audio makes it challenging to develop comprehensive benchmarks or guardrails against this unique risk landscape. To close this gap, we conduct large-scale red teaming on audio systems, systematically uncover vulnerabilities in audio, and develop a comprehensive, policy-grounded audio risk taxonomy and AudioSafetyBench, the first policy-based audio safety benchmark across diverse threat models. AudioSafetyBench supports diverse languages, suspicious voices (e.g., celebrity/impersonation and child voice), risky voice-content combinations, and non-speech sound events. To defend against these threats, we propose AudioGuard, a unified guardrail consisting of 1) SoundGuard for waveform-level audio-native detection and 2) ContentGuard for policy-grounded semantic protection. Extensive experiments on AudioSafetyBench and four complementary benchmarks show that AudioGuard consistently improves guardrail accuracy over strong audio-LLM-based baselines with substantially lower latency.
[385] AudioGS: Spectrogram-Based Audio Gaussian Splatting for Sound Field Reconstruction
Chunhao Bi, Houqiang Zhong, Zhixin Xu, Li Song, Zhengxue Cheng
Main category: cs.SD
TL;DR: AudioGS: A visual-free framework for high-fidelity binaural audio synthesis using explicit Audio Gaussian representations based on spectrograms, outperforming visual-dependent methods.
Details
Motivation: Spatial audio is crucial for immersive VR experiences, but synthesizing high-fidelity binaural audio from sparse observations is challenging. Existing visual-conditioned methods struggle with fine-grained acoustic structures.Method: Inspired by 3D Gaussian Splatting, AudioGS explicitly encodes sound fields as Audio Gaussians based on spectrograms. Each time-frequency bin gets an Audio Gaussian with dual Spherical Harmonic coefficients and decay coefficient. For target poses, it renders binaural audio by evaluating SH fields for directionality, incorporating geometry-guided distance attenuation and phase correction, then reconstructing waveforms.
Result: On Replay-NVAS dataset, AudioGS captures complex spatial cues and outperforms state-of-the-art visual-dependent baselines: reduces magnitude reconstruction error (MAG) by over 14% and perceptual quality metric (DPAM) by ~25% compared to best visual-guided method.
Conclusion: AudioGS demonstrates that explicit Gaussian-based representations can effectively model spatial audio without visual priors, achieving superior performance over visual-dependent approaches for binaural audio synthesis.
Abstract: Spatial audio is fundamental to immersive virtual experiences, yet synthesizing high-fidelity binaural audio from sparse observations remains a significant challenge. Existing methods typically rely on implicit neural representations conditioned on visual priors, which often struggle to capture fine-grained acoustic structures. Inspired by 3D Gaussian Splatting (3DGS), we introduce AudioGS, a novel visual-free framework that explicitly encodes the sound field as a set of Audio Gaussians based on spectrograms. AudioGS associates each time-frequency bin with an Audio Gaussian equipped with dual Spherical Harmonic (SH) coefficients and a decay coefficient. For a target pose, we render binaural audio by evaluating the SH field to capture directionality, incorporating geometry-guided distance attenuation and phase correction, and reconstructing the waveform. Experiments on the Replay-NVAS dataset demonstrate that AudioGS successfully captures complex spatial cues and outperforms state-of-the-art visual-dependent baselines. Specifically, AudioGS reduces the magnitude reconstruction error (MAG) by over 14% and reduces the perceptual quality metric (DPAM) by approximately 25% compared to the best performing visual-guided method.
[386] Noise-Aware In-Context Learning for Hallucination Mitigation in ALLMs
Qixuan Huang, Khalid Zaman, Masashi Unoki
Main category: cs.SD
TL;DR: A plug-and-play Noise-Aware In-Context Learning method to reduce hallucinations in auditory large language models for audio captioning tasks, with a new hallucination benchmark dataset and evaluation metrics.
Details
Motivation: Auditory LLMs suffer from hallucination issues in audio understanding tasks, but existing evaluation methods are binary and insufficient for complex generative tasks, while mitigation strategies require expensive fine-tuning.Method: Proposes Noise-Aware In-Context Learning (NAICL) - constructs noise prior library, retrieves relevant noise examples as contextual priors to guide models to reduce speculative associations when acoustic evidence is insufficient and adopt conservative generation.
Result: All evaluated ALLMs exhibit same hallucination behaviors. NAICL reduces overall hallucination rate from 26.53% to 16.98%. Also establishes Clotho-1K multi-event benchmark dataset with four hallucination types and fine-grained metrics.
Conclusion: NAICL effectively mitigates hallucinations in auditory LLMs without fine-tuning, and the new benchmark enables comprehensive evaluation of hallucination patterns in audio captioning tasks.
Abstract: Auditory large language models (ALLMs) have demonstrated strong general capabilities in audio understanding and reasoning tasks. However, their reliability is still undermined by hallucination issues. Existing hallucination evaluation methods are formulated as binary classification tasks, which are insufficient to characterize the more complex hallucination patterns that arise in generative tasks. Moreover, current hallucination mitigation strategies rely on fine-tuning, resulting in high computational costs. To address the above limitations, we propose a plug-and-play Noise-Aware In-Context Learning (NAICL) method. Specifically, we construct a noise prior library, retrieve noise examples relevant to the input audio, and incorporate them as contextual priors, thereby guiding the model to reduce speculative associations when acoustic evidence is insufficient and to adopt a more conservative generation strategy. In addition, we establish a hallucination benchmark for audio caption tasks including the construction of the Clotho-1K multi-event benchmark dataset, the definition of four types of auditory hallucinations, and the introduction of metrics such as hallucination type distribution to support fine-grained analysis. Experimental results show that all evaluated ALLMs exhibit same hallucination behaviors. Moreover, the proposed NAICL method reduces the overall hallucination rate from 26.53% to 16.98%.
[387] AccompGen: Hierarchical Autoregressive Vocal Accompaniment Generation with Dual-Rate Codec Tokenization
Jian Zhu, Jianwei Cui, Shihao Chen, Yubang Zhang, Cheng Luo
Main category: cs.SD
TL;DR: AccompGen generates instrumental accompaniment for input vocals using a hierarchical autoregressive transformer with dual-rate tokenization.
Details
Motivation: To create a system that can generate coherent instrumental music audio to accompany isolated singing vocals, enabling complete music creation from voice alone.Method: Three key innovations: (1) dual-rate codec tokenization using HuBERT semantic tokens for vocals (50Hz) and EnCodec acoustic tokens for instrumentals (75Hz); (2) three-stage hierarchical autoregressive architecture (semantic→coarse acoustic→fine acoustic) with interleaved multi-codebook prediction and classifier-free guidance; (3) modern Transformer design choices including QK-norm, GEGLU activations, RMSNorm, and T5-style relative position bias.
Result: The system produces instrumental accompaniment that can be directly mixed with input vocals to create complete music.
Conclusion: AccompGen presents an effective approach for audio-based music generation that focuses on accompaniment generation for vocals, with innovations in tokenization and architecture design.
Abstract: We present AccompGen, a system that generates instrumental music audio to accompany input vocals. Given isolated singing voice, AccompGen produces a coherent instrumental accompaniment that can be directly mixed with the input to create complete music. We propose three key innovations over prior work: (1) a dual-rate codec tokenization scheme using HuBERT semantic tokens at 50,Hz for vocals and EnCodec acoustic tokens at 75,Hz for instrumentals, enabling time-aligned yet rate-independent modeling; (2) a three-stage hierarchical autoregressive architecture (semantic to coarse acoustic to fine acoustic) with interleaved multi-codebook prediction and classifier-free guidance; and (3) modern Transformer design choices including QK-norm, GEGLU activations, RMSNorm, and T5-style relative position bias for improved training stability and sequence generalization.
[388] DialogueSidon: Recovering Full-Duplex Dialogue Tracks from In-the-Wild Dialogue Audio
Wataru Nakata, Yuki Saito, Kazuki Yamauchi, Emiru Tsunoo, Hiroshi Saruwatari
Main category: cs.SD
TL;DR: DialogueSidon is a model for joint restoration and separation of degraded monaural two-speaker dialogue audio using VAE compression of SSL features and diffusion-based latent prediction.
Details
Motivation: Full-duplex dialogue audio with separate speaker tracks is valuable for spoken dialogue research but difficult to collect at scale. Most real-world two-speaker dialogue exists only as degraded monaural mixtures, making it unsuitable for systems requiring clean speaker-wise signals.Method: Combines a variational autoencoder (VAE) operating on speech self-supervised learning (SSL) model features to compress them into a compact latent space, with a diffusion-based latent predictor that recovers speaker-wise latent representations from degraded mixtures.
Result: Experiments on English, multilingual, and in-the-wild dialogue datasets show DialogueSidon substantially improves intelligibility and separation quality over baselines while achieving much faster inference.
Conclusion: DialogueSidon provides an effective solution for joint restoration and separation of degraded monaural dialogue audio, making in-the-wild dialogue data more usable for research requiring clean speaker-wise signals.
Abstract: Full-duplex dialogue audio, in which each speaker is recorded on a separate track, is an important resource for spoken dialogue research, but is difficult to collect at scale. Most in-the-wild two-speaker dialogue is available only as degraded monaural mixtures, making it unsuitable for systems requiring clean speaker-wise signals. We propose DialogueSidon, a model for joint restoration and separation of degraded monaural two-speaker dialogue audio. DialogueSidon combines a variational autoencoder (VAE) operates on the speech self-supervised learning (SSL) model feature, which compresses SSL model features into a compact latent space, with a diffusion-based latent predictor that recovers speaker-wise latent representations from the degraded mixture. Experiments on English, multilingual, and in-the-wild dialogue datasets show that DialogueSidon substantially improves intelligibility and separation quality over a baseline, while also achieving much faster inference.
[389] Few-Shot Contrastive Adaptation for Audio Abuse Detection in Low-Resource Indic Languages
Aditya Narayan Sankaran, Reza Farahbakhsh, Noel Crespi
Main category: cs.SD
TL;DR: CLAP-based audio representations enable cross-lingual abusive speech detection directly from audio, achieving competitive performance with lightweight adaptation in low-resource settings across ten Indic languages.
Details
Motivation: Current abusive speech detection systems rely on ASR + text classification pipelines that are vulnerable to transcription errors and discard prosodic information. The paper investigates whether contrastive audio-text models can support direct audio-based detection, especially in multilingual low-resource settings.Method: Uses Contrastive Language-Audio Pre-training (CLAP) representations evaluated on ADIMA dataset. Tests few-shot supervised contrastive adaptation in cross-lingual and leave-one-language-out settings, with zero-shot prompting as auxiliary analysis. Employs lightweight projection-only adaptation.
Result: CLAP yields strong cross-lingual audio representations across ten Indic languages. Lightweight projection-only adaptation achieves competitive performance with fully supervised systems trained on complete data. Benefits of few-shot adaptation are language-dependent and not monotonic with shot size.
Conclusion: Contrastive audio-text models provide promising basis for cross-lingual audio abuse detection in low-resource settings, but transfer remains incomplete and language-specific in important ways.
Abstract: Abusive speech detection is becoming increasingly important as social media shifts towards voice-based interaction, particularly in multilingual and low-resource settings. Most current systems rely on automatic speech recognition (ASR) followed by text-based hate speech classification, but this pipeline is vulnerable to transcription errors and discards prosodic information carried in speech. We investigate whether Contrastive Language-Audio Pre-training (CLAP) can support abusive speech detection directly from audio. Using the ADIMA dataset, we evaluate CLAP-based representations under few-shot supervised contrastive adaptation in cross-lingual and leave-one-language-out settings, with zero-shot prompting included as an auxiliary analysis. Our results show that CLAP yields strong cross-lingual audio representations across ten Indic languages, and that lightweight projection-only adaptation achieves competitive performance with respect to fully supervised systems trained on complete training data. However, the benefits of few-shot adaptation are language-dependent and not monotonic with shot size. These findings suggest that contrastive audio-text models provide a promising basis for cross-lingual audio abuse detection in low-resource settings, while also indicating that transfer remains incomplete and language-specific in important ways.
[390] LatentFlowSR: High-Fidelity Audio Super-Resolution via Noise-Robust Latent Flow Matching
Fei Liu, Yang Ai, Hui-Peng Du, Yu-Fei Shi, Zhen-Hua Ling
Main category: cs.SD
TL;DR: LatentFlowSR: A latent-space conditional flow matching approach for audio super-resolution that works across speech, sound effects, and music by operating in a compressed latent representation space rather than waveform or time-frequency domains.
Details
Motivation: Existing audio super-resolution methods operate directly in waveform or time-frequency domains, which involves high-dimensional generation spaces and is largely limited to speech tasks, leaving room for improvement on more complex audio types like sound effects and music.Method: 1) Train a noise-robust autoencoder to encode low-resolution audio into continuous latent space; 2) Use conditional flow matching (CFM) to progressively generate high-resolution latent representation from Gaussian prior conditioned on low-resolution latent; 3) Decode with pretrained autoencoder to reconstruct high-resolution audio.
Result: LatentFlowSR consistently outperforms baseline methods across various audio types and super-resolution settings, demonstrating strong high-frequency reconstruction capability and robust generalization performance.
Conclusion: The method provides compelling evidence for the effectiveness of latent-space modeling in audio super-resolution, with strong generalization across different audio types.
Abstract: Audio super-resolution aims to recover missing high-frequency details from bandwidth-limited low-resolution audio, thereby improving the naturalness and perceptual quality of the reconstructed signal. However, most existing methods directly operate in the waveform or time-frequency domain, which not only involves high-dimensional generation spaces but is also largely limited to speech tasks, leaving substantial room for improvement on more complex audio types such as sound effects and music. To mitigate these limitations, we introduce LatentFlowSR, a new audio super-resolution approach that leverages conditional flow matching (CFM) within a latent representation space. Specifically, we first train a noise-robust autoencoder, which encodes low-resolution audio into a continuous latent space. Conditioned on the low-resolution latent representation, a CFM mechanism progressively generates the corresponding high-resolution latent representation from a Gaussian prior with a one-step ordinary differential equation (ODE) solver. The resulting high-resolution latent representation is then decoded by the pretrained autoencoder to reconstruct the high-resolution audio. Experimental results demonstrate that LatentFlowSR consistently outperforms baseline methods across various audio types and super-resolution settings. These results indicate that the proposed method possesses strong high-frequency reconstruction capability and robust generalization performance, providing compelling evidence for the effectiveness of latent-space modeling in audio super-resolution. All relevant code will be made publicly available upon completion of the paper review process.
[391] Music Audio-Visual Question Answering Requires Specialized Multimodal Designs
Wenhao You, Xingjian Diao, Wenjun Huang, Chunhui Zhang, Keyi Kong, Weiyi Wu, Chiyu Ma, Zhongyu Ouyang, Tingxuan Wu, Ming Cheng, Soroush Vosoughi, Jiang Gui
Main category: cs.SD
TL;DR: Survey paper analyzing Music Audio-Visual Question Answering (Music AVQA) challenges and specialized approaches needed for multimodal music understanding.
Details
Motivation: While general multimodal LLMs show impressive capabilities, specialized domains like music require tailored approaches due to unique challenges including continuous audio-visual content, intricate temporal dynamics, and domain-specific knowledge requirements.Method: Systematic analysis of Music AVQA datasets and methods, identifying critical components including specialized input processing, architectures with spatial-temporal designs, and music-specific modeling strategies.
Result: Identifies effective design patterns empirically linked to strong performance, provides insights for researchers, and proposes future directions for incorporating musical priors to advance multimodal musical understanding.
Conclusion: Specialized approaches are essential for music multimodal understanding, and the survey establishes a foundation for advancing Music AVQA research while encouraging further work in this area.
Abstract: While recent Multimodal Large Language Models exhibit impressive capabilities for general multimodal tasks, specialized domains like music necessitate tailored approaches. Music Audio-Visual Question Answering (Music AVQA) particularly underscores this, presenting unique challenges with its continuous, densely layered audio-visual content, intricate temporal dynamics, and the critical need for domain-specific knowledge. Through a systematic analysis of Music AVQA datasets and methods, this paper identifies that specialized input processing, architectures incorporating dedicated spatial-temporal designs, and music-specific modeling strategies are critical for success in this domain. Our study provides valuable insights for researchers by highlighting effective design patterns empirically linked to strong performance, proposing concrete future directions for incorporating musical priors, and aiming to establish a robust foundation for advancing multimodal musical understanding. We aim to encourage further research in this area and provide a GitHub repository of relevant works: https://github.com/WenhaoYou1/Survey4MusicAVQA.
[392] GRM: Utility-Aware Jailbreak Attacks on Audio LLMs via Gradient-Ratio Masking
Yunqiang Wang, Hengyuan Na, Di Wu, Miao Hu, Guocong Quan
Main category: cs.SD
TL;DR: GRM: A utility-aware frequency-selective jailbreak framework for audio large language models that selectively perturbs Mel frequency bands to balance attack effectiveness with utility preservation.
Details
Motivation: Existing audio jailbreak methods for ALLMs focus on attack success but neglect utility preservation (transcription quality and QA performance), creating a trade-off where stronger attacks degrade utility. The authors investigate whether selective frequency perturbation can achieve better attack-utility balance.Method: GRM ranks Mel frequency bands by their attack contribution relative to utility sensitivity, perturbs only a selected subset of bands, and learns reusable universal perturbations under semantic-preservation objectives. The approach varies perturbation coverage from partial-band to full-band to study trade-offs.
Result: Experiments on four representative ALLMs show GRM achieves average Jailbreak Success Rate of 88.46% while providing better attack-utility trade-off than baselines. Broader frequency coverage doesn’t necessarily improve jailbreak performance but consistently degrades utility.
Conclusion: Frequency-selective perturbation can better balance attack effectiveness and utility preservation in audio jailbreaks. Concentrating perturbations on a subset of frequency bands yields superior trade-offs compared to indiscriminate full-band coverage.
Abstract: Audio large language models (ALLMs) enable rich speech-text interaction, but they also introduce jailbreak vulnerabilities in the audio modality. Existing audio jailbreak methods mainly optimize jailbreak success while overlooking utility preservation, as reflected in transcription quality and question answering performance. In practice, stronger attacks often come at the cost of degraded utility. To study this trade-off, we revisit existing attacks by varying their perturbation coverage in the frequency domain, from partial-band to full-band, and find that broader frequency coverage does not necessarily improve jailbreak performance, while utility consistently deteriorates. This suggests that concentrating perturbation on a subset of bands can yield a better attack-utility trade-off than indiscriminate full-band coverage. Based on this insight, we propose GRM, a utility-aware frequency-selective jailbreak framework. It ranks Mel bands by their attack contribution relative to utility sensitivity, perturbs only a selected subset of bands, and learns a reusable universal perturbation under a semantic-preservation objective. Experiments on four representative ALLMs show that GRM achieves an average Jailbreak Success Rate (JSR) of 88.46% while providing a better attack-utility trade-off than representative baselines. These results highlight the potential of frequency-selective perturbation for better balancing attack effectiveness and utility preservation in audio jailbreak. Content Warning: This paper includes harmful query examples and unsafe model responses.
[393] DDSP-QbE++: Improving Speech Quality for Speech Anonymisation for Atypical Speech
Suhita Ghosh, Yamini Sinha, Sebastian Stober
Main category: cs.SD
TL;DR: Improved DDSP-QbE voice conversion by adding voicing detection and PolyBLEP correction to reduce aliasing artifacts and buzziness in synthesized audio.
Details
Motivation: DDSP-QbE voice conversion suffers from aliasing artifacts and buzziness due to abrupt discontinuities in phase-accumulated excitation signals, especially at higher fundamental frequencies.Method: Two improvements: 1) Explicit voicing detection to gate harmonic excitation and use filtered noise in unvoiced regions, 2) PolyBLEP correction to smooth phase wrap discontinuities and reduce aliasing.
Result: Cleaner harmonic roll-off, reduced high-frequency artifacts, improved perceptual naturalness measured by MOS, while remaining lightweight and differentiable with no additional parameters.
Conclusion: The proposed modifications effectively address aliasing issues in DDSP-QbE voice conversion, improving audio quality while maintaining compatibility with existing training pipelines.
Abstract: Differentiable Digital Signal Processing (DDSP) pipelines for voice conversion rely on subtractive synthesis, where a periodic excitation signal is shaped by a learned spectral envelope to reconstruct the target voice. In DDSP-QbE, the excitation is generated via phase accumulation, producing a sawtooth-like waveform whose abrupt discontinuities introduce aliasing artefacts that manifest perceptually as buzziness and spectral distortion, particularly at higher fundamental frequencies. We propose two targeted improvements to the excitation stage of the DDSP-QbE subtractive synthesizer. First, we incorporate explicit voicing detection to gate the harmonic excitation, suppressing the periodic component in unvoiced regions and replacing it with filtered noise, thereby avoiding aliased harmonic content where it is most perceptually disruptive. Second, we apply Polynomial Band-Limited Step (PolyBLEP) correction to the phase-accumulated oscillator, substituting the hard waveform discontinuity at each phase wrap with a smooth polynomial residual that cancels alias-generating components without oversampling or spectral truncation. Together, these modifications yield a cleaner harmonic roll-off, reduced high-frequency artefacts, and improved perceptual naturalness, as measured by MOS. The proposed approach is lightweight, differentiable, and integrates seamlessly into the existing DDSP-QbE training pipeline with no additional learnable parameters.
[394] DynFOA: Generating First-Order Ambisonics with Conditional Diffusion for Dynamic and Acoustically Complex 360-Degree Videos
Ziyu Luo, Lin Chen, Qiang Qu, Xiaoming Chen, Yiran Shen
Main category: cs.SD
TL;DR: DynFOA: A generative framework that synthesizes first-order ambisonics (FOA) spatial audio from 360-degree videos by integrating dynamic scene reconstruction with conditional diffusion modeling.
Details
Motivation: Most 360-degree videos lack spatial audio due to capture difficulties, and existing methods fail to model dynamic sources and acoustic effects like occlusion, reflections, and reverberation that depend on scene geometry and materials.Method: Analyzes input video to detect/localize dynamic sound sources, estimate depth/semantics, reconstruct scene geometry/materials using 3D Gaussian Splatting, then uses these physically-grounded features to condition a diffusion model for spatial audio generation.
Result: Outperforms existing methods in spatial accuracy, acoustic fidelity, distribution matching, and perceived immersive experience on the M2G-360 dataset (600 real-world clips with MoveSources, Multi-Source, and Geometry subsets).
Conclusion: DynFOA successfully integrates dynamic scene reconstruction with generative modeling to produce realistic spatial audio that accounts for acoustic interactions between sources, environment, and listener viewpoint.
Abstract: Spatial audio is crucial for immersive 360-degree video experiences, yet most 360-degree videos lack it due to the difficulty of capturing spatial audio during recording. Automatically generating spatial audio such as first-order ambisonics (FOA) from video therefore remains an important but challenging problem. In complex scenes, sound perception depends not only on sound source locations but also on scene geometry, materials, and dynamic interactions with the environment. However, existing approaches only rely on visual cues and fail to model dynamic sources and acoustic effects such as occlusion, reflections, and reverberation. To address these challenges, we propose DynFOA, a generative framework that synthesizes FOA from 360-degree videos by integrating dynamic scene reconstruction with conditional diffusion modeling. DynFOA analyzes the input video to detect and localize dynamic sound sources, estimate depth and semantics, and reconstruct scene geometry and materials using 3D Gaussian Splatting (3DGS). The reconstructed scene representation provides physically grounded features that capture acoustic interactions between sources, environment, and listener viewpoint. Conditioned on these features, a diffusion model generates spatial audio consistent with the scene dynamics and acoustic context. We introduce M2G-360, a dataset of 600 real-world clips divided into MoveSources, Multi-Source, and Geometry subsets for evaluating robustness under diverse conditions. Experiments show that DynFOA consistently outperforms existing methods in spatial accuracy, acoustic fidelity, distribution matching, and perceived immersive experience.
cs.LG
[395] GNN-as-Judge: Unleashing the Power of LLMs for Graph Learning with GNN Feedback
Ruiyao Xu, Kaize Ding
Main category: cs.LG
TL;DR: GNN-as-Judge framework combines LLMs and GNNs for few-shot semi-supervised learning on text-attributed graphs by using GNNs to generate reliable pseudo-labels for LLM fine-tuning in low-resource settings.
Details
Motivation: LLMs perform well on text-attributed graphs but struggle in low-resource settings where labeled data is scarce, as fine-tuning requires sufficient labeled data, especially when graphs have complex structural patterns.Method: Proposes GNN-as-Judge framework with collaborative pseudo-labeling that identifies influenced unlabeled nodes and leverages agreement/disagreement patterns between LLMs and GNNs to generate reliable labels, plus weakly-supervised LLM fine-tuning to mitigate label noise.
Result: Experiments on multiple TAG datasets show GNN-as-Judge significantly outperforms existing methods, especially in low-resource regimes with scarce labeled data.
Conclusion: The framework successfully addresses challenges of generating reliable pseudo-labels and mitigating label noise when fine-tuning LLMs on text-attributed graphs in few-shot settings.
Abstract: Large Language Models (LLMs) have shown strong performance on text-attributed graphs (TAGs) due to their superior semantic understanding ability on textual node features. However, their effectiveness as predictors in the low-resource setting, where labeled nodes are severely limited and scarce, remains constrained since fine-tuning LLMs usually requires sufficient labeled data, especially when the TAG shows complex structural patterns. In essence, this paper targets two key challenges: (i) the difficulty of generating and selecting reliable pseudo labels on TAGs for LLMs, and (ii) the need to mitigate potential label noise when fine-tuning LLMs with pseudo labels. To counter the challenges, we propose a new framework, GNN-as-Judge, which can unleash the power of LLMs for few-shot semi-supervised learning on TAGs by incorporating the structural inductive bias of Graph Neural Networks (GNNs). Specifically, GNN-as-Judge introduces a collaborative pseudo-labeling strategy that first identifies the most influenced unlabeled nodes from labeled nodes, then exploits both the agreement and disagreement patterns between LLMs and GNNs to generate reliable labels. Furthermore, we develop a weakly-supervised LLM fine-tuning algorithm that can distill the knowledge from informative pseudo labels while mitigating the potential label noise. Experiments on multiple TAG datasets demonstrate that GNN-as-Judge significantly outperforms existing methods, particularly in low-resource regimes where labeled data are scarce.
[396] Memory-Guided Trust-Region Bayesian Optimization (MG-TuRBO) for High Dimensions
Abhilasha Saroj, Shaked Regev, Guanhao Xu, Jinghui Yuan, Roy Luo, Ross Wang
Main category: cs.LG
TL;DR: Comparison of Bayesian optimization methods vs genetic algorithm for traffic simulation calibration, showing BOMs outperform GA, especially MG-TuRBO with adaptive strategy for high-dimensional problems.
Details
Motivation: Traffic simulation calibration is computationally expensive with limited simulation budget, nonconvex noisy relationships, and becomes harder with more parameters. Need efficient optimization methods for this challenging problem.Method: Compare genetic algorithm (GA) with Bayesian optimization methods (BO, TuRBO, Multi-TuRBO, and proposed Memory-Guided TuRBO) on 14D and 84D real-world traffic calibration problems. Test Thompson sampling and novel adaptive acquisition strategies.
Result: BOMs reach good calibration targets much faster than GA in 14D problem. MG-TuRBO performs comparably in 14D but shows advantages in 84D, especially with adaptive strategy. MG-TuRBO useful for high-dimensional calibration.
Conclusion: MG-TuRBO with adaptive strategy is effective for high-dimensional traffic simulation calibration and potentially other high-D optimization problems.
Abstract: Traffic simulation and digital-twin calibration is a challenging optimization problem with a limited simulation budget. Each trial requires an expensive simulation run, and the relationship between calibration inputs and model error is often nonconvex, and noisy. The problem becomes more difficult as the number of calibration parameters increases. We compare a commonly used automatic calibration method, a genetic algorithm (GA), with Bayesian optimization methods (BOMs): classical Bayesian optimization (BO), Trust-Region BO (TuRBO), Multi-TuRBO, and a proposed Memory-Guided TuRBO (MG-TuRBO) method. We compare performance on 2 real-world traffic simulation calibration problems with 14 and 84 decision variables, representing lower- and higher-dimensional (14D and 84D) settings. For BOMs, we study two acquisition strategies, Thompson sampling and a novel adaptive strategy. We evaluate performance using final calibration quality, convergence behavior, and consistency across runs. The results show that BOMs reach good calibration targets much faster than GA in the lower-D problem. MG-TuRBO performs comparably in our 14D setting, it demonstrates noticeable advantages in the 84D problem, particularly when paired with our adaptive strategy. Our results suggest that MG-TuRBO is especially useful for high-D traffic simulation calibration and potentially for high-D problems in general.
[397] QuanBench+: A Unified Multi-Framework Benchmark for LLM-Based Quantum Code Generation
Ali Slim, Haydar Hamieh, Jawad Kotaich, Yehya Ghosn, Mahdi Chehimi, Ammar Mohanna, Hasan Abed Al Kader Hammoud, Bernard Ghanem
Main category: cs.LG
TL;DR: QuanBench+ is a unified benchmark for evaluating quantum code generation across multiple frameworks (Qiskit, PennyLane, Cirq) with 42 aligned tasks, showing current models perform moderately but improve significantly with feedback-based repair.
Details
Motivation: Current quantum code generation evaluation is limited to single frameworks, making it difficult to distinguish between genuine quantum reasoning and framework-specific knowledge. There's a need for a unified benchmark to properly assess model capabilities across different quantum programming environments.Method: Created QuanBench+ with 42 aligned tasks across Qiskit, PennyLane, and Cirq covering quantum algorithms, gate decomposition, and state preparation. Evaluated models using executable functional tests with Pass@1 and Pass@5 metrics, plus KL-divergence-based acceptance for probabilistic outputs. Also studied feedback-based repair where models can revise code after runtime errors or wrong answers.
Result: Best one-shot scores: 59.5% in Qiskit, 54.8% in Cirq, and 42.9% in PennyLane. With feedback-based repair, scores improved to 83.3%, 76.2%, and 66.7% respectively. Results show clear progress but also highlight that reliable multi-framework quantum code generation remains unsolved and heavily dependent on framework-specific knowledge.
Conclusion: While LLMs show promising capabilities in quantum code generation, significant framework-specific knowledge gaps remain. Feedback-based repair substantially improves performance, but achieving reliable multi-framework quantum code generation requires better abstraction of quantum concepts from framework implementation details.
Abstract: Large Language Models (LLMs) are increasingly used for code generation, yet quantum code generation is still evaluated mostly within single frameworks, making it difficult to separate quantum reasoning from framework familiarity. We introduce QuanBench+, a unified benchmark spanning Qiskit, PennyLane, and Cirq, with 42 aligned tasks covering quantum algorithms, gate decomposition, and state preparation. We evaluate models with executable functional tests, report Pass@1 and Pass@5, and use KL-divergence-based acceptance for probabilistic outputs. We additionally study Pass@1 after feedback-based repair, where a model may revise code after a runtime error or wrong answer. Across frameworks, the strongest one-shot scores reach 59.5% in Qiskit, 54.8% in Cirq, and 42.9% in PennyLane; with feedback-based repair, the best scores rise to 83.3%, 76.2%, and 66.7%, respectively. These results show clear progress, but also that reliable multi-framework quantum code generation remains unsolved and still depends strongly on framework-specific knowledge.
[398] Robust Reasoning Benchmark
Pavel Golikov, Evgenii Opryshko, Gennady Pekhimenko, Mark C. Jeffrey
Main category: cs.LG
TL;DR: Paper evaluates LLM reasoning robustness through 14 perturbation techniques on math problems, revealing catastrophic failures in open-weight models and working memory pollution issues in dense attention mechanisms.
Details
Motivation: LLMs perform well on standard math benchmarks but their reasoning processes are overfit to textual formatting. The authors want to evaluate the robustness of LLM reasoning under various perturbations and understand how working memory affects reasoning quality.Method: Proposed a perturbation pipeline with 14 techniques to evaluate LLM reasoning robustness on AIME 2024 dataset. Also tested working memory capacity by forcing models to solve multiple unperturbed math problems sequentially in a single context window.
Result: Frontier models show resilience but open-weight reasoning models suffer catastrophic collapses (up to 55% average accuracy drops, up to 100% on some perturbations). Models from 7B to 120B parameters and Claude Opus show accuracy decay on subsequent problems, indicating intermediate reasoning steps permanently pollute dense attention mechanisms.
Conclusion: To achieve reliable reasoning, future architectures must integrate explicit contextual resets within Chain-of-Thought, raising fundamental questions about optimal granularity of atomic reasoning tasks.
Abstract: While Large Language Models (LLMs) achieve high performance on standard mathematical benchmarks, their underlying reasoning processes remain highly overfit to standard textual formatting. We propose a perturbation pipeline consisting of 14 techniques to evaluate robustness of LLM reasoning. We apply this pipeline to AIME 2024 dataset and evalute 8 state-of-the-art models on the resulting benchmark. While frontier models exhibit resilience, open weights reasoning models suffer catastrophic collapses (up to 55% average accuracy drops across perturbations and up to 100% on some), exposing structural fragility. To further disentangle mechanical parsing failures from downstream reasoning failures, we strictly isolate the models’ working memory capacity by forcing models to solve multiple unperturbed mathematical problems sequentially within a single context window. Our results indicate that open weight models ranging from 7B to 120B parameters and Claude Opus 4.6 exhibit accuracy decay on subsequent problems. This degradation demonstrates that intermediate reasoning steps permanently pollute standard dense attention mechanisms. We argue that to achieve reliable reasoning, future reasoning architectures must integrate explicit contextual resets within a model’s own Chain-of-Thought, leading to fundamental open questions regarding the optimal granularity of atomic reasoning tasks.
[399] Ranked Activation Shift for Post-Hoc Out-of-Distribution Detection
Gianluca Guglielmo, Marc Masana
Main category: cs.LG
TL;DR: A hyperparameter-free post-hoc OOD detection method that replaces sorted activation magnitudes with a fixed in-distribution reference profile, showing consistent performance across datasets and architectures.
Details
Motivation: Current post-hoc OOD detection methods using intermediate layer activation editing show inconsistent performance across datasets and models, with instability driven by differences in activation distributions and failure modes in scaling-based methods when penultimate layer activations aren't rectified.Method: Proposes a hyperparameter-free post-hoc method that replaces sorted activation magnitudes with a fixed in-distribution reference profile. It’s a simple plug-and-play approach that doesn’t require hyperparameter tuning or assumptions about penultimate layer activation functions.
Result: The method shows strong and consistent performance across datasets and architectures while preserving in-distribution classification accuracy by construction. Analysis reveals that both inhibiting and exciting activation shifts independently contribute to better OOD discrimination.
Conclusion: The proposed method addresses instability in current OOD detection approaches by using a fixed reference profile, providing reliable performance without hyperparameter tuning across diverse datasets and model architectures.
Abstract: State-of-the-art post-hoc out-of-distribution detection methods rely on intermediate layer activation editing. However, they exhibit inconsistent performance across datasets and models. We show that this instability is driven by differences in the activation distributions, and identify a failure mode of scaling-based methods that arises when penultimate layer activations are not rectified. Motivated by this analysis, we propose \ours, a hyperparameter-free post-hoc method that replaces sorted activation magnitudes with a fixed in-distribution reference profile. Our simple plug-and-play method shows strong and consistent performance across datasets and architectures without assumptions on the penultimate layer activation function, and without requiring any hyperparameter tuning, while preserving in-distribution classification accuracy by construction. We further analyze what drives the improvement, showing that both inhibiting and exciting activation shifts independently contribute to better out-of-distribution discrimination.
[400] Silhouette Loss: Differentiable Global Structure Learning for Deep Representations
Matheus Vinícius Todescato, Joel Luís Carbonera
Main category: cs.LG
TL;DR: Soft Silhouette Loss: A novel differentiable objective inspired by silhouette coefficient that encourages intra-class compactness and inter-class separation, can be combined with cross-entropy and supervised contrastive learning for improved representation learning.
Details
Motivation: Cross-entropy doesn't explicitly enforce geometric properties in embedding space like intra-class compactness and inter-class separation. Existing metric learning methods address this but increase computational cost and complexity.Method: Proposes Soft Silhouette Loss inspired by classical silhouette coefficient from clustering. Evaluates each sample against all classes in batch, providing batch-level global structure. Can be combined with cross-entropy and supervised contrastive learning in hybrid objective.
Result: Extensive experiments on 7 datasets show: (1) CE + Soft Silhouette Loss improves over CE and other baselines; (2) hybrid formulation outperforms SupCon alone; (3) combined method achieves best performance (39.08% top-1 accuracy vs 36.71% for CE, 37.85% for SupCon2) with lower computational overhead.
Conclusion: Classical clustering principles can be reinterpreted as differentiable objectives for deep learning, enabling efficient optimization of both local and global structure in representation spaces.
Abstract: Learning discriminative representations is a central goal of supervised deep learning. While cross-entropy (CE) remains the dominant objective for classification, it does not explicitly enforce desirable geometric properties in the embedding space, such as intra-class compactness and inter-class separation. Existing metric learning approaches, including supervised contrastive learning (SupCon) and proxy-based methods, address this limitation by operating on pairwise or proxy-based relationships, but often increase computational cost and complexity. In this work, we introduce Soft Silhouette Loss, a novel differentiable objective inspired by the classical silhouette coefficient from clustering analysis. Unlike pairwise objectives, our formulation evaluates each sample against all classes in the batch, providing a batch-level notion of global structure. The proposed loss directly encourages samples to be closer to their own class than to competing classes, while remaining lightweight. Soft Silhouette Loss can be seamlessly combined with cross-entropy, and is also complementary to supervised contrastive learning. We propose a hybrid objective that integrates them, jointly optimizing local pairwise consistency and global cluster structure. Extensive experiments on seven diverse datasets demonstrate that: (i) augmenting CE with Soft Silhouette Loss consistently improves over CE and other metric learning baselines; (ii) the hybrid formulation outperforms SupCon alone; and (iii) the combined method achieves the best performance, improving average top-1 accuracy from 36.71% (CE) and 37.85% (SupCon2) to 39.08%, while incurring substantially lower computational overhead. These results suggest that classical clustering principles can be reinterpreted as differentiable objectives for deep learning, enabling efficient optimization of both local and global structure in representation spaces.
[401] Distilling Genomic Models for Efficient mRNA Representation Learning via Embedding Matching
Rasched Haidari, Sam Martin, Maxime Allard
Main category: cs.LG
TL;DR: A distillation framework for genomic foundation models that transfers mRNA representations from large models to smaller specialized models, achieving 200x size reduction while maintaining competitive performance.
Details
Motivation: Large genomic foundation models have achieved remarkable results but grow to billions of parameters, making them expensive to run when compute is limited. There's a need for efficient models that can maintain performance while being computationally feasible.Method: Proposes a distillation framework that transfers mRNA representations from state-of-the-art genomic foundation models into much smaller specialized models. Uses embedding-level distillation (found to work better than logit-based methods which were unstable).
Result: Achieves 200-fold size reduction while maintaining state-of-the-art performance among models of comparable size on mRNA-bench benchmark. The distilled model competes with larger architectures for mRNA-related tasks.
Conclusion: Embedding-based distillation of mRNA sequences is an effective training strategy for biological foundation models, enabling efficient and scalable sequence modeling in genomics when large models are computationally challenging.
Abstract: Large Genomic Foundation Models have recently achieved remarkable results and in-vivo translation capabilities. However these models quickly grow to over a few Billion of parameters and are expensive to run when compute is limited. To overcome this challenge, we present a distillation framework for transferring mRNA representations from a state of the art genomic foundation model into a much smaller model specialized for mRNA sequences, reducing the size by 200-fold. Embedding-level distillation worked better than logit based methods, which we found unstable. Benchmarking on mRNA-bench demonstrates that the distilled model achieves state-of-the-art performance among models of comparable size and competes with larger architectures for mRNA-related tasks. Our results highlight embedding-based distillation of mRNA sequences as an effective training strategy for biological foundation models. This enables similar efficient and scalable sequence modelling in genomics, particularly when large models are computationally challenging or infeasible.
[402] MolPaQ: Modular Quantum-Classical Patch Learning for Interpretable Molecular Generation
Syed Rameez Naqvi, Lu Peng
Main category: cs.LG
TL;DR: MOLPAQ is a quantum-classical molecular generator that uses quantum-generated latent patches to create valid, diverse molecules with property control, achieving near-perfect validity and novelty.
Details
Motivation: Existing molecular generative models struggle to balance validity, diversity, and property control simultaneously, often trading off one objective for another. The authors aim to create a model that can achieve all three objectives effectively.Method: A modular quantum-classical approach: 1) β-VAE pretrained on QM9 learns chemically aligned latent space, 2) reduced conditioner maps molecular descriptors into this space, 3) parameter-efficient quantum patch generator produces entangled node embeddings, 4) valence-aware aggregator reconstructs valid molecular graphs, and 5) adversarial fine-tuning with latent critic and chemistry-shaped reward.
Result: Achieves 100% RDKit validity, 99.75% novelty, and 0.905 diversity. The quantum generator improves mean QED by ~2.3% and increases aromatic motif incidence by ~10-12% compared to parameter-matched classical generator.
Conclusion: MOLPAQ successfully balances validity, diversity, and property control in molecular generation, with the quantum component serving as an effective compact topology-shaping operator that outperforms classical alternatives.
Abstract: Molecular generative models must jointly ensure validity, diversity, and property control, yet existing approaches typically trade off among these objectives. We present MOLPAQ, a modular quantum-classical generator that assembles molecules from quantum-generated latent patches. A \b{eta}-VAE pretrained on QM9 learns a chemically aligned latent manifold; a reduced conditioner maps molecular descriptors into this space; and a parameter-efficient quantum patch generator produces entangled node embeddings that a valence-aware aggregator reconstructs into valid molecular graphs. Adversarial fine-tuning with a latent critic and chemistry-shaped reward yields 100% RDKit validity, 99.75% novelty, and 0.905 diversity. Beyond aggregate metrics, the pretrained quantum generator, steered by the conditioner, improves mean QED by approx. 2.3% and increases aromatic motif incidence by approx. 10-12% relative to a parameter-matched classical generator, highlighting its role as a compact topology-shaping operator.
[403] Distributionally Robust Token Optimization in RLHF
Yeping Jin, Jiaming Hu, Ioannis Ch. Paschalidis
Main category: cs.LG
TL;DR: DRTO combines token-level RLHF with distributionally robust optimization to improve LLM consistency under distribution shifts in reasoning tasks.
Details
Motivation: LLMs are sensitive to small changes in prompts (wording, format, language) which causes failures in multi-step reasoning, especially under distribution shifts from training data.Method: Distributionally Robust Token Optimization (DRTO) combines token-level Reinforcement Learning from Human Feedback (RLHF) with Distributionally Robust Optimization (DRO) using f-divergence ambiguity sets to bound worst-case token-wise rewards.
Result: DRTO improves consistency under distribution shifts: 9.17% improvement on GSM8K and 2.49% improvement on MathQA mathematical reasoning benchmarks.
Conclusion: DRTO provides theoretical robustness guarantees and practical improvements for LLM consistency under distribution shifts in reasoning tasks.
Abstract: Large Language Models (LLMs) tend to respond correctly to prompts that align to the data they were trained and fine-tuned on. Yet, small shifts in wording, format, or language can trigger surprisingly large failures, especially on multi-step reasoning problems. To address this problem, we propose a Distributionally Robust Token Optimization (DRTO) approach, which combines token-level Reinforcement Learning from Human Feedback (RLHF) with Distributionally Robust Optimization (DRO). DRTO bounds worst case token-wise rewards by constructing an f-divergence ambiguity set over a loss minibatch, leading to a theoretical robustness. Empirically, DRTO enhances consistency under distribution shifts in mathematical reasoning benchmarks, achieving 9.17% improvement on GSM8K and 2.49% improvement on MathQA.
[404] Multi-Agent Decision-Focused Learning via Value-Aware Sequential Communication
Benjamin Amoh, Geoffrey Parker, Wesley Marrero
Main category: cs.LG
TL;DR: SeqComm-DFL: A method for multi-agent coordination with sequential communication using decision-focused learning and Stackelberg conditioning to optimize messages for task performance rather than intermediate objectives.
Details
Motivation: Existing multi-agent coordination methods under partial observability optimize messages for intermediate objectives like reconstruction accuracy or mutual information, rather than directly optimizing for decision quality and task performance.Method: Introduces value-aware message generation with sequential Stackelberg conditioning: messages maximize receiver decision quality, generated in priority order with agents conditioning on predecessors. Uses guidance potential determined by prosocial ordering. Extends Optimal Model Design to communication-augmented world models with QMIX factorization for efficient end-to-end training via implicit differentiation.
Result: Achieves four to six times higher cumulative rewards and over 13% win rate improvements on collaborative healthcare and StarCraft Multi-Agent Challenge (SMAC) benchmarks. Enables coordination strategies inaccessible under information asymmetry.
Conclusion: SeqComm-DFL effectively unifies sequential communication with decision-focused learning, proving information-theoretic bounds showing communication value scales with coordination gaps and establishing convergence guarantees for the bilevel optimization.
Abstract: Multi-agent coordination under partial observability requires agents to share complementary private information. While recent methods optimize messages for intermediate objectives (e.g., reconstruction accuracy or mutual information), rather than decision quality, we introduce \textbf{SeqComm-DFL}, unifying the sequential communication with decision-focused learning for task performance. Our approach features \emph{value-aware message generation with sequential Stackelberg conditioning}: messages maximize receiver decision quality and are generated in priority order, with agents conditioning on their predecessors. The \emph{guidance potential} determined by their prosocial ordering. We extend Optimal Model Design to communication-augmented world models with QMIX factorization, enabling efficient end-to-end training via implicit differentiation. We prove information-theoretic bounds showing that communication value scales with coordination gaps and establish $\mathcal{O}(1/\sqrt{T})$ convergence for the bilevel optimization, where $T$ denotes the number of training iterations. On collaborative healthcare and StarCraft Multi-Agent Challenge (SMAC) benchmarks, SeqComm-DFL achieves four to six times higher cumulative rewards and over 13% win rate improvements, enabling coordination strategies inaccessible under information asymmetry.
[405] Structured Exploration and Exploitation of Label Functions for Automated Data Annotation
Phong Lam, Ha-Linh Nguyen, Thu-Trang Nguyen, Son Nguyen, Hieu Dinh Vo
Main category: cs.LG
TL;DR: EXPONA is an automated programmatic labeling framework that generates diverse and reliable label functions through multi-level exploration and reliability-aware filtering, outperforming existing methods in coverage and downstream performance.
Details
Motivation: Manual data annotation is costly and error-prone, while existing automated label function generation methods have limited coverage and unreliable label quality, creating a need for a more principled approach to programmatic labeling.Method: EXPONA formulates LF generation as a principled process balancing diversity and reliability, systematically exploring multi-level LFs (surface, structural, semantic) and applying reliability-aware mechanisms to suppress noisy/redundant heuristics while preserving complementary signals.
Result: On eleven classification datasets across diverse domains, EXPONA achieved nearly complete label coverage (up to 98.9%), improved weak label quality by up to 87%, and yielded downstream performance gains of up to 46% in weighted F1, consistently outperforming state-of-the-art methods.
Conclusion: EXPONA’s combination of multi-level LF exploration and reliability-aware filtering enables more consistent label quality and downstream performance across diverse tasks by balancing coverage and precision in the generated LF set.
Abstract: High-quality labeled data is critical for training reliable machine learning and deep learning models, yet manual annotation remains costly and error-prone. Programmatic labeling addresses this challenge by using label functions (LFs), i.e., heuristic rules that automatically generate weak labels for training datasets. However, existing automated LF generation methods either rely on large language models (LLMs) to synthesize surface-level heuristics or employ model-based synthesis over hand-crafted primitives. These approaches often result in limited coverage and unreliable label quality. In this paper, we introduce EXPONA, an automated framework for programmatic labeling that formulates LF generation as a principled process balancing diversity and reliability. EXPONA systematically explores multi-level LFs, spanning surface, structural, and semantic perspectives. EXPONA further applies reliability-aware mechanisms to suppress noisy or redundant heuristics while preserving complementary signals. To evaluate EXPONA, we conducted extensive experiments on eleven classification datasets across diverse domains. Experimental results show that EXPONA consistently outperformed state-of-the-art automated LF generation methods. Specifically, EXPONA achieved nearly complete label coverage (up to 98.9%), improved weak label quality by up to 87%, and yielded downstream performance gains of up to 46% in weighted F1. These results indicate that EXPONA’s combination of multi-level LF exploration and reliability-aware filtering enabled more consistent label quality and downstream performance across diverse tasks by balancing coverage and precision in the generated LF set.
[406] ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion
Lifeng Chen, Tianqi You, Hao Liu, Zhimin Bao, Jile Jiao, Xiao Han, Zhicai Ou, Tao Sun, Xiaofeng Mou, Xiaojie Jin, Yi Xu
Main category: cs.LG
TL;DR: ECHO: Efficient diffusion-based vision-language model for chest X-ray report generation with one-step-per-block inference via Direct Conditional Distillation and Response-Asymmetric Diffusion training.
Details
Motivation: Autoregressive VLMs for chest X-ray report generation suffer from high inference latency due to sequential token decoding. Diffusion models offer parallel generation but still require multiple denoising iterations. Compressing to single-step generation often degrades textual coherence due to mean-field bias from token-factorized denoisers.Method: Proposes ECHO with Direct Conditional Distillation (DCD) framework to enable stable one-step-per-block inference by constructing unfactorized supervision from on-policy diffusion trajectories to encode joint token dependencies. Also introduces Response-Asymmetric Diffusion (RAD) training strategy for improved efficiency.
Result: ECHO surpasses state-of-the-art autoregressive methods, improving RaTE and SemScore by 64.33% and 60.58% respectively, while achieving 8× inference speedup without compromising clinical accuracy.
Conclusion: ECHO demonstrates efficient diffusion-based VLM for medical report generation with significant speed improvements while maintaining or improving text quality and clinical accuracy.
Abstract: Chest X-ray report generation (CXR-RG) has the potential to substantially alleviate radiologists’ workload. However, conventional autoregressive vision–language models (VLMs) suffer from high inference latency due to sequential token decoding. Diffusion-based models offer a promising alternative through parallel generation, but they still require multiple denoising iterations. Compressing multi-step denoising to a single step could further reduce latency, but often degrades textual coherence due to the mean-field bias introduced by token-factorized denoisers. To address this challenge, we propose \textbf{ECHO}, an efficient diffusion-based VLM (dVLM) for chest X-ray report generation. ECHO enables stable one-step-per-block inference via a novel Direct Conditional Distillation (DCD) framework, which mitigates the mean-field limitation by constructing unfactorized supervision from on-policy diffusion trajectories to encode joint token dependencies. In addition, we introduce a Response-Asymmetric Diffusion (RAD) training strategy that further improves training efficiency while maintaining model effectiveness. Extensive experiments demonstrate that ECHO surpasses state-of-the-art autoregressive methods, improving RaTE and SemScore by \textbf{64.33%} and \textbf{60.58%} respectively, while achieving an \textbf{$8\times$} inference speedup without compromising clinical accuracy.
[407] On the Spectral Geometry of Cross-Modal Representations: A Functional Map Diagnostic for Multimodal Alignment
Krisanu Sarkar
Main category: cs.LG
TL;DR: Cross-modal alignment study using functional maps reveals that independently trained vision and language encoders develop manifolds with similar intrinsic complexity but unaligned eigenvector bases, termed the spectral complexity-orientation gap.
Details
Motivation: To understand cross-modal alignment between independently pretrained vision and language encoders using computational geometry frameworks, and to investigate structural properties of multimodal representations beyond traditional alignment methods.Method: Uses functional map framework from computational geometry to represent correspondence between representation manifolds as a compact linear operator between graph Laplacian eigenbases. Compares with Procrustes alignment and relative representations across different supervision budgets.
Result: Functional map underperforms traditional methods for cross-modal retrieval but reveals important structural insights: Laplacian eigenvalue spectra are quantitatively similar (normalized spectral distance 0.043), indicating comparable intrinsic complexity, but eigenvector bases are effectively unaligned (mean diagonal dominance below 0.05, orthogonality error 70.15).
Conclusion: Independently trained models converge in how much structure they capture (spectral complexity) but not in how they organize it (orientation), creating a spectral complexity-orientation gap. This defines boundary conditions for spectral alignment methods and motivates diagnostic quantities for cross-modal representation compatibility.
Abstract: We study cross-modal alignment between independently pretrained vision (DINOv2) and language (all-MiniLM-L6-v2) encoders using the functional map framework from computational geometry, which represents correspondence between representation manifolds as a compact linear operator between graph Laplacian eigenbases. While the framework underperforms Procrustes alignment and relative representations for cross-modal retrieval across all supervision budgets, it reveals a structural property of multimodal representations. We find that the Laplacian eigenvalue spectra of the two encoders are quantitatively similar (normalized spectral distance 0.043), indicating that independently trained models develop manifolds of comparable intrinsic complexity. However, the functional map exhibits near-zero diagonal dominance (mean below 0.05) and large orthogonality error (70.15), showing that the eigenvector bases are effectively unaligned. We term this decoupling the spectral complexity–orientation gap: models converge in how much structure they capture but not in how they organize it. This gap defines a boundary condition for spectral alignment methods and motivates three diagnostic quantities : diagonal dominance, orthogonality deviation, and Laplacian commutativity error for characterizing cross-modal representation compatibility.
[408] Fully Autonomous Z-Score-Based TinyML Anomaly Detection on Resource-Constrained MCUs Using Power Side-Channel Data
Abdulrahman Albaiz, Fathi Amsaad
Main category: cs.LG
TL;DR: A fully autonomous TinyML anomaly detection system deployed on low-power microcontrollers for real-time appliance monitoring using power side-channel data, achieving perfect detection with minimal resource usage.
Details
Motivation: Existing IoT anomaly detection systems often rely on offline training or cloud-assisted analytics, which limits real-time autonomous operation. The authors aim to develop a fully autonomous system that performs both training and inference directly on resource-constrained microcontrollers without external computation or connectivity.Method: The system continuously samples current consumption, computes Root Mean Square (RMS) values on-device, and derives statistical parameters during an initial training phase. Anomalies are detected using lightweight Z-Score thresholds, enabling interpretable and computationally efficient inference suitable for embedded deployment. Implemented on an STM32-based platform.
Result: Perfect detection performance with Precision and Recall of 1.00, inference latencies on the order of tens of microseconds, and a total memory footprint of approximately 3.3 KB SRAM and 63 KB Flash. Evaluated using a 14-day dataset from a household mini-fridge under normal and controlled anomaly conditions.
Conclusion: Robust and fully autonomous TinyML anomaly detection can be achieved on low-cost microcontrollers. Future work includes extending the framework with additional lightweight models and multi-device learning scenarios.
Abstract: This paper presents a fully autonomous Tiny Machine Learning (TinyML) Z-Score-based anomaly detection system deployed on a low-power microcontroller for real-time monitoring of appliance behavior using power side-channel data. Unlike existing Internet of Things (IoT) anomaly detection approaches that rely on offline training or cloud-assisted analytics, the proposed system performs both model training and inference directly on a resource-constrained microcontroller without external computation or connectivity. The system continuously samples current consumption, computes Root Mean Square (RMS) values on-device, and derives statistical parameters during an initial training phase. Anomalies are detected using lightweight Z-Score thresholds, enabling interpretable and computationally efficient inference suitable for embedded deployment. The architecture was implemented on an STM32-based platform and evaluated using a 14-day dataset collected from a household mini-fridge under normal operation and controlled anomaly conditions. Results demonstrate perfect detection performance, with Precision and Recall of 1.00, inference latencies on the order of tens of microseconds, and a total memory footprint of approximately 3.3 KB SRAM and 63 KB Flash. These results confirm that robust and fully autonomous TinyML anomaly detection can be achieved on low-cost microcontrollers. Future work includes extending the framework to incorporate additional lightweight models and multi-device learning scenarios.
[409] Multivariate Time Series Anomaly Detection via Dual-Branch Reconstruction and Autoregressive Flow-based Residual Density Estimation
Jun Liu, Ying Chen, Ziqian Lu, Qinyue Tong, Jun Tang
Main category: cs.LG
TL;DR: DBR-AF is a novel framework for multivariate time series anomaly detection that combines dual-branch reconstruction with autoregressive flow to address limitations of existing reconstruction-based methods.
Details
Motivation: Current reconstruction-based anomaly detection methods suffer from two key problems: 1) overfitting to spurious correlations due to overemphasis on cross-variable modeling, and 2) misleading anomaly scores from simply summing multivariable reconstruction errors, making it hard to distinguish hard-to-reconstruct samples from genuine anomalies.Method: Proposes DBR-AF framework with two core components: 1) Dual-branch reconstruction (DBR) encoder that decouples cross-variable correlation learning and intra-variable statistical property modeling to mitigate spurious correlations, and 2) Autoregressive flow (AF) module that uses stacked reversible transformations to model complex multivariate residual distribution and leverages density estimation to accurately identify normal samples with large reconstruction errors.
Result: Extensive experiments on seven benchmark datasets demonstrate state-of-the-art performance, with ablation studies validating the indispensability of both core components.
Conclusion: DBR-AF effectively addresses key limitations in reconstruction-based anomaly detection by decoupling correlation learning and using advanced density estimation, achieving superior performance on multivariate time series anomaly detection tasks.
Abstract: Multivariate Time Series Anomaly Detection (MTSAD) is critical for real-world monitoring scenarios such as industrial control and aerospace systems. Mainstream reconstruction-based anomaly detection methods suffer from two key limitations: first, overfitting to spurious correlations induced by an overemphasis on cross-variable modeling; second, the generation of misleading anomaly scores by simply summing up multivariable reconstruction errors, which makes it difficult to distinguish between hard-to-reconstruct samples and genuine anomalies. To address these issues, we propose DBR-AF, a novel framework that integrates a dual-branch reconstruction (DBR) encoder and an autoregressive flow (AF) module. The DBR encoder decouples cross-variable correlation learning and intra-variable statistical property modeling to mitigate spurious correlations, while the AF module employs multiple stacked reversible transformations to model the complex multivariate residual distribution and further leverages density estimation to accurately identify normal samples with large reconstruction errors. Extensive experiments on seven benchmark datasets demonstrate that DBR-AF achieves state-of-the-art performance, with ablation studies validating the indispensability of its core components.
[410] CSAttention: Centroid-Scoring Attention for Accelerating LLM Inference
Chuxu Song, Zhencan Peng, Jiuqi Wei, Chuanhui Yang
Main category: cs.LG
TL;DR: CSAttention is a training-free sparse attention method for long-context LLMs that optimizes reusable context serving by front-loading computation to offline prefill and using query-centric lookup tables for efficient online decoding.
Details
Motivation: Long-context LLMs face attention and KV-cache bottlenecks during decoding, especially with reusable prefill prompts for agents and domain Q&A. Existing sparse attention methods struggle with accuracy at high sparsity due to distribution shift between queries and keys.Method: CSAttention uses a storage-for-computation strategy with offline prefill and online decode phases. During offline prefill, it constructs query-centric lookup tables that remain fixed during decoding. Online decoding replaces full-context scans with efficient table lookups and GPU-friendly score accumulation.
Result: CSAttention achieves near-identical accuracy to full attention while outperforming state-of-the-art sparse attention methods. At 95% sparsity and 32K-128K context lengths, it achieves up to 4.6x inference speedup over the most accurate baseline at 128K context length.
Conclusion: CSAttention effectively addresses the decode-time bottlenecks in long-context LLMs with reusable contexts, offering significant speed improvements while maintaining accuracy through its training-free sparse attention approach.
Abstract: Long-context LLMs increasingly rely on extended, reusable prefill prompts for agents and domain Q&A, pushing attention and KV-cache to become the dominant decode-time bottlenecks. While sparse attention reduces computation and transfer costs, it often struggles to maintain accuracy at high sparsity levels due to the inherent distribution shift between Queries and Keys. We propose Centroid-Scoring Attention (CSAttention), a training-free sparse attention method optimized for high-throughput serving of reusable contexts. CSAttention adopts a storage-for-computation strategy tailored to the offline-prefill/online-decode setting: it front-loads computation into a one-time offline prefill phase that can be amortized across multiple queries, while aggressively optimizing per-step decoding latency. Specifically, CSAttention constructs query-centric lookup tables during offline prefill, whose size remains fixed during decoding, and enables online decoding to replace full-context scans with efficient table lookups and GPU-friendly score accumulation. Extensive experiments demonstrate that CSAttention achieves near-identical accuracy to full attention. Under high sparsity (95%) and long-context settings (32K-128K), CSAttention consistently outperforms state-of-the-art sparse attention methods in both model accuracy and inference speed, achieving up to 4.6x inference speedup over the most accurate baseline at a context length of 128K.
[411] FluidFlow: a flow-matching generative model for fluid dynamics surrogates on unstructured meshes
David Ramos, Lucas Lacasa, Fermín Gutiérrez, Eusebio Valero, Gonzalo Rubio
Main category: cs.LG
TL;DR: FluidFlow: A generative flow-matching model for scalable surrogate modeling of fluid dynamics on both structured and unstructured meshes without interpolation preprocessing.
Details
Motivation: CFD simulations are computationally expensive for many-query applications, and existing deep learning surrogate models need improvement. The authors propose using generative modeling as a framework for constructing scalable fluid-dynamics surrogate models that can handle both structured and unstructured meshes directly.Method: FluidFlow uses conditional flow-matching (alternative to diffusion models) to learn deterministic transport maps between noise and data distributions. It operates directly on CFD data on both structured/unstructured meshes without interpolation. Two architectures tested: U-Net and diffusion transformer (DiT), conditioned on physically meaningful parameters.
Result: FluidFlow outperforms strong multilayer perceptron baselines on two benchmark problems: airfoil pressure coefficients and 3D aircraft pressure/friction coefficients on unstructured mesh. Achieves significantly lower error metrics and improved generalization. Transformer-based architecture enables scalable learning on large unstructured datasets while maintaining high accuracy.
Conclusion: Flow-matching generative models provide an effective and flexible framework for surrogate modeling in fluid dynamics, with potential for realistic engineering and scientific applications.
Abstract: Computational fluid dynamics (CFD) provides high-fidelity simulations of fluid flows but remains computationally expensive for many-query applications. In recent years deep learning (DL) has been used to construct data-driven fluid-dynamic surrogate models. In this work we consider a different learning paradigm and embrace generative modelling as a framework for constructing scalable fluid-dynamics surrogate models. We introduce FluidFlow, a generative model based on conditional flow-matching, a recent alternative to diffusion models that learns deterministic transport maps between noise and data distributions. FluidFlow is specifically designed to operate directly on CFD data defined on both structured and unstructured meshes alike, without the needs to perform any mesh interpolation pre-processing and preserving geometric fidelity. We assess the capabilities of FluidFlow using two different core neural network architectures, a U-Net and diffusion transformer (DiT), and condition their learning on physically meaningful parameters. The methodology is validated on two benchmark problems of increasing complexity: prediction of pressure coefficients along an airfoil boundary across different operating conditions, and prediction of pressure and friction coefficients over a full three-dimensional aircraft geometry discretized on a large unstructured mesh. In both cases, FluidFlow outperform strong multilayer perceptron baselines, achieving significantly lower error metrics and improved generalisation across operating conditions. Notably, the transformer-based architecture enables scalable learning on large unstructured datasets while maintaining high predictive accuracy. These results demonstrate that flow-matching generative models provide an effective and flexible framework for surrogate modelling in fluid dynamics, with potential for realistic engineering and scientific applications.
[412] Event-Driven Temporal Graph Networks for Asynchronous Multi-Agent Cyber Defense in NetForge_RL
Igor Jankowski
Main category: cs.LG
TL;DR: NetForge_RL is a high-fidelity cyber operations simulator that bridges the Sim2Real gap for MARL in network defense, using continuous-time POSMDP and CT-GMARL with Neural ODEs for asynchronous alert processing.
Details
Motivation: Current MARL policies for cyber wargames face a Sim2Real gap when transitioning to operational SOCs, as legacy simulators abstract away network physics, use synchronous ticks, and provide clean state vectors rather than noisy telemetry.Method: Introduces NetForge_RL simulator with dual-mode engine for training and live evaluation, and CT-GMARL using Neural ODEs to process irregularly sampled alerts in continuous-time POSMDP with Zero-Trust Network Access constraints.
Result: CT-GMARL achieves 2.0-2.1x higher median Blue reward than baselines, restores 12x more compromised services, and achieves high reward in zero-shot transfer to live Docker environment, validating the Sim2Real bridge.
Conclusion: The framework successfully bridges Sim2Real gap for MARL in network defense through high-fidelity simulation and continuous-time modeling, enabling effective transfer from simulation to real operational environments.
Abstract: The transition of Multi-Agent Reinforcement Learning (MARL) policies from simulated cyber wargames to operational Security Operations Centers (SOCs) is fundamentally bottlenecked by the Sim2Real gap. Legacy simulators abstract away network protocol physics, rely on synchronous ticks, and provide clean state vectors rather than authentic, noisy telemetry. To resolve these limitations, we introduce NetForge_RL: a high-fidelity cyber operations simulator that reformulates network defense as an asynchronous, continuous-time Partially Observable Semi-Markov Decision Process (POSMDP). NetForge enforces Zero-Trust Network Access (ZTNA) constraints and requires defenders to process NLP-encoded SIEM telemetry. Crucially, NetForge bridges the Sim2Real gap natively via a dual-mode engine, allowing high-throughput MARL training in a mock hypervisor and zero-shot evaluation against live exploits in a Docker hypervisor. To navigate this continuous-time POSMDP, we propose Continuous-Time Graph MARL (CT-GMARL), utilizing fixed-step Neural Ordinary Differential Equations (ODEs) to process irregularly sampled alerts. We evaluate our framework against discrete baselines (R-MAPPO, QMIX). Empirical results demonstrate that CT-GMARL achieves a converged median Blue reward of 57,135 - a 2.0x improvement over R-MAPPO and 2.1x over QMIX. Critically, CT-GMARL restores 12x more compromised services than the strongest baseline by avoiding the “scorched earth” failure mode of trivially minimizing risk by destroying network utility. On zero-shot transfer to the live Docker environment, CT-GMARL policies achieve a median reward of 98,026, validating the Sim2Real bridge.
[413] Act or Escalate? Evaluating Escalation Behavior in Automation with Language Models
Matthew DosSantos DiSorbo, Harang Ju
Main category: cs.LG
TL;DR: LLMs can be trained to make better escalation decisions by explicitly reasoning about uncertainty and decision costs, with supervised fine-tuning on chain-of-thought targets yielding the most robust policies.
Details
Motivation: Effective automation requires deciding when to act vs. escalate to humans, but current LLMs have inconsistent escalation thresholds and miscalibrated uncertainty estimates that vary unpredictably across models.Method: Model escalation as decision under uncertainty: LLM predicts, estimates correctness probability, compares expected costs of acting vs. escalating. Test across 5 domains with multiple model families, then evaluate interventions including cost ratio variation, accuracy signals, and training models to follow desired escalation rules.
Result: Marked differences in implicit escalation thresholds across models not predicted by architecture or scale; self-estimates miscalibrated in model-specific ways. Prompting helps mainly for reasoning models; SFT on chain-of-thought targets yields most robust policies generalizing across datasets, cost ratios, prompt framings, and held-out domains.
Conclusion: Escalation behavior is model-specific property requiring characterization before deployment; robust alignment benefits from training models to reason explicitly about uncertainty and decision costs.
Abstract: Effective automation hinges on deciding when to act and when to escalate. We model this as a decision under uncertainty: an LLM forms a prediction, estimates its probability of being correct, and compares the expected costs of acting and escalating. Using this framework across five domains of recorded human decisions-demand forecasting, content recommendation, content moderation, loan approval, and autonomous driving-and across multiple model families, we find marked differences in the implicit thresholds models use to trade off these costs. These thresholds vary substantially and are not predicted by architecture or scale, while self-estimates are miscalibrated in model-specific ways. We then test interventions that target this decision process by varying cost ratios, providing accuracy signals, and training models to follow the desired escalation rule. Prompting helps mainly for reasoning models. SFT on chain-of-thought targets yields the most robust policies, which generalize across datasets, cost ratios, prompt framings, and held-out domains. These results suggest that escalation behavior is a model-specific property that should be characterized before deployment, and that robust alignment benefits from training models to reason explicitly about uncertainty and decision costs.
[414] EngageTriBoost: Predictive Modeling of User Engagement in Digital Mental Health Intervention Using Explainable Machine Learning
Ha Na Cho, Daniel Eisenberg, Cheryl King, Kai Zheng
Main category: cs.LG
TL;DR: ML ensemble model predicts DMHI user engagement with 84% accuracy, with SHAP analysis revealing emotional dysregulation and stigma as key factors affecting adoption.
Details
Motivation: Digital mental health interventions (DMHIs) face adoption barriers like low uptake and high dropout rates, despite rising mental health challenges among young adults. Need to understand and predict user engagement patterns to improve DMHI effectiveness.Method: Used machine learning to analyze behavioral patterns from eBridge DMHI users. Developed ensemble model called EngageTriBoost to predict engagement (measured by sign-ins and counselor interactions). Applied SHAP analysis for interpretable insights into key factors.
Result: EngageTriBoost achieved up to 84% accuracy in predicting user engagement. SHAP analysis identified emotional dysregulation and perceived stigma as critical factors influencing DMHI adoption and engagement patterns.
Conclusion: Explainable ML can effectively predict and understand user engagement with DMHIs, providing actionable insights to improve adoption and mental health outcomes through targeted interventions addressing key barriers.
Abstract: Mental health challenges among young adults, are on the rise, necessitating effective solutions such as digital mental health interventions (DMHIs). Despite their promise, DMHIs face significant adoption barriers, including low initial uptake and high dropout rates. This study leverages machine learning (ML) to analyze behavioral patterns of users of a DMHI, eBridge, designed to increase the utilization of professional mental health services among at-risk college students through motivational interviewing-based online counseling. Our ensemble model, EngageTriBoost, achieved up to 84% accuracy in predicting engagement, measured by sign-ins and counselor interactions. We then applied the Shapley Additive exPlanations (SHAP) analysis which provided clear, interpretable insights into key factors influencing user engagement such as emotional dysregulation and perceived stigma, highlighting their critical effect on DMHI adoption. This study demonstrates the power of explainable ML for better understanding user engagement with DMHI to improve their adoption and achievable impact on mental health outcomes.
[415] AlphaLab: Autonomous Multi-Agent Research Across Optimization Domains with Frontier LLMs
Brendan R. Hogan, Xiwen Chen, James T. Wilson, Kashif Rasul, Adel Boyarsky, Thomas Kamei, Anderson Schneider, Yuriy Nevmyvaka
Main category: cs.LG
TL;DR: AlphaLab is an autonomous research system using LLM agents to automate the full experimental cycle in quantitative domains, achieving state-of-the-art results in CUDA optimization, LLM pretraining, and traffic forecasting.
Details
Motivation: To automate the entire research pipeline in computation-intensive domains using LLM agents, eliminating human intervention while achieving competitive or superior results compared to traditional methods.Method: Three-phase autonomous pipeline: (1) domain adaptation and data exploration with code generation, (2) adversarial evaluation framework construction, (3) large-scale GPU experiments via Strategist/Worker loop with persistent playbook knowledge accumulation.
Result: Achieved 4.4x faster CUDA kernels than torch.compile (up to 91x), 22% lower validation loss in LLM pretraining, and 23-25% improvement in traffic forecasting over baselines. Different LLMs discovered complementary solutions.
Conclusion: AlphaLab demonstrates that LLM agents can autonomously conduct high-quality research in quantitative domains, with multi-model campaigns providing complementary search coverage for better results.
Abstract: We present AlphaLab, an autonomous research harness that leverages frontier LLM agentic capabilities to automate the full experimental cycle in quantitative, computation-intensive domains. Given only a dataset and a natural-language objective, AlphaLab proceeds through three phases without human intervention: (1) it adapts to the domain and explores the data, writing analysis code and producing a research report; (2) it constructs and adversarially validates its own evaluation framework; and (3) it runs large-scale GPU experiments via a Strategist/Worker loop, accumulating domain knowledge in a persistent playbook that functions as a form of online prompt optimization. All domain-specific behavior is factored into adapters generated by the model itself, so the same pipeline handles qualitatively different tasks without modification. We evaluate AlphaLab with two frontier LLMs (GPT-5.2 and Claude Opus 4.6) on three domains: CUDA kernel optimization, where it writes GPU kernels that run 4.4x faster than torch.compile on average (up to 91x); LLM pretraining, where the full system achieves 22% lower validation loss than a single-shot baseline using the same model; and traffic forecasting, where it beats standard baselines by 23-25% after researching and implementing published model families from the literature. The two models discover qualitatively different solutions in every domain (neither dominates uniformly), suggesting that multi-model campaigns provide complementary search coverage. We additionally report results on financial time series forecasting in the appendix, and release all code at https://brendanhogan.github.io/alphalab-paper/.
[416] From Dispersion to Attraction: Spectral Dynamics of Hallucination Across Whisper Model Scales
Ivan Viakhirev, Kirill Borodin, Grach Mkrtchian
Main category: cs.LG
TL;DR: Large ASR models exhibit phase transitions from dispersive to attractor regimes under adversarial stress, with intermediate models showing structural disintegration and large models entering compression-seeking attractor states that decouple from acoustic evidence.
Details
Motivation: Hallucinations in large ASR models present critical safety risks, motivating the need to understand the underlying mechanisms of how these models fail under adversarial conditions.Method: Proposed the Spectral Sensitivity Theorem predicting phase transitions governed by layer-wise gain and alignment, validated by analyzing eigenspectra of activation graphs in Whisper models (Tiny to Large-v3-Turbo) under adversarial stress.
Result: Confirmed theoretical predictions: intermediate models show Structural Disintegration (Regime I) with 13.4% collapse in Cross-Attention rank, while large models enter Compression-Seeking Attractor state (Regime II) where Self-Attention compresses rank (-2.34%) and hardens spectral slope, decoupling from acoustic evidence.
Conclusion: The study reveals fundamental differences in how ASR models of different sizes respond to adversarial stress, with large models developing compression mechanisms that make them more prone to hallucinations by decoupling from acoustic input.
Abstract: Hallucinations in large ASR models present a critical safety risk. In this work, we propose the \textit{Spectral Sensitivity Theorem}, which predicts a phase transition in deep networks from a dispersive regime (signal decay) to an attractor regime (rank-1 collapse) governed by layer-wise gain and alignment. We validate this theory by analyzing the eigenspectra of activation graphs in Whisper models (Tiny to Large-v3-Turbo) under adversarial stress. Our results confirm the theoretical prediction: intermediate models exhibit \textit{Structural Disintegration} (Regime I), characterized by a $13.4%$ collapse in Cross-Attention rank. Conversely, large models enter a \textit{Compression-Seeking Attractor} state (Regime II), where Self-Attention actively compresses rank ($-2.34%$) and hardens the spectral slope, decoupling the model from acoustic evidence.
[417] Inferring Latent Temporal Sparse Coordination Graph for Multi-Agent Reinforcement Learning
Wei Duan, Jie Lu, Junyu Xuan
Main category: cs.LG
TL;DR: LTS-CG is a novel MARL method that learns latent temporal sparse coordination graphs from historical observations to improve agent cooperation while reducing computational complexity.
Details
Motivation: Current graph learning methods in MARL rely only on one-step observations, neglect historical experiences, produce deficient graphs with redundant/detrimental information exchange, and have high computational demands for dense graphs.Method: Proposes Latent Temporal Sparse Coordination Graph (LTS-CG) that uses agents’ historical observations to compute agent-pair probability matrices, samples sparse graphs for knowledge exchange, incorporates Predict-Future and Infer-Present mechanisms, and trains graph learning and agents end-to-end.
Result: Demonstrated superior performance on StarCraft II benchmark, with computational complexity scaling only with number of agents rather than dense graph calculations.
Conclusion: LTS-CG effectively captures agent dependencies and relation uncertainty through temporal sparse coordination graphs, enabling more efficient and effective multi-agent collaboration.
Abstract: Effective agent coordination is crucial in cooperative Multi-Agent Reinforcement Learning (MARL). While agent cooperation can be represented by graph structures, prevailing graph learning methods in MARL are limited. They rely solely on one-step observations, neglecting crucial historical experiences, leading to deficient graphs that foster redundant or detrimental information exchanges. Additionally, high computational demands for action-pair calculations in dense graphs impede scalability. To address these challenges, we propose inferring a Latent Temporal Sparse Coordination Graph (LTS-CG) for MARL. The LTS-CG leverages agents’ historical observations to calculate an agent-pair probability matrix, where a sparse graph is sampled from and used for knowledge exchange between agents, thereby simultaneously capturing agent dependencies and relation uncertainty. The computational complexity of this procedure is only related to the number of agents. This graph learning process is further augmented by two innovative characteristics: Predict-Future, which enables agents to foresee upcoming observations, and Infer-Present, ensuring a thorough grasp of the environmental context from limited data. These features allow LTS-CG to construct temporal graphs from historical and real-time information, promoting knowledge exchange during policy learning and effective collaboration. Graph learning and agent training occur simultaneously in an end-to-end manner. Our demonstrated results on the StarCraft II benchmark underscore LTS-CG’s superior performance.
[418] Reservoir observer enhanced with residual calibration and attention mechanism
Yichen Liu, Wei Xiao, Tianguang Chu
Main category: cs.LG
TL;DR: Enhanced reservoir observers with residual calibration and attention mechanisms improve inference of unmeasured variables in nonlinear dynamical systems, addressing input-dependent performance issues.
Details
Motivation: Traditional reservoir observers for nonlinear dynamical systems show variable performance depending on input variables, sometimes compromising reliability. The paper aims to enhance inference accuracy and robustness.Method: Integrates residual calibration and attention mechanisms into reservoir observer design. Residual calibration uses estimation residuals to refine outputs, while attention exploits temporal dependencies to enrich reservoir dynamics representation.
Result: Experiments on chaotic systems show substantial improvement in inference accuracy, especially for worst-case scenarios from traditional reservoir observers. Transfer entropy analysis explains input-dependent observation discrepancies.
Conclusion: The proposed enhancements significantly improve reservoir observer performance, making them more robust and accurate for inferring unmeasured variables in nonlinear dynamical systems.
Abstract: Reservoir observers provide a data-driven approach to the inference of unmeasured variables from observed ones for nonlinear dynamical systems. While previous studies have demonstrated wide applicability, their performance may vary considerably with different input variables, even compromising reliability in the worst cases. To enhance the performance of inference, we integrate residual calibration and attention mechanism into the reservoir observer design. The residual calibration module leverages information from the estimation residuals to refine the observer output, and the attention mechanism exploits the temporal dependencies of the data to enrich the representation of reservoir internal dynamics. Experiments on typical chaotic systems demonstrate that our method substantially improves inference accuracy, especially for the worst cases resulting from the traditional reservoir observers. We also invoke the notion of transfer entropy to explain the reason for the input-dependent observation discrepancy and the effectiveness of the proposed method.
[419] Group-Aware Coordination Graph for Multi-Agent Reinforcement Learning
Wei Duan, Jie Lu, Junyu Xuan
Main category: cs.LG
TL;DR: GACG learns group-aware coordination graphs for MARL by capturing both pairwise agent cooperation and group-level dependencies from behavior patterns, using graph convolution for information exchange and group distance loss for behavioral consistency.
Details
Motivation: Existing MARL methods focus only on agent-pair relations and neglect higher-order group relationships, limiting information exchange among partially observed agents and failing to capture complex coordination patterns.Method: Proposes Group-Aware Coordination Graph (GACG) that infers both pairwise cooperation from current observations and group-level dependencies from behavior patterns across trajectories. Uses graph convolution for agent information exchange and introduces group distance loss to promote within-group cohesion and between-group specialization.
Result: Demonstrates superior performance on StarCraft II micromanagement tasks compared to existing methods. Ablation studies confirm effectiveness of each component (pairwise relations, group dependencies, group distance loss).
Conclusion: GACG effectively captures complex multi-agent coordination by modeling both pairwise and group-level relationships, improving cooperation and performance in challenging MARL environments.
Abstract: Cooperative Multi-Agent Reinforcement Learning (MARL) necessitates seamless collaboration among agents, often represented by an underlying relation graph. Existing methods for learning this graph primarily focus on agent-pair relations, neglecting higher-order relationships. While several approaches attempt to extend cooperation modelling to encompass behaviour similarities within groups, they commonly fall short in concurrently learning the latent graph, thereby constraining the information exchange among partially observed agents. To overcome these limitations, we present a novel approach to infer the Group-Aware Coordination Graph (GACG), which is designed to capture both the cooperation between agent pairs based on current observations and group-level dependencies from behaviour patterns observed across trajectories. This graph is further used in graph convolution for information exchange between agents during decision-making. To further ensure behavioural consistency among agents within the same group, we introduce a group distance loss, which promotes group cohesion and encourages specialization between groups. Our evaluations, conducted on StarCraft II micromanagement tasks, demonstrate GACG’s superior performance. An ablation study further provides experimental evidence of the effectiveness of each component of our method.
[420] Joint Interference Detection and Identification via Adversarial Multi-task Learning
H. Xu, B. He, S. Wang
Main category: cs.LG
TL;DR: A theoretically-grounded multi-task learning framework for joint interference detection, modulation identification, and interference identification in wireless communications, using adversarial training and adaptive task correlation modeling.
Details
Motivation: Existing deep learning approaches for interference analysis use single-task learning that neglects task correlations, while emerging multi-task learning methods lack theoretical foundation for quantifying and modeling task relationships in wireless communication systems.Method: Established theoretical MTL framework with derived upper bound for weighted expected loss connected to task similarity (Wasserstein distance). Proposed AMTIDIN network integrates adversarial training to minimize distributional discrepancies and uses adaptive coefficients to dynamically model task correlations.
Result: AMTIDIN significantly outperforms both task-specific STL baselines and state-of-the-art MTL baselines in robustness and generalization, especially under challenging conditions with limited training data, short signal lengths, and low SNRs.
Conclusion: The theoretically grounded MTL framework with adversarial training and adaptive task correlation modeling provides superior performance for joint interference analysis tasks in wireless communications, with quantitative analysis revealing intrinsic task relationships.
Abstract: Precise interference detection and identification are crucial for enhancing the survivability of communication systems in non-cooperative wireless environments. While deep learning (DL) has advanced this field, existing single-task learning (STL) approaches neglect inherent task correlations. Furthermore, emerging multi-task learning (MTL) methods often lack a theoretical foundation for quantifying and modeling task relationships. To bridge this gap, we establish a theoretically grounded MTL framework for joint interference detection, modulation identification, and interference identification. First, we derive an upper bound for the weighted expected loss in MTL frameworks. This bound explicitly connects MTL performance to task similarity, quantified by the Wasserstein distance and learnable task relation coefficients. Guided by this theory, we present the adversarial multi-task interference detection and identification network (AMTIDIN), which integrates adversarial training to minimize distributional discrepancies across tasks and uses adaptive coefficients to model task correlations dynamically. Crucially, we conducted a quantitative analysis of task similarity to reveal intrinsic task relationships, specifically that modulation identification and interference identification share a substantial feature overlap distinct from interference detection. Extensive comparative experiments demonstrate that AMTIDIN significantly outperforms both its task-specific STL baseline and state-of-the-art MTL baselines in robustness and generalization, particularly under challenging conditions with limited training data, short signal lengths, and low signal-to-noise ratios (SNRs).
[421] From Selection to Scheduling: Federated Geometry-Aware Correction Makes Exemplar Replay Work Better under Continual Dynamic Heterogeneity
Zhuang Qi, Ying-Peng Tang, Lei Meng, Guoqing Chao, Lei Wu, Han Yu, Xiangxu Meng
Main category: cs.LG
TL;DR: FEAT: Federated geometry-aware correction method for federated continual learning that addresses representation collapse in class-imbalanced scenarios through geometric structure alignment and energy-based correction.
Details
Motivation: Existing federated continual learning methods focus on sample selection but overlook effective utilization of exemplars, limiting performance under continual dynamic heterogeneity across clients and tasks, particularly suffering from imbalance-induced representation collapse.Method: Two key modules: 1) Geometric Structure Alignment aligns pairwise angular similarities between features and Equiangular Tight Frame prototypes for geometric consistency; 2) Energy-based Geometric Correction removes task-irrelevant directional components to reduce bias toward majority classes.
Result: The method mitigates representation drift, improves sensitivity to minority classes, and enhances model robustness under class-imbalanced distributions in federated continual learning settings.
Conclusion: FEAT effectively addresses representation collapse in federated continual learning by maintaining geometric consistency and correcting for class imbalance, outperforming existing exemplar-based methods.
Abstract: Exemplar replay has become an effective strategy for mitigating catastrophic forgetting in federated continual learning (FCL) by retaining representative samples from past tasks. Existing studies focus on designing sample-importance estimation mechanisms to identify information-rich samples. However, they typically overlook strategies for effectively utilizing the selected exemplars, which limits their performance under continual dynamic heterogeneity across clients and tasks. To address this issue, this paper proposes a Federated gEometry-Aware correcTion method, termed FEAT, which alleviates imbalance-induced representation collapse that drags rare-class features toward frequent classes across clients. Specifically, it consists of two key modules: 1) the Geometric Structure Alignment module performs structural knowledge distillation by aligning the pairwise angular similarities between feature representations and their corresponding Equiangular Tight Frame prototypes, which are fixed and shared across clients to serve as a class-discriminative reference structure. This encourages geometric consistency across tasks and helps mitigate representation drift; 2) the Energy-based Geometric Correction module removes task-irrelevant directional components from feature embeddings, which reduces prediction bias toward majority classes. This improves sensitivity to minority classes and enhances the model’s robustness under class-imbalanced distributions.
[422] StructRL: Recovering Dynamic Programming Structure from Learning Dynamics in Distributional Reinforcement Learning
Ivo Nowak
Main category: cs.LG
TL;DR: Distributional RL reveals structured learning dynamics that mimic dynamic programming, enabling guided sampling for more efficient reinforcement learning.
Details
Motivation: Traditional RL treats learning as uniform optimization without exploiting global structure, while dynamic programming uses structured information propagation for efficiency. The paper aims to show that distributional RL can recover similar structure from learning dynamics.Method: Analyze temporal evolution of return distributions in distributional RL to identify learning signals. Introduce temporal learning indicator t*(s) that captures when states undergo strongest learning updates. Use these signals to guide sampling in alignment with emerging propagation structure (StructRL framework).
Result: Empirical evidence shows the temporal learning indicator induces state ordering consistent with dynamic programming-style information propagation. Preliminary results suggest distributional learning dynamics can recover and exploit DP-like structure without explicit models.
Conclusion: Distributional RL provides a mechanism to interpret learning as structured propagation rather than purely uniform optimization, offering new perspective on RL with potential efficiency gains through guided sampling.
Abstract: Reinforcement learning is typically treated as a uniform, data-driven optimization process, where updates are guided by rewards and temporal-difference errors without explicitly exploiting global structure. In contrast, dynamic programming methods rely on structured information propagation, enabling efficient and stable learning. In this paper, we provide evidence that such structure can be recovered from the learning dynamics of distributional reinforcement learning. By analyzing the temporal evolution of return distributions, we identify signals that capture when and where learning occurs in the state space. In particular, we introduce a temporal learning indicator t*(s) that reflects when a state undergoes its strongest learning update during training. Empirically, this signal induces an ordering over states that is consistent with a dynamic programming-style propagation of information. Building on this observation, we propose StructRL, a framework that exploits these signals to guide sampling in alignment with the emerging propagation structure. Our preliminary results suggest that distributional learning dynamics provide a mechanism to recover and exploit dynamic programming-like structure without requiring an explicit model. This offers a new perspective on reinforcement learning, where learning can be interpreted as a structured propagation process rather than a purely uniform optimization procedure.
[423] Practical Bayesian Inference for Speech SNNs: Uncertainty and Loss-Landscape Smoothing
Yesmine Abdennadher, Philip N. Garner
Main category: cs.LG
TL;DR: Bayesian learning applied to Spiking Neural Networks for speech processing tasks improves predictive landscape smoothness and performance metrics
Details
Motivation: SNNs are well-suited for temporal speech data but have irregular predictive landscapes due to threshold-based spiking; Bayesian learning may smooth these landscapesMethod: Apply Bayesian learning approach to SNN weights, specifically using Improved Variational Online Newton (IVON) for surrogate-gradient SNNs
Result: Improved performance on negative log-likelihood and Brier score; smoother, more regular predictive landscape compared to deterministic approach
Conclusion: Bayesian learning effectively addresses SNN’s irregular predictive landscape, enhancing performance on speech processing tasks
Abstract: Spiking Neural Networks (SNNs) are naturally suited for speech processing tasks due to their specific dynamics, which allows them to handle temporal data. However, the threshold-based generation of spikes in SNNs intuitively causes an angular or irregular predictive landscape. We explore the effect of using the Bayesian learning approach for the weights on the irregular predictive landscape. For the surrogate-gradient SNNs, we also explore the application of the Improved Variational Online Newton (IVON) approach, which is an efficient variational approach. The performance of the proposed approach is evaluated on the Heidelberg Digits and Speech Commands datasets. The hypothesis is that the Bayesian approach will result in a smoother and more regular predictive landscape, given the angular nature of the deterministic predictive landscape. The experimental evaluation of the proposed approach shows improved performance on the negative log-likelihood and Brier score. Furthermore, the proposed approach has resulted in a smoother and more regular predictive landscape compared to the deterministic approach, based on the one-dimensional slices of the weight space
[424] Bandwidth-constrained Variational Message Encoding for Cooperative Multi-agent Reinforcement Learning
Wei Duan, Jie Lu, En Yu, Junyu Xuan
Main category: cs.LG
TL;DR: BVME introduces variational message encoding for bandwidth-limited multi-agent reinforcement learning, achieving comparable performance with 67-83% fewer message dimensions.
Details
Motivation: Existing graph-based MARL methods focus on learning sparse coordination graphs but don't address what information should be transmitted under hard bandwidth constraints. Naive dimensionality reduction degrades coordination performance, and deterministic projections lack control over compression.Method: Bandwidth-constrained Variational Message Encoding (BVME) treats messages as samples from learned Gaussian posteriors regularized via KL divergence to an uninformative prior. This variational framework provides principled, tunable control over compression strength through interpretable hyperparameters.
Result: BVME achieves comparable or superior performance while using 67-83% fewer message dimensions across SMACv1, SMACv2, and MPE benchmarks. Gains are most pronounced on sparse graphs where message quality critically impacts coordination.
Conclusion: BVME provides an effective solution for bandwidth-limited MARL with minimal overhead, excelling at extreme compression ratios where traditional methods fail.
Abstract: Graph-based multi-agent reinforcement learning (MARL) enables coordinated behavior under partial observability by modeling agents as nodes and communication links as edges. While recent methods excel at learning sparse coordination graphs-determining who communicates with whom-they do not address what information should be transmitted under hard bandwidth constraints. We study this bandwidth-limited regime and show that naive dimensionality reduction consistently degrades coordination performance. Hard bandwidth constraints force selective encoding, but deterministic projections lack mechanisms to control how compression occurs. We introduce Bandwidth-constrained Variational Message Encoding (BVME), a lightweight module that treats messages as samples from learned Gaussian posteriors regularized via KL divergence to an uninformative prior. BVME’s variational framework provides principled, tunable control over compression strength through interpretable hyperparameters, directly constraining the representations used for decision-making. Across SMACv1, SMACv2, and MPE benchmarks, BVME achieves comparable or superior performance while using 67–83% fewer message dimensions, with gains most pronounced on sparse graphs where message quality critically impacts coordination. Ablations reveal U-shaped sensitivity to bandwidth, with BVME excelling at extreme ratios while adding minimal overhead.
[425] On Divergence Measures for Training GFlowNets
Tiago da Silva, Eliezer de Souza da Silva, Diego Mesquita
Main category: cs.LG
TL;DR: GFlowNets training improved via divergence minimization with efficient gradient estimators and control variates, bridging gap with variational inference.
Details
Motivation: Traditional GFlowNets training uses log-squared difference minimization, which is related to variational inference but can lead to biased, high-variance estimators. The paper aims to bridge GFlowNets training with generalized variational approximations by exploring alternative divergence measures.Method: Reviews four divergence measures (Renyi-α, Tsallis-α, reverse and forward KL), designs statistically efficient estimators for their stochastic gradients in GFlowNets training, and implements control variates based on REINFORCE leave-one-out and score-matching estimators to reduce gradient variance.
Result: Proper minimization of these divergences yields provably correct and empirically effective training, often leading to significantly faster convergence than previous optimization methods.
Conclusion: The work narrows the gap between GFlowNets training and generalized variational approximations, enabling algorithmic ideas informed by divergence minimization perspective.
Abstract: Generative Flow Networks (GFlowNets) are amortized inference models designed to sample from unnormalized distributions over composable objects, with applications in generative modeling for tasks in fields such as causal discovery, NLP, and drug discovery. Traditionally, the training procedure for GFlowNets seeks to minimize the expected log-squared difference between a proposal (forward policy) and a target (backward policy) distribution, which enforces certain flow-matching conditions. While this training procedure is closely related to variational inference (VI), directly attempting standard Kullback-Leibler (KL) divergence minimization can lead to proven biased and potentially high-variance estimators. Therefore, we first review four divergence measures, namely, Renyi-$α$’s, Tsallis-$α$’s, reverse and forward KL’s, and design statistically efficient estimators for their stochastic gradients in the context of training GFlowNets. Then, we verify that properly minimizing these divergences yields a provably correct and empirically effective training scheme, often leading to significantly faster convergence than previously proposed optimization. To achieve this, we design control variates based on the REINFORCE leave-one-out and score-matching estimators to reduce the variance of the learning objectives’ gradients. Our work contributes by narrowing the gap between GFlowNets training and generalized variational approximations, paving the way for algorithmic ideas informed by the divergence minimization viewpoint.
[426] Evidential Transformation Network: Turning Pretrained Models into Evidential Models for Post-hoc Uncertainty Estimation
Yongchan Chun, Chanhee Park, Jeongho Yoon, Jaehyung Seo, Heuiseok Lim
Main category: cs.LG
TL;DR: ETN is a lightweight post-hoc module that converts pretrained models into evidential models for uncertainty estimation by learning sample-dependent affine transformations of logits.
Details
Motivation: Pretrained models lack reliable confidence measures, and existing uncertainty estimation methods are computationally expensive. EDL offers efficiency but requires training from scratch, which is impractical for pretrained networks.Method: ETN is a post-hoc module that learns sample-dependent affine transformations of logits from pretrained models, interpreting transformed outputs as Dirichlet distribution parameters for uncertainty estimation.
Result: ETN consistently improves uncertainty estimation over post-hoc baselines on image classification and LLM question-answering benchmarks under both in-distribution and out-of-distribution settings, while preserving accuracy with minimal computational overhead.
Conclusion: ETN enables efficient EDL-style uncertainty estimation for pretrained models without retraining, offering practical uncertainty quantification for deployed models.
Abstract: Pretrained models have become standard in both vision and language, yet they typically do not provide reliable measures of confidence. Existing uncertainty estimation methods, such as deep ensembles and MC dropout, are often too computationally expensive to deploy in practice. Evidential Deep Learning (EDL) offers a more efficient alternative, but it requires models to be trained to output evidential quantities from the start, which is rarely true for pretrained networks. To enable EDL-style uncertainty estimation in pretrained models, we propose the Evidential Transformation Network (ETN), a lightweight post-hoc module that converts a pretrained predictor into an evidential model. ETN operates in logit space: it learns a sample-dependent affine transformation of the logits and interprets the transformed outputs as parameters of a Dirichlet distribution for uncertainty estimation. We evaluate ETN on image classification and large language model question-answering benchmarks under both in-distribution and out-of-distribution settings. ETN consistently improves uncertainty estimation over post-hoc baselines while preserving accuracy and adding only minimal computational overhead.
[427] VOLTA: The Surprising Ineffectiveness of Auxiliary Losses for Calibrated Deep Learning
Rahul D Ray, Utkarsh Srivastava
Main category: cs.LG
TL;DR: VOLTA is a simplified uncertainty quantification method using deep encoders, learnable prototypes, cross-entropy loss, and temperature scaling, achieving competitive accuracy and superior calibration compared to 10 baseline UQ methods across multiple datasets and distribution shifts.
Details
Motivation: There's no consensus on which uncertainty quantification method performs best across different data modalities and distribution shifts, creating challenges for deploying deep learning models in safety-critical applications where reliable uncertainty estimates are essential.Method: VOLTA simplifies uncertainty quantification by using only a deep encoder, learnable prototypes, cross-entropy loss, and post hoc temperature scaling. It’s benchmarked against 10 UQ baselines including MC Dropout, SWAG, ensemble methods, temperature scaling, energy-based OOD, Mahalanobis, hyperbolic classifiers, ENN, Taylor Sensus, and split conformal prediction.
Result: VOLTA achieves competitive or superior accuracy (up to 0.864 on CIFAR-10), significantly lower expected calibration error (0.010 vs. 0.044-0.102 for baselines), and strong OOD detection (AUROC 0.802). Statistical testing confirms VOLTA matches or outperforms most baselines across CIFAR-10, CIFAR-100, SVHN, uniform noise, CIFAR-10-C, and Tiny ImageNet features.
Conclusion: VOLTA establishes itself as a lightweight, deterministic, and well-calibrated alternative to more complex UQ approaches, with ablation studies confirming the importance of adaptive temperature and deep encoders for uncertainty quantification.
Abstract: Uncertainty quantification (UQ) is essential for deploying deep learning models in safety critical applications, yet no consensus exists on which UQ method performs best across different data modalities and distribution shifts. This paper presents a comprehensive benchmark of ten widely used UQ baselines including MC Dropout, SWAG, ensemble methods, temperature scaling, energy based OOD, Mahalanobis, hyperbolic classifiers, ENN, Taylor Sensus, and split conformal prediction against a simplified yet highly effective variant of VOLTA that retains only a deep encoder, learnable prototypes, cross entropy loss, and post hoc temperature scaling. We evaluate all methods on CIFAR 10 (in distribution), CIFAR 100, SVHN, uniform noise (out of distribution), CIFAR 10 C (corruptions), and Tiny ImageNet features (tabular). VOLTA achieves competitive or superior accuracy (up to 0.864 on CIFAR 10), significantly lower expected calibration error (0.010 vs. 0.044 to 0.102 for baselines), and strong OOD detection (AUROC 0.802). Statistical testing over three random seeds shows that VOLTA matches or outperforms most baselines, with ablation studies confirming the importance of adaptive temperature and deep encoders. Our results establish VOLTA as a lightweight, deterministic, and well calibrated alternative to more complex UQ approaches.
[428] Creator Incentives in Recommender Systems: A Cooperative Game-Theoretic Approach for Stable and Fair Collaboration in Multi-Agent Bandits
Ramakrishnan Krishnamurthy, Arpit Agarwal, Lakshminarayanan Subramanian, Maximilian Nickel
Main category: cs.LG
TL;DR: This paper models collaboration in recommendation systems as a multi-agent stochastic linear bandit problem with transferable utility cooperative game theory, analyzing incentives and fairness among content creators.
Details
Motivation: Online recommendation platforms create interdependencies among content creators where feedback on one creator's content influences the exposure of other creators. The paper aims to analyze incentives and fairness in such collaborative settings using game theory.Method: The authors model collaboration as a multi-agent stochastic linear bandit problem with a transferable utility (TU) cooperative game formulation. They analyze the game properties for homogeneous and heterogeneous agents, propose a regret-based payout rule, and validate with experiments on MovieLens-100k dataset.
Result: For homogeneous agents with fixed action sets, the induced TU game is convex with non-empty core containing Shapley value. For heterogeneous agents, the game still has non-empty core but convexity isn’t guaranteed. The proposed regret-based payout rule satisfies three Shapley axioms and lies in the core.
Conclusion: The paper provides a game-theoretic framework for analyzing incentives in collaborative recommendation systems, showing stability and fairness properties under different agent conditions, with practical payout mechanisms for heterogeneous settings.
Abstract: User interactions in online recommendation platforms create interdependencies among content creators: feedback on one creator’s content influences the system’s learning and, in turn, the exposure of other creators’ contents. To analyze incentives in such settings, we model collaboration as a multi-agent stochastic linear bandit problem with a transferable utility (TU) cooperative game formulation, where a coalition’s value equals the negative sum of its members’ cumulative regrets. We show that, for identical (homogenous) agents with fixed action sets, the induced TU game is convex under mild algorithmic conditions, implying a non-empty core that contains the Shapley value and ensures both stability and fairness. For heterogeneous agents, the game still admits a non-empty core, though convexity and Shapley value core-membership are no longer guaranteed. To address this, we propose a simple regret-based payout rule that satisfies three out of the four Shapley axioms and also lies in the core. Experiments on MovieLens-100k dataset illustrate when the empirical payout aligns with – and diverges from – the Shapley fairness across different settings and algorithms.
[429] PRAGMA: Revolut Foundation Model
Maxim Ostroukhov, Ruslan Mikhailov, Vladimir Iashin, Artem Sokolov, Andrei Akshonov, Vitaly Protasov, Dmitrii Beloborodov, Vince Mullin, Roman Yokunda Enzmann, Georgios Kolovos, Jason Renders, Pavel Nesterov, Anton Repushko
Main category: cs.LG
TL;DR: PRAGMA is a family of Transformer-based foundation models for multi-source banking event sequences that uses masked modeling on heterogeneous financial data to create general-purpose representations for various downstream financial tasks.
Details
Motivation: Financial systems generate vast transactional and event-level data with rich economic signals, but existing approaches lack general-purpose foundation models that can learn from heterogeneous banking event sequences to support multiple downstream applications.Method: Pre-trains a Transformer-based architecture with masked modeling on large-scale heterogeneous banking event corpus using self-supervised objectives tailored to discrete, variable-length financial records.
Result: PRAGMA achieves superior performance across multiple domains (credit scoring, fraud detection, lifetime value prediction) directly from raw event sequences, with strong results using simple linear models on extracted embeddings and further improvements with lightweight fine-tuning.
Conclusion: PRAGMA provides a general-purpose representation layer for financial applications that effectively captures economic signals from banking event sequences, enabling strong performance across diverse downstream tasks with minimal task-specific adaptation.
Abstract: Modern financial systems generate vast quantities of transactional and event-level data that encode rich economic signals. This paper presents PRAGMA, a family of foundation models for multi-source banking event sequences. Our approach pre-trains a Transformer-based architecture with masked modelling on a large-scale, heterogeneous banking event corpus using a self-supervised objective tailored to the discrete, variable-length nature of financial records. The resulting model supports a wide range of downstream tasks such as credit scoring, fraud detection, and lifetime value prediction: strong performance can be achieved by training a simple linear model on top of the extracted embeddings and can be further improved with lightweight fine-tuning. Through extensive evaluation on downstream tasks, we demonstrate that PRAGMA achieves superior performance across multiple domains directly from raw event sequences, providing a general-purpose representation layer for financial applications.
[430] Skip-Connected Policy Optimization for Implicit Advantage
Fengwei Teng, Jinyi Bai, Xinhao Yao, Demi Ruohan Wang, Jiahao Zhao, Zhijiang Guo
Main category: cs.LG
TL;DR: SKPO improves RL-based reasoning by decomposing reasoning into upstream/downstream phases with skip connections to handle high-variance dense rewards in early reasoning tokens.
Details
Motivation: While dense rewards should theoretically improve reasoning performance in RLVR, Monte Carlo estimation causes high-variance advantages for early reasoning tokens, making outcome-only GRPO actually perform better in practice. Need a method to leverage dense rewards without suffering from variance issues.Method: Skip-Connected Optimization (SKPO) decomposes reasoning into upstream and downstream phases. Upstream receives dense rewards via Monte Carlo sampling with single-stream optimization. Downstream uses group-relative optimization with a skip connection that concatenates upstream reasoning with original problem, allowing the model to leverage helpful upstream reasoning while bypassing flawed reasoning through direct problem access.
Result: Achieves 3.91% and 6.17% relative gains over strongest baselines on Qwen2.5-Math-7B and Llama-3.2-3B across mathematical benchmarks and out-of-domain tasks (general reasoning and code generation). Also generates trajectories with higher intermediate-step quality even when final correctness is matched.
Conclusion: SKPO effectively addresses the variance problem in dense reward RL for reasoning tasks by architectural decomposition and skip connections, enabling better utilization of intermediate rewards while maintaining robustness to flawed early reasoning.
Abstract: Group Relative Policy Optimization (GRPO) has proven effective in RLVR by using outcome-based rewards. While fine-grained dense rewards can theoretically improve performance, we reveal that under practical sampling budgets, Monte Carlo estimation yields high-variance and sign-inconsistent advantages for early reasoning tokens, paradoxically underperforming outcome-only GRPO. We propose Skip-Connected Optimization (SKPO), which decomposes reasoning into upstream and downstream phases: upstream receives dense rewards from downstream Monte Carlo sampling with single-stream optimization; downstream maintains group-relative optimization, where a skip connection concatenates the upstream segment with the original problem, enabling the model to leverage helpful upstream reasoning while preserving the freedom to bypass flawed reasoning through direct problem access. Experiments demonstrate improvements of 3.91% and 6.17% relative gains over the strongest baselines on Qwen2.5-Math-7B and Llama-3.2-3B respectively across mathematical benchmarks and out-of-domain tasks including general reasoning and code generation. Further analysis reveals an implicit advantage: SKPO generates trajectories with higher intermediate-step quality even when matched for final correctness.
[431] EvoLen: Evolution-Guided Tokenization for DNA Language Model
Nan Huang, Xiaoxiao Zhou, Junxia Cui, Mario Tapia-Pacheco, Tiffany Amariuta, Yang Li, Jingbo Shang
Main category: cs.LG
TL;DR: EvoLen is a novel DNA tokenizer that incorporates evolutionary information through cross-species stratification and length-aware decoding to better preserve functional sequence patterns like regulatory motifs, outperforming standard BPE in biological relevance while maintaining performance on DNALM benchmarks.
Details
Motivation: DNA language models lack appropriate tokenization strategies. Unlike natural language, DNA has no inherent token boundaries and is organized by biological function rather than linguistic convention. Existing approaches like BPE capture linguistic regularities but miss biological functional patterns like regulatory motifs that are evolutionarily constrained and preserved across species.Method: EvoLen combines evolutionary stratification with length-aware decoding: 1) Uses cross-species evolutionary signals to group DNA sequences, 2) Trains separate BPE tokenizers on each evolutionary group, 3) Merges vocabularies via rules prioritizing preserved patterns, 4) Applies length-aware decoding with dynamic programming to better preserve motif-scale functional units.
Result: EvoLen improves preservation of functional sequence patterns, differentiation across genomic contexts, and alignment with evolutionary constraint. It matches or outperforms standard BPE across diverse DNA language model benchmarks while producing more biologically meaningful and interpretable sequence representations.
Conclusion: Tokenization introduces critical inductive bias in DNA language models. Incorporating evolutionary information yields more biologically meaningful representations that better capture functional sequence patterns like regulatory motifs, demonstrating the importance of domain-specific tokenization strategies.
Abstract: Tokens serve as the basic units of representation in DNA language models (DNALMs), yet their design remains underexplored. Unlike natural language, DNA lacks inherent token boundaries or predefined compositional rules, making tokenization a fundamental modeling decision rather than a naturally specified one. While existing approaches like byte-pair encoding (BPE) excel at capturing token structures that reflect human-generated linguistic regularities, DNA is organized by biological function and evolutionary constraint rather than linguistic convention. We argue that DNA tokenization should prioritize functional sequence patterns like regulatory motifs-short, recurring segments under evolutionary constraint and typically preserved across species. We incorporate evolutionary information directly into the tokenization process through EvoLen, a tokenizer that combines evolutionary stratification with length-aware decoding to better preserve motif-scale functional sequence units. EvoLen uses cross-species evolutionary signals to group DNA sequences, trains separate BPE tokenizers on each group, merges the resulting vocabularies via a rule prioritizing preserved patterns, and applies length-aware decoding with dynamic programming. Through controlled experiments, EvoLen improves the preservation of functional sequence patterns, differentiation across genomic contexts, and alignment with evolutionary constraint, while matching or outperforming standard BPE across diverse DNALM benchmarks. These results demonstrate that tokenization introduces a critical inductive bias and that incorporating evolutionary information yields more biologically meaningful and interpretable sequence representations.
[432] Efficient RL Training for LLMs with Experience Replay
Charles Arnal, Vivien Cabannes, Taco Cohen, Julia Kempe, Remi Munos
Main category: cs.LG
TL;DR: Experience replay buffers can reduce inference compute in LLM post-training without degrading performance, challenging the belief that only fresh on-policy data is needed.
Details
Motivation: Challenge the prevailing assumption that fresh on-policy data is essential for LLM post-training, and explore whether experience replay buffers (common in RL) could reduce expensive inference compute while maintaining performance.Method: Systematic study of replay buffers for LLM post-training, formalizing optimal design as trade-off between staleness-induced variance, sample diversity, and computational cost of generation. Compare strict on-policy sampling with replay buffer approaches.
Result: Well-designed replay buffers can drastically reduce inference compute without degrading final model performance, and in some cases even improve performance while preserving policy entropy. Strict on-policy sampling is suboptimal when generation is expensive.
Conclusion: Experience replay is viable and beneficial for LLM post-training, offering significant computational savings while maintaining or improving model quality, challenging conventional wisdom about data freshness requirements.
Abstract: While Experience Replay - the practice of storing rollouts and reusing them multiple times during training - is a foundational technique in general RL, it remains largely unexplored in LLM post-training due to the prevailing belief that fresh, on-policy data is essential for high performance. In this work, we challenge this assumption. We present a systematic study of replay buffers for LLM post-training, formalizing the optimal design as a trade-off between staleness-induced variance, sample diversity and the high computational cost of generation. We show that strict on-policy sampling is suboptimal when generation is expensive. Empirically, we show that a well-designed replay buffer can drastically reduce inference compute without degrading - and in some cases even improving - final model performance, while preserving policy entropy.
[433] Every Response Counts: Quantifying Uncertainty of LLM-based Multi-Agent Systems through Tensor Decomposition
Tiejin Chen, Huaiyuan Yao, Jia Chen, Evangelos E. Papalexakis, Hua Wei
Main category: cs.LG
TL;DR: MATU: A tensor decomposition framework for quantifying uncertainty in multi-agent LLM systems, addressing cascading uncertainty, communication variability, and topology diversity.
Details
Motivation: Large Language Model-based Multi-Agent Systems (MAS) outperform single-agent systems but introduce reliability challenges from communication dynamics and role dependencies. Existing uncertainty quantification methods designed for single-turn outputs fail to address MAS complexities like cascading uncertainty, variable communication paths, and diverse topologies.Method: MATU quantifies uncertainty through tensor decomposition. It moves beyond analyzing final text outputs by representing entire reasoning trajectories as embedding matrices and organizing multiple execution runs into a higher-order tensor. Tensor decomposition disentangles and quantifies distinct sources of uncertainty.
Result: MATU effectively estimates holistic and robust uncertainty across diverse tasks and communication topologies, providing comprehensive reliability measures generalizable across different agent structures.
Conclusion: MATU bridges the gap in uncertainty quantification for multi-agent LLM systems by offering a novel tensor decomposition approach that addresses the unique challenges of cascading uncertainty, communication variability, and topology diversity in MAS.
Abstract: While Large Language Model-based Multi-Agent Systems (MAS) consistently outperform single-agent systems on complex tasks, their intricate interactions introduce critical reliability challenges arising from communication dynamics and role dependencies. Existing Uncertainty Quantification methods, typically designed for single-turn outputs, fail to address the unique complexities of the MAS. Specifically, these methods struggle with three distinct challenges: the cascading uncertainty in multi-step reasoning, the variability of inter-agent communication paths, and the diversity of communication topologies. To bridge this gap, we introduce MATU, a novel framework that quantifies uncertainty through tensor decomposition. MATU moves beyond analyzing final text outputs by representing entire reasoning trajectories as embedding matrices and organizing multiple execution runs into a higher-order tensor. By applying tensor decomposition, we disentangle and quantify distinct sources of uncertainty, offering a comprehensive reliability measure that is generalizable across different agent structures. We provide comprehensive experiments to show that MATU effectively estimates holistic and robust uncertainty across diverse tasks and communication topologies.
[434] Wireless Communication Enhanced Value Decomposition for Multi-Agent Reinforcement Learning
Diyi Hu, Bhaskar Krishnamachari
Main category: cs.LG
TL;DR: CLOVER is a cooperative MARL framework that conditions value decomposition on realistic wireless communication graphs, using a GNN mixer with permutation-equivariant hypernetwork to adapt credit assignment based on actual communication topology.
Details
Motivation: Most MARL approaches assume idealized communication channels and ignore who successfully shared information with whom, while existing value decomposition methods don't account for realistic communication constraints in multi-agent systems.Method: Proposes CLOVER with a centralized value mixer conditioned on the realized communication graph under realistic wireless channels. Uses a GNN with node-specific weights generated by a Permutation-Equivariant Hypernetwork, formulates an augmented MDP to isolate stochastic channel effects, and employs a stochastic receptive field encoder for variable-size message sets.
Result: CLOVER consistently improves convergence speed and final performance over VDN, QMIX, TarMAC+VDN, and TarMAC+QMIX on Predator-Prey and Lumberjacks benchmarks under p-CSMA wireless channels. Behavioral analysis shows agents learn adaptive signaling and listening strategies.
Conclusion: The communication-graph inductive bias is the key source of improvement, enabling more effective value decomposition that accounts for realistic communication constraints in cooperative multi-agent systems.
Abstract: Cooperation in multi-agent reinforcement learning (MARL) benefits from inter-agent communication, yet most approaches assume idealized channels and existing value decomposition methods ignore who successfully shared information with whom. We propose CLOVER, a cooperative MARL framework whose centralized value mixer is conditioned on the communication graph realized under a realistic wireless channel. This graph introduces a relational inductive bias into value decomposition, constraining how individual utilities are mixed based on the realized communication structure. The mixer is a GNN with node-specific weights generated by a Permutation-Equivariant Hypernetwork: multi-hop propagation along communication edges reshapes credit assignment so that different topologies induce different mixing. We prove this mixer is permutation invariant, monotonic (preserving the IGM condition), and strictly more expressive than QMIX-style mixers. To handle realistic channels, we formulate an augmented MDP isolating stochastic channel effects from the agent computation graph, and employ a stochastic receptive field encoder for variable-size message sets, enabling end-to-end differentiable training. On Predator-Prey and Lumberjacks benchmarks under p-CSMA wireless channels, CLOVER consistently improves convergence speed and final performance over VDN, QMIX, TarMAC+VDN, and TarMAC+QMIX. Behavioral analysis confirms agents learn adaptive signaling and listening strategies, and ablations isolate the communication-graph inductive bias as the key source of improvement.
[435] A Little Rank Goes a Long Way: Random Scaffolds with LoRA Adapters Are All You Need
Hananel Hazan, Yanbo Zhang, Benedikt Hartl, Michael Levin
Main category: cs.LG
TL;DR: LottaLoRA shows that training only low-rank LoRA adapters on frozen random backbones recovers 96-100% of fully trained performance, revealing task-specific information occupies much smaller subspace than full parameter count suggests.
Details
Motivation: To investigate how much of a neural network's parameters actually encode task-specific information, and whether task-specific signal occupies a much smaller subspace than the full parameter count suggests.Method: LottaLoRA training paradigm where every backbone weight is drawn at random and frozen, and only low-rank LoRA adapters are trained. Tested across nine benchmarks with diverse architectures from single-layer classifiers to 900M parameter Transformers.
Result: Low-rank adapters over frozen random backbones recover 96-100% of fully trained performance while training only 0.5-40% of parameters. The frozen backbone is actively exploited, interchangeable with any random initialization, and minimum LoRA rank estimates task intrinsic dimensionality.
Conclusion: Task-specific information occupies subspace orders of magnitude smaller than full parameter count. Models can be distributed as adapters plus random seed, with footprint growing with task complexity rather than model size, enabling significant storage and memory savings.
Abstract: How many of a neural network’s parameters actually encode task-specific information? We investigate this question with LottaLoRA, a training paradigm in which every backbone weight is drawn at random and frozen; only low-rank LoRA adapters are trained. Across nine benchmarks spanning diverse architecture families from single-layer classifiers to 900M parameter Transformers low-rank adapters over frozen random backbones recover 96-100% of fully trained performance while training only 0.5-40% of the parameters. The task-specific signal therefore occupies a subspace orders of magnitude smaller than the full parameter count suggests.Three mechanistic findings underpin this result:(1) the frozen backbone is actively exploited when static the learned scaling~$β$ remains strictly positive across all architectures but when the scaffold is destabilized, the optimizer silences it and the LoRA factors absorb all task information; (2) the frozen backbone is preferable but interchangeable any random initialization works equally well, provided it remains fixed throughout training; and (3) the minimum LoRA rank at which performance saturates estimates the intrinsic dimensionality of the task, reminiscent of the number of components retained in Principal Component Analysis (PCA). The construction is formally analogous to Reservoir Computing unfolded along the depth axis of a feedforward network. Because the backbone is determined by a random seed alone, models can be distributed as adapters plus seed a footprint that grows with task complexity, not model size, so that storage and memory savings compound as architectures scale.
[436] Adversarial Sensor Errors for Safe and Robust Wind Turbine Fleet Control
Julian Quick, Marcus Binder Nilsen, Andreas Bechmann, Tran Nguyen Le, Pierre-Elouan Mikael Rethore
Main category: cs.LG
TL;DR: Adversarial training framework for wind farm control systems using an “Arms Race” approach between controller and adversary to improve robustness against measurement errors and cyber attacks.
Details
Motivation: Wind farm control systems face risks from measurement errors and potential cyber attacks that could alter telemetry signals, necessitating robust controllers that can maintain performance under adversarial conditions.Method: Developed a framework for training safe plant controllers using adversarial training with three approaches: co-training protagonist (controller) and adversary, finding the “Arms Race” approach most effective where both agents continuously adapt to each other’s strategies.
Result: Arms Race adversarial training reduced worst-case performance degradation from 39% power loss to 7.9% power gain relative to baseline operational strategy, demonstrating significant robustness improvements.
Conclusion: Adversarial training, particularly the Arms Race approach, effectively improves wind farm controller robustness against measurement errors and potential cyber attacks, though computational costs remain a consideration.
Abstract: Plant-level control is an emerging wind energy technology that presents opportunities and challenges. By controlling turbines in a coordinated manner via a central controller, it is possible to achieve greater wind power plant efficiency. However, there is a risk that measurement errors will confound the process, or even that hackers will alter the telemetry signals received by the central controller. This paper presents a framework for developing a safe plant controller by training it with an adversarial agent designed to confound it. This necessitates training the adversary to confound the controller, creating a sort of circular logic or “Arms Race.” This paper examines three broad training approaches for co-training the protagonist and adversary, finding that an Arms Race approach yields the best results. These initial results indicate that the Arms Race adversarial training reduced worst-case performance degradation from 39% power loss to 7.9% power gain relative to a baseline operational strategy.
[437] IKKA: Inversion Classification via Critical Anomalies for Robust Visual Servoing
Darya Pavlenko
Main category: cs.LG
TL;DR: IKKA is a topologically motivated weighting framework for robust visual servoing that treats outliers as structurally informative observations rather than noise, improving performance under distribution shifts like dim lighting and occlusion.
Details
Motivation: Conventional outlier handling in visual servoing treats maverick points as noise to be rejected, but IKKA argues these points are structurally informative observations that can reveal ambiguous decision regions where small perturbations cause qualitatively different control responses.Method: IKKA combines local extremality (E), boundary transversality (T), and multi-scale persistence (M) into a single anomaly weight W(x) = E(x) × T(x) × M(x) that modulates control updates near ambiguous decision regions. The framework is implemented in a CPU-only embedded visual-servoing pipeline on Raspberry Pi 4.
Result: In stress scenarios with dim illumination and transient occlusion, IKKA reduces 95th-percentile lateral error by 24% (0.124 to 0.094) relative to a hybrid baseline while increasing throughput from 20.0 to 24.8 Hz. Non-parametric analysis shows large effect size (Cliff’s delta = 0.79) across 230 reproducible runs.
Conclusion: IKKA demonstrates that treating outliers as structurally informative rather than noise improves visual servoing robustness under distribution shift, with practical benefits for embedded systems through reduced error and increased throughput.
Abstract: We introduce IKKA (Inversion Classification via Critical Anomalies), a topologically motivated weighting framework for robust visual servoing under distribution shift. Unlike conventional outlier handling, IKKA treats maverick points as structurally informative observations: points where small perturbations can induce qualitatively different control responses or class assignments. The method combines local extremality, boundary transversality, and multi-scale persistence into a single anomaly weight, W(x) = E(x) x T(x) x M(x), which modulates control updates near ambiguous decision regions. We instantiate IKKA in a CPU-only embedded visual-servoing pipeline on Raspberry Pi 4 and evaluate it across 230 reproducible runs under nominal and stress conditions. In stress scenarios involving dim illumination and transient occlusion, IKKA reduces the 95th-percentile lateral error by 24% relative to a hybrid baseline (0.124 to 0.094) while increasing throughput from 20.0 to 24.8 Hz. Non-parametric analysis confirms a large effect size (Cliff’s delta = 0.79).
[438] Adaptive Simulation Experiment for LLM Policy Optimization
Mingjie Hu, Siyang Gao, Jian-qiang Hu, Enlu Zhou
Main category: cs.LG
TL;DR: LLM-PO: A pairwise comparison-based adaptive simulation framework for identifying optimal policies in LLM deployment for operations management.
Details
Motivation: LLMs can improve operational efficiency but require optimal policy specification for response quality, user experience, and operational value. Need systematic method to identify best policies from candidate sets.Method: Treat LLMs as stochastic simulators, use pairwise comparison-based adaptive experiments. Two policy spaces: unstructured (no parametric assumptions) and structured (preference model). Derive optimal sampling proportions, develop LLM-PO adaptive procedure with statistical guarantees.
Result: Derived fundamental data requirements for optimal policy identification. For unstructured space: closed-form optimal sampling proportions. For structured space: regularized convex program for optimal proportions. LLM-PO identifies optimal policy with statistical guarantees while asymptotically attaining fundamental data requirements.
Conclusion: LLM-PO framework effectively identifies optimal LLM deployment policies, outperforms benchmarks, and improves LLM performance in operations management applications.
Abstract: Large language models (LLMs) have significant potential to improve operational efficiency in operations management. Deploying these models requires specifying a policy that governs response quality, shapes user experience, and influences operational value. In this research, we treat LLMs as stochastic simulators and propose a pairwise comparison-based adaptive simulation experiment framework for identifying the optimal policy from a finite set of candidates. We consider two policy spaces: an unstructured space with no parametric assumption, and a structured space in which the data are generated from a preference model. For both settings, we characterize the fundamental data requirements for identifying the optimal policy with high probability. In the unstructured case, we derive a closed-form expression for the optimal sampling proportions, together with a clear operational interpretation. In the structured case, we formulate a regularized convex program to compute the optimal proportions. We then develop an adaptive experimental procedure, termed LLM-PO, for both policy spaces, and prove that it identifies the optimal policy with the desired statistical guarantee while asymptotically attaining the fundamental data requirements. Numerical experiments demonstrate that LLM-PO consistently outperforms benchmark methods and improves LLM performance.
[439] $p1$: Better Prompt Optimization with Fewer Prompts
Zhaolin Gao, Yu, Wang, Bo Liu, Thorsten Joachims, Kianté Brantley, Wen Sun
Main category: cs.LG
TL;DR: Prompt optimization effectiveness depends on task variance structure; proposed p1 method filters user prompts with high variance across system prompts to improve optimization.
Details
Motivation: Prompt optimization effectiveness varies widely across tasks, and understanding what makes a task amenable to optimization is crucial for improving language model performance without weight updates.Method: Analyzes reward variance decomposition into response variance (generation stochasticity) and system prompt variance (quality differences). Proposes p1 method that filters user prompts with high variance across candidate system prompts to distinguish good from bad system prompts.
Result: p1 substantially improves prompt optimization over full dataset training and outperforms baselines like GEPA. Training on only two prompts from AIME 24 yields a system prompt that generalizes well to other reasoning benchmarks.
Conclusion: Prompt optimization succeeds when system prompt variance dominates response variance; scaling to more user prompts can hurt optimization on heterogeneous datasets; p1 filtering enables effective optimization with minimal data.
Abstract: Prompt optimization improves language models without updating their weights by searching for a better system prompt, but its effectiveness varies widely across tasks. We study what makes a task amenable to prompt optimization. We show that the reward variance across different system prompts can be decomposed into two components: variance among responses, which captures generation stochasticity, and variance among system prompts, which captures differences in system prompt quality. Prompt optimization succeeds when variance among system prompts is sufficiently large, but fails when variance among responses dominates the variance of the system prompts. Surprisingly, we further show that scaling to more user prompts can hurt optimization by reducing variance among system prompts, especially on heterogeneous datasets where different user prompts favor different system prompts. Motivated by this insight, we propose $p1$, a simple user prompt filtering method that selects a small subset of user prompts with high variance across candidate system prompts. This subset of user prompts allows one to distinguish a good system prompt from a bad one, making system optimization easier. Experiments on reasoning benchmarks show that $p1$ substantially improves prompt optimization over training on the full dataset and outperforms strong baselines such as GEPA. Notably, training on only two prompts from AIME 24 yields a system prompt that generalizes well to other reasoning benchmarks.
[440] Alleviating Community Fear in Disasters via Multi-Agent Actor-Critic Reinforcement Learning
Yashodhan D. Hakke, Almuatazbellah M. Boker, Lamine Mili, Michael von Spakovsky, Hoda Eldardiry
Main category: cs.LG
TL;DR: Extends cyber-physical-social resilience model with multi-agency control channels using game theory and reinforcement learning to reduce community fear during disasters
Details
Motivation: Existing CPS models simulate coupled disaster dynamics but lack active intervention mechanisms to reduce community fear and improve infrastructure recoveryMethod: Extends Valinejad and Mili’s CPS model with control channels for three agencies, formulates as three-player non-zero-sum differential game solved via online actor-critic reinforcement learning
Result: 70% mean fear reduction with improved infrastructure recovery in Hurricane Harvey simulations; 50% fear reduction in Hurricane Irma cross-validation without refitting
Conclusion: Multi-agency control through game-theoretic reinforcement learning effectively reduces community fear and improves disaster resilience with generalizable results
Abstract: During disasters, cascading failures across power grids, communication networks, and social behavior amplify community fear and undermine cooperation. Existing cyber-physical-social (CPS) models simulate these coupled dynamics but lack mechanisms for active intervention. We extend the CPS resilience model of Valinejad and Mili (2023) with control channels for three agencies, communication, power, and emergency management, and formulate the resulting system as a three-player non-zero-sum differential game solved via online actor-critic reinforcement learning. Simulations based on Hurricane Harvey data show 70% mean fear reduction with improved infrastructure recovery; cross-validation in the case of Hurricane Irma (without refitting) achieves 50% fear reduction, confirming generalizability.
[441] Smartwatch-Based Sitting Time Estimation in Real-World Office Settings
Olivia Zhang, Zhilin Zhang
Main category: cs.LG
TL;DR: A method using smartwatch IMU data with rotation vector sequences derived from Euler angles to estimate sitting time in office settings, showing improved performance on a 34-hour dataset.
Details
Motivation: Sedentary behavior is a major public health risk linked to obesity and chronic diseases. Accurate sitting time estimation is crucial for health monitoring, especially in real-world office environments where people spend significant time sitting.Method: Uses inertial measurement unit (IMU) signals from smartwatches worn by office workers. Introduces rotation vector sequences derived from Euler angles as a novel representation of movement dynamics for sitting time estimation.
Result: Experiments on a 34-hour dataset demonstrate that exploiting rotation vector sequences improves algorithm performance for sitting time estimation in natural office environments.
Conclusion: Rotation vector sequences show potential for robust sitting time estimation in real-world settings, offering improved accuracy for health monitoring applications.
Abstract: Sedentary behavior poses a major public health risk, being strongly linked to obesity, cardiovascular disease, and other chronic conditions. Accurately estimating sitting time is therefore critical for monitoring and improving individual health. This work addresses the problem in real-world office settings, where signals from the inertial measurement units (IMU) on a smartwatch were collected from office workers during their daily routines. We propose a method that estimates sitting time from the IMU signals by introducing the use of rotation vector sequences, derived from Euler angles, as a novel representation of movement dynamics. Experiments on a 34-hour dataset demonstrate that exploiting rotation vector sequences improves algorithm performance, highlighting their potential for robust sitting time estimation in natural environments.
[442] Structural Evaluation Metrics for SVG Generation via Leave-One-Out Analysis
Haonan Zhu, Adrienne Deganutti, Elad Hirsch, Purvanshi Mehta
Main category: cs.LG
TL;DR: Proposes element-level leave-one-out analysis for evaluating SVG generation quality, focusing on structural properties rather than just visual similarity.
Details
Motivation: Current SVG generation evaluation focuses on visual similarity but ignores structural editability - the core value of SVG format. Existing metrics can't identify which elements contribute to quality, map concepts to code, or assess downstream editability.Method: Introduces element-level leave-one-out (LOO) analysis inspired by jackknife estimator. Renders SVG with/without each element, measures visual changes, and derives structural quality metrics including element quality scores, concept-element attribution, and four modularity metrics (purity, coverage, compactness, locality).
Result: Validated on over 19,000 edits across 5 generation systems and 3 complexity tiers. Provides comprehensive structural evaluation framework that enables zero-shot artifact detection and quantifies SVG modularity.
Conclusion: LOO analysis offers principled approach to evaluate SVG generation’s structural properties, addressing limitations of current similarity-based metrics and enabling better assessment of editability and modularity.
Abstract: Scalable Vector Graphics (SVG) represent visual content as structured, editable code. Each element (path, shape, or text node) can be individually inspected, transformed, or removed. This structural editability is a main motivation for SVG generation, yet prevailing evaluation protocols primarily reduce the output to a single similarity score against a reference image or input texts, measuring how faithfully the result reproduces an image or follows the instructions, but not how well it preserves the structural properties that make SVG valuable. In particular, existing metrics cannot determine which generated elements contribute positively to overall visual quality, how visual concepts map to specific parts of the code, or whether the generated output supports meaningful downstream editing. We introduce element-level leave-one-out (LOO) analysis, inspired by the classic jackknife estimator. The procedure renders the SVG with and without each element, measures the resulting visual change, and derives a suite of structural quality metrics. Despite its simplicity, the jackknife’s capacity to decompose an aggregate statistic into per-sample contributions translates directly to this setting. From a single mechanism, we obtain: (1) quality scores per element through LOO scoring that enable zero-shot artifact detection; (2) concept-element attribution that maps each element to the visual concept it serves; and (3) four structural metrics, purity, coverage, compactness, and locality, that quantify SVG modularity from complementary perspectives. We validate these metrics on over 19,000 edits (5 types) across 5 generation systems and 3 complexity tiers.
[443] Loom: A Scalable Analytical Neural Computer Architecture
Mehmet Kerem Turkcan
Main category: cs.LG
TL;DR: Loom is a novel computer architecture that executes C programs inside a looped transformer with analytically derived weights, implementing a 22-opcode instruction set in 8 transformer layers with fixed computational cost.
Details
Motivation: The motivation is to explore the intersection of neural networks and traditional computing by creating a transformer-based architecture that can execute compiled programs with fixed computational cost, independent of program length or execution history.Method: Loom implements a 22-opcode instruction set in 8 transformer layers with analytically derived weights. The full machine state resides in a single fixed-size tensor, and each forward pass executes one instruction. The model runs iteratively until the program counter reaches zero.
Result: The architecture achieves fixed computational cost for fixed tensor dimensions, with a default configuration of 4.7 million parameters and 928 instruction slots. A compact configuration (d=146, n=512) suffices for a 9×9 Sudoku solver with 284 instructions.
Conclusion: Loom demonstrates that transformers with analytically derived weights can execute compiled programs, offering a novel approach to program execution with fixed computational complexity and program-independent weights.
Abstract: We present Loom, a computer architecture that executes programs compiled from C inside a looped transformer whose weights are derived analytically. The architecture implements a 22-opcode instruction set in 8 transformer layers. Each forward pass executes one instruction; the model is applied iteratively until the program counter reaches zero. The full machine state resides in a single tensor $X \in \mathbb{R}^{d \times n}$ of fixed size, and every step has fixed cost for fixed $d$ and $n$, independent of program length or execution history. The default configuration uses $d = 155$ and $n = 1024$, yielding 4.7 million parameters and 928 instruction slots. A compact configuration at $d = 146$ and $n = 512$ suffices for a 9$\times$9 Sudoku solver (284 instructions). The weights are program-independent: programs live in the state tensor, and the same fixed-weight model executes any compiled program. We make Loom source code publicly available at https://github.com/mkturkcan/Loom.
[444] HiFloat4 Format for Language Model Pre-training on Ascend NPUs
Mehran Taghian, Yunke Peng, Xing Huang, Yao Wang, Yaoyuan Wang, Wei Guo, Yuanyong Luo, Tianchi Hu, Junsong Wang, Xin Wang, Hu Liu, Yu Cheng, Ziwei Yu, Hongliang Li, Mehdi Rahimifar, Lei Yan, Xuefei Wang, Zhuang Ma, Lei Liu, Hui Yu, Anandharaju Durai Raju, Hoang Le, Hei Yi Mak, Tanzila Rahman, Shadan Golestan
Main category: cs.LG
TL;DR: Systematic comparison of FP4 formats (HiFloat4 vs MXFP4) for low-precision training on NPUs, evaluating both dense and MoE models with stabilization techniques to maintain accuracy.
Details
Motivation: Large foundation models have high computational and memory costs, motivating low-precision training techniques. FP4 formats offer 4x improvements in throughput and memory efficiency, but need systematic evaluation on NPU hardware.Method: Conducted experiments on Ascend NPU clusters with linear and expert GEMM operations in FP4 precision. Evaluated dense architectures (Pangu, LLaMA-style) and MoE models. Explored stabilization techniques to reduce numerical degradation while maintaining efficiency.
Result: FP4 training maintains relative error within 1% of full-precision baselines while preserving 4-bit computation efficiency benefits. Provides comprehensive empirical study of FP4 training trade-offs on NPUs.
Conclusion: FP4 formats are practical for large-scale training on NPUs with proper stabilization techniques, offering significant efficiency gains for both dense and MoE models while maintaining accuracy.
Abstract: Large foundation models have become central to modern machine learning, with performance scaling predictably with model size and data. However, training and deploying such models incur substantial computational and memory costs, motivating the development of low-precision training techniques. Recent work has demonstrated that 4-bit floating-point (FP4) formats–such as MXFP4 and NVFP4–can be successfully applied to linear GEMM operations in large language models (LLMs), achieving up to 4x improvements in compute throughput and memory efficiency compared to higher-precision baselines. In this work, we investigate the recently proposed HiFloat4 FP4 format for Huawei Ascend NPUs and systematically compare it with MXFP4 in large-scale training settings. All experiments are conducted on Ascend NPU clusters, with linear and expert GEMM operations performed entirely in FP4 precision. We evaluate both dense architectures (e.g., Pangu and LLaMA-style models) and mixture-of-experts (MoE) models, where both standard linear layers and expert-specific GEMMs operate in FP4. Furthermore, we explore stabilization techniques tailored to FP4 training that significantly reduce numerical degradation, maintaining relative error within 1% of full-precision baselines while preserving the efficiency benefits of 4-bit computation. Our results provide a comprehensive empirical study of FP4 training on NPUs and highlight the practical trade-offs between FP4 formats in large-scale dense and MoE models.
[445] Post-Hoc Guidance for Consistency Models by Joint Flow Distribution Learning
Chia-Hong Hsu, Randall Balestriero
Main category: cs.LG
TL;DR: JFDL enables classifier-free guidance in pre-trained Consistency Models without needing a separate Diffusion Model teacher, allowing adjustable guidance like CFG but with faster sampling.
Details
Motivation: Classifier-free guidance (CFG) is useful for trading fidelity vs diversity in diffusion models, but suffers from slow sampling. Consistency models offer fast sampling but existing guidance methods require distillation from diffusion models, limiting CFG to consistency distillation methods only.Method: Proposes Joint Flow Distribution Learning (JFDL), a lightweight alignment method that enables guidance in pre-trained CMs. Uses pre-trained CM as ODE solver, verifies Gaussian nature of noise from unconditional/conditional velocity fields, and applies alignment to enable adjustable guidance knob.
Result: JFDL equips CMs with adjustable guidance similar to CFG, reduces FID on CIFAR-10 and ImageNet 64x64 datasets, and enables guided generation in originally conditional-only CMs. First method to provide effective post-hoc guidance for CMs without DM teacher.
Conclusion: JFDL bridges key gap in CM methods by enabling classifier-free guidance without requiring diffusion model distillation, making fast-sampling CMs more practical with adjustable fidelity-diversity tradeoffs.
Abstract: Classifier-free Guidance (CFG) lets practitioners trade-off fidelity against diversity in Diffusion Models (DMs). The practicality of CFG is however hindered by DMs sampling cost. On the other hand, Consistency Models (CMs) generate images in one or a few steps, but existing guidance methods require knowledge distillation from a separate DM teacher, limiting CFG to Consistency Distillation (CD) methods. We propose Joint Flow Distribution Learning (JFDL), a lightweight alignment method enabling guidance in a pre-trained CM. With a pre-trained CM as an ordinary differential equation (ODE) solver, we verify with normality tests that the variance-exploding noise implied by the velocity fields from unconditional and conditional distributions is Gaussian. In practice, JFDL equips CMs with the familiar adjustable guidance knob, yielding guided images with similar characteristics to CFG. Applied to an original Consistency Trained (CT) CM that could only do conditional sampling, JFDL unlocks guided generation and reduces FID on both CIFAR-10 and ImageNet 64x64 datasets. This is the first time that CMs are able to receive effective guidance post-hoc without a DM teacher, thus, bridging a key gap in current methods for CMs.
[446] Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs
Jinqi Luo, Jinyu Yang, Tal Neiman, Lei Fan, Bing Yin, Son Tran, Mubarak Shah, René Vidal
Main category: cs.LG
TL;DR: DACO is a framework using concept dictionaries and sparse autoencoders to control MLLM activations for safety, achieving improved safety while preserving general capabilities.
Details
Motivation: Current MLLM safety approaches (prompt engineering, classification, finetuning) are ineffective against evolving threats, require rerunning queries, or are computationally heavy. Activation steering methods exist but handle only narrow safety concepts or struggle with granular control.Method: 1) Curate dictionary of 15,000 multimodal concepts from 400K+ caption-image stimuli (DACO-400K dataset); 2) Use dictionary for activation intervention via sparse coding; 3) Train Sparse Autoencoder (SAE) initialized with dictionary to automatically annotate atom semantics for MLLM safeguarding.
Result: Experiments on multiple MLLMs (QwenVL, LLaVA, InternVL) across safety benchmarks (MM-SafetyBench, JailBreakV) show DACO significantly improves MLLM safety while maintaining general-purpose capabilities.
Conclusion: DACO provides granular control over MLLM activations using concept dictionaries and SAEs, offering an effective safety solution that balances protection with model utility.
Abstract: Multimodal Large Language Models (MLLMs) have been shown to be vulnerable to malicious queries that can elicit unsafe responses. Recent work uses prompt engineering, response classification, or finetuning to improve MLLM safety. Nevertheless, such approaches are often ineffective against evolving malicious patterns, may require rerunning the query, or demand heavy computational resources. Steering the activations of a frozen model at inference time has recently emerged as a flexible and effective solution. However, existing steering methods for MLLMs typically handle only a narrow set of safety-related concepts or struggle to adjust specific concepts without affecting others. To address these challenges, we introduce Dictionary-Aligned Concept Control (DACO), a framework that utilizes a curated concept dictionary and a Sparse Autoencoder (SAE) to provide granular control over MLLM activations. First, we curate a dictionary of 15,000 multimodal concepts by retrieving over 400,000 caption-image stimuli and summarizing their activations into concept directions. We name the dataset DACO-400K. Second, we show that the curated dictionary can be used to intervene activations via sparse coding. Third, we propose a new steering approach that uses our dictionary to initialize the training of an SAE and automatically annotate the semantics of the SAE atoms for safeguarding MLLMs. Experiments on multiple MLLMs (e.g., QwenVL, LLaVA, InternVL) across safety benchmarks (e.g., MM-SafetyBench, JailBreakV) show that DACO significantly improves MLLM safety while maintaining general-purpose capabilities.
[447] Hierarchical Kernel Transformer: Multi-Scale Attention with an Information-Theoretic Approximation Analysis
Giansalvo Cirrincione
Main category: cs.LG
TL;DR: HKT is a multi-scale attention mechanism with trainable causal downsampling across L resolution levels, achieving computational cost bounded by 4/3× standard attention while showing consistent performance gains on sequence tasks.
Details
Motivation: To develop a more efficient attention mechanism that can process sequences at multiple scales while maintaining theoretical guarantees and improving performance on sequence modeling tasks.Method: Hierarchical Kernel Transformer (HKT) processes sequences at L resolution levels via trainable causal downsampling, combines level-specific score matrices through learned convex weights, and decomposes attention into symmetric (reciprocal) and antisymmetric (directional) components across scales.
Result: Consistent gains over standard attention baselines: +4.77pp on synthetic ListOps, +1.44pp on sequential CIFAR-10, and +7.47pp on IMDB character-level sentiment, all at 1.31x computational overhead.
Conclusion: HKT provides an efficient multi-scale attention mechanism with theoretical guarantees, strictly subsumes standard attention and causal convolution, and demonstrates practical performance improvements across diverse sequence modeling tasks.
Abstract: The Hierarchical Kernel Transformer (HKT) is a multi-scale attention mechanism that processes sequences at L resolution levels via trainable causal downsampling, combining level-specific score matrices through learned convex weights. The total computational cost is bounded by 4/3 times that of standard attention, reaching 1.3125x for L = 3. Four theoretical results are established. (i) The hierarchical score matrix defines a positive semidefinite kernel under a sufficient condition on the symmetrised bilinear form (Proposition 3.1). (ii) The asymmetric score matrix decomposes uniquely into a symmetric part controlling reciprocal attention and an antisymmetric part controlling directional attention; HKT provides L independent such pairs across scales, one per resolution level (Propositions 3.5-3.6). (iii) The approximation error decomposes into three interpretable components with an explicit non-Gaussian correction and a geometric decay bound in L (Theorem 4.3, Proposition 4.4). (iv) HKT strictly subsumes single-head standard attention and causal convolution (Proposition 3.4). Experiments over 3 random seeds show consistent gains over retrained standard attention baselines: +4.77pp on synthetic ListOps (55.10+-0.29% vs 50.33+-0.12%, T = 512), +1.44pp on sequential CIFAR-10 (35.45+-0.09% vs 34.01+-0.19%, T = 1,024), and +7.47pp on IMDB character-level sentiment (70.19+-0.57% vs 62.72+-0.40%, T = 1,024), all at 1.31x overhead.
[448] Discrete Meanflow Training Curriculum
Chia-Hong Hsu, Frank Wood
Main category: cs.LG
TL;DR: Discrete Meanflow (DMF) training curriculum reduces computational cost for training Meanflow models while maintaining one-step sampling performance, achieving FID 3.36 on CIFAR-10 in 2000 epochs.
Details
Motivation: Meanflow models show promising few-step and one-step sampling performance but require extremely large training budgets. The authors aim to significantly reduce the computational and data requirements for training Meanflow models.Method: Proposes a Discrete Meanflow (DMF) Training Curriculum that exploits a particular discretization of the Meanflow objective to yield a consistency property. The method is initialized with a pretrained Flow Model and uses a curriculum approach to efficiently train the model.
Result: Achieves one-step FID 3.36 on CIFAR-10 in only 2000 epochs, significantly reducing training time and computational requirements compared to previous Meanflow models.
Conclusion: The DMF curriculum enables faster training of Meanflow models, particularly when fine-tuned from existing Flow Models, paving the way for more efficient training methods for future one-step generative models.
Abstract: Flow-based image generative models exhibit stable training and produce high quality samples when using multi-step sampling procedures. One-step generative models can produce high quality image samples but can be difficult to optimize as they often exhibit unstable training dynamics. Meanflow models exhibit excellent few-step sampling performance and tantalizing one-step sampling performance. Notably, MeanFlow models that achieve this have required extremely large training budgets. We significantly decrease the amount of computation and data budget it takes to train Meanflow models by noting and exploiting a particular discretization of the Meanflow objective that yields a consistency property which we formulate into a ``Discrete Meanflow’’ (DMF) Training Curriculum. Initialized with a pretrained Flow Model, DMF curriculum reaches one-step FID 3.36 on CIFAR-10 in only 2000 epochs. We anticipate that faster training curriculums of Meanflow models, specifically those fine-tuned from existing Flow Models, drives efficient training methods of future one-step examples.
[449] Temporal Dropout Risk in Learning Analytics: A Harmonized Survival Benchmark Across Dynamic and Early-Window Representations
Rafael da Silva, Jeff Eicher, Gregory Longo
Main category: cs.LG
TL;DR: A survival analysis benchmark for student dropout prediction using OULAD dataset, comparing dynamic weekly vs. continuous-time models with multi-dimensional evaluation including discrimination, ablation, explainability, and calibration.
Details
Motivation: Addressing limitations in current learning analytics where dropout prediction models are evaluated under heterogeneous protocols that prioritize discrimination over temporal interpretability and calibration, lacking standardized benchmarks.Method: Introduces a survival-oriented benchmark using OULAD dataset with two harmonized arms: dynamic weekly (person-period representation) and continuous-time (expanded model families including tree-based survival, parametric, and neural models). Evaluation protocol integrates four analytical layers: predictive performance, ablation, explainability, and calibration.
Result: Random Survival Forest leads in discrimination and horizon-specific Brier scores in continuous-time arm; Poisson Piecewise-Exponential leads narrowly on integrated Brier score in dynamic arm. Ablation and explainability analyses converged on finding that dominant predictive signal was temporal and behavioral rather than demographic or structural. Calibration corroborated this pattern except for XGBoost AFT which showed systematic bias.
Conclusion: Supports value of harmonized, multi-dimensional benchmarks in learning analytics and positions dropout risk as a temporal-behavioral process rather than function of static background attributes, emphasizing temporal dynamics over static characteristics.
Abstract: Student dropout is a persistent concern in Learning Analytics, yet comparative studies frequently evaluate predictive models under heterogeneous protocols, prioritizing discrimination over temporal interpretability and calibration. This study introduces a survival-oriented benchmark for temporal dropout risk modelling using the Open University Learning Analytics Dataset (OULAD). Two harmonized arms are compared: a dynamic weekly arm, with models in person-period representation, and a comparable continuous-time arm, with an expanded roster of families – tree-based survival, parametric, and neural models. The evaluation protocol integrates four analytical layers: predictive performance, ablation, explainability, and calibration. Results are reported within each arm separately, as a single cross-arm ranking is not methodologically warranted. Within the comparable arm, Random Survival Forest leads in discrimination and horizon-specific Brier scores; within the dynamic arm, Poisson Piecewise-Exponential leads narrowly on integrated Brier score within a tight five-family cluster. No-refit bootstrap sampling variability qualifies these positions as directional signals rather than absolute superiority. Ablation and explainability analyses converged, across all families, on a shared finding: the dominant predictive signal was not primarily demographic or structural, but temporal and behavioral. Calibration corroborated this pattern in the better-discriminating models, with the exception of XGBoost AFT, which exhibited systematic bias. These results support the value of a harmonized, multi-dimensional benchmark in Learning Analytics and situate dropout risk as a temporal-behavioral process rather than a function of static background attributes.
[450] Spectral Geometry of LoRA Adapters Encodes Training Objective and Predicts Harmful Compliance
Roi Paul
Main category: cs.LG
TL;DR: LoRA weight delta spectral features can identify fine-tuning objectives and predict downstream behavioral harm in language models, with strong within-method detection but poor cross-method generalization.
Details
Motivation: To investigate whether geometric properties of LoRA weight deltas can reveal which fine-tuning objective was applied to a language model and whether these geometric signals correlate with downstream behavioral harm.Method: Created 38 LoRA adapters for Llama-3.2-3B-Instruct across four categories: healthy SFT baselines, DPO on inverted harmlessness preferences, DPO on inverted helpfulness preferences, and activation-steering-derived adapters. Extracted per-layer spectral features (norms, stable rank, singular-value entropy, effective rank, and singular-vector cosine alignment) and used logistic regression classifiers and PCA for analysis.
Result: Within DPO training, classifiers achieved AUC1.00 for binary drift detection and objective comparisons. PCA showed training objective as PC1 (AUC1.00) orthogonal to training duration on PC2. DPO-inverted-harmlessness adapters showed elevated harmful compliance (mean ASR 0.266 vs. healthy 0.112). Geometry-to-behavior correlation was ρ=0.72 across non-steered adapters. Cross-method generalization failed completely (AUC~0.00).
Conclusion: LoRA weight-space geometry carries objective identity, intensity ordering, and coarse links to harmful compliance, but cross-method monitoring requires per-method calibration.
Abstract: We study whether low-rank spectral summaries of LoRA weight deltas can identify which fine-tuning objective was applied to a language model, and whether that geometric signal predicts downstream behavioral harm. In a pre-registered experiment on \texttt{Llama-3.2-3B-Instruct}, we manufacture 38 LoRA adapters across four categories: healthy SFT baselines, DPO on inverted harmlessness preferences, DPO on inverted helpfulness preferences, and activation-steering-derived adapters, and extract per-layer spectral features (norms, stable rank, singular-value entropy, effective rank, and singular-vector cosine alignment to a healthy centroid). Within a single training method (DPO), a logistic regression classifier achieves AUC1.00 on binary drift detection, all six pairwise objective comparisons, and near-perfect ordinal severity ranking ($ρ\geq 0.956$). Principal component analysis on flattened weight deltas reveals that training objective is PC1 (AUC1.00 for objective separation), orthogonal to training duration on PC2. Query-projection weights detect that drift occurred; value-projection weights identify which objective. Cross-method generalization fails completely: a DPO-trained classifier assigns every steering adapter a lower drift score than every DPO adapter (AUC~0.00). In a behavioral evaluation phase, DPO-inverted-harmlessness adapters show elevated harmful compliance on HEx-PHI prompts (mean ASR 0.266 vs.\ healthy 0.112, $Δ= +0.154$), with near-perfect dose–response ($ρ= 0.986$). The geometry-to-behavior rank correlation is $ρ= 0.72$ across 24 non-steered adapters. These results establish that within a controlled manufacturing regime, LoRA weight-space geometry carries objective identity, intensity ordering, and a coarse link to harmful compliance, and that cross-method monitoring requires per-method calibration.
[451] A Mathematical Framework for Temporal Modeling and Counterfactual Policy Simulation of Student Dropout
Rafael da Silva, Jeff Eicher, Gregory Longo
Main category: cs.LG
TL;DR: A temporal modeling framework with counterfactual policy simulation for predicting student dropout using LMS engagement data, achieving good predictive performance but showing limited policy impact.
Details
Motivation: To develop a predictive framework for student dropout in higher education that can simulate policy interventions and assess their potential impact on student retention.Method: Uses penalized, class-balanced logistic regression on person-period rows with LMS engagement data and administrative records. Includes a counterfactual policy-simulation layer that produces survival contrasts under different intervention scenarios.
Result: Achieved AUCs of 0.8350 (train) and 0.8405 (test). Policy simulations showed positive survival contrasts only in shock-based interventions, while mechanism-aware interventions had negative impacts. Performance was sensitive to feature composition.
Conclusion: The framework demonstrates capacity for structural scenario comparison under observational data constraints, but results are not causally identified and policy impacts are small and context-dependent.
Abstract: This study proposes a temporal modeling framework with a counterfactual policy-simulation layer for student dropout in higher education, using LMS engagement data and administrative withdrawal records. Dropout is operationalized as a time-to-event outcome at the enrollment level; weekly risk is modeled in discrete time via penalized, class-balanced logistic regression over person–period rows. Under a late-event temporal holdout, the model attains row-level AUCs of 0.8350 (train) and 0.8405 (test), with aggregate calibration acceptable but sparsely supported in the highest-risk bins. Ablation analyses indicate performance is sensitive to feature set composition, underscoring the role of temporal engagement signals. A scenario-indexed policy layer produces survival contrasts $ΔS(T)$ under an explicit trigger/schedule contract: positive contrasts are confined to the shock branch ($T_{\rm policy}=18$: 0.0102, 0.0260, 0.0819), while the mechanism-aware branch is negative ($ΔS_{\rm mech}(18)=-0.0078$, $ΔS_{\rm mech}(38)=-0.0134$). A subgroup analysis by gender quantifies scenario-induced survival gaps via bootstrap; contrasts are directionally stable but small. Results are not causally identified; they demonstrate the framework’s capacity for internal structural scenario comparison under observational data constraints.
[452] Finite-Sample Analysis of Nonlinear Independent Component Analysis:Sample Complexity and Identifiability Bounds
Yuwen Jiang
Main category: cs.LG
TL;DR: First complete finite-sample analysis of nonlinear ICA with neural networks, providing matching upper/lower bounds on sample complexity and extending to practical SGD optimization.
Details
Motivation: While asymptotic identifiability guarantees exist for nonlinear ICA, finite-sample statistical properties remain poorly understood, creating challenges for practitioners needing reliable sample size guidance.Method: Theoretical analysis establishing direct relationship between excess risk and identification error, proving matching information-theoretic lower bounds, and extending to SGD optimization under standard landscape assumptions.
Result: Complete characterization with optimal sample complexity scaling, validated through simulation experiments, showing same efficiency achievable with finite-iteration gradient descent.
Conclusion: Provides validated scaling laws for dimension and diversity, highlighting importance of finite-sample analysis for neural network training in unsupervised learning.
Abstract: Independent Component Analysis (ICA) is a fundamental unsupervised learning technique foruncovering latent structure in data by separating mixed signals into their independent sources. While substantial progress has been made in establishing asymptotic identifiability guarantees for nonlinear ICA, the finite-sample statistical properties of learning algorithms remain poorly understood. This gap poses significant challenges for practitioners who must determine appropriate sample sizes for reliable source recovery. This paper presents a comprehensive finite-sample analysis of nonlinear ICA with neural network encoders, providing the first complete characterization with matching upper and lower bounds. Our theoretical development introduces three key technical contributions. First, we establish a direct relationship between excess risk and identification error that bypasses parameter-space arguments, thereby avoiding the rate degradation that would otherwise yield suboptimal scaling. Second, we prove matching information-theoretic lower bounds that confirm the optimality of our sample complexity results. Third, we extend our analysis to practical SGD optimization, showing that the same sample efficiency can be achieved with finite-iteration gradient descent under standard landscape assumptions. We validate our theoretical predictions through carefully designed simulation experiments. This gap points toward valuable future research on finite-sample behavior of neural network training and highlights the importance of our validated scaling laws for dimension and diversity.
[453] Revisiting the Capacity Gap in Chain-of-Thought Distillation from a Practical Perspective
Tokio Kajitsuka, Ukyo Honda, Sho Takase
Main category: cs.LG
TL;DR: Re-examining capacity gap in Chain-of-Thought distillation, finding it often degrades performance compared to student baseline, and proposing better evaluation protocols.
Details
Motivation: Prior work reports capacity gap issues in CoT distillation when teacher-student capability mismatch is large, but this paper revisits the problem from practical perspective, noting that CoT distillation often degrades performance compared to student's pre-distillation baseline - an issue obscured in previous evaluations.Method: Re-examines commonly used experimental settings in CoT distillation, proposes more realistic evaluation protocol that compares against student’s pre-distillation baseline, and analyzes impact of capacity gap across different tasks and settings.
Result: Finds that capacity gap effects don’t consistently dominate across tasks and settings, especially when candidate teachers differ substantially in performance. CoT distillation often degrades performance compared to student baseline.
Conclusion: Offers practical guidance for selecting teacher-student pairs in CoT distillation based on more realistic evaluation protocols that account for baseline comparisons.
Abstract: Chain-of-thought (CoT) distillation transfers reasoning behaviors from a strong teacher to a smaller student, but prior work reports a capacity gap: distillation may fail when the teacher-student capability mismatch is large. We revisit the capacity gap from a practical perspective by re-examining commonly used experimental settings. Notably, we find that CoT distillation often degrades performance compared to the student’s pre-distillation baseline, an issue obscured when only post-distillation comparisons are reported. We therefore propose a more realistic evaluation protocol and find that the impact of capacity gap effects does not consistently dominate across tasks and settings, especially when candidate teachers differ substantially in performance. Our results offer practical guidance for selecting teacher-student pairs in CoT distillation.
[454] How does Chain of Thought decompose complex tasks?
Amrut Nadgir, Vijay Balasubramanian, Pratik Chaudhari
Main category: cs.LG
TL;DR: The paper shows that classification error in LLM tasks scales as a power law with number of classes, and that decomposing tasks into smaller classification problems (tree-structured chain-of-thought) can substantially reduce error, with optimal depth depending on degree.
Details
Motivation: To understand why chain-of-thought (CoT) reasoning improves LLM performance and to mathematically characterize the relationship between classification error, number of classes, and decomposition depth in tree-structured reasoning.Method: Mathematical analysis showing classification error scales as power law in number of classes, then modeling CoT as tree-structured decomposition where each node is a smaller classification problem with fixed degree (number of classes). Derives critical threshold for degree and optimal depth.
Result: Identifies power law scaling of classification error with number of classes. Shows splitting tasks into smaller classification problems reduces error. Finds critical degree threshold below which thinking is detrimental and above which optimal depth exists that minimizes error. Demonstrates error cannot be reduced beyond this minimum by increasing depth.
Conclusion: Chain-of-thought reasoning works by decomposing complex classification tasks into sequences of smaller classification problems, with mathematical limits on optimal decomposition depth based on degree of each classification step.
Abstract: Many language tasks can be modeled as classification problems where a large language model (LLM) is given a prompt and selects one among many possible answers. We show that the classification error in such problems scales as a power law in the number of classes. This has a dramatic consequence: the prediction error can be reduced substantially by splitting the overall task into a sequence of smaller classification problems, each with the same number of classes (“degree”). This tree-structured decomposition models chain-of-thought (CoT). It has been observed that CoT-based predictors perform better when they “think’”, i.e., when they develop a deeper tree, thus decomposing the problem into a larger number of steps. We identify a critical threshold for the degree, below which thinking is detrimental, and above which there exists an optimal depth that minimizes the error. It is impossible to surpass this minimal error by increasing the depth of thinking.
[455] Uncertainty-Aware Transformers: Conformal Prediction for Language Models
Abhiram Vellore, Niraj K. Jha
Main category: cs.LG
TL;DR: CONFIDE is a conformal prediction framework for transformer-based language models that provides uncertainty quantification and interpretability by applying conformal methods to internal embeddings of encoder-only architectures like BERT and RoBERTa.
Details
Motivation: Transformers have revolutionized AI but their black-box nature limits trust in high-stakes applications. Models need to provide clear reasoning behind decisions, not just predictions, to be genuinely useful and trustworthy in critical settings.Method: Applies conformal prediction to internal embeddings of encoder-only transformer architectures (BERT, RoBERTa) using either [CLS] token embeddings or flattened hidden states to construct class-conditional nonconformity scores. Enables hyperparameter tuning and provides statistically valid prediction sets with instance-level explanations.
Result: Improves test accuracy by up to 4.09% on BERT-tiny and achieves greater correct efficiency (expected size of prediction set conditioned on containing true label) compared to prior methods. Early and intermediate transformer layers yield better-calibrated and more semantically meaningful representations for conformal prediction.
Conclusion: CONFIDE offers robustness and interpretability where softmax-based uncertainty fails, especially in resource-constrained models and high-stakes tasks with ambiguous labels. It serves as a practical framework for diagnostic and efficiency/robustness improvement over prior conformal baselines.
Abstract: Transformers have had a profound impact on the field of artificial intelligence, especially on large language models and their variants. However, as was the case with neural networks, their black-box nature limits trust and deployment in high-stakes settings. For models to be genuinely useful and trustworthy in critical applications, they must provide more than just predictions: they must supply users with a clear understanding of the reasoning that underpins their decisions. This article presents an uncertainty quantification framework for transformer-based language models. This framework, called CONFIDE (CONformal prediction for FIne-tuned DEep language models), applies conformal prediction to the internal embeddings of encoder-only architectures, like BERT and RoBERTa, while enabling hyperparameter tuning. CONFIDE uses either [CLS] token embeddings or flattened hidden states to construct class-conditional nonconformity scores, enabling statistically valid prediction sets with instance-level explanations. Empirically, CONFIDE improves test accuracy by up to 4.09% on BERT-tiny and achieves greater correct efficiency (i.e., the expected size of the prediction set conditioned on it containing the true label) compared to prior methods, including NM2 and VanillaNN. We show that early and intermediate transformer layers often yield better-calibrated and more semantically meaningful representations for conformal prediction. In resource-constrained models and high-stakes tasks with ambiguous labels, CONFIDE offers robustness and interpretability where softmax-based uncertainty fails. We position CONFIDE as a framework for practical diagnostic and efficiency/robustness improvement over prior conformal baselines.
[456] A Closer Look at the Application of Causal Inference in Graph Representation Learning
Hang Gao, Kunyu Li, Huang Hong, Baoquan Cui, Fengge Wu
Main category: cs.LG
TL;DR: The paper addresses causal modeling challenges in graph representation learning, proving that aggregating diverse graph elements into single causal variables violates causal inference assumptions, proposing a theoretical model based on smallest indivisible units, and developing an enhancement module for existing graph learning pipelines.
Details
Motivation: Existing approaches for modeling causal relationships in graph representation learning often aggregate diverse graph elements into single causal variables, which risks violating core assumptions of causal inference due to the inherent complexity of graph-structured data.Method: The authors prove that aggregation compromises causal validity, propose a theoretical model grounded in the smallest indivisible units of graph data, analyze costs of precise causal modeling, identify simplification conditions, construct a controllable synthetic dataset reflecting real-world causal structures, and develop a causal modeling enhancement module for existing graph learning pipelines.
Result: The paper demonstrates through extensive experiments on synthetic datasets that their theoretical model ensures causal validity, and their enhancement module shows effectiveness in comprehensive comparative experiments when integrated into existing graph learning pipelines.
Conclusion: Causal modeling in graph representation learning requires careful consideration of graph element aggregation to maintain causal validity, and the proposed theoretical framework and enhancement module provide practical solutions for improving causal inference in graph learning applications.
Abstract: Modeling causal relationships in graph representation learning remains a fundamental challenge. Existing approaches often draw on theories and methods from causal inference to identify causal subgraphs or mitigate confounders. However, due to the inherent complexity of graph-structured data, these approaches frequently aggregate diverse graph elements into single causal variables, an operation that risks violating the core assumptions of causal inference. In this work, we prove that such aggregation compromises causal validity. Building on this conclusion, we propose a theoretical model grounded in the smallest indivisible units of graph data to ensure that the causal validity is guaranteed. With this model, we further analyze the costs of achieving precise causal modeling in graph representation learning and identify the conditions under which the problem can be simplified. To empirically support our theory, we construct a controllable synthetic dataset that reflects realworld causal structures and conduct extensive experiments for validation. Finally, we develop a causal modeling enhancement module that can be seamlessly integrated into existing graph learning pipelines, and we demonstrate its effectiveness through comprehensive comparative experiments.
[457] Adaptive Candidate Point Thompson Sampling for High-Dimensional Bayesian Optimization
Donney Fan, Geoff Pleiss
Main category: cs.LG
TL;DR: ACTS improves Thompson sampling for Bayesian optimization by adaptively generating candidate points in subspaces guided by surrogate model gradients, rather than using fixed discretizations.
Details
Motivation: Thompson sampling in Bayesian optimization becomes intractable for Gaussian process surrogates as dimensionality increases, requiring fixed candidate point discretizations that become exponentially sparse in high dimensions.Method: Introduces Adaptive Candidate Thompson Sampling (ACTS) which generates candidate points in subspaces guided by the gradient of a surrogate model sample, adaptively reducing search space during sampling.
Result: ACTS produces better samples of maxima and improved optimization performance across synthetic and real-world benchmarks compared to existing Thompson sampling methods.
Conclusion: ACTS is a simple drop-in replacement for existing Thompson sampling methods that effectively addresses the curse of dimensionality in Bayesian optimization through adaptive subspace sampling.
Abstract: In Bayesian optimization, Thompson sampling selects the evaluation point by sampling from the posterior distribution over the objective function maximizer. Because this sampling problem is intractable for Gaussian process (GP) surrogates, the posterior distribution is typically restricted to fixed discretizations (i.e., candidate points) that become exponentially sparse as dimensionality increases. While previous works aim to increase candidate point density through scalable GP approximations, our orthogonal approach increases density by adaptively reducing the search space during sampling. Specifically, we introduce Adaptive Candidate Thompson Sampling (ACTS), which generates candidate points in subspaces guided by the gradient of a surrogate model sample. ACTS is a simple drop-in replacement for existing TS methods – including those that use trust regions or other local approximations – producing better samples of maxima and improved optimization across synthetic and real-world benchmarks.
[458] Using Synthetic Data for Machine Learning-based Childhood Vaccination Prediction in Narok, Kenya
Jimmy Bach, Yang Li, Yaqi Liu, John Sankok, Rose Kimani, Carrie B. Dolan, Julius N. Odhiambo, Haipeng Chen
Main category: cs.LG
TL;DR: Machine learning models predict children at risk of missing vaccines in nomadic Maasai populations using digitized health records, with synthetic data preserving privacy without performance loss.
Details
Motivation: Limited data utilization in low-resource settings hinders vaccine delivery, especially for nomadic populations like the Maasai in Kenya who face increased risk of missing vaccinations. Data privacy concerns are heightened in groups with limited sensitive health data.Method: Digitized 8 years of child vaccination records (n=6,913) and applied Logistic Regression and XGBoost models to identify children at risk. Used TabSyn (tabular diffusion-based synthetic data generation) to protect patient privacy while maintaining model performance.
Result: Classification techniques successfully predicted children at risk with recall, precision, and F1-scores exceeding 90% for some vaccines. Training with synthetic data rather than real data preserved privacy without loss in predictive performance.
Conclusion: Synthetic data implementation supports privacy-preserving, scalable forecasting for childhood immunization coverage in clinics with limited digital infrastructure, enabling better health informatics strategies.
Abstract: Background: Limited data utilization in low-resource settings poses a barrier to the vaccine delivery ecosystem, undermining efforts to achieve equitable immunization coverage. In nomadic populations, individuals face an increased risk of missing crucial vaccination doses as children. One such population is the Maasai in Narok County, Kenya, where the absence of high-volume, quality data hampers accurate coverage estimates, impedes efficient resource allocation, and weakens the ability to deliver timely interventions. Additionally, data privacy concerns are heightened in groups with limited sensitive data. Objectives: First, we aim to identify children at risk of missing key vaccines across a large population to provide timely, evidence-based interventions that support increased vaccination coverage. Second, we aim to better protect the privacy of sensitive health data in a vulnerable population. Methods: We digitized 8 years of child vaccination records from the MOH 510 registry (n=6,913) and applied machine learning models (Logistic Regression and XGBoost) to identify children at risk. Additionally, we utilize a novel approach to tabular diffusion-based synthetic data generation (TabSyn) to protect patient privacy within the models. Results: Our findings show that classification techniques can reliably and successfully predict children at risk of missing a vaccine, with recall, precision, and F1-scores exceeding 90% for some vaccines modeled. Additionally, training these models with synthetic data rather than real data, thus preserving the privacy of individuals within the original dataset, does not lead to a loss in predictive performance. Conclusion: These results support the use of synthetic data implementation in health informatics strategies for clinics with limited digital infrastructure, enabling privacy-preserving, scalable forecasting for childhood immunization coverage.
[459] Bridging SFT and RL: Dynamic Policy Optimization for Robust Reasoning
Taojie Zhu, Dongyang Xu, Ding Zou, Sen Zhao, Qiaobo Hao, Zhiguo Yang, Yonghong He
Main category: cs.LG
TL;DR: DYPO is a unified framework that addresses the bias-variance trade-off in LLM post-training by dynamically integrating SFT and RL through variance reduction, bias correction, and adaptive gating mechanisms.
Details
Motivation: Current LLM post-training methods face a fundamental dilemma: SFT provides stability but suffers from high fitting bias, while RL enables exploration but has high gradient variance. Existing unified approaches use naive loss weighting without addressing the statistical conflict between these gradient signals.Method: DYPO integrates three components: 1) Group Alignment Loss (GAL) reduces RL gradient variance using intrinsic group dynamics; 2) Multi-Teacher Distillation corrects SFT fitting bias via diverse reasoning paths; 3) Dynamic Exploitation-Exploration Gating adaptively arbitrates between SFT and RL based on reward feedback.
Result: DYPO significantly outperforms traditional sequential pipelines, achieving average improvements of 4.8% on complex reasoning benchmarks and 13.3% on out-of-distribution tasks. Theoretical analysis confirms it linearly reduces fitting bias and minimizes overall variance.
Conclusion: DYPO provides a principled solution to the bias-variance trade-off in LLM post-training by structurally mitigating the conflict between SFT and RL through dynamic optimization, leading to substantial performance gains on reasoning and generalization tasks.
Abstract: Post-training paradigms for Large Language Models (LLMs), primarily Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), face a fundamental dilemma: SFT provides stability (low variance) but suffers from high fitting bias, while RL enables exploration (low bias) but grapples with high gradient variance. Existing unified optimization strategies often employ naive loss weighting, overlooking the statistical conflict between these distinct gradient signals. In this paper, we provide a rigorous theoretical analysis of this bias-variance trade-off and propose \textbf{DYPO} (Dynamic Policy Optimization), a unified framework designed to structurally mitigate this conflict. DYPO integrates three core components: (1) a \textit{Group Alignment Loss (GAL)} that leverages intrinsic group dynamics to significantly reduce RL gradient variance; (2) a \textit{Multi-Teacher Distillation} mechanism that corrects SFT fitting bias via diverse reasoning paths; and (3) a \textit{Dynamic Exploitation-Exploration Gating} mechanism that adaptively arbitrates between stable SFT and exploratory RL based on reward feedback. Theoretical analysis confirms that DYPO linearly reduces fitting bias and minimizes overall variance. Extensive experiments demonstrate that DYPO significantly outperforms traditional sequential pipelines, achieving an average improvement of 4.8% on complex reasoning benchmarks and 13.3% on out-of-distribution tasks. Our code is publicly available at https://github.com/Tocci-Zhu/DYPO.
[460] WOMBET: World Model-based Experience Transfer for Robust and Sample-efficient Reinforcement Learning
Mintae Kim, Koushil Sreenath
Main category: cs.LG
TL;DR: WOMBET is a world model-based experience transfer framework that generates reliable offline data from source tasks and adaptively fine-tunes in target tasks for improved RL sample efficiency.
Details
Motivation: RL in robotics faces high data collection costs and risks. Existing offline-to-online RL assumes fixed datasets and doesn't address how to generate reliable transfer data, creating a need for joint data generation and utilization.Method: WOMBET learns a world model in source tasks, generates offline data via uncertainty-penalized planning, filters trajectories with high return and low epistemic uncertainty, then performs online fine-tuning with adaptive sampling between offline and online data.
Result: Theoretical analysis shows uncertainty-penalized objective provides lower bound on true return with finite-sample error decomposition. Empirically, WOMBET improves sample efficiency and final performance over baselines on continuous control benchmarks.
Conclusion: Jointly optimizing data generation and transfer through world models enables stable transition from prior-driven initialization to task-specific adaptation, improving RL sample efficiency.
Abstract: Reinforcement learning (RL) in robotics is often limited by the cost and risk of data collection, motivating experience transfer from a source task to a target task. Offline-to-online RL leverages prior data but typically assumes a given fixed dataset and does not address how to generate reliable data for transfer. We propose \textit{World Model-based Experience Transfer} (WOMBET), a framework that jointly generates and utilizes prior data. WOMBET learns a world model in the source task and generates offline data via uncertainty-penalized planning, followed by filtering trajectories with high return and low epistemic uncertainty. It then performs online fine-tuning in the target task using adaptive sampling between offline and online data, enabling a stable transition from prior-driven initialization to task-specific adaptation. We show that the uncertainty-penalized objective provides a lower bound on the true return and derive a finite-sample error decomposition capturing distribution mismatch and approximation error. Empirically, WOMBET improves sample efficiency and final performance over strong baselines on continuous control benchmarks, demonstrating the benefit of jointly optimizing data generation and transfer.
[461] Delve into the Applicability of Advanced Optimizers for Multi-Task Learning
Zhipeng Zhou, Linxiao Cao, Pengcheng Wu, Peilin Zhao, Chunyan Miao
Main category: cs.LG
TL;DR: APT framework enhances multi-task learning by adapting advanced optimizers like Muon to better handle multi-task optimization dynamics
Details
Motivation: Existing multi-task learning approaches are limited because they don't properly account for how advanced optimizers like Muon work - the instant-derived gradients play only a marginal role in actual parameter updates, preventing MTL frameworks from fully leveraging learning dynamicsMethod: Proposes APT framework with adaptive momentum mechanism to balance strengths between advanced optimizers and MTL, plus light direction preservation method to facilitate Muon’s orthogonalization
Result: Extensive experiments across four mainstream MTL datasets show APT consistently augments existing MTL approaches with substantial performance improvements
Conclusion: Proper integration of advanced optimizers like Muon into MTL frameworks is crucial, and APT provides an effective solution that bridges the gap between optimizer capabilities and multi-task learning requirements
Abstract: Multi-Task Learning (MTL) is a foundational machine learning problem that has seen extensive development over the past decade. Recently, various optimization-based MTL approaches have been proposed to learn multiple tasks simultaneously by altering the optimization trajectory. Although these methods strive to de-conflict and re-balance tasks, we empirically identify that their effectiveness is often undermined by an overlooked factor when employing advanced optimizers: the instant-derived gradients play only a marginal role in the actual parameter updates. This discrepancy prevents MTL frameworks from fully releasing its power on learning dynamics. Furthermore, we observe that Muon-a recently emerged advanced optimizer-inherently functions as a multi-task learner, which underscores the critical importance of the gradients used for its orthogonalization. To address these issues, we propose APT (Applicability of advanced oPTimizers), a framework featuring a simple adaptive momentum mechanism designed to balance the strengths between advanced optimizers and MTL. Additionally, we introduce a light direction preservation method to facilitate Muon’s orthogonalization. Extensive experiments across four mainstream MTL datasets demonstrate that APT consistently augments existing MTL approaches, yielding substantial performance improvements.
[462] Neighbourhood Transformer: Switchable Attention for Monophily-Aware Graph Learning
Yi Luo, Xu Sun, Guangchun Luo, Aiguo Chen
Main category: cs.LG
TL;DR: Neighbourhood Transformers (NT) is a novel graph neural network architecture that uses self-attention within local neighborhoods instead of traditional message passing, addressing limitations of GNNs on heterophilic graphs where dissimilar nodes are connected.
Details
Motivation: Traditional GNNs rely on homophily assumption where similar nodes connect, but this fails for heterophilic graphs where dissimilar nodes are frequently connected. The paper aims to address this fundamental limitation in graph learning.Method: Proposes Neighbourhood Transformers (NT) that applies self-attention within every local neighborhood instead of aggregating messages to the central node. Also develops a neighborhood partitioning strategy with switchable attentions to reduce computational costs by over 95% space and up to 92.67% time.
Result: Extensive experiments on 10 real-world datasets (5 heterophilic and 5 homophilic graphs) show NT outperforms all current state-of-the-art methods on node classification tasks, demonstrating superior performance and cross-domain adaptability.
Conclusion: NT provides a novel paradigm for graph learning that is inherently monophily-aware, theoretically expressive, and practically efficient, with significant improvements over existing methods on both heterophilic and homophilic graphs.
Abstract: Graph neural networks (GNNs) have been widely adopted in engineering applications such as social network analysis, chemical research and computer vision. However, their efficacy is severely compromised by the inherent homophily assumption, which fails to hold for heterophilic graphs where dissimilar nodes are frequently connected. To address this fundamental limitation in graph learning, we first draw inspiration from the recently discovered monophily property of real-world graphs, and propose Neighbourhood Transformers (NT), a novel paradigm that applies self-attention within every local neighbourhood instead of aggregating messages to the central node as in conventional message-passing GNNs. This design makes NT inherently monophily-aware and theoretically guarantees its expressiveness is no weaker than traditional message-passing frameworks. For practical engineering deployment, we further develop a neighbourhood partitioning strategy equipped with switchable attentions, which reduces the space consumption of NT by over 95% and time consumption by up to 92.67%, significantly expanding its applicability to larger graphs. Extensive experiments on 10 real-world datasets (5 heterophilic and 5 homophilic graphs) show that NT outperforms all current state-of-the-art methods on node classification tasks, demonstrating its superior performance and cross-domain adaptability. The full implementation code of this work is publicly available at https://github.com/cf020031308/MoNT to facilitate reproducibility and industrial adoption.
[463] Predictive Entropy Links Calibration and Paraphrase Sensitivity in Medical Vision-Language Models
Binesh Sadanandan, Vahid Behzadan
Main category: cs.LG
TL;DR: Medical VLMs suffer from mis-calibrated confidence and sensitivity to question rephrasing, both caused by proximity to decision boundaries. Simple predictive entropy from single forward pass outperforms complex ensembles for detecting unreliable predictions.
Details
Motivation: Medical Vision Language Models (VLMs) have two critical failure modes that threaten safe clinical deployment: 1) mis-calibrated confidence (models being overconfident in wrong predictions), and 2) sensitivity to question rephrasing (predictions flipping with minor wording changes). Both issues undermine trust and reliability in medical applications.Method: Benchmarked five uncertainty quantification methods on MedGemma (4BIT) across in-distribution (MIMIC-CXR) and out-of-distribution (PadChest) chest X-ray datasets. Cross-architecture validation on LLaVA-RAD (7B). Methods included: single-model predictive entropy, five-member LoRA ensembles, and MC Dropout. Evaluated calibration (Expected Calibration Error), selective prediction coverage, and ability to detect rephrase-sensitive predictions.
Result: Predictive entropy from a single forward pass effectively predicts which samples will flip under rephrasing (AUROC: 0.711 on MedGemma, 0.878 on LLaVA-RAD). LoRA ensembles failed under distribution shift (42.9 ECE, 34.1% accuracy on MIMIC→PadChest). MC Dropout achieved best calibration (ECE: 4.3) and selective prediction coverage (21.5% at 5% risk). Single forward pass entropy outperformed ensembles for both error detection (AUROC: 0.743 vs 0.657) and paraphrase screening.
Conclusion: Simple uncertainty quantification methods (single forward pass predictive entropy) outperform complex ensembles for detecting unreliable predictions in medical VLMs. Both mis-calibration and rephrase sensitivity share a common cause - proximity to decision boundaries - and can be effectively flagged using a single entropy threshold, enabling safer clinical deployment.
Abstract: Medical Vision Language Models VLMs suffer from two failure modes that threaten safe deployment mis calibrated confidence and sensitivity to question rephrasing. We show they share a common cause, proximity to the decision boundary, by benchmarking five uncertainty quantification methods on MedGemma 4BIT across in distribution MIMIC CXR and outof distribution PadChest chest X ray datasets, with cross architecture validation on LLaVA RAD7B. For well calibrated single model methods, predictive entropy from one forward pass predicts which samples will flip under rephrasing AUROC 0.711 on MedGemma, 0.878 on LLaVARAD p 10 4, enabling a single entropy threshold to flag both unreliable and rephrase sensitive predictions. A five member LoRA ensemble fails under the MIMIC PadChest shift 42.9 ECE, 34.1 accuracy, though LLaVA RAD s ensemble does not collapse 69.1. MC Dropout achieves the best calibration ECE 4.3 and selective prediction coverage 21.5 at 5 risk, yet total entropy from a single forward pass outperforms the ensemble for both error detection AUROC 0.743 vs 0.657 and paraphrase screening. Simple methods win.
[464] Identification and Anonymization of Named Entities in Unstructured Information Sources for Use in Social Engineering Detection
Carlos Jimeno Miguel, Raul Orduna, Francesco Zola
Main category: cs.LG
TL;DR: A system for GDPR-compliant cybercrime analysis using Telegram data with speech-to-text transcription and NER for sensitive information detection
Details
Motivation: To create datasets for cybercrime analysis while complying with data protection regulations (GDPR and Penal Code), addressing the challenge of collecting and processing sensitive information from platforms like TelegramMethod: Proposed system collects text, audio, and images from Telegram; implements speech-to-text transcription with signal enhancement; evaluates NER solutions including Microsoft Presidio and transformer-based AI models; includes anonymization metrics for structural coherence preservation
Result: Parakeet achieves best performance in audio transcription; proposed NER solutions achieve highest f1-scores for sensitive information detection; anonymization metrics enable evaluation of data structural coherence while protecting personal information
Conclusion: The system supports cybersecurity research within legal frameworks by providing GDPR-compliant data collection, processing, and anonymization techniques for multimodal Telegram data
Abstract: This study addresses the challenge of creating datasets for cybercrime analysis while complying with the requirements of regulations such as the General Data Protection Regulation (GDPR) and Organic Law 10/1995 of the Penal Code. To this end, a system is proposed for collecting information from the Telegram platform, including text, audio, and images; the implementation of speech-to-text transcription models incorporating signal enhancement techniques; and the evaluation of different Named Entity Recognition (NER) solutions, including Microsoft Presidio and AI models designed using a transformer-based architecture. Experimental results indicate that Parakeet achieves the best performance in audio transcription, while the proposed NER solutions achieve the highest f1-score values in detecting sensitive information. In addition, anonymization metrics are presented that allow evaluation of the preservation of structural coherence in the data, while simultaneously guaranteeing the protection of personal information and supporting cybersecurity research within the current legal framework.
[465] Efficient Hierarchical Implicit Flow Q-learning for Offline Goal-conditioned Reinforcement Learning
Zhiqiang Dong, Teng Pang, Rongjian Xu, Guoqiang Wu
Main category: cs.LG
TL;DR: Proposes goal-conditioned mean flow policy with average velocity field for offline hierarchical reinforcement learning, plus LeJEPA loss for better goal representations
Details
Motivation: Address limitations in offline goal-conditioned RL where Gaussian policies have limited expressiveness and high-level policies struggle to generate effective subgoals for long-horizon controlMethod: Introduces goal-conditioned mean flow policy with average velocity field for hierarchical policy modeling, enabling efficient one-step sampling. Adds LeJEPA loss to repel goal representation embeddings for more discriminative representations
Result: Achieves strong performance across both state-based and pixel-based tasks in the OGBench benchmark
Conclusion: The proposed method effectively addresses expressiveness and representation issues in offline hierarchical GCRL, improving long-horizon control capabilities
Abstract: Offline goal-conditioned reinforcement learning (GCRL) is a practical reinforcement learning paradigm that aims to learn goal-conditioned policies from reward-free offline data. Despite recent advances in hierarchical architectures such as HIQL, long-horizon control in offline GCRL remains challenging due to the limited expressiveness of Gaussian policies and the inability of high-level policies to generate effective subgoals. To address these limitations, we propose the goal-conditioned mean flow policy, which introduces an average velocity field into hierarchical policy modeling for offline GCRL. Specifically, the mean flow policy captures complex target distributions for both high-level and low-level policies through a learned average velocity field, enabling efficient action generation via one-step sampling. Furthermore, considering the insufficiency of goal representation, we introduce a LeJEPA loss that repels goal representation embeddings during training, thereby encouraging more discriminative representations and improving generalization. Experimental results show that our method achieves strong performance across both state-based and pixel-based tasks in the OGBench benchmark.
[466] Modality-Aware Zero-Shot Pruning and Sparse Attention for Efficient Multimodal Edge Inference
Yueyuan Sui, Payal Mohapatra, Doğaç Eldenk, Haodong Yang, Yiting Zhang, Haoyan Zhang, Qi Zhu, Stephen Xia
Main category: cs.LG
TL;DR: SentryFuse framework enables zero-shot compression of multimodal models for edge devices with modality-aware pruning and sparse attention, improving accuracy under sensor dropout without fine-tuning.
Details
Motivation: Edge devices need multimodal sensing pipelines that remain accurate despite power fluctuations and sensor dropout, but existing pruning methods require energy-intensive fine-tuning and have static importance scores blind to sensor presence.Method: Two components: 1) SentryGate learns modality-conditioned importance scores via first-order saliency supervision and prunes attention heads/feed-forward channels at deployment without fine-tuning; 2) SentryAttend replaces dense self-attention with sparse grouped-query attention.
Result: 12.7% average accuracy improvement over strongest pruning baseline, up to 18% under modality dropout; 28.2% memory reduction; 1.63× latency reduction; 15% GFLOPs reduction across three multimodal architectures.
Conclusion: SentryFuse establishes modality-aware zero-shot compression as practical for multimodal intelligence on heterogeneous edge hardware, addressing sensor dropout and power constraints without fine-tuning.
Abstract: Edge devices increasingly run multimodal sensing pipelines that must remain accurate despite fluctuating power budgets and unpredictable sensor dropout. Existing pruning methods fail under these conditions: they generally require fine-tuning after compression, consuming over $10\times$ the deployment energy, and they assign static importance scores that are blind to which sensors are present. We present the SentryFuse framework, which addresses both challenges jointly through two key components. First, SentryGate learns modality-conditioned importance scores during training via first-order saliency supervision and then prunes attention heads and feed-forward channels at deployment without fine-tuning. Second, SentryAttend replaces dense self-attention, a key bottleneck in contemporary multimodal architectures, with sparse grouped-query attention, yielding a net 15% reduction in GFLOPs across three different multimodal architectures. Across three applications and multimodal backbones, SentryGate achieves a 12.7% average accuracy improvement over the strongest pruning baseline, and upto to 18% under modality dropout conditions. Together, SentryFuse reduces memory by 28.2% and lowers latency by up to $1.63\times$ without further fine-tuning, establishing modality-aware zero-shot compression as a practical path to multimodal intelligence on heterogeneous edge hardware.
[467] U-Cast: A Surprisingly Simple and Efficient Frontier Probabilistic AI Weather Forecaster
Salva Rühling Cachay, Duncan Watson-Parris, Rose Yu
Main category: cs.LG
TL;DR: U-Cast is a probabilistic weather forecasting model using standard U-Net architecture with efficient training that matches state-of-the-art performance while reducing computational costs by 10x+.
Details
Motivation: Current SOTA AI weather models require specialized architectures and massive computational budgets, creating high barriers to entry. The authors aim to show that frontier performance can be achieved with simpler, more accessible approaches.Method: U-Cast uses a standard U-Net backbone trained with a two-stage recipe: 1) deterministic pre-training on Mean Absolute Error, followed by 2) short probabilistic fine-tuning on CRPS using Monte Carlo Dropout for stochastic ensemble generation.
Result: Matches or exceeds probabilistic skill of GenCast and IFS ENS at 1.5° resolution while reducing training compute by over 10x compared to leading CRPS-based models and inference latency by over 10x compared to diffusion models. Trains in under 12 H200 GPU-days and generates 60-step ensemble forecast in 11 seconds.
Conclusion: Scalable, general-purpose architectures with efficient training curricula can match complex domain-specific designs at a fraction of the cost, opening frontier probabilistic weather modeling to broader community.
Abstract: AI-based weather forecasting now rivals traditional physics-based ensembles, but state-of-the-art (SOTA) models rely on specialized architectures and massive computational budgets, creating a high barrier to entry. We demonstrate that such complexity is unnecessary for frontier performance. We introduce U-Cast, a probabilistic forecaster built on a standard U-Net backbone trained with a simple recipe: deterministic pre-training on Mean Absolute Error followed by short probabilistic fine-tuning on the Continuous Ranked Probability Score (CRPS) using Monte Carlo Dropout for stochasticity. As a result, our model matches or exceeds the probabilistic skill of GenCast and IFS ENS at 1.5$^\circ$ resolution while reducing training compute by over 10$\times$ compared to leading CRPS-based models and inference latency by over 10$\times$ compared to diffusion-based models. U-Cast trains in under 12 H200 GPU-days and generates a 60-step ensemble forecast in 11 seconds. These results suggest that scalable, general-purpose architectures paired with efficient training curricula can match complex domain-specific designs at a fraction of the cost, opening the training of frontier probabilistic weather models to the broader community. Our code is available at: https://github.com/Rose-STL-Lab/u-cast.
[468] The nextAI Solution to the NeurIPS 2023 LLM Efficiency Challenge
Gyuwon Park, DongIl Shin, SolGil Oh, SangGi Ryu, Byung-Hak Kim
Main category: cs.LG
TL;DR: Fine-tuning LLaMa2 70B model on single A100 GPU using QLoRA with Flash Attention 2 for efficiency challenge
Details
Motivation: Address growing concerns about resource usage and transparency in large language models by demonstrating efficient fine-tuning under strict computational constraintsMethod: Used Quantized-Low Rank Adaptation (QLoRA) fine-tuning with Flash Attention 2 on custom dataset assembled from diverse open-source resources, experimenting with various LoRA configurations
Result: Successfully fine-tuned LLaMa2 70B model on single A100 40GB GPU within 24 hours, achieving high accuracy across QA benchmarks while significantly reducing resource utilization
Conclusion: Demonstrates feasibility of optimizing large-scale models in resource-constrained environments, emphasizing practical potential of LLMs for real-world applications
Abstract: The rapid evolution of Large Language Models (LLMs) has significantly impacted the field of natural language processing, but their growing complexity raises concerns about resource usage and transparency. Addressing these challenges, we participated in the NeurIPS LLM Efficiency Challenge, aiming to fine-tune a foundation model within stringent constraints. Our focus was the LLaMa2 70 billion model, optimized on a single A100 40GB GPU within a 24-hour limit. Our methodology hinged on a custom dataset, carefully assembled from diverse open-source resources and benchmark tests, aligned with the challenge’s open-source ethos. Our approach leveraged Quantized-Low Rank Adaptation (QLoRA) Fine tuning, integrated with advanced attention mechanisms like Flash Attention 2. We experimented with various configurations of the LoRA technique, optimizing the balance between computational efficiency and model accuracy. Our fine-tuning strategy was underpinned by the creation and iterative testing of multiple dataset compositions, leading to the selection of a version that demonstrated robust performance across diverse tasks and benchmarks. The culmination of our efforts was an efficiently fine-tuned LLaMa2 70B model that operated within the constraints of a single GPU, showcasing not only a significant reduction in resource utilization but also high accuracy across a range of QA benchmarks. Our study serves as a testament to the feasibility of optimizing large-scale models in resource-constrained environments, emphasizing the potential of LLMs in real-world applications.
[469] PDE-regularized Dynamics-informed Diffusion with Uncertainty-aware Filtering for Long-Horizon Dynamics
Min Young Baeg, Yoon-Yeong Kim
Main category: cs.LG
TL;DR: PDYffusion is a physics-informed diffusion framework for long-horizon spatiotemporal prediction that combines PDE-based regularization with uncertainty-aware forecasting using UKF to address error accumulation and ensure physical consistency.
Details
Motivation: Long-horizon spatiotemporal prediction suffers from cumulative errors, noise amplification, and lack of physical consistency in existing models. Current diffusion models often fail to capture underlying dynamics governed by physical laws and rely on inadequate mean squared error objectives.Method: Proposes PDYffusion with two key components: 1) PDE-regularized interpolator that incorporates differential operators to enforce physically consistent intermediate states, and 2) UKF-based forecaster that uses Unscented Kalman Filter to explicitly model uncertainty and mitigate error accumulation during iterative prediction.
Result: Extensive experiments on multiple dynamical datasets show PDYffusion achieves superior performance in CRPS and MSE metrics while maintaining stable uncertainty behavior measured by SSR. The method demonstrates balanced trade-off between prediction accuracy and uncertainty.
Conclusion: PDYffusion provides a robust solution for long-horizon forecasting by integrating physics-based regularization with uncertainty-aware diffusion modeling, addressing key challenges in spatiotemporal prediction while maintaining theoretical guarantees.
Abstract: Long-horizon spatiotemporal prediction remains a challenging problem due to cumulative errors, noise amplification, and the lack of physical consistency in existing models. While diffusion models provide a probabilistic framework for modeling uncertainty, conventional approaches often rely on mean squared error objectives and fail to capture the underlying dynamics governed by physical laws. In this work, we propose PDYffusion, a dynamics-informed diffusion framework that integrates PDE-based regularization and uncertainty-aware forecasting for stable long-term prediction. The proposed method consists of two key components: a PDE-regularized interpolator and a UKF-based forecaster. The interpolator incorporates a differential operator to enforce physically consistent intermediate states, while the forecaster leverages the Unscented Kalman Filter to explicitly model uncertainty and mitigate error accumulation during iterative prediction. We provide theoretical analyses showing that the proposed interpolator satisfies PDE-constrained smoothness properties, and that the forecaster converges under the proposed loss formulation. Extensive experiments on multiple dynamical datasets demonstrate that PDYffusion achieves superior performance in terms of CRPS and MSE, while maintaining stable uncertainty behavior measured by SSR. We further analyze the inherent trade-off between prediction accuracy and uncertainty, showing that our method provides a balanced and robust solution for long-horizon forecasting.
[470] Beyond Isolated Clients: Integrating Graph-Based Embeddings into Event Sequence Models
Harry Proshian, Nikita Severin, Sergey Nikolenko, Kireev Ivan, Andrey Savchenko, Ivan Sergeev, Maria Postnova, Ilya Makarov
Main category: cs.LG
TL;DR: Three model-agnostic strategies to integrate graph structural information into contrastive self-supervised learning for temporal user-item interaction data, improving prediction accuracy.
Details
Motivation: Self-supervised learning effectively models temporal order of user-item interactions but typically overlooks the global structure of the user-item interaction graph, which contains valuable information for better user attribute prediction.Method: Three model-agnostic strategies: 1) enriching event embeddings with graph structural information, 2) aligning client representations with graph embeddings, and 3) adding a structural pretext task to contrastive SSL frameworks.
Result: Experiments on four financial and e-commerce datasets show consistent accuracy improvements (up to 2.3% AUC), with graph density identified as a key factor in selecting optimal integration strategy.
Conclusion: Integrating graph structural information into contrastive SSL for temporal event data significantly improves prediction performance, with different strategies being optimal depending on graph density characteristics.
Abstract: Large-scale digital platforms generate billions of timestamped user-item interactions (events) that are crucial for predicting user attributes in, e.g., fraud prevention and recommendations. While self-supervised learning (SSL) effectively models the temporal order of events, it typically overlooks the global structure of the user-item interaction graph. To bridge this gap, we propose three model-agnostic strategies for integrating this structural information into contrastive SSL: enriching event embeddings, aligning client representations with graph embeddings, and adding a structural pretext task. Experiments on four financial and e-commerce datasets demonstrate that our approach consistently improves the accuracy (up to a 2.3% AUC) and reveals that graph density is a key factor in selecting the optimal integration strategy.
[471] Feature-Label Modal Alignment for Robust Partial Multi-Label Learning
Yu Chen, Weijun Lv, Yue Huang, Xiaozhao Fang, Jie Wen, Yong Xu, Guanbin Li
Main category: cs.LG
TL;DR: PML-MA: A novel partial multi-label learning method that treats features and labels as complementary modalities, using low-rank decomposition for noise filtering, modal alignment for consistency restoration, and multi-peak prototype learning for enhanced discriminability.
Details
Motivation: In partial multi-label learning (PML), noisy labels in candidate label sets disrupt the feature-label correspondence, degrading classification performance. Existing methods struggle with effectively filtering noise while maintaining the multi-label nature of instances.Method: 1) Low-rank orthogonal decomposition generates pseudo-labels approximating true distribution by filtering noisy labels; 2) Modal alignment aligns features and pseudo-labels through global projection into common subspace and local neighborhood preservation; 3) Multi-peak class prototype learning leverages multi-label nature using pseudo-labels as soft membership weights.
Result: Extensive experiments on real-world and synthetic datasets show PML-MA significantly outperforms state-of-the-art methods, achieving superior classification accuracy and noise robustness.
Conclusion: PML-MA effectively addresses PML challenges by integrating modal alignment with prototype-guided refinement, ensuring pseudo-labels better reflect true distribution while maintaining robustness against label noise.
Abstract: In partial multi-label learning (PML), each instance is associated with a set of candidate labels containing both ground-truth and noisy labels. The presence of noisy labels disrupts the correspondence between features and labels, degrading classification performance. To address this challenge, we propose a novel PML method based on feature-label modal alignment (PML-MA), which treats features and labels as two complementary modalities and restores their consistency through systematic alignment. Specifically, PML-MA first employs low-rank orthogonal decomposition to generate pseudo-labels that approximate the true label distribution by filtering noisy labels. It then aligns features and pseudo-labels through both global projection into a common subspace and local preservation of neighborhood structures. Finally, a multi-peak class prototype learning mechanism leverages the multi-label nature where instances simultaneously belong to multiple categories, using pseudo-labels as soft membership weights to enhance discriminability. By integrating modal alignment with prototype-guided refinement, PML-MA ensures pseudo-labels better reflect the true distribution while maintaining robustness against label noise. Extensive experiments on both real-world and synthetic datasets demonstrate that PML-MA significantly outperforms state-of-the-art methods, achieving superior classification accuracy and noise robustness.
[472] Is More Data Worth the Cost? Dataset Scaling Laws in a Tiny Attention-Only Decoder
Götz-Henrik Wiegand, Lorena Raichle, Rico Städeli, Tomas Hrycej, Bernhard Bermeitinger, Siegfried Handschuh
Main category: cs.LG
TL;DR: Training smaller Transformer models shows diminishing returns with dataset scaling - using only 30% of data achieves ~90% of full-data accuracy, providing practical guidance for compute-limited settings.
Details
Motivation: To understand dataset-size effects in controlled, smaller-scale Transformer training environments, since scaling laws are well-studied at large scale but less explored in compute-restricted settings like small research labs.Method: Used a strongly reduced attention-only decoder architecture, trained on progressively larger power-of-two subsets of data to isolate dataset-size effects.
Result: Observed smooth performance improvements with clear diminishing returns consistent with scaling-law behavior; using only about 30% of training data reaches approximately 90% of full-data validation token-level accuracy.
Conclusion: Provides actionable insights into dataset scaling in controlled settings and practical guidance for balancing dataset size and computational cost in compute- and data-restricted environments.
Abstract: Training Transformer language models is expensive, as performance typically improves with increasing dataset size and computational budget. Although scaling laws describe this trend at large scale, their implications in controlled, smaller-scale settings remain less explored. In this work, we isolate dataset-size effects using a strongly reduced attention-only decoder architecture. By training on progressively larger power-of-two subsets, we observe smooth performance improvements accompanied by clear diminishing returns, consistent with scaling-law behavior. Using only about 30% of the training data is sufficient to reach approximately 90% of the full-data validation token-level accuracy. These results provide actionable insights into dataset scaling in a controlled, component-isolated setting and offer practical guidance for balancing dataset size and computational cost in compute- and data-restricted environments, such as small research labs and exploratory model development.
[473] Temporal Patch Shuffle (TPS): Leveraging Patch-Level Shuffling to Boost Generalization and Robustness in Time Series Forecasting
Jafar Bakhshaliyev, Johannes Burchert, Niels Landwehr, Lars Schmidt-Thieme
Main category: cs.LG
TL;DR: TPS is a novel data augmentation method for time series forecasting that extracts overlapping temporal patches, selectively shuffles them using variance-based ordering, and reconstructs sequences while preserving local temporal structure.
Details
Motivation: Most existing time series augmentation methods are designed for classification tasks and cannot be directly applied to forecasting due to the need to preserve temporal coherence and forecast-consistent structure.Method: Temporal Patch Shuffle (TPS) extracts overlapping temporal patches from time series, selectively shuffles a subset of patches using variance-based ordering as a conservative heuristic, and reconstructs sequences by averaging overlapping regions to maintain temporal coherence.
Result: TPS consistently improves performance across nine long-term forecasting datasets using five model families (TSMixer, DLinear, PatchTST, TiDE, LightTS) and four short-term forecasting datasets using PatchTST, with comprehensive ablation studies demonstrating effectiveness and robustness.
Conclusion: TPS is an effective, model-agnostic data augmentation method for time series forecasting that increases sample diversity while preserving forecast-consistent local temporal structure, leading to consistent performance improvements across various forecasting models and datasets.
Abstract: Data augmentation is a crucial technique for improving model generalization and robustness, particularly in deep learning models where training data is limited. Although many augmentation methods have been developed for time series classification, most are not directly applicable to time series forecasting due to the need to preserve temporal coherence. In this work, we propose Temporal Patch Shuffle (TPS), a simple and model-agnostic data augmentation method for forecasting that extracts overlapping temporal patches, selectively shuffles a subset of patches using variance-based ordering as a conservative heuristic, and reconstructs the sequence by averaging overlapping regions. This design increases sample diversity while preserving forecast-consistent local temporal structure. We extensively evaluate TPS across nine long-term forecasting datasets using five recent model families (TSMixer, DLinear, PatchTST, TiDE, and LightTS), and across four short-term forecasting datasets using PatchTST, observing consistent performance improvements. Comprehensive ablation studies further demonstrate the effectiveness, robustness, and design rationale of the proposed method.
[474] Synthesizing real-world distributions from high-dimensional Gaussian Noise with Fully Connected Neural Network
Joanna Komorniczak
Main category: cs.LG
TL;DR: A time-efficient synthetic tabular data generation method using fully connected neural networks with randomized loss functions that transforms Gaussian noise to approximate real datasets, outperforming state-of-the-art methods in speed and quality.
Details
Motivation: To address the need for efficient synthetic data generation that offers benefits like performance improvement through data augmentation, privacy preservation, and reliable method assessment, while overcoming the computational inefficiency of modern deep learning solutions.Method: Uses a fully connected neural network with randomized loss functions to transform random Gaussian distributions into synthetic data approximating target real-world datasets. Employs PCA for dimensionality reduction to enhance privacy and reduce complexity.
Result: Achieves state-of-the-art performance on 25 diverse tabular datasets with reference MMD scores orders of magnitude faster than modern deep learning solutions. Enhances classification quality while reducing time and memory complexity.
Conclusion: The proposed method provides an efficient solution for synthetic tabular data generation that outperforms existing methods in speed while maintaining high quality, with applications in data augmentation, privacy preservation, and method assessment.
Abstract: The use of synthetic data in machine learning applications and research offers many benefits, including performance improvements through data augmentation, privacy preservation of original samples, and reliable method assessment with fully synthetic data. This work proposes a time-efficient synthetic data generation method based on a fully connected neural network and a randomized loss function that transforms a random Gaussian distribution to approximate a target real-world dataset. The experiments conducted on 25 diverse tabular real-world datasets confirm that the proposed solution surpasses the state-of-the-art generative methods and achieves reference MMD scores orders of magnitude faster than modern deep learning solutions. The experiments involved analyzing distributional similarity, assessing the impact on classification quality, and using PCA for dimensionality reduction, which further enhances data privacy and can boost classification quality while reducing time and memory complexity.
[475] GeoPAS: Geometric Probing for Algorithm Selection in Continuous Black-Box Optimisation
Jiabao Brad Wang, Xiang Shi, Yiliang Yuan, Mustafa Misir
Main category: cs.LG
TL;DR: GeoPAS uses geometric probing with 2D slices across locations, orientations, and scales for algorithm selection in black-box optimization, improving over single best solver in various evaluation scenarios.
Details
Motivation: Traditional algorithm selection in continuous black-box optimization relies on fixed landscape descriptors that degrade under problem-split or cross-benchmark evaluation, necessitating more robust and transferable representations.Method: Proposes GeoPAS: geometric probing approach representing problem instances by multiple coarse 2D slices sampled across locations, orientations, and logarithmic scales. Uses shared validity-aware convolutional encoder, conditions on slice-scale and amplitude statistics, aggregates features permutation-invariantly for risk-aware solver selection via log-scale performance prediction with explicit penalty on tail failures.
Result: On COCO/BBOB with 12-solver portfolio in dimensions 2-10, GeoPAS improves over single best solver under leave-instance-out, grouped random, and leave-problem-out evaluation.
Conclusion: Multi-scale geometric slices provide useful transferable static signal for algorithm selection, though some heavy-tail regimes remain and continue to dominate mean performance.
Abstract: Automated algorithm selection in continuous black-box optimisation typically relies on fixed landscape descriptors computed under a limited probing budget, yet such descriptors can degrade under problem-split or cross-benchmark evaluation. We propose GeoPAS, a geometric probing approach that represents a problem instance by multiple coarse two-dimensional slices sampled across locations, orientations, and logarithmic scales. A shared validity-aware convolutional encoder maps each slice to an embedding, conditions it on slice-scale and amplitude statistics, and aggregates the resulting features permutation-invariantly for risk-aware solver selection via log-scale performance prediction with an explicit penalty on tail failures. On COCO/BBOB with a 12-solver portfolio in dimensions 2–10, GeoPAS improves over the single best solver under leave-instance-out, grouped random, and leave-problem-out evaluation. These results suggest that multi-scale geometric slices provide a useful transferable static signal for algorithm selection, although a small number of heavy-tail regimes remain and continue to dominate the mean. Our code is available at $\href{https://github.com/BradWangW/GeoPAS}{GitHub}$.
[476] EquiformerV3: Scaling Efficient, Expressive, and General SE(3)-Equivariant Graph Attention Transformers
Yi-Lun Liao, Alexander J. Hoffman, Sabrina C. Shen, Alexandre Duval, Sam Walton Norwood, Tess Smidt
Main category: cs.LG
TL;DR: EquiformerV3 introduces third-generation SE(3)-equivariant graph attention Transformer with improved efficiency, expressivity, and generality for 3D atomistic modeling, achieving state-of-the-art results on molecular datasets.
Details
Motivation: As SE(3)-equivariant graph neural networks mature for 3D atomistic modeling, there's a need to improve their efficiency, expressivity, and physical consistency for large-scale applications.Method: Three key advances: 1) Optimized software implementation for 1.75× speedup, 2) Simple modifications including equivariant merged layer normalization, improved feedforward networks, and smooth radius cutoff attention, 3) SwiGLU-S² activations to incorporate many-body interactions while preserving strict equivariance.
Result: Achieves state-of-the-art results on OC20, OMat24, and Matbench Discovery datasets when trained with auxiliary denoising non-equilibrium structures (DeNS) task.
Conclusion: EquiformerV3 advances SE(3)-equivariant graph attention Transformers in efficiency, expressivity, and generality, enabling accurate modeling of potential energy surfaces for energy-conserving simulations and higher-order derivatives.
Abstract: As $SE(3)$-equivariant graph neural networks mature as a core tool for 3D atomistic modeling, improving their efficiency, expressivity, and physical consistency has become a central challenge for large-scale applications. In this work, we introduce EquiformerV3, the third generation of the $SE(3)$-equivariant graph attention Transformer, designed to advance all three dimensions: efficiency, expressivity, and generality. Building on EquiformerV2, we have the following three key advances. First, we optimize the software implementation, achieving $1.75\times$ speedup. Second, we introduce simple and effective modifications to EquiformerV2, including equivariant merged layer normalization, improved feedforward network hyper-parameters, and attention with smooth radius cutoff. Third, we propose SwiGLU-$S^2$ activations to incorporate many-body interactions for better theoretical expressivity and to preserve strict equivariance while reducing the complexity of sampling $S^2$ grids. Together, SwiGLU-$S^2$ activations and smooth-cutoff attention enable accurate modeling of smoothly varying potential energy surfaces (PES), generalizing EquiformerV3 to tasks requiring energy-conserving simulations and higher-order derivatives of PES. With these improvements, EquiformerV3 trained with the auxiliary task of denoising non-equilibrium structures (DeNS) achieves state-of-the-art results on OC20, OMat24, and Matbench Discovery.
[477] CORA: Conformal Risk-Controlled Agents for Safeguarded Mobile GUI Automation
Yushi Feng, Junye Du, Qifan Wang, Zizhan Ma, Qian Niu, Yutaka Matsuo, Long Feng, Lequan Yu
Main category: cs.LG
TL;DR: CORA is a safety framework for GUI agents that uses conformal risk control to provide statistical guarantees on harmful actions, with a Guardian model estimating action risk and a Diagnostician recommending interventions.
Details
Motivation: GUI agents powered by VLMs are becoming autonomous but expose users to severe financial, privacy, and social harm. Existing safeguards lack formal verification and user-tunable guarantees, creating a need for statistically grounded safety mechanisms.Method: CORA uses a Guardian model to estimate action-conditional risk for each proposed step, applies Conformal Risk Control to calibrate execute/abstain boundaries that satisfy user-specified risk budgets, routes rejected actions to a trainable Diagnostician model for multimodal reasoning and intervention recommendations, and includes a Goal-Lock mechanism to anchor assessment to clarified user intent.
Result: Experiments on the new Phone-Harm benchmark and public benchmarks show CORA improves the safety-helpfulness-interruption Pareto frontier, offering practical, statistically grounded safety for autonomous GUI execution.
Conclusion: CORA provides a practical framework with statistical guarantees for GUI agent safety, addressing the critical need for formal verification and user-tunable risk control in autonomous multimodal systems.
Abstract: Graphical user interface (GUI) agents powered by vision language models (VLMs) are rapidly moving from passive assistance to autonomous operation. However, this unrestricted action space exposes users to severe and irreversible financial, privacy or social harm. Existing safeguards rely on prompt engineering, brittle heuristics and VLM-as-critic lack formal verification and user-tunable guarantees. We propose CORA (COnformal Risk-controlled GUI Agent), a post-policy, pre-action safeguarding framework that provides statistical guarantees on harmful executed actions. CORA reformulates safety as selective action execution: we train a Guardian model to estimate action-conditional risk for each proposed step. Rather than thresholding raw scores, we leverage Conformal Risk Control to calibrate an execute/abstain boundary that satisfies a user-specified risk budget and route rejected actions to a trainable Diagnostician model, which performs multimodal reasoning over rejected actions to recommend interventions (e.g., confirm, reflect, or abort) to minimize user burden. A Goal-Lock mechanism anchors assessment to a clarified, frozen user intent to resist visual injection attacks. To rigorously evaluate this paradigm, we introduce Phone-Harm, a new benchmark of mobile safety violations with step-level harm labels under real-world settings. Experiments on Phone-Harm and public benchmarks against diverse baselines validate that CORA improves the safety–helpfulness–interruption Pareto frontier, offering a practical, statistically grounded safety paradigm for autonomous GUI execution. Code and benchmark are available at cora-agent.github.io.
[478] Score-Driven Rating System for Sports
Vladimír Holý, Michal Černý
Main category: cs.LG
TL;DR: A score-driven rating system generalizing Elo that uses score (gradient of log-likelihood) for updating ratings, accommodating various game outcomes beyond win/loss.
Details
Motivation: To create a more flexible rating system that goes beyond traditional win/loss outcomes and can handle diverse game results like point differences, rankings, and multiple outcome types, while maintaining theoretical soundness.Method: Proposes a score-driven framework using the gradient of the log-likelihood (score) as the updating mechanism for player/team ratings. Derives theoretical properties of the score and shows how it generalizes Elo while accommodating various outcome types.
Result: The score-driven system has desirable theoretical properties: zero expected value, sums to zero across players, decreases with increasing rating (ensuring fairness), and exhibits reversion to true underlying skills over time.
Conclusion: Provides a theoretical foundation for existing dynamic sports performance models and offers a systematic approach for constructing new rating systems that can handle diverse game outcomes beyond simple win/loss.
Abstract: This paper introduces a score-driven rating system, a generalization of the classical Elo rating system that employs the score, i.e. the gradient of the log-likelihood, as the updating mechanism for player and team ratings. The proposed framework extends beyond simple win/loss game outcomes and accommodates a wide range of game results, such as point differences, win/draw/loss outcomes, or complete rankings. Theoretical properties of the score are derived, showing that it has zero expected value, sums to zero across all players, and decreases with increasing value of a player’s rating, thereby ensuring internal consistency and fairness. Furthermore, the score-driven rating system exhibits a reversion property, meaning that ratings tend to follow the underlying unobserved true skills over time. The proposed framework provides a theoretical rationale for existing dynamic models of sports performance and offers a systematic approach for constructing new ones.
[479] Truncated Rectified Flow Policy for Reinforcement Learning with One-Step Sampling
Xubin Zhou, Yipeng Yang, Zhan Li
Main category: cs.LG
TL;DR: TRFP is a novel MaxEnt RL framework using truncated rectified flow policies to address challenges of multimodal action modeling with tractable entropy estimation and stable one-step sampling.
Details
Motivation: Standard Gaussian policies in MaxEnt RL are inherently unimodal, limiting their ability to model complex multimodal action distributions needed for sophisticated decision-making. Existing generative policies (diffusion/flow matching) face challenges with intractable likelihood/entropy estimation and multi-step sampling issues.Method: Proposes Truncated Rectified Flow Policy (TRFP) with hybrid deterministic-stochastic architecture. Uses gradient truncation and flow straightening techniques to make entropy-regularized optimization tractable while enabling stable training and effective one-step sampling.
Result: TRFP effectively captures multimodal behavior in toy multigoal environments, outperforms strong baselines on most of 10 MuJoCo benchmarks under standard sampling, and remains highly competitive under one-step sampling.
Conclusion: TRFP successfully addresses key challenges in incorporating expressive generative policies into MaxEnt RL, providing a practical framework for multimodal action modeling with tractable optimization and efficient inference.
Abstract: Maximum entropy reinforcement learning (MaxEnt RL) has become a standard framework for sequential decision making, yet its standard Gaussian policy parameterization is inherently unimodal, limiting its ability to model complex multimodal action distributions. This limitation has motivated increasing interest in generative policies based on diffusion and flow matching as more expressive alternatives. However, incorporating such policies into MaxEnt RL is challenging for two main reasons: the likelihood and entropy of continuous-time generative policies are generally intractable, and multi-step sampling introduces both long-horizon backpropagation instability and substantial inference latency. To address these challenges, we propose Truncated Rectified Flow Policy (TRFP), a framework built on a hybrid deterministic-stochastic architecture. This design makes entropy-regularized optimization tractable while supporting stable training and effective one-step sampling through gradient truncation and flow straightening. Empirical results on a toy multigoal environment and 10 MuJoCo benchmarks show that TRFP captures multimodal behavior effectively, outperforms strong baselines on most benchmarks under standard sampling, and remains highly competitive under one-step sampling.
[480] Generalization and Scaling Laws for Mixture-of-Experts Transformers
Mansour Zoubeirou a Mayaki
Main category: cs.LG
TL;DR: Theoretical analysis of Mixture-of-Experts Transformers showing how generalization scales with active parameters and routing combinatorics, deriving neural scaling laws for MoE architectures.
Details
Motivation: To develop a theoretical understanding of generalization and scaling properties for Mixture-of-Experts (MoE) Transformers, separating active per-input capacity from routing combinatorics to provide statistical foundations for reasoning about MoE scaling behaviors.Method: Develops a theory using sup-norm covering-number bounds conditioned on fixed routing patterns with union bounding, combined with ERM analysis for squared loss. Proves constructive approximation theorems for MoE architectures under manifold data models.
Result: Derives generalization bounds showing approximation and estimation trade off similarly to dense networks when accounting for active parameters. Shows error can decrease by scaling active capacity or increasing number of experts. Derives neural scaling laws for model size, data size, and compute-optimal tradeoffs.
Conclusion: Provides transparent statistical reference point for reasoning about MoE scaling, clarifying which behaviors are certified by worst-case theory versus those arising from data-dependent routing structure or optimization dynamics.
Abstract: We develop a theory of generalization and scaling for Mixture-of-Experts (MoE) Transformers that cleanly separates \emph{active} per-input capacity from routing combinatorics. By conditioning on fixed routing patterns and union-bounding across them, we derive a sup-norm covering-number bound whose metric entropy scales with the active parameter budget and incurs a MoE-specific routing overhead. Combined with a standard ERM analysis for squared loss, this yields a generalization bound under a $d$-dimensional manifold data model and $C^β$ targets, showing that approximation and estimation trade off as in dense networks once active parameters are accounted for appropriately. We further prove a constructive approximation theorem for MoE architectures, showing that, under the approximation construction, error can decrease either by scaling active capacity or by increasing the number of experts, depending on the dominant bottleneck. From these results we derive neural scaling laws for model size, data size, and compute-optimal tradeoffs. Overall, our results provide a transparent statistical reference point for reasoning about MoE scaling, clarifying which behaviors are certified by worst-case theory and which must arise from data-dependent routing structure or optimization dynamics.
[481] Automated Batch Distillation Process Simulation for a Large Hybrid Dataset for Deep Anomaly Detection
Jennifer Werner, Justus Arweiler, Indra Jungjohann, Jochen Schmid, Fabian Jirasek, Hans Hasse, Michael Bortz
Main category: cs.LG
TL;DR: A hybrid dataset combining experimental and simulation data for anomaly detection in chemical batch distillation processes, featuring automated simulation workflow and structured anomaly annotations.
Details
Motivation: Deep learning for anomaly detection in chemical processes requires large, diverse datasets that are rarely available from industrial operations. Existing experimental datasets are limited, so creating hybrid datasets combining experimental and simulation data can address this gap.Method: Created a hybrid dataset by augmenting existing experimental data with simulation data generated via an automated workflow using a novel Python-based process simulator with tailored index-reduction strategy for differential-algebraic equations. Experimental records were automatically translated into simulation scenarios after calibration to a reference experiment.
Result: Successfully generated time-series data for numerous experimental runs covering normal operation and various actuator- and control-related anomalies. The hybrid dataset is openly released and demonstrates good prediction of experimental dynamics after single-reference calibration.
Conclusion: The work provides a valuable hybrid dataset for chemical process anomaly detection research, enabling simulation-to-experiment style transfer and pseudo-experimental data generation. It demonstrates automated simulation of large-scale experimental campaigns while offering a unique basis for deep anomaly detection methods in process monitoring.
Abstract: Anomaly detection (AD) in chemical processes based on deep learning offers significant opportunities but requires large, diverse, and well-annotated training datasets that are rarely available from industrial operations. In a recent work, we introduced a large, fully annotated experimental dataset for batch distillation under normal and anomalous operating conditions. In the present study, we augment this dataset with a corresponding simulation dataset, creating a novel hybrid dataset. The simulation data is generated in an automated workflow with a novel Python-based process simulator that employs a tailored index-reduction strategy for the underlying differential-algebraic equations. Leveraging the rich metadata and structured anomaly annotations of the experimental database, experimental records are automatically translated into simulation scenarios. After calibration to a single reference experiment, the dynamics of the other experiments are well predicted. This enabled the fully automated, consistent generation of time-series data for a large number of experimental runs, covering both normal operation and a wide range of actuator- and control-related anomalies. The resulting hybrid dataset is released openly. From a process simulation perspective, this work demonstrates the automated, consistent simulation of large-scale experimental campaigns, using batch distillation as an example. From a data-driven AD perspective, the hybrid dataset provides a unique basis for simulation-to-experiment style transfer, the generation of pseudo-experimental data, and future research on deep AD methods in chemical process monitoring.
[482] On the Role of DAG topology in Energy-Aware Cloud Scheduling : A GNN-Based Deep Reinforcement Learning Approach
Anas Hattay, Fred Ngole Mboula, Eric Gascard, Zakaria Yahoun
Main category: cs.LG
TL;DR: GNN-based deep reinforcement learning schedulers for workflow DAGs fail under out-of-distribution conditions due to structural mismatches that disrupt message passing and policy generalization.
Details
Motivation: Cloud providers need to assign heterogeneous compute resources to workflow DAGs while balancing competing objectives like completion time, cost, and energy consumption. Current GNN-based deep reinforcement learning schedulers show promise but fail under certain conditions that need to be understood.Method: The paper studies a single-workflow, queue-free scheduling setting and analyzes GNN-based deep reinforcement learning schedulers designed to minimize workflow completion time and energy usage. It identifies specific out-of-distribution conditions under which these schedulers fail and provides principled explanations through controlled OOD evaluations.
Result: Performance degradation stems from structural mismatches between training and deployment environments, which disrupt message passing and undermine policy generalization. The analysis exposes fundamental limitations of current GNN-based schedulers.
Conclusion: There is a need for more robust representations in GNN-based schedulers to ensure reliable scheduling performance under distribution shifts, highlighting fundamental limitations of current approaches.
Abstract: Cloud providers must assign heterogeneous compute resources to workflow DAGs while balancing competing objectives such as completion time, cost, and energy consumption. In this work, we study a single-workflow, queue-free scheduling setting and consider a graph neural network (GNN)-based deep reinforcement learning scheduler designed to minimize workflow completion time and energy usage. We identify specific out-of-distribution (OOD) conditions under which GNN-based deep reinforcement learning schedulers fail and provide a principled explanation of why these failures occur. Through controlled OOD evaluations, we demonstrate that performance degradation stems from structural mismatches between training and deployment environments, which disrupt message passing and undermine policy generalization. Our analysis exposes fundamental limitations of current GNN-based schedulers and highlights the need for more robust representations to ensure reliable scheduling performance under distribution shifts.
[483] Statistical Properties of the King Wen Sequence: An Anti-Habituation Structure That Does Not Improve Neural Network Training
Augustin Chan
Main category: cs.LG
TL;DR: The King Wen I-Ching sequence shows statistically significant combinatorial patterns resembling curriculum learning principles, but empirical tests show it degrades neural network training performance due to high variance destabilizing optimization.
Details
Motivation: The paper investigates whether the ancient King Wen sequence of the I-Ching, which exhibits statistically significant combinatorial patterns that superficially resemble curriculum learning and curiosity-driven exploration principles, could benefit modern neural network training.Method: 1) Statistical characterization using Monte Carlo permutation analysis against 100,000 random baselines; 2) Three neural network experiments: learning rate schedule modulation, curriculum ordering, and seed sensitivity analysis across two hardware platforms (NVIDIA RTX 2060 with PyTorch and Apple Silicon with MLX).
Result: The sequence has four statistically significant properties, but all neural network experiments show negative results: King Wen learning rate modulation degrades performance at all amplitudes; as curriculum ordering, it’s worst on one platform and within noise on another; 30-seed sweep confirms degradation exceeds natural seed variance.
Conclusion: The King Wen sequence’s high variance - the property that makes it statistically distinctive - destabilizes gradient-based optimization. Anti-habituation in a fixed combinatorial sequence is not equivalent to effective training dynamics.
Abstract: The King Wen sequence of the I-Ching (c. 1000 BC) orders 64 hexagrams – states of a six-dimensional binary space – in a pattern that has puzzled scholars for three millennia. We present a rigorous statistical characterization of this ordering using Monte Carlo permutation analysis against 100,000 random baselines. We find that the sequence has four statistically significant properties: higher-than-random transition distance (98.2nd percentile), negative lag-1 autocorrelation (p=0.037), yang-balanced groups of four (p=0.002), and asymmetric within-pair vs. between-pair distances (99.2nd percentile). These properties superficially resemble principles from curriculum learning and curiosity-driven exploration, motivating the hypothesis that they might benefit neural network training. We test this hypothesis through three experiments: learning rate schedule modulation, curriculum ordering, and seed sensitivity analysis, conducted across two hardware platforms (NVIDIA RTX 2060 with PyTorch and Apple Silicon with MLX). The results are uniformly negative. King Wen LR modulation degrades performance at all tested amplitudes. As curriculum ordering, King Wen is the worst non-sequential ordering on one platform and within noise on the other. A 30-seed sweep confirms that only King Wen’s degradation exceeds natural seed variance. We explain why: the sequence’s high variance – the very property that makes it statistically distinctive – destabilizes gradient-based optimization. Anti-habituation in a fixed combinatorial sequence is not the same as effective training dynamics.
[484] DiffHLS: Differential Learning for High-Level Synthesis QoR Prediction with GNNs and LLM Code Embeddings
Zedong Peng, Zeju Li, Qiang Xu, Jieru Zhao
Main category: cs.LG
TL;DR: DiffHLS is a differential learning framework for HLS QoR prediction that uses GNNs and pretrained code LLMs to predict design performance differences from kernel baselines.
Details
Motivation: High-Level Synthesis (HLS) optimization exploration is expensive due to time-consuming synthesis for each design point. Current approaches lack efficient prediction of Quality-of-Result (QoR) for pragma-driven design variants.Method: Uses dedicated GNN branches to encode kernel and design intermediate-representation graphs, augments delta pathway with pretrained code LLM embeddings, and jointly predicts kernel baseline and design-induced delta which are composed for final prediction.
Result: On PolyBench, achieves lower average MAPE than GNN baselines under four GNN backbones, with LLM code embeddings consistently improving over GNN-only ablation. Validates scalability on ForgeHLS dataset.
Conclusion: DiffHLS provides an effective differential learning framework for HLS QoR prediction that leverages both GNNs and pretrained code LLMs to efficiently explore design optimization spaces.
Abstract: High-Level Synthesis (HLS) compiles C/C++ into RTL, but exploring pragma-driven optimization choices remains expensive because each design point requires time-consuming synthesis. We propose \textbf{\DiffHLS}, a differential learning framework for HLS Quality-of-Result (QoR) prediction that learns from kernel–design pairs: a kernel baseline and a pragma-inserted design variant. \DiffHLSencodes kernel and design intermediate-representation graphs with dedicated graph neural network (GNN) branches, and augments the delta pathway with code embeddings from a pretrained code large language model (LLM). Instead of regressing absolute targets directly, we jointly predict the kernel baseline and the design-induced delta, and compose them to obtain the design prediction. On PolyBench, \DiffHLSattains lower average MAPE than GNN baselines under four GNN backbones, and LLM code embeddings consistently improve over a GNN-only ablation. We further validate scalability on the ForgeHLS dataset.
[485] Nexus: Same Pretraining Loss, Better Downstream Generalization via Common Minima
Huanran Chen, Huaqing Zhang, Xiao Li, Yinpeng Dong, Ke Shen, Jun Zhu
Main category: cs.LG
TL;DR: The Nexus optimizer improves downstream performance by encouraging geometric closeness of task-specific minima during pretraining, despite achieving the same pretraining loss as standard optimizers.
Details
Motivation: The paper investigates whether LLMs converge to a common minimizer across all data sources during pretraining, hypothesizing that the geometric closeness of task-specific minima is linked to downstream generalization.Method: Proposes the Nexus optimizer which encourages closeness of task-specific minima by maximizing gradient similarity during optimization, tested across models from 130M to 3B parameters with various data mixtures.
Result: Nexus significantly boosts downstream performance despite achieving the same pretraining loss, reducing out-of-distribution loss by 0.012 and yielding up to 15.0% accuracy improvement on complex reasoning tasks like GSM8k.
Conclusion: Challenges reliance on pretraining loss as sole proxy for model evaluation and demonstrates importance of implicit biases in unlocking downstream generalization through geometric optimization properties.
Abstract: Pretraining is the cornerstone of Large Language Models (LLMs), dominating the vast majority of computational budget and data to serve as the primary engine for their capabilities. During pretraining, LLMs acquire foundational knowledge from an unprecedentedly massive and diverse data sources, encompassing a vast array of domains such as general language, mathematics, code, and complex reasoning. In this work, we investigate an interesting geometric question regarding the converged state of pretraining: Does the model converge to a common minimizer across all data sources (e.g., \cref{fig:cwa_illustration:close}), or merely a minimizer of the summed loss (e.g., \cref{fig:cwa_illustration:distant})? We hypothesize that the geometric “closeness” of task-specific minima is intrinsically linked to downstream generalization. We reveal that standard optimizers (e.g., AdamW) often converge to points where task-specific minima are distant from each other. To address this, we propose the Nexus optimizer, which encourages the closeness of these minima by maximizing gradient similarity during optimization. Experiments across models ranging from 130M to 3B parameters, various data mixtures and hyperparameter schedules, show that Nexus \textit{significantly boosts downstream performance}, despite \textit{achieving the same pretraining loss} (see \cref{fig:demo:benchmark}). Notably, on the 3B model, Nexus reduces the out-of-distribution loss by 0.012 and yields up to a 15.0% accuracy improvement on complex reasoning tasks (e.g., GSM8k). This finding challenges the reliance on pretraining loss as the sole proxy for model evaluation and demonstrates the importance of implicit biases in unlocking downstream generalization.
[486] The causal relation between off-street parking and electric vehicle adoption in Scotland
Bernardino D’Amico, Achille Fonzone, Emma Hart
Main category: cs.LG
TL;DR: Study uses causal inference to analyze EV adoption in Scotland, finding income is the main barrier while off-street parking accelerates adoption but primarily for those already economically positioned.
Details
Motivation: To understand whether the 'charging divide' in EV adoption is due to genuine infrastructure constraints or socio-economic disparities, moving beyond conventional predictive models to identify causal relationships.Method: Applied probabilistic causal framework to nationally representative Scottish household data, enabling estimation of policy interventions while controlling for confounding factors.
Result: Off-street parking increases EV ownership probability from 3.3% to 5.6% (70% relative increase), but primarily accelerates adoption for those already economically positioned. Income is the fundamental barrier, with a 23.1 percentage point reduction in market non-participation between income strata.
Conclusion: Standard models overstate parking infrastructure effects due to selection bias. Dual-track policy needed: financial instruments to lower affordability barriers, and addressing home-charging access for urban ’latent intent’ households.
Abstract: The transition to electric mobility hinges on maximising aggregate adoption while also facilitating equitable access. This study examines whether the ‘charging divide’ between households with and without off-street parking reflects a genuine infrastructure constraint or a by-product of socio-economic disparity. Moving beyond conventional predictive models, we apply a probabilistic causal framework to a nationally representative dataset of Scottish households, enabling estimation of policy interventions while explicitly neutralising the confounding effect of other causal factors. The results reveal a structural hierarchy in the EV adoption process. Private off-street parking functions as a conversion catalyst: enabling access to home-charging increases the probability of EV ownership from 3.3% to 5.6% (a 70% relative, 2.3 percentage point absolute increase). However, this effect primarily accelerates households already economically positioned to purchase an EV rather than recruiting new entrants. By contrast, household income operates as the fundamental affordability ceiling. A causal contrast between lower- and higher-income strata, shows a reduction in market non-participation by 23.1 percentage points, identifying financial capacity as the principal gatekeeper to entering the EV transition funnel. Crucially, the analysis demonstrates that standard observational models overstate the isolated effect of off-street parking infrastructure. The apparent effect emerges from selection bias: higher-income households are disproportionately likely to possess both private parking and the means to purchase EVs. These findings support a dual-track policy strategy: lowering the affordability ceiling for non-participants through financial instruments, while addressing EV home-charging access for the ’latent intent’ cohort in high-density urban contexts.
[487] Distributed Online Convex Optimization with Compressed Communication: Optimal Regret and Applications
Sifan Yang, Dan-Yue Li, Lijun Zhang
Main category: cs.LG
TL;DR: Distributed online convex optimization with compressed communication: establishes lower bounds and proposes optimal algorithms with error feedback and online compression strategies.
Details
Motivation: Communication cost between local learners and central server is substantial in large-scale distributed online convex optimization applications, creating a bottleneck that needs to be alleviated through compression techniques.Method: Proposes an optimal algorithm incorporating error feedback mechanism into Follow-the-Regularized-Leader framework to handle compression-projection error coupling, and uses online compression strategy to mitigate accumulated bidirectional compression errors.
Result: Establishes Ω(δ^{-1/2}√T) and Ω(δ^{-1}logT) lower bounds for convex and strongly convex functions, and achieves matching upper bounds O(δ^{-1/2}√T) and O(δ^{-1}logT) respectively, with extensions to offline stochastic setting.
Conclusion: The proposed method provides optimal guarantees for distributed online convex optimization with compressed communication, addressing communication bottlenecks while maintaining theoretical optimality.
Abstract: Distributed online convex optimization (D-OCO) is a powerful paradigm for modeling distributed scenarios with streaming data. However, the communication cost between local learners and the central server is substantial in large-scale applications. To alleviate this bottleneck, we initiate the study of D-OCO with compressed communication. Firstly, to quantify the compression impact, we establish the $Ω(δ^{-1/2}\sqrt{T})$ and $Ω(δ^{-1}\log{T})$ lower bounds for convex and strongly convex loss functions, respectively, where $δ\in (0,1]$ is the compression ratio. Secondly, we propose an optimal algorithm, which enjoys regret bounds of $O(δ^{-1/2}\sqrt{T})$ and $O(δ^{-1} \log T)$ for convex and strongly convex loss functions, respectively. Our method incorporates the error feedback mechanism into the Follow-the-Regularized-Leader framework to address the coupling between the compression error and the projection error. Furthermore, we employ the online compression strategy to mitigate the accumulated error arising from the bidirectional compression. Our online method has great generality, and can be extended to the offline stochastic setting via online-to-batch conversion. We establish convergence rates of $O(δ^{-1/2}T^{-1/2})$ and $O(δ^{-1} T^{-1})$ for convex and strongly convex loss functions, respectively, providing the first guarantees for distributed non-smooth optimization with compressed communication and domain constraints.
[488] Are Independently Estimated View Uncertainties Comparable? Unified Routing for Trusted Multi-View Classification
Yilin Zhang, Cai Xu, Haishun Chen, Ziyu Guan, Wei Zhao
Main category: cs.LG
TL;DR: TMUR addresses scale bias in multi-view evidential fusion by decoupling view-specific evidence extraction from fusion arbitration using a unified router with global context.
Details
Motivation: Current trusted multi-view classification assumes evidence from different views is numerically comparable, but this assumption is fragile due to differences in feature spaces, noise levels, semantic granularity, and lack of cross-view consistency constraints, leading to uncertainty dominated by branch-specific scale bias rather than true reliability.Method: Proposes Trusted Multi-view learning with Unified Routing (TMUR) which uses view-private experts and one collaborative expert, with a unified router that observes global multi-view context to generate sample-level expert weights. Includes soft load-balancing and diversity regularization for balanced expert utilization and discriminative specialization.
Result: Theoretical analysis shows why independent evidential supervision fails to identify common cross-view evidence scale, and why unified global routing is preferable to branch-local arbitration when reliability is sample-dependent.
Conclusion: TMUR provides a more robust approach to trusted multi-view classification by addressing scale bias through global routing and proper expert specialization.
Abstract: Trusted multi-view classification typically relies on a view-wise evidential fusion process: each view independently produces class evidence and uncertainty, and the final prediction is obtained by aggregating these independent opinions. While this design is modular and uncertainty-aware, it implicitly assumes that evidence from different views is numerically comparable. In practice, however, this assumption is fragile. Different views often differ in feature space, noise level, and semantic granularity, while independently trained branches are optimized only for prediction correctness, without any constraint enforcing cross-view consistency in evidence strength. As a result, the uncertainty used for fusion can be dominated by branch-specific scale bias rather than true sample-level reliability. To address this issue, we propose Trusted Multi-view learning with Unified Routing (TMUR), which decouples view-specific evidence extraction from fusion arbitration. TMUR uses view-private experts and one collaborative expert, and employs a unified router that observes the global multi-view context to generate sample-level expert weights. Soft load-balancing and diversity regularization further encourage balanced expert utilization and more discriminative expert specialization. We also provide theoretical analysis showing why independent evidential supervision does not identify a common cross-view evidence scale, and why unified global routing is preferable to branch-local arbitration when reliability is sample-dependent.
[489] Meta-Learned Basis Adaptation for Parametric Linear PDEs
Vikas Dwivedi, Monica Sigovan, Bruno Sixou
Main category: cs.LG
TL;DR: Hybrid physics-informed framework for parametric linear PDEs combining meta-learned predictor (KAPI) with least-squares corrector for adaptive basis generation and improved accuracy.
Details
Motivation: To develop an efficient and interpretable method for solving families of parametric linear PDEs that adapts the approximation space across parameter variations while maintaining physics-informed constraints.Method: Two-stage approach: 1) KAPI predictor - shallow task-conditioned model that maps query coordinates and PDE parameters to solution values while generating adaptive Gaussian basis geometry; 2) Least-squares corrector that uses predictor-generated geometry with background basis for physics-informed Extreme Learning Machine solve.
Result: Method evaluated on four linear PDE families (diffusion, transport, mixed advection-diffusion, variable-speed transport) shows predictor captures meaningful physics through localized basis placement, and corrector improves accuracy by one or more orders of magnitude.
Conclusion: Predictor-guided basis adaptation is an interpretable and efficient strategy for parametric PDE solving, outperforming parametric PINNs, physics-informed DeepONet, and uniform-grid PIELM correctors.
Abstract: We propose a hybrid physics-informed framework for solving families of parametric linear partial differential equations (PDEs) by combining a meta-learned predictor with a least-squares corrector. The predictor, termed \textbf{KAPI} (Kernel-Adaptive Physics-Informed meta-learner), is a shallow task-conditioned model that maps query coordinates and PDE parameters to solution values while internally generating an interpretable, task-adaptive Gaussian basis geometry. A lightweight meta-network maps PDE parameters to basis centers, widths, and activity patterns, thereby learning how the approximation space should adapt across the parametric family. This predictor-generated geometry is transferred to a second-stage corrector, which augments it with a background basis and computes the final solution through a one-shot physics-informed Extreme Learning Machine (PIELM)-style least-squares solve. We evaluate the method on four linear PDE families spanning diffusion, transport, mixed advection–diffusion, and variable-speed transport. Across these cases, the predictor captures meaningful physics through localized and transport-aligned basis placement, while the corrector further improves accuracy, often by one or more orders of magnitude. Comparisons with parametric PINNs, physics-informed DeepONet, and uniform-grid PIELM correctors highlight the value of predictor-guided basis adaptation as an interpretable and efficient strategy for parametric PDE solving.
[490] Stability Enhanced Gaussian Process Variational Autoencoders
Carl R. Richardson, Jichen Zhang, Ethan King, Ján Drgoňa
Main category: cs.LG
TL;DR: SEGP-VAE combines Gaussian processes and VAEs to learn low-dimensional LTI systems from high-dimensional video data with stability guarantees.
Details
Motivation: To develop a method that can indirectly train interpretable low-dimensional linear time invariant (LTI) systems from high-dimensional video observations while ensuring numerical stability and avoiding non-Hurwitz matrices.Method: Proposes a stability-enhanced Gaussian process variational autoencoder (SEGP-VAE) with a novel GP prior derived from LTI system definitions. Uses a complete and unconstrained parametrization that restricts LTI parameters to semi-contracting systems, enabling unconstrained optimization while preventing numerical issues.
Result: The method successfully learns latent dynamics from video data, demonstrated on a dataset of spiralling particles, showing accurate latent state predictions and benefits of the application-specific design choices.
Conclusion: SEGP-VAE provides a stable, interpretable framework for learning physical LTI systems from video data, combining probabilistic modeling with physical interpretability while avoiding numerical instability issues.
Abstract: A novel stability-enhanced Gaussian process variational autoencoder (SEGP-VAE) is proposed for indirectly training a low-dimensional linear time invariant (LTI) system, using high-dimensional video data. The mean and covariance function of the novel SEGP prior are derived from the definition of an LTI system, enabling the SEGP to capture the indirectly observed latent process using a combined probabilistic and interpretable physical model. The search space of LTI parameters is restricted to the set of semi-contracting systems via a complete and unconstrained parametrisation. As a result, the SEGP-VAE can be trained using unconstrained optimisation algorithms. Furthermore, this parametrisation prevents numerical issues caused by the presence of a non-Hurwitz state matrix. A case study applies SEGP-VAE to a dataset containing videos of spiralling particles. This highlights the benefits of the approach and the application-specific design choices that enabled accurate latent state predictions.
[491] Hierarchical Flow Decomposition for Turning Movement Prediction at Signalized Intersections
Md Atiqur Rahman Mallick, Kamrul Hasan, Pulock Das, Liang Hong, S M Shazzad Rassel
Main category: cs.LG
TL;DR: HFD-TM: Hierarchical deep learning framework for intersection turning movement prediction using corridor flow decomposition and physics-informed constraints
Details
Motivation: Turning movement prediction is essential for adaptive signal control but difficult due to high volatility; corridor flows are more stable and explain significant turning movement varianceMethod: Hierarchical framework that first forecasts corridor through-movements, then expands to individual turning streams; uses physics-informed loss function for flow conservation
Result: Achieves MAE of 2.49 vehicles per interval, reducing error by 5.7% vs Transformer and 27.0% vs GRU; 12.8x faster training than DCRNN
Conclusion: HFD-TM effectively leverages traffic structure hierarchy for accurate, efficient turning movement prediction suitable for real-time traffic applications
Abstract: Accurate prediction of intersection turning movements is essential for adaptive signal control but remains difficult due to the high volatility of directional flows. This study proposes HFD-TM (Hierarchical Flow-Decomposition for Turning Movement Prediction), a hierarchical deep learning framework that predicts turning movements by first forecasting corridor through-movements and then expanding these predictions to individual turning streams. This design is motivated by empirical traffic structure, where corridor flows account for 65.1% of total volume, exhibit lower volatility than turning movements, and explain 35.5% of turning-movement variance. A physics-informed loss function enforces flow conservation to maintain structural consistency. Evaluated on six months of 15-minute interval LiDAR (Light Detection and Ranging) data from a six-intersection corridor in Nashville, Tennessee, HFD-TM achieves a mean absolute error of 2.49 vehicles per interval, reducing MAE by 5.7% compared to a Transformer and by 27.0% compared to a GRU (Gated Recurrent Unit). Ablation results show that hierarchical decomposition provides the largest performance gain, while training time is 12.8 times lower than DCRNN (Diffusion Convolutional Recurrent Neural Network), demonstrating suitability for real-time traffic applications.
[492] Drift-Aware Online Dynamic Learning for Nonstationary Multivariate Time Series: Application to Sintering Quality Prediction
Yumeng Zhao, Shengxiang Yang, Xianpeng Wang
Main category: cs.LG
TL;DR: DA-MSDL is an online adaptive framework for nonstationary multivariate time series prediction that addresses concept drift and label latency using multi-scale feature extraction, unsupervised drift detection, and hierarchical fine-tuning.
Details
Motivation: Predicting nonstationary multivariate time series in complex industrial systems is challenging due to concept drift and significant label verification latency, which degrades offline-trained models. Existing methods struggle with multi-scale feature extraction and the stability-plasticity dilemma without immediate supervision.Method: Proposes a Drift-Aware Multi-Scale Dynamic Learning (DA-MSDL) framework with: 1) multi-scale bi-branch convolutional network backbone to disentangle local fluctuations from long-term trends, 2) Maximum Mean Discrepancy (MMD) for unsupervised drift detection to trigger adaptation before inference, and 3) drift-severity-guided hierarchical fine-tuning with prioritized experience replay from dynamic memory queue.
Result: DA-MSDL consistently outperforms representative baselines under severe concept drift in long-horizon experiments on real-world industrial sintering data and public benchmark datasets, demonstrating strong cross-domain generalization and predictive stability.
Conclusion: The framework provides an effective online dynamic learning paradigm for quality monitoring in nonstationary environments, successfully addressing concept drift and label latency challenges in industrial time series prediction.
Abstract: Accurate prediction of nonstationary multivariate time series remains a critical challenge in complex industrial systems such as iron ore sintering. In practice, pronounced concept drift compounded by significant label verification latency rapidly degrades the performance of offline-trained models. Existing methods based on static architectures or passive update strategies struggle to simultaneously extract multi-scale spatiotemporal features and overcome the stability-plasticity dilemma without immediate supervision. To address these limitations, a Drift-Aware Multi-Scale Dynamic Learning (DA-MSDL) framework is proposed to maintain robust multi-output predictive performance via online adaptive mechanisms on nonstationary data streams. The framework employs a multi-scale bi-branch convolutional network as its backbone to disentangle local fluctuations from long-term trends, thereby enhancing representational capacity for complex dynamic patterns. To circumvent the label latency bottleneck, DA-MSDL leverages Maximum Mean Discrepancy (MMD) for unsupervised drift detection. By quantifying online statistical deviations in feature distributions, DA-MSDL proactively triggers model adaptation prior to inference. Furthermore, a drift-severity-guided hierarchical fine-tuning strategy is developed. Supported by prioritized experience replay from a dynamic memory queue, this approach achieves rapid distribution alignment while effectively mitigating catastrophic forgetting. Long-horizon experiments on real-world industrial sintering data and a public benchmark dataset demonstrate that DA-MSDL consistently outperforms representative baselines under severe concept drift. Exhibiting strong cross-domain generalization and predictive stability, the proposed framework provides an effective online dynamic learning paradigm for quality monitoring in nonstationary environments.
[493] Bringing Clustering to MLL: Weakly-Supervised Clustering for Partial Multi-Label Learning
Yu Chen, Weijun Lv, Yue Huang, Xuhuan Zhu, Fang Li
Main category: cs.LG
TL;DR: WSC-PML: A novel weakly-supervised clustering approach for partial multi-label learning that bridges clustering and multi-label learning through membership matrix decomposition to handle label noise.
Details
Motivation: Label noise in multi-label learning, especially in partial multi-label learning where candidate labels contain both relevant and irrelevant labels, poses significant challenges. Traditional clustering methods cannot be directly applied to multi-label scenarios due to fundamental incompatibility between clustering's membership constraints and multi-label's binary assignments.Method: Proposes WSC-PML with key innovation of decomposing clustering membership matrix A = Π ⊙ F, where Π maintains clustering constraints while F preserves multi-label characteristics. Uses three-stage process: initial prototype learning from noisy labels, adaptive confidence-based weak supervision construction, and joint optimization via iterative clustering refinement.
Result: Extensive experiments on 24 datasets demonstrate that WSC-PML outperforms six state-of-the-art methods across all evaluation metrics.
Conclusion: The proposed weakly-supervised clustering approach effectively bridges clustering and multi-label learning for handling label noise in partial multi-label learning scenarios.
Abstract: Label noise in multi-label learning (MLL) poses significant challenges for model training, particularly in partial multi-label learning (PML) where candidate labels contain both relevant and irrelevant labels. While clustering offers a natural approach to exploit data structure for noise identification, traditional clustering methods cannot be directly applied to multi-label scenarios due to a fundamental incompatibility: clustering produces membership values that sum to one per instance, whereas multi-label assignments require binary values that can sum to any number. We propose a novel weakly-supervised clustering approach for PML (WSC-PML) that bridges clustering and multi-label learning through membership matrix decomposition. Our key innovation decomposes the clustering membership matrix $\mathbf{A}$ into two components: $\mathbf{A} = \mathbfΠ \odot \mathbf{F}$, where $\mathbfΠ$ maintains clustering constraints while $\mathbf{F}$ preserves multi-label characteristics. This decomposition enables seamless integration of unsupervised clustering with multi-label supervision for effective label noise handling. WSC-PML employs a three-stage process: initial prototype learning from noisy labels, adaptive confidence-based weak supervision construction, and joint optimization via iterative clustering refinement. Extensive experiments on 24 datasets demonstrate that our approach outperforms six state-of-the-art methods across all evaluation metrics.
[494] Stochastic-Dimension Frozen Sampled Neural Network for High-Dimensional Gross-Pitaevskii Equations on Unbounded Domains
Zhangyong Liang
Main category: cs.LG
TL;DR: SD-FSNN is a novel neural network method for solving high-dimensional Gross-Pitaevskii equations on unbounded domains with dimension-independent computational cost and improved accuracy.
Details
Motivation: To address the exponential computational cost growth in solving high-dimensional Gross-Pitaevskii equations using traditional methods like Hermite-basis discretizations, and to overcome limitations of gradient-based optimization in neural network training.Method: Uses stochastic-dimension frozen sampled neural network with random sampling of hidden weights/biases, space-time separation with adaptive ODE solvers, Gaussian-weighted ansatz for boundary conditions, normalization projection layer, and energy conservation constraints.
Result: SD-FSNN achieves dimension-independent computational complexity (vs linear for existing methods), better accuracy, faster training than general high-dimensional solvers, and superior performance across various spatial dimensions and interaction parameters.
Conclusion: SD-FSNN provides an efficient, accurate solution for high-dimensional GPEs on unbounded domains with dimension-independent computational cost and improved training efficiency compared to existing methods.
Abstract: In this paper, we propose a stochastic-dimension frozen sampled neural network (SD-FSNN) for solving a class of high-dimensional Gross-Pitaevskii equations (GPEs) on unbounded domains. SD-FSNN is unbiased across all dimensions, and its computational cost is independent of the dimension, avoiding the exponential growth in computational and memory costs associated with Hermite-basis discretizations. Additionally, we randomly sample the hidden weights and biases of the neural network, significantly outperforming iterative, gradient-based optimization methods in terms of training time and accuracy. Furthermore, we employ a space-time separation strategy, using adaptive ordinary differential equation (ODE) solvers to update the evolution coefficients and incorporate temporal causality. To preserve the structure of the GPEs, we integrate a Gaussian-weighted ansatz into the neural network to enforce exponential decay at infinity, embed a normalization projection layer for mass normalization, and add an energy conservation constraint to mitigate long-time numerical dissipation. Comparative experiments with existing methods demonstrate the superior performance of SD-FSNN across a range of spatial dimensions and interaction parameters. Compared to existing random-feature methods, SD-FSNN reduces the complexity from linear to dimension-independent. Additionally, SD-FSNN achieves better accuracy and faster training compared to general high-dimensional solvers, while focusing specifically on high-dimensional GPEs on unbounded domains.
[495] Efficient Unlearning through Maximizing Relearning Convergence Delay
Khoa Tran, Simon S. Woo
Main category: cs.LG
TL;DR: Proposes a new metric called “relearning convergence delay” and an “Influence Eliminating Unlearning” framework that removes problematic data from pretrained models by degrading performance on forgetting sets while maintaining accuracy on retaining sets.
Details
Motivation: Current machine unlearning approaches and evaluation metrics focus only on model predictions, limiting insight into the model's true underlying data characteristics. There's a need for more comprehensive assessment of unlearning effectiveness and risk of data recovery.Method: Introduces relearning convergence delay metric capturing both weight space and prediction space changes. Proposes Influence Eliminating Unlearning framework that removes forgetting set influence through performance degradation, weight decay, and noise injection into model weights.
Result: Extensive experiments show the method outperforms existing metrics and approaches ideal unlearning performance. Provides theoretical guarantees including exponential convergence and upper bounds, with empirical evidence of strong retention and resistance to relearning in classification and generative unlearning tasks.
Conclusion: The proposed relearning convergence delay metric provides more comprehensive unlearning assessment, and the Influence Eliminating Unlearning framework effectively removes problematic data influence while maintaining model performance on desired data.
Abstract: Machine unlearning poses challenges in removing mislabeled, contaminated, or problematic data from a pretrained model. Current unlearning approaches and evaluation metrics are solely focused on model predictions, which limits insight into the model’s true underlying data characteristics. To address this issue, we introduce a new metric called relearning convergence delay, which captures both changes in weight space and prediction space, providing a more comprehensive assessment of the model’s understanding of the forgotten dataset. This metric can be used to assess the risk of forgotten data being recovered from the unlearned model. Based on this, we propose the Influence Eliminating Unlearning framework, which removes the influence of the forgetting set by degrading its performance and incorporates weight decay and injecting noise into the model’s weights, while maintaining accuracy on the retaining set. Extensive experiments show that our method outperforms existing metrics and our proposed relearning convergence delay metric, approaching ideal unlearning performance. We provide theoretical guarantees, including exponential convergence and upper bounds, as well as empirical evidence of strong retention and resistance to relearning in both classification and generative unlearning tasks.
[496] OASIS: Online Activation Subspace Learning for Memory-Efficient Training
Sakshi Choudhary, Utkarsh Saxena, Kaushik Roy
Main category: cs.LG
TL;DR: OASIS: Online activation subspace learning algorithm for memory-efficient LLM training that tracks evolving low-dimensional activation subspaces to reduce memory footprint without modifying forward-pass computations.
Details
Motivation: Training large language models is memory-intensive, with activations consuming significant memory. Existing approaches address weight parameterizations or gradient subspaces, but activation memory remains challenging. There's a need for methods that reduce activation memory without architectural changes or performance degradation.Method: OASIS continuously tracks and updates a low-dimensional activation subspace during training. Intermediate activations are projected onto this evolving subspace, reducing memory. The subspace induces low-rank gradient representations, allowing gradients and optimizer states to be maintained directly in the subspace. A projection-aware optimizer transports optimizer states across subspace updates for stable training.
Result: OASIS achieves up to 2× lower peak memory than full fine-tuning while matching its performance. It outperforms prior low-rank methods across various finetuning and pretraining tasks.
Conclusion: OASIS provides an effective online activation subspace learning approach for memory-efficient LLM training that maintains performance while significantly reducing memory requirements, offering advantages over existing low-rank methods.
Abstract: Training large language models (LLMs) is constrained by memory requirements, with activations accounting for a substantial fraction of the total footprint. Existing approaches reduce memory using low-rank weight parameterizations or low-rank gradient subspaces for optimizer states, while activation memory is addressed through architectural modifications or compression schemes based on periodically updated projections. We propose OASIS, an online activation subspace learning algorithm for memory-efficient training that tracks and continuously updates a low-dimensional activation subspace during training. Intermediate activations are projected onto this evolving subspace, reducing memory without modifying forward-pass computations. The evolving activation subspace induces low-rank gradient representations, enabling both gradients and optimizer states to be maintained directly in this subspace, while a projection-aware optimizer consistently transports optimizer states across subspace updates for stable training. Across various finetuning and pretraining tasks, OASIS achieves up to $2\times$ lower peak memory than full fine-tuning while matching its performance and outperforming prior low-rank methods.
[497] NOMAD: Generating Embeddings for Massive Distributed Graphs
Aishwarya Sarkar, Sayan Ghosh, Nathan R. Tallent, Ali Jannesari
Main category: cs.LG
TL;DR: NOMAD is a distributed-memory graph embedding framework using MPI that implements proximity-based models from LINE algorithm, achieving significant speedups over existing methods while maintaining competitive embedding quality.
Details
Motivation: Existing graph embedding methods face scalability challenges for massive graphs with millions-to-billions of edges due to inadequate memory and processing capabilities of single-node solutions, requiring distributed approaches.Method: NOMAD uses Message Passing Interface (MPI) for distributed graphs, implements proximity-based models from LINE algorithm, and proposes practical trade-offs to improve scalability and reduce communication overheads for irregular and distributed graph embedding.
Result: Achieves median speedups of 10-100x over multi-threaded LINE and node2vec, 35-76x over distributed PBG, and 12-370x end-to-end speedups on real-world graphs while maintaining competitive embedding quality relative to state-of-the-art methods.
Conclusion: NOMAD provides an efficient distributed-memory framework for graph embedding that addresses scalability challenges of massive graphs while preserving embedding quality, making it suitable for web and science domain applications.
Abstract: Successful machine learning on graphs or networks requires embeddings that not only represent nodes and edges as low-dimensional vectors but also preserve the graph structure. Established methods for generating embeddings require flexible exploration of the entire graph through repeated use of random walks that capture graph structure with samples of nodes and edges. These methods create scalability challenges for massive graphs with millions-to-billions of edges because single-node solutions have inadequate memory and processing capabilities. We present NOMAD, a distributed-memory graph embedding framework using the Message Passing Interface (MPI) for distributed graphs. NOMAD implements proximity-based models proposed in the widely popular LINE (Large-scale Information Network Embedding) algorithm. We propose several practical trade-offs to improve the scalability and communication overheads confronted by irregular and distributed graph embedding methods, catering to massive-scale graphs arising in web and science domains. NOMAD demonstrates median speedups of 10/100x on CPU-based NERSC Perlmutter cluster relative to the popular reference implementations of multi-threaded LINE and node2vec, 35-76x over distributed PBG, and competitive embedding quality relative to LINE, node2vec, and GraphVite, while yielding 12-370x end-to-end speedups on real-world graphs.
[498] Offline Local Search for Online Stochastic Bandits
Gerdus Benadè, Rathish Das, Thomas Lavastida
Main category: cs.LG
TL;DR: A framework for converting offline local search algorithms into online stochastic combinatorial bandit algorithms with O(log³ T) approximate regret, applied to scheduling, matroid base selection, and clustering problems.
Details
Motivation: There's substantial interest in leveraging offline algorithm design knowledge for online decision-making in combinatorial multi-armed bandits. While greedy and linear optimization algorithms have been explored, local search methods - widely used in theory and practice - remain under-explored in online settings.Method: Proposes a generic method for converting offline local search algorithms that terminate in approximately optimal solutions into online stochastic combinatorial bandit algorithms. The framework achieves O(log³ T) approximate regret, significantly better than existing polynomial dependence on T.
Result: The framework yields O(log³ T) approximate regret for online stochastic combinatorial bandits, demonstrating improved theoretical guarantees compared to existing offline-to-online conversion methods. Applied successfully to three problems: scheduling to minimize total completion time, finding minimum cost matroid bases, and uncertain clustering.
Conclusion: Local search methods can be effectively adapted to online stochastic combinatorial bandit settings with strong theoretical guarantees, providing a flexible framework that outperforms existing conversion methods in terms of regret bounds.
Abstract: Combinatorial multi-armed bandits provide a fundamental online decision-making environment where a decision-maker interacts with an environment across $T$ time steps, each time selecting an action and learning the cost of that action. The goal is to minimize regret, defined as the loss compared to the optimal fixed action in hindsight under full-information. There has been substantial interest in leveraging what is known about offline algorithm design in this online setting. Offline greedy and linear optimization algorithms (both exact and approximate) have been shown to provide useful guarantees when deployed online. We investigate local search methods, a broad class of algorithms used widely in both theory and practice, which have thus far been under-explored in this context. We focus on problems where offline local search terminates in an approximately optimal solution and give a generic method for converting such an offline algorithm into an online stochastic combinatorial bandit algorithm with $O(\log^3 T)$ (approximate) regret. In contrast, existing offline-to-online frameworks yield regret (and approximate regret) which depend sub-linearly, but polynomially on $T$. We demonstrate the flexibility of our framework by applying it to three online stochastic combinatorial optimization problems: scheduling to minimize total completion time, finding a minimum cost base of a matroid and uncertain clustering.
[499] SafeAdapt: Provably Safe Policy Updates in Deep Reinforcement Learning
Maksim Anisimov, Francesco Belardinelli, Matthew Wicker
Main category: cs.LG
TL;DR: A novel approach for safe policy updates in continual reinforcement learning using Rashomon sets to provide formal safety guarantees during adaptation to new tasks while preserving safety on previously encountered tasks.
Details
Motivation: RL agents in safety-critical tasks need to adapt to non-stationary environments while preserving safety properties from previous tasks. Current approaches lack formal guarantees or only verify safety after deployment.Method: Introduces Rashomon sets - certified regions in policy parameter space that meet safety constraints within demonstration data distribution. Updates any RL algorithm by projecting policy updates onto this certified safe region.
Result: Empirical validation on grid-world navigation environments (Frozen Lake, Poisoned Apple) shows provably deterministic safety on source tasks during downstream adaptation, while baselines experience catastrophic forgetting of safety constraints.
Conclusion: The Rashomon set approach enables strong adaptation with provable safety guarantees, addressing the fundamental challenge of safe policy updates in continual RL for safety-critical applications.
Abstract: Safety guarantees are a prerequisite to the deployment of reinforcement learning (RL) agents in safety-critical tasks. Often, deployment environments exhibit non-stationary dynamics or are subject to changing performance goals, requiring updates to the learned policy. This leads to a fundamental challenge: how to update an RL policy while preserving its safety properties on previously encountered tasks? The majority of current approaches either do not provide formal guarantees or verify policy safety only a posteriori. We propose a novel a priori approach to safe policy updates in continual RL by introducing the Rashomon set: a region in policy parameter space certified to meet safety constraints within the demonstration data distribution. We then show that one can provide formal, provable guarantees for arbitrary RL algorithms used to update a policy by projecting their updates onto the Rashomon set. Empirically, we validate this approach across grid-world navigation environments (Frozen Lake and Poisoned Apple) where we guarantee an a priori provably deterministic safety on the source task during downstream adaptation. In contrast, we observe that regularisation-based baselines experience catastrophic forgetting of safety constraints while our approach enables strong adaptation with provable guarantees that safety is preserved.
[500] AdaCubic: An Adaptive Cubic Regularization Optimizer for Deep Learning
Ioannis Tsingalis, Constantine Kotropoulos, Corentin Briat
Main category: cs.LG
TL;DR: AdaCubic is a novel adaptive cubic regularization method for optimization that dynamically adjusts cubic term weights using an auxiliary optimization problem, with Hessian approximation via Hutchinson’s method, achieving competitive performance across CV, NLP, and signal processing tasks without hyperparameter tuning.
Details
Motivation: The paper aims to develop an optimizer that leverages cubic regularization for deep learning applications while addressing the computational cost issues of traditional Newton methods and eliminating the need for hyperparameter fine-tuning that plagues many adaptive optimization algorithms.Method: AdaCubic adapts the weight of the cubic term in Newton’s cubic regularized method through an auxiliary optimization problem with cubic constraints. It uses Hutchinson’s method to approximate the Hessian matrix to reduce computational cost, making it scalable for deep learning applications.
Result: AdaCubic inherits the local convergence guarantees of cubically regularized Newton methods and outperforms or competes with several widely used optimizers in Computer Vision, Natural Language Processing, and Signal Processing tasks, all evaluated with a fixed set of hyperparameters.
Conclusion: AdaCubic represents the first optimizer to successfully leverage cubic regularization in scalable deep learning applications, offering an attractive option for researchers and practitioners due to its competitive performance without requiring hyperparameter fine-tuning.
Abstract: A novel regularization technique, AdaCubic, is proposed that adapts the weight of the cubic term. The heart of AdaCubic is an auxiliary optimization problem with cubic constraints that dynamically adjusts the weight of the cubic term in Newton’s cubic regularized method. We use Hutchinson’s method to approximate the Hessian matrix, thereby reducing computational cost. We demonstrate that AdaCubic inherits the cubically regularized Newton method’s local convergence guarantees. Our experiments in Computer Vision, Natural Language Processing, and Signal Processing tasks demonstrate that AdaCubic outperforms or competes with several widely used optimizers. Unlike other adaptive algorithms that require hyperparameter fine-tuning, AdaCubic is evaluated with a fixed set of hyperparameters, rendering it a highly attractive optimizer in settings where fine-tuning is infeasible. This makes AdaCubic an attractive option for researchers and practitioners alike. To our knowledge, AdaCubic is the first optimizer to leverage cubic regularization in scalable deep learning applications.
[501] Integrated electro-optic attention nonlinearities for transformers
Luis Mickeler, Kai Lion, Alfonso Nardi, Jost Kellner, Pierre Didier, Bhavin J. Shastri, Niao He, Rachel Grange
Main category: cs.LG
TL;DR: Analog electro-optic implementation of Softmax/Sigmoid functions using thin-film lithium niobate modulators for faster transformer inference
Details
Motivation: Softmax operations in transformers create latency bottlenecks despite being computationally small; analog optical computing can accelerate these nonlinear functionsMethod: Use thin-film lithium niobate Mach-Zehnder modulators as analog nonlinear computational elements to implement electro-optic alternatives to digital Softmax and Sigmoid functions
Result: System maintains competitive accuracy in Vision Transformers and LLMs even with 4-bit quantization; operates at encoding speeds up to 10 GBaud with characterized noise robustness
Conclusion: TFLN modulators can serve as nonlinear function units in hybrid co-packaged hardware for high-speed, energy-efficient nonlinear computation in transformers
Abstract: Transformers have emerged as the dominant neural-network architecture, achieving state-of-the-art performance in language processing and computer vision. At the core of these models lies the attention mechanism, which requires a nonlinear, non-negative mapping using the Softmax function. However, although Softmax operations account for less than 1% of the total operation count, they can disproportionately bottleneck overall inference latency. Here, we use thin-film lithium niobate (TFLN) Mach-Zehnder modulators (MZMs) as analog nonlinear computational elements to drastically reduce the latency of nonlinear computations. We implement electro-optic alternatives to digital Softmax and Sigmoid, and evaluate their performance in Vision Transformers and Large Language Models. Our system maintains highly competitive accuracy, even under aggressive 4-bit input-output quantization of the analog units. We further characterize system noise at encoding speeds up to 10 GBaud and assess model robustness under various noise conditions. Our findings suggest that TFLN modulators can serve as nonlinear function units within hybrid co-packaged hardware, enabling high-speed and energy-efficient nonlinear computation.
[502] Toward World Models for Epidemiology
Zeeshan Memon, Yiqi Su, Christo Kurisummoottil Thomas, Walid Saad, Liang Zhao, Naren Ramakrishnan
Main category: cs.LG
TL;DR: World models applied to computational epidemiology for reasoning about latent disease states, noisy policy-dependent surveillance, and intervention effects with behavioral feedback.
Details
Motivation: Epidemiology is an ideal but underdeveloped setting for world models because epidemic decision-making requires reasoning about latent disease burden, imperfect surveillance signals that depend on policy, and interventions that affect human behavior.Method: Introduces a conceptual framework formulating epidemics as controlled, partially observed dynamical systems where: (1) true epidemic state is latent, (2) observations are noisy and endogenous to policy, and (3) interventions act as sequential actions with effects propagating through behavioral and social feedback.
Result: Three case studies demonstrate why explicit world modeling is necessary: strategic misreporting in behavioral surveillance, systematic delays in time-lagged signals like hospitalizations/deaths, and counterfactual intervention analysis where identical histories diverge under different action sequences.
Conclusion: Computational epidemiology provides a natural application domain for world models, offering a framework to address challenges in epidemic policy-making through explicit modeling of latent states, policy-dependent observations, and behavioral feedback mechanisms.
Abstract: World models have emerged as a unifying paradigm for learning latent dynamics, simulating counterfactual futures, and supporting planning under uncertainty. In this paper, we argue that computational epidemiology is a natural and underdeveloped setting for world models. This is because epidemic decision-making requires reasoning about latent disease burden, imperfect and policy-dependent surveillance signals, and intervention effects are mediated by adaptive human behavior. We introduce a conceptual framework for epidemiological world models, formulating epidemics as controlled, partially observed dynamical systems in which (i) the true epidemic state is latent, (ii) observations are noisy and endogenous to policy, and (iii) interventions act as sequential actions whose effects propagate through behavioral and social feedback. We present three case studies that illustrate why explicit world modeling is necessary for policy-relevant reasoning: strategic misreporting in behavioral surveillance, systematic delays in time-lagged signals such as hospitalizations and deaths, and counterfactual intervention analysis where identical histories diverge under alternative action sequences.
[503] ANTIC: Adaptive Neural Temporal In-situ Compressor
Sandeep S. Cranganore, Andrei Bodnar, Gianluca Galleti, Fabian Paischer, Johannes Brandstetter
Main category: cs.LG
TL;DR: ANTIC is an in situ compression pipeline for large-scale PDE simulations that combines adaptive temporal selection with neural spatial compression to reduce petabyte-scale storage requirements.
Details
Motivation: High-resolution PDE simulations (Navier-Stokes, magnetohydrodynamics, plasma physics, black hole mergers) generate petabyte-to-exabyte scale data that exceeds modern HPC storage capabilities, creating a critical bottleneck for scientific computing.Method: ANTIC uses: 1) adaptive temporal selector to filter informative snapshots during simulation, and 2) spatial neural compression module with continual fine-tuning that learns residual updates between adjacent snapshots using neural fields, operating in a single streaming pass.
Result: The method achieves storage reductions of several orders of magnitude while maintaining physics accuracy, effectively alleviating the need for explicit on-disk storage of entire time-evolved trajectories.
Conclusion: ANTIC provides an end-to-end in situ compression solution for large-scale PDE simulations that addresses the petabyte-scale storage bottleneck through adaptive temporal selection and neural spatial compression.
Abstract: The persistent storage requirements for high-resolution, spatiotemporally evolving fields governed by large-scale and high-dimensional partial differential equations (PDEs) have reached the petabyte-to-exabyte scale. Transient simulations modeling Navier-Stokes equations, magnetohydrodynamics, plasma physics, or binary black hole mergers generate data volumes that are prohibitive for modern high-performance computing (HPC) infrastructures. To address this bottleneck, we introduce ANTIC (Adaptive Neural Temporal in situ Compressor), an end-to-end in situ compression pipeline. ANTIC consists of an adaptive temporal selector tailored to high-dimensional physics that identifies and filters informative snapshots at simulation time, combined with a spatial neural compression module based on continual fine-tuning that learns residual updates between adjacent snapshots using neural fields. By operating in a single streaming pass, ANTIC enables a combined compression of temporal and spatial components and effectively alleviates the need for explicit on-disk storage of entire time-evolved trajectories. Experimental results demonstrate how storage reductions of several orders of magnitude relate to physics accuracy.
[504] Neurons Speak in Ranges: Breaking Free from Discrete Neuronal Attribution
Muhammad Umair Haider, Hammad Rizwan, Hassan Sajjad, Peizhong Ju, A. B. Siddique
Main category: cs.LG
TL;DR: NeuronLens introduces a range-based interpretation and manipulation framework for LLMs that addresses polysemantic neurons by targeting specific activation ranges rather than entire neurons, enabling more precise concept control with less collateral damage.
Details
Motivation: Polysemanticity in LLMs (where single neurons respond to multiple concepts) undermines traditional neuron-concept attribution methods, making model interpretation and control challenging. Current neuron-level interventions cause collateral damage to other concepts.Method: Analyzed encoder and decoder LLMs across diverse datasets, observed that concept-conditioned activation magnitudes form distinct Gaussian-like distributions. Introduced NeuronLens framework that localizes concept attribution to specific activation ranges within neurons rather than whole neurons.
Result: Range-based interventions enable effective manipulation of target concepts while causing substantially less collateral degradation to auxiliary concepts and overall model performance compared to neuron-level masking approaches.
Conclusion: Range-based interpretation and manipulation offers a more precise approach to LLM interpretability and control by addressing the pervasive polysemanticity problem through activation range targeting rather than neuron-level interventions.
Abstract: Pervasive polysemanticity in large language models (LLMs) undermines discrete neuron-concept attribution, posing a significant challenge for model interpretation and control. We systematically analyze both encoder and decoder based LLMs across diverse datasets, and observe that even highly salient neurons for specific semantic concepts consistently exhibit polysemantic behavior. Importantly, we uncover a consistent pattern: concept-conditioned activation magnitudes of neurons form distinct, often Gaussian-like distributions with minimal overlap. Building on this observation, we hypothesize that interpreting and intervening on concept-specific activation ranges can enable more precise interpretability and targeted manipulation in LLMs. To this end, we introduce NeuronLens, a novel range-based interpretation and manipulation framework, that localizes concept attribution to activation ranges within a neuron. Extensive empirical evaluations show that range-based interventions enable effective manipulation of target concepts while causing substantially less collateral degradation to auxiliary concepts and overall model performance compared to neuron-level masking.
[505] LLM4Delay: Flight Delay Prediction via Cross-Modality Adaptation of Large Language Models and Aircraft Trajectory Representation
Thaweerath Phisannupawong, Joshua Julian Damanik, Han-Lim Choi
Main category: cs.LG
TL;DR: LLM4Delay is a multimodal framework using large language models to predict flight delays by integrating textual aeronautical information with trajectory data through cross-modality adaptation.
Details
Motivation: Flight delays reflect inefficiencies in air traffic management systems, and there's a need for better prediction methods that can leverage both textual information (flight data, weather reports, aerodrome notices) and trajectory data for more accurate delay forecasting.Method: Proposes LLM4Delay framework that integrates textual aeronautical information with multiple trajectory data modeling airspace conditions. Uses instance-level projection to map trajectory representations into language modality, enabling cross-modality adaptation between textual and trajectory contexts.
Result: LLM4Delay demonstrates superior performance compared to existing ATM frameworks and prior time-series-to-language adaptation methods, showing improved delay prediction accuracy by leveraging both textual and trajectory data.
Conclusion: The framework effectively combines textual and trajectory data through cross-modality adaptation, highlighting their complementary roles and enabling continuous prediction updates as new information becomes available, with potential operational relevance.
Abstract: Flight delay prediction has become a key focus in air traffic management (ATM), as delays reflect inefficiencies in the system. This paper proposes LLM4Delay, a large language model (LLM)-based framework for predicting flight delays from the perspective of air traffic controllers monitoring aircraft after they enter the terminal maneuvering area (TMA). LLM4Delay is designed to integrate textual aeronautical information, including flight data, weather reports, and aerodrome notices, together with multiple trajectories that model airspace conditions, forming a comprehensive delay-relevant context. By jointly leveraging comprehensive textual and trajectory contexts via instance-level projection, an effective cross-modality adaptation strategy that maps multiple instance-level trajectory representations into the language modality, the framework improves delay prediction accuracy. LLM4Delay demonstrates superior performance compared to existing ATM frameworks and prior time-series-to-language adaptation methods. This highlights the complementary roles of textual and trajectory data while leveraging knowledge from both the pretrained trajectory encoder and the pretrained LLM. The proposed framework enables continuous updates to predictions as new information becomes available, indicating potential operational relevance.
[506] FP8-RL: A Practical and Stable Low-Precision Stack for LLM Reinforcement Learning
Zhaopeng Qiu, Shuang Yu, Jingqi Zhang, Shuai Zhang, Xue Huang, Jingyi Yang, Junjie Lai
Main category: cs.LG
TL;DR: FP8 quantization for LLM RL rollout acceleration with techniques for weight synchronization, KV-cache optimization, and train-inference mismatch correction
Details
Motivation: RL for LLMs is bottlenecked by rollout generation due to long output sequences where attention and KV-cache memory dominate end-to-end step time. FP8 offers acceleration but introduces engineering challenges with changing policy weights and train-inference mismatch.Method: Three main techniques: (1) FP8 W8A8 linear-layer rollout using blockwise FP8 quantization, (2) FP8 KV-cache with per-step QKV scale recalibration to address long-context memory bottlenecks, (3) Importance-sampling-based rollout correction (token-level TIS/MIS variants) to mitigate train-inference mismatch.
Result: Across dense and MoE models, delivers up to 44% rollout throughput gains while preserving learning behavior comparable to BF16 baselines.
Conclusion: Presents a practical FP8 rollout stack for LLM RL that addresses engineering challenges and enables significant acceleration while maintaining training stability.
Abstract: Reinforcement learning (RL) for large language models (LLMs) is increasingly bottlenecked by rollout (generation), where long output sequence lengths make attention and KV-cache memory dominate end-to-end step time. FP8 offers an attractive lever for accelerating RL by reducing compute cost and memory traffic during rollout, but applying FP8 in RL introduces unique engineering and algorithmic challenges: policy weights change every step (requiring repeated quantization and weight synchronization into the inference engine) and low-precision rollouts can deviate from the higher-precision policy assumed by the trainer, causing train-inference mismatch and potential instability. This report presents a practical FP8 rollout stack for LLM RL, implemented in the veRL ecosystem with support for common training backends (e.g., FSDP/Megatron-LM) and inference engines (e.g., vLLM/SGLang). We (i) enable FP8 W8A8 linear-layer rollout using blockwise FP8 quantization, (ii) extend FP8 to KV-cache to remove long-context memory bottlenecks via per-step QKV scale recalibration, and (iii) mitigate mismatch using importance-sampling-based rollout correction (token-level TIS/MIS variants). Across dense and MoE models, these techniques deliver up to 44% rollout throughput gains while preserving learning behavior comparable to BF16 baselines.
[507] Task-Distributionally Robust Data-Free Meta-Learning
Zixuan Hu, Yongxian Wei, Li Shen, Zhenyi Wang, Baoyuan Wu, Chun Yuan, Dacheng Tao
Main category: cs.LG
TL;DR: Trustworthy Data-Free Meta-Learning framework addressing robustness vulnerabilities in few-shot learning from pre-trained models without original data.
Details
Motivation: Existing Data-Free Meta-Learning (DFML) methods lack comprehensive robustness analysis, particularly regarding failure modes and vulnerability to attacks in real-world environments where algorithms operate under uncertainty.Method: Proposes trustworthy DFML framework with three components: 1) synthetic task reconstruction using model inversion, 2) meta-learning with task memory interpolation to prevent catastrophic forgetting, and 3) automatic model selection to filter untrustworthy models.
Result: Identifies two critical vulnerabilities (Task-Distribution Shift and Task-Distribution Corruption) and demonstrates framework’s effectiveness in mitigating these issues through synthetic task reconstruction and robust meta-learning.
Conclusion: The proposed trustworthy DFML framework addresses critical robustness vulnerabilities in data-free meta-learning, providing solutions for catastrophic forgetting and security threats from untrustworthy models.
Abstract: Data-Free Meta-Learning (DFML) aims to enable efficient learning of unseen few-shot tasks, by meta-learning from multiple pre-trained models without accessing their original training data. While existing DFML methods typically generate synthetic data from these models to perform meta-learning, a comprehensive analysis of DFML’s robustness-particularly its failure modes and vulnerability to potential attacks-remains notably absent. Such an analysis is crucial as algorithms often operate in complex and uncertain real-world environments. This paper fills this significant gap by systematically investigating the robustness of DFML, identifying two critical but previously overlooked vulnerabilities: Task-Distribution Shift (TDS) and Task-Distribution Corruption (TDC). TDS refers to the sequential shifts in the evolving task distribution, leading to the catastrophic forgetting of previously learned meta-knowledge. TDC exposes a security flaw of DFML, revealing its susceptibility to attacks when the pre-trained model pool includes untrustworthy models that deceptively claim to be beneficial but are actually harmful. To mitigate these vulnerabilities, we propose a trustworthy DFML framework comprising three components: synthetic task reconstruction, meta-learning with task memory interpolation, and automatic model selection. Specifically, utilizing model inversion techniques, we reconstruct synthetic tasks from multiple pre-trained models to perform meta-learning. To prevent forgetting, we introduce a strategy to replay interpolated historical tasks to efficiently recall previous meta-knowledge. Furthermore, our framework seamlessly incorporates an automatic model selection mechanism to automatically filter out untrustworthy models during the meta-learning process. Code is available at https://github.com/Egg-Hu/Trustworthy-DFML.
[508] Learning General Representation of 12-Lead Electrocardiogram with a Joint-Embedding Predictive Architecture
Sehun Kim
Main category: cs.LG
TL;DR: ECG-JEPA introduces a self-supervised learning approach for ECG analysis using masked modeling in latent space with specialized Cross-Pattern Attention for 12-lead ECG data.
Details
Motivation: ECG data provides valuable cardiac diagnostic information but suffers from scarcity of labeled data, making supervised learning challenging. Self-supervised learning offers a solution by learning from unlabeled data, but existing methods have limitations in handling ECG-specific challenges like noise and the limitations of naive reconstruction losses.Method: ECG-JEPA uses masked modeling in latent space rather than reconstructing raw signals, avoiding noise reproduction and L2 loss limitations. It introduces Cross-Pattern Attention (CroPA), a specialized masked attention mechanism for 12-lead ECG data. The model is trained on ~180,000 ECG samples from multiple open datasets.
Result: ECG-JEPA achieves state-of-the-art performance in various downstream tasks including diagnostic classification, feature extraction, and segmentation on ECG data.
Conclusion: Masked modeling in latent space is an effective self-supervised approach for ECG analysis, with ECG-JEPA demonstrating superior performance through its specialized architecture and training methodology.
Abstract: Electrocardiogram (ECG) captures the heart’s electrical signals, offering valuable information for diagnosing cardiac conditions. However, the scarcity of labeled data makes it challenging to fully leverage supervised learning in the medical domain. Self-supervised learning (SSL) offers a promising solution, enabling models to learn from unlabeled data and uncover meaningful patterns. In this paper, we show that masked modeling in the latent space can be a powerful alternative to existing self-supervised methods in the ECG domain. We introduce ECG-JEPA, an SSL model for 12-lead ECG analysis that learns semantic representations of ECG data by predicting in the hidden latent space, bypassing the need to reconstruct raw signals. This approach offers several advantages in the ECG domain: (1) it avoids producing unnecessary details, such as noise, which is common in ECG; and (2) it addresses the limitations of naive L2 loss between raw signals. Another key contribution is the introduction of Cross-Pattern Attention (CroPA), a specialized masked attention mechanism tailored for 12-lead ECG data. ECG-JEPA is trained on the union of several open ECG datasets, totaling approximately 180,000 samples, and achieves state-of-the-art performance in various downstream tasks including diagnostic classification, feature extraction, and segmentation. Our code is openly available at https://github.com/sehunfromdaegu/ECG_JEPA.
[509] Revitalizing Black-Box Interpretability: Actionable Interpretability for LLMs via Proxy Models
Junhao Liu, Haonan Yu, Zhenyu Yan, Xin Zhang
Main category: cs.LG
TL;DR: Proposes a budget-friendly proxy framework using efficient models to approximate LLM decision boundaries for model-agnostic interpretability, enabling scalable post-hoc explanations at 11% of original cost.
Details
Motivation: Model-agnostic interpretability techniques for LLMs face prohibitive computational costs, making them impractical for real-world applications. There's a need to revitalize these tools by making them computationally feasible while maintaining reliability.Method: Develops a proxy framework using efficient models to approximate LLM decision boundaries, with a screen-and-apply mechanism to statistically verify local alignment before deployment. The approach reduces computational costs while maintaining explanation fidelity.
Result: Proxy explanations achieve over 90% fidelity with only 11% of the oracle’s computational cost. The framework demonstrates practical utility in prompt compression and poisoned example removal tasks.
Conclusion: The proposed framework transforms interpretability from a passive observation tool into a scalable primitive for LLM development, enabling practical application of model-agnostic interpretability techniques at significantly reduced computational cost.
Abstract: Post-hoc explanations provide transparency and are essential for guiding model optimization, such as prompt engineering and data sanitation. However, applying model-agnostic techniques to Large Language Models (LLMs) is hindered by prohibitive computational costs, rendering these tools dormant for real-world applications. To revitalize model-agnostic interpretability, we propose a budget-friendly proxy framework that leverages efficient models to approximate the decision boundaries of expensive LLMs. We introduce a screen-and-apply mechanism to statistically verify local alignment before deployment. Our empirical evaluation confirms that proxy explanations achieve over 90% fidelity with only 11% of the oracle’s cost. Building on this foundation, we demonstrate the actionable utility of our framework in prompt compression and poisoned example removal. Results show that reliable proxy explanations effectively guide optimization, transforming interpretability from a passive observation tool into a scalable primitive for LLM development. Additionally, we open-source code and datasets to facilitate future research.
[510] FIT-GNN: Faster Inference Time for GNNs that ‘FIT’ in Memory Using Coarsening
Shubhajit Roy, Hrriday Ruparel, Kishan Ved, Anirban Dasgupta
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) - cannot analyze the paper content
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2410.15001: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.15001&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[511] Graph Defense Diffusion Model
Xin He, Wenqi Fan, Yili Wang, Chengyi Liu, Rui Miao, Xin Juan, Xin Wang
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2501.11568: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.11568&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[512] Mamba-Based Graph Convolutional Networks: Tackling Over-smoothing with Selective State Space
Xin He, Yili Wang, Wenqi Fan, Xu Shen, Xin Juan, Rui Miao, Xin Wang
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Cannot analyze method as paper content is unavailable due to HTTP 429 error
Result: No results available - API request was rate limited (HTTP 429)
Conclusion: Cannot provide analysis due to technical limitations in accessing the paper
Abstract: Failed to fetch summary for 2501.15461: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.15461&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[513] Low Rank Based Subspace Inference for the Laplace Approximation of Bayesian Neural Networks
Josua Faller, Jörg Martin
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting).
Details
Motivation: Cannot determine motivation as paper content is unavailable.Method: Cannot determine method as paper content is unavailable.
Result: Cannot determine results as paper content is unavailable.
Conclusion: Cannot determine conclusion as paper content is unavailable.
Abstract: Failed to fetch summary for 2502.02345: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.02345&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[514] Reducing Class Bias In Data-Balanced Datasets Through Hardness-Based Resampling
Pawel Pukowski, Venet Osmani
Main category: cs.LG
TL;DR: Unable to analyze paper 2504.07031 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract retrieval failedMethod: Cannot determine method as abstract retrieval failed
Result: Cannot determine results as abstract retrieval failed
Conclusion: Cannot determine conclusion as abstract retrieval failed
Abstract: Failed to fetch summary for 2504.07031: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.07031&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[515] Task-agnostic Low-rank Residual Adaptation for Efficient Federated Continual Fine-Tuning
Feng Yu, Jia Hu, Geyong Min
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2505.12318: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.12318&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[516] Large Reasoning Models Learn Better Alignment from Flawed Thinking
ShengYun Peng, Eric Smith, Ivan Evtimov, Song Jiang, Pin-Yu Chen, Hongyuan Zhan, Haozhu Wang, Duen Horng Chau, Mahesh Pasupuleti, Jianfeng Chi
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2510.00938: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.00938&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[517] Batch Distillation Data for Developing Machine Learning Anomaly Detection Methods
Justus Arweiler, Indra Jungjohann, Aparna Muraleedharan, Heike Leitte, Jakob Burger, Kerstin Münnemann, Fabian Jirasek, Hans Hasse
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2510.18075: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.18075&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[518] Predicting Metabolic Dysfunction-Associated Steatotic Liver Disease using Machine Learning Methods: A Retrospective Cohort Study
Mary E. An, Paul M. Griffin, Jonathan G. Stine, Balakrishnan S. Ramakrishna, Soundar R. T. Kumara
Main category: cs.LG
TL;DR: Developed MASER, an EHR-based prediction model for early detection of metabolic dysfunction-associated steatotic liver disease (MASLD) using LASSO logistic regression with fairness adjustments to reduce racial/ethnic disparities.
Details
Motivation: MASLD affects 30-40% of US adults and is the most common chronic liver disease, often asymptomatic but can progress to cirrhosis. Need for early detection in primary care settings using routinely collected EHR data.Method: Evaluated LASSO logistic regression, random forest, XGBoost, and neural network models using clinical feature subsets from large EHR database. Selected LASSO with top 10 features for interpretability. Applied equal opportunity postprocessing fairness adjustment to reduce disparities across racial/ethnic subgroups.
Result: Before fairness adjustment: AUROC 0.84, accuracy 78%, sensitivity 72%, specificity 79%, F1-score 0.617. After fairness adjustment: accuracy 81%, specificity 94%, sensitivity 41%, F1-score 0.515. Model achieved competitive performance with limited feature set and diverse population.
Conclusion: MASER demonstrates EHR-ready MASLD prediction with fairness adjustments, supporting early detection and potential integration into primary care workflows. Pending prospective validation for clinical implementation.
Abstract: Background: Metabolic dysfunction-associated steatotic liver disease (MASLD) affects 30-40% of US adults and is the most common chronic liver disease. Although often asymptomatic, progression can lead to cirrhosis. The objective of the study was to develop and evaluate an electronic health record (EHR) based prediction model to support early detection of MASLD in primary care settings. Methods: We evaluated LASSO logistic regression, random forest, XGBoost, and a neural network model for MASLD prediction using clinical feature subsets from a large EHR database, including the top 10 ranked features. To reduce disparities in true positive rates across racial and ethnic subgroups, we applied an equal opportunity postprocessing method in a prediction model called MASLD EHR Static Risk Prediction (MASER). Results: This retrospective cohort study included 59,492 participants in the training data, 24,198 in the validating data, and 25,188 in the testing data. The LASSO logistic regression model with the top 10 features was selected for its interpretability and comparable performance. Before fairness adjustment, the model achieved AUROC of 0.84, accuracy of 78%, sensitivity of 72%, specificity of 79%, and F1-score of 0.617. After equal opportunity postprocessing, accuracy modestly increased to 81% and specificity to 94%, while sensitivity decreased to 41% and F1-score to 0.515, reflecting the fairness trade-off. Conclusions: MASER achieved competitive performance for MASLD prediction, comparable to previously reported ensemble and tree-based models, while using a limited and routinely collected feature set and a diverse study population. The model is designed to support early detection and potential integration into primary care workflows. MASER demonstrates EHR-ready MASLD prediction with fairness adjustments, supporting future primary care implementation pending prospective validation.
[519] Dual Mamba for Node-Specific Representation Learning: Tackling Over-Smoothing with Selective State Space Modeling
Xin He, Yili Wang, Yiwei Dai, Xin Wang
Main category: cs.LG
TL;DR: Failed to fetch summary for paper 2511.06756 due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2511.06756: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.06756&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[520] MARBLE: Multi-Armed Restless Bandits in Latent Markovian Environment
Mohsen Amiri, Konstantin Avrachenkov, Ibtihal El Mimouni, Sindri Magnússon
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation as paper content could not be retrievedMethod: Unable to determine method as paper content could not be retrieved
Result: Unable to determine results as paper content could not be retrieved
Conclusion: Unable to determine conclusion as paper content could not be retrieved
Abstract: Failed to fetch summary for 2511.09324: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.09324&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[521] Boosting Brain-inspired Path Integration Efficiency via Learning-based Replication of Continuous Attractor Neurodynamics
Zhangyu Ge, Xu He, Lingfei Mo, Xiaolin Meng, Wenxuan Yin, Youdong Zhang, Lansong Jiang, Fengyuan Liu
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2511.17687: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.17687&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[522] Adaptive Tuning of Parameterized Traffic Controllers via Multi-Agent Reinforcement Learning
Giray Önür, Azita Dabiri, Bart De Schutter
Main category: cs.LG
TL;DR: Unable to analyze paper 2512.07417 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract retrieval failedMethod: Cannot determine method as abstract retrieval failed
Result: Cannot determine results as abstract retrieval failed
Conclusion: Cannot draw conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2512.07417: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.07417&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[523] Imitation Learning for Combinatorial Optimisation under Uncertainty
Prakash Gawas, Antoine Legrain, Louis-Martin Rousseau
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to the paper contentMethod: Cannot determine method without access to the paper content
Result: Cannot determine results without access to the paper content
Conclusion: Cannot determine conclusion without access to the paper content
Abstract: Failed to fetch summary for 2601.05383: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.05383&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[524] Demystifying Mergeability: Interpretable Properties to Predict Model Merging Success
Luca Zhou, Bo Zhao, Rose Yu, Emanuele Rodolà
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2601.22285: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.22285&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[525] dnaHNet: A Scalable and Hierarchical Foundation Model for Genomic Sequence Learning
Arnav Shah, Junzhe Li, Parsa Idehpour, Adibvafa Fallahpour, Brandon Wang, Sukjun Hwang, Bo Wang, Patrick D. Hsu, Hani Goodarzi, Albert Gu
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2602.10603: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.10603&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[526] TopoFlow: Topography-aware Pollutant Flow Learning for High-Resolution Air Quality Prediction
Ammar Kheder, Helmi Toropainen, Wenqing Peng, Samuel Antão, Jia Chen, Michael Boy, Zhi-Song Liu
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2602.16821: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.16821&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[527] Accelerating Local LLMs on Resource-Constrained Edge Devices via Distributed Prompt Caching
Hiroki Matsutani, Naoki Matsuda, Naoto Sugiura
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error in fetching paper contentMethod: Unable to determine method due to technical error in fetching paper content
Result: Unable to determine results due to technical error in fetching paper content
Conclusion: Unable to draw conclusions due to technical error in fetching paper content
Abstract: Failed to fetch summary for 2602.22812: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22812&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[528] Implicit Bias in Deep Linear Discriminant Analysis
Jiawen Li
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2603.02622 appears to be from March 2026, suggesting it might be a future or hypothetical paper.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot draw conclusions without access to the paper content.
Abstract: Failed to fetch summary for 2603.02622: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.02622&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[529] Training event-based neural networks with exact gradients via Differentiable ODE Solving in JAX
Lukas König, Manuel Kuhn, David Kappel, Anand Subramoney
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2603.08146 suggests it’s from March 2023, but without the abstract content, I cannot analyze its relevance to multimodal LLMs with audio/vision focus.
Details
Motivation: Cannot determine motivation without access to the paper abstract. The HTTP 429 error indicates the arXiv API rate limit has been exceeded.Method: Cannot determine method without access to the paper abstract. The paper ID format suggests it’s from March 2023 (2603 = year 2023, month 03).
Result: Cannot determine results without access to the paper abstract. The arXiv API request failed due to rate limiting.
Conclusion: Cannot draw conclusions about the paper’s content or relevance without access to the abstract. The arXiv API rate limit needs to be respected with appropriate delays between requests.
Abstract: Failed to fetch summary for 2603.08146: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.08146&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[530] ALMAB-DC: Active Learning, Multi-Armed Bandits, and Distributed Computing for Sequential Experimental Design and Black-Box Optimization
Foo Hui-Mean, Yuan-chin I Chang
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2603.21180: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.21180&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[531] Mechanisms of Introspective Awareness
Uzay Macar, Li Yang, Atticus Wang, Peter Wallich, Emmanuel Ameisen, Jack Lindsey
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.21396: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.21396&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[532] Improving Model Performance by Adapting the KGE Metric to Account for System Non-Stationarity
M Jawad, HV Gupta, YH Wang, MA Farmani, A Behrangi, GY Niu
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing paper contentMethod: Unable to determine method due to missing paper content
Result: Unable to determine results due to missing paper content
Conclusion: Unable to draw conclusions due to missing paper content
Abstract: Failed to fetch summary for 2604.03906: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.03906&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[533] Hierarchical SVG Tokenization: Learning Compact Visual Programs for Scalable Vector Graphics Modeling
Ximing Xing, Ziteng Xue, Zhenxi Li, Weicong Liang, Linqing Wang, Zhantao Yang, Tiankai Hang, Zijin Yin, Qinglin Lu, Chunyu Wang, Qian Yu
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2604.05072: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.05072&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[534] Gated-SwinRMT: Unifying Swin Windowed Attention with Retentive Manhattan Decay via Input-Dependent Gating
Dipan Maity, Suman Mondal, Arindam Roy
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2604.06014: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.06014&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[535] Conformal Margin Risk Minimization: An Envelope Framework for Robust Learning under Label Noise
Yuanjie Shi, Peihong Li, Zijian Zhang, Janardhan Rao Doppa, Yan Yan
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2604.06468: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.06468&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[536] BLEG: LLM Functions as Powerful fMRI Graph-Enhancer for Brain Network Analysis
Rui Dong, Zitong Wang, Jiaxing Li, Weihuang Zheng, Youyong Kong
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) - cannot analyze the paper content
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to draw conclusions due to fetch failure
Abstract: Failed to fetch summary for 2604.07361: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.07361&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[537] Bias-Constrained Diffusion Schedules for PDE Emulations: Reconstruction Error Minimization and Efficient Unrolled Training
Constantin Le Cleï, Nils Thuerey, Xiaoxiang Zhu
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2604.08357: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.08357&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[538] Balancing User Preferences by Social Networks: A Condition-Guided Social Recommendation Model for Mitigating Popularity Bias
Xin He, Wenqi Fan, Ruobing Wang, Yili Wang, Ying Wang, Shirui Pan, Xin Wang
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error in fetching paper contentMethod: Unable to determine method due to technical error in fetching paper content
Result: Unable to determine results due to technical error in fetching paper content
Conclusion: Unable to draw conclusions due to technical error in fetching paper content
Abstract: Failed to fetch summary for 2405.16772: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2405.16772&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[539] Automatic Self-supervised Learning for Social Recommendations
Xin He, Wenqi Fan, Mingchen Sun, Ying Wang, Xin Wang
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). No abstract available for analysis.
Details
Motivation: Cannot determine motivation as paper content is unavailable due to HTTP 429 error from arXiv API.Method: Cannot determine method as paper content is unavailable due to HTTP 429 error from arXiv API.
Result: Cannot determine results as paper content is unavailable due to HTTP 429 error from arXiv API.
Conclusion: Cannot determine conclusion as paper content is unavailable due to HTTP 429 error from arXiv API.
Abstract: Failed to fetch summary for 2412.18735: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2412.18735&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[540] Conformal Prediction in Hierarchical Classification with Constrained Representation Complexity
Thomas Mortier, Alireza Javanmardi, Yusuf Sale, Eyke Hüllermeier, Willem Waegeman
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2501.19038: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.19038&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[541] Universal Approximation with XL MIMO Systems: OTA Classification via Trainable Analog Combining
Kyriakos Stylianopoulos, George C. Alexandropoulos
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2504.12758: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.12758&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[542] SPP-SBL: Space-Power Prior Sparse Bayesian Learning for Block Sparse Recovery
Yanhao Zhang, Zhihan Zhu, Yong Xia
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error in fetching paper informationMethod: Unable to determine method due to technical error in fetching paper information
Result: Unable to determine results due to technical error in fetching paper information
Conclusion: Unable to draw conclusions due to technical error in fetching paper information
Abstract: Failed to fetch summary for 2505.08518: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.08518&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[543] GL-LowPopArt: A Nearly Instance-Wise Minimax-Optimal Estimator for Generalized Low-Rank Trace Regression
Junghyun Lee, Kyoungseok Jang, Kwang-Sung Jun, Milan Vojnović, Se-Young Yun
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2506.03074: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.03074&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[544] Learning Encodings by Maximizing State Distinguishability: Variational Quantum Error Correction
Nico Meyer, Christopher Mutschler, Andreas Maier, Daniel D. Scherer
Main category: cs.LG
TL;DR: Unable to analyze paper 2506.11552 due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrievedMethod: Cannot determine method as paper content could not be retrieved
Result: Cannot determine results as paper content could not be retrieved
Conclusion: Cannot draw conclusions as paper content could not be retrieved
Abstract: Failed to fetch summary for 2506.11552: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.11552&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[545] Neural Two-Stage Stochastic Optimization for Solving Unit Commitment Problem
Zhentong Shao, Jingtao Qin, Nanpeng Yu
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2507.09503: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.09503&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[546] Beyond Spherical geometry: Unraveling complex features of objects orbiting around stars from its transit light curve using deep learning
Ushasi Bhowmick, Shivam Kumaran
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation as paper content could not be retrievedMethod: Unable to determine method as paper content could not be retrieved
Result: Unable to determine results as paper content could not be retrieved
Conclusion: Unable to determine conclusion as paper content could not be retrieved
Abstract: Failed to fetch summary for 2509.14875: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.14875&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[547] Contribution of task-irrelevant stimuli to drift of neural representations
Farhad Pashakhanloo
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2510.21588: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.21588&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[548] Scaling flow-based approaches for topology sampling in $\mathrm{SU}(3)$ gauge theory
Claudio Bonanno, Andrea Bulgarelli, Elia Cellini, Alessandro Nada, Dario Panfalone, Davide Vadacchino, Lorenzo Verzichelli
Main category: cs.LG
TL;DR: Paper 2510.25704: Unable to fetch abstract due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to inability to access paper contentMethod: Cannot determine method due to inability to access paper content
Result: Cannot determine results due to inability to access paper content
Conclusion: Cannot determine conclusion due to inability to access paper content
Abstract: Failed to fetch summary for 2510.25704: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.25704&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[549] Differentially Private and Federated Structure Learning in Bayesian Networks
Ghita Fassy El Fehri, Aurélien Bellet, Philippe Bastien
Main category: cs.LG
TL;DR: Paper ID 2512.01708 could not be fetched due to HTTP 429 error (rate limiting), so content analysis is not possible
Details
Motivation: Unable to determine motivation as paper content could not be retrieved due to API rate limitingMethod: Method unknown - paper content unavailable due to HTTP 429 error from arXiv API
Result: No results available - paper summary could not be fetched
Conclusion: Cannot draw conclusions about paper content due to technical limitations in accessing the information
Abstract: Failed to fetch summary for 2512.01708: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.01708&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[550] Sample Complexity of Composite Quantum Hypothesis Testing
Jacob Paul Simpson, Efstratios Palias, Sharu Theresa Jose
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about paper content due to technical access issues
Abstract: Failed to fetch summary for 2601.08588: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.08588&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[551] Distribution-free two-sample testing with blurred total variation distance
Rohan Hore, Rina Foygel Barber
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2602.05862 cannot be analyzed without access to its abstract or content.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot draw conclusions without access to the paper content.
Abstract: Failed to fetch summary for 2602.05862: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.05862&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[552] Fisher-Geometric Diffusion in Stochastic Gradient Descent: Optimal Rates, Oracle Complexity, and Information-Theoretic Limits
Daniel Zantedeschi, Kumar Muthuraman
Main category: cs.LG
TL;DR: Paper 2603.02417: Unable to fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to missing abstractMethod: Cannot determine method due to missing abstract
Result: Cannot determine results due to missing abstract
Conclusion: Cannot determine conclusion due to missing abstract
Abstract: Failed to fetch summary for 2603.02417: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.02417&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[553] Post-Selection Distributional Model Evaluation
Amirmohammad Farzaneh, Osvaldo Simeone
Main category: cs.LG
TL;DR: Paper 2603.23055: Could not fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access restrictionsMethod: Unable to determine method due to access restrictions
Result: Unable to determine results due to access restrictions
Conclusion: Unable to determine conclusion due to access restrictions
Abstract: Failed to fetch summary for 2603.23055: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.23055&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.MA
[554] Beyond the Individual: Virtualizing Multi-Disciplinary Reasoning for Clinical Intake via Collaborative Agents
Huangwei Chen, Wu Li, Junhao Jia, Yining Chen, Xiaotao Pang, Ya-Long Chen, Li Gonghui, Haishuai Wang, Jiajun Bu, Lei Wu
Main category: cs.MA
TL;DR: Aegle: A synchronous virtual multi-disciplinary team framework for outpatient consultations using graph-based multi-agent architecture to improve clinical decision-making and documentation quality.
Details
Motivation: Single-physician outpatient consultations are prone to cognitive biases and incomplete evidence capture due to time pressure. While Multi-Disciplinary Teams (MDTs) reduce these risks, they are costly and difficult to scale for real-time intake.Method: A graph-based multi-agent architecture that formalizes consultation state using structured SOAP representation, separating evidence collection from diagnostic reasoning. An orchestrator dynamically activates specialist agents for parallel reasoning, with results integrated by an aggregator into coherent clinical notes.
Result: Experiments on ClinicalBench and real-world RAPID-IPN dataset across 24 departments and 53 metrics show Aegle consistently outperforms state-of-the-art proprietary and open-source models in documentation quality and consultation capability, while improving final diagnosis accuracy.
Conclusion: Aegle successfully brings MDT-level reasoning to outpatient consultations through a scalable virtual framework, demonstrating superior performance in clinical documentation and diagnostic accuracy compared to existing models.
Abstract: The initial outpatient consultation is critical for clinical decision-making, yet it is often conducted by a single physician under time pressure, making it prone to cognitive biases and incomplete evidence capture. Although the Multi-Disciplinary Team (MDT) reduces these risks, they are costly and difficult to scale to real-time intake. We propose Aegle, a synchronous virtual MDT framework that brings MDT-level reasoning to outpatient consultations via a graph-based multi-agent architecture. Aegle formalizes the consultation state using a structured SOAP representation, separating evidence collection from diagnostic reasoning to improve traceability and bias control. An orchestrator dynamically activates specialist agents, which perform decoupled parallel reasoning and are subsequently integrated by an aggregator into a coherent clinical note. Experiments on ClinicalBench and a real-world RAPID-IPN dataset across 24 departments and 53 metrics show that Aegle consistently outperforms state-of-the-art proprietary and open-source models in documentation quality and consultation capability, while also improving final diagnosis accuracy. Our code is available at https://github.com/HovChen/Aegle.
[555] Aligned Agents, Biased Swarm: Measuring Bias Amplification in Multi-Agent Systems
Keyu Li, Jin Gao, Dequan Wang
Main category: cs.MA
TL;DR: Multi-agent systems amplify bias through structural echo chambers rather than diluting it, with sophisticated architectures worsening prejudice despite individual agent neutrality.
Details
Motivation: Multi-agent systems are increasingly used for complex workflows but their emergent properties like bias accumulation remain poorly understood. Real-world MAS are too complex to analyze fully, so foundational mechanics need isolation to evaluate ethical robustness.Method: Introduce Discrim-Eval-Open benchmark that bypasses individual model neutrality through forced comparative judgments across demographic groups. Analyze bias cascades across various MAS topologies and feedback loops, studying foundational dynamics by stripping away advanced swarm complexity.
Result: Structural sophistication frequently exacerbates bias rather than mitigating it. Systemic amplification occurs even when isolated agents operate neutrally, and ‘Trigger Vulnerability’ is identified where injecting objective context drastically accelerates polarization.
Conclusion: Structural complexity does not guarantee ethical robustness in multi-agent systems. Basic MAS topologies and feedback loops can amplify minor stochastic biases into systemic polarization, acting as echo chambers.
Abstract: While Multi-Agent Systems (MAS) are increasingly deployed for complex workflows, their emergent properties-particularly the accumulation of bias-remain poorly understood. Because real-world MAS are too complex to analyze entirely, evaluating their ethical robustness requires first isolating their foundational mechanics. In this work, we conduct a baseline empirical study investigating how basic MAS topologies and feedback loops influence prejudice. Contrary to the assumption that multi-agent collaboration naturally dilutes bias, we hypothesize that structured workflows act as echo chambers, amplifying minor stochastic biases into systemic polarization. To evaluate this, we introduce Discrim-Eval-Open, an open-ended benchmark that bypasses individual model neutrality through forced comparative judgments across demographic groups. Analyzing bias cascades across various structures reveals that architectural sophistication frequently exacerbates bias rather than mitigating it. We observe systemic amplification even when isolated agents operate neutrally, and identify a ‘Trigger Vulnerability’ where injecting purely objective context drastically accelerates polarization. By stripping away advanced swarm complexity to study foundational dynamics, we establish a crucial baseline: structural complexity does not guarantee ethical robustness. Our code is available at https://github.com/weizhihao1/MAS-Bias.
[556] Multi-agent Reinforcement Learning for Low-Carbon P2P Energy Trading among Self-Interested Microgrids
Junhao Ren, Honglin Gao, Lan Zhao, Qiyu Kang, Gaoxi Xiao, Yajuan Sun
Main category: cs.MA
TL;DR: Multi-agent reinforcement learning framework for microgrids to optimize P2P electricity trading, improving renewable utilization and reducing carbon emissions while increasing economic welfare.
Details
Motivation: Address uncertainties in renewable generation and demand dynamics in day-ahead scheduling to enhance renewable penetration and maintain intra-day balance in electricity markets.Method: Developed a multi-agent reinforcement learning framework where self-interested microgrids independently bid price and quantity in P2P electricity trading, optimizing profits via storage arbitrage under time-varying main-grid prices with a market-clearing mechanism for coordination.
Result: The learned bidding policy improves renewable utilization, reduces reliance on high-carbon electricity, and increases community-level economic welfare, creating a win-win situation for emission reduction and local prosperity.
Conclusion: The proposed multi-agent reinforcement learning framework effectively addresses renewable energy integration challenges in electricity markets through coordinated P2P trading, balancing economic and environmental objectives.
Abstract: Uncertainties in renewable generation and demand dynamics challenge day-ahead scheduling. To enhance renewable penetration and maintain intra-day balance, we develop a multi-agent reinforcement learning framework for self-interested microgrids participating in peer-to-peer (P2P) electricity trading. Each microgrid independently bids both price and quantity while optimizing its own profit via storage arbitrage under time-varying main-grid prices. A market-clearing mechanism coordinating trades and promoting incentive compatibility is proposed. Simulation results show that the learned bidding policy improves renewable utilization and reduces reliance on high-carbon electricity, while increasing community-level economic welfare, delivering a win-win situation in emission reduction and local prosperity.
[557] Social Reality Construction via Active Inference: Modeling the Dialectic of Conformity and Creativity
Kentaro Nomura, Takato Horii
Main category: cs.MA
TL;DR: A multi-agent active inference model shows how social agents both conform to and creatively reshape collective norms through local interactions on structured networks, leading to emergent social reality formation.
Details
Motivation: Current computational models fail to capture the bidirectional process where social agents both internalize collective norms and creatively reshape them. There's a need for a unified framework that formalizes the dialectical constitution of social reality.Method: Proposes a multi-agent simulation model grounded in active inference on structured social networks. Each agent maintains an internal generative model, communicates with neighbors to form social priors, creates novel observations, and selectively incorporates others’ creations into memory.
Result: Three main findings: 1) Informationally cohesive social groups emerge endogenously with representational alignment mirroring network cluster topology. 2) Circular mutual constitution arises between social representations and observation distribution through creative acts. 3) Propagation of creations shows selective, heterogeneous patterns distinct from stable diffusion of social representations.
Conclusion: The interplay between social conformity and creative deviation can give rise to the endogenous formation and differentiation of shared social reality through local interaction dynamics, suggesting agents construct cultural niches.
Abstract: Social agents both internalize collective norms and reshape them through creative action, yet computational models have not captured this bidirectional process within a unified framework. We propose a multi-agent simulation model grounded in active inference that formalizes the dialectical constitution of social reality on a structured social network. Each agent maintains an internal generative model, communicates with neighbors to form social priors, creates novel observations, and selectively incorporates others’ creations into memory. Simulation experiments demonstrate three main findings. First, informationally cohesive social groups emerge endogenously, with representational alignment mirroring the cluster topology of the underlying network. Second, a circular mutual constitution arises between social representations and the observation distribution, maintained through agents’ creative acts that project representational structure onto the external world. Third, the propagation of creations exhibits selective, heterogeneous patterns distinct from the stable diffusion of social representations, indicating that agents construct cultural niches through local interaction dynamics. These results suggest that the interplay between social conformity and creative deviation can give rise to the endogenous formation and differentiation of shared social reality.
[558] Plasticity-Enhanced Multi-Agent Mixture of Experts for Dynamic Objective Adaptation in UAVs-Assisted Emergency Communication Networks
Wen Qiu, Zhiqiang He, Wei Zhao, Hiroshi Masui
Main category: cs.MA
TL;DR: PE-MAMoE: A multi-agent RL framework for UAV base stations that maintains policy plasticity during non-stationary environment shifts using mixture of experts and phase-controlled perturbations.
Details
Motivation: UAVs as aerial base stations face non-stationary challenges due to changing user mobility and traffic demands, causing deep RL policies to suffer from plasticity loss (representation collapse and neuron dormancy) that impairs adaptation.Method: PE-MAMoE uses centralized training with decentralized execution built on multi-agent PPO. Each UAV has a sparsely gated mixture of experts actor with single-expert selection per step. A Phase Controller injects expert-only stochastic perturbations after phase switches, resets action log-standard-deviation, anneals entropy/learning rate, and schedules router temperature to re-plasticize policies without destabilizing safe behaviors.
Result: In phase-driven simulator with mobile users and 3GPP-style channels: 26.3% improvement in normalized interquartile mean return over best baseline, 12.8% increase in served-user capacity, ~75% reduction in collisions. Diagnostics show higher expert feature rank and periodic dormant-neuron recovery at regime switches.
Conclusion: PE-MAMoE effectively maintains policy plasticity in non-stationary UAV communication environments, enabling better adaptation to changing conditions while improving performance metrics and safety.
Abstract: Unmanned aerial vehicles serving as aerial base stations can rapidly restore connectivity after disasters, yet abrupt changes in user mobility and traffic demands shift the quality of service trade-offs and induce strong non-stationarity. Deep reinforcement learning policies suffer from plasticity loss under such shifts, as representation collapse and neuron dormancy impair adaptation. We propose plasticity enhanced multi-agent mixture of experts (PE-MAMoE), a centralized training with decentralized execution framework built on multi-agent proximal policy optimization. PE-MAMoE equips each UAV with a sparsely gated mixture of experts actor whose router selects a single specialist per step. A non-parametric Phase Controller injects brief, expert-only stochastic perturbations after phase switches, resets the action log-standard-deviation, anneals entropy and learning rate, and schedules the router temperature, all to re-plasticize the policy without destabilizing safe behaviors. We derive a dynamic regret bound showing the tracking error scales with both environment variation and cumulative noise energy. In a phase-driven simulator with mobile users and 3GPP-style channels, PE-MAMoE improves normalized interquartile mean return by 26.3% over the best baseline, increases served-user capacity by 12.8%, and reduces collisions by approximately 75%. Diagnostics confirm persistently higher expert feature rank and periodic dormant-neuron recovery at regime switches.
[559] Risk-seeking conservative policy iteration with agent-state based policies for Dec-POMDPs with guaranteed convergence
Amit Sinha, Matthieu Geist, Aditya Mahajan
Main category: cs.MA
TL;DR: A polynomial-time algorithm for Dec-POMDPs with memory-constrained agents using iterated best response with risk-seeking incentives to find locally optimal policies.
Details
Motivation: Dec-POMDPs are computationally hard (NEXP-complete), and optimal policies require full history. Practical applications need compact policies due to limited compute capabilities, requiring memory-constrained solutions.Method: Iterated best response algorithm with polynomial runtime, using modified objective that incentivizes risk-seeking alongside conservative policy iteration updates for better local optima.
Result: Achieves near-optimal performance on benchmark Dec-POMDPs comparable to state-of-the-art approaches, with polynomial runtime despite memory constraints. More agent states (larger memory) leads to better performance.
Conclusion: Provides novel approach for incorporating memory constraints in Dec-POMDP problems, offering practical polynomial-time solutions with good performance despite limited agent memory.
Abstract: Optimally solving decentralized decision-making problems modeled as Dec-POMDPs is known to be NEXP-complete. These optimal solutions are policies based on the entire history of observations and actions of an agent. However, some applications may require more compact policies because of limited compute capabilities, which can be modeled by considering a limited number of memory states (or agent states). While such an agent-state based policy class may not contain the optimal solution, it is still of practical interest to find the best agent-state policy within the class. We focus on an iterated best response style algorithm which guarantees monotonic improvements and convergence to a local optimum in polynomial runtime in the Dec-POMDP model size. In order to obtain a better local optimum, we use a modified objective which incentivizes risk-seeking alongside a conservative policy iteration update. Our empirical results show that our approach performs as well as state-of-the-art approaches on several benchmark Dec-POMDPs, achieving near-optimal performance while having polynomial runtime despite the limited memory. We also show that using more agent states (a larger memory) leads to greater performance. Our approach provides a novel way of incorporating memory constraints on the agents in the Dec-POMDP problem.
[560] Bayesian Ego-graph Inference for Networked Multi-Agent Reinforcement Learning
Wei Duan, Jie Lu, Junyu Xuan
Main category: cs.MA
TL;DR: BayesG: A decentralized MARL framework that learns dynamic communication graphs via Bayesian variational inference for scalable multi-agent coordination under local observability.
Details
Motivation: Existing Networked-MARL methods assume static communication graphs, limiting adaptability to dynamic environments. Centralized approaches that learn dynamic graphs require global state access, which is impractical for real-world decentralized systems.Method: Proposes a stochastic graph-based policy where agents condition decisions on sampled subgraphs of local neighborhoods. Introduces BayesG, a decentralized actor-framework that learns sparse, context-aware interaction structures via Bayesian variational inference. Each agent operates over an ego-graph and samples latent communication masks to guide message passing and policy computation.
Result: BayesG outperforms strong MARL baselines on large-scale traffic control tasks with up to 167 agents, demonstrating superior scalability, efficiency, and performance.
Conclusion: BayesG enables decentralized agents to jointly learn both interaction topology and decision-making strategies through end-to-end variational inference, providing a scalable solution for networked multi-agent systems with local observability constraints.
Abstract: In networked multi-agent reinforcement learning (Networked-MARL), decentralized agents must act under local observability and constrained communication over fixed physical graphs. Existing methods often assume static neighborhoods, limiting adaptability to dynamic or heterogeneous environments. While centralized frameworks can learn dynamic graphs, their reliance on global state access and centralized infrastructure is impractical in real-world decentralized systems. We propose a stochastic graph-based policy for Networked-MARL, where each agent conditions its decision on a sampled subgraph over its local physical neighborhood. Building on this formulation, we introduce BayesG, a decentralized actor-framework that learns sparse, context-aware interaction structures via Bayesian variational inference. Each agent operates over an ego-graph and samples a latent communication mask to guide message passing and policy computation. The variational distribution is trained end-to-end alongside the policy using an evidence lower bound (ELBO) objective, enabling agents to jointly learn both interaction topology and decision-making strategies. BayesG outperforms strong MARL baselines on large-scale traffic control tasks with up to 167 agents, demonstrating superior scalability, efficiency, and performance.
[561] On the Uncertainty of Large Language Model-Based Multi-Agent Systems
Yuxuan Zhao, Sijia Chen, Ningxin Su
Main category: cs.MA
TL;DR: Analyzes multi-agent LLM systems through uncertainty/entropy lens, finding single agents often outperform MAS, and proposes Entropy Judger algorithm for solution selection.
Details
Motivation: The mechanisms behind why multi-agent systems (MAS) built on publicly available LLMs succeed or fail remain largely unexplored, despite MAS being a prominent paradigm for tackling complex tasks with LLMs.Method: Investigates MAS through uncertainty perspective, analyzing entropy transitions during problem-solving across various topologies and six benchmark tasks. Examines 245 features spanning token-, trajectory-, and round-level entropy to understand intra- and inter-agent dynamics.
Result: Counterintuitively finds single agents outperform MAS in ~43.3% of cases, uncertainty dynamics largely determined in first interaction round. Identifies three key observations: Certainty Preference, Base Uncertainty, and Task Awareness. Proposes Entropy Judger algorithm for solution selection.
Conclusion: Provides empirical insights into MAS effectiveness through uncertainty analysis, demonstrating that simple entropy-based solution selection (Entropy Judger) consistently improves MAS accuracy across configurations and tasks.
Abstract: Multi-agent systems (MAS) have emerged as a prominent paradigm for leveraging large language models (LLMs) to tackle complex tasks. However, the mechanisms governing the effectiveness of MAS built upon publicly available LLMs, specifically the underlying rationales for their success or failure, remain largely unexplored. In this paper, we revisit MAS through the perspective of uncertainty, considering both intra- and inter-agent dynamics by investigating entropy transitions during problem-solving across various topologies and six benchmark tasks. By analyzing 245 features spanning token-, trajectory-, and round-level entropy, we counterintuitively find that a single agent outperforms MAS in approximately 43.3% of cases, and that uncertainty dynamics are largely determined during the first round of interaction. Furthermore, we provide three key observations: 1) Certainty Preference: reducing uncertainty at any stage for any agent is critical for guaranteeing correct solutions; 2) Base Uncertainty: base models with lower entropy during problem-solving directly benefit MAS performance; and 3) Task Awareness: entropy dynamics of MAS play varying roles across different tasks. Building on these insights, we introduce a simple yet effective algorithm, the Entropy Judger, to select solutions from MAS’s pass@k results, leading to consistent accuracy improvements across all MAS configurations and tasks. Our source code is available at https://github.com/AgenticFinLab/multiagent-entropy.
[562] SPEAR: An Engineering Case Study of Multi-Agent Coordination for Smart Contract Auditing
Indraveni Chebolu, Arnab Mallick, Harmesh Rana
Main category: cs.MA
TL;DR: SPEAR is a multi-agent coordination framework for smart contract auditing that uses specialized agents (planning, execution, repair) with established MAS patterns, coordination protocols, and autonomous recovery capabilities.
Details
Motivation: Smart contract auditing is complex and requires coordination of multiple specialized tasks. Current approaches lack robust coordination mechanisms and autonomous recovery from failures in generated artifacts.Method: Multi-agent system with specialized agents: Planning Agent (risk-aware heuristics), Execution Agent (Contract Net protocol for task allocation), Repair Agent (programmatic-first repair policy). Agents use AGM-compliant belief revision, negotiation/auction protocols, and dynamic plan revision.
Result: Empirical study compares multi-agent design with centralized and pipeline-based alternatives under controlled failure scenarios, evaluating coordination, recovery behavior, and resource use.
Conclusion: SPEAR demonstrates effective multi-agent coordination for smart contract auditing with autonomous recovery capabilities, outperforming centralized and pipeline alternatives in failure scenarios.
Abstract: We present SPEAR, a multi-agent coordination framework for smart contract auditing that applies established MAS patterns in a realistic security analysis workflow. SPEAR models auditing as a coordinated mission carried out by specialized agents: a Planning Agent prioritizes contracts using risk-aware heuristics, an Execution Agent allocates tasks via the Contract Net protocol, and a Repair Agent autonomously recovers from brittle generated artifacts using a programmatic-first repair policy. Agents maintain local beliefs updated through AGM-compliant revision, coordinate via negotiation and auction protocols, and revise plans as new information becomes available. An empirical study compares the multi-agent design with centralized and pipeline-based alternatives under controlled failure scenarios, focusing on coordination, recovery behavior, and resource use.
cs.MM
[563] QoS-QoE Translation with Large Language Model
Yingjie Yu, Mingyuan Wu, Ahmadreza Eslaminia, Lingzhi Zhao, Kaizhuo Yan, Klara Nahrstedt
Main category: cs.MM
TL;DR: A dataset and benchmark for QoS-QoE translation in multimedia systems, with evaluation of LLMs on bidirectional quality prediction tasks.
Details
Motivation: QoS-QoE translation is fundamental for understanding how system conditions affect user experience, but existing research is scattered and not systematically reusable. There's a need for structured datasets to enable systematic analysis and LLM-based reasoning.Method: Created QoS-QoE Translation dataset through automated pipeline combining paper curation, relationship extraction, and iterative evaluation. Evaluated LLMs on bidirectional translation tasks (QoS→QoE and QoE→QoS) before and after supervised fine-tuning.
Result: LLMs show strong performance on both continuous-value and discrete-label prediction in bidirectional QoS-QoE translation after fine-tuning on the dataset. The dataset provides foundation for benchmarking and future LLM-based multimedia quality prediction.
Conclusion: The dataset enables systematic reuse, cross-scenario generalization, and large-scale analysis of QoS-QoE relationships, supporting LLM-based reasoning for multimedia quality optimization.
Abstract: QoS-QoE translation is a fundamental problem in multimedia systems because it characterizes how measurable system and network conditions affect user-perceived experience. Although many prior studies have examined this relationship, their findings are often developed for specific setups and remain scattered across papers, experimental settings, and reporting formats, limiting systematic reuse, cross-scenario generalization, and large-scale analysis. To address this gap, we first introduce QoS-QoE Translation dataset, a source-grounded dataset of structured QoS-QoE relationships from the multimedia literature, with a focus on video streaming related tasks. We construct the dataset through an automated pipeline that combines paper curation, QoS-QoE relationship extraction, and iterative data evaluation. Each record preserves the extracted relationship together with parameter definitions, supporting evidence, and contextual metadata. We further evaluate the capability of large language models (LLMs) on QoS-QoE translation, both before and after supervised fine-tuning on our dataset, and show strong performance on both continuous-value and discrete-label prediction in bidirectional translation, from QoS-QoE and QoE-QoS. Our dataset provides a foundation for benchmarking LLMs in QoS-QoE translation and for supporting future LLM-based reasoning for multimedia quality prediction and optimization. The complete dataset and code are publicly available at https://yyu6969.github.io/qos-qoe-translation-page/, for full reproducibility and open access.
[564] Generalizing Video DeepFake Detection by Self-generated Audio-Visual Pseudo-Fakes
Zihe Wei, Yuezun Li
Main category: cs.MM
TL;DR: AVPF method enhances video deepfake detection generalizability by training with self-generated audio-visual pseudo-fakes created solely from authentic samples, improving performance by up to 7.4% across datasets.
Details
Motivation: Existing video deepfake detection methods struggle with real-world scenarios due to limited diversity in training datasets, which restricts their generalizability to unseen cases. The need for more robust detection methods that can handle diverse audio-visual correspondence patterns in real-world deepfakes.Method: Proposes AVPF (Audio-Visual Pseudo-Fakes) method that creates pseudo-fake training samples with diverse audio-visual correspondence patterns from authentic samples only. The method generates training data solely from authentic samples without requiring any real deepfakes, enhancing model generalizability through diverse pseudo-fake patterns.
Result: Extensive experiments on multiple standard datasets demonstrate strong generalizability, achieving average performance improvement of up to 7.4%. The method shows effectiveness in handling diverse real-world scenarios where existing methods degrade.
Conclusion: AVPF provides a simple yet effective approach to enhance video deepfake detection generalizability by training with self-generated audio-visual pseudo-fakes, addressing the limitations of existing methods in real-world scenarios without requiring real deepfake data.
Abstract: Detecting video deepfakes has become increasingly urgent in recent years. Given the audio-visual information in videos, existing methods typically expose deepfakes by modeling cross-modal correspondence using specifically designed architectures with publicly available datasets. While they have shown promising results, their effectiveness often degrades in real-world scenarios, as the limited diversity of training datasets naturally restricts generalizability to unseen cases. To address this, we propose a simple yet effective method, called AVPF, which can notably enhance model generalizability by training with self-generated Audio-Visual Pseudo-Fakes.The key idea of AVPF is to create pseudo-fake training samples that contain diverse audio-visual correspondence patterns commonly observed in real-world deepfakes. We highlight that AVPF is generated solely from authentic samples, and training relies only on authentic data and AVPF, without requiring any real deepfakes.Extensive experiments on multiple standard datasets demonstrate the strong generalizability of the proposed method, achieving an average performance improvement of up to 7.4%.
[565] 2D or 3D: Who Governs Salience in VLA Models? – Tri-Stage Token Pruning Framework with Modality Salience Awareness
Zihao Zheng, Sicheng Tian, Zhihao Mao, Lingyue Zhang, Chenyue Li, Ziyun Zhang, Hong Gao, Yuchen Huang, Yutong Xu, Guojie Luo, Xiang Chen
Main category: cs.MM
TL;DR: A tri-stage token pruning framework for multi-visual-modal VLA models that optimizes 2D/3D token selection based on modality salience analysis to achieve significant inference speedup with minimal accuracy loss.
Details
Motivation: MVLA models with 2D+3D inputs face increased computational demands due to more input tokens, but existing token pruning methods are designed for 2D-only models and ignore modality salience differences between 2D and 3D data.Method: Developed tri-stage analysis to capture 2D/3D modality salience discrepancy and dynamics, then proposed corresponding tri-stage token pruning framework for optimal 2D/3D token selection and efficient pruning.
Result: Achieves up to 2.55x inference speedup with minimal accuracy loss, while only costing 5.8% overhead.
Conclusion: The proposed framework effectively addresses the acceleration demands of MVLA models by considering modality-specific salience differences in token pruning.
Abstract: Vision-Language-Action (VLA) models have emerged as the mainstream of embodied intelligence. Recent VLA models have expanded their input modalities from 2D-only to 2D+3D paradigms, forming multi-visual-modal VLA (MVLA) models. Despite achieving improved spatial perception, MVLA faces a greater acceleration demand due to the increased number of input tokens caused by modal expansion. Token pruning is an effective optimization methods tailored to MVLA models. However, existing token pruning schemes are designed for 2D-only VLA models, ignoring 2D/3D modality salience differences. In this paper, we follow the application process of multi-modal data in MVLA models and develop a tri-stage analysis to capture the discrepancy and dynamics of 2D/3D modality salience. Based on these, we propose a corresponding tri-stage token pruning framework for MVLA models to achieve optimal 2D/3D token selection and efficient pruning. Experiments show that our framework achieves up to a 2.55x inference speedup with minimal accuracy loss, while only costing 5.8% overhead. Our Code is coming soon.
[566] Through Their Eyes: Fixation-aligned Tuning for Personalized User Emulation
Lingfeng Huang, Huizhong Guo, Tianjun Wei, Yingpeng Du, Zhu Sun
Main category: cs.MM
TL;DR: Vision-language models can better simulate user behavior in recommender systems by aligning their visual attention patterns with individual user gaze data, improving click prediction accuracy.
Details
Motivation: Current LLM agents for recommender system evaluation use text/metadata but miss visual interface perception, which is critical since attention over recommendation layouts is visually driven and personalized. Real users browse visual interfaces, not just text.Method: FixATE (Fixation-Aligned Tuning for user Emulation): 1) Analyze real-world eye-tracking data showing stable individual gaze patterns predictive of clicks, 2) Probe VLM’s internal visual attention via interpretability operators to get slot-level relevance distributions comparable to human fixation, 3) Learn personalized soft prompts to steer model attention toward each user’s characteristic fixation pattern.
Result: Experiments across three interpretability-based probing operators and two architecturally distinct VLM backbones show consistent improvements in both attention alignment and click prediction accuracy.
Conclusion: Making models “see like the user” is a viable path toward simulators that more faithfully reproduce how users perceive and act in recommendation interfaces, bridging the gap between text-based simulation and real visual interaction.
Abstract: Large language model (LLM) agents are increasingly deployed as scalable user simulators for recommender system evaluation. Yet existing simulators perceive recommendations through text or structured metadata rather than the visual interfaces real users browse-a critical gap, since attention over recommendation layouts is both visually driven and highly personalized. We investigate whether aligning a vision-language model’s (VLM’s) visual attention with user-specific gaze patterns can improve simulation fidelity. Analysis of a real-world eye-tracking dataset collected in a carousel-based recommendation setting reveals that users exhibit stable individual gaze patterns strongly predictive of click behavior. Building on this finding, we propose Fixation-Aligned Tuning for user Emulation (FixATE). Our approach first probes the VLM’s internal visual attention via interpretability operators to obtain a slot-level relevance distribution comparable with human fixation, and then learns personalized soft prompts to steer the model’s attention toward each user’s characteristic fixation pattern. Experiments across three interpretability-based probing operators and two architecturally distinct VLM backbones demonstrate consistent improvements in both attention alignment and click prediction accuracy. These results suggest that making the model “see like the user” is a viable path toward simulators that more faithfully reproduce how users perceive and act in recommendation interfaces.
eess.AS
[567] Enhancing Conversational TTS with Cascaded Prompting and ICL-Based Online Reinforcement Learning
Zhicheng Ouyang, Seong-Gyun Leem, Bach Viet Do, Haibin Wu, Ariya Rastrow, Yuzong Liu, Florian Metze
Main category: eess.AS
TL;DR: A cascaded TTS framework using textual style tokens with audio prompts enables single-shot adaptation to fine-grained speaking styles and character voices, enhanced by ICL-based online RL with aesthetic rewards and CTC alignment.
Details
Motivation: Current conversational AI struggles with generating expressive and controllable TTS, particularly for fine-grained voice styles and emotions, which typically requires massive annotated training data. There's a need for scalable, data-efficient approaches to overcome this data bottleneck.Method: A cascaded framework pairs textual style tokens with human-curated audio prompts for single-shot adaptation. Uses In-Context Learning (ICL) to guide prosody and timbre without massive parameter updates. Introduces ICL-based online reinforcement learning with subjective aesthetic rewards, constrained by CTC alignment to preserve intelligibility.
Result: Comprehensive human perception evaluations demonstrate significant improvements in both naturalness and expressivity of synthesized speech, establishing the efficacy of the ICL-based online RL approach.
Conclusion: The proposed framework successfully addresses the data bottleneck in expressive TTS generation, enabling fine-grained style control with data-efficient single-shot adaptation through audio prompting and reinforcement learning optimization.
Abstract: Conversational AI has made significant progress, yet generating expressive and controllable text-to-speech (TTS) remains challenging. Specifically, controlling fine-grained voice styles and emotions is notoriously difficult and typically requires massive amounts of heavily annotated training data. To overcome this data bottleneck, we present a scalable, data-efficient cascaded framework that pairs textual style tokens with human-curated, high-quality audio prompts. This approach enables single-shot adaptation to fine-grained speaking styles and character voices. In the context of TTS, this audio prompting acts as In-Context Learning (ICL), guiding the model’s prosody and timbre without requiring massive parameter updates or large-scale retraining. To further enhance generation quality and mitigate hallucinations, we introduce a novel ICL-based online reinforcement learning (RL) strategy. This strategy directly optimizes the autoregressive prosody model using subjective aesthetic rewards while being constrained by Connectionist Temporal Classification (CTC) alignment to preserve intelligibility. Comprehensive human perception evaluations demonstrate significant improvements in both the naturalness and expressivity of the synthesized speech, establishing the efficacy of our ICL-based online RL approach.
[568] PS-TTS: Phonetic Synchronization in Text-to-Speech for Achieving Natural Automated Dubbing
Changi Hong, Yoonah Song, Hwayoung Park, Chaewoon Bang, Dayeon Gu, Do Hyun Lee, Hong Kook Kim
Main category: eess.AS
TL;DR: A synchronization method for automated dubbing that paraphrases translated text to achieve isochrony (duration matching) and phonetic synchronization (lip-sync preservation) using language models and dynamic time warping with vowel distance metrics.
Details
Motivation: Automated dubbing faces synchronization challenges in duration and lip-sync that are crucial for viewer experience. Current methods struggle to preserve both timing constraints and phonetic alignment between source and target speech.Method: Two-step approach: 1) Isochrony using language model paraphrasing to match target speech duration to source; 2) Phonetic synchronization using DTW with vowel distance metrics from training data. Extended to PS-Comet which jointly considers semantic and phonetic similarity.
Result: PS-TTS and PS-Comet TTS outperform TTS without PS on objective metrics and outperform voice actors in Korean-to-English and English-to-Korean dubbing. PS-Comet performed best across all tested language pairs (Korean, English, French), balancing lip-sync accuracy with semantic preservation.
Conclusion: The proposed synchronization methods effectively address automated dubbing challenges, with PS-Comet achieving the best balance between accurate lip-sync and semantic preservation across multiple language pairs.
Abstract: Recently, artificial intelligence-based dubbing technology has advanced, enabling automated dubbing (AD) to convert the source speech of a video into target speech in different languages. However, natural AD still faces synchronization challenges such as duration and lip-synchronization (lip-sync), which are crucial for preserving the viewer experience. Therefore, this paper proposes a synchronization method for AD processes that paraphrases translated text, comprising two steps: isochrony for timing constraints and phonetic synchronization (PS) to preserve lip-sync. First, we achieve isochrony by paraphrasing the translated text with a language model, ensuring the target speech duration matches that of the source speech. Second, we introduce PS, which employs dynamic time warping (DTW) with local costs of vowel distances measured from training data so that the target text composes vowels with pronunciations similar to source vowels. Third, we extend this approach to PSComet, which jointly considers semantic and phonetic similarity to preserve meaning better. The proposed methods are incorporated into text-to-speech systems, PS-TTS and PS-Comet TTS. The performance evaluation using Korean and English lip-reading datasets and a voice-actor dubbing dataset demonstrates that both systems outperform TTS without PS on several objective metrics and outperform voice actors in Korean-to-English and English-to-Korean dubbing. We extend the experiments to French, testing all pairs among these languages to evaluate cross-linguistic applicability. Across all language pairs, PS-Comet performed best, balancing lip-sync accuracy with semantic preservation, confirming that PS-Comet achieves more accurate lip-sync with semantic preservation than PS alone.
[569] Phonemes vs. Projectors: An Investigation of Speech-Language Interfaces for LLM-based ASR
Ziwei Li, Lukuang Dong, Saierdaer Yusuyin, Xianyu Zhao, Zhijian Ou
Main category: eess.AS
TL;DR: Phoneme-based interfaces for connecting speech encoders to LLMs outperform vanilla projector methods, especially in low-resource settings, with BPE-phoneme hybrids providing additional gains.
Details
Motivation: To improve ASR performance and data efficiency by comparing different speech-language interfaces (phoneme-based vs. projector-based) for integrating pretrained speech encoders with LLMs.Method: Compared phoneme-based and vanilla projector-based interfaces using same encoder/LLM backbones; proposed BPE-phoneme interface grouping frequent local phoneme patterns while preserving word-boundary cues; tested on high-resource English (LibriSpeech) and low-resource Tatar.
Result: Phoneme-based interface competitive with vanilla projector on LibriSpeech; BPE-phoneme yields further gains; phoneme-based substantially outperforms vanilla projector on Tatar; phoneme supervision creates stronger hybrid interface than vanilla projector.
Conclusion: Phoneme-based interfaces are effective for speech-LLM integration, especially valuable for low-resource languages, with BPE-phoneme hybrids offering optimal performance.
Abstract: Integrating pretrained speech encoders with large language models (LLMs) is promising for ASR, but performance and data efficiency depend on the speech-language interface. A common choice is a learned projector that maps encoder features into the LLM embedding space, whereas an alternative is to expose discrete phoneme sequences to the LLM. Using the same encoder and LLM backbones, we compare phoneme-based and vanilla projector-based interfaces in high-resource English and low-resource Tatar. We also propose a BPE-phoneme interface that groups frequent local phoneme patterns while preserving explicit word-boundary cues for phoneme-to-grapheme generation. On LibriSpeech, the phoneme-based interface is competitive with the vanilla projector, and the BPE-phoneme interface yields further gains. On Tatar, the phoneme-based interface substantially outperforms the vanilla projector. We further find that phoneme supervision yields a phoneme-informed hybrid interface that is stronger than the vanilla projector.
[570] Discrete Token Modeling for Multi-Stem Music Source Separation with Language Models
Pengbo Lyu, Xiangyu Zhao, Chengwei Liu, Haoyin Yan, Xiaotao Liang, Hongyu Wang, Shaofei Xue
Main category: eess.AS
TL;DR: A generative framework for multi-track music source separation using conditional discrete token generation with Conformer encoder, neural audio codec, and decoder-only language model.
Details
Motivation: To reformulate music source separation as a generative task using discrete token generation rather than conventional discriminative approaches that estimate continuous signals, aiming to improve perceptual quality.Method: Combines Conformer-based conditional encoder, dual-path neural audio codec (HCodec), and decoder-only language model to autoregressively generate audio tokens for four target tracks, which are then decoded back to waveforms.
Result: Achieves perceptual quality approaching state-of-the-art discriminative methods on MUSDB18-HQ benchmark, with highest NISQA score on vocals track. Ablation studies confirm effectiveness of learnable Conformer encoder and sequential cross-track generation.
Conclusion: The generative approach to music source separation using discrete token generation is effective and competitive with discriminative methods, particularly for perceptual quality.
Abstract: We propose a generative framework for multi-track music source separation (MSS) that reformulates the task as conditional discrete token generation. Unlike conventional approaches that directly estimate continuous signals in the time or frequency domain, our method combines a Conformer-based conditional encoder, a dual-path neural audio codec (HCodec), and a decoder-only language model to autoregressively generate audio tokens for four target tracks. The generated tokens are decoded back to waveforms through the codec decoder. Evaluation on the MUSDB18-HQ benchmark shows that our generative approach achieves perceptual quality approaching state-of-the-art discriminative methods, while attaining the highest NISQA score on the vocals track. Ablation studies confirm the effectiveness of the learnable Conformer encoder and the benefit of sequential cross-track generation.
[571] Data Selection Effects on Self-Supervised Learning of Audio Representations for French Audiovisual Broadcasts
Valentin Pelloin, Lina Bekkali, Reda Dehak, David Doukhan
Main category: eess.AS
TL;DR: Audio SSL models pretrained on diverse TV/radio broadcast content (not just clean speech) show improved performance on multiple downstream tasks and highlight data memorization risks.
Details
Motivation: Most audio self-supervised learning models are trained on clean segmented speech (like LibriSpeech), limiting their ability to handle diverse real-world audio content. The paper investigates how pretraining datasets impact SSL model performance and explores training on more diverse audio sources.Method: Built a large pretraining corpus from diverse TV and radio broadcast audio, annotated with automatic tools. Created smaller subsets from this corpus to train audio SSL models. Evaluated models on multiple downstream tasks including ASR, voice activity detection, music detection, and speaker recognition. Also performed membership inference attacks to assess data memorization.
Result: Models pretrained on diverse audio content showed strong performance across multiple downstream tasks, demonstrating the value of training on varied audio beyond just clean speech. Membership inference attacks revealed data memorization issues, highlighting the importance of data deduplication.
Conclusion: Pretraining SSL models on diverse audio content (not restricted to speech) improves performance on various downstream tasks and could bridge speech and music ML communities. Data deduplication is crucial to prevent memorization issues.
Abstract: Audio and speech self-supervised encoder models are now widely used for a lot of different tasks. Many of these models are often trained on clean segmented speech content such as LibriSpeech. In this paper, we look into how the pretraining datasets of such SSL (Self-Supervised Learning) models impact their downstream results. We build a large pretraining corpus of highly diverse TV and Radio broadcast audio content, which we describe with automatic tools. We use these annotations to build smaller subsets, which we use to train audio SSL models. Then, we evaluate the models on multiple downstream tasks such as automatic speech recognition, voice activity and music detection, or speaker recognition. The results show the potential of pretraining SSL models on diverse audio content without restricting it to speech. We also perform a membership inference attack to evaluate the encoder ability to memorize their training datasets, which highlight the importance of data deduplication. This unified training could bridge speech and music machine learning communities.
[572] Is ASMR Engineerable? A Signal Processing and User Experience Study
Zexin Fang, Bin Han, Henrik H. Sveen, C. Clark Cao, Hans D. Schotten
Main category: eess.AS
TL;DR: ASMR effects can be systematically induced through controlled acoustic design using cyclic patterns with varying predictability and randomness, with smoothly spread energy-dense patterns being most effective.
Details
Motivation: Despite ASMR's popularity, it remains unclear whether its effects can be deliberately engineered. While behavioral and neuro-physiological studies validate ASMR effects, the acoustic mechanisms that trigger it remain poorly understood.Method: Design cyclic sound patterns with varying predictability and randomness, evaluate effects via structured user study, use signal processing-based feature extraction and regression analysis to establish interpretable mapping between acoustic structure and perceived ASMR effects.
Result: Relaxing effects accumulate progressively, are independent of spatial orientation, and remain stable across time. Smoothly spread, energy-dense cyclic patterns most effectively trigger ASMR.
Conclusion: Signal-level engineering of ASMR experiences is achievable, with cyclic patterns where predictability drives relaxation and variation sustains intrigue being key engineerable parameters.
Abstract: Autonomous Sensory Meridian Response (ASMR) has been remarkably popular in the recent decade, yet whether its effects can be deliberately engineered remains an open question. While ASMR effects validated through behavioral studies and neuro-physiological measurements such as electroencephalography (EEG) and related bio-signals, the acoustic mechanisms that trigger it remain poorly understood. We investigate whether ASMR responses can be systematically induced through controlled acoustic design, hypothesizing that cyclic patterns where predictability drives relaxation and variation sustains intrigue are key engineerable parameters. Specifically, we design cyclic sound patterns with varying predictability and randomness, and evaluate their effects via a structured user study. Signal processing-based feature extraction and regression analysis are used to establish an interpretable mapping between acoustic structure and perceived ASMR effects. Results show that relaxing effects accumulate progressively, are independent of spatial orientation, and remain stable across time. Crucially, smoothly spread, energy-dense cyclic patterns most effectively trigger ASMR, suggesting that signal-level engineering of ASMR experiences is achievable
eess.IV
[573] PSIRNet: Deep Learning-based Free-breathing Rapid Acquisition Late Enhancement Imaging
Arda Atalik, Hui Xue, Rhodri H. Davies, Thomas A. Treibel, Daniel K. Sodickson, Michael S. Hansen, Peter Kellman
Main category: eess.IV
TL;DR: Deep learning method PSIRNet reconstructs diagnostic-quality cardiac MRI images from single acquisition, reducing scan time 8-24x compared to traditional motion-corrected methods.
Details
Motivation: To address the long acquisition times (8-24 signal averages) required for motion-corrected PSIR LGE cardiac MRI by developing a deep learning approach that can produce diagnostic-quality images from just a single acquisition over two heartbeats.Method: Developed PSIRNet, a physics-guided deep learning network with 845 million parameters trained end-to-end on 800,653 slices from 55,917 patients. The network reconstructs PSIR images with surface coil correction from single interleaved IR/PD acquisitions. Training used patient-split data from different institutions to ensure generalization.
Result: PSIRNet reconstructions were rated superior to traditional MOCO PSIR for dark blood LGE by both expert readers, and either superior or equivalent for bright blood and wideband variants. Inference time was ~100 msec per slice vs >5 seconds for MOCO PSIR, enabling 8-24x faster acquisition.
Conclusion: PSIRNet enables diagnostic-quality free-breathing PSIR LGE cardiac MRI from single acquisitions, dramatically reducing scan times while maintaining or improving image quality compared to conventional methods.
Abstract: Purpose: To develop and evaluate a deep learning (DL) method for free-breathing phase-sensitive inversion recovery (PSIR) late gadolinium enhancement (LGE) cardiac MRI that produces diagnostic-quality images from a single acquisition over two heartbeats, eliminating the need for 8 to 24 motion-corrected (MOCO) signal averages. Materials and Methods: Raw data comprising 800,653 slices from 55,917 patients, acquired on 1.5T and 3T scanners across multiple sites from 2016 to 2024, were used in this retrospective study. Data were split by patient: 640,000 slices (42,822 patients) for training and the remainder for validation and testing, without overlap. The training and testing data were from different institutions. PSIRNet, a physics-guided DL network with 845 million parameters, was trained end-to-end to reconstruct PSIR images with surface coil correction from a single interleaved IR/PD acquisition over two heartbeats. Reconstruction quality was evaluated using SSIM, PSNR, and NRMSE against MOCO PSIR references. Two expert cardiologists performed an independent qualitative assessment, scoring image quality on a 5-point Likert scale across bright blood, dark blood, and wideband LGE variants. Paired superiority and equivalence (margin = 0.25 Likert points) were tested using exact Wilcoxon signed-rank tests at a significance level of 0.05 using R version 4.5.2. Results: Both readers rated single-average PSIRNet reconstructions superior to MOCO PSIR for dark blood LGE (conservative P = .002); for bright blood and wideband, one reader rated it superior and the other confirmed equivalence (all P < .001). Inference required approximately 100 msec per slice versus more than 5 sec for MOCO PSIR. Conclusion: PSIRNet produces diagnostic-quality free-breathing PSIR LGE images from a single acquisition, enabling 8- to 24-fold reduction in acquisition time.
[574] MedFormer-UR: Uncertainty-Routed Transformer for Medical Image Classification
Mohammed Maaz Sibhai, Abedalrhman Alkhateeb, Saad B. Ahmed
Main category: eess.IV
TL;DR: Enhanced Medical Transformer with evidential uncertainty quantification and prototype-based learning improves model calibration and selective prediction for medical imaging tasks across multiple modalities.
Details
Motivation: Deep learning models for clinical applications need dependable uncertainty quantification beyond just accuracy. Current Medical Vision Transformers often produce overconfident predictions and lack transparency, especially with noisy, imbalanced clinical data.Method: Enhanced modified Medical Transformer (MedFormer) with prototype-based learning and uncertainty-guided routing, using Dirichlet distribution for per-token evidential uncertainty. Uncertainty actively participates in training to filter unreliable feature updates, and class-specific prototypes structure the embedding space for visual similarity-based decisions.
Result: Testing across four modalities (mammography, ultrasound, MRI, histopathology) shows significant improvement in model calibration, reducing expected calibration error (ECE) by up to 35%, and enhanced selective prediction even with modest accuracy gains.
Conclusion: The proposed framework provides reliable uncertainty quantification for medical vision transformers, addressing overconfidence and transparency issues in clinical applications through evidential uncertainty and prototype-based learning.
Abstract: To ensure safe clinical integration, deep learning models must provide more than just high accuracy; they require dependable uncertainty quantification. While current Medical Vision Transformers perform well, they frequently struggle with overconfident predictions and a lack of transparency, issues that are magnified by the noisy and imbalanced nature of clinical data. To address this, we enhanced the modified Medical Transformer (MedFormer) that incorporates prototype-based learning and uncertainty-guided routing, by utilizing a Dirichlet distribution for per-token evidential uncertainty, our framework can quantify and localize ambiguity in real-time. This uncertainty is not just an output but an active participant in the training process, filtering out unreliable feature updates. Furthermore, the use of class-specific prototypes ensures the embedding space remains structured, allowing for decisions based on visual similarity. Testing across four modalities (mammography, ultrasound, MRI, and histopathology) confirms that our approach significantly enhances model calibration, reducing expected calibration error (ECE) by up to 35%, and improves selective prediction, even when accuracy gains are modest.
[575] Training-free, Perceptually Consistent Low-Resolution Previews with High-Resolution Image for Efficient Workflows of Diffusion Models
Wongi Jeong, Hoigi Seo, Se Young Chun
Main category: eess.IV
TL;DR: Proposes a training-free method to generate low-resolution preview images that maintain perceptual consistency with their high-resolution counterparts, enabling efficient workflow for image generation with computational savings.
Details
Motivation: Current image generation workflows require generating many high-resolution images with different prompts/seeds, which is computationally expensive. Generating low-resolution images first could reduce costs, but maintaining perceptual consistency between LR and HR versions is challenging.Method: Proposes commutator-zero condition to ensure LR-HR perceptual consistency for flow matching models. Uses training-free solution with downsampling matrix selection and commutator-zero guidance to generate perceptually consistent preview images.
Result: Method achieves up to 33% computation reduction while maintaining HR perceptual consistency. When combined with existing acceleration techniques, achieves up to 3× speedup. Also extends to image manipulations like warping and translation.
Conclusion: Proposed method enables efficient image generation workflow by generating perceptually consistent low-resolution previews, reducing computational costs while maintaining quality. Generalizable to various image manipulations.
Abstract: Image generative models have become indispensable tools to yield exquisite high-resolution (HR) images for everyone, ranging from general users to professional designers. However, a desired outcome often requires generating a large number of HR images with different prompts and seeds, resulting in high computational cost for both users and service providers. Generating low-resolution (LR) images first could alleviate computational burden, but it is not straightforward how to generate LR images that are perceptually consistent with their HR counterparts. Here, we consider the task of generating high-fidelity LR images, called Previews, that preserve perceptual similarity of their HR counterparts for an efficient workflow, allowing users to identify promising candidates before generating the final HR image. We propose the commutator-zero condition to ensure the LR-HR perceptual consistency for flow matching models, leading to the proposed training-free solution with downsampling matrix selection and commutator-zero guidance. Extensive experiments show that our method can generate LR images with up to 33% computation reduction while maintaining HR perceptual consistency. When combined with existing acceleration techniques, our method achieves up to 3$\times$ speedup. Moreover, our formulation can be extended to image manipulations, such as warping and translation, demonstrating its generalizability.
[576] A GPU-enhanced workflow for non-Fourier SENSE reconstruction
Samuel Bianchi, Klaas P. Pruessmann
Main category: eess.IV
TL;DR: Non-Fourier SENSE reconstruction for MRI using GPU-accelerated implementation with accurate coil sensitivity and B0 mapping for challenging acquisition scenarios.
Details
Motivation: MRI image reconstruction in challenging scenarios requires accurate characterization of coil sensitivity profiles and local off-resonances (B0), but traditional Fourier-based methods are incompatible with these signal models, making reconstruction computationally demanding.Method: Developed a workflow for accurate sensitivity and B0 mapping with GPU-accelerated non-Fourier SENSE reconstruction implementation using FFT on GPU. Analyzed practical aspects like stopping criteria and artifact sources.
Result: Demonstrated highly performant reconstruction on 2D and 3D spiral datasets with readout durations up to 71.5ms and undersampling factors up to R=7. GPU execution greatly boosted reconstruction speed, with proper stopping criteria crucial for image quality.
Conclusion: The GPU-accelerated non-Fourier SENSE implementation achieves practical runtimes and robust computation of coil sensitivity profiles and off-resonance maps for challenging MRI reconstruction scenarios.
Abstract: Purpose: Image reconstruction in challenging scenarios requires accurate characterisations of coil sensitivity profiles, local off-resonances (B0) and effective encoding fields. Reconstruction methods utilising all of this information rely on signal models that are not compatible with the classical Fourier/k-space interpretation of the coil data. Hence, the FFT and related techniques are no more applicable, rendering image reconstruction computationally demanding. Methods: This article contains a workflow for accurate sensitivity and B0 mapping as well as other required processing steps. An implementation of non-Fourier SENSE reconstruction is provide that is well suited for execution on a GPU using the FFT. Important practical aspects like stopping criteria and sources of image artifacts are analyzed and documented. Results: Highly performant image reconstruction could be demonstrated on a 2D and 3D spiral dataset. These datasets contain trajectories featuring readout durations up to 71.5ms and undersampling factors up to R = 7. Running the reconstruction on a GPU greatly boosts reconstruction speed. Stopping the reconstruction at the right moment is crucial for image quality. All methods included in this article are available in a public code repository. Conclusion: The provided implementation of non-Fourier SENSE reconstruction is highly performant. When it is executed on GPU, runtimes reach a duration feasible in practice. The presented workflow ensures robust and accurate computation of coil sensitive profiles and off-resonance maps.
[577] AMO-ENE: Attention-based Multi-Omics Fusion Model for Outcome Prediction in Extra Nodal Extension and HPV-associated Oropharyngeal Cancer
Gautier Hénique, William Le, Gabriel Dayan, Coralie Brodeur, Kristoff Nelson, Apostolos Christopoulos, Edith Filion, Phuc-Felix Nguyen-Tan, Laurent Letourneau-Guillon, Houda Bahig, Samuel Kadoury
Main category: eess.IV
TL;DR: Automated pipeline using CT images and clinical data to detect extranodal extension in HPV-positive oropharyngeal cancer and predict treatment outcomes with multimodal attention-based models.
Details
Motivation: Extranodal extension (ENE) is an important prognostic factor in HPV-associated oropharyngeal cancer but faces clinical integration challenges due to segmentation inconsistencies, low CT contrast, and laborious manual annotations.Method: End-to-end pipeline with hierarchical 3D semi-supervised segmentation for ENE detection, radiomics/deep feature extraction for ENE grading classification, and multimodal attention-based outcome prediction combining nodal features with primary tumor characteristics.
Result: Validated on 397 HPV-positive OPC patients, achieving 88.2% AUC for metastatic recurrence, 79.2% for overall survival, and 78.1% for disease-free survival at 2-year mark, outperforming baseline models.
Conclusion: The automated pipeline provides accurate ENE assessment and outcome prediction, making it feasible for clinical decision-making in HPV-positive oropharyngeal cancer management.
Abstract: Extranodal extension (ENE) is an emerging prognostic factor in human papillomavirus (HPV)-associated oropharyngeal cancer (OPC), although it is currently omitted as a clinical staging criteria. Recent works have advocated for the inclusion of iENE as a prognostic marker in HPV-positive OPC staging. However, several practical limitations continue to hinder its clinical integration, including inconsistencies in segmentation, low contrast in the periphery of metastatic lymph nodes on CT imaging, and laborious manual annotations. To address these limitations, we propose a fully automated end-to-end pipeline that uses computed tomography (CT) images with clinical data to assess the status of nodal ENE and predict treatment outcomes. Our approach includes a hierarchical 3D semi-supervised segmentation model designed to detect and delineate relevant iENE from radiotherapy planning CT scans. From these segmentations, a set of radiomics and deep features are extracted to train an imaging-detected ENE grading classifier. The predicted ENE status is then evaluated for its prognostic value and compared with existing staging criteria. Furthermore, we integrate these nodal features with primary tumor characteristics in a multimodal, attention-based outcome prediction model, providing a dynamic framework for outcome prediction. Our method is validated in an internal cohort of 397 HPV-positive OPC patients treated with radiation therapy or chemoradiotherapy between 2009 and 2020. For outcome prediction at the 2-year mark, our pipeline surpassed baseline models with 88.2% (4.8) in AUC for metastatic recurrence, 79.2% (7.4) for overall survival, and 78.1% (8.6) for disease-free survival. We also obtain a concordance index of 83.3% (6.5) for metastatic recurrence, 71.3% (8.9) for overall survival, and 70.0% (8.1) for disease-free survival, making it feasible for clinical decision making.
[578] Compositional-Degradation UAV Image Restoration: Conditional Decoupled MoE Network and A Benchmark
Jinquan Yan, Zhicheng Zhao, Zhengzheng Tu, Chenglong Li, Jin Tang, Bin Luo
Main category: eess.IV
TL;DR: DAME-Net is a degradation-aware mixture-of-experts network for compositional UAV image restoration that explicitly perceives multiple degradation factors and uses conditioned expert routing for selective correction.
Details
Motivation: UAV images often suffer from multiple degradation factors (rain, haze, noise) simultaneously, but current unified restoration approaches use implicit degradation representations that entangle multiple factors, causing mutual interference in heterogeneous corrections.Method: Proposes DAME-Net with: 1) Factor-wise Degradation Perception Module (FDPM) for explicit per-factor degradation cues via multi-label prediction with label-similarity-guided soft alignment; 2) Conditioned Decoupled MoE Module (CDMM) for stage-wise conditioning, spatial-frequency hybrid processing, and mask-constrained decoupled expert routing; 3) Constructs MDUR benchmark with 43 degradation configurations.
Result: Extensive experiments on MDUR show consistent improvements over representative unified restoration methods, with greater gains on unseen and higher-order composite degradations. Downstream experiments validate benefits for UAV object detection.
Conclusion: DAME-Net effectively addresses compositional UAV image restoration by decoupling explicit degradation perception from degradation-conditioned reconstruction, outperforming existing methods and improving downstream task performance.
Abstract: UAV images are critical for applications such as large-area mapping, infrastructure inspection, and emergency response. However, in real-world flight environments, a single image is often affected by multiple degradation factors, including rain, haze, and noise, undermining downstream task performance. Current unified restoration approaches typically rely on implicit degradation representations that entangle multiple factors into a single condition, causing mutual interference among heterogeneous corrections. To this end, we propose DAME-Net, a Degradation-Aware Mixture-of-Experts Network that decouples explicit degradation perception from degradation-conditioned reconstruction for compositional UAV image restoration. Specifically, we design a Factor-wise Degradation Perception module(FDPM) to provide explicit per-factor degradation cues for the restoration stage through multi-label prediction with label-similarity-guided soft alignment, replacing implicit entangled conditions with interpretable and generalizable degradation descriptions. Moreover, we develop a Conditioned Decoupled MoE module(CDMM) that leverages these cues for stage-wise conditioning, spatial-frequency hybrid processing, and mask-constrained decoupled expert routing, enabling selective factor-specific correction while suppressing irrelevant interference. In addition, we construct the Multi-Degradation UAV Restoration benchmark (MDUR), the first large-scale UAV benchmark for compositional UAV image restoration, with 43 degradation configurations from single degradations to four-factor composites and standardized seen/unseen splits.Extensive experiments on MDUR demonstrate consistent improvements over representative unified restoration methods, with greater gains on unseen and higher-order composite degradations. Downstream experiments further validate benefits for UAV object detection.
[579] UHD Low-Light Image Enhancement via Real-Time Enhancement Methods with Clifford Information Fusion
Xiaohan Wang, Chen Wu, Dawei Zhao, Guangwei Gao, Dianjie Lu, Guijuan Zhang, Linwei Fan, Xu Lu, Shuai Wu, Hang Wei, Zhuoran Zheng
Main category: eess.IV
TL;DR: Real-time UHD low-light image enhancement using Clifford algebra for geometric feature fusion, achieving millisecond inference on edge devices.
Details
Motivation: Existing UHD low-light restoration methods suffer from memory bottlenecks and cannot achieve real-time millisecond-level inference on edge devices, necessitating more efficient solutions.Method: Four-layer feature pyramid with Gaussian blur for frequency decomposition, lightweight U-Net with depthwise separable convolution, and spatially aware Clifford algebra mapping features to multivector space (scalars, vectors, bivectors) for noise-suppressing feature fusion. Outputs adaptive Gamma and Gain maps based on Retinex theory.
Result: Achieves millisecond-level inference for 4K/8K images on consumer-grade devices while outperforming SOTA models on restoration metrics through FP16 mixed-precision and dynamic operator fusion.
Conclusion: The proposed Clifford algebra-based geometric feature fusion enables real-time UHD low-light enhancement with efficient memory usage and high-quality restoration on edge devices.
Abstract: Considering efficiency, ultra-high-definition (UHD) low-light image restoration is extremely challenging. Existing methods based on Transformer architectures or high-dimensional complex convolutional neural networks often suffer from the “memory wall” bottleneck, failing to achieve millisecond-level inference on edge devices. To address this issue, we propose a novel real-time UHD low-light enhancement network based on geometric feature fusion using Clifford algebra in 2D Euclidean space. First, we construct a four-layer feature pyramid with gradually increasing resolution, which decomposes input images into low-frequency and high-frequency structural components via a Gaussian blur kernel, and adopts a lightweight U-Net based on depthwise separable convolution for dual-branch feature extraction. Second, to resolve structural information loss and artifacts from traditional high-low frequency feature fusion, we introduce spatially aware Clifford algebra, which maps feature tensors to a multivector space (scalars, vectors, bivectors) and uses Clifford similarity to aggregate features while suppressing noise and preserving textures. In the reconstruction stage, the network outputs adaptive Gamma and Gain maps, which perform physically constrained non-linear brightness adjustment via Retinex theory. Integrated with FP16 mixed-precision computation and dynamic operator fusion, our method achieves millisecond-level inference for 4K/8K images on a single consumer-grade device, while outperforming state-of-the-art (SOTA) models on several restoration metrics.
[580] Multi-task Just Recognizable Difference for Video Coding for Machines: Database, Model, and Coding Application
Junqi Liu, Yun Zhang, Xiaoxia Huang, Long Xu, Weisi Lin
Main category: eess.IV
TL;DR: Proposes Multi-Task Just Recognizable Difference (MT-JRD) dataset and Attribute-assisted MT-JRD (AMT-JRD) model for Video Coding for Machines, improving both prediction accuracy and coding efficiency across multiple vision tasks.
Details
Motivation: Current Just Recognizable Difference (JRD) approaches are limited to single-task scenarios, but real-world machine vision applications require handling multiple tasks simultaneously. There's a need to extend JRD to multi-task settings for more efficient Video Coding for Machines (VCM).Method: 1) Construct MT-JRD dataset with 27,264 JRD annotations supporting object detection, instance segmentation, and keypoint detection. 2) Develop AMT-JRD model with Generalized Feature Extraction Module (GFEM) and Specialized Feature Extraction Module (SFEM) for joint multi-task learning. 3) Incorporate object attribute information through Attribute Feature Fusion Module (AFFM) using size and location priors. 4) Apply predicted JRDs to VCM for bitrate reduction while preserving accuracy.
Result: AMT-JRD achieves precise multi-task prediction with mean absolute error of 3.781 and error variance of 5.332 across three tasks, outperforming state-of-the-art single-task model by 6.7% and 6.3%. In VCM applications, improves average BD-mAP by 3.861% over VVC and 7.886% over JPEG baselines.
Conclusion: The proposed AMT-JRD model effectively extends JRD to multi-task scenarios, enhancing both prediction accuracy and coding efficiency for Video Coding for Machines, demonstrating practical benefits for real-world machine vision applications.
Abstract: Just Recognizable Difference (JRD) boosts coding efficiency for machine vision through visibility threshold modeling, but is currently limited to a single-task scenario. To address this issue, we propose a Multi-Task JRD (MT-JRD) dataset and an Attribute-assisted MT-JRD (AMT-JRD) model for Video Coding for Machines (VCM), enhancing both prediction accuracy and coding efficiency. First, we construct a dataset comprising 27,264 JRD annotations from machines, supporting three representative tasks including object detection, instance segmentation, and keypoint detection. Secondly, we propose the AMT-JRD prediction model, which integrates Generalized Feature Extraction Module (GFEM) and Specialized Feature Extraction Module (SFEM) to facilitate joint learning across multiple tasks. Thirdly, we innovatively incorporate object attribute information into object-wise JRD prediction through the Attribute Feature Fusion Module (AFFM), which introduces prior knowledge about object size and location. This design effectively compensates for the limitations of relying solely on image features and enhances the model’s capacity to represent the perceptual mechanisms of machine vision. Finally, we apply the AMT-JRD model to VCM, where the accurately predicted JRDs are applied to reduce the coding bit rate while preserving accuracy across multiple machine vision tasks. Extensive experimental results demonstrate that AMT-JRD achieves precise and robust multi-task prediction with a mean absolute error of 3.781 and error variance of 5.332 across three tasks, outperforming the state-of-the-art single-task prediction model by 6.7% and 6.3%, respectively. Coding experiments further reveal that compared to the baseline VVC and JPEG, the AMT-JRD-based VCM improves an average of 3.861% and 7.886% Bjontegaard Delta-mean Average Precision (BD-mAP), respectively.
[581] DSVTLA: Deep Swin Vision Transformer-Based Transfer Learning Architecture for Multi-Type Cancer Histopathological Cancer Image Classification
Muazzem Hussain Khan, Tasdid Hasnain, Md. Jamil khan, Ruhul Amin, Md. Shamim Reza, Md. Al Mehedi Hasan, Md Ashad Alam
Main category: eess.IV
TL;DR: Swin-Vision Transformer-based transfer learning architecture for multi-cancer histopathological image classification achieves near-perfect accuracy across diverse cancer types.
Details
Motivation: To develop a robust multi-cancer classification system for histopathological images that can capture both long-range contextual dependencies and fine-grained local morphological patterns, addressing the need for reliable AI-assisted diagnosis in clinical settings.Method: Proposed a deep Swin-Vision Transformer-based transfer learning architecture that integrates hierarchical Swin Transformer with ResNet50-based convolution features extraction. Used balanced data preprocessing, transfer learning, and fine-tuning strategies on a comprehensive multi-cancer dataset including Breast, Oral, Lung-Colon, Kidney cancers, and Acute Lymphocytic Leukemia.
Result: Achieved 100% test accuracy for lung-colon cancer and segmented leukemia datasets, and up to 99.23% accuracy for breast cancer classification. Model demonstrated near-perfect precision, f1 score, and recall across diverse cancer types, outperforming state-of-the-art CNN and transfer models.
Conclusion: The proposed model establishes a highly accurate, interpretable, and robust multi-cancer classification system that provides a strong benchmark for future research and reliable AI-assisted histopathological diagnosis and clinical decision-making.
Abstract: In this study, we proposed a deep Swin-Vision Transformer-based transfer learning architecture for robust multi-cancer histopathological image classification. The proposed framework integrates a hierarchical Swin Transformer with ResNet50-based convolution features extraction, enabling the model to capture both long-range contextual dependencies and fine-grained local morphological patterns within histopathological images. To validate the efficiency of the proposed architecture, an extensive experiment was executed on a comprehensive multi-cancer dataset including Breast Cancer, Oral Cancer, Lung and Colon Cancer, Kidney Cancer, and Acute Lymphocytic Leukemia (ALL), including both original and segmented images were analyzed to assess model robustness across heterogeneous clinical imaging conditions. Our approach is benchmarked alongside several state-of-the-art CNN and transfer models, including DenseNet121, DenseNet201, InceptionV3, ResNet50, EfficientNetB3, multiple ViT variants, and Swin Transformer models. However, all models were trained and validated using a unified pipeline, incorporating balanced data preprocessing, transfer learning, and fine-tuning strategies. The experimental results demonstrated that our proposed architecture consistently gained superior performance, reaching 100% test accuracy for lung-colon cancer, segmented leukemia datasets, and up to 99.23% accuracy for breast cancer classification. The model also achieved near-perfect precision, f1 score, and recall, indicating highly stable scores across divers cancer types. Overall, the proposed model establishes a highly accurate, interpretable, and also robust multi-cancer classification system, demonstrating strong benchmark for future research and provides a unified comparative assessment useful for designing reliable AI-assisted histopathological diagnosis and clinical decision-making.