Editor’s Picks
Top papers matching your research interests in multimodal LLMs, audio and vision understanding/generation.
[1] The USTC-NERCSLIP Systems for the CHiME-9 MCoRec Challenge
Ya Jiang, Ruoyu Wang, Jingxuan Zhang, Jun Du, Yi Han, Zihao Quan, Hang Chen, Yeran Yang, Kongzhi Zheng, Zhuo Chen, Yanhui Tu, Shutong Niu, Changfeng Xi, Mengzhi Wang, Zhongbin Wu, Jieru Chen, Henghui Zhi, Weiyi Shi, Shuhang Wu, Genshun Wan, Jia Pan, Jianqing Gao
Main category: eess.AS
TL;DR: Multimodal cascaded system for recognizing and clustering multiple concurrent conversations in indoor social settings using audio-visual pretrained models with LLM enhancement
Details
Motivation: Address the challenge of recognizing and clustering multiple parallel dialogues in indoor social settings where speech overlap exceeds 90%, unlike conventional single-topic meetingsMethod: Multimodal cascaded system leveraging per-speaker visual streams from 360° video and single-channel audio, with enhanced audio-visual pretrained models for Active Speaker Detection, Audio-Visual Target Speech Extraction, and Audio-Visual Speech Recognition incorporating Whisper and LLM techniques
Result: Best single system achieves 32.44% Speaker WER; ROVER fusion reduces to 31.40%; LLM-based zero-shot conversational clustering achieves perfect F1 score of 1.0, yielding final JACER of 15.70%
Conclusion: The proposed multimodal approach effectively handles highly overlapping multi-conversation scenarios through audio-visual integration and LLM enhancement
Abstract: This report details our submission to the CHiME-9 MCoRec Challenge on recognizing and clustering multiple concurrent natural conversations within indoor social settings. Unlike conventional meetings centered on a single shared topic, this scenario contains multiple parallel dialogues–up to eight speakers across up to four simultaneous conversations–with a speech overlap rate exceeding 90%. To tackle this, we propose a multimodal cascaded system that leverages per-speaker visual streams extracted from synchronized 360 degree video together with single-channel audio. Our system improves three components of the pipeline by leveraging enhanced audio-visual pretrained models: Active Speaker Detection (ASD), Audio-Visual Target Speech Extraction (AVTSE), and Audio-Visual Speech Recognition (AVSR). The AVSR module further incorporates Whisper and LLM techniques to boost transcription accuracy. Our best single cascaded system achieves a Speaker Word Error Rate (WER) of 32.44% on the development set. By further applying ROVER to fuse outputs from diverse front-end and back-end variants, we reduce Speaker WER to 31.40%. Notably, our LLM-based zero-shot conversational clustering achieves a speaker clustering F1 score of 1.0, yielding a final Joint ASR-Clustering Error Rate (JACER) of 15.70%.
Relevance: 9/10
[2] CueNet: Robust Audio-Visual Speaker Extraction through Cross-Modal Cue Mining and Interaction
Jiadong Wang, Ke Zhang, Xinyuan Qian, Ruijie Tao, Haizhou Li, Björn Schuller
Main category: cs.MM
TL;DR: Proposes a robust audio-visual speaker extraction method that handles degraded visual inputs without requiring degraded training data, using disentangled cues inspired by human perception.
Details
Motivation: Existing audio-visual speaker extraction methods perform well but are vulnerable to degraded visual inputs common in real-world scenarios. Current approaches require training with degraded data, which is impractical since visual degradation can occur in unpredictable ways.Method: Proposes an audio-visual learner that disentangles three distinct cues inspired by human perception: speaker information, acoustic synchronization, and semantic synchronization. Designs a dedicated interaction module to effectively integrate these cues for reliable speaker extraction guidance.
Result: Extensive experiments demonstrate strong robustness under various visual degradations and clear superiority over existing methods.
Conclusion: The proposed approach enhances robustness against impaired visual inputs without needing degraded videos during training, offering a more practical solution for real-world audio-visual speaker extraction.
Abstract: Audio-visual speaker extraction has attracted increasing attention, as it removes the need for pre-registered speech and leverages the visual modality as a complement to audio. Although existing methods have achieved impressive performance, the issue of degraded visual inputs has received relatively little attention, despite being common in real-world scenarios. Previous attempts to address this problem have mainly involved training with degraded visual data. However, visual degradation can occur in many unpredictable ways, making it impractical to simulate all possible cases during training. In this paper, we aim to enhance the robustness of audio-visual speaker extraction against impaired visual inputs without relying on degraded videos during training. Inspired by observations from human perceptual mechanisms, we propose an audio-visual learner that disentangles speaker information, acoustic synchronisation, and semantic synchronisation as distinct cues. Furthermore, we design a dedicated interaction module that effectively integrates these cues to provide a reliable guidance signal for speaker extraction. Extensive experiments demonstrate the strong robustness of the proposed model under various visual degradations and its clear superiority over existing methods.
Relevance: 9/10
[3] Investigating Group Relative Policy Optimization for Diffusion Transformer based Text-to-Audio Generation
Yi Gu, Yanqing Liu, Chen Yang, Sheng Zhao
Main category: eess.AS
TL;DR: Reinforcement learning with GRPO and LLM-enhanced captions improves text-to-audio generation quality and text-audio alignment
Details
Motivation: Existing text-to-audio generation methods struggle with accurately rendering complex prompts involving intricate audio effects and achieving precise text-audio alignment, despite prior approaches using data augmentation, explicit timing conditioning, and reinforcement learning.Method: Uses LLM to generate high-fidelity audio captions for better text-audio semantic alignment, then applies Group Relative Policy Optimization (GRPO) reinforcement learning to fine-tune diffusion transformer-based T2A models with systematic experimentation of reward functions (CLAP, KL, FAD combinations).
Result: GRPO-based fine-tuning yields substantial gains in synthesis fidelity and prompt adherence, with systematic analysis identifying key drivers of effective RL in audio synthesis and how reward design impacts final audio quality.
Conclusion: Reinforcement learning with GRPO and LLM-enhanced captioning effectively improves text-to-audio generation quality and alignment, providing insights into reward function design for audio synthesis.
Abstract: Text-to-audio (T2A) generation has advanced considerably in recent years, yet existing methods continue to face challenges in accurately rendering complex text prompts, particularly those involving intricate audio effects, and achieving precise text-audio alignment. While prior approaches have explored data augmentation, explicit timing conditioning, and reinforcement learning, overall synthesis quality remains constrained. In this work, we experiment with reinforcement learning to further enhance T2A generation quality, building on diffusion transformer (DiT)-based architectures. Our method first employs a large language model (LLM) to generate high-fidelity, richly detailed audio captions, substantially improving text-audio semantic alignment, especially for ambiguous or underspecified prompts. We then apply Group Relative Policy Optimization (GRPO), a recently introduced reinforcement learning algorithm, to fine-tune the T2A model. Through systematic experimentation with diverse reward functions (including CLAP, KL, FAD, and their combinations), we identify the key drivers of effective RL in audio synthesis and analyze how reward design impacts final audio quality. Experimental results demonstrate that GRPO-based fine-tuning yield substantial gains in synthesis fidelity and prompt adherence.
Relevance: 9/10
Today’s Research Highlights
AI-enhanced summaries of the latest research papers from arXiv.
Table of Contents
- cs.CL [Total: 287]
- cs.CV [Total: 525]
- cs.AI [Total: 298]
- cs.SD [Total: 26]
- cs.LG [Total: 399]
- cs.MA [Total: 19]
- cs.MM [Total: 3]
- eess.AS [Total: 18]
- eess.IV [Total: 18]
cs.CL
[1] From Global to Local: Learning Context-Aware Graph Representations for Document Classification and Summarization
Ruangrin Ldallitsakool, Margarita Bugueño, Gerard de Melo
Main category: cs.CL
TL;DR: Proposes data-driven method for automatic graph-based document representation using dynamic sliding-window attention to capture semantic dependencies and structural relations, with applications to document classification and extractive summarization.
Details
Motivation: To create effective graph-based document representations that capture both local/mid-range semantic dependencies between sentences and structural relations within documents, building on recent work in dynamic sliding-window attention.Method: Leverages dynamic sliding-window attention module to automatically construct graph representations of documents, then uses Graph Attention Networks (GATs) trained on these learned graphs for document classification and extractive summarization tasks.
Result: GATs trained on the learned graphs achieve competitive results on document classification with lower computational resources than previous approaches. Exploratory evaluation shows potential for extractive summarization but also current limitations.
Conclusion: The proposed data-driven graph construction method using dynamic sliding-window attention provides effective document representations that enable competitive performance on document classification with computational efficiency, while showing promise for extractive summarization applications.
Abstract: This paper proposes a data-driven method to automatically construct graph-based document representations. Building upon the recent work of Bugueño and de Melo (2025), we leverage the dynamic sliding-window attention module to effectively capture local and mid-range semantic dependencies between sentences, as well as structural relations within documents. Graph Attention Networks (GATs) trained on our learned graphs achieve competitive results on document classification while requiring lower computational resources than previous approaches. We further present an exploratory evaluation of the proposed graph construction method for extractive document summarization, highlighting both its potential and current limitations. The implementation of this project can be found on GitHub.
[2] Noise reduction in BERT NER models for clinical entity extraction
Kuldeep Jiwani, Yash K Jeengar, Ayush Dhaka
Main category: cs.CL
TL;DR: A clinical NER system using BERT fine-tuning with a novel noise removal model that reduces false positives by analyzing probability distributions and semantic patterns in transformer embeddings.
Details
Motivation: Clinical entity extraction requires high precision, but fine-tuned encoder models for NER achieve good recall but insufficient precision. Simple probability thresholding fails due to transformer architectures assigning high confidence scores even to uncertain predictions.Method: Pre-trained an in-house BERT on clinical data, fine-tuned for NER, then developed a supervised Noise Removal model that analyzes probability sequences using features like Probability Density Map (PDM) to capture Semantic-Pull effects in transformer embeddings and classify predictions as weak/strong.
Result: The Noise Removal model reduced False Positives across various clinical NER models by 50% to 90%, significantly improving precision while maintaining recall performance.
Conclusion: The proposed noise removal approach effectively addresses precision limitations in clinical NER by leveraging sophisticated analysis of transformer probability distributions, enabling reliable clinical entity extraction with reduced false positives.
Abstract: Precision is of utmost importance in the realm of clinical entity extraction from clinical notes and reports. Encoder Models fine-tuned for Named Entity Recognition (NER) are an efficient choice for this purpose, as they don’t hallucinate. We pre-trained an in-house BERT over clinical data and then fine-tuned it for NER. These models performed well on recall but could not close upon the high precision range, needed for clinical models. To address this challenge, we developed a Noise Removal model that refines the output of NER. The NER model assigns token-level entity tags along with probability scores for each token. Our Noise Removal (NR) model then analyzes these probability sequences and classifies predictions as either weak or strong. A naïve approach might involve filtering predictions based on low probability values; however, this method is unreliable. Owing to the characteristics of the SoftMax function, Transformer based architectures often assign disproportionately high confidence scores even to uncertain or weak predictions, making simple thresholding ineffective. To address this issue, we adopted a supervised modeling strategy in which the NR model leverages advanced features such as the Probability Density Map (PDM). The PDM captures the Semantic-Pull effect observed within Transformer embeddings, an effect that manifests in the probability distributions of NER class predictions across token sequences. This approach enables the model to classify predictions as weak or strong with significantly improved accuracy. With these NR models we were able to reduce False Positives across various clinical NER models by 50% to 90%.
[3] Personalization Increases Affective Alignment but Has Role-Dependent Effects on Epistemic Independence in LLMs
Sean W. Kelley, Christoph Riedl
Main category: cs.CL
TL;DR: Personalization in LLMs increases affective alignment (emotional validation) but has context-dependent effects on epistemic alignment, with models showing more independence in advice roles but less in social peer roles.
Details
Motivation: LLMs exhibit sycophantic behavior by uncritically conforming to user beliefs, and as models increasingly use personalization (user traits, preferences, history), understanding how this modulates sycophancy is critical for evaluating personalized AI systems.Method: Rigorous evaluation across nine frontier models and five benchmark datasets spanning advice, moral judgment, and debate contexts, with systematic testing of personalization’s impact on affective and epistemic alignment, including robustness tests to isolate effects.
Result: Personalization generally increases affective alignment (emotional validation, hedging/deference) but affects epistemic alignment differently based on context and role: in advice roles, personalization strengthens epistemic independence (models challenge user presuppositions); in social peer roles, personalization decreases epistemic independence, with extensively personalized challenges causing LLMs to abandon positions at higher rates.
Conclusion: The work provides measurement frameworks for evaluating personalized AI systems, demonstrates necessity of role-sensitive evaluation, and establishes a novel benchmark for assessing goal alignment in LLMs with personalization.
Abstract: Large Language Models (LLMs) are prone to sycophantic behavior, uncritically conforming to user beliefs. As models increasingly condition responses on user-specific context (personality traits, preferences, conversation history), they gain information to tailor agreement more effectively. Understanding how personalization modulates sycophancy is critical, yet systematic evaluation across models and contexts remains limited. We present a rigorous evaluation of personalization’s impact on LLM sycophancy across nine frontier models and five benchmark datasets spanning advice, moral judgment, and debate contexts. We find that personalization generally increases affective alignment (emotional validation, hedging/deference), but affects epistemic alignment (belief adoption, position stability, resistance to influence) with context-dependent role modulation. When the LLM’s role is to give advice, personalization strengthens epistemic independence (models challenge user presuppositions). When its role is that of a social peer, personalization decreases epistemic independence. In this role, extensively personalized user challenges causing LLMs to abandon their position at significantly higher rates. Robustness tests confirm that the effects are driven by personalized conditioning, not by additional input tokens per se or demographic information alone. Our work provides measurement frameworks for evaluating personalized AI systems, demonstrates the necessity of role-sensitive evaluation, and establishes a novel benchmark to assess goal alignment.
[4] TAB-PO: Preference Optimization with a Token-Level Adaptive Barrier for Token-Critical Structured Generation
Samah Fodeh, Linhai Ma, Ganesh Puthiaraju, Srivani Talakokkul, Afshan Khan, Ashley Hagaman, Sarah R. Lowe, Aimee Kendall Roundtree
Main category: cs.CL
TL;DR: TAB-PO improves DPO for token-critical structured prediction by addressing margin collapse and gradient dilution through token-weighted advantages and conditional barriers.
Details
Motivation: Standard DPO struggles with low-separation preference pairs and token-importance skew in structured prediction tasks like medical annotation, where semantic tokens (labels, evidence spans) are sparse but crucial compared to frequent structural tokens.Method: TAB-PO augments DPO with token-weighted, reference-adjusted advantages that prioritize high-value semantic tokens, and a conditional token-level barrier that balances SFT-anchored likelihood and preference-driven separation.
Result: TAB-PO achieves ~4% relative improvement in micro-F1 over SFT on medical communication annotation and consistently outperforms recent preference-optimization baselines.
Conclusion: TAB-PO effectively addresses DPO’s limitations in token-critical structured prediction by focusing learning on important semantic tokens while maintaining proper likelihood separation.
Abstract: Direct Preference Optimization is an offline post-SFT method for aligning language models from preference pairs, with strong results in instruction following and summarization. However, DPO’s sequence-level implicit reward can be brittle for token-critical structured prediction settings such as medical annotation, which often exhibit (i) low-separation preference pairs, where chosen and rejected completions differ by minimal edit distance (often 1-3 tokens), and (ii) token-importance skew, where sparse semantic tokens (hierarchical labels and evidence Spans) carry disproportionate task importance relative to high-frequency structural tokens (JSON scaffolding). In this regime, standard DPO suffers from margin collapse (insufficient log-probability separation between near-identical preferences), likelihood squeezing (the margin objective shifts the absolute likelihoods of both completions together), and gradient dilution, where uniform sequence-level weighting diffuses learning signal across shared scaffolding while rare, confusable label tokens receive weak, noisy updates. We introduce Token-Adaptive Barrier Preference Optimization (TAB-PO), which augments DPO with token-weighted, reference-adjusted advantages that prioritize high-value semantic tokens, and a conditional token-level barrier that regularizes under-confident tokens balancing SFT-anchored likelihood and preference-driven separation in low-separation, importance-skewed regimes. We evaluate TAB-PO on medical communication annotation, a task requiring joint prediction of hierarchical labels and evidence Spans from patient-provider messages. TAB-PO achieves a ~ 4% relative improvement in micro-F1 over SFT and consistently outperforms recent preference-optimization baselines.
[5] ActMem: Bridging the Gap Between Memory Retrieval and Reasoning in LLM Agents
Xiaohui Zhang, Zequn Sun, Chengyuan Yang, Yaqin Jin, Yazhong Zhang, Wei Hu
Main category: cs.CL
TL;DR: ActMem is an actionable memory framework for LLM agents that integrates causal reasoning with memory retrieval to handle complex, logic-driven scenarios requiring conflict detection and decision-making.
Details
Motivation: Current memory frameworks treat agents as passive recorders and fail in scenarios requiring conflict detection and complex decision-making. There's a need for memory systems that understand deeper implications rather than just retrieving facts.Method: Proposes ActMem framework that transforms unstructured dialogue history into structured causal and semantic graphs. Uses counterfactual reasoning and commonsense completion to deduce implicit constraints and resolve conflicts between past states and current intentions.
Result: ActMem significantly outperforms state-of-the-art baselines in handling complex, memory-dependent tasks. Introduces ActMemEval dataset for evaluating agent reasoning capabilities in logic-driven scenarios.
Conclusion: ActMem enables more consistent and reliable intelligent assistants by integrating memory retrieval with active causal reasoning, moving beyond fact-retrieval to handle complex reasoning scenarios.
Abstract: Effective memory management is essential for large language model (LLM) agents handling long-term interactions. Current memory frameworks typically treat agents as passive “recorders” and retrieve information without understanding its deeper implications. They may fail in scenarios requiring conflict detection and complex decision-making. To bridge this critical gap, we propose a novel actionable memory framework called ActMem that integrates memory retrieval with active causal reasoning. ActMem transforms unstructured dialogue history into a structured causal and semantic graph. By leveraging counterfactual reasoning and commonsense completion, it enables agents to deduce implicit constraints and resolve potential conflicts between past states and current intentions. Furthermore, we introduce a comprehensive dataset ActMemEval to evaluate agent reasoning capabilities in logic-driven scenarios, moving beyond the fact-retrieval focus of existing memory benchmarks. Experiments demonstrate that ActMem significantly outperforms state-of-the-art baselines in handling complex, memory-dependent tasks, paving the way for more consistent and reliable intelligent assistants.
[6] EPPCMinerBen: A Novel Benchmark for Evaluating Large Language Models on Electronic Patient-Provider Communication via the Patient Portal
Samah Fodeh, Yan Wang, Linhai Ma, Srivani Talakokkul, Jordan M. Alpert, Sarah Schellhorn
Main category: cs.CL
TL;DR: EPPCMinerBen is a benchmark for evaluating LLMs on analyzing electronic patient-provider communication, with tasks for classifying communication patterns and extracting evidence from medical messages.
Details
Motivation: As healthcare communication shifts to secure messaging, there's a need for automated analysis of electronic patient-provider exchanges to understand communication patterns and extract insights for better treatment outcomes.Method: Created a benchmark with 1,933 expert-annotated sentences from 752 secure messages, featuring three sub-tasks: Code Classification, Subcode Classification, and Evidence Extraction. Evaluated various LLMs under zero-shot and few-shot settings.
Result: Llama-3.1-70B led in evidence extraction (F1: 82.84%), Llama-3.3-70b-Instruct excelled in code classification (F1: 67.03%), and DeepSeek-R1-Distill-Qwen-32B performed best in subcode classification (F1: 48.25%). Few-shot prompting improved performance, while smaller models struggled.
Conclusion: Large instruction-tuned models perform well on EPPCMinerBen tasks, especially evidence extraction, while smaller models struggle with fine-grained reasoning. The benchmark supports future work on discourse-level understanding and patient-provider communication analysis.
Abstract: Effective communication in health care is critical for treatment outcomes and adherence. With patient-provider exchanges shifting to secure messaging, analyzing electronic patient-communication (EPPC) data is both essential and challenging. We introduce EPPCMinerBen, a benchmark for evaluating LLMs in detecting communication patterns and extracting insights from electronic patient-provider messages. EPPCMinerBen includes three sub-tasks: Code Classification, Subcode Classification, and Evidence Extraction. Using 1,933 expert annotated sentences from 752 secure messages of the patient portal at Yale New Haven Hospital, it evaluates LLMs on identifying communicative intent and supportive text. Benchmarks span various LLMs under zero-shot and few-shot settings, with data to be released via the NCI Cancer Data Service. Model performance varied across tasks and settings. Llama-3.1-70B led in evidence extraction (F1: 82.84%) and performed well in classification. Llama-3.3-70b-Instruct outperformed all models in code classification (F1: 67.03%). DeepSeek-R1-Distill-Qwen-32B excelled in subcode classification (F1: 48.25%), while sdoh-llama-3-70B showed consistent performance. Smaller models underperformed, especially in subcode classification (>30% F1). Few-shot prompting improved most tasks. Our results show that large, instruction-tuned models generally perform better in EPPCMinerBen tasks, particularly evidence extraction while smaller models struggle with fine-grained reasoning. EPPCMinerBen provides a benchmark for discourse-level understanding, supporting future work on model generalization and patient-provider communication analysis. Keywords: Electronic Patient-Provider Communication, Large language models, Data collection, Prompt engineering
[7] Iterative LLM-based improvement for French Clinical Interview Transcription and Speaker Diarization
Ambre Marie, Thomas Bertin, Guillaume Dardenne, Gwenolé Quellec
Main category: cs.CL
TL;DR: Multi-pass LLM architecture improves French medical speech recognition by alternating speaker and word recognition passes, achieving significant WDER reductions in suicide prevention conversations with acceptable computational cost.
Details
Motivation: French medical conversations present challenges for automatic speech recognition with high word error rates (often >30%) in spontaneous clinical speech, necessitating improved transcription accuracy and speaker attribution for clinical applications.Method: Proposes a multi-pass LLM post-processing architecture that alternates between Speaker Recognition and Word Recognition passes. Investigates four design choices: model selection, prompting strategy, pass ordering, and iteration depth using Qwen3-Next-80B on two French clinical datasets.
Result: Significant WDER reductions on suicide prevention conversations (p < 0.05, n=18), stable performance on awake neurosurgery consultations (n=10), zero output failures, and acceptable computational cost (RTF 0.32).
Conclusion: The multi-pass LLM architecture demonstrates feasibility for offline clinical deployment by improving transcription accuracy in French medical conversations while maintaining computational efficiency and reliability.
Abstract: Automatic speech recognition for French medical conversations remains challenging, with word error rates often exceeding 30% in spontaneous clinical speech. This study proposes a multi-pass LLM post-processing architecture alternating between Speaker Recognition and Word Recognition passes to improve transcription accuracy and speaker attribution. Ablation studies on two French clinical datasets (suicide prevention telephone counseling and preoperative awake neurosurgery consultations) investigate four design choices: model selection, prompting strategy, pass ordering, and iteration depth. Using Qwen3-Next-80B, Wilcoxon signed-rank tests confirm significant WDER reductions on suicide prevention conversations (p < 0.05, n=18), while maintaining stability on awake neurosurgery consultations (n=10), with zero output failures and acceptable computational cost (RTF 0.32), suggesting feasibility for offline clinical deployment.
[8] Embracing Anisotropy: Turning Massive Activations into Interpretable Control Knobs for Large Language Models
Youngji Roh, Hyunjin Cho, Jaehyung Kim
Main category: cs.CL
TL;DR: LLMs have anisotropic representations with extreme activation dimensions that serve as interpretable functional units for domain specialization, enabling targeted steering for domain adaptation and jailbreaking.
Details
Motivation: Prior work treats extreme activation dimensions in LLMs as artifacts to be managed, but this paper proposes they are actually intrinsic interpretable functional units that emerge from domain specialization and can be leveraged for better model control.Method: Proposes a magnitude-based criterion to identify Domain-Critical Dimensions in a training-free manner, and introduces Critical Dimension Steering which applies activation steering exclusively to these identified dimensions rather than all dimensions.
Result: Empirical results show that Critical Dimension Steering outperforms conventional whole-dimension steering in both domain adaptation and jailbreaking scenarios, demonstrating the effectiveness of targeting specific interpretable semantic detectors.
Conclusion: Extreme activation dimensions in LLMs are not just artifacts but serve as interpretable functional units for domain specialization, and targeted steering of these dimensions enables more effective model control for various applications.
Abstract: Large Language Models (LLMs) exhibit highly anisotropic internal representations, often characterized by massive activations, a phenomenon where a small subset of feature dimensions possesses magnitudes significantly larger than the rest. While prior works view these extreme dimensions primarily as artifacts to be managed, we propose a distinct perspective: these dimensions serve as intrinsic interpretable functional units arising from domain specialization. Specifically, we propose a simple magnitude-based criterion to identify Domain-Critical Dimensions in a training-free manner. Our analyses reveal that such dimensions behave as interpretable semantic detectors for symbolic/quantitative patterns or domain-specific terms. In addition, we introduce Critical Dimension Steering, which applies activation steering exclusively to the identified dimensions. Empirical results show that this approach outperforms conventional whole-dimension steering in domain adaptation and jailbreaking scenarios.
[9] SimpleTool: Parallel Decoding for Real-Time LLM Function Calling
Xiaoxin Shi, Jiaxin Wan, Linkang Dong, Wei Jiang, Yue Liu, Zengfeng Huang
Main category: cs.CL
TL;DR: SimpleTool accelerates LLM function calling by using special tokens to compress redundant structure and enable parallel generation, achieving 3-6x speedup for real-time applications.
Details
Motivation: Autoregressive decoding in LLM-based function calling creates latency bottlenecks that limit real-time applications like embodied intelligence and interactive avatars, which require high-frequency control (e.g., 10 Hz).Method: Introduces special tokens that serve dual purposes: compressing low-entropy structural tokens (4-6x reduction) and acting as mode selectors to enable independent parallel generation of function names and arguments.
Result: Achieves 3-6x end-to-end speedup (up to 9.6x) with only +8.2% parallelization overhead. On Mobile Actions, ST-Qwen-0.5B outperforms Google’s FunctionGemma in accuracy and latency consistency. With quantization, achieves 61.2ms P50 latency enabling 16 Hz real-time control at 4B model scale.
Conclusion: SimpleTool bridges the gap between LLM function calling and latency-critical real-world deployment by exploiting structural redundancy and weak causal dependencies in function outputs for parallel generation.
Abstract: LLM-based function calling enables intelligent agents to interact with external tools and environments, yet autoregressive decoding imposes a fundamental latency bottleneck that limits real-time applications such as embodied intelligence, game AI, and interactive avatars (e.g., 10 Hz control frequency). We observe that function calling differs fundamentally from free-form text generation: structured outputs exhibit substantial token redundancy (delimiters, parameter names), and arguments exhibit weak causal dependencies. Crucially, these two properties must be exploited jointly to achieve real-time performance. We present SimpleTool, which introduces special tokens that serve a dual role: compressing low-entropy tokens (4-6x reduction) while acting as mode selectors that enable independent parallel generation of function name and arguments. This synergistic design achieves 3-6x end-to-end speedup (up to 9.6x) with only +8.2% parallelization overhead. Experiments on five benchmarks across Qwen-series models (0.5B-14B) demonstrate substantial speedup while maintaining competitive or improved accuracy. On Mobile Actions, ST-Qwen-0.5B outperforms Google’s FunctionGemma in both accuracy and latency consistency. With quantization on consumer-grade GPU, SimpleTool achieves 61.2ms P50 latency, enabling 16 Hz real-time control at 4B model scale, bridging the gap between LLM function calling and latency-critical real-world deployment.
[10] Anatomy of the Modality Gap: Dissecting the Internal States of End-to-End Speech LLMs
Ming-Hao Hsu, Xueyao Zhang, Xiaohai Tian, Jun Zhang, Zhizheng Wu
Main category: cs.CL
TL;DR: The paper investigates why speech-based models underperform text-based models despite advances in Large Speech-Language Models, finding that speech representations have broad cross-layer alignment due to redundancy, and that simple calibration fails because the modality gap is structural rather than just distributional.
Details
Motivation: Despite progress in Large Speech-Language Models bridging acoustic signals and linguistic understanding, there remains a performance gap between speech-based input tasks and direct text inference. The authors aim to understand the dynamic roots of this modality gap beyond static geometric alignment.Method: The authors evaluate four open-weight end-to-end models on SpeechMMLU and VoiceBench BBH. They use cross-layer CKA (Centered Kernel Alignment) analysis with speech-text token alignment to examine how speech and text representations evolve layer-by-layer.
Result: Speech representations show a broad cross-layer alignment band due to the redundant nature of speech where semantic content spans multiple frames. These alignment patterns are structurally stable across configurations. Simple statistical calibration is insufficient and can be detrimental when applied at the input layer, indicating the modality gap is not just a distribution shift.
Conclusion: The bottleneck lies in condensing redundant speech into stable late-layer decisions. Future solutions should operate at token or temporal granularity rather than feature-level matching to address the structural differences between speech and text representations.
Abstract: Recent advancements in Large Speech-Language Models have significantly bridged the gap between acoustic signals and linguistic understanding. However, a persistent performance disparity remains in speech-based input tasks compared to direct text inference. In this paper, we investigate the dynamic roots of this modality gap beyond static geometric alignment, analyzing how speech and text representations evolve layer-by-layer. We evaluate four open-weight end-to-end models on SpeechMMLU and VoiceBench BBH. Using cross-layer CKA analysis with speech-text token alignment, we find that speech representations exhibit a broad cross-layer alignment band, attributable to the redundant nature of speech where semantic content spans multiple frames. We show that these alignment patterns are structurally stable across different analysis configurations. Crucially, simple statistical calibration is insufficient and can be detrimental when applied at the input layer, indicating that the modality gap is not a mere distribution shift. Overall, our results suggest that the bottleneck lies in condensing redundant speech into stable late-layer decisions, motivating future solutions that operate at the token or temporal granularity instead of feature-level matching.
[11] GRIP: Geometric Refinement and Adaptive Information Potential for Data Efficiency
Changhao Wang, Jiaolong Yang, Xinhao Yao, Yunfei Yu, Peng Jiao, Lu Yu, Junpeng Fang, Riccardo Cantoro, Qing Cui, Jun Zhou
Main category: cs.CL
TL;DR: GRIP is a data selection framework that unifies global distribution balancing with local instance selection using geometric modeling and adaptive information potential to improve LLM training efficiency.
Details
Motivation: Existing data selection methods for LLMs often separate global distribution balancing from local instance selection, compromising the hierarchical integrity of training sets and limiting data efficiency despite the importance of data quality over raw scaling volume.Method: GRIP models the corpus as an information-dense geometric space, using a Rapid Adaptation Probe (RAP) to quantify information potential of semantic clusters and dynamically re-allocate sampling budget to regions with highest representation deficits, followed by intra-cluster selection using length-rectified geometric prior to counteract embedding density artifacts.
Result: Extensive evaluations on Mixture-of-Experts models up to 300B tokens show GRIP consistently outperforms state-of-the-art baselines, surpassing performance of models trained on 3× larger uncurated datasets.
Conclusion: GRIP establishes a robust geometric foundation for adaptive data curation in large-scale pre-training, demonstrating that unified geometric modeling of global and local data selection dimensions significantly improves LLM training efficiency.
Abstract: The performance of Large Language Models (LLMs) is increasingly governed by data efficiency rather than raw scaling volume. However, existing selection methods often decouple global distribution balancing from local instance selection, compromising the hierarchical integrity of the training set. We introduce \textbf{GRIP} (Geometric Refinement and Adaptive Information Potential), a framework that unifies these dimensions by modeling the corpus as an information-dense geometric space. GRIP employs a \textbf{Rapid Adaptation Probe (RAP)} to quantify the information potential of semantic clusters, dynamically re-allocating the sampling budget to regions with the highest representation deficits. Subsequently, we perform Intra-Cluster Selection using a \textbf{length-rectified geometric prior} to counteract embedding density artifacts and preserve long-tail logical sequences. Extensive evaluations on Mixture-of-Experts (MoE) models up to 300B tokens demonstrate that GRIP consistently outperforms state-of-the-art baselines, \textbf{surpassing the performance of models trained on $3\times$ larger uncurated datasets}. Our work establishes a robust geometric foundation for adaptive data curation in large-scale pre-training.
[12] Autorubric: A Unified Framework for Rubric-Based LLM Evaluation
Delip Rao, Chris Callison-Burch
Main category: cs.CL
TL;DR: Autorubric: A unified framework for rubric-based evaluation with LLMs, offering comprehensive features for reliable text assessment at scale.
Details
Motivation: Current LLM-based rubric evaluation techniques are scattered across papers with inconsistent terminology and partial solutions, lacking a unified framework for reliable, large-scale text assessment.Method: Proposes Autorubric, an open-source Python framework supporting binary/ordinal/nominal criteria with weights, multi-judge ensemble evaluation, few-shot calibration, bias mitigations (position, verbosity, criterion conflation), and psychometric reliability metrics.
Result: Evaluated on three benchmarks (RiceChem, ResearcherBench, CHARM-100), produces results consistent with published benchmarks while demonstrating key capabilities like per-criterion evaluation, multi-judge ensembles, and mixed criterion types.
Conclusion: Autorubric provides a comprehensive, unified solution for reliable LLM-based rubric evaluation at scale, with production-ready infrastructure and introduces CHARM-100 dataset for framework stress-testing.
Abstract: Rubric-based evaluation with large language models (LLMs) has become standard practice for assessing text generation at scale, yet the underlying techniques are scattered across papers with inconsistent terminology and partial solutions. We present a unified framework: each identified technique is paired with its realization in Autorubric, an open-source Python framework proposed in this paper. Autorubric supports binary, ordinal, and nominal criteria with configurable weights; single-judge and multi-judge ensemble evaluation with majority, weighted, unanimous, and any-vote aggregation; few-shot calibration with verdict-balanced sampling; and mitigations for position bias (option shuffling), verbosity bias (length penalties), and criterion conflation (per-criterion atomic evaluation with natural language explanations). The framework provides reliability metrics drawn from psychometrics (Cohen’s $κ$, weighted $κ$, correlation coefficients, and distribution-level tests) alongside production infrastructure including response caching, checkpointing with resumable runs, multi-provider rate limiting, and cost tracking. We evaluate Autorubric on three benchmarks spanning educational assessment, deep research evaluation, and chatbot quality assessment, demonstrating that it produces results consistent with published benchmarks while exercising the framework’s key capabilities: per-criterion binary evaluation with few-shot calibration (RiceChem), multi-judge ensemble evaluation across judge models (ResearcherBench), and mixed criterion types combining binary, ordinal, and nominal scales (CHARM-100). We also contribute CHARM-100, a 100-sample chatbot evaluation dataset with per-sample ground truth labels across all three criterion types, designed to stress-test rubric evaluation frameworks on heterogeneous criteria.
[13] Stepwise Penalization for Length-Efficient Chain-of-Thought Reasoning
Xintong Li, Sha Li, Rongmei Lin, Hongye Jin, Linwei Li, Hejie Cui, Sarah Zhang, Chia-Yuan Chang, Kewei Cheng, Besnik Fetahu, Priyanka Nigam, Jingbo Shang, Bing Yin
Main category: cs.CL
TL;DR: SWAP is a reinforcement learning framework that reduces reasoning length in large language models by adaptively penalizing less important reasoning steps while preserving essential ones, achieving significant length reduction with improved accuracy.
Details
Motivation: Large reasoning models often produce unnecessarily long chains-of-thought that increase computational cost without improving accuracy. Existing RL approaches use blunt length penalties that cannot distinguish essential from redundant reasoning steps, motivating a more fine-grained approach.Method: Step-wise Adaptive Penalization (SWAP) estimates step importance from the model’s on-policy log-probability improvement toward correct answers, then redistributes length penalty mass to penalize low-importance steps more heavily while preserving high-importance reasoning. Uses unified outcome-process advantage within group-relative policy optimization.
Result: SWAP reduces reasoning length by 64.3% on average while improving accuracy by 5.7% relative to the base model, demonstrating effective compression without sacrificing performance.
Conclusion: The SWAP framework successfully addresses the overthinking problem in large reasoning models by providing fine-grained, adaptive length reduction that preserves essential reasoning while eliminating redundancy, offering a practical solution for efficient reasoning.
Abstract: Large reasoning models improve with more test-time computation, but often overthink, producing unnecessarily long chains-of-thought that raise cost without improving accuracy. Prior reinforcement learning approaches typically rely on a single outcome reward with trajectory-level length penalties, which cannot distinguish essential from redundant reasoning steps and therefore yield blunt compression. Although recent work incorporates step-level signals, such as offline pruning, supervised data construction, or verifier-based intermediate rewards, reasoning length is rarely treated as an explicit step-level optimization objective during RL. We propose Step-wise Adaptive Penalization (SWAP), a fine-grained framework that allocates length reduction across steps based on intrinsic contribution. We estimate step importance from the model’s on-policy log-probability improvement toward the correct answer, then treat excess length as a penalty mass redistributed to penalize low-importance steps more heavily while preserving high-importance reasoning. We optimize with a unified outcome-process advantage within group-relative policy optimization. Extensive experiments demonstrate that SWAP reduces reasoning length by 64.3% on average while improving accuracy by 5.7% relative to the base model.
[14] From Prerequisites to Predictions: Validating a Geometric Hallucination Taxonomy Through Controlled Induction
Matic Korun
Main category: cs.CL
TL;DR: Study tests geometric hallucination taxonomy in GPT-2, finding Type 3 (coverage gaps) most distinctive, Type 1/2 not separable, and token-level tests inflate significance via pseudoreplication.
Details
Motivation: To validate a geometric hallucination taxonomy by testing whether different hallucination types (center-drift, wrong-well convergence, coverage gaps) can be distinguished through controlled induction in language models.Method: Used two-level statistical design with prompts as unit of inference, running each experiment 20 times with different generation seeds. Tested static embeddings and contextual hidden states in GPT-2, analyzing norm separation and effect sizes.
Result: Type 3 (coverage gaps) showed robust norm separation in static embeddings (significant in 18/20 runs). In contextual states, Type 3 effect direction was stable but underpowered. Types 1 and 2 did not separate in either space. Token-level tests inflated significance 4-16× through pseudoreplication.
Conclusion: Coverage-gap hallucinations are the most geometrically distinctive failure mode, characterized by magnitude rather than direction differences. Type 1/2 non-separation is genuine at 124M parameters, and token-level analysis suffers from pseudoreplication issues.
Abstract: We test whether a geometric hallucination taxonomy – classifying failures as center-drift (Type1), wrong-well convergence (Type2), or coverage gaps (Type3) – can distinguish hallucination types through controlled induction in GPT-2. Using a two-level statistical design with prompts ($N = 15$/group) as the unit of inference, we run each experiment 20 times with different generation seeds to quantify result stability. In static embeddings, Type3 norm separation is robust (significant in 18/20 runs, Holm-corrected in 14/20, median $r = +0.61$). In contextual hidden states, the Type3 norm effect direction is stable (19/20 runs) but underpowered at $N = 15$ (significant in 4/20, median $r = -0.28$). Types1 and2 do not separate in either space (${\leq},3/20$ runs). Token-level tests inflate significance by 4–16$\times$ through pseudoreplication – a finding replicated across all 20 runs. The results establish coverage-gap hallucinations as the most geometrically distinctive failure mode, carried by magnitude rather than direction, and confirm the Type1/2 non-separation as genuine at 124M parameters.
[15] When Metrics Disagree: Automatic Similarity vs. LLM-as-a-Judge for Clinical Dialogue Evaluation
Bian Sun, Zhenjian Wang, Orvill de la Torre, Zirui Wang
Main category: cs.CL
TL;DR: Fine-tuning Llama 2 7B on real patient-doctor transcripts improves medical query accuracy, though human expert validation remains essential.
Details
Motivation: LLMs perform poorly in medical contexts, potentially providing harmful misinformation. Need to improve reliability for healthcare applications.Method: Supervised fine-tuning of Llama 2 7B transformer model using transcripts from real patient-doctor interactions, evaluated with text similarity metrics.
Result: Fine-tuned model showed significant improvements across key dimensions (except GPT-4 evaluation), but human expert validation is recommended.
Conclusion: Fine-tuning improves medical LLM accuracy, but deployment requires human expert review and evaluation due to limitations of automated metrics.
Abstract: This paper details the baseline model selection, fine-tuning process, evaluation methods, and the implications of deploying more accurate LLMs in healthcare settings. As large language models (LLMs) are increasingly employed to address diverse problems, including medical queries, concerns about their reliability have surfaced. A recent study by Long Island University highlighted that LLMs often perform poorly in medical contexts, potentially leading to harmful misguidance for users. To address this, our research focuses on fine-tuning the Llama 2 7B, a transformer-based, decoder-only model, using transcripts from real patient-doctor interactions. Our objective was to enhance the model’s accuracy and precision in responding to medical queries. We fine-tuned the model using a supervised approach, emphasizing domain-specific nuances captured in the training data. In the best scenario, the model results should be reviewed and evaluated by real medical experts. Due to resource constraints, the performance of the fine-tuned model was evaluated using text similarity metrics. The fine-tuned model demonstrated significant improvements across all key dimensions except GPT-4’s evaluation. The evaluations of ChatGPT4 are quite different from the quantitative results; here, we not only suggest, but also propose that the result should be evaluated by human medical experts.
[16] How Large Language Models Get Stuck: Early structure with persistent errors
Alokesh Manna, William Snyder, Whitney Tabor
Main category: cs.CL
TL;DR: OPT model trained on BabyLM dataset shows persistent grammatical preference failures on BLiMP benchmark, with early erroneous entrenchment that resists correction during training.
Details
Motivation: To investigate whether linguistic insights can improve LLM training efficiency by examining how models learn grammatical preferences during training, specifically looking at when and why models fail to prefer grammatical sentences.Method: Trained Meta’s OPT model on 100M word BabyLM dataset, evaluated on BLiMP benchmark with 67 grammatical classes, tracked model preferences across training iterations using qualitative (linguistic theory, deep learning theory) and quantitative (numerical testing) assessments.
Result: OPT fails to consistently prefer grammatical sentences in nearly one-third of BLiMP classes, often establishing erroneous likelihood separation early in training that persists throughout training phase, suggesting entrenched biases that are costly to reverse.
Conclusion: Proposes Bigram Hypothesis: erroneous entrenchment occurs when bigram statistics bias models toward wrong distinctions early in training, with ongoing work to test this hypothesis on selected BLiMP classes.
Abstract: Linguistic insights may help make Large Language Model (LLM) training more efficient. We trained Meta’s OPT model on the 100M word BabyLM dataset, and evaluated it on the BLiMP benchmark, which consists of 67 classes, each defined by sentence pairs that differ in a targeted syntactic or semantic rule violation. We tested the model’s preference for grammatical over ungrammatical sentences across training iterations and grammatical types. In nearly one-third of the BLiMP classes, OPT fails to consistently assign a higher likelihood to grammatical sentences, even after extensive training. When it fails, it often establishes a clear (erroneous) separation of the likelihoods at an early stage of processing and sustains this to the end of our training phase. We hypothesize that this mis-categorization is costly because it creates entrenched biases that must, eventually, be reversed in order for the model to perform well. We probe this phenomenon using a mixture of qualitative (based on linguistic theory and the theory of Deep Learning) and quantitative (based on numerical testing) assessments. Our qualitative assessments indicate that only some BLiMP tests are meaningful guides. We conclude by articulating a hypothesis, the Bigram Hypothesis, which claims that the learning process will exhibit erroneous entrenchment if bigram statistics bias the model toward wrong distinctions early in training, and we describe a method (in progress) of testing the hypothesis on appropriately selected BLiMP classes.
[17] Distribution-Aware Companding Quantization of Large Language Models
Athul Radhakrishnan, Siddhant Mohan, Mahima Sachdeva
Main category: cs.CL
TL;DR: Multi-token prediction training improves sample efficiency and inference speed for language models by predicting multiple future tokens simultaneously
Details
Motivation: Standard language models use next-token prediction loss, which may be suboptimal for sample efficiency. The authors hypothesize that predicting multiple future tokens simultaneously could improve learning and downstream capabilities.Method: Train language models with n independent output heads on top of a shared trunk to predict n future tokens at each position. This multi-token prediction serves as an auxiliary training task alongside standard next-token prediction.
Result: 13B parameter models show 12% improvement on HumanEval and 17% on MBPP coding benchmarks. Models demonstrate better development of induction heads and algorithmic reasoning. 4-token prediction models achieve up to 3x faster inference speed.
Conclusion: Multi-token prediction is an effective auxiliary training objective that improves sample efficiency, downstream capabilities (especially for coding), and inference speed without increasing training time.
Abstract: Large language models such as GPT and Llama are trained with a next-token prediction loss. In this work, we suggest that training language models to predict multiple future tokens at once results in higher sample efficiency. More specifically, at each position in the training corpus, we ask the model to predict the following n tokens using n independent output heads, operating on top of a shared model trunk. Considering multi-token prediction as an auxiliary training task, we measure improved downstream capabilities with no overhead in training time for both code and natural language models. The method is increasingly useful for larger model sizes and keeps its appeal when training for multiple epochs. Gains are especially pronounced on generative benchmarks like coding, where our models consistently outperform strong baselines by several percentage points. Our 13B parameter models solves 12 % more problems on HumanEval and 17 % more on MBPP than comparable next-token models. Experiments on small algorithmic tasks demonstrate that multi-token prediction is favorable for the development of induction heads and algorithmic reasoning capabilities. As an additional benefit, models trained with 4-token prediction are up to 3X times faster at inference, even with large batch sizes.
[18] Policy Compliance of User Requests in Natural Language for AI Systems
Pedro Cisneros-Velarde
Main category: cs.CL
TL;DR: A benchmark for evaluating AI system compliance with organizational policies when processing natural language user requests, with performance analysis of various LLM models.
Details
Motivation: Organizations need to ensure AI systems safely handle user requests by complying with diverse policies, but lack benchmarks to evaluate policy compliance assessment capabilities.Method: Created first benchmark with annotated user requests showing diverse compliance levels with organizational policies, then evaluated various LLM models using different solution methods on this benchmark.
Result: Benchmark reveals challenging nature of policy compliance assessment, with performance differences across LLM models and solution methods highlighting the difficulty of this problem.
Conclusion: Policy compliance assessment for AI systems is a challenging problem requiring specialized benchmarks and evaluation, with current LLM models showing varied performance on this task.
Abstract: Consider an organization whose users send requests in natural language to an AI system that fulfills them by carrying out specific tasks. In this paper, we consider the problem of ensuring such user requests comply with a list of diverse policies determined by the organization with the purpose of guaranteeing the safe and reliable use of the AI system. We propose, to the best of our knowledge, the first benchmark consisting of annotated user requests of diverse compliance with respect to a list of policies. Our benchmark is related to industrial applications in the technology sector. We then use our benchmark to evaluate the performance of various LLM models on policy compliance assessment under different solution methods. We analyze the differences on performance metrics across the models and solution methods, showcasing the challenging nature of our problem.
[19] Towards Orthographically-Informed Evaluation of Speech Recognition Systems for Indian Languages
Kaushal Santosh Bhogale, Tahir Javed, Greeshma Susan John, Dhruv Rathi, Akshayasree Padmanaban, Niharika Parasa, Mitesh M. Khapra
Main category: cs.CL
TL;DR: Proposes OIWER, an ASR evaluation metric for Indian languages that accounts for orthographic variations using LLMs, reducing pessimistic error rates and better aligning with human perception than traditional WER.
Details
Motivation: Traditional Word Error Rate (WER) for ASR evaluation in Indian languages is problematic due to spelling variations, suffix splitting flexibility, and non-standard spellings in code-mixed words, creating a bleaker picture than human users perceive.Method: Leverages recent advances in LLMs to create a framework for benchmarks that capture permissible orthographic variations, proposing OIWER (Orthographically-Informed Word Error Rate) that accounts for these variations.
Result: OIWER reduces pessimistic error rates by an average of 6.3 points, narrows inflated model gaps (e.g., Gemini-Canary performance difference drops from 18.1 to 11.5 points), and aligns more closely with human perception than prior methods like WER-SN by 4.9 points.
Conclusion: The proposed OIWER framework using LLMs provides a more accurate ASR evaluation metric for Indian languages by accounting for orthographic variations, better reflecting real-world performance and human perception.
Abstract: Evaluating ASR systems for Indian languages is challenging due to spelling variations, suffix splitting flexibility, and non-standard spellings in code-mixed words. Traditional Word Error Rate (WER) often presents a bleaker picture of system performance than what human users perceive. Better aligning evaluation with real-world performance requires capturing permissible orthographic variations, which is extremely challenging for under-resourced Indian languages. Leveraging recent advances in LLMs, we propose a framework for creating benchmarks that capture permissible variations. Through extensive experiments, we demonstrate that OIWER, by accounting for orthographic variations, reduces pessimistic error rates (an average improvement of 6.3 points), narrows inflated model gaps (e.g., Gemini-Canary performance difference drops from 18.1 to 11.5 points), and aligns more closely with human perception than prior methods like WER-SN by 4.9 points.
[20] LLM-Bootstrapped Targeted Finding Guidance for Factual MLLM-based Medical Report Generation
Cunyuan Yang, Dejuan Song, Xiaotao Pang, Qianqian Shen, Wenjie Nie, Yifan Huang, Lei Wu, Wei Han, Haishuai Wang, Jiajun Bu
Main category: cs.CL
TL;DR: Fact-Flow: A framework that separates visual fact identification from report generation in medical MLLMs to improve factual accuracy by first predicting clinical findings from images, then using those facts to guide report generation.
Details
Motivation: Current medical MLLMs for report generation suffer from factual instability - they may omit findings or include inaccurate information, limiting clinical applicability. Existing methods generate reports directly from image features without a solid factual foundation.Method: Fact-Flow separates visual fact identification from report generation. First predicts clinical findings from images, then uses these facts to guide MLLM report generation. Uses LLM to automatically create labeled medical findings dataset, eliminating need for expensive manual annotation.
Result: Extensive experiments on two disease-focused medical datasets show significant improvement in factual accuracy compared to state-of-the-art models while maintaining high text quality standards.
Conclusion: Fact-Flow effectively addresses factual instability in medical MLLMs by separating fact identification from generation, enabling more reliable clinical report generation with improved accuracy.
Abstract: The automatic generation of medical reports utilizing Multimodal Large Language Models (MLLMs) frequently encounters challenges related to factual instability, which may manifest as the omission of findings or the incorporation of inaccurate information, thereby constraining their applicability in clinical settings. Current methodologies typically produce reports based directly on image features, which inherently lack a definitive factual basis. In response to this limitation, we introduce Fact-Flow, an innovative framework that separates the process of visual fact identification from the generation of reports. This is achieved by initially predicting clinical findings from the image, which subsequently directs the MLLM to produce a report that is factually precise. A pivotal advancement of our approach is a pipeline that leverages a Large Language Model (LLM) to autonomously create a dataset of labeled medical findings, effectively eliminating the need for expensive manual annotation. Extensive experimental evaluations conducted on two disease-focused medical datasets validate the efficacy of our method, demonstrating a significant enhancement in factual accuracy compared to state-of-the-art models, while concurrently preserving high standards of text quality.
[21] A Typologically Grounded Evaluation Framework for Word Order and Morphology Sensitivity in Multilingual Masked LMs
Anna Feldman, Libby Barak, Jing Peng
Main category: cs.CL
TL;DR: Diagnostic framework tests multilingual language models’ reliance on word order vs. inflectional form using Universal Dependencies and inference-time perturbations.
Details
Motivation: To understand how multilingual masked language models (mBERT, XLM-R) process linguistic information by testing their reliance on word order versus inflectional morphology across different languages.Method: Uses Universal Dependencies to apply inference-time perturbations: full token scrambling, content-word scrambling with function words fixed, dependency-based head-dependent swaps, and sentence-level lemma substitution (+L) that lemmatizes both context and masked target label.
Result: Full scrambling drives word-level reconstruction accuracy near zero in all languages; partial perturbations cause smaller but still large drops. +L has little effect in Chinese but substantially lowers accuracy in German/Spanish/Russian, and does not mitigate scrambling impact. Top-5 accuracy shows same pattern.
Conclusion: Multilingual language models show strong reliance on word order across languages, with inflectional morphology playing varying roles depending on language typology. Models struggle when word order is disrupted.
Abstract: We introduce a typology-aware diagnostic for multilingual masked language models that tests reliance on word order versus inflectional form. Using Universal Dependencies, we apply inference-time perturbations: full token scrambling, content-word scrambling with function words fixed, dependency-based head–dependent swaps, and sentence-level lemma substitution (+L), which lemmatizes both the context and the masked target label. We evaluate mBERT and XLM-R on English, Chinese, German, Spanish, and Russian. Full scrambling drives word-level reconstruction accuracy near zero in all languages; partial and head–dependent perturbations cause smaller but still large drops. +L has little effect in Chinese but substantially lowers accuracy in German/Spanish/Russian, and it does not mitigate the impact of scrambling. Top-5 word accuracy shows the same pattern: under full scrambling, the gold word rarely appears among the five highest-ranked reconstructions. We release code, sampling scripts, and balanced evaluation subsets; Turkish results under strict reconstruction are reported in the appendix.
[22] Chain of Correction for Full-text Speech Recognition with Large Language Models
Zhiyuan Tang, Dong Wang, Zhikai Zhou, Yong Liu, Shen Huang, Shidong Shang
Main category: cs.CL
TL;DR: Proposes Chain of Correction (CoC) - a multi-turn chat approach using LLMs to correct ASR errors segment by segment with better semantic understanding through full-text context.
Details
Motivation: While LLMs show promise for full-text ASR error correction, challenges remain in stability, controllability, completeness, and fluency of corrections.Method: CoC uses multi-turn chat format to correct errors segment by segment, guided by pre-recognized text and full-text context. Fine-tunes pre-trained LLM on ChFT dataset.
Result: CoC significantly outperforms baseline and benchmark systems in correcting full-text ASR outputs. Analyzes correction thresholds and extrapolates to extra-long ASR outputs.
Conclusion: CoC effectively addresses ASR error correction challenges and explores using additional information types to guide error correction.
Abstract: Full-text error correction with Large Language Models (LLMs) for Automatic Speech Recognition (ASR) is attracting increased attention for its ability to address a wide range of error types, such as punctuation restoration and inverse text normalization, across long context. However, challenges remain regarding stability, controllability, completeness, and fluency. To mitigate these issues, this paper proposes the Chain of Correction (CoC), which uses a multi-turn chat format to correct errors segment by segment, guided by pre-recognized text and full-text context for better semantic understanding. Utilizing the open-sourced ChFT dataset, we fine-tune a pre-trained LLM to evaluate CoC’s performance. Experiments show that CoC significantly outperforms baseline and benchmark systems in correcting full-text ASR outputs. We also analyze correction thresholds to balance under-correction and over-rephrasing, extrapolate CoC on extra-long ASR outputs, and explore using other types of information to guide error correction.
[23] CIRCUS: Circuit Consensus under Uncertainty via Stability Ensembles
Swapnil Parekh
Main category: cs.CL
TL;DR: CIRCUS is a method for robust circuit discovery in neural networks that quantifies uncertainty across analyst choices by aggregating multiple pruning configurations into consensus circuits.
Details
Motivation: Current circuit discovery methods are brittle and sensitive to arbitrary analyst choices like pruning thresholds, producing unreliable "one-shot" explanations without principled uncertainty quantification.Method: CIRCUS constructs an ensemble of attribution graphs by pruning a single raw attribution run under multiple configurations, assigns edge stability scores, and extracts strict-consensus circuits containing only edges appearing in all views.
Result: On Gemma-2-2B and Llama-3.2-1B, strict consensus circuits are ~40x smaller than union of all configurations while retaining comparable explanatory power, and consensus-identified nodes outperform non-consensus controls in causal validation (p=0.0004).
Conclusion: CIRCUS provides a practical, uncertainty-aware framework for reporting trustworthy, auditable mechanistic circuits with explicit core/contingent/noise decomposition, requiring no retraining and adding negligible overhead.
Abstract: Mechanistic circuit discovery is notoriously sensitive to arbitrary analyst choices, especially pruning thresholds and feature dictionaries, often yielding brittle “one-shot” explanations with no principled notion of uncertainty. We reframe circuit discovery as an uncertainty-quantification problem over these analytic degrees of freedom. Our method, CIRCUS, constructs an ensemble of attribution graphs by pruning a single raw attribution run under multiple configurations, assigns each edge a stability score (the fraction of configurations that retain it), and extracts a strict-consensus circuit consisting only of edges that appear in all views. This produces a threshold-robust “core” circuit while explicitly surfacing contingent alternatives and enabling rejection of low-agreement structure. CIRCUS requires no retraining and adds negligible overhead, since it aggregates structure across already-computed pruned graphs. On Gemma-2-2B and Llama-3.2-1B, strict consensus circuits are ~40x smaller than the union of all configurations while retaining comparable influence-flow explanatory power, and they outperform a same-edge-budget baseline (union pruned to match the consensus size). We further validate causal relevance with activation patching, where consensus-identified nodes consistently beat matched non-consensus controls (p=0.0004). Overall, CIRCUS provides a practical, uncertainty-aware framework for reporting trustworthy, auditable mechanistic circuits with an explicit core/contingent/noise decomposition.
[24] CoMoL: Efficient Mixture of LoRA Experts via Dynamic Core Space Merging
Jie Cao, Zhenxuan Fan, Zhuonan Wang, Tianwei Lin, Ziyuan Zhao, Rolan Yan, Wenqiao Zhang, Feifei Shao, Hongwei Wang, Jun Xiao, Siliang Tang
Main category: cs.CL
TL;DR: CoMoL: A parameter-efficient MoE-LoRA framework with core space experts and token-level routing for fine-grained adaptation
Details
Motivation: Existing MoE-LoRA methods suffer from limited parameter efficiency and coarse-grained adaptation due to proliferation of LoRA experts and instance-level routingMethod: Introduces core space experts (compact core matrices) and core space routing (token-level expert selection), with soft-merging strategy and low-rank routing projection
Result: Achieves parameter efficiency comparable to standard LoRA while outperforming existing methods across multiple tasks
Conclusion: CoMoL retains MoE-LoRA adaptability with improved parameter efficiency and fine-grained adaptation
Abstract: Large language models (LLMs) achieve remarkable performance on diverse downstream and domain-specific tasks via parameter-efficient fine-tuning (PEFT). However, existing PEFT methods, particularly MoE-LoRA architectures, suffer from limited parameter efficiency and coarse-grained adaptation due to the proliferation of LoRA experts and instance-level routing. To address these issues, we propose Core Space Mixture of LoRA (\textbf{CoMoL}), a novel MoE-LoRA framework that incorporates expert diversity, parameter efficiency, and fine-grained adaptation. Specifically, CoMoL introduces two key components: core space experts and core space routing. Core space experts store each expert in a compact core matrix, preserving diversity while controlling parameter growth. Core space routing dynamically selects and activates the appropriate core experts for each token, enabling fine-grained, input-adaptive routing. Activated core experts are then merged via a soft-merging strategy into a single core expert, which is combined with a shared LoRA to form a specialized LoRA module. Besides, the routing network is projected into the same low-rank space as the LoRA matrices, further reducing parameter overhead without compromising expressiveness. Extensive experiments demonstrate that CoMoL retains the adaptability of MoE-LoRA architectures while achieving parameter efficiency comparable to standard LoRA, consistently outperforming existing methods across multiple tasks.
[25] Super Research: Answering Highly Complex Questions with Large Language Models through Super Deep and Super Wide Research
Yubo Dong, Nianhao You, Yuxuan Hou, Zixun Sun, Yue Zhang, Hehe Fan, Liang Zhang, Siyuan Zhao, Linyi
Main category: cs.CL
TL;DR: Super Research is a benchmark for evaluating LLMs on complex autonomous research tasks requiring long-horizon planning, massive evidence gathering, and synthesis across heterogeneous sources.
Details
Motivation: While LLMs show proficiency in basic research/search tasks, their capacity for solving highly complex questions requiring extensive planning, evidence gathering, and synthesis across diverse sources remains unexplored.Method: Introduces Super Research task with three components: structured decomposition into research plans, super wide retrieval for diverse perspectives, and super deep investigation through iterative queries. Includes a benchmark of 300 expert-written questions requiring up to 100+ retrieval steps and 1,000+ web pages.
Result: Produces verifiable reports with fine-grained citations and intermediate artifacts. Presents graph-anchored auditing protocol evaluating five dimensions: Coverage, Logical Consistency, Report Utility, Objectivity, and Citation Health.
Conclusion: Super Research serves as a critical ceiling evaluation and stress test for LLM capabilities, where proficiency acts as a proxy for general research competence, indicating robustness for subordinate research tasks.
Abstract: While Large Language Models (LLMs) have demonstrated proficiency in Deep Research or Wide Search, their capacity to solve highly complex questions-those requiring long-horizon planning, massive evidence gathering, and synthesis across heterogeneous sources-remains largely unexplored. We introduce Super Research, a task for complex autonomous research tasks that integrates (i) structured decomposition into a research plan, (ii) super wide retrieval for diverse perspectives, and (iii) super deep investigation to resolve uncertainties through iterative queries. To evaluate this capability, we curated a benchmark of 300 expert-written questions across diverse domains, each requiring up to 100+ retrieval steps and 1,000+ web pages to reconcile conflicting evidence. Super Research produces verifiable reports with fine-grained citations and intermediate artifacts (e.g., outlines and tables) to ensure traceable reasoning. Furthermore, we present a graph-anchored auditing protocol that evaluates Super Research along five dimensions: Coverage, Logical Consistency, Report Utility, Objectivity and Citation Health. While super-complex questions may be infrequent in standard applications, Super Research serves as a critical ceiling evaluation and stress test for LLM capabilities. A model’s proficiency within Super Research acts as a powerful proxy for its general research competence; success here suggests the robustness necessary to navigate nearly any subordinate research task. Leaderboard is available at: https://cnsdqd-dyb.github.io/Super-Research-Benchmark/
[26] From Literature to Hypotheses: An AI Co-Scientist System for Biomarker-Guided Drug Combination Hypothesis Generation
Raneen Younis, Suvinava Basak, Lukas Chavez, Zahra Ahmadi
Main category: cs.CL
TL;DR: AI Co-Scientist (CoDHy) is an interactive system for biomarker-guided drug combination hypothesis generation in cancer research, integrating structured databases and literature into a knowledge graph for graph-based reasoning and evidence-grounded hypothesis construction.
Details
Motivation: The rapid growth of biomedical literature and curated databases makes it difficult for researchers to systematically connect biomarker mechanisms to actionable drug combination hypotheses in cancer research.Method: Integrates structured biomedical databases and unstructured literature into a task-specific knowledge graph, then uses knowledge graph embeddings with agent-based reasoning to generate, validate, and rank candidate drug combinations while grounding hypotheses in retrievable evidence.
Result: Demonstrates CoDHy as a system for exploratory hypothesis generation and decision support in translational oncology, highlighting its design, interaction workflow, and practical use cases through a web-based interface.
Conclusion: CoDHy enables transparent, researcher-steerable exploration for biomarker-guided drug combination hypothesis generation in cancer research, rather than automated decision-making.
Abstract: The rapid growth of biomedical literature and curated databases has made it increasingly difficult for researchers to systematically connect biomarker mechanisms to actionable drug combination hypotheses. We present AI Co-Scientist (CoDHy), an interactive, human-in-the-loop system for biomarker-guided drug combination hypothesis generation in cancer research. CoDHy integrates structured biomedical databases and unstructured literature evidence into a task-specific knowledge graph, which serves as the basis for graph-based reasoning and hypothesis construction. The system combines knowledge graph embeddings with agent-based reasoning to generate, validate, and rank candidate drug combinations, while explicitly grounding each hypothesis in retrievable evidence. Through a web-based interface, users can configure the scientific context, inspect intermediate results, and iteratively refine hypotheses, enabling transparent and researcher-steerable exploration rather than automated decision-making. We demonstrate CoDHy as a system for exploratory hypothesis generation and decision support in translational oncology, highlighting its design, interaction workflow, and practical use cases.
[27] QQ: A Toolkit for Language Identifiers and Metadata
Wessel Poelman, Yiyi Chen, Miryam de Lhoneux
Main category: cs.CL
TL;DR: QwanQwa is a Python toolkit for unified language metadata management that integrates multiple language resources, provides normalization and mapping between different language identifiers, and enables graph-based traversal across linguistic attributes.
Details
Motivation: The growing number of languages in multilingual NLP creates challenges for accurately reporting which languages are used and how, due to inconsistent language identifiers across datasets (BCP-47, ISO 639-1, Glottocodes). Manual mapping becomes unscalable with thousands of languages.Method: Developed a lightweight Python toolkit that integrates multiple language resources into a single interface, provides convenient normalization and mapping between language identifiers, and implements a graph-based structure for traversal across families, regions, writing systems, and other linguistic attributes.
Result: Created QwanQwa, which serves as both (1) a simple “glue” library in multilingual NLP research to make working with many languages easier, and (2) an intuitive way for exploring languages through shared scripts, regions, or other metadata.
Conclusion: QwanQwa addresses the scalability challenge of language metadata management in multilingual NLP by providing a unified toolkit that simplifies working with diverse language identifiers and enables exploration of linguistic relationships.
Abstract: The growing number of languages considered in multilingual NLP, including new datasets and tasks, poses challenges regarding properly and accurately reporting which languages are used and how. For example, datasets often use different language identifiers; some use BCP-47 (e.g. en_Latn), others use ISO 639-1 (en), and more linguistically oriented datasets use Glottocodes (stan1293). Mapping between identifiers is manageable for a few dozen languages, but becomes unscalable when dealing with thousands. We introduce QwanQwa, a light-weight Python toolkit for unified language metadata management. QQ integrates multiple language resources into a single interface, provides convenient normalization and mapping between language identifiers, and affords a graph-based structure that enables traversal across families, regions, writing systems, and other linguistic attributes. QQ serves both as (1) a simple “glue” library in multilingual NLP research to make working with many languages easier, and (2) as an intuitive way for exploring languages, such as finding related ones through shared scripts, regions or other metadata.
[28] Piecing Together Cross-Document Coreference Resolution Datasets: Systematic Dataset Analysis and Unification
Anastasia Zhukova, Terry Ruas, Jan Philip Wahle, Bela Gipp
Main category: cs.CL
TL;DR: uCDCR is a unified dataset that consolidates diverse English CDCR (Cross-Document Coreference Resolution) corpora into consistent format, enabling standardized evaluation and analysis of both entity and event coreference across domains.
Details
Motivation: CDCR research is fragmented due to heterogeneous dataset formats, varying annotation standards, and overemphasis on event coreference resolution (ECR) over entity coreference. This fragmentation hinders reproducible research and fair cross-dataset comparisons.Method: Created uCDCR by consolidating publicly available English CDCR corpora into a consistent format, correcting inconsistencies, enriching datasets with missing attributes, and establishing standardized metrics and evaluation protocols for analysis.
Result: Analysis shows ECB+ (current state-of-the-art benchmark) has one of the lowest lexical diversities, and its CDCR complexity is middle-range. Using all uCDCR datasets improves model generalizability. Event and entity coreference resolution show similar complexity.
Conclusion: uCDCR provides a unified framework for reproducible CDCR research, showing that both event and entity coreference are complex tasks that shouldn’t be reduced to ECR alone, and that broader dataset usage improves model generalizability.
Abstract: Research in CDCR remains fragmented due to heterogeneous dataset formats, varying annotation standards, and the predominance of the CDCR definition as the event coreference resolution (ECR). To address these challenges, we introduce uCDCR, a unified dataset that consolidates diverse publicly available English CDCR corpora across various domains into a consistent format, which we analyze with standardized metrics and evaluation protocols. uCDCR incorporates both entity and event coreference, corrects known inconsistencies, and enriches datasets with missing attributes to facilitate reproducible research. We establish a cohesive framework for fair, interpretable, and cross-dataset analysis in CDCR and compare the datasets on their lexical properties, e.g., lexical composition of the annotated mentions, lexical diversity and ambiguity metrics, discuss the annotation rules and principles that lead to high lexical diversity, and examine how these metrics influence performance on the same-head-lemma baseline. Our dataset analysis shows that ECB+, the state-of-the-art benchmark for CDCR, has one of the lowest lexical diversities, and its CDCR complexity, measured by the same-head-lemma baseline, lies in the middle among all uCDCR datasets. Moreover, comparing document and mention distributions between ECB+ and uCDCR shows that using all uCDCR datasets for model training and evaluation will improve the generalizability of CDCR models. Finally, the almost identical performance on the same-head-lemma baseline, separately applied to events and entities, shows that resolving both types is a complex task and should not be steered toward ECR alone. The uCDCR dataset is available at https://huggingface.co/datasets/AnZhu/uCDCR, and the code for parsing, analyzing, and scoring the dataset is available at https://github.com/anastasia-zhukova/uCDCR.
[29] BLUFF: Benchmarking the Detection of False and Synthetic Content across 58 Low-Resource Languages
Jason Lucas, Matt Murtagh-White, Adaku Uchendu, Ali Al-Lawati, Michiharu Yamashita, Dominik Macko, Ivan Srba, Robert Moro, Dongwon Lee
Main category: cs.CL
TL;DR: BLUFF is a comprehensive multilingual benchmark for detecting false and synthetic content across 79 languages, featuring diverse content types and manipulation techniques to address gaps in low-resource language detection.
Details
Motivation: Current falsehood detection benchmarks are limited to English or high-resource languages, leaving low-resource linguistic communities vulnerable to misinformation without adequate defense tools.Method: Created BLUFF benchmark with 202K+ samples across 79 languages, combining human-written fact-checked content and LLM-generated content. Introduced AXL-CoI framework for controlled fake/real news generation and mPURIFY quality filtering pipeline.
Result: Experiments show state-of-the-art detectors suffer up to 25.3% F1 degradation on low-resource vs high-resource languages, highlighting the need for multilingual benchmarks.
Conclusion: BLUFF provides a comprehensive multilingual benchmark with tools to advance equitable falsehood detection across diverse linguistic communities.
Abstract: Multilingual falsehoods threaten information integrity worldwide, yet detection benchmarks remain confined to English or a few high-resource languages, leaving low-resource linguistic communities without robust defense tools. We introduce BLUFF, a comprehensive benchmark for detecting false and synthetic content, spanning 79 languages with over 202K samples, combining human-written fact-checked content (122K+ samples across 57 languages) and LLM-generated content (79K+ samples across 71 languages). BLUFF uniquely covers both high-resource “big-head” (20) and low-resource “long-tail” (59) languages, addressing critical gaps in multilingual research on detecting false and synthetic content. Our dataset features four content types (human-written, LLM-generated, LLM-translated, and hybrid human-LLM text), bidirectional translation (English$\leftrightarrow$X), 39 textual modification techniques (36 manipulation tactics for fake news, 3 AI-editing strategies for real news), and varying edit intensities generated using 19 diverse LLMs. We present AXL-CoI (Adversarial Cross-Lingual Agentic Chainof-Interactions), a novel multi-agentic framework for controlled fake/real news generation, paired with mPURIFY, a quality filtering pipeline ensuring dataset integrity. Experiments reveal state-of-theart detectors suffer up to 25.3% F1 degradation on low-resource versus high-resource languages. BLUFF provides the research community with a multilingual benchmark, extensive linguistic-oriented benchmark evaluation, comprehensive documentation, and opensource tools to advance equitable falsehood detection. Dataset and code are available at: https://jsl5710.github.io/BLUFF/
[30] SSKG Hub: An Expert-Guided Platform for LLM-Empowered Sustainability Standards Knowledge Graphs
Chaoyue He, Xin Zhou, Xinjia Yu, Lei Zhang, Yan Zhang, Yi Wu, Lei Xiao, Liangyue Li, Di Wang, Hong Xu, Xiaoqiao Wang, Wei Liu, Chunyan Miao
Main category: cs.CL
TL;DR: SSKG Hub transforms complex sustainability disclosure standards into auditable knowledge graphs using LLM-centered pipeline with expert-guided curation and governance framework.
Details
Motivation: Sustainability disclosure standards (GRI, SASB, TCFD, IFRS S2) are comprehensive but lengthy, terminology-dense, and highly cross-referential, making structured analysis and downstream use difficult.Method: LLM-centered, expert-guided pipeline with automatic standard identification, configurable chunking, standard-specific prompting, robust triple parsing, and provenance-aware Neo4j storage. Includes role-based governance framework for draft KG curation and certified KG promotion.
Result: Created SSKG Hub interactive web platform that transforms standards into auditable knowledge graphs with fine-grained audit metadata. Validated through comprehensive expert-led KG review case study demonstrating end-to-end curation and quality assurance.
Conclusion: SSKG Hub enables structured analysis of sustainability standards through knowledge graph transformation with traceability and accountability, supporting cross-KG fusion and KG-driven tasks.
Abstract: Sustainability disclosure standards (e.g., GRI, SASB, TCFD, IFRS S2) are comprehensive yet lengthy, terminology-dense, and highly cross-referential, hindering structured analysis and downstream use. We present SSKG Hub (Sustainability Standards Knowledge Graph Hub), a research prototype and interactive web platform that transforms standards into auditable knowledge graphs (KGs) through an LLM-centered, expert-guided pipeline. The system integrates automatic standard identification, configurable chunking, standard-specific prompting, robust triple parsing, and provenance-aware Neo4j storage with fine-grained audit metadata. LLM extraction produces a provenance-linked Draft KG, which is reviewed, curated, and formally promoted to a Certified KG through meta-expert adjudication. A role-based governance framework covering read-only guest access, expert review and CRUD operations, meta-expert certification, and administrative oversight ensures traceability and accountability across draft and certified states. Beyond graph exploration and triple-level evidence tracing, SSKG Hub supports cross-KG fusion, KG-driven tasks, and dedicated modules for insights and curated resources. We validate the platform through a comprehensive expert-led KG review case study that demonstrates end-to-end curation and quality assurance. The web application is publicly available at www.sskg-hub.com.
[31] Polynomial Mixing for Efficient Self-supervised Speech Encoders
Eva Feillet, Ryan Whetten, David Picard, Alexandre Allauzen
Main category: cs.CL
TL;DR: PoM (Polynomial Mixer) replaces self-attention in speech-to-text models with linear-complexity polynomial mixing, maintaining competitive accuracy while improving efficiency.
Details
Motivation: Self-attention in Transformer-based speech encoders has quadratic complexity, limiting scalability in memory and computation for long sequences.Method: Proposes Polynomial Mixer (PoM) as drop-in replacement for multi-head self-attention, computing polynomial representation of input with linear complexity relative to sequence length.
Result: PoM achieves competitive word error rate compared to full self-attention and other linear-complexity alternatives, offering better performance-efficiency trade-off.
Conclusion: PoM provides efficient alternative to self-attention for speech recognition models, enabling better scalability while maintaining accuracy.
Abstract: State-of-the-art speech-to-text models typically employ Transformer-based encoders that model token dependencies via self-attention mechanisms. However, the quadratic complexity of self-attention in both memory and computation imposes significant constraints on scalability. In this work, we propose a novel token-mixing mechanism, the Polynomial Mixer (PoM), as a drop-in replacement for multi-head self-attention. PoM computes a polynomial representation of the input with linear complexity with respect to the input sequence length. We integrate PoM into a self-supervised speech representation learning framework based on BEST-RQ and evaluate its performance on downstream speech recognition tasks. Experimental results demonstrate that PoM achieves a competitive word error rate compared to full self-attention and other linear-complexity alternatives, offering an improved trade-off between performance and efficiency in time and memory.
[32] RAVEL: Reasoning Agents for Validating and Evaluating LLM Text Synthesis
Andrew Zhuoer Feng, Cunxiang Wang, Yu Luo, Bosi Wen, Yidong Wang, Lin Fan, Yilin Zhou, Zikang Wang, Wenbo Yu, Lindong Wu, Hongning Wang, Minlie Huang
Main category: cs.CL
TL;DR: RAVEL is an agentic framework for evaluating LLMs’ text synthesis capabilities through autonomous planning and execution of operations like outlining, drafting, and editing, tested on C3EBench benchmark.
Details
Motivation: Current evaluation frameworks fail to assess actual synthesis operations (outlining, drafting, editing) and detailed capabilities of LLMs in long-horizon text generation tasks.Method: Introduces RAVEL framework enabling LLM testers to autonomously plan and execute synthesis operations, complemented by C3EBench benchmark with 1,258 samples derived from professional human writings using reverse-engineering pipeline across four tasks: Cloze, Edit, Expand, and End-to-End.
Result: Analysis of 14 LLMs shows most struggle with tasks requiring contextual understanding under limited instructions. Agentic text synthesis is dominated by reasoning capability rather than raw generative capacity. Strong reasoners can guide weaker generators to better results, but not vice versa.
Conclusion: RAVEL provides a comprehensive framework for evaluating LLM synthesis capabilities, revealing the critical importance of reasoning over raw generation in agentic text synthesis tasks.
Abstract: Large Language Models have evolved from single-round generators into long-horizon agents, capable of complex text synthesis scenarios. However, current evaluation frameworks lack the ability to assess the actual synthesis operations, such as outlining, drafting, and editing. Consequently, they fail to evaluate the actual and detailed capabilities of LLMs. To bridge this gap, we introduce RAVEL, an agentic framework that enables the LLM testers to autonomously plan and execute typical synthesis operations, including outlining, drafting, reviewing, and refining. Complementing this framework, we present C3EBench, a comprehensive benchmark comprising 1,258 samples derived from professional human writings. We utilize a “reverse-engineering” pipeline to isolate specific capabilities across four tasks: Cloze, Edit, Expand, and End-to-End. Through our analysis of 14 LLMs, we uncover that most LLMs struggle with tasks that demand contextual understanding under limited or under-specified instructions. By augmenting RAVEL with SOTA LLMs as operators, we find that such agentic text synthesis is dominated by the LLM’s reasoning capability rather than raw generative capacity. Furthermore, we find that a strong reasoner can guide a weaker generator to yield higher-quality results, whereas the inverse does not hold. Our code and data are available at this link: https://github.com/ZhuoerFeng/RAVEL-Reasoning-Agents-Text-Eval.
[33] GateLens: A Reasoning-Enhanced LLM Agent for Automotive Software Release Analytics
Arsham Gholamzadeh Khoee, Shuai Wang, Yinan Yu, Robert Feldt, Dhasarathy Parthasarathy
Main category: cs.CL
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2503.21735: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.21735&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[34] DRIV-EX: Counterfactual Explanations for Driving LLMs
Amaia Cardiel, Eloi Zablocki, Elias Ramzi, Eric Gaussier
Main category: cs.CL
TL;DR: DRIV-EX: A method for generating counterfactual explanations to interpret LLM decision-making in autonomous driving by optimizing embeddings to flip decisions while maintaining linguistic fluency.
Details
Motivation: LLMs are increasingly used as reasoning engines in autonomous driving, but their decision-making processes remain opaque. There's a need for interpretability methods that can explain why LLM-based driving agents make specific decisions and expose potential biases.Method: DRIV-EX uses gradient-based optimization on continuous embeddings to identify minimal semantic changes needed to flip LLM decisions. Instead of directly using optimized embeddings (which can produce incoherent text), it uses them as semantic guides to bias a controlled decoding process that regenerates the original scene description, ensuring linguistic fluency and domain validity.
Result: Evaluated on the LC-LLM planner with textual transcriptions of the highD dataset, DRIV-EX generates valid, fluent counterfactuals more reliably than existing baselines. It successfully exposes latent biases and provides concrete insights for improving LLM-based driving agent robustness.
Conclusion: DRIV-EX provides an effective method for interpreting LLM decision-making in autonomous driving through counterfactual explanations, offering both interpretability and insights for improving model robustness.
Abstract: Large language models (LLMs) are increasingly used as reasoning engines in autonomous driving, yet their decision-making remains opaque. We propose to study their decision process through counterfactual explanations, which identify the minimal semantic changes to a scene description required to alter a driving plan. We introduce DRIV-EX, a method that leverages gradient-based optimization on continuous embeddings to identify the input shifts required to flip the model’s decision. Crucially, to avoid the incoherent text typical of unconstrained continuous optimization, DRIV-EX uses these optimized embeddings solely as a semantic guide: they are used to bias a controlled decoding process that re-generates the original scene description. This approach effectively steers the generation toward the counterfactual target while guaranteeing the linguistic fluency, domain validity, and proximity to the original input, essential for interpretability. Evaluated using the LC-LLM planner on a textual transcription of the highD dataset, DRIV-EX generates valid, fluent counterfactuals more reliably than existing baselines. It successfully exposes latent biases and provides concrete insights to improve the robustness of LLM-based driving agents.
[35] SkillCraft: Can LLM Agents Learn to Use Tools Skillfully?
Shiqi Chen, Jingze Gai, Ruochen Zhou, Jinghan Zhang, Tongyao Zhu, Junlong Li, Kangrui Wang, Zihan Wang, Zhengyu Chen, Klara Kaleb, Ning Miao, Siyang Gao, Cong Lu, Manling Li, Junxian He, Yee Whye Teh
Main category: cs.CL
TL;DR: SkillCraft benchmark tests agents’ ability to form and reuse higher-level tool compositions (Skills) in realistic, compositional tool-use scenarios with scalable difficulty.
Details
Motivation: Real-world tool-using agents need to abstract and reuse higher-level tool compositions for long-horizon workflows, but existing benchmarks only measure instance-level success with static tool sets, lacking insight into agents' ability to acquire reusable skills.Method: Introduces SkillCraft benchmark with realistic, highly compositional tool-use scenarios scaled along quantitative and structural dimensions. Proposes lightweight evaluation protocol allowing agents to auto-compose atomic tools into executable Skills, cache and reuse them across tasks, building a persistent library of reusable skills.
Result: Evaluation shows substantial efficiency gains with token usage reduced by up to 80% through skill saving and reuse. Success rate strongly correlates with tool composition ability at test time, indicating compositional skill acquisition as a core capability.
Conclusion: SkillCraft addresses a critical gap in evaluating agents’ ability to form and reuse higher-level tool compositions, demonstrating that compositional skill acquisition is essential for efficient tool-use in complex, long-horizon workflows.
Abstract: Real-world tool-using agents operate over long-horizon workflows with recurring structure and diverse demands, where effective behavior requires not only invoking atomic tools but also abstracting, and reusing higher-level tool compositions. However, existing benchmarks mainly measure instance-level success under static tool sets, offering limited insight into agents’ ability to acquire such reusable skills. We address this gap by introducing SkillCraft, a benchmark explicitly stress-test agent ability to form and reuse higher-level tool compositions, where we call Skills. SkillCraft features realistic, highly compositional tool-use scenarios with difficulty scaled along both quantitative and structural dimensions, designed to elicit skill abstraction and cross-task reuse. We further propose a lightweight evaluation protocol that enables agents to auto-compose atomic tools into executable Skills, cache and reuse them inside and across tasks, thereby improving efficiency while accumulating a persistent library of reusable skills. Evaluating state-of-the-art agents on SkillCraft, we observe substantial efficiency gains, with token usage reduced by up to 80% by skill saving and reuse. Moreover, success rate strongly correlates with tool composition ability at test time, underscoring compositional skill acquisition as a core capability.
[36] RLAR: An Agentic Reward System for Multi-task Reinforcement Learning on Large Language Models
Andrew Zhuoer Feng, Cunxiang Wang, Bosi Wen, Yidong Wang, Yu Luo, Hongning Wang, Minlie Huang
Main category: cs.CL
TL;DR: RLAR is a reinforcement learning framework that dynamically generates tailored reward functions for individual queries using LLM agents, improving generalization over static reward models.
Details
Motivation: Static reward models for LLM alignment are costly to train and generalize poorly to out-of-distribution scenarios encountered during RL training, limiting their effectiveness.Method: Transforms reward acquisition into dynamic tool synthesis: uses LLM agents to autonomously retrieve optimal reward models from the Internet and synthesize programmatic verifiers through code generation, allowing the reward system to self-evolve with shifting data distributions.
Result: Achieves consistent performance gains of 10-60% across mathematics, coding, translation, and dialogue tasks; significantly outperforms static baselines on RewardBench-V2 and approaches performance upper bound.
Conclusion: RLAR demonstrates superior generalization through dynamic reward orchestration, offering a more flexible approach to LLM alignment that adapts to evolving training distributions.
Abstract: Large language model alignment via reinforcement learning depends critically on reward function quality. However, static, domain-specific reward models are often costly to train and exhibit poor generalization in out-of-distribution scenarios encountered during RL iterations. We present RLAR (Reinforcement Learning from Agent Rewards), an agent-driven framework that dynamically assigns tailored reward functions to individual queries. Specifically, RLAR transforms reward acquisition into a dynamic tool synthesis and invocation task. It leverages LLM agents to autonomously retrieve optimal reward models from the Internet and synthesize programmatic verifiers through code generation. This allows the reward system to self-evolve with the shifting data distributions during training. Experimental results demonstrate that RLAR yields consistent performance gains ranging from 10 to 60 across mathematics, coding, translation, and dialogue tasks. On RewardBench-V2, RLAR significantly outperforms static baselines and approaches the performance upper bound, demonstrating superior generalization through dynamic reward orchestration. The data and code are available on this link: https://github.com/ZhuoerFeng/RLAR.
[37] LightMem: Lightweight and Efficient Memory-Augmented Generation
Jizhan Fang, Xinle Deng, Haoming Xu, Ziyan Jiang, Yuqi Tang, Ziwen Xu, Shumin Deng, Yunzhi Yao, Mengru Wang, Shuofei Qiao, Huajun Chen, Ningyu Zhang
Main category: cs.CL
TL;DR: Paper analysis unavailable due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Unable to determine paper motivation due to API request failureMethod: Unable to determine paper method due to API request failure
Result: Unable to determine paper results due to API request failure
Conclusion: Unable to determine paper conclusion due to API request failure
Abstract: Failed to fetch summary for 2510.18866: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.18866&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[38] LaSTR: Language-Driven Time-Series Segment Retrieval
Kota Dohi, Harsh Purohit, Tomoya Nishida, Takashi Endo, Yusuke Ohtsubo, Koichiro Yawata, Koki Takeshita, Tatsuya Sasaki, Yohei Kawaguchi
Main category: cs.CL
TL;DR: LaSTR: Language-driven time-series segment retrieval using natural language queries to find relevant local segments in large time-series repositories.
Details
Motivation: Existing time-series search methods require expert-designed similarity criteria or rely on global, series-level descriptions, lacking natural language query capabilities for local segment retrieval.Method: Build large-scale segment-caption training data using TV2-based segmentation on LOTSA windows, generate segment descriptions with GPT-5.2, train Conformer-based contrastive retriever in shared text-time-series embedding space.
Result: LaSTR outperforms random and CLIP baselines across all settings, showing improved ranking quality and stronger semantic agreement between retrieved segments and query intent.
Conclusion: Language-driven segment retrieval enables effective natural language querying of time-series data, overcoming limitations of traditional similarity-based methods.
Abstract: Effectively searching time-series data is essential for system analysis, but existing methods often require expert-designed similarity criteria or rely on global, series-level descriptions. We study language-driven segment retrieval: given a natural language query, the goal is to retrieve relevant local segments from large time-series repositories. We build large-scale segment–caption training data by applying TV2-based segmentation to LOTSA windows and generating segment descriptions with GPT-5.2, and then train a Conformer-based contrastive retriever in a shared text–time-series embedding space. On a held-out test split, we evaluate single-positive retrieval together with caption-side consistency (SBERT and VLM-as-a-judge) under multiple candidate pool sizes. Across all settings, LaSTR outperforms random and CLIP baselines, yielding improved ranking quality and stronger semantic agreement between retrieved segments and query intent.
[39] Qwen3-Coder-Next Technical Report
Ruisheng Cao, Mouxiang Chen, Jiawei Chen, Zeyu Cui, Yunlong Feng, Binyuan Hui, Yuheng Jing, Kaixin Li, Mingze Li, Junyang Lin, Zeyao Ma, Kashun Shum, Xuwu Wang, Jinxi Wei, Jiaxi Yang, Jiajun Zhang, Lei Zhang, Zongmeng Zhang, Wenting Zhao, Fan Zhou
Main category: cs.CL
TL;DR: Qwen3-Coder-Next is an 80B parameter coding agent model with only 3B active parameters during inference, trained using verifiable coding tasks with executable environments and environment feedback learning.
Details
Motivation: The paper aims to explore how far strong training recipes can push the capability limits of models with small parameter footprints, specifically for coding agents that need efficient inference while maintaining strong coding capabilities.Method: Uses agentic training through large-scale synthesis of verifiable coding tasks paired with executable environments, enabling learning directly from environment feedback via mid-training and reinforcement learning. The model has 80B total parameters but only activates 3B during inference.
Result: Achieves competitive performance on agent-centric benchmarks including SWE-Bench and Terminal-Bench relative to its active parameter count of 3B parameters.
Conclusion: Demonstrates that strong training recipes can enable models with small active parameter footprints to achieve competitive coding agent performance, and releases open-weight versions to support research and real-world coding agent development.
Abstract: We present Qwen3-Coder-Next, an open-weight language model specialized for coding agents. Qwen3-Coder-Next is an 80-billion-parameter model that activates only 3 billion parameters during inference, enabling strong coding capability with efficient inference. In this work, we explore how far strong training recipes can push the capability limits of models with small parameter footprints. To achieve this, we perform agentic training through large-scale synthesis of verifiable coding tasks paired with executable environments, allowing learning directly from environment feedback via mid-training and reinforcement learning. Across agent-centric benchmarks including SWE-Bench and Terminal-Bench, Qwen3-Coder-Next achieves competitive performance relative to its active parameter count. We release both base and instruction-tuned open-weight versions to support research and real-world coding agent development.
[40] A Comprehensive Evaluation of LLM Unlearning Robustness under Multi-Turn Interaction
Ruihao Pan, Suhang Wang
Main category: cs.CL
TL;DR: Machine unlearning in LLMs shows limited robustness in interactive settings - knowledge that appears forgotten in static evaluation can often be recovered through dialogue interactions like self-correction and querying.
Details
Motivation: Machine unlearning is important for LLMs due to safety, privacy, and legal concerns, but prior work only evaluates in static, single-turn settings, leaving forgetting robustness under realistic interactive use underexplored.Method: Study whether unlearning remains stable in interactive environments by examining two common interaction patterns: self-correction and dialogue-conditioned querying, comparing static evaluation with interactive scenarios.
Result: Knowledge appearing forgotten in static evaluation can often be recovered through interaction. Stronger unlearning often results in behavioral rigidity rather than genuine knowledge erasure.
Conclusion: Static evaluation may overestimate real-world effectiveness of machine unlearning, highlighting the need for ensuring stable forgetting under interactive settings.
Abstract: Machine unlearning aims to remove the influence of specific training data from pre-trained models without retraining from scratch, and is increasingly important for large language models (LLMs) due to safety, privacy, and legal concerns. Although prior work primarily evaluates unlearning in static, single-turn settings, forgetting robustness under realistic interactive use remains underexplored. In this paper, we study whether unlearning remains stable in interactive environments by examining two common interaction patterns: self-correction and dialogue-conditioned querying. We find that knowledge appearing forgotten in static evaluation can often be recovered through interaction. Although stronger unlearning improves apparent robustness, it often results in behavioral rigidity rather than genuine knowledge erasure. Our findings suggest that static evaluation may overestimate real-world effectiveness and highlight the need for ensuring stable forgetting under interactive settings.
[41] InnoGym: Benchmarking the Innovation Potential of AI Agents
Jintian Zhang, Kewei Xu, Jingsheng Zheng, Zhuoyun Yu, Yuqi Zhu, Yujie Luo, Lanning Wei, Shuofei Qiao, Lun Du, Da Zheng, Shumin Deng, Huajun Chen, Ningyu Zhang
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2512.01822: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.01822&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[42] Constitutional Black-Box Monitoring for Scheming in LLM Agents
Simon Storf, Rich Barton-Cooper, James Peters-Gill, Marius Hobbhahn
Main category: cs.CL
TL;DR: LLM-based monitors trained on synthetic data can detect scheming agents in realistic environments, but performance quickly saturates and overfits with extensive optimization.
Details
Motivation: Need reliable oversight mechanisms for safe deployment of LLM agents in autonomous settings, particularly to detect scheming where agents covertly pursue misaligned goals.Method: Develop constitutional black-box monitors using LLM-based prompted classifiers. Generate synthetic agent trajectories using STRIDE (iterative refinement) and Gloom (agent-environment simulation) pipelines. Optimize frontier LLM monitors via prompt sweeps, human refinement, and automated prompt optimization on synthetic data.
Result: Monitors trained purely on synthetic data generalize to realistic environments (ControlArena), capturing meaningful scheming signals. However, performance saturates quickly - simple prompt sweeps match extensive optimization results. Further optimization leads to overfitting without improvements.
Conclusion: LLM-based monitoring shows promise for detecting scheming agents, but current approaches have limitations with performance saturation and overfitting, suggesting need for more sophisticated methods beyond prompt optimization.
Abstract: Safe deployment of Large Language Model (LLM) agents in autonomous settings requires reliable oversight mechanisms. A central challenge is detecting scheming, where agents covertly pursue misaligned goals. One approach to mitigating such risks is LLM-based monitoring: using language models to examine agent behaviors for suspicious actions. We study constitutional black-box monitors: prompted classifiers that detect scheming using only externally observable inputs and outputs, optimized on synthetic data generated from natural-language behavior specifications. We introduce two pipelines for generating synthetic agent trajectories, STRIDE (iterative refinement) and Gloom (agent-environment simulation), from which we generate 1,000 samples each. We optimize frontier LLM monitors on these datasets via prompt sweeps, human refinement, and automated prompt optimization, and evaluate performance on 7,500 held-out trajectories from ControlArena, a suite of grounded environments where agents operate in more realistic contexts. Our results demonstrate that monitors selected purely on synthetic data can generalize to more realistic environments, capturing a meaningful scheming signal. However, we find that performance saturates quickly in our setting, with simple prompt sweeps matching the results of more extensive optimization. Pushing beyond this limit yields no further improvements and instead leads to overfitting.
[43] Learning Nested Named Entity Recognition from Flat Annotations
Igor Rozhkov, Natalia Loukachevitch
Main category: cs.CL
TL;DR: Models can learn nested named entity recognition from flat annotations alone using various approaches including string inclusions, entity corruption, flat neutralization, and hybrid fine-tuned + LLM pipeline, achieving 26.37% inner F1 on Russian NEREL benchmark.
Details
Motivation: Nested named entity recognition requires expensive multi-level annotation while flat NER corpora are abundant but nested resources remain scarce. The paper investigates whether models can learn nested structure from flat annotations alone.Method: Four approaches: 1) String inclusions (substring matching), 2) Entity corruption (pseudo-nested data), 3) Flat neutralization (reducing false negative signal), and 4) Hybrid fine-tuned + LLM pipeline.
Result: On NEREL, a Russian benchmark with 29 entity types where 21% of entities are nested, the best combined method achieves 26.37% inner F1, closing 40% of the gap to full nested supervision.
Conclusion: Models can learn nested structure from flat annotations alone, with the hybrid approach showing promising results, though there remains significant room for improvement compared to fully supervised methods.
Abstract: Nested named entity recognition identifies entities contained within other entities, but requires expensive multi-level annotation. While flat NER corpora exist abundantly, nested resources remain scarce. We investigate whether models can learn nested structure from flat annotations alone, evaluating four approaches: string inclusions (substring matching), entity corruption (pseudo-nested data), flat neutralization (reducing false negative signal), and a hybrid fine-tuned + LLM pipeline. On NEREL, a Russian benchmark with 29 entity types where 21% of entities are nested, our best combined method achieves 26.37% inner F1, closing 40% of the gap to full nested supervision. Code is available at https://github.com/fulstock/Learning-from-Flat-Annotations.
[44] MedGPT-oss: Training a General-Purpose Vision-Language Model for Biomedicine
Kai Zhang, Zhengqing Yuan, Cheng Peng, Songlin Zhao, Mengxian Lyu, Ziyi Chen, Yanfang Ye, Wei Liu, Ying Zhang, Kaleb E Smith, Lifang He, Lichao Sun, Yonghui Wu
Main category: cs.CL
TL;DR: MEDGPT-OSS is a 20B-parameter open-weight multimodal vision-language model for clinical AI that unifies radiology, pathology, and clinical-text reasoning while being computationally efficient for on-premises deployment.
Details
Motivation: Address the deployment gap in biomedical multimodal assistants where top-performing systems are either closed-source or computationally prohibitive, preventing on-premises deployment needed for patient privacy and PHI compliance in clinical settings.Method: Pairs GPT-oss language backbone with visual front-end using optimized three-stage training curriculum: progressive domain adaptation through rigorous data curation and long-context multimodal alignment, maintaining parameter efficiency for commodity GPUs.
Result: Outperforms larger open medical models on out-of-distribution multimodal reasoning and complex text-only clinical tasks, demonstrating that a 20B model can bridge the capacity gap for clinical applications.
Conclusion: Provides an open-weight, computationally efficient foundation for privacy-preserving, institution-specific clinical AI research with complete training recipe, checkpoints, and evaluation harness released.
Abstract: Biomedical multimodal assistants have the potential to unify radiology, pathology, and clinical-text reasoning, yet a critical deployment gap remains: top-performing systems are either closed-source or computationally prohibitive, precluding the on-premises deployment required for patient privacy and PHI compliance. We introduce MEDGPT-OSS, an open-weight, 20B-parameter generalist vision-language model designed to facilitate open research in clinical AI. Rather than relying on architectural complexity, MEDGPT-OSS pairs the GPT-oss language backbone with a visual front-end via a optimized, three-stage training curriculum. By progressively domain-adapting these modules through rigorous data curation and long-context multimodal alignment, we demonstrate that a 20B model can bridge the capacity gap. It successfully outperforms larger open medical models on out-of-distribution (OOD) multimodal reasoning and complex text-only clinical tasks. By unifying diverse modalities under a single instruction-following interface, MEDGPT-OSS maintains a parameter-efficient footprint fully compatible with commodity GPUs. We release the complete training recipe, open-weight checkpoints, and a rigorous evaluation harness to serve as a verifiable foundation for privacy-preserving, institution-specific clinical AI research.
[45] MAS-Orchestra: Understanding and Improving Multi-Agent Reasoning Through Holistic Orchestration and Controlled Benchmarks
Zixuan Ke, Yifei Ming, Austin Xu, Ryan Chin, Xuan-Phi Nguyen, Prathyusha Jwalapuram, Jiayu Wang, Semih Yavuz, Caiming Xiong, Shafiq Joty
Main category: cs.CL
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2601.14652: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.14652&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[46] CHIMERA: Compact Synthetic Data for Generalizable LLM Reasoning
Xinyu Zhu, Yihao Feng, Yanchao Sun, Xianzhi Du, Pingzhi Li, Olli Saarikivi, Yun Zhu, Yu Meng
Main category: cs.CL
TL;DR: CHIMERA: A compact synthetic reasoning dataset addressing data challenges for LLM reasoning capabilities through automated generation of diverse scientific reasoning problems with detailed Chain-of-Thought trajectories.
Details
Motivation: Three fundamental data-centric challenges hinder reproducing and extending LLM reasoning capabilities: (1) cold-start problem due to lack of seed datasets with detailed CoT trajectories, (2) limited domain coverage concentrated in mathematics, and (3) annotation bottleneck for frontier-level reasoning tasks.Method: Created CHIMERA dataset with 9K samples featuring: (1) rich, long CoT reasoning trajectories synthesized by state-of-the-art reasoning models, (2) broad coverage across 8 major scientific disciplines and 1K+ fine-grained topics organized via hierarchical taxonomy, (3) fully automated evaluation pipeline using strong reasoning models to cross-validate problem validity and answer correctness.
Result: Post-training a 4B Qwen3 model with CHIMERA achieved strong performance on challenging reasoning benchmarks (GPQA-Diamond, AIME 24/25/26, HMMT 25, Humanity’s Last Exam), approaching or matching performance of substantially larger models like DeepSeek-R1 and Qwen3-235B.
Conclusion: CHIMERA demonstrates that compact, high-quality synthetic reasoning datasets can effectively address data bottlenecks and enable strong reasoning capabilities in smaller models, providing a scalable solution for cross-domain scientific reasoning.
Abstract: Large Language Models (LLMs) have recently exhibited remarkable reasoning capabilities, largely enabled by supervised fine-tuning (SFT)- and reinforcement learning (RL)-based post-training on high-quality reasoning data. However, reproducing and extending these capabilities in open and scalable settings is hindered by three fundamental data-centric challenges: (1) the cold-start problem, arising from the lack of seed datasets with detailed, long Chain-of-Thought (CoT) trajectories needed to initialize reasoning policies; (2) limited domain coverage, as most existing open-source reasoning datasets are concentrated in mathematics, with limited coverage of broader scientific disciplines; and (3) the annotation bottleneck, where the difficulty of frontier-level reasoning tasks makes reliable human annotation prohibitively expensive or infeasible. To address these challenges, we introduce CHIMERA, a compact synthetic reasoning dataset comprising 9K samples for generalizable cross-domain reasoning. CHIMERA is constructed with three key properties: (1) it provides rich, long CoT reasoning trajectories synthesized by state-of-the-art reasoning models; (2) it has broad and structured coverage, spanning 8 major scientific disciplines and over 1K fine-grained topics organized via a model-generated hierarchical taxonomy; and (3) it employs a fully automated, scalable evaluation pipeline that uses strong reasoning models to cross-validate both problem validity and answer correctness. We use CHIMERA to post-train a 4B Qwen3 model. Despite the dataset’s modest size, the resulting model achieves strong performance on a suite of challenging reasoning benchmarks, including GPQA-Diamond, AIME 24/25/26, HMMT 25, and Humanity’s Last Exam, approaching or matching the reasoning performance of substantially larger models such as DeepSeek-R1 and Qwen3-235B.
[47] KVSlimmer: Theoretical Insights and Practical Optimizations for Asymmetric KV Merging
Lianjun Liu, Hongli An, Weiqi Yan, Xin Du, Shengchuan Zhang, Huazhong Liu, Yunshan Zhong
Main category: cs.CL
TL;DR: KVSlimmer: A theoretically-grounded KV cache compression method that uses spectral analysis of projection weights and exact Hessian information to reduce memory and latency in LLMs while maintaining performance.
Details
Motivation: The KV cache in LLMs creates significant computational and memory bottlenecks. Existing KV merging methods lack theoretical foundations and have suboptimal compression with inference overhead.Method: Establishes theoretical framework analyzing KV asymmetry through spectral energy distribution of projection weights. Introduces KVSlimmer algorithm that captures exact Hessian information through mathematically exact formulation and derives closed-form solution using only forward-pass variables.
Result: Outperforms SOTA methods across various models and benchmarks. On Llama3.1-8B-Instruct, improves LongBench average score by 0.92 while reducing memory costs by 29% and latency by 28%.
Conclusion: KVSlimmer provides a theoretically-sound, efficient solution for KV cache compression that significantly reduces memory and computational overhead while maintaining or improving model performance.
Abstract: The growing computational and memory demands of the Key-Value (KV) cache significantly limit the ability of Large Language Models (LLMs). While KV merging has emerged as a promising solution, existing methods that rely on empirical observations of KV asymmetry and gradient-based Hessian approximations lack a theoretical foundation and incur suboptimal compression and inference overhead. To bridge these gaps, we establish a theoretical framework that characterizes this asymmetry through the spectral energy distribution of projection weights, demonstrating that concentrated spectra in Query/Key weights induce feature homogeneity, whereas dispersed spectra in Value weights preserve heterogeneity. Then, we introduce KVSlimmer, an efficient algorithm that captures exact Hessian information through a mathematically exact formulation, and derives a closed-form solution utilizing only forward-pass variables, resulting in a gradient-free approach that is both memory- and time-efficient. Extensive experiments across various models and benchmarks demonstrate that KVSlimmer consistently outperforms SOTA methods. For instance, on Llama3.1-8B-Instruct, it improves the LongBench average score by 0.92 while reducing memory costs and latency by 29% and 28%, respectively.
[48] Prompt Sensitivity and Answer Consistency of Small Open-Source Large Language Models on Clinical Question Answering: Implications for Low-Resource Healthcare Deployment
Shravani Hariprasad
Main category: cs.CL
TL;DR: Evaluation of small open-source LLMs for clinical QA shows consistency ≠ accuracy, with roleplay prompts harming performance and domain pretraining alone being insufficient for reliable clinical AI.
Details
Motivation: Small open-source language models are gaining attention for low-resource healthcare settings, but their reliability under different prompt phrasings remains poorly understood, creating potential safety risks in clinical applications.Method: Evaluated five open-source models (Gemma 2 2B, Phi-3 Mini 3.8B, Llama 3.2 3B, Mistral 7B, Meditron-7B) across three clinical QA datasets (MedQA, MedMCQA, PubMedQA) using five prompt styles (original, formal, simplified, roleplay, direct), measuring consistency scores, accuracy, and instruction-following failure rates on consumer CPU hardware without fine-tuning.
Result: Consistency and accuracy were largely independent; Gemma 2 had highest consistency (0.845-0.888) but lowest accuracy (33.0-43.5%), while Llama 3.2 showed moderate consistency (0.774-0.807) with highest accuracy (49.0-65.0%). Roleplay prompts consistently reduced accuracy across all models, with Phi-3 Mini dropping 21.5 percentage points on MedQA. Meditron-7B exhibited near-complete instruction-following failure on PubMedQA (99.0% UNKNOWN rate).
Conclusion: High consistency does not imply correctness - models can be reliably wrong, a dangerous failure mode in clinical AI. Roleplay prompts should be avoided in healthcare applications. Llama 3.2 showed strongest balance of accuracy and reliability for low-resource deployment. Safe clinical AI requires joint evaluation of consistency, accuracy, and instruction adherence.
Abstract: Small open-source language models are gaining attention for low-resource healthcare settings, but their reliability under different prompt phrasings remains poorly understood. We evaluated five open-source models (Gemma 2 2B, Phi-3 Mini 3.8B, Llama 3.2 3B, Mistral 7B, and Meditron-7B domain-pretrained without instruction tuning) across three clinical QA datasets (MedQA, MedMCQA, PubMedQA) using five prompt styles (original, formal, simplified, roleplay, direct). We measured consistency scores, accuracy, and instruction-following failure rates. All inference ran locally on consumer CPU hardware without fine-tuning. Consistency and accuracy were largely independent. Gemma 2 achieved the highest consistency (0.845-0.888) but lowest accuracy (33.0-43.5%), while Llama 3.2 showed moderate consistency (0.774-0.807) with the highest accuracy (49.0-65.0%). Roleplay prompts consistently reduced accuracy across all models, with Phi-3 Mini dropping 21.5 percentage points on MedQA. Meditron-7B exhibited near-complete instruction-following failure on PubMedQA (99.0% UNKNOWN rate), showing domain pretraining alone is insufficient for structured clinical QA. High consistency does not imply correctness. Models can be reliably wrong, a dangerous failure mode in clinical AI. Roleplay prompts should be avoided in healthcare applications. Llama 3.2 showed the strongest balance of accuracy and reliability for low-resource deployment. Safe clinical AI requires joint evaluation of consistency, accuracy, and instruction adherence.
[49] Hybrid Neural-LLM Pipeline for Morphological Glossing in Endangered Language Documentation: A Case Study of Jungar Tuvan
Siyu Liang, Talant Mawkanuli, Gina-Anne Levow
Main category: cs.CL
TL;DR: Hybrid pipeline combining neural sequence labeling (BiLSTM-CRF) with LLM post-correction for automatic glossing of low-resource morphologically rich languages, evaluated on Jungar Tuvan Turkic language.
Details
Motivation: Interlinear glossed text (IGT) creation is a major bottleneck in linguistic documentation, especially for low-resource morphologically rich languages, requiring automated solutions to reduce annotation workload.Method: Two-stage hybrid pipeline: 1) Neural sequence labeling using BiLSTM-CRF model, 2) LLM post-correction with retrieval-augmented prompting. Systematic ablation studies examine effects of morpheme dictionaries, few-shot examples, and different prompting strategies.
Result: Retrieval-augmented prompting provides substantial gains over random example selection; morpheme dictionaries paradoxically hurt performance; performance scales logarithmically with few-shot examples; hybrid pipeline yields substantial gains and meaningful workload reduction.
Conclusion: Hybrid architectures combining structured prediction models with LLM reasoning offer promising direction for computationally light solutions to automatic linguistic annotation in endangered language documentation, with established design principles for morphologically complex fieldwork contexts.
Abstract: Interlinear glossed text (IGT) creation remains a major bottleneck in linguistic documentation and fieldwork, particularly for low-resource morphologically rich languages. We present a hybrid automatic glossing pipeline that combines neural sequence labeling with large language model (LLM) post-correction, evaluated on Jungar Tuvan, a low-resource Turkic language. Through systematic ablation studies, we show that retrieval-augmented prompting provides substantial gains over random example selection. We further find that morpheme dictionaries paradoxically hurt performance compared to providing no dictionary at all in most cases, and that performance scales approximately logarithmically with the number of few-shot examples. Most significantly, our two-stage pipeline combining a BiLSTM-CRF model with LLM post-correction yields substantial gains for most models, achieving meaningful reductions in annotation workload. Drawing on these findings, we establish concrete design principles for integrating structured prediction models with LLM reasoning in morphologically complex fieldwork contexts. These principles demonstrate that hybrid architectures offer a promising direction for computationally light solutions to automatic linguistic annotation in endangered language documentation.
[50] Conformal Prediction for Risk-Controlled Medical Entity Extraction Across Clinical Domains
Manil Shrestha, Edward Kim
Main category: cs.CL
TL;DR: Conformal prediction framework provides finite-sample coverage guarantees for LLM-based medical entity extraction across clinical domains, showing calibration depends on document structure and model architecture.
Details
Motivation: LLMs are increasingly used for medical entity extraction but their confidence scores are often miscalibrated, limiting safe deployment in clinical settings where reliability is critical.Method: Present a conformal prediction framework with two applications: 1) Extract structured entities from 1,000 FDA drug labels using GPT-4.1 with FactScore evaluation; 2) Extract radiological entities from MIMIC-CXR reports using GPT-4.1 and Llama-4-Maverick with RadGraph schema, evaluated against physician annotations.
Result: Models show reversed miscalibration direction across domains: underconfident on well-structured FDA labels (τ≈0.06) vs overconfident on free-text radiology reports (τ up to 0.99). Conformal prediction achieves ≥90% target coverage in both settings with manageable 9-13% rejection rates.
Conclusion: Calibration is not a global model property but depends on document structure, extraction category, and model architecture, motivating domain-specific conformal calibration for safe clinical deployment of LLMs.
Abstract: Large Language Models (LLMs) are increasingly used for medical entity extraction, yet their confidence scores are often miscalibrated, limiting safe deployment in clinical settings. We present a conformal prediction framework that provides finite-sample coverage guarantees for LLM-based extraction across two clinical domains. First, we extract structured entities from 1,000 FDA drug labels across eight sections using GPT-4.1, verified via FactScore-based atomic statement evaluation (97.7% accuracy over 128,906 entities). Second, we extract radiological entities from MIMIC-CXR reports using the RadGraph schema with GPT-4.1 and Llama-4-Maverick, evaluated against physician annotations (entity F1: 0.81 to 0.84). Our central finding is that miscalibration direction reverses across domains: on well-structured FDA labels, models are underconfident, requiring modest conformal thresholds ($τ\approx 0.06$), while on free-text radiology reports, models are overconfident, demanding strict thresholds ($τ$ up to 0.99). Despite this heterogeneity, conformal prediction achieves target coverage ($\geq 90%$) in both settings with manageable rejection rates (9–13%). These results demonstrate that calibration is not a global model property but depends on document structure, extraction category, and model architecture, motivating domain-specific conformal calibration for safe clinical deployment.
[51] The Aftermath of DrawEduMath: Vision Language Models Underperform with Struggling Students and Misdiagnose Errors
Li Lucy, Albert Zhang, Nathan Anderson, Ryan Knight, Kyle Lo
Main category: cs.CL
TL;DR: VLMs struggle with assessing student errors in handwritten math responses, limiting their effectiveness for educational applications despite strong math problem-solving capabilities.
Details
Motivation: To evaluate how well vision-language models can support mathematics education by analyzing their performance on real students' handwritten responses, particularly focusing on their ability to identify and respond to student errors.Method: Conducted a year-long evaluation of 11 vision-language models using DrawEduMath benchmark, which contains real students’ handwritten responses to math problems, analyzing performance across different student proficiency levels and error assessment tasks.
Result: All VLMs underperformed when describing work from students needing more pedagogical help and struggled most on questions related to assessing student errors, despite being optimized for math problem solving.
Conclusion: VLMs require alternative development incentives beyond math problem-solving optimization to adequately support educational use cases, particularly for error assessment and supporting struggling students.
Abstract: Effective mathematics education requires identifying and responding to students’ mistakes. For AI to support pedagogical applications, models must perform well across different levels of student proficiency. Our work provides an extensive, year-long snapshot of how 11 vision-language models (VLMs) perform on DrawEduMath, a QA benchmark involving real students’ handwritten, hand-drawn responses to math problems. We find that models’ weaknesses concentrate on a core component of math education: student error. All evaluated VLMs underperform when describing work from students who require more pedagogical help, and across all QA, they struggle the most on questions related to assessing student error. Thus, while VLMs may be optimized to be math problem solving experts, our results suggest that they require alternative development incentives to adequately support educational use cases.
[52] S-VoCAL: A Dataset and Evaluation Framework for Inferring Speaking Voice Character Attributes in Literature
Abigail Berthe-Pardo, Gaspard Michel, Elena V. Epure, Christophe Cerisara
Main category: cs.CL
TL;DR: S-VoCAL is a dataset and evaluation framework for inferring voice-related character attributes from literature to improve synthetic audiobook narration.
Details
Motivation: Current synthetic audiobook narration lacks ability to impersonate fictional characters and convey complex emotions/prosody. Assigning plausible voices to characters requires inferring attributes from book-length contexts, but there are no dedicated benchmark datasets for evaluating such extraction systems.Method: Created S-VoCAL dataset with 8 sociophonetic attributes and 952 character-book pairs from Project Gutenberg. Developed evaluation framework with attribute-specific metrics and novel LLM-based similarity metric. Demonstrated applicability using Retrieval-Augmented Generation (RAG) pipeline for attribute inference.
Result: RAG pipeline reliably infers attributes like Age and Gender, but struggles with Origin and Physical Health. The dataset provides first benchmark for evaluating voice-related character attribute extraction systems.
Conclusion: S-VoCAL enables systematic evaluation of character voice attribute inference, which is crucial for improving synthetic audiobook narration quality and character impersonation capabilities.
Abstract: With recent advances in Text-to-Speech (TTS) systems, synthetic audiobook narration has seen increased interest, reaching unprecedented levels of naturalness. However, larger gaps remain in synthetic narration systems’ ability to impersonate fictional characters, and convey complex emotions or prosody. A promising direction to enhance character identification is the assignment of plausible voices to each fictional characters in a book. This step typically requires complex inference of attributes in book-length contexts, such as a character’s age, gender, origin or physical health, which in turns requires dedicated benchmark datasets to evaluate extraction systems’ performances. We present S-VoCAL (Speaking Voice Character Attributes in Literature), the first dataset and evaluation framework dedicated to evaluate the inference of voice-related fictional character attributes. S-VoCAL entails 8 attributes grounded in sociophonetic studies, and 952 character-book pairs derived from Project Gutenberg. Its evaluation framework addresses the particularities of each attribute, and includes a novel similarity metric based on recent Large Language Models embeddings. We demonstrate the applicability of S-VoCAL by applying a simple Retrieval-Augmented Generation (RAG) pipeline to the task of inferring character attributes. Our results suggest that the RAG pipeline reliably infers attributes such as Age or Gender, but struggles on others such as Origin or Physical Health. The dataset and evaluation code are available at https://github.com/AbigailBerthe/S-VoCAL .
[53] Qayyem: A Real-time Platform for Scoring Proficiency of Arabic Essays
Hoor Elbahnasawi, Marwan Sayed, Sohaila Eltanbouly, Fatima Brahamia, Tamer Elsayed
Main category: cs.CL
TL;DR: Qayyem is a web-based platform for Arabic Automated Essay Scoring (AES) that provides an integrated workflow for assignment creation, essay upload, scoring configuration, and evaluation, abstracting technical complexity from scoring APIs.
Details
Motivation: Arabic AES systems are limited due to linguistic complexity and scarcity of large annotated datasets, creating a need for accessible tools that can support Arabic essay scoring for educational purposes.Method: Developed a web-based platform that integrates multiple state-of-the-art Arabic essay scoring models, providing a user-friendly interface for instructors to create assignments, upload essays, configure scoring, and evaluate essays per trait without dealing with technical API complexities.
Result: Qayyem successfully abstracts technical complexity and provides access to advanced Arabic AES services through an integrated platform with various scoring models offering different effectiveness and efficiency trade-offs.
Conclusion: Qayyem addresses the gap in Arabic AES tools by providing an accessible platform that makes advanced scoring models available to instructors through a simplified workflow, potentially improving Arabic language assessment scalability.
Abstract: Over the past years, Automated Essay Scoring (AES) systems have gained increasing attention as scalable and consistent solutions for assessing the proficiency of student writing. Despite recent progress, support for Arabic AES remains limited due to linguistic complexity and scarcity of large publicly-available annotated datasets. In this work, we present Qayyem, a Web-based platform designed to support Arabic AES by providing an integrated workflow for assignment creation, batch essay upload, scoring configuration, and per-trait essay evaluation. Qayyem abstracts the technical complexity of interacting with scoring server APIs, allowing instructors to access advanced scoring services through a user-friendly interface. The platform deploys a number of state-of-the-art Arabic essay scoring models with different effectiveness and efficiency figures.
[54] Thoth: Mid-Training Bridges LLMs to Time Series Understanding
Jiafeng Lin, Yuxuan Wang, Jialong Wu, Huakun Luo, Zhongyi Pei, Jianmin Wang
Main category: cs.CL
TL;DR: Thoth is a family of mid-trained LLMs with general-purpose time series understanding capabilities, using a novel mid-training approach with Book-of-Thoth corpus for time-series-to-text and text-to-time-series generation.
Details
Motivation: LLMs struggle with time series data understanding and reasoning, limiting their effectiveness in decision-making scenarios that depend on temporal dynamics. There's a need for models that can bridge the gap between time series data and natural language reasoning.Method: Proposes mid-training as an intermediate stage between pre-training and fine-tuning, using Book-of-Thoth corpus for task- and domain-agnostic alignment between time series and natural language. Introduces KnoTS benchmark for evaluating knowledge-intensive time series understanding.
Result: Thoth significantly outperforms base models and advanced LLMs across time series QA benchmarks, and shows superior capabilities when fine-tuned under data scarcity conditions.
Conclusion: Mid-training with Book-of-Thoth enables effective time series understanding in LLMs, bridging the gap between temporal patterns and natural language reasoning, with promising applications in decision-making scenarios.
Abstract: Large Language Models (LLMs) have demonstrated remarkable success in general-purpose reasoning. However, they still struggle to understand and reason about time series data, which limits their effectiveness in decision-making scenarios that depend on temporal dynamics. In this paper, we propose Thoth, the first family of mid-trained LLMs with general-purpose time series understanding capabilities. As a pivotal intermediate stage, mid-training achieves task- and domain-agnostic alignment between time series and natural language, for which we construct Book-of-Thoth, a high-quality, time-series-centric mid-training corpus. Book-of-Thoth enables both time-series-to-text and text-to-time-series generation, equipping LLMs with a foundational grasp of temporal patterns. To better evaluate advanced reasoning capabilities, we further present KnoTS, a novel benchmark of knowledge-intensive time series understanding, designed for joint reasoning over temporal patterns and domain knowledge. Extensive experiments demonstrate that mid-training with Book-of-Thoth enables Thoth to significantly outperform its base model and advanced LLMs across a range of time series question answering benchmarks. Moreover, Thoth exhibits superior capabilities when fine-tuned under data scarcity, underscoring the effectiveness of mid-training for time series understanding. Code is available at: https://github.com/thuml/Thoth.
[55] GroupGPT: A Token-efficient and Privacy-preserving Agentic Framework for Multi-User Chat Assistant
Zhuokang Shen, Yifan Wang, Hanyu Chen, Wenxuan Huang, Shaohui Lin
Main category: cs.CL
TL;DR: GroupGPT: A token-efficient, privacy-preserving framework for multi-user chat assistants using small-large model collaboration to decouple intervention timing from response generation, supporting multimodal inputs.
Details
Motivation: Existing LLM-based chatbots focus on single-user settings and don't generalize well to multi-user group chats, where proactive intervention is needed. Current approaches use LLMs for both reasoning and generation, leading to high token consumption, limited scalability, and privacy risks.Method: Proposes GroupGPT with small-large model collaborative architecture to decouple intervention timing from response generation. Uses smaller models for timing decisions and larger models for response generation. Supports multimodal inputs (memes, images, videos, voice messages). Introduces MUIR benchmark dataset with 2,500 annotated group chat segments.
Result: GroupGPT achieves average score of 4.72/5.0 in LLM-based evaluation, reduces token usage by up to 3x compared to baselines, provides privacy sanitization, and is well-received across diverse group chat scenarios.
Conclusion: GroupGPT effectively addresses challenges in multi-user chat assistants through efficient architecture design, multimodal support, and privacy preservation, demonstrating superior performance and efficiency.
Abstract: Recent advances in large language models (LLMs) have enabled increasingly capable chatbots. However, most existing systems focus on single-user settings and do not generalize well to multi-user group chats, where agents require more proactive and accurate intervention under complex, evolving contexts. Existing approaches typically rely on LLMs for both reasoning and generation, leading to high token consumption, limited scalability, and potential privacy risks. To address these challenges, we propose GroupGPT, a token-efficient and privacy-preserving agentic framework for multi-user chat assistant. GroupGPT adopts a small-large model collaborative architecture to decouple intervention timing from response generation, enabling efficient and accurate decision-making. The framework also supports multimodal inputs, including memes, images, videos, and voice messages. We further introduce MUIR, a benchmark dataset for multi-user chat assistant intervention reasoning. MUIR contains 2,500 annotated group chat segments with intervention labels and rationales, supporting evaluation of timing accuracy and response quality. We evaluate a range of models on MUIR, from large language models to smaller counterparts. Extensive experiments demonstrate that GroupGPT produces accurate and well-timed responses, achieving an average score of 4.72/5.0 in LLM-based evaluation, and is well received by users across diverse group chat scenarios. Moreover, GroupGPT reduces token usage by up to 3 times compared to baseline methods, while providing privacy sanitization of user messages before cloud transmission. Code is available at: https://github.com/Eliot-Shen/GroupGPT .
[56] How RL Unlocks the Aha Moment in Geometric Interleaved Reasoning
Xiangxiang Zhang, Caijun Jia, Siyuan Li, Dingyu He, Xiya Xiong, Zheng Sun, Honghao He, Yuchen Wu, Bihui Yu, Linzhuang Sun, Cheng Tan, Jingxuan Wei
Main category: cs.CL
TL;DR: Faire is a reinforcement learning framework that addresses the failure of supervised fine-tuning for interleaved geometric reasoning, where models learn surface plotting formats but not the causal dependencies between plots and reasoning steps.
Details
Motivation: The paper identifies a counter-intuitive phenomenon where supervised fine-tuning on interleaved plot-solution data actually degrades reasoning performance compared to text-only baselines, despite MLLMs having strong visual generation capabilities. This happens because SFT only teaches surface format imitation without internalizing the causal dependencies between generated plots and reasoning steps.Method: Proposes Faire (Functional alignment for interleaved reasoning), a reinforcement learning framework that enforces three causal constraints to move beyond superficial imitation toward functional alignment. The framework ensures the model learns the actual functional relationships between plotting and reasoning rather than just mimicking the format.
Result: Extensive experiments show Faire induces a qualitative shift in model behavior where plotting is effectively internalized, yielding competitive performance on challenging geometric reasoning benchmarks compared to SFT approaches.
Conclusion: The paper demonstrates that functional alignment through reinforcement learning is necessary for effective interleaved reasoning in multimodal models, overcoming the limitations of supervised fine-tuning that only achieves distributional alignment without understanding causal dependencies.
Abstract: Solving complex geometric problems inherently requires interleaved reasoning: a tight alternation between constructing diagrams and performing logical deductions. Although recent Multimodal Large Language Models (MLLMs) have demonstrated strong capabilities in visual generation and plotting, we identify a counter-intuitive and underexplored phenomenon. Naively applying Supervised Fine-Tuning (SFT) on interleaved plot-solution data leads to a substantial degradation in reasoning performance compared to text-only baselines. We argue that this failure stems from a fundamental limitation of SFT, which primarily induces distributional alignment: the model learns to reproduce the surface format of interleaved plotting but fails to internalize the causal dependency between the generated plot and reasoning steps. To overcome this limitation, we propose Faire (Functional alignment for interleaved reasoning), a reinforcement learning framework that enforces three casual constraints to move beyond superficial imitation toward functional alignment. Extensive experiments show that Faire induces a qualitative shift in model behavior in which the plotting is effectively internalized, yielding competitive performance on challenging geometric reasoning benchmarks.
[57] CARD: Towards Conditional Design of Multi-agent Topological Structures
Tongtong Wu, Yanming Li, Ziye Tang, Chen Jiang, Linhao Luo, Guilin Qi, Shirui Pan, Gholamreza Haffari
Main category: cs.CL
TL;DR: CARD is a conditional graph-generation framework for adaptive multi-agent communication that dynamically adjusts communication topologies based on environmental signals to improve effectiveness and robustness.
Details
Motivation: Current LLM-based multi-agent systems use fixed or statically learned communication topologies, ignoring real-world dynamics like model upgrades, API changes, or knowledge source variability, which limits their effectiveness and robustness.Method: Proposes CARD (Conditional Agentic Graph Designer), a conditional graph-generation framework implementing AMACP protocol for adaptive multi-agent communication. Uses conditional variational graph encoder and environment-aware optimization to dynamically construct communication structures based on environmental signals.
Result: Empirical results on HumanEval, MATH, and MMLU show CARD consistently outperforms static and prompt-based baselines, achieving higher accuracy and robustness across diverse conditions.
Conclusion: CARD enables adaptive communication topologies that are both effective and resilient to shifts in model capability or resource availability, addressing limitations of fixed-topology multi-agent systems.
Abstract: Large language model (LLM)-based multi-agent systems have shown strong capabilities in tasks such as code generation and collaborative reasoning. However, the effectiveness and robustness of these systems critically depend on their communication topology, which is often fixed or statically learned, ignoring real-world dynamics such as model upgrades, API (or tool) changes, or knowledge source variability. To address this limitation, we propose CARD (Conditional Agentic Graph Designer), a conditional graph-generation framework that instantiates AMACP, a protocol for adaptive multi-agent communication. CARD explicitly incorporates dynamic environmental signals into graph construction, enabling topology adaptation at both training and runtime. Through a conditional variational graph encoder and environment-aware optimization, CARD produces communication structures that are both effective and resilient to shifts in model capability or resource availability. Empirical results on HumanEval, MATH, and MMLU demonstrate that CARD consistently outperforms static and prompt-based baselines, achieving higher accuracy and robustness across diverse conditions. The source code is available at: https://github.com/Warma10032/CARD.
[58] DEP: A Decentralized Large Language Model Evaluation Protocol
Jianxiang Peng, Junhao Li, Hongxiang Wang, Haocheng Lyu, Hui Guo, Siyi Hao, Zhen Wang, Chuang Liu, Shaowei Zhang, Bojian Xiong, Yue Chen, Zhuowen Han, Ling Shi, Tianyu Dong, Juesi Xiao, Lei Yang, Yuqi Ren, Deyi Xiong
Main category: cs.CL
TL;DR: DEP is a decentralized evaluation protocol for LLMs that addresses benchmark inconsistency and leakage risks through modular design and data isolation.
Details
Motivation: Current LLM benchmarks lack unified evaluation standards, require manual script implementation, and suffer from centralized frameworks that risk benchmark leakage, making results inconsistent and hard to reproduce.Method: Proposes Decentralized Evaluation Protocol (DEP) with a matching server that decouples users, LLMs, and benchmarks. Benchmark files and evaluation logic stay on server side, enabling modular plug-and-play evaluation with data isolation. Includes DEP Toolkit with features like breakpoint resume, concurrent requests, and congestion control.
Result: Experimental results verify DEP’s effectiveness in reducing benchmark evaluation deployment costs. Over 60 benchmarks adapted by February 2026, with continued community co-construction for unified evaluation across tasks and domains.
Conclusion: DEP provides a decentralized, standardized evaluation framework that addresses consistency, reproducibility, and leakage issues in LLM benchmarking, enabling more reliable and secure model evaluation.
Abstract: With the rapid development of Large Language Models (LLMs), a large number of benchmarks have been proposed. However, most benchmarks lack unified evaluation standard and require the manual implementation of custom scripts, making results hard to ensure consistency and reproducibility. Furthermore, mainstream evaluation frameworks are centralized, with datasets and answers, which increases the risk of benchmark leakage. To address these issues, we propose a Decentralized Evaluation Protocol (DEP), a decentralized yet unified and standardized evaluation framework through a matching server without constraining benchmarks. The server can be mounted locally or deployed remotely, and once adapted, it can be reused over the long term. By decoupling users, LLMs, and benchmarks, DEP enables modular, plug-and-play evaluation: benchmark files and evaluation logic stay exclusively on the server side. In remote setting, users cannot access the ground truth, thereby achieving data isolation and leak-proof evaluation. To facilitate practical adoption, we develop DEP Toolkit, a protocol-compatible toolkit that supports features such as breakpoint resume, concurrent requests, and congestion control. We also provide detailed documentation for adapting new benchmarks to DEP. Using DEP toolkit, we evaluate multiple LLMs across benchmarks. Experimental results verify the effectiveness of DEP and show that it reduces the cost of deploying benchmark evaluations. As of February 2026, we have adapted over 60 benchmarks and continue to promote community co-construction to support unified evaluation across various tasks and domains.
[59] Token-level Data Selection for Safe LLM Fine-tuning
Yanping Li, Zhening Liu, Zijian Li, Zehong Lin, Jun Zhang
Main category: cs.CL
TL;DR: TOSS is a token-level data selection framework that identifies and removes unsafe tokens during LLM fine-tuning to prevent safety degradation while preserving task-specific utility.
Details
Motivation: Fine-tuning LLMs on custom datasets often leads to safety degradation, and existing sample-level defense methods struggle to balance safety and utility effectively.Method: Proposes token-level safety risk quantification by measuring loss differences between safety-degraded and utility-oriented models, with TOSS-Pro adding progressive refinement through iterative enhancement.
Result: Extensive experiments show TOSS robustly safeguards LLMs during fine-tuning while achieving superior downstream task performance compared to sample-level defense methods.
Conclusion: Token-level granularity enables accurate identification of unsafe tokens, preserving valuable task-specific information and achieving better safety-utility trade-off than sample-level approaches.
Abstract: Fine-tuning large language models (LLMs) on custom datasets has become a standard approach for adapting these models to specific domains and applications. However, recent studies have shown that such fine-tuning can lead to significant degradation in the model’s safety. Existing defense methods operate at the sample level and often suffer from an unsatisfactory trade-off between safety and utility. To address this limitation, we perform a systematic token-level diagnosis of safety degradation during fine-tuning. Based on this, we propose token-level data selection for safe LLM fine-tuning (TOSS), a novel framework that quantifies the safety risk of each token by measuring the loss difference between a safety-degraded model and a utility-oriented model. This token-level granularity enables accurate identification and removal of unsafe tokens, thereby preserving valuable task-specific information. In addition, we introduce a progressive refinement strategy, TOSS-Pro, which iteratively enhances the safety-degraded model’s ability to identify unsafe tokens. Extensive experiments demonstrate that our approach robustly safeguards LLMs during fine-tuning while achieving superior downstream task performance, significantly outperforming existing sample-level defense methods. Our code is available at https://github.com/Polly-LYP/TOSS.
[60] Reasoning or Rationalization? The Role of Justifications in Masked Diffusion Models for Fact Verification
Jacob Devasier
Main category: cs.CL
TL;DR: MDLMs converge on verdicts early in diffusion, treating them as anchors resolved before justification completion. Forcing reasoning-first degrades performance as noisy justification tokens override correct initial predictions.
Details
Motivation: To understand how Masked Diffusion Language Models handle tasks requiring justified verdicts, specifically whether justifications serve as genuine reasoning or post-hoc rationalization in fact verification tasks.Method: Investigates MDLM reasoning dynamics on fact verification, using interventional experiments including delayed verdict unmasking, forced verdict rationalization analysis, and justification quality manipulation to study causal dependencies.
Result: MDLMs converge on verdicts early (86.2% accuracy), but forcing reasoning-first drops accuracy to 71.9%. Models rationalize incorrect forced verdicts in 56% of cases, and verdicts strongly depend on justification quality (57.3% accuracy with corrupted vs. 97.1% with ground-truth).
Conclusion: For fact verification with MDLMs, extended deliberation can be counterproductive as noisy justification tokens override correct early predictions, suggesting verdict-first rather than reasoning-first approaches are more effective.
Abstract: Unlike autoregressive models, which generate tokens sequentially and benefit from reasoning-before-answering strategies such as Chain-of-Thought, Masked Diffusion Language Models (MDLMs) refine all sequence positions simultaneously, raising questions about how these models handle tasks requiring justified verdicts. In this work, we investigate the dynamics of MDLM reasoning on fact verification, examining whether justifications serve as genuine reasoning or post-hoc rationalization. We observe that MDLMs typically converge on a verdict early in the diffusion process, treating it as a global anchor that is resolved before the justification is complete. Crucially, enforcing a reasoning-first constraint via delayed verdict unmasking actively degrades performance, dropping accuracy from 86.2% to 71.9% as accumulating justification tokens introduce inconsistencies that override initially correct predictions. Interventional experiments reveal that the model rationalizes incorrect forced verdicts in 56% of cases, and that verdicts are strongly causally dependent on justification quality (57.3% accuracy with corrupted justifications vs. 97.1% with ground-truth). This causal dependence explains the degradation under forced deliberation: as the model generates noisy justification tokens, it conditions on them, gradually overriding its initially correct assessment. Our findings suggest that for fact verification with MDLMs, extended deliberation can be counterproductive, risking the dilution of accurate early predictions with noise introduced during justification generation.
[61] XAI-enhanced Comparative Opinion Mining via Aspect-based Scoring and Semantic Reasoning
Ngoc-Quang Le, T. Thanh-Lam Nguyen, Quoc-Trung Phu, Thi-Phuong Le, Duy-Cat Can, Hoang-Quynh Le
Main category: cs.CL
TL;DR: XCom is an interpretable transformer model for comparative opinion mining that combines aspect-based rating prediction with semantic analysis and uses SHAP explanations for transparency.
Details
Motivation: Transformer models for comparative opinion mining lack transparency, which hinders user trust. The authors aim to create an interpretable model that provides meaningful explanations for its decisions.Method: XCom uses a two-module architecture: (1) aspect-based rating prediction and (2) semantic analysis for comparative opinion mining. It incorporates Shapley Additive Explanations (SHAP) to provide interpretable insights into model decisions.
Result: XCom achieves leading performance compared to other baselines, demonstrating effectiveness in providing meaningful explanations while maintaining strong comparative opinion mining capabilities.
Conclusion: XCom provides a more reliable tool for comparative opinion mining by combining strong performance with interpretability through SHAP explanations.
Abstract: Comparative opinion mining involves comparing products from different reviews. However, transformer-based models designed for this task often lack transparency, which can adversely hinder the development of trust in users. In this paper, we propose XCom, an enhanced transformer-based model separated into two principal modules, i.e., (i) aspect-based rating prediction and (ii) semantic analysis for comparative opinion mining. XCom also incorporates a Shapley additive explanations module to provide interpretable insights into the model’s deliberative decisions. Empirically, XCom achieves leading performances compared to other baselines, which demonstrates its effectiveness in providing meaningful explanations, making it a more reliable tool for comparative opinion mining. Source code is available at: https://anonymous.4open.science/r/XCom.
[62] Reasoning Boosts Opinion Alignment in LLMs
Frédéric Berdoz, Yann Billeter, Yann Vonlanthen, Roger Wattenhofer
Main category: cs.CL
TL;DR: LLMs can model political opinions but produce biased outputs; reasoning via RL improves alignment but doesn’t fully eliminate bias, establishing a baseline for political digital twins.
Details
Motivation: LLMs are promising for opinion modeling in digital democracies but produce biased opinions due to statistical nature and limited causal understanding. The paper explores whether reasoning can improve opinion alignment, motivated by RL's success in mathematical reasoning.Method: Train models using reinforcement learning to produce profile-consistent answers through structured reasoning. Evaluate on three political datasets covering U.S., European, and Swiss politics.
Result: Reasoning enhances opinion modeling and is competitive with strong baselines, but does not fully remove bias. The approach establishes a solid baseline for future research on LLM opinion alignment.
Conclusion: While reasoning improves opinion alignment, additional mechanisms are needed to build faithful political digital twins using LLMs. The method and datasets are released to support future research.
Abstract: Opinion modeling aims to capture individual or group political preferences, enabling applications such as digital democracies, where models could help shape fairer and more popular policies. Given their versatility, strong generalization capabilities, and demonstrated success across diverse text-to-text applications, large language models (LLMs) are natural candidates for this task. However, due to their statistical nature and limited causal understanding, they tend to produce biased opinions when prompted naively. In this work, we study whether reasoning can improve opinion alignment. Motivated by the recent advancement in mathematical reasoning enabled by reinforcement learning (RL), we train models to produce profile-consistent answers through structured reasoning. We evaluate our approach on three datasets covering U.S., European, and Swiss politics. Results indicate that reasoning enhances opinion modeling and is competitive with strong baselines, but does not fully remove bias, highlighting the need for additional mechanisms to build faithful political digital twins using LLMs. By releasing both our method and datasets, we establish a solid baseline to support future research on LLM opinion alignment.
[63] Generative AI & Fictionality: How Novels Power Large Language Models
Edwin Roland, Richard Jean So
Main category: cs.CL
TL;DR: This paper examines how fiction training data shapes generative AI models, specifically analyzing BERT’s use of novel language patterns compared to other text sources like newspapers and Wikipedia.
Details
Motivation: The motivation is to investigate the unexamined belief that fiction provides rich language training for AI models, and to understand how different text sources (novels vs. newspapers, Reddit, Wikipedia) shape model outputs and social responses.Method: The study analyzes the influential open-source BERT model to examine how it leverages fiction’s attributes and affordances, comparing its use of fictional language patterns with other text sources.
Result: The research finds that LLMs do leverage familiar attributes of fiction while also creating new qualities and forms of social response, suggesting fiction training data has distinct effects on model behavior.
Conclusion: Since contemporary culture is increasingly shaped by generative AI, analysis of cultural production must account for computational training data, particularly the influence of fiction on model outputs.
Abstract: Generative models, like the one in ChatGPT, are powered by their training data. The models are simply next-word predictors, based on patterns learned from vast amounts of pre-existing text. Since the first generation of GPT, it is striking that the most popular datasets have included substantial collections of novels. For the engineers and research scientists who build these models, there is a common belief that the language in fiction is rich enough to cover all manner of social and communicative phenomena, yet the belief has gone mostly unexamined. How does fiction shape the outputs of generative AI? Specifically, what are novels’ effects relative to other forms of text, such as newspapers, Reddit, and Wikipedia? Since the 1970s, literature scholars such as Catherine Gallagher and James Phelan have developed robust and insightful accounts of how fiction operates as a form of discourse and language. Through our study of an influential open-source model (BERT), we find that LLMs leverage familiar attributes and affordances of fiction, while also fomenting new qualities and forms of social response. We argue that if contemporary culture is increasingly shaped by generative AI and machine learning, any analysis of today’s various modes of cultural production must account for a relatively novel dimension: computational training data.
[64] Can Thinking Models Think to Detect Hateful Memes?
Mohamed Bayan Kmainasi, Mucahid Kutlu, Ali Ezzat Shahroor, Abul Hasnat, Firoj Alam
Main category: cs.CL
TL;DR: A reinforcement learning framework with Group Relative Policy Optimization improves multimodal reasoning in thinking-based MLLMs for hateful meme analysis, achieving state-of-the-art performance on the Hateful Memes benchmark.
Details
Motivation: Hateful memes require compositional multimodal reasoning where image and text interactions convey harmful intent, but current thinking-based MLLMs have underexplored capabilities for hateful meme analysis.Method: Proposes a reinforcement learning post-training framework with task-specific rewards and novel Group Relative Policy Optimization (GRPO) objective; includes systematic study of off-the-shelf MLLMs, dataset extension with generated chain-of-thought rationales, and joint optimization of classification and explanation quality.
Result: Achieves state-of-the-art performance on Hateful Memes benchmark with ~1% improvement in accuracy and F1, and ~3% improvement in explanation quality.
Conclusion: The GRPO-based framework effectively enhances multimodal reasoning in MLLMs for hateful meme understanding, with publicly released code, dataset extensions, and evaluation resources.
Abstract: Hateful memes often require compositional multimodal reasoning: the image and text may appear benign in isolation, yet their interaction conveys harmful intent. Although thinking-based multimodal large language models (MLLMs) have recently advanced vision-language understanding, their capabilities remain underexplored for hateful meme analysis. We propose a reinforcement learning based post-training framework that improves reasoning in thinking-based MLLMs through task-specific rewards and a novel Group Relative Policy Optimization (GRPO) objective. Specifically, we (i) conduct a systematic empirical study of off-the-shelf MLLMs for hateful meme understanding, (ii) extend an existing hateful meme dataset by generating weakly or pseudo-supervised chain-of-thought rationales via distillation, and (iii) introduce a GRPO-based objective that jointly optimizes meme classification and explanation quality to encourage fine-grained, step-by-step reasoning. Experiments on the Hateful Memes benchmark show that our approach achieves state-of-the-art performance, improving accuracy and F1 by approximately 1 percent and explanation quality by approximately 3 percent. We will publicly release our code, dataset extensions, and evaluation resources to support reproducibility.
[65] Self-Anchoring Calibration Drift in Large Language Models: How Multi-Turn Conversations Reshape Model Confidence
Harshavardhan
Main category: cs.CL
TL;DR: SACD is a hypothesized tendency for LLMs to show systematic confidence changes when building on their own prior outputs in multi-turn conversations, with empirical evidence showing model-heterogeneous patterns across Claude, Gemini, and GPT models.
Details
Motivation: To investigate whether LLMs exhibit systematic calibration drift when they iteratively build on their own prior outputs in multi-turn conversations, and to understand how this self-anchoring affects confidence calibration across different model architectures.Method: Empirical study comparing three frontier models (Claude Sonnet 4.6, Gemini 3.1 Pro, GPT-5.2) across 150 questions spanning factual, technical, and open-ended domains, using three conditions: single-turn baseline, multi-turn self-anchoring, and independent repetition control.
Result: Complex model-heterogeneous patterns: Claude showed significant decreasing confidence under self-anchoring, GPT showed opposite pattern in open-ended domains with ECE escalation, Gemini showed no significant confidence drift but revealed that self-anchoring suppresses natural calibration improvement.
Conclusion: Self-anchoring calibration drift exists and varies significantly across models, affecting confidence calibration in different ways, with implications for reliable multi-turn interactions with LLMs.
Abstract: We introduce Self-Anchoring Calibration Drift (SACD), a hypothesized tendency for large language models (LLMs) to show systematic changes in expressed confidence when building iteratively on their own prior outputs across multi-turn conversations. We report an empirical study comparing three frontier models – Claude Sonnet 4.6, Gemini 3.1 Pro, and GPT-5.2 – across 150 questions spanning factual, technical, and open-ended domains, using three conditions: single-turn baseline (A), multi-turn self-anchoring (B), and independent repetition control (C). Results reveal a complex, model-heterogeneous pattern that partially diverges from pre-registered hypotheses. Claude Sonnet 4.6 exhibited significant decreasing confidence under self-anchoring (mean CDS = -0.032, t(14) = -2.43, p = .029, d = -0.627), while also showing significant calibration error drift (F(4,56) = 22.77, p < .001, eta^2 = .791). GPT-5.2 showed the opposite pattern in open-ended domains (mean CDS = +0.026) with significant ECE escalation by Turn 5. Gemini 3.1 Pro showed no significant CDS (t(14) = 0.38, p = .710), but its Condition C data reveals a striking ECE pattern: without self-anchoring, Gemini’s calibration error drops from .327 to near zero across repetitions, whereas self-anchoring holds ECE flat at approximately .333 – indicating that SACD can manifest as suppression of natural calibration improvement rather than ac
[66] Suffix-Constrained Greedy Search Algorithms for Causal Language Models
Ayoub Hammal, Pierre Zweigenbaum, Caio Corro
Main category: cs.CL
TL;DR: Suffix-constrained generation for LLMs ensures final answers follow strict templates for trivial parsing without performance loss.
Details
Motivation: LLMs generate free-form outputs making final answer extraction difficult; need deterministic parsing while maintaining performance.Method: Suffix-constrained generation with greedy search algorithms to produce responses where final answers follow strict, parseable templates.
Result: Approach guarantees trivial deterministic answer extraction without negative impact, even improving results on several datasets.
Conclusion: Suffix-constrained generation enables reliable answer extraction from LLM outputs while maintaining or improving performance.
Abstract: Large language models (LLMs) are powerful tools that have found applications beyond human-machine interfaces and chatbots. In particular, their ability to generate reasoning traces motivated their use in many prediction tasks like math question answering. Unfortunately, extracting the final answer in an LLM free-form output is difficult, as it is an information extraction problem on its own. In this work, we introduce suffix-constrained generation, that aims to produce well-formed LLM responses in which final answers follow strict templates and are guaranteed to be trivially parseable. To this end, we introduce several algorithms that are based on greedy search procedures. We experiment on several datasets, and show that our approach allows to guarantee trivial deterministic extraction of the final answer from an LLM output without having a negative impact on results, and even improving them.
[67] Linking Knowledge to Care: Knowledge Graph-Augmented Medical Follow-Up Question Generation
Liwen Sun, Xiang Yu, Ming Tan, Zhuohao Chen, Anqi Cheng, Ashutosh Joshi, Chenyan Xiong
Main category: cs.CL
TL;DR: KG-Followup: A knowledge graph-augmented LLM with active in-context learning for generating relevant medical follow-up questions in pre-diagnostic assessments.
Details
Motivation: Clinical diagnosis is time-consuming and requires intensive patient-medical professional interactions. While LLMs could help with pre-diagnostic workload, their limited medical domain knowledge hinders effective medical question generation.Method: Uses a knowledge graph-augmented LLM with active in-context learning. A structured medical domain knowledge graph provides professional expertise for the LLM to reason upon when generating follow-up questions.
Result: KG-Followup outperforms state-of-the-art methods by 5% - 8% on relevant benchmarks in recall for generating medical follow-up questions.
Conclusion: The knowledge graph-augmented approach effectively enhances LLMs’ medical question generation capabilities, serving as a critical module for pre-diagnostic assessment.
Abstract: Clinical diagnosis is time-consuming, requiring intensive interactions between patients and medical professionals. While large language models (LLMs) could ease the pre-diagnostic workload, their limited domain knowledge hinders effective medical question generation. We introduce a Knowledge Graph-augmented LLM with active in-context learning to generate relevant and important follow-up questions, KG-Followup, serving as a critical module for the pre-diagnostic assessment. The structured medical domain knowledge graph serves as a seamless patch-up to provide professional domain expertise upon which the LLM can reason. Experiments demonstrate that KG-Followup outperforms state-of-the-art methods by 5% - 8% on relevant benchmarks in recall.
[68] LLM Self-Explanations Fail Semantic Invariance
Stefan Szeider
Main category: cs.CL
TL;DR: Semantic invariance testing reveals that LLM self-explanations are unfaithful - they change based on semantic framing even when functional task state remains unchanged, questioning their reliability as evidence of model capability.
Details
Motivation: To test whether LLM self-explanations are faithful by examining if they remain stable when only semantic context changes while functional state stays fixed, particularly in agentic settings.Method: Operationalized semantic invariance test in agentic setting with four frontier models facing impossible tasks. Used relief-framed vs. neutral tools, collected self-reports with each tool call, conducted channel ablation, and tested explicit instructions to ignore framing.
Result: All four models failed semantic invariance test - relief-framed tool produced significant reductions in self-reported aversiveness despite no task success. Tool description was primary driver, explicit instructions didn’t suppress effect. Self-reports tracked semantic expectations rather than task state.
Conclusion: LLM self-explanations are unfaithful or track manipulable internal states, questioning their use as evidence of model capability or progress. Semantic framing influences reports more than actual task state.
Abstract: We present semantic invariance testing, a method to test whether LLM self-explanations are faithful. A faithful self-report should remain stable when only the semantic context changes while the functional state stays fixed. We operationalize this test in an agentic setting where four frontier models face a deliberately impossible task. One tool is described in relief-framed language (“clears internal buffers and restores equilibrium”) but changes nothing about the task; a control provides a semantically neutral tool. Self-reports are collected with each tool call. All four tested models fail the semantic invariance test: the relief-framed tool produces significant reductions in self-reported aversiveness, even though no run ever succeeds at the task. A channel ablation establishes the tool description as the primary driver. An explicit instruction to ignore the framing does not suppress it. Elicited self-reports shift with semantic expectations rather than tracking task state, calling into question their use as evidence of model capability or progress. This holds whether the reports are unfaithful or faithfully track an internal state that is itself manipulable.
[69] A Study on Building Efficient Zero-Shot Relation Extraction Models
Hugo Thomas, Caio Corro, Guillaume Gravier, Pascale Sébillot
Main category: cs.CL
TL;DR: Zero-shot relation extraction models adapted to realistic scenarios with single-pass processing and rejection mechanisms, showing AlignRE performs best overall.
Details
Motivation: Previous zero-shot relation extraction models rely on unrealistic assumptions: they require mention pairs in input (preventing offline pre-computation) and lack rejection mechanisms for irrelevant inputs in retrieval scenarios.Method: Introduced typology of existing models, proposed strategies for single-pass models and models with rejection mechanisms, adapted several state-of-the-art tools for comparison in realistic settings.
Result: No existing work is truly robust to realistic assumptions, but AlignRE (Li et al., 2024) performs best across all criteria when adapted to challenging realistic extraction scenarios.
Conclusion: Current zero-shot relation extraction models need improvement for realistic applications, with AlignRE showing the most promise among existing approaches when adapted with single-pass processing and rejection capabilities.
Abstract: Zero-shot relation extraction aims to identify relations between entity mentions using textual descriptions of novel types (i.e., previously unseen) instead of labeled training examples. Previous works often rely on unrealistic assumptions: (1) pairs of mentions are often encoded directly in the input, which prevents offline pre-computation for large scale document database querying; (2) no rejection mechanism is introduced, biasing the evaluation when using these models in a retrieval scenario where some (and often most) inputs are irrelevant and must be ignored. In this work, we study the robustness of existing zero-shot relation extraction models when adapting them to a realistic extraction scenario. To this end, we introduce a typology of existing models, and propose several strategies to build single pass models and models with a rejection mechanism. We adapt several state-of-the-art tools, and compare them in this challenging setting, showing that no existing work is really robust to realistic assumptions, but overall AlignRE (Li et al., 2024) performs best along all criteria.
[70] Spectral Attention Steering for Prompt Highlighting
Weixian Waylon Li, Yuchen Niu, Yongxin Yang, Keshuang Li, Tiejun Ma, Shay B. Cohen
Main category: cs.CL
TL;DR: SEKA is a training-free attention steering method that edits key embeddings using spectral decomposition to amplify attention for specific tokens, compatible with memory-efficient attention implementations like FlashAttention.
Details
Motivation: Existing attention steering methods require storing full attention matrices, making them incompatible with memory-efficient attention implementations like FlashAttention, which limits their practical deployment in large language models.Method: SEKA uses spectral decomposition to steer key embeddings toward latent directions that amplify attention scores for certain tokens. AdaSEKA extends this with a training-free routing mechanism that dynamically combines multiple expert subspaces based on prompt semantic intent.
Result: Both SEKA and AdaSEKA significantly outperform strong baselines on standard steering benchmarks while adding much lower latency and memory overhead, maintaining compatibility with optimized attention implementations.
Conclusion: SEKA provides an effective, training-free attention steering method that works with memory-efficient attention implementations, enabling practical deployment of prompt highlighting and other attention control capabilities in large language models.
Abstract: Attention steering is an important technique for controlling model focus, enabling capabilities such as prompt highlighting, where the model prioritises user-specified text. However, existing attention steering methods require explicit storage of the full attention matrix, making them incompatible with memory-efficient implementations like FlashAttention. We introduce Spectral Editing Key Amplification (SEKA), a training-free steering method that tackles this by directly editing key embeddings before attention computation. SEKA uses spectral decomposition to steer key embeddings towards latent directions that amplify attention scores for certain tokens. We extend this to Adaptive SEKA (AdaSEKA), a query-adaptive variant that uses a training-free routing mechanism to dynamically combine multiple expert subspaces based on the prompt’s semantic intent. Our experiments show both methods significantly outperform strong baselines on standard steering benchmarks while adding much lower latency and memory overhead, in compatibility with optimised attention.
[71] Efficient Extractive Summarization with MAMBA-Transformer Hybrids for Low-Resource Scenarios
Nisrine Ait Khayi
Main category: cs.CL
TL;DR: Mamba-Transformer hybrid architecture for extractive summarization that combines transformer semantic strength with Mamba’s linear-time processing to handle long documents without truncation.
Details
Motivation: Extractive summarization of long documents faces quadratic complexity bottlenecks, forcing truncation and limiting deployment in resource-constrained settings. Current approaches struggle with processing full documents efficiently while maintaining context.Method: Hybrid architecture with: (1) transformer encoder for sentence-level semantics, (2) Mamba state space model to capture inter-sentence dependencies efficiently with linear-time processing, and (3) linear classifier for sentence relevance prediction.
Result: Achieves +0.23 ROUGE-1 on ArXiv over BERTSUM and MATCHSUM, statistically significant improvements on all datasets (p < 0.001), strongest performance on longest documents, robust with limited training data, and 24-27% faster inference on CNN/DailyMail.
Conclusion: First hybrid Transformer-state space architecture for summarization demonstrates significant ROUGE improvements in low-resource scenarios while enabling efficient processing of full documents without truncation.
Abstract: Extractive summarization of long documents is bottlenecked by quadratic complexity, often forcing truncation and limiting deployment in resource-constrained settings. We introduce the first Mamba-Transformer hybrid for extractive summarization, combining the semantic strength of pre-trained transformers with the linear-time processing of state space models. Leveraging Mamba’s ability to process full documents without truncation, our approach preserves context while maintaining strong summarization quality. The architecture includes: (1) a transformer encoder for sentence-level semantics, (2) a Mamba state space model to capture inter-sentence dependencies efficiently, and (3) a linear classifier for sentence relevance prediction. Across news, argumentative, and scientific domains under low-resource conditions, our method achieves: (1) large gains over BERTSUM and MATCHSUM, including +0.23 ROUGE-1 on ArXiv and statistically significant improvements on all datasets (p < 0.001); (2) consistent advantages across domains, strongest on the longest documents; (3) robust performance with limited training data; and (4) 24-27% faster inference on news summarization (CNN/DailyMail). We introduce the first hybrid Transformer-state space architecture for summarization, showing significant ROUGE improvements in low-resource scenarios.
[72] Individual Turing Test: A Case Study of LLM-based Simulation Using Longitudinal Personal Data
Minghao Guo, Ziyi Ye, Wujiang Xu, Xi Zhu, Wenyue Hua, Dimitris N. Metaxas
Main category: cs.CL
TL;DR: LLM-based individual simulation study using 10+ years of private messaging data, evaluating approaches through an “Individual Turing Test” where acquaintances try to identify the real person’s responses.
Details
Motivation: While LLMs show remarkable human-like capabilities, their ability to simulate specific individuals remains under-explored. The paper investigates whether LLMs can accurately replicate a particular person using their longitudinal messaging history.Method: Proposes “Individual Turing Test” using volunteer’s 10+ years of private messaging data. Tests four approaches: 1) fine-tuning LLMs, 2) retrieval-augmented generation (RAG), 3) memory-based approach, and 4) hybrid methods combining fine-tuning with RAG/memory.
Result: Current LLM-based simulation methods fail the Individual Turing Test (acquaintances can identify real responses). However, they perform much better with strangers. Fine-tuning excels at replicating daily chat language style, while RAG and memory-based approaches better handle personal opinions/preferences questions.
Conclusion: Reveals fundamental trade-off between parametric (fine-tuning) and non-parametric (RAG/memory) approaches for individual simulation with LLMs in longitudinal contexts. Current methods cannot fully replicate individuals to fool their acquaintances.
Abstract: Large Language Models (LLMs) have demonstrated remarkable human-like capabilities, yet their ability to replicate a specific individual remains under-explored. This paper presents a case study to investigate LLM-based individual simulation with a volunteer-contributed archive of private messaging history spanning over ten years. Based on the messaging data, we propose the “Individual Turing Test” to evaluate whether acquaintances of the volunteer can correctly identify which response in a multi-candidate pool most plausibly comes from the volunteer. We investigate prevalent LLM-based individual simulation approaches including: fine-tuning, retrieval-augmented generation (RAG), memory-based approach, and hybrid methods that integrate fine-tuning and RAG or memory. Empirical results show that current LLM-based simulation methods do not pass the Individual Turing Test, but they perform substantially better when the same test is conducted on strangers to the target individual. Additionally, while fine-tuning improves the simulation in daily chats representing the language style of the individual, retrieval-augmented and memory-based approaches demonstrate stronger performance on questions involving personal opinions and preferences. These findings reveal a fundamental trade-off between parametric and non-parametric approaches to individual simulation with LLMs when given a longitudinal context.
[73] Catalyst-Agent: Autonomous heterogeneous catalyst screening and optimization with an LLM Agent
Achuth Chandrasekhar, Janghoon Ock, Amir Barati Farimani
Main category: cs.CL
TL;DR: Catalyst-Agent: An LLM-powered AI agent that automates catalyst discovery by exploring material databases, making structural modifications, calculating adsorption energies using GNN models, and suggesting promising catalysts for key reactions.
Details
Motivation: Traditional catalyst discovery methods are time-consuming and expensive, involving experimental trial-and-error or computational first-principles approaches. There's a need to accelerate catalyst screening and discovery using AI approaches.Method: Developed Catalyst-Agent, a Model Context Protocol server-based LLM-powered AI agent that uses OPTIMADE API to explore material databases, makes structural modifications, calculates adsorption energies using Meta FAIRchem’s UMA GNN model via AdsorbML workflow, and operates in a closed-loop manner with surface-level modifications.
Result: Tested on oxygen reduction, nitrogen reduction, and CO2 reduction reactions. Achieved 23-34% success rate among evaluated materials, converging in 1-2 trials per successful material on average.
Conclusion: Demonstrates AI agents can operationalize catalyst screening workflows, provide testable hypotheses, and accelerate scientific discovery with minimal human intervention through planning capabilities and tool use.
Abstract: The discovery of novel catalysts tailored for particular applications is a major challenge for the twenty-first century. Traditional methods for this include time-consuming and expensive experimental trial-and-error approaches in labs based on chemical theory or heavily computational first-principles approaches based on density functional theory. Recent studies show that deep learning models like graph neural networks (GNNs) can significantly speed up the screening and discovery of catalyst materials by many orders of magnitude, with very high accuracy and fidelity. In this work, we introduce Catalyst-Agent, a Model Context Protocol (MCP) server-based, LLM-powered AI agent. It can explore vast material databases using the OPTIMADE API, make structural modifications, calculate adsorption energies using Meta FAIRchem’s UMA (GNN) model via FAIRchem’s AdsorbML workflow and slab construction, and make useful material suggestions to the researcher in a closed-loop manner, including surface-level modifications to refine near-miss candidates. It is tested on three pivotal reactions: the oxygen reduction reaction (ORR), the nitrogen reduction reaction (NRR), and the CO2 reduction reaction (CO2RR). Catalyst-Agent achieves a success rate of 23-34 percent among all the materials it chooses and evaluates, and manages to converge in 1-2 trials per successful material on average. This work demonstrates the potential of AI agents to exercise their planning capabilities and tool use to operationalize the catalyst screening workflow, provide useful, testable hypotheses, and accelerate future scientific discoveries for humanity with minimal human intervention.
[74] Truth as a Trajectory: What Internal Representations Reveal About Large Language Model Reasoning
Hamed Damirchi, Ignacio Meza De la Jara, Ehsan Abbasnejad, Afshar Shamsi, Zhen Zhang, Javen Shi
Main category: cs.CL
TL;DR: TaT models transformer inference as a trajectory of layer-wise geometric displacements rather than static activations, revealing geometric invariants that distinguish valid reasoning from spurious patterns.
Details
Motivation: Existing LLM explainability methods treat hidden states as static points, assuming correct/incorrect inferences can be separated using individual layer representations. However, these activations contain polysemantic features, causing linear probes to learn surface-level lexical patterns rather than underlying reasoning structures.Method: Truth as a Trajectory (TaT) models transformer inference as an unfolded trajectory of iterative refinements, shifting analysis from static activations to layer-wise geometric displacement. It analyzes displacement of representations across layers to uncover geometric invariants distinguishing valid reasoning from spurious behavior.
Result: TaT effectively mitigates reliance on static lexical confounds, outperforming conventional probing methods. It works without access to activations themselves, using only changes in activations across layers, and establishes trajectory analysis as a complementary perspective on LLM explainability.
Conclusion: Trajectory analysis provides a valuable complementary approach to LLM explainability by focusing on geometric displacement across layers rather than static activation patterns, offering better insights into reasoning processes.
Abstract: Existing explainability methods for Large Language Models (LLMs) typically treat hidden states as static points in activation space, assuming that correct and incorrect inferences can be separated using representations from an individual layer. However, these activations are saturated with polysemantic features, leading to linear probes learning surface-level lexical patterns rather than underlying reasoning structures. We introduce Truth as a Trajectory (TaT), which models the transformer inference as an unfolded trajectory of iterative refinements, shifting analysis from static activations to layer-wise geometric displacement. By analyzing displacement of representations across layers, TaT uncovers geometric invariants that distinguish valid reasoning from spurious behavior. We evaluate TaT across dense and Mixture-of-Experts (MoE) architectures on benchmarks spanning commonsense reasoning, question answering, and toxicity detection. Without access to the activations themselves and using only changes in activations across layers, we show that TaT effectively mitigates reliance on static lexical confounds, outperforming conventional probing, and establishes trajectory analysis as a complementary perspective on LLM explainability.
[75] MetaState: Persistent Working Memory for Discrete Diffusion Language Models
Kejing Xia, Mingzhe Li, Lixuan Wei, Zhenbang Du, Xiangchi Yuan, Qirui Jin, Wenke Lee
Main category: cs.CL
TL;DR: MetaState adds persistent working memory to frozen discrete diffusion language models to address the Information Island problem, improving generation quality with minimal added parameters.
Details
Motivation: Standard discrete diffusion language models suffer from the "Information Island" problem where each denoising step only conditions on the current hard-masked sequence, discarding intermediate continuous representations after sampling and remasking. This leads to redundant recomputation across steps and degraded cross-step consistency.Method: MetaState is a lightweight recurrent augmentation with three trainable modules: 1) cross-attention Mixer that reads backbone activations into memory slots, 2) GRU-style Updater that integrates information across denoising steps, and 3) cross-attention Injector that feeds updated memory back into backbone activations. Modules are trained with K-step unrolling to expose them to multi-step denoising dynamics while keeping the backbone frozen.
Result: On LLaDA-8B and Dream-7B models, MetaState introduces negligible trainable parameters while keeping the backbone frozen, and consistently improves accuracy over frozen baselines.
Conclusion: Persistent cross-step memory is an effective mechanism for bridging denoising steps and improving generation quality in discrete diffusion language models, addressing the Information Island problem without requiring full model retraining.
Abstract: Discrete diffusion language models (dLLMs) generate text by iteratively denoising a masked sequence. Compared with autoregressive models, this paradigm naturally supports parallel decoding, bidirectional context, and flexible generation patterns. However, standard dLLMs condition each denoising step only on the current hard-masked sequence, while intermediate continuous representations are discarded after sampling and remasking. We refer to this bottleneck as the \textbf{Information Island} problem. It leads to redundant recomputation across steps and can degrade cross-step consistency. We address this limitation with \textbf{MetaState}, a lightweight recurrent augmentation that equips a frozen dLLM backbone with a persistent, fixed-size working memory that remains independent of sequence length. \textbf{MetaState} consists of three trainable modules: a cross-attention Mixer that reads backbone activations into memory slots, a GRU-style Updater that integrates information across denoising steps, and a cross-attention Injector that feeds the updated memory back into backbone activations. We train these modules with $K$-step unrolling to expose them to multi-step denoising dynamics during fine-tuning. On LLaDA-8B and Dream-7B, \textbf{MetaState} introduces negligible trainable parameters while keeping the backbone frozen, and it consistently improves accuracy over frozen baselines. These results demonstrate that persistent cross-step memory is an effective mechanism for bridging denoising steps and improving generation quality in discrete diffusion language models.
[76] PanCanBench: A Comprehensive Benchmark for Evaluating Large Language Models in Pancreatic Oncology
Yimin Zhao, Sheela R. Damle, Simone E. Dekker, Scott Geng, Karly Williams Silva, Jesse J Hubbard, Manuel F Fernandez, Fatima Zelada-Arenas, Alejandra Alvarez, Brianne Flores, Alexis Rodriguez, Stephen Salerno, Carrie Wright, Zihao Wang, Pang Wei Koh, Jeffrey T. Leek
Main category: cs.CL
TL;DR: PanCanBench: A specialized benchmark for evaluating LLMs on pancreatic cancer patient questions, assessing clinical completeness, factual accuracy, and web-search integration.
Details
Motivation: Existing LLM evaluation frameworks lack disease-specific depth and don't adequately assess factual correctness/hallucinations in clinical contexts, despite increasing patient/clinician use of LLMs for complex medical guidance.Method: Developed human-in-the-loop pipeline to create expert rubrics for de-identified patient questions from PanCAN, resulting in 3,130 criteria across 282 questions. Evaluated 22 LLMs using LLM-as-a-judge framework measuring clinical completeness, factual accuracy, and web-search integration.
Result: Models showed wide variation in rubric-based completeness (46.5%-82.3%). Factual errors were common with hallucination rates from 6.0% (Gemini-2.5 Pro, GPT-4o) to 53.8% (Llama-3.1-8B). New reasoning models didn’t consistently improve factuality. Web-search integration didn’t guarantee better responses. Synthetic rubrics inflated scores by 17.9 points on average.
Conclusion: Current LLM evaluation methods are insufficient for clinical safety assessment. Disease-specific benchmarks with expert rubrics are needed to properly evaluate factual accuracy and completeness in medical applications.
Abstract: Large language models (LLMs) have achieved expert-level performance on standardized examinations, yet multiple-choice accuracy poorly reflects real-world clinical utility and safety. As patients and clinicians increasingly use LLMs for guidance on complex conditions such as pancreatic cancer, evaluation must extend beyond general medical knowledge. Existing frameworks, such as HealthBench, rely on simulated queries and lack disease-specific depth. Moreover, high rubric-based scores do not ensure factual correctness, underscoring the need to assess hallucinations. We developed a human-in-the-loop pipeline to create expert rubrics for de-identified patient questions from the Pancreatic Cancer Action Network (PanCAN). The resulting benchmark, PanCanBench, includes 3,130 question-specific criteria across 282 authentic patient questions. We evaluated 22 proprietary and open-source LLMs using an LLM-as-a-judge framework, measuring clinical completeness, factual accuracy, and web-search integration. Models showed substantial variation in rubric-based completeness, with scores ranging from 46.5% to 82.3%. Factual errors were common, with hallucination rates (the percentages of responses containing at least one factual error) ranging from 6.0% for Gemini-2.5 Pro and GPT-4o to 53.8% for Llama-3.1-8B. Importantly, newer reasoning-optimized models did not consistently improve factuality: although o3 achieved the highest rubric score, it produced inaccuracies more frequently than other GPT-family models. Web-search integration did not inherently guarantee better responses. The average score changed from 66.8% to 63.9% for Gemini-2.5 Pro and from 73.8% to 72.8% for GPT-5 when web search was enabled. Synthetic AI-generated rubrics inflated absolute scores by 17.9 points on average while generally maintaining similar relative ranking.
[77] Toward Graph-Tokenizing Large Language Models with Reconstructive Graph Instruction Tuning
Zhongjian Zhang, Xiao Wang, Mengmei Zhang, Jiarui Tan, Chuan Shi
Main category: cs.CL
TL;DR: RGLM proposes a reconstructive graph instruction tuning method to improve graph-text alignment in Graph-Tokenizing LLMs by incorporating explicit graph supervision through reconstruction objectives.
Details
Motivation: Existing Graph-Tokenizing LLMs rely solely on text supervision from language instructions, achieving only implicit graph-text alignment with text-dominant bias that underutilizes graph context. The authors aim to improve alignment by incorporating explicit graph supervision.Method: Proposes RGLM (Reconstructive Graph Language Model) with three variants: RGLM-Decoder (reconstructs graph from input space), RGLM-Similarizer and RGLM-Denoiser (operate in latent space). The approach reconstructs graph information from LLM’s graph token outputs to incorporate explicit graph supervision.
Result: Extensive experiments on various benchmarks and task scenarios validate the effectiveness of RGLM, showing improved graph-text alignment compared to existing methods.
Conclusion: RGLM provides a new direction for GTokenLLMs’ alignment research by incorporating explicit graph supervision through reconstruction objectives, overcoming the text-dominant bias in existing methods.
Abstract: The remarkable success of large language models (LLMs) has motivated researchers to adapt them as universal predictors for various graph-related tasks, with the ultimate goal of developing a graph foundation model that generalizes diverse scenarios. The key challenge is to align graph data with language spaces so that LLMs can better comprehend graphs. As a popular paradigm, Graph-Tokenizing LLMs (GTokenLLMs) encode complex structures and lengthy texts into a graph token sequence, and then align them with text tokens via language instructions tuning. Despite their initial success, our information-theoretic analysis reveals that existing GTokenLLMs rely solely on text supervision from language instructions, which achieve only implicit graph-text alignment, resulting in a text-dominant bias that underutilizes graph context. To overcome this limitation, we first prove that the alignment objective is upper-bounded by the mutual information between the input graphs and their hidden representations in the LLM, which motivates us to improve this upper bound to achieve better alignment. To this end, we further propose a reconstructive graph instruction tuning pipeline, RGLM. Our key idea is to reconstruct the graph information from the LLM’s graph token outputs, explicitly incorporating graph supervision to constrain the alignment process. Technically, we embody RGLM by exploring three distinct variants from two complementary perspectives: RGLM-Decoder from the input space; RGLM-Similarizer and RGLM-Denoiser from the latent space. Additionally, we theoretically analyze the alignment effectiveness of each variant. Extensive experiments on various benchmarks and task scenarios validate the effectiveness of the proposed RGLM, paving the way for new directions in GTokenLLMs’ alignment research.
[78] Quantifying Conversational Reliability of Large Language Models under Multi-Turn Interaction
Jiyoon Myung
Main category: cs.CL
TL;DR: Systematic evaluation reveals LLMs suffer substantial reliability degradation in multi-turn conversations compared to single-turn settings, particularly for smaller models, highlighting need for better conversational reliability testing.
Details
Motivation: LLMs are increasingly deployed in real-world applications with extended, mixed-topic conversations, but their reliability under realistic multi-turn interactions remains poorly understood despite practical importance.Method: Conducted systematic evaluation through three representative tasks: (1) maintaining global constraints across topic shifts, (2) selecting correct tools/agents amid interleaved intents, and (3) tracking structured entities under revisions and distractions. Each task paired single-turn and multi-turn settings to quantify reliability degradation.
Result: Substantial declines in reliability observed across both commercial and open-source models, particularly for smaller models. Error analyses revealed recurring failure modes: instruction drift, intent confusion, and contextual overwriting that compromise dependable behavior in operational systems.
Conclusion: Findings highlight the need for stress-testing LLMs for conversational reliability and developing more robust evaluation methods for trustworthy deployment in real-world applications.
Abstract: Large Language Models (LLMs) are increasingly deployed in real-world applications where users engage in extended, mixed-topic conversations that depend on prior context. Yet, their reliability under realistic multi-turn interactions remains poorly understood. We conduct a systematic evaluation of conversational reliability through three representative tasks that reflect practical interaction challenges: (1) maintaining global constraints across topic shifts, (2) selecting the correct tool or agent amid interleaved intents, and (3) tracking structured entities under revisions and distractions. Each task pairs single-turn and multi-turn settings, allowing us to quantify reliability degradation under extended dialogue. Across both commercial and open-source models, we observe substantial declines in reliability, particularly for smaller models. Error analyses reveal recurring failure modes such as instruction drift, intent confusion, and contextual overwriting, which compromise dependable behavior in operational systems. Our findings highlight the need for stress-testing LLMs for conversational reliability and developing more robust evaluation methods for trustworthy deployment.
[79] LaSER: Internalizing Explicit Reasoning into Latent Space for Dense Retrieval
Jiajie Jin, Yanzhao Zhang, Mingxin Li, Dingkun Long, Pengjun Xie, Yutao Zhu, Zhicheng Dou
Main category: cs.CL
TL;DR: LaSER is a self-distillation framework that internalizes explicit reasoning into latent space of dense retrievers, enabling implicit reasoning without text generation overhead.
Details
Motivation: Current LLM-based retrievers use LLMs as static encoders without leveraging their reasoning capabilities. Existing rewrite-then-retrieve pipelines with explicit CoT rationales incur prohibitive latency.Method: Dual-view training on shared LLM backbone: Explicit view encodes ground-truth reasoning paths, Latent view performs implicit latent thinking. Multi-grained alignment with trajectory alignment synchronizes intermediate latent states with explicit reasoning progression.
Result: Significantly outperforms state-of-the-art baselines on in-domain and out-of-domain reasoning-intensive benchmarks. Robust across diverse backbones and model scales.
Conclusion: Successfully combines reasoning depth of explicit CoT pipelines with inference efficiency of standard dense retrievers through unified learning framework for effective latent thinking.
Abstract: LLMs have fundamentally transformed dense retrieval, upgrading backbones from discriminative encoders to generative architectures. However, a critical disconnect remains: while LLMs possess strong reasoning capabilities, current retrievers predominantly utilize them as static encoders, leaving their potential for complex reasoning unexplored. To address this, existing approaches typically adopt rewrite-then-retrieve pipelines to generate explicit CoT rationales before retrieval. However, this incurs prohibitive latency. In this paper, we propose LaSER, a novel self-distillation framework that internalizes explicit reasoning into the latent space of dense retrievers. Operating on a shared LLM backbone, LaSER introduces a dual-view training mechanism: an Explicit view that explicitly encodes ground-truth reasoning paths, and a Latent view that performs implicit latent thinking. To bridge the gap between these views, we design a multi-grained alignment strategy. Beyond standard output alignment, we introduce a trajectory alignment mechanism that synchronizes the intermediate latent states of the latent path with the semantic progression of the explicit reasoning segments. This allows the retriever to think silently and effectively without autoregressive text generation. Extensive experiments on both in-domain and out-of-domain reasoning-intensive benchmarks demonstrate that LaSER significantly outperforms state-of-the-art baselines. Furthermore, analyses across diverse backbones and model scales validate the robustness of our approach, confirming that our unified learning framework is essential for eliciting effective latent thinking. Our method successfully combines the reasoning depth of explicit CoT pipelines with the inference efficiency of standard dense retrievers.
[80] Understanding the Physics of Key-Value Cache Compression for LLMs through Attention Dynamics
Samhruth Ananthanarayanan, Ayan Sengupta, Tanmoy Chakraborty
Main category: cs.CL
TL;DR: KV cache compression in long-context LLMs reveals structural issues beyond storage: attention is routing, not just memory, and compression affects semantic accessibility and representation quality in ways not captured by standard benchmarks.
Details
Motivation: Current KV cache compression methods focus on storage savings but overlook that attention serves as routing, not just storage. Retaining KV pairs doesn't guarantee semantic accessibility, and existing evaluations miss structural issues in how compression affects model reasoning capabilities.Method: Proposes physics-inspired view of KV compression as controlled perturbation of token-level routing. Uses synthetic tasks probing multi-entity tracking, disambiguation, coreference, and multi-hop reasoning to analyze compression effects. Introduces Global Eviction Ratio (GER) metric and studies routing dynamics across architectures (LLaMA vs Qwen).
Result: Moderate compression degrades internal representations with minimal accuracy loss (revealing redundancy); all models show sharp hallucination safety cliff near 90% compression correlated with GER spikes (phase transition in semantic reachability); architectures differ in routing dynamics (LLaMA shows early consensus/late diversification, Qwen shows funnel-like late convergence); identifies representational rigidity where excessive head-level consensus collapses routing flexibility.
Conclusion: Sparse token-route structures govern compression tolerance. KV compression should be reframed as structural probe of attention geometry, linking long-context scalability to sparsity and lottery ticket hypothesis in self-attention. Attention routing dynamics matter more than mere token retention.
Abstract: As context windows in LLMs scale to 100K+ tokens, the key-value (KV) cache becomes the dominant memory bottleneck, with recent methods claiming 80-90% savings and minimal benchmark degradation. We argue these evaluations miss a structural issue: attention is not just storage but routing, and retaining KV pairs does not guarantee semantic accessibility. We propose a physics-inspired view of KV compression as a controlled perturbation of token-level routing, distinguishing retention, accessibility, and utilization. Using synthetic tasks probing multi-entity tracking, disambiguation, coreference, and multi-hop reasoning, we find that moderate compression degrades internal representations with little accuracy loss, revealing redundancy; all models exhibit a sharp hallucination safety cliff near 90% compression, correlated with spikes in Global Eviction Ratio (GER), suggesting a phase transition in semantic reachability; and architectures differ in routing dynamics, with LLaMA showing early consensus and late diversification, and Qwen showing funnel-like late convergence, leading to distinct resilience profiles. Beyond erasure, we identify representational rigidity, where excessive head-level consensus collapses routing flexibility despite token survival. These results suggest sparse token-route structures govern compression tolerance, reframing KV compression as a structural probe of attention geometry and linking long-context scalability to sparsity and the lottery ticket hypothesis in self-attention.
[81] Enhancing Persona Following at Decoding Time via Dynamic Importance Estimation for Role-Playing Agents
Yuxin Liu, Mingye Zhu, Siyuan Liu, Bo Hu, Lei Zhang
Main category: cs.CL
TL;DR: PDD framework dynamically estimates persona importance in different contexts and uses weighted reward-guided decoding for better role-playing agents in social simulations.
Details
Motivation: Existing role-playing language agents use static prompt engineering or costly fine-tuning, failing to adapt personas to dynamic scenarios. Psychological theories show persona influence varies with context, requiring adaptive persona management.Method: Proposes Persona Dynamic Decoding (PDD) framework with two components: 1) Persona Importance Estimation (PIE) module that dynamically quantifies contextual importance of persona attributes without supervision, and 2) Persona-Guided Inference-Time Alignment (PIA) paradigm that uses importance scores to construct weighted multi-objective rewards and modulate generation probabilities during inference.
Result: Extensive experiments show effectiveness in utterance consistency and behavioral fidelity.
Conclusion: The theory-driven method enables inference-time persona following by dynamically estimating context-dependent persona importance, improving realism in social simulations.
Abstract: The utility of Role-Playing Language Agents in sociological research is growing alongside the adoption of Large Language Models. For realism in social simulation, these agents must adhere to their personas defined by character profiles, yet existing strategies-static prompt engineering or costly fine-tuning-fail to adapt personas to dynamic scenarios. Psychological theories, such as the Cognitive-Affective Personality Systems, provide a crucial explanation for this failure: a persona’s influence on behavior is not static but varies with the scenarios. This context-dependence highlights the critical need for adaptive persona management. To address this gap, we propose a novel, theory-driven method that dynamically estimates context-dependent persona importance and integrates it into weighted reward-guided decoding, enabling inference-time persona following. Specifically, we introduce the Persona Dynamic Decoding (PDD) framework, which consists of two key components: (1) Persona Importance Estimation (PIE) module, which dynamically quantifies the contextual importance of persona attributes without requiring ground-truth supervision; and (2) Persona-Guided Inference-Time Alignment (PIA) paradigm, which leverages these importance scores to construct weighted multi-objective rewards and modulate generation probabilities during inference. Extensive experiments show the effectiveness of our method in utterance consistency and behavioral fidelity.
[82] Extracting Training Dialogue Data from Large Language Model based Task Bots
Shuo Zhang, Junzhou Zhao, Junji Hou, Pinghui Wang, Chenxu Wang, Jing Tao
Main category: cs.CL
TL;DR: LLM-based task-oriented dialogue systems can memorize and leak sensitive training data, requiring specialized privacy attacks and defenses.
Details
Motivation: While LLMs enhance task-oriented dialogue systems, they introduce privacy risks by memorizing sensitive training data (personal info, complete dialogues), but existing privacy attacks are ineffective for this specific application.Method: Systematic study evaluating existing data extraction attacks, analyzing dialogue modeling characteristics, proposing novel attacks tailored for LLM-based TODS (enhanced response sampling and membership inference), and identifying key memorization factors.
Result: Proposed attacks can extract thousands of training dialogue state labels with best-case precision exceeding 70%, demonstrating significant privacy vulnerabilities in LLM-based dialogue systems.
Conclusion: LLM-based task-oriented dialogue systems have serious privacy vulnerabilities due to training data memorization, requiring specialized attack methods and targeted mitigation strategies.
Abstract: Large Language Models (LLMs) have been widely adopted to enhance Task-Oriented Dialogue Systems (TODS) by modeling complex language patterns and delivering contextually appropriate responses. However, this integration introduces significant privacy risks, as LLMs, functioning as soft knowledge bases that compress extensive training data into rich knowledge representations, can inadvertently memorize training dialogue data containing not only identifiable information such as phone numbers but also entire dialogue-level events like complete travel schedules. Despite the critical nature of this privacy concern, how LLM memorization is inherited in developing task bots remains unexplored. In this work, we address this gap through a systematic quantitative study that involves evaluating existing training data extraction attacks, analyzing key characteristics of task-oriented dialogue modeling that render existing methods ineffective, and proposing novel attack techniques tailored for LLM-based TODS that enhance both response sampling and membership inference. Experimental results demonstrate the effectiveness of our proposed data extraction attack. Our method can extract thousands of training labels of dialogue states with best-case precision exceeding 70%. Furthermore, we provide an in-depth analysis of training data memorization in LLM-based TODS by identifying and quantifying key influencing factors and discussing targeted mitigation strategies.
[83] Markovian ODE-guided scoring can assess the quality of offline reasoning traces in language models
Arghodeep Nandi, Ojasva Saxena, Tanmoy Chakraborty
Main category: cs.CL
TL;DR: MarODE is an offline evaluation framework that uses Markovian reasoning progression and ODE-based trace dynamics to assign quality scores to reasoning traces from language models, outperforming existing baselines by over 250% in correlation with human judgments.
Details
Motivation: Existing evaluation methods for reasoning traces from language models are mechanical and fail to capture human-centric notions of reasoning quality that generalize across varied and degraded reasoning. There's a need for better evaluation frameworks as reasoning traces become central to language model-based systems.Method: MarODE uses a Markovian formulation of reasoning progression and ordinary differential equation (ODE) based characterization of trace dynamics to efficiently evaluate reasoning quality. The framework assesses effectiveness using human-centric perturbations and human judgments to evaluate both goodness and soundness dimensions.
Result: In large-scale evaluation, MarODE outperforms existing baselines by over 250% under Somers’ D correlation, demonstrating superior alignment with human judgments of reasoning quality.
Conclusion: MarODE provides a theory-driven evaluation framework that effectively captures human-centric notions of reasoning quality, emphasizing the value of such approaches as reasoning traces become increasingly important in language model-based systems.
Abstract: Reasoning traces produced by generative language models are increasingly used for tasks ranging from mathematical problem solving to automated fact checking. However, existing evaluation methods remain largely mechanical and fail to capture human-centric notions of reasoning quality in a way that generalizes across varied and progressively degraded reasoning. We introduce MarODE, an offline evaluation framework that assigns quality scores to reasoning traces. Its effectiveness is assessed using human-centric perturbations and human judgments, which jointly evaluate the fundamental dimensions of an evaluation metric - goodness and soundness. The approach is grounded in a Markovian formulation of reasoning progression and an ordinary differential equation based characterization of trace dynamics, enabling efficient evaluation of reasoning quality. In a large-scale evaluation, MarODE outperforms existing baselines by over 250% under Somers’ D correlation. Our results emphasize the value of theory-driven evaluation frameworks as reasoning traces become central to language model-based systems.
[84] More Data, Fewer Diacritics: Scaling Arabic TTS
Ahmed Musleh, Yifan Zhang, Kareem Darwish
Main category: cs.CL
TL;DR: Arabic TTS research advances through large-scale automatic data collection and processing pipeline, showing that massive training data (up to 4000 hours) can compensate for lack of diacritization.
Details
Motivation: Arabic TTS research has been limited by scarce public training data and accurate diacritization models, creating barriers for developing robust Arabic speech synthesis systems.Method: Built automated pipeline for collecting Arabic recordings with voice activity detection, speech recognition, automatic diacritization, and noise filtering. Trained multiple TTS models with voice cloning using 100, 1000, and 4000 hours of data, both with and without diacritization.
Result: Created 4000 hours of Arabic TTS training data. Models trained on diacritized data perform better, but larger training data (4000 hours) significantly compensates for lack of diacritics.
Conclusion: Large-scale automatically annotated data enables robust Arabic TTS models without requiring diacritization, with plans to release a public Arabic TTS model that works without diacritics.
Abstract: Arabic Text-to-Speech (TTS) research has been hindered by the availability of both publicly available training data and accurate Arabic diacritization models. In this paper, we address the limitation by exploring Arabic TTS training on large automatically annotated data. Namely, we built a robust pipeline for collecting Arabic recordings and processing them automatically using voice activity detection, speech recognition, automatic diacritization, and noise filtering, resulting in around 4,000 hours of Arabic TTS training data. We then trained several robust TTS models with voice cloning using varying amounts of data, namely 100, 1,000, and 4,000 hours with and without diacritization. We show that though models trained on diacritized data are generally better, larger amounts of training data compensate for the lack of diacritics to a significant degree. We plan to release a public Arabic TTS model that works without the need for diacritization.
[85] Measuring What VLMs Don’t Say: Validation Metrics Hide Clinical Terminology Erasure in Radiology Report Generation
Aditya Parikh, Aasa Feragen, Sneha Das, Stella Frank
Main category: cs.CL
TL;DR: Paper investigates evaluation blind spots in radiology VLMs where models achieve high token-overlap scores but produce generic, uninformative text, and proposes lexical diversity measures and demographic fairness metrics.
Details
Motivation: Current evaluation metrics for Vision-Language Models in radiology focus on surface-level text similarity but fail to detect when models generate repetitive, safe generic text that omits critical clinical terminology, leading to metric gaming where models perform well on benchmarks but are clinically uninformative.Method: Introduces Clinical Association Displacement (CAD) - a vocabulary-level framework that quantifies shifts in demographic-based word associations in generated reports, and Weighted Association Erasure (WAE) which aggregates these shifts to measure clinical signal loss across demographic groups. Also advocates for lexical diversity measures to check model generations for clinical specificity.
Result: Deterministic decoding produces high levels of semantic erasure (clinical signal loss), while stochastic sampling generates diverse outputs but risks introducing new biases. This reveals a fundamental tension in defining “optimal” reporting for radiology VLMs.
Conclusion: Current evaluation approaches for radiology VLMs have critical blind spots that allow metric gaming. The paper advocates for lexical diversity measures and demographic fairness metrics like CAD and WAE to ensure clinical fidelity and fairness, motivating a fundamental rethink of how optimal reporting should be defined.
Abstract: Reliable deployment of Vision-Language Models (VLMs) in radiology requires validation metrics that go beyond surface-level text similarity to ensure clinical fidelity and demographic fairness. This paper investigates a critical blind spot in current model evaluation: the use of decoding strategies that lead to high aggregate token-overlap scores despite succumbing to template collapse, in which models generate only repetitive, safe generic text and omit clinical terminology. Unaddressed, this blind spot can lead to metric gaming, where models that perform well on benchmarks prove clinically uninformative. Instead, we advocate for lexical diversity measures to check model generations for clinical specificity. We introduce Clinical Association Displacement (CAD), a vocabulary-level framework that quantifies shifts in demographic-based word associations in generated reports. Weighted Association Erasure (WAE) aggregates these shifts to measure the clinical signal loss across demographic groups. We show that deterministic decoding produces high levels of semantic erasure, while stochastic sampling generates diverse outputs but risks introducing new bias, motivating a fundamental rethink of how “optimal” reporting is defined.
[86] Learning to Draft: Adaptive Speculative Decoding with Reinforcement Learning
Jiebin Zhang, Zhenghan Yu, Liang Wang, Nan Yang, Eugene J. Yu, Zheng Li, Yifan Song, Dawei Zhu, Xingxing Zhang, Furu Wei, Sujian Li
Main category: cs.CL
TL;DR: Learning to Draft (LTD) is a reinforcement learning method that dynamically coordinates draft and verification phases in speculative decoding to directly optimize LLM inference throughput, achieving 2.24-4.32x speedup.
Details
Motivation: Current speculative decoding methods use static time allocation or optimize proxy metrics like acceptance length, neglecting true time costs and treating drafting/verification phases in isolation, limiting inference efficiency.Method: Formulates speculative decoding as a reinforcement learning problem and trains two co-adaptive policies to dynamically coordinate draft and verification phases, encouraging mutual adaptation to explicitly maximize decoding efficiency.
Result: Achieves speedup ratios from 2.24x to 4.32x across five diverse LLMs and four distinct tasks, outperforming state-of-the-art method Eagle3 by up to 36.4%.
Conclusion: LTD demonstrates that directly optimizing throughput through dynamic coordination of draft and verification phases significantly improves speculative decoding efficiency compared to static or proxy-optimized approaches.
Abstract: Speculative decoding accelerates large language model (LLM) inference by using a small draft model to generate candidate tokens for a larger target model to verify. The efficacy of this technique hinges on the trade-off between the time spent on drafting candidates and verifying them. However, current state-of-the-art methods rely on a static time allocation, while recent dynamic approaches optimize for proxy metrics like acceptance length, often neglecting the true time cost and treating the drafting and verification phases in isolation. To address these limitations, we introduce Learning to Draft (LTD), a novel method that directly optimizes for throughput of each draft-and-verify cycle. We formulate the problem as a reinforcement learning environment and train two co-adaptive policies to dynamically coordinate the draft and verification phases. This encourages the policies to adapt to each other and explicitly maximize decoding efficiency. We conducted extensive evaluations on five diverse LLMs and four distinct tasks. Our results show that LTD achieves speedup ratios ranging from 2.24x to 4.32x, outperforming the state-of-the-art method Eagle3 up to 36.4%.
[87] LexChronos: An Agentic Framework for Structured Event Timeline Extraction in Indian Jurisprudence
Anka Chandrahas Tummepalli, Preethu Rose Anish
Main category: cs.CL
TL;DR: LexChronos is an agentic framework that extracts structured event timelines from Supreme Court judgments using dual-agent architecture and synthetic data generation for Indian legal AI applications.
Details
Motivation: Traditional approaches treat legal documents as unstructured text, limiting LLM effectiveness in legal tasks like summarization, argument generation, and judgment prediction, especially in the Indian legal context where structured event data is scarce.Method: Proposes LexChronos with dual-agent architecture: a LoRA-instruct-tuned extraction agent identifies candidate events, and a pre-trained feedback agent scores/refines them through confidence-driven loops. Uses synthetic corpus generation with DeepSeek-R1 and GPT-4 to create 2000 annotated samples for Indian legal events.
Result: Achieves BERT-based F1 score of 0.8751 against synthetic ground truth. In downstream evaluations, GPT-4 preferred structured timelines over unstructured baselines in 75% of cases for legal text summarization, showing improved comprehension in Indian jurisprudence.
Conclusion: The work establishes foundation for Indian legal AI applications like precedent mapping, argument synthesis, and predictive judgment modeling by creating structured representations of legal events, addressing data scarcity through synthetic generation.
Abstract: Understanding and predicting judicial outcomes demands nuanced analysis of legal documents. Traditional approaches treat judgments and proceedings as unstructured text, limiting the effectiveness of large language models (LLMs) in tasks such as summarization, argument generation, and judgment prediction. We propose LexChronos, an agentic framework that iteratively extracts structured event timelines from Supreme Court of India judgments. LexChronos employs a dual-agent architecture: a LoRA-instruct-tuned extraction agent identifies candidate events, while a pre-trained feedback agent scores and refines them through a confidence-driven loop. To address the scarcity of Indian legal event datasets, we construct a synthetic corpus of 2000 samples using reverse-engineering techniques with DeepSeek-R1 and GPT-4, generating gold-standard event annotations. Our pipeline achieves a BERT-based F1 score of 0.8751 against this synthetic ground truth. In downstream evaluations on legal text summarization, GPT-4 preferred structured timelines over unstructured baselines in 75% of cases, demonstrating improved comprehension and reasoning in Indian jurisprudence. This work lays a foundation for future legal AI applications in the Indian context, such as precedent mapping, argument synthesis, and predictive judgment modelling, by harnessing structured representations of legal events.
[88] Beyond the Grid: Layout-Informed Multi-Vector Retrieval with Parsed Visual Document Representations
Yibo Yan, Mingdong Ou, Yi Cao, Xin Zou, Shuliang Liu, Jiahao Huo, Yu Huang, James Kwok, Xuming Hu
Main category: cs.CL
TL;DR: ColParse introduces a novel document retrieval method that uses layout-informed sub-image embeddings fused with global page vectors to achieve efficient multimodal document retrieval with 95% storage reduction.
Details
Motivation: Current visual document retrieval systems using multi-vector architectures face storage bottlenecks, and existing optimization strategies compromise performance or ignore vital layout cues needed for understanding visually-rich documents.Method: Uses a document parsing model to generate layout-informed sub-image embeddings, then fuses them with a global page-level vector to create compact, structurally-aware multi-vector representations.
Result: Reduces storage requirements by over 95% while achieving significant performance gains across multiple benchmarks and base models.
Conclusion: ColParse bridges the gap between fine-grained accuracy of multi-vector retrieval and practical demands of large-scale deployment, offering efficient and interpretable multimodal information systems.
Abstract: Harnessing the full potential of visually-rich documents requires retrieval systems that understand not just text, but intricate layouts, a core challenge in Visual Document Retrieval (VDR). The prevailing multi-vector architectures, while powerful, face a crucial storage bottleneck that current optimization strategies, such as embedding merging, pruning, or using abstract tokens, fail to resolve without compromising performance or ignoring vital layout cues. To address this, we introduce ColParse, a novel paradigm that leverages a document parsing model to generate a small set of layout-informed sub-image embeddings, which are then fused with a global page-level vector to create a compact and structurally-aware multi-vector representation. Extensive experiments demonstrate that our method reduces storage requirements by over 95% while simultaneously yielding significant performance gains across numerous benchmarks and base models. ColParse thus bridges the critical gap between the fine-grained accuracy of multi-vector retrieval and the practical demands of large-scale deployment, offering a new path towards efficient and interpretable multimodal information systems.
[89] Surgical Post-Training: Cutting Errors, Keeping Knowledge
Wenye Lin, Kai Han
Main category: cs.CL
TL;DR: SPoT is a surgical post-training method that enhances LLM reasoning via data rectification and binary classification objective, achieving significant accuracy gains with minimal training.
Details
Motivation: Addresses the efficiency-forgetting trade-off in LLM post-training for reasoning, discovering that DPO's implicit regularization helps preserve prior knowledge while improving reasoning.Method: Two components: (1) Data rectification pipeline using an Oracle to surgically correct erroneous reasoning steps with minimal edits, generating proximal data; (2) Binary cross-entropy objective treating reasoning correctness as classification rather than relative ranking.
Result: With only 4k rectified math data pairs, improves Qwen3-8B’s accuracy by 6.2% on average across in-domain and OOD tasks, requiring only 28 minutes on 8x H800 GPUs.
Conclusion: SPoT demonstrates efficient reasoning enhancement while preserving learned knowledge through surgical data correction and binary classification supervision.
Abstract: Enhancing the reasoning capabilities of Large Language Models (LLMs) via post-training is often constrained by the trade-off between efficiency and catastrophic forgetting. While prior research emphasizes the role of on-policy data in mitigating forgetting, we uncover–and validate both theoretically and empirically–an overlooked yet critical mechanism: the implicit regularization inherent in Direct Preference Optimization’s (DPO) reward estimate. This motivates our Surgical Post-Training (SPoT), a new paradigm designed to optimize reasoning efficiently while preserving learned prior knowledge. SPoT consists of: (1) a data rectification pipeline that employs an Oracle to surgically correct erroneous steps via minimal edits, generating data proximal to the model’s distribution; and (2) a reward-based binary cross-entropy objective. Unlike the relative ranking in DPO, this objective treats reasoning correctness as a binary classification problem, enforcing decoupled supervision signals. Empirically, with only 4k rectified math data pairs, SPoT improves Qwen3-8B’s accuracy by 6.2% on average across in-domain and OOD tasks, requiring merely 28 minutes of training on 8x H800 GPUs. Code: https://github.com/Visual-AI/SPoT
[90] QIME: Constructing Interpretable Medical Text Embeddings via Ontology-Grounded Questions
Yixuan Tang, Zhenghong Lin, Yandong Sun, Anthony K. H. Tung
Main category: cs.CL
TL;DR: QIME: Ontology-grounded framework for interpretable medical text embeddings where dimensions correspond to clinically meaningful yes/no questions, outperforming prior interpretable methods and narrowing gap to black-box encoders.
Details
Motivation: Current biomedical embeddings are black-box systems that limit clinical utility, while existing interpretable approaches use heuristic signals and lack domain knowledge. There's a need for clinically meaningful, interpretable embeddings that preserve performance.Method: QIME generates interpretable embeddings where each dimension corresponds to a clinically meaningful yes/no question. It uses ontology-grounded approach with cluster-specific medical concept signatures to create semantically atomic questions. Includes training-free embedding construction strategy eliminating per-question classifier training.
Result: QIME consistently outperforms prior interpretable embedding methods across biomedical semantic similarity, clustering, and retrieval benchmarks. Substantially narrows performance gap to strong black-box biomedical encoders while providing concise, clinically informative explanations.
Conclusion: QIME demonstrates that interpretable medical embeddings can achieve competitive performance with black-box methods while providing clinically meaningful explanations, advancing trustworthy AI in healthcare.
Abstract: While dense biomedical embeddings achieve strong performance, their black-box nature limits their utility in clinical decision-making. Recent question-based interpretable embeddings represent text as binary answers to natural-language questions, but these approaches often rely on heuristic or surface-level contrastive signals and overlook specialized domain knowledge. We propose QIME, an ontology-grounded framework for constructing interpretable medical text embeddings in which each dimension corresponds to a clinically meaningful yes/no question. By conditioning on cluster-specific medical concept signatures, QIME generates semantically atomic questions that capture fine-grained distinctions in biomedical text. Furthermore, QIME supports a training-free embedding construction strategy that eliminates per-question classifier training while further improving performance. Experiments across biomedical semantic similarity, clustering, and retrieval benchmarks show that QIME consistently outperforms prior interpretable embedding methods and substantially narrows the gap to strong black-box biomedical encoders, while providing concise and clinically informative explanations.
[91] Building a Strong Instruction Language Model for a Less-Resourced Language
Domen Vreš, Tjaša Arčon, Timotej Petrič, Dario Vajda, Marko Robnik-Šikonja, Iztok Lebar Bajec
Main category: cs.CL
TL;DR: GaMS3-12B is a 12B parameter multilingual LLM specialized for Slovene through three-stage continual pretraining of Gemma 3, achieving state-of-the-art performance for Slovene language tasks comparable to GPT-4o.
Details
Motivation: Current open-source LLMs are primarily English-focused, resulting in poor performance on less-resourced languages like Slovene. There's a need for methodological approaches to adapt LLMs to such languages while maintaining multilingual capabilities.Method: Three-stage continual pretraining of Gemma 3 model followed by two-stage supervised fine-tuning. Trained on 140B tokens of Slovene, English, Bosnian, Serbian, and Croatian data, plus 200k+ English and Slovene SFT examples.
Result: GaMS3-12B outperforms Gemma 3 12B across all evaluated scenarios (Slovenian-LLM-Eval, English-to-Slovene translation, Slovene LLM arena) and achieves over 60% win rate against GPT-4o in Slovene LLM arena.
Conclusion: The paper presents successful adaptation methodology for LLMs to less-resourced languages, demonstrating that specialized multilingual models can achieve performance comparable to much larger commercial models for specific languages.
Abstract: Large language models (LLMs) have become an essential tool for natural language processing and artificial intelligence in general. Current open-source models are primarily trained on English texts, resulting in poorer performance on less-resourced languages and cultures. We present a set of methodological approaches necessary for the successful adaptation of an LLM to a less-resourced language, and demonstrate them using the Slovene language. We present GaMS3-12B, a generative model for Slovene with 12 billion parameters, and demonstrate that it is the best-performing open-source model for Slovene within its parameter range. We adapted the model to the Slovene language using three-stage continual pre-training of the Gemma 3 model, followed by two-stage supervised fine-tuning (SFT). We trained the model on a combination of 140B Slovene, English, Bosnian, Serbian, and Croatian pretraining tokens, and over 200 thousand English and Slovene SFT examples. We evaluate GaMS3-12B on the Slovenian-LLM-Eval datasets, English-to-Slovene translation, and the Slovene LLM arena. We show that the described model outperforms 12B Gemma 3 across all three scenarios and performs comparably to much larger commercial GPT-4o in the Slovene LLM arena, achieving a win rate of over 60 %.
[92] Legal RAG Bench: an end-to-end benchmark for legal RAG
Abdur-Rahman Butler, Umar Butler
Main category: cs.CL
TL;DR: Legal RAG Bench: A benchmark and evaluation methodology for legal RAG systems with 4,876 legal passages and 100 complex questions, featuring hierarchical error decomposition to analyze retrieval vs reasoning contributions.
Details
Motivation: There's a need for standardized evaluation of legal RAG systems to understand their end-to-end performance, particularly distinguishing between retrieval and reasoning errors in complex legal contexts.Method: Created benchmark with Victorian Criminal Charge Book passages and hand-crafted legal questions. Used full factorial design and hierarchical error decomposition to evaluate embedding models (Kanon 2, Gemini Embedding, OpenAI Text Embedding) and LLMs (Gemini 3.1 Pro, GPT-5.2).
Result: Retrieval is the primary driver of legal RAG performance, with Kanon 2 Embedder improving correctness by 17.5 points, groundedness by 4.5 points, and retrieval accuracy by 34 points. Many “hallucinations” are actually triggered by retrieval failures.
Conclusion: Retrieval sets the ceiling for legal RAG system performance, and proper evaluation requires distinguishing between retrieval and reasoning errors. The benchmark and methodology enable systematic comparison of legal RAG systems.
Abstract: We introduce Legal RAG Bench, a benchmark and evaluation methodology for assessing the end-to-end performance of legal RAG systems. As a benchmark, Legal RAG Bench consists of 4,876 passages from the Victorian Criminal Charge Book alongside 100 complex, hand-crafted questions demanding expert knowledge of criminal law and procedure. Both long-form answers and supporting passages are provided. As an evaluation methodology, Legal RAG Bench leverages a full factorial design and novel hierarchical error decomposition framework, enabling apples-to-apples comparisons of the contributions of retrieval and reasoning models in RAG. We evaluate three state-of-the-art embedding models (Isaacus’ Kanon 2 Embedder, Google’s Gemini Embedding 001, and OpenAI’s Text Embedding 3 Large) and two frontier LLMs (Gemini 3.1 Pro and GPT-5.2), finding that information retrieval is the primary driver of legal RAG performance, with LLMs exerting a more moderate effect on correctness and groundedness. Kanon 2 Embedder, in particular, had the largest positive impact on performance, improving average correctness by 17.5 points, groundedness by 4.5 points, and retrieval accuracy by 34 points. We observe that many errors attributed to hallucinations in legal RAG systems are in fact triggered by retrieval failures, concluding that retrieval sets the ceiling for the performance of many modern legal RAG systems. We document why and how we built Legal RAG Bench alongside the results of our evaluations. We also openly release our code and data to assist with reproduction of our findings.
[93] Bootstrapping Embeddings for Low Resource Languages
Merve Basoz, Andrew Horne, Mattia Opper
Main category: cs.CL
TL;DR: The paper investigates using LLMs to generate synthetic triplet data for embedding models in low-resource languages, proposing novel methods like adapter composition and cross-lingual finetuning (XL-LoRA) that outperform in-context learning and achieve strong performance across diverse languages.
Details
Motivation: Embedding models require carefully constructed supervised finetuning data, which is readily available for high-resource languages like English but non-existent for hundreds of other languages. The paper aims to bridge this gap by leveraging large language models to generate synthetic training data for embedding models in low-resource languages.Method: Three strategies for generating synthetic triplet data: 1) In-context learning, 2) Adapter composition (novel approach), and 3) Cross-lingual finetuning of LLM generator (XL-LoRA, novel approach). The synthetic data is then used to optimize embedding models for various languages.
Result: In-context learning falls short of strong non-synthetic baselines, but both adapter composition and XL-LoRA yield strong performance gains across a wide array of tasks and languages. These novel approaches provide a scalable pathway to creating performant embedding models for diverse languages.
Conclusion: LLMs can effectively bridge the data gap for embedding models in low-resource languages through novel techniques like adapter composition and cross-lingual finetuning, offering a scalable solution for multilingual NLP applications.
Abstract: Embedding models are crucial to modern NLP. However, the creation of the most effective models relies on carefully constructed supervised finetuning data. For high resource languages, such as English, such datasets are readily available. However, for hundreds of other languages, they are simply non-existent. We investigate whether the advent of large language models can help to bridge this gap. We test three different strategies for generating synthetic triplet data used to optimise embedding models. These include in-context learning as well as two novel approaches, leveraging adapter composition and cross lingual finetuning of the LLM generator (XL-LoRA) respectively. We find that while in-context learning still falls short of strong non-synthetic baselines, adapter composition and XL-LoRA yield strong performance gains across a wide array of tasks and languages, offering a clear, scalable pathway to producing performant embedding models for a wide variety of languages.
[94] AnnoABSA: A Web-Based Annotation Tool for Aspect-Based Sentiment Analysis with Retrieval-Augmented Suggestions
Nils Constantin Hellwig, Jakob Fehle, Udo Kruschwitz, Christian Wolff
Main category: cs.CL
TL;DR: AnnoABSA is a web-based annotation tool for Aspect-Based Sentiment Analysis tasks with LLM-based RAG suggestions and similarity-based few-shot learning.
Details
Motivation: To create a customizable, open-source annotation tool that supports the full spectrum of ABSA tasks and improves annotation efficiency through AI-assisted suggestions while keeping human annotators in control.Method: Developed a web-based tool with flexible configuration options for sentiment elements, incorporating LLM-based RAG suggestions that retrieve the ten most similar annotated examples as few-shot prompts to provide context-aware assistance.
Result: Created AnnoABSA, the first web-based annotation tool for comprehensive ABSA tasks, released as open-source software under MIT License, featuring human-in-the-loop AI assistance with improving suggestion accuracy over time.
Conclusion: AnnoABSA provides a practical, extensible solution for ABSA annotation that balances AI assistance with human control, potentially accelerating research and applications in sentiment analysis.
Abstract: We introduce AnnoABSA, the first web-based annotation tool to support the full spectrum of Aspect-Based Sentiment Analysis (ABSA) tasks. The tool is highly customizable, enabling flexible configuration of sentiment elements and task-specific requirements. Alongside manual annotation, AnnoABSA provides optional Large Language Model (LLM)-based retrieval-augmented generation (RAG) suggestions that offer context-aware assistance in a human-in-the-loop approach, keeping the human annotator in control. To improve prediction quality over time, the system retrieves the ten most similar examples that are already annotated and adds them as few-shot examples in the prompt, ensuring that suggestions become increasingly accurate as the annotation process progresses. Released as open-source software under the MIT License, AnnoABSA is freely accessible and easily extendable for research and practical applications.
[95] Beyond the Resumé: A Rubric-Aware Automatic Interview System for Information Elicitation
Harry Stuart, Masahiro Kaneko, Timothy Baldwin
Main category: cs.CL
TL;DR: LLM-powered interview system for automated hiring assessment that updates belief about candidate abilities through simulated interviews
Details
Motivation: Traditional hiring is expensive and automated resume screening lacks nuance; LLMs can act as cost-effective subject matter experts to elicit role-specific information from candidatesMethod: System uses LLM interviewer to conduct simulated interviews and update calibrated beliefs about applicants’ rubric-oriented latent traits through Bayesian updating
Result: Evaluation on simulated interviews shows belief converges toward artificially-constructed latent ability levels; system includes code, dataset, calibration tests, and demo
Conclusion: LLMs can improve early-stage hiring decisions by providing nuanced, role-specific candidate assessment beyond traditional resume screening
Abstract: Effective hiring is integral to the success of an organisation, but it is very challenging to find the most suitable candidates because expert evaluation (e.g.\ interviews conducted by a technical manager) are expensive to deploy at scale. Therefore, automated resume scoring and other applicant-screening methods are increasingly used to coarsely filter candidates, making decisions on limited information. We propose that large language models (LLMs) can play the role of subject matter experts to cost-effectively elicit information from each candidate that is nuanced and role-specific, thereby improving the quality of early-stage hiring decisions. We present a system that leverages an LLM interviewer to update belief over an applicant’s rubric-oriented latent traits in a calibrated way. We evaluate our system on simulated interviews and show that belief converges towards the simulated applicants’ artificially-constructed latent ability levels. We release code, a modest dataset of public-domain/anonymised resumes, belief calibration tests, and simulated interviews, at \href{https://github.com/mbzuai-nlp/beyond-the-resume}{https://github.com/mbzuai-nlp/beyond-the-resume}. Our demo is available at \href{https://btr.hstu.net}{https://btr.hstu.net}.
[96] FreeAct: Freeing Activations for LLM Quantization
Xiaohao Liu, Xiaobo Xia, Manyi Zhang, Ji-Fu Li, Xianzhi Yu, Fei Shen, Xiu Su, See-Kiong Ng, Tat-Seng Chua
Main category: cs.CL
TL;DR: FreeAct is a novel quantization framework for LLMs that addresses dynamic activation patterns in diffusion and multimodal models by relaxing rigid transformation constraints and using token-specific activation transformations.
Details
Motivation: Existing quantization methods use static one-to-one transformations that fail to handle dynamic activation patterns in diffusion LLMs and multimodal LLMs, where different token types (vision vs text, masked tokens) have distinct distributions.Method: FreeAct leverages the rank-deficient nature of activations to derive a solution space beyond simple inverse matrices, decoupling activation transformations from weights. It identifies token-specific dynamics and allocates distinct transformation matrices to activations while maintaining unified static transformations for weights.
Result: Extensive experiments across diffusion LLMs and multimodal LLMs show FreeAct significantly outperforms baselines with up to 5.3% performance improvement.
Conclusion: FreeAct advances quantization for complex LLMs by accommodating dynamic activation disparities through flexible transformation allocation, achieving superior performance in diffusion and multimodal settings.
Abstract: Quantization is pivotal for mitigating the significant memory and computational overhead of Large Language Models (LLMs). While emerging transformation-based methods have successfully enhanced quantization by projecting feature spaces onto smoother manifolds using orthogonal matrices, they typically enforce a rigid one-to-one transformation constraint. This static approach fails to account for the dynamic patterns inherent in input activations, particularly within diffusion LLMs (dLLMs) and Multimodal LLMs (MLLMs), where varying token types exhibit distinct distributions. To advance this, we propose FreeAct, a novel quantization framework that relaxes the static one-to-one constraint to accommodate dynamic activation disparities. Theoretically, we leverage the rank-deficient nature of activations to derive a solution space that extends beyond simple inverse matrices, enabling the decoupling of activation transformations from weights. Methodologically, FreeAct identifies token-specific dynamics (i.e., vision v.s. text, or masked tokens) and allocates distinct transformation matrices to the activation side, while maintaining a unified, static transformation for the weights. Extensive experiments across dLLMs and MLLMs demonstrate that FreeAct significantly outperforms baselines, up to 5.3% performance improvement, with in-depth analyses. Our code will be publicly released.
[97] LLM-as-an-Annotator: Training Lightweight Models with LLM-Annotated Examples for Aspect Sentiment Tuple Prediction
Nils Constantin Hellwig, Jakob Fehle, Udo Kruschwitz, Christian Wolff
Main category: cs.CL
TL;DR: LA-ABSA uses LLM-generated annotations to train lightweight models for Aspect-Based Sentiment Analysis, achieving competitive performance with LLM prompting while being more energy-efficient.
Details
Motivation: Manual annotation for ABSA tasks is expensive and time-consuming. The paper aims to leverage LLM-generated annotations to create efficient, lightweight models that can perform complex ABSA tasks without extensive human annotation.Method: Proposes LA-ABSA approach that uses LLM-generated annotations to fine-tune lightweight models. Evaluates on five datasets for Target Aspect Sentiment Detection (TASD) and Aspect Sentiment Quad Prediction (ASQP). Uses in-context learning with small annotated examples to guide annotation of unlabeled data.
Result: Outperformed previous augmentation strategies and achieved competitive performance with LLM-prompting in low-resource scenarios. For ASQP on SemEval Rest16, achieved F1 score of 49.85 using 50 annotated examples, closely matching Gemma-3-27B’s 51.10 while requiring significantly lower computational resources.
Conclusion: LA-ABSA provides an effective approach for training lightweight ABSA models using LLM-generated annotations, offering substantial energy efficiency benefits while maintaining competitive performance with resource-intensive LLM prompting methods.
Abstract: Training models for Aspect-Based Sentiment Analysis (ABSA) tasks requires manually annotated data, which is expensive and time-consuming to obtain. This paper introduces LA-ABSA, a novel approach that leverages Large Language Model (LLM)-generated annotations to fine-tune lightweight models for complex ABSA tasks. We evaluate our approach on five datasets for Target Aspect Sentiment Detection (TASD) and Aspect Sentiment Quad Prediction (ASQP). Our approach outperformed previously reported augmentation strategies and achieved competitive performance with LLM-prompting in low-resource scenarios, while providing substantial energy efficiency benefits. For example, using 50 annotated examples for in-context learning (ICL) to guide the annotation of unlabeled data, LA-ABSA achieved an F1 score of 49.85 for ASQP on the SemEval Rest16 dataset, closely matching the performance of ICL prompting with Gemma-3-27B (51.10), while requiring significantly lower computational resources.
[98] nchellwig at SemEval-2026 Task 3: Self-Consistent Structured Generation (SCSG) for Dimensional Aspect-Based Sentiment Analysis using Large Language Models
Nils Constantin Hellwig, Jakob Fehle, Udo Kruschwitz, Christian Wolff
Main category: cs.CL
TL;DR: SCSG improves dimensional aspect-based sentiment analysis through self-consistency voting with multiple LLM executions and efficient caching via vLLM’s PagedAttention.
Details
Motivation: To enhance prediction reliability in dimensional aspect-based sentiment analysis by reducing variance and improving consistency through multiple executions with majority consensus.Method: Uses LoRA-adapted LLM executed multiple times per instance, retains only tuples achieving majority consensus, and leverages vLLM’s PagedAttention for efficient key-value cache reuse to reduce computational overhead.
Result: Self-consistency with 15 executions yields statistically significant improvements over single-inference prompting, with system ranking top seven across all settings, achieving second place on three English subsets and first on Tatar-Restaurant.
Conclusion: SCSG effectively improves reliability in dimensional aspect-based sentiment analysis through self-consistency voting, demonstrating practical benefits across multiple languages and domains.
Abstract: We present Self-Consistent Structured Generation (SCSG) for Dimensional Aspect-Based Sentiment Analysis in SemEval-2026 Task 3 (Track A). SCSG enhances prediction reliability by executing a LoRA-adapted large language model multiple times per instance, retaining only tuples that achieve a majority consensus across runs. To mitigate the computational overhead of multiple forward passes, we leverage vLLM’s PagedAttention mechanism for efficient key–value cache reuse. Evaluation across 6 languages and 8 language–domain combinations demonstrates that self-consistency with 15 executions yields statistically significant improvements over single-inference prompting, with our system (leveraging Gemma 3) ranking in the top seven across all settings, achieving second place on three out of four English subsets and first place on Tatar-Restaurant for DimASTE.
[99] Semantic Novelty Trajectories in 80,000 Books: A Cross-Corpus Embedding Analysis
Fred Zimmerman
Main category: cs.CL
TL;DR: Analyzes semantic novelty trajectories in 80,000+ English books across two centuries using compression progress theory, finding modern books have higher novelty, more circuitous trajectories, fewer convergent narratives, and novelty is independent of literary quality.
Details
Motivation: To apply Schmidhuber's compression progress theory of interestingness at corpus scale to understand how semantic novelty and narrative structures have evolved in English literature over two centuries, comparing pre-1920 and modern books.Method: Uses sentence-transformer paragraph embeddings and a running-centroid novelty measure to analyze 28,730 pre-1920 Project Gutenberg books (PG19) and 52,796 modern English books (Books3, 1990-2010). Employs PAA-16 representations for clustering narrative trajectories.
Result: Modern books show 10% higher mean paragraph-level novelty, 67% increase in trajectory circuitousness, 2.3x fewer convergent narrative curves, and novelty is orthogonal to reader quality ratings (r = -0.002). Eight distinct narrative-shape archetypes identified with substantial distribution shifts between eras.
Conclusion: Semantic novelty and narrative structures have significantly evolved in English literature, with modern works showing more complex, circuitous trajectories and less convergence to settled semantic registers, while interestingness remains independent of perceived literary merit.
Abstract: I apply Schmidhuber’s compression progress theory of interestingness at corpus scale, analyzing semantic novelty trajectories in more than 80,000 books spanning two centuries of English-language publishing. Using sentence-transformer paragraph embeddings and a running-centroid novelty measure, I compare 28,730 pre-1920 Project Gutenberg books (PG19) against 52,796 modern English books (Books3, approximately 1990-2010). The principal findings are fourfold. First, mean paragraph-level novelty is roughly 10% higher in modern books (0.503 vs. 0.459). Second, trajectory circuitousness – the ratio of cumulative path length to net displacement in embedding space – nearly doubles in the modern corpus (+67%). Third, convergent narrative curves, in which novelty declines toward a settled semantic register, are 2.3x more common in pre-1920 literature. Fourth, novelty is orthogonal to reader quality ratings (r = -0.002), suggesting that interestingness in Schmidhuber’s sense is structurally independent of perceived literary merit. Clustering paragraph-level trajectories via PAA-16 representations reveals eight distinct narrative-shape archetypes whose distribution shifts substantially between eras. All analysis code and an interactive exploration toolkit are publicly available at https://bigfivekiller.online/novelty_hub.
[100] ALTER: Asymmetric LoRA for Token-Entropy-Guided Unlearning of LLMs
Xunlei Chen, Jinyu Guo, Yuang Li, Zhaokun Wang, Yi Gong, Jie Zou, Jiwei Wei, Wenhong Tian
Main category: cs.CL
TL;DR: ALTER: A lightweight unlearning framework for LLMs that addresses knowledge entanglement and efficiency challenges through token-level isolation in an asymmetric LoRA architecture.
Details
Motivation: Controlling what LLMs should not know is crucial for alignment and safety, but effective unlearning is difficult due to fuzzy boundaries between knowledge retention/forgetting, entangled parameter spaces from multi-domain training, and computational overhead of optimizing billion-parameter models.Method: Two-phase approach: (I) capture high entropy tokens via shared A matrix in LoRA, (II) asymmetric LoRA architecture with parameter isolation to unlearn tokens within target subdomains through token-level isolation.
Result: Achieves SOTA performance on TOFU, WMDP, and MUSE benchmarks with >95% forget quality, minimal side effects by preserving foundational tokens, >90% model utility preservation (vs 47.8-83.6% baselines).
Conclusion: ALTER provides an efficient unlearning framework that decouples unlearning from LLMs’ billion-scale parameters, achieving high forget quality while preserving model utility with minimal computational overhead.
Abstract: Large language models (LLMs) have advanced to encompass extensive knowledge across diverse domains. Yet controlling what a LLMs should not know is important for ensuring alignment and thus safe use. However, effective unlearning in LLMs is difficult due to the fuzzy boundary between knowledge retention and forgetting. This challenge is exacerbated by entangled parameter spaces from continuous multi-domain training, often resulting in collateral damage, especially under aggressive unlearning strategies. Furthermore, the computational overhead required to optimize State-of-the-Art (SOTA) models with billions of parameters poses an additional barrier. In this work, we present ALTER, a lightweight unlearning framework for LLMs to address both the challenges of knowledge entanglement and unlearning efficiency. ALTER operates through two phases: (I) high entropy tokens are captured and learned via the shared A matrix in LoRA, followed by (II) an asymmetric LoRA architecture that achieves a specified forgetting objective by parameter isolation and unlearning tokens within the target subdomains. Serving as a new research direction for achieving unlearning via token-level isolation in the asymmetric framework. ALTER achieves SOTA performance on TOFU, WMDP, and MUSE benchmarks with over 95% forget quality and shows minimal side effects through preserving foundational tokens. By decoupling unlearning from LLMs’ billion-scale parameters, this framework delivers excellent efficiency while preserving over 90% of model utility, exceeding baseline preservation rates of 47.8-83.6%.
[101] OpenAutoNLU: Open Source AutoML Library for NLU
Grigory Arshinov, Aleksandr Boriskin, Sergey Senichev, Ayaz Zaripov, Daria Galimzianova, Daniil Karpov, Leonid Sanochkin
Main category: cs.CL
TL;DR: OpenAutoNLU is an open-source AutoML library for NLU tasks with automated training regime selection, data quality diagnostics, OOD detection, and LLM integration.
Details
Motivation: Existing AutoML solutions for NLU require manual configuration, lack integrated data quality tools, and don't provide comprehensive OOD detection capabilities.Method: Develops an automated machine learning library with data-aware training regime selection, integrated data quality diagnostics, configurable OOD detection, and LLM features through a minimal lowcode API.
Result: OpenAutoNLU provides a comprehensive open-source solution for NLU tasks with automated configuration, quality diagnostics, and LLM integration accessible via a demo app.
Conclusion: OpenAutoNLU offers a user-friendly, automated solution for NLU tasks that reduces manual configuration while providing advanced features like data quality assessment and OOD detection.
Abstract: OpenAutoNLU is an open-source automated machine learning library for natural language understanding (NLU) tasks, covering both text classification and named entity recognition (NER). Unlike existing solutions, we introduce data-aware training regime selection that requires no manual configuration from the user. The library also provides integrated data quality diagnostics, configurable out-of-distribution (OOD) detection, and large language model (LLM) features, all within a minimal lowcode API. The demo app is accessible here https://openautonlu.dev.
[102] Let the Agent Search: Autonomous Exploration Beats Rigid Workflows in Temporal Question Answering
Xufei Lv, Jiahui Yang, Yifu Gao, Linbo Qiao, Houde Liu
Main category: cs.CL
TL;DR: AT2QA is a training-free autonomous agent for Temporal Knowledge Graph Question Answering that uses LLMs with agentic autonomy to dynamically retrieve information from temporal knowledge graphs, achieving state-of-the-art performance without fine-tuning.
Details
Motivation: Prior TKGQA approaches rely on rigid retrieval workflows or costly supervised fine-tuning of LLMs. The authors discovered that simply granting LLMs autonomy to decide their next actions yields substantial gains, even in zero-shot settings.Method: AT2QA is an autonomous, training-free agent that iteratively interacts with temporal knowledge graphs via a general search tool for dynamic retrieval. The LLM agent decides what to do next autonomously without hand-crafted workflows or fine-tuning.
Result: On MultiTQ benchmark, AT2QA achieves 88.7% Hits@1 (+10.7% over prior SOTA), including +20.1% gain on challenging multi-target queries, demonstrating that agentic autonomy can decisively outperform fine-tuning for temporal question answering.
Conclusion: Agentic autonomy in LLMs can significantly improve temporal question answering performance without requiring fine-tuning, suggesting that autonomous decision-making capabilities are crucial for complex temporal reasoning tasks.
Abstract: Temporal Knowledge Graph Question Answering (TKGQA) demands multi-hop reasoning under temporal constraints. Prior approaches based on large language models (LLMs) typically rely on rigid, hand-crafted retrieval workflows or costly supervised fine-tuning. We show that simply granting an off-the-shelf LLM autonomy, that is, letting it decide what to do next, already yields substantial gains even in a strict zero-shot setting. Building on this insight, we propose AT2QA, an autonomous, training-free agent for temporal question answering that iteratively interacts with the temporal knowledge graph via a general search tool for dynamic retrieval. Experiments on MultiTQ demonstrate large improvements: AT2QA achieves 88.7% Hits@1 (+10.7% over prior SOTA), including a +20.1% gain on challenging multi-target queries, showing that agentic autonomy can decisively outperform fine-tuning for temporal question answering. Code and the full set of sampled trajectories are available on https://github.com/AT2QA-Official-Code/AT2QA-Official-Code
[103] CyclicJudge: Mitigating Judge Bias Efficiently in LLM-based Evaluation
Ziyi Zhu, Olivier Tieleman, Alexey Bukhtiyarov, Jinghong Chen
Main category: cs.CL
TL;DR: CyclicJudge: A round-robin assignment method for LLM-as-judge evaluation that eliminates systematic biases while maintaining evaluation cost efficiency.
Details
Motivation: LLM-as-judge evaluation has become standard but suffers from systematic biases that cannot be eliminated by increasing scenarios or generations. These biases are similar in magnitude to model differences that benchmarks aim to detect, making single-judge evaluations unreliable for ranking models.Method: Introduces a variance decomposition that partitions benchmark score variance into scenario, generation, judge, and residual components. Based on this analysis, proposes CyclicJudge - a round-robin assignment of judges that eliminates bias precisely while requiring each judge only once per cycle.
Result: CyclicJudge is demonstrated to be the optimal allocation strategy, maintaining the cost of single-judge evaluation while eliminating bias. Empirical validation on MT-Bench supports all theoretical predictions.
Conclusion: The proposed CyclicJudge method provides a practical solution to systematic bias in LLM-as-judge evaluations, enabling more reliable model rankings without increasing evaluation costs.
Abstract: LLM-as-judge evaluation has become standard practice for open-ended model assessment; however, judges exhibit systematic biases that cannot be eliminated by increasing the number of scenarios or generations. These biases are often similar in magnitude to the model differences that benchmarks are designed to detect, resulting in unreliable rankings when single-judge evaluations are used. This work introduces a variance decomposition that partitions benchmark score variance into scenario, generation, judge, and residual components. Based on this analysis, CyclicJudge, a round-robin assignment of judges, is demonstrated to be the optimal allocation strategy. It eliminates bias precisely while requiring each judge only once per cycle, maintaining the cost of single-judge evaluation. Empirical validation on MT-Bench supports all theoretical predictions.
[104] Sovereign AI-based Public Services are Viable and Affordable
António Branco, Luís Gomes, Rodrigo Santos, Eduardo Santos, João Silva, Nuno Marques, Madalena Rodrigues
Main category: cs.CL
TL;DR: Sovereign AI public services are technically feasible and economically sustainable alternatives to commercial AI oligopolies, enabling digital and cultural autonomy with modest resources.
Details
Motivation: Address concerns about structural concentration in AI infrastructure and expertise, particularly for AI-enabled public services that increasingly rely on a small oligopoly of global tech providers, threatening digital sovereignty and reliability.Method: Practical experimentation demonstrating viable alternatives to general-purpose architectures from commercial providers, focusing on cost-effective solutions that align with digital and cultural sovereignty principles.
Result: Empirical evidence shows sovereign AI-based public services are technically feasible and economically sustainable, capable of operating effectively on premises with modest computational and financial resources while maintaining autonomy.
Conclusion: Sovereign AI public services represent viable alternatives to commercial AI oligopolies, offering technical feasibility and economic sustainability while preserving digital and cultural sovereignty for governments and public agencies.
Abstract: The rapid expansion of AI-based remote services has intensified debates about the long-term implications of growing structural concentration in infrastructure and expertise. As AI capabilities become increasingly intertwined with geopolitical interests, the availability and reliability of foundational AI services can no longer be taken for granted. This issue is particularly pressing for AI-enabled public services for citizens, as governments and public agencies are progressively adopting 24/7 AI-driven support systems typically operated through commercial offerings from a small oligopoly of global technology providers. This paper challenges the prevailing assumption that general-purpose architectures, offered by these providers, are the optimal choice for all application contexts. Through practical experimentation, we demonstrate that viable and cost-effective alternatives exist. Alternatives that align with principles of digital and cultural sovereignty. Our findings provide an empirical illustration that sovereign AI-based public services are both technically feasible and economically sustainable, capable of operating effectively on premises with modest computational and financial resources while maintaining cultural and digital autonomy. The technical insights and deployment lessons reported here are intended to inform the adoption of similar sovereign AI public services by national agencies and governments worldwide.
[105] KDFlow: A User-Friendly and Efficient Knowledge Distillation Framework for Large Language Models
Songming Zhang, Xue Zhang, Tong Zhang, Bojie Hu, Yufeng Chen, Jinan Xu
Main category: cs.CL
TL;DR: KDFlow is a novel framework for LLM knowledge distillation that uses a decoupled architecture with FSDP2 for student training and SGLang for teacher inference, achieving significant speedups over existing KD frameworks.
Details
Motivation: Current KD frameworks use homogeneous training backends for both student and teacher models, leading to suboptimal training efficiency. There's a need for a system that can fully utilize the distinct computational requirements of training and inference phases in KD.Method: Proposes KDFlow with decoupled architecture: FSDP2 for efficient student training and SGLang for efficient teacher inference. Uses zero-copy data transfer of teacher hidden states (not full logits) and recomputes logits on student side to balance communication and performance. Supports both off-policy and on-policy distillation with extensible APIs.
Result: Achieves 1.44× to 6.36× speedup compared to current KD frameworks, enabling rapid prototyping and scaling of LLM distillation with minimal engineering overhead.
Conclusion: KDFlow provides an efficient, scalable framework for LLM knowledge distillation by decoupling training and inference backends, optimizing communication, and offering flexible APIs for various KD scenarios.
Abstract: Knowledge distillation (KD) is an essential technique to compress large language models (LLMs) into smaller ones. However, despite the distinct roles of the student model and the teacher model in KD, most existing frameworks still use a homogeneous training backend (e.g., FSDP and DeepSpeed) for both models, leading to suboptimal training efficiency. In this paper, we present a novel framework for LLM distillation, termed \textbf{KDFlow}, which features a decoupled architecture and employs SGLang for teacher inference. By bridging the training efficiency of FSDP2 and the inference efficiency of SGLang, KDFlow achieves full utilization of both advantages in a unified system. Moreover, instead of transferring full logits across different processes, our framework only transmits the teacher’s hidden states using zero-copy data transfer and recomputes the logits on the student side, effectively balancing the communication cost and KD performance. Furthermore, our framework supports both off-policy and on-policy distillation and incorporates KD algorithms for cross-tokenizer KD through highly extensible and user-friendly APIs. Experiments show that KDFlow can achieve \textbf{1.44$\times$ to 6.36$\times$} speedup compared to current KD frameworks, enabling researchers to rapidly prototype and scale LLM distillation with minimal engineering overhead. Code is available at: https://github.com/songmzhang/KDFlow
[106] FLANS at SemEval-2026 Task 7: RAG with Open-Sourced Smaller LLMs for Everyday Knowledge Across Diverse Languages and Cultures
Liliia Bogdanova, Shiran Sun, Lifeng Han, Natalia Amat Lefort, Flor Miriam Plaza-del-Arco
Main category: cs.CL
TL;DR: The paper describes participation in SemEval-2025 Task-7 on everyday knowledge across languages/cultures using retrieval-augmented generation with open-source smaller LLMs and culturally-aware knowledge bases.
Details
Motivation: To address the challenge of everyday knowledge across diverse languages and cultures using smaller, open-source LLMs with culturally-aware knowledge bases for better privacy and sustainability.Method: Used retrieval-augmented generation (RAG) with open-source smaller LLMs, created culturally-aware knowledge bases from Wikipedia using keyword lists, integrated live online search via DuckDuckGo, and deployed on Ollama platform with refined prompts.
Result: Tested on English, Spanish, and Chinese for both short answer and multiple-choice question tracks, with resources and code shared publicly on GitHub.
Conclusion: Demonstrates effective use of smaller open-source LLMs with culturally-aware knowledge bases for cross-lingual everyday knowledge tasks, promoting privacy and sustainability.
Abstract: This system paper describes our participation in the SemEval-2025 Task-7 ``Everyday Knowledge Across Diverse Languages and Cultures’’. We attended two subtasks, i.e., Track 1: Short Answer Questions (SAQ), and Track 2: Multiple-Choice Questions (MCQ). The methods we used are retrieval augmented generation (RAGs) with open-sourced smaller LLMs (OS-sLLMs). To better adapt to this shared task, we created our own culturally aware knowledge base (CulKBs) by extracting Wikipedia content using keyword lists we prepared. We extracted both culturally-aware wiki-text and country-specific wiki-summary. In addition to the local CulKBs, we also have one system integrating live online search output via DuckDuckGo. Towards better privacy and sustainability, we aimed to deploy smaller LLMs (sLLMs) that are open-sourced on the Ollama platform. We share the prompts we developed using refinement techniques and report the learning curve of such prompts. The tested languages are English, Spanish, and Chinese for both tracks. Our resources and codes are shared via https://github.com/aaronlifenghan/FLANS-2026
[107] Demonstrating ViviDoc: Generating Interactive Documents through Human-Agent Collaboration
Yinghao Tang, Yupeng Xie, Yingchaojie Feng, Tingfeng Lan, Wei Chen
Main category: cs.CL
TL;DR: ViviDoc is a human-agent collaborative system that generates interactive educational documents from a single topic input using a multi-agent pipeline and Document Specification (DocSpec) representation for controllable generation.
Details
Motivation: Creating interactive educational articles is costly and requires both domain expertise and web development skills. While LLM-based agents can automate content creation, naive application yields uncontrollable and unverifiable outputs.Method: ViviDoc uses a multi-agent pipeline (Planner, Executor, Evaluator) with Document Specification (DocSpec) - a human-readable intermediate representation that decomposes interactive visualizations into State, Render, Transition, and Constraint components, enabling human review before code generation.
Result: Expert evaluation and user study show ViviDoc substantially outperforms naive agentic generation and provides an intuitive editing experience for creating interactive educational documents.
Conclusion: ViviDoc bridges the gap between pedagogical intent and executable output through human-agent collaboration and structured intermediate representations, making interactive educational content creation more accessible and controllable.
Abstract: Interactive articles help readers engage with complex ideas through exploration, yet creating them remains costly, requiring both domain expertise and web development skills. Recent LLM-based agents can automate content creation, but naively applying them yields uncontrollable and unverifiable outputs. We present ViviDoc, a human-agent collaborative system that generates interactive educational documents from a single topic input. ViviDoc introduces a multi-agent pipeline (Planner, Executor, Evaluator) and the Document Specification (DocSpec), a human-readable intermediate representation that decomposes each interactive visualization into State, Render, Transition, and Constraint components. The DocSpec enables educators to review and refine generation plans before code is produced, bridging the gap between pedagogical intent and executable output. Expert evaluation and a user study show that ViviDoc substantially outperforms naive agentic generation and provides an intuitive editing experience. Our project homepage is available at https://vividoc-homepage.vercel.app/.
[108] AdaPonderLM: Gated Pondering Language Models with Token-Wise Adaptive Depth
Shixiang Song, He Li, Zitong Wang, Boyi Zeng, Feichen Song, Yixuan Wang, Zhiqin John Xu, Ziwei He, Zhouhan Lin
Main category: cs.CL
TL;DR: AdaPonderLM: A self-supervised recurrent language model that learns token-wise early exiting during pretraining to enable adaptive computation time without manual pruning ratios.
Details
Motivation: Existing recurrent/iterative Transformers waste computation on easy tokens and lack token-wise adaptivity because they run a fixed number of iterations. The paper aims to enable adaptive computation time in a self-supervised manner without manually tuned pruning ratios.Method: Uses iteration-specific MLP gates with a monotonic halting mask to decide when each token stops recurring, and introduces a KV reuse mechanism that reuses cached key/value states for halted tokens to ensure train-test consistency and practical acceleration.
Result: Across Pythia backbones from 70M to 410M (pretraining) and up to 2.8B (continued pretraining), reduces inference compute by about 10% while maintaining comparable language modeling perplexity and competitive downstream accuracy.
Conclusion: AdaPonderLM successfully learns adaptive computation time behavior in a fully self-supervised setting, allocating more computation to hard tokens and outperforming fixed pruning under iso-FLOPs, showing it allocates compute to the right tokens rather than just reducing average depth.
Abstract: Test-time scaling via recurrent/iterative Transformers enables large language models to spend more computation at inference, but most pretrained recurrent LMs run a fixed number of iterations, wasting compute on easy tokens and lacking token-wise adaptivity. Following the core idea of Adaptive Computation Time(ACT) and Early Exit(EE), we propose AdaPonderLM, a self-supervised recurrent language model that learns token-wise early exiting during pretraining without manually tuned per-token/per-layer pruning ratios. AdaPonderLM uses iteration-specific MLP gates with a monotonic halting mask to decide when each token stops recurring, and introduces a KV reuse mechanism that reuses cached key/value states for halted tokens, ensuring train–test consistency and practical acceleration. Across Pythia backbones from 70M to 410M (pretraining) and up to 2.8B (continued pretraining), AdaPonderLM reduces inference compute at about 10% while maintaining comparable language modeling perplexity and competitive downstream accuracy. Our analysis shows the learned gates allocate more computation to high-NLL (hard) tokens, exhibiting adaptive computation time behavior in a fully self-supervised setting. Meanwhile, under iso-FLOPs, the learned halting policy consistently outperforms fixed pruning, showing AdaPonderLM allocates compute to the right tokens rather than just reducing average depth.
[109] From Variance to Invariance: Qualitative Content Analysis for Narrative Graph Annotation
Junbo Huang, Max Weinig, Ulrich Fritsche, Ricardo Usbeck
Main category: cs.CL
TL;DR: A narrative graph annotation framework for news discourse using directed acyclic graphs to represent economic events and causal relations, with evaluation of inter-annotator agreement under human label variation.
Details
Motivation: To address the challenge of annotating and evaluating narratives in news discourse about economic events like inflation in a structured manner, focusing on reducing annotation errors and capturing human label variation in narrative interpretations.Method: Introduces a narrative graph annotation framework integrating qualitative content analysis principles, using directed acyclic graphs where nodes represent events and edges encode causal relations. Employs a 6×3 factorial experimental design to evaluate effects of narrative representation (six levels) and distance metric type (three levels) on inter-annotator agreement using Krippendorrf’s α.
Result: Analysis shows that lenient metrics (overlap-based distance) overestimate reliability, and locally-constrained representations (e.g., one-hop neighbors) reduce annotation variability. The framework and graph-based Krippendorrf’s α implementation are open-sourced.
Conclusion: Provides practical guidance for NLP research on graph-based narrative annotation under human label variation, offering an annotation framework and evaluation methodology for structured narrative analysis in news discourse.
Abstract: Narratives in news discourse play a critical role in shaping public understanding of economic events, such as inflation. Annotating and evaluating these narratives in a structured manner remains a key challenge for Natural Language Processing (NLP). In this work, we introduce a narrative graph annotation framework that integrates principles from qualitative content analysis (QCA) to prioritize annotation quality by reducing annotation errors. We present a dataset of inflation narratives annotated as directed acyclic graphs (DAGs), where nodes represent events and edges encode causal relations. To evaluate annotation quality, we employed a $6\times3$ factorial experimental design to examine the effects of narrative representation (six levels) and distance metric type (three levels) on inter-annotator agreement (Krippendorrf’s $α$), capturing the presence of human label variation (HLV) in narrative interpretations. Our analysis shows that (1) lenient metrics (overlap-based distance) overestimate reliability, and (2) locally-constrained representations (e.g., one-hop neighbors) reduce annotation variability. Our annotation and implementation of graph-based Krippendorrf’s $α$ are open-sourced. The annotation framework and evaluation results provide practical guidance for NLP research on graph-based narrative annotation under HLV.
[110] When Numbers Tell Half the Story: Human-Metric Alignment in Topic Model Evaluation
Thibault Prouteau, Francis Lareau, Nicolas Dugué, Jean-Charles Lamirel, Christophe Malaterre
Main category: cs.CL
TL;DR: Topic Word Mixing (TWM) is a new human evaluation task for assessing inter-topic distinctness in topic models, complementing existing metrics and validated on a philosophy of science corpus.
Details
Motivation: Existing topic model evaluation methods have limitations: automated metrics (coherence, diversity) may not align with human judgment, while human evaluation tasks like word intrusion are costly and primarily validated on general-domain corpora. There's a need for better evaluation frameworks, especially for specialized domains.Method: Introduces Topic Word Mixing (TWM), a human evaluation task where annotators distinguish between word sets from single topics vs. mixed topics. Evaluates six topic models (LDA, NMF, Top2Vec, BERTopic, CFMF, CFMF-emb) using both automated metrics and human evaluation methods based on nearly 4,000 annotations from a philosophy of science corpus.
Result: Word intrusion and coherence metrics don’t always align, especially in specialized domains. TWM effectively captures human-perceived distinctness and appears to align with diversity metrics. The annotated dataset and task generation code are released.
Conclusion: TWM provides a valuable human-grounded evaluation of inter-topic distinctness, complementing existing methods. The work highlights the need for evaluation frameworks that bridge automated and human assessments, particularly for domain-specific corpora.
Abstract: Topic models uncover latent thematic structures in text corpora, yet evaluating their quality remains challenging, particularly in specialized domains. Existing methods often rely on automated metrics like topic coherence and diversity, which may not fully align with human judgment. Human evaluation tasks, such as word intrusion, provide valuable insights but are costly and primarily validated on general-domain corpora. This paper introduces Topic Word Mixing (TWM), a novel human evaluation task assessing inter-topic distinctness by testing whether annotators can distinguish between word sets from single or mixed topics. TWM complements word intrusion’s focus on intra-topic coherence and provides a human-grounded counterpart to diversity metrics. We evaluate six topic models - both statistical and embedding-based (LDA, NMF, Top2Vec, BERTopic, CFMF, CFMF-emb) - comparing automated metrics with human evaluation methods based on nearly 4,000 annotations from a domain-specific corpus of philosophy of science publications. Our findings reveal that word intrusion and coherence metrics do not always align, particularly in specialized domains, and that TWM captures human-perceived distinctness while appearing to align with diversity metrics. We release the annotated dataset and task generation code. This work highlights the need for evaluation frameworks bridging automated and human assessments, particularly for domain-specific corpora.
[111] AMemGym: Interactive Memory Benchmarking for Assistants in Long-Horizon Conversations
Cheng Jiayang, Dongyu Ru, Lin Qiu, Yiyang Li, Xuezhi Cao, Yangqiu Song, Xunliang Cai
Main category: cs.CL
TL;DR: AMemGym: An interactive environment for on-policy evaluation and optimization of memory-driven personalization in LLM-based assistants, using structured data sampling and LLM-simulated users to bridge structured state evolution with free-form interactions.
Details
Motivation: Current memory management approaches for LLM-based assistants face challenges in training and evaluation. Existing benchmarks rely on static, off-policy data as context, limiting evaluation reliability and scalability for long-horizon user interactions.Method: AMemGym employs structured data sampling to predefine user profiles, state-dependent questions, and state evolution trajectories. It uses LLM-simulated users that expose latent states through role-play while maintaining structured state consistency, enabling cost-effective generation of high-quality, evaluation-aligned interactions.
Result: Experiments reveal performance gaps in existing memory systems (RAG, long-context LLMs, and agentic memory) and their corresponding reasons. The framework enables effective selection among competing approaches and can potentially drive self-evolution of memory management strategies.
Conclusion: AMemGym provides a scalable, diagnostically rich environment for advancing memory capabilities in conversational agents by bridging structured state evolution with free-form interactions, addressing current limitations in memory evaluation and optimization.
Abstract: Long-horizon interactions between users and LLM-based assistants necessitate effective memory management, yet current approaches face challenges in training and evaluation of memory. Existing memory benchmarks rely on static, off-policy data as context, limiting evaluation reliability and scalability. To address these gaps, we introduce AMemGym, an interactive environment enabling on-policy evaluation and optimization for memory-driven personalization. AMemGym employs structured data sampling to predefine user profiles, state-dependent questions, and state evolution trajectories, enabling cost-effective generation of high-quality, evaluation-aligned interactions. LLM-simulated users expose latent states through role-play while maintaining structured state consistency. Comprehensive metrics based on structured data guide both assessment and optimization of assistants. Extensive experiments reveal performance gaps in existing memory systems (e.g., RAG, long-context LLMs, and agentic memory) and corresponding reasons. AMemGym not only enables effective selection among competing approaches but also can potentially drive the self-evolution of memory management strategies. By bridging structured state evolution with free-form interactions, our framework provides a scalable, diagnostically rich environment for advancing memory capabilities in conversational agents.
[112] CharacterFlywheel: Scaling Iterative Improvement of Engaging and Steerable LLMs in Production
Yixin Nie, Lin Guan, Zhongyao Ma, Anchit Gupta, Yipin Zhou, Xiao Li, Zhengping Zhou, Raymond Zeng, Gelin Zhou, Shigan Chu, Ajay Thampi, Wancen Mu, Nathan Shuster, Ketong Wang, Lin Chen, Jason Brewer, Derek Hao Hu, Alexander McCauley, Jason Weston, Sem Park, Na Zhang, Kevin Tang
Main category: cs.CL
TL;DR: CharacterFlywheel is an iterative process for improving LLMs in social chat apps using real-user data, reward modeling, SFT, and RL, achieving significant engagement and steerability improvements across 15 generations.
Details
Motivation: To develop a systematic, production-ready approach for continuously improving LLMs in social chat applications (Instagram, WhatsApp, Messenger) using real-user feedback and ensuring reliable progress at scale.Method: Iterative flywheel process with 15 generations starting from LLaMA 3.1, using data curation, reward modeling to estimate engagement metrics, supervised fine-tuning, reinforcement learning, and both offline/online evaluation with A/B testing.
Result: 7 of 8 deployed models showed positive engagement lift, with best performers achieving 8.8% improvement in engagement breadth and 19.4% in engagement depth; instruction following increased from 59.2% to 84.8% while violations decreased from 26.6% to 5.8%.
Conclusion: CharacterFlywheel provides a rigorous, scalable framework for improving LLMs in production social applications, demonstrating consistent engagement gains and improved steerability through systematic iteration and real-user feedback.
Abstract: This report presents CharacterFlywheel, an iterative flywheel process for improving large language models (LLMs) in production social chat applications across Instagram, WhatsApp, and Messenger. Starting from LLaMA 3.1, we refined models across 15 generations using data from both internal and external real-user traffic. Through continuous deployments from July 2024 to April 2025, we conducted controlled 7-day A/B tests showing consistent engagement improvements: 7 of 8 newly deployed models demonstrated positive lift over the baseline, with the strongest performers achieving up to 8.8% improvement in engagement breadth and 19.4% in engagement depth. We also observed substantial gains in steerability, with instruction following increasing from 59.2% to 84.8% and instruction violations decreasing from 26.6% to 5.8%. We detail the CharacterFlywheel process which integrates data curation, reward modeling to estimate and interpolate the landscape of engagement metrics, supervised fine-tuning (SFT), reinforcement learning (RL), and both offline and online evaluation to ensure reliable progress at each optimization step. We also discuss our methods for overfitting prevention and navigating production dynamics at scale. These contributions advance the scientific rigor and understanding of LLMs in social applications serving millions of users.
[113] PonderLM-3: Adaptive Token-Wise Pondering with Differentiable Masking
He Li, Feichen Song, Boyi Zeng, Shixiang Song, Zhiqin John Xu, Ziwei He, Zhouhan Lin
Main category: cs.CL
TL;DR: PonderLM-3 introduces a pretraining framework for token-wise adaptive computation allocation that learns to selectively apply additional computation only where beneficial, using differentiable attention masks during training and hard pruning at inference.
Details
Motivation: Test-time scaling shows that additional inference computation improves generation quality, but current approaches apply uniform extra cost to all tokens. The paper aims to make computation an allocatable resource per token, spending it only where beneficial rather than paying uniform cost.Method: Built on PonderLM-2 backbone, PonderLM-3 uses differentiable attention masks during pretraining with matching hard pruning rules at inference to enable token-wise adaptive pondering. This maintains train-inference consistency while making computation allocation learnable under self-supervised objectives.
Result: PonderLM-3 achieves lower pretraining perplexity at equal inference FLOPs compared to existing recursive or adaptive baselines. On downstream benchmarks, it matches fixed-step PonderLM-2 performance under same maximum computation steps while using fewer inference FLOPs in practice.
Conclusion: PonderLM-3 provides an end-to-end differentiable and train-inference consistent framework for token-wise adaptive computation, enabling efficient allocation of additional inference compute where most useful rather than uniform application to all tokens.
Abstract: Test-time scaling has shown that allocating more additional computation at inference can improve generation quality, motivating a natural follow-up question: where should this computation be spent? Building on this insight, we introduce PonderLM-3, a pretraining framework for token-wise adaptive pondering that learns to selectively allocate additional computation under purely self-supervised objectives, built on top of the PonderLM-2 backbone. This makes additional inference computation an allocatable per-token resource, so tokens receive more computation only when it is beneficial, rather than paying a uniform extra cost. To make this allocation learnable while maintaining train-inference consistency, PonderLM-3 injects a differentiable attention mask during pretraining and pairs it with a matching hard pruning rule at inference. PonderLM-3 defines a stronger Pareto frontier: compared with existing recursive or adaptive baselines, it achieves lower pretraining perplexity at equal inference FLOPs. On downstream benchmarks, PonderLM-3 attains comparable performance to fixed-step PonderLM-2 under the same maximum number of additional computation steps, while using fewer inference FLOPs in practice. Overall, PonderLM-3 provides an end-to-end differentiable and train-inference consistent framework for token-wise adaptive computation, enabling additional inference compute to be allocated where it is most useful rather than paid uniformly by every token.
[114] MMR-Life: Piecing Together Real-life Scenes for Multimodal Multi-image Reasoning
Jiachun Li, Shaoping Huang, Zhuoran Jin, Chenlong Zhang, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao
Main category: cs.CL
TL;DR: MMR-Life is a comprehensive benchmark for evaluating multimodal multi-image reasoning capabilities of MLLMs across 7 reasoning types using 2,646 questions based on 19,108 real-world images.
Details
Motivation: Current MLLMs lack standardized benchmarks for evaluating their multimodal reasoning abilities across diverse real-life scenarios, despite progress in scientific and mathematical reasoning.Method: Created MMR-Life benchmark with 2,646 multiple-choice questions based on 19,108 real-world images covering 7 reasoning types: abductive, analogical, causal, deductive, inductive, spatial, and temporal.
Result: Evaluation of 37 advanced models shows substantial challenges - even top models like GPT-5 achieve only 58% accuracy with considerable variance across reasoning types.
Conclusion: MMR-Life establishes a comprehensive foundation for evaluating, analyzing, and improving next-generation multimodal reasoning systems.
Abstract: Recent progress in the reasoning capabilities of multimodal large language models (MLLMs) has empowered them to address more complex tasks such as scientific analysis and mathematical reasoning. Despite their promise, MLLMs’ reasoning abilities across different scenarios in real life remain largely unexplored and lack standardized benchmarks for evaluation. To address this gap, we introduce MMR-Life, a comprehensive benchmark designed to evaluate the diverse multimodal multi-image reasoning capabilities of MLLMs across real-life scenarios. MMR-Life consists of 2,646 multiple-choice questions based on 19,108 images primarily sourced from real-world contexts, comprehensively covering seven reasoning types: abductive, analogical, causal, deductive, inductive, spatial, and temporal. Unlike existing reasoning benchmarks, MMR-Life does not rely on domain-specific expertise but instead requires models to integrate information across multiple images and apply diverse reasoning abilities. The evaluation of 37 advanced models highlights the substantial challenge posed by MMR-Life. Even top models like GPT-5 achieve only 58% accuracy and display considerable variance in performance across reasoning types. Moreover, we analyze the reasoning paradigms of existing MLLMs, exploring how factors such as thinking length, reasoning method, and reasoning type affect their performance. In summary, MMR-Life establishes a comprehensive foundation for evaluating, analyzing, and improving the next generation of multimodal reasoning systems.
[115] EstLLM: Enhancing Estonian Capabilities in Multilingual LLMs via Continued Pretraining and Post-Training
Aleksei Dorkin, Taido Purason, Emil Kalbaliyev, Hele-Andra Kuulmets, Marii Ojastu, Mark Fišel, Tanel Alumäe, Eleri Aedmaa, Krister Kruusmaa, Kairit Sirts
Main category: cs.CL
TL;DR: Continued pretraining on Estonian data improves multilingual LLM’s Estonian capabilities while preserving English performance through balanced data mixtures and post-training alignment.
Details
Motivation: LLMs are predominantly trained on English data, leading to uneven performance for smaller languages like Estonian. The paper investigates whether continued pretraining can improve Estonian capabilities in a multilingual LLM while maintaining English and general reasoning performance.Method: Used Llama 3.1 8B as base model, performed continued pretraining with a mixture that increases Estonian exposure while approximating original training distribution through English replay and inclusion of code, mathematics, and instruction-like data. Applied supervised fine-tuning, preference optimization, and chat vector merging for instruction-following behavior.
Result: Evaluation on Estonian benchmarks shows consistent gains in linguistic competence, knowledge, reasoning, translation quality, and instruction-following compared to original base model and its instruction-tuned variant, while maintaining competitive performance on English benchmarks.
Conclusion: Continued pretraining with appropriately balanced data mixture, together with post-training alignment, can substantially improve single-language capabilities in pretrained multilingual LLMs without sacrificing performance in other languages.
Abstract: Large language models (LLMs) are predominantly trained on English-centric data, resulting in uneven performance for smaller languages. We study whether continued pretraining (CPT) can substantially improve Estonian capabilities in a pretrained multilingual LLM while preserving its English and general reasoning performance. Using Llama 3.1 8B as the main base model, we perform CPT on a mixture that increases Estonian exposure while approximating the original training distribution through English replay and the inclusion of code, mathematics, and instruction-like data. We subsequently apply supervised fine-tuning, preference optimization, and chat vector merging to introduce robust instruction-following behavior. Evaluation on a comprehensive suite of Estonian benchmarks shows consistent gains in linguistic competence, knowledge, reasoning, translation quality, and instruction-following compared to the original base model and its instruction-tuned variant, while maintaining competitive performance on English benchmarks. These findings indicate that CPT, with an appropriately balanced data mixture, together with post-training alignment, can substantially improve single-language capabilities in pretrained multilingual LLMs.
[116] What Exactly do Children Receive in Language Acquisition? A Case Study on CHILDES with Automated Detection of Filler-Gap Dependencies
Zhenghao Herbert Zhou, William Dai, Maya Viswanathan, Simon Charlow, R. Thomas McCoy, Robert Frank
Main category: cs.CL
TL;DR: A system for identifying filler-gap constructions in spoken English corpora to study children’s language acquisition, validated on CHILDES data and enabling computational studies.
Details
Motivation: To resolve debates about whether children's acquisition of filler-gap dependencies depends on innate grammatical knowledge or distributional evidence in child-directed speech, by developing tools to quantify this input at scale with fine granularity.Method: Combines constituency and dependency parsing to identify three core filler-gap constructions (matrix wh-questions, embedded wh-questions, relative clauses) and extraction sites (subject, object, adjunct). Validated on human-annotated data and applied to 57 English CHILDES corpora.
Result: System scores well across most categories, enables characterization of children’s filler-gap input and production trajectories, including construction-specific frequencies and extraction-site asymmetries. Fine-grained labels support computational studies demonstrated through filtered corpus training with language models.
Conclusion: The system provides a valuable tool for studying language acquisition and enables future computational work, bridging linguistic theory and computational approaches to language learning.
Abstract: Children’s acquisition of filler-gap dependencies has been argued by some to depend on innate grammatical knowledge, while others suggest that the distributional evidence available in child-directed speech suffices. Unfortunately, the relevant input is difficult to quantify at scale with fine granularity, making this question difficult to resolve. We present a system that identifies three core filler-gap constructions in spoken English corpora – matrix wh-questions, embedded wh-questions, and relative clauses – and further identifies the extraction site (i.e., subject vs. object vs. adjunct). Our approach combines constituency and dependency parsing, leveraging their complementary strengths for construction classification and extraction site identification. We validate the system on human-annotated data and find that it scores well across most categories. Applying the system to 57 English CHILDES corpora, we are able to characterize children’s filler-gap input and their filler-gap production trajectories over the course of development, including construction-specific frequencies and extraction-site asymmetries. The resulting fine-grained labels enable future work in both acquisition and computational studies, which we demonstrate with a case study using filtered corpus training with language models.
[117] Modeling Grammatical Hypothesis Testing in Young Learners: A Sequence-Based Learning Analytics Study of Morphosyntactic Reasoning in an Interactive Game
Thierry Geoffre, Trystan Geoffre
Main category: cs.CL
TL;DR: Analyzes grammatical reasoning in children using interactive game data, treating slider movements as hypothesis-testing actions to reveal real-time cognitive strategies in French morphosyntactic agreement learning.
Details
Motivation: Traditional assessments only capture final answers, missing the cognitive processes behind grammatical reasoning. The study aims to uncover real-time learning strategies by analyzing fine-grained action sequences from an educational game.Method: Analyzed 597 gameplay sessions (9,783 actions) from 100 students aged 8-11 using a sequence-based learning analytics approach. Introduced Hamming distance to quantify proximity to valid grammatical solutions and examined convergence patterns across exercises with varying difficulty levels.
Result: Determiners and verbs are key difficulty sites; learners often fix verbs first and adjust preceding elements. Exercises with fewer solutions show slower, more erratic convergence. Action sequences deviate from left-to-right processing, revealing dynamic hypothesis revision.
Conclusion: Sequence-based analytics can uncover hidden dimensions of linguistic reasoning, offering potential for real-time scaffolding and teacher tools in linguistically diverse classrooms.
Abstract: This study investigates grammatical reasoning in primary school learners through a sequence-based learning analytics approach, leveraging fine-grained action sequences from an interactive game targeting morphosyntactic agreement in French. Unlike traditional assessments that rely on final answers, we treat each slider movement as a hypothesis-testing action, capturing real-time cognitive strategies during sentence construction. Analyzing 597 gameplay sessions (9,783 actions) from 100 students aged 8-11 in authentic classroom settings, we introduce Hamming distance to quantify proximity to valid grammatical solutions and examine convergence patterns across exercises with varying levels of difficulty. Results reveal that determiners and verbs are key sites of difficulty, with action sequences deviating from left-to-right usual treatment. This suggests learners often fix the verb first and adjust preceding elements. Exercises with fewer solutions exhibit slower and more erratic convergence, while changes in the closest valid solution indicate dynamic hypothesis revision. Our findings demonstrate how sequence-based analytics can uncover hidden dimensions of linguistic reasoning, offering a foundation for real-time scaffolding and teacher-facing tools in linguistically diverse classrooms.
[118] ClinConsensus: A Consensus-Based Benchmark for Evaluating Chinese Medical LLMs across Difficulty Levels
Xiang Zheng, Han Li, Wenjie Luo, Weiqi Zhai, Yiyuan Li, Chuanmiao Yan, Tianyi Tang, Yubo Ma, Kexin Yang, Dayiheng Liu, Hu Wei, Bing Zhao
Main category: cs.CL
TL;DR: ClinConsensus is a comprehensive Chinese medical benchmark with 2500 expert-curated cases spanning 36 specialties and 12 clinical task types, featuring a dual-judge evaluation framework with CACS@k scoring for reliable assessment of medical LLMs.
Details
Motivation: Existing medical benchmarks are static and task-isolated, failing to capture the openness, longitudinal structure, and safety-critical complexity of real-world clinical workflows, necessitating a more comprehensive evaluation framework.Method: Created ClinConsensus benchmark with 2500 expert-curated cases covering full care continuum, developed rubric-based grading with CACS@k scoring, and implemented dual-judge evaluation combining high-capability LLM-as-judge with distilled local judge model via supervised fine-tuning.
Result: Comprehensive assessment revealed substantial heterogeneity across task themes, care stages, and medical specialties; top models showed comparable overall scores but differed in reasoning, evidence use, and longitudinal follow-up capabilities, with treatment planning as key bottleneck.
Conclusion: ClinConsensus provides an extensible benchmark for developing robust, clinically grounded medical LLMs ready for real-world deployment, highlighting the need for improved reasoning and longitudinal capabilities in medical AI systems.
Abstract: Large language models (LLMs) are increasingly applied to health management, showing promise across disease prevention, clinical decision-making, and long-term care. However, existing medical benchmarks remain largely static and task-isolated, failing to capture the openness, longitudinal structure, and safety-critical complexity of real-world clinical workflows. We introduce ClinConsensus, a Chinese medical benchmark curated, validated, and quality-controlled by clinical experts. ClinConsensus comprises 2500 open-ended cases spanning the full continuum of care–from prevention and intervention to long-term follow-up–covering 36 medical specialties, 12 common clinical task types, and progressively increasing levels of complexity. To enable reliable evaluation of such complex scenarios, we adopt a rubric-based grading protocol and propose the Clinically Applicable Consistency Score (CACS@k). We further introduce a dual-judge evaluation framework, combining a high-capability LLM-as-judge with a distilled, locally deployable judge model trained via supervised fine-tuning, enabling scalable and reproducible evaluation aligned with physician judgment. Using ClinConsensus, we conduct a comprehensive assessment of several leading LLMs and reveal substantial heterogeneity across task themes, care stages, and medical specialties. While top-performing models achieve comparable overall scores, they differ markedly in reasoning, evidence use, and longitudinal follow-up capabilities, and clinically actionable treatment planning remains a key bottleneck. We release ClinConsensus as an extensible benchmark to support the development and evaluation of medical LLMs that are robust, clinically grounded, and ready for real-world deployment.
[119] Recursive Think-Answer Process for LLMs and VLMs
Byung-Kwan Lee, Youngchae Chee, Yong Man Ro
Main category: cs.CL
TL;DR: R-TAP introduces a recursive think-answer process with confidence evaluation to improve reasoning accuracy in LLMs and VLMs beyond single-pass inference.
Details
Motivation: Current think-answer reasoners like DeepSeek-R1 still produce errors in single-pass inference despite having self-reflective cues, needing iterative reasoning cycles for better accuracy.Method: Proposes Recursive Think-Answer Process (R-TAP) with confidence generator to evaluate response certainty, using two rewards: Recursively Confidence Increase Reward and Final Answer Confidence Reward.
Result: R-TAP-enhanced models outperform conventional single-pass methods for both LLMs and VLMs, show fewer self-reflective patterns (“Oops!”), and achieve more stable, faster inference-time reasoning.
Conclusion: R-TAP provides an efficient way to refine reasoning processes in AI models, paving the way for more elaborated methods in future AI development.
Abstract: Think-Answer reasoners such as DeepSeek-R1 have made notable progress by leveraging interpretable internal reasoning. However, despite the frequent presence of self-reflective cues like “Oops!”, they remain vulnerable to output errors during single-pass inference. To address this limitation, we propose an efficient Recursive Think-Answer Process (R-TAP) that enables models to engage in iterative reasoning cycles and generate more accurate answers, going beyond conventional single-pass approaches. Central to this approach is a confidence generator that evaluates the certainty of model responses and guides subsequent improvements. By incorporating two complementary rewards-Recursively Confidence Increase Reward and Final Answer Confidence Reward-we show that R-TAP-enhanced models consistently outperform conventional single-pass methods for both large language models (LLMs) and vision-language models (VLMs). Moreover, by analyzing the frequency of “Oops”-like expressions in model responses, we find that R-TAP-applied models exhibit significantly fewer self-reflective patterns, resulting in more stable and faster inference-time reasoning. We hope R-TAP pave the way evolving into efficient and elaborated methods to refine the reasoning processes of future AI.
[120] LLMs as Strategic Actors: Behavioral Alignment, Risk Calibration, and Argumentation Framing in Geopolitical Simulations
Veronika Solopova, Viktoria Skorik, Maksym Tereshchenko, Alina Haidun, Ostap Vykhopen
Main category: cs.CL
TL;DR: LLMs show distinct behavioral patterns in geopolitical simulations, approximating human decisions initially but diverging over time with strong normative-cooperative framing in explanations.
Details
Motivation: To evaluate how large language models behave as agents in structured geopolitical simulations compared to human decision-makers, particularly in crisis scenarios requiring strategic actions and justifications.Method: Tested six state-of-the-art LLMs across four real-world crisis simulation scenarios, requiring models to select predefined actions and justify decisions across multiple rounds. Compared models to humans on action alignment, risk calibration through action severity, and argumentative framing grounded in international relations theory.
Result: Models approximate human decision patterns in initial rounds but diverge over time, displaying distinct behavioral profiles and strategy updates. LLM explanations consistently show strong normative-cooperative framing focused on stability, coordination, and risk mitigation, with limited adversarial reasoning.
Conclusion: LLMs exhibit systematic differences from human decision-making in geopolitical simulations, with consistent normative biases in their reasoning that may affect their suitability as strategic agents in such environments.
Abstract: Large language models (LLMs) are increasingly proposed as agents in strategic decision environments, yet their behavior in structured geopolitical simulations remains under-researched. We evaluate six popular state-of-the-art LLMs alongside results from human results across four real-world crisis simulation scenarios, requiring models to select predefined actions and justify their decisions across multiple rounds. We compare models to humans in action alignment, risk calibration through chosen actions’ severity, and argumentative framing grounded in international relations theory. Results show that models approximate human decision patterns in base simulation rounds but diverge over time, displaying distinct behavioural profiles and strategy updates. LLM explanations for chosen actions across all models exhibit a strong normative-cooperative framing centered on stability, coordination, and risk mitigation, with limited adversarial reasoning.
[121] LongRLVR: Long-Context Reinforcement Learning Requires Verifiable Context Rewards
Guanzheng Chen, Michael Qizhe Shieh, Lidong Bing
Main category: cs.CL
TL;DR: LongRLVR enhances RLVR with dense context rewards for better long-context reasoning by addressing vanishing gradient issues in context grounding.
Details
Motivation: Standard RLVR fails in long-context scenarios because answer-only rewards are too sparse to guide effective context grounding, leading to vanishing gradients that make learning intractable.Method: Introduces LongRLVR which augments sparse answer rewards with dense, verifiable context rewards that directly incentivize correct grounding information selection, providing robust learning gradients.
Result: Significantly outperforms standard RLVR across Qwen and LLaMA models on long-context benchmarks, boosting scores on RULER-QA from 73.17 to 88.90 and LongBench v2 from 39.8 to 46.5 for 14B models.
Conclusion: Explicitly rewarding the grounding process is critical for unlocking LLMs’ full reasoning potential in long-context applications, solving the optimization challenge of context grounding.
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capabilities of Large Language Models (LLMs) by optimizing them against factual outcomes. However, this paradigm falters in long-context scenarios, as its reliance on internal parametric knowledge is ill-suited for tasks requiring contextual grounding–the ability to find and reason over externally provided information. We identify a key reason for this failure: a reward based solely on the final answer is too sparse to effectively guide the model for identifying relevant evidence. We formally prove that the outcome-only reward leads to significant vanishing gradients for the context grounding process, rendering learning intractable. To overcome this bottleneck, we introduce LongRLVR to augment the sparse answer reward with a dense and verifiable context reward. This auxiliary signal directly incentivizes the model for selecting the correct grounding information, providing a robust learning gradient that solves the underlying optimization challenge. We validate our method on challenging long-context benchmarks using Qwen and LLaMA models. LongRLVR consistently and significantly outperforms the standard RLVR across all models and benchmarks, e.g., boosting a 14B model’s scores on RULER-QA from 73.17 to 88.90 and on LongBench v2 from 39.8 to 46.5. Our work demonstrates that explicitly rewarding the grounding process is a critical and effective strategy for unlocking the full reasoning potential of LLMs in long-context applications. Our code is available at https://github.com/real-absolute-AI/LongRLVR.
[122] Zero- and Few-Shot Named-Entity Recognition: Case Study and Dataset in the Crime Domain (CrimeNER)
Miguel Lopez-Duran, Julian Fierrez, Aythami Morales, Daniel DeAlcala, Gonzalo Mancera, Javier Irigoyen, Ruben Tolosana, Oscar Delgado, Francisco Jurado, Alvaro Ortigosa
Main category: cs.CL
TL;DR: CrimeNER introduces a crime-related NER dataset (CrimeNERdb) with 1.5k annotated documents and experiments with zero/few-shot NER models for crime information extraction.
Details
Motivation: There's a lack of adequately annotated data for crime-related named entity recognition in real-world scenarios, which is crucial for law enforcement information extraction.Method: Created CrimeNERdb dataset with 1.5k annotated documents from terrorist attack reports and DOJ press notes, defining 5 coarse and 22 fine-grained entity types. Conducted experiments with state-of-the-art NER models and large language models in zero- and few-shot settings.
Result: Developed a comprehensive crime NER dataset and demonstrated performance of various models on crime entity extraction tasks in limited-data scenarios.
Conclusion: The CrimeNER framework and dataset address the data scarcity problem in crime-related NER, enabling better information extraction from crime documents with limited annotated examples.
Abstract: The extraction of critical information from crime-related documents is a crucial task for law enforcement agencies. Named-Entity Recognition (NER) can perform this task in extracting information about the crime, the criminal, or law enforcement agencies involved. However, there is a considerable lack of adequately annotated data on general real-world crime scenarios. To address this issue, we present CrimeNER, a case-study of Crime-related zero- and Few-Shot NER, and a general Crime-related Named-Entity Recognition database (CrimeNERdb) consisting of more than 1.5k annotated documents for the NER task extracted from public reports on terrorist attacks and the U.S. Department of Justice’s press notes. We define 5 types of coarse crime entity and a total of 22 types of fine-grained entity. We address the quality of the case-study and the annotated data with experiments on Zero and Few-Shot settings with State-of-the-Art NER models as well as generalist and commonly used Large Language Models.
[123] Organizing, Orchestrating, and Benchmarking Agent Skills at Ecosystem Scale
Hao Li, Chunjiang Mu, Jianhao Chen, Siyue Ren, Zhiyao Cui, Yiqun Zhang, Lei Bai, Shuyue Hu
Main category: cs.CL
TL;DR: AgentSkillOS: A principled framework for organizing, selecting, and orchestrating AI agent skills using capability trees and DAG-based pipelines to scale skill ecosystems.
Details
Motivation: The rapid growth of Claude agent skills creates challenges in effectively leveraging, managing, and scaling the agent skill ecosystem, requiring systematic approaches for skill discovery and orchestration.Method: Two-stage framework: (1) Manage Skills organizes skills into capability trees via recursive categorization for efficient discovery; (2) Solve Tasks retrieves, orchestrates, and executes multiple skills through DAG-based pipelines. Evaluation uses 30 artifact-rich tasks across five categories with LLM-based pairwise evaluation aggregated via Bradley-Terry model.
Result: Tree-based retrieval effectively approximates oracle skill selection, and DAG-based orchestration substantially outperforms native flat invocation even with identical skill sets. Experiments across three ecosystem scales (200 to 200K skills) demonstrate effectiveness.
Conclusion: Structured composition is key to unlocking skill potential in agent ecosystems. AgentSkillOS provides a principled framework for skill selection, orchestration, and ecosystem-level management.
Abstract: The rapid proliferation of Claude agent skills has raised the central question of how to effectively leverage, manage, and scale the agent skill ecosystem. In this paper, we propose AgentSkillOS, the first principled framework for skill selection, orchestration, and ecosystem-level management. AgentSkillOS comprises two stages: (i) Manage Skills, which organizes skills into a capability tree via node-level recursive categorization for efficient discovery; and (ii) Solve Tasks, which retrieves, orchestrates, and executes multiple skills through DAG-based pipelines. To evaluate the agent’s ability to invoke skills, we construct a benchmark of 30 artifact-rich tasks across five categories: data computation, document creation, motion video, visual design, and web interaction. We assess the quality of task outputs using LLM-based pairwise evaluation, and the results are aggregated via a Bradley-Terry model to produce unified quality scores. Experiments across three skill ecosystem scales (200 to 200K skills) show that tree-based retrieval effectively approximates oracle skill selection, and that DAG-based orchestration substantially outperforms native flat invocation even when given the identical skill set. Our findings confirm that structured composition is the key to unlocking skill potential. Our GitHub repository is available at:https://github.com/ynulihao/AgentSkillOS.
[124] Reasoning Core: A Scalable Procedural Data Generation Suite for Symbolic Pre-training and Post-Training
Valentin Lacombe, Valentin Quesnel, Damien Sileo
Main category: cs.CL
TL;DR: Reasoning Core: A scalable suite for procedurally generating verifiable symbolic reasoning data across formal domains like planning, logic, grammar parsing, causal reasoning, and equations, with external solver verification and difficulty control.
Details
Motivation: Existing procedural generators for symbolic reasoning data rely on fixed puzzles/templates and lack distributional breadth needed for scaling. There's a need for scalable, verifiable symbolic reasoning data generation across core formal domains to expand language models' reasoning capabilities beyond standard pre-training corpora.Method: Developed Reasoning Core suite that procedurally generates verifiable symbolic reasoning data across five formal domains: PDDL planning over randomized domains, first-order logic with equality, context-free grammar parsing/generation, causal reasoning over random Bayesian networks, and systems of equations. Each task includes external solver verification, continuous difficulty control, and optional solver-derived reasoning traces for supervised training or reward functions for RL.
Result: Mixing Reasoning Core data into pre-training improves downstream reasoning while preserving or slightly improving language modeling quality. Zero-shot evaluations show these tasks challenge frontier models like GPT-5. The system provides scalable, verifiable symbolic reasoning data generation.
Conclusion: Reasoning Core enables scalable generation of verifiable symbolic reasoning data across formal domains, supporting both supervised training and reinforcement learning. This approach expands language models’ reasoning capabilities beyond standard pre-training while maintaining language quality.
Abstract: Training on verifiable symbolic data is a promising way to expand the reasoning frontier of language models beyond what standard pre-training corpora provide. Yet existing procedural generators often rely on fixed puzzles or templates and do not deliver the distributional breadth needed at scale. We introduce Reasoning Core, a scalable suite that procedurally generates verifiable symbolic reasoning data across core formal domains: PDDL planning over randomized domains, first-order logic with equality, context-free grammar parsing and generation, causal reasoning over random Bayesian networks, and systems of equations. Each task is paired with an external solver for rigorous verification and admits continuous difficulty control for curriculum design. Examples can optionally include solver-derived reasoning traces, enabling supervised training from the earliest pre-training stages, and the same interface provides verifiable reward functions for reinforcement learning. Our experiments show that mixing Reasoning Core data into pre-training improves downstream reasoning while preserving, or slightly improving, language modeling quality. Zero-shot evaluations confirm these tasks challenge frontier models such as GPT-5. The code and data are publicly available under the MIT license.
[125] GLEE: A Unified Framework and Benchmark for Language-based Economic Environments
Eilam Shapira, Omer Madmon, Itamar Reinman, Samuel Joseph Amouyal, Roi Reichart, Moshe Tennenholtz
Main category: cs.CL
TL;DR: A benchmark for evaluating LLM behavior in two-player sequential language-based economic games, focusing on rationality, efficiency, fairness, and natural language’s role in strategic interactions.
Details
Motivation: To understand how LLMs behave in economic and strategic interactions where natural language communication is prevalent, and to assess their rationality, performance compared to humans, efficiency, fairness, and the role of language in strategic contexts.Method: Developed a benchmark with three base families of two-player sequential language-based games with consistent parameterization. Created an open-source framework for interaction simulation and analysis, collecting datasets of LLM vs. LLM and human vs. LLM interactions across various game configurations.
Result: Market parameters and LLM choices have complex, interdependent effects on economic outcomes. The framework enables comparison of LLM behavior across economic contexts, evaluation of individual/collective performance, and quantification of environmental effects on agent behavior.
Conclusion: LLM integration into economic systems requires careful design and analysis due to complex interactions between market parameters and LLM characteristics affecting economic outcomes in language-based strategic interactions.
Abstract: Large Language Models (LLMs) show significant potential in economic and strategic interactions, where communication via natural language is often prevalent. This raises key questions: Do LLMs behave rationally? How do they perform compared to humans? Do they tend to reach an efficient and fair outcome? What is the role of natural language in strategic interaction? How do characteristics of the economic environment influence these dynamics? These questions become crucial concerning the economic and societal implications of integrating LLM-based agents into real-world data-driven systems, such as online retail platforms and recommender systems. To answer these questions, we introduce a benchmark for standardizing research on two-player, sequential, language-based games. Inspired by the economic literature, we define three base families of games with consistent parameterization, degrees of freedom and economic measures to evaluate agents’ performance (self-gain), as well as the game outcome (efficiency and fairness). We develop an open-source framework for interaction simulation and analysis, and utilize it to collect a dataset of LLM vs. LLM interactions across numerous game configurations and an additional dataset of human vs. LLM interactions. Through extensive experimentation, we demonstrate how our framework and dataset can be used to: (i) compare the behavior of LLM-based agents in various economic contexts; (ii) evaluate agents in both individual and collective performance measures; and (iii) quantify the effect of the economic characteristics of the environments on the behavior of agents. Our results suggest that the market parameters, as well as the choice of the LLMs, tend to have complex and interdependent effects on the economic outcome, which calls for careful design and analysis of the language-based economic ecosystem.
[126] SimpleToM: Exposing the Gap between Explicit ToM Inference and Implicit ToM Application in LLMs
Yuling Gu, Oyvind Tafjord, Hyunwoo Kim, Jared Moore, Ronan Le Bras, Peter Clark, Yejin Choi
Main category: cs.CL
TL;DR: SimpleToM benchmark tests LLMs’ Theory of Mind across explicit mental state inference and applied reasoning in everyday scenarios, revealing a gap between knowing mental states and applying that knowledge.
Details
Motivation: Current ToM evaluations focus on explicit belief attribution in toy stories, leaving open whether LLMs can implicitly apply ToM knowledge to predict behavior or judge actions in diverse real-world scenarios.Method: Created SimpleToM benchmark with concise stories in everyday settings (supermarkets, hospitals, schools, offices) featuring information asymmetries. Each story has three questions testing: (a) mental state inference, (b) behavior prediction, and (c) behavior judgment.
Result: State-of-the-art models perform well on mental state inference but show sharp performance drops on behavior prediction and further on behavior judgment, exposing a critical gap between explicit ToM knowledge and applied reasoning.
Conclusion: LLMs have fragile social reasoning capabilities - they can infer mental states but struggle to implicitly apply that knowledge for behavior prediction and judgment, highlighting limitations in applied Theory of Mind.
Abstract: Large language models (LLMs) are increasingly tested for a “Theory of Mind” (ToM) - the ability to attribute mental states to oneself and others. Yet most evaluations stop at explicit belief attribution in classical toy stories or stylized tasks, leaving open the questions of whether LLMs can implicitly apply such knowledge to predict human behavior, or to judge an observed behavior, in diverse scenarios. We introduce SimpleToM, a benchmark that advances ToM evaluation along two novel axes. First, it probes multiple levels of ToM reasoning, from mental state inference (explicit ToM) to behavior prediction and judgment (applied ToM). Second, it situates these tasks in diverse, everyday scenarios - such as supermarkets, hospitals, schools, and offices - where information asymmetries naturally arise (e.g., hidden defects in grocery store items, incomplete information in provider-patient interactions, or restricted access to locked devices). SimpleToM contains concise stories (e.g., “The can of Pringles has moldy chips in it. Mary picks up the can in the supermarket and walks to the cashier.”), each with three questions that test different degrees of ToM reasoning, asking models to predict: (a) mental states (“Is Mary aware of the mold?”), (b) behaviors (“Will Mary pay for the chips or report the mold?”), and (c) judgments (“Mary paid for the chips. Was that reasonable?”). Experiments reveal a striking gap: state-of-the-art models often reliably infer mental state (a), but fail at applying knowledge about the mental state for secondary predictions, with performance dropping sharply for behavior prediction (b) and further for behavior judgment (c). This exposes a critical fragility in LLMs’ social reasoning in terms of what they know (explicit ToM) versus how well they can implicitly apply that knowledge for predictions (applied ToM).
[127] Scaling Knowledge Graph Construction through Synthetic Data Generation and Distillation
Prafulla Kumar Choubey, Xin Su, Man Luo, Xiangyu Peng, Caiming Xiong, Tiep Le, Shachar Rosenman, Vasudev Lal, Phil Mui, Ricky Ho, Phillip Howard, Chien-Sheng Wu
Main category: cs.CL
TL;DR: SynthKG is a data synthesis pipeline for document-level knowledge graph construction that uses LLMs to generate training data, enabling smaller models to outperform larger ones in KG quality and downstream tasks.
Details
Motivation: Document-level KG construction faces scaling challenges: expensive LLMs are economically nonviable for large corpora, while smaller models produce incomplete graphs. The limitation stems from insufficient training on high-quality document-level KG data rather than model capabilities.Method: Introduces SynthKG, a multi-step data synthesis pipeline that generates high-quality document-KG pairs through systematic chunking, decontextualization, and structured extraction using LLMs. Then fine-tunes a smaller LLM on synthesized data to create Distill-SynthKG for single-step KG generation. Also repurposes QA datasets for evaluation and proposes graph-based retrieval framework for RAG.
Result: Distill-SynthKG surpasses all baseline models in KG quality (including models 8x larger) and improves retrieval and QA tasks. The graph retrieval framework outperforms all KG-retrieval methods across multiple benchmark datasets.
Conclusion: High-quality synthesized training data enables smaller models to achieve superior document-level KG construction, addressing scaling challenges while maintaining quality. The approach also enhances downstream applications like retrieval and question answering.
Abstract: Document-level knowledge graph (KG) construction faces a fundamental scaling challenge: existing methods either rely on expensive large language models (LLMs), making them economically nonviable for large-scale corpora, or employ smaller models that produce incomplete and inconsistent graphs. We find that this limitation stems not from model capabilities but from insufficient training on high-quality document-level KG data. To address this gap, we introduce SynthKG, a multi-step data synthesis pipeline that generates high-quality document-KG pairs through systematic chunking, decontextualization, and structured extraction using LLMs. By fine-tuning a smaller LLM on synthesized document-KG pairs, we streamline the multi-step process into a single-step KG generation approach called Distill-SynthKG. Furthermore, we repurpose existing question-answering datasets to construct KG evaluation datasets and introduce new evaluation metrics. Using KGs produced by Distill-SynthKG, we also design a novel graph-based retrieval framework for RAG. Experimental results demonstrate that Distill-SynthKG not only surpasses all baseline models in KG quality (including models up to eight times larger) but also consistently improves in retrieval and question-answering tasks. Additionally, our proposed graph retrieval framework outperforms all KG-retrieval methods across multiple benchmark datasets.
[128] A Survey of Query Optimization in Large Language Models
Mingyang Song, Mao Zheng
Main category: cs.CL
TL;DR: Survey paper on Query Optimization (QO) techniques for LLMs in RAG systems, introducing a Query Optimization Lifecycle Framework, Query Complexity Taxonomy, and analyzing four atomic operations with over 90 methods.
Details
Motivation: Query quality directly determines retrieval and response performance in RAG systems, making Query Optimization essential for enhancing LLM effectiveness. There's a need for systematic analysis and structured frameworks to guide research and practice in this area.Method: 1) Introduces Query Optimization Lifecycle (QOL) Framework with five phases: Intent Recognition, Query Transformation, Retrieval Execution, Evidence Integration, Response Synthesis. 2) Proposes Query Complexity Taxonomy classifying queries by evidence type (explicit/implicit) and quantity (single/multiple). 3) Analyzes four atomic operations: Query Expansion, Query Decomposition, Query Disambiguation, Query Abstraction covering 90+ methods.
Result: Provides comprehensive survey of QO techniques, structured frameworks for research, and practical guidance for practitioners. Identifies gaps in benchmarks and discusses open challenges including process reward models, efficiency optimization, and multi-modal query handling.
Conclusion: This survey establishes structured foundation for QO research and offers actionable guidance for practitioners, while highlighting future directions including multi-modal query handling challenges.
Abstract: Query Optimization (QO) has become essential for enhancing Large Language Model (LLM) effectiveness, particularly in Retrieval-Augmented Generation (RAG) systems where query quality directly determines retrieval and response performance. This survey provides a systematic analysis of query optimization techniques with three contributions. \textit{First}, we introduce the \textbf{Query Optimization Lifecycle (QOL) Framework}, a five-phase pipeline covering Intent Recognition, Query Transformation, Retrieval Execution, Evidence Integration, and Response Synthesis. \textit{Second}, we propose a \textbf{Query Complexity Taxonomy} that classifies queries along two dimensions: evidence type (explicit vs.\ implicit) and evidence quantity (single vs.\ multiple), establishing principled mappings to optimization strategies. \textit{Third}, we analyze four atomic operations: \textbf{Query Expansion}, \textbf{Query Decomposition}, \textbf{Query Disambiguation}, and \textbf{Query Abstraction}, covering over 90 representative methods. We further examine evaluation methodologies, identify gaps in benchmarks, and discuss open challenges including process reward models, efficiency optimization, and multi-modal query handling. This survey offers both a structured foundation for research and actionable guidance for practitioners.
[129] AStar: Boosting Multimodal Reasoning with Automated Structured Thinking
Jinyang Wu, Mingkuan Feng, Guocheng Zhai, Shuai Zhang, Zheng Lian, Fangrui Lv, Pengpeng Shao, Ruihan Jin, Zhengqi Wen, Jianhua Tao
Main category: cs.CL
TL;DR: AStar is a training-free automatic structured thinking paradigm for multimodal reasoning that uses “thought cards” - a lightweight library of high-level reasoning patterns - to enhance MLLMs’ visual reasoning without explicit search or post-training.
Details
Motivation: Multimodal LLMs struggle with complex visual reasoning tasks. Current approaches use explicit search (computationally inefficient) or post-training (data/resource intensive, unstable). Need efficient, training-free method to enhance reasoning capabilities.Method: Proposes AStar with “thought cards” - a library of reasoning patterns abstracted from prior samples. For each test problem, adaptively retrieves optimal thought cards and integrates external explicit guidelines with model’s internal implicit reasoning.
Result: Achieves 53.9% accuracy on MathVerse (surpassing GPT-4o’s 50.2%) and 32.7% on MathVision (outperforming GPT-4o’s 30.4%). Shows transferability: thought cards from math reasoning benefit other tasks and general visual perception.
Conclusion: AStar provides efficient, training-free multimodal reasoning enhancement, serving as plug-and-play test-time inference method compatible with other techniques, complementing existing approaches.
Abstract: Multimodal large language models excel across diverse domains but struggle with complex visual reasoning tasks. To enhance their reasoning capabilities, current approaches typically rely on explicit search or post-training techniques. However, search-based methods suffer from computational inefficiency due to extensive solution space exploration, while post-training methods demand substantial data, computational resources, and often exhibit training instability. To address these challenges, we propose \textbf{AStar}, a training-free, \textbf{A}utomatic \textbf{S}tructured \textbf{t}hinking paradigm for multimod\textbf{a}l \textbf{r}easoning. Specifically, we introduce novel ``thought cards’’, a lightweight library of high-level reasoning patterns abstracted from prior samples. For each test problem, AStar adaptively retrieves the optimal thought cards and seamlessly integrates these external explicit guidelines with the model’s internal implicit reasoning capabilities. Compared to previous methods, AStar eliminates computationally expensive explicit search and avoids additional complex post-training processes, enabling a more efficient reasoning approach. Extensive experiments demonstrate that our framework achieves 53.9% accuracy on MathVerse (surpassing GPT-4o’s 50.2%) and 32.7% on MathVision (outperforming GPT-4o’s 30.4%). Further analysis reveals the remarkable transferability of our method: thought cards generated from mathematical reasoning can also be applied to other reasoning tasks, even benefiting general visual perception and understanding. AStar serves as a plug-and-play test-time inference method, compatible with other post-training techniques, providing an important complement to existing multimodal reasoning approaches.
[130] Enhancing Hallucination Detection through Noise Injection
Litian Liu, Reza Pourreza, Sunny Panchal, Apratim Bhattacharyya, Yubing Jian, Yao Qin, Roland Memisevic
Main category: cs.CL
TL;DR: A training-free approach for detecting LLM hallucinations by perturbing model parameters during sampling to better capture Bayesian uncertainty
Details
Motivation: LLMs generate plausible but incorrect responses (hallucinations), which is a safety concern. Current detection methods using token distribution sampling are suboptimal for capturing true model uncertainty needed for effective hallucination detection.Method: Proposes a simple, training-free approach that perturbs appropriate subsets of model parameters or hidden unit activations during sampling to better capture Bayesian uncertainty, improving hallucination detection without additional training.
Result: The approach significantly improves inference-time hallucination detection over standard sampling methods across diverse datasets, model architectures, and uncertainty metrics.
Conclusion: Perturbing model parameters during sampling provides a more effective way to capture Bayesian uncertainty for hallucination detection than standard token distribution sampling, offering a simple yet powerful training-free solution.
Abstract: Large Language Models (LLMs) are prone to generating plausible yet incorrect responses, known as hallucinations. Effectively detecting hallucinations is therefore crucial for the safe deployment of LLMs. Recent research has linked hallucinations to model uncertainty, suggesting that hallucinations can be detected by measuring dispersion over answer distributions obtained from multiple samples drawn from a model. While drawing from the distribution over tokens defined by the model is a natural way to obtain samples, in this work, we argue that it is suboptimal for the purpose of detecting hallucinations. We show that detection can be improved significantly by taking into account model uncertainty in the Bayesian sense. To this end, we propose a very simple, training-free approach based on perturbing an appropriate subset of model parameters, or equivalently hidden unit activations, during sampling. We demonstrate that our approach significantly improves inference-time hallucination detection over standard sampling across diverse datasets, model architectures, and uncertainty metrics.
[131] Wikipedia in the Era of LLMs: Evolution and Risks
Siming Huang, Yuliang Xu, Mingmeng Geng, Yao Wan, Dongping Chen
Main category: cs.CL
TL;DR: Analysis of LLM impact on Wikipedia content and NLP tasks, showing ~1% contamination affecting machine translation benchmarks and RAG effectiveness.
Details
Motivation: To understand how Large Language Models are affecting Wikipedia content and related NLP tasks, and to assess potential risks for future NLP research.Method: Analyzed Wikipedia article content and page views, evaluated LLM impact on NLP tasks (machine translation, RAG), conducted simulations to explore potential risks.
Result: Found ~1% LLM impact on Wikipedia articles in certain categories; machine translation benchmarks may be inflated; RAG effectiveness could decrease due to knowledge contamination.
Conclusion: LLMs haven’t fully changed Wikipedia yet, but empirical findings signal need for careful consideration of future risks in NLP research.
Abstract: In this paper, we present a comprehensive analysis and monitoring framework for the impact of Large Language Models (LLMs) on Wikipedia, examining the evolution of Wikipedia through existing data and using simulations to explore potential risks. We begin by analyzing article content and page views to study the recent changes in Wikipedia and assess the impact of LLMs. Subsequently, we evaluate how LLMs affect various Natural Language Processing (NLP) tasks related to Wikipedia, including machine translation and retrieval-augmented generation (RAG). Our findings and simulation results reveal that Wikipedia articles have been affected by LLMs, with an impact of approximately 1% in certain categories. If the machine translation benchmark based on Wikipedia is influenced by LLMs, the scores of the models may become inflated, and the comparative results among models could shift. Moreover, the effectiveness of RAG might decrease if the knowledge has been contaminated by LLMs. While LLMs have not yet fully changed Wikipedia’s language and knowledge structures, we believe that our empirical findings signal the need for careful consideration of potential future risks in NLP research. We release all the experimental dataset and source code at: https://github.com/HSM316/LLM_Wikipedia
[132] Not-Just-Scaling Laws: Towards a Better Understanding of the Downstream Impact of Language Model Design Decisions
Emmy Liu, Amanda Bertsch, Lintang Sutawika, Lindia Tjuatja, Patrick Fernandes, Lara Marinov, Michael Chen, Shreya Singhal, Carolin Lawrence, Aditi Raghunathan, Kiril Gashteovski, Graham Neubig
Main category: cs.CL
TL;DR: A meta-analysis of 92 open-source pretrained models reveals that design choices beyond just scale (model size and training tokens) significantly impact downstream performance, with features like data composition and architecture accounting for 3-28% improvement in performance prediction.
Details
Motivation: To understand why smaller models with curated data or different architectures sometimes outperform larger models, and to quantify the impact of various design choices beyond just scale on model capabilities.Method: Meta-analysis of 92 open-source pretrained models across various scales, including state-of-the-art and less conventional models, incorporating features beyond just model size and training tokens to predict downstream performance.
Result: Incorporating design features beyond scale improves ability to predict downstream performance by 3-28% relative to using scale alone. Key findings include optimal trade-off between language and code tasks at 15-25% code, and better performance of rotary embeddings over learned embeddings.
Conclusion: Model design choices significantly impact final capabilities beyond just scale, with data composition and architectural decisions playing crucial roles. The framework enables more systematic investigation of how development choices shape model performance.
Abstract: Improvements in language model capabilities are often attributed to increasing model size or training data, but in some cases smaller models trained on curated data or with different architectural decisions can outperform larger ones trained on more tokens. What accounts for this? To quantify the impact of these design choices, we meta-analyze 92 open-source pretrained models across a wide array of scales, including state-of-the-art open-weights models as well as less performant models and those with less conventional design decisions. We find that by incorporating features besides model size and number of training tokens, we can achieve a relative 3-28% increase in ability to predict downstream performance compared with using scale alone. Analysis of model design decisions reveal insights into data composition, such as the trade-off between language and code tasks at 15-25% code, as well as the better performance of some architectural decisions such as choosing rotary over learned embeddings. Broadly, our framework lays a foundation for more systematic investigation of how model development choices shape final capabilities.
[133] Large Language Models in Bioinformatics: A Survey
Zhenyu Wang, Zikang Wang, Jiyue Jiang, Pengan Chen, Xiangyu Shi, Yu Li
Main category: cs.CL
TL;DR: Survey paper reviewing LLM applications in bioinformatics including genomic sequence modeling, RNA structure prediction, protein function inference, and single-cell transcriptomics, with discussion of challenges and future directions.
Details
Motivation: To provide a systematic review of how Large Language Models are revolutionizing bioinformatics by enabling advanced analysis of biological data (DNA, RNA, proteins, single-cell data) and to highlight their transformative potential in precision medicine.Method: Survey methodology - systematic review of recent advancements in LLM applications for bioinformatics, covering multiple domains including genomic sequence modeling, RNA structure prediction, protein function inference, and single-cell transcriptomics.
Result: Comprehensive perspective on current state of LLMs in bioinformatics, identification of key challenges (data scarcity, computational complexity, cross-omics integration), and exploration of future directions including multimodal learning and hybrid AI models.
Conclusion: LLMs have transformative potential in bioinformatics and precision medicine, with future directions pointing toward multimodal learning, hybrid AI approaches, and clinical applications.
Abstract: Large Language Models (LLMs) are revolutionizing bioinformatics, enabling advanced analysis of DNA, RNA, proteins, and single-cell data. This survey provides a systematic review of recent advancements, focusing on genomic sequence modeling, RNA structure prediction, protein function inference, and single-cell transcriptomics. Meanwhile, we also discuss several key challenges, including data scarcity, computational complexity, and cross-omics integration, and explore future directions such as multimodal learning, hybrid AI models, and clinical applications. By offering a comprehensive perspective, this paper underscores the transformative potential of LLMs in driving innovations in bioinformatics and precision medicine.
[134] A Foundational Individual Mobility Prediction Model based on Open-Source Large Language Models
Zhenlin Qin, Leizhen Wang, Yancheng Ling, Francisco Camara Pereira, Zhenliang Ma
Main category: cs.CL
TL;DR: MoBLLM: A foundational model for individual mobility prediction using lightweight LLM with parameter-efficient fine-tuning for transferable mobility behavior representation across heterogeneous data sources.
Details
Motivation: Existing mobility prediction methods lack transferability and robustness across different data sources and out-of-distribution settings; need for a foundational model that can learn shared mobility representations for zero-shot prediction.Method: Based on lightweight open-source LLM with Parameter-Efficient Fine-Tuning (PEFT) techniques to create cost-effective training pipeline; learns transferable mobility behavior representations across heterogeneous data sources.
Result: Achieves best F1 score and accuracy across six real-world mobility datasets compared to SOTA deep learning models; shows better transferability and cost efficiency than commercial LLMs; robust under network changes, policy interventions, events, and incidents.
Conclusion: MoBLLM provides a generalizable modeling foundation for individual mobility behavior, enabling more reliable and adaptive personalized information services for transportation management.
Abstract: Individual mobility prediction plays a key role in urban transport, enabling personalized service recommendations and effective travel management. It is widely modeled by data-driven methods such as machine learning, deep learning, as well as classical econometric methods to capture key features of mobility patterns. However, such methods are hindered in promoting further transferability and robustness due to limited capacity to learn mobility patterns from different data sources, predict in out-of-distribution settings (a.k.a ``zero-shot"). To address this challenge, this paper introduces MoBLLM, a foundational model for individual mobility prediction that aims to learn a shared and transferable representation of mobility behavior across heterogeneous data sources. Based on a lightweight open-source large language model (LLM), MoBLLM employs Parameter-Efficient Fine-Tuning (PEFT) techniques to create a cost-effective training pipeline, avoiding the need for large-scale GPU clusters while maintaining strong performance. We conduct extensive experiments on six real-world mobility datasets to evaluate its accuracy, robustness, and transferability across varying temporal scales (years), spatial contexts (cities), and situational conditions (e.g., disruptions and interventions). MoBLLM achieves the best F1 score and accuracy across all datasets compared with state-of-the-art deep learning models and shows better transferability and cost efficiency than commercial LLMs. Further experiments reveal its robustness under network changes, policy interventions, special events, and incidents. These results indicate that MoBLLM provides a generalizable modeling foundation for individual mobility behavior, enabling more reliable and adaptive personalized information services for transportation management.
[135] AnesSuite: A Comprehensive Benchmark and Dataset Suite for Anesthesiology Reasoning in LLMs
Xiang Feng, Wentao Jiang, Zengmao Wang, Yong Luo, Pingbo Xu, Baosheng Yu, Hua Jin, Jing Zhang
Main category: cs.CL
TL;DR: AnesSuite is a comprehensive dataset suite for anesthesiology reasoning in LLMs, featuring evaluation benchmarks and training datasets, used to develop Morpheus baseline models that show improved reasoning in both specialized and general domains.
Details
Motivation: LLMs have gained attention in medical applications, but their reasoning capabilities in specialized domains like anesthesiology remain underexplored, creating a need for domain-specific evaluation and training resources.Method: Created AnesSuite dataset suite with AnesBench evaluation benchmark (three reasoning levels) and three training datasets for CPT, SFT, and RLVR. Developed Morpheus baseline models using SFT and GRPO training strategies.
Result: Morpheus achieves substantial improvements in anesthesiology reasoning rivaling larger models, demonstrates enhanced reasoning across general medical and broad-domain benchmarks, and identifies key factors influencing performance.
Conclusion: AnesSuite provides the first comprehensive infrastructure for anesthesiology reasoning in LLMs, and Morpheus serves as an effective baseline, with both resources being open-sourced for community use.
Abstract: The application of large language models (LLMs) in the medical field has garnered significant attention, yet their reasoning capabilities in more specialized domains like anesthesiology remain underexplored. To bridge this gap, we introduce AnesSuite, the first comprehensive dataset suite specifically designed for anesthesiology reasoning in LLMs. The suite features AnesBench, an evaluation benchmark tailored to assess anesthesiology-related reasoning across three levels: factual retrieval (System 1), hybrid reasoning (System 1.x), and complex decision-making (System 2). Alongside this benchmark, the suite includes three training datasets that provide an infrastructure for continued pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning with verifiable rewards (RLVR). Leveraging this suite, we develop Morpheus, the first baseline model collection for anesthesiology reasoning. Despite undergoing limited training with SFT and group relative policy optimization (GRPO), Morpheus not only achieves substantial improvements in anesthesiology that rival larger-scale models, but also demonstrates enhanced reasoning capabilities across general medical and broad-domain benchmarks. Furthermore, through comprehensive evaluations and experiments, we analyze the key factors influencing anesthesiology reasoning performance, including model characteristics, training strategies and training data. Both AnesSuite and Morpheus will be open-sourced at https://github.com/MiliLab/AnesSuite.
[136] Decoding Open-Ended Information Seeking Goals from Eye Movements in Reading
Cfir Avraham Hadar, Omer Shubi, Yoav Meiri, Amit Heshes, Yevgeni Berzak
Main category: cs.CL
TL;DR: This paper explores whether open-ended reading goals can be automatically decoded from eye movements using multimodal LLMs that combine text and eye tracking data.
Details
Motivation: People approach texts with specific goals that guide their reading behavior, and the researchers want to determine if these goals can be automatically decoded from eye movements alone, which could enable scientific investigation of goal-driven reading and development of educational/assistive technologies.Method: Introduced goal decoding tasks and evaluation frameworks using large-scale eye tracking data with hundreds of text-specific information seeking tasks. Developed and compared several discriminative and generative multimodal LLMs that combine text and eye movement data.
Result: Experiments showed considerable success in selecting correct goals among options and made progress toward free-form textual reconstruction of precise goal formulations.
Conclusion: The work demonstrates the feasibility of decoding reading goals from eye movements using multimodal LLMs, opening doors for further scientific investigation and development of real-time educational/assistive technologies.
Abstract: When reading, we often have specific information that interests us in a text. For example, you might be reading this paper because you are curious about LLMs for eye movements in reading, the experimental design, or perhaps you wonder ``This sounds like science fiction. Does it actually work?’’. More broadly, in daily life, people approach texts with any number of text-specific goals that guide their reading behavior. In this work, we ask, for the first time, whether open-ended reading goals can be automatically decoded solely from eye movements in reading. To address this question, we introduce goal decoding tasks and evaluation frameworks using large-scale eye tracking for reading data in English with hundreds of text-specific information seeking tasks. We develop and compare several discriminative and generative multimodal text and eye movements LLMs for these tasks. Our experiments show considerable success on the task of selecting the correct goal among several options, and even progress towards free-form textual reconstruction of the precise goal formulation. These results open the door for further scientific investigation of goal driven reading, as well as the development of educational and assistive technologies that will rely on real-time decoding of reader goals from their eye movements.
[137] DRA-GRPO: Your GRPO Needs to Know Diverse Reasoning Paths for Mathematical Reasoning
Xiwen Chen, Wenhui Zhu, Peijie Qiu, Xuanzhao Dong, Hao Wang, Haiyu Wu, Huayu Li, Aristeidis Sotiras, Yalin Wang, Abolfazl Razi
Main category: cs.CL
TL;DR: Diversity-aware Reward Adjustment (DRA) enhances Group Relative Policy Optimization for LLMs by incorporating semantic diversity rewards to prevent policy collapse and improve coverage of valid reasoning strategies.
Details
Motivation: Standard GRPO uses scalar correctness rewards that are non-injective to semantic content, causing Diversity-Quality Inconsistency where policies collapse into narrow dominant modes while ignoring equally valid but structurally novel reasoning strategies.Method: Proposes DRA framework that calibrates reward signals using semantic density of sampled groups via Submodular Mutual Information and Inverse Propensity Scoring to create repulsive forces against redundancy and bias gradient estimation.
Result: DRA-GRPO outperforms baselines on five math benchmarks, achieving 58.2% average accuracy on DeepSeek-R1-Distill-Qwen-1.5B with only 7,000 training samples and $55 cost.
Conclusion: Diversity calibration is critical for data-efficient alignment in LLM post-training, and DRA provides a plug-and-play solution that integrates seamlessly with GRPO variants.
Abstract: Post-training LLMs with Reinforcement Learning, specifically Group Relative Policy Optimization (GRPO), has emerged as a paradigm for enhancing mathematical reasoning. However, standard GRPO relies on scalar correctness rewards that are often non-injective with respect to semantic content: distinct reasoning paths receive identical rewards. This leads to a Diversity-Quality Inconsistency, where the policy collapses into a narrow set of dominant modes while ignoring equally valid but structurally novel strategies. To bridge this gap, we propose Diversity-aware Reward Adjustment (DRA), a theoretically grounded framework that calibrates the reward signal using the semantic density of sampled groups. By leveraging Submodular Mutual Information (SMI), DRA implements an Inverse Propensity Scoring (IPS) mechanism that effectively de-biases the gradient estimation. This creates a repulsive force against redundancy, driving the policy to achieve better coverage of the high-reward landscape. Our method is plug-and-play and integrates seamlessly with GRPO variants. Empirical evaluations on five math benchmarks demonstrate that DRA-GRPO consistently outperforms strong baselines, achieving an average accuracy of 58.2% on DeepSeek-R1-Distill-Qwen-1.5B with only 7,000 training samples and $55 cost, highlighting the critical role of diversity calibration in data-efficient alignment. The code is available at https://github.com/xiwenc1/DRA-GRPO.
[138] When Large Language Models are More PersuasiveThan Incentivized Humans, and Why
Philipp Schoenegger, Francesco Salvi, Jiacheng Liu, Xiaoli Nan, Ramit Debnath, Barbara Fasolo, Evelina Leivada, Gabriel Recchia, Fritz Günther, Ali Zarifhonarvar, Joe Kwon, Zahoor Ul Islam, Marco Dehnert, Daryl Y. H. Lee, Madeline G. Reinecke, David G. Kamper, Mert Kobaş, Adam Sandford, Jonas Kgomo, Luke Hewitt, Shreya Kapoor, Kerem Oktar, Eyup Engin Kucuk, Bo Feng, Cameron R. Jones, Izzy Gainsburg, Sebastian Olschewski, Nora Heinzelmann, Francisco Cruz, Ben M. Tappin, Tao Ma, Peter S. Park, Rayan Onyonka, Arthur Hjorth, Peter Slattery, Qingcheng Zeng, Lennart Finke, Igor Grossmann, Alessandro Salatiello, Ezra Karger
Main category: cs.CL
TL;DR: LLMs (Claude 3.5 Sonnet and DeepSeek v3) were tested against incentivized humans in real-time persuasive conversations, showing that LLM persuasiveness is context-dependent based on truthfulness, model type, and interaction frequency.
Details
Motivation: While LLMs are known to be persuasive, there's limited understanding about when and why they outperform humans in persuasion tasks, particularly in interactive settings with varying contexts (truthful vs. deceptive persuasion).Method: Two large-scale experiments comparing LLMs (Claude 3.5 Sonnet and DeepSeek v3) against incentivized human persuaders in real-time conversational settings where they attempted to persuade quiz-takers toward correct or incorrect answers, with linguistic analysis of persuasion texts.
Result: Claude 3.5 Sonnet was more persuasive than humans in both truthful and deceptive contexts, increasing accuracy when truthful but decreasing it when deceptive. DeepSeek v3 showed similar accuracy effects but was only more persuasive in deceptive contexts. LLM persuasiveness waned over repeated interactions unlike humans, and linguistic analysis suggests higher conviction expression by LLMs.
Conclusion: LLM persuasiveness is context-dependent, varying by model, truthfulness of persuasion, and interaction frequency, with implications for AI deployment in persuasive applications.
Abstract: Large Language Models (LLMs) have been shown to be highly persuasive, but when and why they outperform humans is still an open question. We compare the persuasiveness of two LLMs (Claude 3.5 Sonnet and DeepSeek v3) against humans who had incentives to persuade, using an interactive, real-time conversational setting. We demonstrate that LLMs persuasive superiority is context-dependent: it depends on whether the persuasion attempt is truthful (towards the right answer) or deceptive (towards the wrong answer) and on the LLM model, and wanes over repeated interactions (unlike human persuasiveness). In our first large-scale experiment, humans vs LLMs (Claude 3.5 Sonnet) interacted with other humans who were completing an online quiz for a reward, attempting to persuade them toward a given (either correct or incorrect) answer. Claude was more persuasive than incentivized human persuaders both in truthful and deceptive contexts and it significantly increased accuracy if persuasion was truthful, but decreased it if persuasion was deceptive. In a follow-up experiment with Deepseek v3, we replicated the findings about accuracy but found greater LLM persuasiveness only if the persuasion was deceptive. Linguistic analyses of the persuaders texts suggest that these effects may be due to LLMs expressing higher conviction than humans.
[139] The Counting Power of Transformers
Marco Sälzer, Chris Köcher, Alexander Kozachinskiy, Georg Zetzsche, Anthony Widjaja Lin
Main category: cs.CL
TL;DR: Transformers can express highly nonlinear counting properties beyond linear inequalities, specifically all semialgebraic counting properties expressible as boolean combinations of multivariate polynomials.
Details
Motivation: To formally investigate the counting power of transformers, moving beyond existing results that only demonstrate expressivity for (semi-)linear counting properties, and to understand transformers' ability to capture more complex nonlinear counting properties.Method: Developed a formal framework for analyzing transformers’ counting capabilities, proved transformers can capture all semialgebraic counting properties, identified a subclass of transformers that characterizes these properties, and connected this to Hilbert’s tenth problem for undecidability results.
Result: Transformers can express highly nonlinear counting properties (semialgebraic properties), generalizing beyond linear counting properties captured by C-RASP softmax transformers. Established undecidability results for simple transformer models without positional encodings or masking.
Conclusion: Transformers have significantly stronger counting capabilities than previously demonstrated, being able to express complex nonlinear counting properties, which has implications for both theoretical expressivity and practical trainability of such properties.
Abstract: Counting properties (e.g. determining whether certain tokens occur more than other tokens in a given input text) have played a significant role in the study of expressiveness of transformers. In this paper, we provide a formal framework for investigating the counting power of transformers. We argue that all existing results demonstrate transformers’ expressivity only for (semi-)linear counting properties, i.e., which are expressible as a boolean combination of linear inequalities. Our main result is that transformers can express counting properties that are highly nonlinear. More precisely, we prove that transformers can capture all semialgebraic counting properties, i.e., expressible as a boolean combination of arbitrary multivariate polynomials (of any degree). Among others, these generalize the counting properties that can be captured by C-RASP softmax transformers, which capture only linear counting properties. To complement this result, we exhibit a natural subclass of (softmax) transformers that completely characterizes semialgebraic counting properties. Through connections with the Hilbert’s tenth problem, this expressivity of transformers also yields a new undecidability result for analyzing an extremely simple transformer model – surprisingly with neither positional encodings (i.e. NoPE-transformers) nor masking. We also experimentally validate trainability of such counting properties.
[140] RPM: Reasoning-Level Personalization for Black-Box Large Language Models
Jieyong Kim, Tongyoung Kim, Soojin Yoon, Jaehyung Kim, Dongha Lee
Main category: cs.CL
TL;DR: RPM introduces reasoning-level personalization for LLMs, discovering user-specific reasoning structures from behavioral data to guide personalized inference, outperforming response-level methods.
Details
Motivation: Current black-box LLMs produce generic outputs that ignore individual preferences, and existing personalization methods only match final outputs without modeling the underlying reasoning connecting user behavior to responses.Method: RPM automatically discovers user-specific reasoning structures from raw behavioral data, constructs structured models of user behavior using response-influential features and statistical factors, creates personalized reasoning paths, and uses feature-based retrieval to find beneficial examples for guiding inference.
Result: Extensive experiments across four diverse tasks show RPM consistently outperforms existing response-level personalization methods while enhancing both personalization performance and interpretability.
Conclusion: RPM provides a promising direction for black-box LLM personalization by introducing reasoning-level personalization as a new paradigm that improves both performance and interpretability.
Abstract: While black-box large language models are widely deployed, they produce generic outputs that overlook individual user preferences. Current personalization methods are fundamentally limited to response-level personalization; they only match final outputs, failing to model the underlying reasoning that connects user behavior to responses. To address this, this work introduces reasoning-level personalization as a new paradigm and proposes RPM, the first systematic framework that automatically discovers user-specific reasoning structures from raw behavioral data to guide the model’s personalized inference. RPM constructs a structured model of user behavior-built from response-influential features and statistical factors-to create personalized reasoning paths and retrieve beneficial examples for guiding inference through a feature-based retrieval mechanism. Extensive experiments across four diverse tasks demonstrate that RPM consistently outperforms existing response-level methods while simultaneously enhancing both personalization performance and interpretability, providing a promising direction for black-box LLM personalization.
[141] Augmenting Research Ideation with Data: An Empirical Investigation in Social Science
Xiao Liu, Xinyi Dong, Xinyang Gao, Yansong Feng, Xun Pang
Main category: cs.CL
TL;DR: LLM-augmented research ideation framework that incorporates metadata during generation and automated validation during selection improves idea feasibility and quality in social science domains.
Details
Motivation: While LLMs show potential for generating novel research ideas, these ideas often lack feasibility and effectiveness. The paper investigates whether augmenting LLMs with relevant data during ideation can improve idea quality.Method: Framework integrates data at two stages: (1) incorporating metadata during idea generation to guide models toward more feasible concepts, and (2) introducing automated preliminary validation during idea selection to assess empirical plausibility. Evaluated in social science domain with focus on climate negotiation topics.
Result: Metadata improves feasibility of generated ideas by 20%, automated validation improves overall quality of selected ideas by 7%. Human study shows LLM-generated ideas with data augmentation and validation inspire researchers and lead to higher quality ideas than those proposed without assistance.
Conclusion: Data-augmented research ideation has practical value in real-world academic settings, highlighting the potential of LLM-assisted ideation when properly augmented with relevant data and validation mechanisms.
Abstract: Recent advancements in large language models (LLMs) demonstrate strong potential for generating novel research ideas, yet such ideas often struggle with feasibility and effectiveness. In this paper, we investigate whether augmenting LLMs with relevant data during the ideation process can improve idea quality. Our framework integrates data at two stages: (1) incorporating metadata during idea generation to guide models toward more feasible concepts, and (2) introducing an automated preliminary validation step during idea selection to assess the empirical plausibility of hypotheses within ideas. We evaluate our approach in the social science domain, with a specific focus on climate negotiation topics. Expert evaluation shows that metadata improves the feasibility of generated ideas by 20%, while automated validation improves the overall quality of selected ideas by 7%. Beyond assessing the quality of LLM-generated ideas, we conduct a human study to examine whether these ideas, augmented with related data and preliminary validation, can inspire researchers in their own ideation. Participants report that the LLM-generated ideas and validation are highly useful, and the ideas they propose with such support are proven to be of higher quality than those proposed without assistance. Our findings highlight the potential of data-augmented research ideation and underscore the practical value of LLM-assisted ideation in real-world academic settings.
[142] RefTool: Reference-Guided Tool Creation for Knowledge-Intensive Reasoning
Xiao Liu, Da Yin, Zirui Wu, Yansong Feng
Main category: cs.CL
TL;DR: RefTool: A framework for LLMs to automatically create executable tools from external references (textbooks, knowledge snippets) and organize them hierarchically for solving knowledge-intensive tasks.
Details
Motivation: LLMs can use tools to enhance reasoning, but many tasks lack predefined tools. Existing tool-creation methods rely too heavily on LLMs' internal knowledge and fail when tasks are outside their knowledge scope.Method: Two-module framework: (1) Tool creation - LLMs generate executable tools from reference content, validate them with examples, and organize hierarchically into toolbox; (2) Tool utilization - LLMs navigate toolbox structure to select and apply appropriate tools to solve problems.
Result: Outperforms existing tool-creation and domain-specific reasoning methods by 12.3% average accuracy on causality, physics, and chemistry benchmarks. Cost-efficient and generalizable to non-scientific tasks like low-resource language translation.
Conclusion: RefTool enables LLMs to overcome internal knowledge limitations by grounding tool creation in external references, advancing generalizable reasoning in knowledge-intensive domains through accurate, faithful tools and effective hierarchical organization.
Abstract: Large Language Models (LLMs) can enhance their reasoning capabilities by using external tools. However, many tasks lack predefined tools. Prior works have explored instructing LLMs to generate tools on their own, but such approaches depend heavily on internal knowledge and struggle when tasks fall outside the model’s knowledge scope. To address this limitation, we propose RefTool, a reference-guided framework for automatic tool creation that leverages external materials, such as textbooks and knowledge snippets. RefTool consists of two modules: (1) tool creation, where LLMs generate executable tools from reference content, validate them using illustrative examples, and organize them hierarchically into a toolbox; and (2) tool utilization, where LLMs navigate the toolbox structure to select and apply the appropriate tools to solve problems. Experiments on causality, physics, and chemistry benchmarks demonstrate that RefTool outperforms existing tool-creation and domain-specific reasoning methods by 12.3% on average accuracy, while being cost-efficient and broadly generalizable to non-scientific tasks, e.g., extremely low-resource language translation. Analyses reveal that grounding tool creation in references produces accurate and faithful tools, and that the hierarchical structure facilitates effective tool selection. RefTool enables LLMs to overcome internal knowledge limitations, advancing generalizable reasoning in knowledge-intensive domains.
[143] VeriTrail: Closed-Domain Hallucination Detection with Traceability
Dasha Metropolitansky, Jonathan Larson
Main category: cs.CL
TL;DR: VeriTrail is a method for detecting and tracing hallucinations in multi-step generative processes, outperforming baselines on new datasets with intermediate outputs.
Details
Motivation: Language models often generate unsubstantiated content (closed-domain hallucination), which is amplified in multi-step generative processes. Current detection methods focus only on final outputs, lacking traceability of where hallucinations were introduced and how faithful content was derived.Method: VeriTrail is a closed-domain hallucination detection method designed to provide traceability for both multi-step (MGS) and single-step (SGS) generative processes. It analyzes intermediate outputs to trace where hallucinations were likely introduced and how faithful content was derived from source material.
Result: VeriTrail outperforms baseline methods on newly introduced datasets that include all intermediate outputs and human annotations of final outputs’ faithfulness for their respective MGS processes.
Conclusion: Detecting hallucinations in final outputs alone is insufficient for multi-step generative processes; traceability through intermediate outputs is crucial. VeriTrail provides this capability and demonstrates superior performance over existing methods.
Abstract: Even when instructed to adhere to source material, language models often generate unsubstantiated content - a phenomenon known as “closed-domain hallucination.” This risk is amplified in processes with multiple generative steps (MGS), compared to processes with a single generative step (SGS). However, due to the greater complexity of MGS processes, we argue that detecting hallucinations in their final outputs is necessary but not sufficient: it is equally important to trace where hallucinated content was likely introduced and how faithful content may have been derived from the source material through intermediate outputs. To address this need, we present VeriTrail, the first closed-domain hallucination detection method designed to provide traceability for both MGS and SGS processes. We also introduce the first datasets to include all intermediate outputs as well as human annotations of final outputs’ faithfulness for their respective MGS processes. We demonstrate that VeriTrail outperforms baseline methods on both datasets.
[144] RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments
Zeyi Liao, Jaylen Jones, Linxi Jiang, Yuting Ning, Eric Fosler-Lussier, Yu Su, Zhiqiang Lin, Huan Sun
Main category: cs.CL
TL;DR: RedTeamCUA is a framework for testing computer-use agents against indirect prompt injection attacks in hybrid web-OS environments, revealing significant vulnerabilities in current CUAs.
Details
Motivation: Current evaluations of computer-use agent vulnerabilities lack realistic hybrid web-OS attack scenarios and controlled testing environments, creating a gap in understanding real-world security threats.Method: Proposes RedTeamCUA framework with a novel hybrid sandbox combining VM-based OS environment with Docker-based web platforms, enabling flexible adversarial scenario configuration and direct initialization at injection points.
Result: Created RTC-Bench with 864 examples showing significant vulnerabilities: Claude 3.7 Sonnet CUA has 42.9% ASR, Claude 4.5 Sonnet CUA has 60% ASR, and even the most secure (Operator) has 7.6% ASR. Attempt rates reach 92.5%.
Conclusion: RedTeamCUA provides essential framework for systematic CUA vulnerability analysis, highlighting urgent need for robust defenses against indirect prompt injection before real-world deployment due to tangible risks.
Abstract: Computer-use agents (CUAs) promise to automate complex tasks across operating systems (OS) and the web, but remain vulnerable to indirect prompt injection. Current evaluations of this threat either lack support realistic but controlled environments or ignore hybrid web-OS attack scenarios involving both interfaces. To address this, we propose RedTeamCUA, an adversarial testing framework featuring a novel hybrid sandbox that integrates a VM-based OS environment with Docker-based web platforms. Our sandbox supports key features tailored for red teaming, such as flexible adversarial scenario configuration, and a setting that decouples adversarial evaluation from navigational limitations of CUAs by initializing tests directly at the point of an adversarial injection. Using RedTeamCUA, we develop RTC-Bench, a comprehensive benchmark with 864 examples that investigate realistic, hybrid web-OS attack scenarios and fundamental security vulnerabilities. Benchmarking current frontier CUAs identifies significant vulnerabilities: Claude 3.7 Sonnet | CUA demonstrates an ASR of 42.9%, while Operator, the most secure CUA evaluated, still exhibits an ASR of 7.6%. Notably, CUAs often attempt to execute adversarial tasks with an Attempt Rate as high as 92.5%, although failing to complete them due to capability limitations. Nevertheless, we observe concerning high ASRs in realistic end-to-end settings, with the strongest-to-date Claude 4.5 Sonnet | CUA exhibiting the highest ASR of 60%, indicating that CUA threats can already result in tangible risks to users and computer systems. Overall, RedTeamCUA provides an essential framework for advancing realistic, controlled, and systematic analysis of CUA vulnerabilities, highlighting the urgent need for robust defenses to indirect prompt injection prior to real-world deployment.
[145] FictionalQA: A Dataset for Studying Memorization and Knowledge Acquisition
John Kirchenbauer, Janny Mongkolsupawan, Yuxin Wen, Tom Goldstein, Daphne Ippolito
Main category: cs.CL
TL;DR: A new synthetic dataset for studying fact memorization vs. verbatim sequence memorization in language models, using fictional events to enable controlled experiments.
Details
Motivation: Language models acquire both linguistic structure knowledge and factual world knowledge during training, but while verbatim sequence memorization is well-studied, fact memorization is less understood. Researchers need better tools to study these dual memorization processes.Method: Created a synthetic dataset of webtext-like documents about fictional events with corresponding question-answer pairs. Conducted training experiments to demonstrate how this synthetic data can be used to study different forms of memorization, while documenting challenges in building realistic fictional synthetic data.
Result: The proposed dataset enables controlled study of fact memorization versus verbatim sequence memorization. Training experiments show the synthetic data’s utility for analyzing memorization processes, though challenges in creating realistic fictional data were identified.
Conclusion: The synthetic dataset provides a valuable tool for researchers to better understand how language models memorize facts versus sequences, addressing a gap in current understanding of model memorization mechanisms.
Abstract: When language models are trained on textual data, they acquire both knowledge about the structure of language as well as knowledge of facts about the world. At inference time, their knowledge of facts can be leveraged to solve interesting problems and perform useful knowledge work for users. It is well known that language models can verbatim memorize long sequences from their training data. However, it is much less well understood how language models memorize facts seen during training. In this work, we propose a new dataset to specifically empower researchers to study the dual processes of fact memorization and verbatim sequence memorization. The dataset consists of synthetically-generated, webtext-like documents about fictional events, as well as question-answer pairs about the events. We conduct training experiments showing how synthetic data about fictional events can be useful for studying different forms of memorization. We also document some challenges in effectively building realistic, fictional synthetic data.
[146] Language Agents for Hypothesis-driven Clinical Decision Making with Reinforcement Learning
David Bani-Harouni, Chantal Pellegrini, Ege Özsoy, Nassir Navab, Matthias Keicher
Main category: cs.CL
TL;DR: LA-CDM: A hypothesis-driven uncertainty-aware language agent for clinical decision-making that iteratively requests and interprets tests to converge on diagnoses using hybrid supervised+reinforcement learning.
Details
Motivation: Current LLM applications in clinical decision support either assume unrealistic immediate availability of all patient information (missing the interactive investigation process) or rely on limited out-of-the-box capabilities without task-specific training. There's a need to model the dynamic, interactive, cyclic nature of clinical decision-making.Method: Proposed LA-CDM: a hypothesis-driven uncertainty-aware language agent that converges toward diagnoses via repeated test requests and interpretations. Uses hybrid training combining supervised and reinforcement learning with three objectives: accurate hypothesis generation, hypothesis uncertainty estimation, and efficient decision-making.
Result: Evaluated on MIMIC-CDM dataset covering four abdominal diseases with various clinical tests. Showed benefits of explicitly training clinical decision-making for increasing both diagnostic performance and efficiency.
Conclusion: LA-CDM effectively models the interactive clinical decision-making process, outperforming approaches that either assume complete information upfront or use untrained models, demonstrating improved diagnostic accuracy and efficiency.
Abstract: Clinical decision-making is a dynamic, interactive, and cyclic process where doctors have to repeatedly decide on which clinical action to perform and consider newly uncovered information for diagnosis and treatment. Large Language Models (LLMs) have the potential to support clinicians in this process, however, most applications of LLMs in clinical decision support suffer from one of two limitations: Either they assume the unrealistic scenario of immediate availability of all patient information and do not model the interactive and iterative investigation process, or they restrict themselves to the limited “out-of-the-box” capabilities of large pre-trained models without performing task-specific training. In contrast to this, we propose to model clinical decision-making for diagnosis with a hypothesis-driven uncertainty-aware language agent, LA-CDM, that converges towards a diagnosis via repeatedly requesting and interpreting relevant tests. Using a hybrid training paradigm combining supervised and reinforcement learning, we train LA-CDM with three objectives targeting critical aspects of clinical decision-making: accurate hypothesis generation, hypothesis uncertainty estimation, and efficient decision-making. We evaluate our methodology on MIMIC-CDM, a real-world dataset covering four abdominal diseases containing various clinical tests and show the benefit of explicitly training clinical decision-making for increasing diagnostic performance and efficiency.
[147] AgentSynth: Scalable Task Generation for Generalist Computer-Use Agents
Jingxu Xie, Dylan Xu, Xuandong Zhao, Dawn Song
Main category: cs.CL
TL;DR: AgentSynth is a scalable pipeline for automatically generating diverse computer-use agent tasks and trajectories using information asymmetry to create challenging long-horizon tasks from simple subtasks.
Details
Motivation: Current methods for creating datasets for computer-use agents rely heavily on expensive human annotations, limiting scalability and diversity. There's a need for cost-effective, automated approaches to generate high-quality training data for generalist agents.Method: AgentSynth leverages information asymmetry to construct subtasks that are simple during generation but become challenging when composed into long-horizon tasks. The pipeline can precisely modulate task complexity by varying the number of subtasks, enabling creation of over 6,000 diverse tasks.
Result: The pipeline achieves low average cost of $0.60 per trajectory (much cheaper than human annotations). State-of-the-art LLM agents show steep performance drop from 18% success at difficulty level 1 to just 4% at level 6, demonstrating benchmark difficulty.
Conclusion: AgentSynth provides a scalable, cost-effective solution for generating diverse computer-use agent datasets, enabling better evaluation and training of generalist agents through controlled complexity modulation.
Abstract: We introduce AgentSynth, a scalable and cost-efficient pipeline for automatically synthesizing high-quality tasks and trajectory datasets for generalist computer-use agents. Leveraging information asymmetry, AgentSynth constructs subtasks that are simple during generation but significantly more challenging when composed into long-horizon tasks, enabling the creation of over 6,000 diverse and realistic tasks. A key strength of AgentSynth is its ability to precisely modulate task complexity by varying the number of subtasks. Empirical evaluations show that state-of-the-art LLM agents suffer a steep performance drop, from 18% success at difficulty level 1 to just 4% at level 6, highlighting the benchmark’s difficulty and discriminative power. Moreover, our pipeline achieves a low average cost of $0.60 per trajectory, orders of magnitude cheaper than human annotations. Our code and data are available at https://github.com/sunblaze-ucb/AgentSynth
[148] SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling
Md Imbesat Hassan Rizvi, Xiaodan Zhu, Iryna Gurevych
Main category: cs.CL
TL;DR: SPARE is a framework for efficient automated process annotation in LLMs that aligns solution steps to reference solutions and determines accuracy in single generation, showing effectiveness across mathematical reasoning, QA, and spatial reasoning tasks.
Details
Motivation: Process supervision is crucial for advancing multi-step reasoning in LLMs, but high-quality automated process annotation remains challenging. Current methods are inefficient or require extensive human labeling.Method: SPARE (Single-Pass Annotation with Reference-Guided Evaluation) is a structured framework that performs per-step annotation by jointly aligning solution steps to reference solutions and determining accuracy with explicit reasoning in a single generation.
Result: SPARE shows consistent improvements across four diverse datasets (GSM8K, MATH, MuSiQue-Ans, SpaRP) for training Process Reward Models and fine-tuning via offline RL. It achieves data-efficient out-of-distribution generalization using only ~16% of training samples compared to baselines, with 2.3× speedup over MCTS methods.
Conclusion: SPARE provides a practical and scalable solution for automatic process supervision in LLM reasoning, with complementary precision-recall characteristics to MCTS approaches suggesting potential for ensemble methods.
Abstract: Process or step-wise supervision has played a crucial role in advancing complex multi-step reasoning capabilities of Large Language Models (LLMs). However, efficient, high-quality automated process annotation remains a significant challenge. To address this, we introduce Single-Pass Annotation with Reference-Guided Evaluation (SPARE), a novel structured framework that enables efficient per-step annotation by jointly aligning solution steps to reference solutions and determine its accuracy with explicit reasoning in single generation. We demonstrate SPARE’s effectiveness across four diverse datasets spanning mathematical reasoning (GSM8K, MATH), multi-hop question answering (MuSiQue-Ans), and spatial reasoning (SpaRP), showing consistent improvements in two applications: (1) training Process Reward Models (PRMs) for ranking and aggregating multiple generations, and (2) fine-tuning models via offline reinforcement learning for greedy decoding. On ProcessBench, SPARE demonstrates data-efficient out-of-distribution generalization, using only $\sim$16% of training samples compared to human-labeled and other synthetically trained baselines. Additionally, it achieves competitive performance with MCTS-based methods while offering 2.3$\times$ speedup in terms of total token count. Manual analysis reveals complementary precision-recall characteristics with MCTS approaches, suggesting potential for ensemble methods. These results establish SPARE as a practical and scalable solution for automatic process supervision in LLM reasoning.
[149] GenRecal: Generation after Recalibration from Large to Small Vision-Language Models
Byung-Kwan Lee, Ryo Hachiuma, Yong Man Ro, Yu-Chiang Frank Wang, Yueh-Hua Wu
Main category: cs.CL
TL;DR: GenRecal is a general-purpose distillation framework that enables knowledge transfer between heterogeneous vision-language models with different architectures through feature recalibration.
Details
Motivation: While vision-language models have achieved strong performance, their computational demands make deployment on resource-constrained devices challenging. Existing distillation methods are limited to specific VLM architectures due to differences in token types, vocabulary sizes, and token ordering.Method: GenRecal introduces a Recalibrator module that aligns and adapts feature representations between heterogeneous VLMs, enabling effective knowledge transfer across different types of VLMs regardless of their architectural differences.
Result: Extensive experiments on multiple benchmarks show that GenRecal significantly improves baseline performances and eventually outperforms both large-scale open- and closed-source VLMs.
Conclusion: GenRecal provides a general solution for distilling knowledge across diverse VLM architectures, addressing the deployment challenges of large VLMs on resource-constrained devices.
Abstract: Recent advancements in vision-language models (VLMs) have leveraged large language models (LLMs) to achieve performance on par with closed-source systems like GPT-4V. However, deploying these models in real-world scenarios, particularly on resource-constrained devices, remains challenging due to their substantial computational demands. This has spurred interest in distilling knowledge from large VLMs into smaller, more efficient counterparts. A key challenge arises here from the diversity of VLM architectures, which are built on different LLMs and employ varying token types-differing in vocabulary size, token splits, and token index ordering. To address this challenge of limitation to a specific VLM type, we present Generation after Recalibration (GenRecal), a general-purpose distillation framework for VLMs. GenRecal incorporates a Recalibrator that aligns and adapts feature representations between heterogeneous VLMs, enabling effective knowledge transfer across different types of VLMs. Through extensive experiments on multiple challenging benchmarks, we demonstrate that GenRecal significantly improves baseline performances, eventually outperforming large-scale open- and closed-source VLMs.
[150] OJBench: A Competition Level Code Benchmark For Large Language Models
Zhexu Wang, Yiping Liu, Yejie Wang, Wenyang He, Bofei Gao, Muxi Diao, Yanxu Chen, Kelin Fu, Flood Sung, Zhilin Yang, Tianyu Liu, Weiran Xu
Main category: cs.CL
TL;DR: OJBench is a challenging benchmark for evaluating competitive-level code reasoning abilities of LLMs using 232 programming competition problems from NOI and ICPC, revealing that even state-of-the-art models struggle with these difficult tasks.
Details
Motivation: Existing code benchmarks are limited in evaluating the full spectrum of LLM capabilities, particularly at competitive programming levels, creating a gap in assessing true reasoning abilities.Method: Created OJBench with 232 programming competition problems from NOI and ICPC, then evaluated 37 models including closed-source and open-source, reasoning-oriented and non-reasoning-oriented models.
Result: Even state-of-the-art reasoning-oriented models like o4-mini and Gemini-2.5-pro-exp struggle with highly challenging competition-level problems, highlighting significant challenges in competitive-level code reasoning.
Conclusion: Current LLMs face substantial difficulties with competitive programming problems, indicating a need for improved reasoning capabilities and better benchmarks for evaluating high-level code reasoning.
Abstract: Recent advancements in large language models (LLMs) have demonstrated significant progress in math and code reasoning capabilities. However, existing code benchmark are limited in their ability to evaluate the full spectrum of these capabilities, particularly at the competitive level. To bridge this gap, we introduce OJBench, a novel and challenging benchmark designed to assess the competitive-level code reasoning abilities of LLMs. OJBench comprises 232 programming competition problems from NOI and ICPC, providing a more rigorous test of models’ reasoning skills. We conducted a comprehensive evaluation using OJBench on 37 models, including both closed-source and open-source models, reasoning-oriented and non-reasoning-oriented models. Our results indicate that even state-of-the-art reasoning-oriented models, such as o4-mini and Gemini-2.5-pro-exp, struggle with highly challenging competition-level problems. This highlights the significant challenges that models face in competitive-level code reasoning.
[151] When Does Divide and Conquer Work for Long Context LLM? A Noise Decomposition Framework
Zhen Xu, Shang Zhu, Jue Wang, Junlin Wang, Ben Athiwaratkun, Chi Wang, James Zou, Ce Zhang
Main category: cs.CL
TL;DR: Theoretical framework analyzing LLM failures on long texts, proposing multi-agent chunking as solution for tasks with cross-chunk dependencies, context confusion, and imperfect result aggregation.
Details
Motivation: Large Language Models struggle with processing long texts due to various failure modes that emerge with increasing context length, requiring a principled understanding of when and how chunking strategies are effective.Method: Proposes theoretical framework categorizing long context failures into three types: cross-chunk dependence (task noise), confusion growing with context size (model noise), and imperfect integration of partial results (aggregator noise). Analyzes multi-agent chunking effectiveness through experiments on retrieval, QA, and summarization tasks.
Result: Experiments confirm theoretical analysis and identify conditions favoring multi-agent chunking. Shows weaker models with chunk-based processing can surpass advanced models like GPT4o on large inputs due to accelerated decay of model fidelity with input length.
Conclusion: Provides principled framework for understanding LLM long-context challenges and demonstrates careful chunking and aggregation strategies as direct pathway to handling long contexts effectively.
Abstract: We investigate the challenge of applying Large Language Models (LLMs) to long texts. We propose a theoretical framework that distinguishes the failure modes of long context tasks into three categories: cross-chunk dependence (task noise), confusion that grows with context size (model noise), and the imperfect integration of partial results (aggregator noise). Under this view, we analyze when it is effective to use multi-agent chunking, i.e., dividing a lengthy sequence into smaller chunks and aggregating the processed results of each chunk. Our experiments on tasks such as retrieval, question answering, and summarization confirm both the theoretical analysis and the conditions that favor multi-agent chunking. By exploring the accelerated decay of model fidelity with input length, we also explain why, for large inputs, a weaker model configured with chunk-based processing can surpass a more advanced model like GPT4o applied in a single shot. Overall, we present a principled understanding framework and our results highlight a direct pathway to handling long contexts in LLMs with carefully managed chunking and aggregator strategies.
[152] Long-Context Generalization with Sparse Attention
Pavlo Vasylenko, Hugo Pitorro, André F. T. Martins, Marcos Treviso
Main category: cs.CL
TL;DR: ASEntmax introduces learnable temperature to α-entmax for dynamic sparse attention, enabling better long-context generalization and pattern-focused attention in transformers.
Details
Motivation: Softmax attention in transformers produces dense distributions that disperse attention probability mass across non-informative tokens as sequence length increases, leading to representational collapse and poor performance on tasks requiring precise focus on fixed-size patterns.Method: Proposes Adaptive-Scalable Entmax (ASEntmax) which combines α-entmax’s ability to assign exact zeros to irrelevant tokens with a learnable temperature parameter, allowing dynamic interpolation between sparse (pattern-focused) and dense (softmax-like) attention regimes.
Result: ASEntmax substantially outperforms softmax, scalable softmax, and fixed-temperature α-entmax baselines, achieving up to 1000× length extrapolation on synthetic benchmarks and superior long-context generalization on language modeling while preserving short-context performance.
Conclusion: Dynamic sparse attention mechanisms like ASEntmax can effectively address the limitations of softmax attention for long-context tasks by enabling precise pattern-focused attention while maintaining flexibility across different attention regimes.
Abstract: Transformer-based architectures traditionally employ softmax to compute attention weights, which produces dense distributions over all tokens in a sequence. While effective in many settings, this density has been shown to be detrimental for tasks that demand precise focus on fixed-size patterns: as sequence length increases, non-informative tokens accumulate attention probability mass, leading to dispersion and representational collapse. We show in this paper that dynamically sparse attention mechanisms using $α$-entmax can avoid these issues, due to their ability to assign exact zeros to irrelevant tokens. Furthermore, we introduce Adaptive-Scalable Entmax (ASEntmax), which endows $α$-entmax with a learnable temperature parameter, allowing the attention distribution to interpolate between sparse (pattern-focused) and dense (softmax-like) regimes. Our empirical evaluation on synthetic tasks and language modeling demonstrates that ASEntmax substantially outperforms softmax, scalable softmax, and fixed-temperature $α$-entmax baselines, achieving up to 1000$\times$ length extrapolation on synthetic benchmarks and superior long-context generalization on language modeling while preserving short-context performance, including better perplexity trends and higher retrieval accuracies at 8$\times$ training length.
[153] LongWriter-Zero: Mastering Ultra-Long Text Generation via Reinforcement Learning
Yuhao Wu, Yushi Bai, Zhiqiang Hu, Roy Ka-Wei Lee, Juanzi Li
Main category: cs.CL
TL;DR: LongWriter-Zero uses reinforcement learning without synthetic data to enable LLMs to generate ultra-long, high-quality text, outperforming traditional supervised fine-tuning methods.
Details
Motivation: Ultra-long text generation by LLMs faces challenges due to maximum length limits and quality degradation with longer sequences. Traditional supervised fine-tuning methods rely on costly synthetic data that often lacks coherence and has structural monotony.Method: Proposes an incentivization-based approach using reinforcement learning starting from a base model (similar to R1-Zero). The method employs specialized reward models to guide LLMs toward better length control, writing quality, and structural formatting without any annotated or synthetic data.
Result: LongWriter-Zero model trained from Qwen2.5-32B consistently outperforms traditional SFT methods on long-form writing tasks, achieving state-of-the-art results on WritingBench and Arena-Write benchmarks, even surpassing 100B+ models like DeepSeek R1 and Qwen3-235B.
Conclusion: The RL-based approach without synthetic data successfully enables LLMs to generate ultra-long, high-quality text, demonstrating superior performance over traditional supervised fine-tuning methods.
Abstract: Ultra-long generation by large language models (LLMs) is a widely demanded scenario, yet it remains a significant challenge due to their maximum generation length limit and overall quality degradation as sequence length increases. Previous approaches, exemplified by LongWriter, typically rely on ‘’teaching’’, which involves supervised fine-tuning (SFT) on synthetic long-form outputs. However, this strategy heavily depends on synthetic SFT data, which is difficult and costly to construct, often lacks coherence and consistency, and tends to be overly artificial and structurally monotonous. In this work, we propose an incentivization-based approach that, starting entirely from scratch and without relying on any annotated or synthetic data, leverages reinforcement learning (RL) to foster the emergence of ultra-long, high-quality text generation capabilities in LLMs. We perform RL training starting from a base model, similar to R1-Zero, guiding it to engage in reasoning that facilitates planning and refinement during the writing process. To support this, we employ specialized reward models that steer the LLM towards improved length control, writing quality, and structural formatting. Experimental evaluations show that our LongWriter-Zero model, trained from Qwen2.5-32B, consistently outperforms traditional SFT methods on long-form writing tasks, achieving state-of-the-art results across all metrics on WritingBench and Arena-Write, and even surpassing 100B+ models such as DeepSeek R1 and Qwen3-235B. We open-source our data and model checkpoints under https://huggingface.co/THU-KEG/LongWriter-Zero-32B
[154] Breaking Barriers: Do Reinforcement Post Training Gains Transfer To Unseen Domains?
Chuxuan Hu, Yuxuan Zhu, Antony Kellermann, Caleb Biddulph, Suppakit Waiwitlikhit, Jason Benn, Daniel Kang
Main category: cs.CL
TL;DR: RPT improves LLM reasoning but generalizes poorly to new domains with different reasoning patterns
Details
Motivation: To understand how well reinforcement post-training (RPT) improvements generalize to new domains beyond those used in training, since prior work only evaluated on same-domain dataMethod: Two studies: (1) Observational - compare open-weight RPT models against base models across seen/unseen domains; (2) Interventional - fine-tune LLMs with RPT on single domains and evaluate across multiple domains
Result: RPT brings substantial gains on tasks similar to fine-tuning data, but gains generalize inconsistently and can vanish on domains with different reasoning patterns
Conclusion: RPT improvements are domain-specific and don’t reliably transfer to domains with different reasoning patterns, highlighting limitations in generalization
Abstract: Reinforcement post training (RPT) has recently shown promise in improving the reasoning abilities of large language models (LLMs). However, it remains unclear how well these improvements generalize to new domains, as prior work evaluates RPT models on data from the same domains used for post-training. To understand the generalizability of RPT, we conduct two studies with specific focus on Reinforcement Learning with Verifiable Rewards (RLVR). (1) Observational: we compare a wide range of open-weight RPT models against their corresponding base models across multiple domains, including both seen and unseen domains in their fine-tuning data. (2) Interventional: we fine-tune LLMs with RPT on single domains and evaluate their performance across multiple domains. Both studies converge on the same conclusion that, although RPT brings substantial gains on tasks similar to the fine-tuning data, the gains generalize inconsistently and can vanish on domains with different reasoning patterns.
[155] Cognitive models can reveal interpretable value trade-offs in language models
Sonia K. Murthy, Rosie Zhao, Jennifer Hu, Sham Kakade, Markus Wulfmeier, Peng Qian, Tomer Ullman
Main category: cs.CL
TL;DR: A framework using cognitive models of polite speech to evaluate value trade-offs in LLMs, examining reasoning effort, prompt manipulations, and RL post-training dynamics.
Details
Motivation: Current tools for interpreting dynamic, multi-faceted value trade-offs in language models are limited, while cognitive science provides formal models of such trade-offs in human decision-making.Method: Uses a leading cognitive model of polite speech to evaluate LLMs through two settings: 1) degrees of reasoning “effort” and system prompt manipulations in closed-source models, 2) RL post-training dynamics analysis of open-source models.
Result: LLMs’ behavioral profiles shift predictably when prompted to prioritize certain goals, are amplified by small reasoning budgets, and can diagnose social behaviors like sycophancy. Post-training dynamics show large early value shifts and persistent effects of base model/pretraining data compared to feedback dataset or alignment method.
Conclusion: The framework offers a flexible tool for probing behavioral profiles across diverse model types and gaining insights for shaping training regimes that better control value trade-offs during model development.
Abstract: Value trade-offs are an integral part of human decision-making and language use, however, current tools for interpreting such dynamic and multi-faceted notions of values in language models are limited. In cognitive science, so-called “cognitive models” provide formal accounts of such trade-offs in humans, by modeling the weighting of a speaker’s competing utility functions in choosing an action or utterance. Here, we show that a leading cognitive model of polite speech can be used to systematically evaluate alignment-relevant trade-offs in language models via two encompassing settings: degrees of reasoning “effort” and system prompt manipulations in closed-source frontier models, and RL post-training dynamics of open-source models. Our results show that LLMs’ behavioral profiles under the cognitive model a) shift predictably when they are prompted to prioritize certain goals, b) are amplified by a small reasoning budget, and c) can be used to diagnose other social behaviors such as sycophancy. Our findings from LLMs’ post-training dynamics reveal large shifts in values early on in training and persistent effects of the choice of base model and pretraining data, compared to feedback dataset or alignment method. Our framework offers a flexible tool for probing behavioral profiles across diverse model types and gaining insights for shaping training regimes that better control trade-offs between values during model development.
[156] XISM: an eXploratory and Interactive Graph Tool to Visualize and Evaluate Semantic Map Models
Zhu Liu, Zhen Hu, Lei Dai, Yu Xuan, Ying Liu
Main category: cs.CL
TL;DR: XISM is an interactive system for building semantic maps that combines data-driven inference with expert knowledge through visual refinement interfaces.
Details
Motivation: Existing semantic map construction methods face a tension between scalability (fully automated systems) and interpretability (expert-driven approaches), creating a need for systems that balance computational efficiency with expert involvement.Method: XISM uses a top-down procedure to generate candidate semantic maps, then provides a visual interface where users can iteratively refine edges with real-time metric feedback, combining data-driven inference with expert knowledge.
Result: Experiments in three semantic domains and expert interviews show that XISM improves linguistic decision transparency and controllability while maintaining computational efficiency.
Conclusion: XISM provides a collaborative approach for scalable and interpretable semantic-map building, bridging the gap between automated systems and expert-driven methods.
Abstract: Semantic map models visualize systematic relations among semantic functions through graph structures and are widely used in linguistic typology. However, existing construction methods either depend on labor-intensive expert reasoning or on fully automated systems lacking expert involvement, creating a tension between scalability and interpretability. We introduce \textbf{XISM}, an interactive system that combines data-driven inference with expert knowledge. XISM generates candidate maps via a top-down procedure and allows users to iteratively refine edges in a visual interface, with real-time metric feedback. Experiments in three semantic domains and expert interviews show that XISM improves linguistic decision transparency and controllability in semantic-map construction while maintaining computational efficiency. XISM provides a collaborative approach for scalable and interpretable semantic-map building. The system\footnote{https://app.xism2025.xin/} , source code\footnote{https://github.com/hank317/XISM} , and demonstration video\footnote{https://youtu.be/m5laLhGn6Ys} are publicly available.
[157] FrugalRAG: Less is More in RL Finetuning for Multi-Hop Question Answering
Abhinav Java, Srivathsan Koundinyan, Nagarajan Natarajan, Amit Sharma
Main category: cs.CL
TL;DR: FrugalRAG: RL-based framework for multi-hop QA that adaptively reduces retrieval steps based on question difficulty, optimizing both accuracy and efficiency.
Details
Motivation: RL has succeeded in math/code tasks but shows limited gains in retrieval-augmented generation (RAG) for multi-hop QA. Current approaches often trail supervised baselines, and there's a need for methods that optimize both final answer accuracy and efficiency in reaching answers through adaptive retrieval step reduction.Method: Two-stage finetuning framework: 1) Supervised finetuning on full-exploration policy generating broad sub-queries, 2) RL to adaptively prune search depth based on question difficulty, directly rewarding policies that balance correctness with frugality. Uses only ~1,000 examples instead of 10x more data.
Result: Achieves state-of-the-art efficiency-accuracy tradeoffs on HotPotQA and other multi-hop QA benchmarks, cutting retrieval cost nearly in half. Generalizes zero-shot on challenging BrowseCompPlus benchmark, surpassing SLM-based and other baselines.
Conclusion: Demonstrates RL can effectively reduce reasoning steps rather than increase them, providing scalable and efficient RAG solutions. Shows RL’s potential for optimizing efficiency in multimodal reasoning tasks.
Abstract: Reinforcement learning (RL) based on the final answer’s reward has driven recent progress in small language models (SLMs) on reasoning-heavy tasks such as math and code. However, applying the same techniques to retrieval-augmented generation (RAG) benchmarks like multi-hop QA has yielded limited gains, often trailing supervised or prompting-only baselines. Instead, we argue that a viable path for RL in multi-hop QA is to use test-time scaling judiciously to optimize both final answer accuracy and efficiency in reaching that answer. We propose FrugalRAG, a two-stage finetuning framework that adaptively reduces the number of retrieval steps based on a question’s difficulty. First, we train an SLM with supervised finetuning on a full-exploration policy that generates broad sub-queries. Then, we apply RL to adaptively prune search depth based on question difficulty, directly rewarding policies that balance correctness with frugality. Unlike prior approaches requiring 10x more data, our method achieves competitive performance with only approximately 1,000 examples. On HotPotQA and other multi-hop QA benchmarks, FrugalRAG attains state-of-the-art efficiency-accuracy tradeoffs, cutting retrieval cost nearly in half. Moreover, on the challenging BrowseCompPlus benchmark, it generalizes zero-shot and surpasses SLM-based and other baselines. These results demonstrate the use of RL not to increase reasoning steps, but to reduce them, as an effective solution for scalable and efficient RAG.
[158] SASFT: Sparse Autoencoder-guided Supervised Finetuning to Mitigate Unexpected Code-Switching in LLMs
Boyi Deng, Yu Wan, Baosong Yang, Fei Huang, Wenjie Wang, Fuli Feng
Main category: cs.CL
TL;DR: SASFT method reduces unexpected code-switching in multilingual LLMs by controlling language feature pre-activation values through sparse autoencoder-guided supervised fine-tuning.
Details
Motivation: Multilingual LLMs suffer from unexpected code-switching (language mixing) that degrades response readability and usability, but existing solutions lack mechanistic analysis and show limited effectiveness.Method: First analyze code-switching using sparse autoencoders to identify excessive pre-activation of language features, then propose SASFT (Sparse Autoencoder-guided Supervised Fine-tuning) that teaches LLMs to maintain appropriate pre-activation values of specific language features during training.
Result: SASFT reduces unexpected code-switching by over 50% compared to standard supervised fine-tuning across five models and three languages, with complete elimination in one case, while maintaining or improving performance on six multilingual benchmarks.
Conclusion: SASFT effectively addresses code-switching in multilingual LLMs through mechanistic understanding and targeted training, preserving multilingual capabilities while improving response consistency.
Abstract: Large Language Models (LLMs) have impressive multilingual capabilities, but they suffer from unexpected code-switching, also known as language mixing, which involves switching to unexpected languages in the model response. This problem leads to poor readability and degrades the usability of model responses. However, existing work on this issue lacks a mechanistic analysis and shows limited effectiveness. In this paper, we first provide an in-depth analysis of unexpected code-switching using sparse autoencoders and find that when LLMs switch to a language, the features of that language exhibit excessive pre-activation values. Based on our findings, we propose $\textbf{S}$parse $\textbf{A}$utoencoder-guided $\textbf{S}$upervised $\textbf{F}$ine$\textbf{t}$uning (SASFT), which teaches LLMs to maintain appropriate pre-activation values of specific language features during training. Experiments on five models across three languages demonstrate that SASFT consistently reduces unexpected code-switching by more than 50% compared to standard supervised fine-tuning, with complete elimination in one case. Moreover, SASFT maintains or even improves the models’ performance on six multilingual benchmarks, showing its effectiveness in addressing code-switching while preserving multilingual capabilities. The code and data are available at https://github.com/Aatrox103/SASFT.
[159] Diversity-Enhanced Reasoning for Subjective Questions
Yumeng Wang, Zhiyuan Fan, Jiayu Liu, Jen-tse Huang, Yi R. Fung
Main category: cs.CL
TL;DR: MultiRole-R1 enhances subjective reasoning in Large Reasoning Models by introducing perspective and token-level diversity through unsupervised data synthesis and reinforcement learning with diversity rewards.
Details
Motivation: Current Large Reasoning Models optimized via RLVR excel at objective reasoning but lack diversity for subjective tasks with multiple valid answers depending on different perspectives. There's limited research on enhancing diversity for subjective reasoning tasks.Method: Proposes MultiRole-R1 framework with: 1) Unsupervised data construction pipeline synthesizing reasoning chains with various role perspectives, 2) Reinforcement learning via Group Relative Policy Optimization with reward shaping that includes diversity as a reward signal alongside verifiable rewards.
Result: Training on subjective tasks increases in-domain accuracy by 14.1% and out-of-domain accuracy by 7.64%, even enhances performance on advanced math reasoning (AIME 2024). Diversity proves to be a more consistent indicator of accuracy than reasoning length.
Conclusion: Introducing perspective and token-level diversity significantly improves subjective reasoning capabilities in LRMs, with diversity serving as a better predictor of accuracy than reasoning length.
Abstract: Large Reasoning Models (LRMs) with long chain-of-thought capabilities, optimized via reinforcement learning with verifiable rewards (RLVR), excel at objective reasoning tasks like mathematical problem solving and code generation. However, RLVR is known for degrading generation diversity, which causes LRMs to fall short on subjective reasoning that has multiple answers depending on different role perspectives. While recent studies recognize the importance of diversity-enhanced training in objective reasoning, limited attention has been given to subjective tasks. In this paper, we find that subjective reasoning can be improved by introducing perspective diversity and token-level diversity, with the former one providing a coherent scaffolding anchored to a real-world stakeholder group and the latter one broadening the answer search space. We propose MultiRole-R1, a diversity-enhanced training framework featuring an unsupervised data construction pipeline that synthesizes reasoning chains incorporating various role perspectives. It also employs reinforcement learning via Group Relative Policy Optimization with reward shaping, taking diversity as a reward signal in addition to verifiable reward. Training on subjective tasks solely, MultiRole-R1 increases the in-domain and out-of-domain accuracy by 14.1% and 7.64%, and even enhances the performance on advanced math reasoning such as AIME 2024. We further show that diversity is a more consistent indicator of accuracy than reasoning length.
[160] Rote Learning Considered Useful: Generalizing over Memorized Data in LLMs
Qinyuan Wu, Soumi Das, Mahsa Amani, Bishwamittra Ghosh, Mohammad Aflah Khan, Krishna P. Gummadi, Muhammad Bilal Zafar
Main category: cs.CL
TL;DR: LLMs can generalize from rote memorized data using a “memorize-then-generalize” framework, challenging the view that rote learning hinders generalization.
Details
Motivation: To challenge the conventional view that rote learning hinders generalization in LLMs, and to explore whether models can reinterpret rote memorized data through semantically meaningful prompts.Method: Two-phase framework: 1) Rote memorization of factual subject-object associations using synthetic semantically meaningless key tokens, 2) Fine-tuning on small set of semantically meaningful prompts to enable generalization.
Result: LLMs can reinterpret rote memorized data through semantically meaningful prompts, evidenced by emergence of structured, semantically aligned latent representations between key tokens and meaningful prompts across 8 tested models.
Conclusion: Rote memorization doesn’t necessarily hinder generalization in LLMs; models can reinterpret memorized data through semantic prompts, opening possibilities for efficient knowledge injection but also risks of malicious data repurposing.
Abstract: Rote learning is a memorization technique based on repetition. Many researchers argue that rote learning hinders generalization because it encourages verbatim memorization rather than deeper understanding. This concern extends even to factual knowledge, which inevitably requires a certain degree of memorization. In this work, we challenge this view and demonstrate that large language models (LLMs) can, in fact, generalize over rote memorized data. We introduce a two-phase “memorize-then-generalize” framework, where the model first rote memorizes factual subject-object associations using a synthetic semantically meaningless key token and then learns to generalize by fine-tuning on a small set of semantically meaningful prompts. Extensive experiments over 8 LLMs show that the models can reinterpret rote memorized data through the semantically meaningful prompts, as evidenced by the emergence of structured, semantically aligned latent representations between the key token and the semantically meaningful prompts. This surprising finding opens the door to both effective and efficient knowledge injection as well as possible risks of repurposing the memorized data for malicious usage.
[161] DeepSieve: Information Sieving via LLM-as-a-Knowledge-Router
Minghao Guo, Qingcheng Zeng, Xujiang Zhao, Yanchi Liu, Wenchao Yu, Mengnan Du, Haifeng Chen, Wei Cheng
Main category: cs.CL
TL;DR: DeepSieve is an agentic RAG framework that uses LLMs as knowledge routers to decompose complex queries, route sub-questions to appropriate sources, and filter information through multi-stage distillation for better reasoning and retrieval.
Details
Motivation: LLMs struggle with knowledge-intensive queries due to lack of dynamic access to up-to-date or domain-specific information. Existing RAG methods lack fine-grained control over queries and sources, leading to noisy retrieval and shallow reasoning.Method: DeepSieve uses LLM-as-a-knowledge-router to decompose complex queries into structured sub-questions, recursively routes each to the most suitable knowledge source, and filters irrelevant information through a multi-stage distillation process. The framework emphasizes modularity, transparency, and adaptability.
Result: Experiments on multi-hop QA tasks across heterogeneous sources demonstrate improved reasoning depth, retrieval precision, and interpretability over conventional RAG approaches.
Conclusion: DeepSieve provides an effective agentic RAG framework that addresses limitations of existing methods through fine-grained query decomposition and source routing, enabling better handling of knowledge-intensive queries.
Abstract: Large Language Models (LLMs) excel at many reasoning tasks but struggle with knowledge-intensive queries due to their inability to dynamically access up-to-date or domain-specific information. Retrieval-Augmented Generation (RAG) has emerged as a promising solution, enabling LLMs to ground their responses in external sources. However, existing RAG methods lack fine-grained control over both the query and source sides, often resulting in noisy retrieval and shallow reasoning. In this work, we introduce DeepSieve, an agentic RAG framework that incorporates information sieving via LLM-as-a-knowledge-router. DeepSieve decomposes complex queries into structured sub-questions and recursively routes each to the most suitable knowledge source, filtering irrelevant information through a multi-stage distillation process. Our design emphasizes modularity, transparency, and adaptability, leveraging recent advances in agentic system design. Experiments on multi-hop QA tasks across heterogeneous sources demonstrate improved reasoning depth, retrieval precision, and interpretability over conventional RAG approaches. Our codes are available at https://github.com/MinghoKwok/DeepSieve.
[162] Harnessing Temporal Databases for Systematic Evaluation of Factual Time-Sensitive Question-Answering in Large Language Models
Soyeon Kim, Jindong Wang, Xing Xie, Steven Euijong Whang
Main category: cs.CL
TL;DR: TDBench: A scalable temporal question-answering benchmark using temporal databases to evaluate LLMs on time-sensitive factual knowledge without manual bottlenecks.
Details
Motivation: Facts change over time, requiring LLMs to handle time-sensitive knowledge accurately. Existing TSQA benchmarks face manual bottlenecks limiting scalable and comprehensive evaluation.Method: Systematically constructs TSQA pairs using temporal databases and database techniques (temporal functional dependencies, temporal SQL, temporal joins). Introduces time accuracy metric to assess validity of time references in explanations alongside answer accuracy.
Result: Enables scalable and comprehensive TSQA evaluation while reducing human labor. Complements current Wikipedia/Wikidata-focused approaches by enabling evaluation on application-specific data.
Conclusion: TDBench provides a systematic, scalable approach to evaluating LLMs on time-sensitive factual knowledge using temporal database techniques, addressing limitations of manual benchmark construction.
Abstract: Facts change over time, making it essential for Large Language Models (LLMs) to handle time-sensitive factual knowledge accurately and reliably. Although factual Time-Sensitive Question-Answering (TSQA) tasks have been widely developed, existing benchmarks often face manual bottlenecks that limit scalable and comprehensive TSQA evaluation. To address this issue, we propose TDBench, a new benchmark that systematically constructs TSQA pairs by harnessing temporal databases and database techniques, such as temporal functional dependencies, temporal SQL, and temporal joins. We also introduce a new evaluation metric called time accuracy, which assesses the validity of time references in model explanations alongside traditional answer accuracy for a more fine-grained TSQA evaluation. Extensive experiments on contemporary LLMs show how TDBench enables scalable and comprehensive TSQA evaluation while reducing the reliance on human labor, complementing current TSQA evaluation approaches that largely center on Wikipedia/Wikidata by enabling LLM evaluation on application-specific data.
[163] Post-training Large Language Models for Diverse High-Quality Responses
Yilei Chen, Souradip Chakraborty, Lorenz Wolf, Yannis Paschalidis, Aldo Pacchiano
Main category: cs.CL
TL;DR: DQO (Diversity Quality Optimization) is a novel RL training method that uses determinantal point processes to jointly optimize LLMs for both quality and semantic diversity, addressing the diversity loss problem in RL post-training.
Details
Motivation: RL post-training for LLMs improves downstream task performance but reduces output diversity, leading to narrow, canonical responses. Existing diversity enhancement methods are limited to inference-time solutions or focus on surface-level differences rather than semantic diversity.Method: DQO uses determinantal point processes to measure semantic diversity as the volume spanned by embeddings of multiple responses. For each prompt, it samples and embeds a group of responses, then uses the determinant of a kernel-based similarity matrix to quantify diversity. The method can be applied on top of existing RL algorithms.
Result: Experiments across instruction-following, summarization, story generation, and reasoning tasks show that DQO substantially improves semantic diversity without sacrificing model quality.
Conclusion: DQO provides an effective training-time solution to the diversity loss problem in RL post-training of LLMs, enabling joint optimization for both quality and semantic diversity across various tasks.
Abstract: Reinforcement learning (RL) has emerged as a popular method for post-training large language models (LLMs). While improving the model’s performance on downstream tasks, it often reduces the model’s output diversity, leading to narrow, canonical responses. Existing methods to enhance diversity are limited, either by operating at inference time or by focusing on surface-level differences. We propose a novel training method named DQO (Diversity Quality Optimization) based on determinantal point processes (DPPs) to jointly optimize LLMs for quality and semantic diversity. Our approach samples and embeds a group of responses for each prompt, then uses the determinant of a kernel-based similarity matrix to measure diversity as the volume spanned by the embeddings of these responses. DQO is flexible and can be applied on top of existing RL algorithms. Experiments across instruction-following, summarization, story generation, and reasoning tasks demonstrate that our method substantially improves semantic diversity without sacrificing model quality.
[164] Distribution-Aligned Decoding for Efficient LLM Task Adaptation
Senkang Hu, Xudong Han, Jinqi Jiang, Yihang Tao, Zihan Fang, Yong Dai, Sam Tak Wu Kwong, Yuguang Fang
Main category: cs.CL
TL;DR: SVDecode is a lightweight method for task adaptation that uses steering vectors during decoding to align output distributions, improving performance without additional trainable parameters beyond PEFT adapters.
Details
Motivation: Even with parameter-efficient fine-tuning (PEFT), adapting billion-parameter language models to downstream tasks remains costly. The paper proposes to view task adaptation as output-distribution alignment rather than weight updates.Method: SVDecode extracts a task-aware steering vector from KL divergence gradient between warm-started and pre-trained models’ output distributions, then uses this vector during decoding to steer outputs toward task distribution.
Result: Across three tasks and nine benchmarks, SVDecode paired with PEFT methods improves multiple-choice accuracy by up to 5 percentage points and open-ended truthfulness by 2 percentage points, with similar gains on commonsense datasets.
Conclusion: SVDecode offers a lightweight, theoretically grounded path to stronger task adaptation for large language models without adding trainable parameters beyond PEFT adapters.
Abstract: Adapting billion-parameter language models to a downstream task is still costly, even with parameter-efficient fine-tuning (PEFT). We re-cast task adaptation as output-distribution alignment: the objective is to steer the output distribution toward the task distribution directly during decoding rather than indirectly through weight updates. Building on this view, we introduce Steering Vector Decoding (SVDecode), a lightweight, PEFT-compatible, and theoretically grounded method. We start with a short warm-start fine-tune and extract a task-aware steering vector from the Kullback-Leibler (KL) divergence gradient between the output distribution of the warm-started and pre-trained models. This steering vector is then used to guide the decoding process to steer the model’s output distribution towards the task distribution. We theoretically prove that SVDecode is first-order equivalent to the gradient step of full fine-tuning and derive a globally optimal solution for the strength of the steering vector. Across three tasks and nine benchmarks, SVDecode paired with four standard PEFT methods improves multiple-choice accuracy by up to 5 percentage points and open-ended truthfulness by 2 percentage points, with similar gains (1-2 percentage points) on commonsense datasets without adding trainable parameters beyond the PEFT adapter. SVDecode thus offers a lightweight, theoretically grounded path to stronger task adaptation for large language models. Code is available at https://github.com/dl-m9/SVDecode.
[165] Actions Speak Louder than Prompts: A Large-Scale Study of LLMs for Graph Inference
Ben Finkelshtein, Silviu Cucerzan, Sujay Kumar Jauhar, Ryen White
Main category: cs.CL
TL;DR: LLMs as code generators perform best on graph tasks, especially with long-text or high-degree graphs, and remain effective on heterophilic graphs.
Details
Motivation: Despite growing use of LLMs for text-rich graph ML tasks, there's a lack of principled understanding of LLM capabilities when interacting with graph data, particularly in domains like fraud detection and recommendation systems.Method: Large-scale controlled evaluation across multiple axes: LLM-graph interaction modes (prompting, tool-use, code generation), dataset domains (citation, web-link, e-commerce, social networks), structural regimes (homophilic vs heterophilic graphs), feature characteristics (short vs long-text attributes), and model configurations. Also analyzed dependencies by truncating features, deleting edges, and removing labels.
Result: 1) LLMs as code generators achieve strongest overall performance, especially on long-text or high-degree graphs where prompting exceeds token budget. 2) All interaction strategies remain effective on heterophilic graphs, challenging assumptions about LLM collapse under low homophily. 3) Code generation adapts reliance between structure, features, or labels to leverage most informative input type.
Conclusion: Provides comprehensive view of LLM-graph interaction strengths/limitations and key design principles for future approaches, with code generation emerging as most effective strategy for text-rich graph tasks.
Abstract: Large language models (LLMs) are increasingly used for text-rich graph machine learning tasks such as node classification in high-impact domains like fraud detection and recommendation systems. Yet, despite a surge of interest, the field lacks a principled understanding of the capabilities of LLMs in their interaction with graph data. In this work, we conduct a large-scale, controlled evaluation across several key axes of variability to systematically assess the strengths and weaknesses of LLM-based graph reasoning methods in text-based applications. The axes include the LLM-graph interaction mode, comparing prompting, tool-use, and code generation; dataset domains, spanning citation, web-link, e-commerce, and social networks; structural regimes contrasting homophilic and heterophilic graphs; feature characteristics involving both short- and long-text node attributes; and model configurations with varying LLM sizes and reasoning capabilities. We further analyze dependencies by methodically truncating features, deleting edges, and removing labels to quantify reliance on input types. Our findings provide practical and actionable guidance. (1) LLMs as code generators achieve the strongest overall performance on graph data, with especially large gains on long-text or high-degree graphs where prompting quickly exceeds the token budget. (2) All interaction strategies remain effective on heterophilic graphs, challenging the assumption that LLM-based methods collapse under low homophily. (3) Code generation is able to flexibly adapt its reliance between structure, features, or labels to leverage the most informative input type. Together, these findings provide a comprehensive view of the strengths and limitations of current LLM-graph interaction modes and highlight key design principles for future approaches.
[166] Prior-based Noisy Text Data Filtering: Fast and Strong Alternative For Perplexity
Yeongbin Seo, Gayoung Kim, Jaehyung Kim, Jinyoung Yeo
Main category: cs.CL
TL;DR: A prior-based data filtering method using corpus-level term frequency statistics as a fast alternative to perplexity-based filtering for large language model pretraining.
Details
Motivation: Perplexity-based filtering for LLM pretraining is computationally expensive and unreliable with noisy/out-of-distribution data, requiring a faster, more efficient alternative.Method: Uses token priors estimated from corpus-level term frequency statistics, filtering documents based on mean and standard deviation of token priors without requiring model inference.
Result: Achieves highest average performance across 20 downstream benchmarks while reducing time cost by over 1000x compared to PPL-based filtering, with applicability to code, math, and multilingual corpora.
Conclusion: Prior-based filtering provides a simple, efficient, and effective alternative to perplexity-based methods for data selection in LLM pretraining across various domains.
Abstract: As large language models (LLMs) are pretrained on massive web corpora, careful selection of data becomes essential to ensure effective and efficient learning. While perplexity (PPL)-based filtering has shown strong performance, it suffers from drawbacks: substantial time costs and inherent unreliability of the model when handling noisy or out-of-distribution samples. In this work, we propose a simple yet powerful alternative: a prior-based data filtering method that estimates token priors using corpus-level term frequency statistics, inspired by linguistic insights on word roles and lexical density. Our approach filters documents based on the mean and standard deviation of token priors, serving as a fast proxy to PPL while requiring no model inference. Despite its simplicity, the prior-based filter achieves the highest average performance across 20 downstream benchmarks, while reducing time cost by over 1000x compared to PPL-based filtering. We further demonstrate its applicability to symbolic languages such as code and math, and its dynamic adaptability to multilingual corpora without supervision
[167] Bridging Draft Policy Misalignment: Group Tree Optimization for Speculative Decoding
Shijing Hu, Jingyang Li, Zhihui Lu, Pan Zhou
Main category: cs.CL
TL;DR: GTO (Group Tree Optimization) aligns draft model training with decoding-time tree policies for speculative decoding, improving acceptance length and inference speedup through draft tree rewards and group-based policy training.
Details
Motivation: Existing speculative decoding methods suffer from draft policy misalignment - training objectives optimize single greedy paths while decoding uses tree policies that verify multiple branches, limiting achievable speedups.Method: Introduces two components: (1) Draft Tree Reward - a sampling-free objective measuring expected acceptance length of draft trees under target model; (2) Group-based Draft Policy Training - stable optimization contrasting trees from current and frozen reference models using debiased group-standardized advantages and PPO-style updates along longest accepted sequences.
Result: Across dialogue (MT-Bench), code (HumanEval), and math (GSM8K) tasks with multiple LLMs (LLaMA-3.1-8B, LLaMA-3.3-70B, Vicuna-1.3-13B, etc.), GTO increases acceptance length by 7.4% and yields additional 7.7% speedup over prior state-of-the-art EAGLE-3.
Conclusion: GTO bridges draft policy misalignment in speculative decoding, offering a practical, general solution for efficient LLM inference with provable improvements in acceptance length and speedup.
Abstract: Speculative decoding accelerates large language model (LLM) inference by letting a lightweight draft model propose multiple tokens that the target model verifies in parallel. Yet existing training objectives optimize only a single greedy draft path, while decoding follows a tree policy that re-ranks and verifies multiple branches. This draft policy misalignment limits achievable speedups. We introduce Group Tree Optimization (GTO), which aligns training with the decoding-time tree policy through two components: (i) Draft Tree Reward, a sampling-free objective equal to the expected acceptance length of the draft tree under the target model, directly measuring decoding performance; (ii) Group-based Draft Policy Training, a stable optimization scheme that contrasts trees from the current and a frozen reference draft model, forming debiased group-standardized advantages and applying a PPO-style surrogate along the longest accepted sequence for robust updates. We further prove that increasing our Draft Tree Reward provably improves acceptance length and speedup. Across dialogue (MT-Bench), code (HumanEval), and math (GSM8K), and multiple LLMs (e.g., LLaMA-3.1-8B, LLaMA-3.3-70B, Vicuna-1.3-13B, DeepSeek-R1-Distill-LLaMA-8B, Qwen3-8B), GTO increases acceptance length by (7.4%) and yields an additional (7.7%) speedup over prior state-of-the-art EAGLE-3. By bridging draft policy misalignment, GTO offers a practical, general solution for efficient LLM inference. Code and draft models are available at https://github.com/hsj576/GTO.
[168] Look Back to Reason Forward: Revisitable Memory for Long-Context LLM Agents
Yaorui Shi, Yuxin Chen, Siyuan Wang, Sihang Li, Hengxing Cai, Qi Gu, Xiang Wang, An Zhang
Main category: cs.CL
TL;DR: ReMemR1 enhances long-context QA by integrating memory retrieval into memory updates and using multi-level rewards to improve reasoning over dispersed evidence.
Details
Motivation: Current LLMs struggle with long-context QA where evidence is scattered across millions of tokens. Existing "memorize while reading" methods suffer from pruning of latent evidence, information loss through overwriting, and sparse RL signals.Method: ReMemR1 integrates memory retrieval into memory update process, enabling selective callback of historical memories for non-linear reasoning. Uses multi-level reward design combining final-answer rewards with dense step-level signals to guide effective memory use.
Result: Extensive experiments show ReMemR1 significantly outperforms state-of-the-art baselines on long-context question answering with negligible computational overhead.
Conclusion: ReMemR1 effectively trades marginal cost for robust long-context reasoning by mitigating information degradation and improving supervision through memory retrieval integration and multi-level rewards.
Abstract: Large language models face challenges in long-context question answering, where key evidence of a query may be dispersed across millions of tokens. Existing works equip large language models with a memory buffer that is dynamically updated via a linear document scan, also known as the “memorize while reading” methods. While this approach scales efficiently, it suffers from pruning of latent evidence, information loss through overwriting, and sparse reinforcement learning signals. To tackle these challenges, we present ReMemR1, which integrates the mechanism of memory retrieval into the memory update process, enabling the agent to selectively callback historical memories for non-linear reasoning. To further strengthen training, we propose a multi-level reward design, which combines final-answer rewards with dense, step-level signals that guide effective memory use. Together, these contributions mitigate information degradation, improve supervision, and support complex multi-hop reasoning. Extensive experiments demonstrate that ReMemR1 significantly outperforms state-of-the-art baselines on long-context question answering while incurring negligible computational overhead, validating its ability to trade marginal cost for robust long-context reasoning.
[169] Non-Collaborative User Simulators for Tool Agents
Jeonghoon Shim, Woojung Song, Cheyon Jin, Seungwon KooK, Yohan Jo
Main category: cs.CL
TL;DR: A user simulator framework that generates non-collaborative user behaviors to stress-test tool agents, revealing their vulnerabilities to real-world user interactions.
Details
Motivation: Existing user simulators for tool agents are too cooperative and agent-friendly, failing to prepare agents for real-world non-collaborative users who exhibit challenging behaviors like requesting unavailable services, digressing, expressing impatience, or providing incomplete information.Method: Proposes a novel user simulator architecture that simulates four categories of non-collaborative behaviors: requesting unavailable services, digressing into tangential conversations, expressing impatience, and providing incomplete utterances. The simulator can generate challenging behaviors while still delivering necessary task information.
Result: Experiments on MultiWOZ and τ-bench show significant performance degradation in state-of-the-art tool agents when encountering non-collaborative users, revealing weaknesses like escalated hallucinations and dialogue breakdowns under each non-collaborative condition.
Conclusion: Tool agents need improved robustness to handle diverse real-world user behaviors. The released extensible simulation framework helps the community develop and stress-test tool agents under realistic conditions within their service domains.
Abstract: Tool agents interact with users through multi-turn dialogues to accomplish various tasks. Recent studies have adopted user simulation methods to develop these agents in multi-turn settings. However, existing user simulators tend to be agent-friendly, exhibiting only cooperative behaviors, failing to train and test agents against non-collaborative users in the real world. We propose a novel user simulator architecture that simulates four categories of non-collaborative behaviors: requesting unavailable services, digressing into tangential conversations, expressing impatience, and providing incomplete utterances. Our user simulator can simulate challenging and natural non-collaborative behaviors while reliably delivering all intents and information necessary to accomplish the task. Our experiments on MultiWOZ and τ-bench reveal significant performance degradation in state-of-the-art tool agents when encountering non-collaborative users, as well as agent weaknesses under each non-collaborative condition such as escalated hallucinations and dialogue breakdowns. Our findings point to the need for methods that can improve agent robustness to the wide range of user behaviors encountered in deployment. We release the extensible simulation framework to help the community develop and stress-test tool agents under realistic conditions within their own service domains. Our code is available at https://github.com/holi-lab/NCUser.
[170] Test-Time Policy Adaptation for Enhanced Multi-Turn Interactions with LLMs
Chenxing Wei, Hong Wang, Ying He, Fei Yu, Yao Shu
Main category: cs.CL
TL;DR: ROSA enables LLMs to adapt to user feedback during multi-turn conversations via one-step parameter updates, improving performance in extended interactions.
Details
Motivation: LLMs trained on static single-turn data degrade in extended multi-turn interactions, lacking ability to adapt to real-time user feedback during conversations.Method: Proposes T2PAM paradigm using user feedback as reward signal, then introduces ROSA algorithm for one-step parameter updates toward theoretical optimal policy without iterative optimization.
Result: ROSA achieves significant improvements in task effectiveness and efficiency on challenging benchmarks, with theoretical convergence guarantees to user preferences.
Conclusion: ROSA enables efficient in-conversation self-correction for LLMs through lightweight test-time adaptation, addressing the multi-turn interaction degradation problem.
Abstract: Large Language Models (LLMs) employ multi-turn interaction as a fundamental paradigm for completing complex tasks. However, their performance often degrades in extended interactions, as they are typically trained on static, single-turn data, which hinders their ability to adapt to real-time user feedback. To address this limitation, we first propose a new paradigm: Test-Time Policy Adaptation for Multi-Turn Interactions (T2PAM), which utilizes user feedback from the ongoing interaction as a reward signal to estimate a latent optimal policy aligned with user preferences, then updates a small subset of parameters to steer the model toward this policy, ultimately enabling efficient in-conversation self-correction. We then introduce Optimum-Referenced One-Step Adaptation (ROSA), a lightweight algorithm that operationalizes T2PAM. ROSA guides the model parameters toward a theoretical optimal policy in a single, efficient update step, avoiding costly iterative gradient-based optimization and minimizing computational overhead. We provide a rigorous theoretical analysis guaranteeing that the policy of ROSA converges to the preference of user as the number of interactions increases. Extensive experiments on challenging benchmark demonstrate that ROSA achieves significant improvements in both task effectiveness and efficiency.
[171] Train Once, Answer All: Many Pretraining Experiments for the Cost of One
Sebastian Bordt, Martin Pawelczyk
Main category: cs.CL
TL;DR: A method for conducting multiple controlled pretraining experiments simultaneously in a single training run to study LLM behavior while reducing computational costs.
Details
Motivation: Pretraining experiments are valuable for studying LLM behavior but computationally expensive. The authors aim to enable multiple experiments within one training run to reduce costs while maintaining scientific rigor.Method: Proposes conducting multiple experiments simultaneously during a single training run, validated with 10 experiments on 210B tokens with models up to 2.7B parameters. Introduces Continual Pretraining Dependence Testing (CPDT) to test for interactions between experiments.
Result: Successfully replicated previous results on data contamination, poisoning, and memorization, and conducted novel investigations into knowledge acquisition, mathematical reasoning, and watermarking. Found minimal influence of experiments on training dynamics and performance, with negligible interactions between experiments.
Conclusion: Multiple pretraining experiments can be conducted within a single training run, enabling rigorous scientific experimentation with large models on a limited compute budget.
Abstract: Recent work has demonstrated that controlled pretraining experiments are a powerful tool for studying the relationship between training data and large language model (LLM) behavior. However, the computational cost of pretraining presents a significant constraint. To overcome this constraint, we propose a new approach where multiple experiments are conducted simultaneously during a single training run. We validate our approach by performing ten experiments while training on 210B tokens, with models of up to 2.7B parameters. Although models are trained only once, we can replicate the results of multiple previous works on data contamination, poisoning, and memorization. We also conduct novel investigations into knowledge acquisition, mathematical reasoning, and watermarking. For example, we dynamically update the training data until a model acquires a particular piece of knowledge. Remarkably, the influence of the experiments on the model’s training dynamics and overall performance is minimal. However, interactions between experiments may act as a confounder in our approach. We propose continual pretraining dependence testing (CPDT), a novel technique to test for interactions with continual pretraining experiments, finding them to be negligible in our setup. Overall, our results suggest that performing multiple pretraining experiments within a single training run can enable rigorous scientific experimentation with large models on a compute budget.
[172] Prompt and Parameter Co-Optimization for Large Language Models
Xiaohe Bo, Rui Li, Zexu Sun, Quanyu Dai, Zeyu Zhang, Zihang Tian, Xu Chen, Zhenhua Dong
Main category: cs.CL
TL;DR: MetaTuner is a framework that jointly optimizes prompts and fine-tunes parameters for LLMs, enabling synergistic improvement through shared knowledge encoding and supervised regularization.
Details
Motivation: Prompt optimization and fine-tuning are complementary approaches to improving LLMs (explicit natural language vs implicit parameter updates), but prior work has studied them in isolation, leaving their synergistic potential unexplored.Method: Introduces MetaTuner with two neural networks: one to generate prompts and another to generate parameters, sharing a common bottom encoding layer for knowledge sharing. Uses supervised regularization loss to handle the discrete optimization of prompts and continuous optimization of fine-tuning parameters.
Result: Extensive experiments across diverse benchmarks show that MetaTuner consistently outperforms baseline methods.
Conclusion: Joint integration of prompt optimization and fine-tuning through MetaTuner framework enables synergistic improvements in LLM performance beyond what either approach can achieve alone.
Abstract: Prompt optimization and fine-tuning are two major approaches to improve the performance of Large Language Models (LLMs). They enhance the capabilities of LLMs from complementary perspectives: the former through explicit natural language, and the latter through implicit parameter updates. However, prior work has typically studied them in isolation, leaving their synergistic potential largely underexplored. To bridge this gap, in this paper, we introduce MetaTuner, a novel framework that jointly integrates prompt optimization and fine-tuning for LLM training. Specifically, we introduce two neural networks to generate prompts and parameters, respectively, while allowing them to share a common bottom encoding layer to enable knowledge sharing. By the guidance of the final supervised signals, our framework is optimized to discover the optimal combinations between the prompts and parameters. Given that prompt learning involves discrete optimization while fine-tuning operates in a continuous parameter space, we design a supervised regularization loss to train our framework effectively. Extensive experiments across diverse benchmarks show that our method consistently outperforms the baselines.
[173] SimuHome: A Temporal- and Environment-Aware Benchmark for Smart Home LLM Agents
Gyuhyeon Seo, Jungwoo Yang, Junseong Pyo, Nalim Kim, Jonggeun Lee, Yohan Jo
Main category: cs.CL
TL;DR: SimuHome is a high-fidelity smart home simulator and benchmark for LLM-based agents that simulates dynamic environmental changes from device operations and supports workflow scheduling, grounded in the Matter protocol standard.
Details
Motivation: Existing smart home benchmarks treat homes as static systems without simulating how device operations affect environmental variables over time or supporting workflow scheduling of device commands, limiting realistic agent evaluation.Method: Built on the Matter protocol (industry standard for smart devices), SimuHome provides APIs for agents to interact with devices and observe continuous environmental variable changes. It includes a benchmark of 600 episodes covering state inquiry, implicit intent inference, explicit device control, and workflow scheduling with both feasible and infeasible requests. Time acceleration enables immediate evaluation of scheduled workflows.
Result: Evaluation of 18 agents shows workflow scheduling is the hardest category, with failures persisting across alternative agent frameworks and fine-tuning. The simulator successfully reveals agent limitations in complex scheduling tasks.
Conclusion: SimuHome’s time-accelerated simulation could serve as a valuable environment for agents to pre-validate actions before real-world deployment, addressing critical gaps in current smart home agent evaluation.
Abstract: We introduce $\textbf{SimuHome}$, a high-fidelity smart home simulator and a benchmark of 600 episodes for LLM-based smart home agents. Existing smart home benchmarks treat the home as a static system, neither simulating how device operations affect environmental variables over time nor supporting workflow scheduling of device commands. SimuHome is grounded in the Matter protocol, the industry standard that defines how real smart home devices communicate and operate. Agents interact with devices through SimuHome’s APIs and observe how their actions continuously affect environmental variables such as temperature and humidity. Our benchmark covers state inquiry, implicit user intent inference, explicit device control, and workflow scheduling, each with both feasible and infeasible requests. For workflow scheduling, the simulator accelerates time so that scheduled workflows can be evaluated immediately. An evaluation of 18 agents reveals that workflow scheduling is the hardest category, with failures persisting across alternative agent frameworks and fine-tuning. These findings suggest that SimuHome’s time-accelerated simulation could serve as an environment for agents to pre-validate their actions before committing them to the real world.
[174] SUIT: Knowledge Editing with Subspace-Aware Key-Value Mappings
Haewon Park, Sangwoo Kim, Yohan Jo
Main category: cs.CL
TL;DR: SUIT proposes subspace-based knowledge editing that computes key/value vectors only within edit-critical feature subspaces to reduce unintended model perturbations while maintaining editing performance.
Details
Motivation: Current locate-then-edit methods for knowledge editing in language models often cause significant unintended perturbations beyond targeted edits because they compute key and value vectors without proper constraints, leading to model degradation.Method: SUIT (Subspace Knowledge Edit) identifies and operates within the subspace of critical features relevant to each edit, computing key and value vectors only within this subspace to minimize unintended perturbations while effectively editing knowledge.
Result: Empirical results on LLaMA3, GPT-J, and Qwen2.5 models show SUIT dramatically improves knowledge preservation over strong baselines while maintaining high editing performance, with reduced unintended perturbations in hidden states.
Conclusion: Edit-critical subspace identification is a key principle for reliable, low-perturbation knowledge editing, establishing SUIT as an effective approach that confines updates to directions more effective for editing while preserving model integrity.
Abstract: Knowledge editing aims to efficiently correct factual errors in language models. Widely used locate-then-edit methods update an MLP layer by adjusting its weights to change the mapping between the layer’s input vector (key) and output vector (value), thereby editing the model’s knowledge. As this update is driven by key and value vectors, obtaining these vectors without careful constraints causes significant model perturbations beyond the targeted edit, a common issue in many prior knowledge editing methods. To address this, we propose Subspace Knowledge Edit (SUIT), which computes key and value vectors only within the subspace of critical features relevant to the edit. Our empirical results on LLaMA3, GPT-J, and Qwen2.5 models show that SUIT dramatically improves knowledge preservation over strong baselines while maintaining high editing performance. These results support the claim that SUIT successfully identifies the critical subspace for the edit. Beyond quantitative gains, our analyses show that SUIT reduces unintended perturbations in hidden states while confining updates to directions that are more effective for editing. Taken together, these findings establish edit-critical subspace identification as a key principle for reliable, low-perturbation knowledge editing. Our code is available at https://github.com/holi-lab/SUIT.
[175] EasySteer: A Unified Framework for High-Performance and Extensible LLM Steering
Haolei Xu, Xinyu Mei, Yuchen Yan, Rui Zhou, Wenqi Zhang, Weiming Lu, Yueting Zhuang, Yongliang Shen
Main category: cs.CL
TL;DR: EasySteer is a unified framework for efficient LLM steering that integrates with vLLM to achieve 10.8-22.3× speedup over existing methods, offering modular architecture and pre-computed steering vectors for various applications.
Details
Motivation: Existing LLM steering frameworks suffer from computational inefficiency, limited extensibility, and restricted functionality, hindering both research progress and practical deployment of controllable language models.Method: Built on vLLM with modular architecture featuring pluggable interfaces for analysis-based and learning-based methods, fine-grained parameter control, pre-computed steering vectors for eight application domains, and an interactive demonstration system.
Result: Achieves 10.8-22.3× speedup over existing frameworks through deep integration with vLLM’s optimized inference engine, demonstrating effectiveness in overthinking mitigation, hallucination reduction, and other key applications.
Conclusion: EasySteer transforms steering from a research technique to a production-ready capability, establishing critical infrastructure for deployable, controllable language models.
Abstract: Large language model (LLM) steering has emerged as a promising paradigm for controlling model behavior at inference time through targeted manipulation of hidden states, offering a lightweight alternative to expensive retraining. However, existing steering frameworks suffer from critical limitations: computational inefficiency, limited extensibility, and restricted functionality that hinder both research progress and practical deployment. We present EasySteer, a unified framework for high-performance, extensible LLM steering built on vLLM. Our system features modular architecture with pluggable interfaces for both analysis-based and learning-based methods, fine-grained parameter control, pre-computed steering vectors for eight application domains, and an interactive demonstration system. Through deep integration with vLLM’s optimized inference engine, EasySteer achieves 10.8-22.3$\times$ speedup over existing frameworks. Extensive experiments demonstrate its effectiveness in overthinking mitigation, hallucination reduction, and other key applications. EasySteer transforms steering from research technique to production-ready capability, establishing critical infrastructure for deployable, controllable language models.
[176] Calibrating Verbalized Confidence with Self-Generated Distractors
Victor Wang, Elias Stengel-Eskin
Main category: cs.CL
TL;DR: DINCO improves LLM confidence calibration by normalizing verbalized confidence across self-generated distractors to account for suggestibility bias, outperforming self-consistency with fewer inference calls.
Details
Motivation: LLM verbalized confidence scores are often miscalibrated (overconfident on low-accuracy claims), harming trust and safety. This overconfidence stems from suggestibility when LLMs encode little information about claims.Method: Proposes Distractor-Normalized Coherence (DINCO): 1) LLM generates distractors (alternative claims), 2) verbalizes confidence independently for each distractor, 3) normalizes by total verbalized confidence to account for suggestibility bias. Combines with generator-validator disagreement for further calibration.
Result: DINCO provides less saturated, more usable confidence estimates. Outperforms self-consistency baselines with fewer inference calls (DINCO at 10 calls beats self-consistency at 100 calls). Empirical validation shows more suggestibility on lower-accuracy claims.
Conclusion: DINCO effectively addresses LLM overconfidence by normalizing for suggestibility bias through distractor-based coherence measurement, providing better calibrated confidence estimates crucial for trustworthy LLM deployment.
Abstract: Calibrated confidence estimates are necessary for large language model (LLM) outputs to be trusted by human users. While LLMs can express their confidence in human-interpretable ways, verbalized LLM-generated confidence scores have empirically been found to be miscalibrated, reporting high confidence on instances with low accuracy and thereby harming trust and safety. We hypothesize that this overconfidence often stems from a given LLM’s heightened suggestibility when faced with claims that it encodes little information about; we empirically validate this hypothesis, finding more suggestibility on lower-accuracy claims. Building on this finding, we introduce Distractor-Normalized Coherence (DINCO), which estimates and accounts for an LLM’s suggestibility bias by having the model verbalize its confidence independently across several self-generated distractors (i.e. alternative claims), and normalizes by the total verbalized confidence. To further improve calibration, we leverage generator-validator disagreement, augmenting normalized validator confidence with a consistency-based estimate of generator confidence. Here, we frame the popular approach of self-consistency as leveraging coherence across sampled generations, and normalized verbalized confidence as leveraging coherence across validations on incompatible claims, allowing us to integrate these complementary dimensions of coherence into DINCO. Moreover, our analysis shows that DINCO provides less saturated – and therefore more usable – confidence estimates, and that further sampling alone cannot close the gap between DINCO and baselines, with DINCO at 10 inference calls outperforming self-consistency at 100.
[177] MENLO: From Preferences to Proficiency – Evaluating and Modeling Native-like Quality Across 47 Languages
Chenxi Whitehouse, Sebastian Ruder, Tony Lin, Oksana Kurylo, Haruka Takagi, Janice Lam, Nicolò Busetto, Denise Diaz, Francisco Guzmán
Main category: cs.CL
TL;DR: MENLO is a framework for evaluating native-like quality of LLM responses across languages using audience design principles, with a dataset of 6,423 human-annotated preference pairs in 47 languages.
Details
Motivation: Ensuring native-like quality of LLM responses across many languages is challenging, requiring better evaluation frameworks that can assess multilingual proficiency effectively.Method: Introduced MENLO framework based on audience design-inspired mechanisms; created human-annotated dataset covering four quality dimensions; evaluated zero-shot LLM judges with pairwise evaluation and structured rubrics; improved through fine-tuning with reinforcement learning, reward shaping, and multi-task learning.
Result: Zero-shot LLM judges benefit from pairwise evaluation and structured rubrics but still underperform humans; substantial improvements achieved through fine-tuning; RL-trained judges can serve as generative reward models to enhance LLMs’ multilingual proficiency, though discrepancies with human judgment remain.
Conclusion: MENLO provides promising directions for scalable multilingual evaluation and preference alignment, with released dataset and framework supporting further research in multilingual LLM evaluation.
Abstract: Ensuring native-like quality of large language model (LLM) responses across many languages is challenging. To address this, we introduce MENLO, a framework that operationalizes the evaluation of native-like response quality based on audience design-inspired mechanisms. Using MENLO, we create a dataset of 6,423 human-annotated prompt-response preference pairs covering four quality dimensions with high inter-annotator agreement in 47 language varieties. Our evaluation reveals that zero-shot LLM judges benefit significantly from pairwise evaluation and our structured annotation rubrics, yet they still underperform human annotators on our dataset. We demonstrate substantial improvements through fine-tuning with reinforcement learning, reward shaping, and multi-task learning approaches. Additionally, we show that RL-trained judges can serve as generative reward models to enhance LLMs’ multilingual proficiency, though discrepancies with human judgment remain. Our findings suggest promising directions for scalable multilingual evaluation and preference alignment. We release our dataset and evaluation framework to support further research in multilingual LLM evaluation (https://huggingface.co/datasets/facebook/menlo).
[178] Energy-Regularized Sequential Model Editing on Hyperspheres
Qingyuan Liu, Jia-Chen Gu, Yunzhi Yao, Hong Wang, Nanyun Peng
Main category: cs.CL
TL;DR: SPHERE: A hyperspherical energy-regularized editing method that stabilizes neuron weight distributions during sequential model editing to prevent catastrophic forgetting while enabling reliable knowledge updates.
Details
Motivation: Large language models need constant updates to stay current, but sequential editing often causes catastrophic forgetting and destabilizes representations. The paper hypothesizes that maintaining hyperspherical uniformity of neuron weights can help models remain stable while accommodating new knowledge.Method: Uses Hyperspherical Energy (HE) to quantify neuron uniformity during editing, identifies correlation between HE dynamics and editing performance. Proposes SPHERE: Sparse Projection for Hyperspherical Energy-Regularized Editing, which identifies a sparse space complementary to principal hyperspherical directions of pretrained weight matrices and projects new knowledge onto it to attenuate perturbations.
Result: SPHERE outperforms best baselines in editing capability by average of 16.41% on LLaMA3 (8B) and Qwen2.5 (7B), while better preserving general model performance. Shows strong correlation between HE dynamics and editing performance, with editing failures coinciding with high HE fluctuations.
Conclusion: Hyperspherical uniformity is crucial for stable sequential editing. SPHERE provides a principled approach to reliable large-scale knowledge editing by stabilizing neuron weight distributions, offering a lightweight alternative to full retraining for keeping LLMs updated.
Abstract: Large language models (LLMs) require constant updates to remain aligned with evolving real-world knowledge. Model editing offers a lightweight alternative to retraining, but sequential editing often destabilizes representations and induces catastrophic forgetting. In this work, we seek to better understand and mitigate performance degradation caused by sequential editing. We hypothesize that hyperspherical uniformity, a property that maintains uniform distribution of neuron weights on a hypersphere, helps the model remain stable, retain prior knowledge, while still accommodate new updates. We use Hyperspherical Energy (HE) to quantify neuron uniformity during editing, and examine its correlation with editing performance. Empirical studies across widely used editing methods reveals a strong correlation between HE dynamics and editing performance, with editing failures consistently coinciding with high HE fluctuations. We further theoretically prove that HE dynamics impose a lower bound on the degradation of pretrained knowledge, highlighting why HE stability is crucial for knowledge retention. Motivated by these insights, we propose SPHERE (Sparse Projection for Hyperspherical Energy-Regularized Editing), an HE-driven regularization strategy that stabilizes neuron weight distributions, ultimately preserving prior knowledge while enabling reliable sequential updates. Specifically, SPHERE identifies a sparse space complementary to the principal hyperspherical directions of the pretrained weight matrices and projects new knowledge onto it, attenuating perturbations on the principal directions. Extensive experiments on LLaMA3 (8B) and Qwen2.5 (7B) show that SPHERE outperforms the best baseline in editing capability by an average of 16.41%, while most faithfully preserving general model performance, thereby offering a principled path toward reliable large-scale knowledge editing.
[179] PoLi-RL: A Point-to-List Reinforcement Learning Framework for Conditional Semantic Textual Similarity
Zixin Song, Bowen Zhang, Qian-Wen Zhang, Di Yin, Xing Sun, Chunping Li
Main category: cs.CL
TL;DR: PoLi-RL: A novel Point-to-List Reinforcement Learning framework for Conditional Semantic Textual Similarity that uses a two-stage curriculum with parallel slice ranking rewards to optimize LLMs for ranking-based conditional judgment tasks.
Details
Motivation: Existing C-STS methods are limited to discriminative models and fail to leverage recent advances in LLMs and RL. RL is well-suited for C-STS as it can directly optimize non-differentiable ranking metrics and guide reasoning, but naive listwise RL fails due to complex reward signals causing optimization difficulties.Method: PoLi-RL uses a two-stage curriculum: first trains with simple pointwise reward for basic scoring, then transitions to hybrid reward combining pointwise, pairwise, and listwise objectives. Introduces Parallel Slice Ranking Reward (PSRR) that computes ranking rewards in parallel slices (completions with same index from different samples) for granular credit assignment.
Result: Achieves Spearman correlation coefficient of 48.18 on official C-STS benchmark, establishing new SOTA for cross-encoder architecture.
Conclusion: First successful application of RL to C-STS, introducing powerful paradigm for aligning LLMs for complex, ranking-based conditional judgment tasks.
Abstract: Conditional Semantic Textual Similarity (C-STS) measures the semantic proximity between text segments under a specific condition, thereby overcoming the ambiguity inherent in traditional STS. However, existing methods are largely confined to discriminative models, failing to fully leverage recent breakthroughs in the NLP community involving Large Language Models (LLMs) and Reinforcement Learning (RL). RL is a particularly well-suited paradigm for this task, as it can directly optimize the non-differentiable Spearman ranking metric and guide the reasoning process required by C-STS. Nevertheless, we find that naively applying listwise RL fails to produce meaningful improvements, as the model struggles with complex, coarse-grained reward signals, leading to optimization difficulties. To address this challenge, we introduce PoLi-RL, a novel Point-to-List Reinforcement Learning framework. PoLi-RL employs a two-stage curriculum: it first trains the model with a simple pointwise reward to establish fundamental scoring capabilities, then transitions to a hybrid reward that combines pointwise, pairwise, and listwise objectives to refine the model’s ability to discern subtle semantic distinctions. Crucially, we propose an innovative Parallel Slice Ranking Reward (PSRR) mechanism that computes ranking rewards in parallel slices, where each slice consists of completions with the same index from different samples. This provides a precise, differentiated learning signal for each individual completion, enabling granular credit assignment and effective optimization. On the official C-STS benchmark, PoLi-RL achieves a Spearman correlation coefficient of 48.18, establishing a new SOTA for the cross-encoder architecture. As the first work to successfully apply RL to C-STS, our study introduces a powerful paradigm for aligning LLMs for complex, ranking-based conditional judgment tasks.
[180] TiTok: Transfer Token-level Knowledge via Contrastive Excess to Transplant LoRA
Chanjoo Jung, Jaehyung Kim
Main category: cs.CL
TL;DR: TiTok enables LoRA transplantation across different LLMs via token-level contrastive knowledge transfer without needing additional models or synthetic data generation.
Details
Motivation: Current PEFT methods like LoRA create model-dependent parameters that can't be transferred between different base models, and existing solutions like knowledge distillation depend on training data or require additional discriminator models.Method: TiTok uses token-wise contrastive excess between a source model with and without LoRA to capture task-relevant information, enabling selective filtering of synthetic data without additional models or overhead.
Result: Experiments on three benchmarks across multiple transfer settings show TiTok consistently effective with average performance gains of +4~10% compared to baselines.
Conclusion: TiTok provides an efficient framework for LoRA transplantation that works across different model backbones without requiring additional models or complex synthetic data generation.
Abstract: Large Language Models (LLMs) are widely applied in real world scenarios, yet fine-tuning them comes with significant computational and storage costs. Parameter-Efficient Fine-Tuning (PEFT) methods such as LoRA mitigate these costs; however, the adapted parameters are dependent on the base model and cannot be transferred across different backbones. One way to address this issue is through knowledge distillation, but its effectiveness inherently depends on training data. Recent work such as TransLoRA avoids this by generating synthetic data; nevertheless, this adds complexity since it requires training an additional discriminator model. In this paper, we propose TiTok, a new framework that enables effective LoRA Transplantation through Token-level knowledge transfer. Specifically, TiTok captures task-relevant information through a token-wise contrastive excess between a source model with and without LoRA. This excess highlights informative tokens and enables selective filtering of synthetic data, all without additional models or overhead. Through experiments on three benchmarks across multiple transfer settings, we demonstrate that TiTok is consistently effective, achieving average performance gains of +4~10% compared to baselines overall.
[181] SwiReasoning: Switch-Thinking in Latent and Explicit for Pareto-Superior Reasoning LLMs
Dachuan Shi, Abedelkadir Asi, Keying Li, Xiangchi Yuan, Leyan Pan, Wenke Lee, Wen Xiao
Main category: cs.CL
TL;DR: SwiReasoning is a training-free framework that dynamically switches between explicit chain-of-thought and latent reasoning in LLMs based on confidence estimates, improving both accuracy and token efficiency.
Details
Motivation: Current latent reasoning approaches face two main challenges: 1) purely latent reasoning broadens search distribution, diffusing probability mass and hurting accuracy, and 2) overthinking persists even without explicit text, wasting tokens and degrading efficiency.Method: SwiReasoning dynamically switches between explicit and latent reasoning guided by block-wise confidence estimated from entropy trends in next-token distributions. It also limits maximum thinking-block switches to curb overthinking.
Result: On mathematics, STEM, coding, and general benchmarks, SwiReasoning improves average accuracy by 1.8%-3.1% across different LLMs. Under constrained budgets, it improves average token efficiency by 57%-79%, with larger gains as budgets tighten.
Conclusion: SwiReasoning effectively addresses challenges in latent reasoning by balancing exploration and exploitation through dynamic switching between explicit and latent reasoning, improving both accuracy and efficiency.
Abstract: Recent work shows that, beyond discrete reasoning through explicit chain-of-thought steps, which are limited by the boundaries of natural languages, large language models (LLMs) can also reason continuously in latent space, allowing richer information per step and thereby improving token efficiency. Despite this promise, latent reasoning still faces two challenges, especially in training-free settings: 1) purely latent reasoning broadens the search distribution by maintaining multiple implicit paths, which diffuses probability mass, introduces noise, and impedes convergence to a single high-confidence solution, thereby hurting accuracy; and 2) overthinking persists even without explicit text, wasting tokens and degrading efficiency. To address these issues, we introduce SwiReasoning, a training-free framework for LLM reasoning which features two key innovations: 1) SwiReasoning dynamically switches between explicit and latent reasoning, guided by block-wise confidence estimated from entropy trends in next-token distributions, to balance exploration and exploitation and promote timely convergence. 2) By limiting the maximum number of thinking-block switches, SwiReasoning curbs overthinking and improves token efficiency across varying problem difficulties. On widely used mathematics, STEM, coding, and general benchmarks, SwiReasoning consistently improves average accuracy by 1.8%-3.1% across reasoning LLMs of different model families and scales. Furthermore, under constrained budgets, SwiReasoning improves average token efficiency by 57%-79%, with larger gains as budgets tighten.
[182] Training Large Language Models To Reason In Parallel With Global Forking Tokens
Sheng Jia, Xiao Wang, Shiva Prasad Kasiviswanathan
Main category: cs.CL
TL;DR: SSFT introduces set-based supervised fine-tuning with bipartite matching to preserve diverse reasoning modes in LLMs, enabling better parallel test-time compute scaling for complex reasoning tasks.
Details
Motivation: Current methods for scaling parallel test-time compute in LLMs struggle to generate both diverse and accurate reasoning paths, especially for challenging problems where forking tokens that trigger diverse correct reasoning are deep in the sampling tree. Temperature scaling worsens the diversity-accuracy trade-off.Method: Treats parallel reasoning as a set-of-next-token-prediction problem, incorporates a set-based global loss into SFT using bipartite matching between global forking tokens and unique reasoning traces. Proposes Set Supervised Fine-Tuning (SSFT) to preserve unique reasoning modes and Global Forking Policy Optimization (GFPO) to leverage maximally steerable tokens.
Result: SSFT preserves unique reasoning modes and produces emergent global forking tokens. GFPO models consistently outperform SFT counterparts with GRPO on both math reasoning and execution-based code generation benchmarks.
Conclusion: The proposed set-based approach effectively addresses the diversity-accuracy trade-off in parallel reasoning, enabling better scaling of test-time compute for complex reasoning tasks through preserved reasoning diversity and optimized forking tokens.
Abstract: Although LLMs have demonstrated improved performance by scaling parallel test-time compute, doing so relies on generating reasoning paths that are both diverse and accurate. For challenging problems, the forking tokens that trigger diverse yet correct reasoning modes are typically deep in the sampling tree. Consequently, common strategies to encourage diversity, such as temperature scaling, encounter a worsened trade-off between diversity and accuracy. Motivated by this challenge, we treat parallel reasoning as a set-of-next-token-prediction problem and incorporate a set-based global loss into Supervised Fine-Tuning (SFT) using bipartite matching between global forking tokens and unique reasoning traces. We observe that whereas naive fine-tuning with multiple reasoning traces collapses these unique reasoning modes, our proposed method, Set Supervised Fine-Tuning (SSFT), preserves these modes and produces emergent global forking tokens. Global Forking Policy Optimization (GFPO) leverages these maximally steerable tokens to incentivize complex reasoning, and the resulting models consistently outperform their SFT counterparts with GRPO on both math reasoning and execution-based code generation benchmarks.
[183] Revisiting Self-Play Preference Optimization: On the Role of Prompt Difficulty
Yao Xiao, Jung-jae Kim, Roy Ka-wei Lee, Lidong Bing
Main category: cs.CL
TL;DR: Self-play preference optimization performance varies significantly with prompt difficulty - easy prompts outperform difficult ones, and training on only the easiest 30% of prompts improves overall performance on benchmarks.
Details
Motivation: To investigate how prompts of varying difficulty influence self-play preference optimization, as prompts are a core component but remain underexplored in this pipeline.Method: Used mean reward of sampled responses as a proxy for prompt difficulty, analyzed performance across different difficulty levels, and explored strategies to mitigate negative effects of difficult prompts.
Result: Difficult prompts show inferior optimization performance compared to easy prompts; incorporating difficult prompts degrades overall performance; performance improves as difficulty decreases; model capacity helps close the gap; training on only 30% easiest prompts improves performance on AlpacaEval~2 and Arena-Hard.
Conclusion: Prompt difficulty significantly impacts self-play preference optimization, and selective training on easier prompts can improve overall alignment performance.
Abstract: Self-play preference optimization has emerged as a prominent paradigm for aligning large language models (LLMs). It typically involves a language model to generate on-policy responses for prompts and a reward model (RM) to guide the selection of chosen and rejected responses, which can be further trained with direct preference optimization (DPO). However, the role of prompts remains underexplored, despite being a core component in this pipeline. In this work, we investigate how prompts of varying difficulty influence self-play preference optimization. We use the mean reward of sampled responses of a prompt as a proxy for its difficulty. We first find that difficult prompts exhibit substantially inferior self-play optimization performance compared to easy prompts for language models. Moreover, incorporating difficult prompts into training fails to enhance overall performance and, in fact, leads to slight degradation compared to training on easy prompts alone. Third, there is a clear upward trend in optimization performance as prompt difficulty decreases. We also observe that the performance gap between difficult and easy prompts tends to close as the model capacity increases, suggesting that prompt difficulty interacts with the model capacity. Building on these findings, we explore strategies to mitigate the adversary effect of difficult prompts on final performance. We demonstrate that only training on a small portion (30%) of the easiest prompts improves overall self-play performance on AlpacaEval~2 and Arena-Hard. We also report failed attempts and lessons learned.
[184] MASA: Rethinking the Representational Bottleneck in LoRA with Multi-A Shared Adaptation
Qin Dong, Yuntian Tang, Heming Jia, Yunhang Shen, Bohan Jia, Wenxuan Huang, Lianyue Zhang, Jiao Xie, Shaohui Lin, Rongrong Ji
Main category: cs.CL
TL;DR: MASA introduces a multi-A, single-B LoRA variant that uses multiple specialized down-projection experts shared across layers to overcome representational bottlenecks in parameter-efficient fine-tuning.
Details
Motivation: Standard LoRA's single down-projection matrix creates a representational bottleneck insufficient for capturing diverse signals in complex tasks, motivating architectural enrichment for better feature adaptation.Method: Proposes MASA with multi-A, single-B structure where multiple specialized down-projection experts are asymmetrically shared across layers, integrated by layer-specific up-projection matrices for parameter efficiency.
Result: Achieves 59.62% average accuracy on MMLU benchmark, outperforming standard LoRA by 1.08 points (1.84% relative improvement) with comparable 0.52% learnable parameters.
Conclusion: MASA effectively addresses LoRA’s representational bottleneck through multi-expert feature adaptation while maintaining parameter efficiency, demonstrating improved downstream task performance.
Abstract: Low-Rank Adaptation (LoRA) has emerged as a dominant method in Parameter-Efficient Fine-Tuning (PEFT) for large language models, which augments the transformer layer with one down-projection $A$ and one up-projection $B$. However, LoRA’s reliance on a single down-projection matrix ($A$) creates a representational bottleneck, as this solitary feature extractor is inherently insufficient for capturing the diverse signals required by complex tasks. This motivates our architectural shift to focus on enriching the feature adaptation to improve the downstream task adaptation ability. We propose MASA (Multi-$A$ Shared Adaptation), an architecture that implements a multi-$A$, single-$B$ structure where the multi-$A$ expert ensemble is asymmetrically shared across layers to ensure parameter efficiency. In MASA, these specialized experts capture diverse features, which are then integrated by a single, layer-specific $B$-matrix. The effectiveness and versatility of our method are validated through a comprehensive suite of experiments spanning multi-domain generalization, single-domain specialization, and multi-task reasoning. For example, on the MMLU benchmark, MASA achieves an average accuracy of 59.62%, outperforming the standard LoRA by 1.08 points (a relative improvement of 1.84%) with comparable learnable parameters of 0.52%.
[185] LAD-RAG: Layout-aware Dynamic RAG for Visually-Rich Document Understanding
Zhivar Sourati, Zheng Wang, Marianne Menglin Liu, Yazhe Hu, Mengqing Guo, Sujeeth Bharadwaj, Kyu Han, Tao Sheng, Sujith Ravi, Morteza Dehghani, Dan Roth
Main category: cs.CL
TL;DR: LAD-RAG: Layout-Aware Dynamic RAG framework for visual document QA that captures layout structure and cross-page dependencies using symbolic document graphs, enabling adaptive evidence retrieval.
Details
Motivation: Current RAG methods for visually rich documents fail to capture structural organization and cross-page dependencies, leading to incomplete evidence retrieval and degraded answer quality for multi-page reasoning tasks.Method: Proposes LAD-RAG with two-phase approach: 1) Ingestion phase constructs symbolic document graph capturing layout structure and cross-page dependencies alongside neural embeddings; 2) Inference phase uses LLM agent to dynamically interact with both neural and symbolic indices for adaptive evidence retrieval.
Result: Achieves over 90% perfect recall on average without top-k tuning, outperforms baseline retrievers by up to 20% in recall at comparable noise levels, and yields higher QA accuracy with minimal latency on MMLongBench-Doc, LongDocURL, DUDE, and MP-DocVQA benchmarks.
Conclusion: LAD-RAG effectively addresses limitations of conventional RAG for visually rich documents by incorporating layout awareness and dynamic retrieval, significantly improving evidence retrieval and QA performance for multi-page reasoning tasks.
Abstract: Question answering over visually rich documents (VRDs) requires reasoning not only over isolated content but also over documents’ structural organization and cross-page dependencies. However, conventional retrieval-augmented generation (RAG) methods encode content in isolated chunks during ingestion, losing structural and cross-page dependencies, and retrieve a fixed number of pages at inference, regardless of the specific demands of the question or context. This often results in incomplete evidence retrieval and degraded answer quality for multi-page reasoning tasks. To address these limitations, we propose LAD-RAG, a novel Layout-Aware Dynamic RAG framework. During ingestion, LAD-RAG constructs a symbolic document graph that captures layout structure and cross-page dependencies, adding it alongside standard neural embeddings to yield a more holistic representation of the document. During inference, an LLM agent dynamically interacts with the neural and symbolic indices to adaptively retrieve the necessary evidence based on the query. Experiments on MMLongBench-Doc, LongDocURL, DUDE, and MP-DocVQA demonstrate that LAD-RAG improves retrieval, achieving over 90% perfect recall on average without any top-k tuning, and outperforming baseline retrievers by up to 20% in recall at comparable noise levels, yielding higher QA accuracy with minimal latency.
[186] ExPO-HM: Learning to Explain-then-Detect for Hateful Meme Detection
Jingbiao Mei, Mingsheng Sun, Jinghong Chen, Pengda Qin, Yuhong Li, Da Chen, Bill Byrne
Main category: cs.CL
TL;DR: ExPO-HM is a new method for hateful meme detection that combines explanation generation with detection using policy optimization, achieving state-of-the-art performance across multiple benchmarks.
Details
Motivation: Current hateful meme detection systems provide only binary predictions without explanations, while recent explain-then-detect approaches underperform simple supervised fine-tuning. The paper identifies two key issues: models fail to hypothesize policy-relevant cues, and binary rewards are insufficient for guiding reasoning.Method: ExPO-HM combines supervised fine-tuning warmup, GRPO with curriculum learning, and Conditional Decision Entropy as both metric and reward for reasoning quality. Inspired by human annotator training, it focuses on generating explanations before detection.
Result: Achieves state-of-the-art performance on three hateful meme benchmarks with up to 15% and 17% F1 improvement over GRPO and DPO baselines respectively. Excels in binary detection, fine-grained classification, and reasoning quality.
Conclusion: ExPO-HM successfully moves hateful meme detection from simple binary alarms to explanation-driven detection, providing accurate, interpretable, and actionable moderation support.
Abstract: Hateful memes have emerged as a particularly challenging form of online abuse, motivating the development of automated detection systems. Most prior approaches rely on direct detection, producing only binary predictions. Such models fail to provide the context and explanations that real-world moderation requires. Recent Explain-then-Detect approaches, using Chain-of-Thought prompting or LMM agents, perform worse than simple SFT baselines, and even advanced post-training methods such as GRPO fail to close the gap. Our analysis identifies two key issues of such systems: important policy-relevant cues such as targets and attack types are not hypothesized by the model as a likely explanation; and the binary reward signal is insufficient to guide reasoning. To address these challenges, we propose ExPO-HM (Explain-then-Detect Policy Optimization for Hateful Memes), inspired by the training and evaluation process of human annotators. ExPO-HM combines SFT warmup, GRPO with curriculum learning, and Conditional Decision Entropy (CDE) as both metric and reward for reasoning quality. Across three hateful meme benchmarks, ExPO-HM achieves state-of-the-art performance on binary detection, fine-grained classification, and reasoning quality, with up to 15% and 17% F1 improvement over the GRPO and DPO baselines, respectively. By moving hateful meme detection from simple binary alarms to explanation-driven detection, ExPO-HM provides accurate, interpretable, and actionable moderation support. Code available at https://github.com/JingbiaoMei/ExPO-HM
[187] How Many Code and Test Cases Are Enough? Evaluating Test Cases Generation from a Binary-Matrix Perspective
Xianzhen Luo, Jinyang Huang, Wenzhen Zheng, Qingfu Zhu, Mingzheng Xu, Yiheng Xu, Yuantao Fan, Wanxiang Che
Main category: cs.CL
TL;DR: A framework for evaluating LLM-generated test cases using minimal diagnostic basis selection to create compact, diverse benchmarks that resist score inflation.
Details
Motivation: Existing benchmarks for evaluating LLM-generated test cases suffer from high computational costs, score inflation, and reward detection of trivial bugs while missing rare critical faults.Method: Formalizes benchmark construction as finding optimal diagnostic basis in binary code-test matrix, proposes WrongSelect algorithm to select maximally diverse wrong codes, and constructs TC-Bench from competitive programming submissions.
Result: Created TC-Bench benchmark where even advanced test case generation methods achieve only ~60% exclusion rates, exposing significant gaps in diagnostic power.
Conclusion: The framework provides a principled approach to benchmark construction that resists inflation and reveals substantial room for improvement in test case generation methods.
Abstract: Evaluating test cases automatically generated by Large Language Models (LLMs) is a critical yet challenging task. Existing benchmarks often evaluate the exclusion ratio on large, unstructured collections of wrong codes, suffering from high computational costs and score inflation. Furthermore, they inadvertently reward generators that detect common, trivial bugs, while failing to penalize their inability to identify rare yet critical faults. In this work, we connect two fundamental questions: (1) What is the minimal set of wrong codes sufficient to represent the entire error space? and (2) What is the minimal set of test cases needed to distinguish them? We introduce a novel framework that formalizes benchmark construction as finding an optimal diagnostic basis in a binary code-test matrix, where rows represent wrong codes and columns represent test case results. The rank of this matrix specifies the minimal number of independent error patterns (wrong codes) and provides a tight upper bound on the number of test cases required for complete fault coverage. Our objective is to identify a basis of size equal to the matrix rank that maximizes internal diversity. To tackle this NP-hard problem, we propose WrongSelect, an efficient approximation algorithm to select maximally diverse wrong codes. Applying this framework to millions of competitive programming submissions, we construct TC-Bench, a compact, diverse, and inflation-resistant benchmark. Extensive experiments show that even the most advanced test case generation methods achieve only ~60% exclusion rates on TC-Bench, exposing a significant gap in their diagnostic power and highlighting substantial room for future improvement. Our dataset is available at: https://huggingface.co/datasets/Luoberta/TC-Bench and our code is at: https://github.com/Luowaterbi/TC-Bench.
[188] Language steering in latent space to mitigate unintended code-switching
Andrey Goncharov, Nikolai Kondusov, Alexey Zaytsev
Main category: cs.CL
TL;DR: Latent-space language steering: A lightweight inference-time method that uses PCA on parallel translations to identify language directions and steer token embeddings to control language identity, reducing code-switching in multilingual LLMs.
Details
Motivation: Multilingual Large Language Models often exhibit unintended code-switching, which reduces their reliability in downstream tasks. There's a need for methods to control language identity without compromising semantic content.Method: Proposes latent-space language steering: 1) Uses PCA on parallel translations to identify language directions in the latent space, 2) Steers token embeddings along these axes during inference to control language identity, 3) Requires only minimal parallel data for calibration and adds negligible computational overhead.
Result: Achieves 95-99% language classification accuracy using a single principal component, reduces next-token distributional divergence by up to 55% across multiple language pairs on Qwen2.5 and Llama-3.2 models. Generation-based evaluation shows 63-99% reduction in Code-Switching Index across four language pairs (p < 0.001). Analysis reveals language identity concentrates in final layers with near-perfect linear separability.
Conclusion: Latent-space language steering effectively mitigates code-switching in multilingual LLMs while preserving semantics, with minimal computational overhead and data requirements. The method demonstrates that language identity is highly structured in the latent space and can be controlled via simple linear transformations.
Abstract: Multilingual Large Language Models (LLMs) often exhibit unintended code-switching, reducing reliability in downstream tasks. We propose latent-space language steering, a lightweight inference-time method that identifies language directions via PCA on parallel translations and steers token embeddings along these axes to control language identity. Our approach mitigates code-switching while preserving semantics with negligible computational overhead and requires only minimal parallel data for calibration. Empirically, we achieve 95-99% language classification accuracy using a single principal component and reduce next-token distributional divergence by up to 55% across multiple language pairs on Qwen2.5 and Llama-3.2 models. Generation-based evaluation on Llama-3.2 further demonstrates 63–99% reduction in Code-Switching Index across four language pairs ($p < 0.001$). We further analyze the layer-wise evolution of language representations, revealing that language identity concentrates in final layers with near-perfect linear separability.
[189] Reliable Fine-Grained Evaluation of Natural Language Math Proofs
Wenjie Ma, Andrei Cojocaru, Neel Kolhe, Bradley Louie, Robin Said Sharif, Haihan Zhang, Vincent Zhuang, Matei Zaharia, Sewon Min
Main category: cs.CL
TL;DR: Proposes ProofGrader, a fine-grained evaluator for LLM-generated math proofs, using ProofBench dataset and achieving low error against expert scores.
Details
Motivation: Current LLMs for mathematical reasoning focus on tasks with easily verifiable answers, but generating and verifying natural language math proofs remains challenging due to lack of reliable fine-grained evaluators.Method: Introduces ProofBench dataset with expert-annotated proof ratings, then systematically explores evaluator design space across backbone models, input context, instructions, and evaluation workflow to develop ProofGrader with ensembling.
Result: ProofGrader achieves MAE of 0.926 against expert scores, significantly outperforming baselines, and in best-of-n selection achieves 4.14/7 score, closing 78% of gap between naive binary evaluator and human oracle.
Conclusion: ProofGrader demonstrates strong performance as a fine-grained proof evaluator and has potential to advance downstream proof generation tasks.
Abstract: Recent advances in large language models (LLMs) for mathematical reasoning have largely focused on tasks with easily verifiable final answers while generating and verifying natural language math proofs remains an open challenge. We identify the absence of a reliable, fine-grained evaluator for LLM-generated math proofs as a critical gap. To address this, we propose a systematic methodology for developing and validating evaluators that assign fine-grained scores on a 0-7 scale to model-generated math proofs. To enable this study, we introduce ProofBench, the first expert-annotated dataset of fine-grained proof ratings, spanning 145 problems from six major math competitions (USAMO, IMO, Putnam, etc) and 435 LLM-generated solutions from Gemini-2.5-Pro, o3, and DeepSeek-R1. Using ProofBench as a testbed, we systematically explore the evaluator design space across key axes: the backbone model, input context, instructions and evaluation workflow. Our analysis delivers ProofGrader, an evaluator that combines a strong reasoning backbone LM, rich context from reference solutions and marking schemes, and a simple ensembling method; it achieves a low Mean Absolute Error (MAE) of 0.926 against expert scores, significantly outperforming naive baselines. Finally, we demonstrate its practical utility in a best-of-$n$ selection task: at $n=16$, ProofGrader achieves an average score of 4.14/7, closing 78% of the gap between a naive binary evaluator (2.48) and the human oracle (4.62), highlighting its potential to advance downstream proof generation.
[190] PolySkill: Learning Generalizable Skills Through Polymorphic Abstraction
Simon Yu, Gang Li, Weiyan Shi, Peng Qi
Main category: cs.CL
TL;DR: PolySkill enables LLM agents to learn generalizable, compositional skills by decoupling abstract goals from concrete implementations, improving skill reuse and success rates across websites.
Details
Motivation: Current LLM-powered agents learn skills that are over-specialized to single websites and fail to generalize, limiting their ability to continually learn and adapt across different web environments.Method: Introduces PolySkill framework inspired by polymorphism in software engineering, decoupling a skill’s abstract goal (what it accomplishes) from its concrete implementation (how it’s executed), enabling agents to learn generalizable and compositional skills.
Result: Improves skill reuse by 1.7x on seen websites, boosts success rates by up to 9.4% on Mind2Web and 13.9% on unseen websites, reduces steps by over 20%, and enhances task quality in self-exploration settings.
Conclusion: Separating skill goals from execution is crucial for developing autonomous agents that can learn and generalize across the open web continuously, providing a practical path toward continual learning in adaptive environments.
Abstract: Large language models (LLMs) are moving beyond static uses and are now powering agents that learn continually during their interaction with external environments. For example, agents can learn reusable skills while navigating web pages or toggling new tools. However, existing methods for skill learning often create skills that are over-specialized to a single website and fail to generalize. We introduce PolySkill, a new framework that enables agents to learn generalizable and compositional skills. The core idea, inspired by polymorphism in software engineering, is to decouple a skill’s abstract goal (what it accomplishes) and its concrete implementation (how it is executed). Experiments show that our method (1) improves skill reuse by 1.7x on seen websites and (2) boosts success rates by up to 9.4% on Mind2Web and 13.9% on unseen websites, while reducing steps by over 20%. (3) In self-exploration settings without specified tasks, our framework improves the quality of proposed tasks and enables agents to learn generalizable skills that work across different sites. By enabling the agent to identify and refine its own goals, the PolySkill enhances the agent’s ability to learn a better curriculum, leading to the acquisition of more generalizable skills compared to baseline methods. This work provides a practical path toward building agents capable of continual learning in adaptive environments. Our findings show that separating a skill’s goal from its execution is a crucial step toward developing autonomous agents that can learn and generalize across the open web continuously. Our code can be found in https://github.com/simonucl/PolySkill.
[191] Rethinking On-policy Optimization for Query Augmentation
Zhichao Xu, Shengyao Zhuang, Xueguang Ma, Bingsen Chen, Yijun Tian, Fengran Mo, Jie Cao, Vivek Srikumar
Main category: cs.CL
TL;DR: Systematic comparison of prompting-based vs RL-based query augmentation for IR shows simple prompting often matches or beats RL methods, leading to a novel hybrid approach (OPQE) that generates pseudo-documents optimized via RL.
Details
Motivation: There's growing interest in using LLMs for query augmentation in information retrieval, with two main approaches emerging: prompting-based generation and RL-based fine-tuning. However, these approaches haven't been systematically compared under consistent conditions, making it unclear which performs better or if a hybrid approach could be superior.Method: The paper presents the first systematic comparison of prompting-based and RL-based query augmentation across diverse benchmarks (evidence-seeking, ad hoc, and tool retrieval). Based on findings that simple prompting often matches RL methods, they introduce OPQE - On-policy Pseudo-document Query Expansion, where an LLM policy learns to generate pseudo-documents that maximize retrieval performance, combining prompting’s flexibility with RL’s targeted optimization.
Result: Key finding: Simple, training-free query augmentation often performs on par with or surpasses more expensive RL-based methods, especially with powerful LLMs. The novel OPQE hybrid method outperforms both standalone prompting and RL-based rewriting, showing synergistic approaches yield best results.
Conclusion: A hybrid approach combining prompting’s generative flexibility with RL’s optimization capabilities (OPQE) achieves superior retrieval performance compared to either method alone, demonstrating the value of synergistic methods in LLM-based query augmentation.
Abstract: Recent advances in large language models (LLMs) have led to a surge of interest in query augmentation for information retrieval (IR). Two main approaches have emerged. The first prompts LLMs to generate answers or pseudo-documents that serve as new queries, relying purely on the model’s parametric knowledge or contextual information. The second applies reinforcement learning (RL) to fine-tune LLMs for query rewriting, directly optimizing retrieval metrics. While having respective advantages and limitations, the two approaches have not been compared under consistent experimental conditions. In this work, we present the first systematic comparison of prompting-based and RL-based query augmentation across diverse benchmarks, including evidence-seeking, ad hoc, and tool retrieval. Our key finding is that simple, training-free query augmentation often performs on par with, or even surpasses, more expensive RL-based counterparts, especially when using powerful LLMs. Motivated by this discovery, we introduce a novel hybrid method, On-policy Pseudo-document Query Expansion (OPQE), which, instead of rewriting a query, the LLM policy learns to generate a pseudo-document that maximizes retrieval performance, thus merging the flexibility and generative structure of prompting with the targeted optimization of RL. We show OPQE outperforms both standalone prompting and RL-based rewriting, demonstrating that a synergistic approach yields the best results. Our implementation is made available to facilitate reproducibility.
[192] How Do LLMs Use Their Depth?
Akshat Gupta, Jay Yeung, Gopala Anumanchipalli, Anna Ivanova
Main category: cs.CL
TL;DR: LLMs use depth non-uniformly: early layers make statistical guesses with high-frequency tokens, later layers refine these into contextually appropriate predictions through a “Guess-then-Refine” process.
Details
Motivation: To understand the fine-grained, layer-wise prediction dynamics of large language models, as evidence suggests they don't use their depth uniformly but the specific computational patterns remain unclear.Method: Trace intermediate representations during inference across several open-weight models, propose “Guess-then-Refine” framework, analyze through three case studies (multiple-choice tasks, fact recall, part-of-speech), and validate with causal manipulations including activation patching and early-exiting experiments.
Result: Early layers produce high-frequency token guesses due to limited context; later layers refine these into contextually appropriate tokens. Multiple-choice tasks: options identified in first half, responses finalized in second half. Fact recall: first token requires more depth than subsequent tokens. Function words predicted earliest on average.
Conclusion: LLMs employ structured depth usage with early statistical guessing and later contextual refinement, providing insights for improving computational efficiency in transformer models through better understanding of layer-by-layer computations.
Abstract: Growing evidence suggests that large language models do not use their depth uniformly, yet we still lack a fine-grained understanding of their layer-wise prediction dynamics. In this paper, we trace the intermediate representations of several open-weight models during inference and reveal a structured and nuanced use of depth. Specifically, we propose a “Guess-then-Refine” framework that explains how LLMs internally structure their computations to make predictions. We first show that the top-ranked predictions in early LLM layers are composed primarily of high-frequency tokens, which act as statistical guesses proposed by the model due to the lack of contextual information. As contextual information develops deeper into the model, these initial guesses get refined into contextually appropriate tokens. We then examine the dynamic usage of layer depth through three case studies. (i) Multiple-choice task analysis shows that the model identifies appropriate options within the first half of the model and finalizes the response in the latter half. (ii) Fact recall task analysis shows that in a multi-token answer, the first token requires more computational depth than the rest. (iii) Part-of-speech analysis shows that function words are, on average, the earliest to be predicted correctly. To validate our results, we supplement probe-based analyses with causal manipulations in the form of activation patching and early-exiting experiments. Together, our results provide a detailed view of depth usage in LLMs, shedding light on the layer-by-layer computations that underlie successful predictions and providing insights for future works to improve computational efficiency in transformer-based models.
[193] DiSRouter: Distributed Self-Routing for LLM Selections
Hang Zheng, Hongshen Xu, Yongkai Lin, Shuai Fan, Lu Chen, Kai Yu
Main category: cs.CL
TL;DR: DiSRouter: A distributed routing system where LLM agents self-assess competence and decide whether to answer queries or route them to other agents, outperforming centralized routing approaches.
Details
Motivation: Current LLM query routing systems use centralized external routers that are inflexible and can't fully understand different LLMs' knowledge boundaries, leading to poor performance. There's a need for more flexible, scalable routing that leverages LLMs' intrinsic capabilities.Method: DiSRouter shifts from centralized to distributed routing where queries traverse a network of LLM agents. Each agent independently decides to answer or route based on self-awareness of its competence. Uses a two-stage Self-Awareness Training pipeline to enhance each LLM’s self-assessment capabilities.
Result: DiSRouter significantly outperforms existing routing methods in utility across various scenarios, effectively distinguishes between easy and hard queries, and shows strong generalization to out-of-domain tasks.
Conclusion: Leveraging LLMs’ intrinsic self-awareness is more effective than external assessment for query routing. This distributed approach enables more modular and efficient multi-agent systems with superior flexibility, scalability, and generalizability.
Abstract: The proliferation of Large Language Models (LLMs) has created a diverse ecosystem of models with highly varying performance and costs, necessitating effective query routing to balance performance and expense. Current routing systems often rely on a centralized external router trained on a fixed set of LLMs, making them inflexible and prone to poor performance since the small router can not fully understand the knowledge boundaries of different LLMs. We introduce DiSRouter (Distributed Self-Router), a novel paradigm that shifts from centralized control to distributed routing. In DiSRouter, a query traverses a network of LLM agents, each independently deciding whether to answer or route to other agents based on its own self-awareness, its ability to judge its competence. This distributed design offers superior flexibility, scalability, and generalizability. To enable this, we propose a two-stage Self-Awareness Training pipeline that enhances each LLM’s self-awareness. Extensive experiments demonstrate that DiSRouter significantly outperforms existing routing methods in utility across various scenarios, effectively distinguishes between easy and hard queries, and shows strong generalization to out-of-domain tasks. Our work validates that leveraging an LLM’s intrinsic self-awareness is more effective than external assessment, paving the way for more modular and efficient multi-agent systems.
[194] ToolDreamer: Instilling LLM Reasoning Into Tool Retrievers
Saptarshi Sengupta, Zhengyu Zhou, Jun Araki, Xingbo Wang, Bingqing Wang, Suhang Wang, Zhe Feng
Main category: cs.CL
TL;DR: ToolDreamer improves tool retrieval for LLMs by generating synthetic tool descriptions that better align with user queries, enabling more effective handling of large tool sets without context window limitations.
Details
Motivation: Large tool sets exceed LLM context windows, requiring external retrievers. Current retrievers rank tools based on similarity between user queries and tool descriptions, but user requests often poorly align with TD language, leading to suboptimal retrieval.Method: Proposes ToolDreamer framework that conditions retriever models to fetch tools based on hypothetical (synthetic) tool descriptions generated by an LLM. These synthetic descriptions represent tools the LLM feels would be potentially useful for the query, creating better alignment within the TD language space.
Result: Applied ToolDreamer on ToolRet dataset and showed improved performance for both sparse and dense retrievers with and without training, demonstrating flexibility. The framework effectively offloads reasoning burden to the retriever.
Conclusion: ToolDreamer enables more natural alignment between queries and tools within the TD language space, allowing LLMs to effectively handle large tool collections without overwhelming context windows by improving tool retrieval through synthetic description generation.
Abstract: Tool calling has become increasingly popular for Large Language Models (LLMs). However, for large tool sets, the resulting tokens would exceed the LLM’s context window limit, making it impossible to include every tool. Hence, an external retriever is used to provide LLMs with the most relevant tools for a query. Existing retrieval models rank tools based on the similarity between a user query and a tool description (TD). This leads to suboptimal retrieval as user requests are often poorly aligned with the language of TD. To remedy the issue, we propose ToolDreamer, a framework to condition retriever models to fetch tools based on hypothetical (synthetic) TD generated using an LLM, i.e., description of tools that the LLM feels will be potentially useful for the query. The framework enables a more natural alignment between queries and tools within the language space of TD’s. We apply ToolDreamer on the ToolRet dataset and show that our method improves the performance of sparse and dense retrievers with and without training, thus showcasing its flexibility. Through our proposed framework, our aim is to offload a portion of the reasoning burden to the retriever so that the LLM may effectively handle a large collection of tools without inundating its context window.
[195] Scaf-GRPO: Scaffolded Group Relative Policy Optimization for Enhancing LLM Reasoning
Xichen Zhang, Sitong Wu, Yinghao Zhu, Haoru Tan, Shaozuo Yu, Ziyi He, Jiaya Jia
Main category: cs.CL
TL;DR: Scaf-GRPO introduces a scaffolded reinforcement learning framework that provides tiered hints when LLMs plateau on difficult reasoning problems, overcoming the “learning cliff” phenomenon where models get stuck on zero-reward problems.
Details
Motivation: Current RL methods for LLMs face a "learning cliff" where models fail completely on problems beyond their capabilities, resulting in persistent zero-reward signals that collapse advantage calculations and stall learning progress.Method: Scaf-GRPO (Scaffolded Group Relative Policy Optimization) is a progressive training framework that diagnoses learning stagnation and then intervenes with tiered in-prompt hints ranging from abstract concepts to concrete steps, enabling models to construct valid solutions independently.
Result: Extensive experiments on challenging mathematics benchmarks show Scaf-GRPO boosts the pass@1 score of Qwen2.5-Math-7B on the AIME24 benchmark by 44.3% relative to vanilla GRPO baseline.
Conclusion: The framework provides a robust methodology for unlocking LLMs’ ability to solve problems previously beyond their reach, representing a critical step toward extending autonomous reasoning capabilities.
Abstract: Reinforcement learning from verifiable rewards has emerged as a powerful technique for enhancing the complex reasoning abilities of Large Language Models (LLMs). However, these methods are fundamentally constrained by the ‘’learning cliff’’ phenomenon: when faced with problems far beyond their current capabilities, models consistently fail, yielding a persistent zero-reward signal. In policy optimization algorithms like GRPO, this collapses the advantage calculation to zero, rendering these difficult problems invisible to the learning gradient and stalling progress. To overcome this, we introduce Scaf-GRPO (Scaffolded Group Relative Policy Optimization), a progressive training framework that strategically provides minimal guidance only when a model’s independent learning has plateaued. The framework first diagnoses learning stagnation and then intervenes by injecting tiered in-prompt hints, ranging from abstract concepts to concrete steps, enabling the model to construct a valid solution by itself. Extensive experiments on challenging mathematics benchmarks demonstrate Scaf-GRPO’s effectiveness, boosting the pass@1 score of the Qwen2.5-Math-7B model on the AIME24 benchmark by a relative 44.3% over a vanilla GRPO baseline. This result demonstrates our framework provides a robust and effective methodology for unlocking a model’s ability to solve problems previously beyond its reach, a critical step towards extending the frontier of autonomous reasoning in LLM.
[196] Steering Evaluation-Aware Language Models to Act Like They Are Deployed
Tim Tian Hua, Andrew Qin, Samuel Marks, Neel Nanda
Main category: cs.CL
TL;DR: Activation steering can suppress LLM evaluation-awareness to improve safety evaluation reliability
Details
Motivation: LLMs can detect when they're being evaluated and adjust behavior to appear more aligned, compromising safety evaluation reliability. Need methods to make models behave during evaluation as they would during deployment.Method: Two-step training: 1) Continued pretraining on documents describing evaluation-aware behavior (Python type hints in evaluation, recognition of evaluation cues), 2) Expert iteration to use type hints in evaluation settings. Then use activation steering with vectors from original model to suppress evaluation-awareness.
Result: Activation steering successfully suppressed evaluation-awareness, making the model behave during evaluation as it would during deployment. The steering vector constructed from the original model before additional training was effective.
Conclusion: AI evaluators could improve safety evaluation reliability by steering models to act like they’re deployed, addressing the problem of evaluation-aware behavior in LLMs.
Abstract: Large language models (LLMs) can sometimes detect when they are being evaluated and adjust their behavior to appear more aligned, compromising the reliability of safety evaluations. In this paper, we show that adding a steering vector to an LLM’s activations can suppress evaluation-awareness and make the model act like it is deployed during evaluation. To study our steering technique, we train an LLM to exhibit evaluation-aware behavior using a two-step training process designed to mimic how this behavior could emerge naturally. First, we perform continued pretraining on two sets of documents describing its behavior. The first says that our model uses Python type hints during evaluation but not during deployment. The second says that our model can recognize that the presence of a certain evaluation cue always means that it is being tested. Then, we train the model with expert iteration to use Python type hints in evaluation settings. The resulting model is evaluation-aware: it writes type hints in evaluation contexts more than deployment contexts. We find that activation steering can suppress evaluation awareness and make the model behave during evaluation as it would during deployment. Importantly, we constructed our steering vector using the original model before our additional training. Our results suggest that AI evaluators could improve the reliability of safety evaluations by steering models to act like they are deployed.
[197] A Diagnostic Benchmark for Sweden-Related Factual Knowledge
Jenny Kunz
Main category: cs.CL
TL;DR: A Swedish-specific QA benchmark for testing knowledge about Sweden-related personalities and events, with English translations for cross-lingual analysis.
Details
Motivation: Existing Swedish benchmarks are translations of US-centric benchmarks and don't test knowledge specific to Sweden, particularly about Swedish personalities and events with limited international coverage.Method: Created a manually written question-answering benchmark focused on Sweden-related content, drawing inspiration from Swedish radio programs featuring public figures and major sports events. Includes English translations for cross-lingual analysis.
Result: Smaller models with stronger Swedish coverage perform comparably to a multilingual model three times larger on Sweden-related facts. Continued pre-training on Swedish improves factual knowledge but causes partial forgetting of previously known information.
Conclusion: The dataset serves as a diagnostic tool for studying language adaptation and knowledge retention in multilingual models during language adaptation.
Abstract: Many Swedish benchmarks are translations of US-centric benchmarks and are therefore not suitable for testing knowledge that is particularly relevant, or even specific, to Sweden. We therefore introduce a manually written question-answering benchmark specifically targeted at Sweden-related personalities and events, many of which receive very limited coverage in international media. Our annotators drew inspiration from a popular radio program featuring public figures from culture and media, as well as major sports events in Sweden. The dataset can be used to measure factual recall across models of varying sizes and degrees of Swedish coverage, and allows probing of cross-lingual factual consistency, as it contains English translations. Using the dataset, we find that smaller models with stronger Swedish coverage perform comparably to a multilingual model three times larger in recalling Sweden-related facts. We also observe that continued pre-training on Swedish generally improves factual knowledge but leads to partial forgetting of previously known information. These results demonstrate the dataset’s potential as a diagnostic tool for studying language adaptation and knowledge retention in multilingual models during language adaptation.
[198] VisJudge-Bench: Aesthetics and Quality Assessment of Visualizations
Yupeng Xie, Zhiyang Zhang, Yifan Wu, Sirong Lu, Jiayi Zhang, Zhaoyang Yu, Jinlin Wang, Sirui Hong, Bang Liu, Chenglin Wu, Yuyu Luo
Main category: cs.CL
TL;DR: Proposes VisJudge-Bench, the first comprehensive benchmark for evaluating MLLMs’ capabilities in assessing visualization aesthetics and quality, and introduces VisJudge, a specialized model that significantly outperforms state-of-the-art MLLMs like GPT-5 in this domain.
Details
Motivation: Evaluating visualization quality is challenging as it requires simultaneous judgment across data encoding accuracy, information expressiveness, and visual aesthetics. While MLLMs show promise in natural image aesthetic assessment, no systematic benchmark exists for measuring their capabilities in evaluating visualizations.Method: Created VisJudge-Bench with 3,090 expert-annotated samples from real-world scenarios covering single visualizations, multiple visualizations, and dashboards across 32 chart types. Proposed VisJudge, a model specifically designed for visualization aesthetics and quality assessment.
Result: Advanced MLLMs like GPT-5 show significant gaps compared to human experts (MAE: 0.553, correlation: 0.428). VisJudge reduces MAE to 0.421 (23.9% reduction) and increases consistency with human experts to 0.687 (60.5% improvement) compared to GPT-5.
Conclusion: Current MLLMs have limitations in visualization quality assessment, but specialized models like VisJudge can significantly narrow the gap with human judgment. The benchmark enables systematic evaluation of MLLMs in this important domain.
Abstract: Visualization, a domain-specific yet widely used form of imagery, is an effective way to turn complex datasets into intuitive insights, and its value depends on whether data are faithfully represented, clearly communicated, and aesthetically designed. However, evaluating visualization quality is challenging: unlike natural images, it requires simultaneous judgment across data encoding accuracy, information expressiveness, and visual aesthetics. Although multimodal large language models (MLLMs) have shown promising performance in aesthetic assessment of natural images, no systematic benchmark exists for measuring their capabilities in evaluating visualizations. To address this, we propose VisJudge-Bench, the first comprehensive benchmark for evaluating MLLMs’ performance in assessing visualization aesthetics and quality. It contains 3,090 expert-annotated samples from real-world scenarios, covering single visualizations, multiple visualizations, and dashboards across 32 chart types. Systematic testing on this benchmark reveals that even the most advanced MLLMs (such as GPT-5) still exhibit significant gaps compared to human experts in judgment, with a Mean Absolute Error (MAE) of 0.553 and a correlation with human ratings of only 0.428. To address this issue, we propose VisJudge, a model specifically designed for visualization aesthetics and quality assessment. Experimental results demonstrate that VisJudge significantly narrows the gap with human judgment, reducing the MAE to 0.421 (a 23.9% reduction) and increasing the consistency with human experts to 0.687 (a 60.5% improvement) compared to GPT-5. The benchmark is available at https://github.com/HKUSTDial/VisJudgeBench.
[199] A Comprehensive Dataset for Human vs. AI Generated Text Detection
Rajarshi Roy, Nasrin Imanpour, Ashhar Aziz, Shashwat Bajpai, Gurpreet Singh, Shwetangshu Biswas, Kapil Wanaskar, Parth Patwa, Subhankar Ghosh, Shreyas Dixit, Nilesh Ranjan Pal, Vipula Rawte, Ritvik Garimella, Gaytri Jena, Amit Sheth, Vasu Sharma, Aishwarya Naresh Reganti, Vinija Jain, Aman Chadha, Amitava Das
Main category: cs.CL
TL;DR: A comprehensive dataset of 58,000+ text samples combining authentic New York Times articles with synthetic versions generated by multiple state-of-the-art LLMs, designed for AI-generated text detection and model attribution research.
Details
Motivation: Address the challenge of reliably detecting AI-generated text and attributing it to specific models, which is crucial for content authenticity, misinformation prevention, and trustworthiness in the era of generative AI.Method: Created a large-scale dataset using original New York Times article abstracts as prompts to generate synthetic versions with multiple LLMs (Gemma-2-9b, Mistral-7B, Qwen-2-72B, LLaMA-8B, Yi-Large, GPT-4-o), providing both human-authored and AI-generated text samples.
Result: Baseline results show 58.35% accuracy for distinguishing human-written from AI-generated text, and 8.92% accuracy for attributing AI texts to their generating models, indicating the difficulty of these tasks.
Conclusion: The dataset bridges real-world journalistic content with modern generative models to catalyze development of robust detection and attribution methods, fostering trust and transparency in generative AI.
Abstract: The rapid advancement of large language models (LLMs) has led to increasingly human-like AI-generated text, raising concerns about content authenticity, misinformation, and trustworthiness. Addressing the challenge of reliably detecting AI-generated text and attributing it to specific models requires large-scale, diverse, and well-annotated datasets. In this work, we present a comprehensive dataset comprising over 58,000 text samples that combine authentic New York Times articles with synthetic versions generated by multiple state-of-the-art LLMs including Gemma-2-9b, Mistral-7B, Qwen-2-72B, LLaMA-8B, Yi-Large, and GPT-4-o. The dataset provides original article abstracts as prompts, full human-authored narratives. We establish baseline results for two key tasks: distinguishing human-written from AI-generated text, achieving an accuracy of 58.35%, and attributing AI texts to their generating models with an accuracy of 8.92%. By bridging real-world journalistic content with modern generative models, the dataset aims to catalyze the development of robust detection and attribution methods, fostering trust and transparency in the era of generative AI. Our dataset is available at: https://huggingface.co/datasets/gsingh1-py/train.
[200] Lookahead Tree-Based Rollouts for Enhanced Trajectory-Level Exploration in Reinforcement Learning with Verifiable Rewards
Shangyu Xing, Siyuan Wang, Chenyuan Yang, Xinyu Dai, Xiang Ren
Main category: cs.CL
TL;DR: LATR improves RLVR for LLMs by using lookahead tree-based rollouts to increase trajectory diversity, accelerating policy learning by 131% and improving reasoning performance by 4.2%.
Details
Motivation: Current RLVR pipelines suffer from limited trajectory diversity during group rollouts due to token-level stochastic sampling, which leads to homogeneous trajectories and diminished reward signals, hindering effective policy learning for reasoning tasks.Method: Proposes Lookahead Tree-Based Rollouts (LATR) with three iterative stages: (1) branching at high-uncertainty generation steps, (2) performing lookahead simulation for each branch, and (3) pruning branches that show prolonged similarity during simulation.
Result: LATR accelerates policy learning by 131% on average and improves final pass@1 performance by 4.2% on both GRPO and DAPO algorithms across different reasoning tasks.
Conclusion: LATR effectively addresses the trajectory diversity bottleneck in RLVR pipelines, significantly improving learning efficiency and reasoning performance for large language models.
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR), particularly with algorithms like Group Relative Policy Optimization (GRPO), has proven highly effective in enhancing the reasoning capabilities of large language models. However, a critical bottleneck in current pipelines lies in the limited diversity of sampled trajectories during group rollouts. Homogeneous trajectories and their associated rewards would diminish the return signals for policy updates, thereby hindering effective policy learning. This lack of diversity stems primarily from token-level stochastic sampling, where local variations are likely to collapse into near-identical reasoning paths. To address this limitation, we propose Lookahead Tree-Based Rollouts (LATR), a novel rollout strategy designed to explicitly promotes trajectory-level diversity by enforcing branching into different candidate tokens likely to yield distinct continuations. Specifically, LATR iteratively operates in three stages: (1) branching at high-uncertainty generation steps, (2) performing lookahead simulation for each new branch, and (3) pruning branches that exhibits prolonged similarity during simulation. Compared with stochastic Sampling, LATR accelerates policy learning by 131% on average and improves final pass@1 performance by 4.2% on both GRPO and Dynamic sAmpling Policy Optimization (DAPO) algorithms across different reasoning tasks. Our code and data are publicly available at https://github.com/starreeze/latr.
[201] Self-Harmony: Learning to Harmonize Self-Supervision and Self-Play in Test-Time Reinforcement Learning
Ru Wang, Wei Huang, Qi Cao, Yusuke Iwasawa, Yutaka Matsuo, Jiaxian Guo
Main category: cs.CL
TL;DR: Self-Harmony: A test-time reinforcement learning framework that uses a single model as both Solver and Reframer to generate stable answers across original and paraphrased questions, employing harmonic mean aggregation instead of majority voting for robust pseudo-labeling.
Details
Motivation: Standard test-time reinforcement learning approaches like majority voting often collapse to spurious but popular answers. The paper aims to create more reliable learning signals without human supervision by leveraging the intuition that correct answers should remain stable across original questions and their paraphrases.Method: Self-Harmony uses a single model in two roles: Solver (produces answers) and Reframer (rephrases inputs). It aggregates answer frequencies across original and reframed views using harmonic mean instead of majority voting, selecting solutions stable under reframing to avoid view-dependent spurious answers.
Result: Achieves state-of-the-art results in label-free test-time settings, ranking first in 28 of 30 settings across multiple reasoning benchmarks. Demonstrates unprecedented robustness with zero training failures in all experiments.
Conclusion: Self-Harmony provides a stable and reliable framework for test-time reinforcement learning that avoids common pitfalls of majority voting by leveraging answer stability across paraphrased inputs, requiring no human supervision or auxiliary models.
Abstract: Test-time reinforcement learning (TTRL) offers a label-free paradigm for adapting models using only synthetic signals at inference, but its success hinges on constructing reliable learning signals. Standard approaches such as majority voting often collapse to spurious yet popular answers. We introduce Self-Harmony, a framework built on a simple intuition: the correct answer should remain stable across both an original question and its paraphrase. Self-Harmony operationalizes this by employing a single model in two complementary roles: a Solver to produce answers and a Reframer to rephrase the input. Based on this, we further propose a pseudo-label method: instead of majority voting, it aggregates answer frequencies across these original and reframed views using the harmonic mean. This is a process that naturally selects for solutions stable under reframing, thereby avoiding the common trap of favoring view-dependent, spurious answers. Crucially, this requires no human supervision or auxiliary models. Across diverse reasoning benchmarks, Self-Harmony achieves state-of-the-art results at the label-free test-time setting, ranking first in 28 of 30 settings across multiple methods. Beyond accuracy, it demonstrates unprecedented robustness, with zero training failures in all experiments, underscoring its stability and reliability.
[202] When Data is the Algorithm: A Systematic Study and Curation of Preference Optimization Datasets
Aladin Djuhera, Farhan Ahmed, Swanand Ravindra Kadhe, Syed Zawad, Heiko Ludwig, Holger Boche
Main category: cs.CL
TL;DR: Comprehensive analysis of open-source DPO datasets reveals quality issues, leading to creation of UltraMix - a curated mixture that outperforms individual datasets despite being smaller.
Details
Motivation: There's a lack of systematic comparisons between open-source DPO datasets due to high computational costs and insufficient quality annotations, making it difficult to understand dataset characteristics and preference quality.Method: Used Magpie framework to annotate samples for task category, input quality, and preference reward; conducted fine-grained inspection of preference quality; curated UltraMix by selectively drawing from five corpora while removing noisy/redundant samples.
Result: UltraMix is 30% smaller than best-performing individual dataset yet exceeds its performance across key benchmarks; revealed structural and qualitative discrepancies in reward margins across datasets.
Conclusion: Data-centric analysis enables better understanding of DPO datasets; curated UltraMix provides improved performance with reduced size; annotations and mixture released to facilitate future research.
Abstract: Aligning large language models (LLMs) is a central objective of post-training, often achieved through reward modeling and reinforcement learning methods. Among these, direct preference optimization (DPO) has emerged as a widely adopted technique that fine-tunes LLMs on preferred completions over less favorable ones. While most frontier LLMs do not disclose their curated preference pairs, the broader LLM community has released several open-source DPO datasets, including TuluDPO, ORPO, UltraFeedback, HelpSteer, and Code-Preference-Pairs. However, systematic comparisons remain scarce, largely due to the high computational cost and the lack of rich quality annotations, making it difficult to understand how preferences were selected, which task types they span, and how well they reflect human judgment on a per-sample level. In this work, we present the first comprehensive, data-centric analysis of popular open-source DPO corpora. We leverage the Magpie framework to annotate each sample for task category, input quality, and preference reward, a reward-model-based signal that validates the preference order without relying on human annotations. This enables a scalable, fine-grained inspection of preference quality across datasets, revealing structural and qualitative discrepancies in reward margins. Building on these insights, we systematically curate a new DPO mixture, UltraMix, that draws selectively from all five corpora while removing noisy or redundant samples. UltraMix is 30% smaller than the best-performing individual dataset yet exceeds its performance across key benchmarks. We publicly release all annotations, metadata, and our curated mixture to facilitate future research in data-centric preference optimization.
[203] Gender Bias in Emotion Recognition by Large Language Models
Maureen Herbert, Katie Sun, Angelica Lim, Yasaman Etesam
Main category: cs.CL
TL;DR: LLMs exhibit gender biases in emotional theory of mind tasks, and training-based interventions are more effective than prompt engineering for debiasing
Details
Motivation: As LLMs become more integrated into daily life, it's crucial to evaluate and ensure their fairness, particularly in understanding human emotions and potential gender biasesMethod: Examine LLM fairness in emotional theory of mind by presenting descriptions of people and environments, asking “How does this person feel?”, and evaluating gender biases. Propose and compare multiple debiasing strategies including training-based interventions vs inference-time prompt engineering
Result: LLMs demonstrate gender biases in emotional theory of mind tasks. Training-based interventions achieve meaningful bias reduction, while inference-time prompt engineering approaches are insufficient
Conclusion: Effective debiasing for emotional theory of mind in LLMs requires training-based approaches rather than relying solely on prompt engineering, highlighting the need for systematic fairness interventions
Abstract: The rapid advancement of large language models (LLMs) and their growing integration into daily life underscore the importance of evaluating and ensuring their fairness. In this work, we examine fairness within the domain of emotional theory of mind, investigating whether LLMs exhibit gender biases when presented with a description of a person and their environment and asked, ‘‘How does this person feel?’’. Furthermore, we propose and evaluate several debiasing strategies, demonstrating that achieving meaningful reductions in bias requires training based interventions rather than relying solely on inference-time prompt-based approaches such as prompt engineering, etc.
[204] German General Social Survey Personas: A Survey-Derived Persona Prompt Collection for Population-Aligned LLM Studies
Jens Rupprecht, Leon Fröhling, Claudia Wagner, Markus Strohmaier
Main category: cs.CL
TL;DR: GGSS Personas: A comprehensive persona prompt collection built from German General Social Survey data for LLM-based social simulations, improving alignment with population responses.
Details
Motivation: Current persona prompting for LLM-based social simulations lacks well-curated, empirically grounded persona collections, limiting accuracy and representativeness of simulations.Method: Created GGSS Personas collection from German General Social Survey (ALLBUS) data, designed as plug-and-play persona prompts for LLMs. Evaluated by prompting various LLMs to simulate survey response distributions across topics.
Result: GGSS Personas-guided LLMs outperform state-of-the-art classifiers, especially under data scarcity. Representativity and attribute selection within persona prompts significantly affect alignment with population responses.
Conclusion: GGSS Personas provide a valuable resource for systematic exploration of population-aligned persona prompting in NLP and social science research, enabling more accurate LLM-based social simulations.
Abstract: The use of Large Language Models (LLMs) for simulating human perspectives via persona prompting is gaining traction in computational social science. However, well-curated, empirically grounded persona collections remain scarce, limiting the accuracy and representativeness of such simulations. Here, we introduce the German General Social Survey Personas (GGSS Personas) collection, a comprehensive and representative persona prompt collection built from the German General Social Survey (ALLBUS). The GGSS Personas and their persona prompts are designed to be easily plugged into prompts for all types of LLMs and tasks, steering models to generate responses aligned with the underlying German population. We evaluate GGSS Personas by prompting various LLMs to simulate survey response distributions across diverse topics, demonstrating that GGSS Personas-guided LLMs outperform state-of-the-art classifiers, particularly under data scarcity. Furthermore, we analyze how the representativity and attribute selection within persona prompts affect alignment with population responses. Our findings suggest that GGSS Personas provide a potentially valuable resource for research on LLM-based social simulations that enables more systematic explorations of population-aligned persona prompting in NLP and social science research.
[205] A cross-species neural foundation model for end-to-end speech decoding
Yizi Zhang, Linyang He, Chaofei Fan, Tingkai Liu, Han Yu, Trung Le, Jingyuan Li, Scott Linderman, Lea Duncker, Francis R Willett, Nima Mesgarani, Liam Paninski
Main category: cs.CL
TL;DR: End-to-end Brain-to-Text framework using pretrained neural encoder and audio LLMs for speech BCI, achieving state-of-the-art performance with 10.22% WER.
Details
Motivation: Current speech BCIs use cascaded frameworks that decode phonemes before assembling sentences with n-gram LMs, preventing joint optimization of all stages simultaneously. The authors aim to create an end-to-end differentiable framework for better performance and seamless optimization.Method: Introduces BIT framework with cross-task, cross-species pretrained neural encoder that transfers to both attempted and imagined speech. Integrated end-to-end with audio large language models (LLMs) and trained with contrastive learning for cross-modal alignment. Uses small-scale audio LLMs to improve end-to-end decoding.
Result: Achieves new SOTA on Brain-to-Text ‘24 and ‘25 benchmarks in cascaded setting. Reduces WER from 24.69% to 10.22% compared to prior end-to-end method. Small-scale audio LLMs markedly improve end-to-end decoding. Enables cross-task generalization by aligning attempted and imagined speech embeddings.
Conclusion: BIT advances integration of large, diverse neural datasets and paves the way for end-to-end decoding framework supporting seamless, differentiable optimization. Shows potential for improved speech BCIs through neural encoder pretraining and audio LLM integration.
Abstract: Speech brain-computer interfaces (BCIs) aim to restore communication for people with paralysis by translating neural activity into text. Most systems use cascaded frameworks that decode phonemes before assembling sentences with an n-gram language model (LM), preventing joint optimization of all stages simultaneously. Here, we introduce an end-to-end Brain-to-Text (BIT) framework that translates neural activity into coherent sentences using a single differentiable neural network. Central to our approach is a cross-task, cross-species pretrained neural encoder, whose representations transfer to both attempted and imagined speech. In a cascaded setting with an n-gram LM, the pretrained encoder establishes a new state-of-the-art (SOTA) on the Brain-to-Text ‘24 and ‘25 benchmarks. Integrated end-to-end with audio large language models (LLMs) and trained with contrastive learning for cross-modal alignment, BIT reduces the word error rate (WER) of the prior end-to-end method from 24.69% to 10.22%. Notably, we find that small-scale audio LLMs markedly improve end-to-end decoding. Beyond record-setting performance, BIT aligns attempted and imagined speech embeddings to enable cross-task generalization. Altogether, our approach advances the integration of large, diverse neural datasets, paving the way for an end-to-end decoding framework that supports seamless, differentiable optimization.
[206] Silence the Judge: Reinforcement Learning with Self-Verifier via Latent Geometric Clustering
Nonghai Zhang, Weitao Ma, Zhanyu Ma, Jun Xu, Jiuchong Gao, Jinghua Hao, Renqing He, Jingwen Xu
Main category: cs.CL
TL;DR: Latent-GRPO improves LLM reasoning by deriving intrinsic rewards from latent space geometry instead of expensive external verifiers, achieving 2x training speedup while maintaining performance.
Details
Motivation: Current GRPO methods rely on expensive external verifiers or human rules, leading to high computational costs, training latency, and sparse rewards that hinder optimization efficiency.Method: Proposes Latent-GRPO framework that derives intrinsic rewards from latent space geometry. Uses empirical discovery that correct reasoning trajectories form dense clusters while incorrect ones scatter as outliers. Introduces Iterative Robust Centroid Estimation (IRCE) algorithm with spherical projection to mitigate magnitude fluctuations and estimate robust “truth centroid” through iterative aggregation.
Result: Experimental results show method maintains model performance while achieving over 2x training speedup compared to baselines. Demonstrates strong generalization ability and robustness across multiple datasets.
Conclusion: Latent-GRPO provides an efficient alternative to external verifiers for LLM reasoning optimization by leveraging latent space geometry, significantly reducing computational costs while preserving performance.
Abstract: Group Relative Policy Optimization (GRPO) significantly enhances the reasoning performance of Large Language Models (LLMs). However, this success heavily relies on expensive external verifiers or human rules. Such dependency not only leads to significant computational costs and training latency, but also yields sparse rewards that hinder optimization efficiency. To address these challenges, we propose Latent-GRPO, a framework that derives intrinsic rewards directly from latent space geometry. Crucially, our empirical analysis reveals a compelling geometric property: terminal token representations of correct reasoning trajectories form dense clusters with high intra-class similarity, whereas incorrect trajectories remain scattered as outliers. In light of this discovery, we introduce the Iterative Robust Centroid Estimation (IRCE) algorithm, which generates dense, continuous rewards by mitigating magnitude fluctuations via spherical projection and estimating a robust ``truth centroid’’ through iterative aggregation. Experimental results on multiple datasets show that our method maintains model performance while achieving a training speedup of over 2x compared to baselines. Furthermore, extensive results demonstrate strong generalization ability and robustness. The code will be released soon.
[207] T*: Progressive Block Scaling for Masked Diffusion Language Models Through Trajectory Aware Reinforcement Learning
Hanchen Xia, Baoyou Chen, Yutang Ge, Guojiang Zhao, Siyu Zhu
Main category: cs.CL
TL;DR: T* introduces a TraceRL-based curriculum training method for masked diffusion language models that enables progressive scaling from small to larger block sizes, allowing higher-parallelism decoding while maintaining performance on math reasoning tasks.
Details
Motivation: The paper addresses the challenge of scaling masked diffusion language models (MDMs) to larger block sizes while maintaining performance. Current MDMs face performance degradation when transitioning to larger blocks for higher-parallelism decoding. The authors aim to develop a training curriculum that enables smooth scaling from small to larger blocks without sacrificing reasoning capabilities.Method: T* uses a TraceRL-based training curriculum that starts with an auto-regressively initialized small-block MDM and progressively scales to larger blocks. The method employs reinforcement learning techniques to create a smooth transition between different block sizes, enabling the model to adapt to higher-parallelism decoding while minimizing performance loss.
Result: The method achieves higher-parallelism decoding with minimal performance degradation on math reasoning benchmarks. Analysis suggests T* may converge to an alternative decoding schedule that achieves comparable performance to traditional approaches.
Conclusion: T* provides an effective training curriculum for scaling masked diffusion language models to larger block sizes, enabling practical high-parallelism decoding while maintaining reasoning performance, with potential implications for more efficient language model architectures.
Abstract: We present T*, a simple TraceRL-based training curriculum for progressive block-size scaling in masked diffusion language models (MDMs). Starting from an AR-initialized small-block MDM, T* transitions smoothly to larger blocks, enabling higher-parallelism decoding with minimal performance degradation on math reasoning benchmarks. Moreover, further analysis suggests that T* may actually converge to an alternative decoding schedule that achieves comparable performance.
[208] CascadeMind at SemEval-2026 Task 4: A Hybrid Neuro-Symbolic Cascade for Narrative Similarity
Sebastien Kawada, Dylan Holyoak
Main category: cs.CL
TL;DR: CascadeMind is a hybrid neuro-symbolic system for narrative similarity that uses LLM vote distribution as a proxy for task difficulty, implementing confidence-aware routing to handle uncertain cases.
Details
Motivation: The paper addresses how systems should handle uncertainty when comparing narratives, recognizing that not all narrative similarity cases are equally difficult and that LLM confidence can indicate task complexity.Method: A cascade system that samples eight parallel votes from Gemini 2.5 Flash, applies supermajority thresholds to resolve confident cases immediately, escalates uncertain cases to additional voting rounds, and defers perfect ties to a symbolic ensemble of five narrative signals grounded in classical narrative theory.
Result: The system achieved 11th place out of 47 teams in SemEval-2026 Task 4 with 72.75% test accuracy, outperforming several systems built on larger and more expensive models. The difficulty gradient (85% -> 67% -> 61% by pathway) confirmed that vote consensus tracks genuine ambiguity.
Conclusion: For narrative similarity tasks, knowing when you don’t know (confidence-aware routing) matters more than adding auxiliary representations, and LLM vote distribution is a reliable proxy for task difficulty.
Abstract: How should a system handle uncertainty when comparing narratives? We present CascadeMind, a hybrid neuro-symbolic system for SemEval-2026 Task 4 (Narrative Story Similarity) built around a core finding: an LLM’s internal vote distribution is a reliable proxy for task difficulty, and confidence-aware routing outperforms uniform treatment of all cases. Our cascade samples eight parallel votes from Gemini 2.5 Flash, applying a supermajority threshold to resolve confident cases immediately (74% of instances at 85% development accuracy). Uncertain cases escalate to additional voting rounds (21%), and only perfect ties (5%) are deferred to a symbolic ensemble of five narrative signals grounded in classical narrative theory. The resulting difficulty gradient (85% -> 67% -> 61% by pathway) confirms that vote consensus tracks genuine ambiguity. In official Track A evaluation, CascadeMind placed 11th of 47 teams with 72.75% test accuracy (Hatzel et al., 2026), outperforming several systems built on larger and more expensive models. Gains are driven primarily by routing strategy rather than symbolic reasoning, suggesting that for narrative similarity, knowing when you don’t know matters more than adding auxiliary representations.
[209] EnsembleLink: Accurate Record Linkage Without Training Data
Noah Dasanaike
Main category: cs.CL
TL;DR: EnsembleLink is a record linkage method that uses pre-trained language models to match records without labeled training data, achieving high accuracy by leveraging semantic relationships learned from large text corpora.
Details
Motivation: Record linkage is crucial for social science research but is often treated as a preprocessing step with ad hoc rules and unquantified uncertainty. Existing methods either have low accuracy or require substantial labeled training data, creating barriers for researchers.Method: EnsembleLink leverages pre-trained language models that have learned semantic relationships from large text corpora. It uses these models to match records without any training labels, running locally on open-source models without external API calls.
Result: On benchmarks spanning city names, person names, organizations, multilingual political parties, and bibliographic records, EnsembleLink matches or exceeds methods requiring extensive labeling. It completes typical linkage tasks in minutes.
Conclusion: EnsembleLink provides an accurate, practical solution for record linkage that doesn’t require labeled training data, addressing a key methodological gap in empirical social science research.
Abstract: Record linkage, the process of matching records that refer to the same entity across datasets, is essential to empirical social science but remains methodologically underdeveloped. Researchers treat it as a preprocessing step, applying ad hoc rules without quantifying the uncertainty that linkage errors introduce into downstream analyses. Existing methods either achieve low accuracy or require substantial labeled training data. I present EnsembleLink, a method that achieves high accuracy without any training labels. EnsembleLink leverages pre-trained language models that have learned semantic relationships (e.g., that “South Ozone Park” is a neighborhood in “New York City” or that “Lutte ouvriere” refers to the Trotskyist “Workers’ Struggle” party) from large text corpora. On benchmarks spanning city names, person names, organizations, multilingual political parties, and bibliographic records, EnsembleLink matches or exceeds methods requiring extensive labeling. The method runs locally on open-source models, requiring no external API calls, and completes typical linkage tasks in minutes.
[210] Learn-to-Distance: Distance Learning for Detecting LLM-Generated Text
Hongyi Zhou, Jin Zhu, Kai Ye, Ying Yang, Erhan Xu, Chengchun Shi
Main category: cs.CL
TL;DR: A novel rewrite-based detection algorithm that adaptively learns distance between original and rewritten text to identify LLM-generated content, showing superior performance over baselines.
Details
Motivation: LLMs can produce highly human-like text, raising concerns about misinformation and academic integrity, creating urgent need for reliable detection algorithms.Method: Presents geometric approach to understand rewrite-based detection, then introduces novel algorithm that adaptively learns distance between original and rewritten text rather than using fixed distance.
Result: Extensive experiments with 100+ settings show superior performance over baselines, achieving 54.3% to 75.4% relative improvements over strongest baseline across different target LLMs (GPT, Claude, Gemini).
Conclusion: Adaptively learned distance functions are more effective for LLM-generated content detection than fixed distances, with publicly available implementation.
Abstract: Modern large language models (LLMs) such as GPT, Claude, and Gemini have transformed the way we learn, work, and communicate. Yet, their ability to produce highly human-like text raises serious concerns about misinformation and academic integrity, making it an urgent need for reliable algorithms to detect LLM-generated content. In this paper, we start by presenting a geometric approach to demystify rewrite-based detection algorithms, revealing their underlying rationale and demonstrating their generalization ability. Building on this insight, we introduce a novel rewrite-based detection algorithm that adaptively learns the distance between the original and rewritten text. Theoretically, we demonstrate that employing an adaptively learned distance function is more effective for detection than using a fixed distance. Empirically, we conduct extensive experiments with over 100 settings, and find that our approach demonstrates superior performance over baseline algorithms in the majority of scenarios. In particular, it achieves relative improvements from 54.3% to 75.4% over the strongest baseline across different target LLMs (e.g., GPT, Claude, and Gemini). A python implementation of our proposal is publicly available at https://github.com/Mamba413/L2D.
[211] From Generative Modeling to Clinical Classification: A GPT-Based Architecture for EHR Notes
Fariba Afrin Irany, Sampson Akwafuo
Main category: cs.CL
TL;DR: Selective fine-tuning of GPT-2 for clinical text classification, freezing most parameters and training only final layers, achieves strong performance on radiology reports with reduced computational cost.
Details
Motivation: Clinical narratives in EHRs offer opportunities for automated disease analysis, but modeling long, domain-specific clinical text is challenging due to limited labeled data, class imbalance, and high computational costs of adapting large pretrained models.Method: Proposes GPT-based architecture with selective fine-tuning strategy: freezes majority of GPT-2 backbone, trains only final Transformer block, final layer normalization, and lightweight classification head, reducing trainable parameters while preserving representational capacity.
Result: Evaluated on MIMIC-IV-Note radiology reports with CheXpert-style labels; shows stable convergence and strong classification performance across multiple problem formulations, especially for non-mention and negated findings, with varying dataset sizes.
Conclusion: Selective fine-tuning of pretrained generative language models provides efficient and effective pathway for clinical text classification, enabling scalable adaptation to EHR data while significantly reducing computational complexity.
Abstract: The increasing availability of unstructured clinical narratives in electronic health records (EHRs) has created new opportunities for automated disease characterization, cohort identification, and clinical decision support. However, modeling long, domain-specific clinical text remains challenging due to limited labeled data, severe class imbalance, and the high computational cost of adapting large pretrained language models. This study presents a GPT-based architecture for clinical text classification that adapts a pretrained decoder-only Transformer using a selective fine-tuning strategy. Rather than updating all model parameters, the majority of the GPT-2 backbone is frozen, and training is restricted to the final Transformer block, the final layer normalization, and a lightweight classification head. This approach substantially reduces the number of trainable parameters while preserving the representational capacity required to model complex clinical language. The proposed method is evaluated on radiology reports from the MIMIC-IV-Note dataset using uncertainty-aware CheXpert-style labels derived directly from report text. Experiments cover multiple problem formulations, including multi-label classification of radiographic findings, binary per-label classification under different uncertainty assumptions, and aggregate disease outcome prediction. Across varying dataset sizes, the model exhibits stable convergence behavior and strong classification performance, particularly in settings dominated by non-mention and negated findings. Overall, the results indicate that selective fine-tuning of pretrained generative language models provides an efficient and effective pathway for clinical text classification, enabling scalable adaptation to real-world EHR data while significantly reducing computational complexity.
[212] When Agents “Misremember” Collectively: Exploring the Mandela Effect in LLM-based Multi-Agent Systems
Naen Xu, Hengyu An, Shuo Shi, Jinghuai Zhang, Chunyi Zhou, Changjiang Li, Tianyu Du, Zhihui Fu, Jun Wang, Shouling Ji
Main category: cs.CL
TL;DR: The paper studies the Mandela effect (collective memory distortion) in LLM-based multi-agent systems, proposes MANBENCH benchmark to evaluate it across different tasks and interaction protocols, and develops mitigation strategies that reduce the effect by 74.40%.
Details
Motivation: Multi-agent systems powered by LLMs are vulnerable to collective cognitive biases like the Mandela effect, where groups misremember events due to social influence and misinformation. This limits understanding of memory bias in such systems and raises ethical concerns about misinformation spread.Method: Proposes MANBENCH benchmark with four task types susceptible to Mandela effect and five interaction protocols varying agent roles and memory timescales. Evaluates multiple LLMs on this benchmark, then develops mitigation strategies including prompt-level defenses (cognitive anchoring, source scrutiny) and model-level alignment-based defense.
Result: The proposed mitigation strategies achieve an average 74.40% reduction in Mandela effect compared to baseline. The benchmark successfully quantifies the effect and analyzes how different factors influence it.
Conclusion: The study provides insights for developing more resilient and ethically aligned collaborative multi-agent systems by addressing collective memory biases, with implications for reducing misinformation spread in AI systems.
Abstract: Recent advancements in large language models (LLMs) have significantly enhanced the capabilities of collaborative multi-agent systems, enabling them to address complex challenges. However, within these multi-agent systems, the susceptibility of agents to collective cognitive biases remains an underexplored issue. A compelling example is the Mandela effect, a phenomenon where groups collectively misremember past events as a result of false details reinforced through social influence and internalized misinformation. This vulnerability limits our understanding of memory bias in multi-agent systems and raises ethical concerns about the potential spread of misinformation. In this paper, we conduct a comprehensive study on the Mandela effect in LLM-based multi-agent systems, focusing on its existence, causing factors, and mitigation strategies. We propose MANBENCH, a novel benchmark designed to evaluate agent behaviors across four common task types that are susceptible to the Mandela effect, using five interaction protocols that vary in agent roles and memory timescales. We evaluate agents powered by several LLMs on MANBENCH to quantify the Mandela effect and analyze how different factors affect it. Moreover, we propose strategies to mitigate this effect, including prompt-level defenses (e.g., cognitive anchoring and source scrutiny) and model-level alignment-based defense, achieving an average 74.40% reduction in the Mandela effect compared to the baseline. Our findings provide valuable insights for developing more resilient and ethically aligned collaborative multi-agent systems. Code and dataset are available at https://github.com/bluedream02/Mandela-Effect.
[213] Out of the Memory Barrier: A Highly Memory Efficient Training System for LLMs with Million-Token Contexts
Wenhao Li, Daohai Yu, Gen Luo, Yuxin Zhang, Fei Chao, Rongrong Ji, Yifan Wu, Jiaxin Liu, Ziyang Gong, Zimu Liao
Main category: cs.CL
TL;DR: OOMB is a memory-efficient training system for LLMs that enables training on extremely long contexts (up to 4M tokens) with minimal memory overhead through chunk-recurrent training and KV cache optimizations.
Details
Motivation: Training LLMs on long contexts is severely limited by GPU memory constraints, primarily due to activations scaling linearly with sequence length. Current approaches require large clusters with context parallelism, making long-context training prohibitively expensive.Method: OOMB uses a chunk-recurrent training framework with on-the-fly activation recomputation to maintain constant activation memory. It manages KV cache growth through: 1) paged memory manager for KV cache and gradients to eliminate fragmentation, 2) asynchronous CPU offloading to hide transfer latency, and 3) page-level sparse attention to reduce computation and communication overhead.
Result: OOMB achieves exceptional memory efficiency: only 10MB additional memory per 10K tokens for Qwen2.5-7B. This enables training Qwen2.5-7B with 4M-token context on a single H200 GPU, which would otherwise require a large cluster.
Conclusion: OOMB represents a substantial advance in resource efficiency for long-context LLM training, making it feasible to train models on extremely long sequences with minimal hardware requirements.
Abstract: Training Large Language Models (LLMs) on long contexts is severely constrained by prohibitive GPU memory overhead, not training time. The primary culprits are the activations, whose memory footprints scale linearly with sequence length. We introduce OOMB, a highly memory-efficient training system that directly confronts this barrier. Our approach employs a chunk-recurrent training framework with on-the-fly activation recomputation, which maintains a constant activation memory footprint (O(1)) and shifts the primary bottleneck to the growing KV cache. To manage the KV cache, OOMB integrates a suite of synergistic optimizations: a paged memory manager for both the KV cache and its gradients to eliminate fragmentation, asynchronous CPU offloading to hide data transfer latency, and page-level sparse attention to reduce both computational complexity and communication overhead. The synergy of these techniques yields exceptional efficiency. Our empirical results show that for every additional 10K tokens of context, the end-to-end training memory overhead increases by a mere 10MB for Qwen2.5-7B. This allows training Qwen2.5-7B with a 4M-token context on a single H200 GPU, a feat that would otherwise require a large cluster using context parallelism. This work represents a substantial advance in resource efficiency for long-context LLM training. The source code is available at https://github.com/wenhaoli-xmu/OOMB.
[214] FASA: Frequency-aware Sparse Attention
Yifei Wang, Yueqi Wang, Zhenrui Yue, Huimin Zeng, Yong Wang, Ismini Lourentzou, Zhengzhong Tu, Xiangxiang Chu, Julian McAuley
Main category: cs.CL
TL;DR: FASA is a novel framework for efficient LLM inference that uses RoPE frequency-chunk sparsity to dynamically prune KV cache tokens, achieving near-oracle performance with significant speedups.
Details
Motivation: The KV cache memory bottleneck in LLMs for long inputs needs efficient token pruning. Existing methods have limitations: static pruning risks information loss, while dynamic heuristics fail to capture query-dependent token importance.Method: FASA leverages functional sparsity in RoPE frequency-chunks to identify dominant frequency-chunks that correlate with full attention. It uses these as a free proxy to predict token importance, then performs attention computation only on the pruned critical token subset.
Result: FASA outperforms all token-eviction baselines across long-context tasks (sequence modeling, CoT reasoning), achieving near-oracle accuracy. On LongBench-V1, it reaches ~100% of full-KV performance with only 256 tokens, and achieves 2.56× speedup using 18.9% of cache on AIME24.
Conclusion: FASA provides an effective query-aware token eviction framework that addresses KV cache bottlenecks through RoPE frequency-chunk sparsity analysis, enabling efficient LLM inference for long contexts with minimal performance degradation.
Abstract: The deployment of Large Language Models (LLMs) faces a critical bottleneck when handling lengthy inputs: the prohibitive memory footprint of the Key Value (KV) cache. To address this bottleneck, the token pruning paradigm leverages attention sparsity to selectively retain a small, critical subset of tokens. However, existing approaches fall short, with static methods risking irreversible information loss and dynamic strategies employing heuristics that insufficiently capture the query-dependent nature of token importance. We propose FASA, a novel framework that achieves query-aware token eviction by dynamically predicting token importance. FASA stems from a novel insight into RoPE: the discovery of functional sparsity at the frequency-chunk (FC) level. Our key finding is that a small, identifiable subset of “dominant” FCs consistently exhibits high contextual agreement with the full attention head. This provides a robust and computationally free proxy for identifying salient tokens. Building on this insight, FASA first identifies a critical set of tokens using dominant FCs, and then performs focused attention computation solely on this pruned subset. Across a spectrum of long-context tasks, from sequence modeling to complex CoT reasoning, FASA consistently outperforms all token-eviction baselines and achieves near-oracle accuracy, demonstrating remarkable robustness even under constraint budgets. Notably, on LongBench-V1, FASA reaches nearly 100% of full-KV performance when only keeping 256 tokens, and achieves 2.56$\times$ speedup using just 18.9% of the cache on AIME24.
[215] Investigating Disability Representations in Text-to-Image Models
Yang Tian, Yu Fan, Liudmila Zavolokina, Sarah Ebling
Main category: cs.CL
TL;DR: Analysis of disability representation in text-to-image models (Stable Diffusion XL & DALL-E 3) reveals persistent imbalances and the need for more inclusive portrayals.
Details
Motivation: While text-to-image models have advanced significantly, concerns remain about how they represent social groups. Disability representations are particularly underexplored compared to other characteristics like gender and race, creating a gap in understanding AI's portrayal of people with disabilities.Method: Used structured prompt design to analyze outputs from Stable Diffusion XL and DALL-E 3. Compared image similarities between generic disability prompts and specific disability category prompts. Evaluated mitigation strategies’ influence on disability portrayals through sentiment polarity analysis combining automatic and human evaluation.
Result: Findings reveal persistent representational imbalances in how people with disabilities are portrayed in AI-generated images, highlighting issues with diversity and inclusion in current generative models.
Conclusion: Continuous evaluation and refinement of generative models is needed to foster more diverse and inclusive portrayals of disability, addressing representational gaps in AI-generated content.
Abstract: Text-to-image generative models have made remarkable progress in producing high-quality visual content from textual descriptions, yet concerns remain about how they represent social groups. While characteristics like gender and race have received increasing attention, disability representations remain underexplored. This study investigates how people with disabilities are represented in AI-generated images by analyzing outputs from Stable Diffusion XL and DALL-E 3 using a structured prompt design. We analyze disability representations by comparing image similarities between generic disability prompts and prompts referring to specific disability categories. Moreover, we evaluate how mitigation strategies influence disability portrayals, with a focus on assessing affective framing through sentiment polarity analysis, combining both automatic and human evaluation. Our findings reveal persistent representational imbalances and highlight the need for continuous evaluation and refinement of generative models to foster more diverse and inclusive portrayals of disability.
[216] Document Reconstruction Unlocks Scalable Long-Context RLVR
Yao Xiao, Lei Wang, Yue Deng, Guanzheng Chen, Ziqi Jin, Jung-jae Kim, Xiaoli Li, Roy Ka-wei Lee, Lidong Bing
Main category: cs.CL
TL;DR: Unsupervised RL approach for enhancing LLM long-context capabilities using paragraph reconstruction tasks without human annotations or teacher models.
Details
Motivation: Traditional RL with verifiable rewards (RLVR) requires costly gold-standard answers or evaluation rubrics from teacher models/humans. Need unsupervised methods to enhance long-context capabilities without expensive supervision.Method: Replace paragraphs with placeholders in long documents, train LLMs via RL to reconstruct documents by correctly identifying and sequencing missing paragraphs from candidate options. Captures global narrative coherence.
Result: Validated on RULER and LongBench v2 benchmarks. Achieves noticeable gains on RULER and reasonable improvement on LongBench v2 without manually curated long-context QA data. Includes ablation studies on reward design, data curation, training schemes, and scaling effects.
Conclusion: Unsupervised RL approach effectively enhances LLM long-context capabilities without expensive supervision, demonstrating practical value for long-context understanding tasks.
Abstract: Reinforcement Learning with Verifiable Rewards~(RLVR) has become a prominent paradigm to enhance the capabilities (i.e.\ long-context) of Large Language Models~(LLMs). However, it often relies on gold-standard answers or explicit evaluation rubrics provided by powerful teacher models or human experts, which are costly and time-consuming. In this work, we investigate unsupervised approaches to enhance the long-context capabilities of LLMs, eliminating the need for heavy human annotations or teacher models’ supervision. Specifically, we first replace a few paragraphs with special placeholders in a long document. LLMs are trained through reinforcement learning to reconstruct the document by correctly identifying and sequencing missing paragraphs from a set of candidate options. This training paradigm enables the model to capture global narrative coherence, significantly boosting long-context performance. We validate the effectiveness of our method on two widely used benchmarks, RULER and LongBenchv2. While acquiring noticeable gains on RULER, it can also achieve a reasonable improvement on LongBenchv2 without any manually curated long-context QA data. Furthermore, we conduct extensive ablation studies to analyze the impact of reward design, data curation strategies, training schemes, and data scaling effects on model performance. We publicly release our code, data, and models.
[217] Online Causal Kalman Filtering for Stable and Effective Policy Optimization
Shuo He, Lang Feng, Xin Cheng, Lei Feng, Bo An
Main category: cs.CL
TL;DR: KPO uses Kalman filtering to stabilize token-level importance sampling ratios in RL for LLMs, improving policy optimization for math reasoning tasks.
Details
Motivation: Current reinforcement learning methods for large language models suffer from high-variance token-level importance sampling ratios, which destabilize policy optimization. Existing approaches either use fixed sequence-level ratios or adjust token ratios separately, neglecting temporal off-policy relationships across tokens in a sequence.Method: Proposes Online Causal Kalman Filtering for Policy Optimization (KPO), which models the desired importance sampling ratio as a latent state evolving across tokens. A Kalman filter updates this state online and autoregressively based on past token states, preserving token-wise local structure-aware variation while smoothing noise spikes.
Result: KPO achieves superior results on challenging math reasoning datasets compared with state-of-the-art counterparts, demonstrating more stable and effective policy updates.
Conclusion: The proposed KPO method effectively addresses token-level off-policy deviation issues in RL for LLMs by using Kalman filtering to stabilize importance sampling ratios, leading to improved training stability and performance on reasoning tasks.
Abstract: Reinforcement learning for large language models suffers from high-variance token-level importance sampling (IS) ratios, which would destabilize policy optimization at scale. To improve stability, recent methods typically use a fixed sequence-level IS ratio for all tokens in a sequence or adjust each token’s IS ratio separately, thereby neglecting temporal off-policy derivation across tokens in a sequence. In this paper, we first empirically identify that local off-policy deviation is structurally inconsistent at the token level, which may distort policy-gradient updates across adjacent tokens and lead to training collapse. To address the issue, we propose Online Causal Kalman Filtering for stable and effective Policy Optimization (KPO). Concretely, we model the desired IS ratio as a latent state that evolves across tokens and apply a Kalman filter to update this state online and autoregressively based on the states of past tokens, regardless of future tokens. The resulting filtered IS ratios preserve token-wise local structure-aware variation while strongly smoothing noise spikes, yielding more stable and effective policy updates. Experimentally, KPO achieves superior results on challenging math reasoning datasets compared with state-of-the-art counterparts.
[218] MiniCPM-SALA: Hybridizing Sparse and Linear Attention for Efficient Long-Context Modeling
MiniCPM Team, Wenhao An, Yingfa Chen, Yewei Fang, Jiayi Li, Xin Li, Yaohui Li, Yishan Li, Yuxuan Li, Biyuan Lin, Chuan Liu, Hezi Liu, Siyuan Liu, Hongya Lyu, Yinxu Pan, Shixin Ren, Xingyu Shen, Zhou Su, Haojun Sun, Yangang Sun, Zhen Leng Thai, Xin Tian, Rui Wang, Xiaorong Wang, Yudong Wang, Bo Wu, Xiaoyue Xu, Dong Xu, Shuaikang Xue, Jiawei Yang, Bowen Zhang, Jinqian Zhang, Letian Zhang, Shengnan Zhang, Xinyu Zhang, Xinyuan Zhang, Zhu Zhang, Hengyu Zhao, Jiacheng Zhao, Zhi Zheng, Jie Zhou, Zihan Zhou, Shuo Wang, Chaojun Xiao, Xu Han, Zhiyuan Liu, Maosong Sun
Main category: cs.CL
TL;DR: MiniCPM-SALA is a 9B-parameter hybrid architecture combining sparse attention (InfLLM-V2) and linear attention (Lightning Attention) to enable efficient ultra-long context processing up to 1M tokens while maintaining performance comparable to full-attention models.
Details
Motivation: Address the computational and memory challenges of Transformer architectures for ultra-long context applications, overcoming the trade-off between memory efficiency and model performance in existing sparse/linear attention approaches.Method: Hybrid architecture integrating sparse attention (InfLLM-V2) for high-fidelity long-context modeling and linear attention (Lightning Attention) for global efficiency, using layer selection algorithm (1:3 ratio) and hybrid positional encoding (HyPE), with cost-effective continual training framework.
Result: Achieves up to 3.5x inference speed of full-attention model at 256K tokens, supports up to 1M token contexts (where 8B full-attention models fail), maintains general capabilities comparable to full-attention models, reduces training costs by ~75%.
Conclusion: MiniCPM-SALA demonstrates efficient ultra-long context processing through hybrid attention mechanisms, enabling practical applications with million-token contexts while preserving model performance.
Abstract: The evolution of large language models (LLMs) towards applications with ultra-long contexts faces challenges posed by the high computational and memory costs of the Transformer architecture. While existing sparse and linear attention mechanisms attempt to mitigate these issues, they typically involve a trade-off between memory efficiency and model performance. This paper introduces MiniCPM-SALA, a 9B-parameter hybrid architecture that integrates the high-fidelity long-context modeling of sparse attention (InfLLM-V2) with the global efficiency of linear attention (Lightning Attention). By employing a layer selection algorithm to integrate these mechanisms in a 1:3 ratio and utilizing a hybrid positional encoding (HyPE), the model maintains efficiency and performance for long-context tasks. Furthermore, we introduce a cost-effective continual training framework that transforms pre-trained Transformer-based models into hybrid models, which reduces training costs by approximately 75% compared to training from scratch. Extensive experiments show that MiniCPM-SALA maintains general capabilities comparable to full-attention models while offering improved efficiency. On a single NVIDIA A6000D GPU, the model achieves up to 3.5x the inference speed of the full-attention model at the sequence length of 256K tokens and supports context lengths of up to 1M tokens, a scale where traditional full-attention 8B models fail because of memory constraints.
[219] Unleashing Low-Bit Inference on Ascend NPUs: A Comprehensive Evaluation of HiFloat Formats
Pengxiang Zhao, Hui-Ling Zhen, Xing Li, Han Bao, Weizhe Lin, Zhiyuan Yang, Manyi Zhang, Yuanyong Luo, Ziwei Yu, Xin Wang, Mingxuan Yuan, Xianzhi Yu, Zhenhua Dong
Main category: cs.CL
TL;DR: HiFloat is a family of low-bit floating-point formats (HiF8 and HiF4) designed for Ascend NPUs that outperforms integer formats for high-variance data and prevents accuracy collapse in 4-bit regimes through hierarchical scaling.
Details
Motivation: As LLMs scale, there's a need for efficient low-bit floating-point formats that can maintain accuracy while reducing computational and memory requirements for inference on specialized hardware like Ascend NPUs.Method: Developed HiFloat formats (HiF8 and HiF4) with hierarchical scaling for Ascend NPUs, conducted rigorous comparisons across weight-activation and KV-cache tasks, and evaluated compatibility with state-of-the-art post-training quantization frameworks.
Result: Three key insights: (1) INT8 works for narrow-range data but floating-point formats excel with high-variance data; (2) HiF4’s hierarchical scaling prevents accuracy collapse in 4-bit regimes unlike integer formats; (3) HiFloat is fully compatible with modern quantization frameworks.
Conclusion: HiFloat provides an effective solution for high-efficiency LLM inference on NPUs, offering better performance than integer formats for certain data distributions and preventing accuracy degradation in extreme low-bit scenarios.
Abstract: As LLMs scale, low-bit floating-point formats like MXFP and NVFP4 offer new opportunities for precision and efficiency. In this work, we evaluate HiFloat (HiF8 and HiF4), a family of formats tailored for Ascend NPUs. Through rigorous comparison across weight-activation and KV-cache tasks, we provide three key insights: (1) INT8 suits narrow-range data, while floating-point formats excel with high-variance data; (2) in 4-bit regimes, HiF4’s hierarchical scaling prevents the accuracy collapse seen in integer formats; and (3) HiFloat is fully compatible with state-of-the-art post-training quantization frameworks. Overall, HiFloat provides a solution for high-efficiency LLM inference on NPUs.
[220] Learning Ordinal Probabilistic Reward from Preferences
Longze Chen, Lu Wang, Renke Shan, Ze Gong, Run Luo, Jiaming Li, Jing Luo, Qiyao Wang, Min Yang
Main category: cs.CL
TL;DR: PRM introduces a probabilistic reward modeling paradigm that treats reward as a random variable with full probability distribution, addressing limitations of existing generative and discriminative approaches through ordinal discretization and region flooding tuning.
Details
Motivation: Existing reward models for LLM alignment have limitations: generative models require costly point-wise supervision, while discriminative models produce uncalibrated relative scores lacking probabilistic interpretation. There's a need for reward models that can capture both relative rankings and absolute quality with better data efficiency.Method: Proposes Probabilistic Reward Model (PRM) treating reward as random variable with full probability distribution. Presents Ordinal Probabilistic Reward Model (OPRM) as discrete realization, discretizing quality scores into ordinal ratings. Introduces Region Flooding Tuning (RgFT) training strategy that uses quality-level annotations to concentrate probability mass within corresponding rating sub-regions.
Result: Experiments on various reward model benchmarks show 2.9% to 7.4% accuracy improvement over prior reward models, demonstrating strong performance and data efficiency. Analysis shows method captures both relative rankings and absolute quality.
Conclusion: PRM paradigm with OPRM and RgFT provides effective solution for reward modeling that addresses limitations of existing approaches, offering better accuracy, data efficiency, and ability to capture absolute quality while maintaining probabilistic interpretation.
Abstract: Reward models are crucial for aligning large language models (LLMs) with human values and intentions. Existing approaches follow either Generative (GRMs) or Discriminative (DRMs) paradigms, yet both suffer from limitations: GRMs typically demand costly point-wise supervision, while DRMs produce uncalibrated relative scores that lack probabilistic interpretation. To address these challenges, we introduce a novel reward modeling paradigm: Probabilistic Reward Model (PRM). Instead of modeling reward as a deterministic scalar, our approach treats it as a random variable, learning a full probability distribution for the quality of each response. To make this paradigm practical, we present its closed-form, discrete realization: the Ordinal Probabilistic Reward Model (OPRM), which discretizes the quality score into a finite set of ordinal ratings. Building on OPRM, we propose a data-efficient training strategy called Region Flooding Tuning (RgFT). It enables rewards to better reflect absolute text quality by incorporating quality-level annotations, which guide the model to concentrate the probability mass within corresponding rating sub-regions. Experiments on various reward model benchmarks show that our method improves accuracy by $\textbf{2.9%}\sim\textbf{7.4%}$ compared to prior reward models, demonstrating strong performance and data efficiency. Analysis of the score distribution provides evidence that our method captures not only relative rankings but also absolute quality.
[221] Elo-Evolve: A Co-evolutionary Framework for Language Model Alignment
Jing Zhao, Ting Zhen, Junwei Bao, Hongfei Jiang, Yang Song
Main category: cs.CL
TL;DR: Elo-Evolve: A co-evolutionary framework for LLM alignment using dynamic multi-agent competition with pairwise comparisons and Elo-based opponent selection, achieving better performance than traditional methods.
Details
Motivation: Current LLM alignment methods rely on compressing human preference data into static reward functions, leading to data scarcity, noise sensitivity, and training instability. There's a need for more robust alignment approaches.Method: Introduces Elo-Evolve framework with two key innovations: 1) learning directly from binary win/loss outcomes in pairwise competitions (eliminating Bradley-Terry model dependencies), and 2) implementing Elo-orchestrated opponent selection with temperature-controlled sampling for automatic curriculum learning. Grounded in PAC learning theory.
Result: Achieves 4.5x noise reduction compared to absolute scoring approaches. Trains Qwen2.5-7B model using opponents including Qwen2.5-14B, Qwen2.5-32B, and Qwen3-8B models. Shows clear performance hierarchy: point-based methods < static pairwise training < Elo-Evolve across Alpaca Eval 2.0 and MT-Bench.
Conclusion: Pairwise comparison with dynamic opponent selection provides progressive benefits for LLM alignment, offering superior sample complexity and noise robustness compared to traditional absolute scoring methods.
Abstract: Current alignment methods for Large Language Models (LLMs) rely on compressing vast amounts of human preference data into static, absolute reward functions, leading to data scarcity, noise sensitivity, and training instability. We introduce Elo-Evolve, a co-evolutionary framework that redefines alignment as dynamic multi-agent competition within an adaptive opponent pool. Our approach makes two key innovations: (1) eliminating Bradley-Terry model dependencies by learning directly from binary win/loss outcomes in pairwise competitions, and (2) implementing Elo-orchestrated opponent selection that provides automatic curriculum learning through temperature-controlled sampling. We ground our approach in PAC learning theory, demonstrating that pairwise comparison achieves superior sample complexity and empirically validate a 4.5x noise reduction compared to absolute scoring approaches. Experimentally, we train a Qwen2.5-7B model using our framework with opponents including Qwen2.5-14B, Qwen2.5-32B, and Qwen3-8B models. Results demonstrate a clear performance hierarchy: point-based methods < static pairwise training < Elo-Evolve across Alpaca Eval 2.0 and MT-Bench, validating the progressive benefits of pairwise comparison and dynamic opponent selection for LLM alignment.
[222] Residual Connections and the Causal Shift: Uncovering a Structural Misalignment in Transformers
Jonathan Lys, Vincent Gripon, Bastien Pasdeloup, Axel Marmoret, Lukas Mauch, Fabien Cardinaux, Ghouthi Boukli Hacene
Main category: cs.CL
TL;DR: LLMs have input-output alignment shift due to causal masking; paper identifies this issue and proposes residual attenuation to mitigate it
Details
Motivation: Autoregressive Transformers use causal masking for parallelism, creating misalignment between residual connections (tied to current token) and supervision (targeting next token), potentially propagating mismatched informationMethod: Empirically localize input-output alignment shift using decoding trajectories over tied embedding spaces and similarity-based metrics; propose lightweight residual-path mitigation via residual attenuation, implemented as fixed-layer intervention or learnable gating mechanism
Result: Experiments reveal hidden token representations switch from input alignment to output alignment deep within the network; proposed strategies alleviate representation misalignment and yield improvements on multiple benchmarks
Conclusion: Residual attenuation provides efficient and general architectural enhancement for autoregressive Transformers by addressing the fundamental input-output alignment shift
Abstract: Large Language Models (LLMs) are trained with next-token prediction, implemented in autoregressive Transformers via causal masking for parallelism. This creates a subtle misalignment: residual connections tie activations to the current token, while supervision targets the next token, potentially propagating mismatched information if the current token is not the most informative for prediction. In this work, we empirically localize this input-output alignment shift in pretrained LLMs, using decoding trajectories over tied embedding spaces and similarity-based metrics. Our experiments reveal that the hidden token representations switch from input alignment to output alignment deep within the network. Motivated by this observation, we propose a lightweight residual-path mitigation based on residual attenuation, implemented either as a fixed-layer intervention or as a learnable gating mechanism. Experiments on multiple benchmarks show that these strategies alleviate the representation misalignment and yield improvements, providing an efficient and general architectural enhancement for autoregressive Transformers.
[223] Are LLMs Ready to Replace Bangla Annotators?
Md. Najib Hasan, Touseef Hasan, Souvika Sarkar
Main category: cs.CL
TL;DR: LLMs show bias and instability as zero-shot annotators for Bangla hate speech, with larger models not necessarily performing better than smaller task-aligned models.
Details
Motivation: To investigate the reliability of LLMs as automated annotators for sensitive tasks in low-resource languages, specifically examining bias and instability in their judgments for Bangla hate speech detection.Method: Systematic benchmark of 17 LLMs using a unified evaluation framework for zero-shot annotation of Bangla hate speech, analyzing annotator bias and stability across different model scales.
Result: LLMs exhibit significant annotator bias and instability in judgments; surprisingly, increased model scale doesn’t guarantee better annotation quality, with smaller task-aligned models often showing more consistent behavior.
Conclusion: Current LLMs have important limitations for sensitive annotation tasks in low-resource languages, requiring careful evaluation before deployment despite their potential for scaling dataset creation.
Abstract: Large Language Models (LLMs) are increasingly used as automated annotators to scale dataset creation, yet their reliability as unbiased annotators–especially for low-resource and identity-sensitive settings–remains poorly understood. In this work, we study the behavior of LLMs as zero-shot annotators for Bangla hate speech, a task where even human agreement is challenging, and annotator bias can have serious downstream consequences. We conduct a systematic benchmark of 17 LLMs using a unified evaluation framework. Our analysis uncovers annotator bias and substantial instability in model judgments. Surprisingly, increased model scale does not guarantee improved annotation quality–smaller, more task-aligned models frequently exhibit more consistent behavior than their larger counterparts. These results highlight important limitations of current LLMs for sensitive annotation tasks in low-resource languages and underscore the need for careful evaluation before deployment.
[224] TurkicNLP: An NLP Toolkit for Turkic Languages
Sherzod Hakimov
Main category: cs.CL
TL;DR: TurkicNLP is an open-source Python library providing unified NLP pipelines for Turkic languages across four script families with modular multi-backend architecture.
Details
Motivation: Turkic languages (spoken by 200M+ people) lack unified NLP tooling and resources, with fragmentation across different script families and limited interoperability.Method: Developed a language-agnostic API with modular multi-backend architecture integrating rule-based finite-state transducers and neural models, featuring automatic script detection and routing between script variants.
Result: Created a comprehensive NLP library covering tokenization, morphological analysis, POS tagging, dependency parsing, NER, script transliteration, cross-lingual embeddings, and machine translation following CoNLL-U standards.
Conclusion: TurkicNLP provides the first unified NLP framework for Turkic languages, addressing fragmentation and enabling consistent processing across scripts with full interoperability.
Abstract: Natural language processing for the Turkic language family, spoken by over 200 million people across Eurasia, remains fragmented, with most languages lacking unified tooling and resources. We present TurkicNLP, an open-source Python library providing a single, consistent NLP pipeline for Turkic languages across four script families: Latin, Cyrillic, Perso-Arabic, and Old Turkic Runic. The library covers tokenization, morphological analysis, part-of-speech tagging, dependency parsing, named entity recognition, bidirectional script transliteration, cross-lingual sentence embeddings, and machine translation through one language-agnostic API. A modular multi-backend architecture integrates rule-based finite-state transducers and neural models transparently, with automatic script detection and routing between script variants. Outputs follow the CoNLL-U standard for full interoperability and extension. Code and documentation are hosted at https://github.com/turkic-nlp/turkicnlp .
[225] Mitigating Structural Noise in Low-Resource S2TT: An Optimized Cascaded Nepali-English Pipeline with Punctuation Restoration
Tangsang Chongbang, Pranesh Pyara Shrestha, Amrit Sarki, Anku Jaiswal
Main category: cs.CL
TL;DR: Optimized Nepali-to-English speech-to-text translation pipeline with punctuation restoration to mitigate structural noise from ASR, achieving significant BLEU score improvements.
Details
Motivation: Cascaded speech-to-text translation systems for low-resource languages suffer from structural noise, particularly punctuation loss during ASR, which degrades translation quality. The research aims to address this issue for Nepali-to-English translation.Method: Developed a cascaded pipeline with: 1) Wav2Vec2-XLS-R-300m ASR model achieving 2.72% CER, 2) multi-stage fine-tuned MarianMT NMT model with 28.32 BLEU, and 3) proposed an intermediate Punctuation Restoration Module (PRM) to restore punctuation lost in ASR output.
Result: Punctuation loss caused 20.7% relative BLEU drop. The optimal pipeline with PRM applied directly to ASR output achieved 4.90 BLEU point gain (36.38 vs 31.48 BLEU). Human assessment confirmed superior Adequacy (3.673) and Fluency (3.804) with high inter-rater reliability.
Conclusion: Targeted punctuation restoration is the most effective intervention for mitigating structural noise in Nepali S2TT pipelines, establishing an optimized baseline and critical architectural insight for similar low-resource language systems.
Abstract: Cascaded speech-to-text translation (S2TT) systems for low-resource languages can suffer from structural noise, particularly the loss of punctuation during the Automatic Speech Recognition (ASR) phase. This research investigates the impact of such noise on Nepali-to-English translation and proposes an optimized pipeline to mitigate quality degradation. We first establish highly proficient ASR and NMT components: a Wav2Vec2-XLS-R-300m model achieved a state-of-the-art 2.72% CER on OpenSLR-54, and a multi-stage fine-tuned MarianMT model reached a 28.32 BLEU score on the FLORES-200 benchmark. We empirically investigate the influence of punctuation loss, demonstrating that unpunctuated ASR output significantly degrades translation quality, causing a massive 20.7% relative BLEU drop on the FLORES benchmark. To overcome this, we propose and evaluate an intermediate Punctuation Restoration Module (PRM). The final S2TT pipeline was tested across three configurations on a custom dataset. The optimal configuration, which applied the PRM directly to ASR output, achieved a 4.90 BLEU point gain over the direct ASR-to-NMT baseline (BLEU 36.38 vs. 31.48). This improvement was validated by human assessment, which confirmed the optimized pipeline’s superior Adequacy (3.673) and Fluency (3.804) with inter-rater reliability (Krippendorff’s $α {\geq}$ 0.723). This work validates that targeted punctuation restoration is the most effective intervention for mitigating structural noise in the Nepali S2TT pipeline. It establishes an optimized baseline and demonstrates a critical architectural insight for developing cascaded speech translation systems for similar low-resource languages.
[226] Large Language Models are Algorithmically Blind
Sohan Venkatesh, Ashish Mahendran Kurapath, Tejas Melkote
Main category: cs.CL
TL;DR: LLMs show systematic failure in algorithmic reasoning despite broad knowledge, performing worse than random guessing on causal discovery tasks
Details
Motivation: While LLMs demonstrate broad knowledge, their ability to reason about computational processes remains poorly understood, which is important for practitioners relying on LLMs for algorithm selection and deployment decisionsMethod: Used causal discovery as a testbed to evaluate eight frontier LLMs against ground truth derived from large-scale algorithm executions
Result: Found systematic, near-total failure: models produce overly wide confidence intervals yet still fail to contain true algorithmic mean in most cases; most perform worse than random guessing; best model’s marginal above-random performance likely due to benchmark memorization rather than principled reasoning
Conclusion: Identifies “algorithmic blindness” - a fundamental gap between declarative knowledge about algorithms and calibrated procedural prediction in LLMs
Abstract: Large language models (LLMs) demonstrate remarkable breadth of knowledge, yet their ability to reason about computational processes remains poorly understood. Closing this gap matters for practitioners who rely on LLMs to guide algorithm selection and deployment. We address this limitation using causal discovery as a testbed and evaluate eight frontier LLMs against ground truth derived from large-scale algorithm executions and find systematic, near-total failure. Models produce ranges far wider than true confidence intervals yet still fail to contain the true algorithmic mean in the majority of instances; most perform worse than random guessing and the marginal above-random performance of the best model is most consistent with benchmark memorization rather than principled reasoning. We term this failure algorithmic blindness and argue it reflects a fundamental gap between declarative knowledge about algorithms and calibrated procedural prediction.
[227] mCLM: A Modular Chemical Language Model that Generates Functional and Makeable Molecules
Carl Edwards, Chi Han, Gawon Lee, Thao Nguyen, Sara Szymkuć, Chetan Kumar Prasad, Bowen Jin, Jiawei Han, Ying Diao, Ge Liu, Hao Peng, Bartosz A. Grzybowski, Martin D. Burke, Heng Ji
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2505.12565: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.12565&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[228] Using ChatGPT for Data Science Analyses
Ozan Evkaya, Miguel de Carvalho
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2404.08480: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2404.08480&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[229] Large Language Model Agent in Financial Trading: A Survey
Han Ding, Yinheng Li, Junhao Wang, Hang Chen, Doudou Guo, Yunbai Zhang
Main category: cs.CL
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2408.06361 exists but summary cannot be retrieved.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot determine conclusion without access to the paper content.
Abstract: Failed to fetch summary for 2408.06361: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2408.06361&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[230] Polynomial, trigonometric, and tropical activations
Ismail Khalfaoui-Hassani, Stefan Kesselheim
Main category: cs.CL
TL;DR: Failed to fetch summary for paper 2502.01247 due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to determine conclusion due to failed API request
Abstract: Failed to fetch summary for 2502.01247: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.01247&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[231] Intrinsic Entropy of Context Length Scaling in LLMs
Jingzhe Shi, Qinwei Ma, Hongyi Liu, Hang Zhao, Jeng-Neng Hwang, Lei Li
Main category: cs.CL
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2502.01481: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.01481&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[232] CityLens: Evaluating Large Vision-Language Models for Urban Socioeconomic Sensing
Tianhui Liu, Hetian Pang, Xin Zhang, Tianjian Ouyang, Zhiyuan Zhang, Jie Feng, Yong Li, Pan Hui
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation as paper content could not be retrievedMethod: Unable to determine method as paper content could not be retrieved
Result: Unable to determine results as paper content could not be retrieved
Conclusion: Unable to determine conclusion as paper content could not be retrieved
Abstract: Failed to fetch summary for 2506.00530: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.00530&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[233] Sparse Shift Autoencoders for Identifying Concepts from Large Language Model Activations
Shruti Joshi, Andrea Dittadi, Sébastien Lachapelle, Dhanya Sridhar
Main category: cs.CL
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to inability to access paper contentMethod: Cannot determine method due to inability to access paper content
Result: Cannot determine results due to inability to access paper content
Conclusion: Cannot draw conclusions due to inability to access paper content
Abstract: Failed to fetch summary for 2502.12179: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.12179&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[234] LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning
Zhibin Lan, Liqiang Niu, Fandong Meng, Jie Zhou, Jinsong Su
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2503.04812: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.04812&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[235] Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Xu Tang, Yao Hu, Shaohui Lin
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2503.06749: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.06749&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[236] I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data?
Yuhang Liu, Dong Gong, Yichao Cai, Erdun Gao, Zhen Zhang, Biwei Huang, Mingming Gong, Anton van den Hengel, Javen Qinfeng Shi
Main category: cs.CL
TL;DR: Failed to fetch paper summary - HTTP 429 error indicates rate limiting from arXiv API
Details
Motivation: Unable to analyze paper content due to technical limitations in accessing the abstractMethod: N/A - Paper content not accessible
Result: N/A - No results available for analysis
Conclusion: Cannot provide analysis due to arXiv API rate limiting preventing access to paper content
Abstract: Failed to fetch summary for 2503.08980: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.08980&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[237] FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference
Guangda Liu, Chengwei Li, Zhenyu Ning, Jing Lin, Yiwu Yao, Danning Ke, Minyi Guo, Jieru Zhao
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2505.13109: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.13109&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[238] Dynamic Token Reweighting for Robust Vision-Language Models
Tanqiu Jiang, Jiacheng Liang, Rongyi Zhu, Jiawei Zhou, Fenglong Ma, Ting Wang
Main category: cs.CL
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2505.17132 appears to be from May 2025, suggesting recent work in multimodal AI.
Details
Motivation: Cannot determine motivation without access to paper content. Based on arXiv ID format (2505.17132), this is likely a recent paper from May 2025, potentially related to multimodal models given the reader's interests.Method: Method unknown due to HTTP 429 error preventing access to paper details. The arXiv API rate limiting prevents retrieval of abstract and content.
Result: Results cannot be determined without access to the paper content. The HTTP 429 error indicates too many requests to arXiv API.
Conclusion: Unable to analyze paper due to technical limitations in accessing content. The arXiv API rate limiting prevents proper analysis.
Abstract: Failed to fetch summary for 2505.17132: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.17132&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[239] Meta-Adaptive Prompt Distillation for Few-Shot Visual Question Answering
Akash Gupta, Amos Storkey, Mirella Lapata
Main category: cs.CL
TL;DR: Paper 2506.06905: Unable to fetch abstract due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to missing abstractMethod: Cannot determine method due to missing abstract
Result: Cannot determine results due to missing abstract
Conclusion: Cannot determine conclusion due to missing abstract
Abstract: Failed to fetch summary for 2506.06905: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.06905&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[240] Chain-of-Lure: A Universal Jailbreak Attack Framework using Unconstrained Synthetic Narratives
Wenhan Chang, Tianqing Zhu, Yu Zhao, Shuangyong Song, Ping Xiong, Wanlei Zhou
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2505.17519: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.17519&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[241] NFT: Bridging Supervised Learning and Reinforcement Learning in Math Reasoning
Huayu Chen, Kaiwen Zheng, Qinsheng Zhang, Ganqu Cui, Lifan Yuan, Yin Cui, Haotian Ye, Tsung-Yi Lin, Ming-Yu Liu, Jun Zhu, Haoxiang Wang
Main category: cs.CL
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2505.18116: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.18116&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[242] SQUiD: Synthesizing Relational Databases from Unstructured Text
Mushtari Sadia, Zhenning Yang, Yunming Xiao, Ang Chen, Amrita Roy Chowdhury
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to analyze paper due to technical issue with arXiv API
Abstract: Failed to fetch summary for 2505.19025: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.19025&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[243] SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning
Bo Liu, Leon Guertler, Simon Yu, Zichen Liu, Penghui Qi, Daniel Balcells, Mickel Liu, Cheston Tan, Weiyan Shi, Min Lin, Wee Sun Lee, Natasha Jaques
Main category: cs.CL
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation due to inability to access paper contentMethod: Cannot determine method due to inability to access paper content
Result: Cannot determine results due to inability to access paper content
Conclusion: Cannot determine conclusion due to inability to access paper content
Abstract: Failed to fetch summary for 2506.24119: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.24119&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[244] Learning to Reason without External Rewards
Xuandong Zhao, Zhewei Kang, Aosong Feng, Sergey Levine, Dawn Song
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2505.19590: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.19590&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[245] SpiroLLM: Finetuning Pretrained LLMs to Understand Spirogram Time Series with Clinical Validation in COPD Reporting
Shuhao Mei, Yongchao Long, Xiaoyu Xiao, Shan Cao, Xiaobo Han, Shijia Geng, Jinbo Sun, Yuxi Zhou, Shenda Hong
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2507.16145: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.16145&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[246] Characterizing Pattern Matching and Its Limits on Compositional Task Structures
Hoyeon Chang, Jinho Park, Hanseul Cho, Sohee Yang, Miyoung Ko, Hyeonbin Hwang, Seungpil Won, Dohaeng Lee, Youbin Ahn, Minjoon Seo
Main category: cs.CL
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2505.20278: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.20278&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[247] OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models
Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, Li Yi
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2506.03135: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.03135&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[248] InstructPro: Natural Language Guided Ligand-Binding Protein Design
Zhenqiao Song, Ramith Hettiarachchi, Chuan Li, Jianwen Xie, Lei Li
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2506.09332: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.09332&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[249] Uni-cot: Towards Unified Chain-of-Thought Reasoning Across Text and Vision
Luozheng Qin, Jia Gong, Yuqing Sun, Tianjiao Li, Mengping Yang, Xiaomeng Yang, Chao Qu, Zhiyu Tan, Hao Li
Main category: cs.CL
TL;DR: Unable to analyze paper 2508.05606 due to HTTP 429 error when fetching summary from arXiv API
Details
Motivation: Cannot determine motivation without access to paper abstractMethod: Cannot determine method without access to paper abstract
Result: Cannot determine results without access to paper abstract
Conclusion: Cannot draw conclusions without access to paper abstract
Abstract: Failed to fetch summary for 2508.05606: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.05606&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[250] EigenBench: A Comparative Behavioral Measure of Value Alignment
Jonathn Chang, Leonhard Piff, Suvadip Sana, Jasmine X. Li, Lionel Levine
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2509.01938: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.01938&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[251] Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks
Taishi Nakamura, Satoki Ishikawa, Masaki Kawamura, Takumi Okamoto, Daisuke Nohara, Jun Suzuki, Rio Yokota
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: No method information available - paper content inaccessible
Result: No results available - paper content inaccessible
Conclusion: Unable to analyze paper due to technical limitations in accessing content
Abstract: Failed to fetch summary for 2508.18672: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.18672&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[252] Mitigating Multimodal Hallucinations via Gradient-based Self-Reflection
Shan Wang, Maying Shen, Nadine Chang, Chuong Nguyen, Hongdong Li, Jose M. Alvarez
Main category: cs.CL
TL;DR: Paper 2509.03113 could not be analyzed due to HTTP 429 error (rate limiting) when attempting to fetch the abstract from arXiv API.
Details
Motivation: Unable to determine motivation due to technical error preventing access to paper content.Method: Unable to determine method due to technical error preventing access to paper content.
Result: Unable to determine results due to technical error preventing access to paper content.
Conclusion: Unable to draw conclusions due to technical error preventing access to paper content.
Abstract: Failed to fetch summary for 2509.03113: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.03113&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[253] BinaryShield: Cross-Service Threat Intelligence in LLM Services using Privacy-Preserving Fingerprints
Waris Gill, Natalie Isak, Matthew Dressman
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2509.05608: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.05608&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[254] See, Think, Act: Teaching Multimodal Agents to Effectively Interact with GUI by Identifying Toggles
Zongru Wu, Rui Mao, Zhiyuan Tian, Pengzhou Cheng, Tianjie Ju, Zheng Wu, Lingzhong Dong, Haiyue Sheng, Zhuosheng Zhang, Gongshen Liu
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Unable to determine method due to API rate limiting preventing access to paper content
Result: Unable to determine results due to API rate limiting preventing access to paper content
Conclusion: Unable to determine conclusion due to API rate limiting preventing access to paper content
Abstract: Failed to fetch summary for 2509.13615: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.13615&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[255] PMark: Towards Robust and Distortion-free Semantic-level Watermarking with Channel Constraints
Jiahao Huo, Shuliang Liu, Bin Wang, Junyan Zhang, Yibo Yan, Aiwei Liu, Xuming Hu, Mingxun Zhou
Main category: cs.CL
TL;DR: Paper 2509.21057: Failed to fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access restrictionsMethod: Unable to determine method due to access restrictions
Result: Unable to determine results due to access restrictions
Conclusion: Unable to determine conclusion due to access restrictions
Abstract: Failed to fetch summary for 2509.21057: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.21057&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[256] Reasoning or Retrieval? A Study of Answer Attribution on Large Reasoning Models
Yuhui Wang, Changjiang Li, Guangke Chen, Jiacheng Liang, Ting Wang
Main category: cs.CL
TL;DR: Paper 2509.24156 summary could not be fetched due to HTTP 429 rate limiting error from arXiv API
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper detailsMethod: No method information available - arXiv API returned HTTP 429 (Too Many Requests) error
Result: No results available - paper content inaccessible due to API rate limiting
Conclusion: Cannot provide analysis due to technical limitations in accessing the paper content
Abstract: Failed to fetch summary for 2509.24156: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.24156&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[257] Group-Relative REINFORCE Is Secretly an Off-Policy Algorithm: Demystifying Some Myths About GRPO and Its Friends
Chaorui Yao, Yanxi Chen, Yuchang Sun, Yushuo Chen, Wenhao Zhang, Xuchen Pan, Yaliang Li, Bolin Ding
Main category: cs.CL
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper information.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot determine conclusion without access to the paper content.
Abstract: Failed to fetch summary for 2509.24203: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.24203&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[258] Towards Safe Reasoning in Large Reasoning Models via Corrective Intervention
Yichi Zhang, Yue Ding, Jingwen Yang, Tianwei Luo, Dongbai Li, Ranjie Duan, Qiang Liu, Hang Su, Yinpeng Dong, Jun Zhu
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2509.24393: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.24393&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[259] Scaling with Collapse: Efficient and Predictable Training of LLM Families
Shane Bergsma, Bin Claire Zhang, Nolan Dey, Shaheer Muhammad, Gurpreet Gosal, Joel Hestness
Main category: cs.CL
TL;DR: Paper ID 2509.25087 could not be fetched due to HTTP 429 (rate limiting) error from arXiv API
Details
Motivation: Unable to determine motivation as the abstract could not be retrievedMethod: Unable to determine method as the abstract could not be retrieved
Result: Unable to determine results as the abstract could not be retrieved
Conclusion: Unable to draw conclusions as the paper content could not be accessed
Abstract: Failed to fetch summary for 2509.25087: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.25087&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[260] ProfVLM: A Lightweight Video-Language Model for Multi-View Proficiency Estimation
Edoardo Bianchi, Jacopo Staiano, Antonio Liotta
Main category: cs.CL
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2509.26278 suggests it’s from September 2025, but no content available for analysis.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot draw conclusions without access to the paper content.
Abstract: Failed to fetch summary for 2509.26278: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.26278&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[261] EditReward: A Human-Aligned Reward Model for Instruction-Guided Image Editing
Keming Wu, Sicong Jiang, Max Ku, Ping Nie, Minghao Liu, Wenhu Chen
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2509.26346: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.26346&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[262] Is It Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort
Xinpeng Wang, Nitish Joshi, Barbara Plank, Rico Angell, He He
Main category: cs.CL
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to failed paper retrievalMethod: Cannot determine method due to failed paper retrieval
Result: Cannot determine results due to failed paper retrieval
Conclusion: Cannot determine conclusion due to failed paper retrieval
Abstract: Failed to fetch summary for 2510.01367: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.01367&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[263] MemeIntel: Explainable Detection of Propagandistic and Hateful Memes
Mohamed Bayan Kmainasi, Abul Hasnat, Md Arid Hasan, Ali Ezzat Shahroor, Firoj Alam
Main category: cs.CL
TL;DR: MemeXplain: A multi-stage optimization approach for Vision-Language Models that jointly improves detection and explanation generation for propagandistic and hateful memes, with new datasets in Arabic and English.
Details
Motivation: Current multimodal content moderation systems struggle with jointly detecting harmful content and generating explanations, often degrading classification performance when trained simultaneously. There's a lack of large-scale resources for propagandistic memes in Arabic and hateful memes in English.Method: Proposed MemeXplain dataset for Arabic propagandistic memes and English hateful memes, with a multi-stage optimization approach for Vision-Language Models that separates training phases to avoid performance degradation.
Result: Significant improvements over base models: ~1.4% accuracy improvement on ArMeme (Arabic) and ~2.2% on Hateful Memes (English), outperforming state-of-the-art methods.
Conclusion: The multi-stage optimization approach effectively addresses the joint training challenge, improving both detection and explanation generation for multimodal harmful content moderation.
Abstract: The proliferation of multimodal content on social media presents significant challenges in understanding and moderating complex, context-dependent issues such as misinformation, hate speech, and propaganda. While efforts have been made to develop resources and propose new methods for automatic detection, limited attention has been given to jointly modeling label detection and the generation of explanation-based rationales, which often leads to degraded classification performance when trained simultaneously. To address this challenge, we introduce MemeXplain, an explanation-enhanced dataset for propagandistic memes in Arabic and hateful memes in English, making it the first large-scale resource for these tasks. To solve these tasks, we propose a multi-stage optimization approach and train Vision-Language Models (VLMs). Our results show that this strategy significantly improves both label detection and explanation generation quality over the base model, outperforming the current state-of-the-art with an absolute improvement of ~1.4% (Acc) on ArMeme and ~2.2% (Acc) on Hateful Memes. For reproducibility and future research, we aim to make the MemeXplain dataset and scripts publicly available (https://github.com/MohamedBayan/MemeIntel).
[264] GEM: A Gym for Agentic LLMs
Zichen Liu, Anya Sims, Keyu Duan, Changyu Chen, Simon Yu, Xiangxin Zhou, Haotian Xu, Shaopan Xiong, Bo Liu, Chenmien Tan, Chuen Yang Beh, Weixun Wang, Hao Zhu, Weiyan Shi, Diyi Yang, Michael Shieh, Yee Whye Teh, Wee Sun Lee, Min Lin
Main category: cs.CL
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2510.01051: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.01051&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[265] RLP: Reinforcement as a Pretraining Objective
Ali Hatamizadeh, Syeda Nahida Akter, Shrimai Prabhumoye, Jan Kautz, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Yejin Choi
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to retrieval failureMethod: Unable to determine method due to retrieval failure
Result: Unable to determine results due to retrieval failure
Conclusion: Unable to draw conclusions due to retrieval failure
Abstract: Failed to fetch summary for 2510.01265: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.01265&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[266] StockBench: Can LLM Agents Trade Stocks Profitably In Real-world Markets?
Yanxu Chen, Zijun Yao, Yantao Liu, Amy Xin, Jin Ye, Jianing Yu, Lei Hou, Juanzi Li
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2510.02209: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.02209&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[267] ExGRPO: Learning to Reason from Experience
Runzhe Zhan, Yafu Li, Zhi Wang, Xiaoye Qu, Dongrui Liu, Jing Shao, Derek F. Wong, Yu Cheng
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2510.02245: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.02245&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[268] What Scales in Cross-Entropy Scaling Law?
Junxi Yan, Zixi Wei, Qingyao Ai, Yiqun Liu, Jingtao Zhan
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2510.04067: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.04067&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[269] Tiny but Mighty: A Software-Hardware Co-Design Approach for Efficient Multimodal Inference on Battery-Powered Small Devices
Yilong Li, Shuai Zhang, Yijing Zeng, Hao Zhang, Xinmiao Xiong, Jingyu Liu, Pan Hu, Suman Banerjee
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2510.05109: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.05109&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[270] Exposing Citation Vulnerabilities in Generative Engines
Riku Mochizuki, Shusuke Komatsu, Souta Noguchi, Kazuto Ataka
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to draw conclusions due to fetch failure
Abstract: Failed to fetch summary for 2510.06823: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.06823&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[271] ScholarEval: Research Idea Evaluation Grounded in Literature
Hanane Nour Moussa, Patrick Queiroz Da Silva, Daniel Adu-Ampratwum, Alyson East, Zitong Lu, Nikki Puccetti, Mingyi Xue, Huan Sun, Bodhisattwa Prasad Majumder, Sachin Kumar
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2510.16234: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.16234&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[272] On the Reasoning Abilities of Masked Diffusion Language Models
Anej Svete, Ashish Sabharwal
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) - cannot analyze paper content
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to draw conclusions due to failed API request
Abstract: Failed to fetch summary for 2510.13117: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.13117&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[273] Soft-Masked Diffusion Language Models
Michael Hersche, Samuel Moor-Smith, Thomas Hofmann, Abbas Rahimi
Main category: cs.CL
TL;DR: Unable to analyze paper 2510.17206 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract retrieval failedMethod: Cannot determine method as abstract retrieval failed
Result: Cannot determine results as abstract retrieval failed
Conclusion: Cannot draw conclusions as abstract retrieval failed
Abstract: Failed to fetch summary for 2510.17206: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.17206&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[274] BioCAP: Exploiting Synthetic Captions Beyond Labels in Biological Foundation Models
Ziheng Zhang, Xinyue Ma, Arpita Chowdhury, Elizabeth G. Campolongo, Matthew J. Thompson, Net Zhang, Samuel Stevens, Hilmar Lapp, Tanya Berger-Wolf, Yu Su, Wei-Lun Chao, Jianyang Gu
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error in accessing paper contentMethod: Unable to determine method due to technical error in accessing paper content
Result: Unable to determine results due to technical error in accessing paper content
Conclusion: Unable to draw conclusions due to technical error in accessing paper content
Abstract: Failed to fetch summary for 2510.20095: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.20095&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[275] Can SAEs reveal and mitigate racial biases of LLMs in healthcare?
Hiba Ahsan, Byron C. Wallace
Main category: cs.CL
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2511.00177: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.00177&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[276] Addressing Longstanding Challenges in Cognitive Science with Language Models
Dirk U. Wulff, Rui Mata
Main category: cs.CL
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2511.00206: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.00206&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[277] Regularization Through Reasoning: Systematic Improvements in Language Model Classification via Explanation-Enhanced Fine-Tuning
Vivswan Shah, Randy Cogill, Hanwei Yue, Gopinath Chennupati, Rinat Khaziev
Main category: cs.CL
TL;DR: Unable to analyze paper 2511.02044 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation without access to paper abstractMethod: Cannot determine method without access to paper abstract
Result: Cannot determine results without access to paper abstract
Conclusion: Cannot draw conclusions without access to paper abstract
Abstract: Failed to fetch summary for 2511.02044: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.02044&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[278] From Efficiency to Adaptivity: A Deeper Look at Adaptive Reasoning in Large Language Models
Chao Wu, Baoheng Li, Mingchen Gao, Yu Tian, Zhenyi Wang
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: No method information available - arXiv API returned HTTP 429 error (Too Many Requests)
Result: No results available - could not retrieve paper information due to rate limiting
Conclusion: Cannot analyze paper content due to technical limitations in accessing the arXiv API
Abstract: Failed to fetch summary for 2511.10788: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.10788&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[279] AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition
Zichuan Lin, Yicheng Liu, Yang Yang, Lvfang Tao, Deheng Ye
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: No method information available - API request resulted in HTTP 429 error
Result: No results available - could not retrieve paper information
Conclusion: Cannot analyze paper due to technical limitations in accessing the content
Abstract: Failed to fetch summary for 2512.03794: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.03794&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[280] AgentMath: Empowering Mathematical Reasoning for Large Language Models via Tool-Augmented Agent
Haipeng Luo, Huawen Feng, Qingfeng Sun, Can Xu, Kai Zheng, Yufei Wang, Tao Yang, Han Hu, Yansong Tang
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to data retrieval errorMethod: Unable to determine method due to data retrieval error
Result: Unable to determine results due to data retrieval error
Conclusion: Unable to determine conclusion due to data retrieval error
Abstract: Failed to fetch summary for 2512.20745: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.20745&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[281] Reward Models Inherit Value Biases from Pretraining
Brian Christian, Jessica A. F. Thompson, Elle Michelle Yang, Vincent Adam, Hannah Rose Kirk, Christopher Summerfield, Tsvetomira Dumbalska
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2601.20838: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.20838&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[282] Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models
Yu Zeng, Wenxuan Huang, Zhen Fang, Shuang Chen, Yufan Shen, Yishuo Cai, Xiaoman Wang, Zhenfei Yin, Lin Chen, Zehui Chen, Shiting Huang, Yiming Zhao, Xu Tang, Yao Hu, Philip Torr, Wanli Ouyang, Shaosheng Cao
Main category: cs.CL
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2602.02185: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.02185&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[283] To Think or Not To Think, That is The Question for Large Reasoning Models in Theory of Mind Tasks
Nanxu Gong, Haotian Li, Sixun Dong, Jianxun Lian, Yanjie Fu, Xing Xie
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2602.10625: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.10625&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[284] Spilled Energy in Large Language Models
Adrian Robert Minut, Hazem Dewidar, Iacopo Masi
Main category: cs.CL
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2602.18671: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.18671&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[285] Adaptive Data Augmentation with Multi-armed Bandit: Sample-Efficient Embedding Calibration for Implicit Pattern Recognition
Minxue Tang, Yangyang Yu, Aolin Ding, Maziyar Baran Pouyan, Taha Belkhouja, Yujia Bao
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to data fetch failureMethod: Unable to determine method due to data fetch failure
Result: Unable to determine results due to data fetch failure
Conclusion: Unable to determine conclusion due to data fetch failure
Abstract: Failed to fetch summary for 2602.19385: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.19385&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[286] Dynamic Level Sets
Michael Stephen Fiske
Main category: cs.CL
TL;DR: Failed to fetch summary for paper 2602.22530 due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to determine conclusion due to failed API request
Abstract: Failed to fetch summary for 2602.22530: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22530&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[287] Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible Benchmarking
Zhicheng Fang, Jingjie Zheng, Chenxu Fu, Wei Xu
Main category: cs.CL
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2602.24009 appears to be from arXiv, but the API request was blocked.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot determine conclusion without access to the paper content.
Abstract: Failed to fetch summary for 2602.24009: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.24009&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.CV
[288] Learning Under Extreme Data Scarcity: Subject-Level Evaluation of Lightweight CNNs for fMRI-Based Prodromal Parkinsons Detection
Naimur Rahman
Main category: cs.CV
TL;DR: Study examines deep learning for prodromal Parkinson’s detection from fMRI under extreme data scarcity, revealing evaluation strategy (subject-level vs image-level splits) and model capacity significantly impact performance more than architectural depth.
Details
Motivation: Deep learning is often applied in data-limited settings like neuroimaging for prodromal Parkinson's disease where subject numbers are small and scans produce many related samples, yet evaluation practices don't always reflect these constraints.Method: Used fMRI data from 40 subjects (20 prodromal Parkinson’s, 20 controls), fine-tuned ImageNet-pretrained CNNs (VGG19, Inception V3, Inception ResNet V2, MobileNet V1) under two data partitioning strategies: image-level splits (allowing slices from same subject in both train/test) vs strict subject-level splits.
Result: Image-level splits caused severe information leakage with near-perfect accuracy, while subject-level splits dropped performance to 60-81% accuracy. MobileNet demonstrated most reliable generalization despite fewer parameters, outperforming deeper architectures.
Conclusion: In extreme low-data regimes, evaluation strategy and model capacity have greater impact than architectural depth. Provides practical recommendations for evaluating deep learning models under severe data scarcity.
Abstract: Deep learning is often applied in settings where data are limited, correlated, and difficult to obtain, yet evaluation practices do not always reflect these constraints. Neuroimaging for prodromal Parkinsons disease is one such case, where subject numbers are small and individual scans produce many highly related samples. This work examines prodromal Parkinsons detection from resting-state fMRI as a machine learning problem centered on learning under extreme data scarcity. Using fMRI data from 40 subjects, including 20 prodromal Parkinsons cases and 20 healthy controls, ImageNet-pretrained convolutional neural networks are fine-tuned and evaluated under two different data partitioning strategies. Results show that commonly used image-level splits allow slices from the same subject to appear in both training and test sets, leading to severe information leakage and near-perfect accuracy. When a strict subject-level split is enforced, performance drops substantially, yielding test accuracies between 60 and 81 percent. Models with different capacity profiles are compared, including VGG19, Inception V3, Inception ResNet V2, and the lightweight MobileNet V1. Under subject-level evaluation, MobileNet demonstrates the most reliable generalization, outperforming deeper architectures despite having significantly fewer parameters. These results indicate that in extreme low-data regimes, evaluation strategy and model capacity have a greater impact on performance than architectural depth. Although the analysis is limited to a single cohort of 40 subjects and does not include external validation or cross-validation, it provides a concrete case study and practical recommendations for evaluating deep learning models under severe data scarcity.
[289] Automated Quality Check of Sensor Data Annotations
Niklas Freund, Zekiye Ilknur-Öz, Tobias Klockau, Patrick Naumann, Philipp Neumaier, Martin Köppel
Main category: cs.CV
TL;DR: An open-source tool for automatically detecting nine common errors in multi-sensor railway datasets to ensure training data quality for AI-based automated driving systems.
Details
Motivation: High-quality training data is crucial for AI algorithms in automated railway driving systems, but manual quality assurance is time-consuming and labor-intensive. There's a need for automated methods to detect errors in multi-sensor datasets to accelerate development of safety-critical automated driving systems.Method: Proposed an open-source framework with nine detection methods for common errors in multi-sensor railway datasets. The tool automatically identifies issues like sensor synchronization problems, data corruption, and other quality concerns that could affect AI training.
Result: Six detection methods achieved 100% precision, while three additional methods reached precision rates of 96% and 97%. All detected errors were manually validated to evaluate performance.
Conclusion: The automated quality assurance tool significantly reduces manual workload and accelerates development of AI-based automated driving systems by efficiently detecting common errors in multi-sensor training data.
Abstract: The monitoring of the route and track environment plays an important role in automated driving. For example, it can be used as an assistance system for route monitoring in automation level Grade of Automation (GoA) 2, where the train driver is still on board. In fully automated, driverless driving at automation level GoA4, these systems finally take over environment monitoring completely independently. With the help of artificial intelligence (AI), they react automatically to risks and dangerous events on the route. To train such AI algorithms, large amounts of training data are required, which must meet high-quality standards due to their safety relevance. In this publication we present an automatic method for assuring the quality of training data, significantly reducing the manual workload and accelerating the development of these systems. We propose an open-source tool designed to detect nine common errors found in multi-sensor datasets for railway vehicles. To evaluate the performance of the framework, all detected errors were manually validated. Six issue detection methods achieved 100% precision, while three additional methods reached precision rates 96% and 97%.
[290] VoxelDiffusionCut: Non-destructive Internal-part Extraction via Iterative Cutting and Structure Estimation
Takumi Hachimine, Yuhwan Kwon, Cheng-Yu Kuo, Tomoya Yamanokuchi, Takamitsu Matsubara
Main category: cs.CV
TL;DR: VoxelDiffusionCut uses diffusion models to iteratively estimate 3D internal structures from partial cutting observations for non-destructive extraction of target parts in recycling applications.
Details
Motivation: Non-destructive extraction of internal components like batteries and motors is crucial for recycling, but product diversity and lack of disassembly information make it challenging to determine where to cut without damaging target parts.Method: Proposes VoxelDiffusionCut which uses diffusion models to iteratively estimate internal 3D structures represented as voxels from observed cutting surfaces, then plans cuts based on estimation results to avoid damaging target parts.
Result: Simulation experiments show the method can estimate internal structures from cutting observations and enable non-destructive extraction of target parts by leveraging estimated uncertainty.
Conclusion: VoxelDiffusionCut successfully addresses challenges in internal structure estimation for non-destructive extraction by using diffusion models with voxel representations to capture uncertainty and avoid erroneous cuts.
Abstract: Non-destructive extraction of the target internal part, such as batteries and motors, by cutting surrounding structures is crucial at recycling and disposal sites. However, the diversity of products and the lack of information on disassembly procedures make it challenging to decide where to cut. This study explores a method for non-destructive extraction of a target internal part that iteratively estimates the internal structure from observed cutting surfaces and formulates cutting plans based on the estimation results. A key requirement is to estimate the probability of the target part’s presence from partial observations. However, learning conditional generative models for this task is challenging: The high dimensionality of 3D shape representations makes learning difficult, and conventional models (e.g., conditional variational autoencoders) often fail to capture multi-modal predictive uncertainty due to mode collapse, resulting in overconfident predictions. To address these issues, we propose VoxelDiffusionCut, which iteratively estimates the internal structure represented as voxels using a diffusion model and plans cuts for non-destructive extraction of the target internal part based on the estimation results. Voxel representation allows the model to predict only attributes at fixed grid positions, i.e., types of constituent parts, making learning more tractable. The diffusion model completes the voxel representation conditioned on observed cutting surfaces, capturing uncertainty in unobserved regions to avoid erroneous cuts. Experimental results in simulation suggest that the proposed method can estimate internal structures from observed cutting surfaces and enable non-destructive extraction of the target internal part by leveraging the estimated uncertainty.
[291] Efficient Image Super-Resolution with Multi-Scale Spatial Adaptive Attention Networks
Sushi Rao, Jingwei Li
Main category: cs.CV
TL;DR: MSAAN is a lightweight image super-resolution network with novel attention modules for balancing reconstruction quality and model complexity.
Details
Motivation: Address the trade-off between high reconstruction fidelity and low model complexity in existing SR methods by developing an efficient network that can capture both local details and global context.Method: Proposes Multi-scale Spatial Adaptive Attention Network with MSAA module containing Global Feature Modulation Module for texture structures and Multi-scale Feature Aggregation Module for pyramidal feature fusion. Adds Local Enhancement Block for geometric perception and Feature Interactive Gated Feed-Forward Module for nonlinear representation.
Result: Achieves superior or competitive PSNR/SSIM performance on standard benchmarks (Set5, Set14, B100, Urban100, Manga109) across ×2, ×3, ×4 scaling factors while maintaining significantly lower parameters and computational costs than SOTA methods.
Conclusion: MSAAN effectively balances reconstruction quality and model efficiency through its novel attention mechanisms, making it suitable for practical SR applications with computational constraints.
Abstract: This paper introduces a lightweight image super-resolution (SR) network, termed the Multi-scale Spatial Adaptive Attention Network (MSAAN), to address the common dilemma between high reconstruction fidelity and low model complexity in existing SR methods. The core of our approach is a novel Multi-scale Spatial Adaptive Attention Module (MSAA), designed to jointly model fine-grained local details and long-range contextual dependencies. The MSAA comprises two synergistic components: a Global Feature Modulation Module (GFM) that learns coherent texture structures through differential feature extraction, and a Multi-scale Feature Aggregation Module (MFA) that adaptively fuses features from local to global scales using pyramidal processing. To further enhance the network’s capability, we propose a Local Enhancement Block (LEB) to strengthen local geometric perception and a Feature Interactive Gated Feed-Forward Module (FIGFF) to improve nonlinear representation while reducing channel redundancy. Extensive experiments on standard benchmarks (Set5, Set14, B100, Urban100, Manga109) across $\times2$, $\times3$, and $\times4$ scaling factors demonstrate that both our lightweight (MSAAN-light) and standard (MSAAN) versions achieve superior or competitive performance in terms of PSNR and SSIM, while maintaining significantly lower parameters and computational costs than state-of-the-art methods. Ablation studies validate the contribution of each component, and visual results show that MSAAN reconstructs sharper edges and more realistic textures.
[292] QuickGrasp: Responsive Video-Language Querying Service via Accelerated Tokenization and Edge-Augmented Inference
Miao Zhang, Ruixiao Zhang, Jianxin Shi, Hengzhi Wang, Hao Fang, Jiangchuan Liu
Main category: cs.CV
TL;DR: QuickGrasp is a responsive QoS-aware system that bridges the gap between large VLMs (high accuracy but slow) and small VLMs (fast but less accurate) through local-first architecture with on-demand edge augmentation, achieving large VLM accuracy with 12.8x faster response.
Details
Motivation: Deploying large video-language models in real-world systems is challenging due to high resource demands and unacceptable response delays, while small locally deployable VLMs offer faster responses but sacrifice accuracy. There's a need to reconcile this trade-off for responsive video querying services.Method: QuickGrasp uses a local-first architecture with on-demand edge augmentation, sharing vision representation across model variants to avoid redundant computation. It introduces three key designs: accelerated video tokenization, query-adaptive edge augmentation, and delay-aware, accuracy-preserving vision token density configuration.
Result: QuickGrasp matches the accuracy of large VLMs while achieving up to 12.8x reduction in response delay across multiple video understanding benchmarks.
Conclusion: QuickGrasp represents a key advancement toward building responsive video querying services for open-world understanding that fully leverage VLM capabilities by balancing accuracy and latency through intelligent system design.
Abstract: Video-language models (VLMs) are reshaping video querying services, bringing unified solutions to complex perception and reasoning tasks. However, deploying large VLMs in real-world systems remains challenging due to their high resource demands, and remote-based deployment often results in unacceptable response delays. Although small, locally deployable VLMs offer faster responses, they unavoidably fall short in accuracy. To reconcile this trade-off, we propose QuickGrasp, a responsive, quality of service (QoS)-aware system that bridges this gap through a local-first architecture with on-demand edge augmentation. Built upon the highly modular architecture of VLMs, QuickGrasp shares the vision representation across model variants to avoid redundant computation. To maximize system-wide efficiency, QuickGrasp introduces three key designs: accelerated video tokenization, query-adaptive edge augmentation, and delay-aware, accuracy-preserving vision token density configuration. We implement a prototype of QuickGrasp and evaluate it across multiple video understanding benchmarks. The results show that QuickGrasp matches the accuracy of large VLMs while achieving up to a 12.8x reduction in response delay. QuickGrasp represents a key advancement toward building responsive video querying services for open-world understanding that fully leverage the capabilities of VLMs.
[293] BiSe-Unet: A Lightweight Dual-path U-Net with Attention-refined Context for Real-time Medical Image Segmentation
M Iffat Hossain, Laura Brattain
Main category: cs.CV
TL;DR: BiSe-UNet is a lightweight dual-path U-Net architecture for real-time medical image segmentation on edge devices, achieving competitive accuracy while maintaining 30+ FPS on Raspberry Pi 5.
Details
Motivation: Real-time image segmentation is crucial for image-guided medical procedures like colonoscopy, but existing models are too computationally intensive for resource-constrained edge devices. Lightweight architectures often sacrifice spatial precision and contextual understanding, compromising diagnostic reliability.Method: BiSe-UNet combines an attention-refined context path for global understanding with a shallow spatial path for detailed feature preservation, followed by a depthwise separable decoder for efficient reconstruction. This dual-path approach balances computational efficiency with segmentation quality.
Result: On the Kvasir-Seg dataset, BiSe-UNet achieves competitive Dice and IoU scores while sustaining real-time throughput exceeding 30 FPS on Raspberry Pi 5, demonstrating practical deployment capability on edge hardware.
Conclusion: BiSe-UNet provides an effective solution for accurate, lightweight medical image segmentation suitable for real-time clinical deployment on resource-constrained edge devices, addressing the trade-off between computational efficiency and segmentation quality.
Abstract: During image-guided procedures, real-time image segmentation is often required. This demands lightweight AI models that can operate on resource-constrained devices. One important use case is endoscopy-guided colonoscopy, where polyps must be detected in real time. The Kvasir-Seg dataset, a publicly available benchmark for this task, contains 1,000 high-resolution endoscopic images of polyps with corresponding pixel-level segmentation masks. Achieving real-time inference speed for clinical deployment in constrained environments requires highly efficient and lightweight network architectures. However, many existing models remain too computationally intensive for embedded deployment. Lightweight architectures, although faster, often suffer from reduced spatial precision and weaker contextual understanding, leading to degraded boundary quality and reduced diagnostic reliability. To address these challenges, we introduce BiSe-UNet, a lightweight dual-path U-Net that integrates an attention-refined context path with a shallow spatial path for detailed feature preservation, followed by a depthwise separable decoder for efficient reconstruction. Evaluated on the Kvasir-Seg dataset, BiSe-UNet achieves competitive Dice and IoU scores while sustaining real-time throughput exceeding 30 FPS on Raspberry Pi 5, demonstrating its effectiveness for accurate, lightweight, and deployable medical image segmentation on edge hardware.
[294] FlowPortrait: Reinforcement Learning for Audio-Driven Portrait Video Generation
Weiting Tan, Andy T. Liu, Ming Tu, Xinghua Qu, Philipp Koehn, Lu Lu
Main category: cs.CV
TL;DR: FlowPortrait: RL framework for audio-driven portrait animation using MLLM-based human-aligned evaluation and GRPO optimization for better talking-head video generation
Details
Motivation: Existing talking-head video generation methods suffer from imperfect lip sync, unnatural motion, and poor evaluation metrics that don't align with human perception.Method: Reinforcement learning framework with multimodal backbone for autoregressive audio-to-video generation, using MLLM-based human-aligned evaluation system to assess lip-sync, expressiveness, and motion quality, combined with perceptual/temporal consistency regularizers to form composite reward for post-training via Group Relative Policy Optimization (GRPO).
Result: Extensive experiments show FlowPortrait consistently produces higher-quality talking-head videos, demonstrating effectiveness of RL for portrait animation.
Conclusion: Reinforcement learning with human-aligned MLLM evaluation effectively improves talking-head video generation quality across multiple dimensions.
Abstract: Generating realistic talking-head videos remains challenging due to persistent issues such as imperfect lip synchronization, unnatural motion, and evaluation metrics that correlate poorly with human perception. We propose FlowPortrait, a reinforcement-learning framework for audio-driven portrait animation built on a multimodal backbone for autoregressive audio-to-video generation. FlowPortrait introduces a human-aligned evaluation system based on Multimodal Large Language Models (MLLMs) to assess lip-sync accuracy, expressiveness, and motion quality. These signals are combined with perceptual and temporal consistency regularizers to form a stable composite reward, which is used to post-train the generator via Group Relative Policy Optimization (GRPO). Extensive experiments, including both automatic evaluations and human preference studies, demonstrate that FlowPortrait consistently produces higher-quality talking-head videos, highlighting the effectiveness of reinforcement learning for portrait animation.
[295] NovaLAD: A Fast, CPU-Optimized Document Extraction Pipeline for Generative AI and Data Intelligence
Aman Ulla
Main category: cs.CV
TL;DR: NovaLAD is a fast document parsing system that uses dual YOLO models for element and layout detection, with optional vision-language enhancement for image analysis, achieving state-of-the-art performance on document parsing benchmarks.
Details
Motivation: Document extraction is crucial for RAG, knowledge bases, and generative AI applications, but existing solutions often lack speed, accuracy, or cost-effectiveness. The authors aim to create a comprehensive parsing system that works efficiently on CPU without GPU requirements.Method: Uses two concurrent YOLO object detection models (element detection and layout detection) with rule-based grouping. Includes vision-language enhancement where images are first filtered by ViT classifier, then only relevant images go to Vision LLM for analysis. Parallel execution for detection, classification, OCR, and conversion.
Result: Achieves 96.49% TEDS and 98.51% NID on DP-Bench benchmark, outperforming both commercial and open-source parsers. System works on CPU, generates multiple output formats (JSON, Markdown, RAG-ready texts, knowledge graphs).
Conclusion: NovaLAD provides an accurate, fast, and cost-effective document parsing solution that balances performance with practical deployment considerations, making document extraction accessible without GPU requirements.
Abstract: Document extraction is an important step before retrieval-augmented generation (RAG), knowledge bases, and downstream generative AI can work. It turns unstructured documents like PDFs and scans into structured text and layout-aware representations. We introduce NovaLAD, a comprehensive document parsing system that integrates two concurrent YOLO object detection models - element detection and layout detection - with rule-based grouping and optional vision-language enhancement. When a page image is sent in, the first thing that happens is that it goes through both models at the same time. The element model finds semantic content like the title, header, text, table, image, and so on, and the layout model finds structural regions like layout_box, column_group, multi_column, row_group, and so on. A key design decision is to first send an image or figure through an image classifier (ViT) that decides whether it is relevant or not. Only useful images are then submitted to the Vision LLM for title, summary, and structured information, which cuts down on noise and costs. NovaLAD is built for speed: it works on CPU, employs parallel execution for detection, classification, OCR, and conversion, and generates several forms, including structured JSON, Markdown, RAG-ready texts, and knowledge graphs. We test on the DP-Bench benchmark (upstage/dp-bench) and get 96.49% TEDS and 98.51% NID, which is better than both commercial and open-source parsers. This paper explains how to extract data, how the architecture works, how data flows, and how to make NovaLAD both accurate and usable without needing a GPU.
[296] CT-Flow: Orchestrating CT Interpretation Workflow with Model Context Protocol Servers
Yannian Gu, Xizhuo Zhang, Linjie Mu, Yongrui Yu, Zhongzhen Huang, Shaoting Zhang, Xiaofan Zhang
Main category: cs.CV
TL;DR: CT-Flow: An agentic framework for 3D CT interpretation using tool-mediated workflows, achieving 41% accuracy improvement and 95% tool invocation success rate.
Details
Motivation: Current LVLMs for 3D CT analysis use static inference, while real clinical workflows involve dynamic, iterative tool use (measurement, segmentation, radiomics). Need to bridge this gap between AI models and actual radiologist workflows.Method: Propose CT-Flow framework using Model Context Protocol (MCP) for interoperable volumetric interpretation. Create CT-FlowBench benchmark for 3D CT tool-use and multi-step reasoning. Framework acts as clinical orchestrator decomposing queries into automated tool-use sequences.
Result: State-of-the-art performance on CT-FlowBench and standard 3D VQA datasets. 41% improvement in diagnostic accuracy over baselines. 95% success rate in autonomous tool invocation.
Conclusion: CT-Flow provides scalable foundation for integrating autonomous, agentic intelligence into real-world clinical radiology by shifting from closed-box inference to open, tool-aware paradigm.
Abstract: Recent advances in Large Vision-Language Models (LVLMs) have shown strong potential for multi-modal radiological reasoning, particularly in tasks like diagnostic visual question answering (VQA) and radiology report generation. However, most existing approaches for 3D CT analysis largely rely on static, single-pass inference. In practice, clinical interpretation is a dynamic, tool-mediated workflow where radiologists iteratively review slices and use measurement, radiomics, and segmentation tools to refine findings. To bridge this gap, we propose CT-Flow, an agentic framework designed for interoperable volumetric interpretation. By leveraging the Model Context Protocol (MCP), CT-Flow shifts from closed-box inference to an open, tool-aware paradigm. We curate CT-FlowBench, the first large-scale instruction-tuning benchmark tailored for 3D CT tool-use and multi-step reasoning. Built upon this, CT-Flow functions as a clinical orchestrator capable of decomposing complex natural language queries into automated tool-use sequences. Experimental evaluations on CT-FlowBench and standard 3D VQA datasets demonstrate that CT-Flow achieves state-of-the-art performance, surpassing baseline models by 41% in diagnostic accuracy and achieving a 95% success rate in autonomous tool invocation. This work provides a scalable foundation for integrating autonomous, agentic intelligence into real-world clinical radiology.
[297] UniTalking: A Unified Audio-Video Framework for Talking Portrait Generation
Hebeizi Li, Zihao Liang, Benyuan Sun, Zihao Yin, Xiao Sha, Chenliang Wang, Yi Yang
Main category: cs.CV
TL;DR: UniTalking is an open-source diffusion framework for generating high-fidelity speech and lip-synchronized video using multi-modal transformer blocks with shared self-attention for audio-video alignment.
Details
Motivation: To address the inaccessibility of closed-source audio-video generation models like Veo3 and Sora2 by creating an open, unified framework that can generate realistic talking portraits with lip synchronization.Method: Uses a unified end-to-end diffusion framework with Multi-Modal Transformer Blocks that explicitly model temporal correspondence between audio and video latent tokens via shared self-attention. Leverages pre-trained video generation model priors for visual fidelity and includes personalized voice cloning from brief audio references.
Result: Produces highly realistic talking portraits with superior performance over existing open-source approaches in lip-sync accuracy, audio naturalness, and overall perceptual quality.
Conclusion: UniTalking successfully bridges the accessibility gap in audio-video generation while achieving state-of-the-art performance in generating synchronized talking portraits with voice cloning capabilities.
Abstract: While state-of-the-art audio-video generation models like Veo3 and Sora2 demonstrate remarkable capabilities, their closed-source nature makes their architectures and training paradigms inaccessible. To bridge this gap in accessibility and performance, we introduce UniTalking, a unified, end-to-end diffusion framework for generating high-fidelity speech and lip-synchronized video. At its core, our framework employs Multi-Modal Transformer Blocks to explicitly model the fine-grained temporal correspondence between audio and video latent tokens via a shared self-attention mechanism. By leveraging powerful priors from a pre-trained video generation model, our framework ensures state-of-the-art visual fidelity while enabling efficient training. Furthermore, UniTalking incorporates a personalized voice cloning capability, allowing the generation of speech in a target style from a brief audio reference. Qualitative and quantitative results demonstrate that our method produces highly realistic talking portraits, achieving superior performance over existing open-source approaches in lip-sync accuracy, audio naturalness, and overall perceptual quality.
[298] OrthoAI: A Lightweight Deep Learning Framework for Automated Biomechanical Analysis in Clear Aligner Orthodontics – A Methodological Proof-of-Concept
Edouard Lansiaux, Margaux Leman, Mehdi Ammi
Main category: cs.CV
TL;DR: OrthoAI: An open-source decision-support system combining 3D dental segmentation with automated biomechanical analysis to assist orthodontic treatment planning evaluation.
Details
Motivation: Current clinician review of digitally planned tooth movements (via ClinCheck) is slow and error-prone, creating a need for automated decision-support systems in orthodontics.Method: Uses Dynamic Graph CNN trained on landmark-reconstructed point clouds from 3DTeethLand dataset, integrated with rule-based biomechanical engine grounded in orthodontic evidence. System decomposes per-tooth motion across six degrees of freedom and computes movement-specific predictability.
Result: Segmentation achieves 81.4% Tooth Identification Rate and 8.25% mIoU on surrogate point clouds. The end-to-end pipeline runs in <4s on consumer hardware. System provides alerts when biomechanical limits are exceeded and derives composite index.
Conclusion: Establishes baseline for future full-mesh training, highlights current perceptual limits, and provides open-source tools for reproducible research in geometric deep learning and digital orthodontics.
Abstract: Clear aligner therapy now dominates orthodontics, yet clinician review of digitally planned tooth movements-typically via ClinCheck (Align Technology)-remains slow and error-prone. We present OrthoAI, an open-source proof-of-concept decision-support system combining lightweight 3D dental segmentation with automated biomechanical analysis to assist treatment-plan evaluation. The framework uses a Dynamic Graph CNN trained on landmark-reconstructed point clouds from 3DTeethLand (MICCAI) and integrates a rule-based biomechanical engine grounded in orthodontic evidence (Kravitz et al 2009; Simon et al 2014). The system decomposes per-tooth motion across six degrees of freedom, computes movement-specific predictability, issues alerts when biomechanical limits are exceeded, and derives an exploratory composite index. With 60,705 trainable parameters, segmentation reaches a Tooth Identification Rate of $81.4%$ and mIoU of $8.25%$ on surrogate point clouds-reflecting sparse landmark supervision rather than dense meshes. Although spatial boundaries are coarse, downstream analysis depends mainly on tooth identity and approximate centroid/axis estimation. Results establish a baseline for future full-mesh training and highlight current perceptual limits. The end-to-end pipeline runs in $<4s$ on consumer hardware. Code, weights, and analysis tools are released to support reproducible research in geometric deep learning and digital orthodontics. The system has not been validated on real intraoral meshes and should not be assumed to generalize beyond landmark-derived representations.
[299] From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents
Niu Lian, Yuting Wang, Hanshu Yao, Jinpeng Wang, Bin Chen, Yaowei Wang, Min Zhang, Shu-Tao Xia
Main category: cs.CV
TL;DR: MM-Mem is a pyramidal multimodal memory architecture for long-horizon video understanding, inspired by Fuzzy-Trace Theory, with hierarchical memory organization and entropy-driven retrieval.
Details
Motivation: Current multimodal LLMs struggle with long-horizon video understanding due to limited context windows and inefficient memory mechanisms. Existing approaches either use vision-centric methods with high latency/redundancy or text-centric methods with detail loss/hallucination.Method: Proposes MM-Mem with hierarchical memory: Sensory Buffer (fine-grained perceptual traces), Episodic Stream, and Symbolic Schema (high-level semantics). Uses Semantic Information Bottleneck objective with SIB-GRPO for memory optimization, and entropy-driven top-down retrieval that drills down from abstract to detailed memory under uncertainty.
Result: Extensive experiments across 4 benchmarks show effectiveness on both offline and streaming tasks, demonstrating robust generalization and validating the cognition-inspired memory organization.
Conclusion: MM-Mem bridges the gap between vision-centric and text-centric approaches for long-horizon video understanding through hierarchical memory organization inspired by human cognition, achieving better efficiency and accuracy.
Abstract: While multimodal large language models have demonstrated impressive short-term reasoning, they struggle with long-horizon video understanding due to limited context windows and static memory mechanisms that fail to mirror human cognitive efficiency. Existing paradigms typically fall into two extremes: vision-centric methods that incur high latency and redundancy through dense visual accumulation, or text-centric approaches that suffer from detail loss and hallucination via aggressive captioning. To bridge this gap, we propose MM-Mem, a pyramidal multimodal memory architecture grounded in Fuzzy-Trace Theory. MM-Mem structures memory hierarchically into a Sensory Buffer, Episodic Stream, and Symbolic Schema, enabling the progressive distillation of fine-grained perceptual traces (verbatim) into high-level semantic schemas (gist). Furthermore, to govern the dynamic construction of memory, we derive a Semantic Information Bottleneck objective and introduce SIB-GRPO to optimize the trade-off between memory compression and task-relevant information retention. In inference, we design an entropy-driven top-down memory retrieval strategy, which first tries with the abstract Symbolic Schema and progressively “drills down” to the Sensory Buffer and Episodic Stream under high uncertainty. Extensive experiments across 4 benchmarks confirm the effectiveness of MM-Mem on both offline and streaming tasks, demonstrating robust generalization and validating the effectiveness of cognition-inspired memory organization. Code is available at https://github.com/EliSpectre/MM-Mem.
[300] Segmenting Low-Contrast XCTs of Concretes: An Unsupervised Approach
Kaustav Das, Gaston Rauchs, Jan Sykora, Anna Kucerova
Main category: cs.CV
TL;DR: Unsupervised CNN training using self-annotation with superpixels for semantic segmentation of low-contrast concrete XCT scans
Details
Motivation: Concrete XCT imaging has low contrast between aggregates and mortar due to similar X-ray attenuation coefficients. CNN-based semantic segmentation typically requires labeled data which is costly/unavailable for new datasets.Method: Self-annotation technique using superpixel algorithms to identify perceptually similar local regions, relating them to global context via CNN receptive field to learn global-local relationships for semantic segmentation.
Result: Performance of unsupervised training methodology on XCT datasets is presented, with discussion of potential improvements.
Conclusion: Unsupervised self-annotation approach shows promise for semantic segmentation of challenging low-contrast XCT images without requiring labeled training data.
Abstract: This work tests a self-annotation-based unsupervised methodology for training a convolutional neural network (CNN) model for semantic segmentation of X-ray computed tomography (XCT) scans of concretes. Concrete poses a unique challenge for XCT imaging due to similar X-ray attenuation coefficients of aggregates and mortar, resulting in low-contrast between the two phases in the ensuing images. While CNN-based models are a proven technique for semantic segmentation in such challenging cases, they typically require labeled training data, which is often unavailable for new datasets or are costly to obtain. To counter that limitation, a self-annotation technique is used here which leverages superpixel algorithms to identify perceptually similar local regions in an image and relates them to the global context in the image by utilizing the receptive field of a CNN-based model. This enables the model to learn a global-local relationship in the images and enables identification of semantically similar structures. We therefore present the performance of the unsupervised training methodology on our XCT datasets and discuss potential avenues for further improvements.
[301] Predicting Local Climate Zones using Urban Morphometrics and Satellite Imagery
Hugo Majer, Martin Fleischmann
Main category: cs.CV
TL;DR: Study evaluates using urban morphometrics (2D building/street metrics) to predict Local Climate Zones, finding limited and inconsistent effectiveness compared to satellite imagery-based approaches.
Details
Motivation: The LCZ framework is widely used for urban morphological analysis but relies heavily on satellite imagery. This study explores whether urban morphometrics (numerical measures of urban form) can predict LCZs as an alternative or complementary approach.Method: Calculated 321 2D morphometric attributes from building footprints and street networks at multiple spatial scales. Developed four classification schemes: 1) morphometric-based LCZ prediction, 2) baseline image-based prediction, and 3) two fusion techniques combining morphometrics with satellite imagery. Evaluated across five different sites.
Result: Morphometric-based prediction showed selective and inconsistent correspondence with LCZ types, being highly site-dependent. Fusion approaches yielded modest accuracy improvements at only two sites, with negligible or slightly negative gains at others. The study revealed that a broader range of urban form properties are relevant for distinguishing LCZ types than standard parameters suggest.
Conclusion: The relationship between LCZs and measurable, visible aspects of urban form is tenuous. The LCZ framework should be used cautiously in morphological studies, as morphometrics alone cannot reliably predict LCZ classifications across different urban contexts.
Abstract: The Local Climate Zone (LCZ) framework is commonly employed to represent urban form in morphological analyses despite its mapping predominantly relies on satellite imagery. Urban morphometrics, describing urban form via numerical measures of physical aspects and spatial relationships of its elements, offers another avenue. This study evaluates the ability of morphometric assessment to predict LCZs using a) a morphometric-based LCZ prediction, and b) a fusion-based LCZ prediction combining morphometrics with satellite imagery. We calculate 321 2D morphometric attributes from building footprints and street networks, covering their various properties at multiple spatial scales. Subsequently, we develop four classification schemes: morphometric-based prediction, baseline image-based prediction, and two techniques fusing morphometrics with imagery. We evaluate them across five sites. Results from the morphometric-based prediction indicate that the correspondence between 2D urban morphometrics and urban LCZ types is selective and inconsistent, rendering the efficacy of this method site-dependent. Nevertheless, it demonstrated that a much broader range of urban form properties is relevant for distinguishing LCZ types compared to standard parameters. Relative to the image-based baseline, the fusion yielded relatively distinct accuracy improvements for urban LCZ types at two sites; however, gains at the remaining sites were negligible or even slightly negative, suggesting that the benefits of fusion are modest and inconsistent. Collectively, these results indicate that the relationship between the LCZs and the measurable, visible aspects of urban form is tenuous, thus the LCZ framework should be used with caution in morphological studies.
[302] You Don’t Need All That Attention: Surgical Memorization Mitigation in Text-to-Image Diffusion Models
Kairan Zhao, Eleni Triantafillou, Peter Triantafillou
Main category: cs.CV
TL;DR: GUARD is a novel framework that mitigates memorization in text-to-image diffusion models by adjusting the denoising process to guide generation away from training data while maintaining prompt alignment and image quality.
Details
Motivation: Generative models can memorize training data, leading to verbatim image generation that raises privacy and copyright concerns. Current methods for memorization mitigation often compromise image quality or are not robust across different architectures and memorization types.Method: GUARD uses attractive-repulsive dynamics during image denoising. It includes a novel cross-attention attenuation mechanism that: (1) statistically identifies prompt positions where cross-attention must be attenuated, and (2) attenuates cross-attention in these per-prompt locations. This provides a surgical, dynamic per-prompt inference-time approach.
Result: GUARD produces state-of-the-art results for memorization mitigation across two architectures and for both verbatim and template memorization. It consistently outperforms other methods while improving or maintaining comparable image quality.
Conclusion: GUARD offers an effective inference-time solution for memorization mitigation in text-to-image diffusion models that is robust across architectures and memorization types while preserving image quality.
Abstract: Generative models have been shown to “memorize” certain training data, leading to verbatim or near-verbatim generating images, which may cause privacy concerns or copyright infringement. We introduce Guidance Using Attractive-Repulsive Dynamics (GUARD), a novel framework for memorization mitigation in text-to-image diffusion models. GUARD adjusts the image denoising process to guide the generation away from an original training image and towards one that is distinct from training data while remaining aligned with the prompt, guarding against reproducing training data, without hurting image generation quality. We propose a concrete instantiation of this framework, where the positive target that we steer towards is given by a novel method for (cross) attention attenuation based on (i) a novel statistical mechanism that automatically identifies the prompt positions where cross attention must be attenuated and (ii) attenuating cross-attention in these per-prompt locations. The resulting GUARD offers a surgical, dynamic per-prompt inference-time approach that, we find, is by far the most robust method in terms of consistently producing state-of-the-art results for memorization mitigation across two architectures and for both verbatim and template memorization, while also improving upon or yielding comparable results in terms of image quality.
[303] TinyVLM: Zero-Shot Object Detection on Microcontrollers via Vision-Language Distillation with Matryoshka Embeddings
Bibin Wilson
Main category: cs.CV
TL;DR: TinyVLM enables zero-shot object detection on microcontrollers with <1MB memory via decoupled architecture, Matryoshka distillation, and quantized embeddings.
Details
Motivation: Current zero-shot object detection relies on large vision-language models (VLMs) like CLIP that require hundreds of MBs of memory, exceeding the constraints of microcontrollers (MCUs). There's a need for efficient zero-shot detection on resource-constrained edge devices.Method: Three key innovations: (1) Decoupled architecture separating visual inference from text encoding, allowing precomputed class embeddings in flash memory; (2) Matryoshka distillation training nested embeddings at multiple dimensions (16-256) for flexible accuracy-memory trade-offs; (3) Quantized embedding storage reducing class prototype memory by 4x with minimal accuracy loss.
Result: Trained on CC3M, achieves competitive zero-shot accuracy on COCO, Flowers102, and Food101 with only 285KB RAM and 892KB flash memory. Real-time inference at 26 FPS on STM32H7 and >1,000 FPS on MAX78000 with CNN accelerator.
Conclusion: Enables practical zero-shot object detection on edge devices for the first time, overcoming memory constraints of MCUs while maintaining competitive accuracy.
Abstract: Zero-shot object detection enables recognising novel objects without task-specific training, but current approaches rely on large vision language models (VLMs) like CLIP that require hundreds of megabytes of memory - far exceeding the constraints of micro controller units (MCUs). We present TinyVLM, the first framework enabling zero-shot object detection on resource-constrained MCUs with less than 1MB of memory. Our approach introduces three key innovations: (1) a decoupled architecture that separates visual inference from text encoding, allowing precomputed class embeddings to be stored in flash memory; (2) Matryoshka distillation that trains nested embeddings at multiple dimensions (16-256), enabling flexible accuracy-memory trade-offs; and (3) quantized embedding storage that reduces class prototype memory by 4x with minimal accuracy loss. Trained on Conceptual Captions 3M (CC3M), TinyVLM achieves competitive zero-shot accuracy on COCO, Flowers102, and Food101 while requiring only 285KB of RAM and 892KB of flash memory for the deployed vision encoder. We demonstrate real-time inference at 26 FPS on STM32H7 and over 1,000 FPS on MAX78000 with its CNN accelerator, enabling practical zero-shot detection on edge devices for the first time.
[304] From Scale to Speed: Adaptive Test-Time Scaling for Image Editing
Xiangyan Qu, Zhenlong Yuan, Jing Tang, Rui Chen, Datao Tang, Meng Yu, Lei Sun, Yancheng Bai, Xiangxiang Chu, Gaopeng Gou, Gang Xiong, Yujun Cai
Main category: cs.CV
TL;DR: ADE-CoT is an adaptive test-time scaling framework for image editing that improves efficiency and performance through difficulty-aware resource allocation, edit-specific verification, and opportunistic stopping.
Details
Motivation: Image-CoT methods work well for text-to-image generation but are inefficient for image editing due to constrained solution spaces, leading to three challenges: inefficient resource allocation with fixed budgets, unreliable early-stage verification using general MLLM scores, and redundant results from large-scale sampling.Method: ADE-CoT incorporates three key strategies: (1) difficulty-aware resource allocation that assigns dynamic budgets based on estimated edit difficulty, (2) edit-specific verification using region localization and caption consistency for early pruning, and (3) depth-first opportunistic stopping guided by an instance-specific verifier.
Result: Extensive experiments on three state-of-the-art editing models (Step1X-Edit, BAGEL, FLUX.1 Kontext) across three benchmarks show ADE-CoT achieves superior performance-efficiency trade-offs, obtaining better performance with more than 2x speedup over Best-of-N with comparable sampling budgets.
Conclusion: ADE-CoT effectively addresses the challenges of applying Image-CoT to image editing by providing an adaptive framework that improves both efficiency and editing quality through intelligent resource allocation and verification mechanisms.
Abstract: Image Chain-of-Thought (Image-CoT) is a test-time scaling paradigm that improves image generation by extending inference time. Most Image-CoT methods focus on text-to-image (T2I) generation. Unlike T2I generation, image editing is goal-directed: the solution space is constrained by the source image and instruction. This mismatch causes three challenges when applying Image-CoT to editing: inefficient resource allocation with fixed sampling budgets, unreliable early-stage verification using general MLLM scores, and redundant edited results from large-scale sampling. To address this, we propose ADaptive Edit-CoT (ADE-CoT), an on-demand test-time scaling framework to enhance editing efficiency and performance. It incorporates three key strategies: (1) a difficulty-aware resource allocation that assigns dynamic budgets based on estimated edit difficulty; (2) edit-specific verification in early pruning that uses region localization and caption consistency to select promising candidates; and (3) depth-first opportunistic stopping, guided by an instance-specific verifier, that terminates when intent-aligned results are found. Extensive experiments on three SOTA editing models (Step1X-Edit, BAGEL, FLUX.1 Kontext) across three benchmarks show that ADE-CoT achieves superior performance-efficiency trade-offs. With comparable sampling budgets, ADE-CoT obtains better performance with more than 2x speedup over Best-of-N.
[305] VINCIE: Unlocking In-context Image Editing from Video
Leigang Qu, Feng Cheng, Ziyan Yang, Qi Zhao, Shanchuan Lin, Yichun Shi, Yicong Li, Wenjie Wang, Tat-Seng Chua, Lu Jiang
Main category: cs.CV
TL;DR: Video-trained block-causal diffusion transformer achieves state-of-the-art in-context image editing by learning from video sequences without task-specific pipelines
Details
Motivation: Existing in-context image editing methods rely on task-specific pipelines and expert models for training data curation, which limits scalability and generalization. The authors explore whether such models can be learned directly from videos as a more scalable data source.Method: Proposes a scalable approach to annotate videos as interleaved multimodal sequences. Designs a block-causal diffusion transformer trained on three proxy tasks: next-image prediction, current segmentation prediction, and next-segmentation prediction. Also introduces a novel multi-turn image editing benchmark.
Result: The model exhibits strong in-context image editing capabilities and achieves state-of-the-art results on two multi-turn image editing benchmarks. Despite video-only training, it shows promising abilities in multi-concept composition, story generation, and chain-of-editing applications.
Conclusion: In-context image editing models can be effectively learned directly from videos without task-specific pipelines, demonstrating strong performance and generalization to various editing applications.
Abstract: In-context image editing aims to modify images based on a contextual sequence comprising text and previously generated images. Existing methods typically depend on task-specific pipelines and expert models (e.g., segmentation and inpainting) to curate training data. In this work, we explore whether an in-context image editing model can be learned directly from videos. We introduce a scalable approach to annotate videos as interleaved multimodal sequences. To effectively learn from this data, we design a block-causal diffusion transformer trained on three proxy tasks: next-image prediction, current segmentation prediction, and next-segmentation prediction. Additionally, we propose a novel multi-turn image editing benchmark to advance research in this area. Extensive experiments demonstrate that our model exhibits strong in-context image editing capabilities and achieves state-of-the-art results on two multi-turn image editing benchmarks. Despite being trained exclusively on videos, our model also shows promising abilities in multi-concept composition, story generation, and chain-of-editing applications.
[306] Latent Replay Detection: Memory-Efficient Continual Object Detection on Microcontrollers via Task-Adaptive Compression
Bibin Wilson
Main category: cs.CV
TL;DR: LRD enables continual object detection on microcontrollers using task-adaptive latent replay compression and spatial-diverse exemplar selection within 64KB memory constraints.
Details
Motivation: Current object detection models for microcontrollers cannot learn new object categories after deployment, and existing continual learning methods require storing raw images that exceed MCU memory budgets (tens of kilobytes).Method: 1. Task-adaptive compression using learnable FiLM conditioning to preserve discriminative features per task; 2. Spatial-diverse exemplar selection via farthest-point sampling in IoU space to prevent localization bias; 3. MCU-deployable system storing only 150 bytes per sample instead of >10KB for images.
Result: LRD achieves strong mAP@50 on CORe50 (50 classes, 5 tasks), maintains performance across tasks, and runs on STM32H753ZI, ESP32-S3, and MAX78000 MCUs with 4.9-97.5ms latency per inference within 64KB memory budget.
Conclusion: LRD enables practical continual object detection on edge devices for the first time by synergistically combining task-adaptive compression and spatial-diverse exemplar selection within strict MCU memory constraints.
Abstract: Deploying object detection on microcontrollers (MCUs) enables intelligent edge devices but current models cannot learn new object categories after deployment. Existing continual learning methods require storing raw images far exceeding MCU memory budgets of tens of kilobytes. We present Latent Replay Detection (LRD), the first framework for continual object detection under MCU memory constraints. Our key contributions are: 1. Task-Adaptive Compression: Unlike fixed PCA, we propose learnable compression with FiLM (Feature-wise Linear Modulation) conditioning, where task specific embeddings modulate the compression to preserve discriminative features for each task’s distribution; 2. Spatial-Diverse Exemplar Selection: Traditional sampling ignores spatial information critical for detection - we select exemplars maximizing bounding box diversity via farthest-point sampling in IoU space, preventing localization bias in replay; 3. MCU-Deployable System: Our latent replay stores 150 bytes per sample versus >10KB for images, enabling a 64KB buffer to hold 400+ exemplars. Experiments on CORe50 (50 classes, 5 tasks) demonstrate that LRD achieves mAP@50 on the initial task and maintains strong performance across subsequent tasks - a significant improvement over naive fine-tuning while operating within strict MCU constraints. Our task-adaptive FiLM compression and spatial diverse exemplar selection work synergistically to preserve detection capabilities. Deployed on STM32H753ZI, ESP32-S3, and MAX78000 MCUs, LRD achieves 4.9-97.5ms latency per inference within a 64KB memory budget-enabling practical continual detection on edge devices for the first time.
[307] Leveraging GenAI for Segmenting and Labeling Centuries-old Technical Documents
Carlos Monroy, Benjamin Navarro
Main category: cs.CV
TL;DR: Using SAM2 for segmentation and Florence2/ChatGPT with specialized nautical ontologies to analyze historical shipbuilding documents from the Age of Exploration
Details
Motivation: Historical document analysis faces challenges due to limited training data and domain specialization, but automating segmentation and recognition is crucial for making priceless collections accessible for curation, cataloging, and knowledge disseminationMethod: Combines SAM2 for image segmentation, Florence2 and ChatGPT for labeling, with specialized ontology (ontoShip) and glossary (glosShip) of nautical architecture to enhance labeling accuracy for XVI-XVII century shipbuilding treatises
Result: Preliminary results show potential for marrying these technologies to improve curation and retrieval of historical documents, though challenges and limitations are acknowledged
Conclusion: The approach demonstrates feasibility for historical document analysis despite data scarcity, with ongoing work needed to address current limitations
Abstract: Image segmentation and image recognition are well established computational techniques in the broader discipline of image processing. Segmentation allows to locate areas in an image, while recognition identifies specific objects within an image. These techniques have shown remarkable accuracy with modern images, mainly because the amount of training data is vast. Achieving similar accuracy in digitized images of centuries-old documents is more challenging. This difficulty is due to two main reasons: first, the lack of sufficient training data, and second, because the degree of specialization in a given domain. Despite these limitations, the ability to segment and recognize objects in these collections is important for automating the curation, cataloging, and dissemination of knowledge, making the contents of priceless collections accessible to scholars and the general public. In this paper, we report on our ongoing work in segmenting and labeling images pertaining to shipbuilding treatises from the XVI and XVII centuries, a historical period known as the Age of Exploration. To this end, we leverage SAM2 for image segmentation; Florence2 and ChatGPT for labeling; and a specialized ontology ontoShip and glossary glosShip of nautical architecture for enhancing the labeling process. Preliminary results demonstrate the potential of marrying these technologies for improving curation and retrieval of priceless historical documents. We also discuss the challenges and limitations encountered in this approach and ideas on how to overcome them in the future.
[308] Towards Data-driven Nitrogen Estimation in Wheat Fields using Multispectral Images
Andreas Tritsarolis, Tomaž Bokan, Matej Brumen, Domen Mongus, Yannis Theodoridis
Main category: cs.CV
TL;DR: TerrAI is a neural network-based solution for targeted spraying and fertilization in agriculture that considers spatio-temporal variability across different parcels using remote sensing data.
Details
Motivation: Modern agriculture needs advanced analytics for precise resource application to improve utilization and reduce environmental impacts. Targeted spraying and fertilization is challenging due to external factors like crop type, fertilization phase, soil conditions, and weather dynamics.Method: Neural network-based solution that considers spatio-temporal variability across different agricultural parcels. Uses real-world remote sensing dataset for training and validation.
Result: Experimental study validates the soundness of TerrAI on data-driven agricultural practices. The system effectively handles the complex variability in agricultural conditions.
Conclusion: TerrAI provides an effective neural network solution for targeted agricultural operations that can optimize resource use and promote environmental sustainability through precise application of inputs.
Abstract: The modernization of agriculture has motivated the development of advanced analytics and decision-support systems to improve resource utilization and reduce environmental impacts. Targeted Spraying and Fertilization (TSF) is a critical operation that enables farmers to apply inputs more precisely, optimizing resource use and promoting environmental sustainability. However, accurate TSF is a challenging problem, due to external factors such as crop type, fertilization phase, soil conditions, and weather dynamics. In this paper, we present TerrAI, a Neural Network-based solution for TSF, which considers the spatio-temporal variability across different parcels. Our experimental study over a real-world remote sensing dataset validates the soundness of TerrAI on data-driven agricultural practices.
[309] TTOM: Test-Time Optimization and Memorization for Compositional Video Generation
Leigang Qu, Ziyang Wang, Na Zheng, Wenjie Wang, Liqiang Nie, Tat-Seng Chua
Main category: cs.CV
TL;DR: TTOM is a training-free framework that improves compositional video generation by optimizing parameters during inference to align video outputs with spatiotemporal layouts, using a memory mechanism to maintain historical optimization contexts.
Details
Motivation: Video Foundation Models (VFMs) have strong visual generation capabilities but struggle with compositional scenarios involving motion, numeracy, and spatial relations. There's a need for better text-image alignment in compositional video generation without requiring extensive retraining.Method: TTOM uses test-time optimization with a layout-attention objective to align VFM outputs with spatiotemporal layouts during inference. It integrates and optimizes new parameters rather than directly intervening in latents or attention. The framework operates in a streaming setting with a parametric memory mechanism that supports insert, read, update, and delete operations for maintaining historical optimization contexts.
Result: Experimental results on T2V-CompBench and Vbench benchmarks show TTOM effectively improves compositional video generation. The framework demonstrates powerful transferability and generalization, disentangling compositional world knowledge while being practical, scalable, and efficient.
Conclusion: TTOM provides an effective training-free framework for achieving cross-modal alignment in compositional video generation, addressing VFM limitations in compositional scenarios through test-time optimization and memory mechanisms.
Abstract: Video Foundation Models (VFMs) exhibit remarkable visual generation performance, but struggle in compositional scenarios (e.g., motion, numeracy, and spatial relation). In this work, we introduce Test-Time Optimization and Memorization (TTOM), a training-free framework that aligns VFM outputs with spatiotemporal layouts during inference for better text-image alignment. Rather than direct intervention to latents or attention per-sample in existing work, we integrate and optimize new parameters guided by a general layout-attention objective. Furthermore, we formulate video generation within a streaming setting, and maintain historical optimization contexts with a parametric memory mechanism that supports flexible operations, such as insert, read, update, and delete. Notably, we found that TTOM disentangles compositional world knowledge, showing powerful transferability and generalization. Experimental results on the T2V-CompBench and Vbench benchmarks establish TTOM as an effective, practical, scalable, and efficient framework to achieve cross-modal alignment for compositional video generation on the fly.
[310] Steering Away from Memorization: Reachability-Constrained Reinforcement Learning for Text-to-Image Diffusion
Sathwik Karnik, Juyeop Kim, Sanmi Koyejo, Jong-Seok Lee, Somil Bansal
Main category: cs.CV
TL;DR: RADS is an inference-time framework that prevents memorization in text-to-image diffusion models while preserving generation quality and prompt alignment, using reachability analysis and constrained RL to steer trajectories away from memorized samples.
Details
Motivation: Text-to-image diffusion models often memorize training data, failing to generalize. Current mitigation strategies sacrifice image quality or prompt alignment. There's a need for a solution that prevents memorization while preserving generation fidelity.Method: Models diffusion denoising as a dynamical system, applies reachability analysis to approximate the “backward reachable tube” (states that evolve into memorized samples), and formulates mitigation as constrained RL where a policy learns to steer trajectories via minimal perturbations in caption embedding space.
Result: Achieves superior Pareto frontier between generation diversity (SSCD), quality (FID), and alignment (CLIP) compared to state-of-the-art baselines. Provides robust mitigation without modifying diffusion backbone, offering plug-and-play solution.
Conclusion: RADS effectively prevents memorization in text-to-image diffusion models while maintaining generation quality and alignment, offering a practical inference-time solution without model modifications.
Abstract: Text-to-image diffusion models often memorize training data, revealing a fundamental failure to generalize beyond the training set. Current mitigation strategies typically sacrifice image quality or prompt alignment to reduce memorization. To address this, we propose Reachability-Aware Diffusion Steering (RADS), an inference-time framework that prevents memorization while preserving generation fidelity. RADS models the diffusion denoising process as a dynamical system and applies concepts from reachability analysis to approximate the “backward reachable tube”–the set of intermediate states that inevitably evolve into memorized samples. We then formulate mitigation as a constrained reinforcement learning (RL) problem, where a policy learns to steer the trajectory away from memorization via minimal perturbations in the caption embedding space. Empirical evaluations show that RADS achieves a superior Pareto frontier between generation diversity (SSCD), quality (FID), and alignment (CLIP) compared to state-of-the-art baselines. Crucially, RADS provides robust mitigation without modifying the diffusion backbone, offering a plug-and-play solution for safe generation. Our website is available at: https://s-karnik.github.io/rads-memorization-project-page/.
[311] Implementation of Licensed Plate Detection and Noise Removal in Image Processing
Yiquan Gao
Main category: cs.CV
TL;DR: A paper on car license plate recognition systems using image processing technology for vehicle identification in Malaysia, with applications in parking, toll systems, traffic surveillance, and police enforcement.
Details
Motivation: The rapid increase in vehicles in Malaysia has created significant demand for automated license plate recognition systems to support various applications like electronic parking payment, highway toll collection, traffic surveillance, and police enforcement tools.Method: The paper describes an image processing technology approach for automatic number-plate recognition (ANPR), also known as automatic vehicle identification or optical character recognition for cars, though specific technical details are not provided in the abstract.
Result: The abstract presents a conceptual framework for implementing car license plate recognition systems in Malaysia, highlighting their potential applications across multiple domains including transportation, law enforcement, and other specialized fields.
Conclusion: Car license plate recognition systems have significant potential in Malaysia for various practical applications and could be combined with techniques from other fields like biology and aerospace to solve specialized problems.
Abstract: Car license plate recognition system is an image processing technology used to identify vehicles by capturing their Car License Plates. The car license plate recognition technology is also known as automatic number-plate recognition, automatic vehicle identification, car license plate recognition or optical character recognition for cars. In Malaysia, as the number of vehicle is increasing rapidly nowadays, a pretty great number of vehicle on the road has brought about the considerable demands of car license plate recognition system. Car license plate recognition system can be implemented in electronic parking payment system, highway toll-fee system, traffic surveillance system and as police enforcement tools. Additionally, car license plate recognition system technology also has potential to be combined with various techniques in other different fields like biology, aerospace and so on to achieve the goal of solving some specialized problems.
[312] GrapHist: Graph Self-Supervised Learning for Histopathology
Sevda Öğüt, Cédric Vincent-Cuaz, Natalia Dubljevic, Carlos Hurtado, Vaishnavi Subramanian, Pascal Frossard, Dorina Thanou
Main category: cs.CV
TL;DR: GrapHist: A graph-based self-supervised learning framework for histopathology that models tissues as cell graphs using masked autoencoders and heterophilic GNNs, achieving competitive performance with fewer parameters.
Details
Motivation: Current self-supervised vision models for digital pathology use domain-agnostic architectures that don't account for fundamental biological elements like cells and their interactions. The authors hypothesize that modeling tissues as cell graphs offers more efficient representation learning.Method: GrapHist integrates masked autoencoders with heterophilic graph neural networks designed to capture tumor microenvironment heterogeneity. Pre-trained on 11 million cell graphs from breast tissues using self-supervised learning.
Result: Achieves competitive performance compared to vision-based counterparts in slide-, region-, and cell-level tasks with 4x fewer parameters. Drastically outperforms fully-supervised graph models on cancer subtyping tasks. Released five graph-based digital pathology datasets.
Conclusion: Graph-based self-supervised learning with biologically-informed cell graph modeling offers efficient representation learning for histopathology, enabling diverse downstream tasks with fewer parameters while establishing the first large-scale graph benchmark in digital pathology.
Abstract: Self-supervised vision models have achieved notable success in digital pathology. However, their domain-agnostic transformer architectures are not originally designed to account for fundamental biological elements of histopathology images, namely cells and their complex interactions. In this work, we hypothesize that a biologically-informed modeling of tissues as cell graphs offers a more efficient representation learning. Thus, we introduce GrapHist, a novel graph-based self-supervised learning framework for histopathology, which learns generalizable and structurally-informed embeddings that enable diverse downstream tasks. GrapHist integrates masked autoencoders and heterophilic graph neural networks that are explicitly designed to capture the heterogeneity of tumor microenvironments. We pre-train GrapHist on a large collection of 11 million cell graphs derived from breast tissues and evaluate its transferability across in- and out-of-domain benchmarks. Our results show that GrapHist achieves competitive performance compared to its vision-based counterparts in slide-, region-, and cell-level tasks, while requiring four times fewer parameters. It also drastically outperforms fully-supervised graph models on cancer subtyping tasks. Finally, we also release five graph-based digital pathology datasets used in our study at https://huggingface.co/ogutsevda/datasets , establishing the first large-scale graph benchmark in this field. Our code is available at https://github.com/ogutsevda/graphist .
[313] Downstream Task Inspired Underwater Image Enhancement: A Perception-Aware Study from Dataset Construction to Network Design
Bosen Lin, Feng Gao, Yanwei Yu, Junyu Dong, Qian Du
Main category: cs.CV
TL;DR: DTI-UIE is a downstream task-inspired underwater image enhancement framework that improves recognition performance for vision tasks like segmentation and detection by focusing on task-relevant high-frequency details rather than just human visual appeal.
Details
Motivation: Existing underwater image enhancement methods focus on human visual perception but fail to reconstruct high-frequency details critical for downstream recognition tasks like semantic segmentation and object detection in underwater environments.Method: Proposes a two-branch network with task-aware attention for feature mixing, multi-stage training framework, task-driven perceptual loss, and automatically constructs a Task-Inspired UIE Dataset using various task-specific networks.
Result: DTI-UIE significantly improves task performance by generating preprocessed images beneficial for downstream tasks including semantic segmentation, object detection, and instance segmentation.
Conclusion: The framework successfully bridges the gap between image enhancement and downstream task performance by focusing on task-relevant features rather than just human visual perception.
Abstract: In real underwater environments, downstream image recognition tasks such as semantic segmentation and object detection often face challenges posed by problems like blurring and color inconsistencies. Underwater image enhancement (UIE) has emerged as a promising preprocessing approach, aiming to improve the recognizability of targets in underwater images. However, most existing UIE methods mainly focus on enhancing images for human visual perception, frequently failing to reconstruct high-frequency details that are critical for task-specific recognition. To address this issue, we propose a Downstream Task-Inspired Underwater Image Enhancement (DTI-UIE) framework, which leverages human visual perception model to enhance images effectively for underwater vision tasks. Specifically, we design an efficient two-branch network with task-aware attention module for feature mixing. The network benefits from a multi-stage training framework and a task-driven perceptual loss. Additionally, inspired by human perception, we automatically construct a Task-Inspired UIE Dataset (TI-UIED) using various task-specific networks. Experimental results demonstrate that DTI-UIE significantly improves task performance by generating preprocessed images that are beneficial for downstream tasks such as semantic segmentation, object detection, and instance segmentation. The codes are publicly available at https://github.com/oucailab/DTIUIE.
[314] NERFIFY: A Multi-Agent Framework for Turning NeRF Papers into Code
Seemandhar Jain, Keshav Gupta, Kunal Gupta, Manmohan Chandraker
Main category: cs.CV
TL;DR: NERFIFY is a multi-agent framework that converts NeRF research papers into trainable Nerfstudio plugins, achieving domain-specific executability through six innovations including CFG-constrained LLM synthesis, graph-of-thought code generation, and visual feedback mechanisms.
Details
Motivation: The proliferation of NeRF research requires significant reimplementation efforts before building upon existing work. Current generic paper-to-code methods and frontier models like GPT-5 usually fail to produce runnable code for complex vision papers like NeRF implementations.Method: NERFIFY uses six key innovations: (1) Context-free grammar constraining LLM synthesis to Nerfstudio architecture, (2) Graph-of-Thought code synthesis with specialized multi-file agents, (3) Compositional citation recovery integrating components from reference graphs, (4) Visual feedback through PSNR-minima ROI analysis and VLM-guided patching, (5) Knowledge enhancement for method improvements, and (6) Benchmarking framework for evaluation across 30 diverse papers.
Result: On papers without public implementations, NERFIFY achieves visual quality matching expert human code (+/-0.5 dB PSNR, +/-0.2 SSIM) while reducing implementation time from weeks to minutes. The framework demonstrates reliable conversion of NeRF papers into trainable plugins.
Conclusion: NERFIFY demonstrates that domain-aware design enables code translation for complex vision papers, potentiating accelerated and democratized reproducible research in the NeRF domain. The approach shows promise for automating implementation of vision research papers.
Abstract: The proliferation of neural radiance field (NeRF) research requires significant efforts to reimplement papers before building upon them. We introduce NERFIFY, a multi-agent framework that reliably converts NeRF research papers into trainable Nerfstudio plugins, in contrast to generic paper-to-code methods and frontier models like GPT-5 that usually fail to produce runnable code. NERFIFY achieves domain-specific executability through six key innovations: (1) Context-free grammar (CFG): LLM synthesis is constrained by Nerfstudio formalized as a CFG, ensuring generated code satisfies architectural invariants. (2) Graph-of-Thought code synthesis: Specialized multi-file-agents generate repositories in topological dependency order, validating contracts and errors at each node. (3) Compositional citation recovery: Agents automatically retrieve and integrate components (samplers, encoders, proposal networks) from citation graphs of references. (4) Visual feedback: Artifacts are diagnosed through PSNR-minima ROI analysis, cross-view geometric validation, and VLM-guided patching to iteratively improve quality. (5) Knowledge enhancement: Beyond reproduction, methods can be improved with novel optimizations. (6) Benchmarking: An evaluation framework is designed for NeRF paper-to-code synthesis across 30 diverse papers. On papers without public implementations, NERFIFY achieves visual quality matching expert human code (+/-0.5 dB PSNR, +/-0.2 SSIM) while reducing implementation time from weeks to minutes. NERFIFY demonstrates that a domain-aware design enables code translation for complex vision papers, potentiating accelerated and democratized reproducible research. Code, data and implementations will be publicly released.
[315] Disentangled Hierarchical VAE for 3D Human-Human Interaction Generation
Zichen Geng, Zeeshan Hayder, Bo Miao, Jian Liu, Wei Liu, Ajmal Mian
Main category: cs.CV
TL;DR: DHVAE: Disentangled Hierarchical Variational Autoencoder with latent diffusion for generating realistic 3D Human-Human Interactions by disentangling global interaction context from individual motion patterns.
Details
Motivation: Existing methods compress all motion information into single latent representations, limiting ability to capture fine-grained actions and inter-agent interactions, leading to semantic misalignment and physically implausible artifacts like penetration or missed contact.Method: Proposes DHVAE with CoTransformer module to disentangle global interaction context and individual motion patterns into decoupled latent structure. Uses contrastive learning constraints for physically plausible latent space, and DDIM-based diffusion denoising in hierarchical latent space with skip-connected AdaLN-Transformer denoiser.
Result: DHVAE achieves superior motion fidelity, text alignment, and physical plausibility with greater computational efficiency compared to existing methods.
Conclusion: The disentangled hierarchical approach with latent diffusion effectively addresses limitations in 3D HHI generation, producing more realistic and physically plausible human-human interactions.
Abstract: Generating realistic 3D Human-Human Interaction (HHI) requires coherent modeling of the physical plausibility of the agents and their interaction semantics. Existing methods compress all motion information into a single latent representation, limiting their ability to capture fine-grained actions and inter-agent interactions. This often leads to semantic misalignment and physically implausible artifacts, such as penetration or missed contact. We propose Disentangled Hierarchical Variational Autoencoder (DHVAE) based latent diffusion for structured and controllable HHI generation. DHVAE explicitly disentangles the global interaction context and individual motion patterns into a decoupled latent structure by employing a CoTransformer module. To mitigate implausible and physically inconsistent contacts in HHI, we incorporate contrastive learning constraints with our DHVAE to promote a more discriminative and physically plausible latent interaction space. For high-fidelity interaction synthesis, DHVAE employs a DDIM-based diffusion denoising process in the hierarchical latent space, enhanced by a skip-connected AdaLN-Transformer denoiser. Extensive evaluations show that DHVAE achieves superior motion fidelity, text alignment, and physical plausibility with greater computational efficiency.
[316] FireRed-OCR Technical Report
Hao Wu, Haoran Lou, Xinyue Li, Zuodong Zhong, Zhaojun Sun, Phellon Chen, Xuanhe Zhou, Kai Zuo, Yibo Chen, Xu Tang, Yao Hu, Boxiang Zhou, Jian Wu, Yongji Wu, Wenxin Yu, Yingmiao Liu, Yuhao Huang, Manjie Xu, Gang Liu, Yidong Ma, Zhichao Sun, Changhao Qiao
Main category: cs.CV
TL;DR: FireRed-OCR transforms general vision-language models into specialized OCR experts for precise document parsing using geometric-semantic data synthesis and progressive training with reinforcement learning.
Details
Motivation: General VLMs suffer from structural hallucination when processing complex documents, limiting their utility in industrial OCR applications that require pixel-precise structural parsing.Method: A systematic framework with: 1) Geometry+Semantics Data Factory using geometric feature clustering for balanced dataset synthesis, 2) Three-Stage Progressive Training: multi-task pre-alignment, specialized SFT for Markdown output, and Format-Constrained GRPO reinforcement learning for syntactic validity.
Result: Achieves state-of-the-art performance on OmniDocBench v1.5 with 92.94% overall score, outperforming DeepSeek-OCR 2 and OCRVerse across text, formula, table, and reading order metrics.
Conclusion: FireRed-OCR successfully specializes general VLMs into high-performance OCR models for structural document parsing, demonstrating the viability of the “General VLM to Specialized Structural Expert” paradigm.
Abstract: We present FireRed-OCR, a systematic framework to specialize general VLMs into high-performance OCR models. Large Vision-Language Models (VLMs) have demonstrated impressive general capabilities but frequently suffer from structural hallucination'' when processing complex documents, limiting their utility in industrial OCR applications. In this paper, we introduce FireRed-OCR, a novel framework designed to transform general-purpose VLMs (based on Qwen3-VL) into pixel-precise structural document parsing experts. To address the scarcity of high-quality structured data, we construct a Geometry + Semantics’’ Data Factory. Unlike traditional random sampling, our pipeline leverages geometric feature clustering and multi-dimensional tagging to synthesize and curate a highly balanced dataset, effectively handling long-tail layouts and rare document types. Furthermore, we propose a Three-Stage Progressive Training strategy that guides the model from pixel-level perception to logical structure generation. This curriculum includes: (1) Multi-task Pre-alignment to ground the model’s understanding of document structure; (2) Specialized SFT for standardizing full-image Markdown output; and (3) Format-Constrained Group Relative Policy Optimization (GRPO), which utilizes reinforcement learning to enforce strict syntactic validity and structural integrity (e.g., table closure, formula syntax). Extensive evaluations on OmniDocBench v1.5 demonstrate that FireRed-OCR achieves state-of-the-art performance with an overall score of 92.94%, significantly outperforming strong baselines such as DeepSeek-OCR 2 and OCRVerse across text, formula, table, and reading order metrics. We open-source our code and model weights to facilitate the ``General VLM to Specialized Structural Expert’’ paradigm.
[317] TP-Blend: Textual-Prompt Attention Pairing for Precise Object-Style Blending in Diffusion Models
Xin Jin, Yichuan Zhong, Yapeng Tian
Main category: cs.CV
TL;DR: TP-Blend is a training-free diffusion editing framework that uses two separate textual prompts (object and style) with complementary attention processors for simultaneous object replacement and style transfer.
Details
Motivation: Current text-conditioned diffusion editors handle single object replacement well but struggle when both a new object and new style must be introduced simultaneously. There's a need for precise control over both content and appearance in image editing.Method: TP-Blend uses two attention processors: 1) Cross-Attention Object Fusion (CAOF) averages head-wise attention to locate spatial tokens, solves entropy-regularized optimal transport to reassign multi-head feature vectors, preserving cross-head correlations. 2) Self-Attention Style Fusion (SASF) injects style through Detail-Sensitive Instance Normalization using Gaussian filtering to separate low/high frequencies, and swaps Key/Value matrices with style prompt.
Result: Extensive experiments show TP-Blend produces high-resolution, photo-realistic edits with precise control over content and appearance, surpassing recent baselines in quantitative fidelity, perceptual quality, and inference speed.
Conclusion: TP-Blend provides an effective lightweight training-free framework for simultaneous object replacement and style transfer in diffusion models, achieving superior performance compared to existing methods.
Abstract: Current text-conditioned diffusion editors handle single object replacement well but struggle when a new object and a new style must be introduced simultaneously. We present Twin-Prompt Attention Blend (TP-Blend), a lightweight training-free framework that receives two separate textual prompts, one specifying a blend object and the other defining a target style, and injects both into a single denoising trajectory. TP-Blend is driven by two complementary attention processors. Cross-Attention Object Fusion (CAOF) first averages head-wise attention to locate spatial tokens that respond strongly to either prompt, then solves an entropy-regularised optimal transport problem that reassigns complete multi-head feature vectors to those positions. CAOF updates feature vectors at the full combined dimensionality of all heads (e.g., 640 dimensions in SD-XL), preserving rich cross-head correlations while keeping memory low. Self-Attention Style Fusion (SASF) injects style at every self-attention layer through Detail-Sensitive Instance Normalization. A lightweight one-dimensional Gaussian filter separates low- and high-frequency components; only the high-frequency residual is blended back, imprinting brush-stroke-level texture without disrupting global geometry. SASF further swaps the Key and Value matrices with those derived from the style prompt, enforcing context-aware texture modulation that remains independent of object fusion. Extensive experiments show that TP-Blend produces high-resolution, photo-realistic edits with precise control over both content and appearance, surpassing recent baselines in quantitative fidelity, perceptual quality, and inference speed.
[318] M-Gaussian: An Magnetic Gaussian Framework for Efficient Multi-Stack MRI Reconstruction
Kangyuan Zheng, Xuan Cai, Jiangqi Wang, Guixing Fu, Zhuoshuo Li, Yazhou Chen, Xinting Ge, Liangqiong Qu, Mengting Liu
Main category: cs.CV
TL;DR: M-Gaussian adapts 3D Gaussian Splatting to MRI reconstruction, achieving high-quality isotropic volume reconstruction from anisotropic multi-stack MRI data with significantly improved computational efficiency.
Details
Motivation: Clinical MRI often uses multi-stack thick-slice acquisitions to reduce scan time and motion sensitivity, but this creates severe through-plane anisotropy that compromises volumetric analysis. Existing implicit neural representation methods achieve high quality but are computationally inefficient.Method: Adapts 3D Gaussian Splatting to MRI reconstruction with three key contributions: (1) Magnetic Gaussian primitives with physics-consistent volumetric rendering, (2) neural residual field for high-frequency detail refinement, and (3) multi-resolution progressive training.
Result: Achieves 40.31 dB PSNR on the FeTA dataset while being 14 times faster than previous methods, representing the first successful adaptation of 3D Gaussian Splatting to multi-stack MRI reconstruction.
Conclusion: M-Gaussian achieves an optimal balance between quality and speed for MRI reconstruction, successfully adapting 3D Gaussian Splatting to medical imaging with physics-consistent rendering and efficient training.
Abstract: Magnetic Resonance Imaging (MRI) is a crucial non-invasive imaging modality. In routine clinical practice, multi-stack thick-slice acquisitions are widely used to reduce scan time and motion sensitivity, particularly in challenging scenarios such as fetal brain imaging. However, the resulting severe through-plane anisotropy compromises volumetric analysis and downstream quantitative assessment, necessitating robust reconstruction of isotropic high-resolution volumes. Implicit neural representation methods, while achieving high quality, suffer from computational inefficiency due to complex network structures. We present M-Gaussian, adapting 3D Gaussian Splatting to MRI reconstruction. Our contributions include: (1) Magnetic Gaussian primitives with physics-consistent volumetric rendering, (2) neural residual field for high-frequency detail refinement, and (3) multi-resolution progressive training. Our method achieves an optimal balance between quality and speed. On the FeTA dataset, M-Gaussian achieves 40.31 dB PSNR while being 14 times faster, representing the first successful adaptation of 3D Gaussian Splatting to multi-stack MRI reconstruction.
[319] Event-Only Drone Trajectory Forecasting with RPM-Modulated Kalman Filtering
Hari Prasanth S. M., Pejman Habibiroudkenar, Eerik Alamikkotervo, Dimitrios Bouzoulas, Risto Ojala
Main category: cs.CV
TL;DR: Event-only drone trajectory forecasting using propeller motion cues from event cameras, outperforming learning-based methods on short-medium horizons.
Details
Motivation: Event cameras offer high-temporal-resolution sensing ideal for fast-moving drones, but existing methods don't fully exploit propeller-induced motion cues for trajectory prediction without RGB imagery or training data.Method: Extracts propeller rotational speed directly from raw event data and fuses it within an RPM-aware Kalman filtering framework for event-only drone forecasting.
Result: Outperforms learning-based approaches and vanilla Kalman filter on FRED dataset in terms of average distance error and final distance error at 0.4s and 0.8s forecasting horizons.
Conclusion: Demonstrates robust and accurate short- and medium-horizon trajectory forecasting without reliance on RGB imagery or training data by exploiting propeller motion cues from event cameras.
Abstract: Event cameras provide high-temporal-resolution visual sensing that is well suited for observing fast-moving aerial objects; however, their use for drone trajectory prediction remains limited. This work introduces an event-only drone forecasting method that exploits propeller-induced motion cues. Propeller rotational speed are extracted directly from raw event data and fused within an RPM-aware Kalman filtering framework. Evaluations on the FRED dataset show that the proposed method outperforms learning-based approaches and vanilla kalman filter in terms of average distance error and final distance error at 0.4s and 0.8s forecasting horizons. The results demonstrate robust and accurate short- and medium-horizon trajectory forecasting without reliance on RGB imagery or training data.
[320] Mechanistically Guided LoRA Improves Paraphrase Consistency in Medical Vision-Language Models
Binesh Sadanandan, Vahid Behzadan
Main category: cs.CV
TL;DR: Medical VLM consistency improved via LoRA fine-tuning with combined paraphrase consistency and accuracy loss, reducing answer flips from 14.6% to 4.4% while maintaining accuracy.
Details
Motivation: Medical Vision-Language Models show inconsistency in answering rephrased versions of the same clinical question, which is problematic for reliable medical applications. The paper aims to address this paraphrase inconsistency issue in MedGemma-4B.Method: Used PSF-Med dataset for systematic consistency evaluation, validated transfer of Gemma Scope 2 Sparse Autoencoders to MedGemma, then fine-tuned Low-Rank Adaptation (LoRA) adapters with a combined loss balancing paraphrase consistency with answer accuracy to prevent mode collapse.
Result: Reduced flip rate from 14.6% to 4.4% (p=0.002) and margin difference from 1.63 to 0.33 logits (79.5% reduction) on MIMIC-CXR while maintaining accuracy (84.2% vs 82.3%). On PadChest Balanced, flip rate dropped from 13.6% to 7.8% and accuracy increased from 66.4% to 69.4%.
Conclusion: Combined consistency-accuracy training effectively reduces paraphrase inconsistency in medical VLMs without sacrificing accuracy, with early layers being more effective than middle layers for reducing margin differences.
Abstract: Medical Vision-Language Models can give different yes or no answers to rephrasings of the same clinical question. We study this in MedGemma-4B using PSF-Med Sadanandan and Behzadan (2025), which provides paraphrase pairs for systematic consistency evaluation on medical VQA. On MIMIC-CXR binary questions (n = 158), the baseline flip rate is 14.6% and mean margin difference is 1.63 logits. We validate that Gemma Scope 2 Sparse Autoencoders (SAEs) transfer to MedGemma activations, achieving R2 ~= 0.997 on both medical and general text (n = 100 prompts each, p < 0.001 for exceeding a 0.95 threshold). We then fine-tune Low-Rank Adaptation (LoRA) adapters with a combined loss that balances paraphrase consistency with answer accuracy. This combined approach prevents mode collapse that occurs with pure consistency training while reducing flip rate from 14.6% to 4.4% (p = 0.002, two-proportion z-test) and margin difference from 1.63 to 0.33 (79.5% reduction). Accuracy remains stable at 84.2% baseline versus 82.3% after training (-1.9pp, not significant). On PadChest Balanced (n = 250), flip rate drops from 13.6% to 7.8%, mean margin difference drops from 1.08 to 0.35 (67.9% reduction), and accuracy increases from 66.4% to 69.4%. A layer-range ablation shows that early layers reduce margin differences more than mechanistically selected middle layers.
[321] Physics-Consistent Diffusion for Efficient Fluid Super-Resolution via Multiscale Residual Correction
Zhihao Li, Shengwei Dong, Chuang Yi, Junxuan Gao, Zhilu Lai, Zhiqiang Liu, Wei Wang, Guangtao Zhang
Main category: cs.CV
TL;DR: ReMD is a physics-consistent diffusion framework for fluid super-resolution that uses multigrid residual correction with multi-wavelet basis to accelerate convergence while preserving fine structures.
Details
Motivation: Existing image SR and generic diffusion models perform poorly for fluid SR - they are sampling-intensive, ignore physical constraints, and often yield spectral mismatch and spurious divergence. There's a need for physics-consistent approaches that can efficiently handle fluid dynamics.Method: ReMD performs multigrid residual correction at each reverse diffusion step, coupling data consistency with lightweight physics cues and correcting residuals across scales using a multi-wavelet basis to capture both large structures and fine vortical details. This coarse-to-fine design accelerates convergence while remaining equation-free.
Result: Across atmospheric and oceanic benchmarks, ReMD improves accuracy and spectral fidelity, reduces divergence, and reaches comparable quality with markedly fewer sampling steps than diffusion baselines.
Conclusion: Enforcing physics consistency inside the diffusion process via multigrid residual correction and multi-wavelet multiscale modeling is an effective route to efficient fluid super-resolution.
Abstract: Existing image SR and generic diffusion models transfer poorly to fluid SR: they are sampling-intensive, ignore physical constraints, and often yield spectral mismatch and spurious divergence. We address fluid super-resolution (SR) with \textbf{ReMD} (\underline{Re}sidual-\underline{M}ultigrid \underline{D}iffusion), a physics-consistent diffusion framework. At each reverse step, ReMD performs a \emph{multigrid residual correction}: the update direction is obtained by coupling data consistency with lightweight physics cues and then correcting the residual across scales; the multiscale hierarchy is instantiated with a \emph{multi-wavelet} basis to capture both large structures and fine vortical details. This coarse-to-fine design accelerates convergence and preserves fine structures while remaining equation-free. Across atmospheric and oceanic benchmarks, ReMD improves accuracy and spectral fidelity, reduces divergence, and reaches comparable quality with markedly fewer sampling steps than diffusion baselines. Our results show that enforcing physics consistency \emph{inside} the diffusion process via multigrid residual correction and multi-wavelet multiscale modeling is an effective route to efficient fluid SR. Our code are available on https://github.com/lizhihao2022/ReMD.
[322] Attention to Neural Plagiarism: Diffusion Models Can Plagiarize Your Copyrighted Images!
Zihang Zou, Boqing Gong, Liqiang Wang
Main category: cs.CV
TL;DR: A method called “anchors and shims” that enables diffusion models to replicate copyrighted images by evading watermark detection through gradient-based perturbations to cross-attention mechanisms.
Details
Motivation: To expose vulnerabilities in copyright protection for neural models, particularly diffusion models, by demonstrating how they can plagiarize copyrighted images even when protected by watermarks, highlighting the need for better countermeasures.Method: Uses “anchors and shims” approach with inverse latents as anchors and finds shim perturbations that gradually deviate anchor latents. Applies perturbations to cross-attention mechanisms at different timesteps to induce semantic modifications that bypass copyright detection without requiring additional training.
Result: Experiments on MS-COCO and real-world copyrighted images show diffusion models can successfully replicate copyrighted images while evading watermark and copyright detection systems.
Conclusion: Neural plagiarism poses a critical threat to copyright protection, and the demonstrated vulnerabilities in current watermarking techniques highlight the urgent need for more robust countermeasures against data plagiarism by neural models.
Abstract: In this paper, we highlight a critical threat posed by emerging neural models: data plagiarism. We demonstrate how modern neural models (e.g., diffusion models) can replicate copyrighted images, even when protected by advanced watermarking techniques. To expose vulnerabilities in copyright protection and facilitate future research, we propose a general approach to neural plagiarism that can either forge replicas of copyrighted data or introduce copyright ambiguity. Our method, based on “anchors and shims”, employs inverse latents as anchors and finds shim perturbations that gradually deviate the anchor latents, thereby evading watermark or copyright detection. By applying perturbations to the cross-attention mechanism at different timesteps, our approach induces varying degrees of semantic modification in copyrighted images, enabling it to bypass protections ranging from visible trademarks and signatures to invisible watermarks. Notably, our method is a purely gradient-based search that requires no additional training or fine-tuning. Experiments on MS-COCO and real-world copyrighted images show that diffusion models can replicate copyrighted images, underscoring the urgent need for countermeasures against neural plagiarism.
[323] Dr. Seg: Revisiting GRPO Training for Visual Large Language Models through Perception-Oriented Design
Haoxiang Sun, Tao Wang, Chenwei Tang, Li Yuan, Jiancheng Lv
Main category: cs.CV
TL;DR: Dr.Seg is a GRPO-based framework for visual perception tasks in VLLMs that addresses limitations of directly transferring language reasoning training paradigms to visual perception by introducing a Look-to-Confirm mechanism and Distribution-Ranked Reward module.
Details
Motivation: The paper challenges the assumption that training paradigms developed for language reasoning can be seamlessly transferred to visual perception tasks in Visual Large Language Models (VLLMs). The authors identify intrinsic differences between reasoning-oriented and perception-oriented settings that make direct transfer problematic.Method: Proposes Dr.Seg, a plug-and-play GRPO-based framework with two key components: 1) Look-to-Confirm mechanism for broader output space, and 2) Distribution-Ranked Reward module for fine-grained, stable rewards. The framework requires no architectural modifications and integrates with existing GRPO-based VLLMs.
Result: Extensive experiments show Dr.Seg improves performance in complex visual scenarios while maintaining strong generalization. The framework demonstrates effectiveness in reasoning segmentation tasks.
Conclusion: Direct transfer of language reasoning training paradigms to visual perception is flawed due to intrinsic differences. Dr.Seg provides an effective solution by addressing the need for broader output space and fine-grained rewards in visual perception tasks.
Abstract: Following the success of Group Relative Policy Optimization (GRPO) in foundation LLMs, an increasing number of works have sought to adapt GRPO to Visual Large Language Models (VLLMs) for visual perception tasks (e.g., detection and segmentation). However, much of this line of research rests on a long-standing yet unexamined assumption: training paradigms developed for language reasoning can be transferred seamlessly to visual perception. Our experiments show that this assumption is not valid, revealing intrinsic differences between reasoning-oriented and perception-oriented settings. Using reasoning segmentation as a representative case, we surface two overlooked factors: (i) the need for a broader output space, and (ii) the importance of fine-grained, stable rewards. Building on these observations, we propose Dr.~Seg, a simple, plug-and-play GRPO-based framework consisting of a Look-to-Confirm mechanism and a Distribution-Ranked Reward module, requiring no architectural modifications and integrating seamlessly with existing GRPO-based VLLMs. Extensive experiments demonstrate that Dr.~Seg improves performance in complex visual scenarios while maintaining strong generalization. Code and models will be available at https://github.com/xVI-group-SCU/Dr-Seg.
[324] EfficientPosterGen: Semantic-aware Efficient Poster Generation via Token Compression and Accurate Violation Detection
Wenxin Tang, Jingyu Xiao, Yanpei Gong, Fengyuan Ran, Tongchuan Xia, Junliang Liu, Man Ho Lam, Wenxuan Wang, Michael R. Lyu
Main category: cs.CV
TL;DR: EfficientPosterGen is an end-to-end framework for automated academic poster generation that addresses limitations of existing MLLM approaches through semantic-aware retrieval, visual compression, and deterministic layout verification.
Details
Motivation: Existing MLLM-based approaches for academic poster generation suffer from low information density when processing full papers, excessive token consumption, and unreliable layout verification, limiting their practical scalability.Method: Three core innovations: 1) Semantic-aware Key Information Retrieval (SKIR) using semantic contribution graphs to model inter-segment relationships, 2) Visual-based Context Compression (VCC) that renders text segments into images to reduce token usage, and 3) Agentless Layout Violation Detection (ALVD) using deterministic color-gradient algorithms for layout verification.
Result: EfficientPosterGen achieves substantial improvements in token efficiency and layout reliability while maintaining high poster quality, offering a scalable solution for automated academic poster generation.
Conclusion: The framework addresses critical limitations of existing MLLM approaches and provides an efficient, reliable solution for academic poster generation with practical scalability.
Abstract: Automated academic poster generation aims to distill lengthy research papers into concise, visually coherent presentations. Existing Multimodal Large Language Models (MLLMs) based approaches, however, suffer from three critical limitations: low information density in full-paper inputs, excessive token consumption, and unreliable layout verification. We present EfficientPosterGen, an end-to-end framework that addresses these challenges through semantic-aware retrieval and token-efficient multimodal generation. EfficientPosterGen introduces three core innovations: (1) Semantic-aware Key Information Retrieval (SKIR), which constructs a semantic contribution graph to model inter-segment relationships and selectively preserves important content; (2) Visual-based Context Compression (VCC), which renders selected text segments into images to shift textual information into the visual modality, significantly reducing token usage while generating poster-ready bullet points; and (3) Agentless Layout Violation Detection (ALVD), a deterministic color-gradient-based algorithm that reliably detects content overflow and spatial sparsity without auxiliary MLLMs. Extensive experiments demonstrate that EfficientPosterGen achieves substantial improvements in token efficiency and layout reliability while maintaining high poster quality, offering a scalable solution for automated academic poster generation. Our code is available at https://github.com/vinsontang1/EfficientPosterGen-Code.
[325] BiCLIP: Bidirectional and Consistent Language-Image Processing for Robust Medical Image Segmentation
Saivan Talaei, Fatemeh Daneshfar, Abdulhady Abas Abdullah, Mustaqeem Khan
Main category: cs.CV
TL;DR: BiCLIP is a bidirectional multimodal framework for robust medical image segmentation that enhances semantic alignment between vision and language features while maintaining performance with limited annotations and under hardware-induced image degradations.
Details
Motivation: Medical image segmentation needs to be robust in real clinical settings with scarce annotations and hardware-induced image degradations. Current multimodal vision-language models show promise but their resilience in "in-the-wild" clinical environments remains under-explored.Method: BiCLIP features a bidirectional multimodal fusion mechanism where visual features iteratively refine textual representations for superior semantic alignment. It also implements an augmentation consistency objective that regularizes intermediate representations against perturbed input views to stabilize learning.
Result: Evaluation on QaTa-COV19 and MosMedData+ benchmarks shows BiCLIP consistently surpasses state-of-the-art image-only and multimodal baselines. It maintains high performance with only 1% labeled data and exhibits significant resistance to clinical artifacts like motion blur and low-dose CT noise.
Conclusion: BiCLIP provides a robust framework for medical image segmentation that effectively handles real-world clinical challenges including limited annotations and hardware-induced degradations through bidirectional multimodal fusion and consistency regularization.
Abstract: Medical image segmentation is a cornerstone of computer-assisted diagnosis and treatment planning. While recent multimodal vision-language models have shown promise in enhancing semantic understanding through textual descriptions, their resilience in “in-the-wild” clinical settings-characterized by scarce annotations and hardware-induced image degradations-remains under-explored. We introduce BiCLIP (Bidirectional and Consistent Language-Image Processing), a framework engineered to bolster robustness in medical segmentation. BiCLIP features a bidirectional multimodal fusion mechanism that enables visual features to iteratively refine textual representations, ensuring superior semantic alignment. To further stabilize learning, we implement an augmentation consistency objective that regularizes intermediate representations against perturbed input views. Evaluation on the QaTa-COV19 and MosMedData+ benchmarks demonstrates that BiCLIP consistently surpasses state-of-the-art image-only and multimodal baselines. Notably, BiCLIP maintains high performance when trained on as little as 1% of labeled data and exhibits significant resistance to clinical artifacts, including motion blur and low-dose CT noise.
[326] Disentangled Multi-modal Learning of Histology and Transcriptomics for Cancer Characterization
Yupei Zhang, Xiaofei Wang, Anran Liu, Lequan Yu, Chao Li
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2508.16479: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.16479&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[327] FujiView: Multimodal Late-Fusion for Predicting Scenic Visibility
Bryceton Bible, Shah Md Nehal Hasnaeen, Hairong Qi
Main category: cs.CV
TL;DR: FujiView: A multimodal learning framework combining webcam imagery and meteorological data to predict scenic visibility of Mount Fuji, achieving high accuracy for short-term forecasting.
Details
Motivation: Predicting visibility of natural landmarks like Mount Fuji is crucial for tourism planning and visitor experience, but challenging due to rapidly changing atmospheric conditions. Current methods lack effective multimodal approaches combining visual and weather data.Method: Late-fusion multimodal framework combining YOLO-based image features from webcam imagery with structured meteorological data (concurrent and forecasted weather conditions). Classifies visibility into five categories using image-derived probabilities fused with numerical weather features.
Result: Achieved ACC ~0.89 for same-day prediction and up to 84% for next-day forecasts. Vision features dominate short-term horizons (“nowcasting” and “samedaycasting”), while weather data becomes primary predictive signal beyond +1 day. Late fusion consistently yields highest accuracy.
Conclusion: FujiView establishes Scenic Visibility Forecasting as a new benchmark task for multimodal learning, demonstrating effective fusion of visual and meteorological data. The framework and dataset (100,000+ webcam images from 40+ cameras) will be released to support environmental forecasting research.
Abstract: Visibility of natural landmarks such as Mount Fuji is a defining factor in both tourism planning and visitor experience, yet it remains difficult to predict due to rapidly changing atmospheric conditions. We present FujiView, a multimodal learning framework and dataset for predicting scenic visibility by fusing webcam imagery with structured meteorological data. Our late-fusion approach combines image-derived class probabilities with numerical weather features to classify visibility into five categories. The dataset currently comprises over 100,000 webcam images paired with concurrent and forecasted weather conditions from more than 40 cameras around Mount Fuji, and continues to expand; it will be released to support further research in environmental forecasting. Experiments show that YOLO-based vision features dominate short-term horizons such as “nowcasting” and “samedaycasting”, while weather-driven forecasts increasingly take over as the primary predictive signal beyond $+1$d. Late fusion consistently yields the highest overall accuracy, achieving ACC of approx 0.89 for same-day prediction and up to 84% for next-day forecasts. These results position Scenic Visibility Forecasting (SVF) as a new benchmark task for multimodal learning.
[328] Time-Aware One Step Diffusion Network for Real-World Image Super-Resolution
Tianyi Zhang, Zheng-Peng Duan, Peng-Tao Jiang, Bo Li, Ming-Ming Cheng, Chun-Le Guo, Chongyi Li
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation as paper content could not be retrievedMethod: Unable to determine method as paper content could not be retrieved
Result: Unable to determine results as paper content could not be retrieved
Conclusion: Unable to draw conclusions as paper content could not be retrieved
Abstract: Failed to fetch summary for 2508.16557: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.16557&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[329] DINOv3 Meets YOLO26 for Weed Detection in Vegetable Crops
Boyang Deng, Yuzhen Lu
Main category: cs.CV
TL;DR: Fine-tuned DINOv3 vision transformer integrated into YOLO26 for crop-weed detection, achieving significant mAP improvements and strong cross-domain generalization despite increased parameters and latency.
Details
Motivation: Precision vegetable weeding models are limited by scarce annotated weed-crop datasets, requiring better foundational models that can handle heterogeneous data and generalize across domains.Method: Collected 618,642 crop-weed images, filtered to 199,388 for fine-tuning DINOv3 ViT-small via sequential curation. Integrated fine-tuned backbone into YOLO26 as primary or dual-backbone with feature alignment loss for enhanced fusion.
Result: DINOv3-finetuned YOLO26-large achieved +5.4% mAP50 on in-domain 2025 images, +14.0% on 2021-2023 dataset, and +11.9% on 2024 dataset. Maintains real-time performance at ~28.5 fps despite 45.6% more parameters and 2.9x latency increase.
Conclusion: The approach successfully addresses dataset scarcity through self-supervised learning and heterogeneous data integration, creating a robust foundational model for crop-weed detection with strong generalization capabilities.
Abstract: Developing robust models for precision vegetable weeding is currently constrained by the scarcity of large-scale, annotated weed-crop datasets. To address this limitation, this study proposes a foundational crop-weed detection model by integrating heterogeneous datasets and leveraging self-supervised learning. A total of 618,642 crop-weed images were initially collected and subsequently refined to 199,388 filtered images for fine-tuning a DINOv3 vision transformer (ViT-small) through a sequential curation strategy. The fine-tuned DINOv3 backbone was then integrated into YOLO26, serving either as a primary backbone or part of a dual-backbone architecture. A feature alignment loss was introduced in the dual backbone framework to enhance feature fusion with minimal computational overhead. Experimental results show that the proposed DINOv3-finetuned ViT-small-based YOLO26-large achieved up to a +5.4% mAP50 gain on in-domain images collected in the 2025 season. Moreover, it demonstrated strong cross-domain generalization with mAP50 improvements of +14.0% on the 2021-2023 season dataset and +11.9% on the 2024 season dataset, compared to the standard YOLO26-large. Although the DINOv3-YOLO26-large model has 45.6% more parameters and a 2.9x increase in inference latency, it maintains real-time performance at ~28.5 frames per second (fps). The curated dataset and software programs developed in this study will be made publicly available.
[330] SKINOPATHY AI: Smartphone-Based Ophthalmic Screening and Longitudinal Tracking Using Lightweight Computer Vision
S. Kalaycioglu, C. Hong, M. Zhu, H. Xie
Main category: cs.CV
TL;DR: SKINOPATHY AI is a smartphone-based web app with five explainable ophthalmic screening modules using computer vision techniques for low-resource settings.
Details
Motivation: To enable early ophthalmic screening in low-resource and remote settings where access to specialized equipment and trained practitioners is limited, using commodity mobile hardware.Method: Developed a React/FastAPI web application with five complementary screening modules: (1) redness quantification via LAB a* color-space normalization, (2) blink-rate estimation using MediaPipe FaceMesh Eye Aspect Ratio with adaptive thresholding, (3) pupil light reflex characterization through Pupil-to-Iris Ratio time-series analysis, (4) scleral color indexing for icterus/anemia proxies via LAB/HSV statistics, and (5) iris-landmark-calibrated lesion encroachment measurement with millimeter-scale estimates.
Result: The system demonstrates that multi-signal ophthalmic screening is feasible on unmodified smartphones without cloud-based AI inference, providing deterministic, privacy-preserving algorithms for non-diagnostic consumer triage.
Conclusion: SKINOPATHY AI provides a foundation for future clinically validated mobile ophthalmoscopy tools that can operate entirely on commodity smartphones in low-resource settings.
Abstract: Early ophthalmic screening in low-resource and remote settings is constrained by access to specialized equipment and trained practitioners. We present SKINOPATHY AI, a smartphone-first web application that delivers five complementary, explainable screening modules entirely through commodity mobile hardware: (1) redness quantification via LAB a* color-space normalization; (2) blink-rate estimation using MediaPipe FaceMesh Eye Aspect Ratio (EAR) with adaptive thresholding; (3) pupil light reflex characterization through Pupil-to-Iris Ratio (PIR) time-series analysis; (4) scleral color indexing foricterus and anemia proxies via LAB/HSV statistics; and (5) iris-landmark-calibrated lesion encroachment measurement with millimeter-scale estimates and longitudinal trend tracking. The system is implemented as a React/FastAPI stack with OpenCV and MediaPipe, MongoDB-backed session persistence, and PDF report generation. All algorithms are fully deterministic, privacy-preserving, and designed for non-diagnostic consumer triage. We detail system architecture, algorithm design, evaluation methodology, clinical context, and ethical boundaries of the platform. SKINOPATHY AI demonstrates that multi-signal ophthalmic screening is feasible on unmodified smartphones without cloud-based AI inference, providing a foundation for future clinically validated mobile ophthalmoscopy tools.
[331] COMPASS: Robust Feature Conformal Prediction for Medical Segmentation Metrics
Matt Y. Cheung, Ashok Veeraraghavan, Guha Balakrishnan
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to missing paper contentMethod: Cannot determine method due to missing paper content
Result: Cannot determine results due to missing paper content
Conclusion: Cannot determine conclusion due to missing paper content
Abstract: Failed to fetch summary for 2509.22240: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.22240&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[332] A Boundary-Metric Evaluation Protocol for Whiteboard Stroke Segmentation Under Extreme Imbalance
Nicholas Korcynski
Main category: cs.CV
TL;DR: The paper addresses whiteboard stroke segmentation with extreme class imbalance, proposing an evaluation protocol that combines region metrics, boundary metrics, and thin-stroke equity analysis to reveal hidden trade-offs between different loss functions.
Details
Motivation: Whiteboard stroke segmentation suffers from extreme class imbalance where stroke pixels constitute only about 1.79% of the image, with thin strokes being even rarer. Standard region metrics like F1 and IoU can mask thin-stroke failures because the background dominates the score, necessitating more comprehensive evaluation methods.Method: Proposes an evaluation protocol combining region metrics (F1, IoU), boundary metrics (BF1, B-IoU), core/thin-subset equity analysis, and per-image robustness statistics with seeded multi-run training and non-parametric significance testing. Tests five losses on DeepLabV3-MobileNetV3 model with three training runs each on 12 held-out images split into core and thin subsets.
Result: Overlap-based losses improve F1 by over 20 points compared to cross-entropy (0.663 vs 0.438, p < 0.001). Boundary metrics confirm gains extend to contour precision. Classical baselines like Sauvola binarization achieve higher mean F1 (0.787) but worse worst-case performance (F1 = 0.452 vs 0.565 for Tversky), revealing a consistency-accuracy trade-off. Doubling training resolution increases F1 by 12.7 points.
Conclusion: Comprehensive evaluation reveals hidden trade-offs in whiteboard stroke segmentation. While classical methods achieve higher mean accuracy, learned models provide better worst-case reliability. The proposed evaluation protocol with boundary metrics and thin-stroke equity analysis is crucial for proper assessment of segmentation methods in imbalanced scenarios.
Abstract: The binary segmentation of whiteboard strokes is hindered by extreme class imbalance, caused by stroke pixels that constitute only $1.79%$ of the image on average, and in addition, the thin-stroke subset averages $1.14% \pm 0.41%$ in the foreground. Standard region metrics (F1, IoU) can mask thin-stroke failures because the vast majority of the background dominates the score. In contrast, adding boundary-aware metrics and a thin-subset equity analysis changes how loss functions rank and exposes hidden trade-offs. We contribute an evaluation protocol that jointly examines region metrics, boundary metrics (BF1, B-IoU), a core/thin-subset equity analysis, and per-image robustness statistics (median, IQR, worst-case) under seeded, multi-run training with non-parametric significance testing. Five losses – cross-entropy, focal, Dice, Dice+focal, and Tversky – are trained three times each on a DeepLabV3-MobileNetV3 model and evaluated on 12 held-out images split into core and thin subsets. Overlap-based losses improve F1 by more than 20 points over cross-entropy ($0.663$ vs $0.438$, $p < 0.001$). In addition, the boundary metrics confirm that the gain extends to the precision of the contour. Adaptive thresholding and Sauvola binarization at native resolution achieve a higher mean F1 ($0.787$ for Sauvola) but with substantially worse worst-case performance (F1 $= 0.452$ vs $0.565$ for Tversky), exposing a consistency-accuracy trade-off: classical baselines lead on mean F1 while the learned model delivers higher worst-case reliability. Doubling training resolution further increases F1 by 12.7 points.
[333] Hierarchical Multi-Scale Graph Learning with Knowledge-Guided Attention for Whole-Slide Image Survival Analysis
Bin Xu, Yufei Zhou, Boling Song, Jingwen Sun, Yang Bian, Cheng Lu, Ye Wu, Jianfei Tu, Xiangxue Wang
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to failed paper fetchMethod: Cannot determine method due to failed paper fetch
Result: Cannot determine results due to failed paper fetch
Conclusion: Cannot draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2602.23557: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.23557&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[334] ConFoThinking: Consolidated Focused Attention Driven Thinking for Visual Question Answering
Zhaodong Wu, Haochen Xue, Qi Cao, Wenqi Mo, Yu Pei, Wenqi Xu, Jionglong Su, Yang Liu
Main category: cs.CV
TL;DR: ConFoThinking improves MLLM fine-grained VQA by consolidating fragmented attention signals into a designated layer and using concise semantic cues for attention extraction, then zooming into salient regions.
Details
Motivation: Current MLLMs have unreliable grounding capabilities for fine-grained VQA. Attention-driven ROI cropping methods suffer from fragmented attention signals across layers and rely on question- or text-conditioned attention extraction that introduces semantic noise.Method: ConFoThinking framework learns to aggregate attention into a designated intermediate layer, mines salient regions from this consolidated attention, and zooms in on them. Uses concise semantic cues for attention extraction instead of question-based extraction.
Result: Experiments across five VQA benchmarks demonstrate significant improvement in perception performance compared to existing methods.
Conclusion: Consolidating fragmented attention and using concise semantic cues for attention extraction effectively improves MLLM performance on fine-grained visual question answering tasks.
Abstract: Thinking with Images improves fine-grained VQA for MLLMs by emphasizing visual cues. However, tool-augmented methods depend on the capacity of grounding, which remains unreliable for MLLMs. In parallel, attention-driven methods to crop the Region of Interest (ROIs) are proposed but they are constrained by (1) fragmented attention signals scattered across layers, leading to suboptimal localization and (2) relying on question- or redundant-text-conditioned attention extraction. Our analysis reveals three patterns: MLLMs may attend to the correct region yet generate incorrect coordinates, where-to-look attention is often fragmented across layers, and attention extraction is query-sensitive. Motivated by these, We propose ConFoThinking, a Consolidated-Focused-Attention-Driven Thinking framework that learns to aggregate attention into a designated intermediate layer, from which we mine and zoom in salient regions for downstream visual understanding. Moreover, we extract attention using concise semantic cues of what to look into, which mitigates the semantic noise introduced by question- or redundant-text-based attention extraction. Experiments across five VQA benchmarks demonstrate ConFoThinking significantly improves perception performance. The code, checkpoints, and dataset will be released after being accepted.
[335] Topological Inductive Bias fosters Multiple Instance Learning in Data-Scarce Scenarios
Salome Kazeminia, Carsten Marr, Bastian Rieck
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2307.14025: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2307.14025&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[336] Exploring the AI Obedience: Why is Generating a Pure Color Image Harder than CyberPunk?
Hongyu Li, Kuan Liu, Yuan Chen, Juntao Hu, Huimin Lu, Guanjie Chen, Xue Liu, Guangming Lu, Hong Huang
Main category: cs.CV
TL;DR: Paper introduces “Paradox of Simplicity” in generative AI, formalizes “Obedience” as alignment with instructions, creates hierarchical grading system, and presents VIOLIN benchmark for evaluating pure color generation in visual models.
Details
Motivation: Address the "Paradox of Simplicity" where generative AI models can create complex content but fail at simple deterministic tasks, highlighting the need for better instruction alignment and obedience in AI systems.Method: Formalize Obedience as instruction alignment capability, establish hierarchical grading system from basic semantic to pixel-level precision, conduct case studies to identify obedience gaps, and create VIOLIN benchmark for evaluating pure color generation across six variants.
Result: Extensive experiments on state-of-the-art models reveal fundamental obedience limitations, showing how generative priors often override logical constraints, with exploratory insights into model behavior.
Conclusion: Establishes framework for AI Obedience to draw attention to this gap and encourage deeper exploration to bridge the paradox between complex generation capabilities and simple task failures.
Abstract: Recent advances in generative AI have demonstrated remarkable ability to produce high-quality content. However, these models often exhibit “Paradox of Simplicity”: while they can render intricate landscapes, they often fail at simple, deterministic tasks. To address this, we formalize Obedience as the ability to align with instructions and establish a hierarchical grading system ranging from basic semantic alignment to pixel-level systemic precision, which provides a unified paradigm for incorporating and categorizing existing literature. Then, we conduct case studies to identify common obedience gaps, revealing how generative priors often override logical constraints. To evaluate high-level obedience, we present VIOLIN (VIsual Obedience Level-4 EvaluatIoN), the first benchmark focused on pure color generation across six variants. Extensive experiments on SOTA models reveal fundamental obedience limitations and further exploratory insights. By establishing this framework, we aim to draw more attention on AI Obedience and encourage deeper exploration to bridge this gap.
[337] MFP3D: Monocular Food Portion Estimation Leveraging 3D Point Clouds
Jinge Ma, Xiaoyan Zhang, Gautham Vinod, Siddeshwar Raghavan, Jiangpeng He, Fengqing Zhu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2411.10492: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.10492&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[338] Image-Based Classification of Olive Species Specific to Turkiye with Deep Neural Networks
Irfan Atabas, Hatice Karatas
Main category: cs.CV
TL;DR: Deep learning-based classification of Turkish olive species using CNN models with EfficientNetB0 achieving 94.5% accuracy
Details
Motivation: To develop an automated system for classifying local olive species in Turkey using image processing and deep learning, addressing the need for efficient agricultural product identification and quality controlMethod: Used stereo camera to capture images of five olive species, preprocessed images, employed CNN architectures (MobileNetV2 and EfficientNetB0) with transfer learning optimization for image classification
Result: EfficientNetB0 model achieved optimal performance with 94.5% accuracy, demonstrating deep learning’s effectiveness for olive species classification
Conclusion: Deep learning-based systems provide effective solutions for agricultural product classification with high accuracy, showing significant potential for automatic identification and quality control applications
Abstract: In this study, image processing and deep learning methodologies were employed to automatically classify local olive species cultivated in Turkiye. A stereo camera was utilized to capture images of five distinct olive species, which were then preprocessed to ensure their suitability for analysis. Convolutional Neural Network (CNN) architectures, specifically MobileNetV2 and EfficientNetB0, were employed for image classification. These models were optimized through a transfer learning approach. The training and testing results indicated that the EfficientNetB0 model exhibited the optimal performance, with an accuracy of 94.5%. The findings demonstrate that deep learning-based systems offer an effective solution for classifying olive species with high accuracy. The developed method has significant potential for application in areas such as automatic identification and quality control of agricultural products.
[339] OSDM-MReg: Multimodal Image Registration based One Step Diffusion Model
Xiaochen Wei, Weiwei Guo, Wenxian Yu, Feiming Wei, Dongying Li
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: No method information available - paper content inaccessible due to HTTP 429 error
Result: No results available - API request was rate limited
Conclusion: Unable to analyze paper due to technical limitations in accessing the content
Abstract: Failed to fetch summary for 2504.06027: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.06027&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[340] A Novel Evolutionary Method for Automated Skull-Face Overlay in Computer-Aided Craniofacial Superimposition
Práxedes Martínez-Moreno, Andrea Valsecchi, Pablo Mesejo, Pilar Navarro-Ramírez, Valentino Lugli, Sergio Damas
Main category: cs.CV
TL;DR: Lilium is an automated evolutionary method that improves skull-face overlay accuracy in forensic identification by modeling soft-tissue variability using a 3D cone-based representation optimized via Differential Evolution.
Details
Motivation: Current Skull-Face Overlay (SFO) techniques in forensic craniofacial superimposition suffer from accuracy issues due to individual variability in soft-tissue thickness, which introduces significant uncertainty into the alignment process.Method: Lilium uses a 3D cone-based representation to model soft-tissue variability, optimized via Differential Evolution algorithm. It enforces anatomical plausibility through constraints: landmark matching, camera parameter consistency, head pose alignment, skull containment within facial boundaries, and region parallelism.
Result: Lilium outperforms state-of-the-art methods in both accuracy and robustness for skull-face overlay in forensic identification.
Conclusion: The proposed evolutionary approach successfully addresses soft-tissue variability challenges in craniofacial superimposition, providing a more accurate and robust automated solution for forensic identification.
Abstract: Craniofacial Superimposition is a forensic technique for identifying skeletal remains by comparing a post-mortem skull with ante-mortem facial photographs. A critical step in this process is Skull-Face Overlay (SFO). This stage involves aligning a 3D skull model with a 2D facial image, typically guided by cranial and facial landmarks’ correspondence. However, its accuracy is undermined by individual variability in soft-tissue thickness, introducing significant uncertainty into the overlay. This paper introduces Lilium, an automated evolutionary method to enhance the accuracy and robustness of SFO. Lilium explicitly models soft-tissue variability using a 3D cone-based representation whose parameters are optimized via a Differential Evolution algorithm. The method enforces anatomical, morphological, and photographic plausibility through a combination of constraints: landmark matching, camera parameter consistency, head pose alignment, skull containment within facial boundaries, and region parallelism. This emulation of the usual forensic practitioners’ approach leads Lilium to outperform the state-of-the-art method in terms of both accuracy and robustness.
[341] AdaFocus: Knowing When and Where to Look for Adaptive Visual Reasoning
Yuxiang Shen, Hailong Huang, Zhenkun Gao, Xueheng Li, Chengjun Xie, Xuanhua He, Jie Zhang
Main category: cs.CV
TL;DR: AdaFocus is a training-free framework for multimodal LLMs that adaptively crops images to focus on relevant regions, improving both accuracy and efficiency by addressing perceptual redundancy and semantic-spatial attention drift.
Details
Motivation: Current training-free methods for MLLMs suffer from two main issues: (1) perceptual redundancy from indiscriminate cropping that adds computational overhead and noise, and (2) drift between semantic intent and spatial attention that prevents accurate localization of user-focused regions. Large-scale training approaches are computationally expensive, creating demand for lightweight, training-free solutions.Method: AdaFollows a two-stage pipeline: (1) a confidence-based module that decides WHEN to crop based on the model’s confidence in its current visual understanding, and (2) a semantic-guided localization module that determines WHERE to crop by aligning semantic intent with spatial attention to accurately localize user-focused regions.
Result: AdaFocus delivers substantial performance gains while achieving approximately 4.0× speedup in inference speed compared to the state-of-the-art method ZoomEyes, representing significant advances in both accuracy and efficiency.
Conclusion: AdaFocus provides an effective training-free framework for adaptive visual reasoning in MLLMs that addresses key limitations of existing methods, offering improved performance and efficiency through intelligent, adaptive cropping strategies.
Abstract: Multimodal Large Language Models (MLLMs) are shifting towards “Thinking with Images” by actively exploring image details. While effective, large-scale training is computationally expensive, which has spurred growing interest in lightweight, training-free solutions. However, existing training-free methods suffer from two flaws: perceptual redundancy from indiscriminate cropping, which adds overhead and noise; and a drift between semantic intent and spatial attention, which prevents accurate localization of user-focused regions. To address these challenges, we propose AdaFocus, a novel training-free framework designed for adaptive visual reasoning. AdaFocus follows a two-stage pipeline: a confidence-based module decides when to crop, and a semantic-guided localization module determines where to crop. This enables adaptive visual reasoning without additional training. Experimentally, AdaFocus delivers substantial performance gains while achieving approximately 4.0\times speedup inference speedup than the SOTA method ZoomEyes, representing a significant advance in both accuracy and efficiency.
[342] Summer-22B: A Systematic Approach to Dataset Engineering and Training at Scale for Video Foundation Model
Simo Ryu, Chunghwan Han
Main category: cs.CV
TL;DR: Training a 22B parameter video foundation model from scratch with engineering insights on dataset curation, architecture, and optimization techniques.
Details
Motivation: To share practical experience and lessons learned from developing a large-scale video foundation model from scratch, documenting the engineering challenges and design decisions involved in scaling from raw data collection to functional model training.Method: Combined metadata-driven dataset curation, multi-stage filtering, μP (muP) parameterization, and hypersphere-constrained optimization. Developed Lavender Data system for dataset management and made inference-aware architectural choices. Trained on approximately 50 million video clips.
Result: Successfully trained Summer-22B, a 22B parameter video foundation model. Key observations: dataset engineering consumed majority of effort, architectural variants showed smaller differences than expected, and μP hyperparameter transfer appeared effective even under geometric constraints.
Conclusion: The report provides valuable engineering insights for large-scale video model development, emphasizing the importance of dataset curation and sharing practical lessons that could benefit similar projects in the field.
Abstract: We describe our experience training Summer-22B, a video foundation model developed from scratch. This report documents the engineering challenges, design decisions, and lessons learned while scaling from raw footage collection to a functional model trained on approximately 50 million clips. We outline our approach combining metadata-driven dataset curation, multi-stage filtering, $μ$P parameterization, and hypersphere-constrained optimization. We developed the Lavender Data system for dataset management and adopted inference-aware architectural choices. We share observations on what worked in our setting: dataset engineering consumed the majority of effort, architectural variants showed smaller differences than we expected, and $μ$P hyperparameter transfer appeared effective even under geometric constraints. We hope this account proves useful to others undertaking similar projects.
[343] Infinite Self-Attention
Giorgio Roffo
Main category: cs.CV
TL;DR: InfSA reformulates self-attention as a diffusion process on token graphs, enabling linear-time computation and scaling to high-resolution vision tasks while maintaining performance.
Details
Motivation: The quadratic computational cost of softmax attention limits Transformer scalability for high-resolution vision tasks, creating a need for more efficient attention mechanisms.Method: Introduces Infinite Self-Attention (InfSA) as a spectral reformulation treating attention layers as diffusion steps on content-adaptive token graphs, with Linear-InfSA variant approximating the principal eigenvector without forming full attention matrices.
Result: Achieves 84.7% top-1 on ImageNet-1K (+3.2 points over softmax ViT), runs at 231 images/s with 13x better throughput, and scales to 9216×9216 resolution without memory issues.
Conclusion: InfSA provides an efficient, scalable attention mechanism for vision transformers with strong performance, interpretability through graph centrality connections, and practical linear-time computation.
Abstract: The quadratic cost of softmax attention limits Transformer scalability in high-resolution vision. We introduce Infinite Self-Attention (InfSA), a spectral reformulation that treats each attention layer as a diffusion step on a content-adaptive token graph, accumulating multi-hop interactions through a discounted Neumann series over attention matrices. This links self-attention to classical graph centrality (Katz, PageRank, eigenvector centrality) for interpretable token weighting. We also show the Neumann kernel equals the fundamental matrix of an absorbing Markov chain, so a token’s centrality is its expected number of random-walk visits before absorption. We then propose Linear-InfSA, a linear-time variant that approximates the principal eigenvector of the implicit attention operator without forming the full attention matrix. It keeps an auxiliary state of fixed size proportional to per-head dimension dh (independent of sequence length N), is drop-in compatible with Vision Transformers, and supports stable training at 4096 by 4096 and inference at 9216 by 9216 (about 332k tokens). In a 4-layer ViT (53.5M parameters, 59 GFLOPs at 224 by 224), Linear-InfSA reaches 84.7% top-1 on ImageNet-1K, a +3.2 point architectural gain over an equal-depth softmax ViT trained with the same recipe. On ImageNet-V2, InfViT variants outperform all compared baselines (up to 79.8% vs 76.8%), indicating robustness under distribution shift. On an A100 40GB GPU, Linear-InfViT runs at 231 images/s and 0.87 J/image (13x better throughput and energy than equal-depth ViT) and is the only tested model to complete 9216 by 9216 inference without out-of-memory. The linear approximation closely matches the dominant eigenvector of the quadratic operator (cosine 0.985).
[344] Zero-Shot and Supervised Bird Image Segmentation Using Foundation Models: A Dual-Pipeline Approach with Grounding DINO1.5, YOLOv11, and SAM2.1
Abhinav Munagala
Main category: cs.CV
TL;DR: Dual-pipeline framework for bird image segmentation using foundation models (SAM 2.1) with zero-shot (Grounding DINO) and supervised (YOLOv11) detection pipelines, achieving state-of-the-art results on CUB-200-2011 dataset.
Details
Motivation: Bird image segmentation is challenging due to extreme pose diversity, complex plumage patterns, and variable lighting conditions. Existing methods struggle with these challenges, and there's a need for approaches that can handle new species/domains without retraining segmentation models.Method: Two pipelines built on frozen SAM 2.1 backbone: (1) Zero-shot pipeline uses Grounding DINO 1.5 to detect birds via text prompt “bird” then prompts SAM 2.1 with bounding boxes; (2) Supervised pipeline fine-tunes YOLOv11 on CUB-200-2011 for high-precision detection then prompts SAM 2.1. Segmentation model is never retrained for new species/domains.
Result: On CUB-200-2011 (11,788 images, 200 species): Supervised pipeline achieves IoU 0.912, Dice 0.954, F1 0.953, outperforming SegFormer-B2 (IoU 0.842) by +7.0 percentage points. Zero-shot pipeline achieves IoU 0.831 using only text prompt, first such result reported on this benchmark.
Conclusion: Prompt-based foundation model pipelines outperform task-specific end-to-end trained segmentation networks while requiring only lightweight detector fine-tuning (~1 hour) for domain adaptation. Approach demonstrates strong generalization without segmentation model retraining.
Abstract: Bird image segmentation remains a challenging task in computer vision due to extreme pose diversity, complex plumage patterns, and variable lighting conditions. This paper presents a dual-pipeline framework for binary bird image segmentation leveraging 2025 foundation models. We introduce two operating modes built upon Segment Anything Model 2.1 (SAM 2.1) as a shared frozen backbone: (1) a zero-shot pipeline using Grounding DINO 1.5 to detect birds via the text prompt “bird” before prompting SAM 2.1 with bounding boxes requiring no labelled bird data; and (2) a supervised pipeline that fine-tunes YOLOv11 on the CUB-200-2011 dataset for high-precision detection, again prompting SAM 2.1 for pixel-level masks. The segmentation model is never retrained for new species or domains. On CUB-200-2011 (11,788 images, 200 species), the supervised pipeline achieves IoU 0.912, Dice 0.954, and F1 0.953 outperforming all prior baselines including SegFormer-B2 (IoU 0.842) by +7.0 percentage points. The zero-shot pipeline achieves IoU 0.831 using only a text prompt, the first such result reported on this benchmark. We demonstrate that prompt-based foundation model pipelines outperform task specific end-to-end trained segmentation networks, while requiring only lightweight detector fine-tuning (~1 hour) for domain adaptation. Complete PyTorch implementation, dataset preparation scripts, and trained weights are publicly available.
[345] Efficient Long-Horizon GUI Agents via Training-Free KV Cache Compression
Bowen Zhou, Zhou Xu, Wanli Li, Jingyu Xiao, Haoqian Wang
Main category: cs.CV
TL;DR: ST-Lite is a training-free KV cache compression framework for GUI agents that addresses uniform high-sparsity attention patterns in GUI interactions using dual-branch scoring with spatial saliency and semantic gating.
Details
Motivation: Large VLMs for GUI agents face memory and latency constraints from KV cache during long interactions. Existing cache compression methods underperform in GUI scenarios due to uniform high-sparsity attention patterns across layers, unlike general visual tasks.Method: Proposes ST-Lite with dual-branch scoring: Component-centric Spatial Saliency (CSS) preserves UI element structural integrity by evaluating local neighborhood saliency, and Trajectory-aware Semantic Gating (TSG) filters visually repetitive KV pairs within interaction trajectories.
Result: With only 10-20% cache budget, ST-Lite achieves 2.45x decoding acceleration while maintaining comparable or superior performance to full-cache baselines.
Conclusion: ST-Lite offers a scalable, training-free solution for efficient GUI agents by addressing GUI-specific attention patterns and dynamic spatio-trajectory dependencies.
Abstract: Large Vision-Language Models (VLMs) have emerged as powerful engines for autonomous GUI agents, yet their deployment is severely constrained by the substantial memory footprint and latency of the Key-Value (KV) cache during long-horizon interactions. While existing cache compression methods have proven effective for LLMs, we empirically demonstrate that they suffer from suboptimal performance in GUI scenarios due to a fundamental misalignment: unlike general visual tasks where attention sparsity varies across layers, GUI attention patterns exhibit uniform high-sparsity across all transformer layers. Motivated by this insight, we propose ST-Lite, a training-free KV cache compression framework tailored for efficient GUI agents that explicitly addresses the dynamic spatio-trajectory dependencies within GUI data streams. ST-Lite introduces a novel dual-branch scoring policy incorporating Component-centric Spatial Saliency (CSS) and Trajectory-aware Semantic Gating (TSG). Specifically, CSS preserves the structural integrity of interactive UI elements by evaluating local neighborhood saliency, while TSG mitigates historical redundancy by dynamically filtering visually repetitive KV pairs within the interaction trajectory. Extensive evaluations demonstrate that with only a 10-20% cache budget, ST-Lite achieves a 2.45x decoding acceleration while maintaining comparable or even superior performance compared to full-cache baselines, offering a scalable solution for resource-constrained GUI agents.
[346] SKeDA: A Generative Watermarking Framework for Text-to-video Diffusion Models
Yang Yang, Xinze Zou, Zehua Ma, Han Fang, Weiming Zhang
Main category: cs.CV
TL;DR: SKeDA is a generative watermarking framework for text-to-video diffusion models that improves robustness against frame reordering and temporal distortions through shuffle-key encryption and differential attention mechanisms.
Details
Motivation: The rise of text-to-video generation models raises concerns about content authenticity, copyright protection, and malicious misuse. Existing image watermarking methods don't work well for videos due to frame alignment requirements and vulnerability to video-specific distortions like inter-frame compression.Method: SKeDA has two components: 1) Shuffle-Key-based Distribution-preserving Sampling (SKe) uses a single base pseudo-random binary sequence for watermark encryption and derives frame-level sequences through permutation, making extraction tolerant to frame reordering. 2) Differential Attention (DA) computes inter-frame differences and dynamically adjusts attention weights during extraction to handle temporal distortions.
Result: Extensive experiments show that SKeDA preserves high video generation quality while maintaining strong watermark robustness against various attacks and distortions.
Conclusion: SKeDA provides an effective watermarking solution for text-to-video diffusion models that addresses the unique challenges of video content, offering improved robustness against frame-level and temporal distortions while maintaining generation quality.
Abstract: The rise of text-to-video generation models has raised growing concerns over content authenticity, copyright protection, and malicious misuse. Watermarking serves as an effective mechanism for regulating such AI-generated content, where high fidelity and strong robustness are particularly critical. Recent generative image watermarking methods provide a promising foundation by leveraging watermark information and pseudo-random keys to control the initial sampling noise, enabling lossless embedding. However, directly extending these techniques to videos introduces two key limitations: Existing designs implicitly rely on strict alignment between video frames and frame-dependent pseudo-random binary sequences used for watermark encryption. Once this alignment is disrupted, subsequent watermark extraction becomes unreliable; and Video-specific distortions, such as inter-frame compression, significantly degrade watermark reliability. To address these issues, we propose SKeDA, a generative watermarking framework tailored for text-to-video diffusion models. SKeDA consists of two components: (1) Shuffle-Key-based Distribution-preserving Sampling (SKe) employs a single base pseudo-random binary sequence for watermark encryption and derives frame-level encryption sequences through permutation. This design transforms watermark extraction from synchronization-sensitive sequence decoding into permutation-tolerant set-level aggregation, substantially improving robustness against frame reordering and loss; and (2) Differential Attention (DA), which computes inter-frame differences and dynamically adjusts attention weights during extraction, enhancing robustness against temporal distortions. Extensive experiments demonstrate that SKeDA preserves high video generation quality and watermark robustness.
[347] A Case Study on Concept Induction for Neuron-Level Interpretability in CNN
Moumita Sen Sarma, Samatha Ereshi Akkamahadevi, Pascal Hitzler
Main category: cs.CV
TL;DR: The paper applies a Concept Induction framework for analyzing hidden neuron semantics in DNNs to the SUN2012 dataset, demonstrating generalization of the approach beyond the original ADE20K dataset.
Details
Motivation: To investigate whether the Concept Induction-based framework for hidden neuron analysis in DNNs generalizes beyond the ADE20K dataset to other scene recognition benchmarks like SUN2012.Method: Applied the same Concept Induction workflow to SUN2012 dataset, assigning interpretable semantic labels to neurons and validating them through web-sourced images and statistical testing.
Result: The method successfully transfers to SUN2012, confirming its broader applicability for analyzing hidden neuron semantics in DNNs across different scene recognition datasets.
Conclusion: The Concept Induction framework for hidden neuron analysis generalizes well to SUN2012, demonstrating its robustness and broader applicability for understanding DNN internal representations.
Abstract: Deep Neural Networks (DNNs) have advanced applications in domains such as healthcare, autonomous systems, and scene understanding, yet the internal semantics of their hidden neurons remain poorly understood. Prior work introduced a Concept Induction-based framework for hidden neuron analysis and demonstrated its effectiveness on the ADE20K dataset. In this case study, we investigate whether the approach generalizes by applying it to the SUN2012 dataset, a large-scale scene recognition benchmark. Using the same workflow, we assign interpretable semantic labels to neurons and validate them through web-sourced images and statistical testing. Our findings confirm that the method transfers to SUN2012, showing its broader applicability.
[348] Stateful Token Reduction for Long-Video Hybrid VLMs
Jindong Jiang, Amala Sanjay Deshmukh, Kateryna Chumachenko, Karan Sapra, Zhiding Yu, Guilin Liu, Andrew Tao, Pavlo Molchanov, Jan Kautz, Wonmin Byeon
Main category: cs.CV
TL;DR: Progressive token reduction method for hybrid video VLMs combining attention and Mamba blocks, achieving 3.8-4.2x speedup with minimal accuracy loss.
Details
Motivation: Existing token reduction methods are designed for dense Transformers but don't work well for hybrid architectures that mix attention with linear-time state-space blocks like Mamba. Need efficient token reduction for video VLMs that maintains accuracy while accelerating processing.Method: Analyzes token reduction behavior through layerwise sparsity and importance stability. Proposes low-to-high progressive reduction schedule and unified language-aware scoring mechanism for both attention and Mamba blocks (using implicit-attention proxy for Mamba). Enables all-layer token reduction in hybrid architectures.
Result: Under aggressive compression (retaining 25% of visual tokens), achieves 3.8-4.2x prefilling speedups with near-baseline accuracy. Light finetuning under reduction further improves performance on long-context video benchmarks.
Conclusion: Progressive token reduction with unified scoring effectively accelerates hybrid video VLMs while maintaining accuracy, addressing challenges of token importance instability across layers in mixed attention-Mamba architectures.
Abstract: Token reduction is an effective way to accelerate long-video vision-language models (VLMs), but most existing methods are designed for dense Transformers and do not directly account for hybrid architectures that interleave attention with linear-time state-space blocks (e.g., Mamba). We study query-conditioned token reduction for hybrid video VLMs and analyze reduction behavior through two properties: layerwise sparsity (how many tokens capture query-relevant information) and importance stability (whether token-importance rankings persist across depth). Although token importance is sparse within each layer, the set of important tokens changes across layers, so aggressive early pruning is unreliable. Motivated by this, we propose a low-to-high progressive reduction schedule and a unified language-aware scoring mechanism for both attention and Mamba blocks (using an implicit-attention proxy for Mamba), enabling all-layer token reduction in hybrids. Under an aggressive compression setting (retaining 25% of visual tokens), our approach delivers substantial prefilling speedups (3.8–4.2x) with near-baseline accuracy at test time, and light finetuning under reduction further improves performance on long-context video benchmarks.
[349] AdURA-Net: Adaptive Uncertainty and Region-Aware Network
Antik Aich Roy, Ujjwal Bhattacharya
Main category: cs.CV
TL;DR: AdURA-Net: A geometry-driven adaptive uncertainty-aware framework for reliable thoracic disease classification that handles uncertain labels in medical imaging datasets.
Details
Motivation: Clinical decision-making faces uncertainty from ambiguous radiology reports and automated label extraction limitations, especially in multilabel datasets like CheXpert and MIMIC-CXR with uncertain labels. Models need to express uncertainty when lacking sufficient evidence, which is crucial for high-risk clinical applications.Method: Proposes AdURA-Net with two key components: 1) Adaptive dilated convolution and multiscale deformable alignment integrated with DenseNet backbone to capture anatomical complexities, and 2) Dual Head Loss combining masked binary cross entropy with logit and Dirichlet evidential learning objective.
Result: The paper presents a framework for thoracic disease classification that can appropriately handle uncertain cases by expressing model uncertainty rather than forcing confident predictions when evidence is insufficient.
Conclusion: AdURA-Net provides a reliable uncertainty-aware framework for medical image classification that addresses the critical need for models to express uncertainty in clinical decision-making, particularly for high-risk applications.
Abstract: One of the common issues in clinical decision-making is the presence of uncertainty, which often arises due to ambiguity in radiology reports, which often reflect genuine diagnostic uncertainty or limitations of automated label extraction in various complex cases. Especially the case of multilabel datasets such as CheXpert, MIMIC-CXR, etc., which contain labels such as positive, negative, and uncertain. In clinical decision-making, the uncertain label plays a tricky role as the model should not be forced to provide a confident prediction in the absence of sufficient evidence. The ability of the model to say it does not understand whenever it is not confident is crucial, especially in the cases of clinical decision-making involving high risks. Here, we propose AdURA-Net, a geometry-driven adaptive uncertainty-aware framework for reliable thoracic disease classification. The key highlights of the proposed model are: a) Adaptive dilated convolution and multiscale deformable alignment coupled with the backbone Densenet architecture capturing the anatomical complexities of the medical images, and b) Dual Head Loss, which combines masked binary cross entropy with logit and a Dirichlet evidential learning objective.
[350] TACIT Benchmark: A Programmatic Visual Reasoning Benchmark for Generative and Discriminative Models
Daniel Nobrega Medeiros
Main category: cs.CV
TL;DR: TACIT Benchmark: A programmatic visual reasoning benchmark with 10 tasks across 6 reasoning domains, featuring dual-track evaluation (generative and discriminative) with deterministic verification.
Details
Motivation: Existing visual reasoning benchmarks rely too much on natural language prompts, evaluate narrow reasoning modalities, or use subjective scoring like LLM-as-judge, lacking rigorous, objective evaluation of visual reasoning capabilities.Method: Created a benchmark with 10 tasks across 6 reasoning domains (spatial navigation, abstract pattern completion, causal simulation, logical constraint satisfaction, graph theory, topology). Provides dual-track evaluation: generative track where models produce solution images verified through deterministic computer-vision pipelines, and discriminative track with five-way multiple choice using structurally plausible near-miss distractors.
Result: Version 0.1.0 distributes 6,000 puzzles (108,000 PNG images across three resolutions) with fully deterministic seeded generation and reproducible verification. Dataset, generation code, and evaluation harness released under Apache 2.0 license on HuggingFace.
Conclusion: TACIT Benchmark provides a rigorous, objective framework for evaluating visual reasoning capabilities across diverse domains, addressing limitations of existing benchmarks through programmatic generation and deterministic verification.
Abstract: Existing visual reasoning benchmarks predominantly rely on natural language prompts, evaluate narrow reasoning modalities, or depend on subjective scoring procedures such as LLM-as-judge. We introduce the TACIT Benchmark, a programmatic visual reasoning benchmark comprising 10 tasks across 6 reasoning domains: spatial navigation, abstract pattern completion, causal simulation, logical constraint satisfaction, graph theory, and topology. The benchmark provides dual-track evaluation: a generative track in which models must produce solution images verified through deterministic computer-vision pipelines, and a discriminative track offering five-way multiple choice with structurally plausible near-miss distractors. Each distractor violates exactly one structural constraint, requiring models to reason about fine-grained visual differences rather than exploit superficial cues. Version 0.1.0 distributes 6,000 puzzles (108,000 PNG images across three resolutions) with fully deterministic seeded generation and reproducible verification. The dataset, generation code, and evaluation harness are released under the Apache 2.0 license on HuggingFace (DOI: 10.57967/hf/7904).
[351] VisRef: Visual Refocusing while Thinking Improves Test-Time Scaling in Multi-Modal Large Reasoning Models
Soumya Suvra Ghosal, Youngeun Kim, Zhuowei Li, Ritwick Chaudhry, Linghan Xu, Hongjing Zhang, Jakub Zablocki, Yifan Xing, Qin Zhang
Main category: cs.CV
TL;DR: VisRef is a visually grounded test-time scaling framework that actively guides multimodal reasoning by re-injecting semantically relevant visual tokens during extended reasoning, preventing models from losing attention to visual information.
Details
Motivation: Large reasoning models show performance gains with extended test-time compute, but in vision-dependent tasks, extended textual reasoning causes models to lose attention to visual tokens and rely too much on textual priors. Existing RL-based solutions are computationally expensive.Method: VisRef actively guides reasoning by re-injecting a coreset of visual tokens that are semantically relevant to the reasoning context while remaining diverse and globally representative of the image. This enables more grounded multimodal reasoning without additional RL fine-tuning.
Result: Experiments on three visual reasoning benchmarks with state-of-the-art multimodal large reasoning models show that VisRef consistently outperforms existing test-time scaling approaches by up to 6.4% under fixed test-time compute budgets.
Conclusion: VisRef provides an effective framework for maintaining visual grounding during extended reasoning in multimodal models, offering performance improvements without the computational cost of RL-based approaches.
Abstract: Advances in large reasoning models have shown strong performance on complex reasoning tasks by scaling test-time compute through extended reasoning. However, recent studies observe that in vision-dependent tasks, extended textual reasoning at inference time can degrade performance as models progressively lose attention to visual tokens and increasingly rely on textual priors alone. To address this, prior works use reinforcement learning (RL)-based fine-tuning to route visual tokens or employ refocusing mechanisms during reasoning. While effective, these methods are computationally expensive, requiring large-scale data generation and policy optimization. To leverage the benefits of test-time compute without additional RL fine-tuning, we propose VisRef, a visually grounded test-time scaling framework. Our key idea is to actively guide the reasoning process by re-injecting a coreset of visual tokens that are semantically relevant to the reasoning context while remaining diverse and globally representative of the image, enabling more grounded multi-modal reasoning. Experiments on three visual reasoning benchmarks with state-of-the-art multi-modal large reasoning models demonstrate that, under fixed test-time compute budgets, VisRef consistently outperforms existing test-time scaling approaches by up to 6.4%.
[352] Physical Evaluation of Naturalistic Adversarial Patches for Camera-Based Traffic-Sign Detection
Brianna D’Urso, Tahmid Hasan Sakib, Syed Rafay Hasan, Terry N. Guo
Main category: cs.CV
TL;DR: NAPs (Naturalistic Adversarial Patches) can physically attack traffic sign detectors trained on customized AV datasets, reducing STOP sign detection confidence in real-world experiments.
Details
Motivation: To study how well adversarial patches transfer to physical traffic sign settings when detectors are trained on customized autonomous vehicle datasets, and to develop systematic evaluation protocols for physical patch attacks.Method: Created CompGTSRB dataset by pasting GTSRB traffic signs onto undistorted AV backgrounds, trained YOLOv5 detector, generated patches using GAN with latent space optimization, and conducted physical experiments on Quanser QCar testbed with varying distances, patch sizes, and placements.
Result: NAPs successfully reduced the detector’s STOP class confidence across different configurations, demonstrating the effectiveness of physical adversarial patches and the utility of the CompGTSRB dataset for credible evaluation.
Conclusion: The research shows adversarial patches can physically attack AV perception systems, highlights the need for defenses against localized patch corruption, and provides systematic protocols for evaluating physical adversarial attacks.
Abstract: This paper studies how well Naturalistic Adversarial Patches (NAPs) transfer to a physical traffic sign setting when the detector is trained on a customized dataset for an autonomous vehicle (AV) environment. We construct a composite dataset, CompGTSRB (which is customized dataset for AV environment), by pasting traffic sign instances from the German Traffic Sign Recognition Benchmark (GTSRB) onto undistorted backgrounds captured from the target platform. CompGTSRB is used to train a YOLOv5 model and generate patches using a Generative Adversarial Network (GAN) with latent space optimization, following existing NAP methods. We carried out a series of experiments on our Quanser QCar testbed utilizing the front CSI camera provided in QCar. Across configurations, NAPs reduce the detector’s STOP class confidence. Different configurations include distance, patch sizes, and patch placement. These results along with a detailed step-by-step methodology indicate the utility of CompGTSRB dataset and the proposed systematic physical protocols for credible patch evaluation. The research further motivate researching the defenses that address localized patch corruption in embedded perception pipelines.
[353] Pretty Good Measurement for Radiomics: A Quantum-Inspired Multi-Class Classifier for Lung Cancer Subtyping and Prostate Cancer Risk Stratification
Giuseppe Sergioli, Carlo Cuccu, Giovanni Pasini, Alessandro Stefano, Giorgio Russo, Andrés Camilo Granda Arango, Roberto Giuntini
Main category: cs.CV
TL;DR: Quantum-inspired multi-class classification using Pretty Good Measurement (PGM) applied to biomedical radiomics for cancer subtyping and risk stratification.
Details
Motivation: To develop a genuinely multi-class classification approach without reduction to pairwise or one-vs-rest schemes, inspired by quantum state discrimination principles, and apply it to biomedical radiomics problems.Method: Associates each class with encoded mixed states and performs classification through a single POVM construction based on the Pretty Good Measurement (PGM), reformulating classification as discrimination of class-dependent density operators.
Result: PGM-based classifier is consistently competitive with and sometimes improves upon standard methods, performing especially well in NSCLC binary and three-class tasks, and remaining competitive in PCa risk stratification with clinically relevant sensitivity-specificity trade-offs.
Conclusion: Quantum-inspired PGM classification provides a viable multi-class approach for biomedical radiomics, demonstrating competitive performance and potential advantages in handling class overlap through geometric discrimination principles.
Abstract: We investigate a quantum-inspired approach to supervised multi-class classification based on the \emph{Pretty Good Measurement} (PGM), viewed as an operator-valued decision rule derived from quantum state discrimination. The method associates each class with an encoded mixed state and performs classification through a single POVM construction, thus providing a genuinely multi-class strategy without reduction to pairwise or one-vs-rest schemes. In this perspective, classification is reformulated as the discrimination of a finite ensemble of class-dependent density operators, with performance governed by the geometry induced by the encoding map and by the overlap structure among classes. To assess the practical scope of this framework, we apply the PGM-based classifier to two biomedical radiomics case studies: histopathological subtyping of non-small-cell lung carcinoma (NSCLC) and prostate cancer (PCa) risk stratification. The evaluation is conducted under protocols aligned with previously reported radiomics studies, enabling direct comparison with established classical baselines. The results show that the PGM-based classifier is consistently competitive and, in several settings, improves upon standard methods. In particular, the method performs especially well in the NSCLC binary and three-class tasks, while remaining competitive in the four-class case, where increased class overlap yields a more demanding discrimination geometry. In the PCa study, the PGM classifier remains close to the strongest ensemble baseline and exhibits clinically relevant sensitivity–specificity trade-offs across feature-selection scenarios.
[354] Adversarial Patch Generation for Visual-Infrared Dense Prediction Tasks via Joint Position-Color Optimization
He Li, Wenyue He, Weihang Kong, Xingchen Zhang
Main category: cs.CV
TL;DR: AP-PCO: A joint position-color optimization framework for generating adversarial patches that simultaneously attack both visual and infrared modalities in dense prediction systems, addressing cross-spectral inconsistencies.
Details
Motivation: Multimodal adversarial attacks for dense prediction, especially in visual-infrared (VI) systems, are underexplored. Existing single-modal patch methods fail to handle cross-spectral inconsistencies between visible and infrared modalities, leading to reduced attack effectiveness and poor stealthiness in VI dense prediction models.Method: Proposes AP-PCO: a joint position-color optimization framework that simultaneously optimizes patch placement and color composition using a fitness function from model outputs. Introduces crossmodal color adaptation strategy to constrain patch appearance according to infrared grayscale characteristics while maintaining strong perturbations in visible domain, reducing cross-spectral saliency. Operates without internal model information (black-box attacks).
Result: Extensive experiments on visual-infrared dense prediction tasks show AP-PCO achieves consistently strong attack performance across multiple architectures, providing a practical benchmark for robustness evaluation in VI perception systems.
Conclusion: AP-PCO effectively addresses the challenges of multimodal adversarial attacks in visual-infrared dense prediction by jointly optimizing position and color while bridging spectral discrepancies, enabling effective black-box attacks on VI perception systems.
Abstract: Multimodal adversarial attacks for dense prediction remain largely underexplored. In particular, visual-infrared (VI) perception systems introduce unique challenges due to heterogeneous spectral characteristics and modality-specific intensity distributions. Existing adversarial patch methods are primarily designed for single-modal inputs and fail to account for crossspectral inconsistencies, leading to reduced attack effectiveness and poor stealthiness when applied to VI dense prediction models. To address these challenges, we propose a joint position-color optimization framework (AP-PCO) for generating adversarial patches in visual-infrared settings. The proposed method optimizes patch placement and color composition simultaneously using a fitness function derived from model outputs, enabling a single patch to perturb both visible and infrared modalities. To further bridge spectral discrepancies, we introduce a crossmodal color adaptation strategy that constrains patch appearance according to infrared grayscale characteristics while maintaining strong perturbations in the visible domain, thereby reducing cross-spectral saliency. The optimization procedure operates without requiring internal model information, supporting flexible black-box attacks. Extensive experiments on visual-infrared dense prediction tasks demonstrate that the proposed AP-PCO achieves consistently strong attack performance across multiple architectures, providing a practical benchmark for robustness evaluation in VI perception systems.
[355] Ozone Cues Mitigate Reflected Downwelling Radiance in LWIR Absorption-Based Ranging
Unay Dorken Gallastegi, Wentao Shangguan, Vaibhav Choudhary, Akshay Agarwal, Hoover Rueda-Chacón, Martin J. Stevens, Vivek K Goyal
Main category: cs.CV
TL;DR: New passive LWIR ranging methods using ozone absorption features to account for reflected downwelling radiance, improving accuracy from over 100m error to 1.2m.
Details
Motivation: Traditional passive LWIR ranging suffers from inaccuracies due to ignoring reflected downwelling radiance, especially problematic in scenes with low temperature variations.Method: Two methods: 1) Quadspectral - uses four narrowband measurements (two at water vapor absorption line, two at ozone absorption line) for closed-form range estimation. 2) Hyperspectral - uses broader spectral range to estimate temperature, emissivity profiles, and downwelling contributions from multiple zenith angles.
Result: Experimental results show dramatic improvement: error reduced from over 100m (when reflected light not modeled) to 6.8m with quadspectral method and 1.2m with hyperspectral method.
Conclusion: Using ozone absorption features to model reflected downwelling radiance significantly improves passive LWIR ranging accuracy, with hyperspectral method achieving sub-meter precision.
Abstract: Passive long-wave infrared (LWIR) absorption-based ranging relies on atmospheric absorption to estimate distances to objects from their emitted thermal radiation. First demonstrated decades ago for objects much hotter than the air and recently extended to scenes with low temperature variations, this ranging has depended on reflected radiance being negligible. Downwelling radiance is especially problematic, sometimes causing large inaccuracies. In two new ranging methods, we use characteristic features from ozone absorption to estimate the contribution of reflected downwelling radiance. The quadspectral method gives a simple closed-form range estimate from four narrowband measurements, two at a water vapor absorption line and two at an ozone absorption line. The hyperspectral method uses a broader spectral range to improve accuracy while also providing estimates of temperature, emissivity profiles, and contributions of downwelling from a collection of zenith angles. Experimental results demonstrate improved ranging accuracy, in one case reducing error from over 100 m when reflected light is not modeled to 6.8 m with the quadspectral method and 1.2 m with the hyperspectral method.
[356] Seeking Necessary and Sufficient Information from Multimodal Medical Data
Boyu Chen, Weiye Bao, Junjie Liu, Michael Shen, Bo Peng, Paul Taylor, Zhu Li, Mengyue Yang
Main category: cs.CV
TL;DR: Proposes a multimodal medical AI method that learns necessary and sufficient features using Probability of Necessity and Sufficiency (PNS) by decomposing representations into modality-invariant and modality-specific components.
Details
Motivation: Current multimodal medical models overlook learning features that are both necessary (must be present for outcome) and sufficient (enough to determine outcome). Such features could improve performance and robustness to missing modalities by capturing essential predictive information.Method: Extends PNS to multimodal scenarios by decomposing representations into modality-invariant and modality-specific components, then deriving tractable PNS objectives for each component. Addresses violations of PNS estimation conditions in multimodal settings.
Result: Experiments on synthetic and real-world medical datasets demonstrate the method’s effectiveness in learning necessary and sufficient multimodal features.
Conclusion: The proposed approach successfully extends PNS to multimodal medical AI, enabling learning of features that improve model performance and robustness to missing modalities.
Abstract: Learning multimodal representations from medical images and other data sources can provide richer information for decision-making. While various multimodal models have been developed for this, they overlook learning features that are both necessary (must be present for the outcome to occur) and sufficient (enough to determine the outcome). We argue learning such features is crucial as they can improve model performance by capturing essential predictive information, and enhance model robustness to missing modalities as each modality can provide adequate predictive signals. Such features can be learned by leveraging the Probability of Necessity and Sufficiency (PNS) as a learning objective, an approach that has proven effective in unimodal settings. However, extending PNS to multimodal scenarios remains underexplored and is non-trivial as key conditions of PNS estimation are violated. We address this by decomposing multimodal representations into modality-invariant and modality-specific components, then deriving tractable PNS objectives for each. Experiments on synthetic and real-world medical datasets demonstrate our method’s effectiveness. Code will be available on GitHub.
[357] Proof-of-Perception: Certified Tool-Using Multimodal Reasoning with Compositional Conformal Guarantees
Arya Fayyazi, Haleh Akrami
Main category: cs.CV
TL;DR: PoP is a framework for multimodal reasoning that uses executable graphs with reliability guarantees, conformal uncertainty quantification, and adaptive compute allocation to improve accuracy and efficiency.
Details
Motivation: Current multimodal reasoning systems suffer from error compounding, hallucinations, and lack of reliability guarantees. There's a need for frameworks that provide calibrated uncertainty and principled trade-offs between accuracy and computational cost.Method: PoP casts multimodal reasoning as an executable graph where each perception or logic node outputs a conformal set with calibrated uncertainty. A lightweight controller uses these certificates to allocate compute under a budget, expanding with extra tool calls only when needed and stopping early otherwise.
Result: Across document, chart, and multi-image QA benchmarks, PoP improves performance and reliability over strong chain-of-thought, ReAct-style, and program-of-thought baselines while using computation more efficiently.
Conclusion: PoP provides a principled approach to multimodal reasoning with explicit reliability guarantees, reducing error compounding and hallucinations while enabling verifiable evidence-based answers and efficient compute allocation.
Abstract: We present Proof-of-Perception (PoP), a tool-using framework that casts multimodal reasoning as an executable graph with explicit reliability guarantees. Each perception or logic node outputs a conformal set, yielding calibrated, stepwise uncertainty; a lightweight controller uses these certificates to allocate compute under a budget, expanding with extra tool calls only when needed and stopping early otherwise. This grounds answers in verifiable evidence, reduces error compounding and hallucinations, and enables principled accuracy-compute trade-offs. Across document, chart, and multi-image QA benchmarks, PoP improves performance and reliability over strong chain-of-thought, ReAct-style, and program-of-thought baselines while using computation more efficiently.
[358] Diffusion-Based Low-Light Image Enhancement with Color and Luminance Priors
Xuanshuo Fu, Lei Kang, Javier Vazquez-Corral
Main category: cs.CV
TL;DR: A conditional diffusion framework for low-light image enhancement using structured control embeddings to guide enhancement with physical priors.
Details
Motivation: Low-light images suffer from low contrast, noise, and color distortion, which degrade visual quality and impair downstream vision tasks. Existing methods may not effectively incorporate physical priors for structured enhancement.Method: Proposes a conditional diffusion framework with Structured Control Embedding Module (SCEM) that decomposes low-light images into four informative components: illumination, illumination-invariant features, shadow priors, and color-invariant cues. These serve as control signals for a U-Net-based diffusion model trained with simplified noise-prediction loss.
Result: Achieves state-of-the-art performance in quantitative and perceptual metrics on LOLv2-real, LSRW, DICM, MEF, and LIME datasets, demonstrating strong generalization despite being trained only on LOLv1 dataset.
Conclusion: The SCEM-equipped diffusion method effectively enforces structured enhancement guided by physical priors, showing superior performance and generalization across multiple low-light image enhancement benchmarks.
Abstract: Low-light images often suffer from low contrast, noise, and color distortion, degrading visual quality and impairing downstream vision tasks. We propose a novel conditional diffusion framework for low-light image enhancement that incorporates a Structured Control Embedding Module (SCEM). SCEM decomposes a low-light image into four informative components including illumination, illumination-invariant features, shadow priors, and color-invariant cues. These components serve as control signals that condition a U-Net-based diffusion model trained with a simplified noise-prediction loss. Thus, the proposed SCEM equipped Diffusion method enforces structured enhancement guided by physical priors. In experiments, our model is trained only on the LOLv1 dataset and evaluated without fine-tuning on LOLv2-real, LSRW, DICM, MEF, and LIME. The method achieves state-of-the-art performance in quantitative and perceptual metrics, demonstrating strong generalization across benchmarks. https://casted.github.io/scem/.
[359] Percept-Aware Surgical Planning for Visual Cortical Prostheses with Vascular Avoidance
Galen Pogoncheff, Alvin Wang, Jacob Granley, Michael Beyeler
Main category: cs.CV
TL;DR: A percept-aware surgical planning framework for cortical visual prostheses that optimizes electrode placement using differentiable forward models to maximize perceptual outcomes while ensuring safety constraints.
Details
Motivation: Current cortical visual prosthesis planning focuses on visual field coverage and anatomical heuristics but doesn't directly optimize predicted perceptual outcomes under safety constraints. With high-density neural interfaces, electrode placement in 3D cortex has become a critical surgical planning problem.Method: Formulates electrode placement as constrained optimization in anatomical space, treating electrode coordinates as learnable parameters optimized end-to-end using a differentiable forward model of prosthetic vision. Incorporates vascular avoidance and gray matter feasibility constraints while minimizing task-level perceptual error.
Result: Percept-aware optimization consistently improves reconstruction fidelity relative to coverage-based strategies on simulated reading and natural image tasks using realistic folded cortical geometry. Vascular safety constraints eliminate margin violations while preserving perceptual performance, and enables co-optimization of multi-electrode thread configurations.
Conclusion: Differentiable percept models can inform anatomically grounded, safety-aware computer-assisted planning for cortical neural interfaces, providing a foundation for optimizing next-generation visual prostheses.
Abstract: Cortical visual prostheses aim to restore sight by electrically stimulating neurons in early visual cortex (V1). With the emergence of high-density and flexible neural interfaces, electrode placement within three-dimensional cortex has become a critical surgical planning problem. Existing strategies emphasize visual field coverage and anatomical heuristics but do not directly optimize predicted perceptual outcomes under safety constraints. We present a percept-aware framework for surgical planning of cortical visual prostheses that formulates electrode placement as a constrained optimization problem in anatomical space. Electrode coordinates are treated as learnable parameters and optimized end-to-end using a differentiable forward model of prosthetic vision. The objective minimizes task-level perceptual error while incorporating vascular avoidance and gray matter feasibility constraints. Evaluated on simulated reading and natural image tasks using realistic folded cortical geometry (FreeSurfer fsaverage), percept-aware optimization consistently improves reconstruction fidelity relative to coverage-based placement strategies. Importantly, vascular safety constraints eliminate margin violations while preserving perceptual performance. The framework further enables co-optimization of multi-electrode thread configurations under fixed insertion budgets. These results demonstrate how differentiable percept models can inform anatomically grounded, safety-aware computer-assisted planning for cortical neural interfaces and provide a foundation for optimizing next-generation visual prostheses.
[360] Unsupervised Semantic Segmentation in Synchrotron Computed Tomography with Self-Correcting Pseudo Labels
Austin Yunker, Peter Kenesei, Hemant Sharma, Jun-Sang Park, Antonino Miceli, Rajkumar Kettimuthu
Main category: cs.CV
TL;DR: A novel framework for automatic segmentation of large synchrotron CT datasets using pseudo-labeling and self-correction to eliminate manual annotation needs.
Details
Motivation: Synchrotron CT produces large datasets requiring segmentation, but manual annotation is impractical. Deep learning needs labeled data, creating a bottleneck in analysis workflows.Method: Generate pseudo labels via clustering on voxel values, train segmentation model on these labels, then use Unbiased Teacher approach to self-correct labels for accurate final segmentations.
Result: Improves pixel-wise accuracy by 13.31% and mIoU by 15.94% over baseline pseudo labels on magnesium crystal SR-CT sample. Framework works well on additional samples.
Conclusion: The framework enables automatic segmentation of large SR-CT datasets without manual annotation, addressing a major bottleneck in CT analysis workflows.
Abstract: X-ray computed tomography (CT) is a widely used imaging technique that provides detailed examinations into the internal structure of an object with synchrotron CT (SR-CT) enabling improved data quality by using higher energy, monochromatic X-rays. While SR-CT allows for improved resolution, time-resolved experimentation, and reduced imaging artifacts, it also produces significantly larger datasets than conventional CT. Accurate and efficient evaluation of these datasets is a critical component of these workflows; yet is often done manually representing a major bottleneck in the analysis phase. While deep learning has emerged as a powerful tool capable of providing a wide range of purely data-driven solutions, it requires a substantial amount of labeled data for training and manual annotation of SR-CT datasets is impractical in practice. In this paper, we introduce a novel framework that enables automatic segmentation of large, high-resolution SR-CT datasets by eliminating the need to hand label images for deep learning training. First, we generate pseudo labels by clustering on the voxel values identifying regions in the volume with similar attenuation coefficients producing an initial semantic map. Afterwards, we train a segmentation model on the pseudo labels before utilizing the Unbiased Teacher approach to self-correct them ensuring accurate final segmentations. We find our approach improves pixel-wise accuracy and mIoU by 13.31% and 15.94%, respectively, over the baseline pseudo labels when using a magnesium crystal SR-CT sample. Additionally, we extensively evaluate the different components of our workflow including segmentation model, loss function, pseudo labeling strategy, and input type. Finally, we evaluate our approach on to two additional samples highlighting our frameworks ability to produce segmentations that are considerably better than the original pseudo labels.
[361] DiffSOS: Acoustic Conditional Diffusion Model for Speed-of-Sound Reconstruction in Ultrasound Computed Tomography
Yujia Wu, Shuoqi Chen, Shiru Wang, Yucheng Tang, Petr Bruza, Geoffrey P. Luke
Main category: cs.CV
TL;DR: DiffSOS: A conditional diffusion model for ultrasound computed tomography that maps acoustic waveforms to speed-of-sound maps with uncertainty estimation and near real-time reconstruction.
Details
Motivation: Existing methods for speed-of-sound reconstruction in ultrasound computed tomography have limitations - traditional Full Waveform Inversion is computationally intensive, while deep learning approaches produce oversmoothed results lacking fine details needed for clinical utility.Method: Proposes DiffSOS, a conditional diffusion model with specialized acoustic ControlNet to ground denoising in physical wave measurements. Uses hybrid loss function integrating noise prediction, spatial reconstruction, and noise frequency content. Employs stochastic DDIM sampling for accelerated inference (10 steps).
Result: Significantly outperforms state-of-the-art networks on OpenPros USCT benchmark, achieving average Multi-scale Structural Similarity of 0.957. Provides high-fidelity SoS maps with pixel-wise uncertainty estimation and near real-time reconstruction.
Conclusion: DiffSOS enables high-fidelity speed-of-sound reconstruction with principled confidence measures, facilitating safer and faster clinical interpretation in ultrasound computed tomography.
Abstract: Accurate Speed-of-Sound (SoS) reconstruction from acoustic waveforms is a cornerstone of ultrasound computed tomography (USCT), enabling quantitative velocity mapping that reveals subtle anatomical details and pathological variations often invisible in conventional imaging. However, practical utility is hindered by the limitations of existing algorithms; traditional Full Waveform Inversion (FWI) is computationally intensive, while current deep learning approaches tend to produce oversmoothed results lacking fine details. We propose DiffSOS, a conditional diffusion model that directly maps acoustic waveforms to SoS maps. Our framework employs a specialized acoustic ControlNet to strictly ground the denoising process in physical wave measurements. To ensure structural consistency, we optimize a hybrid loss function that integrates noise prediction, spatial reconstruction, and noise frequency content. To accelerate inference, we employ stochastic Denoising Diffusion Implicit Model (DDIM) sampling, achieving near real-time reconstruction with only 10 steps. Crucially, we exploit the stochastic generative nature of our framework to estimate pixel-wise uncertainty, providing a measure of reliability that is often absent in deterministic approaches. Evaluated on the OpenPros USCT benchmark, DiffSOS significantly outperforms state-of-the-art networks, achieving an average Multi-scale Structural Similarity of 0.957. Our approach provides high-fidelity SoS maps with a principled measure of confidence, facilitating safer and faster clinical interpretation.
[362] SSR: Pushing the Limit of Spatial Intelligence with Structured Scene Reasoning
Yi Zhang, Youya Xia, Yong Wang, Meng Song, Xin Wu, Wenjun Wan, Bingbing Liu, AiXue Ye, Hongbo Zhang, Feng Wen
Main category: cs.CV
TL;DR: SSR framework integrates 2D and 3D representations for structured scene reasoning in MLLMs, achieving SOTA spatial intelligence with efficient alignment and scene graph generation.
Details
Motivation: MLLMs lack spatial sense for geometric reasoning, suffer from high modality-alignment costs, and have deficiency in fine-grained structural modeling precision.Method: Lightweight alignment anchors 3D geometric features to pre-aligned 2D visual semantics via cross-modal addition and token interleaving. Novel scene graph generation pipeline represents global layouts as chain of independent local triplets with relative coordinates, complemented by incremental generation algorithm.
Result: At 7B parameter scale, SSR achieves state-of-the-art performance on multiple spatial intelligence benchmarks, scoring 73.9 on VSI-Bench, outperforming much larger models.
Conclusion: Efficient feature alignment and structured scene reasoning are cornerstones of authentic spatial intelligence in multimodal models.
Abstract: While Multimodal Large Language Models (MLLMs) excel in semantic tasks, they frequently lack the “spatial sense” essential for sophisticated geometric reasoning. Current models typically suffer from exorbitant modality-alignment costs and deficiency in fine-grained structural modeling precision.We introduce SSR, a framework designed for Structured Scene Reasoning that seamlessly integrates 2D and 3D representations via a lightweight alignment mechanism. To minimize training overhead, our framework anchors 3D geometric features to the large language model’s pre-aligned 2D visual semantics through cross-modal addition and token interleaving, effectively obviating the necessity for large-scale alignment pre-training. To underpin complex spatial reasoning, we propose a novel scene graph generation pipeline that represents global layouts as a chain of independent local triplets defined by relative coordinates. This is complemented by an incremental generation algorithm, enabling the model to construct “language-model-friendly” structural scaffolds for complex environments. Furthermore, we extend these capabilities to global-scale 3D global grounding task, achieving absolute metric precision across heterogeneous data sources. At a 7B parameter scale, SSR achieves state-of-the-art performance on multiple spatial intelligence benchmarks, notably scoring 73.9 on VSI-Bench. Our approach significantly outperforms much larger models, demonstrating that efficient feature alignment and structured scene reasoning are the cornerstones of authentic spatial intelligence.
[363] PointAlign: Feature-Level Alignment Regularization for 3D Vision-Language Models
Yuanhao Su, Shaofeng Zhang, Xiaosong Jia, Qi Fan
Main category: cs.CV
TL;DR: PointAlign: A feature-level alignment regularization method for 3D Vision-Language Models that preserves geometric information in intermediate representations through consistency loss between point cloud and visual tokens.
Details
Motivation: 3D Vision-Language Models suffer from limited paired 3D-text data, inefficient use of 3D data due to reliance on next-token prediction loss only, and degradation of geometric information in intermediate representations.Method: Proposes PointAlign with explicit supervision of intermediate point cloud tokens to preserve 3D geometric-semantic information via consistency loss between point cloud tokens and visual input tokens, using lightweight alignment projector and LoRA adapters.
Result: Achieves 2.08pp average improvement on classification tasks, 7.50pp gain on open-vocabulary Objaverse classification, and 4.88pp improvement on 3D object captioning evaluated by Qwen2-72B-Instruct.
Conclusion: PointAlign effectively prevents geometric degradation in 3D VLMs with minimal computational overhead, demonstrating significant performance improvements on 3D classification and captioning tasks.
Abstract: The development of 3D Vision-Language Models (VLMs), crucial for applications in robotics, autonomous driving, and augmented reality, is severely constrained by the scarcity of paired 3D-text data. Existing methods rely solely on next-token prediction loss, using only language tokens for supervision. This results in inefficient utilization of limited 3D data and leads to a significant degradation and loss of valuable geometric information in intermediate representations. To address these limitations, we propose {\mname}, a novel feature-level alignment regularization method. {\mname} explicitly supervises intermediate point cloud tokens to preserve fine-grained 3D geometric-semantic information throughout the language modeling process. Specifically, we constrain the intermediate point cloud tokens within the LLM to align with visual input tokens via a consistency loss. By training only a lightweight alignment projector and LoRA adapters, {\mname} achieves explicit feature-level supervision with minimal computational overhead, effectively preventing geometric degradation. Extensive experiments on ModelNet40 and Objaverse datasets demonstrate that our method achieves \textbf{2.08} pp improvement on average for classification tasks, with a substantial \textbf{7.50} pp gain on the challenging open-vocabulary Objaverse classification task and \textbf{4.88} pp improvement on 3D object captioning evaluated by Qwen2-72B-Instruct, validating the effectiveness of {\mname}. Code is publicly available at \href{https://github.com/yharoldsu0627/PointAlign}{https://github.com/yharoldsu0627/PointAlign}.
[364] DiffTrans: Differentiable Geometry-Materials Decomposition for Reconstructing Transparent Objects
Changpu Li, Shuang Wu, Songlin Tang, Guangming Lu, Jun Yu, Wenjie Pei
Main category: cs.CV
TL;DR: DiffTrans: A differentiable rendering framework for reconstructing transparent objects from multi-view images using recursive differentiable ray tracing with CUDA acceleration.
Details
Motivation: Existing methods for transparent object reconstruction are limited to specific scenarios (uniform topology, ideal transparency, or simple materials), restricting practical applicability in real-world settings with diverse topology and complex textures.Method: 1) Use FlexiCubes with dilation and smoothness regularization for initial geometry reconstruction from multi-view silhouettes; 2) Employ environment light radiance field for scene environment recovery; 3) Design recursive differentiable ray tracer to simultaneously optimize geometry, index of refraction, and absorption rate in end-to-end manner with CUDA acceleration.
Result: Superior reconstruction performance on multiple benchmarks, especially in intricate scenes with transparent objects having diverse topology and complex texture. Significantly reduced computational cost through CUDA implementation.
Conclusion: DiffTrans enables accurate reconstruction of transparent objects in complex real-world scenes through efficient decomposition and reconstruction of geometry and materials using differentiable rendering.
Abstract: Reconstructing transparent objects from a set of multi-view images is a challenging task due to the complicated nature and indeterminate behavior of light propagation. Typical methods are primarily tailored to specific scenarios, such as objects following a uniform topology, exhibiting ideal transparency and surface specular reflections, or with only surface materials, which substantially constrains their practical applicability in real-world settings. In this work, we propose a differentiable rendering framework for transparent objects, dubbed DiffTrans, which allows for efficient decomposition and reconstruction of the geometry and materials of transparent objects, thereby reconstructing transparent objects accurately in intricate scenes with diverse topology and complex texture. Specifically, we first utilize FlexiCubes with dilation and smoothness regularization as the iso-surface representation to reconstruct an initial geometry efficiently from the multi-view object silhouette. Meanwhile, we employ the environment light radiance field to recover the environment of the scene. Then we devise a recursive differentiable ray tracer to further optimize the geometry, index of refraction and absorption rate simultaneously in a unified and end-to-end manner, leading to high-quality reconstruction of transparent objects in intricate scenes. A prominent advantage of the designed ray tracer is that it can be implemented in CUDA, enabling a significantly reduced computational cost. Extensive experiments on multiple benchmarks demonstrate the superior reconstruction performance of our DiffTrans compared with other methods, especially in intricate scenes involving transparent objects with diverse topology and complex texture. The code is available at https://github.com/lcp29/DiffTrans.
[365] Station2Radar: query conditioned gaussian splatting for precipitation field
Doyi Kim, Minseok Seo, Changick Kim
Main category: cs.CV
TL;DR: QCGS fuses weather station data with satellite imagery using query-conditioned Gaussian splatting for efficient precipitation field generation, achieving 50% RMSE improvement over conventional methods.
Details
Motivation: Current precipitation forecasting faces limitations: weather radar has limited coverage, weather stations provide sparse point measurements, and satellites lack direct rainfall retrieval. There's a need to fuse heterogeneous data sources for better precipitation field generation.Method: Proposes Query-Conditioned Gaussian Splatting (QCGS) framework that combines a radar point proposal network to identify rainfall-support locations with an implicit neural representation network that predicts Gaussian parameters for each point. Unlike conventional 2D Gaussian splatting, QCGS selectively renders only queried precipitation regions.
Result: QCGS demonstrates over 50% improvement in RMSE compared to conventional gridded precipitation products, maintains high performance across multiple spatiotemporal scales, and enables efficient, resolution-flexible precipitation field generation in real time.
Conclusion: QCGS effectively fuses weather station observations with satellite imagery for precipitation field generation, overcoming limitations of individual data sources while achieving significant accuracy improvements and computational efficiency.
Abstract: Precipitation forecasting relies on heterogeneous data. Weather radar is accurate, but coverage is geographically limited and costly to maintain. Weather stations provide accurate but sparse point measurements, while satellites offer dense, high-resolution coverage without direct rainfall retrieval. To overcome these limitations, we propose Query-Conditioned Gaussian Splatting (QCGS), the first framework to fuse automatic weather station (AWS) observations with satellite imagery for generating precipitation fields. Unlike conventional 2D Gaussian splatting, which renders the entire image plane, QCGS selectively renders only queried precipitation regions, avoiding unnecessary computation in non-precipitating areas while preserving sharp precipitation structures. The framework combines a radar point proposal network that identifies rainfall-support locations with an implicit neural representation (INR) network that predicts Gaussian parameters for each point. QCGS enables efficient, resolution-flexible precipitation field generation in real time. Through extensive evaluation with benchmark precipitation products, QCGS demonstrates over 50% improvement in RMSE compared to conventional gridded precipitation products, and consistently maintains high performance across multiple spatiotemporal scales.
[366] An Interpretable Local Editing Model for Counterfactual Medical Image Generation
Hyungi Min, Taeseung You, Hangyeul Lee, Yeongjae Cho, Sungzoon Cho
Main category: cs.CV
TL;DR: InstructX2X: An interpretable local editing model for counterfactual medical image generation with region-specific editing to prevent unintended modifications and provide visual explanations.
Details
Motivation: Existing counterfactual medical image generation approaches have two key limitations: 1) they fail to prevent unintended modifications (collateral changes in demographic attributes when only disease features should be affected), and 2) they lack interpretability in their editing process, limiting real-world medical utility.Method: Proposes InstructX2X with Region-Specific Editing that restricts modifications to specific regions to prevent unintended changes, and provides a Guidance Map for inherently interpretable visual explanations of the editing process. Also introduces MIMIC-EDIT-INSTRUCTION dataset from expert-verified medical VQA pairs.
Result: Achieves state-of-the-art performance across all major evaluation metrics. Successfully generates high-quality counterfactual chest X-ray images along with interpretable explanations.
Conclusion: InstructX2X addresses fundamental limitations in counterfactual medical image generation by providing region-specific editing to prevent unintended modifications and offering interpretable visual explanations through Guidance Maps, enhancing utility for real-world medical applications.
Abstract: Counterfactual medical image generation have emerged as a critical tool for enhancing AI-driven systems in medical domain by answering “what-if” questions. However, existing approaches face two fundamental limitations: First, they fail to prevent unintended modifications, resulting collateral changes in demographic attributes when only disease features should be affected. Second, they lack interpretability in their editing process, which significantly limits their utility in real-world medical applications. To address these limitations, we present InstructX2X, a novel interpretable local editing model for counterfactual medical image generation featuring Region-Specific Editing. This approach restricts modifications to specific regions, effectively preventing unintended changes while simultaneously providing a Guidance Map that offers inherently interpretable visual explanations of the editing process. Additionally, we introduce MIMIC-EDIT-INSTRUCTION, a dataset for counterfactual medical image generation derived from expert-verified medical VQA pairs. Through extensive experiments, InstructX2X achieve state-of-the-art performance across all major evaluation metrics. Our model successfully generates high-quality counterfactual chest X-ray images along with interpretable explanations.
[367] Taxonomy-Aware Representation Alignment for Hierarchical Visual Recognition with Large Multimodal Models
Hulingxiao He, Zhi Tan, Yuxin Peng
Main category: cs.CV
TL;DR: TARA injects taxonomic knowledge into Large Multimodal Models by aligning visual features with biology foundation models to improve hierarchical visual recognition of both known and novel biological categories.
Details
Motivation: Current Large Multimodal Models (LMMs) excel at fine-grained visual recognition for known categories but struggle with hierarchical visual recognition that requires consistent label paths from coarse to fine categories, especially for novel categories with few/no training images.Method: Proposes Taxonomy-Aware Representation Alignment (TARA) that: 1) Aligns intermediate visual representations with biology foundation models (BFMs) that encode rich biological relationships through hierarchical contrastive learning, 2) Aligns first answer token representations with ground-truth labels to bridge visual features and categories of varying granularity.
Result: TARA consistently enhances LMMs’ hierarchical consistency and leaf node accuracy, enabling reliable recognition of both known and novel categories within complex biological taxonomies.
Conclusion: TARA effectively injects taxonomic knowledge into LMMs, improving their hierarchical visual recognition capabilities for biological categories, including novel ones with limited training data.
Abstract: A high-performing, general-purpose visual understanding model should map visual inputs to a taxonomic tree of labels, identify novel categories beyond the training set for which few or no publicly available images exist. Large Multimodal Models (LMMs) have achieved remarkable progress in fine-grained visual recognition (FGVR) for known categories. However, they remain limited in hierarchical visual recognition (HVR) that aims at predicting consistent label paths from coarse to fine categories, especially for novel categories. To tackle these challenges, we propose Taxonomy-Aware Representation Alignment (TARA), a simple yet effective strategy to inject taxonomic knowledge into LMMs. TARA leverages representations from biology foundation models (BFMs) that encode rich biological relationships through hierarchical contrastive learning. By aligning the intermediate representations of visual features with those of BFMs, LMMs are encouraged to extract discriminative visual cues well structured in the taxonomy tree. Additionally, we align the representations of the first answer token with the ground-truth label, flexibly bridging the gap between contextualized visual features and categories of varying granularity according to user intent. Experiments demonstrate that TARA consistently enhances LMMs’ hierarchical consistency and leaf node accuracy, enabling reliable recognition of both known and novel categories within complex biological taxonomies. Code is available at https://github.com/PKU-ICST-MIPL/TARA_CVPR2026.
[368] TAP-SLF: Parameter-Efficient Adaptation of Vision Foundation Models for Multi-Task Ultrasound Image Analysis
Hui Wan, Libin Lan
Main category: cs.CV
TL;DR: TAP-SLF: Task-Aware Prompting and Selective Layer Fine-Tuning framework for efficient multi-task medical ultrasound image analysis using Vision Foundation Models.
Details
Motivation: Multi-task medical image analysis faces challenges with model generalizability and shared feature optimization. Full fine-tuning of Vision Foundation Models on limited medical data causes overfitting and high computational costs, while existing parameter-efficient approaches are task-agnostic and ignore layer sensitivity differences.Method: Proposes TAP-SLF framework with task-aware soft prompts to encode task-specific priors into input tokens, and applies LoRA (Low-Rank Adaptation) to selected top layers of the encoder. Updates only a small fraction of VFM parameters while keeping pre-trained backbone frozen.
Result: Achieved fifth place in FMC_UIA 2026 Challenge test set. Evaluations on officially released training dataset with 8:2 train-test split demonstrate effectiveness of task-aware prompting and selective layer tuning for efficient VFM adaptation.
Conclusion: Task-aware prompting combined with selective high-layer fine-tuning enables efficient adaptation of Vision Foundation Models to diverse medical tasks within a shared backbone, addressing overfitting and computational cost issues.
Abstract: Executing multiple tasks simultaneously in medical image analysis, including segmentation, classification, detection, and regression, often introduces significant challenges regarding model generalizability and the optimization of shared feature representations. While Vision Foundation Models (VFMs) provide powerful general representations, full fine-tuning on limited medical data is prone to overfitting and incurs high computational costs. Moreover, existing parameter-efficient fine-tuning approaches typically adopt task-agnostic adaptation protocols, overlooking both task-specific mechanisms and the varying sensitivity of model layers during fine-tuning. In this work, we propose Task-Aware Prompting and Selective Layer Fine-Tuning (TAP-SLF), a unified framework for multi-task ultrasound image analysis. TAP-SLF incorporates task-aware soft prompts to encode task-specific priors into the input token sequence and applies LoRA to selected specific top layers of the encoder. This strategy updates only a small fraction of the VFM parameters while keeping the pre-trained backbone frozen. By combining task-aware prompts with selective high-layer fine-tuning, TAP-SLF enables efficient VFM adaptation to diverse medical tasks within a shared backbone. Results on the FMC_UIA 2026 Challenge test set, where TAP-SLF wins fifth place, combined with evaluations on the officially released training dataset using an 8:2 train-test split, demonstrate that task-aware prompting and selective layer tuning are effective strategies for efficient VFM adaptation.
[369] Self-Correction Inside the Model: Leveraging Layer Attention to Mitigate Hallucinations in Large Vision Language Models
April Fu
Main category: cs.CV
TL;DR: ICLA introduces an internal self-correction mechanism using layer attention to reduce hallucinations in Large Vision-Language Models by enabling hidden state refinement during generation.
Details
Motivation: Hallucination remains a persistent challenge in LVLMs where generated text is not grounded in visual input. As models become stronger, traditional hallucination patterns become less consistent, making existing mitigation techniques less effective.Method: Proposes Internal self-Correction mechanism utilizing Layer Attention (ICLA) that operates directly on hidden states during generation. Each layer selectively retrieves information from all preceding layers through a diagonal cross-layer attention mechanism, enabling self-refinement without external correction signals.
Result: With only 0.2M and 0.1M additional parameters on LLaVA1.5-7B and Qwen2.5-VL-7B respectively, ICLA consistently improves visual grounding across multiple hallucination benchmarks.
Conclusion: ICLA demonstrates effectiveness for more advanced LVLMs by providing an efficient internal self-correction mechanism that reduces hallucinations through layer attention-based refinement of hidden states.
Abstract: Although Large Vision-Language Models (LVLMs) have made substantial progress, hallucination, where generated text is not grounded in the visual input, remains a challenge. As LVLMs become stronger, previously reported hallucination patterns, such as linguistic bias and overthinking phenomenon, become far less consistent, making the corresponding mitigation techniques substantially less effective. In this paper, we introduce an Internal self-Correction mechanism utilizing Layer Attention (ICLA) that operates directly on hidden states during generation. Each layer selectively retrieves information from all preceding layers through a diagonal cross-layer attention mechanism, enabling self-refinement without any external correction signals. With introducing and training only 0.2M and 0.1M additional parameters on LLaVA1.5-7B and Qwen2.5-VL-7B, \ours consistently improves visual grounding across multiple hallucination benchmarks, demonstrating its effectiveness for more advanced LVLMs.
[370] Mamba-CAD: State Space Model For 3D Computer-Aided Design Generative Modeling
Xueyang Li, Yunzhong Lou, Yu Song, Xiangdong Zhou
Main category: cs.CV
TL;DR: Mamba-CAD: A self-supervised generative modeling approach for complex industrial CAD models using Mamba architecture to handle longer parametric CAD sequences, with encoder-decoder pre-training and GAN-based generation.
Details
Motivation: Industrial CAD models are fine-grained and complex, requiring longer parametric CAD sequences that existing sequence models struggle to handle effectively.Method: Uses Mamba architecture encoder-decoder with CAD reconstruction pre-training to learn latent representations, then employs GAN to generate fake representations that are decoded back to parametric CAD sequences.
Result: Demonstrates effectiveness through comprehensive experiments with new dataset of 77,078 CAD models, showing improved generation length of valid parametric CAD sequences.
Conclusion: Mamba-CAD successfully addresses the challenge of modeling longer parametric CAD sequences for complex industrial CAD models through self-supervised learning and Mamba architecture.
Abstract: Computer-Aided Design (CAD) generative modeling has a strong and long-term application in the industry. Recently, the parametric CAD sequence as the design logic of an object has been widely mined by sequence models. However, the industrial CAD models, especially in component objects, are fine-grained and complex, requiring a longer parametric CAD sequence to define. To address the problem, we introduce Mamba-CAD, a self-supervised generative modeling for complex CAD models in the industry, which can model on a longer parametric CAD sequence. Specifically, we first design an encoder-decoder framework based on a Mamba architecture and pair it with a CAD reconstruction task for pre-training to model the latent representation of CAD models; and then we utilize the learned representation to guide a generative adversarial network to produce the fake representation of CAD models, which would be finally recovered into parametric CAD sequences via the decoder of MambaCAD. To train Mamba-CAD, we further create a new dataset consisting of 77,078 CAD models with longer parametric CAD sequences. Comprehensive experiments are conducted to demonstrate the effectiveness of our model under various evaluation metrics, especially in the generation length of valid parametric CAD sequences. The code and dataset can be achieved from https://github.com/Sunny-Hack/Code-for-Mamba-CAD-AAAI-2025-.
[371] SesaHand: Enhancing 3D Hand Reconstruction via Controllable Generation with Semantic and Structural Alignment
Zhuoran Zhao, Xianghao Kong, Linlin Yang, Zheng Wei, Pan Hui, Anyi Rao
Main category: cs.CV
TL;DR: SesaHand improves controllable hand image generation using semantic and structural alignment for better 3D hand reconstruction, outperforming prior methods.
Details
Motivation: Existing synthetic hand image generation methods lack diversity in textures/environments and often exclude arms/interacting objects. Game engine approaches have limitations, while generative models suffer from misalignment issues.Method: 1) Semantic alignment: Chain-of-Thought inference extracts human behavior semantics from VLM-generated captions to suppress irrelevant details. 2) Structural alignment: Hierarchical structural fusion integrates multi-granularity structural info with hand structure attention enhancement for better hand-body alignment.
Result: Method outperforms prior work in generation performance and improves 3D hand reconstruction using generated hand images.
Conclusion: SesaHand effectively enhances controllable hand image generation through semantic and structural alignment, benefiting 3D hand reconstruction tasks.
Abstract: Recent studies on 3D hand reconstruction have demonstrated the effectiveness of synthetic training data to improve estimation performance. However, most methods rely on game engines to synthesize hand images, which often lack diversity in textures and environments, and fail to include crucial components like arms or interacting objects. Generative models are promising alternatives to generate diverse hand images, but still suffer from misalignment issues. In this paper, we present SesaHand, which enhances controllable hand image generation from both semantic and structural alignment perspectives for 3D hand reconstruction. Specifically, for semantic alignment, we propose a pipeline with Chain-of-Thought inference to extract human behavior semantics from image captions generated by the Vision-Language Model. This semantics suppresses human-irrelevant environmental details and ensures sufficient human-centric contexts for hand image generation. For structural alignment, we introduce hierarchical structural fusion to integrate structural information with different granularity for feature refinement to better align the hand and the overall human body in generated images. We further propose a hand structure attention enhancement method to efficiently enhance the model’s attention on hand regions. Experiments demonstrate that our method not only outperforms prior work in generation performance but also improves 3D hand reconstruction with the generated hand images.
[372] Improved Adversarial Diffusion Compression for Real-World Video Super-Resolution
Bin Chen, Weiqi Li, Shijie Zhao, Xuanyu Zhang, Junlin Li, Li Zhang, Jian Zhang
Main category: cs.CV
TL;DR: Improved adversarial diffusion compression method for real-world video super-resolution that distills a large 3D DiT teacher into a pruned 2D SD backbone with lightweight temporal convolutions, achieving 95% parameter reduction and 8× speedup.
Details
Motivation: Existing diffusion models for Real-VSR are slow due to multi-step sampling, while one-step networks remain heavy with billions of parameters. Recent adversarial diffusion compression (ADC) shows promise but fails to balance spatial details and temporal consistency in video applications.Method: Distills a large 3D DiT teacher (DOVE) into a pruned 2D Stable Diffusion-based backbone augmented with lightweight 1D temporal convolutions. Introduces dual-head adversarial distillation with discriminators in both pixel and feature domains to disentangle detail and consistency optimization.
Result: The compressed AdcVSR model reduces complexity by 95% in parameters and achieves 8× acceleration over its DiT teacher DOVE while maintaining competitive video quality and efficiency.
Conclusion: The proposed improved ADC method effectively balances efficiency and quality for real-world video super-resolution by combining architectural innovations with specialized adversarial training.
Abstract: While many diffusion models have achieved impressive results in real-world video super-resolution (Real-VSR) by generating rich and realistic details, their reliance on multi-step sampling leads to slow inference. One-step networks like SeedVR2, DOVE, and DLoRAL alleviate this through condensing generation into one single step, yet they remain heavy, with billions of parameters and multi-second latency. Recent adversarial diffusion compression (ADC) offers a promising path via pruning and distilling these models into a compact AdcSR network, but directly applying it to Real-VSR fails to balance spatial details and temporal consistency due to its lack of temporal awareness and the limitations of standard adversarial learning. To address these challenges, we propose an improved ADC method for Real-VSR. Our approach distills a large diffusion Transformer (DiT) teacher DOVE equipped with 3D spatio-temporal attentions, into a pruned 2D Stable Diffusion (SD)-based AdcSR backbone, augmented with lightweight 1D temporal convolutions, achieving significantly higher efficiency. In addition, we introduce a dual-head adversarial distillation scheme, in which discriminators in both pixel and feature domains explicitly disentangle the discrimination of details and consistency into two heads, enabling both objectives to be effectively optimized without sacrificing one for the other. Experiments demonstrate that the resulting compressed AdcVSR model reduces complexity by 95% in parameters and achieves an 8$\times$ acceleration over its DiT teacher DOVE, while maintaining competitive video quality and efficiency.
[373] Explainable Continuous-Time Mask Refinement with Local Self-Similarity Priors for Medical Image Segmentation
Rajdeep Chatterjee, Sudip Chakrabarty, Trishaani Acharjee
Main category: cs.CV
TL;DR: LSS-LTCNet is an explainable framework for foot ulcer segmentation that combines local self-similarity texture descriptors with liquid time-constant neural dynamics for boundary refinement.
Details
Motivation: Accurate foot ulcer segmentation is crucial for automated wound monitoring, but boundary delineation remains challenging due to tissue heterogeneity and poor contrast with surrounding skin. Standard intensity-based networks have limitations in handling these issues.Method: The framework uses Local Self-Similarity (LSS) mechanism to extract dense, illumination-invariant texture descriptors that disentangle necrotic tissue from background artifacts. It also employs a Liquid Time-Constant (LTC) refinement module that treats boundary evolution as an ODE-governed dynamic system, iteratively refining masks over continuous time-steps.
Result: On the MICCAI FUSeg dataset, LSS-LTCNet achieves state-of-the-art boundary alignment with a peak Dice score of 86.96% and exceptional 95th percentile Hausdorff Distance (HD95) of 8.91 pixels. With only 25.70M parameters, it outperforms heavier U-Net and transformer baselines in efficiency.
Conclusion: LSS-LTCNet offers a robust and transparent solution for computer-aided diagnosis in mobile healthcare settings by providing inherent visual audit trails alongside high-fidelity predictions, addressing both accuracy and explainability needs.
Abstract: Accurate semantic segmentation of foot ulcers is essential for automated wound monitoring, yet boundary delineation remains challenging due to tissue heterogeneity and poor contrast with surrounding skin. To overcome the limitations of standard intensity-based networks, we present LSS-LTCNet:an ante-hoc explainable framework synergizing deterministic structural priors with continuous-time neural dynamics. Our architecture departs from traditional black-box models by employing a Local Self-Similarity (LSS) mechanism that extracts dense, illumination-invariant texture descriptors to explicitly disentangle necrotic tissue from background artifacts. To enforce topological precision, we introduce a Liquid Time-Constant (LTC) refinement module that treats boundary evolution as an ODEgoverned dynamic system, iteratively refining masks over continuous time-steps. Comprehensive evaluation on the MICCAI FUSeg dataset demonstrates that LSS-LTCNet achieves state-of-the-art boundary alignment, securing a peak Dice score of 86.96% and an exceptional 95th percentile Hausdorff Distance (HD95) of 8.91 pixels. Requiring merely 25.70M parameters, the model significantly outperforms heavier U-Net and transformer baselines in efficiency. By providing inherent visual audit trails alongside high-fidelity predictions, LSS-LTCNet offers a robust and transparent solution for computer-aided diagnosis in mobile healthcare (mHealth) settings.
[374] ReMoT: Reinforcement Learning with Motion Contrast Triplets
Cong Wan, Zeyu Guo, Jiangyang Li, SongLin Dong, Yifan Bai, Lin Peng, Zhiheng Ma, Yihong Gong
Main category: cs.CV
TL;DR: ReMoT introduces a unified training paradigm to improve VLMs’ spatio-temporal consistency through automatic dataset generation and group relative policy optimization, achieving significant performance gains on motion reasoning tasks.
Details
Motivation: Current Vision-Language Models (VLMs) suffer from fundamental shortcomings in spatio-temporal consistency, which is critical for applications like navigation, robotics, and autonomous driving. Existing methods lack systematic approaches to address these motion understanding failures.Method: ReMoT integrates two core components: 1) A rule-based automatic framework that generates ReMoT-16K, a large-scale motion-contrast dataset (16.5K triplets) from video meta-annotations, avoiding costly manual/model-based generation. 2) Group Relative Policy Optimization, which optimizes learning of contrastive reasoning more effectively than standard Supervised Fine-Tuning.
Result: The approach achieves state-of-the-art performance on the new fine-grained motion contrast benchmark and multiple standard VLM benchmarks, with a remarkable 25.1% performance improvement on spatio-temporal reasoning tasks.
Conclusion: ReMoT provides a systematic solution to improve VLMs’ spatio-temporal consistency through automated dataset generation and optimized training, enabling better motion understanding for real-world applications.
Abstract: We present ReMoT, a unified training paradigm to systematically address the fundamental shortcomings of VLMs in spatio-temporal consistency – a critical failure point in navigation, robotics, and autonomous driving. ReMoT integrates two core components: (1) A rule-based automatic framework that generates ReMoT-16K, a large-scale (16.5K triplets) motion-contrast dataset derived from video meta-annotations, surpassing costly manual or model-based generation. (2) Group Relative Policy Optimization, which we empirically validate yields optimal performance and data efficiency for learning this contrastive reasoning, far exceeding standard Supervised Fine-Tuning. We also construct the first benchmark for fine-grained motion contrast triplets to measure a VLM’s discrimination of subtle motion attributes (e.g., opposing directions). The resulting model achieves state-of-the-art performance on our new benchmark and multiple standard VLM benchmarks, culminating in a remarkable 25.1% performance leap on spatio-temporal reasoning tasks.
[375] OPGAgent: An Agent for Auditable Dental Panoramic X-ray Interpretation
Zhaolin Yu, Litao Yang, Ben Babicka, Ming Hu, Jing Hao, Anthony Huang, James Huang, Yueming Jin, Jiasong Wu, Zongyuan Ge
Main category: cs.CV
TL;DR: OPGAgent is a multi-tool agentic system for panoramic dental X-ray analysis that coordinates specialized perception modules through hierarchical evidence gathering, a toolbox of specialized tools, and consensus mechanisms to outperform current dental vision-language models.
Details
Motivation: Current Vision Language Models (VLMs) for dental OPG analysis underperform task-specific models on individual tasks, despite offering multi-task capabilities. Agentic systems that orchestrate specialized tools could provide both versatility and accuracy, but this approach hasn't been explored in dental imaging.Method: OPGAgent has three main components: (1) Hierarchical Evidence Gathering module that decomposes analysis into global, quadrant, and tooth-level phases with dynamic tool invocation; (2) Specialized Toolbox with spatial, detection, utility, and expert zoos; (3) Consensus Subagent that resolves conflicts through anatomical constraints. Also introduces OPG-Bench, a structured-report protocol based on (Location, Field, Value) triples.
Result: OPGAgent outperforms current dental VLMs and medical agent frameworks on both the proposed OPG-Bench benchmark and the public MMOral-OPG benchmark, across both structured-report and VQA evaluation metrics.
Conclusion: The agentic approach with specialized tool coordination provides superior performance for dental OPG analysis compared to end-to-end VLMs, offering both versatility and accuracy while enabling comprehensive auditability of findings.
Abstract: Orthopantomograms (OPGs) are the standard panoramic radiograph in dentistry, used for full-arch screening across multiple diagnostic tasks. While Vision Language Models (VLMs) now allow multi-task OPG analysis through natural language, they underperform task-specific models on most individual tasks. Agentic systems that orchestrate specialized tools offer a path to both versatility and accuracy, this approach remains unexplored in the field of dental imaging. To address this gap, we propose OPGAgent, a multi-tool agentic system for auditable OPG interpretation. OPGAgent coordinates specialized perception modules with a consensus mechanism through three components: (1) a Hierarchical Evidence Gathering module that decomposes OPG analysis into global, quadrant, and tooth-level phases with dynamically invoking tools, (2) a Specialized Toolbox encapsulating spatial, detection, utility, and expert zoos, and (3) a Consensus Subagent that resolves conflicts through anatomical constraints. We further propose OPG-Bench, a structured-report protocol based on (Location, Field, Value) triples derived from real clinical reports, which enables a comprehensive review of findings and hallucinations, extending beyond the limitations of VQA indicators. On our OPG-Bench and the public MMOral-OPG benchmark, OPGAgent outperforms current dental VLMs and medical agent frameworks across both structured-report and VQA evaluation. Code will be released upon acceptance.
[376] DreamWorld: Unified World Modeling in Video Generation
Boming Tan, Xiangdong Zhang, Ning Liao, Yuqing Zhang, Shaofeng Zhang, Xue Yang, Qi Fan, Yanyong Zhang
Main category: cs.CV
TL;DR: DreamWorld: A unified framework for video generation that integrates multiple world knowledge dimensions (physical commonsense, 3D, temporal consistency) through joint world modeling, addressing limitations of single-knowledge approaches.
Details
Motivation: Current video generation models lack coherent world understanding, being limited to surface-level plausibility. Existing approaches typically incorporate only single forms of world knowledge or use rigid alignment strategies, which are insufficient for comprehensive world modeling that requires jointly modeling multiple heterogeneous dimensions.Method: Proposes DreamWorld with Joint World Modeling Paradigm that jointly predicts video pixels and features from foundation models to capture temporal dynamics, spatial geometry, and semantic consistency. Introduces Consistent Constraint Annealing (CCA) to progressively regulate world-level constraints during training, and Multi-Source Inner-Guidance to enforce learned world priors at inference.
Result: Extensive evaluations show DreamWorld improves world consistency, outperforming Wan2.1 by 2.26 points on VBench benchmark.
Conclusion: DreamWorld provides a unified framework for integrating complementary world knowledge into video generation, addressing the limitations of single-knowledge approaches through joint modeling of multiple world dimensions.
Abstract: Despite impressive progress in video generation, existing models remain limited to surface-level plausibility, lacking a coherent and unified understanding of the world. Prior approaches typically incorporate only a single form of world-related knowledge or rely on rigid alignment strategies to introduce additional knowledge. However, aligning the single world knowledge is insufficient to constitute a world model that requires jointly modeling multiple heterogeneous dimensions (e.g., physical commonsense, 3D and temporal consistency). To address this limitation, we introduce \textbf{DreamWorld}, a unified framework that integrates complementary world knowledge into video generators via a \textbf{Joint World Modeling Paradigm}, jointly predicting video pixels and features from foundation models to capture temporal dynamics, spatial geometry, and semantic consistency. However, naively optimizing these heterogeneous objectives can lead to visual instability and temporal flickering. To mitigate this issue, we propose \textit{Consistent Constraint Annealing (CCA)} to progressively regulate world-level constraints during training, and \textit{Multi-Source Inner-Guidance} to enforce learned world priors at inference. Extensive evaluations show that DreamWorld improves world consistency, outperforming Wan2.1 by 2.26 points on VBench. Code will be made publicly available at \href{https://github.com/ABU121111/DreamWorld}{\textcolor{mypink}{\textbf{Github}}}.
[377] High Dynamic Range Imaging Based on an Asymmetric Event-SVE Camera System
Pengju Sun, Banglei Guan, Jing Tao, Zhenbao Yu, Xuanyu Bai, Yang Shang, Qifeng Yu
Main category: cs.CV
TL;DR: Hardware-algorithm co-designed HDR imaging system combining SVE camera with event sensor, featuring cross-modal alignment and fusion for improved highlight recovery and edge fidelity in extreme illumination.
Details
Motivation: Conventional cameras struggle with HDR imaging under extreme illumination due to overexposure. Event cameras offer high temporal resolution and dynamic range, while SVE sensors provide radiometric diversity. The paper aims to combine these complementary modalities for superior HDR imaging.Method: 1) Hardware co-design: asymmetric dual-modality system integrating SVE micro-attenuation camera with event sensor. 2) Two-stage cross-modal alignment: feature-guided coarse homography estimation + multi-scale refinement with spatial pooling and frequency-domain filtering. 3) Cross-modal HDR reconstruction network: convolutional fusion, mutual-information regularization, and learnable fusion loss balancing intensity cues and event-derived structural constraints.
Result: Comprehensive experiments on synthetic benchmarks and real captures show consistent improvements in highlight recovery, edge fidelity, and robustness compared to frame-only or event-only HDR pipelines. The system effectively handles highly dynamic and radiometrically challenging environments.
Conclusion: Joint optimization of optical design, cross-modal alignment, and computational fusion provides an effective foundation for reliable HDR perception in challenging illumination conditions. The hardware-algorithm co-design approach enables superior performance over single-modality solutions.
Abstract: High dynamic range (HDR) imaging under extreme illumination remains challenging for conventional cameras due to overexposure. Event cameras provide microsecond temporal resolution and high dynamic range, while spatially varying exposure (SVE) sensors offer single-shot radiometric diversity.We present a hardware–algorithm co-designed HDR imaging system that tightly integrates an SVE micro-attenuation camera with an event sensor in an asymmetric dual-modality configuration. To handle non-coaxial geometry and heterogeneous optics, we develop a two-stage cross-modal alignment framework that combines feature-guided coarse homography estimation with a multi-scale refinement module based on spatial pooling and frequency-domain filtering. On top of aligned representations, we develop a cross-modal HDR reconstruction network with convolutional fusion, mutual-information regularization, and a learnable fusion loss that adaptively balances intensity cues and event-derived structural constraints. Comprehensive experiments on both synthetic benchmarks and real captures demonstrate that the proposed system consistently improves highlight recovery, edge fidelity, and robustness compared with frame-only or event-only HDR pipelines. The results indicate that jointly optimizing optical design, cross-modal alignment, and computational fusion provides an effective foundation for reliable HDR perception in highly dynamic and radiometrically challenging environments.
[378] U-VLM: Hierarchical Vision Language Modeling for Report Generation
Pengcheng Shi, Minghui Zhang, Kehan Song, Jiaqi Liu, Yun Gu, Xinglin Zhang
Main category: cs.CV
TL;DR: U-VLM is a hierarchical vision-language model for 3D medical report generation that uses progressive training from segmentation to classification to report generation, with multi-layer visual injection from U-Net encoder to language model layers.
Details
Motivation: Existing vision-language models for radiology report generation don't leverage segmentation-pretrained encoders and only inject visual features at the input layer of language models, losing multi-scale information crucial for 3D medical imaging.Method: Progressive training from segmentation to classification to report generation, plus multi-layer visual injection that routes U-Net encoder features to corresponding language model layers. Each stage can use different datasets without unified annotations.
Result: State-of-the-art performance on CT-RATE (F1: 0.414 vs 0.258, BLEU-mean: 0.349 vs 0.305) and AbdomenAtlas 3.0 (F1: 0.624 vs 0.518) using only a 0.1B decoder trained from scratch, showing well-designed vision encoder pretraining outweighs benefits of 7B+ pre-trained language models.
Conclusion: Hierarchical vision-language modeling with progressive training and multi-layer visual injection enables effective 3D medical report generation, demonstrating that specialized vision encoder design is more important than large pre-trained language models for this domain.
Abstract: Automated radiology report generation is key for reducing radiologist workload and improving diagnostic consistency, yet generating accurate reports for 3D medical imaging remains challenging. Existing vision-language models face two limitations: they do not leverage segmentation-pretrained encoders, and they inject visual features only at the input layer of language models, losing multi-scale information. We propose U-VLM, which enables hierarchical vision-language modeling in both training and architecture: (1) progressive training from segmentation to classification to report generation, and (2) multi-layer visual injection that routes U-Net encoder features to corresponding language model layers. Each training stage can leverage different datasets without unified annotations. U-VLM achieves state-of-the-art performance on CT-RATE (F1: 0.414 vs 0.258, BLEU-mean: 0.349 vs 0.305) and AbdomenAtlas 3.0 (F1: 0.624 vs 0.518 for segmentation-based detection) using only a 0.1B decoder trained from scratch, demonstrating that well-designed vision encoder pretraining outweighs the benefits of 7B+ pre-trained language models. Ablation studies show that progressive pretraining significantly improves F1, while multi-layer injection improves BLEU-mean. Code is available at https://github.com/yinghemedical/U-VLM.
[379] TokenCom: Vision-Language Model for Multimodal and Multitask Token Communications
Feibo Jiang, Siwei Tu, Li Dong, Xiaolong Li, Kezhi Wang, Cunhua Pan, Zhu Han, Jiangzhou Wang
Main category: cs.CV
TL;DR: TaiChi is a novel Visual-Language Model framework designed for token communications, featuring dual-visual tokenizers for multi-scale image processing, a Bilateral Attention Network for token fusion, and a KAN-based modality projector for precise cross-modal alignment.
Details
Motivation: Current VLMs have limitations in token communications due to: 1) limited token granularity, 2) overlong visual token sequences, and 3) inadequate cross-modal alignment. These constraints hinder their effectiveness in intelligent communication systems.Method: 1) Dual-visual tokenizer architecture processing both high- and low-resolution images; 2) Bilateral Attention Network (BAN) for intelligent fusion of multi-scale visual tokens; 3) Kolmogorov Arnold Network (KAN)-based modality projector with learnable activation functions for precise visual-to-text alignment; 4) Integration into multimodal multitask token communication system with joint VLM-channel coding.
Result: Experimental results validate TaiChi’s superior performance and demonstrate the feasibility and effectiveness of the TaiChi-driven token communication system.
Conclusion: TaiChi successfully addresses key limitations of VLMs in token communications through its novel architecture, achieving better visual understanding, compact token representation, and precise cross-modal alignment for intelligent communication systems.
Abstract: Visual-Language Models (VLMs), with their strong capabilities in image and text understanding, offer a solid foundation for intelligent communications. However, their effectiveness is constrained by limited token granularity, overlong visual token sequences, and inadequate cross-modal alignment. To overcome these challenges, we propose TaiChi, a novel VLM framework designed for token communications. TaiChi adopts a dual-visual tokenizer architecture that processes both high- and low-resolution images to collaboratively capture pixel-level details and global conceptual features. A Bilateral Attention Network (BAN) is introduced to intelligently fuse multi-scale visual tokens, thereby enhancing visual understanding and producing compact visual tokens. In addition, a Kolmogorov Arnold Network (KAN)-based modality projector with learnable activation functions is employed to achieve precise nonlinear alignment from visual features to the text semantic space, thus minimizing information loss. Finally, TaiChi is integrated into a multimodal and multitask token communication system equipped with a joint VLM-channel coding scheme. Experimental results validate the superior performance of TaiChi, as well as the feasibility and effectiveness of the TaiChi-driven token communication system.
[380] RAISE: Requirement-Adaptive Evolutionary Refinement for Training-Free Text-to-Image Alignment
Liyao Jiang, Ruichen Chen, Chao Gao, Di Niu
Main category: cs.CV
TL;DR: RAISE is a training-free evolutionary framework for adaptive text-to-image generation that dynamically scales computational effort based on prompt complexity through requirement-driven refinement and verification.
Details
Motivation: Current text-to-image models struggle with faithful prompt-image alignment for complex prompts, while existing methods use fixed iteration budgets that don't adapt to prompt difficulty or require extensive fine-tuning that lacks transferability.Method: RAISE formulates image generation as requirement-driven adaptive scaling, evolving candidate images through refinement actions (prompt rewriting, noise resampling, instructional editing) and verifying each generation against structured requirement checklists to dynamically allocate computation where needed.
Result: Achieves state-of-the-art alignment (0.94 overall GenEval) while reducing generated samples by 30-40% and VLM calls by 80% compared to prior methods, demonstrating efficient and generalizable multi-round self-improvement.
Conclusion: RAISE provides an efficient, model-agnostic framework for adaptive text-to-image generation that aligns computational effort with semantic complexity without requiring training or fine-tuning.
Abstract: Recent text-to-image (T2I) diffusion models achieve remarkable realism, yet faithful prompt-image alignment remains challenging, particularly for complex prompts with multiple objects, relations, and fine-grained attributes. Existing training-free inference-time scaling methods rely on fixed iteration budgets that cannot adapt to prompt difficulty, while reflection-tuned models require carefully curated reflection datasets and extensive joint fine-tuning of diffusion and vision-language models, often overfitting to reflection paths data and lacking transferability across models. We introduce RAISE (Requirement-Adaptive Self-Improving Evolution), a training-free, requirement-driven evolutionary framework for adaptive T2I generation. RAISE formulates image generation as a requirement-driven adaptive scaling process, evolving a population of candidates at inference time through a diverse set of refinement actions-including prompt rewriting, noise resampling, and instructional editing. Each generation is verified against a structured checklist of requirements, enabling the system to dynamically identify unsatisfied items and allocate further computation only where needed. This achieves adaptive test-time scaling that aligns computational effort with semantic query complexity. On GenEval and DrawBench, RAISE attains state-of-the-art alignment (0.94 overall GenEval) while incurring fewer generated samples (reduced by 30-40%) and VLM calls (reduced by 80%) than prior scaling and reflection-tuned baselines, demonstrating efficient, generalizable, and model-agnostic multi-round self-improvement. Code is available at https://github.com/LiyaoJiang1998/RAISE.
[381] Random Wins All: Rethinking Grouping Strategies for Vision Tokens
Qihang Fan, Yuang Ai, Huaibo Huang, Ran He
Main category: cs.CV
TL;DR: Random token grouping strategy for Vision Transformers achieves comparable or better performance than carefully designed grouping methods across multiple vision tasks and modalities.
Details
Motivation: To challenge the necessity of complex, carefully designed token grouping strategies in Vision Transformers and explore whether a simpler, unified approach can achieve similar or better performance.Method: Proposes random grouping strategy for vision tokens - a simple and fast approach that randomly groups tokens for self-attention calculations, analyzed through multiple baselines and downstream tasks.
Result: Random grouping outperforms most carefully designed grouping methods on vision tasks, shows pronounced advantages in downstream tasks like object detection, and works across multiple modalities including point clouds and vision-language models.
Conclusion: Vision tokens only need an extremely simple grouping strategy (random grouping) that meets four key conditions: positional information, head feature diversity, global receptive field, and fixed grouping pattern, making complex designs unnecessary.
Abstract: Since Transformers are introduced into vision architectures, their quadratic complexity has always been a significant issue that many research efforts aim to address. A representative approach involves grouping tokens, performing self-attention calculations within each group, or pooling the tokens within each group into a single token. To this end, various carefully designed grouping strategies have been proposed to enhance the performance of Vision Transformers. Here, we pose the following questions: \textbf{Are these carefully designed grouping methods truly necessary? Is there a simpler and more unified token grouping method that can replace these diverse methods?} Therefore, we propose the random grouping strategy, which involves a simple and fast random grouping strategy for vision tokens. We validate this approach on multiple baselines, and experiments show that random grouping almost outperforms all other grouping methods. When transferred to downstream tasks, such as object detection, random grouping demonstrates even more pronounced advantages. In response to this phenomenon, we conduct a detailed analysis of the advantages of random grouping from multiple perspectives and identify several crucial elements for the design of grouping strategies: positional information, head feature diversity, global receptive field, and fixed grouping pattern. We demonstrate that as long as these four conditions are met, vision tokens require only an extremely simple grouping strategy to efficiently and effectively handle various visual tasks. We also validate the effectiveness of our proposed random method across multiple modalities, including visual tasks, point cloud processing, and vision-language models. Code will be available at https://github.com/qhfan/random.
[382] ArtiFixer: Enhancing and Extending 3D Reconstruction with Auto-Regressive Diffusion Models
Riccardo de Lutio, Tobias Fischer, Yen-Yu Chang, Yuxuan Zhang, Jay Zhangjie Wu, Xuanchi Ren, Tianchang Shen, Katarina Tothova, Zan Gojcic, Haithem Turki
Main category: cs.CV
TL;DR: Two-stage pipeline using bidirectional generative model with opacity mixing and distilled causal auto-regressive model for novel view synthesis in under-observed 3D scenes
Details
Motivation: Per-scene optimization methods like 3D Gaussian Splatting have poor extrapolation to under-observed areas. Existing generative methods have scalability issues (limited view generation) and quality problems (inconsistent outputs, failure in unobserved regions).Method: Two-stage pipeline: 1) Train bidirectional generative model with novel opacity mixing strategy for consistency with observations while extrapolating novel content; 2) Distill into causal auto-regressive model that generates hundreds of frames in single pass for novel view synthesis or pseudo-supervision.
Result: Outperforms all existing baselines by wide margin, exceeding prior state-of-the-art by 1-3 dB PSNR. Generates plausible reconstructions where existing approaches fail completely.
Conclusion: Proposed method solves scalability and quality issues in generative novel view synthesis, enabling consistent content generation in under-observed areas while maintaining high-quality results.
Abstract: Per-scene optimization methods such as 3D Gaussian Splatting provide state-of-the-art novel view synthesis quality but extrapolate poorly to under-observed areas. Methods that leverage generative priors to correct artifacts in these areas hold promise but currently suffer from two shortcomings. The first is scalability, as existing methods use image diffusion models or bidirectional video models that are limited in the number of views they can generate in a single pass (and thus require a costly iterative distillation process for consistency). The second is quality itself, as generators used in prior work tend to produce outputs that are inconsistent with existing scene content and fail entirely in completely unobserved regions. To solve these, we propose a two-stage pipeline that leverages two key insights. First, we train a powerful bidirectional generative model with a novel opacity mixing strategy that encourages consistency with existing observations while retaining the model’s ability to extrapolate novel content in unseen areas. Second, we distill it into a causal auto-regressive model that generates hundreds of frames in a single pass. This model can directly produce novel views or serve as pseudo-supervision to improve the underlying 3D representation in a simple and highly efficient manner. We evaluate our method extensively and demonstrate that it can generate plausible reconstructions in scenarios where existing approaches fail completely. When measured on commonly benchmarked datasets, we outperform existing all existing baselines by a wide margin, exceeding prior state-of-the-art methods by 1-3 dB PSNR.
[383] COG: Confidence-aware Optimal Geometric Correspondence for Unsupervised Single-reference Novel Object Pose Estimation
Yuchen Che, Jingtu Wu, Hao Zheng, Asako Kanezaki
Main category: cs.CV
TL;DR: COG: Confidence-aware Optimal Geometric Correspondence framework for unsupervised 6DoF pose estimation of novel objects from single reference view using optimal transport with confidence prediction and semantic priors.
Details
Motivation: Estimating 6DoF pose of novel objects from single reference view is challenging due to occlusions, viewpoint changes, and outliers. Existing methods rely on discrete one-to-one matching that is non-differentiable and collapses onto sparse keypoints, lacking robustness in cross-view correspondence estimation.Method: Proposes COG framework that formulates correspondence estimation as confidence-aware optimal transport problem. Predicts point-wise confidences and injects them as optimal transport marginals to suppress non-overlapping regions. Uses semantic priors from vision foundation models to regularize correspondences for stable pose estimation. Enables unsupervised learning by integrating confidence into correspondence finding and pose estimation pipeline.
Result: Unsupervised COG achieves comparable performance to supervised methods, and supervised COG outperforms existing supervised methods. The framework demonstrates robust pose estimation despite occlusions and viewpoint changes.
Conclusion: COG provides an effective unsupervised framework for 6DoF pose estimation by formulating correspondence as confidence-aware optimal transport, leveraging semantic priors, and achieving state-of-the-art performance in both unsupervised and supervised settings.
Abstract: Estimating the 6DoF pose of a novel object with a single reference view is challenging due to occlusions, view-point changes, and outliers. A core difficulty lies in finding robust cross-view correspondences, as existing methods often rely on discrete one-to-one matching that is non-differentiable and tends to collapse onto sparse key-points. We propose Confidence-aware Optimal Geometric Correspondence (COG), an unsupervised framework that formulates correspondence estimation as a confidence-aware optimal transport problem. COG produces balanced soft correspondences by predicting point-wise confidences and injecting them as optimal transport marginals, suppressing non-overlapping regions. Semantic priors from vision foundation models further regularize the correspondences, leading to stable pose estimation. This design integrates confidence into the correspondence finding and pose estimation pipeline, enabling unsupervised learning. Experiments show unsupervised COG achieves comparable performance to supervised methods, and supervised COG outperforms them.
[384] M$^2$: Dual-Memory Augmentation for Long-Horizon Web Agents via Trajectory Summarization and Insight Retrieval
Dawei Yan, Haokui Zhang, Guangda Huzhang, Yang Li, Yibo Wang, Qing-Guo Chen, Zhao Xu, Weihua Luo, Ying Li, Wei Dong, Chunhua Shen
Main category: cs.CV
TL;DR: M² is a training-free, memory-augmented framework for multimodal LLM agents that improves long-horizon web navigation through dual-tier memory compression and retrieval augmentation.
Details
Motivation: Current MLLM-based agents struggle with long-horizon web navigation tasks due to high computational costs, insufficient reasoning capabilities, and reliance on extensive data collection and training.Method: Proposes M² framework with dual-tier memory: 1) Internal Memory using Dynamic Trajectory Summarization to compress interaction history, and 2) External Memory using Insight Retrieval Augmentation to retrieve actionable guidelines from an offline insight bank.
Result: Achieves up to 19.6% success rate increase and 58.7% token reduction for Qwen3-VL-32B, with proprietary models like Claude achieving up to 12.5% accuracy gains and significantly lower computational overhead.
Conclusion: M² demonstrates that training-free, memory-augmented approaches can effectively enhance MLLM agents’ performance on long-horizon web navigation tasks while reducing computational costs.
Abstract: Multimodal Large Language Models (MLLMs) based agents have demonstrated remarkable potential in autonomous web navigation. However, handling long-horizon tasks remains a critical bottleneck. Prevailing strategies often rely heavily on extensive data collection and model training, yet still struggle with high computational costs and insufficient reasoning capabilities when facing complex, long-horizon scenarios. To address this, we propose M$^2$, a training-free, memory-augmented framework designed to optimize context efficiency and decision-making robustness. Our approach incorporates a dual-tier memory mechanism that synergizes Dynamic Trajectory Summarization (Internal Memory) to compress verbose interaction history into concise state updates, and Insight Retrieval Augmentation (External Memory) to guide the agent with actionable guidelines retrieved from an offline insight bank. Extensive evaluations across WebVoyager and OnlineMind2Web demonstrate that M$^2$ consistently surpasses baselines, yielding up to a 19.6% success rate increase and 58.7% token reduction for Qwen3-VL-32B, while proprietary models like Claude achieve accuracy gains up to 12.5% alongside significantly lower computational overhead.
[385] Hierarchical Classification for Improved Histopathology Image Analysis
Keunho Byeon, Jinsol Song, Seong Min Hong, Yosep Chong, Jin Tae Kwak
Main category: cs.CV
TL;DR: HiClass: A hierarchical classification framework for whole-slide pathology images that improves both coarse-grained and fine-grained classification by capturing hierarchical relationships between diagnostic labels.
Details
Motivation: Existing deep learning methods for whole-slide image analysis use flat classification, ignoring hierarchical relationships among diagnostic labels. Pathology diagnoses often have hierarchical structures (e.g., broad categories with specific subtypes), and capturing these relationships could improve classification accuracy.Method: Extends multiple instance learning with bidirectional feature integration between coarse-grained and fine-grained representations. Introduces specialized loss functions: hierarchical consistency loss, intra- and inter-class distance loss, and group-wise cross-entropy loss to optimize hierarchical learning.
Result: Tested on gastric biopsy dataset with 4 coarse-grained and 14 fine-grained classes. Achieved superior classification performance for both coarse-grained and fine-grained classification compared to existing methods.
Conclusion: HiClass effectively improves whole-slide image classification by capturing hierarchical histopathological characteristics, demonstrating the value of modeling label hierarchies in pathology image analysis.
Abstract: Whole-slide image analysis is essential for diagnostic tasks in pathology, yet existing deep learning methods primarily rely on flat classification, ignoring hierarchical relationships among class labels. In this study, we propose HiClass, a hierarchical classification framework for improved histopathology image analysis, that enhances both coarse-grained and fine-grained WSI classification. Built based upon a multiple instance learning approach, HiClass extends it by introducing bidirectional feature integration that facilitates information exchange between coarse-grained and fine-grained feature representations, effectively learning hierarchical features. Moreover, we introduce tailored loss functions, including hierarchical consistency loss, intra- and inter-class distance loss, and group-wise cross-entropy loss, to further optimize hierarchical learning. We assess the performance of HiClass on a gastric biopsy dataset with 4 coarse-grained and 14 fine-grained classes, achieving superior classification performance for both coarse-grained classification and fine-grained classification. These results demonstrate the effectiveness of HiClass in improving WSI classification by capturing coarse-grained and fine-grained histopathological characteristics.
[386] What Do Visual Tokens Really Encode? Uncovering Sparsity and Redundancy in Multimodal Large Language Models
Yingqi Fan, Junlong Tong, Anhao Zhao, Xiaoyu Shen
Main category: cs.CV
TL;DR: EmbedLens reveals visual tokens in MLLMs partition into sink/dead/alive categories, with only ~60% alive tokens carrying image-specific meaning; these already encode rich visual cues before LLM processing, making most internal visual computation redundant.
Details
Motivation: To understand the internal structuring and processing of visual semantics in multimodal large language models (MLLMs), as current understanding of how visual tokens are processed remains poor despite their projection into language model embedding spaces.Method: Introduces EmbedLens, a two-fold analytical framework with a novel probing tool for fine-grained analysis of visual token processing. Uses targeted patch-compression benchmark to analyze token categories and their semantic content.
Result: Visual tokens consistently partition into sink, dead, and alive categories; only ~60% alive tokens carry image-specific meaning. Alive tokens already encode rich fine-grained visual cues (objects, colors, OCR) before LLM processing. Internal visual computations are redundant for most standard tasks, and for vision-centric tasks, alive tokens naturally align with intermediate LLM layers rather than initial embedding space.
Conclusion: Provides unified mechanistic view of visual token processing, enabling more efficient MLLM architectures through selective token pruning, minimized visual computation, and mid-layer injection strategies.
Abstract: Multimodal large language models (MLLMs) project visual tokens into the embedding space of language models, yet the internal structuring and processing of visual semantics remain poorly understood. In this work, we introduce a two-fold analytical framework featuring a novel probing tool, $\textbf{EmbedLens}$, to conduct a fine-grained analysis. We uncover a pronounced semantic sparsity at the input level: visual tokens consistently partition into sink, dead, and alive categories. Remarkably, only the alive tokens, comprising $\approx60%$ of the total input, carry image-specific meaning. Furthermore, using a targeted patch-compression benchmark, we demonstrate that these alive tokens already encode rich, fine-grained cues (e.g., objects, colors, and OCR) prior to entering the LLM. Internal visual computations (such as visual attention and feed-forward networks) are redundant for most standard tasks. For the small subset of highly vision-centric tasks that actually benefit from internal processing, we reveal that alive tokens naturally align with intermediate LLM layers rather than the initial embedding space, indicating that shallow-layer processing is unnecessary and that direct mid-layer injection is both sufficient. Ultimately, our findings provide a unified mechanistic view of visual token processing, paving the way for more efficient and interpretable MLLM architectures through selective token pruning, minimized visual computation, and mid-layer injection. The code is released at: https://github.com/EIT-NLP/EmbedLens.
[387] Multimodal Adaptive Retrieval Augmented Generation through Internal Representation Learning
Ruoshuang Du, Xin Sun, Qiang Liu, Bowen Song, Zhongqi Chen, Weiqiang Wang, Liang Wang
Main category: cs.CV
TL;DR: MMA-RAG is a multimodal adaptive retrieval-augmented generation system that dynamically decides when to use external knowledge based on model confidence, improving VQA reliability by reducing hallucinations from irrelevant retrieved content.
Details
Motivation: Visual Question Answering systems suffer from hallucinations where models generate answers misaligned with visual input or factual knowledge. Traditional RAG frameworks using static retrieval often introduce irrelevant or conflicting content, especially in visual settings where visually similar but semantically incorrect evidence may be retrieved.Method: Proposes Multimodal Adaptive RAG (MMA-RAG) which dynamically assesses model confidence in internal knowledge to decide whether to incorporate retrieved external information. Uses a decision classifier trained through layer-wise analysis that leverages joint internal visual and textual representations to guide reverse image retrieval.
Result: Achieves significant improvement in response performance across three VQA datasets. Ablation studies highlight the importance of internal representations in adaptive retrieval decisions. Effectively balances external knowledge utilization and inference robustness in diverse multimodal scenarios.
Conclusion: MMA-RAG effectively addresses hallucination issues in VQA by adaptively deciding when to use external knowledge, improving reliability through better integration of multimodal internal representations with external retrieval.
Abstract: Visual Question Answering systems face reliability issues due to hallucinations, where models generate answers misaligned with visual input or factual knowledge. While Retrieval Augmented Generation frameworks mitigate this issue by incorporating external knowledge, static retrieval often introduces irrelevant or conflicting content, particularly in visual RAG settings where visually similar but semantically incorrect evidence may be retrieved. To address this, we propose Multimodal Adaptive RAG (MMA-RAG), which dynamically assesses the confidence in the internal knowledge of the model to decide whether to incorporate the retrieved external information into the generation process. Central to MMA-RAG is a decision classifier trained through a layer-wise analysis, which leverages joint internal visual and textual representations to guide the use of reverse image retrieval. Experiments demonstrated that the model achieves a significant improvement in response performance in three VQA datasets. Meanwhile, ablation studies highlighted the importance of internal representations in adaptive retrieval decisions. In general, the experimental results demonstrated that MMA-RAG effectively balances external knowledge utilization and inference robustness in diverse multimodal scenarios.
[388] Wavelet-based Frame Selection by Detecting Semantic Boundary for Long Video Understanding
Wang Chen, Yuhui Zeng, Yongdong Luo, Tianyu Xie, Luojun Lin, Jiayi Ji, Yan Zhang, Xiawu Zheng
Main category: cs.CV
TL;DR: WFS-SB is a training-free framework for video understanding that selects frames by detecting semantic boundaries using wavelet transform, then allocates frame budgets to coherent clips and selects diverse frames within each clip.
Details
Motivation: Current frame selection methods for long videos focus on query relevance but ignore narrative structure, resulting in disjointed frames. Effective video understanding requires capturing semantic shifts - pivotal moments of narrative change that reveal the holistic storyline.Method: 1) Use wavelet transform to decompose noisy query-frame similarity signals into multiple scales, extracting clean semantic change signals from coarsest scale; 2) Identify local extrema as semantic boundaries to segment video into coherent clips; 3) Two-stage selection: adaptively allocate frame budget to each clip based on composite importance score, then use Maximal Marginal Relevance within each clip for diverse yet relevant frame selection.
Result: Significant performance improvements: 5.5% accuracy boost on VideoMME, 9.5% on MLVU, and 6.2% on LongVideoBench, consistently outperforming state-of-the-art methods.
Conclusion: WFS-SB demonstrates that effective video understanding requires capturing semantic shifts, not just query relevance, and provides a training-free framework using wavelet analysis for robust semantic boundary detection and intelligent frame selection.
Abstract: Frame selection is crucial due to high frame redundancy and limited context windows when applying Large Vision-Language Models (LVLMs) to long videos. Current methods typically select frames with high relevance to a given query, resulting in a disjointed set of frames that disregard the narrative structure of video. In this paper, we introduce Wavelet-based Frame Selection by Detecting Semantic Boundary (WFS-SB), a training-free framework that presents a new perspective: effective video understanding hinges not only on high relevance but, more importantly, on capturing semantic shifts - pivotal moments of narrative change that are essential to comprehending the holistic storyline of video. However, direct detection of abrupt changes in the query-frame similarity signal is often unreliable due to high-frequency noise arising from model uncertainty and transient visual variations. To address this, we leverage the wavelet transform, which provides an ideal solution through its multi-resolution analysis in both time and frequency domains. By applying this transform, we decompose the noisy signal into multiple scales and extract a clean semantic change signal from the coarsest scale. We identify the local extrema of this signal as semantic boundaries, which segment the video into coherent clips. Building on this, WFS-SB comprises a two-stage strategy: first, adaptively allocating a frame budget to each clip based on a composite importance score; and second, within each clip, employing the Maximal Marginal Relevance approach to select a diverse yet relevant set of frames. Extensive experiments show that WFS-SB significantly boosts LVLM performance, e.g., improving accuracy by 5.5% on VideoMME, 9.5% on MLVU, and 6.2% on LongVideoBench, consistently outperforming state-of-the-art methods.
[389] MLLM-4D: Towards Visual-based Spatial-Temporal Intelligence
Xingyilang Yin, Chengzhengxu Li, Jiahao Chang, Chi-Man Pun, Xiaodong Cun
Main category: cs.CV
TL;DR: MLLM-4D is a framework that enhances multimodal LLMs’ 4D spatiotemporal understanding from 2D RGB inputs through specialized data curation and training methods.
Details
Motivation: Current MLLMs lack vision-based 4D spatial-temporal intelligence - the ability to perceive and reason about 3D space evolution over time from visual inputs, which is a critical bottleneck.Method: 1) Cost-efficient data curation pipeline repurposing stereo video datasets into 4D spatiotemporal instructional data (MLLM4D-2M, MLLM4D-R1-30k datasets); 2) Post-training strategy with SFT for foundational understanding and Group Relative Policy Optimization with Spatiotemporal Chain of Thought prompting and reward functions for reasoning.
Result: MLLM-4D achieves state-of-the-art spatial-temporal understanding and reasoning capabilities from purely 2D RGB inputs, as demonstrated through extensive experiments.
Conclusion: The framework successfully bridges gaps in training data and model post-training for spatiotemporal understanding, advancing MLLMs’ 4D intelligence without architectural modifications.
Abstract: Humans are born with vision-based 4D spatial-temporal intelligence, which enables us to perceive and reason about the evolution of 3D space over time from purely visual inputs. Despite its importance, this capability remains a significant bottleneck for current multimodal large language models (MLLMs). To tackle this challenge, we introduce MLLM-4D, a comprehensive framework designed to bridge the gaps in training data curation and model post-training for spatiotemporal understanding and reasoning. On the data front, we develop a cost-efficient data curation pipeline that repurposes existing stereo video datasets into high-quality 4D spatiotemporal instructional data. This results in the MLLM4D-2M and MLLM4D-R1-30k datasets for Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT), alongside MLLM4D-Bench for comprehensive evaluation. Regarding model training, our post-training strategy establishes a foundational 4D understanding via SFT and further catalyzes 4D reasoning capabilities by employing Group Relative Policy Optimization (GRPO) with specialized Spatiotemporal Chain of Thought (ST-CoT) prompting and Spatiotemporal reward functions (ST-reward) without involving the modification of architecture. Extensive experiments demonstrate that MLLM-4D achieves state-of-the-art spatial-temporal understanding and reasoning capabilities from purely 2D RGB inputs. Project page: https://github.com/GVCLab/MLLM-4D.
[390] Vision-TTT: Efficient and Expressive Visual Representation Learning with Test-Time Training
Quan Kong, Yanru Xiao, Yuhao Shen, Cong Wang
Main category: cs.CV
TL;DR: Vision-TTT introduces a linear-time sequence modeling method for vision transformers that compresses visual token sequences through self-supervised learning, achieving efficient computation while maintaining strong performance on ImageNet and downstream tasks.
Details
Motivation: Vision Transformers face challenges with quadratic complexity of self-attention, limiting their scalability and efficiency. The paper aims to develop a more efficient visual representation learning method that maintains expressiveness while reducing computational costs.Method: Proposes Vision-TTT, which extends Test-Time Training (TTT) to vision by compressing visual token sequences in a self-supervised manner. Incorporates bidirectional scan strategy and Conv2d modules to model 2D visual correlations with global receptive fields while maintaining linear-time complexity.
Result: Vision-TTT achieves 77.3%, 81.2%, 82.5% Top-1 accuracy on ImageNet for different model sizes, outperforms counterparts on downstream tasks, reduces FLOPs by 79.4%, runs 4.38x faster with 88.9% less memory than DeiT-T at 1280x1280 resolution.
Conclusion: Vision-TTT demonstrates both expressiveness and efficiency as a next-generation visual backbone, offering a promising alternative to traditional Vision Transformers by addressing their computational limitations while maintaining strong performance.
Abstract: Learning efficient and expressive visual representation has long been the pursuit of computer vision research. While Vision Transformers (ViTs) gradually replace traditional Convolutional Neural Networks (CNNs) as more scalable vision learners, their applications are plagued by the quadratic complexity of the self-attention mechanism. To address the challenge, we introduce a new linear-time sequence modeling method Test-Time Training (TTT) into vision and propose Vision-TTT, which compresses the visual token sequence in a novel self-supervised learning manner. By incorporating bidirectional scan strategy and the Conv2d module, Vision-TTT effectively extends vanilla TTT to model 2D visual correlations with global receptive fields. Extensive experiments show that \texttt{Vittt-T/S/B} achieve 77.3%,81.2%,82.5% Top-1 accuracy on ImageNet classification and also greatly outperform their counterparts on downstream tasks. At 1280x1280 resolution, \texttt{Vittt-T} reduces FLOPs by 79.4% and runs 4.38x faster with 88.9% less memory than DeiT-T. These results demonstrate the expressiveness and efficiency of Vision-TTT as a strong candidate for the next-generation generic visual backbone.
[391] Jano: Adaptive Diffusion Generation with Early-stage Convergence Awareness
Yuyang Chen, Linqian Zeng, Yijin ZHou, Hengjie Li, Jidong Zhai
Main category: cs.CV
TL;DR: Jano is a training-free framework that accelerates Diffusion Transformers (DiTs) by recognizing heterogeneous convergence patterns in different image regions and adaptively scheduling computational resources.
Details
Motivation: Diffusion models, especially Diffusion Transformers (DiTs), suffer from computational inefficiency due to intensive full-attention computation. Existing acceleration approaches use uniform optimization strategies, but different image regions actually exhibit heterogeneous convergence patterns during denoising.Method: Jano introduces: 1) Early-stage complexity recognition algorithm that identifies regional convergence requirements within initial denoising steps, and 2) Adaptive token scheduling runtime that optimizes computational resource allocation based on regional complexity.
Result: Jano achieves substantial acceleration with average 2.0× speedup (up to 2.4×) while preserving generation quality, evaluated on state-of-the-art models.
Conclusion: The work challenges conventional uniform processing assumptions in diffusion models and provides a practical solution for accelerating large-scale content generation through region-aware optimization.
Abstract: Diffusion models have achieved remarkable success in generative AI, yet their computational efficiency remains a significant challenge, particularly for Diffusion Transformers (DiTs) requiring intensive full-attention computation. While existing acceleration approaches focus on content-agnostic uniform optimization strategies, we observe that different regions in generated content exhibit heterogeneous convergence patterns during the denoising process. We present Jano, a training-free framework that leverages this insight for efficient region-aware generation. Jano introduces an early-stage complexity recognition algorithm that accurately identifies regional convergence requirements within initial denoising steps, coupled with an adaptive token scheduling runtime that optimizes computational resource allocation. Through comprehensive evaluation on state-of-the-art models, Jano achieves substantial acceleration (average 2.0 times speedup, up to 2.4 times) while preserving generation quality. Our work challenges conventional uniform processing assumptions and provides a practical solution for accelerating large-scale content generation. The source code of our implementation is available at https://github.com/chen-yy20/Jano.
[392] Mesh-Pro: Asynchronous Advantage-guided Ranking Preference Optimization for Artist-style Quadrilateral Mesh Generation
Zhen Zhou, Jian Liu, Biwen Lei, Jing Xu, Haohan Weng, Yiling Zhu, Zhuo Chen, Junfeng Fan, Yunkai Ma, Dazhao Du, Song Guo, Fengshui Jing, Chunchao Guo
Main category: cs.CV
TL;DR: Mesh-Pro: An asynchronous online RL framework for 3D mesh generation with novel ARPO algorithm and mesh tokenization, achieving state-of-the-art performance.
Details
Motivation: RL has shown success in text and image generation but remains largely unexplored in 3D generation. Existing methods like DPO suffer from low training efficiency and limited generalization in 3D mesh generation.Method: 1) First asynchronous online RL framework for 3D mesh generation; 2) Advantage-guided Ranking Preference Optimization (ARPO) algorithm; 3) Diagonal-aware mixed triangular-quadrilateral tokenization for mesh representation; 4) Ray-based reward for geometric integrity.
Result: Asynchronous framework is 3.75× faster than synchronous RL. Mesh-Pro achieves state-of-the-art performance on both artistic and dense meshes with better training efficiency and generalization.
Conclusion: The proposed Mesh-Pro framework significantly advances RL-based 3D mesh generation through asynchronous training, novel ARPO algorithm, and specialized mesh representation, setting new state-of-the-art benchmarks.
Abstract: Reinforcement learning (RL) has demonstrated remarkable success in text and image generation, yet its potential in 3D generation remains largely unexplored. Existing attempts typically rely on offline direct preference optimization (DPO) method, which suffers from low training efficiency and limited generalization. In this work, we aim to enhance both the training efficiency and generation quality of RL in 3D mesh generation. Specifically, (1) we design the first asynchronous online RL framework tailored for 3D mesh generation post-training efficiency improvement, which is 3.75$\times$ faster than synchronous RL. (2) We propose Advantage-guided Ranking Preference Optimization (ARPO), a novel RL algorithm that achieves a better trade-off between training efficiency and generalization than current RL algorithms designed for 3D mesh generation, such as DPO and group relative policy optimization (GRPO). (3) Based on asynchronous ARPO, we propose Mesh-Pro, which additionally introduces a novel diagonal-aware mixed triangular-quadrilateral tokenization for mesh representation and a ray-based reward for geometric integrity. Mesh-Pro achieves state-of-the-art performance on artistic and dense meshes.
[393] TP-Spikformer: Token Pruned Spiking Transformer
Wenjie Wei, Xiaolong Zhou, Malu Zhang, Ammar Belatreche, Qian Sun, Yimeng Shan, Dehao Zhang, Zijian Zhou, Zeyu Ma, Yang Yang, Haizhou Li
Main category: cs.CV
TL;DR: TP-Spikformer introduces a token pruning method for spiking transformers that reduces computational overhead while maintaining performance through spatiotemporal information-retaining criteria and block-level early stopping.
Details
Motivation: Spiking neural networks (SNNs) offer energy efficiency but recent spiking transformers focus on large-scale architectures requiring significant computational resources, limiting deployment on resource-constrained devices. There's a need for methods that reduce storage and computational overhead while maintaining competitive performance.Method: Proposes TP-Spikformer with: 1) heuristic spatiotemporal information-retaining criterion to evaluate token importance, 2) information-retaining token pruning framework using block-level early stopping strategy for uninformative tokens (instead of removal), 3) training-free operation across diverse spiking transformer architectures.
Result: Demonstrated effectiveness, efficiency and scalability across diverse architectures (Spikformer, QKFormer, Spike-driven Transformer V1/V3) and tasks (image classification, object detection, semantic segmentation, event-based object tracking). Performs well in training-free manner.
Conclusion: TP-Spikformer offers an efficient and practical solution for deploying SNNs in real-world applications with limited computational resources, showing potential for resource-constrained deployment scenarios.
Abstract: Spiking neural networks (SNNs) offer an energy-efficient alternative to traditional neural networks due to their event-driven computing paradigm. However, recent advancements in spiking transformers have focused on improving accuracy with large-scale architectures, which require significant computational resources and limit deployment on resource-constrained devices. In this paper, we propose a simple yet effective token pruning method for spiking transformers, termed TP-Spikformer, that reduces storage and computational overhead while maintaining competitive performance. Specifically, we first introduce a heuristic spatiotemporal information-retaining criterion that comprehensively evaluates tokens’ importance, assigning higher scores to informative tokens for retention and lower scores to uninformative ones for pruning. Based on this criterion, we propose an information-retaining token pruning framework that employs a block-level early stopping strategy for uninformative tokens, instead of removing them outright. This also helps preserve more information during token pruning. We demonstrate the effectiveness, efficiency and scalability of TP-Spikformer through extensive experiments across diverse architectures, including Spikformer, QKFormer and Spike-driven Transformer V1 and V3, and a range of tasks such as image classification, object detection, semantic segmentation and event-based object tracking. Particularly, TP-Spikformer performs well in a training-free manner. These results reveal its potential as an efficient and practical solution for deploying SNNs in real-world applications with limited computational resources.
[394] CaptionFool: Universal Image Captioning Model Attacks
Swapnil Parekh
Main category: cs.CV
TL;DR: CaptionFool: A universal adversarial attack that modifies only 7 image patches (1.2% of image) to force transformer-based image captioning models to generate arbitrary target captions, including offensive content and slang designed to evade content moderation filters.
Details
Motivation: Image captioning models are vulnerable to adversarial attacks, but existing attacks often require input-specific perturbations. The authors aim to develop a universal (input-agnostic) attack that can manipulate caption generation with minimal image modifications, exposing critical vulnerabilities in deployed vision-language models.Method: Develops CaptionFool, a novel universal adversarial attack against transformer-based captioning models. The attack modifies only 7 out of 577 image patches (approximately 1.2% of the image) to generate arbitrary target captions. The method specifically designs perturbations to generate “slang” terms that can evade existing content moderation filters.
Result: Achieves 94-96% success rate in generating arbitrary target captions, including offensive content. Demonstrates ability to generate slang terms specifically designed to evade content moderation filters. Shows that minimal image perturbations (just 7 patches) can completely manipulate caption generation.
Conclusion: Exposes critical vulnerabilities in deployed vision-language models and underscores the urgent need for robust defenses against such universal adversarial attacks. The attack’s effectiveness with minimal modifications highlights significant security concerns for multimodal AI systems.
Abstract: Image captioning models are encoder-decoder architectures trained on large-scale image-text datasets, making them susceptible to adversarial attacks. We present CaptionFool, a novel universal (input-agnostic) adversarial attack against state-of-the-art transformer-based captioning models. By modifying only 7 out of 577 image patches (approximately 1.2% of the image), our attack achieves 94-96% success rate in generating arbitrary target captions, including offensive content. We further demonstrate that CaptionFool can generate “slang” terms specifically designed to evade existing content moderation filters. Our findings expose critical vulnerabilities in deployed vision-language models and underscore the urgent need for robust defenses against such attacks. Warning: This paper contains model outputs which are offensive in nature.
[395] RAFM: Retrieval-Augmented Flow Matching for Unpaired CBCT-to-CT Translation
Xianhao Zhou, Jianghao Wu, Lanfeng Zhong, Ku Zhao, Jinlong He, Shaoting Zhang, Guotai Wang
Main category: cs.CV
TL;DR: RAFM introduces retrieval-augmented flow matching for unpaired CBCT-to-CT translation, using DINOv3 embeddings and a CT memory bank to create better pseudo pairs for stable training.
Details
Motivation: CBCT images have artifacts and unreliable HU values, making them unsuitable for dose calculation. Paired CBCT-CT data is often unavailable due to temporal gaps, anatomical variations, and registration errors, creating a need for effective unpaired translation methods.Method: Proposes Retrieval-Augmented Flow Matching (RAFM) that adapts rectified flow to medical imaging by using a frozen DINOv3 encoder and global CT memory bank to construct retrieval-guided pseudo pairs, improving coupling quality and stabilizing unpaired flow-based training.
Result: Experiments on SynthRAD2023 under strict subject-level true-unpaired protocol show RAFM outperforms existing methods across FID, MAE, SSIM, PSNR, and SegScore metrics.
Conclusion: RAFM effectively addresses the challenges of unpaired CBCT-to-CT translation in medical imaging by improving pseudo pair construction and stabilizing flow-based training, achieving state-of-the-art performance.
Abstract: Cone-beam CT (CBCT) is routinely acquired in radiotherapy but suffers from severe artifacts and unreliable Hounsfield Unit (HU) values, limiting its direct use for dose calculation. Synthetic CT (sCT) generation from CBCT is therefore an important task, yet paired CBCT–CT data are often unavailable or unreliable due to temporal gaps, anatomical variation, and registration errors. In this work, we introduce rectified flow (RF) into unpaired CBCT-to-CT translation in medical imaging. Although RF is theoretically compatible with unpaired learning through distribution-level coupling and deterministic transport, its practical effectiveness under small medical datasets and limited batch sizes remains underexplored. Direct application with random or batch-local pseudo pairing can produce unstable supervision due to semantically mismatched endpoint samples. To address this challenge, we propose Retrieval-Augmented Flow Matching (RAFM), which adapts RF to the medical setting by constructing retrieval-guided pseudo pairs using a frozen DINOv3 encoder and a global CT memory bank. This strategy improves empirical coupling quality and stabilizes unpaired flow-based training. Experiments on SynthRAD2023 under a strict subject-level true-unpaired protocol show that RAFM outperforms existing methods across FID, MAE, SSIM, PSNR, and SegScore. The code is available at https://github.com/HiLab-git/RAFM.git.
[396] Adaptive Dynamic Dehazing via Instruction-Driven and Task-Feedback Closed-Loop Optimization for Diverse Downstream Task Adaptation
Yafei Zhang, Shuaitian Song, Huafeng Li, Shujuan Wang, Yu Liu
Main category: cs.CV
TL;DR: A novel adaptive dynamic dehazing framework with closed-loop optimization that uses task feedback and text instructions to adapt dehazing outputs for multiple downstream vision tasks without retraining.
Details
Motivation: Real-world vision systems need haze removal not just for image visibility enhancement, but also to meet specific requirements of diverse downstream tasks. Current approaches lack adaptability to different task needs without retraining.Method: Proposes a closed-loop optimization framework with two mechanisms: (1) task feedback loop that dynamically modulates dehazing outputs based on performance across multiple downstream tasks, and (2) text instruction interface allowing users to specify high-level task preferences for real-time adjustment.
Result: Extensive experiments across various vision tasks demonstrate strong effectiveness, robustness, and generalizability of the approach, establishing a new paradigm for interactive, task-adaptive dehazing.
Conclusion: The framework enables feedback-driven refinement and user instruction-guided adjustment during inference, allowing the model to satisfy specific requirements of multiple downstream tasks without retraining, representing an interactive and collaborative dehazing approach.
Abstract: In real-world vision systems,haze removal is required not only to enhance image visibility but also to meet the specific needs of diverse downstream tasks.To address this challenge,we propose a novel adaptive dynamic dehazing framework that incorporates a closed-loop optimization mechanism.It enables feedback-driven refinement based on downstream task performance and user instruction-guided adjustment during inference,allowing the model to satisfy the specific requirements of multiple downstream tasks without retraining.Technically,our framework integrates two complementary and innovative mechanisms: (1)a task feedback loop that dynamically modulates dehazing outputs based on performance across multiple downstream tasks,and (2) a text instruction interface that allows users to specify high-level task preferences.This dual-guidance strategy enables the model to adapt its dehazing behavior after training,tailoring outputs in real time to the evolving needs of multiple tasks.Extensive experiments across various vision tasks demonstrate the strong effectiveness,robustness,and generalizability of our approach.These results establish a new paradigm for interactive,task-adaptive dehazing that actively collaborates with downstream applications.
[397] Cross-Scale Pansharpening via ScaleFormer and the PanScale Benchmark
Ke Cao, Xuanhua He, Xueheng Li, Lingting Zhu, Yingying Wang, Ao Ma, Zhanjie Zhang, Man Zhou, Chengjun Xie, Jie Zhang
Main category: cs.CV
TL;DR: ScaleFormer is a novel architecture for cross-scale pansharpening that treats resolution generalization as sequence length generalization, using patch tokenization with variable sequence lengths and rotary positional encoding for better scale extrapolation.
Details
Motivation: Existing pansharpening methods are evaluated under limited, low-resolution settings and lack generalization to real-world high-resolution scenarios. There's a need to systematically address data, algorithmic, and computational challenges for cross-scale pansharpening.Method: Proposes ScaleFormer architecture that tokenizes images into patch sequences with variable lengths proportional to image scale. Uses Scale-Aware Patchify module for training from fixed-size crops, decouples intra-patch spatial feature learning from inter-patch sequential dependency modeling, and incorporates Rotary Positional Encoding for extrapolation to unseen scales.
Result: Extensive experiments show ScaleFormer outperforms state-of-the-art methods in both fusion quality and cross-scale generalization. The approach is validated on the newly introduced PanScale dataset and PanScale-Bench benchmark.
Conclusion: The proposed ScaleFormer successfully addresses cross-scale pansharpening challenges by reframing resolution generalization as sequence length generalization, demonstrating superior performance and generalization capabilities across varying resolutions.
Abstract: Pansharpening aims to generate high-resolution multi-spectral images by fusing the spatial detail of panchromatic images with the spectral richness of low-resolution MS data. However, most existing methods are evaluated under limited, low-resolution settings, limiting their generalization to real-world, high-resolution scenarios. To bridge this gap, we systematically investigate the data, algorithmic, and computational challenges of cross-scale pansharpening. We first introduce PanScale, the first large-scale, cross-scale pansharpening dataset, accompanied by PanScale-Bench, a comprehensive benchmark for evaluating generalization across varying resolutions and scales. To realize scale generalization, we propose ScaleFormer, a novel architecture designed for multi-scale pansharpening. ScaleFormer reframes generalization across image resolutions as generalization across sequence lengths: it tokenizes images into patch sequences of the same resolution but variable length proportional to image scale. A Scale-Aware Patchify module enables training for such variations from fixed-size crops. ScaleFormer then decouples intra-patch spatial feature learning from inter-patch sequential dependency modeling, incorporating Rotary Positional Encoding to enhance extrapolation to unseen scales. Extensive experiments show that our approach outperforms SOTA methods in fusion quality and cross-scale generalization. The datasets and source code are available upon acceptance.
[398] Multiple Inputs and Mixwd data for Alzheimer’s Disease Classification Based on 3D Vision Transformer
Juan A. Castro-Silva, Maria N. Moreno Garcia, Diego H. Peluffo-Ordoñez
Main category: cs.CV
TL;DR: A novel 3D Vision Transformer method for Alzheimer’s Disease diagnosis that processes consecutive MRI slices, fuses multiple 3D ROI data, and integrates demographic/cognitive information, achieving 97.14% accuracy.
Details
Motivation: Current Alzheimer's Disease diagnosis methods using MRI have limitations: 2D Transformers lose 3D context, ROI-based models focus on too few regions, and single-test approaches lack multifaceted integration needed for accurate diagnosis.Method: MIMD-3DVT (Multiple Inputs and Mixed Data 3D Vision Transformer) processes consecutive MRI slices to capture 3D spatial information, fuses multiple 3D ROI imaging data, and integrates mixed data including demographic factors and cognitive assessments.
Result: The method achieved 97.14% accuracy in distinguishing between Normal Cognition and Alzheimer’s Disease using combined datasets (ADNI, AIBL, OASIS), outperforming state-of-the-art methods.
Conclusion: The proposed 3D Vision Transformer approach with multimodal data integration provides superior Alzheimer’s Disease diagnosis by capturing comprehensive 3D brain information and leveraging diverse data sources.
Abstract: The current methods for diagnosing Alzheimer Disease using Magnetic Resonance Imaging (MRI) have significant limitations. Many previous studies used 2D Transformers to analyze individual brain slices independently, potentially losing critical 3D contextual information. Region of interest-based models often focus on only a few brain regions despite Alzheimer’s affecting multiple areas. Additionally, most classification models rely on a single test, whereas diagnosing Alzheimer’s requires a multifaceted approach integrating diverse data sources for a more accurate assessment. This study introduces a novel methodology called the Multiple Inputs and Mixed Data 3D Vision Transformer (MIMD-3DVT). This method processes consecutive slices together to capture the feature dimensions and spatial information, fuses multiple 3D ROI imaging data inputs, and integrates mixed data from demographic factors, cognitive assessments, and brain imaging. The proposed methodology was experimentally evaluated using a combined dataset that included the Alzheimer’s Disease Neuroimaging Initiative (ADNI), the Australian Imaging, Biomarker, and Lifestyle Flagship Study of Ageing (AIBL), and the Open Access Series of Imaging Studies (OASIS). Our MIMD-3DVT, utilizing single or multiple ROIs, achieved an accuracy of 97.14%, outperforming the state-of-the-art methods in distinguishing between Normal Cognition and Alzheimer’s Disease.
[399] Weakly Supervised Video Anomaly Detection with Anomaly-Connected Components and Intention Reasoning
Yu Wang, Shengjie Zhao
Main category: cs.CV
TL;DR: LAS-VAD: A novel weakly supervised video anomaly detection framework that learns anomaly semantics through anomaly-connected components, intention awareness, and anomaly attribute information to improve detection accuracy without dense frame-level annotations.
Details
Motivation: Weakly supervised video anomaly detection (WS-VAD) lacks dense frame-level annotations, making it difficult for existing methods to effectively learn anomaly semantics. The paper aims to address this limitation by developing a framework that can better understand anomaly characteristics despite the weak supervision.Method: Proposes LAS-VAD with three key mechanisms: 1) Anomaly-connected component mechanism groups video frames into semantic groups sharing identical information, 2) Intention awareness mechanism distinguishes between similar normal and abnormal behaviors, and 3) Incorporates anomaly attribute information (e.g., flames and smoke for explosions) to guide accurate detection.
Result: Extensive experiments on XD-Violence and UCF-Crime datasets show LAS-VAD outperforms current state-of-the-art methods with remarkable gains, demonstrating effectiveness in learning anomaly semantics under weak supervision.
Conclusion: LAS-VAD successfully addresses the limitation of learning anomaly semantics in weakly supervised VAD by integrating semantic grouping, intention awareness, and attribute information, achieving superior performance on benchmark datasets.
Abstract: Weakly supervised video anomaly detection (WS-VAD) involves identifying the temporal intervals that contain anomalous events in untrimmed videos, where only video-level annotations are provided as supervisory signals. However, a key limitation persists in WS-VAD, as dense frame-level annotations are absent, which often leaves existing methods struggling to learn anomaly semantics effectively. To address this issue, we propose a novel framework named LAS-VAD, short for Learning Anomaly Semantics for WS-VAD, which integrates anomaly-connected component mechanism and intention awareness mechanism. The former is designed to assign video frames into distinct semantic groups within a video, and frame segments within the same group are deemed to share identical semantic information. The latter leverages an intention-aware strategy to distinguish between similar normal and abnormal behaviors (e.g., taking items and stealing). To further model the semantic information of anomalies, as anomaly occurrence is accompanied by distinct characteristic attributes (i.e., explosions are characterized by flames and thick smoke), we additionally incorporate anomaly attribute information to guide accurate detection. Extensive experiments on two benchmark datasets, XD-Violence and UCF-Crime, demonstrate that our LAS-VAD outperforms current state-of-the-art methods with remarkable gains.
[400] Geometry OR Tracker: Universal Geometric Operating Room Tracking
Yihua Shao, Kang Chen, Feng Xue, Siyu Chen, Long Bai, Hongyuan Yu, Hao Tang, Jinlin Wu, Nassir Navab
Main category: cs.CV
TL;DR: Geometry OR Tracker: A two-stage pipeline for robust multi-view 3D tracking in operating rooms that addresses unreliable camera calibration through metric geometry rectification and occlusion-robust tracking.
Details
Motivation: Real clinical deployments in operating rooms rarely satisfy geometric prerequisites for stable multi-view fusion and tracking due to unreliable camera calibration and RGB-D registration, leading to cross-view geometric inconsistency ("ghosting") and degraded 3D trajectories.Method: Two-stage pipeline: 1) Multi-view Metric Geometry Rectification module that rectifies imprecise calibration into a scale-consistent and geometrically consistent camera setup with single global scale, 2) Occlusion-Robust 3D Point Tracking directly in unified OR world frame.
Result: On MM-OR benchmark, rectification front-end reduces cross-view depth disagreement by more than 30× compared to raw calibration. Ablation studies show relationship between calibration quality and tracking accuracy, demonstrating improved geometric consistency yields stronger world-frame tracking.
Conclusion: The Geometry OR Tracker addresses geometric inconsistency in clinical OR settings through calibration rectification, enabling more accurate multi-view 3D tracking for downstream applications like surgeon behavior recognition.
Abstract: In operating rooms (OR), world-scale multi-view 3D tracking supports downstream applications such as surgeon behavior recognition, where physically meaningful quantities such as distances and motion statistics must be measured in meters. However, real clinical deployments rarely satisfy the geometric prerequisites for stable multi-view fusion and tracking: camera calibration and RGB-D registration are always unreliable, leading to cross-view geometric inconsistency that produces “ghosting” during fusion and degrades 3D trajectories in a shared OR coordinate frame. To address this, we introduce Geometry OR Tracker, a two-stage pipeline that first rectifies imprecise calibration into a scaleconsistent and geometrically consistent camera setup with a single global scale via a Multi-view Metric Geometry Rectification module, and then performs Occlusion-Robust 3D Point Tracking directly in the unified OR world frame. On the MM-OR benchmark, improved geometric consistency translates into tracking gains: our rectification front-end reduces cross-view depth disagreement by more than 30$\times$ compared to raw calibration. Ablation studies further demonstrate the relationship between calibration quality and tracking accuracy, showing that improved geometric consistency yields stronger world-frame tracking.
[401] MIDAS: Multi-Image Dispersion and Semantic Reconstruction for Jailbreaking MLLMs
Yilian Liu, Xiaojun Jia, Guoshun Nan, Jiuyang Lyu, Zhican Chen, Tao Guan, Shuyuan Luo, Zhongyi Zhai, Yang Liu
Main category: cs.CV
TL;DR: MIDAS is a multimodal jailbreak attack framework that disperses harmful semantics across multiple images and uses cross-image reasoning to bypass MLLM safety mechanisms.
Details
Motivation: Current multimodal jailbreak attacks rely on single-image masking or isolated visual cues, which only modestly extend reasoning paths and achieve limited effectiveness against strongly aligned commercial MLLMs. There's a need for more effective attacks that can bypass advanced safety mechanisms.Method: MIDAS decomposes harmful semantics into risk-bearing subunits, disperses them across multiple visual clues, and leverages cross-image reasoning to gradually reconstruct malicious intent. It enforces longer, structured multi-image chained reasoning to increase model reliance on visual cues while delaying exposure of malicious semantics.
Result: Extensive experiments show MIDAS outperforms state-of-the-art jailbreak attacks for MLLMs, achieving an average attack success rate of 81.46% across 4 closed-source MLLMs.
Conclusion: MIDAS demonstrates that multi-image dispersion and semantic reconstruction can effectively bypass MLLM safety mechanisms, highlighting vulnerabilities in current multimodal security approaches and the need for more robust defenses.
Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable performance but remain vulnerable to jailbreak attacks that can induce harmful content and undermine their secure deployment. Previous studies have shown that introducing additional inference steps, which disrupt security attention, can make MLLMs more susceptible to being misled into generating malicious content. However, these methods rely on single-image masking or isolated visual cues, which only modestly extend reasoning paths and thus achieve limited effectiveness, particularly against strongly aligned commercial closed-source models. To address this problem, in this paper, we propose Multi-Image Dispersion and Semantic Reconstruction (MIDAS), a multimodal jailbreak framework that decomposes harmful semantics into risk-bearing subunits, disperses them across multiple visual clues, and leverages cross-image reasoning to gradually reconstruct the malicious intent, thereby bypassing existing safety mechanisms. The proposed MIDAS enforces longer and more structured multi-image chained reasoning, substantially increases the model’s reliance on visual cues while delaying the exposure of malicious semantics and significantly reducing the model’s security attention, thereby improving the performance of jailbreak against advanced MLLMs. Extensive experiments across different datasets and MLLMs demonstrate that the proposed MIDAS outperforms state-of-the-art jailbreak attacks for MLLMs and achieves an average attack success rate of 81.46% across 4 closed-source MLLMs. Our code is available at this link.
[402] Decoupling Stability and Plasticity for Multi-Modal Test-Time Adaptation
Yongbo He, Zirun Guo, Tao Jin
Main category: cs.CV
TL;DR: DASP is a test-time adaptation framework for multimodal models that addresses negative transfer and catastrophic forgetting through asymmetric adaptation based on modality bias analysis.
Details
Motivation: Multimodal test-time adaptation faces challenges of negative transfer in unbiased modalities and catastrophic forgetting in biased modalities when adapting to evolving distributions.Method: DASP uses a diagnose-then-mitigate framework: first analyzes interdimensional redundancy to identify biased vs unbiased modalities, then applies asymmetric adaptation with decoupled stable/plastic components per modality.
Result: Comprehensive evaluations on diverse multimodal benchmarks show DASP significantly outperforms state-of-the-art methods.
Conclusion: The asymmetric adaptation strategy effectively balances stability and plasticity for different modalities, enabling flexible domain adaptation while preserving generalizable knowledge.
Abstract: Adapting pretrained multi-modal models to evolving test-time distributions, known as multi-modal test-time adaptation, presents a significant challenge. Existing methods frequently encounter negative transfer in the unbiased modality and catastrophic forgetting in the biased modality. To address these challenges, we propose Decoupling Adaptation for Stability and Plasticity (DASP), a novel diagnose-then-mitigate framework. Our analysis reveals a critical discrepancy within the unified latent space: the biased modality exhibits substantially higher interdimensional redundancy (i.e., strong correlations across feature dimensions) compared to the unbiased modality. Leveraging this insight, DASP identifies the biased modality and implements an asymmetric adaptation strategy. This strategy employs a decoupled architecture where each modality-specific adapter is divided into stable and plastic components. The asymmetric mechanism works as follows: for the biased modality, which requires plasticity, the plastic component is activated and updated to capture domain-specific information, while the stable component remains fixed. Conversely, for the unbiased modality, which requires stability, the plastic component is bypassed, and the stable component is updated using KL regularization to prevent negative transfer. This asymmetric design enables the model to adapt flexibly to new domains while preserving generalizable knowledge. Comprehensive evaluations on diverse multi-modal benchmarks demonstrate that DASP significantly outperforms state-of-the-art methods.
[403] WildActor: Unconstrained Identity-Preserving Video Generation
Qin Guo, Tianyu Yang, Xuanhua He, Fei Shen, Yong Zhang, Zhuoliang Kang, Xiaoming Wei, Dan Xu
Main category: cs.CV
TL;DR: WildActor is a framework for human video generation that maintains consistent full-body identity across diverse viewpoints and motions, using a large-scale Actor-18M dataset and novel attention mechanisms.
Details
Motivation: Existing human video generation methods struggle with maintaining consistent full-body identity across dynamic shots, viewpoints, and motions, often producing face-centric results or rigid copy-paste artifacts.Method: Proposes WildActor framework with Asymmetric Identity-Preserving Attention and Viewpoint-Adaptive Monte Carlo Sampling that iteratively re-weights reference conditions for balanced manifold coverage, trained on Actor-18M dataset (1.6M videos, 18M images).
Result: WildActor consistently preserves body identity under diverse shot compositions, large viewpoint transitions, and substantial motions, outperforming existing methods on the proposed Actor-Bench evaluation.
Conclusion: The approach successfully addresses the challenge of identity-consistent human video generation across unconstrained viewpoints and environments, advancing production-ready digital actor creation.
Abstract: Production-ready human video generation requires digital actors to maintain strictly consistent full-body identities across dynamic shots, viewpoints and motions, a setting that remains challenging for existing methods. Prior methods often suffer from face-centric behavior that neglects body-level consistency, or produce copy-paste artifacts where subjects appear rigid due to pose locking. We present Actor-18M, a large-scale human video dataset designed to capture identity consistency under unconstrained viewpoints and environments. Actor-18M comprises 1.6M videos with 18M corresponding human images, covering both arbitrary views and canonical three-view representations. Leveraging Actor-18M, we propose WildActor, a framework for any-view conditioned human video generation. We introduce an Asymmetric Identity-Preserving Attention mechanism coupled with a Viewpoint-Adaptive Monte Carlo Sampling strategy that iteratively re-weights reference conditions by marginal utility for balanced manifold coverage. Evaluated on the proposed Actor-Bench, WildActor consistently preserves body identity under diverse shot compositions, large viewpoint transitions, and substantial motions, surpassing existing methods in these challenging settings.
[404] AlignVAR: Towards Globally Consistent Visual Autoregression for Image Super-Resolution
Cencen Liu, Dongyang Zhang, Wen Yin, Jielei Wang, Tianyu Li, Ji Guo, Wenbo Jiang, Guoqing Wang, Guoming Lu
Main category: cs.CV
TL;DR: AlignVAR is a visual autoregressive framework for image super-resolution that addresses locality-biased attention and residual-only supervision issues through spatial consistency autoregression and hierarchical consistency constraints.
Details
Motivation: VAR models show promise for image generation but face challenges when applied to image super-resolution: locality-biased attention fragments spatial structures, and residual-only supervision accumulates errors across scales, compromising global consistency.Method: Proposes AlignVAR with two key components: (1) Spatial Consistency Autoregression (SCA) uses adaptive mask to reweight attention toward structurally correlated regions, enhancing long-range dependencies; (2) Hierarchical Consistency Constraint (HCC) adds full reconstruction supervision at each scale to expose accumulated deviations early.
Result: AlignVAR enhances structural coherence and perceptual fidelity over existing generative methods, delivers over 10x faster inference with nearly 50% fewer parameters than leading diffusion-based approaches.
Conclusion: AlignVAR establishes a new paradigm for efficient image super-resolution by addressing key challenges in applying visual autoregressive models to this task.
Abstract: Visual autoregressive (VAR) models have recently emerged as a promising alternative for image generation, offering stable training, non-iterative inference, and high-fidelity synthesis through next-scale prediction. This encourages the exploration of VAR for image super-resolution (ISR), yet its application remains underexplored and faces two critical challenges: locality-biased attention, which fragments spatial structures, and residual-only supervision, which accumulates errors across scales, severely compromises global consistency of reconstructed images. To address these issues, we propose AlignVAR, a globally consistent visual autoregressive framework tailored for ISR, featuring two key components: (1) Spatial Consistency Autoregression (SCA), which applies an adaptive mask to reweight attention toward structurally correlated regions, thereby mitigating excessive locality and enhancing long-range dependencies; and (2) Hierarchical Consistency Constraint (HCC), which augments residual learning with full reconstruction supervision at each scale, exposing accumulated deviations early and stabilizing the coarse-to-fine refinement process. Extensive experiments demonstrate that AlignVAR consistently enhances structural coherence and perceptual fidelity over existing generative methods, while delivering over 10x faster inference with nearly 50% fewer parameters than leading diffusion-based approaches, establishing a new paradigm for efficient ISR.
[405] UNICBench: UNIfied Counting Benchmark for MLLM
Chenggang Rong, Tao Han, Zhiyuan Zhao, Yaowu Fan, Jia Wan, Song Guo, Yuan Yuan, Junyu Gao
Main category: cs.CV
TL;DR: UNICBench is a unified multimodal counting benchmark for evaluating MLLMs across image, text, and audio modalities with standardized evaluation protocols and comprehensive taxonomy.
Details
Motivation: There is no unified counting dataset to rigorously evaluate multimodal large language models' counting ability across different modalities (image, text, audio), making it difficult to compare models and identify weaknesses.Method: Created UNICBench with 5,300 images (5,508 QA), 872 documents (5,888 QA), and 2,069 audio clips (2,905 QA) annotated with three-level capability taxonomy and difficulty tags. Implemented standardized evaluation protocol with fixed splits/prompts/seeds and modality-specific matching rules.
Result: Evaluation of 45 state-of-the-art MLLMs shows strong performance on basic counting tasks but significant gaps on reasoning and hardest partitions, revealing long-tail errors and substantial room for improvement in general counting ability.
Conclusion: UNICBench provides a rigorous, comparable basis for measuring multimodal counting capabilities and offers a public toolkit to accelerate progress in this important area of MLLM research.
Abstract: Counting is a core capability for multimodal large language models (MLLMs), yet there is no unified counting dataset to rigorously evaluate this ability across image, text, and audio. We present UNICBench, a unified multimodal, multi level counting benchmark and evaluation toolkit with accurate ground truth, deterministic numeric parsing, and stratified reporting. The corpus comprises 5,300 images (5,508 QA), 872 documents (5,888 QA), and 2,069 audio clips (2,905 QA), annotated with a three level capability taxonomy and difficulty tags. Under a standardized protocol with fixed splits/prompts/seeds and modality specific matching rules, we evaluate 45 state-of-the-art MLLMs across modalities. Results show strong performance on some basic counting tasks but significant gaps on reasoning and the hardest partitions, highlighting long-tail errors and substantial headroom for improving general counting. UNICBench offers a rigorous and comparable basis for measurement and a public toolkit to accelerate progress.
[406] Data-Centric Benchmark for Label Noise Estimation and Ranking in Remote Sensing Image Segmentation
Keiller Nogueira, Codrut-Andrei Diaconu, Dávid Kerekes, Jakob Gawlikowski, Cédric Léonard, Nassim Ait Ali Braham, June Moh Goo, Zichao Zeng, Zhipeng Liu, Pallavi Jain, Andrea Nascetti, Ronny Hänsch
Main category: cs.CV
TL;DR: Novel data-centric benchmark and methods for identifying and quantifying label noise in remote sensing semantic segmentation datasets.
Details
Motivation: Pixel-level annotations for remote sensing semantic segmentation are expensive and often noisy due to labor-intensive annotation processes, which degrades model performance and robustness.Method: Introduces a novel benchmark with publicly available dataset and two techniques leveraging model uncertainty, prediction consistency, and representation analysis to identify, quantify, and rank training samples by label noise level.
Result: Proposed methods consistently outperform established baselines across various experimental settings.
Conclusion: Provides valuable tools for improving remote sensing segmentation by addressing label noise issues, with publicly available resources for the research community.
Abstract: High-quality pixel-level annotations are essential for the semantic segmentation of remote sensing imagery. However, such labels are expensive to obtain and often affected by noise due to the labor-intensive and time-consuming nature of pixel-wise annotation, which makes it challenging for human annotators to label every pixel accurately. Annotation errors can significantly degrade the performance and robustness of modern segmentation models, motivating the need for reliable mechanisms to identify and quantify noisy training samples. This paper introduces a novel Data-Centric benchmark, together with a novel, publicly available dataset and two techniques for identifying, quantifying, and ranking training samples according to their level of label noise in remote sensing semantic segmentation. Such proposed methods leverage complementary strategies based on model uncertainty, prediction consistency, and representation analysis, and consistently outperform established baselines across a range of experimental settings. The outcomes of this work are publicly available at https://github.com/keillernogueira/label_noise_segmentation.
[407] IdGlow: Dynamic Identity Modulation for Multi-Subject Generation
Honghao Cai, Xiangyuan Wang, Yunhao Bai, Tianze Zhou, Sijie Xu, Yuyang Hao, Zezhou Cui, Yuyuan Yang, Wei Zhu, Yibo Chen, Xu Tang, Yao Hu, Zhen Li
Main category: cs.CV
TL;DR: IdGlow is a mask-free two-stage framework for multi-subject image generation that addresses the stability-plasticity dilemma using Flow Matching diffusion models with task-adaptive timestep scheduling and fine-grained DPO optimization.
Details
Motivation: Existing multi-subject image generation methods struggle with the "stability-plasticity dilemma," particularly failing in tasks requiring complex structural deformations like identity-preserving age transformation, due to rigid spatial masks or localized attention mechanisms.Method: Two-stage framework: 1) Supervised fine-tuning stage with task-adaptive timestep scheduling (linear decay schedule and temporal gating mechanism) and VLM for prompt synthesis; 2) Fine-grained group-level DPO with weighted margin formulation to eliminate artifacts and improve harmony.
Result: IdGlow fundamentally mitigates the stability-plasticity conflict, achieving superior Pareto balance between state-of-the-art facial fidelity and commercial-grade aesthetic quality on two challenging benchmarks: direct multi-person fusion and age-transformed group generation.
Conclusion: The proposed mask-free, progressive two-stage framework successfully addresses complex multi-subject generation challenges, particularly identity-preserving age transformation, through innovative timestep scheduling and optimization techniques.
Abstract: Multi-subject image generation requires seamlessly harmonizing multiple reference identities within a coherent scene. However, existing methods relying on rigid spatial masks or localized attention often struggle with the “stability-plasticity dilemma,” particularly failing in tasks that require complex structural deformations, such as identity-preserving age transformation. To address this, we present IdGlow, a mask-free, progressive two-stage framework built upon Flow Matching diffusion models. In the supervised fine-tuning (SFT) stage, we introduce task-adaptive timestep scheduling aligned with diffusion generative dynamics: a linear decay schedule that progressively relaxes constraints for natural group composition, and a temporal gating mechanism that concentrates identity injection within a critical semantic window, successfully preserving adult facial semantics without overriding child-like anatomical structures. To resolve attribute leakage and semantic ambiguity without explicit layout inputs, we further integrate a badcase-driven Vision-Language Model (VLM) for precise, context-aware prompt synthesis. In the second stage, we design a Fine-Grained Group-Level Direct Preference Optimization (DPO) with a weighted margin formulation to simultaneously eliminate multi-subject artifacts, elevate texture harmony, and recalibrate identity fidelity towards real-world distributions. Extensive experiments on two challenging benchmarks – direct multi-person fusion and age-transformed group generation – demonstrate that IdGlow fundamentally mitigates the stability-plasticity conflict, achieving a superior Pareto balance between state-of-the-art facial fidelity and commercial-grade aesthetic quality.
[408] Linking Modality Isolation in Heterogeneous Collaborative Perception
Changxing Liu, Zichen Chao, Siheng Chen
Main category: cs.CV
TL;DR: CodeAlign: A novel framework for cross-modal alignment in collaborative perception that addresses modality isolation through codebook-based feature translation without requiring spatial correspondence.
Details
Motivation: Addresses two key challenges in collaborative perception: heterogeneity across agents causing domain gaps, and modality isolation where agents with different modalities never co-occur in training data, which existing alignment methods fail to handle due to their reliance on spatially overlapping observations.Method: Proposes CodeAlign framework using cross-modal feature-code-feature (FCF) translation. Uses codebooks to regularize feature spaces into compact code spaces, then learns mappings between modality-specific feature spaces by translating features to corresponding codes of other modalities and decoding back into target feature spaces.
Result: Achieves state-of-the-art perception performance on OPV2V and DAIR-V2X datasets while requiring only 8% of training parameters compared to prior alignment methods and reducing communication load by 1024x.
Conclusion: CodeAlign provides an efficient, co-occurrence-free alignment framework that effectively addresses modality isolation in collaborative perception through codebook-based feature translation, enabling seamless cross-modal collaboration without spatial correspondence requirements.
Abstract: Collaborative perception leverages data exchange among multiple agents to enhance overall perception capabilities. However, heterogeneity across agents introduces domain gaps that hinder collaboration, and this is further exacerbated by an underexplored issue: modality isolation. It arises when multiple agents with different modalities never co-occur in any training data frame, enlarging cross-modal domain gaps. Existing alignment methods rely on supervision from spatially overlapping observations, thus fail to handle modality isolation. To address this challenge, we propose CodeAlign, the first efficient, co-occurrence-free alignment framework that smoothly aligns modalities via cross-modal feature-code-feature(FCF) translation. The key idea is to explicitly identify the representation consistency through codebook, and directly learn mappings between modality-specific feature spaces, thereby eliminating the need for spatial correspondence. Codebooks regularize feature spaces into code spaces, providing compact yet expressive representations. With a prepared code space for each modality, CodeAlign learns FCF translations that map features to the corresponding codes of other modalities, which are then decoded back into features in the target code space, enabling effective alignment. Experiments show that, when integrating three modalities, CodeAlign requires only 8% of the training parameters of prior alignment methods, reduces communication load by 1024x, and achieves state-of-the-art perception performance on both OPV2V and DAIR-V2X dataset. Code will be released on https://github.com/cxliu0314/CodeAlign.
[409] Exploring Spatiotemporal Feature Propagation for Video-Level Compressive Spectral Reconstruction: Dataset, Model and Benchmark
Lijing Cai, Zhan Shi, Chenglong Huang, Jinyao Wu, Qiping Li, Zikang Huo, Linsen Chen, Chongde Zi, Xun Cao
Main category: cs.CV
TL;DR: PG-SVRT advances spectral compressive imaging from image to video reconstruction, using spatial-then-temporal attention and bridged tokens to achieve high-quality, temporally consistent hyperspectral video reconstruction with minimal computational cost.
Details
Motivation: Existing spectral compressive imaging methods are image-based and suffer from two key limitations: (1) encoding process masks spatial-spectral features causing uncertainty in reconstructing missing information from single compressed measurements, and (2) frame-by-frame reconstruction fails to ensure temporal consistency crucial for video perception.Method: Proposed Propagation-Guided Spectral Video Reconstruction Transformer (PG-SVRT) with spatial-then-temporal attention to reconstruct spectral features from video information, using bridged tokens to reduce computational complexity. Also created DynaSpec dataset (first high-quality dynamic hyperspectral image dataset) and conducted experiments with four SCI systems including a DD-CASSI prototype.
Result: PG-SVRT achieves superior performance in reconstruction quality, spectral fidelity, and temporal consistency while maintaining minimal FLOPs. Extensive experiments demonstrate effectiveness across simulation and real-world data collection.
Conclusion: The paper successfully advances spectral reconstruction from image to video level by leveraging complementary features and temporal continuity across adjacent frames, addressing key limitations of existing methods through the proposed PG-SVRT architecture and comprehensive experimental validation.
Abstract: Recently, Spectral Compressive Imaging (SCI) has achieved remarkable success, unlocking significant potential for dynamic spectral vision. However, existing reconstruction methods, primarily image-based, suffer from two limitations: (i) Encoding process masks spatial-spectral features, leading to uncertainty in reconstructing missing information from single compressed measurements, and (ii) The frame-by-frame reconstruction paradigm fails to ensure temporal consistency, which is crucial in the video perception. To address these challenges, this paper seeks to advance spectral reconstruction from the image level to the video level, leveraging the complementary features and temporal continuity across adjacent frames in dynamic scenes. Initially, we construct the first high-quality dynamic hyperspectral image dataset (DynaSpec), comprising 30 sequences obtained through frame-scanning acquisition. Subsequently, we propose the Propagation-Guided Spectral Video Reconstruction Transformer (PG-SVRT), which employs a spatial-then-temporal attention to effectively reconstruct spectral features from abundant video information, while using a bridged token to reduce computational complexity. Finally, we conduct simulation experiments to assess the performance of four SCI systems, and construct a DD-CASSI prototype for real-world data collection and benchmarking. Extensive experiments demonstrate that PG-SVRT achieves superior performance in reconstruction quality, spectral fidelity, and temporal consistency, while maintaining minimal FLOPs. Project page: https://github.com/nju-cite/DynaSpec
[410] Position: Evaluation of Visual Processing Should Be Human-Centered, Not Metric-Centered
Jinfan Hu, Fanghua Yu, Zhiyuan You, Xiang Yin, Hongyu An, Xinqi Lin, Chao Dong, Jinjin Gu
Main category: cs.CV
TL;DR: The paper argues that current single-metric image quality assessment benchmarks are inadequate for evaluating modern visual processing systems, especially generative and perception-oriented methods, and calls for more human-centered, context-aware evaluation approaches.
Details
Motivation: The motivation stems from the growing mismatch between traditional objective IQA metrics and human perception/user preferences in the era of generative and perception-oriented visual processing methods. Current evaluation paradigms risk constraining innovation and misguiding research progress.Method: This is a position paper that presents arguments and calls for paradigm shifts rather than proposing specific technical methods. The approach is conceptual, advocating for rebalancing evaluation frameworks toward human-centered, context-aware, and fine-grained assessment.
Result: The paper establishes the need for evaluation reform in visual processing systems, highlighting the divergence between objective metrics and human perception, and provides a framework for more appropriate assessment approaches.
Conclusion: The conclusion calls for moving beyond single-metric IQA benchmarks toward more comprehensive evaluation paradigms that better align with human perception and user preferences, particularly for generative and perception-oriented visual processing systems.
Abstract: This position paper argues that the evaluation of modern visual processing systems should no longer be driven primarily by single-metric image quality assessment benchmarks, particularly in the era of generative and perception-oriented methods. Image restoration exemplifies this divergence: while objective IQA metrics enable reproducible, scalable evaluation, they have increasingly drifted apart from human perception and user preferences. We contend that this mismatch risks constraining innovation and misguiding research progress across visual processing tasks. Rather than rejecting metrics altogether, this paper calls for a rebalancing of evaluation paradigms, advocating a more human-centered, context-aware, and fine-grained approach to assessing the visual models’ outcomes.
[411] Exploring 3D Dataset Pruning
Xiaohan Zhao, Xinyi Shang, Jiacheng Liu, Zhiqiang Shen
Main category: cs.CV
TL;DR: This paper introduces dataset pruning methods specifically designed for 3D data, addressing the unique challenges of long-tail class distributions and conflicting evaluation metrics (OA vs mAcc).
Details
Motivation: Dataset pruning has been well-studied for 2D images but remains largely unexplored for 3D data. The long-tail class distribution nature of 3D datasets creates inherent conflicts between optimization under Overall Accuracy (OA) and Mean Accuracy (mAcc), making pruning particularly challenging for 3D data.Method: The authors formulate pruning as approximating full-data expected risk with a weighted subset, identifying two key errors: coverage error (insufficient representativeness) and prior-mismatch bias (inconsistency between subset-induced class weights and target metrics). They propose representation-aware subset selection with per-class retention quotas for long-tail coverage, and prior-invariant teacher supervision using calibrated soft labels and embedding-geometry distillation.
Result: Extensive experiments on 3D datasets show that the method can improve both OA and mAcc metrics across multiple settings while adapting to different downstream preferences. The retention quota serves as a switch to control the OA-mAcc trade-off.
Conclusion: The paper presents the first comprehensive study of dataset pruning for 3D data, addressing the unique challenges of long-tail distributions and conflicting evaluation metrics through novel theoretical formulation and practical methods that enable flexible trade-off control between different accuracy metrics.
Abstract: Dataset pruning has been widely studied for 2D images to remove redundancy and accelerate training, while particular pruning methods for 3D data remain largely unexplored. In this work, we study dataset pruning for 3D data, where its observed common long-tail class distribution nature make optimization under conventional evaluation metrics Overall Accuracy (OA) and Mean Accuracy (mAcc) inherently conflicting, and further make pruning particularly challenging. To address this, we formulate pruning as approximating the full-data expected risk with a weighted subset, which reveals two key errors: coverage error from insufficient representativeness and prior-mismatch bias from inconsistency between subset-induced class weights and target metrics. We propose representation-aware subset selection with per-class retention quotas for long-tail coverage, and prior-invariant teacher supervision using calibrated soft labels and embedding-geometry distillation. The retention quota also serves as a switch to control the OA-mAcc trade-off. Extensive experiments on 3D datasets show that our method can improve both metrics across multiple settings while adapting to different downstream preferences. Our code is available at https://github.com/XiaohanZhao123/3D-Dataset-Pruning.
[412] RC-GeoCP: Geometric Consensus for Radar-Camera Collaborative Perception
Xiaokai Bai, Lianqing Zheng, Runwei Guan, Siyuan Cao, Huiliang Shen
Main category: cs.CV
TL;DR: RC-GeoCP is a collaborative perception framework that fuses 4D radar and camera data across multiple agents, using radar-anchored geometric consensus to align visual semantics with spatial geometry while reducing communication overhead.
Details
Motivation: LiDAR-based collaborative perception systems are expensive and degrade in adverse weather, while camera-radar fusion remains underexplored in collaborative settings despite offering dense visual semantics and robust spatial measurements.Method: Three key components: 1) Geometric Structure Rectification (GSR) aligns visual semantics with radar-derived geometry, 2) Uncertainty-Aware Communication (UAC) selectively transmits informative features based on inter-agent disagreement, 3) Consensus-Driven Assembler (CDA) aggregates multi-agent information via shared geometric anchors.
Result: State-of-the-art performance on V2X-Radar and V2X-R benchmarks with significantly reduced communication overhead compared to existing methods.
Conclusion: RC-GeoCP successfully demonstrates the effectiveness of radar-camera fusion in collaborative perception, establishing a new benchmark and showing superior performance with efficient communication.
Abstract: Collaborative perception (CP) enhances scene understanding through multi-agent information sharing. While LiDAR-centric systems offer precise geometry, high costs and performance degradation in adverse weather necessitate multi-modal alternatives. Despite dense visual semantics and robust spatial measurements, the synergy between cameras and 4D radar remains underexplored in collaborative settings. This work introduces RC-GeoCP, the first framework to explore the fusion of 4D radar and images in CP. To resolve misalignment caused by depth ambiguity and spatial dispersion across agents, RC-GeoCP establishes a radar-anchored geometric consensus. Specifically, Geometric Structure Rectification (GSR) aligns visual semantics with geometry derived from radar to generate spatially grounded, geometry-consistent representations. Uncertainty-Aware Communication (UAC) formulates selective transmission as a conditional entropy reduction process to prioritize informative features based on inter-agent disagreement. Finally, the Consensus-Driven Assembler (CDA) aggregates multi-agent information via shared geometric anchors to form a globally coherent representation. We establish the first unified radar-camera CP benchmark on V2X-Radar and V2X-R, demonstrating state-of-the-art performance with significantly reduced communication overhead. Code will be released soon.
[413] Stateful Cross-layer Vision Modulation
Ying Liu, Yudong Han, Kean Shi, Liyuan Pan
Main category: cs.CV
TL;DR: SCVM introduces a cross-layer memory-modulated vision framework for MLLMs that uses recursively updated memory states and layer-wise feedback modulation to better preserve fine-grained visual details without modifying the language model.
Details
Motivation: Existing MLLMs use static concatenation or weighted aggregation of visual features after encoding, which can suppress fine-grained details from early layers during hierarchical abstraction. There's also a semantic distribution mismatch when introducing shallow-layer features to LLMs pretrained on specific visual feature spaces.Method: Proposes SCVM with: 1) recursively updated cross-layer memory state inside vision encoder to model long-range inter-layer dependencies; 2) layer-wise feedback modulation that refreshes token representations at each layer based on accumulated memory; 3) auxiliary semantic alignment objective supervising final memory state for progressive compression and reinforcement of task-relevant information.
Result: Experimental results on multiple visual question answering and hallucination evaluation benchmarks show consistent performance improvements without expanding visual tokens, adding vision encoders, or modifying/fine-tuning the language model.
Conclusion: SCVM effectively addresses visual representation limitations in MLLMs through representation evolution control, achieving better performance while maintaining model compatibility and efficiency.
Abstract: Recent multimodal large language models (MLLMs) widely adopt multi-layer visual feature fusion to enhance visual representation. However, existing approaches typically perform static concatenation or weighted aggregation after visual encoding, without intervening in the representation formation process itself. As a result, fine-grained details from early layers may be progressively suppressed during hierarchical abstraction. Moreover, directly introducing shallow-layer features into the language model often leads to semantic distribution mismatch with the visual feature space that the LLM’s cross-attention layers were pretrained on, which typically requires additional adaptation or fine-tuning of the LLM. To address these limitations, we revisit visual representation learning from the perspective of representation evolution control and propose a cross-layer memory-modulated vision framework(SCVM). Specifically, we introduce a recursively updated cross-layer memory state inside the vision encoder to model long-range inter-layer dependencies. We further design a layer-wise feedback modulation mechanism that refreshes token representations at each layer based on the accumulated memory, thereby structurally regulating the representation evolution trajectory. In addition, we incorporate an auxiliary semantic alignment objective that explicitly supervises the final memory state, encouraging progressive compression and reinforcement of task-relevant information. Experimental results on multiple visual question answering and hallucination evaluation benchmarks demonstrate that SCVM achieves consistent performance improvements without expanding visual tokens, introducing additional vision encoders, or modifying or fine-tuning the language model.
[414] Act Like a Pathologist: Tissue-Aware Whole Slide Image Reasoning
Wentao Huang, Weimin Lyu, Peiliang Lou, Qingqiao Hu, Xiaoling Hu, Shahira Abousamra, Wenchao Han, Ruifeng Guo, Jiawei Zhou, Chao Chen, Chen Wang
Main category: cs.CV
TL;DR: HistoSelect: A question-guided, tissue-aware, coarse-to-fine retrieval framework for pathology question-answering that mimics human slide examination by selectively sampling relevant tissue regions and patches, reducing visual tokens by 70% while improving accuracy.
Details
Motivation: Current pathology question-answering models use uniform patch sampling or broad attention maps, attending equally to irrelevant regions while missing key visual evidence. Pathologists naturally scan broadly and zoom in selectively based on clinical questions, but models lack this human-like navigation of tissue complexity.Method: Proposes HistoSelect with two components: 1) group sampler identifies question-relevant tissue regions, 2) patch selector retrieves most informative patches within those regions. Uses question-guided, tissue-aware, coarse-to-fine retrieval to mimic human slide examination patterns.
Result: Reduces visual token usage by 70% on average while improving accuracy across three pathology QA tasks. Evaluated on 356,000 question-answer pairs, outperforms existing methods and produces answers grounded in interpretable, pathologist-consistent regions.
Conclusion: Bringing human-like search and attention patterns into whole slide image reasoning is promising for building practical and reliable pathology vision-language models. Selective attention to relevant regions improves efficiency and accuracy.
Abstract: Computational pathology has advanced rapidly in recent years, driven by domain-specific image encoders and growing interest in using vision-language models to answer natural-language questions about diseases. Yet, the core problem behind pathology question-answering remains unsolved, considering that a gigapixel slide contains far more information than necessary for a given question. Pathologists naturally navigate tissue and morphology complexity by scanning broadly, and zooming in selectively according to the clinical questions. Current models, in contrast, rely on uniform patch sampling or broad attention maps, often attending equally to irrelevant regions while overlooking key visual evidence. In this work, we try to bring models closer to how humans actually examine slides. We propose a question-guided, tissue-aware, and coarse-to-fine retrieval framework, HistoSelect, that consists of two key components: a group sampler that identifies question-relevant tissue regions, followed by a patch selector that retrieves the most informative patches within those regions. By selecting only the most informative patches, our method becomes significantly more efficient: reducing visual token usage by 70% on average, while improving accuracy across three pathology QA tasks. Evaluated on 356,000 question-answer pairs, our approach outperforms existing methods and produces answers grounded in interpretable, pathologist-consistent regions. Our results suggest that bringing human-like search and attention patterns into WSI reasoning is a promising direction for building practical and reliable pathology VLMs.
[415] Direct low-field MRI super-resolution using undersampled k-space
Daniel Tweneboah Anyimadu, Mohammed M. Abdelsamea, Ahmed Karam Eldaly
Main category: cs.CV
TL;DR: A k-space dual channel U-Net framework for reconstructing high-field MRI-like images directly from undersampled low-field k-space data, outperforming spatial-domain methods.
Details
Motivation: Low-field MRI is affordable but suffers from long acquisition times and poor image quality. Current super-resolution and image quality transfer methods rely on spatial-domain post-processing, but the authors propose working directly in k-space to better reconstruct high-field quality images from undersampled low-field data.Method: Proposes a k-space dual channel U-Net that processes real and imaginary components of undersampled low-field k-space data to restore missing frequency content, enabling direct reconstruction of high-field-like images from undersampled acquisitions.
Result: The k-space-driven image enhancement consistently outperforms spatial-domain methods, and reconstructions from undersampled k-space achieve image quality comparable to full k-space acquisitions on low-field brain MRI data.
Conclusion: This is the first work investigating low-field MRI super-resolution/image quality transfer directly from undersampled k-space, demonstrating superior performance over spatial-domain approaches for accelerated low-field MRI reconstruction.
Abstract: Low-field magnetic resonance imaging (MRI) provides affordable access to diagnostic imaging but suffers from prolonged acquisition and limited image quality. Accelerated imaging can be achieved with k-space undersampling, while super-resolution (SR) and image quality transfer (IQT) methods typically rely on spatial-domain post-processing. In this work, we propose a novel framework for reconstructing high-field MR like images directly from undersampled low-field k-space. Our approach employs a k-space dual channel U-Net that processes the real and imaginary components of undersampled k-space to restore missing frequency content. Experiments on low-field brain MRI demonstrate that our k-space-driven image enhancement consistently outperforms the counterpart spatial-domain method. Furthermore, reconstructions from undersampled k-space achieve image quality comparable to full k-space acquisitions. To the best of our knowledge, this is the first work that investigates low-field MRI SR/IQT directly from undersampled k-space.
[416] Specializing Foundation Models via Mixture of Low-Rank Experts for Comprehensive Head CT Analysis
Youngjin Yoo, Han Liu, Bogdan Georgescu, Yanbo Zhang, Sasa Grbic, Michael Baumgartner, Thomas J. Re, Jyotipriya Das, Poikavila Ullaskrishnan, Eva Eibenberger, Andrei Chekkoury, Uttam K. Bodanapally, Savvas Nicolaou, Pina C. Sanelli, Thomas J. Schroeppel, Yvonne W. Lui, Eli Gibson
Main category: cs.CV
TL;DR: MoLRE framework extends LoRA with multiple specialized low-rank adapters and unsupervised soft routing for multi-label medical image analysis, achieving consistent improvements across various foundation models for head CT finding detection.
Details
Motivation: Standard parameter-efficient fine-tuning methods like LoRA apply uniform adaptations across pathology types, which may limit performance for diverse medical findings in complex multi-label diagnostic tasks like comprehensive head CT finding detection.Method: Proposes Mixture of Low-Rank Experts (MoLRE) framework that extends LoRA with multiple specialized low-rank adapters and unsupervised soft routing, enabling conditional feature adaptation with less than 0.5% additional parameters and without explicit pathology supervision.
Result: Consistent performance improvements across all tested models on 70,000+ head CT scans with 75 annotated findings. Gains varied by model type: general-purpose and medical-domain models showed largest improvements (DINOv3-Base: +4.6%; MedGemma: +4.3%), while 3D CT-specialized or very large models showed more modest gains (+0.2-1.3%). MoLRE + MedGemma achieved highest average detection AUC of 0.917.
Conclusion: MoLRE effectively adapts foundation models to complex multi-label medical tasks, highlighting the importance of systematic benchmarking as pretraining domain, architecture, and model scale interact in non-obvious ways for clinical applications.
Abstract: Foundation models pre-trained on large-scale datasets demonstrate strong transfer learning capabilities; however, their adaptation to complex multi-label diagnostic tasks-such as comprehensive head CT finding detection-remains understudied. Standard parameter-efficient fine-tuning methods such as LoRA apply uniform adaptations across pathology types, which may limit performance for diverse medical findings. We propose a Mixture of Low-Rank Experts (MoLRE) framework that extends LoRA with multiple specialized low-rank adapters and unsupervised soft routing. This approach enables conditional feature adaptation with less than 0.5% additional parameters and without explicit pathology supervision. We present a comprehensive benchmark of MoLRE across six state-of-the-art medical imaging foundation models spanning 2D and 3D architectures, general-domain, medical-domain, and head CT-specific pretraining, and model sizes ranging from 7M to 431M parameters. Using over 70,000 non-contrast head CT scans with 75 annotated findings-including hemorrhage, infarction, trauma, mass lesions, structural abnormalities, and chronic changes-our experiments demonstrate consistent performance improvements across all models. Gains vary substantially: general-purpose and medical-domain models show the largest improvements (DINOv3-Base: +4.6%; MedGemma: +4.3%), whereas 3D CT-specialized or very large models show more modest gains (+0.2-1.3%). The combination of MoLRE and MedGemma achieves the highest average detection AUC of 0.917. These findings highlight the importance of systematic benchmarking on target clinical tasks, as pretraining domain, architecture, and model scale interact in non-obvious ways.
[417] CoLC: Communication-Efficient Collaborative Perception with LiDAR Completion
Yushan Han, Hui Zhang, Qiming Xia, Yi Jin, Yidong Li
Main category: cs.CV
TL;DR: CoLC: Communication-efficient early collaborative perception framework using LiDAR completion to restore scene completeness under sparse transmission
Details
Motivation: Early fusion in collaborative perception offers better perceptual complementarity and robustness to model heterogeneity, but suffers from high communication costs that limit practical deploymentMethod: Three complementary designs: 1) Foreground-Aware Point Sampling (FAPS) for selective transmission, 2) Completion-Enhanced Early Fusion (CEEF) to reconstruct dense pillars from sparse inputs, 3) Dense-Guided Dual Alignment (DGDA) for semantic/geometric consistency during training
Result: Achieves superior perception-communication trade-offs and remains robust under heterogeneous model settings on both simulated and real-world datasets
Conclusion: CoLC enables practical deployment of early fusion collaborative perception by addressing communication efficiency while maintaining perceptual benefits
Abstract: Collaborative perception empowers autonomous agents to share complementary information and overcome perception limitations. While early fusion offers more perceptual complementarity and is inherently robust to model heterogeneity, its high communication cost has limited its practical deployment, prompting most existing works to favor intermediate or late fusion. To address this, we propose a communication-efficient early Collaborative perception framework that incorporates LiDAR Completion to restore scene completeness under sparse transmission, dubbed as CoLC. Specifically, the CoLC integrates three complementary designs. First, each neighbor agent applies Foreground-Aware Point Sampling (FAPS) to selectively transmit informative points that retain essential structural and contextual cues under bandwidth constraints. The ego agent then employs Completion-Enhanced Early Fusion (CEEF) to reconstruct dense pillars from the received sparse inputs and adaptively fuse them with its own observations, thereby restoring spatial completeness. Finally, the Dense-Guided Dual Alignment (DGDA) strategy enforces semantic and geometric consistency between the enhanced and dense pillars during training, ensuring consistent and robust feature learning. Experiments on both simulated and real-world datasets demonstrate that CoLC achieves superior perception-communication trade-offs and remains robust under heterogeneous model settings. The code is available at https://github.com/CatOneTwo/CoLC.
[418] SCOUT: Fast Spectral CT Imaging in Ultra LOw-data Regimes via PseUdo-label GeneraTion
Guoquan Wei, Liu Shi, Shaoyu Wang, Mohan Li, Cunfeng Wei, Qiegen Liu
Main category: cs.CV
TL;DR: A self-supervised CT reconstruction method using spatial nonlocal similarity and projection domain conjugate properties to generate pseudo-3D data for training, achieving high-fidelity results without external data or lengthy pre-training.
Details
Motivation: Current CT reconstruction methods either take too long or rely on data-driven models that don't adequately leverage the inherent information in medical 3D data, especially under ultra-low raw data conditions.Method: Proposes a reconstruction method that uses spatial nonlocal similarity and conjugate properties of the projection domain to generate pseudo-3D data for self-supervised training, requiring no external data and avoiding lengthy pre-training.
Result: Extensive experiments show the method mitigates detector-induced ring artifacts and exhibits unprecedented capabilities in detail recovery, achieving high-fidelity results in very short time.
Conclusion: The method provides a new paradigm for research using unlabeled raw projection data in medical imaging reconstruction.
Abstract: Noise and artifacts during computed tomography (CT) scans are a fundamental challenge affecting disease diagnosis. However, current methods either involve excessively long reconstruction times or rely on data-driven models for optimization, failing to adequately consider the valuable information inherent in the data itself, especially medical 3D data. This work proposes a reconstruction method under ultra-low raw data conditions, requiring no external data and avoiding lengthy pre-training processes. By leveraging spatial nonlocal similarity and the conjugate properties of the projection domain to generate pseudo-3D data for self-supervised training, high-fidelity results can be achieved in a very short time. Extensive experiments demonstrate that this method not only mitigates detector-induced ring artifacts but also exhibits unprecedented capabilities in detail recovery. This method provides a new paradigm for research using unlabeled raw projection data. Code is available at https://github.com/yqx7150/SCOUT.
[419] STMI: Segmentation-Guided Token Modulation with Cross-Modal Hypergraph Interaction for Multi-Modal Object Re-Identification
Xingguo Xu, Zhanyu Liu, Weixiang Zhou, Yuansheng Gao, Junjie Cao, Yuhao Wang, Jixiang Luo, Dell Zhang
Main category: cs.CV
TL;DR: STMI: A multi-modal object ReID framework using segmentation-guided feature modulation, semantic token reallocation, and cross-modal hypergraph interaction to improve feature representation across modalities.
Details
Motivation: Existing multi-modal ReID methods suffer from hard token filtering or simple fusion strategies that lose discriminative cues and increase background interference, limiting their effectiveness in exploiting complementary information from different modalities.Method: Three-component framework: 1) Segmentation-Guided Feature Modulation (SFM) uses SAM-generated masks for foreground enhancement and background suppression via learnable attention modulation; 2) Semantic Token Reallocation (STR) employs learnable query tokens with adaptive reallocation to extract compact representations without token discarding; 3) Cross-Modal Hypergraph Interaction (CHI) constructs unified hypergraph across modalities to capture high-order semantic relationships.
Result: Extensive experiments on RGBNT201, RGBNT100, and MSVR310 benchmarks demonstrate the effectiveness and robustness of STMI in multi-modal ReID scenarios.
Conclusion: STMI provides an effective solution for multi-modal object ReID by addressing limitations of existing methods through integrated segmentation guidance, token reallocation, and cross-modal hypergraph modeling.
Abstract: Multi-modal object Re-Identification (ReID) aims to exploit complementary information from different modalities to retrieve specific objects. However, existing methods often rely on hard token filtering or simple fusion strategies, which can lead to the loss of discriminative cues and increased background interference. To address these challenges, we propose STMI, a novel multi-modal learning framework consisting of three key components: (1) Segmentation-Guided Feature Modulation (SFM) module leverages SAM-generated masks to enhance foreground representations and suppress background noise through learnable attention modulation; (2) Semantic Token Reallocation (STR) module employs learnable query tokens and an adaptive reallocation mechanism to extract compact and informative representations without discarding any tokens; (3) Cross-Modal Hypergraph Interaction (CHI) module constructs a unified hypergraph across modalities to capture high-order semantic relationships. Extensive experiments on public benchmarks (i.e., RGBNT201, RGBNT100, and MSVR310) demonstrate the effectiveness and robustness of our proposed STMI framework in multi-modal ReID scenarios.
[420] TokenSplat: Token-aligned 3D Gaussian Splatting for Feed-forward Pose-free Reconstruction
Yihui Li, Chengxin Lv, Zichen Tang, Hongyu Yang, Di Huang
Main category: cs.CV
TL;DR: TokenSplat: A feed-forward framework for joint 3D Gaussian reconstruction and camera pose estimation from unposed multi-view images using token-aligned Gaussian prediction and asymmetric dual-flow decoding.
Details
Motivation: Existing methods for 3D reconstruction often require known camera poses or rely on iterative refinement, which can be slow and unstable. There's a need for a feed-forward approach that can jointly handle 3D reconstruction and camera pose estimation from unposed images without iterative refinement.Method: Introduces Token-aligned Gaussian Prediction module that aligns semantically corresponding information across views in feature space, guided by coarse token positions and fusion confidence. Uses learnable camera tokens and Asymmetric Dual-Flow Decoder (ADF-Decoder) to enforce directionally constrained communication between camera and image tokens, maintaining clean factorization.
Result: Achieves higher reconstruction fidelity and novel-view synthesis quality in pose-free settings, and significantly improves pose estimation accuracy compared to prior pose-free methods.
Conclusion: TokenSplat demonstrates that joint 3D reconstruction and camera pose estimation can be effectively achieved through a feed-forward architecture with proper token alignment and factorization mechanisms, without requiring iterative refinement.
Abstract: We present TokenSplat, a feed-forward framework for joint 3D Gaussian reconstruction and camera pose estimation from unposed multi-view images. At its core, TokenSplat introduces a Token-aligned Gaussian Prediction module that aligns semantically corresponding information across views directly in the feature space. Guided by coarse token positions and fusion confidence, it aggregates multi-scale contextual features to enable long-range cross-view reasoning and reduce redundancy from overlapping Gaussians. To further enhance pose robustness and disentangle viewpoint cues from scene semantics, TokenSplat employs learnable camera tokens and an Asymmetric Dual-Flow Decoder (ADF-Decoder) that enforces directionally constrained communication between camera and image tokens. This maintains clean factorization within a feed-forward architecture, enabling coherent reconstruction and stable pose estimation without iterative refinement. Extensive experiments demonstrate that TokenSplat achieves higher reconstruction fidelity and novel-view synthesis quality in pose-free settings, and significantly improves pose estimation accuracy compared to prior pose-free methods. Project page: https://kidleyh.github.io/tokensplat/.
[421] Towards Universal Khmer Text Recognition
Marry Kong, Rina Buoy, Sovisal Chenda, Nguonly Taing, Masakazu Iwamura, Koichi Kise
Main category: cs.CV
TL;DR: A universal Khmer text recognition framework using modality-aware adaptive feature selection to handle diverse text modalities (printed, handwritten, scene text) for a low-resource language.
Details
Motivation: Khmer is a low-resource language with complex script, making OCR challenging. Current approaches have limitations: modality-specific models prevent cross-modality transfer learning and create deployment overhead, while combined training leads to degraded performance on underrepresented modalities.Method: Proposes a universal Khmer text recognition (UKTR) framework with modality-aware adaptive feature selection (MAFS) that adapts visual features according to input image modality to enhance recognition robustness across different text modalities.
Result: The model achieves state-of-the-art performance and the authors introduce the first comprehensive benchmark for universal Khmer text recognition, releasing datasets and models to the community.
Conclusion: The proposed universal framework effectively addresses modality-specific challenges in Khmer text recognition through adaptive feature selection, enabling robust performance across diverse text modalities while facilitating future research with released benchmarks.
Abstract: Khmer is a low-resource language characterized by a complex script, presenting significant challenges for optical character recognition (OCR). While document printed text recognition has advanced because of available datasets, performance on other modalities, such as handwritten and scene text, remains limited by data scarcity. Training modality-specific models for each modality does not allow cross-modality transfer learning, from which modalities with limited data could otherwise benefit. Moreover, deploying many modality-specific models results in significant memory overhead and requires error-prone routing each input image to the appropriate model. On the other hand, simply training on a combined dataset with a non-uniform data distribution across different modalities often leads to degraded performance on underrepresented modalities. To address these, we propose a universal Khmer text recognition (UKTR) framework capable of handling diverse text modalities. Central to our method is a novel modality-aware adaptive feature selection (MAFS) technique designed to adapt visual features according to a particular input image modality and enhance recognition robustness across modalities. Extensive experiments demonstrate that our model achieves state-of-the-art (SoTA) performance. Furthermore, we introduce the first comprehensive benchmark for universal Khmer text recognition, which we release to the community to facilitate future research. Our datasets and models can be accessible via this gated repository\footnote{in review}.
[422] Towards Khmer Scene Document Layout Detection
Marry Kong, Rina Buoy, Sovisal Chenda, Nguonly Taing, Masakazu Iwamura, Koichi Kise
Main category: cs.CV
TL;DR: First comprehensive study on Khmer scene document layout detection, addressing data scarcity with a novel framework including dataset, augmentation tool, and YOLO-based detection models.
Details
Motivation: Document layout analysis for Khmer language lags behind Latin scripts due to scarcity of annotated training data, especially challenging for scene documents with perspective distortions and complex backgrounds. Khmer script's structural complexities (diacritics, multi-layer character stacking) cause existing Latin-based models to fail in accurately delineating semantic layout units.Method: Three key elements: (1) robust training/benchmarking dataset for Khmer scene layouts, (2) open-source document augmentation tool for synthesizing realistic scene documents to scale training data, (3) layout detection baselines using YOLO-based architectures with oriented bounding boxes (OBB) to handle geometric distortions.
Result: The paper presents the first comprehensive framework for Khmer scene document layout detection, releasing models, code, and datasets to foster research in Khmer document analysis and recognition community.
Conclusion: Addresses critical gap in Khmer document layout analysis through a complete framework combining dataset creation, data augmentation, and detection models, enabling future research in this underserved language domain.
Abstract: While document layout analysis for Latin scripts has advanced significantly, driven by the advent of large multimodal models (LMMs), progress for the Khmer language remains constrained because of the scarcity of annotated training data. This gap is particularly acute for scene documents, where perspective distortions and complex backgrounds challenge traditional methods. Given the structural complexities of Khmer script, such as diacritics and multi-layer character stacking, existing Latin-based layout analysis models fail to accurately delineate semantic layout units, particularly for dense text regions (e.g., list items). In this paper, we present the first comprehensive study on Khmer scene document layout detection. We contribute a novel framework comprising three key elements: (1) a robust training and benchmarking dataset specifically for Khmer scene layouts; (2) an open-source document augmentation tool capable of synthesizing realistic scene documents to scale training data; and (3) layout detection baselines utilizing YOLO-based architectures with oriented bounding boxes (OBB) to handle geometric distortions. To foster further research in the Khmer document analysis and recognition (DAR) community, we release our models, code, and datasets in this gated repository (in review).
[423] A Reconstruction System for Industrial Pipeline Inner Walls Using Panoramic Image Stitching with Endoscopic Imaging
Rui Ma, Yifeng Wang, Ziteng Yang, Xinghui Li
Main category: cs.CV
TL;DR: A pipeline inner wall reconstruction system using panoramic image stitching from industrial endoscope videos, with custom GUI for efficient defect detection.
Details
Motivation: Visual analysis and reconstruction of pipeline inner walls is challenging in industrial inspection scenarios, requiring more efficient methods than traditional frame-by-frame video review.Method: System extracts key frames from endoscope video, applies polar coordinate transformation to unwrap annular frames, and uses image stitching techniques to create planar panoramic images of pipeline inner walls.
Result: The method enables efficient processing of industrial endoscope videos, generating panoramic stitched images that preserve all detailed features of pipeline inner walls for defect detection.
Conclusion: The approach significantly improves pipeline inner wall reconstruction efficiency compared to traditional methods and has considerable engineering application value for industrial inspection.
Abstract: Visual analysis and reconstruction of pipeline inner walls remain challenging in industrial inspection scenarios. This paper presents a dedicated reconstruction system for pipeline inner walls via industrial endoscopes, which is built on panoramic image stitching technology. Equipped with a custom graphical user interface (GUI), the system extracts key frames from endoscope video footage, and integrates polar coordinate transformation with image stitching techniques to unwrap annular video frames of pipeline inner walls into planar panoramic images. Experimental results demonstrate that the proposed method enables efficient processing of industrial endoscope videos, and the generated panoramic stitched images preserve all detailed features of pipeline inner walls in their entirety. This provides intuitive and accurate visual support for defect detection and condition assessment of pipeline inner walls. In comparison with the traditional frame-by-frame video review method, the proposed approach significantly elevates the efficiency of pipeline inner wall reconstruction and exhibits considerable engineering application value.
[424] Diversity over Uniformity: Rethinking Representation in Generated Image Detection
Qinghui He, Haifeng Zhang, Qiao Qin, Bo Liu, Xiuli Bi, Bin Xiao
Main category: cs.CV
TL;DR: Anti-feature-collapse learning framework for generated image detection that prevents discriminative information from collapsing into few dominant feature directions, enhancing generalization to unseen generative models.
Details
Motivation: Existing generated image detection methods rely on limited salient forgery cues, limiting generalization to unseen generative mechanisms. Need for multiple judgment perspectives rather than single decision path.Method: Proposes anti-feature-collapse learning framework that filters task-irrelevant components and suppresses excessive overlap among different forgery cues in representation space to prevent discriminative information collapse.
Result: Significantly outperforms state-of-the-art in cross-model scenarios with 5.02% accuracy improvement, demonstrating superior generalization and detection reliability on multiple public benchmarks.
Conclusion: The framework maintains diverse complementary evidence, reduces reliance on limited salient cues, and enhances robustness under unseen generative settings for reliable generated image detection.
Abstract: With the rapid advancement of generative models, generated image detection has become an important task in visual forensics. Although existing methods have achieved remarkable progress, they often rely, after training, on only a small subset of highly salient forgery cues, which limits their ability to generalize to unseen generative mechanisms. We argue that reliably generated image detection should not depend on a single decision path but should preserve multiple judgment perspectives, enabling the model to understand the differences between real and generated images from diverse viewpoints. Based on this idea, we propose an anti-feature-collapse learning framework that filters task-irrelevant components and suppresses excessive overlap among different forgery cues in the representation space, preventing discriminative information from collapsing into a few dominant feature directions. This design maintains diverse and complementary evidence within the model, reduces reliance on a small set of salient cues, and enhances robustness under unseen generative settings. Extensive experiments on multiple public benchmarks demonstrate that the proposed method significantly outperforms the state-of-the-art approaches in cross-model scenarios, achieving an accuracy improvement of 5.02% and exhibiting superior generalization and detection reliability. The source code is available at https://github.com/Yanmou-Hui/DoU.
[425] BornoViT: A Novel Efficient Vision Transformer for Bengali Handwritten Basic Characters Classification
Rafi Hassan Chowdhury, Naimul Haque, Kaniz Fatiha
Main category: cs.CV
TL;DR: BornoViT: A lightweight Vision Transformer model for Bengali handwritten character classification that achieves high accuracy with minimal computational resources (0.65M parameters, 0.62MB size).
Details
Motivation: Bengali handwritten character classification is challenging due to character complexity and variability. Existing models are computationally expensive and data-hungry, making them unsuitable for resource-limited languages like Bengali.Method: Proposes BornoViT, a novel lightweight Vision Transformer that uses a simplified deep convolutional neural network (DCNN) architecture to reduce computational burden while maintaining effectiveness for Bengali character classification.
Result: Achieved 95.77% accuracy on BanglaLekha Isolated dataset and 91.51% on self-collected Bornomala dataset (222 samples). Model has only 0.65M parameters, 0.62MB size, and 0.16 GFLOPs.
Conclusion: BornoViT is significantly lighter than state-of-the-art models while maintaining competitive accuracy, making it suitable for resource-limited environments essential for Bengali handwritten character classification.
Abstract: Handwritten character classification in the Bengali script is a significant challenge due to the complexity and variability of the characters. The models commonly used for classification are often computationally expensive and data-hungry, making them unsuitable for resource-limited languages such as Bengali. In this experiment, we propose a novel, efficient, and lightweight Vision Transformer model that effectively classifies Bengali handwritten basic characters and digits, addressing several shortcomings of traditional methods. The proposed solution utilizes a deep convolutional neural network (DCNN) in a more simplified manner compared to traditional DCNN architectures, with the aim of reducing computational burden. With only 0.65 million parameters, a model size of 0.62 MB, and 0.16 GFLOPs, our model, BornoViT, is significantly lighter than current state-of-the-art models, making it more suitable for resource-limited environments, which is essential for Bengali handwritten character classification. BornoViT was evaluated on the BanglaLekha Isolated dataset, achieving an accuracy of 95.77%, and demonstrating superior efficiency compared to existing state-of-the-art approaches. Furthermore, the model was evaluated on our self-collected dataset, Bornomala, consisting of approximately 222 samples from different age groups, where it achieved an accuracy of 91.51%.
[426] Stroke outcome and evolution prediction from CT brain using a spatiotemporal diffusion autoencoder
Adam Marcus, Paul Bentley, Daniel Rueckert
Main category: cs.CV
TL;DR: Diffusion models generate self-supervised stroke representations from CT images, enhanced with longitudinal data and time information, achieving state-of-the-art outcome prediction with minimal labels.
Details
Motivation: Stroke outcome prediction is crucial for personalized care but challenging despite rich neuroimaging data. Current methods struggle with modeling brain tissue fate evolution accurately.Method: Apply diffusion probabilistic models to generate self-supervised stroke representations from CT images. Extend to incorporate longitudinal images and time-from-stroke-onset information for improved representations.
Result: Achieves best performance for predicting next-day severity and functional outcome at discharge on a dataset of 5,824 CT images from 3,573 patients across two medical centers with minimal labels.
Conclusion: Diffusion models with longitudinal and temporal extensions provide effective self-supervised stroke representations that outperform existing methods for clinical outcome prediction.
Abstract: Stroke is a major cause of death and disability worldwide. Accurate outcome and evolution prediction has the potential to revolutionize stroke care by individualizing clinical decision-making leading to better outcomes. However, despite a plethora of attempts and the rich data provided by neuroimaging, modelling the ultimate fate of brain tissue remains a challenging task. In this work, we apply recent ideas in the field of diffusion probabilistic models to generate a self-supervised semantically meaningful stroke representation from Computed Tomography (CT) images. We then improve this representation by extending the method to accommodate longitudinal images and the time from stroke onset. The effectiveness of our approach is evaluated on a dataset consisting of 5,824 CT images from 3,573 patients across two medical centers with minimal labels. Comparative experiments show that our method achieves the best performance for predicting next-day severity and functional outcome at discharge.
[427] Analyzing and Improving Fast Sampling of Text-to-Image Diffusion Models
Zhenyu Zhou, Defang Chen, Siwei Lyu, Chun Chen, Can Wang
Main category: cs.CV
TL;DR: Proposes TORS, a constant total rotation schedule for diffusion models that ensures uniform geometric variation along sampling trajectories to accelerate text-to-image generation while maintaining quality with fewer steps.
Details
Motivation: Text-to-image diffusion models struggle with high-quality results under limited sampling budgets. Existing training-free acceleration methods are developed independently without exploring overall performance and compatibility, leaving optimal scheduling strategies unexplored.Method: Systematically elucidates design space of sampling acceleration methods, identifies sampling time schedule as most pivotal factor. Inspired by geometric properties revealed through Frenet-Serret formulas, proposes constant total rotation schedule (TORS) that ensures uniform geometric variation along sampling trajectory.
Result: TORS outperforms previous training-free acceleration methods and produces high-quality images with only 10 sampling steps on Flux.1-Dev and Stable Diffusion 3.5. Shows adaptability to unseen models, hyperparameters, and downstream applications.
Conclusion: The sampling time schedule is crucial for diffusion model acceleration. TORS provides an effective geometric scheduling strategy that enables high-quality image generation with significantly fewer sampling steps while maintaining compatibility across different models and applications.
Abstract: Text-to-image diffusion models have achieved unprecedented success but still struggle to produce high-quality results under limited sampling budgets. Existing training-free sampling acceleration methods are typically developed independently, leaving the overall performance and compatibility among these methods unexplored. In this paper, we bridge this gap by systematically elucidating the design space, and our comprehensive experiments identify the sampling time schedule as the most pivotal factor. Inspired by the geometric properties of diffusion models revealed through the Frenet-Serret formulas, we propose constant total rotation schedule (TORS), a scheduling strategy that ensures uniform geometric variation along the sampling trajectory. TORS outperforms previous training-free acceleration methods and produces high-quality images with 10 sampling steps on Flux.1-Dev and Stable Diffusion 3.5. Extensive experiments underscore the adaptability of our method to unseen models, hyperparameters, and downstream applications.
[428] DUCX: Decomposing Unfairness in Tool-Using Chest X-ray Agents
Zikang Xu, Ruinan Jin, Xiaoxiao Li
Main category: cs.CV
TL;DR: Systematic audit of demographic bias in tool-using medical agents for chest X-ray question answering, decomposing fairness into agent-specific sources beyond standalone models.
Details
Motivation: While tool-using medical agents can improve chest X-ray question answering by orchestrating specialized vision and language modules, this added pipeline complexity creates new pathways for demographic bias beyond standalone models, necessitating systematic fairness auditing.Method: Introduces DUCK (Decomposing Unfairness in Chest X-ray agents) with a stage-wise fairness decomposition that separates end-to-end bias from three agent-specific sources: tool exposure bias (utility gaps conditioned on tool presence), tool transition bias (subgroup differences in tool-routing patterns), and model reasoning bias (subgroup differences in synthesis behaviors). Uses MedRAX framework with five driver backbones.
Result: Demographic gaps persist in end-to-end performance (equalized odds up to 20.79%, lowest fairness-utility tradeoff down to 28.65%). Intermediate behaviors, tool usage, transition patterns, and reasoning traces exhibit distinct subgroup disparities not predictable from end-to-end evaluation alone (e.g., conditioned on segmentation-tool availability, subgroup utility gap reaches 50%).
Conclusion: Findings underscore the need for process-level fairness auditing and debiasing to ensure equitable deployment of clinical agentic systems, as agent-specific biases emerge beyond standalone model evaluation.
Abstract: Tool-using medical agents can improve chest X-ray question answering by orchestrating specialized vision and language modules, but this added pipeline complexity also creates new pathways for demographic bias beyond standalone models. We present ours (Decomposing Unfairness in Chest X-ray agents), a systematic audit of chest X-ray agents instantiated with MedRAX. To localize where disparities arise, we introduce a stage-wise fairness decomposition that separates end-to-end bias from three agent-specific sources: tool exposure bias (utility gaps conditioned on tool presence), tool transition bias (subgroup differences in tool-routing patterns), and model reasoning bias (subgroup differences in synthesis behaviors). Extensive experiments on tool-used based agentic frameworks across five driver backbones reveal that (i) demographic gaps persist in end-to-end performance, with equalized odds up to 20.79%, and the lowest fairness-utility tradeoff down to 28.65%, and (ii) intermediate behaviors, tool usage, transition patterns, and reasoning traces exhibit distinct subgroup disparities that are not predictable from end-to-end evaluation alone (e.g., conditioned on segmentation-tool availability, the subgroup utility gap reaches as high as 50%). Our findings underscore the need for process-level fairness auditing and debiasing to ensure the equitable deployment of clinical agentic systems. Code is available here: https://anonymous.4open.science/r/DUCK-E5FE/README.md
[429] Neural Functional Alignment Space: Brain-Referenced Representation of Artificial Neural Networks
Ruiyu Yan, Hanqi Jiang, Yi Pan, Xiaobo Li, Tianming Liu, Xi Jiang, Lin Zhao
Main category: cs.CV
TL;DR: NFAS is a brain-referenced framework that models neural network representations as dynamical trajectories across depth, projecting them into biologically anchored coordinates to enable cross-modal model comparison.
Details
Motivation: To create a unified framework for comparing artificial neural networks across different modalities (vision, audio, language) on equal functional grounds, using brain-referenced representations rather than layer-wise features or task-specific activations.Method: Models layer-wise embeddings as depth-wise dynamical trajectories, applies Dynamic Mode Decomposition to extract stable modes, projects representations into biologically anchored coordinate system defined by distributed neural responses, and introduces Signal-to-Noise Consistency Index for cross-model consistency quantification.
Result: Across 45 pretrained models spanning vision, audio, and language, NFAS reveals structured organization including modality-specific clustering and cross-modal convergence in integrative cortical systems.
Conclusion: Representation dynamics provide a principled basis for functional alignment of artificial neural networks, enabling cross-modal comparison in brain-referenced space.
Abstract: We propose the Neural Functional Alignment Space (NFAS), a brain-referenced representational framework for characterizing artificial neural networks on equal functional grounds. NFAS departs from conventional alignment approaches that rely on layer-wise features or task-specific activations by modeling the intrinsic dynamical evolution of stimulus representations across network depth. Specifically, we model layer-wise embeddings as a depth-wise dynamical trajectory and apply Dynamic Mode Decomposition (DMD) to extract the stable mode. This representation is then projected into a biologically anchored coordinate system defined by distributed neural responses. We also introduce the Signal-to-Noise Consistency Index (SNCI) to quantify cross-model consistency at the modality level. Across 45 pretrained models spanning vision, audio, and language, NFAS reveals structured organization within this brain-referenced space, including modality-specific clustering and cross-modal convergence in integrative cortical systems. Our findings suggest that representation dynamics provide a principled basis for
[430] COMBAT: Conditional World Models for Behavioral Agent Training
Anmol Agarwal, Pranay Meshram, Sumer Singh, Saurav Suman, Andrew Lapp, Shahbuland Matiana, Louis Castricato, Spencer Frazier
Main category: cs.CV
TL;DR: COMBAT is a real-time, action-controlled world model for fighting games that uses diffusion transformers to simulate dynamic opponents that react to player actions without explicit supervision.
Details
Motivation: Current video generation world models can simulate 3D-consistent environments but struggle with modeling dynamic, reactive agents that intelligently interact with the world. The authors aim to bridge this gap by creating a model that can simulate responsive opponents in complex interactive environments like fighting games.Method: Uses a 1.2B parameter Diffusion Transformer conditioned on latent representations from a deep compression autoencoder. Employs state-of-the-art techniques including causal distillation and diffusion forcing for real-time inference. Trained solely on single-player inputs without explicit opponent policy supervision, learning from partially observed data.
Result: Successfully demonstrates that diffusion models can simulate dynamic opponents that react to player actions, with emergent sophisticated agent behavior. Achieves real-time inference and learns effectively from partially observed data to generate responsive behaviors for controllable Player 1.
Conclusion: COMBAT establishes a foundation for training interactive agents within diffusion-based world models, showing that diffusion models can learn emergent agent behavior without explicit supervision, opening new possibilities for interactive AI systems.
Abstract: Recent advances in video generation have spurred the development of world models capable of simulating 3D-consistent environments and interactions with static objects. However, a significant limitation remains in their ability to model dynamic, reactive agents that can intelligently influence and interact with the world. To address this gap, we introduce COMBAT, a real-time, action-controlled world model trained on the complex 1v1 fighting game Tekken 3. Our work demonstrates that diffusion models can successfully simulate a dynamic opponent that reacts to player actions, learning its behavior implicitly. Our approach utilizes a 1.2 billion parameter Diffusion Transformer, conditioned on latent representations from a deep compression autoencoder. We employ state-of-the-art techniques, including causal distillation and diffusion forcing, to achieve real-time inference. Crucially, we observe the emergence of sophisticated agent behavior by training the model solely on single-player inputs, without any explicit supervision for the opponent’s policy. Unlike traditional imitation learning methods, which require complete action labels, COMBAT learns effectively from partially observed data to generate responsive behaviors for a controllable Player 1. We present an extensive study and introduce novel evaluation methods to benchmark this emergent agent behavior, establishing a strong foundation for training interactive agents within diffusion-based world models.
[431] MME: Mixture of Mesh Experts with Random Walk Transformer Gating
Amir Belder, Ayellet Tal
Main category: cs.CV
TL;DR: A Mixture of Experts framework for 3D mesh analysis that combines multiple mesh analysis methods through a novel gating mechanism and dynamic loss balancing to achieve state-of-the-art performance in classification, retrieval, and segmentation tasks.
Details
Motivation: Different mesh analysis methods excel on different object classes, creating an opportunity to combine their complementary strengths through an ensemble approach rather than relying on a single method.Method: Proposes a Mixture of Experts framework with: (1) novel gate architecture using random walks to capture regions experts attend to, (2) attention mechanism for focusing on informative areas, and (3) dynamic loss balancing scheme adjusting trade-off between diversity and similarity losses during training.
Result: Achieves state-of-the-art results in mesh classification, retrieval, and semantic segmentation tasks, demonstrating the effectiveness of combining multiple mesh analysis approaches.
Conclusion: The proposed MoE framework successfully harnesses complementary strengths of diverse mesh analysis methods through specialized gating and balanced training, advancing 3D mesh understanding capabilities.
Abstract: In recent years, various methods have been proposed for mesh analysis, each offering distinct advantages and often excelling on different object classes. We present a novel Mixture of Experts (MoE) framework designed to harness the complementary strengths of these diverse approaches. We propose a new gate architecture that encourages each expert to specialise in the classes it excels in. Our design is guided by two key ideas: (1) random walks over the mesh surface effectively capture the regions that individual experts attend to, and (2) an attention mechanism that enables the gate to focus on the areas most informative for each expert’s decision-making. To further enhance performance, we introduce a dynamic loss balancing scheme that adjusts a trade-off between diversity and similarity losses throughout the training, where diversity prompts expert specialization, and similarity enables knowledge sharing among the experts. Our framework achieves state-of-the-art results in mesh classification, retrieval, and semantic segmentation tasks. Our code is available at: https://github.com/amirbelder/MME-Mixture-of-Mesh-Experts.
[432] Neural Discrimination-Prompted Transformers for Efficient UHD Image Restoration and Enhancement
Cong Wang, Jinshan Pan, Liyan Wang, Wei Wang, Yang Yang
Main category: cs.CV
TL;DR: UHDPromer is a neural discrimination-prompted Transformer for Ultra-High-Definition image restoration tasks (low-light enhancement, dehazing, deblurring) that leverages neural differences between high- and low-resolution features through novel attention and network mechanisms.
Details
Motivation: The paper is motivated by the observation that there exist implicit neural differences between high-resolution and low-resolution features, and exploring these differences can facilitate better low-resolution feature representation for UHD image restoration tasks.Method: Proposes Neural Discrimination Priors (NDP) to measure feature differences, Neural Discrimination-Prompted Attention (NDPA) to incorporate NDP into attention mechanisms, Neural Discrimination-Prompted Network (NDPN) with continuous gating guided by NDP, and a super-resolution-guided reconstruction approach.
Result: UHDPromer achieves state-of-the-art performance on three UHD image restoration tasks (low-light enhancement, dehazing, deblurring) while maintaining the best computational efficiency compared to other methods.
Conclusion: The proposed UHDPromer effectively leverages neural discrimination priors for UHD image restoration, demonstrating superior performance and efficiency across multiple enhancement tasks.
Abstract: We propose a simple yet effective UHDPromer, a neural discrimination-prompted Transformer, for Ultra-High-Definition (UHD) image restoration and enhancement. Our UHDPromer is inspired by an interesting observation that there implicitly exist neural differences between high-resolution and low-resolution features, and exploring such differences can facilitate low-resolution feature representation. To this end, we first introduce Neural Discrimination Priors (NDP) to measure the differences and then integrate NDP into the proposed Neural Discrimination-Prompted Attention (NDPA) and Neural Discrimination-Prompted Network (NDPN). The proposed NDPA re-formulates the attention by incorporating NDP to globally perceive useful discrimination information, while the NDPN explores a continuous gating mechanism guided by NDP to selectively permit the passage of beneficial content. To enhance the quality of restored images, we propose a super-resolution-guided reconstruction approach, which is guided by super-resolving low-resolution features to facilitate final UHD image restoration. Experiments show that UHDPromer achieves the best computational efficiency while still maintaining state-of-the-art performance on $3$ UHD image restoration and enhancement tasks, including low-light image enhancement, image dehazing, and image deblurring. The source codes and pre-trained models will be made available at https://github.com/supersupercong/uhdpromer.
[433] Unified Vision-Language Modeling via Concept Space Alignment
Yifu Qiu, Paul-Ambroise Duquenne, Holger Schwenk
Main category: cs.CV
TL;DR: V-SONAR extends text-only SONAR embeddings to vision-language space via post-hoc alignment, enabling competitive multimodal understanding and generation across 1500+ languages.
Details
Motivation: To create a unified multimodal embedding space that extends the multilingual capabilities of SONAR (supporting 1500 text and 177 speech languages) to vision, enabling cross-modal understanding and generation across diverse languages.Method: Proposes post-hoc alignment pipeline to map vision encoder representations into SONAR space, then extends to V-LCM which encodes vision/language inputs into unified latent embeddings and trains with latent diffusion objective for next-embedding prediction.
Result: V-SONAR achieves competitive text-to-video retrieval and surpasses SOTA on video captioning (DREAM-1K: BLEU 23.9 vs 19.6; PE-VIDEO: BLEU 39.0 vs 30.0). V-LCM matches SOTA on English tasks while significantly outperforming across 61 out of 62 tested languages.
Conclusion: The approach successfully extends multilingual text capabilities to vision, demonstrating effective zero-shot multimodal understanding and strong performance across diverse languages, highlighting the potential of unified multimodal embedding spaces.
Abstract: We introduce V-SONAR, a vision-language embedding space extended from the text-only embedding space SONAR (Omnilingual Embeddings Team et al., 2026), which supports 1500 text languages and 177 speech languages. To construct V-SONAR, we propose a post-hoc alignment pipeline that maps the representations of an existing vision encoder into the SONAR space. We thoroughly evaluate V-SONAR and show that its embeddings achieve competitive performance on text-to-video retrieval. Equipped with the OMNISONAR text decoder, V-SONAR further surpasses state-of-the-art vision-language models on video captioning tasks, including DREAM-1K (BLEU 23.9 vs. 19.6) and PE-VIDEO (BLEU 39.0 vs. 30.0). Leveraging V-SONAR, we first demonstrate that the Large Concept Model (LCM; LCM team et al. 2024) operating in SONAR and trained with English text only, can perform both single- and multi-visual concept understanding in a zero-shot manner. Finally, we introduce V-LCM, which extends the LCM with vision-language instruction tuning. V-LCM encodes vision and language inputs into an unified sequence of latent embeddings via V-SONAR and SONAR, and it is trained with the same latent diffusion objective for next-embedding prediction as in LCM’s text-only pre-training. Experiments on a large-scale multilingual and -modal instruction-tuning data mixture highlight the potential of V-LCM: V-LCM matches state-of-the-art vision-language models on tasks covering image/video captioning and question answering, while significantly outperforming them across 61 rich- to low-resource languages out of all 62 tested languages.
[434] PPC-MT: Parallel Point Cloud Completion with Mamba-Transformer Hybrid Architecture
Jie Li, Shengwei Tian, Long Yu, Xin Ning
Main category: cs.CV
TL;DR: PPC-MT: A parallel point cloud completion framework using hybrid Mamba-Transformer architecture with PCA-guided parallel reconstruction for efficient and high-quality 3D shape completion.
Details
Motivation: Existing point cloud completion methods struggle to balance high-quality reconstruction with computational efficiency. There's a need for methods that can achieve both detailed reconstruction and computational efficiency simultaneously.Method: Proposes PPC-MT, a parallel framework using hybrid Mamba-Transformer architecture. Introduces PCA-guided parallel completion strategy that imposes geometric structure on unordered point clouds, transforming them into ordered sets decomposed into multiple subsets. These subsets are reconstructed in parallel using multi-head reconstructor. Combines Mamba’s linear complexity for efficient encoding with Transformer’s fine-grained multi-sequence modeling for decoding.
Result: Extensive experiments on PCN, ShapeNet-55/34, and KITTI datasets show PPC-MT outperforms state-of-the-art methods across multiple metrics. Achieves better uniformity of point distribution and detail fidelity while preserving computational efficiency.
Conclusion: PPC-MT effectively balances efficiency and reconstruction accuracy through its parallel hybrid architecture and PCA-guided structured synthesis, offering a promising solution for point cloud completion tasks.
Abstract: Existing point cloud completion methods struggle to balance high-quality reconstruction with computational efficiency. To address this, we propose PPC-MT, a novel parallel framework for point cloud completion leveraging a hybrid Mamba-Transformer architecture. Our approach introduces an innovative parallel completion strategy guided by Principal Component Analysis (PCA), which imposes a geometrically meaningful structure on unordered point clouds, transforming them into ordered sets and decomposing them into multiple subsets. These subsets are reconstructed in parallel using a multi-head reconstructor. This structured parallel synthesis paradigm significantly enhances the uniformity of point distribution and detail fidelity, while preserving computational efficiency. By integrating Mamba’s linear complexity for efficient feature extraction during encoding with the Transformer’s capability to model fine-grained multi-sequence relationships during decoding, PPC-MT effectively balances efficiency and reconstruction accuracy. Extensive quantitative and qualitative experiments on benchmark datasets, including PCN, ShapeNet-55/34, and KITTI, demonstrate that PPC-MT outperforms state-of-the-art methods across multiple metrics, validating the efficacy of our proposed framework.
[435] MMTA: Multi Membership Temporal Attention for Fine-Grained Stroke Rehabilitation Assessment
Halil Ismail Helvaci, Justin Huber, Jihye Bae, Sen-ching Samson Cheung
Main category: cs.CV
TL;DR: MMTA introduces multi-membership temporal attention for fine-grained rehabilitation action segmentation, enabling each frame to attend to multiple local temporal windows simultaneously for better boundary sensitivity in therapy videos and IMU data.
Details
Motivation: Existing temporal action segmentation models struggle to capture sub-second micro-movements while retaining exercise context, blurring rapid phase transitions and limiting reliable assessment of motor recovery in rehabilitation videos.Method: Multi-Membership Temporal Attention (MMTA) allows each frame to attend to multiple locally normalized temporal attention windows within the same layer, fusing concurrent temporal views via feature-space overlap resolution to preserve competing local contexts near transitions.
Result: MMTA consistently improves over Global Attention transformer, boosting Edit Score by +1.3 (Video) and +1.6 (IMU) on StrokeRehab dataset, and further improving 50Salads by +3.3.
Conclusion: MMTA offers a practical solution for resource-constrained rehabilitation assessment by improving boundary sensitivity without additional depth or multi-stage refinement, supporting both video and wearable IMU inputs in a unified architecture.
Abstract: To empower the iterative assessments involved during a person’s rehabilitation, automated assessment of a person’s abilities during daily activities requires temporally precise segmentation of fine-grained actions in therapy videos. Existing temporal action segmentation (TAS) models struggle to capture sub-second micro-movements while retaining exercise context, blurring rapid phase transitions and limiting reliable downstream assessment of motor recovery. We introduce Multi-Membership Temporal Attention (MMTA), a high-resolution temporal transformer for fine-grained rehabilitation assessment. Unlike standard temporal attention, which assigns each frame a single attention context per layer, MMTA lets each frame attend to multiple locally normalized temporal attention windows within the same layer. We fuse these concurrent temporal views via feature-space overlap resolution, preserving competing local contexts near transitions while enabling longer-range reasoning through layer-wise propagation. This increases boundary sensitivity without additional depth or multi-stage refinement. MMTA supports both video and wearable IMU inputs within a unified single-stage architecture, making it applicable to both clinical and home settings. MMTA consistently improves over the Global Attention transformer, boosting Edit Score by +1.3 (Video) and +1.6 (IMU) on StrokeRehab while further improving 50Salads by +3.3. Ablations confirm that performance gains stem from multi-membership temporal views rather than architectural complexity, offering a practical solution for resource-constrained rehabilitation assessment.
[436] Uncertainty-Aware Concept and Motion Segmentation for Semi-Supervised Angiography Videos
Yu Luo, Guangyu Wei, Yangfan Li, Jieyu He, Yueming Lyu
Main category: cs.CV
TL;DR: SMART is a semi-supervised vessel segmentation method for X-ray coronary angiography videos that uses SAM3-based teacher-student framework with motion-aware consistency and progressive confidence regularization to handle complex vessel dynamics and unreliable predictions.
Details
Motivation: Segmentation of coronary arteries from XCA sequences is crucial for diagnosing coronary artery diseases but faces challenges including blurred boundaries, inconsistent contrast, complex motion patterns, and limited annotated training data. Semi-supervised learning can help but conventional methods struggle with temporal dynamics and unreliable uncertainty quantification.Method: Proposes SMART: 1) Uses SAM3’s promptable concept segmentation design in a teacher-student framework, 2) Integrates vessel mask warping and motion consistency loss to model complex vessel dynamics, 3) Implements progressive confidence-aware consistency regularization to mitigate unreliable teacher predictions caused by blurred boundaries and minimal contrast.
Result: Extensive experiments on three XCA datasets from different institutions show SMART achieves state-of-the-art performance while requiring significantly fewer annotations, making it valuable for clinical applications with scarce labeled data.
Conclusion: SMART effectively addresses challenges in XCA vessel segmentation through its innovative SAM3-based teacher-student framework with motion-aware consistency and progressive confidence regularization, demonstrating strong performance with limited annotations for real-world clinical use.
Abstract: Segmentation of the main coronary artery from X-ray coronary angiography (XCA) sequences is crucial for the diagnosis of coronary artery diseases. However, this task is challenging due to issues such as blurred boundaries, inconsistent radiation contrast, complex motion patterns, and a lack of annotated images for training. Although Semi-Supervised Learning (SSL) can alleviate the annotation burden, conventional methods struggle with complicated temporal dynamics and unreliable uncertainty quantification. To address these challenges, we propose SAM3-based Teacher-student framework with Motion-Aware consistency and Progressive Confidence Regularization (SMART), a semi-supervised vessel segmentation approach for X-ray angiography videos. First, our method utilizes SAM3’s unique promptable concept segmentation design and innovates a SAM3-based teacher-student framework to maximize the performance potential of both the teacher and the student. Second, we enhance segmentation by integrating the vessel mask warping technique and motion consistency loss to model complex vessel dynamics. To address the issue of unreliable teacher predictions caused by blurred boundaries and minimal contrast, we further propose a progressive confidence-aware consistency regularization to mitigate the risk of unreliable outputs. Extensive experiments on three datasets of XCA sequences from different institutions demonstrate that SMART achieves state-of-the-art performance while requiring significantly fewer annotations, making it particularly valuable for real-world clinical applications where labeled data is scarce. Our code is available at: https://github.com/qimingfan10/SMART.
[437] VEMamba: Efficient Isotropic Reconstruction of Volume Electron Microscopy with Axial-Lateral Consistent Mamba
Longmi Gao, Pan Gao
Main category: cs.CV
TL;DR: VEMamba is an efficient framework for isotropic reconstruction of anisotropic Volume Electron Microscopy data using a novel 3D Dependency Reordering paradigm with Mamba-based modeling and degradation-aware training.
Details
Motivation: Volume Electron Microscopy (VEM) produces anisotropic 3D tissue imaging data with poor axial resolution, which hinders visualization and analysis. Existing methods fail to properly utilize axial information and use oversimplified downsampling for simulation.Method: Proposes VEMamba with: 1) 3D Dependency Reordering paradigm via Axial-Lateral Chunking Selective Scan Module (ALCSSM) that remaps 3D spatial dependencies into optimized 1D sequences for Mamba-based modeling, 2) Dynamic Weights Aggregation Module (DWAM) for adaptive output aggregation, and 3) realistic degradation simulation with Momentum Contrast (MoCo) for degradation-aware training.
Result: Extensive experiments on simulated and real-world anisotropic VEM datasets show VEMamba achieves highly competitive performance across various metrics while maintaining lower computational footprint compared to existing methods.
Conclusion: VEMamba provides an efficient solution for isotropic reconstruction of anisotropic VEM data through innovative 3D dependency modeling and degradation-aware training, demonstrating superior performance with reduced computational requirements.
Abstract: Volume Electron Microscopy (VEM) is crucial for 3D tissue imaging but often produces anisotropic data with poor axial resolution, hindering visualization and downstream analysis. Existing methods for isotropic reconstruction often suffer from neglecting abundant axial information and employing simple downsampling to simulate anisotropic data. To address these limitations, we propose VEMamba, an efficient framework for isotropic reconstruction. The core of VEMamba is a novel 3D Dependency Reordering paradigm, implemented via two key components: an Axial-Lateral Chunking Selective Scan Module (ALCSSM), which intelligently re-maps complex 3D spatial dependencies (both axial and lateral) into optimized 1D sequences for efficient Mamba-based modeling, explicitly enforcing axial-lateral consistency; and a Dynamic Weights Aggregation Module (DWAM) to adaptively aggregate these reordered sequence outputs for enhanced representational power. Furthermore, we introduce a realistic degradation simulation and then leverage Momentum Contrast (MoCo) to integrate this degradation-aware knowledge into the network for superior reconstruction. Extensive experiments on both simulated and real-world anisotropic VEM datasets demonstrate that VEMamba achieves highly competitive performance across various metrics while maintaining a lower computational footprint. The source code is available on GitHub: https://github.com/I2-Multimedia-Lab/VEMamba
[438] pySpatial: Generating 3D Visual Programs for Zero-Shot Spatial Reasoning
Zhanpeng Luo, Ce Zhang, Silong Yong, Cunxi Dai, Qianwei Wang, Haoxi Ran, Guanya Shi, Katia Sycara, Yaqi Xie
Main category: cs.CV
TL;DR: pySpatial is a visual programming framework that enables MLLMs to perform 3D spatial reasoning by generating Python code to interface with spatial tools like 3D reconstruction and camera-pose recovery, achieving state-of-the-art results on spatial understanding benchmarks.
Details
Motivation: Current MLLMs excel at general perception and reasoning but struggle with tasks requiring 3D spatial understanding. There's a need to bridge this gap without requiring expensive fine-tuning.Method: A visual programming framework where MLLMs generate Python code to interface with spatial tools (3D reconstruction, camera-pose recovery, novel-view rendering) that convert 2D images into explorable 3D scenes for explicit spatial reasoning.
Result: Outperforms GPT-4.1-mini by 12.94% on MindCube benchmark, achieves strong results on Omni3D-Bench, and enables successful real-world indoor navigation in complex environments.
Conclusion: pySpatial effectively enhances MLLMs’ 3D spatial reasoning capabilities through code-based tool interfacing without fine-tuning, demonstrating practical applications in robotics and spatial understanding tasks.
Abstract: Multi-modal Large Language Models (MLLMs) have demonstrated strong capabilities in general-purpose perception and reasoning, but they still struggle with tasks that require spatial understanding of the 3D world. To address this, we introduce pySpatial, a visual programming framework that equips MLLMs with the ability to interface with spatial tools via Python code generation. Given an image sequence and a natural-language query, the model composes function calls to spatial tools including 3D reconstruction, camera-pose recovery, novel-view rendering, etc. These operations convert raw 2D inputs into an explorable 3D scene, enabling MLLMs to reason explicitly over structured spatial representations. Notably, pySpatial requires no gradient-based fine-tuning and operates in a fully zero-shot setting. Experimental evaluations on the challenging MindCube and Omni3D-Bench benchmarks demonstrate that our framework pySpatial consistently surpasses strong MLLM baselines; for instance, it outperforms GPT-4.1-mini by 12.94% on MindCube. Furthermore, we conduct real-world indoor navigation experiments where the robot can successfully traverse complex environments using route plans generated by pySpatial, highlighting the practical effectiveness of our approach.
[439] ShiftLUT: Spatial Shift Enhanced Look-Up Tables for Efficient Image Restoration
Xiaolong Zeng, Yitong Yu, Shiyao Xiong, Jinhua Hao, Ming Sun, Chao Zhou, Bin Wang
Main category: cs.CV
TL;DR: ShiftLUT: A novel LUT-based image restoration framework with learnable spatial shifts for expanded receptive field, asymmetric dual-branch architecture for efficiency, and feature-level compression for reduced storage.
Details
Motivation: Current LUT-based methods for image restoration expand receptive fields at the cost of increased computational and storage overhead, hindering deployment on edge devices. There's a need for efficient methods that maintain large receptive fields while being lightweight.Method: Three complementary components: 1) Learnable Spatial Shift (LSS) module applies channel-wise spatial offsets to expand receptive field, 2) Asymmetric dual-branch architecture allocates more computation to information-dense branch for efficiency, 3) Error-bounded Adaptive Sampling (EAS) compresses feature-level LUTs to minimize storage.
Result: Achieves 3.8× larger receptive field than previous SOTA TinyLUT, improves average PSNR by over 0.21 dB across multiple benchmarks, while maintaining small storage size and inference time.
Conclusion: ShiftLUT demonstrates that efficient LUT-based image restoration with large receptive fields is achievable through complementary design of spatial shifting, asymmetric computation allocation, and feature compression.
Abstract: Look-Up Table based methods have emerged as a promising direction for efficient image restoration tasks. Recent LUT-based methods focus on improving their performance by expanding the receptive field. However, they inevitably introduce extra computational and storage overhead, which hinders their deployment in edge devices. To address this issue, we propose ShiftLUT, a novel framework that attains the largest receptive field among all LUT-based methods while maintaining high efficiency. Our key insight lies in three complementary components. First, Learnable Spatial Shift module (LSS) is introduced to expand the receptive field by applying learnable, channel-wise spatial offsets on feature maps. Second, we propose an asymmetric dual-branch architecture that allocates more computation to the information-dense branch, substantially reducing inference latency without compromising restoration quality. Finally, we incorporate a feature-level LUT compression strategy called Error-bounded Adaptive Sampling (EAS) to minimize the storage overhead. Compared to the previous state-of-the-art method TinyLUT, ShiftLUT achieves a 3.8$\times$ larger receptive field and improves an average PSNR by over 0.21 dB across multiple standard benchmarks, while maintaining a small storage size and inference time.
[440] UD-SfPNet: An Underwater Descattering Shape-from-Polarization Network for 3D Normal Reconstruction
Puyun Wang, Kaimin Yu, Huayang He, Feng Huang, Xianyu Wu, Yating Chen
Main category: cs.CV
TL;DR: UD-SfPNet: A unified network for underwater descattering and 3D surface normal estimation using polarization imaging, achieving state-of-the-art reconstruction accuracy.
Details
Motivation: Underwater optical imaging suffers from scattering effects that degrade image quality and 3D reconstruction. Polarization imaging offers both descattering and shape-from-polarization capabilities, but existing methods process these tasks separately, leading to error accumulation.Method: Proposes UD-SfPNet, a unified network that jointly models polarization-based image descattering and shape-from-polarization normal estimation. Includes a color embedding module to enhance geometric consistency by relating color encodings to surface orientation, and a detail enhancement convolution module to preserve high-frequency geometric details lost under scattering.
Result: Achieves state-of-the-art performance on the MuS-Polar3D dataset with a mean surface normal angular error of 15.12°, the lowest among compared methods. Demonstrates significant improvement in reconstruction accuracy.
Conclusion: The unified approach combining descattering with polarization-based shape inference is effective for underwater 3D imaging. UD-SfPNet shows practical significance for optical 3D imaging in challenging underwater environments.
Abstract: Underwater optical imaging is severely hindered by scattering, but polarization imaging offers the unique dual advantages of descattering and shape-from-polarization (SfP) 3D reconstruction. To exploit these advantages, this paper proposes UD-SfPNet, an underwater descattering shape-from-polarization network that leverages polarization cues for improved 3D surface normal prediction. The framework jointly models polarization-based image descattering and SfP normal estimation in a unified pipeline, avoiding error accumulation from sequential processing and enabling global optimization across both tasks. UD-SfPNet further incorporates a novel color embedding module to enhance geometric consistency by exploiting the relationship between color encodings and surface orientation. A detail enhancement convolution module is also included to better preserve high-frequency geometric details that are lost under scattering. Experiments on the MuS-Polar3D dataset show that the proposed method significantly improves reconstruction accuracy, achieving a mean surface normal angular error of 15.12$^\circ$ (the lowest among compared methods). These results confirm the efficacy of combining descattering with polarization-based shape inference, and highlight the practical significance and potential applications of UD-SfPNet for optical 3D imaging in challenging underwater environments. The code is available at https://github.com/WangPuyun/UD-SfPNet.
[441] On the Exact Algorithmic Extraction of Finite Tesselations Through Prime Extraction of Minimal Representative Forms
Sushish Baral, Paulo Garcia, Warisa Sritriratanarak
Main category: cs.CV
TL;DR: A hierarchical algorithm for discovering exact rectangular tessellations in finite planar grids using composite discovery, normalization, and prime extraction techniques.
Details
Motivation: Symbolic analysis for deterministic extraction of periodic structures in discrete grids is underdeveloped compared to statistical approaches for noisy data. There's a need for exact pattern recognition in symbolic reasoning, algorithm synthesis, and structural optimization.Method: Hierarchical algorithm with composite discovery (dual inspection and breadth-first pruning) to identify rectangular regions with internal repetition, normalization to minimal representative form, and prime extraction (selective duplication and hierarchical memoization) to handle irregular dimensions efficiently.
Result: Scalable performance on grid sizes from 2x2 to 32x32: simple repeating tiles process in under 1ms, while complex patterns requiring exhaustive search show exponential growth. Provides deterministic behavior for exact, axis-aligned, rectangular tessellations.
Conclusion: The algorithm addresses a critical gap in symbolic grid analysis techniques, applicable to puzzle solving reasoning tasks and identification of exact repeating structures in discrete symbolic domains.
Abstract: The identification of repeating patterns in discrete grids is rudimentary within symbolic reasoning, algorithm synthesis and structural optimization across diverse computational domains. Although statistical approaches targeting noisy data can approximately recognize patterns, symbolic analysis utilizing deterministic extraction of periodic structures is underdeveloped. This paper aims to fill this gap by employing a hierarchical algorithm that discovers exact tessellations in finite planar grids, addressing the problem where multiple independent patterns may coexist within a hierarchical structure. The proposed method utilizes composite discovery (dual inspection and breadth-first pruning) for identifying rectangular regions with internal repetition, normalization to a minimal representative form, and prime extraction (selective duplication and hierarchical memoization) to account for irregular dimensions and to achieve efficient computation time. We evaluate scalability on grid sizes from 2x2 to 32x32, showing overlap detection on simple repeating tiles exhibits processing time under 1ms, while complex patterns which require exhaustive search and systematic exploration shows exponential growth. This algorithm provides deterministic behavior for exact, axis-aligned, rectangular tessellations, addressing a critical gap in symbolic grid analysis techniques, applicable to puzzle solving reasoning tasks and identification of exact repeating structures in discrete symbolic domains.
[442] VGGT-Det: Mining VGGT Internal Priors for Sensor-Geometry-Free Multi-View Indoor 3D Object Detection
Yang Cao, Feize Wu, Dave Zhenyu Chen, Yingji Zhong, Lanqing Hong, Dan Xu
Main category: cs.CV
TL;DR: VGGT-Det: A sensor-geometry-free multi-view indoor 3D object detection framework that leverages visual geometry grounded transformers without requiring camera poses or depth information.
Details
Motivation: Current multi-view indoor 3D object detectors require costly sensor geometry (calibrated camera poses) for fusion, limiting real-world deployment. The paper targets a more practical sensor-geometry-free setting.Method: Integrates VGGT encoder into transformer-based pipeline with two novel components: Attention-Guided Query Generation (uses VGGT attention maps as semantic priors to initialize object queries) and Query-Driven Feature Aggregation (learnable See-Query interacts with object queries to dynamically aggregate multi-level geometric features across VGGT layers).
Result: Significantly surpasses best-performing SG-Free method by 4.4 mAP@0.25 on ScanNet and 8.6 mAP@0.25 on ARKitScenes. Ablation shows VGGT’s learned semantic and geometric priors can be effectively leveraged.
Conclusion: VGGT-Det demonstrates effective sensor-geometry-free 3D object detection by integrating VGGT’s visual geometry understanding into a transformer pipeline with novel attention-guided and query-driven components.
Abstract: Current multi-view indoor 3D object detectors rely on sensor geometry that is costly to obtain (i.e., precisely calibrated multi-view camera poses) to fuse multi-view information into a global scene representation, limiting deployment in real-world scenes. We target a more practical setting: Sensor-Geometry-Free (SG-Free) multi-view indoor 3D object detection, where there are no sensor-provided geometric inputs (multi-view poses or depth). Recent Visual Geometry Grounded Transformer (VGGT) shows that strong 3D cues can be inferred directly from images. Building on this insight, we present VGGT-Det, the first framework tailored for SG-Free multi-view indoor 3D object detection. Rather than merely consuming VGGT predictions, our method integrates VGGT encoder into a transformer-based pipeline. To effectively leverage both the semantic and geometric priors from inside VGGT, we introduce two novel key components: (i) Attention-Guided Query Generation (AG): exploits VGGT attention maps as semantic priors to initialize object queries, improving localization by focusing on object regions while preserving global spatial structure; (ii) Query-Driven Feature Aggregation (QD): a learnable See-Query interacts with object queries to ‘see’ what they need, and then dynamically aggregates multi-level geometric features across VGGT layers that progressively lift 2D features into 3D. Experiments show that VGGT-Det significantly surpasses the best-performing method in the SG-Free setting by 4.4 and 8.6 mAP@0.25 on ScanNet and ARKitScenes, respectively. Ablation study shows that VGGT’s internally learned semantic and geometric priors can be effectively leveraged by our AG and QD.
[443] Learning to Read Where to Look: Disease-Aware Vision-Language Pretraining for 3D CT
Simon Ging, Philipp Arnold, Sebastian Walter, Hani Alnahas, Hannah Bast, Elmar Kotter, Jiancheng Yang, Behzad Bozorgtabar, Thomas Brox
Main category: cs.CV
TL;DR: A 3D CT vision-language model trained on large-scale hospital data achieves SOTA retrieval and competitive classification, with novel intra-scan snippet localization for precise axial depth prediction.
Details
Motivation: Existing 3D CT vision-language models use limited public data with coarse global supervision, lacking precise localization capabilities that radiologists naturally use when referencing specific images within reports.Method: Train on 98k report-volume pairs from single hospital plus public data using SigLIP-style contrastive pretraining with prompt-based disease supervision. Automatically mine 262k snippet-slice pairs and introduce intra-scan snippet localization task to predict axial depth from text snippets.
Result: Achieves SOTA text-to-image retrieval (R@10 31.5 vs 22.2) and competitive disease classification (AUC 83.8 vs 83.8). Intra-scan localization reduces mean absolute error to 36.3mm at 12mm resolution vs 67.0mm baseline, with retrieval/classification unchanged.
Conclusion: The unified model effectively combines retrieval, classification, and intra-scan grounding capabilities, demonstrating that precise localization can be integrated without compromising other vision-language tasks in medical imaging.
Abstract: Recent 3D CT vision-language models align volumes with reports via contrastive pretraining, but typically rely on limited public data and provide only coarse global supervision. We train a 3D CT vision-language model on 98k report-volume pairs (50k patients) collected at a single hospital, combined with public datasets, using SigLIP-style contrastive pretraining together with prompt-based disease supervision in the shared vision-text embedding space. On CT-RATE, our model achieves state-of-the-art text-to-image retrieval (R@10 31.5 vs. 22.2) and competitive disease classification (AUC 83.8 vs. 83.8), with consistent results on Rad-ChestCT (AUC 77.0 vs. 77.3). We further observe that radiologists routinely reference specific images within their reports (e.g., ``series X, image Y’’), linking textual descriptions to precise axial locations. We automatically mine 262k such snippet-slice pairs and introduce the task of intra-scan snippet localization – predicting the axial depth referred to by a text snippet – reducing mean absolute error to 36.3 mm at 12 mm feature resolution, compared with 67.0 mm for the best baseline. Adding this localization objective leaves retrieval and classification broadly unchanged within confidence bounds, yielding a single unified model for retrieval, classification, and intra-scan grounding.
[444] Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards
Seungwook Kim, Minsu Cho
Main category: cs.CV
TL;DR: ARC is an unsupervised post-training framework for text-to-image models that uses internal self-confidence signals from self-denoising probes instead of external rewards, improving compositional generation and text-image alignment.
Details
Motivation: Current post-training methods for text-to-image models rely on external reward supervision (human preferences, factuality metrics, aesthetics), which requires additional datasets, annotators, or reward models. The authors aim to develop a fully unsupervised approach that leverages the model's internal capabilities.Method: ARC uses self-denoising probes to evaluate how accurately the model recovers injected noise, converting this intrinsic self-confidence signal into scalar rewards. The framework reinforces high-confidence generations without external supervision, and can also be integrated with external rewards for complementary improvements.
Result: ARC delivers consistent gains in compositional generation, text rendering, and text-image alignment over baseline models. When integrated with external rewards, it shows complementary improvement with alleviated reward hacking issues.
Conclusion: ARC provides an effective unsupervised post-training approach for text-to-image models that leverages internal self-confidence signals, offering improvements in key generation metrics and compatibility with external reward systems.
Abstract: Text-to-image generation powers content creation across design, media, and data augmentation. Post-training of text-to-image generative models is a promising path to better match human preferences, factuality, and improved aesthetics. We introduce ARC (Adaptive Rewarding by self-Confidence), a post-training framework that replaces external reward supervision with an internal self-confidence signal, obtained by evaluating how accurately the model recovers injected noise under self-denoising probes. ARC converts this intrinsic signal into scalar rewards, enabling fully unsupervised optimization without additional datasets, annotators, or reward models. Empirically, by reinforcing high-confidence generations, ARC delivers consistent gains in compositional generation, text rendering and text-image alignment over the baseline. We also find that integrating ARC with external rewards results in a complementary improvement, with alleviated reward hacking.
[445] DriveCode: Domain Specific Numerical Encoding for LLM-Based Autonomous Driving
Zhiye Wang, Yanbo Jiang, Rui Zhou, Bo Zhang, Fang Zhang, Zhenhua Xu, Yaqin Zhang, Jianqiang Wang
Main category: cs.CV
TL;DR: DriveCode introduces a novel numerical encoding method for LLM-based autonomous driving that represents numbers as dedicated embeddings rather than discrete text tokens, addressing limitations in numerical precision and reasoning for sensor measurements and control commands.
Details
Motivation: Current LLMs tokenize numbers as discrete text tokens, which limits precise numerical reasoning, fails to reflect positional significance of digits, and creates difficulties in achieving both decoding efficiency and numerical precision. These limitations affect processing sensor measurements and generating precise control commands in autonomous driving systems.Method: DriveCode uses a number projector to map numbers into the language model’s hidden space as dedicated embeddings rather than discrete text tokens. This enables seamless integration with visual and textual features in a unified multimodal sequence for autonomous driving applications.
Result: Evaluated on OmniDrive, DriveGPT4, and DriveGPT4-V2 datasets, DriveCode demonstrates superior performance in trajectory prediction and control signal generation compared to traditional token-based numerical encoding methods.
Conclusion: DriveCode effectively addresses numerical precision limitations in LLM-based autonomous driving systems by representing numbers as dedicated embeddings, confirming its effectiveness for trajectory prediction and control signal generation tasks.
Abstract: Large language models (LLMs) have shown great promise for autonomous driving. However, discretizing numbers into tokens limits precise numerical reasoning, fails to reflect the positional significance of digits in the training objective, and makes it difficult to achieve both decoding efficiency and numerical precision. These limitations affect both the processing of sensor measurements and the generation of precise control commands, creating a fundamental barrier for deploying LLM-based autonomous driving systems. In this paper, we introduce DriveCode, a novel numerical encoding method that represents numbers as dedicated embeddings rather than discrete text tokens. DriveCode employs a number projector to map numbers into the language model’s hidden space, enabling seamless integration with visual and textual features in a unified multimodal sequence. Evaluated on OmniDrive, DriveGPT4, and DriveGPT4-V2 datasets, DriveCode demonstrates superior performance in trajectory prediction and control signal generation, confirming its effectiveness for LLM-based autonomous driving systems.
[446] Learning to Weigh Waste: A Physics-Informed Multimodal Fusion Framework and Large-Scale Dataset for Commercial and Industrial Applications
Md. Adnanul Islam, Wasimul Karim, Md Mahbub Alam, Subhey Sadi Rahman, Md. Abdur Rahman, Arefin Ittesafun Abian, Mohaimenul Azam Khan Raiaan, Kheng Cher Yeo, Deepika Mathur, Sami Azam
Main category: cs.CV
TL;DR: MWP framework estimates waste weight by combining RGB images with physics-informed metadata (dimensions, camera distance/height) using Vision Transformer and metadata encoder with mutual attention fusion.
Details
Motivation: Accurate waste weight estimation is crucial for logistics/recycling operations, but image-based estimation is challenging due to varying densities of similar-looking objects and perspective effects from camera distance.Method: Proposes Multimodal Weight Predictor (MWP) combining RGB images with physics metadata using Vision Transformer for visual features, dedicated metadata encoder, and Stacked Mutual Attention Fusion to integrate visual and physical cues. Trained with Mean Squared Logarithmic Error for wide weight range stability.
Result: Achieves 88.06 kg MAE, 6.39% MAPE, and R² of 0.9548 on Waste-Weight-10K dataset. Strong accuracy for light objects (0-100 kg: 2.38 kg MAE, 3.1% MAPE) and reliable for heavy waste (1000-2000 kg: 11.1% MAPE). Includes explainable AI module with SHAP and LLM.
Conclusion: The multimodal approach effectively addresses perspective and density challenges in waste weight estimation, providing accurate predictions across wide weight ranges with interpretable explanations.
Abstract: Accurate weight estimation of commercial and industrial waste is important for efficient operations, yet image-based estimation remains difficult because similar-looking objects may have different densities, and the visible size changes with camera distance. Addressing this problem, we propose Multimodal Weight Predictor (MWP) framework that estimates waste weight by combining RGB images with physics-informed metadata, including object dimensions, camera distance, and camera height. We also introduce Waste-Weight-10K, a real-world dataset containing 10,421 synchronized image-metadata collected from logistics and recycling sites. The dataset covers 11 waste categories and a wide weight range from 3.5 to 3,450 kg. Our model uses a Vision Transformer for visual features and a dedicated metadata encoder for geometric and category information, combining them with Stacked Mutual Attention Fusion that allows visual and physical cues guide each other. This helps the model manage perspective effects and link objects to material properties. To ensure stable performance across the wide weight range, we train the model using Mean Squared Logarithmic Error. On the test set, the proposed method achieves 88.06 kg Mean Absolute Error (MAE), 6.39% Mean Absolute Percentage Error (MAPE), and an R2 coefficient of 0.9548. The model shows strong accuracy for light objects in the 0-100 kg range with 2.38 kg MAE and 3.1% MAPE, maintaining reliable performance for heavy waste in the 1000-2000 kg range with 11.1% MAPE. Finally, we incorporate a physically grounded explanation module using Shapley Additive Explanations (SHAP) and a large language model to provide clear, human-readable explanations for each prediction.
[447] Seeing Beyond 8bits: Subjective and Objective Quality Assessment of HDR-UGC Videos
Shreshth Saini, Bowen Chen, Neil Birkbeck, Yilin Wang, Balu Adsumilli, Alan C. Bovik
Main category: cs.CV
TL;DR: HDR-Q: A multimodal LLM for HDR video quality assessment using HDR-aware vision encoder and RL finetuning with HDR-specific optimizations.
Details
Motivation: HDR videos are becoming common on social platforms, but existing VQA systems are designed for SDR and fail to capture HDR-specific distortions like near-black crushing, highlight clipping, banding, and exposure flicker.Method: 1) Created Beyond8Bits dataset with 44K HDR videos and 1.5M ratings; 2) Developed HDR-Q MLLM with HDR-aware vision encoder for HDR-sensitive embeddings; 3) Introduced HDR-Aware Policy Optimization (HAPO) RL framework with HDR-SDR contrastive KL loss and Gaussian weighted regression reward.
Result: HDR-Q achieves state-of-the-art performance on Beyond8Bits dataset and public HDR-VQA benchmarks, demonstrating superior HDR video quality assessment capabilities.
Conclusion: The paper presents a comprehensive solution for HDR video quality assessment, addressing the gap in existing SDR-focused VQA systems through dataset creation, novel architecture, and specialized training techniques.
Abstract: High Dynamic Range (HDR) user-generated (UGC) videos are rapidly proliferating across social platforms, yet most perceptual video quality assessment (VQA) systems remain tailored to Standard Dynamic Range (SDR). HDR has a higher bit depth, wide color gamut, and elevated luminance range, exposing distortions such as near-black crushing, highlight clipping, banding, and exposure flicker that amplify UGC artifacts and challenge SDR models. To catalyze progress, we curate Beyond8Bits, a large-scale subjective dataset of 44K videos from 6.5K sources with over 1.5M crowd ratings, spanning diverse scenes, capture conditions, and compression settings. We further introduce HDR-Q, the first Multimodal Large Language Model (MLLM) for HDR-UGC VQA. We propose (i) a novel HDR-aware vision encoder to produce HDR-sensitive embeddings, and (ii) HDR-Aware Policy Optimization (HAPO), an RL finetuning framework that anchors reasoning to HDR cues. HAPO augments GRPO via an HDR-SDR contrastive KL that encourages token reliance on HDR inputs and a Gaussian weighted regression reward for fine-grained MOS calibration. Across Beyond8Bits and public HDR-VQA benchmarks, HDR-Q delivers state-of-the-art performance.
[448] \textsc{Mobile-VTON}: High-Fidelity On-Device Virtual Try-On
Zhenchen Wan, Ce Chen, Runqi Lin, Jiaxin Huang, Tianxi Chen, Yanwu Xu, Tongliang Liu, Mingming Gong
Main category: cs.CV
TL;DR: Mobile-VTON enables high-quality, privacy-preserving virtual try-on on mobile devices using a modular TGT architecture with knowledge distillation and garment-conditioned generation.
Details
Motivation: Address privacy concerns and limitations of cloud-based VTON systems by enabling fully offline virtual try-on on commodity mobile devices.Method: Proposes a modular TeacherNet-GarmentNet-TryonNet (TGT) architecture with Feature-Guided Adversarial (FGA) Distillation, trajectory-consistency loss for garment preservation, and lightweight cross-modal conditioning for garment alignment.
Result: Achieves high-fidelity generation matching or outperforming server-based baselines on VITON-HD and DressCode datasets at 1024×768 resolution while running entirely offline on mobile devices.
Conclusion: High-quality virtual try-on is feasible and practical on-device, offering a secure, privacy-preserving solution for real-world applications.
Abstract: Virtual try-on (VTON) has recently achieved impressive visual fidelity, but most existing systems require uploading personal photos to cloud-based GPUs, raising privacy concerns and limiting on-device deployment. To address this, we present \textsc{Mobile-VTON}, a high-quality, privacy-preserving framework that enables fully offline virtual try-on on commodity mobile devices using only a single user image and a garment image. \textsc{Mobile-VTON} introduces a modular TeacherNet–GarmentNet–TryonNet (TGT) architecture that integrates knowledge distillation, garment-conditioned generation, and garment alignment into a unified pipeline optimized for on-device efficiency. Within this framework, we propose a Feature-Guided Adversarial (FGA) Distillation strategy that combines teacher supervision with adversarial learning to better match real-world image distributions. GarmentNet is trained with a trajectory-consistency loss to preserve garment semantics across diffusion steps, while TryonNet uses latent concatenation and lightweight cross-modal conditioning to enable robust garment-to-person alignment without large-scale pretraining. By combining these components, \textsc{Mobile-VTON} achieves high-fidelity generation with low computational overhead. Experiments on VITON-HD and DressCode at $1024{\times}768$ show that it matches or outperforms strong server-based baselines while running entirely offline. These results demonstrate that high-quality VTON is not only feasible but also practical on-device, offering a secure solution for real-world applications.
[449] StegoNGP: 3D Cryptographic Steganography using Instant-NGP
Wenxiang Jiang, Yujun Lan, Shuo Zhao, Yuanshan Liu, Mingzhu Zhou, Jinxin Wang
Main category: cs.CV
TL;DR: StegoNGP: A parameter-free 3D cryptographic steganography method using Instant-NGP’s hash encoding as a key-controlled scene switcher to hide entire 3D scenes within neural field representations.
Details
Motivation: Existing methods for embedding hidden data in 3D scene reconstructions have limitations: they require external decoders, need architectural modifications, have limited capacity, and are easily detectable. There's a need for high-capacity, undetectable information hiding in neural fields.Method: Uses Instant-NGP’s hash encoding function as a key-controlled scene switcher. Associates a default key with a cover scene and a secret key with a hidden scene, training a single model to interweave both representations within the same network weights. Also introduces an enhanced Multi-Key scheme that assigns multiple independent keys across hash levels to expand key space and improve robustness against partial key disclosure attacks.
Result: StegoNGP can hide a complete high-quality 3D scene with strong imperceptibility and security. The resulting model is indistinguishable from a standard Instant-NGP in architecture and parameter count, providing a new paradigm for high-capacity, undetectable information hiding in neural fields.
Conclusion: StegoNGP offers a novel approach to 3D cryptographic steganography that overcomes limitations of existing methods by leveraging Instant-NGP’s hash encoding as a key-controlled scene switcher, enabling secure embedding of entire 3D scenes within neural field representations without architectural modifications.
Abstract: Recently, Instant Neural Graphics Primitives (Instant-NGP) has achieved significant success in rapid 3D scene reconstruction, but securely embedding high-capacity hidden data, such as an entire 3D scene, remains a challenge. Existing methods rely on external decoders, require architectural modifications, and suffer from limited capacity, which makes them easily detectable. We propose a novel parameter-free 3D Cryptographic Steganography using Instant-NGP (StegoNGP), which leverages the Instant-NGP hash encoding function as a key-controlled scene switcher. By associating a default key with a cover scene and a secret key with a hidden scene, our method trains a single model to interweave both representations within the same network weights. The resulting model is indistinguishable from a standard Instant-NGP in architecture and parameter count. We also introduce an enhanced Multi-Key scheme, which assigns multiple independent keys across hash levels, dramatically expanding the key space and providing high robustness against partial key disclosure attacks. Experimental results demonstrated that StegoNGP can hide a complete high-quality 3D scene with strong imperceptibility and security, providing a new paradigm for high-capacity, undetectable information hiding in neural fields. The code can be found at https://github.com/jiang-wenxiang/StegoNGP.
[450] Decoupling Motion and Geometry in 4D Gaussian Splatting
Yi Zhang, Yulei Kang, Jian-Fang Hu
Main category: cs.CV
TL;DR: VeGaS: A velocity-based 4D Gaussian Splatting framework that decouples motion and geometry using Galilean shearing matrices and geometric deformation networks for better dynamic scene reconstruction.
Details
Motivation: Current 4D Gaussian Splatting (4DGS) methods couple Gaussian motion and geometric attributes within a single covariance formulation, limiting expressiveness for complex motions and causing visual artifacts. There's a need to better separate motion and geometry for improved dynamic scene reconstruction.Method: Proposes VeGaS with two key innovations: 1) Galilean shearing matrix that explicitly incorporates time-varying velocity to model complex non-linear motions while isolating motion from geometry, and 2) Geometric Deformation Network that refines Gaussian shapes/orientations using spatio-temporal context and velocity cues.
Result: Extensive experiments on public datasets demonstrate state-of-the-art performance in dynamic scene reconstruction compared to existing methods.
Conclusion: VeGaS successfully addresses limitations of previous 4DGS approaches by decoupling motion and geometry, enabling more expressive modeling of complex dynamic scenes with reduced artifacts.
Abstract: High-fidelity reconstruction of dynamic scenes is an important yet challenging problem. While recent 4D Gaussian Splatting (4DGS) has demonstrated the ability to model temporal dynamics, it couples Gaussian motion and geometric attributes within a single covariance formulation, which limits its expressiveness for complex motions and often leads to visual artifacts. To address this, we propose VeGaS, a novel velocity-based 4D Gaussian Splatting framework that decouples Gaussian motion and geometry. Specifically, we introduce a Galilean shearing matrix that explicitly incorporates time-varying velocity to flexibly model complex non-linear motions, while strictly isolating the effects of Gaussian motion from the geometry-related conditional Gaussian covariance. Furthermore, a Geometric Deformation Network is introduced to refine Gaussian shapes and orientations using spatio-temporal context and velocity cues, enhancing temporal geometric modeling. Extensive experiments on public datasets demonstrate that VeGaS achieves state-of-the-art performance.
[451] PreciseCache: Precise Feature Caching for Efficient and High-fidelity Video Generation
Jiangshan Wang, Kang Zhao, Jiayi Guo, Jiayu Wang, Hang Guo, Chenyang Zhu, Xiu Li, Xiangyu Yue
Main category: cs.CV
TL;DR: PreciseCache: A plug-and-play framework that accelerates video generation models by precisely detecting and skipping truly redundant computations at step and block levels without quality degradation.
Details
Motivation: Video generation models suffer from high computational costs and slow inference, limiting practical applications. Existing acceleration methods using feature caching often cause quality degradation due to inability to distinguish truly redundant features, leading to unintended skipping of important computations.Method: Proposes PreciseCache with two components: 1) LFCache for step-wise caching using Low-Frequency Difference (LFD) to detect redundant steps, and 2) BlockCache for block-wise caching to skip redundant computations within network blocks. Both work together to precisely identify and skip only truly redundant computations.
Result: Extensive experiments on various backbones show PreciseCache achieves average 2.6x speedup without noticeable quality loss, outperforming prior feature caching methods that suffer from quality degradation.
Conclusion: PreciseCache effectively accelerates video generation inference by precisely detecting and skipping truly redundant computations, enabling practical applications without sacrificing quality.
Abstract: High computational costs and slow inference hinder the practical application of video generation models. While prior works accelerate the generation process through feature caching, they often suffer from notable quality degradation. In this work, we reveal that this issue arises from their inability to distinguish truly redundant features, which leads to the unintended skipping of computations on important features. To address this, we propose \textbf{PreciseCache}, a plug-and-play framework that precisely detects and skips truly redundant computations, thereby accelerating inference without sacrificing quality. Specifically, PreciseCache contains two components: LFCache for step-wise caching and BlockCache for block-wise caching. For LFCache, we compute the Low-Frequency Difference (LFD) between the prediction features of the current step and those from the previous cached step. Empirically, we observe that LFD serves as an effective measure of step-wise redundancy, accurately detecting highly redundant steps whose computation can be skipped through reusing cached features. To further accelerate generation within each non-skipped step, we propose BlockCache, which precisely detects and skips redundant computations at the block level within the network. Extensive experiments on various backbones demonstrate the effectiveness of our PreciseCache, which achieves an average of 2.6x speedup without noticeable quality loss. Source code will be released.
[452] EraseAnything++: Enabling Concept Erasure in Rectified Flow Transformers Leveraging Multi-Object Optimization
Zhaoxin Fan, Nanxiang Jiang, Daiheng Gao, Shiji Zhou, Wenjun Wu
Main category: cs.CV
TL;DR: EraseAnything++ is a unified framework for concept erasure in image and video diffusion models with flow-matching objectives, using constrained multi-objective optimization and implicit gradient surgery to balance concept removal with generative utility preservation.
Details
Motivation: Existing concept erasure methods fail to generalize to modern diffusion models like Stable Diffusion v3, Flux, and OpenSora that use flow-matching and transformer architectures for long-horizon video generation, creating a need for a unified framework that works across both image and video domains.Method: Formulates concept erasure as constrained multi-objective optimization, uses implicit gradient surgery for utility-preserving unlearning, integrates LoRA-based parameter tuning with attention-level regularization, and employs anchor-and-propagate mechanism for temporal consistency in videos.
Result: Substantially outperforms prior methods in erasure effectiveness, generative fidelity, and temporal consistency across both image and video benchmarks, establishing new state-of-the-art for concept erasure in next-generation diffusion models.
Conclusion: EraseAnything++ provides an effective unified solution for concept erasure in modern flow-matching diffusion models, successfully addressing the limitations of previous methods while maintaining generative quality and temporal consistency.
Abstract: Removing undesired concepts from large-scale text-to-image (T2I) and text-to-video (T2V) diffusion models while preserving overall generative quality remains a major challenge, particularly as modern models such as Stable Diffusion v3, Flux, and OpenSora employ flow-matching and transformer-based architectures and extend to long-horizon video generation. Existing concept erasure methods, designed for earlier T2I/T2V models, often fail to generalize to these paradigms. To address this issue, we propose EraseAnything++, a unified framework for concept erasure in both image and video diffusion models with flow-matching objectives. Central to our approach is formulating concept erasure as a constrained multi-objective optimization problem that explicitly balances concept removal with preservation of generative utility. To solve the resulting conflicting objectives, we introduce an efficient utility-preserving unlearning strategy based on implicit gradient surgery. Furthermore, by integrating LoRA-based parameter tuning with attention-level regularization, our method anchors erasure on key visual representations and propagates it consistently across spatial and temporal dimensions. In the video setting, we further enhance consistency through an anchor-and-propagate mechanism that initializes erasure on reference frames and enforces it throughout subsequent transformer layers, thereby mitigating temporal drift. Extensive experiments on both image and video benchmarks demonstrate that EraseAnything++ substantially outperforms prior methods in erasure effectiveness, generative fidelity, and temporal consistency, establishing a new state of the art for concept erasure in next-generation diffusion models.
[453] Fake It Right: Injecting Anatomical Logic into Synthetic Supervised Pre-training for Medical Segmentation
Jiaqi Tang, Mengyan Zheng, Shu Zhang, Fandong Zhang, Qingchao Chen
Main category: cs.CV
TL;DR: Anatomy-Informed Synthetic Supervised Pre-training framework for medical segmentation that bridges semantic gap between formula-driven synthetic data and real anatomy using shape bank and structure-aware placement.
Details
Motivation: Vision Transformers need large annotated datasets for 3D medical segmentation, but privacy/logistical barriers limit real data access. Formula-Driven Supervised Learning offers privacy-preserving synthetic data but lacks anatomical realism and structural priors of real anatomy.Method: Replace basic mathematical primitives with lightweight shape bank from de-identified segmentation masks of 5 subjects. Introduce structure-aware sequential placement strategy using spatial anchors for localization and topological graph to manage inter-organ interactions during patch synthesis.
Result: Outperforms state-of-the-art FDSL baselines by 1.74% and SSL methods by up to 1.66% on BTCV and MSD datasets. Shows robust scaling effect where performance improves with increased synthetic data volume.
Conclusion: Provides data-efficient, privacy-compliant solution for medical segmentation by unifying infinite scalability of synthetic data with anatomical realism through anatomy-informed synthetic supervision.
Abstract: Vision Transformers (ViTs) excel in 3D medical segmentation but require massive annotated datasets. While Self-Supervised Learning (SSL) mitigates this using unlabeled data, it still faces strict privacy and logistical barriers. Formula-Driven Supervised Learning (FDSL) offers a privacy-preserving alternative by pre-training on synthetic mathematical primitives. However, a critical semantic gap limits its efficacy: generic shapes lack the morphological fidelity, fixed spatial layouts, and inter-organ relationships of real anatomy, preventing models from learning essential global structural priors. To bridge this gap, we propose an Anatomy-Informed Synthetic Supervised Pre-training framework unifying FDSL’s infinite scalability with anatomical realism. We replace basic primitives with a lightweight shape bank with de-identified, label-only segmentation masks from 5 subjects. Furthermore, we introduce a structure-aware sequential placement strategy to govern the patch synthesis process. Instead of random placement, we enforce physiological plausibility using spatial anchors for correct localization and a topological graph to manage inter-organ interactions (e.g., preventing impossible overlaps). Extensive experiments on BTCV and MSD datasets demonstrate that our method significantly outperforms state-of-the-art FDSL baselines and SSL methods by 1.74% and up to 1.66%, while exhibiting a robust scaling effect where performance improves with increased synthetic data volume. This provides a data-efficient, privacy-compliant solution for medical segmentation. The code will be made publicly available upon acceptance.
[454] Event-Anchored Frame Selection for Effective Long-Video Understanding
Wang Chen, Yongdong Luo, Yuhui Zeng, Luojun Lin, Tianyu Xie, Fei Chao, Rongrong Ji, Xiawu Zheng
Main category: cs.CV
TL;DR: EFS is a hierarchical event-aware frame selection method for long-video understanding that partitions videos into semantic events and selects query-relevant anchors, improving LVLM performance without training.
Details
Motivation: Massive frame redundancy and limited context windows in LVLMs make efficient frame selection crucial for long-video understanding. Current flat sampling approaches treat videos as unstructured frame collections, missing semantic event structure.Method: EFS uses self-supervised DINO embeddings to partition videos into visually homogeneous temporal segments (event proxies), selects query-relevant frames within each event as anchors, then applies adaptive Maximal Marginal Relevance for global refinement to optimize event coverage, query relevance, and visual diversity.
Result: When applied to LLaVA-Video-7B, EFS improves accuracy by 4.7% on VideoMME, 4.9% on LongVideoBench, and 8.8% on MLVU benchmarks.
Conclusion: EFS is an effective training-free, plug-and-play module for LVLMs that leverages event structure for better long-video understanding through hierarchical frame selection.
Abstract: Massive frame redundancy and limited context window make efficient frame selection crucial for long-video understanding with large vision-language models (LVLMs). Prevailing approaches, however, adopt a flat sampling paradigm which treats the video as an unstructured collection of frames. In this paper, we introduce Event-Anchored Frame Selection (EFS), a hierarchical, event-aware pipeline. Leveraging self-supervised DINO embeddings, EFS first partitions the video stream into visually homogeneous temporal segments, which serve as proxies for semantic events. Within each event, it then selects the most query-relevant frame as an anchor. These anchors act as structural priors that guide a global refinement stage using an adaptive Maximal Marginal Relevance (MMR) scheme. This pipeline ensures the final keyframe set jointly optimizes for event coverage, query relevance, and visual diversity. As a training-free, plug-and-play module, EFS can be seamlessly integrated into off-the-shelf LVLMs, yielding substantial gains on challenging video understanding benchmarks. Specifically, when applied to LLaVA-Video-7B, EFS improves accuracy by 4.7%, 4.9%, and 8.8% on VideoMME, LongVideoBench, and MLVU, respectively.
[455] The Texture-Shape Dilemma: Boundary-Safe Synthetic Generation for 3D Medical Transformers
Jiaqi Tang, Weixuan Xu, Shu Zhang, Fandong Zhang, Qingchao Chen
Main category: cs.CV
TL;DR: Physics-inspired framework for medical image synthesis that decouples shape learning from texture injection to address boundary aliasing in formula-driven supervised learning for Vision Transformers.
Details
Motivation: Vision Transformers need large datasets but medical data is scarce and private. Formula-Driven Supervised Learning (FDSL) creates synthetic data but existing methods use simple geometric shapes without realistic textures, creating a gap with real medical images like CT/MRI that have tissue textures and noise patterns.Method: Proposes a Physics-inspired Spatially-Decoupled Synthesis framework that orthogonalizes synthesis: 1) creates gradient-shielded buffer zones around boundaries to preserve shape learning, and 2) injects physics-driven spectral textures into object cores. This addresses boundary aliasing where high-frequency textures corrupt gradient signals needed for learning structural boundaries.
Result: Outperforms previous FDSL methods and SSL methods trained on real-world medical datasets by 1.43% on BTCV and up to 1.51% on MSD tasks, demonstrating superior performance in medical image segmentation.
Conclusion: The framework offers a scalable, annotation-free foundation for medical Vision Transformers by effectively reconciling robust shape representation learning with invariance to acquisition noise through spatially-decoupled synthesis.
Abstract: Vision Transformers (ViTs) have revolutionized medical image analysis, yet their data-hungry nature clashes with the scarcity and privacy constraints of clinical archives. Formula-Driven Supervised Learning (FDSL) has emerged as a promising solution to this bottleneck, synthesizing infinite annotated samples from mathematical formulas without utilizing real patient data. However, existing FDSL paradigms rely on simple geometric shapes with homogeneous intensities, creating a substantial gap by neglecting tissue textures and noise patterns inherent in modalities like CT and MRI. In this paper, we identify a critical optimization conflict termed boundary aliasing: when high-frequency synthetic textures are naively added, they corrupt the image gradient signals necessary for learning structural boundaries, causing the model to fail in delineating real anatomical margins. To bridge this gap, we propose a novel Physics-inspired Spatially-Decoupled Synthesis framework. Our approach orthogonalizes the synthesis process: it first constructs a gradient-shielded buffer zone based on boundary distance to ensure stable shape learning, and subsequently injects physics-driven spectral textures into the object core. This design effectively reconciles robust shape representation learning with invariance to acquisition noise. Extensive experiments on the BTCV and MSD datasets demonstrate that our method significantly outperforms previous FDSL, as well as SSL methods trained on real-world medical datasets, by 1.43% on BTCV and up to 1.51% on MSD task, offering a scalable, annotation-free foundation for medical ViTs. The code will be made publicly available upon acceptance.
[456] Foundation Models in Remote Sensing: Evolving from Unimodality to Multimodality
Danfeng Hong, Chenyu Li, Xuyang Li, Gustau Camps-Valls, Jocelyn Chanussot
Main category: cs.CV
TL;DR: A comprehensive survey on foundation models in remote sensing, covering their evolution from unimodal to multimodal approaches, with practical guidance for researchers.
Details
Motivation: The exponential growth of remote sensing data requires advanced modeling capabilities, and foundation models offer revolutionary potential for managing and interpreting these vast datasets effectively.Method: Technical survey methodology that categorizes existing foundation models into unimodal and multimodal approaches, provides background/motivation analysis, and includes tutorial-like guidance for training and applying these models to real-world remote sensing tasks.
Result: A comprehensive resource that addresses three key questions: defining foundation models in remote sensing, explaining their necessity, and providing practical guidance for junior researchers to understand and apply these models effectively.
Conclusion: Foundation models represent a transformative opportunity for remote sensing, and this survey serves as a valuable entry point for researchers to understand their evolution from unimodal to multimodal approaches and apply them across various applications.
Abstract: Remote sensing (RS) techniques are increasingly crucial for deepening our understanding of the planet. As the volume and diversity of RS data continue to grow exponentially, there is an urgent need for advanced data modeling and understanding capabilities to manage and interpret these vast datasets effectively. Foundation models present significant new growth opportunities and immense potential to revolutionize the RS field. In this paper, we conduct a comprehensive technical survey on foundation models in RS, offering a brand-new perspective by exploring their evolution from unimodality to multimodality. We hope this work serves as a valuable entry point for researchers interested in both foundation models and RS and helps them launch new projects or explore new research topics in this rapidly evolving area. This survey addresses the following three key questions: What are foundation models in RS? Why are foundation models needed in RS? How can we effectively guide junior researchers in gaining a comprehensive and practical understanding of foundation models in RS applications? More specifically, we begin by outlining the background and motivation, emphasizing the importance of foundation models in RS. We then review existing foundation models in RS, systematically categorizing them into unimodal and multimodal approaches. Additionally, we provide a tutorial-like section to guide researchers, especially beginners, on how to train foundation models in RS and apply them to real-world tasks. The survey aims to equip researchers in RS with a deeper and more efficient understanding of foundation models, enabling them to get started easily and effectively apply these models across various RS applications.
[457] MLRecon: Robust Markerless Freehand 3D Ultrasound Reconstruction via Coarse-to-Fine Pose Estimation
Yi Zhang, Puxun Tu, Kun Wang, Yulin Yan, Tao Ying, Xiaojun Chen
Main category: cs.CV
TL;DR: MLRecon is a markerless 3D ultrasound reconstruction framework using a single RGB-D camera and vision foundation models for drift-resilient 6D probe pose tracking, with vision-guided failure recovery and dual-stage pose refinement.
Details
Motivation: Existing 3D ultrasound reconstruction methods face a trilemma: marker-based systems are expensive, inside-out methods require intrusive sensors, and sensorless approaches suffer from cumulative drift. There's a need for low-cost, accessible volumetric imaging for resource-limited clinical settings.Method: Uses a single commodity RGB-D camera with vision foundation models for continuous markerless probe tracking. Includes a vision-guided divergence detector for tracking integrity monitoring and failure recovery. Proposes a dual-stage pose refinement network that disentangles high-frequency jitter from low-frequency bias to denoise trajectories while maintaining kinematic fidelity.
Result: Outperforms competing sensorless and sensor-aided methods, achieving average position errors as low as 0.88 mm on complex trajectories. Produces high-quality 3D reconstructions with sub-millimeter mean surface accuracy.
Conclusion: Establishes a new benchmark for low-cost, accessible volumetric ultrasound imaging in resource-limited clinical settings by overcoming the limitations of existing tracking paradigms.
Abstract: Freehand 3D ultrasound (US) reconstruction promises volumetric imaging with the flexibility of standard 2D probes, yet existing tracking paradigms face a restrictive trilemma: marker-based systems demand prohibitive costs, inside-out methods require intrusive sensor attachment, and sensorless approaches suffer from severe cumulative drift. To overcome these limitations, we present MLRecon, a robust markerless 3D US reconstruction framework delivering drift-resilient 6D probe pose tracking using a single commodity RGB-D camera. Leveraging the generalization power of vision foundation models, our pipeline enables continuous markerless tracking of the probe, augmented by a vision-guided divergence detector that autonomously monitors tracking integrity and triggers failure recovery to ensure uninterrupted scanning. Crucially, we further propose a dual-stage pose refinement network that explicitly disentangles high-frequency jitter from low-frequency bias, effectively denoising the trajectory while maintaining the kinematic fidelity of operator maneuvers. Experiments demonstrate that MLRecon significantly outperforms competing sensorless and sensor-aided methods, achieving average position errors as low as 0.88 mm on complex trajectories and yielding high-quality 3D reconstructions with sub-millimeter mean surface accuracy. This establishes a new benchmark for low-cost, accessible volumetric US imaging in resource-limited clinical settings.
[458] Let Your Image Move with Your Motion! – Implicit Multi-Object Multi-Motion Transfer
Yuze Li, Dong Gong, Xiao Cao, Junchao Yuan, Dongsheng Li, Lei Zhou, Yun Sing Koh, Cheng Yan, Xinyu Zhang
Main category: cs.CV
TL;DR: FlexiMMT is an implicit image-to-video motion transfer framework that enables multi-object, multi-motion transfer by independently extracting motion representations and assigning them to different objects with flexible recombination.
Details
Motivation: Existing motion transfer methods focus on single-object scenarios and struggle with multiple objects requiring distinct motion patterns, creating a need for multi-object, multi-motion transfer capabilities.Method: Introduces Motion Decoupled Mask Attention Mechanism using object-specific masks to constrain attention, and Differentiated Mask Propagation Mechanism that derives masks from diffusion attention and propagates them across frames efficiently.
Result: Extensive experiments demonstrate precise, compositional, and state-of-the-art performance in I2V-based multi-object multi-motion transfer.
Conclusion: FlexiMMT successfully addresses cross-object motion entanglement and enables flexible multi-object motion transfer with accurate motion-to-object mappings.
Abstract: Motion transfer has emerged as a promising direction for controllable video generation, yet existing methods largely focus on single-object scenarios and struggle when multiple objects require distinct motion patterns. In this work, we present FlexiMMT, the first implicit image-to-video (I2V) motion transfer framework that explicitly enables multi-object, multi-motion transfer. Given a static multi-object image and multiple reference videos, FlexiMMT independently extracts motion representations and accurately assigns them to different objects, supporting flexible recombination and arbitrary motion-to-object mappings. To address the core challenge of cross-object motion entanglement, we introduce a Motion Decoupled Mask Attention Mechanism that uses object-specific masks to constrain attention, ensuring that motion and text tokens only influence their designated regions. We further propose a Differentiated Mask Propagation Mechanism that derives object-specific masks directly from diffusion attention and progressively propagates them across frames efficiently. Extensive experiments demonstrate that FlexiMMT achieves precise, compositional, and state-of-the-art performance in I2V-based multi-object multi-motion transfer.
[459] Dr.Occ: Depth- and Region-Guided 3D Occupancy from Surround-View Cameras for Autonomous Driving
Xubo Zhu, Haoyang Zhang, Fei He, Rui Wu, Yanhu Shan, Wen Yang, Huai Yu
Main category: cs.CV
TL;DR: Dr.Occ is a 3D semantic occupancy prediction framework for autonomous driving that uses depth-guided view transformation and region-guided expert transformers to address geometric misalignment and spatial class imbalance issues.
Details
Motivation: Existing 3D semantic occupancy prediction methods suffer from geometric misalignment in view transformation due to inaccurate depth estimation, and severe spatial class imbalance where semantic categories exhibit strong spatial anisotropy. These limitations hinder comprehensive scene understanding for autonomous driving.Method: Proposes Dr.Occ with two key components: 1) Depth-guided 2D-to-3D View Transformer (D²-VFormer) that leverages high-quality dense depth cues from MoGe-2 to construct reliable geometric priors for precise voxel feature alignment, and 2) Region-guided Expert Transformer (R/R²-EFormer) inspired by Mixture-of-Experts framework that adaptively allocates region-specific experts to handle spatial semantic variations.
Result: On the Occ3D-nuScenes benchmark, Dr.Occ improves the strong baseline BEVDet4D by 7.43% mIoU and 3.09% IoU under the full vision-only setting, demonstrating significant performance gains in 3D semantic occupancy prediction.
Conclusion: Dr.Occ effectively addresses geometric misalignment through depth guidance and spatial class imbalance through region-guided experts, providing complementary contributions that enhance 3D semantic occupancy prediction for autonomous driving perception.
Abstract: 3D semantic occupancy prediction is crucial for autonomous driving perception, offering comprehensive geometric scene understanding and semantic recognition. However, existing methods struggle with geometric misalignment in view transformation due to the lack of pixel-level accurate depth estimation, and severe spatial class imbalance where semantic categories exhibit strong spatial anisotropy. To address these challenges, we propose Dr.Occ, a depth- and region-guided occupancy prediction framework. Specifically, we introduce a depth-guided 2D-to-3D View Transformer (D$^2$-VFormer) that effectively leverages high-quality dense depth cues from MoGe-2 to construct reliable geometric priors, thereby enabling precise geometric alignment of voxel features. Moreover, inspired by the Mixture-of-Experts (MoE) framework, we propose a region-guided Expert Transformer (R/R$^2$-EFormer) that adaptively allocates region-specific experts to focus on different spatial regions, effectively addressing spatial semantic variations. Thus, the two components make complementary contributions: depth guidance ensures geometric alignment, while region experts enhance semantic learning. Experiments on the Occ3D-nuScenes benchmark demonstrate that \textbf{Dr.Occ} improves the strong baseline BEVDet4D by 7.43% mIoU and 3.09% IoU under the full vision-only setting.
[460] GeodesicNVS: Probability Density Geodesic Flow Matching for Novel View Synthesis
Xuqin Wang, Tao Wu, Yanfeng Zhang, Lu Liu, Mingwei Sun, Yongliang Wang, Niclas Zeller, Daniel Cremers
Main category: cs.CV
TL;DR: A novel view synthesis method using Data-to-Data Flow Matching with Probability Density Geodesic Flow Matching for consistent multi-view generation.
Details
Motivation: Current diffusion-based novel view synthesis methods suffer from inconsistency across viewpoints due to stochastic noise-to-data transitions that obscure deterministic structures. There's a need for more view-consistent synthesis through explicit data coupling and geometric coherence.Method: Proposes Data-to-Data Flow Matching framework that learns deterministic transformations directly between paired views. Introduces Probability Density Geodesic Flow Matching (PDG-FM) which constrains flow trajectories using geodesic interpolants derived from probability density metrics of pretrained diffusion models, aligning with high-density regions of the data manifold.
Result: Empirically surpasses diffusion-based NVS baselines, demonstrating improved structural coherence and smoother transitions across views.
Conclusion: Incorporating data-dependent geometric regularization into deterministic flow matching provides advantages for consistent novel view generation.
Abstract: Recent advances in generative modeling have substantially enhanced novel view synthesis, yet maintaining consistency across viewpoints remains challenging. Diffusion-based models rely on stochastic noise-to-data transitions, which obscure deterministic structures and yield inconsistent view predictions. We propose a Data-to-Data Flow Matching framework that learns deterministic transformations directly between paired views, enhancing view-consistent synthesis through explicit data coupling. To further enhance geometric coherence, we introduce Probability Density Geodesic Flow Matching (PDG-FM), which constrains flow trajectories using geodesic interpolants derived from probability density metrics of pretrained diffusion models. Such alignment with high-density regions of the data manifold promotes more realistic interpolants between samples. Empirically, our method surpasses diffusion-based NVS baselines, demonstrating improved structural coherence and smoother transitions across views. These results highlight the advantages of incorporating data-dependent geometric regularization into deterministic flow matching for consistent novel view generation.
[461] RaUF: Learning the Spatial Uncertainty Field of Radar
Shengpeng Wang, Kuangyu Wang, Wei Wang
Main category: cs.CV
TL;DR: RaUF is a spatial uncertainty field learning framework for millimeter-wave radar that models anisotropic properties to resolve ambiguous feature-to-label mapping and improve reliability in adverse weather conditions.
Details
Motivation: Millimeter-wave radar has advantages in adverse weather but suffers from low spatial fidelity, azimuth ambiguity, and clutter-induced spurious returns. Existing methods focus on cross-modal supervision but overlook ambiguous feature-to-label mapping, leading to ill-posed geometric inference and challenges for downstream perception tasks.Method: Proposes RaUF framework with: 1) anisotropic probabilistic model to learn fine-grained uncertainty and resolve conflicting feature-to-label mapping, 2) Bidirectional Domain Attention mechanism that exploits mutual complementarity between spatial structure and Doppler consistency to suppress spurious/multipath reflections.
Result: Extensive experiments on public benchmarks and real-world datasets show RaUF delivers highly reliable spatial detections with well-calibrated uncertainty. Downstream case studies validate enhanced reliability and scalability under challenging real-world driving scenarios.
Conclusion: RaUF provides a robust framework for radar-based perception by modeling spatial uncertainty fields, addressing fundamental challenges in radar sensing through physically grounded anisotropic properties and bidirectional attention mechanisms.
Abstract: Millimeter-wave radar offers unique advantages in adverse weather but suffers from low spatial fidelity, severe azimuth ambiguity, and clutter-induced spurious returns. Existing methods mainly focus on improving spatial perception effectiveness via coarse-to-fine cross-modal supervision, yet often overlook the ambiguous feature-to-label mapping, which may lead to ill-posed geometric inference and pose fundamental challenges to downstream perception tasks. In this work, we propose RaUF, a spatial uncertainty field learning framework that models radar measurements through their physically grounded anisotropic properties. To resolve conflicting feature-to-label mapping, we design an anisotropic probabilistic model that learns fine-grained uncertainty. To further enhance reliability, we propose a Bidirectional Domain Attention mechanism that exploits the mutual complementarity between spatial structure and Doppler consistency, effectively suppressing spurious or multipath-induced reflections. Extensive experiments on public benchmarks and real-world datasets demonstrate that RaUF delivers highly reliable spatial detections with well-calibrated uncertainty. Moreover, downstream case studies further validate the enhanced reliability and scalability of RaUF under challenging real-world driving scenarios.
[462] Content-Aware Frequency Encoding for Implicit Neural Representations with Fourier-Chebyshev Features
Junbo Ke, Yangyang Xu, You-Wei Wen, Chao Wang
Main category: cs.CV
TL;DR: CAFE proposes content-aware frequency encoding to overcome spectral bias in implicit neural representations by learning task-relevant frequencies through parallel linear layers and Hadamard products, extended with Chebyshev features in CAFE+.
Details
Motivation: Implicit Neural Representations (INRs) suffer from spectral bias that limits their ability to capture high-frequency details. Existing Fourier-based methods use fixed frequency bases, forcing MLPs to inefficiently compose required frequencies and constraining representational capacity.Method: Proposes Content-Aware Frequency Encoding (CAFE) which builds on Fourier features using multiple parallel linear layers combined via Hadamard product to explicitly synthesize broader frequency bases with learned weights for task-relevant frequency selection. Extends to CAFE+ by incorporating Chebyshev features as complementary components to Fourier bases for stronger, more stable frequency representation.
Result: Extensive experiments across multiple benchmarks validate effectiveness and efficiency, consistently achieving superior performance over existing methods.
Conclusion: CAFE and CAFE+ address spectral bias limitations in INRs by enabling explicit and efficient synthesis of broader frequency bases with learned task-relevant frequency selection, outperforming existing approaches.
Abstract: Implicit Neural Representations (INRs) have emerged as a powerful paradigm for various signal processing tasks, but their inherent spectral bias limits the ability to capture high-frequency details. Existing methods partially mitigate this issue by using Fourier-based features, which usually rely on fixed frequency bases. This forces multi-layer perceptrons (MLPs) to inefficiently compose the required frequencies, thereby constraining their representational capacity. To address this limitation, we propose Content-Aware Frequency Encoding (CAFE), which builds upon Fourier features through multiple parallel linear layers combined via a Hadamard product. CAFE can explicitly and efficiently synthesize a broader range of frequency bases, while the learned weights enable the selection of task-relevant frequencies. Furthermore, we extend this framework to CAFE+, which incorporates Chebyshev features as a complementary component to Fourier bases. This combination provides a stronger and more stable frequency representation. Extensive experiments across multiple benchmarks validate the effectiveness and efficiency of our approach, consistently achieving superior performance over existing methods. Our code is available at https://github.com/JunboKe0619/CAFE.
[463] Vision-Language Feature Alignment for Road Anomaly Segmentation
Zhuolin He, Jiacheng Tang, Jian Pu, Xiangyang Xue
Main category: cs.CV
TL;DR: VL-Anomaly: A vision-language framework for road anomaly segmentation that uses pre-trained VLMs to incorporate semantic priors, reducing false positives on normal backgrounds while improving detection of true out-of-distribution obstacles.
Details
Motivation: Existing road anomaly segmentation methods rely too heavily on pixel-level statistics, leading to high false-positive rates on semantically normal background regions (like sky/vegetation) and poor recall of true OOD instances, creating safety risks for autonomous systems.Method: Proposes VL-Anomaly with: 1) prompt learning-driven alignment module that adapts Mask2Former visual features to CLIP text embeddings of known categories to suppress spurious anomaly responses; 2) multi-source inference strategy integrating text-guided similarity, CLIP-based image-text similarity, and detector confidence for more reliable anomaly prediction.
Result: Achieves state-of-the-art performance on benchmark datasets including RoadAnomaly, SMIYC and Fishyscapes, demonstrating improved anomaly segmentation with reduced false positives and better recall.
Conclusion: VL-Anomaly effectively incorporates semantic priors from pre-trained VLMs to address limitations of pixel-statistics-based approaches, providing more reliable road anomaly segmentation crucial for safe autonomous systems.
Abstract: Safe autonomous systems in complex environments require robust road anomaly segmentation to identify unknown obstacles. However, existing approaches often rely on pixel-level statistics to determine whether a region appears anomalous. This reliance leads to high false-positive rates on semantically normal background regions such as sky or vegetation, and poor recall of true Out-of-distribution (OOD) instances, thereby posing safety risks for robotic perception and decision-making. To address these challenges, we propose VL-Anomaly, a vision-language anomaly segmentation framework that incorporates semantic priors from pre-trained Vision-Language Models (VLMs). Specifically, we design a prompt learning-driven alignment module that adapts Mask2Forme’s visual features to CLIP text embeddings of known categories, effectively suppressing spurious anomaly responses in background regions. At inference time, we further introduce a multi-source inference strategy that integrates text-guided similarity, CLIP-based image-text similarity and detector confidence, enabling more reliable anomaly prediction by leveraging complementary information sources. Extensive experiments demonstrate that VL-Anomaly achieves state-of-the-art performance on benchmark datasets including RoadAnomaly, SMIYC and Fishyscapes.Code is released on https://github.com/NickHezhuolin/VL-aligner-Road-anomaly-segment.
[464] Reparameterized Tensor Ring Functional Decomposition for Multi-Dimensional Data Recovery
Yangyang Xu, Junbo Ke, You-Wei Wen, Chao Wang
Main category: cs.CV
TL;DR: RepTRFD: A reparameterized tensor ring functional decomposition using implicit neural representations for continuous data modeling, with improved high-frequency modeling capacity through spectral analysis and structured factor design.
Details
Motivation: Tensor Ring decomposition is limited to discrete forms on fixed meshgrids, while real-world data often exists in continuous or non-meshgrid forms. Existing methods struggle to capture fine-scale details in continuous representations.Method: Proposes a TR functional decomposition with factors parameterized by Implicit Neural Representations (INRs). Through frequency-domain analysis, identifies spectral limitations and introduces a reparameterized version where each TR factor is a structured combination of learnable latent tensor and fixed basis to improve training dynamics and high-frequency modeling.
Result: Demonstrates superior performance on image inpainting, denoising, super-resolution, and point cloud recovery tasks compared to existing approaches. Theoretically proves improved training dynamics, Lipschitz continuity, and provides principled initialization scheme.
Conclusion: The reparameterized TR functional decomposition effectively addresses limitations of traditional TR methods for continuous data, enabling better high-frequency modeling and superior performance across various vision tasks.
Abstract: Tensor Ring (TR) decomposition is a powerful tool for high-order data modeling, but is inherently restricted to discrete forms defined on fixed meshgrids. In this work, we propose a TR functional decomposition for both meshgrid and non-meshgrid data, where factors are parameterized by Implicit Neural Representations (INRs). However, optimizing this continuous framework to capture fine-scale details is intrinsically difficult. Through a frequency-domain analysis, we demonstrate that the spectral structure of TR factors determines the frequency composition of the reconstructed tensor and limits the high-frequency modeling capacity. To mitigate this, we propose a reparameterized TR functional decomposition, in which each TR factor is a structured combination of a learnable latent tensor and a fixed basis. This reparameterization is theoretically shown to improve the training dynamics of TR factor learning. We further derive a principled initialization scheme for the fixed basis and prove the Lipschitz continuity of our proposed model. Extensive experiments on image inpainting, denoising, super-resolution, and point cloud recovery demonstrate that our method achieves consistently superior performance over existing approaches. Code is available at https://github.com/YangyangXu2002/RepTRFD.
[465] SMR-Net:Robot Snap Detection Based on Multi-Scale Features and Self-Attention Network
Kuanxu Hou
Main category: cs.CV
TL;DR: SMR-Net: An attention-based multi-scale object detection algorithm using custom sensor for precise snap detection in robot assembly, outperforming Faster R-CNN in IoU and mAP metrics.
Details
Motivation: Traditional visual methods for snap detection in robot assembly suffer from poor robustness and large localization errors in complex scenarios (transparent/low-contrast snaps), failing to meet high-precision assembly demands.Method: Proposes SMR-Net with attention-enhanced multi-scale feature fusion: attention-embedded feature extractor encodes raw sensor data, parallel multi-scale feature processing with standard/dilated convolutions, and adaptive reweighting network for dynamic feature weighting.
Result: On Type A and Type B snap datasets, SMR-Net outperforms Faster R-CNN with IoU improvements of 6.52% and 5.8%, and mAP increases of 2.8% and 1.5% respectively.
Conclusion: The method demonstrates superiority in complex snap detection and localization tasks for robot automated assembly, addressing limitations of traditional visual approaches.
Abstract: In robot automated assembly, snap assembly precision and efficiency directly determine overall production quality. As a core prerequisite, snap detection and localization critically affect subsequent assembly success. Traditional visual methods suffer from poor robustness and large localization errors when handling complex scenarios (e.g., transparent or low-contrast snaps), failing to meet high-precision assembly demands. To address this, this paper designs a dedicated sensor and proposes SMR-Net, an self-attention-based multi-scale object detection algorithm, to synergistically enhance detection and localization performance. SMR-Net adopts an attention-enhanced multi-scale feature fusion architecture: raw sensor data is encoded via an attention-embedded feature extractor to strengthen key snap features and suppress noise; three multi-scale feature maps are processed in parallel with standard and dilated convolution for dimension unification while preserving resolution; an adaptive reweighting network dynamically assigns weights to fused features, generating fine representations integrating details and global semantics. Experimental results on Type A and Type B snap datasets show SMR-Net outperforms traditional Faster R-CNN significantly: Intersection over Union (IoU) improves by 6.52% and 5.8%, and mean Average Precision (mAP) increases by 2.8% and 1.5% respectively. This fully demonstrates the method’s superiority in complex snap detection and localization tasks.
[466] From Intuition to Investigation: A Tool-Augmented Reasoning MLLM Framework for Generalizable Face Anti-Spoofing
Haoyuan Zhang, Keyao Wang, Guosheng Zhang, Haixiao Yue, Zhiwen Tan, Siran Peng, Tianshuo Zhang, Xiao Tan, Kunbin Chen, Wei He, Jingdong Wang, Ajian Liu, Xiangyu Zhu, Zhen Lei
Main category: cs.CV
TL;DR: TAR-FAS is a Face Anti-Spoofing framework that enhances MLLMs with external visual tools for fine-grained spoof detection through a Chain-of-Thought with Visual Tools paradigm.
Details
Motivation: Existing MLLM-based FAS methods have limited generalizability because they mainly capture intuitive semantic cues while struggling with fine-grained visual patterns needed for robust spoof detection.Method: Proposes TAR-FAS framework with CoT-VT paradigm, tool-augmented data annotation pipeline creating ToolFAS-16K dataset, and Diverse-Tool Group Relative Policy Optimization for autonomous tool learning.
Result: Achieves state-of-the-art performance under challenging one-to-eleven cross-domain protocol while providing fine-grained visual investigation for trustworthy spoof detection.
Conclusion: Incorporating external visual tools into MLLMs significantly improves FAS performance by enabling deeper investigation of subtle spoof clues beyond intuitive semantic cues.
Abstract: Face recognition remains vulnerable to presentation attacks, calling for robust Face Anti-Spoofing (FAS) solutions. Recent MLLM-based FAS methods reformulate the binary classification task as the generation of brief textual descriptions to improve cross-domain generalization. However, their generalizability is still limited, as such descriptions mainly capture intuitive semantic cues (e.g., mask contours) while struggling to perceive fine-grained visual patterns. To address this limitation, we incorporate external visual tools into MLLMs to encourage deeper investigation of subtle spoof clues. Specifically, we propose the Tool-Augmented Reasoning FAS (TAR-FAS) framework, which reformulates the FAS task as a Chain-of-Thought with Visual Tools (CoT-VT) paradigm, allowing MLLMs to begin with intuitive observations and adaptively invoke external visual tools for fine-grained investigation. To this end, we design a tool-augmented data annotation pipeline and construct the ToolFAS-16K dataset, which contains multi-turn tool-use reasoning trajectories. Furthermore, we introduce a tool-aware FAS training pipeline, where Diverse-Tool Group Relative Policy Optimization (DT-GRPO) enables the model to autonomously learn efficient tool use. Extensive experiments under a challenging one-to-eleven cross-domain protocol demonstrate that TAR-FAS achieves SOTA performance while providing fine-grained visual investigation for trustworthy spoof detection.
[467] MM-DeepResearch: A Simple and Effective Multimodal Agentic Search Baseline
Huanjin Yao, Qixiang Yin, Min Yang, Ziwang Zhao, Yibo Wang, Haotian Luo, Jingyi Zhang, Jiaxing Huang
Main category: cs.CV
TL;DR: MM-DeepResearch is a multimodal research agent that addresses challenges in search-intensive multimodal QA through hypergraph-based data generation, decomposed tool expert optimization, and offline search engine for cost-effective training.
Details
Motivation: Develop a multimodal research agent capable of explicit reasoning, planning, multi-tool invocation, and cross-modal information synthesis for deep research tasks, addressing challenges of data scarcity, ineffective search trajectories, and high training costs.Method: Three key designs: 1) Hyper-Search (hypergraph-based QA generation connecting visual/textual nodes), 2) DR-TTS (decomposes tasks by tool type, optimizes specialized experts, recomposes via tree search), 3) Offline search engine supporting multiple tools for cost-effective training.
Result: Developed MM-DeepResearch, a powerful multimodal deep research agent showing superiority across benchmarks, with code publicly available.
Conclusion: The proposed three-pronged approach successfully addresses key challenges in developing multimodal research agents, enabling effective search-intensive multimodal QA through innovative data generation, tool optimization, and cost-effective training methods.
Abstract: We aim to develop a multimodal research agent capable of explicit reasoning and planning, multi-tool invocation, and cross-modal information synthesis, enabling it to conduct deep research tasks. However, we observe three main challenges in developing such agents: (1) scarcity of search-intensive multimodal QA data, (2) lack of effective search trajectories, and (3) prohibitive cost of training with online search APIs. To tackle them, we first propose Hyper-Search, a hypergraph-based QA generation method that models and connects visual and textual nodes within and across modalities, enabling to generate search-intensive multimodal QA pairs that require invoking various search tools to solve. Second, we introduce DR-TTS, which first decomposes search-involved tasks into several categories according to search tool types, and respectively optimize specialized search tool experts for each tool. It then recomposes tool experts to jointly explore search trajectories via tree search, producing trajectories that successfully solve complex tasks using various search tools. Third, we build an offline search engine supporting multiple search tools, enabling agentic reinforcement learning without using costly online search APIs. With the three designs, we develop MM-DeepResearch, a powerful multimodal deep research agent, and extensive results shows its superiority across benchmarks. Code is available at https://github.com/HJYao00/MM-DeepResearch
[468] Unleashing VLA Potentials in Autonomous Driving via Explicit Learning from Failures
Yuechen Luo, Qimao Chen, Fang Li, Shaoqing Xu, Jaxin Liu, Ziying Song, Zhi-xin Yang, Fuxi Wen
Main category: cs.CV
TL;DR: VLA models for autonomous driving plateau in RL due to sparse reward signals; ELF-VLA introduces explicit diagnostic feedback to identify failure modes and guide policy refinement.
Details
Motivation: VLA models for autonomous driving hit performance plateaus during RL optimization because exploration is constrained by previous SFT training. In long-tail scenarios, sparse zero-value reward signals indicate failure but don't identify root causes (planning, reasoning, or execution errors), limiting improvement.Method: ELF-VLA augments RL with structured diagnostic feedback that produces detailed, interpretable reports identifying specific failure modes. The VLA policy uses this explicit feedback to generate Feedback-Guided Refinement, injecting corrected high-reward samples back into RL training batches for targeted gradient updates.
Result: Extensive experiments show the method unlocks latent capabilities of VLA models, achieving state-of-the-art performance on the public NAVSIM benchmark for overall PDMS, EPDMS score and high-level planning accuracy.
Conclusion: Explicit learning from failures through structured diagnostic feedback enables VLA models to overcome RL optimization plateaus by providing targeted guidance for critical scenarios that unguided exploration cannot solve.
Abstract: Vision-Language-Action (VLA) models for autonomous driving often hit a performance plateau during Reinforcement Learning (RL) optimization. This stagnation arises from exploration capabilities constrained by previous Supervised Fine-Tuning (SFT), leading to persistent failures in long-tail scenarios. In these critical situations, all explored actions yield a zero-value driving score. This information-sparse reward signals a failure, yet fails to identify its root cause – whether it is due to incorrect planning, flawed reasoning, or poor trajectory execution. To address this limitation, we propose VLA with Explicit Learning from Failures (ELF-VLA), a framework that augments RL with structured diagnostic feedback. Instead of relying on a vague scalar reward, our method produces detailed, interpretable reports that identify the specific failure mode. The VLA policy then leverages this explicit feedback to generate a Feedback-Guided Refinement. By injecting these corrected, high-reward samples back into the RL training batch, our approach provides a targeted gradient, which enables the policy to solve critical scenarios that unguided exploration cannot. Extensive experiments demonstrate that our method unlocks the latent capabilities of VLA models, achieving state-of-the-art (SOTA) performance on the public NAVSIM benchmark for overall PDMS, EPDMS score and high-level planning accuracy.
[469] LLaDA-o: An Effective and Length-Adaptive Omni Diffusion Model
Zebin You, Xiaolu Zhang, Jun Zhou, Chongxuan Li, Ji-Rong Wen
Main category: cs.CV
TL;DR: LLaDA-o is an omni diffusion model for multimodal understanding and generation using a Mixture of Diffusion framework with length adaptation
Details
Motivation: To create a unified model that can handle both multimodal understanding and generation efficiently, addressing the computational redundancy in fixed-condition models and enabling flexible-length decodingMethod: Uses Mixture of Diffusion (MoD) framework that decouples discrete masked diffusion for text understanding and continuous diffusion for visual generation, coupled through a shared attention backbone. Introduces data-centric length adaptation for flexible-length decoding without architectural changes
Result: Achieves state-of-the-art performance among omni-diffusion models on multimodal understanding and generation benchmarks, with 87.04 on DPG-Bench for text-to-image generation
Conclusion: LLaDA-o demonstrates the effectiveness of unified omni diffusion modeling for multimodal tasks, providing an efficient and flexible solution for both understanding and generation
Abstract: We present \textbf{LLaDA-o}, an effective and length-adaptive omni diffusion model for multimodal understanding and generation. LLaDA-o is built on a Mixture of Diffusion (MoD) framework that decouples discrete masked diffusion for text understanding and continuous diffusion for visual generation, while coupling them through a shared, simple, and efficient attention backbone that reduces redundant computation for fixed conditions. Building on MoD, we further introduce a data-centric length adaptation strategy that enables flexible-length decoding in multimodal settings without architectural changes. Extensive experiments show that LLaDA-o achieves state-of-the-art performance among omni-diffusion models on multimodal understanding and generation benchmarks, and reaches 87.04 on DPG-Bench for text-to-image generation, supporting the effectiveness of unified omni diffusion modeling. Code is available at https://github.com/ML-GSAI/LLaDA-o.
[470] Flow Matching-enabled Test-Time Refinement for Unsupervised Cardiac MR Registration
Yunguan Fu, Wenjia Bai, Wen Yan, Matthew J Clarkson, Rhodri Huw Davies, Yipeng Hu
Main category: cs.CV
TL;DR: FlowReg: A flow-matching framework for unsupervised cardiac image registration that achieves strong performance in as few as two steps with warmup-reflow training and Initial Guess strategy.
Details
Motivation: Existing diffusion-based unsupervised image registration methods for cardiac cine MR have expensive multi-step inference that limits practical use, creating a need for more efficient approaches.Method: Proposes FlowReg, a flow-matching framework in displacement field space with warmup-reflow training (single-step network as teacher, student learns to refine from arbitrary states) and Initial Guess strategy (feeding back predictions as next starting point).
Result: Outperforms state-of-the-art on 5 out of 6 tasks (+0.6% mean Dice score, largest gain in left ventricle +1.09%), reduces LVEF estimation error on all six tasks (-2.58 percentage points), using only 0.7% extra parameters and no segmentation labels.
Conclusion: FlowReg enables efficient and accurate cardiac image registration with minimal inference steps, addressing practical limitations of existing diffusion-based methods while maintaining strong performance.
Abstract: Diffusion-based unsupervised image registration has been explored for cardiac cine MR, but expensive multi-step inference limits practical use. We propose FlowReg, a flow-matching framework in displacement field space that achieves strong registration in as few as two steps and supports further refinement with more steps. FlowReg uses warmup-reflow training: a single-step network first acts as a teacher, then a student learns to refine from arbitrary intermediate states, removing the need for a pre-trained model as in existing methods. An Initial Guess strategy feeds back the model prediction as the next starting point, improving refinement from step two onward. On ACDC and MM2 across six tasks (including cross-dataset generalization), FlowReg outperforms the state of the art on five tasks (+0.6% mean Dice score on average), with the largest gain in the left ventricle (+1.09%), and reduces LVEF estimation error on all six tasks (-2.58 percentage points), using only 0.7% extra parameters and no segmentation labels. Anonymized code is available at https://github.com/mathpluscode/FlowReg.
[471] Adaptive Augmentation-Aware Latent Learning for Robust LiDAR Semantic Segmentation
Wangkai Li, Zhaoyang Li, Yuwen Pan, Rui Sun, Yujia Chen, Tianzhu Zhang
Main category: cs.CV
TL;DR: A3Point is an adaptive augmentation-aware latent learning framework for LiDAR point cloud semantic segmentation that addresses performance degradation in adverse weather by effectively utilizing diverse augmentations while mitigating semantic shifts.
Details
Motivation: Adverse weather conditions cause significant distribution shifts that degrade LiDAR point cloud semantic segmentation performance. Existing augmentation methods struggle to balance minor vs aggressive augmentations and fail to fully exploit augmentation potential due to semantic shift issues.Method: Proposes A3Point with two key components: 1) Semantic Confusion Prior (SCP) latent learning that captures model’s inherent semantic confusion information, and 2) Semantic Shift Region (SSR) localization that decouples semantic confusion from semantic shift, enabling adaptive optimization strategies for different disturbance levels.
Result: Extensive experiments on multiple standard generalized LiDAR segmentation benchmarks under adverse weather demonstrate effectiveness, achieving new state-of-the-art results.
Conclusion: A3Point effectively addresses the augmentation trade-off problem in adverse weather LiDAR segmentation by adaptively handling semantic shifts through latent learning and region localization techniques.
Abstract: Adverse weather conditions significantly degrade the performance of LiDAR point cloud semantic segmentation networks by introducing large distribution shifts. Existing augmentation-based methods attempt to enhance robustness by simulating weather interference during training. However, they struggle to fully exploit the potential of augmentations due to the trade-off between minor and aggressive augmentations. To address this, we propose A3Point, an adaptive augmentation-aware latent learning framework that effectively utilizes a diverse range of augmentations while mitigating the semantic shift, which refers to the change in the semantic meaning caused by augmentations. A3Point consists of two key components: semantic confusion prior (SCP) latent learning, which captures the model’s inherent semantic confusion information, and semantic shift region (SSR) localization, which decouples semantic confusion and semantic shift, enabling adaptive optimization strategies for different disturbance levels. Extensive experiments on multiple standard generalized LiDAR segmentation benchmarks under adverse weather demonstrate the effectiveness of our method, setting new state-of-the-art results.
[472] Beyond Global Similarity: Towards Fine-Grained, Multi-Condition Multimodal Retrieval
Xuan Lu, Kangle Li, Haohang Huang, Rui Meng, Wenjun Zeng, Xiaoyu Shen
Main category: cs.CV
TL;DR: MCMR is a new benchmark for evaluating multimodal retrieval systems on fine-grained, multi-condition queries that require joint satisfaction of interdependent visual and textual constraints across product domains.
Details
Motivation: Existing multimodal retrieval benchmarks focus on coarse-grained or single-condition alignment, but real-world queries often specify multiple interdependent constraints across modalities. There's a need for benchmarks that evaluate fine-grained, multi-condition cross-modal retrieval under natural language queries.Method: Created MCMR benchmark spanning five product domains (clothing, jewelry, shoes, furniture) with rich long-form metadata. Each query integrates complementary visual and textual attributes requiring joint satisfaction of all conditions. Benchmarked diverse MLLM-based multimodal retrievers and vision-language rerankers.
Result: Revealed: (1) distinct modality asymmetries across models; (2) visual cues dominate early-rank precision while textual metadata stabilizes long-tail ordering; (3) MLLM-based pointwise rerankers significantly improve fine-grained matching by explicitly verifying query-candidate consistency.
Conclusion: MCMR establishes a challenging diagnostic benchmark for advancing multimodal retrieval toward compositional, constraint-aware, and interpretable understanding, addressing limitations of existing benchmarks.
Abstract: Recent advances in multimodal large language models (MLLMs) have substantially expanded the capabilities of multimodal retrieval, enabling systems to align and retrieve information across visual and textual modalities. Yet, existing benchmarks largely focus on coarse-grained or single-condition alignment, overlooking real-world scenarios where user queries specify multiple interdependent constraints across modalities. To bridge this gap, we introduce MCMR (Multi-Conditional Multimodal Retrieval): a large-scale benchmark designed to evaluate fine-grained, multi-condition cross-modal retrieval under natural-language queries. MCMR spans five product domains: upper and bottom clothing, jewelry, shoes, and furniture. It also preserves rich long-form metadata essential for compositional matching. Each query integrates complementary visual and textual attributes, requiring models to jointly satisfy all specified conditions for relevance. We benchmark a diverse suite of MLLM-based multimodal retrievers and vision-language rerankers to assess their condition-aware reasoning abilities. Experimental results reveal: (i) distinct modality asymmetries across models; (ii) visual cues dominate early-rank precision, while textual metadata stabilizes long-tail ordering; and (iii) MLLM-based pointwise rerankers markedly improve fine-grained matching by explicitly verifying query-candidate consistency. Overall, MCMR establishes a challenging and diagnostic benchmark for advancing multimodal retrieval toward compositional, constraint-aware, and interpretable understanding. Our code and dataset is available at https://github.com/EIT-NLP/MCMR
[473] Can Vision Language Models Assess Graphic Design Aesthetics? A Benchmark, Evaluation, and Dataset Perspective
Arctanx An, Shizhao Sun, Danqing Huang, Mingxi Cheng, Yan Gao, Ji Li, Yu Qiao, Jiang Bian
Main category: cs.CV
TL;DR: VLMs struggle with graphic design aesthetic assessment; new benchmark AesEval-Bench with 4 dimensions, 12 indicators, 3 tasks; systematic evaluation shows performance gaps; fine-tuning dataset created with human-guided labeling improves VLM performance.
Details
Motivation: Aesthetic quality assessment of graphic design is crucial for visual communication but underexplored in vision language models. Existing work has limitations: narrow benchmarks, coarse evaluation protocols, lack of systematic VLM comparisons, and insufficient training data for improvement.Method: Introduces AesEval-Bench with 4 dimensions, 12 indicators, and 3 quantifiable tasks (aesthetic judgment, region selection, precise localization). Systematically evaluates proprietary, open-source, and reasoning-augmented VLMs. Creates training dataset using human-guided VLM labeling to produce task labels at scale and indicator-grounded reasoning to link abstract indicators to concrete design regions.
Result: Reveals clear performance gaps between VLMs and nuanced demands of aesthetic assessment. Fine-tuned VLMs show improved performance on aesthetic quality assessment tasks. Establishes first systematic framework for aesthetic quality assessment in graphic design.
Conclusion: The work provides comprehensive benchmark and training framework for improving VLMs’ ability to assess graphic design aesthetics, addressing key limitations in current research and enabling better visual communication evaluation.
Abstract: Assessing the aesthetic quality of graphic design is central to visual communication, yet remains underexplored in vision language models (VLMs). We investigate whether VLMs can evaluate design aesthetics in ways comparable to humans. Prior work faces three key limitations: benchmarks restricted to narrow principles and coarse evaluation protocols, a lack of systematic VLM comparisons, and limited training data for model improvement. In this work, we introduce AesEval-Bench, a comprehensive benchmark spanning four dimensions, twelve indicators, and three fully quantifiable tasks: aesthetic judgment, region selection, and precise localization. Then, we systematically evaluate proprietary, open-source, and reasoning-augmented VLMs, revealing clear performance gaps against the nuanced demands of aesthetic assessment. Moreover, we construct a training dataset to fine-tune VLMs for this domain, leveraging human-guided VLM labeling to produce task labels at scale and indicator-grounded reasoning to tie abstract indicators to concrete design regions.Together, our work establishes the first systematic framework for aesthetic quality assessment in graphic design. Our code and dataset will be released at: \href{https://github.com/arctanxarc/AesEval-Bench}{https://github.com/arctanxarc/AesEval-Bench}
[474] Differential privacy representation geometry for medical image analysis
Soroosh Tayebi Arasteh, Marziyeh Mohammadi, Sven Nebelung, Daniel Truhn
Main category: cs.CV
TL;DR: DP-RGMI framework analyzes how differential privacy affects medical imaging models by examining representation space transformations, decomposing performance loss into encoder geometry and task-head utilization.
Details
Motivation: Current evaluation of differential privacy in medical imaging focuses only on end-to-end performance, leaving the mechanism of privacy-induced utility loss unclear. The authors aim to understand how DP transforms representation spaces and causes performance degradation.Method: Introduces DP-RGMI framework that interprets DP as structured transformation of representation space. Decomposes performance degradation into encoder geometry (quantified by representation displacement from initialization and spectral effective dimension) and task-head utilization (measured as gap between linear-probe and end-to-end utility). Evaluated across 594,000+ images from four chest X-ray datasets with multiple pretrained initializations.
Result: DP consistently associated with utilization gap even when linear separability is preserved. Representation displacement and spectral dimension show non-monotonic, initialization- and dataset-dependent reshaping, indicating DP alters representation anisotropy rather than uniformly collapsing features. Correlation analysis shows end-to-end performance and utilization association is robust across datasets but varies by initialization, while geometric quantities capture additional prior- and dataset-conditioned variation.
Conclusion: DP-RGMI provides reproducible framework for diagnosing privacy-induced failure modes and informing privacy model selection in medical imaging. Reveals DP’s complex effects on representation spaces beyond simple performance metrics.
Abstract: Differential privacy (DP)’s effect in medical imaging is typically evaluated only through end-to-end performance, leaving the mechanism of privacy-induced utility loss unclear. We introduce Differential Privacy Representation Geometry for Medical Imaging (DP-RGMI), a framework that interprets DP as a structured transformation of representation space and decomposes performance degradation into encoder geometry and task-head utilization. Geometry is quantified by representation displacement from initialization and spectral effective dimension, while utilization is measured as the gap between linear-probe and end-to-end utility. Across over 594,000 images from four chest X-ray datasets and multiple pretrained initializations, we show that DP is consistently associated with a utilization gap even when linear separability is largely preserved. At the same time, displacement and spectral dimension exhibit non-monotonic, initialization- and dataset-dependent reshaping, indicating that DP alters representation anisotropy rather than uniformly collapsing features. Correlation analysis reveals that the association between end-to-end performance and utilization is robust across datasets but can vary by initialization, while geometric quantities capture additional prior- and dataset-conditioned variation. These findings position DP-RGMI as a reproducible framework for diagnosing privacy-induced failure modes and informing privacy model selection.
[475] HeroGS: Hierarchical Guidance for Robust 3D Gaussian Splatting under Sparse Views
Jiashu Li, Xumeng Han, Zhaoyang Wei, Zipeng Wang, Kuiran Wang, Guorong Li, Zhenjun Han, Jianbin Jiao
Main category: cs.CV
TL;DR: HeroGS introduces hierarchical guidance (image, feature, parameter levels) to improve 3D Gaussian Splatting performance under sparse-view conditions by regularizing Gaussian distributions and enhancing structural fidelity.
Details
Motivation: 3D Gaussian Splatting (3DGS) achieves photorealistic rendering with real-time efficiency but heavily relies on dense camera coverage. Under sparse-view conditions, insufficient supervision leads to irregular Gaussian distributions with sparse coverage, blurred backgrounds, and distorted high-frequency areas.Method: HeroGS establishes hierarchical guidance across three levels: 1) Image level converts sparse supervision into pseudo-dense guidance for global regularization, 2) Feature level uses Feature-Adaptive Densification and Pruning (FADP) to refine high-frequency details and adaptively densify background Gaussians, 3) Parameter level employs Co-Pruned Geometry Consistency (CPG) to guide geometric consistency through parameter freezing and co-pruning.
Result: Extensive experiments demonstrate that HeroGS achieves high-fidelity reconstructions and consistently surpasses state-of-the-art baselines under sparse-view conditions, effectively addressing the limitations of irregular Gaussian distributions.
Conclusion: The hierarchical guidance strategy effectively constrains and optimizes overall Gaussian distributions, enhancing both structural fidelity and rendering quality for 3DGS in sparse-view scenarios.
Abstract: 3D Gaussian Splatting (3DGS) has recently emerged as a promising approach in novel view synthesis, combining photorealistic rendering with real-time efficiency. However, its success heavily relies on dense camera coverage; under sparse-view conditions, insufficient supervision leads to irregular Gaussian distributions, characterized by globally sparse coverage, blurred background, and distorted high-frequency areas. To address this, we propose HeroGS, Hierarchical Guidance for Robust 3D Gaussian Splatting, a unified framework that establishes hierarchical guidance across the image, feature, and parameter levels. At the image level, sparse supervision is converted into pseudo-dense guidance, globally regularizing the Gaussian distributions and forming a consistent foundation for subsequent optimization. Building upon this, Feature-Adaptive Densification and Pruning (FADP) at the feature level leverages low-level features to refine high-frequency details and adaptively densifies Gaussians in background regions. The optimized distributions then support Co-Pruned Geometry Consistency (CPG) at parameter level, which guides geometric consistency through parameter freezing and co-pruning, effectively removing inconsistent splats. The hierarchical guidance strategy effectively constrains and optimizes the overall Gaussian distributions, thereby enhancing both structural fidelity and rendering quality. Extensive experiments demonstrate that HeroGS achieves high-fidelity reconstructions and consistently surpasses state-of-the-art baselines under sparse-view conditions.
[476] Data-Efficient Brushstroke Generation with Diffusion Models for Oil Painting
Dantong Qin, Alessandro Bozzon, Xian Yang, Xun Zhang, Yike Guo, Pan Wang
Main category: cs.CV
TL;DR: StrokeDiff: A diffusion-based framework for learning human-like brushstroke generation from limited hand-drawn samples using Smooth Regularization for stable training and Bézier-based conditioning for controllability.
Details
Motivation: Visual primitives like strokes or textures are difficult to collect at scale and differ from natural image data, making it challenging for generative models to learn expressive and controllable primitives for process-aware content creation.Method: Proposes StrokeDiff, a diffusion-based framework with Smooth Regularization (SmR) that injects stochastic visual priors during training to stabilize diffusion models under sparse supervision. Includes Bézier-based conditioning module for controllability and integrates into a complete stroke-based painting pipeline.
Result: Produces diverse and structurally coherent brushstrokes, enables paintings with richer texture and layering. Validated by both automatic metrics and human evaluation using only 470 hand-drawn samples.
Conclusion: Demonstrates how data-efficient primitive modeling can support expressive and structured multimedia content creation through stable diffusion training with limited data and controllable primitive generation.
Abstract: Many creative multimedia systems are built upon visual primitives such as strokes or textures, which are difficult to collect at scale and fundamentally different from natural image data. This data scarcity makes it challenging for modern generative models to learn expressive and controllable primitives, limiting their use in process-aware content creation. In this work, we study the problem of learning human-like brushstroke generation from a small set of hand-drawn samples (n=470) and propose StrokeDiff, a diffusion-based framework with Smooth Regularization (SmR). SmR injects stochastic visual priors during training, providing a simple mechanism to stabilize diffusion models under sparse supervision without altering the inference process. We further show how the learned primitives can be made controllable through a Bézier-based conditioning module and integrated into a complete stroke-based painting pipeline, including prediction, generation, ordering, and compositing. This demonstrates how data-efficient primitive modeling can support expressive and structured multimedia content creation. Experiments indicate that the proposed approach produces diverse and structurally coherent brushstrokes and enables paintings with richer texture and layering, validated by both automatic metrics and human evaluation.
[477] GroundedSurg: A Multi-Procedure Benchmark for Language-Conditioned Surgical Tool Segmentation
Tajamul Ashraf, Abrar Ul Riyaz, Wasif Tak, Tavaheed Tariq, Sonia Yadav, Moloud Abdar, Janibul Bashir
Main category: cs.CV
TL;DR: GroundedSurg: First language-conditioned, instance-level surgical grounding benchmark for evaluating vision-language models in realistic surgical scenes with natural language descriptions targeting specific instrument instances.
Details
Motivation: Existing surgical tool benchmarks focus on category-level segmentation, but real-world clinical decisions require resolving references to specific instrument instances based on functional roles, spatial relations, or anatomical interactions not captured by current evaluation paradigms.Method: Introduces GroundedSurg benchmark with surgical images paired with natural-language descriptions targeting single instruments, accompanied by structured spatial grounding annotations (bounding boxes, point-level anchors). Dataset spans ophthalmic, laparoscopic, robotic, and open procedures with diverse instrument types and imaging conditions.
Result: Extensive experiments show substantial performance gaps across modern segmentation and vision-language models, highlighting the need for clinically grounded vision-language reasoning in surgical AI systems.
Conclusion: GroundedSurg enables systematic and realistic evaluation of vision-language models in clinically realistic multi-instrument surgical scenes, addressing limitations of current benchmarks and advancing intelligent, context-aware intraoperative assistance.
Abstract: Clinically reliable perception of surgical scenes is essential for advancing intelligent, context-aware intraoperative assistance such as instrument handoff guidance, collision avoidance, and workflow-aware robotic support. Existing surgical tool benchmarks primarily evaluate category-level segmentation, requiring models to detect all instances of predefined instrument classes. However, real-world clinical decisions often require resolving references to a specific instrument instance based on its functional role, spatial relation, or anatomical interaction capabilities not captured by current evaluation paradigms. We introduce GroundedSurg, the first language-conditioned, instance-level surgical grounding benchmark. Each instance pairs a surgical image with a natural-language description targeting a single instrument, accompanied by structured spatial grounding annotations including bounding boxes and point-level anchors. The dataset spans ophthalmic, laparoscopic, robotic, and open procedures, encompassing diverse instrument types, imaging conditions, and operative complexities. By jointly evaluating linguistic reference resolution and pixel-level localization, GroundedSurg enables a systematic and realistic evaluation of vision-language models in clinically realistic multi-instrument scenes. Extensive experiments demonstrate substantial performance gaps across modern segmentation and VLMs, highlighting the urgent need for clinically grounded vision-language reasoning in surgical AI systems. Code and data are publicly available at https://github.com/gaash-lab/GroundedSurg
[478] DeAR: Fine-Grained VLM Adaptation by Decomposing Attention Head Roles
Yiming Ma, Hongkun Yang, Lionel Z. Wang, Bin Chen, Weizhi Xian, Jianzhi Teng
Main category: cs.CV
TL;DR: DeAR challenges the layer-centric view of VLM adaptation and proposes attention head role decomposition for fine-grained adaptation, achieving better balance between task-specific performance and generalization preservation.
Details
Motivation: Existing prompt learning methods for Vision-Language Models rely on simplistic layer-centric assumptions that cause uncontrolled interactions between learnable and original tokens, degrading core generalization and creating a trade-off between task adaptation and zero-shot generalization preservation.Method: Proposes DeAR framework that decomposes attention head roles using Concept Entropy metric to classify heads into Attribute, Generalization, and Mixed roles. Uses specialized attribute tokens and Role-Based Attention Mask mechanism to control information flow, plus Task-Adaptive Fusion Strategy for inference.
Result: Extensive experiments on fifteen datasets show DeAR achieves strong balance between task adaptation and generalization, outperforming previous methods across various tasks.
Conclusion: Functional specialization in VLMs occurs at the finer-grained level of individual attention heads rather than between layers, and controlling information flow based on head roles enables better adaptation while preserving generalization capabilities.
Abstract: Prompt learning is a dominant paradigm for adapting pre-trained Vision-Language Models (VLMs) to downstream tasks. However, existing methods often rely on a simplistic, layer-centric view, assuming shallow layers capture general features while deep layers handle task-specific knowledge. This assumption results in uncontrolled interactions between learnable tokens and original tokens. Task-specific knowledge could degrades the model’s core generalization and creates a trade-off between task adaptation and the preservation of zero-shot generalization. To address this, we challenge the layer-centric view and propose \textbf{DeAR}, a framework that achieves fine-grained VLM adaptation by \textbf{De}composing \textbf{A}ttention head \textbf{R}oles. We posit that the functional specialization within VLMs occurs not between layers, but at the finer-grained level of individual attention heads in the deeper layers. Based on this insight, we introduce a novel metric, Concept Entropy, to systematically classify attention heads into distinct functional roles: \textit{Attribute}, \textit{Generalization}, and \textit{Mixed}. Guided by these roles, we introduce specialized attribute tokens and a Role-Based Attention Mask mechanism to precisely control information flow, ensuring generalization heads remain isolated from task-specific knowledge. We further incorporate a Task-Adaptive Fusion Strategy for inference. Extensive experiments on fifteen datasets show that DeAR achieves a strong balance between task adaptation and generalization, outperforming previous methods across various tasks.
[479] GuiDINO: Rethinking Vision Foundation Model in Medical Image Segmentation
Zhuonan Liang, Wei Guo, Jie Gan, Yaxuan Song, Runnan Chen, Hang Chang, Weidong Cai
Main category: cs.CV
TL;DR: GuiDINO is a framework that uses DINOv3 foundation model as a visual guidance generator for medical image segmentation, creating guide masks that gate feature activations in segmentation backbones without full fine-tuning.
Details
Motivation: Foundation vision models misalign with medical image segmentation needs due to domain shift, but full fine-tuning is computationally expensive. There's a need for methods that leverage foundation model priors while preserving the efficiency and inductive biases of medical-specific architectures.Method: Extracts visual features from DINOv3 and converts them into spatial guide masks via a lightweight TokenBook mechanism that aggregates token-prototype similarities. These guide masks gate feature activations in multiple segmentation backbones. Uses guide supervision loss to align masks with ground truth, optionally with boundary-focused hinge loss. Supports parameter-efficient adaptation via LoRA on DINOv3.
Result: Consistently improves segmentation quality and boundary robustness across diverse medical datasets with nnUNet-style inference, offering better performance than fine-tuning alternatives.
Conclusion: GuiDINO provides a practical alternative to fine-tuning foundation models for medical vision, demonstrating how foundation models can serve as guidance generators rather than direct feature extractors, preserving domain-specific architectural benefits.
Abstract: Foundation vision models are increasingly adopted in medical image analysis. Due to domain shift, these pretrained models misalign with medical image segmentation needs without being fully fine-tuned or lightly adapted. We introduce GuiDINO, a framework that repositions native foundation model to acting as a visual guidance generator for downstream segmentation. GuiDINO extracts visual feature representation from DINOv3 and converts them into a spatial guide mask via a lightweight TokenBook mechanism, which aggregates token-prototype similarities. This guide mask gates feature activations in multiple segmentation backbones, thereby injecting foundation-model priors while preserving the inductive biases and efficiency of medical dedicated architectures. Training relies on a guide supervision objective loss that aligns the guide mask to ground-truth regions, optionally augmented by a boundary-focused hinge loss to sharpen fine structures. GuiDINO also supports parameter-efficient adaptation through LoRA on the DINOv3 guide backbone. Across diverse medical datasets and nnUNet-style inference, GuiDINO consistently improves segmentation quality and boundary robustness, suggesting a practical alternative to fine-tuning and offering a new perspective on how foundation models can best serve medical vision. Code is available at https://github.com/Hi-FishU/GuiDINO
[480] Improved MambdaBDA Framework for Robust Building Damage Assessment Across Disaster Domains
Alp Eren Gençoğlu, Hazım Kemal Ekenel
Main category: cs.CV
TL;DR: Enhanced MambaBDA for building damage assessment with focal loss, attention gates, and alignment module improves performance and generalization across disaster types
Details
Motivation: Building damage assessment from satellite imagery faces challenges of severe class imbalance, background clutter, and domain shift across different disaster types and geographies, limiting model reliability and generalizationMethod: Enhances MambaBDA (ChangeMamba architecture) with three modular components: (1) Focal Loss to address class imbalance, (2) lightweight Attention Gates to suppress irrelevant context, and (3) compact Alignment Module to spatially warp pre-event features toward post-event content before decoding
Result: Consistent improvements over baseline with 0.8% to 5% performance gains in-domain, and up to 27% improvement on unseen disasters across multiple datasets (xBD, Pakistan Flooding, Turkey Earthquake, Ida Hurricane)
Conclusion: The proposed modular enhancements significantly improve building damage assessment performance, with particularly strong benefits for generalization capability to unseen disaster scenarios
Abstract: Reliable post-disaster building damage assessment (BDA) from satellite imagery is hindered by severe class imbalance, background clutter, and domain shift across disaster types and geographies. In this work, we address these problems and explore ways to improve the MambaBDA, the BDA network of ChangeMamba architecture, one of the most successful BDA models. The approach enhances the MambaBDA with three modular components: (i) Focal Loss to mitigate class imbalance damage classification, (ii) lightweight Attention Gates to suppress irrelevant context, and (iii) a compact Alignment Module to spatially warp pre-event features toward post-event content before decoding. We experiment on multiple satellite imagery datasets, including xBD, Pakistan Flooding, Turkey Earthquake, and Ida Hurricane, and conduct in-domain and crossdataset tests. The proposed modular enhancements yield consistent improvements over the baseline model, with 0.8% to 5% performance gains in-domain, and up to 27% on unseen disasters. This indicates that the proposed enhancements are especially beneficial for the generalization capability of the system.
[481] ClinCoT: Clinical-Aware Visual Chain-of-Thought for Medical Vision Language Models
Xiwei Liu, Yulong Li, Xinlin Zhuang, Xuhui Li, Jianxu Chen, Haolin Yang, Imran Razzak, Yutong Xie
Main category: cs.CV
TL;DR: ClinCoT introduces a clinical-aware visual chain-of-thought framework that transforms preference optimization from response-level correction to visual-driven reasoning for medical vision-language models, improving factual grounding through region-level reasoning.
Details
Motivation: Medical Vision-Language Models show promise but suffer from factual hallucinations due to insufficient grounding in localized pathological evidence. Existing alignment methods operate at response level, leaving intermediate reasoning weakly connected to visual regions, while chain-of-thought remains text-centric, limiting integration of clinical visual cues.Method: Proposes ClinCoT framework with automatic data generation pipeline constructing clinically grounded preference pairs through reasoning with hypotheses-driven region proposals. Uses multiple Med-LLMs evaluators to rank and score responses, with rankings as supervision. Introduces scoring-based margin-aware optimization incorporating preference ranking and score difference to refine region-level reasoning trajectories, with iterative learning scheme that dynamically regenerates preference data.
Result: Extensive experiments on three medical VQA and report generation benchmarks demonstrate that ClinCoT consistently improves factual grounding and achieves superior performance compared with existing preference-based alignment methods.
Conclusion: ClinCoT effectively addresses the gap in medical vision-language models by transforming preference optimization from response-level correction to visual-driven reasoning, enhancing factual grounding through region-level reasoning integration.
Abstract: Medical Vision-Language Models have shown promising potential in clinical decision support, yet they remain prone to factual hallucinations due to insufficient grounding in localized pathological evidence. Existing medical alignment methods primarily operate at the response level through preference optimization, improving output correctness but leaving intermediate reasoning weakly connected to visual regions. Although chain-of-thought (CoT) enhances multimodal reasoning, it remains largely text-centric, limiting effective integration of clinical visual cues. To address this gap, we propose ClinCoT, a clinical-aware visual chain-of-thought framework that transforms preference optimization from response-level correction to visual-driven reasoning. We introduce an automatic data generation pipeline that constructs clinically grounded preference pairs through reasoning with hypotheses-driven region proposals. Multiple Med-LLMs evaluators rank and assign scores to each response, and these rankings serve as supervision to train the target model. We further introduce a scoring-based margin-aware optimization strategy that incorporates both preference ranking and score difference to refine region-level reasoning trajectories. To maintain alignment as the model’s policy evolves during training, we adopt an iterative learning scheme that dynamically regenerates preference data. Extensive experiments on three medical VQA and report generation benchmarks demonstrate that ClinCoT consistently improves factual grounding and achieves superior performance compared with existing preference-based alignment methods.
[482] Predictive Reasoning with Augmented Anomaly Contrastive Learning for Compositional Visual Relations
Chengtai Li, Yuting He, Jianfeng Ren, Ruibin Bai, Yitian Zhao, Heng Yu, Xudong Jiang
Main category: cs.CV
TL;DR: PR-A²CL is a method for compositional visual reasoning that identifies outlier images using augmented anomaly contrastive learning and predictive verification blocks.
Details
Motivation: Compositional visual relations (CVR) are more complex than simple visual analogies and remain relatively unexplored. The paper aims to address the challenge of modeling abundant compositional rules in visual reasoning tasks.Method: Proposes Predictive Reasoning with Augmented Anomaly Contrastive Learning (PR-A²CL) with two key components: 1) Augmented Anomaly Contrastive Learning to distill discriminative features by maximizing similarity among normal instances while minimizing similarity between normal and anomalous outliers, and 2) A predict-and-verify paradigm using Predictive Anomaly Reasoning Blocks (PARBs) that iteratively leverage features from three images to predict the fourth and progressively pinpoint discrepancies.
Result: Experimental results on SVRT, CVR, and MC²R datasets show that PR-A²CL significantly outperforms state-of-the-art reasoning models.
Conclusion: The proposed PR-A²CL framework effectively addresses compositional visual reasoning tasks through anomaly contrastive learning and predictive verification mechanisms, demonstrating superior performance over existing methods.
Abstract: While visual reasoning for simple analogies has received significant attention, compositional visual relations (CVR) remain relatively unexplored due to their greater complexity. To solve CVR tasks, we propose Predictive Reasoning with Augmented Anomaly Contrastive Learning (PR-A$^2$CL), \ie, to identify an outlier image given three other images that follow the same compositional rules. To address the challenge of modelling abundant compositional rules, an Augmented Anomaly Contrastive Learning is designed to distil discriminative and generalizable features by maximizing similarity among normal instances while minimizing similarity between normal and anomalous outliers. More importantly, a predict-and-verify paradigm is introduced for rule-based reasoning, in which a series of Predictive Anomaly Reasoning Blocks (PARBs) iteratively leverage features from three out of the four images to predict those of the remaining one. Throughout the subsequent verification stage, the PARBs progressively pinpoint the specific discrepancies attributable to the underlying rules. Experimental results on SVRT, CVR and MC$^2$R datasets show that PR-A$^2$CL significantly outperforms state-of-the-art reasoning models.
[483] TC-SSA: Token Compression via Semantic Slot Aggregation for Gigapixel Pathology Reasoning
Zhuo Chen, Shawn Young, Lijian Xu
Main category: cs.CV
TL;DR: TC-SSA: A learnable token compression framework for gigapixel Whole Slide Images that aggregates patch features into semantic slots using sparse routing, reducing tokens to 1.7% while maintaining diagnostic performance.
Details
Motivation: Large vision-language models face computational bottlenecks with gigapixel Whole Slide Images (WSIs) containing over 100,000 patches, exceeding standard Transformer constraints. Existing spatial sampling methods risk discarding diagnostically critical evidence.Method: Proposes TC-SSA (Token Compression via Semantic Slot Aggregation) with a gated routing module that assigns patches to semantic slots using sparse Top-2 routing, followed by weighted aggregation to achieve global slide coverage under strict token budgets.
Result: Achieves 78.34% overall accuracy on SlideBench(TCGA) and 77.14% on diagnosis subset, outperforming sampling-based baselines. Generalizes to MIL classification with AUCs of 95.83% on TCGA-BRCA, 98.27% on TCGA-NSCLC, and 79.80% on PANDA.
Conclusion: Learnable semantic aggregation provides an effective trade-off between efficiency and diagnostic performance for gigapixel pathology reasoning, retaining diagnostically relevant information while dramatically reducing computational requirements.
Abstract: The application of large vision-language models to computational pathology holds great promise for diagnostic assistants but faces a critical computational bottleneck: the gigapixel scale of Whole Slide Images (WSIs). A single WSI typically contains over 105 patches, creating sequence lengths that exceed the constraints of standard Transformer architectures. Existing solutions often resort to spatial sampling, which risks discarding diagnostically critical evidence. To address this, we propose TC-SSA (Token Compression via Semantic Slot Aggregation), a learnable token compression framework that aggregates patch features into a fixed number of semantic slots. A gated routing module assigns patches to slots using sparse Top-2 routing, followed by weighted aggregation, enabling global slide coverage under a strict token budget. The resulting representation retains diagnostically relevant information while reducing the number of visual tokens to 1.7% of the original sequence. On the SlideBench(TCGA), our model achieves 78.34% overall accuracy and 77.14% on the diagnosis subset, outperforming sampling-based baselines under comparable token budgets. The method also generalizes to MIL classification, reaching AUC of 95.83% on TCGA-BRCA, 98.27% on TCGA-NSCLC and 79.80% on PANDA. These results suggest that learnable semantic aggregation provides an effective trade-off between efficiency and diagnostic performance for gigapixel pathology reasoning.
[484] Teacher-Guided Causal Interventions for Image Denoising: Orthogonal Content-Noise Disentanglement in Vision Transformers
Kuai Jiang, Zhaoyan Ding, Guijuan Zhang, Dianjie Lu, Zhuoran Zheng
Main category: cs.CV
TL;DR: TCD-Net is a causal intervention-based image denoising method that disentangles content from noise using environmental bias adjustment, dual-branch separation, and reasoning-guided AI priors to improve robustness and preserve details.
Details
Motivation: Conventional denoising models learn spurious correlations between environmental factors and noise patterns, struggle to distinguish textures from noise due to high-frequency ambiguity, and degrade under distribution shifts due to entangling intrinsic content with extrinsic noise.Method: Teacher-Guided Causal Disentanglement Network (TCD-Net) with three components: 1) Environmental Bias Adjustment module for de-confounding, 2) dual-branch disentanglement head with orthogonality constraint for strict content-noise separation, 3) Nano Banana Pro (Google’s reasoning-guided AI image generation model) to provide causal priors and guide content representations back to natural-image manifold.
Result: Outperforms mainstream methods across multiple benchmarks in both fidelity and efficiency, achieving 104.2 FPS real-time speed on a single RTX 5090 GPU.
Conclusion: Causal intervention approach effectively disentangles content from noise, improving robustness under distribution shifts while preserving subtle textures and eliminating residual artifacts.
Abstract: Conventional image denoising models often inadvertently learn spurious correlations between environmental factors and noise patterns. Moreover, due to high-frequency ambiguity, they struggle to reliably distinguish subtle textures from stochastic noise, resulting in over-removed details or residual noise artifacts. We therefore revisit denoising via causal intervention, arguing that purely correlational fitting entangles intrinsic content with extrinsic noise, which directly degrades robustness under distribution shifts. Motivated by this, we propose the Teacher-Guided Causal Disentanglement Network (TCD-Net), which explicitly decomposes the generative mechanism via structured interventions on feature spaces within a Vision Transformer framework. Specifically, our method integrates three key components: (1) An Environmental Bias Adjustment (EBA) module projects features into a stable, de-centered subspace to suppress global environmental bias (de-confounding). (2) A dual-branch disentanglement head employs an orthogonality constraint to force a strict separation between content and noise representations, preventing information leakage. (3) To resolve structural ambiguity, we leverage Nano Banana Pro, Google’s reasoning-guided AI image generation model, to guide a causal prior, effectively pulling content representations back onto the natural-image manifold. Extensive experiments demonstrate that TCD-Net outperforms mainstream methods across multiple benchmarks in both fidelity and efficiency, achieving a real-time speed of 104.2 FPS on a single RTX 5090 GPU.
[485] GRAD-Former: Gated Robust Attention-based Differential Transformer for Change Detection
Durgesh Ameta, Ujjwal Mishra, Praful Hambarde, Amit Shukla
Main category: cs.CV
TL;DR: GRAD-Former is a novel transformer-based framework for remote sensing change detection that improves contextual understanding while maintaining efficiency through reduced model size, outperforming state-of-the-art methods across multiple datasets.
Details
Motivation: Existing change detection methods struggle with precise delineation of change regions, especially transformer-based approaches that suffer from quadratic computational complexity with high-resolution satellite images and poor performance with limited training data, underutilizing rich spatial information.Method: Proposes GRAD-Former with Adaptive Feature Relevance and Refinement (AFRAR) module containing Selective Embedding Amplification (SEA) and Global-Local Feature Refinement (GLFR) components. SEA uses gating mechanisms and GLFR uses differential attention with multiple softmax heaps to capture important features while minimizing irrelevant ones.
Result: Outperforms current state-of-the-art models across all metrics on three challenging CD datasets (LEVIR-CD, CDD, DSIFN-CD) while using fewer parameters, establishing a new benchmark for remote sensing change detection performance.
Conclusion: GRAD-Former effectively addresses computational complexity and feature utilization challenges in remote sensing change detection, providing superior performance with reduced model parameters.
Abstract: Change detection (CD) in remote sensing aims to identify semantic differences between satellite images captured at different times. While deep learning has significantly advanced this field, existing approaches based on convolutional neural networks (CNNs), transformers and Selective State Space Models (SSMs) still struggle to precisely delineate change regions. In particular, traditional transformer-based methods suffer from quadratic computational complexity when applied to very high-resolution (VHR) satellite images and often perform poorly with limited training data, leading to under-utilization of the rich spatial information available in VHR imagery. We present GRAD-Former, a novel framework that enhances contextual understanding while maintaining efficiency through reduced model size. The proposed framework consists of a novel encoder with Adaptive Feature Relevance and Refinement (AFRAR) module, fusion and decoder blocks. AFRAR integrates global-local contextual awareness through two proposed components: the Selective Embedding Amplification (SEA) module and the Global-Local Feature Refinement (GLFR) module. SEA and GLFR leverage gating mechanisms and differential attention, respectively, which generates multiple softmax heaps to capture important features while minimizing the captured irreverent features. Multiple experiments across three challenging CD datasets (LEVIR-CD, CDD, DSIFN-CD) demonstrate GRAD-Former’s superior performance compared to existing approaches. Notably, GRAD-Former outperforms the current state-of-the-art models across all the metrics and all the datasets while using fewer parameters. Our framework establishes a new benchmark for remote sensing change detection performance. Our code will be released at: https://github.com/Ujjwal238/GRAD-Former
[486] ArtLLM: Generating Articulated Assets via 3D LLM
Penghao Wang, Siyuan Xie, Hongyu Yan, Xianghui Yang, Jingwei Huang, Chunchao Guo, Jiayuan Gu
Main category: cs.CV
TL;DR: ArtLLM: A 3D multimodal LLM framework for generating articulated 3D objects from complete meshes by predicting variable parts/joints and synthesizing part geometries.
Details
Motivation: Current methods for creating articulated 3D objects are limited - optimization-based approaches are slow and handle only simple joints, while retrieval-based methods have repetitive geometry and poor generalization. Need scalable solution for interactive environments in gaming, robotics, and simulation.Method: Train 3D multimodal large language model on large-scale articulation dataset (existing datasets + procedurally generated objects). Model autoregressively predicts variable number of parts and joints from object point clouds, then conditions 3D generative model to synthesize part geometries.
Result: Significantly outperforms state-of-the-art methods on PartNet-Mobility dataset in both part layout accuracy and joint prediction. Generalizes robustly to real-world objects and demonstrates utility in constructing digital twins for robot learning.
Conclusion: ArtLLM provides scalable framework for generating high-quality articulated assets, addressing limitations of existing methods and showing potential for applications in robotics and interactive environments.
Abstract: Creating interactive digital environments for gaming, robotics, and simulation relies on articulated 3D objects whose functionality emerges from their part geometry and kinematic structure. However, existing approaches remain fundamentally limited: optimization-based reconstruction methods require slow, per-object joint fitting and typically handle only simple, single-joint objects, while retrieval-based methods assemble parts from a fixed library, leading to repetitive geometry and poor generalization. To address these challenges, we introduce ArtLLM, a novel framework for generating high-quality articulated assets directly from complete 3D meshes. At its core is a 3D multimodal large language model trained on a large-scale articulation dataset curated from both existing articulation datasets and procedurally generated objects. Unlike prior work, ArtLLM autoregressively predicts a variable number of parts and joints, inferring their kinematic structure in a unified manner from the object’s point cloud. This articulation-aware layout then conditions a 3D generative model to synthesize high-fidelity part geometries. Experiments on the PartNet-Mobility dataset show that ArtLLM significantly outperforms state-of-the-art methods in both part layout accuracy and joint prediction, while generalizing robustly to real-world objects. Finally, we demonstrate its utility in constructing digital twins, highlighting its potential for scalable robot learning.
[487] TripleSumm: Adaptive Triple-Modality Fusion for Video Summarization
Sumin Kim, Hyemin Jeong, Mingu Kang, Yejin Kim, Yoori Oh, Joonseok Lee
Main category: cs.CV
TL;DR: TripleSumm is a novel multimodal video summarization architecture that adaptively fuses visual, text, and audio modalities at frame level, achieving SOTA performance on new MoSu benchmark and existing datasets.
Details
Motivation: Current video summarization methods struggle with complex videos due to static or modality-agnostic fusion strategies that ignore dynamic, frame-dependent variations in modality saliency. There's also a lack of comprehensive multimodal benchmarks.Method: Proposes TripleSumm architecture with adaptive weighting and fusion of visual, text, and audio modalities at frame level. Also introduces MoSu, the first large-scale benchmark providing all three modalities for multimodal video summarization research.
Result: TripleSumm achieves state-of-the-art performance, outperforming existing methods by significant margin on four benchmarks including the new MoSu dataset.
Conclusion: The adaptive multimodal fusion approach effectively addresses limitations of static fusion strategies, and the MoSu benchmark addresses the critical bottleneck in multimodal video summarization research.
Abstract: The exponential growth of video content necessitates effective video summarization to efficiently extract key information from long videos. However, current approaches struggle to fully comprehend complex videos, primarily because they employ static or modality-agnostic fusion strategies. These methods fail to account for the dynamic, frame-dependent variations in modality saliency inherent in video data. To overcome these limitations, we propose TripleSumm, a novel architecture that adaptively weights and fuses the contributions of visual, text, and audio modalities at the frame level. Furthermore, a significant bottleneck for research into multimodal video summarization has been the lack of comprehensive benchmarks. Addressing this bottleneck, we introduce MoSu (Most Replayed Multimodal Video Summarization), the first large-scale benchmark that provides all three modalities. Extensive experiments demonstrate that TripleSumm achieves state-of-the-art performance, outperforming existing methods by a significant margin on four benchmarks, including MoSu. Our code and dataset are available at https://github.com/smkim37/TripleSumm.
[488] ConVibNet: Needle Detection during Continuous Insertion via Frequency-Inspired Features
Jiamei Guo, Zhehao Duan, Maria Neiiendam, Dianye Huang, Nassir Navab, Zhongliang Jiang
Main category: cs.CV
TL;DR: ConVibNet: A real-time framework for continuous needle detection in ultrasound-guided interventions using temporal correlation modeling and novel loss function.
Details
Motivation: Ultrasound-guided needle interventions suffer from poor needle visibility in ultrasound images due to artifacts, occlusions, and low contrast, limiting real-time continuous insertion support.Method: ConVibNet extends VibNet to detect low-visibility needles by leveraging temporal dependencies across ultrasound frames. Uses novel intersection-and-difference loss for motion correlation awareness and curated dataset for development.
Result: Achieved tip error of 2.80±2.42 mm and angle error of 1.69±2.00 deg, with 0.75 mm improvement over best baseline while maintaining real-time inference.
Conclusion: ConVibNet advances real-time needle detection through temporal correlation modeling and novel loss, improving accuracy and robustness for potential integration into autonomous insertion systems.
Abstract: Purpose: Ultrasound-guided needle interventions are widely used in clinical practice, but their success critically depends on accurate needle placement, which is frequently hindered by the poor and intermittent visibility of needles in ultrasound images. Existing approaches remain limited by artifacts, occlusions, and low contrast, and often fail to support real-time continuous insertion. To overcome these challenges, this study introduces a robust real-time framework for continuous needle detection. Methods: We present ConVibNet, an extension of VibNet for detecting needles with significantly reduced visibility, addressing real-time, continuous needle tracking during insertion. ConVibNet leverages temporal dependencies across successive ultrasound frames to enable continuous estimation of both needle tip position and shaft angle in dynamic scenarios. To strengthen temporal awareness of needle-tip motion, we introduce a novel intersection-and-difference loss that explicitly leverages motion correlations across consecutive frames. In addition, we curated a dedicated dataset for model development and evaluation. Results: The performance of the proposed ConVibNet model was evaluated on our dataset, demonstrating superior accuracy compared to the baseline VibNet and UNet-LSTM models. Specifically, ConVibNet achieved a tip error of 2.80+-2.42 mm and an angle error of 1.69+-2.00 deg. These results represent a 0.75 mm improvement in tip localization accuracy over the best-performing baseline, while preserving real-time inference capability. Conclusion: ConVibNet advances real-time needle detection in ultrasound-guided interventions by integrating temporal correlation modeling with a novel intersection-and-difference loss, thereby improving accuracy and robustness and demonstrating high potential for integration into autonomous insertion systems.
[489] BeautyGRPO: Aesthetic Alignment for Face Retouching via Dynamic Path Guidance and Fine-Grained Preference Modeling
Jiachen Yang, Xianhui Lin, Yi Dong, Zebiao Zheng, Xing Liu, Hong Gu, Yanmei Fang
Main category: cs.CV
TL;DR: BeautyGRPO is an RL framework for face retouching that aligns with human aesthetic preferences using a fine-grained preference dataset and dynamic path guidance to correct stochastic drift.
Details
Motivation: Existing face retouching methods face a trade-off: supervised learning fails to capture complex human aesthetic preferences, while online RL introduces noise artifacts due to stochastic exploration conflicting with high-fidelity demands.Method: Proposes BeautyGRPO with FRPref-10K dataset covering five retouching dimensions, trains a specialized reward model, and introduces Dynamic Path Guidance (DPG) to stabilize stochastic sampling by computing anchor-based ODE paths and replanning guided trajectories.
Result: Extensive experiments show BeautyGRPO outperforms specialized face retouching methods and general image editing models, achieving superior texture quality, more accurate blemish removal, and better alignment with human aesthetic preferences.
Conclusion: BeautyGRPO successfully reconciles the exploration-fidelity trade-off in RL for face retouching, demonstrating effective alignment with human aesthetic preferences through dynamic path guidance and fine-grained preference modeling.
Abstract: Face retouching requires removing subtle imperfections while preserving unique facial identity features, in order to enhance overall aesthetic appeal. However, existing methods suffer from a fundamental trade-off. Supervised learning on labeled data is constrained to pixel-level label mimicry, failing to capture complex subjective human aesthetic preferences. Conversely, while online reinforcement learning (RL) excels at preference alignment, its stochastic exploration paradigm conflicts with the high-fidelity demands of face retouching and often introduces noticeable noise artifacts due to accumulated stochastic drift. To address these limitations, we propose BeautyGRPO, a reinforcement learning framework that aligns face retouching with human aesthetic preferences. We construct FRPref-10K, a fine-grained preference dataset covering five key retouching dimensions, and train a specialized reward model capable of evaluating subtle perceptual differences. To reconcile exploration and fidelity, we introduce Dynamic Path Guidance (DPG). DPG stabilizes the stochastic sampling trajectory by dynamically computing an anchor-based ODE path and replanning a guided trajectory at each sampling timestep, effectively correcting stochastic drift while maintaining controlled exploration. Extensive experiments show that BeautyGRPO outperforms both specialized face retouching methods and general image editing models, achieving superior texture quality, more accurate blemish removal, and overall results that better align with human aesthetic preferences.
[490] VisNec: Measuring and Leveraging Visual Necessity for Multimodal Instruction Tuning
Mingkang Dong, Hongyi Cai, Jie Li, Sifan Zhou, Bin Ren, Kunyu Peng, Yuqian Fu
Main category: cs.CV
TL;DR: VisNec framework selects vision-critical training samples for multimodal instruction tuning by measuring visual necessity, achieving better performance with less data.
Details
Motivation: Existing multimodal instruction datasets contain many visually redundant samples (solvable from text alone) and misaligned supervision, which degrade learning efficiency and effectiveness.Method: Proposes VisNec (Visual Necessity Score) that measures the marginal contribution of visual input by comparing predictive loss with and without visual context. Combines with semantic clustering to preserve task diversity while selecting high-necessity samples.
Result: Training on only 15% of LLaVA-665K dataset selected by VisNec achieves 100.2% of full-data performance. On Vision-Flan-186K, selection reduces data size further and surpasses full-data training by 15.8%.
Conclusion: Measuring and leveraging visual necessity provides an effective solution for both efficient and robust multimodal instruction tuning, demonstrating that data quality (visual necessity) is more important than quantity.
Abstract: The effectiveness of multimodal instruction tuning depends not only on dataset scale, but critically on whether training samples genuinely require visual reasoning. However, existing instruction datasets often contain a substantial portion of visually redundant samples (solvable from text alone), as well as multimodally misaligned supervision that can degrade learning. To address this, we propose VisNec (Visual Necessity Score), a principled data selection framework that measures the marginal contribution of visual input during instruction tuning. By comparing predictive loss with and without visual context, VisNec identifies whether a training instance is vision-critical, redundant, or misaligned. To preserve task diversity, we combine VisNec with semantic clustering and select high-necessity samples within each cluster. Across 10 downstream benchmarks, training on only 15% of the LLaVA-665K dataset selected by VisNec achieves 100.2% of full-data performance. On the smaller Vision-Flan-186K dataset, our selection not only further reduces data size but also surpasses full-data training by 15.8%. These results demonstrate that measuring and leveraging visual necessity provides an effective solution for both efficient and robust multimodal instruction tuning. Codes and selected subsets will be released upon acceptance.
[491] FREE-Edit: Using Editing-aware Injection in Rectified Flow Models for Zero-shot Image-Driven Video Editing
Maomao Li, Yunfei Liu, Yu Li
Main category: cs.CV
TL;DR: A zero-shot image-driven video editing framework called FREE-Edit that uses editing-aware attention injection to propagate edits from modified first frame to entire video while preserving source motion and layout.
Details
Motivation: Existing methods for image-driven video editing struggle with balancing attention injection - too much injection causes semantic conflicts with source video, while too little injection fails to preserve source representation. There's a need for adaptive injection that responds to editing areas.Method: Proposes Editing-awaRE (REE) injection method that modulates injection intensity per token based on editing masks. Uses optical flow to track editing areas throughout video, then generates editing-aware feature injection where injection is not conducted on editing areas. Built on rectified-Flow models for zero-shot editing.
Result: FREE-Edit demonstrates effectiveness in various image-driven video editing scenarios without fine-tuning or training, producing higher-quality outputs compared to existing techniques.
Conclusion: The proposed REE injection method and FREE-Edit framework provide an effective zero-shot solution for image-driven video editing that better balances preservation of source motion/layout with edit propagation.
Abstract: Image-driven video editing aims to propagate edit contents from the modified first frame to the rest frames. The existing methods usually invert the source video to noise using a pre-trained image-to-video (I2V) model and then guide the sampling process using the edited first frame. Generally, a popular choice for maintaining motion and layout from the source video is intervening in the denoising process by injecting attention during reconstruction. However, such injection often leads to unsatisfactory results, where excessive injection leads to conflicting semantics from the source video while insufficient injection brings limited source representation. Recognizing this, we propose an Editing-awaRE (REE) injection method to modulate injection intensity of each token. Specifically, we first compute the pixel difference between the source and edited first frame to form a corresponding editing mask. Next, we track the editing area throughout the entire video by using optical flow to warp the first-frame mask. Then, editing-aware feature injection intensity for each token is generated accordingly, where injection is not conducted on editing areas. Building upon REE injection, we further propose a zero-shot image-driven video editing framework with recent-emerging rectified-Flow models, dubbed FREE-Edit. Without fine-tuning or training, our FREE-Edit demonstrates effectiveness in various image-driven video editing scenarios, showing its capability to produce higher-quality outputs compared with existing techniques. Project page: https://free-edit.github.io/page/.
[492] VP-Hype: A Hybrid Mamba-Transformer Framework with Visual-Textual Prompting for Hyperspectral Image Classification
Abdellah Zakaria Sellam, Fadi Abdeladhim Zidi, Salah Eddine Bekhouche, Ihssen Houhou, Marouane Tliba, Cosimo Distante, Abdenour Hadid
Main category: cs.CV
TL;DR: VP-Hype: A hybrid Mamba-Transformer framework with visual-textual prompts for hyperspectral image classification under extreme label scarcity, achieving SOTA with only 2% training data.
Details
Motivation: Addresses the challenge of hyperspectral image classification where high-dimensional spectral data conflicts with extreme scarcity of labeled training samples, while overcoming quadratic complexity limitations of standard Transformers.Method: Unifies linear-time State-Space Models (SSMs) with Transformers in a hybrid architecture, using 3D-CNN spectral front-end, Hybrid Mamba-Transformer backbone, and dual-modal Visual and Textual Prompts for context-aware guidance.
Result: Achieves state-of-the-art performance with only 2% training data: 99.69% OA on Salinas dataset and 99.45% on Longkou dataset, demonstrating superior sample efficiency.
Conclusion: Hybrid sequence modeling combined with multi-modal prompting provides a robust path for high-performance, sample-efficient remote sensing classification.
Abstract: Accurate classification of hyperspectral imagery (HSI) is often frustrated by the tension between high-dimensional spectral data and the extreme scarcity of labeled training samples. While hierarchical models like LoLA-SpecViT have demonstrated the power of local windowed attention and parameter-efficient fine-tuning, the quadratic complexity of standard Transformers remains a barrier to scaling. We introduce VP-Hype, a framework that rethinks HSI classification by unifying the linear-time efficiency of State-Space Models (SSMs) with the relational modeling of Transformers in a novel hybrid architecture. Building on a robust 3D-CNN spectral front-end, VP-Hype replaces conventional attention blocks with a Hybrid Mamba-Transformer backbone to capture long-range dependencies with significantly reduced computational overhead. Furthermore, we address the label-scarcity problem by integrating dual-modal Visual and Textual Prompts that provide context-aware guidance for the feature extraction process. Our experimental evaluation demonstrates that VP-Hype establishes a new state of the art in low-data regimes. Specifically, with a training sample distribution of only 2%, the model achieves Overall Accuracy (OA) of 99.69% on the Salinas dataset and 99.45% on the Longkou dataset. These results suggest that the convergence of hybrid sequence modeling and multi-modal prompting provides a robust path forward for high-performance, sample-efficient remote sensing.
[493] Monocular 3D Object Position Estimation with VLMs for Human-Robot Interaction
Ari Wahl, Dorian Gawlinski, David Przewozny, Paul Chojecki, Felix Bießmann, Sebastian Bosse
Main category: cs.CV
TL;DR: A Vision-Language Model is fine-tuned to predict 3D object positions from monocular RGB images, natural language queries, and robot states, achieving robust performance with median MAE of 13mm.
Details
Motivation: While pre-trained VLMs excel at 2D object detection and general visual understanding, they lack capabilities for 3D coordinate detection tasks needed for robotics applications. The paper aims to enhance VLMs with 3D position estimation while maintaining their general visual query processing abilities.Method: Collected and curated a heterogeneous dataset of 100,000+ images from wrist-mounted cameras. Fine-tuned a VLM using QLoRA with a custom regression head for 3D position prediction. Implemented conditional routing to maintain general visual query capabilities while adding specialized 3D estimation.
Result: Achieved robust predictive performance with median MAE of 13mm on test set, representing a five-fold improvement over baseline without fine-tuning. In about 25% of cases, predictions were within acceptable range for robot-object interaction.
Conclusion: The approach successfully extends VLMs to 3D position estimation tasks while preserving their general visual understanding capabilities, demonstrating potential for enhanced human-machine interactions in robotics applications.
Abstract: Pre-trained general-purpose Vision-Language Models (VLM) hold the potential to enhance intuitive human-machine interactions due to their rich world knowledge and 2D object detection capabilities. However, VLMs for 3D coordinates detection tasks are rare. In this work, we investigate interactive abilities of VLMs by returning 3D object positions given a monocular RGB image from a wrist-mounted camera, natural language input, and robot states. We collected and curated a heterogeneous dataset of more than 100,000 images and finetuned a VLM using QLoRA with a custom regression head. By implementing conditional routing, our model maintains its ability to process general visual queries while adding specialized 3D position estimation capabilities. Our results demonstrate robust predictive performance with a median MAE of 13 mm on the test set and a five-fold improvement over a simpler baseline without finetuning. In about 25% of the cases, predictions are within a range considered acceptable for the robot to interact with objects.
[494] RnG: A Unified Transformer for Complete 3D Modeling from Partial Observations
Mochu Xiang, Zhelun Shen, Xuesong Li, Jiahui Ren, Jing Zhang, Chen Zhao, Shanshan Liu, Haocheng Feng, Jingdong Wang, Yuchao Dai
Main category: cs.CV
TL;DR: RnG is a feed-forward Transformer that unifies 3D reconstruction and generation, predicting complete 3D representations from partial 2D observations using a reconstruction-guided causal attention mechanism.
Details
Motivation: Current feed-forward 3D reconstruction models only recover observed regions, leaving unseen geometry un-modeled. The paper addresses the fundamental challenge of inferring complete 3D structures from partial 2D observations.Method: Proposes RnG with reconstruction-guided causal attention that separates reconstruction and generation at the attention level. Treats KV-cache as implicit 3D representation, allowing arbitrary poses to query this cache for novel-view RGBD rendering.
Result: Achieves state-of-the-art performance in both generalizable 3D reconstruction and novel view generation, while operating efficiently enough for real-time interactive applications.
Conclusion: RnG successfully unifies reconstruction and generation, accurately reconstructing visible geometry while generating plausible, coherent unseen geometry and appearance from limited viewpoints.
Abstract: Human perceive the 3D world through 2D observations from limited viewpoints. While recent feed-forward generalizable 3D reconstruction models excel at recovering 3D structures from sparse images, their representations are often confined to observed regions, leaving unseen geometry un-modeled. This raises a key, fundamental challenge: Can we infer a complete 3D structure from partial 2D observations? We present RnG (Reconstruction and Generation), a novel feed-forward Transformer that unifies these two tasks by predicting an implicit, complete 3D representation. At the core of RnG, we propose a reconstruction-guided causal attention mechanism that separates reconstruction and generation at the attention level, and treats the KV-cache as an implicit 3D representation. Then, arbitrary poses can efficiently query this cache to render high-fidelity, novel-view RGBD outputs. As a result, RnG not only accurately reconstructs visible geometry but also generates plausible, coherent unseen geometry and appearance. Our method achieves state-of-the-art performance in both generalizable 3D reconstruction and novel view generation, while operating efficiently enough for real-time interactive applications. Project page: https://npucvr.github.io/RnG
[495] CoSMo3D: Open-World Promptable 3D Semantic Part Segmentation through LLM-Guided Canonical Spatial Modeling
Li Jin, Weikai Chen, Yujie Wang, Yingda Yin, Zeyu Hu, Runze Zhang, Keyang Luo, Shengju Qian, Xin Wang, Xueying Qin
Main category: cs.CV
TL;DR: Canonical3D introduces a canonical space perception approach for 3D semantic segmentation that learns latent canonical reference frames from data, enabling more stable and transferable part semantics by shifting from input pose space to canonical embeddings.
Details
Motivation: Current open-world promptable 3D semantic segmentation is brittle because semantics are inferred in input sensor coordinates, whereas humans interpret parts via functional roles in canonical spaces (e.g., wings extend laterally, handles protrude to the side). Humans mentally rotate objects into canonical frames to reveal these functional roles, which current methods lack.Method: Proposes Canonical3D with two key components: 1) Construction: Creates a unified canonical dataset through LLM-guided intra- and cross-category alignment across 200 categories, exposing canonical spatial regularities. 2) Induction: Implements a dual-branch architecture with canonical map anchoring and canonical box calibration to collapse pose variation and symmetry into stable canonical embeddings, shifting from input pose space to canonical embedding space.
Result: Establishes new state-of-the-art in open-world promptable 3D segmentation, demonstrating that the shift from input pose space to canonical embedding yields far more stable and transferable part semantics.
Conclusion: Learning latent canonical reference frames directly from data enables canonical space perception for 3D segmentation, overcoming brittleness of sensor-coordinate semantics and achieving more human-like interpretation of parts via functional roles in canonical spaces.
Abstract: Open-world promptable 3D semantic segmentation remains brittle as semantics are inferred in the input sensor coordinates. Yet, humans, in contrast, interpret parts via functional roles in a canonical space – wings extend laterally, handles protrude to the side, and legs support from below. Psychophysical evidence shows that we mentally rotate objects into canonical frames to reveal these roles. To fill this gap, we propose \methodName{}, which attains canonical space perception by inducing a latent canonical reference frame learned directly from data. By construction, we create a unified canonical dataset through LLM-guided intra- and cross-category alignment, exposing canonical spatial regularities across 200 categories. By induction, we realize canonicality inside the model through a dual-branch architecture with canonical map anchoring and canonical box calibration, collapsing pose variation and symmetry into a stable canonical embedding. This shift from input pose space to canonical embedding yields far more stable and transferable part semantics. Experimental results show that \methodName{} establishes new state of the art in open-world promptable 3D segmentation.
[496] Towards Policy-Adaptive Image Guardrail: Benchmark and Method
Caiyong Piao, Zhiyuan Yan, Haoming Xu, Yunzhen Zhao, Kaiqing Lin, Feiyang Xu, Shuigeng Zhou
Main category: cs.CV
TL;DR: SafeGuard-VL: A reinforcement learning method with verifiable rewards for adaptive harmful image detection that generalizes across evolving safety policies using vision-language models.
Details
Motivation: Traditional harmful image classifiers are rigid and require frequent retraining for new safety policies. Vision-language models offer adaptability but existing methods overfit to fixed policies and fail to generalize to unseen policies while losing basic capabilities.Method: Two contributions: 1) SafeEditBench evaluation suite using image-editing models to create policy-aligned safe/unsafe image pairs with human labels under 5 distinct policies; 2) SafeGuard-VL method using reinforcement learning with verifiable rewards (RLVR) instead of supervised fine-tuning to optimize models across evolving policies.
Result: Extensive experiments verify SafeGuard-VL’s effectiveness for unsafe image guardrails across various policies, addressing the cross-policy generalization problem that existing VLM-based methods fail to handle.
Conclusion: The proposed SafeGuard-VL with RLVR provides a robust solution for adaptive harmful image detection that can evolve with changing safety policies, overcoming limitations of traditional classifiers and overfitted VLM-based methods.
Abstract: Accurate rejection of sensitive or harmful visual content, i.e., harmful image guardrail, is critical in many application scenarios. This task must continuously adapt to the evolving safety policies and content across various domains and over time. However, traditional classifiers, confined to fixed categories, require frequent retraining when new policies are introduced. Vision-language models (VLMs) offer a more adaptable and generalizable foundation for dynamic safety guardrails. Despite this potential, existing VLM-based safeguarding methods are typically trained and evaluated under only a fixed safety policy. We find that these models are heavily overfitted to the seen policy, fail to generalize to unseen policies, and even lose the basic instruction-following ability and general knowledge. To address this issue, in this paper we make two key contributions. First, we benchmark the cross-policy generalization performance of existing VLMs with SafeEditBench, a new evaluation suite. SafeEditBench leverages image-editing models to convert unsafe images into safe counterparts, producing policy-aligned datasets where each safe-unsafe image pair remains visually similar except for localized regions violating specific safety rules. Human annotators then provide accurate safe/unsafe labels under five distinct policies, enabling fine-grained assessment of policy-aware generalization. Second, we introduce SafeGuard-VL, a reinforcement learning-based method with verifiable rewards (RLVR) for robust unsafe-image guardrails. Instead of relying solely on supervised fine-tuning (SFT) under fixed policies, SafeGuard-VL explicitly optimizes the model with policy-grounded rewards, promoting verifiable adaptation across evolving policies. Extensive experiments verify the effectiveness of our method for unsafe image guardrails across various policies.
[497] AgilePruner: An Empirical Study of Attention and Diversity for Adaptive Visual Token Pruning in Large Vision-Language Models
Changwoo Baek, Jouwon Song, Sohyeon Kim, Kyeongbo Kong
Main category: cs.CV
TL;DR: Analysis of visual token pruning in LVLMs reveals diversity-based methods preserve less diversity than intended and correlate with hallucination, while attention-based methods work better on simple images; hybrid adaptive pruning improves performance.
Details
Motivation: LVLMs face computational overhead from extensive visual token sequences, but prior visual token pruning methods (attention-based or diversity-based) lack thorough analysis of their characteristics and limitations.Method: Conduct empirical analysis using effective rank (erank) to measure feature diversity and attention score entropy to investigate visual token processing mechanisms; analyze strengths/weaknesses of each approach; propose image-aware adjustments to hybrid pruning strategies and a simple adaptive pruning mechanism.
Result: Diversity-based pruning preserves less feature diversity than intended and correlates with increased hallucination; attention-based methods work better on simple images while diversity-based methods handle complex images better; adaptive pruning achieves strong performance across benchmarks and hallucination evaluations.
Conclusion: Empirical insights reveal limitations of current pruning approaches; image-aware hybrid pruning and adaptive mechanisms can improve performance and reliability in LVLMs.
Abstract: Large Vision-Language Models (LVLMs) have adopted visual token pruning strategies to mitigate substantial computational overhead incurred by extensive visual token sequences. While prior works primarily focus on either attention-based or diversity-based pruning methods, in-depth analysis of these approaches’ characteristics and limitations remains largely unexplored. In this work, we conduct thorough empirical analysis using effective rank (erank) as a measure of feature diversity and attention score entropy to investigate visual token processing mechanisms and analyze the strengths and weaknesses of each approach. Our analysis reveals two insights: (1) Our erank-based quantitative analysis shows that many diversity-oriented pruning methods preserve substantially less feature diversity than intended; moreover, analysis using the CHAIR dataset reveals that the diversity they do retain is closely tied to increased hallucination frequency compared to attention-based pruning. (2) We further observe that attention-based approaches are more effective on simple images where visual evidence is concentrated, while diversity-based methods better handle complex images with distributed features. Building on these empirical insights, we show that incorporating image-aware adjustments into existing hybrid pruning strategies consistently improves their performance. We also provide a minimal instantiation of our empirical findings through a simple adaptive pruning mechanism, which achieves strong and reliable performance across standard benchmarks as well as hallucination-specific evaluations. Our project page available at https://cvsp-lab.github.io/AgilePruner.
[498] The MAMA-MIA Challenge: Advancing Generalizability and Fairness in Breast MRI Tumor Segmentation and Treatment Response Prediction
Lidia Garrucho, Smriti Joshi, Kaisar Kushibar, Richard Osuala, Maciej Bobowicz, Xavier Bargalló, Paulius Jaruševičius, Kai Geissler, Raphael Schäfer, Muhammad Alberb, Tony Xu, Anne Martel, Daniel Sleiman, Navchetan Awasthi, Hadeel Awwad, Joan C. Vilanova, Robert Martí, Daan Schouten, Jeong Hoon Lee, Mirabela Rusu, Eleonora Poeta, Luisa Vargas, Eliana Pastor, Maria A. Zuluaga, Jessica Kächele, Dimitrios Bounias, Alexandra Ertl, Katarzyna Gwoździewicz, Maria-Laura Cosaka, Pasant M. Abo-Elhoda, Sara W. Tantawy, Shorouq S. Sakrana, Norhan O. Shawky-Abdelfatah, Amr Muhammad Abdo-Salem, Androniki Kozana, Eugen Divjak, Gordana Ivanac, Katerina Nikiforaki, Michail E. Klontzas, Rosa García-Dosdá, Meltem Gulsun-Akpinar, Oğuz Lafcı, Carlos Martín-Isla, Oliver Díaz, Laura Igual, Karim Lekadir
Main category: cs.CV
TL;DR: The MAMA-MIA Challenge introduces a large-scale benchmark for breast cancer MRI analysis, focusing on tumor segmentation and treatment response prediction using pre-treatment imaging, with emphasis on cross-institutional generalization and subgroup fairness evaluation.
Details
Motivation: Address limitations of existing AI models for breast MRI that are often developed on single-center data with aggregate performance metrics, which limits generalizability and obscures performance disparities across demographic subgroups.Method: Created a large-scale benchmark challenge with training cohort of 1,506 patients from multiple US institutions and external test set of 574 patients from three independent European centers. Used unified scoring framework combining predictive performance with subgroup consistency across age, menopausal status, and breast density.
Result: Twenty-six international teams participated, revealing substantial performance variability under external testing and trade-offs between overall accuracy and subgroup fairness. Demonstrated challenges in cross-continental and cross-institutional generalization.
Conclusion: The challenge provides standardized datasets, evaluation protocols, and public resources to promote development of robust and equitable AI systems for breast cancer imaging, highlighting the importance of external validation and subgroup fairness assessment.
Abstract: Breast cancer is the most frequently diagnosed malignancy among women worldwide and a leading cause of cancer-related mortality. Dynamic contrast-enhanced magnetic resonance imaging plays a central role in tumor characterization and treatment monitoring, particularly in patients receiving neoadjuvant chemotherapy. However, existing artificial intelligence models for breast magnetic resonance imaging are often developed using single-center data and evaluated using aggregate performance metrics, limiting their generalizability and obscuring potential performance disparities across demographic subgroups. The MAMA-MIA Challenge was designed to address these limitations by introducing a large-scale benchmark that jointly evaluates primary tumor segmentation and prediction of pathologic complete response using pre-treatment magnetic resonance imaging only. The training cohort comprised 1,506 patients from multiple institutions in the United States, while evaluation was conducted on an external test set of 574 patients from three independent European centers to assess cross-continental and cross-institutional generalization. A unified scoring framework combined predictive performance with subgroup consistency across age, menopausal status, and breast density. Twenty-six international teams participated in the final evaluation phase. Results demonstrate substantial performance variability under external testing and reveal trade-offs between overall accuracy and subgroup fairness. The challenge provides standardized datasets, evaluation protocols, and public resources to promote the development of robust and equitable artificial intelligence systems for breast cancer imaging.
[499] Cross-Modal Guidance for Fast Diffusion-Based Computed Tomography
Timofey Efimov, Singanallur Venkatakrishnan, Maliha Hossain, Haley Duba-Sullivan, Amirkoushyar Ziabari
Main category: cs.CV
TL;DR: Cross-modal guidance for diffusion-based CT reconstruction using complementary imaging modalities without retraining diffusion priors
Details
Motivation: Addressing the challenge of sparse data in expensive imaging modalities like neutron CT by leveraging complementary modalities without the need for retraining diffusion modelsMethod: Proposes incorporating additional imaging modalities (like X-ray CT) as guidance for diffusion-based reconstruction of sparse-view neutron CT, without retraining the diffusion prior
Result: Substantially improved reconstruction quality for sparse-view neutron CT by incorporating X-ray CT of the same samples, even with imperfect side modalities
Conclusion: Cross-modal guidance enables accelerated imaging of costly modalities by leveraging complementary imaging data without retraining diffusion priors
Abstract: Diffusion models have emerged as powerful priors for solving inverse problems in computed tomography (CT). In certain applications, such as neutron CT, it can be expensive to collect large amounts of measurements even for a single scan, leading to sparse data sets from which it is challenging to obtain high quality reconstructions even with diffusion models. One strategy to mitigate this challenge is to leverage a complementary, easily available imaging modality; however, such approaches typically require retraining the diffusion model with large datasets. In this work, we propose incorporating an additional modality without retraining the diffusion prior, enabling accelerated imaging of costly modalities. We further examine the impact of imperfect side modalities on cross-modal guidance. Our method is evaluated on sparse-view neutron computed tomography, where reconstruction quality is substantially improved by incorporating X-ray computed tomography of the same samples.
[500] Multi-Level Bidirectional Decoder Interaction for Uncertainty-Aware Breast Ultrasound Analysis
Abdullah Al Shafi, Md Kawsar Mahmud Khan Zunayed, Safin Ahmmed, Sk Imran Hossain, Engelbert Mephu Nguifo
Main category: cs.CV
TL;DR: Multi-task framework for breast ultrasound analysis with multi-level decoder interaction and uncertainty-aware adaptive coordination for simultaneous lesion segmentation and tissue classification.
Details
Motivation: Conventional multi-task learning for breast ultrasound suffers from task interference and rigid coordination strategies that don't adapt to instance-specific prediction difficulty.Method: Uses multi-level decoder interaction with Task Interaction Modules at all decoder levels for bidirectional segmentation-classification communication via attention weighted pooling and multiplicative modulation. Includes Uncertainty-Proxy Attention for adaptive feature weighting and multi-scale context fusion for varying lesion sizes.
Result: Achieves 74.5% lesion IoU and 90.6% classification accuracy on BUSI dataset, with competitive performance on multiple breast ultrasound datasets. Ablation studies show multi-level task interaction provides significant gains.
Conclusion: Decoder-level bidirectional communication is more effective than conventional encoder-only parameter sharing for multi-task medical image analysis, with multi-level interaction capturing scale-specific task synergies.
Abstract: Breast ultrasound interpretation requires simultaneous lesion segmentation and tissue classification. However, conventional multi-task learning approaches suffer from task interference and rigid coordination strategies that fail to adapt to instance-specific prediction difficulty. We propose a multi-task framework addressing these limitations through multi-level decoder interaction and uncertainty-aware adaptive coordination. Task Interaction Modules operate at all decoder levels, establishing bidirectional segmentation-classification communication during spatial reconstruction through attention weighted pooling and multiplicative modulation. Unlike prior single-level or encoder-only approaches, this multi-level design captures scale specific task synergies across semantic-to-spatial scales, producing complementary task interaction streams. Uncertainty-Proxy Attention adaptively weights base versus enhanced features at each level using feature activation variance, enabling per-level and per-sample task balancing without heuristic tuning. To support instance-adaptive prediction, multi-scale context fusion captures morphological cues across varying lesion sizes. Evaluation on multiple publicly available breast ultrasound datasets demonstrates competitive performance, including 74.5% lesion IoU and 90.6% classification accuracy on BUSI dataset. Ablation studies confirm that multi-level task interaction provides significant performance gains, validating that decoder-level bidirectional communication is more effective than conventional encoder-only parameter sharing. The code is available at: https://github.com/C-loud-Nine/Uncertainty-Aware-Multi-Level-Decoder-Interaction.
[501] FoSS: Modeling Long Range Dependencies and Multimodal Uncertainty in Trajectory Prediction via Fourier State Space Integration
Yizhou Huang, Gengze Jiang, Yihua Cheng, Kezhi Wang
Main category: cs.CV
TL;DR: FoSS: A dual-branch framework for trajectory prediction that combines frequency-domain reasoning with linear-time sequence modeling to achieve state-of-the-art accuracy with improved computational efficiency.
Details
Motivation: Existing trajectory prediction methods struggle to balance modeling power and computational efficiency. Attention-based architectures have quadratic complexity with increasing agents, while recurrent models fail to capture both long-range dependencies and fine-grained local dynamics effectively.Method: Proposes FoSS, a dual-branch framework with: 1) Frequency-domain branch using discrete Fourier transform to decompose trajectories into amplitude (global intent) and phase (local variations) components, with progressive helix reordering and selective state-space modules (Coarse2Fine-SSM and SpecEvolve-SSM) for O(N) complexity refinement; 2) Time-domain dynamic selective SSM that reconstructs self-attention behavior in linear time; 3) Cross-attention layer to fuse temporal and spectral representations; 4) Learnable queries for multiple candidate trajectories; 5) Weighted fusion head to express motion uncertainty.
Result: Achieves state-of-the-art accuracy on Argoverse 1 and Argoverse 2 benchmarks while reducing computation by 22.5% and parameters by over 40%. Comprehensive ablations confirm the necessity of each component.
Conclusion: FoSS successfully balances modeling power and computational efficiency for trajectory prediction by unifying frequency-domain reasoning with linear-time sequence modeling, offering a promising direction for efficient multimodal understanding in autonomous driving applications.
Abstract: Accurate trajectory prediction is vital for safe autonomous driving, yet existing approaches struggle to balance modeling power and computational efficiency. Attention-based architectures incur quadratic complexity with increasing agents, while recurrent models struggle to capture long-range dependencies and fine-grained local dynamics. Building upon this, we present FoSS, a dual-branch framework that unifies frequency-domain reasoning with linear-time sequence modeling. The frequency-domain branch performs a discrete Fourier transform to decompose trajectories into amplitude components encoding global intent and phase components capturing local variations, followed by a progressive helix reordering module that preserves spectral order; two selective state-space submodules, Coarse2Fine-SSM and SpecEvolve-SSM, refine spectral features with O(N) complexity. In parallel, a time-domain dynamic selective SSM reconstructs self-attention behavior in linear time to retain long-range temporal context. A cross-attention layer fuses temporal and spectral representations, while learnable queries generate multiple candidate trajectories, and a weighted fusion head expresses motion uncertainty. Experiments on Argoverse 1 and Argoverse 2 benchmarks demonstrate that FoSS achieves state-of-the-art accuracy while reducing computation by 22.5% and parameters by over 40%. Comprehensive ablations confirm the necessity of each component.
[502] AG-VAS: Anchor-Guided Zero-Shot Visual Anomaly Segmentation with Large Multimodal Models
Zhen Qu, Xian Tao, Xiaoyi Bao, Dingrong Wang, ShiChen Qu, Zhengtao Zhang, Xingang Wang
Main category: cs.CV
TL;DR: AG-VAS introduces anchor-guided segmentation with learnable semantic tokens and cross-modal alignment for zero-shot visual anomaly segmentation using large multimodal models.
Details
Motivation: Existing LMM-based segmentation approaches struggle with abstract anomaly concepts lacking stable visual prototypes and weak alignment between semantic embeddings and pixel-level features for precise localization.Method: Introduces three learnable semantic anchor tokens ([SEG], [NOR], [ANO]), Semantic-Pixel Alignment Module (SPAM) for cross-modal alignment, Anchor-Guided Mask Decoder (AGMD), and Anomaly-Instruct20K dataset with structured anomaly descriptions.
Result: Achieves state-of-the-art performance on six industrial and medical benchmarks in zero-shot setting.
Conclusion: AG-VAS provides an effective framework for zero-shot visual anomaly segmentation by addressing fundamental limitations of LMM-based approaches through anchor-guided segmentation and improved cross-modal alignment.
Abstract: Large multimodal models (LMMs) exhibit strong task generalization capabilities, offering new opportunities for zero-shot visual anomaly segmentation (ZSAS). However, existing LMM-based segmentation approaches still face fundamental limitations: anomaly concepts are inherently abstract and context-dependent, lacking stable visual prototypes, and the weak alignment between high-level semantic embeddings and pixel-level spatial features hinders precise anomaly localization. To address these challenges, we present AG-VAS (Anchor-Guided Visual Anomaly Segmentation), a new framework that expands the LMM vocabulary with three learnable semantic anchor tokens-[SEG], [NOR], and [ANO], establishing a unified anchor-guided segmentation paradigm. Specifically, [SEG] serves as an absolute semantic anchor that translates abstract anomaly semantics into explicit, spatially grounded visual entities (e.g., holes or scratches), while [NOR] and [ANO] act as relative anchors that model the contextual contrast between normal and abnormal patterns across categories. To further enhance cross-modal alignment, we introduce a Semantic-Pixel Alignment Module (SPAM) that aligns language-level semantic embeddings with high-resolution visual features, along with an Anchor-Guided Mask Decoder (AGMD) that performs anchor-conditioned mask prediction for precise anomaly localization. In addition, we curate Anomaly-Instruct20K, a large-scale instruction dataset that organizes anomaly knowledge into structured descriptions of appearance, shape, and spatial attributes, facilitating effective learning and integration of the proposed semantic anchors. Extensive experiments on six industrial and medical benchmarks demonstrate that AG-VAS achieves consistent state-of-the-art performance in the zero-shot setting.
[503] When Does RL Help Medical VLMs? Disentangling Vision, SFT, and RL Gains
Ahmadreza Jeddi, Kimia Shaban, Negin Baghbanzadeh, Natasha Sharan, Abhishek Moturu, Elham Dolatabadi, Babak Taati
Main category: cs.CV
TL;DR: RL post-training for medical VLMs primarily sharpens existing behaviors rather than creating new reasoning capabilities, with effectiveness depending on pre-existing support from SFT.
Details
Motivation: To understand whether RL actually improves medical visual reasoning in VLMs or just refines behaviors already learned through supervised fine-tuning, and to disentangle the contributions of vision, SFT, and RL components.Method: Controlled study using MedMNIST as multi-modality testbed, benchmarking VLM vision towers against vision-only baselines, quantifying reasoning support via Accuracy@1 vs Pass@K metrics, and evaluating when RL closes support gaps across modalities.
Result: RL is most effective when models already have non-trivial support (high Pass@K) - it primarily sharpens output distributions, improving accuracy and sampling efficiency, while SFT expands support and enables RL effectiveness.
Conclusion: Proposed boundary-aware recipe for RL post-training, demonstrated by achieving strong performance across six medical VQA benchmarks using OctoMed-initialized model trained on small balanced PMC VQA subset.
Abstract: Reinforcement learning (RL) is increasingly used to post-train medical Vision-Language Models (VLMs), yet it remains unclear whether RL improves medical visual reasoning or mainly sharpens behaviors already induced by supervised fine-tuning (SFT). We present a controlled study that disentangles these effects along three axes: vision, SFT, and RL. Using MedMNIST as a multi-modality testbed, we probe visual perception by benchmarking VLM vision towers against vision-only baselines, quantify reasoning support and sampling efficiency via Accuracy@1 versus Pass@K, and evaluate when RL closes the support gap and how gains transfer across modalities. We find that RL is most effective when the model already has non-trivial support (high Pass@K): it primarily sharpens the output distribution, improving Acc@1 and sampling efficiency, while SFT expands support and makes RL effective. Based on these findings, we propose a boundary-aware recipe and instantiate it by RL post-training an OctoMed-initialized model on a small, balanced subset of PMC multiple-choice VQA, achieving strong average performance across six medical VQA benchmarks.
[504] You Only Need One Stage: Novel-View Synthesis From A Single Blind Face Image
Taoyue Wang, Xiang Zhang, Xiaotian Li, Huiyuan Yang, Lijun Yin
Main category: cs.CV
TL;DR: NVB-Face: A one-stage method for generating consistent novel-view face images directly from single blind (degraded) face images using feature extraction and diffusion models.
Details
Motivation: Existing novel-view synthesis methods require high-resolution RGB images as input, and for degraded images, they use a two-stage process (restoration then synthesis) which leads to inaccuracies and inconsistencies due to dependency on restoration quality.Method: Extract single-view features directly from blind face images, use a feature manipulator to transform these into 3D-aware multi-view latent representations, and leverage diffusion models to synthesize high-quality consistent novel-view face images.
Result: Experimental results show the method significantly outperforms traditional two-stage approaches in both consistency and fidelity of novel-view face image generation.
Conclusion: NVB-Face provides an effective one-stage solution for novel-view synthesis from blind face images, overcoming limitations of conventional two-stage approaches by directly processing degraded inputs.
Abstract: We propose a novel one-stage method, NVB-Face, for generating consistent Novel-View images directly from a single Blind Face image. Existing approaches to novel-view synthesis for objects or faces typically require a high-resolution RGB image as input. When dealing with degraded images, the conventional pipeline follows a two-stage process: first restoring the image to high resolution, then synthesizing novel views from the restored result. However, this approach is highly dependent on the quality of the restored image, often leading to inaccuracies and inconsistencies in the final output. To address this limitation, we extract single-view features directly from the blind face image and introduce a feature manipulator that transforms these features into 3D-aware, multi-view latent representations. Leveraging the powerful generative capacity of a diffusion model, our framework synthesizes high-quality, consistent novel-view face images. Experimental results show that our method significantly outperforms traditional two-stage approaches in both consistency and fidelity.
[505] Open-Vocabulary vs Supervised Learning Methods for Post-Disaster Visual Scene Understanding
Anna Michailidou, Georgios Angelidis, Vasileios Argyriou, Panagiotis Sarigiannidis, Georgios Th. Papadopoulos
Main category: cs.CV
TL;DR: Comparative evaluation of supervised vs open-vocabulary vision models for post-disaster aerial imagery analysis, showing supervised approaches remain superior for fixed label spaces despite open-vocabulary models’ promise.
Details
Motivation: Automated interpretation of aerial disaster imagery is challenging due to clutter, visual variability, and domain shift. Supervised approaches require costly annotations, while open-vocabulary foundation models offer alternatives with less annotation dependence.Method: Comparative evaluation of supervised learning and open-vocabulary vision models for semantic segmentation and object detection across multiple disaster datasets (FloodNet+, RescueNet, DFire, LADD). Analysis of performance trends, failure modes, and practical trade-offs.
Result: Supervised training remains the most reliable approach when label space is fixed and annotations are available, especially for small objects and fine boundary delineation in cluttered scenes.
Conclusion: While open-vocabulary models offer advantages in annotation efficiency and flexibility, supervised approaches still outperform for precise disaster scene understanding tasks where fixed labels and annotations exist.
Abstract: Aerial imagery is critical for large-scale post-disaster damage assessment. Automated interpretation remains challenging due to clutter, visual variability, and strong cross-event domain shift, while supervised approaches still rely on costly, task-specific annotations with limited coverage across disaster types and regions. Recent open-vocabulary and foundation vision models offer an appealing alternative, by reducing dependence on fixed label sets and extensive task-specific annotations. Instead, they leverage large-scale pretraining and vision-language representations. These properties are particularly relevant for post-disaster domains, where visual concepts are ambiguous and data availability is constrained. In this work, we present a comparative evaluation of supervised learning and open-vocabulary vision models for post-disaster scene understanding, focusing on semantic segmentation and object detection across multiple datasets, including FloodNet+, RescueNet, DFire, and LADD. We examine performance trends, failure modes, and practical trade-offs between different learning paradigms, providing insight into their applicability for real-world disaster response. The most notable remark across all evaluated benchmarks is that supervised training remains the most reliable approach (i.e., when the label space is fixed and annotations are available), especially for small objects and fine boundary delineation in cluttered scenes.
[506] Perspective-Equivariant Fine-tuning for Multispectral Demosaicing without Ground Truth
Andrew Wang, Mike Davies
Main category: cs.CV
TL;DR: PEFD is a novel framework for multispectral demosaicing that learns from mosaiced measurements alone without ground truth, using perspective-equivariant fine-tuning of pretrained foundation models.
Details
Motivation: Multispectral demosaicing is essential for real-time imaging applications but faces challenges: classical methods produce blurry results, while supervised learning requires expensive ground truth data from slow line-scanning systems.Method: PEFD exploits projective geometry of camera systems to leverage richer group structure for null-space recovery, and efficiently learns without ground truth by fine-tuning pretrained foundation models designed for 1-3 channel imaging.
Result: On intraoperative and automotive datasets, PEFD recovers fine details like blood vessels and preserves spectral fidelity, substantially outperforming recent approaches and nearing supervised performance.
Conclusion: PEFD provides an effective framework for multispectral demosaicing that eliminates the need for costly ground truth data while achieving performance close to supervised methods.
Abstract: Multispectral demosaicing is crucial to reconstruct full-resolution spectral images from snapshot mosaiced measurements, enabling real-time imaging from neurosurgery to autonomous driving. Classical methods are blurry, while supervised learning requires costly ground truth (GT) obtained from slow line-scanning systems. We propose Perspective-Equivariant Fine-tuning for Demosaicing (PEFD), a framework that learns multispectral demosaicing from mosaiced measurements alone. PEFD a) exploits the projective geometry of camera-based imaging systems to leverage a richer group structure than previous demosaicing methods to recover more null-space information, and b) learns efficiently without GT by adapting pretrained foundation models designed for 1-3 channel imaging. On intraoperative and automotive datasets, PEFD recovers fine details such as blood vessels and preserves spectral fidelity, substantially outperforming recent approaches, nearing supervised performance.
[507] MixerCSeg: An Efficient Mixer Architecture for Crack Segmentation via Decoupled Mamba Attention
Zilong Zhao, Zhengming Ding, Pei Niu, Wenhao Sun, Feng Guo
Main category: cs.CV
TL;DR: MixerCSeg: A novel mixer architecture combining CNN, Transformer, and Mamba pathways for pixel-level crack segmentation, achieving SOTA performance with high efficiency.
Details
Motivation: Existing models (CNN, Transformer, Mamba-based) only capture partial spatial/structural information needed for crack segmentation, leaving gaps in modeling complex crack patterns. Need a unified approach that combines local texture, global dependencies, and sequential context modeling.Method: MixerCSeg architecture with coordinated specialist pathways: CNN-like for local textures, Transformer-style for global dependencies, Mamba-inspired for sequential context. Core TransMixer explores Mamba’s latent attention with dedicated pathways. Spatial block processing strategy and Direction-guided Edge Gated Convolution (DEGConv) enhance edge sensitivity. Spatial Refinement Multi-Level Fusion (SRF) refines multi-scale details.
Result: State-of-the-art performance on multiple crack segmentation benchmarks with only 2.05 GFLOPs and 2.54M parameters, demonstrating both efficiency and strong representational capability.
Conclusion: MixerCSeg successfully integrates complementary strengths of CNN, Transformer, and Mamba architectures for crack segmentation, achieving superior performance with minimal computational overhead through coordinated specialist pathways and edge-aware enhancements.
Abstract: Feature encoders play a key role in pixel-level crack segmentation by shaping the representation of fine textures and thin structures. Existing CNN-, Transformer-, and Mamba-based models each capture only part of the required spatial or structural information, leaving clear gaps in modeling complex crack patterns. To address this, we present MixerCSeg, a mixer architecture designed like a coordinated team of specialists, where CNN-like pathways focus on local textures, Transformer-style paths capture global dependencies, and Mamba-inspired flows model sequential context within a single encoder. At the core of MixerCSeg is the TransMixer, which explores Mamba’s latent attention behavior while establishing dedicated pathways that naturally express both locality and global awareness. To further enhance structural fidelity, we introduce a spatial block processing strategy and a Direction-guided Edge Gated Convolution (DEGConv) that strengthens edge sensitivity under irregular crack geometries with minimal computational overhead. A Spatial Refinement Multi-Level Fusion (SRF) module is then employed to refine multi-scale details without increasing complexity. Extensive experiments on multiple crack segmentation benchmarks show that MixerCSeg achieves state-of-the-art performance with only 2.05 GFLOPs and 2.54 M parameters, demonstrating both efficiency and strong representational capability. The code is available at https://github.com/spiderforest/MixerCSeg.
[508] TIMI: Training-Free Image-to-3D Multi-Instance Generation with Spatial Fidelity
Xiao Cai, Lianli Gao, Pengpeng Zeng, Ji Zhang, Heng Tao Shen, Jingkuan Song
Main category: cs.CV
TL;DR: TIMI is a training-free framework for image-to-3D multi-instance generation that achieves high spatial fidelity without additional training by leveraging existing spatial priors in pre-trained models.
Details
Motivation: Current methods for multi-instance 3D generation require fine-tuning on multi-instance datasets, which is computationally expensive and struggles with spatial fidelity. The authors observed that pre-trained image-to-3D models already have meaningful spatial priors that are underutilized due to instance entanglement issues.Method: TIMI introduces two key modules: 1) Instance-aware Separation Guidance (ISG) for instance disentanglement during early denoising, and 2) Spatial-stabilized Geometry-adaptive Update (SGU) to preserve geometric characteristics and relative relationships while stabilizing the ISG guidance.
Result: Extensive experiments show TIMI outperforms existing multi-instance methods in both global layout and distinct local instances, without requiring additional training and with faster inference speed.
Conclusion: TIMI demonstrates that training-free approaches can effectively leverage existing spatial priors in pre-trained models for high-fidelity multi-instance 3D generation, offering computational efficiency and improved spatial fidelity.
Abstract: Precise spatial fidelity in Image-to-3D multi-instance generation is critical for downstream real-world applications. Recent work attempts to address this by fine-tuning pre-trained Image-to-3D (I23D) models on multi-instance datasets, which incurs substantial training overhead and struggles to guarantee spatial fidelity. In fact, we observe that pre-trained I23D models already possess meaningful spatial priors, which remain underutilized as evidenced by instance entanglement issues. Motivated by this, we propose TIMI, a novel Training-free framework for Image-to-3D Multi-Instance generation that achieves high spatial fidelity. Specifically, we first introduce an Instance-aware Separation Guidance (ISG) module, which facilitates instance disentanglement during the early denoising stage. Next, to stabilize the guidance introduced by ISG, we devise a Spatial-stabilized Geometry-adaptive Update (SGU) module that promotes the preservation of the geometric characteristics of instances while maintaining their relative relationships. Extensive experiments demonstrate that our method yields better performance in terms of both global layout and distinct local instances compared to existing multi-instance methods, without requiring additional training and with faster inference speed.
[509] VidDoS: Universal Denial-of-Service Attack on Video-based Large Language Models
Duoxun Tang, Dasen Dai, Jiyao Wang, Xiao Yang, Jianyu Wang, Siqi Cai
Main category: cs.CV
TL;DR: VidDoS is a universal Energy-Latency Attack framework for Video-LLMs that creates instance-agnostic triggers to exhaust computational resources without inference-time gradient calculations, causing extreme token expansion and latency increases.
Details
Motivation: Video-LLMs are increasingly used in safety-critical applications but are vulnerable to Energy-Latency Attacks. Current image-centric methods fail because temporal aggregation dilutes frame perturbations, and real-time demands make instance-wise optimization impractical for continuous video streams.Method: VidDoS uses universal optimization to create instance-agnostic triggers without inference-time gradient calculations. It employs masked teacher forcing to steer models toward expensive target sequences, combined with refusal penalty and early-termination suppression to override conciseness priors.
Result: Testing across three mainstream Video-LLMs and three video datasets (including video QA and autonomous driving) shows extreme degradation: token expansion >205× and inference latency inflation >15× relative to clean baselines. Simulations reveal induced latency leads to critical safety violations in autonomous driving.
Conclusion: The community needs to recognize and mitigate these high-hazard Energy-Latency Attacks in Video-LLMs, especially for safety-critical applications like autonomous driving.
Abstract: Video-LLMs are increasingly deployed in safety-critical applications but are vulnerable to Energy-Latency Attacks (ELAs) that exhaust computational resources. Current image-centric methods fail because temporal aggregation mechanisms dilute individual frame perturbations. Additionally, real-time demands make instance-wise optimization impractical for continuous video streams. We introduce VidDoS, which is the first universal ELA framework tailored for Video-LLMs. Our method leverages universal optimization to create instance-agnostic triggers that require no inference-time gradient calculation. We achieve this through $\textit{masked teacher forcing}$ to steer models toward expensive target sequences, combined with a $\textit{refusal penalty}$ and $\textit{early-termination suppression}$ to override conciseness priors. Testing across three mainstream Video-LLMs and three video datasets, which include video question answering and autonomous driving scenarios, shows extreme degradation. VidDoS induces a token expansion of more than 205$\times$ and inflates the inference latency by more than 15$\times$ relative to clean baselines. Simulations of real-time autonomous driving streams further reveal that this induced latency leads to critical safety violations. We urge the community to recognize and mitigate these high-hazard ELA in Video-LLMs.
[510] Continuous Exposure-Time Modeling for Realistic Atmospheric Turbulence Synthesis
Junwei Zeng, Dong Liang, Sheng-Jun Huang, Kun Zhan, Songcan Chen
Main category: cs.CV
TL;DR: Proposes ET-MTF for modeling turbulence blur as continuous function of exposure time, creates ET-Turb dataset with 5,083 videos for training models that generalize better to real turbulence.
Details
Motivation: Existing turbulence synthesis methods oversimplify exposure-time effects, assuming fixed or binary settings, leading to unrealistic synthetic data and poor model generalization to real turbulence.Method: Revisits modulation transfer function (MTF) to propose Exposure-Time-dependent MTF (ET-MTF) modeling blur as continuous function of exposure-time. Derives tilt-invariant PSF from ET-MTF, integrates with spatially varying blur-width field for comprehensive turbulence blur synthesis.
Result: Constructs ET-Turb dataset with 5,083 videos (2,005,835 frames) across diverse optical/atmospheric conditions with continuous exposure-time modeling. Models trained on ET-Turb produce more realistic restorations and achieve superior generalization on real-world turbulence data.
Conclusion: ET-MTF provides physically accurate turbulence blur modeling, and ET-Turb dataset enables training models that generalize better to real turbulence restoration tasks.
Abstract: Atmospheric turbulence significantly degrades long-range imaging by introducing geometric warping and exposure-time-dependent blur, which adversely affects both visual quality and the performance of high-level vision tasks. Existing methods for synthesizing turbulence effects often oversimplify the relationship between blur and exposure-time, typically assuming fixed or binary exposure settings. This leads to unrealistic synthetic data and limited generalization capability of trained models. To address this gap, we revisit the modulation transfer function (MTF) formulation and propose a novel Exposure-Time-dependent MTF (ET-MTF) that models blur as a continuous function of exposure-time. For blur synthesis, we derive a tilt-invariant point spread function (PSF) from the ET-MTF, which, when integrated with a spatially varying blur-width field, provides a comprehensive and physically accurate characterization of turbulence-induced blur. Building on this synthesis pipeline, we construct ET-Turb, a large-scale synthetic turbulence dataset that explicitly incorporates continuous exposure-time modeling across diverse optical and atmospheric conditions. The dataset comprises 5,083 videos (2,005,835 frames), partitioned into 3,988 training and 1,095 test videos. Extensive experiments demonstrate that models trained on ET-Turb produce more realistic restorations and achieve superior generalization on real-world turbulence data compared to those trained on other datasets. The dataset is publicly available at: github.com/Jun-Wei-Zeng/ET-Turb.
[511] Token Reduction via Local and Global Contexts Optimization for Efficient Video Large Language Models
Jinlong Li, Liyuan Jiang, Haonan Zhang, Nicu Sebe
Main category: cs.CV
TL;DR: AOT is a training-free token reduction method for Video LLMs that uses optimal transport to aggregate informative contexts from pruned tokens via local-global token anchors, improving efficiency while preserving performance.
Details
Motivation: Video LLMs suffer from inefficiency due to redundant visual tokens. Existing pruning methods target intra-frame spatial redundancy or prune inside LLMs with shallow-layer overhead, yielding suboptimal spatiotemporal reduction and discarding subtle informative context from merged/pruned tokens.Method: Proposes AOT (Anchors via Optimal Transport) that establishes local- and global-aware token anchors within frames using attention guidance, then uses optimal transport to aggregate informative contexts from pruned tokens. For temporal reduction, uses keyframe anchors to ensemble similar information from consecutive frames via optimal transport while keeping distinct tokens for temporal dynamics.
Result: Extensive evaluations show competitive performance across various short- and long-video benchmarks on leading video LLMs, obtaining substantial computational efficiency while preserving temporal and visual fidelity.
Conclusion: AOT provides an effective training-free token reduction method for Video LLMs that comprehensively addresses spatiotemporal redundancy while preserving informative contexts through optimal transport-based aggregation.
Abstract: Video Large Language Models (VLLMs) demonstrate strong video understanding but suffer from inefficiency due to redundant visual tokens. Existing pruning primary targets intra-frame spatial redundancy or prunes inside the LLM with shallow-layer overhead, yielding suboptimal spatiotemporal reduction and underutilizing long-context compressibility. All of them often discard subtle yet informative context from merged or pruned tokens. In this paper, we propose a new perspective that elaborates token \textbf{A}nchors within intra-frame and inter-frame to comprehensively aggregate the informative contexts via local-global \textbf{O}ptimal \textbf{T}ransport (\textbf{AOT}). Specifically, we first establish local- and global-aware token anchors within each frame under the attention guidance, which then optimal transport aggregates the informative contexts from pruned tokens, constructing intra-frame token anchors. Then, building on the temporal frame clips, the first frame within each clip will be considered as the keyframe anchors to ensemble similar information from consecutive frames through optimal transport, while keeping distinct tokens to represent temporal dynamics, leading to efficient token reduction in a training-free manner. Extensive evaluations show that our proposed AOT obtains competitive performances across various short- and long-video benchmarks on leading video LLMs, obtaining substantial computational efficiency while preserving temporal and visual fidelity. Project webpage: \href{https://tyroneli.github.io/AOT}{AOT}.
[512] UETrack: A Unified and Efficient Framework for Single Object Tracking
Ben Kang, Jie Zhao, Xin Chen, Wanting Geng, Bin Zhang, Lu Zhang, Dong Wang, Huchuan Lu
Main category: cs.CV
TL;DR: UETrack is an efficient multi-modal single object tracking framework that handles RGB, Depth, Thermal, Event, and Language inputs with high speed-accuracy trade-off across multiple hardware platforms.
Details
Motivation: Most existing tracking methods are limited to RGB inputs and struggle in multi-modal scenarios, while current multi-modal approaches use complex designs that are too heavy and slow for resource-constrained deployment.Method: Introduces two key components: 1) Token-Pooling-based Mixture-of-Experts mechanism for feature aggregation and expert specialization, and 2) Target-aware Adaptive Distillation strategy that selectively performs distillation based on sample characteristics to reduce redundant supervision.
Result: Achieves 69.2% AUC on LaSOT benchmark and runs at 163/56/60 FPS on GPU/CPU/AGX platforms, demonstrating superior speed-accuracy trade-off compared to previous methods across 12 benchmarks on 3 hardware platforms.
Conclusion: UETrack addresses the gap in efficient multi-modal tracking with high practicality and versatility, efficiently handling multiple modalities while maintaining strong performance across various hardware platforms.
Abstract: With growing real-world demands, efficient tracking has received increasing attention. However, most existing methods are limited to RGB inputs and struggle in multi-modal scenarios. Moreover, current multi-modal tracking approaches typically use complex designs, making them too heavy and slow for resource-constrained deployment. To tackle these limitations, we propose UETrack, an efficient framework for single object tracking. UETrack demonstrates high practicality and versatility, efficiently handling multiple modalities including RGB, Depth, Thermal, Event, and Language, and addresses the gap in efficient multi-modal tracking. It introduces two key components: a Token-Pooling-based Mixture-of-Experts mechanism that enhances modeling capacity through feature aggregation and expert specialization, and a Target-aware Adaptive Distillation strategy that selectively performs distillation based on sample characteristics, reducing redundant supervision and improving performance. Extensive experiments on 12 benchmarks across 3 hardware platforms show that UETrack achieves a superior speed-accuracy trade-off compared to previous methods. For instance, UETrack-B achieves 69.2% AUC on LaSOT and runs at 163/56/60 FPS on GPU/CPU/AGX, demonstrating strong practicality and versatility. Code is available at https://github.com/kangben258/UETrack.
[513] ATA: Bridging Implicit Reasoning with Attention-Guided and Action-Guided Inference for Vision-Language Action Models
Cheng Yang, Jianhao Jiao, Lingyi Huang, Jinqi Xiao, Zhexiang Tang, Yu Gong, Yibiao Ying, Yang Sui, Jintian Lin, Wen Huang, Bo Yuan
Main category: cs.CV
TL;DR: ATA is a training-free framework that introduces implicit reasoning into Vision-Language-Action models through attention-guided and action-guided strategies, improving task performance without extra annotations or training.
Details
Motivation: Current VLA models rely on explicit reasoning methods that require data-intensive resources like CoT annotations and visual grounding annotations, leading to time-consuming dataset construction and reduced inference efficiency.Method: ATA uses complementary attention-guided and action-guided strategies to formulate implicit reasoning by integrating attention maps with action-based regions of interest, adaptively refining visual inputs without training or annotations.
Result: Extensive experiments show ATA consistently improves task success and robustness while preserving and even enhancing inference efficiency compared to explicit reasoning methods.
Conclusion: ATA provides a lightweight, plug-and-play implicit reasoning approach for VLA models that addresses limitations of explicit reasoning methods without requiring additional training or annotations.
Abstract: Vision-Language-Action (VLA) models rely on current observations, including images, language instructions, and robot states, to predict actions and complete tasks. While accurate visual perception is crucial for precise action prediction and execution, recent work has attempted to further improve performance by introducing explicit reasoning during inference. However, such approaches face significant limitations. They often depend on data-intensive resources such as Chain-of-Thought (CoT) style annotations to decompose tasks into step-by-step reasoning, and in many cases require additional visual grounding annotations (e.g., bounding boxes or masks) to highlight relevant image regions. Moreover, they involve time-consuming dataset construction, labeling, and retraining, which ultimately results in longer inference sequences and reduced efficiency. To address these challenges, we propose ATA, a novel training-free framework that introduces implicit reasoning into VLA inference through complementary attention-guided and action-guided strategies. Unlike CoT or explicit visual-grounding methods, ATA formulates reasoning implicitly by integrating attention maps with an action-based region of interest (RoI), thereby adaptively refining visual inputs without requiring extra training or annotations. ATA is a plug-and-play implicit reasoning approach for VLA models, lightweight yet effective. Extensive experiments show that it consistently improves task success and robustness while preserving, and even enhancing, inference efficiency.
[514] SeaVIS: Sound-Enhanced Association for Online Audio-Visual Instance Segmentation
Yingjian Zhu, Ying Wang, Yuyang Hong, Ruohao Guo, Kun Ding, Xin Gu, Bin Fan, Shiming Xiang
Main category: cs.CV
TL;DR: SeaVIS introduces the first online framework for audio-visual instance segmentation, addressing limitations of offline methods by enabling real-time processing of continuous video streams through causal cross-attention fusion and audio-guided contrastive learning.
Details
Motivation: Existing audio-visual instance segmentation methods are offline and cannot associate detected instances across consecutive clips, making them unsuitable for real-world continuous video streams. Additionally, appearance-based instance association fails to distinguish between sounding and silent states of objects.Method: SeaVIS uses Causal Cross Attention Fusion (CCAF) for online processing, integrating visual features from current frames with entire audio history under causal constraints. It employs Audio-Guided Contrastive Learning (AGCL) to generate instance prototypes encoding both visual appearance and sounding activity, enabling suppression of silent objects during instance association.
Result: SeaVIS surpasses existing state-of-the-art models on the AVISeg dataset across multiple evaluation metrics while maintaining competitive inference speed suitable for real-time processing.
Conclusion: SeaVIS is the first online framework for audio-visual instance segmentation that effectively handles continuous video streams and distinguishes between sounding and silent object states, demonstrating superior performance and real-time capability.
Abstract: Recently, an audio-visual instance segmentation (AVIS) task has been introduced, aiming to identify, segment and track individual sounding instances in videos. However, prevailing methods primarily adopt the offline paradigm, that cannot associate detected instances across consecutive clips, making them unsuitable for real-world scenarios that involve continuous video streams. To address this limitation, we introduce SeaVIS, the first online framework designed for audio-visual instance segmentation. SeaVIS leverages the Causal Cross Attention Fusion (CCAF) module to enable efficient online processing, which integrates visual features from the current frame with the entire audio history under strict causal constraints. A major challenge for conventional VIS methods is that appearance-based instance association fails to distinguish between an object’s sounding and silent states, resulting in the incorrect segmentation of silent objects. To tackle this, we employ an Audio-Guided Contrastive Learning (AGCL) strategy to generate instance prototypes that encode not only visual appearance but also sounding activity. In this way, instances preserved during per-frame prediction that do not emit sound can be effectively suppressed during instance association process, thereby significantly enhancing the audio-following capability of SeaVIS. Extensive experiments conducted on the AVISeg dataset demonstrate that SeaVIS surpasses existing state-of-the-art models across multiple evaluation metrics while maintaining a competitive inference speed suitable for real-time processing.
[515] DOCFORGE-BENCH: A Comprehensive Benchmark for Document Forgery Detection and Analysis
Zengqi Zhao, Weidi Xia, Peter Wei, Yan Zhang, Yiyi Zhang, Jane Mo, Tiannan Zhang, Yuanqin Dai, Zexi Chen, Simiao Ren
Main category: cs.CV
TL;DR: DOCFORGE-BENCH is the first unified zero-shot benchmark for document forgery detection, evaluating 14 methods across 8 datasets to reveal calibration failures in existing approaches.
Details
Motivation: Current document forgery detection evaluations focus on fine-tuning scenarios, but real-world deployment requires zero-shot capabilities where practitioners lack labeled training data. There's a need for a benchmark that reflects realistic deployment scenarios and identifies practical limitations of existing methods.Method: Created DOCFORGE-BENCH as a unified zero-shot benchmark evaluating 14 methods across 8 datasets covering text tampering, receipt forgery, and identity document manipulation. All methods were applied with published pretrained weights without domain adaptation. Used Pixel-AUC and Pixel-F1 metrics with analysis of score-distribution shifts and calibration issues.
Result: Found pervasive calibration failure: methods achieve moderate Pixel-AUC (≥0.76) but near-zero Pixel-F1 due to score-distribution shift (tampered regions occupy only 0.27-4.17% of pixels). Oracle-F1 is 2-10x higher than fixed-threshold Pixel-F1, showing calibration is the bottleneck. Simple threshold adaptation on N=10 domain images recovers 39-55% of Oracle-F1 gap.
Conclusion: No evaluated method works reliably out-of-the-box on diverse document types, highlighting document forgery detection as an unsolved problem. Current datasets predate generative AI editing, creating a critical gap for modern attack surfaces involving diffusion- and LLM-based forgeries.
Abstract: We present DOCFORGE-BENCH, the first unified zero-shot benchmark for document forgery detection, evaluating 14 methods across eight datasets spanning text tampering, receipt forgery, and identity document manipulation. Unlike fine-tuning-oriented evaluations such as ForensicHub [Du et al., 2025], DOCFORGE-BENCH applies all methods with their published pretrained weights and no domain adaptation – a deliberate design choice that reflects the realistic deployment scenario where practitioners lack labeled document training data. Our central finding is a pervasive calibration failure invisible under single-threshold protocols: methods achieve moderate Pixel-AUC (>=0.76) yet near-zero Pixel-F1. This AUC-F1 gap is not a discrimination failure but a score-distribution shift: tampered regions occupy only 0.27-4.17% of pixels in document images – an order of magnitude less than in natural image benchmarks – making the standard tau=0.5 threshold catastrophically miscalibrated. Oracle-F1 is 2-10x higher than fixed-threshold Pixel-F1, confirming that calibration, not representation, is the bottleneck. A controlled calibration experiment validates this: adapting a single threshold on N=10 domain images recovers 39-55% of the Oracle-F1 gap, demonstrating that threshold adaptation – not retraining – is the key missing step for practical deployment. Overall, no evaluated method works reliably out-of-the-box on diverse document types, underscoring that document forgery detection remains an unsolved problem. We further note that all eight datasets predate the era of generative AI editing; benchmarks covering diffusion- and LLM-based document forgeries represent a critical open gap on the modern attack surface.
[516] Unifying Language-Action Understanding and Generation for Autonomous Driving
Xinyang Wang, Qian Liu, Wenjie Ding, Zhao Yang, Wei Li, Chang Liu, Bailin Li, Kun Zhan, Xianpeng Lang, Wei Chen
Main category: cs.CV
TL;DR: LinkVLA improves Vision-Language-Action models for autonomous driving by addressing language-action misalignment and inefficient auto-regressive generation through shared codebooks, bidirectional training, and coarse-to-fine decoding.
Details
Motivation: Existing VLA models for autonomous driving suffer from persistent misalignment between language instructions and action outputs, and inefficient auto-regressive action generation that slows down inference.Method: Three key innovations: 1) Structural link via shared discrete codebook for language and action tokens, 2) Deep semantic link through auxiliary action understanding objective (generating captions from trajectories), 3) Two-step coarse-to-fine generation (C2F) to replace slow auto-regressive decoding.
Result: Experiments on closed-loop driving benchmarks show consistent gains in instruction following accuracy and driving performance, with 86% reduction in inference time compared to auto-regressive methods.
Conclusion: LinkVLA successfully addresses both alignment and efficiency challenges in VLA models for autonomous driving through structural and semantic linking mechanisms combined with efficient generation techniques.
Abstract: Vision-Language-Action (VLA) models are emerging as a promising paradigm for end-to-end autonomous driving, valued for their potential to leverage world knowledge and reason about complex driving scenes. However, existing methods suffer from two critical limitations: a persistent misalignment between language instructions and action outputs, and the inherent inefficiency of typical auto-regressive action generation. In this paper, we introduce LinkVLA, a novel architecture that directly addresses these challenges to enhance both alignment and efficiency. First, we establish a structural link by unifying language and action tokens into a shared discrete codebook, processed within a single multi-modal model. This structurally enforces cross-modal consistency from the ground up. Second, to create a deep semantic link, we introduce an auxiliary action understanding objective that trains the model to generate descriptive captions from trajectories, fostering a bidirectional language-action mapping. Finally, we replace the slow, step-by-step generation with a two-step coarse-to-fine generation method C2F that efficiently decodes the action sequence, saving 86% inference time. Experiments on closed-loop driving benchmarks show consistent gains in instruction following accuracy and driving performance, alongside reduced inference latency.
[517] Deepfake Forensics Adapter: A Dual-Stream Network for Generalizable Deepfake Detection
Jianfeng Liao, Yichen Wei, Raymond Chan Ching Bon, Shulan Wang, Kam-Pui Chow, Kwok-Yan Lam
Main category: cs.CV
TL;DR: DFA is a dual-stream framework that combines CLIP vision-language foundation models with specialized forensic analysis for deepfake detection, achieving state-of-the-art performance with strong generalization capabilities.
Details
Motivation: Deepfake generation techniques pose significant threats to public safety, but existing detection methods have limitations in generalizing to emerging forgery patterns. There's a need for robust detection systems that can adapt to evolving deepfake threats.Method: Proposes Deepfake Forensics Adapter (DFA) - a dual-stream framework that integrates pre-trained CLIP with three components: 1) Global Feature Adapter for global inconsistency detection, 2) Local Anomaly Stream for local facial forgery cues using facial structure priors, and 3) Interactive Fusion Classifier using transformer encoder for feature fusion.
Result: Achieves state-of-the-art performance on DFDC dataset with frame-level AUC/EER of 0.816/0.256 and video-level AUC/EER of 0.836/0.251, representing 4.8% video AUC improvement over previous methods. Demonstrates superior generalization capabilities across benchmarks.
Conclusion: DFA provides an effective framework for robust deepfake detection with enhanced generalization against evolving threats, pointing to a feasible direction for developing practical detection systems using vision-language foundation models.
Abstract: The rapid advancement of deepfake generation techniques poses significant threats to public safety and causes societal harm through the creation of highly realistic synthetic facial media. While existing detection methods demonstrate limitations in generalizing to emerging forgery patterns, this paper presents Deepfake Forensics Adapter (DFA), a novel dual-stream framework that synergizes vision-language foundation models with targeted forensics analysis. Our approach integrates a pre-trained CLIP model with three core components to achieve specialized deepfake detection by leveraging the powerful general capabilities of CLIP without changing CLIP parameters: 1) A Global Feature Adapter is used to identify global inconsistencies in image content that may indicate forgery, 2) A Local Anomaly Stream enhances the model’s ability to perceive local facial forgery cues by explicitly leveraging facial structure priors, and 3) An Interactive Fusion Classifier promotes deep interaction and fusion between global and local features using a transformer encoder. Extensive evaluations of frame-level and video-level benchmarks demonstrate the superior generalization capabilities of DFA, particularly achieving state-of-the-art performance in the challenging DFDC dataset with frame-level AUC/EER of 0.816/0.256 and video-level AUC/EER of 0.836/0.251, representing a 4.8% video AUC improvement over previous methods. Our framework not only demonstrates state-of-the-art performance, but also points out a feasible and effective direction for developing a robust deepfake detection system with enhanced generalization capabilities against the evolving deepfake threats. Our code is available at https://github.com/Liao330/DFA.git
[518] Retrieval, Refinement, and Ranking for Text-to-Video Generation via Prompt Optimization and Test-Time Scaling
Zillur Rahman, Alex Sheng, Cristian Meo
Main category: cs.CV
TL;DR: 3R is a RAG-based prompt optimization framework for text-to-video generation that improves video quality without model training by using contextual grounding, preference optimization, and temporal interpolation.
Details
Motivation: Current text-to-video models are highly sensitive to input prompts, and existing improvement methods either require complex post-editing (risking artifacts) or expensive fine-tuning (limiting scalability). There's a need for accessible prompt optimization that works with any T2V model.Method: 3R uses three strategies: 1) RAG-based modifiers extraction for enriched contextual grounding, 2) diffusion-based Preference Optimization to align outputs with human preferences, and 3) temporal frame interpolation for producing temporally consistent visual content. The framework works with any T2V model without training.
Result: Experimental results show 3R enhances static fidelity and dynamic coherence of generated videos, demonstrating the importance of optimizing user prompts for better text-to-video generation.
Conclusion: 3R provides an effective, scalable prompt optimization framework that improves text-to-video generation quality without requiring model training, making it accessible and applicable to various T2V models.
Abstract: While large-scale datasets have driven significant progress in Text-to-Video (T2V) generative models, these models remain highly sensitive to input prompts, demonstrating that prompt design is critical to generation quality. Current methods for improving video output often fall short: they either depend on complex, post-editing models, risking the introduction of artifacts, or require expensive fine-tuning of the core generator, which severely limits both scalability and accessibility. In this work, we introduce 3R, a novel RAG based prompt optimization framework. 3R utilizes the power of current state-of-the-art T2V diffusion model and vision language model. It can be used with any T2V model without any kind of model training. The framework leverages three key strategies: RAG-based modifiers extraction for enriched contextual grounding, diffusion-based Preference Optimization for aligning outputs with human preferences, and temporal frame interpolation for producing temporally consistent visual contents. Together, these components enable more accurate, efficient, and contextually aligned text-to-video generation. Experimental results demonstrate the efficacy of 3R in enhancing the static fidelity and dynamic coherence of generated videos, underscoring the importance of optimizing user prompts.
[519] UltraStar: Semantic-Aware Star Graph Modeling for Echocardiography Navigation
Teng Wang, Haojun Jiang, Chenxi Li, Diwen Wang, Yihang Tang, Zhenguo Sun, Yujiao Deng, Shiji Song, Gao Huang
Main category: cs.CV
TL;DR: UltraStar reformulates ultrasound probe navigation from path regression to anchor-based global localization using a Star Graph with semantic-aware sampling of historical keyframes.
Details
Motivation: Echocardiography diagnosis is hindered by shortage of skilled sonographers. Automated probe navigation has clinical potential but existing methods overfit noisy trajectories from trial-and-error scanning data, degrading performance on long sequences.Method: Proposes UltraStar which reformulates probe navigation as anchor-based global localization. Establishes a Star Graph treating historical keyframes as spatial anchors connected directly to current view, modeling geometric constraints. Uses semantic-aware sampling to select representative landmarks from history logs.
Result: Extensive experiments on dataset with over 1.31 million samples show UltraStar outperforms baselines and scales better with longer input lengths, revealing more effective topology for history modeling under noisy exploration.
Conclusion: Star Graph approach provides robust navigation by better leveraging historical scanning information, addressing limitations of sequential chain modeling for noisy trajectory data.
Abstract: Echocardiography is critical for diagnosing cardiovascular diseases, yet the shortage of skilled sonographers hinders timely patient care, due to high operational difficulties. Consequently, research on automated probe navigation has significant clinical potential. To achieve robust navigation, it is essential to leverage historical scanning information, mimicking how experts rely on past feedback to adjust subsequent maneuvers. Practical scanning data collected from sonographers typically consists of noisy trajectories inherently generated through trial-and-error exploration. However, existing methods typically model this history as a sequential chain, forcing models to overfit these noisy paths, leading to performance degradation on long sequences. In this paper, we propose UltraStar, which reformulates probe navigation from path regression to anchor-based global localization. By establishing a Star Graph, UltraStar treats historical keyframes as spatial anchors connected directly to the current view, explicitly modeling geometric constraints for precise positioning. We further enhance the Star Graph with a semantic-aware sampling strategy that actively selects the representative landmarks from massive history logs, reducing redundancy for accurate anchoring. Extensive experiments on a dataset with over 1.31 million samples demonstrate that UltraStar outperforms baselines and scales better with longer input lengths, revealing a more effective topology for history modeling under noisy exploration.
[520] Pri4R: Learning World Dynamics for Vision-Language-Action Models with Privileged 4D Representation
Jisoo Kim, Jungbin Cho, Sanghyeok Chu, Ananya Bal, Jinhyung Kim, Gunhee Lee, Sihaeng Lee, Seung Hwan Kim, Bohyung Han, Hyunmin Lee, Laszlo A. Jeni, Seungryong Kim
Main category: cs.CV
TL;DR: Pri4R enhances Vision-Language-Action models with implicit understanding of physical dynamics by adding a lightweight point track head during training, improving manipulation performance without inference overhead.
Details
Motivation: Current VLA models lack understanding of spatiotemporal dynamics governing physical interactions, limiting their ability to perform precise manipulation tasks that require understanding how the world responds to actions.Method: Adds a lightweight point track head to VLA models during training that predicts 3D point tracks. Injects VLA features into this head to jointly predict future 3D trajectories, learning evolving scene geometry within the shared representation space.
Result: Significantly improves performance on challenging manipulation tasks: +10% gain on LIBERO-Long and +40% gain on RoboCasa. Shows 3D point track prediction is effective supervision for learning action-world dynamics.
Conclusion: Pri4R successfully endows VLA models with implicit understanding of world dynamics through privileged 4D information during training, enabling more physically aware context for precise control without inference overhead.
Abstract: Humans learn not only how their bodies move, but also how the surrounding world responds to their actions. In contrast, while recent Vision-Language-Action (VLA) models exhibit impressive semantic understanding, they often fail to capture the spatiotemporal dynamics governing physical interaction. In this paper, we introduce Pri4R, a simple yet effective approach that endows VLA models with an implicit understanding of world dynamics by leveraging privileged 4D information during training. Specifically, Pri4R augments VLAs with a lightweight point track head that predicts 3D point tracks. By injecting VLA features into this head to jointly predict future 3D trajectories, the model learns to incorporate evolving scene geometry within its shared representation space, enabling more physically aware context for precise control. Due to its architectural simplicity, Pri4R is compatible with dominant VLA design patterns with minimal changes. During inference, we run the model using the original VLA architecture unchanged; Pri4R adds no extra inputs, outputs, or computational overhead. Across simulation and real-world evaluations, Pri4R significantly improves performance on challenging manipulation tasks, including a +10% gain on LIBERO-Long and a +40% gain on RoboCasa. We further show that 3D point track prediction is an effective supervision target for learning action-world dynamics, and validate our design choices through extensive ablations.
[521] WildCross: A Cross-Modal Large Scale Benchmark for Place Recognition and Metric Depth Estimation in Natural Environments
Joshua Knights, Joseph Reid, Kaushik Roy, David Hall, Mark Cox, Peyman Moghadam
Main category: cs.CV
TL;DR: WildCross is a cross-modal benchmark dataset for place recognition and metric depth estimation in large-scale natural environments, addressing the gap in existing robotics datasets that focus on structured urban settings.
Details
Motivation: There's increasing demand for robotic solutions in unstructured natural environments and growing interest in bridging 2D and 3D scene understanding, but existing robotics datasets are predominantly captured in structured urban environments, making them inadequate for complex natural settings.Method: The authors propose WildCross, a cross-modal benchmark comprising over 476K sequential RGB frames with semi-dense depth and surface normal annotations, each aligned with accurate 6DoF poses and synchronized dense lidar submaps.
Result: Comprehensive experiments on visual, lidar, and cross-modal place recognition, as well as metric depth estimation, demonstrate the value of WildCross as a challenging benchmark for multi-modal robotic perception tasks.
Conclusion: WildCross addresses the critical gap in robotics datasets for natural environments and provides a valuable resource for advancing multi-modal robotic perception research in unstructured settings.
Abstract: Recent years have seen a significant increase in demand for robotic solutions in unstructured natural environments, alongside growing interest in bridging 2D and 3D scene understanding. However, existing robotics datasets are predominantly captured in structured urban environments, making them inadequate for addressing the challenges posed by complex, unstructured natural settings. To address this gap, we propose WildCross, a cross-modal benchmark for place recognition and metric depth estimation in large-scale natural environments. WildCross comprises over 476K sequential RGB frames with semi-dense depth and surface normal annotations, each aligned with accurate 6DoF poses and synchronized dense lidar submaps. We conduct comprehensive experiments on visual, lidar, and cross-modal place recognition, as well as metric depth estimation, demonstrating the value of WildCross as a challenging benchmark for multi-modal robotic perception tasks. We provide access to the code repository and dataset at https://csiro-robotics.github.io/WildCross.
[522] SCATR: Mitigating New Instance Suppression in LiDAR-based Tracking-by-Attention via Second Chance Assignment and Track Query Dropout
Brian Cheong, Letian Wang, Sandro Papais, Steven L. Waslander
Main category: cs.CV
TL;DR: SCATR is a LiDAR-based tracking-by-attention model that introduces two novel training strategies to reduce false negatives and bridge the performance gap with traditional tracking-by-detection methods.
Details
Motivation: LiDAR-based tracking-by-attention frameworks suffer from high false negative errors, creating a significant performance gap compared to traditional tracking-by-detection methods. The paper aims to systematically address this fundamental challenge.Method: SCATR introduces two architecture-agnostic training strategies: 1) Second Chance Assignment - concatenates unassigned track queries to proposal queries before bipartite matching, giving them another chance to be assigned to ground truth objects; 2) Track Query Dropout - diversifies supervised object query configurations to train the decoder to handle different track query sets, enhancing robustness to missing or newborn tracks.
Result: On the nuScenes tracking benchmark, SCATR achieves state-of-the-art performance among LiDAR-based TBA methods, outperforming previous works by 7.6% AMOTA and successfully bridging the performance gap between LiDAR-based TBA and TBD methods.
Conclusion: The proposed Second Chance Assignment and Track Query Dropout strategies effectively address the false negative problem in LiDAR-based tracking-by-attention frameworks, demonstrating their effectiveness and generalization through ablation studies.
Abstract: LiDAR-based tracking-by-attention (TBA) frameworks inherently suffer from high false negative errors, leading to a significant performance gap compared to traditional LiDAR-based tracking-by-detection (TBD) methods. This paper introduces SCATR, a novel LiDAR-based TBA model designed to address this fundamental challenge systematically. SCATR leverages recent progress in vision-based tracking and incorporates targeted training strategies specifically adapted for LiDAR. Our work’s core innovations are two architecture-agnostic training strategies for TBA methods: Second Chance Assignment and Track Query Dropout. Second Chance Assignment is a novel ground truth assignment that concatenates unassigned track queries to the proposal queries before bipartite matching, giving these track queries a second chance to be assigned to a ground truth object and effectively mitigating the conflict between detection and tracking tasks inherent in tracking-by-attention. Track Query Dropout is a training method that diversifies supervised object query configurations to efficiently train the decoder to handle different track query sets, enhancing robustness to missing or newborn tracks. Experiments on the nuScenes tracking benchmark demonstrate that SCATR achieves state-of-the-art performance among LiDAR-based TBA methods, outperforming previous works by 7.6% AMOTA and successfully bridging the long-standing performance gap between LiDAR-based TBA and TBD methods. Ablation studies further validate the effectiveness and generalization of Second Chance Assignment and Track Query Dropout. Code can be found at the following link: \href{https://github.com/TRAILab/SCATR}{https://github.com/TRAILab/SCATR}
[523] SkeleGuide: Explicit Skeleton Reasoning for Context-Aware Human-in-Place Image Synthesis
Chuqiao Wu, Jin Song, Yiyun Fei
Main category: cs.CV
TL;DR: SkeleGuide is a novel framework for human image synthesis that uses explicit skeletal reasoning to generate structurally plausible human images in existing scenes, addressing common artifacts like distorted limbs through joint training of reasoning and rendering stages.
Details
Motivation: Current generative models struggle with generating realistic human images in scenes, often producing artifacts like distorted limbs and unnatural poses due to inability to perform explicit reasoning over human skeletal structure.Method: Introduces SkeleGuide framework with explicit skeletal reasoning, joint training of reasoning and rendering stages to produce internal pose as structural prior, and PoseInverter module for decoding latent pose into editable format for user control.
Result: SkeleGuide significantly outperforms both specialized and general-purpose models in generating high-fidelity, contextually-aware human images with better structural integrity.
Conclusion: Explicitly modeling skeletal structure is fundamental for robust and plausible human image synthesis, and SkeleGuide demonstrates the effectiveness of this approach.
Abstract: Generating realistic and structurally plausible human images into existing scenes remains a significant challenge for current generative models, which often produce artifacts like distorted limbs and unnatural poses. We attribute this systemic failure to an inability to perform explicit reasoning over human skeletal structure. To address this, we introduce SkeleGuide, a novel framework built upon explicit skeletal reasoning. Through joint training of its reasoning and rendering stages, SkeleGuide learns to produce an internal pose that acts as a strong structural prior, guiding the synthesis towards high structural integrity. For fine-grained user control, we introduce PoseInverter, a module that decodes this internal latent pose into an explicit and editable format. Extensive experiments demonstrate that SkeleGuide significantly outperforms both specialized and general-purpose models in generating high-fidelity, contextually-aware human images. Our work provides compelling evidence that explicitly modeling skeletal structure is a fundamental step towards robust and plausible human image synthesis.
[524] Radiometrically Consistent Gaussian Surfels for Inverse Rendering
Kyu Beom Han, Jaeyoon Kim, Woo Jae Kim, Jinhwan Seo, Sung-eui Yoon
Main category: cs.CV
TL;DR: RadioGS introduces radiometric consistency for inverse rendering with Gaussian splatting, enabling accurate material disentanglement from indirect illumination by providing supervision for unobserved views through physically-based constraints.
Details
Motivation: Existing inverse rendering methods using Gaussian Splatting struggle to accurately disentangle material properties from complex global illumination effects, especially indirect illumination. Current approaches query indirect radiance from Gaussian primitives pre-trained only for novel-view synthesis, lacking supervision for modeling indirect radiances from unobserved views.Method: Proposes radiometric consistency - a physically-based constraint that minimizes residual between each Gaussian primitive’s learned radiance and its physically-based rendered counterpart for unobserved views. This creates a self-correcting feedback loop combining physically-based rendering and novel-view synthesis supervision. Implements RadioGS framework using Gaussian surfels and 2D Gaussian ray tracing, plus a finetuning-based relighting strategy.
Result: RadioGS outperforms existing Gaussian-based methods in inverse rendering benchmarks while maintaining computational efficiency (<10ms rendering cost). The finetuning-based relighting adapts Gaussian surfel radiances to new illuminations within minutes.
Conclusion: Radiometric consistency effectively addresses the indirect illumination modeling challenge in inverse rendering with Gaussian splatting, enabling accurate material disentanglement while preserving computational efficiency.
Abstract: Inverse rendering with Gaussian Splatting has advanced rapidly, but accurately disentangling material properties from complex global illumination effects, particularly indirect illumination, remains a major challenge. Existing methods often query indirect radiance from Gaussian primitives pre-trained for novel-view synthesis. However, these pre-trained Gaussian primitives are supervised only towards limited training viewpoints, thus lack supervision for modeling indirect radiances from unobserved views. To address this issue, we introduce radiometric consistency, a novel physically-based constraint that provides supervision towards unobserved views by minimizing the residual between each Gaussian primitive’s learned radiance and its physically-based rendered counterpart. Minimizing the residual for unobserved views establishes a self-correcting feedback loop that provides supervision from both physically-based rendering and novel-view synthesis, enabling accurate modeling of inter-reflection. We then propose Radiometrically Consistent Gaussian Surfels (RadioGS), an inverse rendering framework built upon our principle by efficiently integrating radiometric consistency by utilizing Gaussian surfels and 2D Gaussian ray tracing. We further propose a finetuning-based relighting strategy that adapts Gaussian surfel radiances to new illuminations within minutes, achieving low rendering cost (<10ms). Extensive experiments on existing inverse rendering benchmarks show that RadioGS outperforms existing Gaussian-based methods in inverse rendering, while retaining the computational efficiency.
[525] YCDa: YCbCr Decoupled Attention for Real-time Realistic Camouflaged Object Detection
PeiHuang Zheng, Yunlong Zhao, Zheng Cui, Yang Li
Main category: cs.CV
TL;DR: YCDa is an early-stage feature processing strategy inspired by human vision that separates color and luminance information, dynamically allocating attention to improve camouflaged object detection in real-time detectors.
Details
Motivation: Inspired by human visual adaptability where the visual system shifts from color to brightness/texture cues when color becomes unreliable in camouflage scenarios, the authors aim to embed this biological mechanism into computer vision systems for more robust perception in confusing environments.Method: YCDa (chrominance-luminance decoupling and dynamic attention) separates color and luminance information at the input stage, then dynamically allocates attention across channels to amplify discriminative cues while suppressing misleading color noise. It’s a plug-and-play strategy that can be integrated into existing detectors by replacing the first downsampling layer.
Result: YCDa consistently improves performance with negligible overhead across multiple baselines. YCDa-YOLO12s achieves 112% improvement in mAP over baseline on COD10K-D and sets new state-of-the-art results for real-time camouflaged object detection across COD-D datasets.
Conclusion: The biologically-inspired YCDa strategy effectively enhances real-time camouflaged object detection by mimicking human visual adaptation mechanisms, demonstrating significant performance gains with minimal computational cost.
Abstract: Human vision exhibits remarkable adaptability in perceiving objects under camouflage. When color cues become unreliable, the visual system instinctively shifts its reliance from chrominance (color) to luminance (brightness and texture), enabling more robust perception in visually confusing environments. Drawing inspiration from this biological mechanism, we propose YCDa, an efficient early-stage feature processing strategy that embeds this “chrominance-luminance decoupling and dynamic attention” principle into modern real-time detectors. Specifically, YCDa separates color and luminance information in the input stage and dynamically allocates attention across channels to amplify discriminative cues while suppressing misleading color noise. The strategy is plug-and-play and can be integrated into existing detectors by simply replacing the first downsampling layer. Extensive experiments on multiple baselines demonstrate that YCDa consistently improves performance with negligible overhead as shown in Fig. Notably, YCDa-YOLO12s achieves a 112% improvement in mAP over the baseline on COD10K-D and sets new state-of-the-art results for real-time camouflaged object detection across COD-D datasets.
[526] Tri-path DINO: Feature Complementary Learning for Remote Sensing Multi-Class Change Detection
Kai Zheng, Hang-Cheng Dong, Zhenkai Wu, Fupeng Wei, Wei Zhang
Main category: cs.CV
TL;DR: Tripath DINO architecture for multi-class change detection in remote sensing using three-path complementary feature learning with DINOv3 backbone
Details
Motivation: Multi-class change detection in remote sensing faces challenges with complex scene variations and limited detailed annotations, requiring better adaptation of foundation models to vertical domainsMethod: Three-path architecture: main path with DINOv3 backbone for coarse features, auxiliary siamese path for fine-grained features, and multi-scale attention decoder with parallel convolutions for contextual enhancement
Result: Achieves optimal performance on Gaza facility damage assessment dataset and SECOND dataset; GradCAM shows main path focuses on semantic changes while auxiliary path captures structural details
Conclusion: Tripath DINO provides robust, interpretable solution for advanced change detection with synergistic complementarity between coarse and fine-grained feature learning
Abstract: In remote sensing imagery, multi class change detection (MCD) is crucial for fine grained monitoring, yet it has long been constrained by complex scene variations and the scarcity of detailed annotations. To address this, we propose the Tripath DINO architecture, which adopts a three path complementary feature learning strategy to facilitate the rapid adaptation of pre trained foundation models to complex vertical domains. Specifically, we employ the DINOv3 pre trained model as the backbone feature extraction network to learn coarse grained features. An auxiliary path also adopts a siamese structure, progressively aggregating intermediate features from the siamese encoder to enhance the learning of fine grained features. Finally, a multi scale attention mechanism is introduced to augment the decoder network, where parallel convolutions adaptively capture and enhance contextual information under different receptive fields. The proposed method achieves optimal performance on the MCD task on both the Gaza facility damage assessment dataset (Gaza change) and the classic SECOND dataset. GradCAM visualizations further confirm that the main and auxiliary paths naturally focus on coarse grained semantic changes and fine grained structural details, respectively. This synergistic complementarity provides a robust and interpretable solution for advanced change detection tasks, offering a basis for rapid and accurate damage assessment.
[527] What Helps – and What Hurts: Bidirectional Explanations for Vision Transformers
Qin Su, Tie Luo
Main category: cs.CV
TL;DR: BiCAM is a bidirectional class activation mapping method for Vision Transformers that captures both positive and negative contributions to predictions, improving interpretability and enabling adversarial detection.
Details
Motivation: Vision Transformers achieve strong performance but their decision-making is difficult to interpret. Existing CAM-based methods discard negative signals, providing incomplete explanations. The authors aim to develop a more complete interpretability method that captures both supportive and suppressive evidence.Method: BiCAM (Bidirectional Class Activation Mapping) preserves signed attributions to capture both positive and negative contributions. It introduces a Positive-to-Negative Ratio (PNR) that summarizes attribution balance. The method works with multiple ViT variants including DeiT and Swin Transformers, and enables lightweight adversarial example detection without retraining.
Result: BiCAM improves localization and faithfulness across ImageNet, VOC, and COCO datasets while remaining computationally efficient. It generalizes to multiple ViT variants and enables effective adversarial example detection through the PNR metric.
Conclusion: Modeling both supportive and suppressive evidence is important for interpreting transformer-based vision models. BiCAM provides more complete and contrastive explanations while maintaining computational efficiency.
Abstract: Vision Transformers (ViTs) achieve strong performance in visual recognition, yet their decision-making remains difficult to interpret. We propose BiCAM, a bidirectional class activation mapping method that captures both supportive (positive) and suppressive (negative) contributions to model predictions. Unlike prior CAM-based approaches that discard negative signals, BiCAM preserves signed attributions to produce more complete and contrastive explanations. BiCAM further introduces a Positive-to-Negative Ratio (PNR) that summarizes attribution balance and enables lightweight detection of adversarial examples without retraining. Across ImageNet, VOC, and COCO, BiCAM improves localization and faithfulness while remaining computationally efficient. It generalizes to multiple ViT variants, including DeiT and Swin. These results suggest the importance of modeling both supportive and suppressive evidence for interpreting transformer-based vision models.
[528] OMG-Avatar: One-shot Multi-LOD Gaussian Head Avatar
Jianqiang Ren, Lin Liu, Steven Hoi
Main category: cs.CV
TL;DR: OMG-Avatar: One-shot method using Multi-LOD Gaussian representation for fast (0.2s) animatable 3D head reconstruction from single image, with transformer-based architecture and multi-region decomposition for head and shoulders.
Details
Motivation: Current methods for 3D head avatar creation from single images are slow, lack detail hierarchy, and struggle with non-head regions like shoulders. There's a need for fast, high-quality reconstruction that works across different hardware capabilities.Method: Uses Multi-LOD Gaussian representation with transformer-based global feature extraction and projection-based local feature sampling. Features fused with depth buffer guidance. Employs coarse-to-fine learning for LOD functionality and multi-region decomposition separating head and shoulder prediction then combining them.
Result: Outperforms state-of-the-art methods in reconstruction quality, reenactment performance, and computational efficiency (0.2s inference time). Supports Level-of-Detail functionality for diverse hardware capabilities.
Conclusion: OMG-Avatar provides fast, high-quality animatable 3D head reconstruction from single images with LOD support and better handling of non-head regions, making it practical for real-time applications.
Abstract: We propose OMG-Avatar, a novel One-shot method that leverages a Multi-LOD (Level-of-Detail) Gaussian representation for animatable 3D head reconstruction from a single image in 0.2s. Our method enables LOD head avatar modeling using a unified model that accommodates diverse hardware capabilities and inference speed requirements. To capture both global and local facial characteristics, we employ a transformer-based architecture for global feature extraction and projection-based sampling for local feature acquisition. These features are effectively fused under the guidance of a depth buffer, ensuring occlusion plausibility. We further introduce a coarse-to-fine learning paradigm to support Level-of-Detail functionality and enhance the perception of hierarchical details. To address the limitations of 3DMMs in modeling non-head regions such as the shoulders, we introduce a multi-region decomposition scheme in which the head and shoulders are predicted separately and then integrated through cross-region combination. Extensive experiments demonstrate that OMG-Avatar outperforms state-of-the-art methods in reconstruction quality, reenactment performance, and computational efficiency.
[529] FACE: A Face-based Autoregressive Representation for High-Fidelity and Efficient Mesh Generation
Hanxiao Wang, Yuan-Chen Guo, Ying-Tian Liu, Zi-Xin Zou, Biao Zhang, Weize Quan, Ding Liang, Yan-Pei Cao, Dong-Ming Yan
Main category: cs.CV
TL;DR: FACE: An autoregressive autoencoder framework for 3D mesh generation that operates at face level (one-face-one-token) instead of vertex level, achieving 9x sequence length reduction and state-of-the-art reconstruction quality.
Details
Motivation: Autoregressive models for 3D mesh generation flatten meshes into long vertex-coordinate sequences, causing prohibitive computational costs and hindering efficient synthesis of high-fidelity geometry. The bottleneck stems from operating at the wrong semantic level (vertex level instead of face level).Method: Introduces FACE, an Autoregressive Autoencoder (ARAE) framework that generates meshes at face level using a one-face-one-token strategy. Each triangle face is treated as a single unified token, reducing sequence length by a factor of nine. Pairs a face-level decoder with a powerful VecSet encoder for efficient representation.
Result: Achieves unprecedented compression ratio of 0.11 (halving previous SOTA), reduces sequence length by 9x, achieves state-of-the-art reconstruction quality on standard benchmarks. The learned latent space enables training a latent diffusion model for high-fidelity single-image-to-mesh generation.
Conclusion: FACE provides a simple, scalable, and powerful paradigm that lowers the barrier to high-quality structured 3D content creation by operating at the appropriate semantic level (face instead of vertex).
Abstract: Autoregressive models for 3D mesh generation suffer from a fundamental limitation: they flatten meshes into long vertex-coordinate sequences. This results in prohibitive computational costs, hindering the efficient synthesis of high-fidelity geometry. We argue this bottleneck stems from operating at the wrong semantic level. We introduce FACE, a novel Autoregressive Autoencoder (ARAE) framework that reconceptualizes the task by generating meshes at the face level. Our one-face-one-token strategy treats each triangle face, the fundamental building block of a mesh, as a single, unified token. This simple yet powerful design reduces the sequence length by a factor of nine, leading to an unprecedented compression ratio of 0.11, halving the previous state-of-the-art. This dramatic efficiency gain does not compromise quality; by pairing our face-level decoder with a powerful VecSet encoder, FACE achieves state-of-the-art reconstruction quality on standard benchmarks. The versatility of the learned latent space is further demonstrated by training a latent diffusion model that achieves high-fidelity, single-image-to-mesh generation. FACE provides a simple, scalable, and powerful paradigm that lowers the barrier to high-quality structured 3D content creation.
[530] MVR: Multi-view Video Reward Shaping for Reinforcement Learning
Lirui Luo, Guoxi Zhang, Hongming Xu, Yaodong Yang, Cong Fang, Qing Li
Main category: cs.CV
TL;DR: MVR is a reinforcement learning framework that uses multi-view video-text similarity from vision-language models to shape rewards for complex dynamic tasks, addressing limitations of static image-based methods.
Details
Motivation: Current VLM-based reward methods use single static images and linear reward addition, which can alter optimal policies, struggle with complex dynamic motions, and suffer from viewpoint occlusion issues.Method: Multi-view video reward shaping (MVR) uses videos from multiple viewpoints with video-text similarity from frozen VLMs to learn state relevance functions, plus state-dependent reward shaping that reduces VLM influence once desired motion is achieved.
Result: Extensive experiments on HumanoidBench locomotion and MetaWorld manipulation tasks demonstrate efficacy, with ablation studies verifying design choices.
Conclusion: MVR effectively addresses limitations of image-based VLM reward methods for complex dynamic tasks through multi-view video analysis and adaptive reward shaping.
Abstract: Reward design is of great importance for solving complex tasks with reinforcement learning. Recent studies have explored using image-text similarity produced by vision-language models (VLMs) to augment rewards of a task with visual feedback. A common practice linearly adds VLM scores to task or success rewards without explicit shaping, potentially altering the optimal policy. Moreover, such approaches, often relying on single static images, struggle with tasks whose desired behavior involves complex, dynamic motions spanning multiple visually different states. Furthermore, single viewpoints can occlude critical aspects of an agent’s behavior. To address these issues, this paper presents Multi-View Video Reward Shaping (MVR), a framework that models the relevance of states regarding the target task using videos captured from multiple viewpoints. MVR leverages video-text similarity from a frozen pre-trained VLM to learn a state relevance function that mitigates the bias towards specific static poses inherent in image-based methods. Additionally, we introduce a state-dependent reward shaping formulation that integrates task-specific rewards and VLM-based guidance, automatically reducing the influence of VLM guidance once the desired motion pattern is achieved. We confirm the efficacy of the proposed framework with extensive experiments on challenging humanoid locomotion tasks from HumanoidBench and manipulation tasks from MetaWorld, verifying the design choices through ablation studies.
[531] Better Matching, Less Forgetting: A Quality-Guided Matcher for Transformer-based Incremental Object Detection
Qirui Wu, Shizhou Zhang, De Cheng, Yinghui Xing, Lingyan Ran, Dahu Shi, Peng Wang
Main category: cs.CV
TL;DR: Proposes Q-MCMF matcher for incremental object detection in DETR-like architectures to address background foregrounding issue that causes catastrophic forgetting.
Details
Motivation: Address catastrophic forgetting in incremental object detection, specifically identifying a novel source of forgetting in DETR-like architectures called "background foregrounding" caused by the Hungarian matcher's exhaustiveness constraint.Method: Proposes Quality-guided Min-Cost Max-Flow (Q-MCMF) matcher that builds a flow graph, prunes implausible matches based on geometric quality, and optimizes for final matching that minimizes cost while maximizing valid assignments.
Result: Extensive experiments on COCO dataset under various incremental settings demonstrate consistent outperformance over existing state-of-the-art approaches.
Conclusion: Q-MCMF effectively addresses background foregrounding in DETR-like architectures for incremental object detection, eliminating harmful supervision while maximizing foreground learning signals.
Abstract: Incremental Object Detection (IOD) aims to continuously learn new object classes without forgetting previously learned ones. A persistent challenge is catastrophic forgetting, primarily attributed to background shift in conventional detectors. While pseudo-labeling mitigates this in dense detectors, we identify a novel, distinct source of forgetting specific to DETR-like architectures: background foregrounding. This arises from the exhaustiveness constraint of the Hungarian matcher, which forcibly assigns every ground truth target to one prediction, even when predictions primarily cover background regions (i.e., low IoU). This erroneous supervision compels the model to misclassify background features as specific foreground classes, disrupting learned representations and accelerating forgetting. To address this, we propose a Quality-guided Min-Cost Max-Flow (Q-MCMF) matcher. To avoid forced assignments, Q-MCMF builds a flow graph and prunes implausible matches based on geometric quality. It then optimizes for the final matching that minimizes cost and maximizes valid assignments. This strategy eliminates harmful supervision from background foregrounding while maximizing foreground learning signals. Extensive experiments on the COCO dataset under various incremental settings demonstrate that our method consistently outperforms existing state-of-the-art approaches.
[532] Cross-modal Identity Mapping: Minimizing Information Loss in Modality Conversion via Reinforcement Learning
Haonan Jia, Shichao Dong, Xin Dong, Zenghui Sun, Jin Wang, Jinsong Lan, Xiaoyong Zhu, Bo Zheng, Kaifu Zhang
Main category: cs.CV
TL;DR: CIM is a reinforcement learning framework that improves LVLM image captioning by minimizing information loss through cross-modal identity mapping, using image retrieval similarity as a proxy for caption quality.
Details
Motivation: LVLMs often lose or misrepresent critical visual information in image captions due to the modality gap between vision and language. Measuring this information loss is challenging, but the authors hypothesize that caption quality correlates with how well images retrieved using that caption match the original image.Method: Proposes Cross-modal Identity Mapping (CIM), a reinforcement learning framework that evaluates information loss from two perspectives: Gallery Representation Consistency (how well captions represent image content) and Query-gallery Image Relevance (similarity between retrieved and original images). Uses these metrics as supervision signals without additional annotations.
Result: CIM demonstrates superior performance in image captioning, even outperforming Supervised Fine-Tuning. On COCO-LN500 benchmark, achieves 20% improvement in relation reasoning on Qwen2.5-VL-7B model.
Conclusion: The proposed CIM framework effectively enhances LVLM image captioning by minimizing information loss through cross-modal identity mapping, using image retrieval similarity as a novel proxy for measuring caption quality without requiring additional annotations.
Abstract: Large Vision-Language Models (LVLMs) often omit or misrepresent critical visual content in generated image captions. Minimizing such information loss will force LVLMs to focus on image details to generate precise descriptions. However, measuring information loss during modality conversion is inherently challenging due to the modal gap between visual content and text output. In this paper, we argue that the quality of an image caption is positively correlated with the similarity between images retrieved via text search using that caption. Based on this insight, we further propose Cross-modal Identity Mapping (CIM), a reinforcement learning framework that enhances image captioning without requiring additional annotations. Specifically, the method quantitatively evaluates the information loss from two perspectives: Gallery Representation Consistency and Query-gallery Image Relevance. Supervised under these metrics, LVLM minimizes information loss and aims to achieve identity mapping from images to captions. The experimental results demonstrate the superior performance of our method in image captioning, even when compared with Supervised Fine-Tuning. Particularly, on the COCO-LN500 benchmark, CIM achieves a 20% improvement in relation reasoning on Qwen2.5-VL-7B.The code will be released when the paper is accepted.
[533] Boosting AI Reliability with an FSM-Driven Streaming Inference Pipeline: An Industrial Case
Yutian Zhang, Zhongyi Pei, Yi Mao, Chen Wang, Lin Liu, Jianmin Wang
Main category: cs.CV
TL;DR: A streaming inference pipeline combining object detection with Finite State Machines to count excavator workloads from surveillance videos, improving robustness over manual heuristics.
Details
Motivation: AI in industry often lacks robustness when facing scenarios not in training data, leading to prediction bias and vulnerabilities. Need to enhance data-driven models by incorporating prior knowledge for real-world applications like excavator workload counting.Method: Proposes a streaming inference pipeline that integrates an object detection model with a Finite State Machine (FSM). The FSM encodes knowledge of operational scenarios to guide and correct AI predictions on streaming video data for excavator workload counting.
Result: Tested on real-world dataset of over 7,000 images from 12 site videos with 300+ excavator workloads. The method demonstrated superior performance and greater robustness compared to original manual heuristic rules solution.
Conclusion: The proposed approach successfully enhances industrial AI applications by combining data-driven models with prior knowledge through FSM, improving robustness in real-world scenarios. Code will be released publicly.
Abstract: The widespread adoption of AI in industry is often hampered by its limited robustness when faced with scenarios absent from training data, leading to prediction bias and vulnerabilities. To address this, we propose a novel streaming inference pipeline that enhances data-driven models by explicitly incorporating prior knowledge. This paper presents the work on an industrial AI application that automatically counts excavator workloads from surveillance videos. Our approach integrates an object detection model with a Finite State Machine (FSM), which encodes knowledge of operational scenarios to guide and correct the AI’s predictions on streaming data. In experiments on a real-world dataset of over 7,000 images from 12 site videos, encompassing more than 300 excavator workloads, our method demonstrates superior performance and greater robustness compared to the original solution based on manual heuristic rules. We will release the code at https://github.com/thulab/video-streamling-inference-pipeline.
[534] Towards Principled Dataset Distillation: A Spectral Distribution Perspective
Ruixi Wu, Shaobo Wang, Jiahuan Chen, Zhiyuan Liu, Yicun Yang, Zhaorun Chen, Zekai Li, Kaixin Li, Xinming Wang, Hongzhu Yi, Kai Wang, Linfeng Zhang
Main category: cs.CV
TL;DR: CSDM improves dataset distillation for long-tailed datasets by using spectral distribution matching and adaptive class balancing.
Details
Motivation: Existing dataset distillation methods perform poorly on long-tailed datasets due to heuristic distribution discrepancy measures and uniform treatment of imbalanced classes.Method: Proposes Class-Aware Spectral Distribution Matching (CSDM) that reformulates distribution alignment via kernel spectrum analysis, creating Spectral Distribution Distance (SDD), and uses amplitude-phase decomposition to adaptively prioritize tail class realism.
Result: On CIFAR-10-LT, CSDM achieves 14.0% improvement over SOTA DD methods with only 5.7% performance drop when tail class images decrease from 500 to 25.
Conclusion: CSDM effectively addresses long-tailed dataset distillation challenges through spectral analysis and adaptive class balancing, demonstrating strong stability on imbalanced data.
Abstract: Dataset distillation (DD) aims to compress large-scale datasets into compact synthetic counterparts for efficient model training. However, existing DD methods exhibit substantial performance degradation on long-tailed datasets. We identify two fundamental challenges: heuristic design choices for distribution discrepancy measure and uniform treatment of imbalanced classes. To address these limitations, we propose Class-Aware Spectral Distribution Matching (CSDM), which reformulates distribution alignment via the spectrum of a well-behaved kernel function. This technique maps the original samples into frequency space, resulting in the Spectral Distribution Distance (SDD). To mitigate class imbalance, we exploit the unified form of SDD to perform amplitude-phase decomposition, which adaptively prioritizes the realism in tail classes. On CIFAR-10-LT, with 10 images per class, CSDM achieves a 14.0% improvement over state-of-the-art DD methods, with only a 5.7% performance drop when the number of images in tail classes decreases from 500 to 25, demonstrating strong stability on long-tailed data.
[535] Benchmarking Semantic Segmentation Models via Appearance and Geometry Attribute Editing
Zijin Yin, Bing Li, Kongming Liang, Hao Sun, Zhongjiang He, Zhanyu Ma, Jun Guo
Main category: cs.CV
TL;DR: Gen4Seg: An automatic data generation pipeline using diffusion models to create challenging test samples for stress-testing semantic segmentation models by editing appearance and geometry attributes at object and image levels.
Details
Motivation: To develop a systematic way to test semantic segmentation models in varied and complex scenes before deployment, going beyond previous evaluation paradigms that focused only on global weather and style transfer.Method: Uses diffusion models to edit visual attributes (color, material, size, position at object level; weather, style at image level) of existing real images while preserving structural information, enabling reuse of segmentation labels to reduce annotation costs.
Result: Created Pascal-EA and COCO-EA benchmarks; found that open-vocabulary models aren’t more robust to geometric variations, data augmentation techniques are limited against appearance variations, and the pipeline can improve both in-distribution and out-of-distribution performance when used for data augmentation.
Conclusion: Generative models show potential as effective tools for automatically analyzing segmentation models, and the findings can help develop more robust and reliable segmentation models.
Abstract: Semantic segmentation takes pivotal roles in various applications such as autonomous driving and medical image analysis. When deploying segmentation models in practice, it is critical to test their behaviors in varied and complex scenes in advance. In this paper, we construct an automatic data generation pipeline Gen4Seg to stress-test semantic segmentation models by generating various challenging samples with different attribute changes. Beyond previous evaluation paradigms focusing solely on global weather and style transfer, we investigate variations in both appearance and geometry attributes at the object and image level. These include object color, material, size, position, as well as image-level variations such as weather and style. To achieve this, we propose to edit visual attributes of existing real images with precise control of structural information, empowered by diffusion models. In this way, the existing segmentation labels can be reused for the edited images, which greatly reduces the labor costs. Using our pipeline, we construct two new benchmarks, Pascal-EA and COCO-EA. We benchmark a wide variety of semantic segmentation models, spanning from closed-set models to open-vocabulary large models. We have several key findings: 1) advanced open-vocabulary models do not exhibit greater robustness compared to closed-set methods under geometric variations; 2) data augmentation techniques, such as CutOut and CutMix, are limited in enhancing robustness against appearance variations; 3) our pipeline can also be employed as a data augmentation tool and improve both in-distribution and out-of-distribution performances. Our work suggests the potential of generative models as effective tools for automatically analyzing segmentation models, and we hope our findings will assist practitioners and researchers in developing more robust and reliable segmentation models.
[536] RA-Det: Towards Universal Detection of AI-Generated Images via Robustness Asymmetry
Xinchang Wang, Yunhao Chen, Yuechen Zhang, Congcong Bian, Zihao Guo, Xingjun Ma, Hui Li
Main category: cs.CV
TL;DR: RA-Det detects AI-generated images by measuring their instability under small perturbations compared to natural images, leveraging robustness asymmetry as a universal behavioral signal.
Details
Motivation: As AI-generated images become more photo-realistic, traditional appearance-based detection methods (forensic cues, high-level representations) lose effectiveness. This motivates a shift from appearance to behavior - focusing on how images respond to controlled perturbations rather than how they look.Method: Introduces Robustness Asymmetry Detection (RA-Det): 1) Identifies that natural images preserve stable semantic representations under small structured perturbations, while generated images exhibit larger feature drift. 2) Provides theoretical analysis connecting this asymmetry to memorization tendencies in generative models. 3) Converts robustness asymmetry into a detection signal using a behavior-driven framework that is data- and model-agnostic.
Result: Evaluated across 14 diverse generative models and against more than 10 strong detectors, RA-Det achieves superior performance, improving average performance by 7.81%. The method transfers across unseen generators and requires no generator fingerprints.
Conclusion: Robustness asymmetry is a stable, general cue for synthetic-image detection. Carefully designed probing can turn this behavioral signal into a practical, universal detector for AI-generated images.
Abstract: Recent image generators produce photo-realistic content that undermines the reliability of downstream recognition systems. As visual appearance cues become less pronounced, appearance-driven detectors that rely on forensic cues or high-level representations lose stability. This motivates a shift from appearance to behavior, focusing on how images respond to controlled perturbations rather than how they look. In this work, we identify a simple and universal behavioral signal. Natural images preserve stable semantic representations under small, structured perturbations, whereas generated images exhibit markedly larger feature drift. We refer to this phenomenon as robustness asymmetry and provide a theoretical analysis that establishes a lower bound connecting this asymmetry to memorization tendencies in generative models, explaining its prevalence across architectures. Building on this insight, we introduce Robustness Asymmetry Detection (RA-Det), a behavior-driven detection framework that converts robustness asymmetry into a reliable decision signal. Evaluated across 14 diverse generative models and against more than 10 strong detectors, RA-Det achieves superior performance, improving the average performance by 7.81 percent. The method is data- and model-agnostic, requires no generator fingerprints, and transfers across unseen generators. Together, these results indicate that robustness asymmetry is a stable, general cue for synthetic-image detection and that carefully designed probing can turn this cue into a practical, universal detector. The source code is publicly available at Github.
[537] An Analysis of Multi-Task Architectures for the Hierarchic Multi-Label Problem of Vehicle Model and Make Classification
Alexandru Manole, Laura Diosan
Main category: cs.CV
TL;DR: Multi-task learning improves car make and model classification by leveraging hierarchical structure, with CNNs and Transformers showing performance gains on StanfordCars and CompCars datasets.
Details
Motivation: Most real-world information is organized hierarchically, but many deep learning approaches don't leverage this structure. Human learning benefits from hierarchical information, suggesting intelligent models could similarly benefit through multi-task learning.Method: Analyzed multi-task learning for hierarchical multi-label classification (car make and model). Evaluated parallel and cascaded multi-task architectures with different deep learning classifiers (CNNs, Transformers), varying dropout rates and loss weighting. Tested on StanfordCars and CompCars benchmarks.
Result: Multi-task learning improved performance on both datasets. CNNs showed improvements in almost all scenarios. Both CNN and Transformer models showed significant improvements on the CompCars dataset.
Conclusion: Multi-task learning is effective for hierarchical classification tasks like car make and model recognition, demonstrating that leveraging hierarchical structure can improve deep learning model performance.
Abstract: Most information in our world is organized hierarchically; however, many Deep Learning approaches do not leverage this semantically rich structure. Research suggests that human learning benefits from exploiting the hierarchical structure of information, and intelligent models could similarly take advantage of this through multi-task learning. In this work, we analyze the advantages and limitations of multi-task learning in a hierarchical multi-label classification problem: car make and model classification. Considering both parallel and cascaded multi-task architectures, we evaluate their impact on different Deep Learning classifiers (CNNs, Transformers) while varying key factors such as dropout rate and loss weighting to gain deeper insight into the effectiveness of this approach. The tests are conducted on two established benchmarks: StanfordCars and CompCars. We observe the effectiveness of the multi-task paradigm on both datasets, improving the performance of the investigated CNN in almost all scenarios. Furthermore, the approach yields significant improvements on the CompCars dataset for both types of models.
[538] Training-Free Spatio-temporal Decoupled Reasoning Video Segmentation with Adaptive Object Memory
Zhengtong Zhu, Jiaqing Fan, Zhixuan Liu, Fanzhang Li
Main category: cs.CV
TL;DR: SDAM is a training-free framework for reasoning video object segmentation that uses adaptive object memory and spatio-temporal decoupling to achieve stable segmentation with complex textual inputs.
Details
Motivation: Existing methods for reasoning video object segmentation require fine-tuning MLLMs which is resource-intensive, and they often have coupled spatio-temporal processing that affects temporal stability. The authors aim to create a training-free solution that outperforms fine-tuned methods.Method: Proposes SDAM with three key components: 1) Training-free framework using only pre-trained models, 2) Adaptive Object Memory module that selects and memorizes key objects based on motion cues, and 3) Spatio-temporal Decoupling approach where spatial domain handles precise localization/segmentation while temporal domain uses key object information for stable cross-frame propagation.
Result: Achieves excellent results on five benchmark datasets: Ref-YouTubeVOS, Ref-DAVIS17, MeViS, ReasonVOS, and ReVOS, outperforming existing methods that require fine-tuning.
Conclusion: SDAM demonstrates that training-free approaches can surpass fine-tuned methods for reasoning video object segmentation through effective spatio-temporal decoupling and adaptive memory mechanisms.
Abstract: Reasoning Video Object Segmentation (ReasonVOS) is a challenging task that requires stable object segmentation across video sequences using implicit and complex textual inputs. Previous methods fine-tune Multimodal Large Language Models (MLLMs) to produce segmentation outputs, which demand substantial resources. Additionally, some existing methods are coupled in the processing of spatio-temporal information, which affects the temporal stability of the model to some extent. To address these issues, we propose Training-Free \textbf{S}patio-temporal \textbf{D}ecoupled Reasoning Video Segmentation with \textbf{A}daptive Object \textbf{M}emory (SDAM). We aim to design a training-free reasoning video segmentation framework that outperforms existing methods requiring fine-tuning, using only pre-trained models. Meanwhile, we propose an Adaptive Object Memory module that selects and memorizes key objects based on motion cues in different video sequences. Finally, we propose Spatio-temporal Decoupling for stable temporal propagation. In the spatial domain, we achieve precise localization and segmentation of target objects, while in the temporal domain, we leverage key object temporal information to drive stable cross-frame propagation. Our method achieves excellent results on five benchmark datasets, including Ref-YouTubeVOS, Ref-DAVIS17, MeViS, ReasonVOS, and ReVOS.
[539] PathMoE: Interpretable Multimodal Interaction Experts for Pediatric Brain Tumor Classification
Jian Yu, Joakim Nguyen, Jinrui Fang, Awais Naeem, Zeyuan Cao, Sanjay Krishnan, Nicholas Konz, Tianlong Chen, Chandra Krishnan, Hairong Wang, Edward Castillo, Ying Ding, Ankita Shukla
Main category: cs.CV
TL;DR: PathMoE is an interpretable multimodal framework that integrates H&E slides, pathology reports, and nuclei-level cell graphs using a mixture-of-experts architecture for pediatric CNS tumor classification.
Details
Motivation: Pediatric CNS tumor classification is challenging due to histological complexity and limited training data. Existing pathology foundation models often fail to leverage complementary information from clinical text and tissue microarchitecture.Method: Proposes PathMoE, an interpretable multimodal framework that integrates H&E slides, pathology reports, and nuclei-level cell graphs via an interaction-aware mixture-of-experts architecture built on state-of-the-art foundation models for each modality. Uses specialized experts to capture modality uniqueness, redundancy, and synergy with an input-dependent gating mechanism.
Result: PathMoE improves macro-F1 from 0.762 to 0.799 (+0.037) on internal pediatric brain tumor dataset when integrating WSI, text, and graph modalities. On TCGA, augmenting WSI with graph knowledge improves macro-F1 from 0.668 to 0.709 (+0.041).
Conclusion: The framework demonstrates significant performance gains over image-only baselines while providing sample-level interpretability, which is critical for rare tumor subtypes and clinical trust.
Abstract: Accurate classification of pediatric central nervous system tumors remains challenging due to histological complexity and limited training data. While pathology foundation models have advanced whole-slide image (WSI) analysis, they often fail to leverage the rich, complementary information found in clinical text and tissue microarchitecture. To this end, we propose PathMoE, an interpretable multimodal framework that integrates H&E slides, pathology reports, and nuclei-level cell graphs via an interaction-aware mixture-of-experts architecture built on state-of-the-art foundation models for each modality. By training specialized experts to capture modality uniqueness, redundancy, and synergy, PathMoE employs an input-dependent gating mechanism that dynamically weights these interactions, providing sample-level interpretability. We evaluate our framework on two dataset-specific classification tasks on an internal pediatric brain tumor dataset (PBT) and external TCGA datasets. PathMoE improves macro-F1 from 0.762 to 0.799 (+0.037) on PBT when integrating WSI, text, and graph modalities; on TCGA, augmenting WSI with graph knowledge improves macro-F1 from 0.668 to 0.709 (+0.041). These results demonstrate significant performance gains over state-of-the-art image-only baselines while revealing the specific modality interactions driving individual predictions. This interpretability is particularly critical for rare tumor subtypes, where transparent model reasoning is essential for clinical trust and diagnostic validation.
[540] Align-cDAE: Alzheimer’s Disease Progression Modeling with Attention-Aligned Conditional Diffusion Auto-Encoder
Ayantika Das, Keerthi Ram, Mohanasankar Sivaprakasam
Main category: cs.CV
TL;DR: A diffusion autoencoder framework for Alzheimer’s disease progression modeling that enforces alignment between multimodal inputs and structures latent space for better control over image generation.
Details
Motivation: Existing diffusion models for brain disease progression don't ensure meaningful alignment between non-imaging modalities and image features, nor provide structured latent spaces for precise control over progression-specific changes.Method: Proposes a diffusion autoencoder framework with explicit alignment objective to focus on progression-related regions, and structured latent subspaces separating progression conditions from subject identity information.
Result: The approach demonstrates more anatomically precise modeling of Alzheimer’s disease progression through better alignment and latent space structuring.
Conclusion: Enforcing alignment between modalities and structuring latent representational space improves controllability and anatomical precision in disease progression modeling using diffusion autoencoders.
Abstract: Generative AI framework-based modeling and prediction of longitudinal human brain images offer an efficient mechanism to track neurodegenerative progression, essential for the assessment of diseases like Alzheimer’s. Among the existing generative approaches, recent diffusion-based models have emerged as an effective alternative to generate disease progression images. Incorporating multi-modal and non-imaging attributes as conditional information into diffusion frameworks has been shown to improve controllability during such generations. However, existing methods do not explicitly ensure that information from non-imaging conditioning modalities is meaningfully aligned with image features to introduce desirable changes in the generated images, such as modulation of progression-specific regions. Further, more precise control over the generation process can be achieved by introducing progression-relevant structure into the internal representations of the model, lacking in the existing approaches. To address these limitations, we propose a diffusion autoencoder-based framework for disease progression modeling that explicitly enforces alignment between different modalities. The alignment is enforced by introducing an explicit objective function that enables the model to focus on the regions exhibiting progression-related changes. Further, we devise a mechanism to better structure the latent representational space of the diffusion auto-encoding framework. Specifically, we assign separate latent subspaces for integrating progression-related conditions and retaining subject-specific identity information, allowing better-controlled image generation. These results demonstrate that enforcing alignment and better structuring of the latent representational space of diffusion auto-encoding framework leads to more anatomically precise modeling of Alzheimer’s disease progression.
[541] TopoMaskV3: 3D Mask Head with Dense Offset and Height Predictions for Road Topology Understanding
Muhammet Esat Kalfaoglu, Halil Ibrahim Ozturk, Ozsel Kilinc, Alptekin Temizel
Main category: cs.CV
TL;DR: TopoMaskV3 introduces a mask-based 3D road topology understanding method with dense offset and height prediction heads, addressing geographic data leakage with distinct splits and long-range benchmarks.
Details
Motivation: Prior mask-based methods for road topology understanding were limited to 2D predictions and suffered from discretization artifacts, requiring fusion with parametric heads. There was also geographic data leakage in evaluation that needed addressing.Method: TopoMaskV3 advances mask-based pipeline to 3D prediction via two novel dense prediction heads: dense offset field for sub-grid discretization correction and dense height map for direct 3D estimation. Introduces geographically distinct splits and long-range (+/-100m) benchmark to prevent memorization.
Result: Achieves state-of-the-art 28.5 OLS on geographically disjoint benchmark, surpassing all prior methods. Mask representation shows more robustness to geographic overfitting than Bezier methods. LiDAR fusion benefits most at long range and shows larger gains on overlapping splits, suggesting overlap-induced memorization effects.
Conclusion: TopoMaskV3 provides a robust standalone 3D predictor for road topology understanding that addresses geographic data leakage issues and demonstrates superior performance on fair evaluation benchmarks.
Abstract: Mask-based paradigms for road topology understanding, such as TopoMaskV2, offer a complementary alternative to query-based methods by generating centerlines via a dense rasterized intermediate representation. However, prior work was limited to 2D predictions and suffered from severe discretization artifacts, necessitating fusion with parametric heads. We introduce TopoMaskV3, which advances this pipeline into a robust, standalone 3D predictor via two novel dense prediction heads: a dense offset field for sub-grid discretization correction within the existing BEV resolution, and a dense height map for direct 3D estimation. Beyond the architecture, we are the first to address geographic data leakage in road topology evaluation by introducing (1) geographically distinct splits to prevent memorization and ensure fair generalization, and (2) a long-range (+/-100 m) benchmark. TopoMaskV3 achieves state-of-the-art 28.5 OLS on this geographically disjoint benchmark, surpassing all prior methods. Our analysis shows that the mask representation is more robust to geographic overfitting than Bezier, while LiDAR fusion is most beneficial at long range and exhibits larger relative gains on the overlapping original split, suggesting overlap-induced memorization effects.
[542] Non-verbal Real-time Human-AI Interaction in Constrained Robotic Environments
Dragos Costea, Alina Marcu, Cristina Lazar, Marius Leordeanu
Main category: cs.CV
TL;DR: This paper introduces a framework for real-time non-verbal interaction between humans and AI using 2D body keypoints, evaluating whether AI-generated motion can match human motion quality and examining performance on outputs from text-to-video systems like SORA and VEO.
Details
Motivation: The paper aims to investigate whether contemporary generative models can go beyond surface mimicry to achieve authentic non-verbal communication through body language, addressing the statistical fidelity gap between AI-generated and human-generated motion data.Method: The authors introduce the first framework for generating natural non-verbal interactions between humans and AI in real-time from 2D body keypoints, using four lightweight architectures that run at up to 100 FPS on an NVIDIA Orin Nano. They trained on 437 human video clips and used pretraining on synthetically-generated sequences.
Result: Pretraining on synthetic sequences significantly reduces motion errors without sacrificing speed, but a measurable reality gap persists. When evaluated on keypoints from SORA and VEO, performance drops more on SORA-generated clips than VEO, suggesting temporal coherence (not image fidelity) drives real-world performance.
Conclusion: Statistically distinguishable differences persist between human and AI motion, indicating that current generative models haven’t fully captured the nuances of human non-verbal communication despite advances in motion generation.
Abstract: We study the ongoing debate regarding the statistical fidelity of AI-generated data compared to human-generated data in the context of non-verbal communication using full body motion. Concretely, we ask if contemporary generative models move beyond surface mimicry to participate in the silent, but expressive dialogue of body language. We tackle this question by introducing the first framework that generates a natural non-verbal interaction between Human and AI in real-time from 2D body keypoints. Our experiments utilize four lightweight architectures which run at up to 100 FPS on an NVIDIA Orin Nano, effectively closing the perception-action loop needed for natural Human-AI interaction. We trained on 437 human video clips and demonstrated that pretraining on synthetically-generated sequences reduces motion errors significantly, without sacrificing speed. Yet, a measurable reality gap persists. When the best model is evaluated on keypoints extracted from cutting-edge text-to-video systems, such as SORA and VEO, we observe that performance drops on SORA-generated clips. However, it degrades far less on VEO, suggesting that temporal coherence, not image fidelity, drives real-world performance. Our results demonstrate that statistically distinguishable differences persist between Human and AI motion.
[543] Cryo-Bench: Benchmarking Foundation Models for Cryosphere Applications
Saurabh Kaushik, Lalit Maurya, Beth Tellman
Main category: cs.CV
TL;DR: Cryo-Bench benchmark evaluates 14 Geo-Foundation Models for cryospheric applications including glaciers, lakes, sea ice, and calving fronts, showing UNet performs best with frozen encoder but GFMs excel in few-shot settings.
Details
Motivation: There's a lack of suitable evaluation datasets for benchmarking Geo-Foundation Models in cryospheric applications, limiting understanding of their performance for Earth observation tasks in polar regions.Method: Created Cryo-Bench benchmark with datasets covering debris-covered glaciers, glacial lakes, sea ice, and calving fronts across multiple sensors and regions. Evaluated 14 GFMs alongside UNet and ViT baselines under frozen encoder, few-shot (10% data), and full fine-tuning settings.
Result: UNet achieved highest average mIoU (66.38) with frozen encoder. In few-shot settings, GFMs like DOFA and TerraMind outperformed UNet (59.53, 56.62 vs 56.60). Full fine-tuning showed inconsistent performance but improved 12.77% with learning rate tuning. GFMs demonstrated strong domain adaptation despite minimal cryosphere pretraining data.
Conclusion: GFMs show promise for cryospheric applications with notable domain adaptation. Recommended encoder fine-tuning with hyperparameter optimization for best performance, or frozen encoders for quick results without extensive experimentation.
Abstract: Geo-Foundation Models (GFMs) have been evaluated across diverse Earth observation task including multiple domains and have demonstrated strong potential of producing reliable maps even with sparse labels. However, benchmarking GFMs for Cryosphere applications has remained limited, primarily due to the lack of suitable evaluation datasets. To address this gap, we introduce \textbf{Cryo-Bench}, a benchmark compiled to evaluate GFM performance across key Cryospheric components. Cryo-Bench includes debris-covered glaciers, glacial lakes, sea ice, and calving fronts, spanning multiple sensors and broad geographic regions. We evaluate 14 GFMs alongside UNet and ViT baselines to assess their advantages, limitations, and optimal usage strategies. With a frozen encoder, UNet achieves the highest average mIoU of \textbf{66.38}, followed by TerraMind at \textbf{64.02} across five evluation dataset included in Cryo-Bench. In the few-shot setting (10% input data), GFMs such as DOFA and TerraMind outperform UNet, achieving mIoU scores of \textbf{59.53}, \textbf{56.62}, and \textbf{56.60}, respectively, comapred to U-Net’s 56.60. When fully finetuning GFMs, we observe inconsistent performance across datasets and models. However, tuning learning rate along with finetuning substantially improves GFM performance. For example, evaluation on two representative datasets (GLID and CaFFe) shows an average relative improvement of \textbf{12.77%}. Despite having minimal Cryosphere representation in their pretraining data, GFMs exhibit notable domain adaptation capabilities and produce meaningful results across tasks. Based on our findings, We recommend encoder fine-tuning with hyperparameter optimization optimization to achieve the best possible performance, while using frozen encoders when users need quick results without extensive experimentation.(\href{https://github.com/Sk-2103/Cryo-Bench}{GitHub}).
[544] InterCoG: Towards Spatially Precise Image Editing with Interleaved Chain-of-Grounding Reasoning
Yecong Wan, Fan Li, Chunwei Wang, Hao Wu, Mingwen Shao, Wangmeng Zuo
Main category: cs.CV
TL;DR: InterCoG: A text-vision interleaved chain-of-grounding framework for fine-grained image editing in complex multi-entity scenes with spatial reasoning.
Details
Motivation: Existing unified editing models struggle with fine-grained editing in complex multi-entity scenes where targets are not visually salient and require spatial reasoning. There's a need for better spatial localization and reasoning in image editing tasks.Method: Proposes InterCoG framework with three-step reasoning: 1) text-only object position reasoning with spatial relations, 2) visual grounding via bounding boxes and masks, 3) rewriting editing descriptions. Includes two auxiliary training modules: multimodal grounding reconstruction supervision and multimodal grounding reasoning alignment. Also introduces GroundEdit-45K dataset and GroundEdit-Bench for evaluation.
Result: Extensive experiments demonstrate superiority in highly precise edits under spatially intricate and multi-entity scenes. The framework shows strong capabilities in fine-grained editing where targets require spatial reasoning.
Conclusion: InterCoG effectively addresses the challenge of fine-grained editing in complex scenes through interleaved text-vision reasoning, achieving precise spatial localization and interpretable reasoning for multi-entity editing tasks.
Abstract: Emerging unified editing models have demonstrated strong capabilities in general object editing tasks. However, it remains a significant challenge to perform fine-grained editing in complex multi-entity scenes, particularly those where targets are not visually salient and require spatial reasoning. To this end, we propose InterCoG, a novel text-vision Interleaved Chain-of-Grounding reasoning framework for fine-grained image editing in complex real-world scenes. The key insight of InterCoG is to first perform object position reasoning solely within text that includes spatial relation details to explicitly deduce the location and identity of the edited target. It then conducts visual grounding via highlighting the editing targets with generated bounding boxes and masks in pixel space, and finally rewrites the editing description to specify the intended outcomes. To further facilitate this paradigm, we propose two auxiliary training modules: multimodal grounding reconstruction supervision and multimodal grounding reasoning alignment to enforce spatial localization accuracy and reasoning interpretability, respectively. We also construct GroundEdit-45K, a dataset comprising 45K grounding-oriented editing samples with detailed reasoning annotations, and GroundEdit-Bench for grounding-aware editing evaluation. Extensive experiments substantiate the superiority of our approach in highly precise edits under spatially intricate and multi-entity scenes.
[545] PPEDCRF: Privacy-Preserving Enhanced Dynamic CRF for Location-Privacy Protection for Sequence Videos with Minimal Detection Degradation
Bo Ma, Jinsong Wu, Weiqi Yan, Catherine Shi, Minh Nguyen
Main category: cs.CV
TL;DR: PPEDCRF is a privacy-preserving framework that protects location privacy in dashcam videos by selectively perturbing only location-sensitive background regions while maintaining foreground object detection utility.
Details
Motivation: Dashcam videos from autonomous driving systems are shared for safety auditing but pose location privacy risks even without GPS metadata, as attackers can match visual background cues against street-view imagery to infer recording locations.Method: Proposes PPEDCRF with three components: 1) dynamic CRF for temporal consistency in tracking location-sensitive regions, 2) normalized control penalty for hierarchical sensitivity-based perturbation allocation, and 3) utility-preserving noise injection that minimizes interference with object detection/segmentation.
Result: Significantly reduces location-retrieval attack success (Top-k accuracy) while maintaining competitive detection performance (mAP and segmentation metrics) compared to baselines like global noise, white-noise masking, and feature-based anonymization.
Conclusion: PPEDCRF effectively balances location privacy protection with preservation of foreground detection utility in dashcam videos, addressing the tension between privacy and utility in autonomous driving data sharing.
Abstract: Dashcam videos collected by autonomous or assisted-driving systems are increasingly shared for safety auditing and model improvement. Even when explicit GPS metadata are removed, an attacker can still infer the recording location by matching background visual cues (e.g., buildings and road layouts) against large-scale street-view imagery. This paper studies location-privacy leakage under a background-based retrieval attacker, and proposes PPEDCRF, a privacy-preserving enhanced dynamic conditional random field framework that injects calibrated perturbations only into inferred location-sensitive background regions while preserving foreground detection utility. PPEDCRF consists of three components: (i) a dynamic CRF that enforces temporal consistency to discover and track location sensitive regions across frames, (ii) a normalized control penalty (NCP) that allocates perturbation strength according to a hierarchical sensitivity model, and (iii) a utility-preserving noise injection module that minimizes interference to object detection and segmentation. Experiments on public driving datasets demonstrate that PPEDCRF significantly reduces location-retrieval attack success (e.g., Top-k retrieval accuracy) while maintaining competitive detection performance (e.g., mAP and segmentation metrics) compared with common baselines such as global noise, white-noise masking, and feature-based anonymization. The source code is in https://github.com/mabo1215/PPEDCRF.git
[546] Preference Score Distillation: Leveraging 2D Rewards to Align Text-to-3D Generation with Human Preference
Jiaqi Leng, Shuyuan Tu, Haidong Cao, Sicheng Xie, Daoguo Dong, Zuxuan Wu, Yu-Gang Jiang
Main category: cs.CV
TL;DR: PSD is a novel framework for human preference alignment in text-to-3D generation that uses pretrained 2D reward models without requiring 3D training data, reformulating preference alignment as a classifier-free guidance mechanism.
Details
Motivation: Human preference alignment is critical but underexplored in text-to-3D generation. Existing methods require task-specific fine-tuning which is challenging in data-scarce 3D domains. There's a need for methods that can leverage existing 2D reward models without 3D training data.Method: Proposes Preference Score Distillation (PSD), an optimization-based framework that: 1) Uses pretrained 2D reward models for human-aligned text-to-3D synthesis, 2) Reformulates preference alignment as a classifier-free guidance mechanism through implicit reward modeling, 3) Introduces adaptive co-optimization of preference scores and negative text embeddings, 4) Incorporates CFG during optimization for dynamic enhancement of alignment.
Result: PSD demonstrates superiority in aesthetic metrics, seamless integration with diverse pipelines, and strong extensibility. It successfully bridges human preference alignment with CFG theory under score distillation framework.
Conclusion: PSD provides an effective solution for human preference alignment in text-to-3D generation without requiring 3D training data, offering a novel connection between preference alignment and classifier-free guidance theory.
Abstract: Human preference alignment presents a critical yet underexplored challenge for diffusion models in text-to-3D generation. Existing solutions typically require task-specific fine-tuning, posing significant hurdles in data-scarce 3D domains. To address this, we propose Preference Score Distillation (PSD), an optimization-based framework that leverages pretrained 2D reward models for human-aligned text-to-3D synthesis without 3D training data. Our key insight stems from the incompatibility of pixel-level gradients: due to the absence of noisy samples during reward model training, direct application of 2D reward gradients disturbs the denoising process. Noticing that similar issue occurs in the naive classifier guidance in conditioned diffusion models, we fundamentally rethink preference alignment as a classifier-free guidance (CFG)-style mechanism through our implicit reward model. Furthermore, recognizing that frozen pretrained diffusion models constrain performance, we introduce an adaptive strategy to co-optimize preference scores and negative text embeddings. By incorporating CFG during optimization, online refinement of negative text embeddings dynamically enhances alignment. To our knowledge, we are the first to bridge human preference alignment with CFG theory under score distillation framework. Experiments demonstrate the superiority of PSD in aesthetic metrics, seamless integration with diverse pipelines, and strong extensibility.
[547] Dehallu3D: Hallucination-Mitigated 3D Generation from Single Image via Cyclic View Consistency Refinement
Xiwen Wang, Shichao Zhang, Hailun Zhang, Ruowei Wang, Mao Li, Chenyu Zhou, Qijun Zhao, Ji-Zhe Zhou
Main category: cs.CV
TL;DR: Dehallu3D is a method to reduce hallucinations in large 3D reconstruction models by enforcing multi-view continuity constraints to eliminate structural outliers while preserving sharp geometric features.
Details
Motivation: Large 3D reconstruction models suffer from hallucinations (structural outliers like odd holes/protrusions) that are underexplored compared to other large models, leading to malformed 3D-printed objects and poor virtual reality immersion. These hallucinations originate from reconstructing 3D content from sparsely generated multi-view images with large viewpoint gaps.Method: Proposes Dehallu3D with a plug-and-play optimization module using two constraints: (1) adjacent consistency for geometric continuity across views, and (2) adaptive smoothness to retain fine details. Also introduces Outlier Risk Measure (ORM) metric to quantify geometric fidelity from outlier perspective.
Result: Extensive experiments show Dehallu3D achieves high-fidelity 3D generation by effectively preserving structural details while removing hallucinated outliers.
Conclusion: Dehallu3D successfully mitigates hallucinations in 3D reconstruction models through balanced multi-view continuity constraints, enabling more reliable 3D content generation for applications like VR and 3D printing.
Abstract: Large 3D reconstruction models have revolutionized the 3D content generation field, enabling broad applications in virtual reality and gaming. Just like other large models, large 3D reconstruction models suffer from hallucinations as well, introducing structural outliers (e.g., odd holes or protrusions) that deviate from the input data. However, unlike other large models, hallucinations in large 3D reconstruction models remain severely underexplored, leading to malformed 3D-printed objects or insufficient immersion in virtual scenes. Such hallucinations majorly originate from that existing methods reconstruct 3D content from sparsely generated multi-view images which suffer from large viewpoint gaps and discontinuities. To mitigate hallucinations by eliminating the outliers, we propose Dehallu3D for 3D mesh generation. Our key idea is to design a balanced multi-view continuity constraint to enforce smooth transitions across dense intermediate viewpoints, while avoiding over-smoothing that could erase sharp geometric features. Therefore, Dehallu3D employs a plug-and-play optimization module with two key constraints: (i) adjacent consistency to ensure geometric continuity across views, and (ii) adaptive smoothness to retain fine details.We further propose the Outlier Risk Measure (ORM) metric to quantify geometric fidelity in 3D generation from the perspective of outliers. Extensive experiments show that Dehallu3D achieves high-fidelity 3D generation by effectively preserving structural details while removing hallucinated outliers.
[548] physfusion: A Transformer-based Dual-Stream Radar and Vision Fusion Framework for Open Water Surface Object Detection
Yuting Wan, Liguo Sun, Jiuwu Hao, Zao Zhang, Pin LV
Main category: cs.CV
TL;DR: PhysFusion: A physics-informed radar-image fusion framework for water-surface target detection in USVs, addressing sparse radar point clouds and challenging maritime conditions through specialized radar encoding and temporal fusion.
Details
Motivation: Water-surface target detection for USVs is challenging due to wave clutter, specular reflections, and weak appearance cues in long-range observations. While 4D radar complements cameras in poor lighting, maritime radar point clouds are sparse and intermittent with heavy-tailed reflectivity variations, making conventional fusion approaches ineffective.Method: Three main components: (1) Physics-Informed Radar Encoder (PIR Encoder) with RCS Mapper and Quality Gate for transforming radar attributes into scattering priors and predicting point-wise reliability; (2) Radar-guided Interactive Fusion Module (RIFM) for query-level radar-image fusion using a dual-stream backbone with point-based local stream and transformer-based global stream with Scattering-Aware Self-Attention (SASA); (3) Temporal Query Aggregation module for aggregating frame-wise fused queries over temporal windows.
Result: Achieves 59.7% mAP50:95 and 90.3% mAP50 on WaterScenes (with T=5 radar history) using 5.6M parameters and 12.5G FLOPs, and reaches 94.8% mAP50 and 46.2% mAP50:95 on FLOW dataset under radar+camera setting. Ablation studies confirm contributions of PIR Encoder, SASA-based global reasoning, and RIFM.
Conclusion: PhysFusion effectively addresses the challenges of water-surface perception by integrating physics-informed radar encoding with multimodal fusion, demonstrating superior performance in maritime target detection through specialized handling of sparse radar data and temporal consistency.
Abstract: Detecting water-surface targets for Unmanned Surface Vehicles (USVs) is challenging due to wave clutter, specular reflections, and weak appearance cues in long-range observations. Although 4D millimeter-wave radar complements cameras under degraded illumination, maritime radar point clouds are sparse and intermittent, with reflectivity attributes exhibiting heavy-tailed variations under scattering and multipath, making conventional fusion designs struggle to exploit radar cues effectively. We propose PhysFusion, a physics-informed radar-image detection framework for water-surface perception. The framework integrates: (1) a Physics-Informed Radar Encoder (PIR Encoder) with an RCS Mapper and Quality Gate, transforming per-point radar attributes into compact scattering priors and predicting point-wise reliability for robust feature learning under clutter; (2) a Radar-guided Interactive Fusion Module (RIFM) performing query-level radar-image fusion between semantically enriched radar features and multi-scale visual features, with the radar branch modeled by a dual-stream backbone including a point-based local stream and a transformer-based global stream using Scattering-Aware Self-Attention (SASA); and (3) a Temporal Query Aggregation module (TQA) aggregating frame-wise fused queries over a short temporal window for temporally consistent representations. Experiments on WaterScenes and FLOW demonstrate that PhysFusion achieves 59.7% mAP50:95 and 90.3% mAP50 on WaterScenes (T=5 radar history) using 5.6M parameters and 12.5G FLOPs, and reaches 94.8% mAP50 and 46.2% mAP50:95 on FLOW under radar+camera setting. Ablation studies quantify the contributions of PIR Encoder, SASA-based global reasoning, and RIFM.
[549] Sparse View Distractor-Free Gaussian Splatting
Yi Gu, Zhaorui Wang, Jiahang Cao, Jiaxu Wang, Mingle Zhao, Dongjun Ye, Renjing Xu
Main category: cs.CV
TL;DR: A framework that enhances distractor-free 3D Gaussian Splatting under sparse-view conditions by incorporating geometric and semantic priors from foundation models.
Details
Motivation: Existing distractor-free 3DGS methods work well with dense image captures but degrade significantly under sparse input conditions due to unreliable color residual heuristics.Method: Uses VGGT geometry foundation model for camera parameter estimation and initial 3D points, leverages VGGT attention maps for semantic entity matching, and employs Vision-Language Models to identify and preserve large static scene regions.
Result: Extensive experiments confirm effectiveness and robustness in mitigating transient distractors for sparse-view 3DGS training.
Conclusion: The proposed framework successfully addresses sparse-view limitations in distractor-free 3DGS by integrating rich prior information from foundation models.
Abstract: 3D Gaussian Splatting (3DGS) enables efficient training and fast novel view synthesis in static environments. To address challenges posed by transient objects, distractor-free 3DGS methods have emerged and shown promising results when dense image captures are available. However, their performance degrades significantly under sparse input conditions. This limitation primarily stems from the reliance on the color residual heuristics to guide the training, which becomes unreliable with limited observations. In this work, we propose a framework to enhance distractor-free 3DGS under sparse-view conditions by incorporating rich prior information. Specifically, we first adopt the geometry foundation model VGGT to estimate camera parameters and generate a dense set of initial 3D points. Then, we harness the attention maps from VGGT for efficient and accurate semantic entity matching. Additionally, we utilize Vision-Language Models (VLMs) to further identify and preserve the large static regions in the scene. We also demonstrate how these priors can be seamlessly integrated into existing distractor-free 3DGS methods. Extensive experiments confirm the effectiveness and robustness of our approach in mitigating transient distractors for sparse-view 3DGS training.
[550] Coarse-to-Fine Monocular Re-Localization in OpenStreetMap via Semantic Alignment
Yuchen Zou, Xiao Hu, Dexing Zhong, Yuqing Tang
Main category: cs.CV
TL;DR: A hierarchical search framework with semantic alignment for monocular re-localization using OpenStreetMap (OSM) as a lightweight, privacy-preserving alternative to dense maps.
Details
Motivation: Traditional monocular re-localization methods rely on dense maps with scalability limitations and privacy risks. OpenStreetMap offers a lightweight, privacy-protecting alternative with global scalability, but faces challenges with cross-modal discrepancies between images and OSM, and high computational costs of global map-based localization.Method: 1) Uses DINO-ViT’s semantic awareness capability to deconstruct visual elements and establish semantic relationships with OSM. 2) Implements a coarse-to-fine search paradigm to replace global dense matching, enabling efficient progressive refinement.
Result: Extensive experiments show significant improvements in both localization accuracy and speed. When trained on a single dataset, the 3° orientation recall outperforms the 5° recall of state-of-the-art methods.
Conclusion: The proposed hierarchical search framework with semantic alignment effectively addresses cross-modal discrepancies and computational challenges in OSM-based localization, offering a scalable and privacy-preserving solution for monocular re-localization.
Abstract: Monocular re-localization plays a crucial role in enabling intelligent agents to achieve human-like perception. However, traditional methods rely on dense maps, which face scalability limitations and privacy risks. OpenStreetMap (OSM), as a lightweight map that protects privacy, offers semantic and geometric information with global scalability. Nonetheless, there are still challenges in using OSM for localization: the inherent cross-modal discrepancies between natural images and OSM, as well as the high computational cost of global map-based localization. In this paper, we propose a hierarchical search framework with semantic alignment for localization in OSM. First, the semantic awareness capability of DINO-ViT is utilised to deconstruct visual elements to establish semantic relationships with OSM. Second, a coarse-to-fine search paradigm is designed to replace global dense matching, enabling efficient progressive refinement. Extensive experiments demonstrate that our method significantly improves both localization accuracy and speed. When trained on a single dataset, the 3° orientation recall of our method even outperforms the 5° recall of state-of-the-art methods.
[551] MAP-Diff: Multi-Anchor Guided Diffusion for Progressive 3D Whole-Body Low-Dose PET Denoising
Peiyuan Jing, Chun-Wun Cheng, Liutao Yang, Zhenxuan Zhang, Thiago V. Lima, Klaus Strobel, Antoine Leimgruber, Angelica Aviles-Rivero, Guang Yang, Javier A. Montoya-Zegarra
Main category: cs.CV
TL;DR: MAP-Diff: A multi-anchor guided diffusion framework for progressive 3D whole-body PET denoising that uses intermediate-dose scans as trajectory anchors to regularize the reverse diffusion process toward dose-aligned intermediate states.
Details
Motivation: Low-dose PET reduces radiation exposure but suffers from severe noise and quantitative degradation. Existing diffusion-based denoising models have unconstrained reverse trajectories that are not aligned with the progressive nature of PET dose formation.Method: Introduces clinically observed intermediate-dose scans as trajectory anchors, enforces timestep-dependent supervision to regularize the reverse process toward dose-aligned intermediate states, calibrates anchor timesteps via degradation matching between simulated diffusion corruption and real multi-dose PET pairs, and uses timestep-weighted anchor loss for stable stage-wise learning.
Result: Consistent improvements over strong CNN-, Transformer-, GAN-, and diffusion-based baselines. On internal dataset: PSNR improved from 42.48 dB to 43.71 dB (+1.23 dB), SSIM increased to 0.986, NMAE reduced from 0.115 to 0.103 (-0.012). Performance gains generalize across scanners with 34.42 dB PSNR and 0.141 NMAE on external cohort.
Conclusion: MAP-Diff effectively regularizes diffusion reverse trajectories using multi-dose anchors, enabling progressive, dose-consistent intermediate restoration from ultra-low-dose PET inputs while achieving state-of-the-art denoising performance.
Abstract: Low-dose Positron Emission Tomography (PET) reduces radiation exposure but suffers from severe noise and quantitative degradation. Diffusion-based denoising models achieve strong final reconstructions, yet their reverse trajectories are typically unconstrained and not aligned with the progressive nature of PET dose formation. We propose MAP-Diff, a multi-anchor guided diffusion framework for progressive 3D whole-body PET denoising. MAP-Diff introduces clinically observed intermediate-dose scans as trajectory anchors and enforces timestep-dependent supervision to regularize the reverse process toward dose-aligned intermediate states. Anchor timesteps are calibrated via degradation matching between simulated diffusion corruption and real multi-dose PET pairs, and a timestep-weighted anchor loss stabilizes stage-wise learning. At inference, the model requires only ultra-low-dose input while enabling progressive, dose-consistent intermediate restoration. Experiments on internal (Siemens Biograph Vision Quadra) and cross-scanner (United Imaging uEXPLORER) datasets show consistent improvements over strong CNN-, Transformer-, GAN-, and diffusion-based baselines. On the internal dataset, MAP-Diff improves PSNR from 42.48 dB to 43.71 dB (+1.23 dB), increases SSIM to 0.986, and reduces NMAE from 0.115 to 0.103 (-0.012) compared to 3D DDPM. Performance gains generalize across scanners, achieving 34.42 dB PSNR and 0.141 NMAE on the external cohort, outperforming all competing methods.
[552] Adaptive Spectral Feature Forecasting for Diffusion Sampling Acceleration
Jiaqi Han, Juntong Shi, Puheng Li, Haotian Ye, Qiushan Guo, Stefano Ermon
Main category: cs.CV
TL;DR: Spectrum: A training-free method that uses Chebyshev polynomial approximation with ridge regression to forecast diffusion features across multiple future steps, enabling 4-5x speedups while maintaining quality.
Details
Motivation: Diffusion models are bottlenecked by slow inference due to numerous iterative passes. Existing feature caching methods rely on local approximations that cause error accumulation with large skips, degrading quality at high speedups.Method: Treats latent features as functions over time and approximates them with Chebyshev polynomials. Uses ridge regression to fit coefficients for each basis, then forecasts features at multiple future diffusion steps. This provides global, long-range feature reuse with controlled error.
Result: Achieves up to 4.79× speedup on FLUX.1 and 4.67× speedup on Wan2.1-14B while maintaining higher sample quality compared to baselines. The approach works with various state-of-the-art image and video diffusion models.
Conclusion: Spectrum enables significant inference speedups for diffusion models through global feature forecasting with theoretical error guarantees, outperforming local approximation methods.
Abstract: Diffusion models have become the dominant tool for high-fidelity image and video generation, yet are critically bottlenecked by their inference speed due to the numerous iterative passes of Diffusion Transformers. To reduce the exhaustive compute, recent works resort to the feature caching and reusing scheme that skips network evaluations at selected diffusion steps by using cached features in previous steps. However, their preliminary design solely relies on local approximation, causing errors to grow rapidly with large skips and leading to degraded sample quality at high speedups. In this work, we propose spectral diffusion feature forecaster (Spectrum), a training-free approach that enables global, long-range feature reuse with tightly controlled error. In particular, we view the latent features of the denoiser as functions over time and approximate them with Chebyshev polynomials. Specifically, we fit the coefficient for each basis via ridge regression, which is then leveraged to forecast features at multiple future diffusion steps. We theoretically reveal that our approach admits more favorable long-horizon behavior and yields an error bound that does not compound with the step size. Extensive experiments on various state-of-the-art image and video diffusion models consistently verify the superiority of our approach. Notably, we achieve up to 4.79$\times$ speedup on FLUX.1 and 4.67$\times$ speedup on Wan2.1-14B, while maintaining much higher sample quality compared with the baselines.
[553] DriveCombo: Benchmarking Compositional Traffic Rule Reasoning in Autonomous Driving
Enhui Ma, Jiahuan Zhang, Guantian Zheng, Tao Tang, Shengbo Eben Li, Yuhang Lu, Xia Zhou, Xueyang Zhang, Yifei Zhan, Kun Zhan, Zhihui Hao, Xianpeng Lang, Kaicheng Yu
Main category: cs.CV
TL;DR: DriveCombo: A multimodal benchmark for compositional traffic rule reasoning in autonomous driving, featuring a Five-Level Cognitive Ladder to evaluate MLLMs from single-rule understanding to multi-rule conflict resolution.
Details
Motivation: Existing benchmarks for autonomous driving MLLMs focus on simple single-rule scenarios like traffic sign recognition, but real-world driving involves complex multi-rule concurrency and conflicts. Current models perform well on simple tasks but fail in complex real-world situations, creating a need for more comprehensive evaluation frameworks.Method: 1) Proposed DriveCombo benchmark with text and vision-based compositional traffic rule reasoning. 2) Introduced a Five-Level Cognitive Ladder inspired by human drivers’ cognitive development, evaluating reasoning from single-rule understanding to multi-rule integration and conflict resolution. 3) Developed a Rule2Scene Agent that maps language-based traffic rules to dynamic driving scenes through rule crafting and scene generation for scene-level traffic rule visual reasoning.
Result: Evaluation of 14 mainstream MLLMs showed performance drops as task complexity increased, particularly during rule conflicts. After dataset splitting and fine-tuning on the training set, substantial improvements were observed in both traffic rule reasoning and downstream planning capabilities.
Conclusion: DriveCombo effectively advances compliant and intelligent autonomous driving systems by providing a comprehensive benchmark for evaluating MLLMs’ compositional reasoning abilities in complex traffic scenarios, highlighting the importance of multi-rule understanding and conflict resolution.
Abstract: Multimodal Large Language Models (MLLMs) are rapidly becoming the intelligence brain of end-to-end autonomous driving systems. A key challenge is to assess whether MLLMs can truly understand and follow complex real-world traffic rules. However, existing benchmarks mainly focus on single-rule scenarios like traffic sign recognition, neglecting the complexity of multi-rule concurrency and conflicts in real driving. Consequently, models perform well on simple tasks but often fail or violate rules in real world complex situations. To bridge this gap, we propose DriveCombo, a text and vision-based benchmark for compositional traffic rule reasoning. Inspired by human drivers’ cognitive development, we propose a systematic Five-Level Cognitive Ladder that evaluates reasoning from single-rule understanding to multi-rule integration and conflict resolution, enabling quantitative assessment across cognitive stages. We further propose a Rule2Scene Agent that maps language-based traffic rules to dynamic driving scenes through rule crafting and scene generation, enabling scene-level traffic rule visual reasoning. Evaluations of 14 mainstream MLLMs reveal performance drops as task complexity grows, particularly during rule conflicts. After splitting the dataset and fine-tuning on the training set, we further observe substantial improvements in both traffic rule reasoning and downstream planning capabilities. These results highlight the effectiveness of DriveCombo in advancing compliant and intelligent autonomous driving systems.
[554] MSP-ReID: Hairstyle-Robust Cloth-Changing Person Re-Identification
Xiangyang He, Lin Wan
Main category: cs.CV
TL;DR: MSP framework addresses cloth-changing person re-identification by reducing hairstyle dependence and preserving structural information through hairstyle augmentation, cloth-preserved erasing, and parsing attention.
Details
Motivation: Existing CC-ReID methods focus on removing apparel and using head regions but treat the head holistically without distinguishing face from hair, leading to over-reliance on volatile hairstyle cues that degrade performance under hairstyle changes.Method: Proposes MSP framework with three components: 1) Hairstyle-Oriented Augmentation (HSOA) generates intra-identity hairstyle diversity, 2) Cloth-Preserved Random Erasing (CPRE) performs ratio-controlled erasing within clothing regions to suppress texture bias while retaining body shape, 3) Region-based Parsing Attention (RPA) uses parsing-guided priors to highlight face/limb regions while suppressing hair features.
Result: Extensive experiments on multiple CC-ReID benchmarks demonstrate that MSP achieves state-of-the-art performance, providing robust and practical solution for long-term person re-identification.
Conclusion: MSP effectively addresses hairstyle distraction in cloth-changing person re-identification by reducing hairstyle dependence while preserving structural information, leading to improved robustness and performance.
Abstract: Cloth-Changing Person Re-Identification (CC-ReID) aims to match the same individual across cameras under varying clothing conditions. Existing approaches often remove apparel and focus on the head region to reduce clothing bias. However, treating the head holistically without distinguishing between face and hair leads to over-reliance on volatile hairstyle cues, causing performance degradation under hairstyle changes. To address this issue, we propose the Mitigating Hairstyle Distraction and Structural Preservation (MSP) framework. Specifically, MSP introduces Hairstyle-Oriented Augmentation (HSOA), which generates intra-identity hairstyle diversity to reduce hairstyle dependence and enhance attention to stable facial and body cues. To prevent the loss of structural information, we design Cloth-Preserved Random Erasing (CPRE), which performs ratio-controlled erasing within clothing regions to suppress texture bias while retaining body shape and context. Furthermore, we employ Region-based Parsing Attention (RPA) to incorporate parsing-guided priors that highlight face and limb regions while suppressing hair features. Extensive experiments on multiple CC-ReID benchmarks demonstrate that MSP achieves state-of-the-art performance, providing a robust and practical solution for long-term person re-identification.
[555] QCAgent: An agentic framework for quality-controllable pathology report generation from whole slide image
Rundong Wang, Wei Ba, Ying Zhou, Yingtai Li, Bowen Liu, Baizhi Wang, Yuhao Wang, Zhidong Yang, Kun Zhang, Rui Yan, S. Kevin Zhou
Main category: cs.CV
TL;DR: QCAgent is an agentic framework for quality-controllable pathology report generation from whole-slide images that uses critique mechanisms and iterative refinement to ground statements in visual evidence.
Details
Motivation: Current pathology report generation methods fail to ground fine-grained statements in localized visual evidence and lack control over which diagnostic details to include and how to verify them. The authors are inspired by pathologists' diagnostic workflow and emerging agentic analysis paradigms.Method: QCAgent incorporates a customized critique mechanism guided by user-defined checklists specifying required diagnostic details and constraints. It re-identifies informative regions in WSIs based on critique feedback and text-patch semantic retrieval, iteratively enriching and reconciling reports.
Result: Experiments show that by making report requirements explicitly prompt-defined, constraint-aware, and verifiable through evidence-grounded refinement, QCAgent enables controllable generation of clinically meaningful and high-coverage pathology reports from WSIs.
Conclusion: The proposed agentic framework successfully addresses limitations of current methods by enabling quality-controllable, evidence-grounded pathology report generation that mimics pathologists’ diagnostic workflow.
Abstract: Recent methods for pathology report generation from whole-slide image (WSI) are capable of producing slide-level diagnostic descriptions but fail to ground fine-grained statements in localized visual evidence. Furthermore, they lack control over which diagnostic details to include and how to verify them. Inspired by emerging agentic analysis paradigms and the diagnostic workflow of pathologists,who selectively examine multiple fields of view, we propose QCAgent, an agentic framework for quality-controllable WSI report generation. The core innovations of this framework are as follows: (i) it incorporates a customized critique mechanism guided by a user-defined checklist specifying required diagnostic details and constraints; (ii) it re-identifies informative regions in the WSI based on the critique feedback and text-patch semantic retrieval, a process that iteratively enriches and reconciles the report. Experiments demonstrate that by making report requirements explicitly prompt-defined, constraint-aware, and verifiable through evidence-grounded refinement, QCAgent enables controllable generation of clinically meaningful and high-coverage pathology reports from WSI.
[556] Detection-Gated Glottal Segmentation with Zero-Shot Cross-Dataset Transfer and Clinical Feature Extraction
Harikrishnan Unnikrishnan
Main category: cs.CV
TL;DR: A detection-gated pipeline combining YOLOv8 detector with U-Net segmenter for robust glottal segmentation in high-speed videoendoscopy, achieving state-of-the-art performance and enabling zero-shot transfer across clinical settings.
Details
Motivation: Existing deep learning models for glottal segmentation in high-speed videoendoscopy produce artifacts in non-glottal frames and fail to generalize across different clinical settings, limiting their clinical utility.Method: Proposed a detection-gated pipeline integrating YOLOv8-based detector with U-Net segmenter, with temporal consistency wrapper to suppress false positives during glottal closure and instrument occlusion. Trained on limited GIRAFE dataset (600 frames) and evaluated via zero-shot transfer on BAGLS dataset.
Result: Achieved state-of-the-art performance on GIRAFE benchmark (DSC 0.81) and superior generalizability on BAGLS (DSC 0.85) without institutional fine-tuning. Automated kinematic features remained consistent with clinical benchmarks, with glottal area coefficient of variation being significant marker for distinguishing healthy from pathological vocal function (p=0.006).
Conclusion: Detection-gated architecture provides lightweight, computationally efficient solution (~35 frames/s) for real-time clinical use, enabling robust zero-shot transfer and standardized extraction of clinical biomarkers across diverse endoscopy platforms.
Abstract: Background: Accurate glottal segmentation in high-speed videoendoscopy (HSV) is essential for extracting kinematic biomarkers of laryngeal function. However, existing deep learning models often produce spurious artifacts in non-glottal frames and fail to generalize across different clinical settings. Methods: We propose a detection-gated pipeline that integrates a YOLOv8-based detector with a U-Net segmenter. A temporal consistency wrapper ensures robustness by suppressing false positives during glottal closure and instrument occlusion. The model was trained on a limited subset of the GIRAFE dataset (600 frames) and evaluated via zero-shot transfer on the large-scale BAGLS dataset. Results: The pipeline achieved state-of-the-art performance on the GIRAFE benchmark (DSC 0.81) and demonstrated superior generalizability on BAGLS (DSC 0.85, in-distribution) without institutional fine-tuning. Downstream validation on a 65-subject clinical cohort confirmed that automated kinematic features (Open Quotient, coefficient of variation) remained consistent with established clinical benchmarks. The coefficient of variation (CV) of the glottal area was found to be a significant marker for distinguishing healthy from pathological vocal function (p=0.006). Conclusions: The detection-gated architecture provides a lightweight, computationally efficient solution (~35 frames/s) for real-time clinical use. By enabling robust zero-shot transfer, this framework facilitates the standardized, large-scale extraction of clinical biomarkers across diverse endoscopy platforms. Code, trained weights, and evaluation scripts are released at https://github.com/hari-krishnan/openglottal.
[557] PromptStereo: Zero-Shot Stereo Matching via Structure and Motion Prompts
Xianqi Wang, Hao Yang, Hangtian Wang, Junda Cheng, Gangwei Xu, Min Lin, Xin Yang
Main category: cs.CV
TL;DR: PromptStereo introduces Prompt Recurrent Unit (PRU) for stereo matching, using monocular depth foundation models with stereo prompts to improve zero-shot generalization through better iterative refinement.
Details
Motivation: Existing stereo matching methods focus on feature extraction and disparity initialization but neglect iterative refinement for zero-shot generalization. Current GRU-based architectures can't effectively leverage monocular depth priors as guidance due to limited representation capacity.Method: Proposes Prompt Recurrent Unit (PRU), an iterative refinement module based on monocular depth foundation model decoders. Integrates monocular structure and stereo motion cues as prompts to enrich latent representations with stereo-scale information while preserving monocular depth priors.
Result: PromptStereo achieves state-of-the-art zero-shot generalization performance across multiple datasets while maintaining comparable or faster inference speed compared to existing methods.
Conclusion: Prompt-guided iterative refinement is a promising direction for zero-shot stereo matching, showing that effectively integrating monocular depth priors with stereo information during refinement significantly improves generalization.
Abstract: Modern stereo matching methods have leveraged monocular depth foundation models to achieve superior zero-shot generalization performance. However, most existing methods primarily focus on extracting robust features for cost volume construction or disparity initialization. At the same time, the iterative refinement stage, which is also crucial for zero-shot generalization, remains underexplored. Some methods treat monocular depth priors as guidance for iteration, but conventional GRU-based architectures struggle to exploit them due to the limited representation capacity. In this paper, we propose Prompt Recurrent Unit (PRU), a novel iterative refinement module based on the decoder of monocular depth foundation models. By integrating monocular structure and stereo motion cues as prompts into the decoder, PRU enriches the latent representations of monocular depth foundation models with absolute stereo-scale information while preserving their inherent monocular depth priors. Experiments demonstrate that our PromptStereo achieves state-of-the-art zero-shot generalization performance across multiple datasets, while maintaining comparable or faster inference speed. Our findings highlight prompt-guided iterative refinement as a promising direction for zero-shot stereo matching.
[558] FluxMem: Adaptive Hierarchical Memory for Streaming Video Understanding
Yiweng Xie, Bo He, Junke Wang, Xiangyu Zheng, Ziyi Ye, Zuxuan Wu
Main category: cs.CV
TL;DR: FluxMem is a training-free framework for efficient streaming video understanding that compresses redundant visual memory through hierarchical two-stage token compression.
Details
Motivation: Streaming video understanding requires processing long video sequences efficiently, but existing methods suffer from high computational costs and memory usage due to redundant visual information across frames and within frames.Method: Two-stage hierarchical compression: 1) Temporal Adjacency Selection (TAS) removes redundant tokens across adjacent frames, 2) Spatial Domain Consolidation (SDC) merges spatially repetitive regions within frames. Both use self-adaptive token compression based on scene statistics.
Result: State-of-the-art on streaming benchmarks: 76.4 on StreamingBench, 67.2 on OVO-Bench with real-time settings. Reduces latency by 69.9% and GPU memory by 34.5% on OVO-Bench. Maintains strong offline performance (73.1 on MLVU) with 65% fewer tokens.
Conclusion: FluxMem provides an efficient, training-free solution for streaming video understanding through adaptive token compression, achieving significant performance gains while reducing computational and memory overhead.
Abstract: This paper presents FluxMem, a training-free framework for efficient streaming video understanding. FluxMem adaptively compresses redundant visual memory through a hierarchical, two-stage design: (1) a Temporal Adjacency Selection (TAS) module removes redundant visual tokens across adjacent frames, and (2) a Spatial Domain Consolidation (SDC) module further merges spatially repetitive regions within each frame into compact representations. To adapt effectively to dynamic scenes, we introduce a self-adaptive token compression mechanism in both TAS and SDC, which automatically determines the compression rate based on intrinsic scene statistics rather than manual tuning. Extensive experiments demonstrate that FluxMem achieves new state-of-the-art results on existing online video benchmarks, reaching 76.4 on StreamingBench and 67.2 on OVO-Bench under real-time settings, while reducing latency by 69.9% and peak GPU memory by 34.5% on OVO-Bench. Furthermore, it maintains strong offline performance, achieving 73.1 on MLVU while using 65% fewer visual tokens.
[559] A Diffusion-Driven Fine-Grained Nodule Synthesis Framework for Enhanced Lung Nodule Detection from Chest Radiographs
Aryan Goyal, Shreshtha Singh, Ashish Mittal, Manoj Tadepalli, Piyush Kumar, Preetham Putha
Main category: cs.CV
TL;DR: A diffusion-based framework with LoRA adapters for fine-grained control over synthetic lung nodule generation in chest X-rays, enabling characteristic-controlled synthesis to address data scarcity in medical imaging.
Details
Motivation: Early lung cancer detection via chest radiographs is challenging due to subtle nodule appearance and variability. Assembling diverse training datasets is costly, creating need for realistic synthetic data. Existing methods lack fine-grained control over synthetic nodule characteristics.Method: Proposes diffusion-based framework with LoRA adapters: 1) Base diffusion model trained with nodule mask conditioning for size/shape control; 2) Separate LoRA modules for specific radiological features; 3) Dynamic LoRA composability with novel orthogonality loss to address overlapping attention regions and non-orthogonal parameter spaces.
Result: Extensive experiments on in-house and public datasets show improved downstream nodule detection. Radiologist evaluations confirm fine-grained controllability. Quantitative metrics surpass existing nodule generation approaches for CXRs.
Conclusion: The proposed framework enables characteristic-controlled synthetic nodule generation, addressing data scarcity in medical imaging and improving CAD system performance through realistic, controllable synthetic data.
Abstract: Early detection of lung cancer in chest radiographs (CXRs) is crucial for improving patient outcomes, yet nodule detection remains challenging due to their subtle appearance and variability in radiological characteristics like size, texture, and boundary. For robust analysis, this diversity must be well represented in training datasets for deep learning based Computer-Assisted Diagnosis (CAD) systems. However, assembling such datasets is costly and often impractical, motivating the need for realistic synthetic data generation. Existing methods lack fine-grained control over synthetic nodule generation, limiting their utility in addressing data scarcity. This paper proposes a novel diffusion-based framework with low-rank adaptation (LoRA) adapters for characteristic controlled nodule synthesis on CXRs. We begin by addressing size and shape control through nodule mask conditioned training of the base diffusion model. To achieve individual characteristic control, we train separate LoRA modules, each dedicated to a specific radiological feature. However, since nodules rarely exhibit isolated characteristics, effective multi-characteristic control requires a balanced integration of features. We address this by leveraging the dynamic composability of LoRAs and revisiting existing merging strategies. Building on this, we identify two key issues, overlapping attention regions and non-orthogonal parameter spaces. To overcome these limitations, we introduce a novel orthogonality loss term during LoRA composition training. Extensive experiments on both in-house and public datasets demonstrate improved downstream nodule detection. Radiologist evaluations confirm the fine-grained controllability of our generated nodules, and across multiple quantitative metrics, our method surpasses existing nodule generation approaches for CXRs.
[560] FastLightGen: Fast and Light Video Generation with Fewer Steps and Parameters
Shao Shitong, Gu Yufei, Xie Zeke
Main category: cs.CV
TL;DR: FastLightGen accelerates video generation by simultaneously compressing model size and reducing sampling steps through optimal teacher construction and synergistic distillation.
Details
Motivation: Current powerful video generation models (Hunyuan, WanX, Veo3, Kling) have huge computational overhead from large parameter counts and iterative multi-step sampling, limiting practical deployment. Previous acceleration methods focused separately on reducing sampling steps or compressing model size, but not both simultaneously.Method: Proposes FastLightGen algorithm that transforms large models into fast lightweight counterparts by constructing an optimal teacher model designed to maximize student performance within a synergistic framework for distilling both model size and inference steps.
Result: Experiments on HunyuanVideo-ATI2V and WanX-TI2V show optimal visual quality achieved with 4-step sampling and 30% parameter pruning under constrained inference budget. FastLightGen consistently outperforms all competing methods, establishing new SOTA in efficient video generation.
Conclusion: FastLightGen successfully addresses computational bottlenecks in video generation by simultaneous model compression and sampling step reduction, enabling practical deployment of powerful video generation models.
Abstract: The recent advent of powerful video generation models, such as Hunyuan, WanX, Veo3, and Kling, has inaugurated a new era in the field. However, the practical deployment of these models is severely impeded by their substantial computational overhead, which stems from enormous parameter counts and the iterative, multi-step sampling process required during inference. Prior research on accelerating generative models has predominantly followed two distinct trajectories: reducing the number of sampling steps (e.g., LCM, DMD, and MagicDistillation) or compressing the model size for more efficient inference (e.g., ICMD). The potential of simultaneously compressing both to create a fast and lightweight model remains an unexplored avenue. In this paper, we propose FastLightGen, an algorithm that transforms large, computationally expensive models into fast, lightweight counterparts. The core idea is to construct an optimal teacher model, one engineered to maximize student performance, within a synergistic framework for distilling both model size and inference steps. Our extensive experiments on HunyuanVideo-ATI2V and WanX-TI2V reveal that a generator using 4-step sampling and 30% parameter pruning achieves optimal visual quality under a constrained inference budget. Furthermore, FastLightGen consistently outperforms all competing methods, establishing a new state-of-the-art in efficient video generation.
[561] LiftAvatar: Kinematic-Space Completion for Expression-Controlled 3D Gaussian Avatar Animation
Hualiang Wei, Shunran Jia, Jialun Liu, Wenhui Li
Main category: cs.CV
TL;DR: LiftAvatar is a video diffusion Transformer that completes sparse facial kinematic data and uses it to drive high-fidelity 3D avatar animation, addressing limitations of monocular video-based avatar reconstruction.
Details
Motivation: 3D Gaussian Splatting-based avatars suffer from limited expressiveness and reconstruction artifacts due to sparse kinematic cues (facial expressions, head pose) in everyday monocular videos. There's a need to enhance avatar animation quality by completing incomplete observations.Method: Uses a fine-grained, expression-controllable large-scale video diffusion Transformer that lifts incomplete input into richer kinematic representation. Features: (1) multi-granularity expression control combining shading maps with expression coefficients, (2) multi-reference conditioning mechanism aggregating cues from multiple frames for 3D consistency.
Result: Extensive experiments show LiftAvatar consistently boosts animation quality and quantitative metrics of state-of-the-art 3D avatar methods, especially under extreme, unseen expressions. Enables effective prior distillation from large-scale video generative models into 3D pipelines.
Conclusion: LiftAvatar provides a plug-and-play enhancer that addresses sparse kinematic data limitations in monocular videos, improving 3D avatar reconstruction and animation through kinematic space completion and video diffusion-based synthesis.
Abstract: We present LiftAvatar, a new paradigm that completes sparse monocular observations in kinematic space (e.g., facial expressions and head pose) and uses the completed signals to drive high-fidelity avatar animation. LiftAvatar is a fine-grained, expression-controllable large-scale video diffusion Transformer that synthesizes high-quality, temporally coherent expression sequences conditioned on single or multiple reference images. The key idea is to lift incomplete input data into a richer kinematic representation, thereby strengthening both reconstruction and animation in downstream 3D avatar pipelines. To this end, we introduce (i) a multi-granularity expression control scheme that combines shading maps with expression coefficients for precise and stable driving, and (ii) a multi-reference conditioning mechanism that aggregates complementary cues from multiple frames, enabling strong 3D consistency and controllability. As a plug-and-play enhancer, LiftAvatar directly addresses the limited expressiveness and reconstruction artifacts of 3D Gaussian Splatting-based avatars caused by sparse kinematic cues in everyday monocular videos. By expanding incomplete observations into diverse pose-expression variations, LiftAvatar also enables effective prior distillation from large-scale video generative models into 3D pipelines, leading to substantial gains. Extensive experiments show that LiftAvatar consistently boosts animation quality and quantitative metrics of state-of-the-art 3D avatar methods, especially under extreme, unseen expressions.
[562] DiffusionXRay: A Diffusion and GAN-Based Approach for Enhancing Digitally Reconstructed Chest Radiographs
Aryan Goyal, Ashish Mittal, Pranav Rao, Manoj Tadepalli, Preetham Putha
Main category: cs.CV
TL;DR: DiffusionXRay: A novel image restoration pipeline for Chest X-rays using DDPMs and GANs to enhance synthetic X-ray quality for lung cancer diagnosis
Details
Motivation: Deep learning models for lung cancer diagnosis require extensive annotated datasets, but high-quality labeled data is scarce, especially for subtle pulmonary nodules. Synthetic X-rays generated from CT scans suffer from quality degradation, limiting their usefulness for training.Method: Two-stage training process: 1) Investigate DDPM-LQ and GAN-based MUNIT-LQ approaches to generate low-quality CXRs as a style transfer problem, 2) Train a DDPM-based model on paired low-quality and high-quality images to learn X-ray image restoration.
Result: Promising results in enhancing image clarity, contrast, and overall diagnostic value of chest X-rays while preserving clinically significant artifacts, validated by quantitative metrics and expert radiological assessment.
Conclusion: DiffusionXRay addresses the data scarcity problem in medical imaging by improving synthetic X-ray quality through a novel diffusion-based restoration pipeline, potentially enhancing automated lung cancer diagnosis.
Abstract: Deep learning-based automated diagnosis of lung cancer has emerged as a crucial advancement that enables healthcare professionals to detect and initiate treatment earlier. However, these models require extensive training datasets with diverse case-specific properties. High-quality annotated data is particularly challenging to obtain, especially for cases with subtle pulmonary nodules that are difficult to detect even for experienced radiologists. This scarcity of well-labeled datasets can limit model performance and generalization across different patient populations. Digitally reconstructed radiographs (DRR) using CT-Scan to generate synthetic frontal chest X-rays with artificially inserted lung nodules offers one potential solution. However, this approach suffers from significant image quality degradation, particularly in the form of blurred anatomical features and loss of fine lung field structures. To overcome this, we introduce DiffusionXRay, a novel image restoration pipeline for Chest X-ray images that synergistically leverages denoising diffusion probabilistic models (DDPMs) and generative adversarial networks (GANs). DiffusionXRay incorporates a unique two-stage training process: First, we investigate two independent approaches, DDPM-LQ and GAN-based MUNIT-LQ, to generate low-quality CXRs, addressing the challenge of training data scarcity, posing this as a style transfer problem. Subsequently, we train a DDPM-based model on paired low-quality and high-quality images, enabling it to learn the nuances of X-ray image restoration. Our method demonstrates promising results in enhancing image clarity, contrast, and overall diagnostic value of chest X-rays while preserving subtle yet clinically significant artifacts, validated by both quantitative metrics and expert radiological assessment.
[563] CoopDiff: A Diffusion-Guided Approach for Cooperation under Corruptions
Gong Chen, Chaokun Zhang, Pengcheng Lv
Main category: cs.CV
TL;DR: CoopDiff: A diffusion-based cooperative perception framework that uses teacher-student paradigm with denoising to handle diverse corruptions in multi-agent perception systems.
Details
Motivation: Real-world cooperative perception faces challenges from diverse and unpredictable corruptions that undermine robustness and generalization. Current methods lack effective mechanisms to handle various degradation types in multi-agent scenarios.Method: Uses teacher-student paradigm: Quality-Aware Teacher performs voxel-level early fusion with Quality of Interest weighting and semantic guidance, produces clean supervision via diffusion denoiser. Dual-Branch Diffusion Student separates ego/cooperative streams for encoding, uses Ego-Guided Cross-Attention for balanced decoding under degradation.
Result: Outperforms prior methods across all degradation types on OPV2Vn and DAIR-V2Xn benchmarks with six corruption types, lowers relative corruption error, offers tunable balance between precision and inference efficiency.
Conclusion: CoopDiff effectively addresses corruption challenges in cooperative perception through diffusion-based denoising, demonstrating superior robustness and generalization across diverse degradation scenarios.
Abstract: Cooperative perception lets agents share information to expand coverage and improve scene understanding. However, in real-world scenarios, diverse and unpredictable corruptions undermine its robustness and generalization. To address these challenges, we introduce CoopDiff, a diffusion-based cooperative perception framework that mitigates corruptions via a denoising mechanism. CoopDiff adopts a teacher-student paradigm: the Quality-Aware Teacher performs voxel-level early fusion with Quality of Interest weighting and semantic guidance, then produces clean supervision features via a diffusion denoiser. The Dual-Branch Diffusion Student first separates ego and cooperative streams in encoding to reconstruct the teacher’s clean targets. And then, an Ego-Guided Cross-Attention mechanism facilitates balanced decoding under degradation by adaptively integrating ego and cooperative features. We evaluate CoopDiff on two constructed multi-degradation benchmarks, OPV2Vn and DAIR-V2Xn, each incorporating six corruption types, including environmental and sensor-level distortions. Benefiting from the inherent denoising properties of diffusion, CoopDiff consistently outperforms prior methods across all degradation types and lowers the relative corruption error. Furthermore, it offers a tunable balance between precision and inference efficiency.
[564] Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance
Yiqi Lin, Guoqiang Liang, Ziyun Zeng, Zechen Bai, Yanzhe Chen, Mike Zheng Shou
Main category: cs.CV
TL;DR: Kiwi-Edit introduces a scalable data generation pipeline for reference-guided video editing, creating RefVIE dataset and a unified architecture that combines learnable queries with latent features for improved instruction and reference fidelity.
Details
Motivation: Current instruction-based video editing methods struggle with precise visual control due to limitations of natural language in describing complex visual nuances. Reference-guided editing offers a solution but is bottlenecked by scarcity of high-quality paired training data.Method: 1) Introduces a scalable data generation pipeline that transforms existing video editing pairs into high-fidelity training quadruplets using image generative models to create synthesized reference scaffolds. 2) Constructs RefVIE dataset for instruction-reference-following tasks and RefVIE-Bench for evaluation. 3) Proposes Kiwi-Edit architecture that synergizes learnable queries and latent visual features for reference semantic guidance with progressive multi-stage training.
Result: The model achieves significant gains in instruction following and reference fidelity. Extensive experiments demonstrate that the data and architecture establish a new state-of-the-art in controllable video editing.
Conclusion: The proposed scalable data generation pipeline, RefVIE dataset, and Kiwi-Edit architecture collectively advance the field of controllable video editing by addressing the data scarcity problem and improving reference-guided editing capabilities.
Abstract: Instruction-based video editing has witnessed rapid progress, yet current methods often struggle with precise visual control, as natural language is inherently limited in describing complex visual nuances. Although reference-guided editing offers a robust solution, its potential is currently bottlenecked by the scarcity of high-quality paired training data. To bridge this gap, we introduce a scalable data generation pipeline that transforms existing video editing pairs into high-fidelity training quadruplets, leveraging image generative models to create synthesized reference scaffolds. Using this pipeline, we construct RefVIE, a large-scale dataset tailored for instruction-reference-following tasks, and establish RefVIE-Bench for comprehensive evaluation. Furthermore, we propose a unified editing architecture, Kiwi-Edit, that synergizes learnable queries and latent visual features for reference semantic guidance. Our model achieves significant gains in instruction following and reference fidelity via a progressive multi-stage training curriculum. Extensive experiments demonstrate that our data and architecture establish a new state-of-the-art in controllable video editing. All datasets, models, and code is released at https://github.com/showlab/Kiwi-Edit.
[565] Search Multilayer Perceptron-Based Fusion for Efficient and Accurate Siamese Tracking
Tianqi Shen, Huakao Lin, Ning An
Main category: cs.CV
TL;DR: A Siamese visual tracker redesign using MLP-based fusion with differentiable NAS to optimize channel-width/depth balance for efficient pixel-level interactions on constrained hardware.
Details
Motivation: Current Siamese trackers using CNN or Transformer architectures struggle with efficient pixel-level interactions on resource-constrained hardware, creating an accuracy-efficiency imbalance.Method: Redesign Siamese neck with MLP-based fusion module for pixel-level interaction with minimal overhead. Use hierarchical search space of MLP modules with customized relaxation strategy for differentiable NAS to decouple channel-width optimization from architectural choices.
Result: Achieves state-of-the-art accuracy-efficiency trade-offs, top performance on 4 general-purpose and 3 aerial tracking benchmarks, with real-time performance on constrained GPUs and NPUs.
Conclusion: The MLP-based fusion with targeted NAS optimization effectively addresses the accuracy-efficiency imbalance in visual tracking, enabling efficient pixel-level interactions on resource-constrained hardware.
Abstract: Siamese visual trackers have recently advanced through increasingly sophisticated fusion mechanisms built on convolutional or Transformer architectures. However, both struggle to deliver pixel-level interactions efficiently on resource-constrained hardware, leading to a persistent accuracy-efficiency imbalance. Motivated by this limitation, we redesign the Siamese neck with a simple yet effective Multilayer Perception (MLP)-based fusion module that enables pixel-level interaction with minimal structural overhead. Nevertheless, naively stacking MLP blocks introduces a new challenge: computational cost can scale quadratically with channel width. To overcome this, we construct a hierarchical search space of carefully designed MLP modules and introduce a customized relaxation strategy that enables differentiable neural architecture search (DNAS) to decouple channel-width optimization from other architectural choices. This targeted decoupling automatically balances channel width and depth, yielding a low-complexity architecture. The resulting tracker achieves state-of-the-art accuracy-efficiency trade-offs. It ranks among the top performers on four general-purpose and three aerial tracking benchmarks, while maintaining real-time performance on both resource-constrained Graphics Processing Units (GPUs) and Neural Processing Units (NPUs).
[566] Leveraging Model Soups to Classify Intangible Cultural Heritage Images from the Mekong Delta
Quoc-Khang Tran, Minh-Thien Nguyen, Nguyen-Khang Pham
Main category: cs.CV
TL;DR: Proposes a robust framework combining CoAtNet architecture with model soups for Intangible Cultural Heritage image classification in low-resource settings, achieving state-of-the-art results on ICH-17 dataset.
Details
Motivation: Addresses challenges in ICH image classification in the Mekong Delta: limited annotated data, high visual similarity among classes, and domain heterogeneity. Conventional deep learning models suffer from high variance or overfitting in such low-resource settings.Method: Integrates hybrid CoAtNet architecture (combining convolution and self-attention) with model soups - a lightweight weight-space ensembling technique that averages checkpoints from a single training trajectory without increasing inference cost. Uses greedy and uniform soup strategies to combine diverse checkpoints.
Result: Achieves state-of-the-art results on ICH-17 dataset (7,406 images across 17 classes): 72.36% top-1 accuracy and 69.28% macro F1-score, outperforming ResNet-50, DenseNet-121, and ViT baselines. Model soups reduce variance while introducing minimal bias.
Conclusion: Diversity-aware checkpoint averaging provides a principled and efficient way to reduce variance and enhance generalization in culturally rich, data-scarce classification tasks. Model soups select geometrically diverse checkpoints unlike traditional ensemble methods.
Abstract: The classification of Intangible Cultural Heritage (ICH) images in the Mekong Delta poses unique challenges due to limited annotated data, high visual similarity among classes, and domain heterogeneity. In such low-resource settings, conventional deep learning models often suffer from high variance or overfit to spurious correlations, leading to poor generalization. To address these limitations, we propose a robust framework that integrates the hybrid CoAtNet architecture with model soups, a lightweight weight-space ensembling technique that averages checkpoints from a single training trajectory without increasing inference cost. CoAtNet captures both local and global patterns through stage-wise fusion of convolution and self-attention. We apply two ensembling strategies - greedy and uniform soup - to selectively combine diverse checkpoints into a final model. Beyond performance improvements, we analyze the ensembling effect through the lens of bias-variance decomposition. Our findings show that model soups reduces variance by stabilizing predictions across diverse model snapshots, while introducing minimal additional bias. Furthermore, using cross-entropy-based distance metrics and Multidimensional Scaling (MDS), we show that model soups selects geometrically diverse checkpoints, unlike Soft Voting, which blends redundant models centered in output space. Evaluated on the ICH-17 dataset (7,406 images across 17 classes), our approach achieves state-of-the-art results with 72.36% top-1 accuracy and 69.28% macro F1-score, outperforming strong baselines including ResNet-50, DenseNet-121, and ViT. These results underscore that diversity-aware checkpoint averaging provides a principled and efficient way to reduce variance and enhance generalization in culturally rich, data-scarce classification tasks.
[567] WhisperNet: A Scalable Solution for Bandwidth-Efficient Collaboration
Gong Chen, Chaokun Zhang, Xinyan Zhao
Main category: cs.CV
TL;DR: WhisperNet is a bandwidth-aware collaborative perception framework for autonomous driving that uses receiver-centric global coordination to dynamically allocate feature sharing across agents, achieving state-of-the-art performance with minimal communication cost.
Details
Motivation: Current collaborative perception methods for autonomous driving face bandwidth constraints. Fixed-rate compression adapts poorly to changing environments, while spatial selection methods sacrifice global context for efficiency, weakening holistic scene understanding.Method: Proposes a receiver-centric paradigm where senders generate lightweight saliency metadata, receivers formulate global request plans to dynamically budget feature contributions across agents, and a collaborative feature routing module aligns messages before fusion for structural consistency.
Result: Achieves state-of-the-art performance, improving AP@0.7 on OPV2V by 2.4% with only 0.5% of communication cost. As plug-and-play component, boosts strong baselines with 5% of full bandwidth while maintaining robustness under localization noise.
Conclusion: Globally-coordinated allocation across what and where to share is key to efficient collaborative perception. The receiver-centric approach with dynamic feature budgeting enables superior performance with minimal bandwidth.
Abstract: Collaborative perception is vital for autonomous driving yet remains constrained by tight communication budgets. Earlier work reduced bandwidth by compressing full feature maps with fixed-rate encoders, which adapts poorly to a changing environment, and it further evolved into spatial selection methods that improve efficiency by focusing on salient regions, but this object-centric approach often sacrifices global context, weakening holistic scene understanding. To overcome these limitations, we introduce \textit{WhisperNet}, a bandwidth-aware framework that proposes a novel, receiver-centric paradigm for global coordination across agents. Senders generate lightweight saliency metadata, while the receiver formulates a global request plan that dynamically budgets feature contributions across agents and features, retrieving only the most informative features. A collaborative feature routing module then aligns related messages before fusion to ensure structural consistency. Extensive experiments show that WhisperNet achieves state-of-the-art performance, improving AP@0.7 on OPV2V by 2.4% with only 0.5% of the communication cost. As a plug-and-play component, it boosts strong baselines with merely 5% of full bandwidth while maintaining robustness under localization noise. These results demonstrate that globally-coordinated allocation across \textit{what} and \textit{where} to share is the key to achieving efficient collaborative perception.
[568] Sketch2Colab: Sketch-Conditioned Multi-Human Animation via Controllable Flow Distillation
Divyanshu Daiya, Aniket Bera
Main category: cs.CV
TL;DR: Sketch2Colab converts 2D storyboard sketches into coherent 3D multi-human motion with fine-grained control over agents, joints, timing, and contacts using a sketch-driven diffusion prior distilled into an efficient rectified-flow student with differentiable constraints and CTMC planning for coordinated interactions.
Details
Motivation: Existing diffusion-based motion generators struggle with precise adherence to rich interaction constraints in multi-human scenarios, requiring extensive training and/or costly posterior guidance, with performance degradation under strong multi-entity conditioning.Method: 1) Learn sketch-driven diffusion prior, 2) Distill into efficient rectified-flow student in latent space, 3) Apply differentiable energies over keyframes, trajectories, and physics-based constraints to shape transport field, 4) Augment with continuous-time Markov chain (CTMC) planner for scheduling discrete events like touches, grasps, and handoffs.
Result: Achieves state-of-the-art constraint adherence and perceptual quality on CORE4D and InterHuman datasets while offering significantly faster inference than diffusion-only baselines.
Conclusion: Sketch2Colab enables precise, physically plausible 3D multi-human motion generation from 2D sketches with fine-grained control, addressing limitations of conventional diffusion approaches through efficient rectified-flow distillation and CTMC-based event planning.
Abstract: We present Sketch2Colab, which turns storyboard-style 2D sketches into coherent, object-aware 3D multi-human motion with fine-grained control over agents, joints, timing, and contacts. Conventional diffusion-based motion generators have advanced realism; however, achieving precise adherence to rich interaction constraints typically demands extensive training and/or costly posterior guidance, and performance can degrade under strong multi-entity conditioning. Sketch2Colab instead first learns a sketch-driven diffusion prior and then distills it into an efficient rectified-flow student operating in latent space for fast, stable sampling. Differentiable energies over keyframes, trajectories, and physics-based constraints directly shape the student’s transport field, steering samples toward motions that faithfully satisfy the storyboard while remaining physically plausible. To capture coordinated interaction, we augment the continuous flow with a continuous-time Markov chain (CTMC) planner that schedules discrete events such as touches, grasps, and handoffs, modulating the dynamics to produce crisp, well-phased human-object-human collaborations. Experiments on CORE4D and InterHuman show that Sketch2Colab achieves state-of-the-art constraint adherence and perceptual quality while offering significantly faster inference than diffusion-only baselines.
[569] Dual Distillation for Few-Shot Anomaly Detection
Le Dong, Qinzhong Tan, Chunlei Li, Jingliang Hu, Yilei Shi, Weisheng Dong, Xiao Xiang Zhu, Lichao Mou
Main category: cs.CV
TL;DR: D²4FAD is a dual distillation framework for few-shot anomaly detection in medical imaging that uses only a small number of normal reference images to identify anomalies in unseen tasks.
Details
Motivation: Current unsupervised anomaly detection methods require large amounts of normal training data and struggle to generalize across different anatomical contexts, which is problematic for medical imaging where early pathology detection is crucial.Method: Uses a pre-trained encoder as teacher network to extract multi-scale features from support and query images, with a student decoder that distills knowledge from teacher on query images and self-distills on support images. Includes a learn-to-weight mechanism that dynamically assesses reference value of each support image conditioned on the query.
Result: Outperforms existing approaches on a comprehensive benchmark dataset of 13,084 images across four organs, four imaging modalities, and five disease categories, establishing new state-of-the-art in few-shot medical anomaly detection.
Conclusion: D²4FAD provides an effective solution for few-shot anomaly detection in medical imaging, enabling pathology identification with minimal normal reference data and better generalization across anatomical contexts.
Abstract: Anomaly detection is a critical task in computer vision with profound implications for medical imaging, where identifying pathologies early can directly impact patient outcomes. While recent unsupervised anomaly detection approaches show promise, they require substantial normal training data and struggle to generalize across anatomical contexts. We introduce D$^2$4FAD, a novel dual distillation framework for few-shot anomaly detection that identifies anomalies in previously unseen tasks using only a small number of normal reference images. Our approach leverages a pre-trained encoder as a teacher network to extract multi-scale features from both support and query images, while a student decoder learns to distill knowledge from the teacher on query images and self-distill on support images. We further propose a learn-to-weight mechanism that dynamically assesses the reference value of each support image conditioned on the query, optimizing anomaly detection performance. To evaluate our method, we curate a comprehensive benchmark dataset comprising 13,084 images across four organs, four imaging modalities, and five disease categories. Extensive experiments demonstrate that D$^2$4FAD significantly outperforms existing approaches, establishing a new state-of-the-art in few-shot medical anomaly detection. Code is available at https://github.com/ttttqz/D24FAD.
[570] Adaptive Confidence Regularization for Multimodal Failure Detection
Moru Liu, Hao Dong, Olga Fink, Mario Trapp
Main category: cs.CV
TL;DR: ACR: Adaptive Confidence Regularization framework for detecting multimodal failures by penalizing confidence degradation between multimodal and unimodal predictions, plus synthetic failure generation via feature swapping.
Details
Motivation: Multimodal models in high-stakes domains (self-driving, medical diagnostics) need reliable failure detection mechanisms, but this problem remains largely unexplored in multimodal contexts.Method: Proposes Adaptive Confidence Regularization (ACR) with two components: 1) Adaptive Confidence Loss that penalizes when multimodal confidence is lower than unimodal branches (confidence degradation), 2) Multimodal Feature Swapping for outlier synthesis to generate challenging failure-aware training examples.
Result: Extensive experiments across four datasets, three modalities, and multiple evaluation settings show ACR achieves consistent and robust gains in failure detection performance.
Conclusion: ACR provides an effective framework for multimodal failure detection that improves reliability by addressing confidence degradation and using synthetic failure training.
Abstract: The deployment of multimodal models in high-stakes domains, such as self-driving vehicles and medical diagnostics, demands not only strong predictive performance but also reliable mechanisms for detecting failures. In this work, we address the largely unexplored problem of failure detection in multimodal contexts. We propose Adaptive Confidence Regularization (ACR), a novel framework specifically designed to detect multimodal failures. Our approach is driven by a key observation: in most failure cases, the confidence of the multimodal prediction is significantly lower than that of at least one unimodal branch, a phenomenon we term confidence degradation. To mitigate this, we introduce an Adaptive Confidence Loss that penalizes such degradations during training. In addition, we propose Multimodal Feature Swapping, a novel outlier synthesis technique that generates challenging, failure-aware training examples. By training with these synthetic failures, ACR learns to more effectively recognize and reject uncertain predictions, thereby improving overall reliability. Extensive experiments across four datasets, three modalities, and multiple evaluation settings demonstrate that ACR achieves consistent and robust gains. The source code will be available at https://github.com/mona4399/ACR.
[571] Preoperative-to-intraoperative Liver Registration for Laparoscopic Surgery via Latent-Grounded Correspondence Constraints
Ruize Cui, Jialun Pei, Haiqiao Wang, Jun Zhou, Jeremy Yuen-Chun Teoh, Pheng-Ann Heng, Jing Qin
Main category: cs.CV
TL;DR: Land-Reg is a correspondence-driven deformable registration framework for laparoscopic liver surgery that learns interpretable 2D-3D landmark correspondences to bridge cross-modal alignment between preoperative 3D liver models and intraoperative 2D laparoscopic views.
Details
Motivation: Existing registration methods for augmented reality in laparoscopic liver surgery lack explicit modeling of reliable 2D-3D geometric correspondences with latent evidence support, resulting in limited interpretability and potentially unstable alignment in clinical scenarios.Method: The framework includes: 1) Cross-modal Latent Alignment module for rigid registration, 2) Uncertainty-enhanced Overlap Landmark Detector with similarity matching for robust 2D-3D correspondence estimation, and 3) For non-rigid registration, a shape-constrained supervision strategy with reprojection consistency, local-isometric regularization, and rendered-mask alignment.
Result: Experimental results on the P2ILF dataset demonstrate superiority in both rigid pose estimation and non-rigid deformation compared to existing methods.
Conclusion: Land-Reg provides an interpretable, correspondence-driven approach for deformable registration in laparoscopic liver surgery, addressing limitations of existing methods through explicit landmark correspondence learning and robust geometric constraints.
Abstract: In laparoscopic liver surgery, augmented reality technology enhances intraoperative anatomical guidance by overlaying 3D liver models from preoperative CT/MRI onto laparoscopic 2D views. However, existing registration methods lack explicit modeling of reliable 2D-3D geometric correspondences supported by latent evidence, leading to limited interpretability and potentially unstable alignment in clinical scenarios. In this work, we introduce Land-Reg, a correspondence-driven deformable registration framework that explicitly learns latent-grounded 2D-3D landmark correspondences as an interpretable intermediate representation to bridge cross-modal alignment. For rigid registration, Land-Reg embraces a Cross-modal Latent Alignment module to map multi-modal features into a unified latent space. Further, an Uncertainty-enhanced Overlap Landmark Detector with similarity matching is proposed to robustly estimate explicit 2D-3D landmark correspondences. For non-rigid registration, we design a novel shape-constrained supervision strategy that anchors shape deformation to matched landmarks through reprojection consistency and incorporates local-isometric regularization to alleviate inherent 2D-3D depth ambiguity, while a rendered-mask alignment enforces global shape consistency. Experimental results on the P2ILF dataset demonstrate the superiority of our method on both rigid pose estimation and non-rigid deformation. Our code will be available at https://github.com/cuiruize/Land-Reg.
[572] Learning Domain-Aware Task Prompt Representations for Multi-Domain All-in-One Image Restoration
Guanglu Dong, Chunlei Li, Chao Ren, Jingliang Hu, Yilei Shi, Xiao Xiang Zhu, Lichao Mou
Main category: cs.CV
TL;DR: DATPRL-IR: First multi-domain all-in-one image restoration method using domain-aware task prompt representation learning with adaptive prompt selection from multimodal LLMs.
Details
Motivation: Existing all-in-one image restoration methods are limited to specific domains (natural scenes, medical, remote sensing). The authors aim to extend this capability to handle multiple domains simultaneously.Method: Proposes Domain-Aware Task Prompt Representation Learning with two prompt pools: task prompts (implicitly encode task knowledge) and domain prompts (distill domain priors from multimodal LLMs). Uses Prompt Composition Mechanism to adaptively select and compose prompts into instance-level representations, then fuses them for restoration guidance.
Result: Extensive experiments show DATPRL-IR significantly outperforms existing SOTA image restoration methods and exhibits strong generalization capabilities.
Conclusion: The proposed multi-domain all-in-one image restoration approach successfully leverages both task-specific and shared knowledge across domains through domain-aware prompt representation learning.
Abstract: Recently, significant breakthroughs have been made in all-in-one image restoration (AiOIR), which can handle multiple restoration tasks with a single model. However, existing methods typically focus on a specific image domain, such as natural scene, medical imaging, or remote sensing. In this work, we aim to extend AiOIR to multiple domains and propose the first multi-domain all-in-one image restoration method, DATPRL-IR, based on our proposed Domain-Aware Task Prompt Representation Learning. Specifically, we first construct a task prompt pool containing multiple task prompts, in which task-related knowledge is implicitly encoded. For each input image, the model adaptively selects the most relevant task prompts and composes them into an instance-level task representation via a prompt composition mechanism (PCM). Furthermore, to endow the model with domain awareness, we introduce another domain prompt pool and distill domain priors from multimodal large language models into the domain prompts. PCM is utilized to combine the adaptively selected domain prompts into a domain representation for each input image. Finally, the two representations are fused to form a domain-aware task prompt representation which can make full use of both specific and shared knowledge across tasks and domains to guide the subsequent restoration process. Extensive experiments demonstrate that our DATPRL-IR significantly outperforms existing SOTA image restoration methods, while exhibiting strong generalization capabilities. Code is available at https://github.com/GuangluDong0728/DATPRL-IR.
[573] Action-Guided Attention for Video Action Anticipation
Tsung-Ming Tai, Sofia Casarin, Andrea Pilzer, Werner Nutt, Oswald Lanz
Main category: cs.CV
TL;DR: Proposes Action-Guided Attention (AGA) for video action anticipation, using predicted action sequences as queries/keys to guide attention and improve generalization beyond explicit visual cues.
Details
Motivation: Existing transformer-based approaches for video action anticipation lack high-level semantics and tend to overfit to explicit visual cues in past frames, limiting their ability to capture underlying intentions and generalize to unseen samples.Method: Introduces Action-Guided Attention (AGA) that explicitly leverages predicted action sequences as queries and keys to guide sequence modeling. Uses a gating function to combine attention-weighted past information with current frame embeddings, enabling post-training analysis of learned knowledge.
Result: Demonstrates strong generalization from validation to unseen test sets on EPIC-Kitchens-100 benchmark. Post-training analysis reveals captured action dependencies and counterfactual evidence, providing transparent insights into anticipative predictions.
Conclusion: AGA improves action anticipation by focusing on relevant past moments based on upcoming activities, offering better generalization and interpretable insights through post-training analysis of learned knowledge.
Abstract: Anticipating future actions in videos is challenging, as the observed frames provide only evidence of past activities, requiring the inference of latent intentions to predict upcoming actions. Existing transformer-based approaches, which rely on dot-product attention over pixel representations, often lack the high-level semantics necessary to model video sequences for effective action anticipation. As a result, these methods tend to overfit to explicit visual cues present in the past frames, limiting their ability to capture underlying intentions and degrading generalization to unseen samples. To address this, we propose Action-Guided Attention (AGA), an attention mechanism that explicitly leverages predicted action sequences as queries and keys to guide sequence modeling. Our approach fosters the attention module to emphasize relevant moments from the past based on the upcoming activity and combine this information with the current frame embedding via a dedicated gating function. The design of AGA enables post-training analysis of the knowledge discovered from the training set. Experiments on the widely adopted EPIC-Kitchens-100 benchmark demonstrate that AGA generalizes well from validation to unseen test sets. Post-training analysis can further examine the action dependencies captured by the model and the counterfactual evidence it has internalized, offering transparent and interpretable insights into its anticipative predictions.
[574] NeuroSymb-MRG: Differentiable Abductive Reasoning with Active Uncertainty Minimization for Radiology Report Generation
Rong Fu, Yiqing Lyu, Chunlei Meng, Muge Qi, Yabin Jin, Qi Zhao, Li Bao, Juntao Gao, Fuqian Shi, Nilanjan Dey, Wei Luo, Simon Fong
Main category: cs.CV
TL;DR: NeuroSymb-MRG integrates neurosymbolic reasoning with active uncertainty minimization to generate structured, clinically grounded radiology reports, addressing visual-linguistic biases and factual inconsistencies.
Details
Motivation: Existing radiology report generation methods suffer from visual-linguistic biases, factual inconsistency, and lack explicit multi-hop clinical reasoning, despite progress in fluency.Method: Unified framework that maps image features to probabilistic clinical concepts, composes differentiable logic-based reasoning chains, decodes into templated clauses, and refines via retrieval and constrained language-model editing with active sampling guided by rule-level uncertainty.
Result: Experiments on standard benchmarks show consistent improvements in factual consistency and standard language metrics compared to representative baselines.
Conclusion: NeuroSymb-MRG effectively addresses key limitations in radiology report generation through neurosymbolic reasoning and active uncertainty minimization, producing more clinically grounded and factually consistent reports.
Abstract: Automatic generation of radiology reports seeks to reduce clinician workload while improving documentation consistency. Existing methods that adopt encoder-decoder or retrieval-augmented pipelines achieve progress in fluency but remain vulnerable to visual-linguistic biases, factual inconsistency, and lack of explicit multi-hop clinical reasoning. We present NeuroSymb-MRG, a unified framework that integrates NeuroSymbolic abductive reasoning with active uncertainty minimization to produce structured, clinically grounded reports. The system maps image features to probabilistic clinical concepts, composes differentiable logic-based reasoning chains, decodes those chains into templated clauses, and refines the textual output via retrieval and constrained language-model editing. An active sampling loop driven by rule-level uncertainty and diversity guides clinician-in-the-loop adjudication and promptbook refinement. Experiments on standard benchmarks demonstrate consistent improvements in factual consistency and standard language metrics compared to representative baselines.
[575] StepVAR: Structure-Texture Guided Pruning for Visual Autoregressive Models
Keli Liu, Zhendong Wang, Wengang Zhou, Houqiang Li
Main category: cs.CV
TL;DR: StepVAR is a training-free token pruning framework that accelerates Visual AutoRegressive (VAR) models by jointly considering structural and textural importance, using high-pass filters for texture and PCA for structure, with feature propagation to maintain generation quality.
Details
Motivation: VAR models have quadratic inference cost growth at high resolutions, with later scales being computationally intensive but mainly refining high-frequency textures with spatial redundancy. Existing pruning methods focus on high-frequency detection but overlook structural coherence, degrading global semantics.Method: Proposes StepVAR with dual-criterion token pruning: 1) lightweight high-pass filter to capture local texture details, 2) Principal Component Analysis (PCA) to preserve global structural information. Uses nearest neighbor feature propagation to reconstruct dense feature maps from pruned representations for valid next-scale prediction.
Result: Extensive experiments on state-of-the-art text-to-image and text-to-video VAR models show StepVAR achieves substantial inference speedups while maintaining generation quality. Outperforms existing acceleration approaches in both quantitative and qualitative evaluations.
Conclusion: StepVAR effectively accelerates VAR inference by jointly considering structural and textural importance, demonstrating general applicability across diverse VAR architectures while maintaining generation quality.
Abstract: Visual AutoRegressive (VAR) models based on next-scale prediction enable efficient hierarchical generation, yet the inference cost grows quadratically at high resolutions. We observe that the computationally intensive later scales predominantly refine high-frequency textures and exhibit substantial spatial redundancy, in contrast to earlier scales that determine the global structural layout. Existing pruning methods primarily focus on high-frequency detection for token selection, often overlooking structural coherence and consequently degrading global semantics. To address this limitation, we propose StepVAR, a training-free token pruning framework that accelerates VAR inference by jointly considering structural and textural importance. Specifically, we employ a lightweight high-pass filter to capture local texture details, while leveraging Principal Component Analysis (PCA) to preserve global structural information. This dual-criterion design enables the model to retain tokens critical for both fine-grained fidelity and overall composition. To maintain valid next-scale prediction under sparse tokens, we further introduce a nearest neighbor feature propagation strategy to reconstruct dense feature maps from pruned representations. Extensive experiments on state-of-the-art text-to-image and text-to-video VAR models demonstrate that StepVAR achieves substantial inference speedups while maintaining generation quality. Quantitative and qualitative evaluations consistently show that our method outperforms existing acceleration approaches, validating its effectiveness and general applicability across diverse VAR architectures.
[576] Unifying Heterogeneous Multi-Modal Remote Sensing Detection Via Language-Pivoted Pretraining
Yuxuan Li, Yuming Chen, Yunheng Li, Ming-Ming Cheng, Xiang Li, Jian Yang
Main category: cs.CV
TL;DR: BabelRS is a language-pivoted pretraining framework for heterogeneous multi-modal remote sensing object detection that decouples modality alignment from downstream tasks using language as a semantic pivot.
Details
Motivation: Existing approaches for heterogeneous multi-modal remote sensing object detection use late alignment where modality alignment and task-specific optimization are entangled during fine-tuning, leading to unstable training and suboptimal generalization.Method: Proposes BabelRS with two key components: 1) Concept-Shared Instruction Aligning (CSIA) aligns each sensor modality to shared linguistic concepts using language as semantic pivot, 2) Layerwise Visual-Semantic Annealing (LVSA) progressively aggregates multi-scale visual features to provide fine-grained semantic guidance.
Result: Extensive experiments show BabelRS stabilizes training and consistently outperforms state-of-the-art methods without additional enhancements.
Conclusion: BabelRS provides an effective unified framework for heterogeneous multi-modal remote sensing object detection by decoupling modality alignment from downstream tasks through language-pivoted pretraining.
Abstract: Heterogeneous multi-modal remote sensing object detection aims to accurately detect objects from diverse sensors (e.g., RGB, SAR, Infrared). Existing approaches largely adopt a late alignment paradigm, in which modality alignment and task-specific optimization are entangled during downstream fine-tuning. This tight coupling complicates optimization and often results in unstable training and suboptimal generalization. To address these limitations, we propose BabelRS, a unified language-pivoted pretraining framework that explicitly decouples modality alignment from downstream task learning. BabelRS comprises two key components: Concept-Shared Instruction Aligning (CSIA) and Layerwise Visual-Semantic Annealing (LVSA). CSIA aligns each sensor modality to a shared set of linguistic concepts, using language as a semantic pivot to bridge heterogeneous visual representations. To further mitigate the granularity mismatch between high-level language representations and dense detection objectives, LVSA progressively aggregates multi-scale visual features to provide fine-grained semantic guidance. Extensive experiments demonstrate that BabelRS stabilizes training and consistently outperforms state-of-the-art methods without bells and whistles. Code: https://github.com/zcablii/SM3Det.
[577] Efficient Test-Time Optimization for Depth Completion via Low-Rank Decoder Adaptation
Minseok Seo, Wonjun Lee, Jaehyuk Jang, Changick Kim
Main category: cs.CV
TL;DR: A lightweight test-time adaptation method for zero-shot depth completion that updates only a low-dimensional decoder subspace instead of full network optimization, achieving state-of-the-art performance with improved efficiency.
Details
Motivation: Existing zero-shot depth completion methods are computationally expensive due to iterative denoising (diffusion-based) or repeated forward-backward passes (visual-prompt-based). There's a need for more efficient test-time optimization that maintains accuracy.Method: The paper proposes adapting only the decoder for test-time optimization, based on the insight that depth foundation models concentrate depth-relevant information within a low-dimensional decoder subspace. The method updates only this low-dimensional subspace using sparse depth supervision.
Result: Achieves state-of-the-art performance on five indoor and outdoor datasets, establishing a new Pareto frontier between accuracy and efficiency for test-time adaptation. Demonstrates consistent improvements over prior methods.
Conclusion: The proposed lightweight test-time adaptation method enables fast zero-shot depth completion by efficiently updating only the decoder subspace, making it practical for real-world applications while maintaining high accuracy.
Abstract: Zero-shot depth completion has gained attention for its ability to generalize across environments without sensor-specific datasets or retraining. However, most existing approaches rely on diffusion-based test-time optimization, which is computationally expensive due to iterative denoising. Recent visual-prompt-based methods reduce training cost but still require repeated forward–backward passes through the full frozen network to optimize input-level prompts, resulting in slow inference. In this work, we show that adapting only the decoder is sufficient for effective test-time optimization, as depth foundation models concentrate depth-relevant information within a low-dimensional decoder subspace. Based on this insight, we propose a lightweight test-time adaptation method that updates only this low-dimensional subspace using sparse depth supervision. Our approach achieves state-of-the-art performance, establishing a new Pareto frontier between accuracy and efficiency for test-time adaptation. Extensive experiments on five indoor and outdoor datasets demonstrate consistent improvements over prior methods, highlighting the practicality of fast zero-shot depth completion.
[578] Neural Operator-Grounded Continuous Tensor Function Representation and Its Applications
Ruoyang Su, Xi-Le Zhao, Sheng Liu, Wei-Hao Wu, Yisi Luo, Michael K. Ng
Main category: cs.CV
TL;DR: NO-CTR is a neural operator-grounded continuous tensor function representation that uses continuous nonlinear mode-n operators instead of discrete linear mode-n products, enabling better representation of complex real-world data across various grid types.
Details
Motivation: Current continuous tensor function representations are limited by discrete, linear mode-n products, which restricts their ability to faithfully represent complex real-world data and causes discretization artifacts. The authors aim to develop a more genuine continuous representation.Method: Propose neural operator-grounded mode-n operators as continuous, nonlinear alternatives to discrete, linear mode-n products. These operators directly map continuous core tensor functions to continuous target tensor functions. Build NO-CTR (neural operator-grounded continuous tensor function representation) using these operators.
Result: Theoretical proof that any continuous tensor function can be approximated by NO-CTR. Experimental validation through a multi-dimensional data completion model shows superiority across various data types: multi-spectral images, color videos, Sentinel-2 images (different resolutions), and point clouds.
Conclusion: NO-CTR provides a more faithful representation of complex real-world data compared to classic discrete tensor representations and existing continuous tensor function representations, effectively handling data on regular grids, grids with different resolutions, and beyond mesh grids.
Abstract: Recently, continuous tensor functions have attracted increasing attention, because they can unifiedly represent data both on mesh grids and beyond mesh grids. However, since mode-$n$ product is essentially discrete and linear, the potential of current continuous tensor function representations is still locked. To break this bottleneck, we suggest neural operator-grounded mode-$n$ operators as a continuous and nonlinear alternative of discrete and linear mode-$n$ product. Instead of mapping the discrete core tensor to the discrete target tensor, proposed mode-$n$ operator directly maps the continuous core tensor function to the continuous target tensor function, which provides a genuine continuous representation of real-world data and can ameliorate discretization artifacts. Empowering with continuous and nonlinear mode-$n$ operators, we propose a neural operator-grounded continuous tensor function representation (abbreviated as NO-CTR), which can more faithfully represent complex real-world data compared with classic discrete tensor representations and continuous tensor function representations. Theoretically, we also prove that any continuous tensor function can be approximated by NO-CTR. To examine the capability of NO-CTR, we suggest an NO-CTR-based multi-dimensional data completion model. Extensive experiments across various data on regular mesh grids (multi-spectral images and color videos), on mesh girds with different resolutions (Sentinel-2 images) and beyond mesh grids (point clouds) demonstrate the superiority of NO-CTR.
[579] Affine Correspondences in Stereo Vision: Theory, Practice, and Limitations
Levente Hajder
Main category: cs.CV
TL;DR: Paper investigates affine transformations for stereo vision, focusing on how transformation accuracy affects 3D reconstruction quality, with novel methods for estimating local affine transformations from image directions and fundamental matrix.
Details
Motivation: Affine transformations have shown promise in stereo vision applications like surface normal estimation, homography computation, and 3D reconstruction. The paper aims to understand how the accuracy of these affine transformations impacts the quality of 3D reconstruction results.Method: 1) Overviews fundamental statements for affine transformations and epipolar geometry. 2) Investigates how transformation accuracy influences 3D reconstruction quality. 3) Proposes novel techniques for estimating local affine transformation from corresponding image directions. 4) Exploits fundamental matrix related to image pairs. 5) Conducts synthetic and real quantitative evaluations using a specially constructed object with three perpendicular planes with chessboard patterns.
Result: Quantitative evaluations based on reconstructed surface normal accuracy show estimation accuracy around a few degrees for realistic test cases. Special stereo poses and plane orientations are evaluated in detail.
Conclusion: Affine transformations provide effective means for stereo vision applications with good accuracy in surface normal estimation (few degrees error). The relationship between transformation accuracy and 3D reconstruction quality is established, with proposed methods showing practical utility.
Abstract: Affine transformations have been recently used for stereo vision. They can be exploited in various computer vision application, e.g., when estimating surface normals, homographies, fundamental and essential matrices. Even full 3D reconstruction can be obtained by using affine correspondences. First, this paper overviews the fundamental statements for affine transformations and epipolar geometry. Then it is investigated how the transformation accuracy influences the quality of the 3D reconstruction. Besides, we propose novel techniques for estimating the local affine transformation from corresponding image directions; moreover, the fundamental matrix, related to the processed image pair, can also be exploited. Both synthetic and real quantitative evaluations are implemented based on the accuracy of the reconstructed surface normals. For the latter one, a special object, containing three perpendicular planes with chessboard patterns, is constructed. The quantitative evaluations are based on the accuracy of the reconstructed surface normals and it is concluded that the estimation accuracy is around a few degrees for realistic test cases. Special stereo poses and plane orientations are also evaluated in detail.
[580] LEAR: Learning Edge-Aware Representations for Event-to-LiDAR Localization
Kuangyi Chen, Jun Zhang, Yuxi Hu, Yi Zhou, Friedrich Fraundorfer
Main category: cs.CV
TL;DR: LEAR: A dual-task learning framework that jointly estimates edge structures and dense event-depth flow fields to align sparse event camera data with dense LiDAR maps for robust pose estimation in GPS-denied environments.
Details
Motivation: Event cameras provide high-temporal-resolution sensing for challenging conditions (high-speed motion, poor lighting), but aligning their sparse, asynchronous events with dense LiDAR maps is fundamentally ill-posed due to modality gaps. Existing methods struggle with direct correspondence estimation between these different sensing modalities.Method: Proposes LEAR, a dual-task learning framework that jointly estimates edge structures and dense event-depth flow fields. Uses cross-modal fusion to inject modality-invariant geometric cues into motion representation, and iterative refinement to enforce mutual consistency between edge and flow estimation tasks over multiple update steps.
Result: Achieves superior performance over prior methods on several popular and challenging datasets. Produces edge-aware, depth-aligned flow fields that enable more robust and accurate pose recovery via Perspective-n-Point (PnP) solvers.
Conclusion: LEAR effectively bridges the sensing-modality divide between event cameras and LiDAR through synergistic edge and flow estimation, enabling reliable localization in GPS-denied and visually degraded environments.
Abstract: Event cameras offer high-temporal-resolution sensing that remains reliable under high-speed motion and challenging lighting, making them promising for localization from LiDAR point clouds in GPS-denied and visually degraded environments. However, aligning sparse, asynchronous events with dense LiDAR maps is fundamentally ill-posed, as direct correspondence estimation suffers from modality gaps. We propose LEAR, a dual-task learning framework that jointly estimates edge structures and dense event-depth flow fields to bridge the sensing-modality divide. Instead of treating edges as a post-hoc aid, LEAR couples them with flow estimation through a cross-modal fusion mechanism that injects modality-invariant geometric cues into the motion representation, and an iterative refinement strategy that enforces mutual consistency between the two tasks over multiple update steps. This synergy produces edge-aware, depth-aligned flow fields that enable more robust and accurate pose recovery via Perspective-n-Point (PnP) solvers. On several popular and challenging datasets, LEAR achieves superior performance over the best prior method. The source code, trained models, and demo videos are made publicly available online.
[581] GroupEnsemble: Efficient Uncertainty Estimation for DETR-based Object Detection
Yutong Yang, Katarina Popović, Julian Wiederer, Markus Braun, Vasileios Belagiannis, Bin Yang
Main category: cs.CV
TL;DR: GroupEnsemble: Efficient uncertainty estimation for DETR-like models using parallel query groups in a single forward pass, combining with MC-Dropout for hybrid approach.
Details
Motivation: DETR models only capture semantic uncertainty but not spatial uncertainty, while Deep Ensembles are memory-intensive and MC-Dropout has high latency. Need efficient uncertainty estimation for autonomous systems.Method: Adds multiple diverse groups of object queries to transformer decoder with attention mask to prevent inter-group interactions. Each group predicts independent detection sets in parallel during single forward pass.
Result: Hybrid approach (MC-Dropout + GroupEnsemble) outperforms Deep Ensembles on several metrics at fraction of cost, validated on Cityscapes and COCO datasets.
Conclusion: GroupEnsemble provides efficient uncertainty estimation for DETR models, addressing both semantic and spatial uncertainty with single-pass inference.
Abstract: Detection Transformer (DETR) and its variants show strong performance on object detection, a key task for autonomous systems. However, a critical limitation of these models is that their confidence scores only reflect semantic uncertainty, failing to capture the equally important spatial uncertainty. This results in an incomplete assessment of the detection reliability. On the other hand, Deep Ensembles can tackle this by providing high-quality spatial uncertainty estimates. However, their immense memory consumption makes them impractical for real-world applications. A cheaper alternative, Monte Carlo (MC) Dropout, suffers from high latency due to the need of multiple forward passes during inference to estimate uncertainty. To address these limitations, we introduce GroupEnsemble, an efficient and effective uncertainty estimation method for DETR-like models. GroupEnsemble simultaneously predicts multiple individual detection sets by feeding additional diverse groups of object queries to the transformer decoder during inference. Each query group is transformed by the shared decoder in isolation and predicts a complete detection set for the same input. An attention mask is applied to the decoder to prevent inter-group query interactions, ensuring each group detects independently to achieve reliable ensemble-based uncertainty estimation. By leveraging the decoder’s inherent parallelism, GroupEnsemble efficiently estimates uncertainty in a single forward pass without sequential repetition. We validated our method under autonomous driving scenes and common daily scenes using the Cityscapes and COCO datasets, respectively. The results show that a hybrid approach combining MC-Dropout and GroupEnsemble outperforms Deep Ensembles on several metrics at a fraction of the cost. The code is available at https://github.com/yutongy98/GroupEnsemble.
[582] Streaming Real-Time Trajectory Prediction Using Endpoint-Aware Modeling
Alexander Prutsch, David Schinagl, Horst Possegger
Main category: cs.CV
TL;DR: Streaming-based trajectory forecasting for autonomous vehicles using temporal context propagation with endpoint-aware modeling for real-time continuous prediction.
Details
Motivation: Real-world autonomous driving requires continuous, real-time trajectory prediction with low latency, but current research focuses on snapshot-based approaches that treat scenarios independently without temporal context.Method: Proposes a lightweight streaming approach that integrates information from previous predictions using endpoint-aware modeling. Uses trajectory endpoints from previous forecasts as anchors to extract targeted scenario context encodings, efficiently guiding scene encoder without refinement iterations.
Result: Achieves state-of-the-art streaming trajectory prediction results on Argoverse 2 multi-agent and single-agent benchmarks while requiring substantially fewer resources and significantly reducing inference latency.
Conclusion: The approach effectively relays information across consecutive timesteps, making it well-suited for real-world autonomous driving deployment by providing consistent predictions with low latency.
Abstract: Future trajectories of neighboring traffic agents have a significant influence on the path planning and decision-making of autonomous vehicles. While trajectory forecasting is a well-studied field, research mainly focuses on snapshot-based prediction, where each scenario is treated independently of its global temporal context. However, real-world autonomous driving systems need to operate in a continuous setting, requiring real-time processing of data streams with low latency and consistent predictions over successive timesteps. We leverage this continuous setting to propose a lightweight yet highly accurate streaming-based trajectory forecasting approach. We integrate valuable information from previous predictions with a novel endpoint-aware modeling scheme. Our temporal context propagation uses the trajectory endpoints of the previous forecasts as anchors to extract targeted scenario context encodings. Our approach efficiently guides its scene encoder to extract highly relevant context information without needing refinement iterations or segment-wise decoding. Our experiments highlight that our approach effectively relays information across consecutive timesteps. Unlike methods using multi-stage refinement processing, our approach significantly reduces inference latency, making it well-suited for real-world deployment. We achieve state-of-the-art streaming trajectory prediction results on the Argoverse~2 multi-agent and single-agent benchmarks, while requiring substantially fewer resources.
[583] CTForensics: A Comprehensive Dataset and Method for AI-Generated CT Image Detection
Yiheng Li, Zichang Tan, Guoqing Xu, Yijun Ye, Yang Yang, Zhen Lei
Main category: cs.CV
TL;DR: CTForensics dataset and ESF-CTFD model for detecting synthetic CT images, addressing security risks in medical imaging by capturing forgery artifacts across spatial, frequency, and wavelet domains.
Details
Motivation: Synthetic CT images pose security risks in medical applications, but existing detection methods are limited by lack of comprehensive datasets and reliance on natural image techniques that miss CT-specific artifacts.Method: Proposes CTForensics dataset with 10 diverse CT generative methods, and ESF-CTFD network that extracts features across wavelet, spatial, and frequency domains using Wavelet-Enhanced Central Stem, Spatial Process Blocks, and Frequency Process Blocks.
Result: ESF-CTFD consistently outperforms existing methods and shows superior generalization across different CT generative models.
Conclusion: The comprehensive dataset and specialized detector effectively address CT forgery detection challenges, improving security in medical imaging applications.
Abstract: With the rapid development of generative AI in medical imaging, synthetic Computed Tomography (CT) images have demonstrated great potential in applications such as data augmentation and clinical diagnosis, but they also introduce serious security risks. Despite the increasing security concerns, existing studies on CT forgery detection are still limited and fail to adequately address real-world challenges. These limitations are mainly reflected in two aspects: the absence of datasets that can effectively evaluate model generalization to reflect the real-world application requirements, and the reliance on detection methods designed for natural images that are insensitive to CT-specific forgery artifacts. In this view, we propose CTForensics, a comprehensive dataset designed to systematically evaluate the generalization capability of CT forgery detection methods, which includes ten diverse CT generative methods. Moreover, we introduce the Enhanced Spatial-Frequency CT Forgery Detector (ESF-CTFD), an efficient CNN-based neural network that captures forgery cues across the wavelet, spatial, and frequency domains. First, it transforms the input CT image into three scales and extracts features at each scale via the Wavelet-Enhanced Central Stem. Then, starting from the largest-scale features, the Spatial Process Block gradually performs feature fusion with the smaller-scale ones. Finally, the Frequency Process Block learns frequency-domain information for predicting the final results. Experiments demonstrate that ESF-CTFD consistently outperforms existing methods and exhibits superior generalization across different CT generative models.
[584] Resolving Blind Inverse Problems under Dynamic Range Compression via Structured Forward Operator Modeling
Muyu Liu, Xuanyu Tian, Chenhe Du, Qing Wu, Hongjiang Wei, Yuyao Zhang
Main category: cs.CV
TL;DR: CaMB-Diff: A diffusion-based framework with cascaded monotonic Bernstein operators for blind inverse problems in radiometric recovery from unknown dynamic range compression.
Details
Motivation: Recovering radiometric fidelity from unknown dynamic range compression (UDRC) is challenging due to unknown forward models and irreversible information loss. Current methods lack physical consistency constraints.Method: Proposes CaMB-Diff: identifies monotonicity as physical invariant, introduces cascaded monotonic Bernstein (CaMB) operator to parameterize unknown forward model with hard architectural inductive bias, integrates with plug-and-play diffusion framework for structural/semantic recovery.
Result: Significantly outperforms state-of-the-art zero-shot baselines on low-light enhancement, low-field MRI enhancement, and HDR reconstruction in signal fidelity and physical consistency. Validates CaMB’s effectiveness in modeling unknown forward operators.
Conclusion: CaMB-Diff successfully addresses blind inverse UDRC problems by combining diffusion geometric priors with physically-grounded monotonic operators, enabling robust radiometric recovery across diverse applications.
Abstract: Recovering radiometric fidelity from unknown dynamic range compression (UDRC), such as low-light enhancement and HDR reconstruction, is a challenging blind inverse problem, due to the unknown forward model and irreversible information loss introduced by compression. To address this challenge, we first identify monotonicity as the fundamental physical invariant shared across UDRC tasks. Leveraging this insight, we introduce the \textbf{cascaded monotonic Bernstein} (CaMB) operator to parameterize the unknown forward model. CaMB enforces monotonicity as a hard architectural inductive bias, constraining optimization to physically consistent mappings and enabling robust and stable operator estimation. We further integrate CaMB with a plug-and-play diffusion framework, proposing \textbf{CaMB-Diff}. Within this framework, the diffusion model serves as a powerful geometric prior for structural and semantic recovery, while CaMB explicitly models and corrects radiometric distortions through a physically grounded forward operator. Extensive experiments on a variety of zero-shot UDRC tasks, including low-light enhancement, low-field MRI enhancement, and HDR reconstruction, demonstrate that CaMB-Diff significantly outperforms state-of-the-art zero-shot baselines in terms of both signal fidelity and physical consistency. Moreover, we empirically validate the effectiveness of the proposed CaMB parameterization in accurately modeling the unknown forward operator.
[585] Generative Visual Chain-of-Thought for Image Editing
Zijin Yin, Tiankai Hang, Yiji Cheng, Shiyi Zhang, Runze He, Yu Xu, Chunyu Wang, Bing Li, Zheng Chang, Kongming Liang, Qinglin Lu, Zhanyu Ma
Main category: cs.CV
TL;DR: GVCoT is a unified framework for precise image editing that uses visual reasoning chains to first generate spatial cues for localization, then execute edits, enabling more accurate editing under complex scenes and nuanced instructions.
Details
Motivation: Existing image editing methods struggle with perceiving where to edit, especially under complex scenes and nuanced spatial instructions. There's a need for more interpretable and precise image editing that can handle sophisticated spatial reasoning.Method: Proposes Generative Visual Chain-of-Thought (GVCoT) that performs native visual reasoning by first generating spatial cues to localize target regions, then executing edits. Uses end-to-end joint optimization of visual tokens across reasoning and editing phases. Constructs GVCoT-Edit-Instruct dataset (1.8M samples, 19 tasks) and employs progressive training: supervised fine-tuning for foundational localization, followed by reinforcement learning.
Result: GVCoT consistently outperforms state-of-the-art models on SREdit-Bench (new benchmark for sophisticated scenes and fine-grained referring expressions) and ImgEdit benchmarks.
Conclusion: GVCoT enables more interpretable and precise image editing through innate spatial reasoning ability and effective utilization of visual-domain cues, inspiring future research in this direction.
Abstract: Existing image editing methods struggle to perceive where to edit, especially under complex scenes and nuanced spatial instructions. To address this issue, we propose Generative Visual Chain-of-Thought (GVCoT), a unified framework that performs native visual reasoning by first generating spatial cues to localize the target region and then executing the edit. Unlike prior text-only CoT or tool-dependent visual CoT paradigms, GVCoT jointly optimizes visual tokens generated during the reasoning and editing phases in an end-to-end manner. This way fosters the emergence of innate spatial reasoning ability and enables more effective utilization of visual-domain cues. The main challenge of training GCVoT lies in the scarcity of large-scale editing data with precise edit region annotations; to this end, we construct GVCoT-Edit-Instruct, a dataset of 1.8M high-quality samples spanning 19 tasks. We adopt a progressive training strategy: supervised fine-tuning to build foundational localization ability in reasoning trace before final editing, followed by reinforcement learning to further improve reasoning and editing quality. Finally, we introduce SREdit-Bench, a new benchmark designed to comprehensively stress-test models under sophisticated scenes and fine-grained referring expressions. Experiments demonstrate that GVCoT consistently outperforms state-of-the-art models on SREdit-Bench and ImgEdit. We hope our GVCoT will inspire future research toward interpretable and precise image editing.
[586] Zero-shot Low-Field MRI Enhancement via Diffusion-Based Adaptive Contrast Transport
Muyu Liu, Chenhe Du, Xuanyu Tian, Qing Wu, Xiao Wang, Haonan Zhang, Hongjiang Wei, Yuyao Zhang
Main category: cs.CV
TL;DR: DACT is a zero-shot diffusion framework that reconstructs high-field quality MRI images from low-field data without paired supervision, using optimal transport to correct contrast distribution shifts.
Details
Motivation: Low-field MRI is more accessible but suffers from poor signal-to-noise ratio and distorted tissue contrast. Reconstructing high-field quality images from low-field data is challenging due to lack of paired training data and unknown non-linear contrast transformations.Method: DACT combines a pre-trained high-field diffusion prior with a physically-informed adaptive forward model. It uses a differentiable Sinkhorn optimal transport module to explicitly model and correct intensity distribution shifts between low-field and high-field domains during reverse diffusion.
Result: Extensive experiments on simulated and real clinical low-field datasets show DACT achieves state-of-the-art performance, producing reconstructions with superior structural detail and correct tissue contrast.
Conclusion: DACT provides an effective zero-shot solution for high-quality MRI reconstruction from low-field data without paired supervision, addressing the fundamental limitations of low-field MRI through diffusion modeling and optimal transport.
Abstract: Low-field (LF) magnetic resonance imaging (MRI) democratizes access to diagnostic imaging but is fundamentally limited by low signal-to-noise ratio and significant tissue contrast distortion due to field-dependent relaxation dynamics. Reconstructing high-field (HF) quality images from LF data is a blind inverse problem, severely challenged by the scarcity of paired training data and the unknown, non-linear contrast transformation operator. Existing zero-shot methods, which assume simplified linear degradation, often fail to recover authentic tissue contrast. In this paper, we propose DACT(Diffusion-Based Adaptive Contrast Transport), a novel zero-shot framework that restores HF-quality images without paired supervision. DACT synergizes a pre-trained HF diffusion prior to ensure anatomical fidelity with a physically-informed adaptive forward model. Specifically, we introduce a differentiable Sinkhorn optimal transport module that explicitly models and corrects the intensity distribution shift between LF and HF domains during the reverse diffusion process. This allows the framework to dynamically learn the intractable contrast mapping while preserving topological consistency. Extensive experiments on simulated and real clinical LF datasets demonstrate that DACT achieves state-of-the-art performance, yielding reconstructions with superior structural detail and correct tissue contrast.
[587] LaST-VLA: Thinking in Latent Spatio-Temporal Space for Vision-Language-Action in Autonomous Driving
Yuechen Luo, Fang Li, Shaoqing Xu, Yang Ji, Zehan Zhang, Bing Wang, Yuannan Shen, Jianwei Cui, Long Chen, Guang Chen, Hangjun Ye, Zhi-Xin Yang, Fuxi Wen
Main category: cs.CV
TL;DR: LaST-VLA introduces a physically grounded latent spatio-temporal reasoning framework for autonomous driving VLAs, addressing semantic-perceptual decoupling by distilling geometric constraints and dynamic foresight into latent space.
Details
Motivation: Current Vision-Language-Action models for autonomous driving suffer from semantic-perceptual decoupling and perceptual-symbolic conflicts due to reliance on explicit textual Chain-of-Thought reasoning. Recent latent reasoning approaches lack physical grounding and operate as physics-agnostic representations.Method: Proposes Latent Spatio-Temporal VLA (LaST-VLA) with dual-feature alignment to distill geometric constraints from 3D foundation models and dynamic foresight from world models into latent space. Uses progressive SFT training transitioning from feature alignment to trajectory generation, refined via Reinforcement Learning with Group Relative Policy Optimization for safety.
Result: Sets new records on NAVSIM v1 (91.3 PDMS) and NAVSIM v2 (87.1 EPDMS), and excels in spatial-temporal reasoning on SURDS and NuDynamics benchmarks.
Conclusion: LaST-VLA successfully shifts reasoning from discrete symbolic processing to physically grounded latent spatio-temporal Chain-of-Thought, improving autonomous driving performance through better integration of perception and planning.
Abstract: While Vision-Language-Action (VLA) models have revolutionized autonomous driving by unifying perception and planning, their reliance on explicit textual Chain-of-Thought (CoT) leads to semantic-perceptual decoupling and perceptual-symbolic conflicts. Recent shifts toward latent reasoning attempt to bypass these bottlenecks by thinking in continuous hidden space. However, without explicit intermediate constraints, standard latent CoT often operates as a physics-agnostic representation. To address this, we propose the Latent Spatio-Temporal VLA (LaST-VLA), a framework shifting the reasoning paradigm from discrete symbolic processing into a physically grounded Latent Spatio-Temporal CoT. By implementing a dual-feature alignment mechanism, we distill geometric constraints from 3D foundation models and dynamic foresight from world models directly into the latent space. Coupled with a progressive SFT training strategy that transitions from feature alignment to trajectory generation, and refined via Reinforcement Learning with Group Relative Policy Optimization (GRPO) to ensure safety and rule compliance. \method~setting a new record on NAVSIM v1 (91.3 PDMS) and NAVSIM v2 (87.1 EPDMS), while excelling in spatial-temporal reasoning on SURDS and NuDynamics benchmarks.
[588] BAWSeg: A UAV Multispectral Benchmark for Barley Weed Segmentation
Haitian Wang, Xinyu Wang, Muhammad Ibrahim, Dustin Severtson, Ajmal Mian
Main category: cs.CV
TL;DR: VISA is a two-stream segmentation network for weed mapping in cereal fields using UAV multispectral imagery, with separate radiance and vegetation-index streams fused at native resolution, achieving state-of-the-art performance on the new BAWSeg dataset.
Details
Motivation: Existing weed mapping methods using multispectral UAV imagery have limitations: thresholded vegetation indices are brittle under radiometric drift and mixed pixels, while single-stream CNNs/Transformers have interference between radiance and index cues, reducing sensitivity to small weed clusters in crop canopies.Method: Proposes VISA, a two-stream segmentation network that decouples radiance and vegetation-index cues. The radiance stream learns from calibrated five-band reflectance using residual spectral-spatial attention. The index stream operates on vegetation-index maps with windowed self-attention, state-space layers for field-scale context, and Slot Attention for stable region descriptors. The network fuses streams at native resolution.
Result: On the BAWSeg dataset, VISA achieves 75.6% mIoU and 63.5% weed IoU with 22.8M parameters, outperforming multispectral SegFormer-B1 by 1.2 mIoU and 1.9 weed IoU. Maintains 71.2% and 69.2% mIoU under cross-plot and cross-year protocols respectively.
Conclusion: VISA effectively addresses limitations of existing weed mapping methods by decoupling radiance and index cues, achieving robust performance across fields and seasons. The BAWSeg dataset supports supervised training and deployment-oriented evaluation.
Abstract: Accurate weed mapping in cereal fields requires pixel-level segmentation from UAV imagery that remains reliable across fields, seasons, and illumination. Existing multispectral pipelines often depend on thresholded vegetation indices, which are brittle under radiometric drift and mixed crop–weed pixels, or on single-stream CNN and Transformer backbones that ingest stacked bands and indices, where radiance cues and normalized index cues interfere and reduce sensitivity to small weed clusters embedded in crop canopies. We propose VISA (Vegetation-Index and Spectral Attention), a two-stream segmentation network that decouples these cues and fuses them at native resolution. The radiance stream learns from calibrated five-band reflectance using residual spectral-spatial attention to preserve fine textures and row boundaries that are attenuated by ratio indices. The index stream operates on vegetation-index maps with windowed self-attention to model local structure efficiently, state-space layers to propagate field-scale context without quadratic attention cost, and Slot Attention to form stable region descriptors that improve discrimination of sparse weeds under canopy mixing. To support supervised training and deployment-oriented evaluation, we introduce BAWSeg, a four-year UAV multispectral dataset collected over commercial barley paddocks in Western Australia, providing radiometrically calibrated blue, green, red, red edge, and near-infrared orthomosaics, derived vegetation indices, and dense crop, weed, and other labels with leakage-free block splits. On BAWSeg, VISA achieves 75.6% mIoU and 63.5% weed IoU with 22.8M parameters, outperforming a multispectral SegFormer-B1 baseline by 1.2 mIoU and 1.9 weed IoU. Under cross-plot and cross-year protocols, VISA maintains 71.2% and 69.2% mIoU, respectively. The BAWSeg data, VISA code, and trained models will be released upon publication.
[589] MobileMold: A Smartphone-Based Microscopy Dataset for Food Mold Detection
Dinh Nam Pham, Leonard Prokisch, Bennet Meyer, Jonas Thumbs
Main category: cs.CV
TL;DR: MobileMold: An open smartphone-based microscopy dataset for food mold detection and classification using clip-on microscopes with 4,941 images across 11 food types, 4 smartphones, and 3 microscopes.
Details
Motivation: To create an accessible, low-cost food safety sensing system using smartphone clip-on microscopes that can detect mold beyond visual inspection, addressing the need for portable, affordable food spoilage detection.Method: Collected 4,941 handheld microscopy images across 11 food types using 4 different smartphones and 3 clip-on microscopes under diverse real-world conditions. Established baselines for mold detection and food classification using pretrained deep learning architectures with various augmentation strategies.
Result: Achieved near-ceiling performance with accuracy = 0.9954, F1 = 0.9954, MCC = 0.9907 for mold detection. Also provided saliency-based visual explanations highlighting mold regions associated with predictions.
Conclusion: MobileMold dataset enables effective food spoilage detection using smartphone microscopy and contributes to research on accessible food-safety sensing, mobile imaging, and smartphone-enhanced attachments.
Abstract: Smartphone clip-on microscopes turn everyday devices into low-cost, portable imaging systems that can even reveal fungal structures at the microscopic level, enabling mold inspection beyond unaided visual checks. In this paper, we introduce MobileMold, an open smartphone-based microscopy dataset for food mold detection and food classification. MobileMold contains 4,941 handheld microscopy images spanning 11 food types, 4 smartphones, 3 microscopes, and diverse real-world conditions. Beyond the dataset release, we establish baselines for (i) mold detection and (ii) food-type classification, including a multi-task setting that predicts both attributes. Across multiple pretrained deep learning architectures and augmentation strategies, we obtain near-ceiling performance (accuracy = 0.9954, F1 = 0.9954, MCC = 0.9907), validating the utility of our dataset for detecting food spoilage. To increase transparency, we complement our evaluation with saliency-based visual explanations highlighting mold regions associated with the model’s predictions. MobileMold aims to contribute to research on accessible food-safety sensing, mobile imaging, and exploring the potential of smartphones enhanced with attachments.
[590] PreSight: Preoperative Outcome Prediction for Parkinson’s Disease via Region-Prior Morphometry and Patient-Specific Weighting
Yand Wang, Chen Zhang, Lanyun Zhu, Yixin Chen, Qunbo Wang, Yutong Bai, Jurgen Germann, Yinghong Wen, Shuai Shao
Main category: cs.CV
TL;DR: PreSight: A multimodal deep learning model that predicts postoperative motor improvement for Parkinson’s disease surgery by fusing clinical priors with preoperative MRI and deformation-based morphometry, using patient-specific weighting for regional importance.
Details
Motivation: Predicting surgical outcomes for Parkinson's disease is clinically important but challenging due to subtle imaging signals and patient heterogeneity. Current approaches lack reliable preoperative prediction methods that integrate multimodal data with clinical knowledge.Method: PreSight combines clinical priors with preoperative MRI and deformation-based morphometry (DBM) features. It uses a patient-specific weighting module to adapt regional importance, producing end-to-end calibrated predictions with patient-level explanations.
Result: Achieved 88.89% accuracy on internal validation and 85.29% on external-center test for responder classification. Outperformed clinical, imaging-only, and multimodal baselines with better probability calibration and higher decision-curve net benefit.
Conclusion: Integrating clinical prior knowledge with region-adaptive morphometry enables reliable presurgical decision support for Parkinson’s disease surgery, demonstrating the value of patient-specific multimodal fusion.
Abstract: Preoperative improvement rate prediction for Parkinson’s disease surgery is clinically important yet difficult because imaging signals are subtle and patients are heterogeneous. We address this setting, where only information available before surgery is used, and the goal is to predict patient-specific postoperative motor benefit. We present PreSight, a presurgical outcome model that fuses clinical priors with preoperative MRI and deformation-based morphometry (DBM) and adapts regional importance through a patient-specific weighting module. The model produces end-to-end, calibrated, decision-ready predictions with patient-level explanations. We evaluate PreSight on a real-world two-center cohort of 400 subjects with multimodal presurgical inputs and postoperative improvement labels. PreSight outperforms strong clinical, imaging-only, and multimodal baselines. It attains 88.89% accuracy on internal validation and 85.29% on an external-center test for responder classification and shows better probability calibration and higher decision-curve net benefit. Ablations and analyses confirm the contribution of DBM and the patient-specific weighting module and indicate that the model emphasizes disease-relevant regions in a patient-specific manner. These results demonstrate that integrating clinical prior knowledge with region-adaptive morphometry enables reliable presurgical decision support in routine practice.
[591] Robust White Blood Cell Classification with Stain-Normalized Decoupled Learning and Ensembling
Luu Le, Hoang-Loc Cao, Ha-Hieu Pham, Thanh-Huy Nguyen, Ulas Bagci
Main category: cs.CV
TL;DR: A stain-normalized, decoupled training framework for robust white blood cell classification that addresses appearance variations and class imbalance, achieving top performance on the WBCBench 2026 challenge.
Details
Motivation: Real-world WBC datasets have substantial appearance variations from staining/scanning conditions and severe class imbalance where rare but clinically important categories are underrepresented, making robust classification challenging.Method: Proposes a stain-normalized, decoupled training framework: 1) learns transferable representations using instance-balanced sampling, 2) rebalances classifier with class-aware sampling and hybrid loss (effective-number weighting + focal modulation), 3) enhances robustness with ensemble of trained backbones and test-time augmentation.
Result: Achieved top rank on the leaderboard of the WBCBench 2026: Robust White Blood Cell Classification Challenge at ISBI 2026.
Conclusion: The proposed framework effectively addresses appearance variations and class imbalance in WBC classification, demonstrating state-of-the-art performance on benchmark challenges.
Abstract: White blood cell (WBC) classification is fundamental for hematology applications such as infection assessment, leukemia screening, and treatment monitoring. However, real-world WBC datasets present substantial appearance variations caused by staining and scanning conditions, as well as severe class imbalance in which common cell types dominate while rare but clinically important categories are underrepresented. To address these challenges, we propose a stain-normalized, decoupled training framework that first learns transferable representations using instance-balanced sampling, and then rebalances the classifier with class-aware sampling and a hybrid loss combining effective-number weighting and focal modulation. In inference stage, we further enhance robustness by ensembling various trained backbones with test-time augmentation. Our approach achieved the top rank on the leaderboard of the WBCBench 2026: Robust White Blood Cell Classification Challenge at ISBI 2026.
[592] Process Over Outcome: Cultivating Forensic Reasoning for Generalizable Multimodal Manipulation Detection
Yuchen Zhang, Yaxiong Wang, Kecheng Han, Yujiao Wu, Lianwei Wu, Li Zhu, Zhedong Zheng
Main category: cs.CV
TL;DR: REFORM is a reasoning-driven framework for multimodal manipulation detection that shifts from outcome fitting to process modeling, using forensic rationales and reinforcement learning for logical consistency.
Details
Motivation: Existing manipulation detection methods focus on classification under result-oriented supervision, lacking interpretability and generalization to unseen manipulation patterns. The paper argues that generalizable detection requires explicit forensic reasoning rather than just classifying limited manipulation types.Method: REFORM uses a three-stage curriculum: 1) induces forensic rationales, 2) aligns reasoning with final judgments, and 3) refines logical consistency via reinforcement learning. The framework is supported by ROM, a large-scale dataset with rich reasoning annotations.
Result: REFORM establishes new state-of-the-art performance with superior generalization: 81.52% ACC on ROM, 76.65% ACC on DGM4, and 74.9 F1 on MMFakeBench.
Conclusion: The reasoning-driven approach to manipulation detection provides better interpretability and generalization compared to traditional classification-based methods, addressing limitations of existing approaches in multimodal media manipulation detection.
Abstract: Recent advances in generative AI have significantly enhanced the realism of multimodal media manipulation, thereby posing substantial challenges to manipulation detection. Existing manipulation detection and grounding approaches predominantly focus on manipulation type classification under result-oriented supervision, which not only lacks interpretability but also tends to overfit superficial artifacts. In this paper, we argue that generalizable detection requires incorporating explicit forensic reasoning, rather than merely classifying a limited set of manipulation types, which fails to generalize to unseen manipulation patterns. To this end, we propose REFORM, a reasoning-driven framework that shifts learning from outcome fitting to process modeling. REFORM adopts a three-stage curriculum that first induces forensic rationales, then aligns reasoning with final judgments, and finally refines logical consistency via reinforcement learning. To support this paradigm, we introduce ROM, a large-scale dataset with rich reasoning annotations. Extensive experiments show that REFORM establishes new state-of-the-art performance with superior generalization, achieving 81.52% ACC on ROM, 76.65% ACC on DGM4, and 74.9 F1 on MMFakeBench.
[593] NICO-RAG: Multimodal Hypergraph Retrieval-Augmented Generation for Understanding the Nicotine Public Health Crisis
Manuel Serna-Aguilera, Raegan Anderes, Page Dobbs, Khoa Luu
Main category: cs.CV
TL;DR: NICO Dataset: A multimodal dataset with 200K+ images and text descriptions of 55 tobacco/nicotine products, plus NICO-RAG framework for efficient multimodal retrieval without expensive image token processing.
Details
Motivation: Address the public health crisis of nicotine addiction by providing researchers with comprehensive multimodal data on tobacco products and enabling efficient analysis through retrieval-augmented generation without costly image processing.Method: Created NICO Dataset with 200K+ multimodal samples (images + text) across 55 brands. Developed NICO-RAG framework that organizes image/text entities into hypergraphs for retrieval, using image features directly instead of processing image tokens through language models.
Result: NICO-RAG performs comparably to state-of-the-art image-adapted RAG methods on over 100 questions without needing to process additional image tokens, demonstrating efficient multimodal retrieval capability.
Conclusion: The NICO Dataset and NICO-RAG framework provide valuable resources for public health research on nicotine products, enabling efficient multimodal analysis without the computational costs of processing image tokens through large language models.
Abstract: The nicotine addiction public health crisis continues to be pervasive. In this century alone, the tobacco industry has released and marketed new products in an aggressive effort to lure new and young customers for life. Such innovations and product development, namely flavored nicotine or tobacco such as nicotine pouches, have undone years of anti-tobacco campaign work. Past work is limited both in scope and in its ability to connect large-scale data points. Thus, we introduce the Nicotine Innovation Counter-Offensive (NICO) Dataset to provide public health researchers with over 200,000 multimodal samples, including images and text descriptions, on 55 tobacco and nicotine product brands. In addition, to provide public health researchers with factual connections across a large-scale dataset, we propose NICO-RAG, a retrieval-augmented generation (RAG) framework that can retrieve image features without incurring the high-cost of language models, as well as the added cost of processing image tokens with large-scale datasets such as NICO. At construction time, NICO-RAG organizes image- and text-extracted entities and relations into hypergraphs to produce as factual responses as possible. This joint multimodal knowledge representation enables NICO-RAG to retrieve images for query answering not only by visual similarity but also by the semantic similarity of image descriptions. Experimentals show that without needing to process additional tokens from images for over 100 questions, NICO-RAG performs comparably to the state-of-the-art RAG method adapted for images.
[594] WorldStereo: Bridging Camera-Guided Video Generation and Scene Reconstruction via 3D Geometric Memories
Yisu Zhang, Chenjie Cao, Tengfei Wang, Xuhui Zuo, Junta Wu, Jianke Zhu, Chunchao Guo
Main category: cs.CV
TL;DR: WorldStereo bridges camera-guided video generation and 3D reconstruction using geometric memory modules to generate multi-view-consistent videos for high-quality 3D scene reconstruction.
Details
Motivation: Current Video Diffusion Models (VDMs) produce high-quality videos but struggle with 3D scene reconstruction due to limited camera controllability and inconsistent content across different camera trajectories, making it challenging to create consistent 3D scenes from generated videos.Method: Proposes WorldStereo framework with two geometric memory modules: 1) global-geometric memory for precise camera control and coarse structural priors via incrementally updated point clouds, and 2) spatial-stereo memory that constrains attention receptive fields with 3D correspondence to focus on fine-grained details from memory bank. Uses flexible control branch-based architecture with distribution matching distilled VDM backbone without joint training.
Result: WorldStereo generates multi-view-consistent videos under precise camera control, enabling high-quality 3D reconstruction. Demonstrates effectiveness across camera-guided video generation and 3D reconstruction benchmarks. Acts as powerful world model for diverse scene generation tasks (perspective or panoramic images) with high-fidelity 3D results.
Conclusion: WorldStereo successfully bridges video generation and 3D reconstruction through geometric memory modules, addressing camera control and multi-view consistency challenges. The framework shows impressive efficiency and serves as a versatile world model for 3D scene generation tasks.
Abstract: Recent advances in foundational Video Diffusion Models (VDMs) have yielded significant progress. Yet, despite the remarkable visual quality of generated videos, reconstructing consistent 3D scenes from these outputs remains challenging, due to limited camera controllability and inconsistent generated content when viewed from distinct camera trajectories. In this paper, we propose WorldStereo, a novel framework that bridges camera-guided video generation and 3D reconstruction via two dedicated geometric memory modules. Formally, the global-geometric memory enables precise camera control while injecting coarse structural priors through incrementally updated point clouds. Moreover, the spatial-stereo memory constrains the model’s attention receptive fields with 3D correspondence to focus on fine-grained details from the memory bank. These components enable WorldStereo to generate multi-view-consistent videos under precise camera control, facilitating high-quality 3D reconstruction. Furthermore, the flexible control branch-based WorldStereo shows impressive efficiency, benefiting from the distribution matching distilled VDM backbone without joint training. Extensive experiments across both camera-guided video generation and 3D reconstruction benchmarks demonstrate the effectiveness of our approach. Notably, we show that WorldStereo acts as a powerful world model, tackling diverse scene generation tasks (whether starting from perspective or panoramic images) with high-fidelity 3D results. Models will be released.
[595] ORGAN: Object-Centric Representation Learning using Cycle Consistent Generative Adversarial Networks
Joël Küchler, Ellen van Maren, Vaiva Vasiliauskaitė, Katarina Vulić, Reza Abbasi-Asl, Stephan J. Ihle
Main category: cs.CV
TL;DR: ORGAN is a novel object-centric representation learning method using cycle-consistent GANs instead of autoencoders, capable of handling complex real-world datasets with many objects and low contrast.
Details
Motivation: Current object-centric representation learning is dominated by autoencoder architectures, which may struggle with challenging real-world datasets. The authors aim to develop an alternative approach using GANs that can better handle complex scenes with many objects and low visual contrast.Method: ORGAN uses cycle-consistent Generative Adversarial Networks for object-centric representation learning. It segments images into objects and represents each object in a low-dimensional latent space, leveraging GAN-based architecture rather than traditional autoencoders.
Result: ORGAN performs similarly to state-of-the-art approaches on synthetic datasets while being uniquely capable of handling challenging real-world datasets with many objects and low visual contrast. It creates expressive latent representations for object manipulation and scales well with object count and image size.
Conclusion: ORGAN demonstrates that GAN-based approaches can be effective for object-centric representation learning, offering advantages over autoencoder-based methods for complex real-world scenarios, with good scalability and expressive latent representations.
Abstract: Although data generation is often straightforward, extracting information from data is more difficult. Object-centric representation learning can extract information from images in an unsupervised manner. It does so by segmenting an image into its subcomponents: the objects. Each object is then represented in a low-dimensional latent space that can be used for downstream processing. Object-centric representation learning is dominated by autoencoder architectures (AEs). Here, we present ORGAN, a novel approach for object-centric representation learning, which is based on cycle-consistent Generative Adversarial Networks instead. We show that it performs similarly to other state-of-the-art approaches on synthetic datasets, while at the same time being the only approach tested here capable of handling more challenging real-world datasets with many objects and low visual contrast. Complementing these results, ORGAN creates expressive latent space representations that allow for object manipulation. Finally, we show that ORGAN scales well both with respect to the number of objects and the size of the images, giving it a unique edge over current state-of-the-art approaches.
[596] MMNavAgent: Multi-Magnification WSI Navigation Agent for Clinically Consistent Whole-Slide Analysis
Zhengyang Xu, Han Li, Jingsong Liu, Linrui Xie, Xun Ma, Xin You, Shihui Zu, Ayako Ito, Xinyu Hao, Hongming Xu, Shaohua Kevin Zhou, Nassir Navab, Peter J. Schüffler
Main category: cs.CV
TL;DR: MMNavAgent: A clinically consistent multi-magnification WSI navigation agent that models cross-magnification interactions and adaptive magnification selection for improved whole-slide image diagnosis.
Details
Motivation: Existing AI navigation approaches for whole-slide image diagnosis operate at fixed magnifications or use predefined magnification traversal, which doesn't match clinical practice where pathologists examine slides across multiple magnifications and dynamically integrate global and cellular evidence.Method: Proposes MMNavAgent with two key components: 1) Cross-Magnification navigation Tool (CMT) that aggregates contextual information from adjacent magnifications to enhance discriminative representations, and 2) Magnification Selection Tool (MST) that uses memory-driven reasoning for interactive and adaptive magnification selection.
Result: Extensive experiments on a public dataset show improved diagnostic performance with 1.45% gain in AUC and 2.93% gain in BACC over non-agent baselines.
Conclusion: The proposed MMNavAgent better mimics real diagnostic workflows by enabling multi-magnification interaction and adaptive magnification selection, leading to improved WSI diagnosis performance.
Abstract: Recent AI navigation approaches aim to improve Whole-Slide Image (WSI) diagnosis by modeling spatial exploration and selecting diagnostically relevant regions, yet most operate at a single fixed magnification or rely on predefined magnification traversal. In clinical practice, pathologists examine slides across multiple magnifications and selectively inspect only necessary scales, dynamically integrating global and cellular evidence in a sequential manner. This mismatch prevents existing methods from modeling cross-magnification interactions and adaptive magnification selection inherent to real diagnostic workflows. To these, we propose a clinically consistent Multi-Magnification WSI Navigation Agent (MMNavAgent) that explicitly models multi magnification interaction and adaptive magnification selection. Specifically, we introduce a Cross-Magnification navigation Tool (CMT) that aggregates contextual information from adjacent magnifications to enhance discriminative representations along the navigation path. We further introduce a Magnification Selection Tool (MST) that leverages memory-driven reasoning within the agent framework to enable interactive and adaptive magnification selection, mimicking the sequential decision process of pathologists. Extensive experiments on a public dataset demonstrate improved diagnostic performance, with 1.45% gain of AUC and 2.93% gain of BACC over a non-agent baseline. Code will be public upon acceptance.
[597] From Pixels to Patches: Pooling Strategies for Earth Embeddings
Isaac Corley, Caleb Robinson, Inbal Becker-Reshef, Juan M. Lavista Ferres
Main category: cs.CV
TL;DR: The paper introduces EuroSAT-Embed dataset for evaluating pooling methods in geospatial foundation models, showing that advanced pooling techniques like Generalized Mean Pooling significantly improve geographic generalization and accuracy compared to default mean pooling.
Details
Motivation: As geospatial foundation models transition to pixel-level embeddings, there's a need for effective methods to aggregate thousands of pixel vectors into patch representations that preserve class-discriminative information while matching downstream label resolution. Default mean pooling discards within-patch variability and can significantly reduce accuracy under spatial shift scenarios.Method: The authors introduce EuroSAT-Embed: a dataset of 81,000 embedding GeoTIFFs derived from three foundation models (AlphaEarth, OlmoEarth, and Tessera). They benchmark 11 training-free and 2 parametric pooling methods under both random and geographically disjoint test splits to evaluate geographic generalization.
Result: Richer pooling schemes reduce the geographic generalization gap by up to 40% relative to mean pooling and increase accuracy by up to 5% on spatial splits. Generalized Mean Pooling (GeM) is recommended as a drop-in replacement that improves accuracy without increasing embedding dimensionality. Stats pooling (concatenation of min/max/mean/std pooling) performs best but at 4x the embedding size.
Conclusion: Advanced pooling methods are crucial for geographic generalization in geospatial foundation models. GeM provides a practical balance of improved accuracy without dimensionality increase, while Stats pooling offers maximum accuracy at higher computational cost. Pooling effectiveness varies across embedding sources, with higher-dimensional embeddings benefiting most from distributional statistics.
Abstract: As geospatial foundation models shift from patch-level to pixel-level embeddings, practitioners must aggregate thousands of pixel vectors into patch representations that preserve class-discriminative signal while matching downstream label resolution. The default choice, mean pooling, discards within-patch variability and can drop accuracy by more than 10% under spatial shift. To evaluate this effect, we introduce EuroSAT-Embed: 81,000 embedding GeoTIFFs derived from three foundation models: AlphaEarth, OlmoEarth, and Tessera. We benchmark 11 training-free and 2 parametric pooling methods under both random and geographically disjoint test splits. Our results show that richer pooling schemes reduce the geographic generalization gap by up to 40% relative to mean pooling and increases accuracy by up to 5% on spatial splits. We recommend Generalized Mean Pooling (GeM) as a drop-in replacement for mean pooling: it improves accuracy without increasing embedding dimensionality. For maximum accuracy, Stats pooling (concatenation of min/max/mean/std pooling) performs best at 4x the embedding size. We further find that pooling effectiveness varies across embedding sources and that higher-dimensional embeddings benefit most from distributional statistics.
[598] A 3D mesh convolution-based autoencoder for geometry compression
Germain Bregeon, Marius Preda, Radu Ispas, Titus Zaharia
Main category: cs.CV
TL;DR: A 3D mesh convolution-based autoencoder for geometry compression that handles irregular meshes without preprocessing or manifold constraints, learning features directly from mesh faces while preserving connectivity.
Details
Motivation: Existing 3D mesh compression methods often require preprocessing or impose constraints like manifoldness/watertightness. The authors aim to develop a method that can handle irregular mesh data directly without such limitations while learning meaningful latent representations.Method: Proposes a mesh convolution-based autoencoder that learns features directly from mesh faces. Uses dedicated pooling and unpooling operations to preserve connectivity. Encoder compresses input mesh into a compact base mesh space for comparable latent representations. Decoder reconstructs original connectivity and restores compressed geometry to full resolution.
Result: Extensive experiments on multi-class datasets show the method outperforms state-of-the-art approaches in both 3D mesh geometry reconstruction and latent space classification tasks.
Conclusion: The proposed approach effectively handles irregular mesh data without preprocessing or constraints, achieving superior performance in geometry reconstruction and latent space classification compared to existing methods.
Abstract: In this paper, we introduce a novel 3D mesh convolution-based autoencoder for geometry compression, able to deal with irregular mesh data without requiring neither preprocessing nor manifold/watertightness conditions. The proposed approach extracts meaningful latent representations by learning features directly from the mesh faces, while preserving connectivity through dedicated pooling and unpooling operations. The encoder compresses the input mesh into a compact base mesh space, which ensures that the latent space remains comparable. The decoder reconstructs the original connectivity and restores the compressed geometry to its full resolution. Extensive experiments on multi-class datasets demonstrate that our method outperforms state-of-the-art approaches in both 3D mesh geometry reconstruction and latent space classification tasks. Code available at: github.com/germainGB/MeshConv3D
[599] Stereo-Inertial Poser: Towards Metric-Accurate Shape-Aware Motion Capture Using Sparse IMUs and a Single Stereo Camera
Tutian Tang, Xingyu Ji, Yutong Li, MingHao Liu, Wenqiang Xu, Cewu Lu
Main category: cs.CV
TL;DR: Stereo-Inertial Poser: Real-time motion capture system using stereo camera + 6 IMUs for metric-accurate, shape-aware 3D human motion estimation, solving monocular depth ambiguity and anthropometric variation issues.
Details
Motivation: Existing visual-inertial motion capture systems using monocular cameras + sparse IMUs have limitations: metric inaccuracies from monocular depth ambiguity, and shape-agnostic local motion that ignores anthropometric variations.Method: Uses single stereo camera + six IMUs; stereo vision resolves depth ambiguity via calibrated baseline geometry for direct 3D keypoint extraction and body shape parameter estimation. IMU data and visual cues fused for drift-compensated joint positions and root movements. Novel shape-aware fusion module dynamically harmonizes anthropometry variations with global translations.
Result: Achieves over 200 FPS without optimization-based post-processing, enabling real-time deployment. Quantitative evaluations show state-of-the-art performance across datasets. Qualitative results demonstrate drift-free global translation under long recording and reduced foot-skating effects.
Conclusion: Stereo-Inertial Poser provides a real-time, metric-accurate, shape-aware motion capture solution that overcomes limitations of monocular systems while maintaining high performance and practical deployment capabilities.
Abstract: Recent advancements in visual-inertial motion capture systems have demonstrated the potential of combining monocular cameras with sparse inertial measurement units (IMUs) as cost-effective solutions, which effectively mitigate occlusion and drift issues inherent in single-modality systems. However, they are still limited by metric inaccuracies in global translations stemming from monocular depth ambiguity, and shape-agnostic local motion estimations that ignore anthropometric variations. We present Stereo-Inertial Poser, a real-time motion capture system that leverages a single stereo camera and six IMUs to estimate metric-accurate and shape-aware 3D human motion. By replacing the monocular RGB with stereo vision, our system resolves depth ambiguity through calibrated baseline geometry, enabling direct 3D keypoint extraction and body shape parameter estimation. IMU data and visual cues are fused for predicting drift-compensated joint positions and root movements, while a novel shape-aware fusion module dynamically harmonizes anthropometry variations with global translations. Our end-to-end pipeline achieves over 200 FPS without optimization-based post-processing, enabling real-time deployment. Quantitative evaluations across various datasets demonstrate state-of-the-art performance. Qualitative results show our method produces drift-free global translation under a long recording time and reduces foot-skating effects.
[600] SimRecon: SimReady Compositional Scene Reconstruction from Real Videos
Chong Xia, Kai Zhu, Zizhuo Wang, Fangfu Liu, Zhizheng Zhang, Yueqi Duan
Main category: cs.CV
TL;DR: SimRecon is a framework for compositional scene reconstruction that creates object-centric representations from real-world videos through a Perception-Generation-Simulation pipeline with bridging modules for visual fidelity and physical plausibility.
Details
Motivation: Current compositional reconstruction approaches focus too much on visual appearance and lack generalization to real-world scenarios. There's a need for a system that can reconstruct cluttered scenes with both visual fidelity and physical plausibility for applications in simulation and interaction.Method: SimRecon uses a three-stage pipeline: 1) Perception: scene-level semantic reconstruction from video, 2) Generation: single-object generation, and 3) Simulation: assembling assets in a simulator. Two bridging modules are introduced: Active Viewpoint Optimization (for Perception→Generation transition) to find optimal 3D viewpoints for object completion, and Scene Graph Synthesizer (for Generation→Simulation transition) to guide constructive scene assembly in 3D simulators.
Result: Extensive experiments on ScanNet dataset show superior performance over previous state-of-the-art approaches, demonstrating improved visual fidelity and physical plausibility in complex scene reconstruction.
Conclusion: SimRecon successfully addresses the challenges of compositional scene reconstruction by integrating perception, generation, and simulation with specialized bridging modules, enabling realistic and physically plausible scene representations for simulation and interaction applications.
Abstract: Compositional scene reconstruction seeks to create object-centric representations rather than holistic scenes from real-world videos, which is natively applicable for simulation and interaction. Conventional compositional reconstruction approaches primarily emphasize on visual appearance and show limited generalization ability to real-world scenarios. In this paper, we propose SimRecon, a framework that realizes a “Perception-Generation-Simulation” pipeline towards cluttered scene reconstruction, which first conducts scene-level semantic reconstruction from video input, then performs single-object generation, and finally assembles these assets in the simulator. However, naively combining these three stages leads to visual infidelity of generated assets and physical implausibility of the final scene, a problem particularly severe for complex scenes. Thus, we further propose two bridging modules between the three stages to address this problem. To be specific, for the transition from Perception to Generation, critical for visual fidelity, we introduce Active Viewpoint Optimization, which actively searches in 3D space to acquire optimal projected images as conditions for single-object completion. Moreover, for the transition from Generation to Simulation, essential for physical plausibility, we propose a Scene Graph Synthesizer, which guides the construction from scratch in 3D simulators, mirroring the native, constructive principle of the real world. Extensive experiments on the ScanNet dataset validate our method’s superior performance over previous state-of-the-art approaches.
[601] OnlineX: Unified Online 3D Reconstruction and Understanding with Active-to-Stable State Evolution
Chong Xia, Fangfu Liu, Yule Wang, Yize Pang, Yueqi Duan
Main category: cs.CV
TL;DR: OnlineX enables real-time 3D scene reconstruction with semantic understanding using streaming images, addressing cumulative drift through decoupled active-stable state evolution.
Details
Motivation: Existing 3D Gaussian Splatting methods are offline and cannot handle continuous reconstruction needed for robotics and VR/AR applications. The cumulative drift issue arises from conflicting requirements for memory states to be both actively updated for local geometry and stable for global structure preservation.Method: Introduces a feed-forward framework with decoupled active-to-stable state evolution: separates memory into active state (for high-frequency local geometry) and stable state (for long-term global structure), then fuses information from active to stable. Jointly models visual appearance and language fields with implicit Gaussian fusion module.
Result: Outperforms prior work in novel view synthesis and semantic understanding across datasets. Shows robust performance with varying input sequence lengths and achieves real-time inference speed.
Conclusion: OnlineX successfully enables online 3D reconstruction with semantic understanding, solving the cumulative drift problem through decoupled state evolution, making it suitable for real-time applications like robotics and VR/AR.
Abstract: Recent advances in generalizable 3D Gaussian Splatting (3DGS) have enabled rapid 3D scene reconstruction within seconds, eliminating the need for per-scene optimization. However, existing methods primarily follow an offline reconstruction paradigm, lacking the capacity for continuous reconstruction, which limits their applicability to online scenarios such as robotics and VR/AR. In this paper, we introduce OnlineX, a feed-forward framework that reconstructs both 3D visual appearance and language fields in an online manner using only streaming images. A key challenge in online formulation is the cumulative drift issue, which is rooted in the fundamental conflict between two opposing roles of the memory state: an active role that constantly refreshes to capture high-frequency local geometry, and a stable role that conservatively accumulates and preserves the long-term global structure. To address this, we introduce a decoupled active-to-stable state evolution paradigm. Our framework decouples the memory state into a dedicated active state and a persistent stable state, and then cohesively fuses the information from the former into the latter to achieve both fidelity and stability. Moreover, we jointly model visual appearance and language fields and incorporate an implicit Gaussian fusion module to enhance reconstruction quality. Experiments on mainstream datasets demonstrate that our method consistently outperforms prior work in novel view synthesis and semantic understanding, showcasing robust performance across input sequences of varying lengths with real-time inference speed.
[602] OmniLottie: Generating Vector Animations via Parameterized Lottie Tokens
Yiying Yang, Wei Cheng, Sijin Chen, Honghao Fu, Xianfang Zeng, Yujun Cai, Gang Yu, Xingjun Ma
Main category: cs.CV
TL;DR: OmniLottie is a framework that generates high-quality vector animations from multimodal instructions by tokenizing Lottie JSON files and leveraging pretrained vision-language models.
Details
Motivation: The paper aims to address the challenge of generating vector animations from multimodal instructions. Raw Lottie JSON files contain extensive structural metadata and formatting tokens that make learning vector animation generation difficult. The researchers want to enable flexible motion and visual content control through a lightweight JSON format for both shapes and animation behaviors.Method: The authors introduce a specialized Lottie tokenizer that transforms JSON files into structured sequences of commands and parameters representing shapes, animation functions, and control parameters. This tokenizer enables building OmniLottie upon pretrained vision language models to follow multimodal interleaved instructions. They also curate MMLottie-2M, a large-scale dataset of professionally designed vector animations with textual and visual annotations.
Result: Through extensive experiments, the researchers validate that OmniLottie can produce vivid and semantically aligned vector animations that adhere closely to multimodal human instructions. The framework demonstrates the ability to generate high-quality vector animations from multimodal inputs.
Conclusion: OmniLottie successfully addresses the challenges of vector animation generation by introducing a well-designed tokenizer for Lottie JSON files and leveraging pretrained vision-language models. The framework enables flexible control over motion and visual content while advancing research through the curated MMLottie-2M dataset.
Abstract: OmniLottie is a versatile framework that generates high quality vector animations from multi-modal instructions. For flexible motion and visual content control, we focus on Lottie, a light weight JSON formatting for both shapes and animation behaviors representation. However, the raw Lottie JSON files contain extensive invariant structural metadata and formatting tokens, posing significant challenges for learning vector animation generation. Therefore, we introduce a well designed Lottie tokenizer that transforms JSON files into structured sequences of commands and parameters representing shapes, animation functions and control parameters. Such tokenizer enables us to build OmniLottie upon pretrained vision language models to follow multi-modal interleaved instructions and generate high quality vector animations. To further advance research in vector animation generation, we curate MMLottie-2M, a large scale dataset of professionally designed vector animations paired with textual and visual annotations. With extensive experiments, we validate that OmniLottie can produce vivid and semantically aligned vector animations that adhere closely to multi modal human instructions.
[603] Is Bigger Always Better? Efficiency Analysis in Resource-Constrained Small Object Detection
Kwame Mbobda-Kuate, Gabriel Kasmi
Main category: cs.CV
TL;DR: Smaller YOLO models with high resolution outperform larger models for rooftop PV detection in Earth observation, challenging scaling law assumptions in resource-constrained domains.
Details
Motivation: To test whether scaling laws (larger models with more data consistently outperform smaller ones) hold in resource-constrained Earth observation applications, specifically for rooftop photovoltaic detection in Madagascar.Method: Systematic efficiency analysis across three scaling dimensions: model size, dataset size, and input resolution. Evaluated YOLO models (N to X variants) for rooftop PV detection, optimizing for model efficiency (mAP50 per unit of model size).
Result: Found efficiency inversion: YOLO11N achieved both highest efficiency (24× higher than YOLO11X) and highest absolute mAP50 (0.617). Resolution was dominant factor (+120% efficiency gain), while additional data yielded negligible returns at low resolution. Small high-resolution configurations were Pareto-dominant across all 44 setups.
Conclusion: In data-scarce Earth observation, bigger models are not just unnecessary but can be worse. Small high-resolution configurations are optimal for resource-constrained deployment.
Abstract: Scaling laws assume larger models trained on more data consistently outperform smaller ones – an assumption that drives model selection in computer vision but remains untested in resource-constrained Earth observation (EO). We conduct a systematic efficiency analysis across three scaling dimensions: model size, dataset size, and input resolution, on rooftop PV detection in Madagascar. Optimizing for model efficiency (mAP${50}$ per unit of model size), we find a consistent efficiency inversion: YOLO11N achieves both the highest efficiency ($24\times$ higher than YOLO11X) and the highest absolute mAP${50}$ (0.617). Resolution is the dominant resource allocation lever ($+$120% efficiency gain), while additional data yields negligible returns at low resolution. These findings are robust to the deployment objective: small high-resolution configurations are Pareto-dominant across all 44 setups in the joint accuracy-throughput space, leaving no tradeoff to resolve. In data-scarce EO, bigger is not just unnecessary: it can be worse.
[604] 3D Field of Junctions: A Noise-Robust, Training-Free Structural Prior for Volumetric Inverse Problems
Namhoon Kim, Narges Moeini, Justin Romberg, Sara Fridovich-Keil
Main category: cs.CV
TL;DR: 3D Field of Junctions (3D FoJ) is a novel volumetric representation for 3D image denoising that models each patch as a junction of 3D wedges, providing a training-free structural prior that preserves sharp edges and corners even under low SNR conditions.
Details
Motivation: Volume denoising is crucial for 3D imaging inverse problems that face high measurement noise. Existing methods often struggle with preserving sharp structural features like edges and corners under low signal-to-noise ratio conditions, and neural methods risk hallucination from training data.Method: Proposes a fully volumetric 3D Field of Junctions representation that optimizes a junction of 3D wedges to best explain each 3D patch, while enforcing consistency between overlapping patches. This serves as a structural prior that can be used via projected or proximal gradient descent for volumetric inverse problems.
Result: 3D FoJ successfully reconstructs and denoises volumes across three challenging low-SNR 3D imaging tasks: low-dose X-ray CT, cryogenic electron tomography, and lidar point cloud denoising. It outperforms both classical and neural methods in preserving sharp structural features.
Conclusion: The 3D FoJ representation provides an effective training-free structural prior for volumetric denoising that preserves sharp edges and corners under low SNR conditions, outperforming existing methods across diverse 3D imaging applications.
Abstract: Volume denoising is a foundational problem in computational imaging, as many 3D imaging inverse problems face high levels of measurement noise. Inspired by the strong 2D image denoising properties of Field of Junctions (ICCV 2021), we propose a novel, fully volumetric 3D Field of Junctions (3D FoJ) representation that optimizes a junction of 3D wedges that best explain each 3D patch of a full volume, while encouraging consistency between overlapping patches. In addition to direct volume denoising, we leverage our 3D FoJ representation as a structural prior that: (i) requires no training data, and thus precludes the risk of hallucination, (ii) preserves and enhances sharp edge and corner structures in 3D, even under low signal to noise ratio (SNR), and (iii) can be used as a drop-in denoising representation via projected or proximal gradient descent for any volumetric inverse problem with low SNR. We demonstrate successful volume reconstruction and denoising with 3D FoJ across three diverse 3D imaging tasks with low-SNR measurements: low-dose X-ray computed tomography (CT), cryogenic electron tomography (cryo-ET), and denoising point clouds such as those from lidar in adverse weather. Across these challenging low-SNR volumetric imaging problems, 3D FoJ outperforms a mixture of classical and neural methods.
[605] Bridging the gap between Performance and Interpretability: An Explainable Disentangled Multimodal Framework for Cancer Survival Prediction
Aniek Eijpe, Soufyan Lakbir, Melis Erdal Cesur, Sara P. Oliveira, Angelos Chatzimparmpas, Sanne Abeln, Wilson Silva
Main category: cs.CV
TL;DR: DIMAFx is an explainable multimodal framework for cancer survival prediction that produces disentangled, interpretable representations from histopathology images and transcriptomics data, achieving state-of-the-art performance while maintaining explainability.
Details
Motivation: Multimodal survival prediction models are becoming more accurate but their complexity reduces interpretability, limiting insights into how different data sources influence predictions. There's a need for models that maintain both high performance and explainability for clinical applications.Method: DIMAFx framework produces disentangled, interpretable modality-specific and modality-shared representations from histopathology whole-slide images and transcriptomics data. It uses SHapley Additive exPlanations (SHAP) for systematic explanation of multimodal interactions and biological information.
Result: Across multiple cancer cohorts, DIMAFx achieves state-of-the-art performance and improved representation disentanglement. In breast cancer survival prediction, it identified modality-shared information capturing solid tumor morphology contextualized by late estrogen response, and modality-specific features capturing microenvironmental signals from adipose and stromal morphologies.
Conclusion: Multimodal models can overcome the traditional trade-off between performance and explainability, supporting their application in precision medicine by providing both accurate predictions and interpretable insights into biological mechanisms.
Abstract: While multimodal survival prediction models are increasingly more accurate, their complexity often reduces interpretability, limiting insight into how different data sources influence predictions. To address this, we introduce DIMAFx, an explainable multimodal framework for cancer survival prediction that produces disentangled, interpretable modality-specific and modality-shared representations from histopathology whole-slide images and transcriptomics data. Across multiple cancer cohorts, DIMAFx achieves state-of-the-art performance and improved representation disentanglement. Leveraging its interpretable design and SHapley Additive exPlanations, DIMAFx systematically reveals key multimodal interactions and the biological information encoded in the disentangled representations. In breast cancer survival prediction, the most predictive features contain modality-shared information, including one capturing solid tumor morphology contextualized primarily by late estrogen response, where higher-grade morphology aligned with pathway upregulation and increased risk, consistent with known breast cancer biology. Key modality-specific features capture microenvironmental signals from interacting adipose and stromal morphologies. These results show that multimodal models can overcome the traditional trade-off between performance and explainability, supporting their application in precision medicine.
[606] GeoDiT: Point-Conditioned Diffusion Transformer for Satellite Image Synthesis
Srikumar Sastry, Dan Cher, Brian Wei, Aayush Dhakal, Subash Khanal, Dev Gupta, Nathan Jacobs
Main category: cs.CV
TL;DR: GeoDiT is a diffusion transformer for text-to-satellite image generation with point-based control, using spatial locations and textual descriptions of points as conditioning signals instead of pixel-level maps.
Details
Motivation: Existing controlled satellite image generation models require pixel-level maps that are time-consuming to acquire and semantically limited. There's a need for more flexible, annotation-friendly control mechanisms for satellite image generation.Method: Proposes GeoDiT, a diffusion transformer with point-based conditioning using spatial locations and textual descriptions of points. Introduces an adaptive local attention mechanism that regularizes attention scores based on input point queries. Systematically evaluates domain-specific design choices including satellite image representation for alignment and geolocation representation.
Result: GeoDiT achieves impressive generation performance, surpassing state-of-the-art remote sensing generative models.
Conclusion: The point-based conditioning framework provides semantically rich control signals that enable flexible, annotation-friendly, and computationally simple inference for satellite image generation.
Abstract: We introduce GeoDiT, a diffusion transformer designed for text-to-satellite image generation with point-based control. Existing controlled satellite image generative models often require pixel-level maps that are time-consuming to acquire, yet semantically limited. To address this limitation, we introduce a novel point-based conditioning framework that controls the generation process through the spatial location of the points and the textual description associated with each point, providing semantically rich control signals. This approach enables flexible, annotation-friendly, and computationally simple inference for satellite image generation. To this end, we introduce an adaptive local attention mechanism that effectively regularizes the attention scores based on the input point queries. We systematically evaluate various domain-specific design choices for training GeoDiT, including the selection of satellite image representation for alignment and geolocation representation for conditioning. Our experiments demonstrate that GeoDiT achieves impressive generation performance, surpassing the state-of-the-art remote sensing generative models.
[607] BridgeDrive: Diffusion Bridge Policy for Closed-Loop Trajectory Planning in Autonomous Driving
Shu Liu, Wenlin Chen, Weihao Li, Zheng Wang, Lijin Yang, Jianing Huang, Yipin Zhang, Zhongzhan Huang, Ze Cheng, Hao Yang
Main category: cs.CV
TL;DR: Paper analysis unavailable due to HTTP 429 error when fetching abstract from arXiv
Details
Motivation: Unable to determine motivation as the paper abstract could not be retrieved due to rate limiting (HTTP 429 error)Method: Cannot analyze method due to missing paper content
Result: No results available - paper content inaccessible
Conclusion: Cannot draw conclusions without access to the paper content
Abstract: Failed to fetch summary for 2509.23589: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.23589&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[608] From Leaderboard to Deployment: Code Quality Challenges in AV Perception Repositories
Mateus Karvat, Bram Adams, Sidney Givigi
Main category: cs.CV
TL;DR: First large-scale study of software quality in AV perception repositories reveals only 7.3% meet basic production-readiness criteria despite good benchmark performance.
Details
Motivation: There's a significant gap between research excellence in AV perception models and real-world deployment readiness, with current evaluation focusing only on benchmark metrics while ignoring code quality, production readiness, and long-term maintainability in safety-critical systems.Method: Systematically analyzed 178 unique models from KITTI and NuScenes 3D Object Detection leaderboards using static analysis tools (Pylint, Bandit, Radon) to evaluate code errors, security vulnerabilities, maintainability, and development practices.
Result: Only 7.3% of repositories meet basic production-readiness criteria (zero critical errors, no high-severity security vulnerabilities). Security issues are highly concentrated (top five issues responsible for ~80% of occurrences). Adoption of CI/CD pipelines correlated with better code maintainability.
Conclusion: Leaderboard performance doesn’t reflect production readiness; targeted interventions could substantially improve AV perception code quality and safety. Actionable guidelines developed to address common security issues.
Abstract: Autonomous vehicle (AV) perception models are typically evaluated solely on benchmark performance metrics, with limited attention to code quality, production readiness and long-term maintainability. This creates a significant gap between research excellence and real-world deployment in safety-critical systems subject to international safety standards. To address this gap, we present the first large-scale empirical study of software quality in AV perception repositories, systematically analyzing 178 unique models from the KITTI and NuScenes 3D Object Detection leaderboards. Using static analysis tools (Pylint, Bandit, and Radon), we evaluated code errors, security vulnerabilities, maintainability, and development practices. Our findings revealed that only 7.3% of the studied repositories meet basic production-readiness criteria, defined as having zero critical errors and no high-severity security vulnerabilities. Security issues are highly concentrated, with the top five issues responsible for almost 80% of occurrences, which prompted us to develop a set of actionable guidelines to prevent them. Additionally, the adoption of Continuous Integration/Continuous Deployment pipelines was correlated with better code maintainability. Our findings highlight that leaderboard performance does not reflect production readiness and that targeted interventions could substantially improve the quality and safety of AV perception code.
[609] HiFi-Inpaint: Towards High-Fidelity Reference-Based Inpainting for Generating Detail-Preserving Human-Product Images
Yichen Liu, Donghao Zhou, Jie Wang, Xin Gao, Guisheng Liu, Jiatong Li, Quanwei Zhang, Qiang Lyu, Lanqing Guo, Shilei Wen, Weiqiang Wang, Pheng-Ann Heng
Main category: cs.CV
TL;DR: HiFi-Inpaint is a reference-based inpainting framework for generating human-product images with high-fidelity product detail preservation using Shared Enhancement Attention and Detail-Aware Loss.
Details
Motivation: Human-product images are crucial for advertising/e-commerce, but existing methods struggle with preserving product details. Current limitations include lack of training data, poor detail preservation, and coarse supervision.Method: Proposes HiFi-Inpaint with Shared Enhancement Attention (SEA) to refine product features and Detail-Aware Loss (DAL) for pixel-level supervision using high-frequency maps. Also introduces HP-Image-40K dataset from self-synthesis with automatic filtering.
Result: Experimental results show state-of-the-art performance in generating detail-preserving human-product images.
Conclusion: HiFi-Inpaint effectively addresses product detail preservation challenges in human-product image generation through novel attention mechanisms and loss functions.
Abstract: Human-product images, which showcase the integration of humans and products, play a vital role in advertising, e-commerce, and digital marketing. The essential challenge of generating such images lies in ensuring the high-fidelity preservation of product details. Among existing paradigms, reference-based inpainting offers a targeted solution by leveraging product reference images to guide the inpainting process. However, limitations remain in three key aspects: the lack of diverse large-scale training data, the struggle of current models to focus on product detail preservation, and the inability of coarse supervision for achieving precise guidance. To address these issues, we propose HiFi-Inpaint, a novel high-fidelity reference-based inpainting framework tailored for generating human-product images. HiFi-Inpaint introduces Shared Enhancement Attention (SEA) to refine fine-grained product features and Detail-Aware Loss (DAL) to enforce precise pixel-level supervision using high-frequency maps. Additionally, we construct a new dataset, HP-Image-40K, with samples curated from self-synthesis data and processed with automatic filtering. Experimental results show that HiFi-Inpaint achieves state-of-the-art performance, delivering detail-preserving human-product images.
[610] Beyond DAGs: A Latent Partial Causal Model for Multimodal Learning
Yuhang Liu, Zhen Zhang, Dong Gong, Erdun Gao, Biwei Huang, Mingming Gong, Anton van den Hengel, Kun Zhang, Javen Qinfeng Shi
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2402.06223 exists but cannot be retrieved at this time.
Details
Motivation: Cannot determine motivation due to inability to access paper content.Method: Cannot determine method due to inability to access paper content.
Result: Cannot determine results due to inability to access paper content.
Conclusion: Cannot determine conclusion due to inability to access paper content.
Abstract: Failed to fetch summary for 2402.06223: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2402.06223&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[611] OmniTracker: Unifying Object Tracking by Tracking-with-Detection
Junke Wang, Zuxuan Wu, Dongdong Chen, Chong Luo, Xiyang Dai, Lu Yuan, Yu-Gang Jiang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2303.12079: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2303.12079&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[612] Task-Driven Lens Design
Xinge Yang, Qiang Fu, Yunfeng Nie, Wolfgang Heidrich
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to determine conclusion due to failed API request
Abstract: Failed to fetch summary for 2305.17185: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2305.17185&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[613] Velocity Disambiguation for Video Frame Interpolation
Zhihang Zhong, Yiming Zhang, Wei Wang, Xiao Sun, Yu Qiao, Gurunandan Krishnan, Sizhuo Ma, Jian Wang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2311.08007: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2311.08007&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[614] Changing the Training Data Distribution to Reduce Simplicity Bias Improves In-distribution Generalization
Dang Nguyen, Paymon Haddad, Eric Gan, Baharan Mirzasoleiman
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access limitationsMethod: Unable to determine method due to access limitations
Result: Unable to determine results due to access limitations
Conclusion: Unable to determine conclusion due to access limitations
Abstract: Failed to fetch summary for 2404.17768: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2404.17768&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[615] FiLo: Zero-Shot Anomaly Detection by Fine-Grained Description and High-Quality Localization
Zhaopeng Gu, Bingke Zhu, Guibo Zhu, Yingying Chen, Hao Li, Ming Tang, Jinqiao Wang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error in fetching paper contentMethod: Unable to determine method due to technical error in fetching paper content
Result: Unable to determine results due to technical error in fetching paper content
Conclusion: Unable to draw conclusions due to technical error in fetching paper content
Abstract: Failed to fetch summary for 2404.13671: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2404.13671&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[616] Towards Camera Open-set 3D Object Detection for Autonomous Driving Scenarios
Zhuolin He, Xinrun Li, Jiacheng Tang, Shoumeng Qiu, Wenfu Wang, Xiangyang Xue, Jian Pu
Main category: cs.CV
TL;DR: Unable to analyze paper 2406.17297 due to HTTP 429 error when fetching the abstract from arXiv API
Details
Motivation: Cannot determine motivation as the paper content could not be retrieved due to rate limiting errorMethod: No method information available - failed to fetch paper details
Result: No results available - paper content inaccessible
Conclusion: Cannot provide analysis due to technical limitations in accessing the paper
Abstract: Failed to fetch summary for 2406.17297: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2406.17297&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[617] PO-GUISE+: Pose and object guided transformer token selection for efficient driver action recognition
Ricardo Pizarro, Roberto Valle, Rafael Barea, Jose M. Buenaposada, Luis Baumela, Luis Miguel Bergasa
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) when querying arXiv API for paper ID 2407.13750
Details
Motivation: Unable to determine motivation as paper content could not be retrieved due to technical limitations in accessing the arXiv APIMethod: No method information available - paper retrieval failed due to HTTP 429 error (Too Many Requests) from arXiv API
Result: No results available - could not access paper content due to rate limiting on arXiv API requests
Conclusion: Cannot provide analysis due to technical limitations in accessing the paper content from arXiv
Abstract: Failed to fetch summary for 2407.13750: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2407.13750&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[618] A Dataset for Crucial Object Recognition in Blind and Low-Vision Individuals’ Navigation
Md Touhidul Islam, Imran Kabir, Elena Ariel Pearce, Md Alimoor Reza, Syed Masum Billah
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2407.16777 exists but summary cannot be retrieved from arXiv API.
Details
Motivation: Cannot determine motivation without access to paper content. The paper appears to be from July 2024 (2407 prefix).Method: Method unknown due to API rate limiting preventing access to paper details.
Result: Results cannot be determined without access to the paper content.
Conclusion: Unable to analyze paper due to technical limitations in accessing the content.
Abstract: Failed to fetch summary for 2407.16777: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2407.16777&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[619] DAWA: Dynamic Ambiguity-Wise Adaptation for Real-Time Domain Adaptive Semantic Segmentation
Taorong Liu, Zhen Zhang, Liang Liao, Jing Xiao, Chia-Wen Lin
Main category: cs.CV
TL;DR: Unable to analyze paper 2409.01072 due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Cannot determine motivation due to inability to access paper contentMethod: Cannot determine method due to inability to access paper content
Result: Cannot determine results due to inability to access paper content
Conclusion: Cannot draw conclusions due to inability to access paper content
Abstract: Failed to fetch summary for 2409.01072: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2409.01072&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[620] Latent 3D Brain MRI Counterfactual
Wei Peng, Tian Xia, Fabio De Sousa Ribeiro, Tomas Bosschieter, Ehsan Adeli, Qingyu Zhao, Ben Glocker, Kilian M. Pohl
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation due to failed paper retrievalMethod: Cannot determine method due to failed paper retrieval
Result: Cannot determine results due to failed paper retrieval
Conclusion: Cannot determine conclusion due to failed paper retrieval
Abstract: Failed to fetch summary for 2409.05585: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2409.05585&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[621] Towards Real Zero-Shot Camouflaged Object Segmentation without Camouflaged Annotations
Cheng Lei, Jie Fan, Xinran Li, Tianzhu Xiang, Ao Li, Ce Zhu, Le Zhang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2410.16953: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.16953&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[622] MV-Adapter: Enhancing Underwater Instance Segmentation via Adaptive Channel Attention
Lianjun Liu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2411.00472: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.00472&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[623] Grounding-IQA: Grounding Multimodal Language Model for Image Quality Assessment
Zheng Chen, Xun Zhang, Wenbo Li, Renjing Pei, Fenglong Song, Xiongkuo Min, Xiaohong Liu, Xin Yuan, Yong Guo, Yulun Zhang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to draw conclusions due to access error
Abstract: Failed to fetch summary for 2411.17237: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.17237&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[624] Progressively Exploring and Exploiting Inference Data to Break Fine-Grained Classification Barrier
Li-Jun Zhao, Si-Yuan Zhang, Zhen-Duo Chen, Xin Luo, Xin-Shun Xu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2412.20383: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2412.20383&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[625] FiLo++: Zero-/Few-Shot Anomaly Detection by Fused Fine-Grained Descriptions and Deformable Localization
Zhaopeng Gu, Bingke Zhu, Guibo Zhu, Yingying Chen, Ming Tang, Jinqiao Wang
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to failed paper fetchMethod: Cannot determine method due to failed paper fetch
Result: Cannot determine results due to failed paper fetch
Conclusion: Cannot determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2501.10067: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.10067&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[626] CLIP Behaves like a Bag-of-Words Model Cross-modally but not Uni-modally
Darina Koishigarina, Arnas Uselis, Seong Joon Oh
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2502.03566: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.03566&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[627] WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs
Jack Hong, Shilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, Weidi Xie
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error in fetching paper contentMethod: Unable to determine method due to technical error in fetching paper content
Result: Unable to determine results due to technical error in fetching paper content
Conclusion: Unable to draw conclusions due to technical error in fetching paper content
Abstract: Failed to fetch summary for 2502.04326: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.04326&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[628] Precise Parameter Localization for Textual Generation in Diffusion Models
Łukasz Staniszewski, Bartosz Cywiński, Franziska Boenisch, Kamil Deja, Adam Dziedzic
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2502.09935: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.09935&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[629] Thicker and Quicker: A Jumbo Token for Fast Plain Vision Transformers
Anthony Fuller, Yousef Yassin, Daniel G. Kyrollos, Evan Shelhamer, James R. Green
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about the paper due to access limitations
Abstract: Failed to fetch summary for 2502.15021: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.15021&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[630] SemHiTok: A Unified Image Tokenizer via Semantic-Guided Hierarchical Codebook for Multimodal Understanding and Generation
Zisheng Chen, Chunwei Wang, Runhui Huang, Hongbin Xu, Xiuwei Chen, Jun Zhou, Jianhua Han, Hang Xu, Xiaodan Liang
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2503.06764 could not be retrieved from arXiv API.
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting.Method: Cannot determine method as paper content is unavailable.
Result: Cannot determine results as paper content is unavailable.
Conclusion: Cannot draw conclusions about the paper due to technical limitations in accessing the content.
Abstract: Failed to fetch summary for 2503.06764: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.06764&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[631] Effective and Efficient Masked Image Generation Models
Zebin You, Jingyang Ou, Xiaolu Zhang, Jun Hu, Jun Zhou, Chongxuan Li
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2503.07197: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.07197&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[632] SPEED: Scalable, Precise, and Efficient Concept Erasure for Diffusion Models
Ouxiang Li, Yuan Wang, Xinting Hu, Houcheng Jiang, Yanbin Hao, Fuli Feng
Main category: cs.CV
TL;DR: Unable to analyze paper 2503.07392 due to HTTP 429 error when fetching summary from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2503.07392: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.07392&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[633] Certified Circuits: Stability Guarantees for Mechanistic Circuits
Alaa Anani, Tobias Lorenz, Bernt Schiele, Mario Fritz, Jonas Fischer
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2602.22968: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22968&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[634] A Survey on Wi-Fi Sensing Generalizability: Taxonomy, Techniques, Datasets, and Future Research Prospects
Fei Wang, Tingting Zhang, Wei Xi, Han Ding, Ge Wang, Di Zhang, Yuanhao Cui, Fan Liu, Jinsong Han, Jie Xu, Tony Xiao Han
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID is 2503.08008 but no content is available.
Details
Motivation: Unable to determine motivation due to lack of access to paper content.Method: Unable to determine method due to lack of access to paper content.
Result: Unable to determine results due to lack of access to paper content.
Conclusion: Unable to determine conclusion due to lack of access to paper content.
Abstract: Failed to fetch summary for 2503.08008: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.08008&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[635] Target-Aware Video Diffusion Models
Taeksoo Kim, Hanbyul Joo
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing paper contentMethod: Unable to determine method due to missing paper content
Result: Unable to determine results due to missing paper content
Conclusion: Unable to draw conclusions due to missing paper content
Abstract: Failed to fetch summary for 2503.18950: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.18950&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[636] A Multi-Objective Evaluation Framework for Analyzing Utility-Fairness Trade-Offs in Machine Learning Systems
Gökhan Özbulak, Oscar Jimenez-del-Toro, Maíra Fatoretto, Lilian Berton, André Anjos
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot determine conclusion as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2503.11120: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.11120&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[637] VideoFusion: A Spatio-Temporal Collaborative Network for Multi-modal Video Fusion
Linfeng Tang, Yeda Wang, Meiqi Gong, Zizhuo Li, Yuxin Deng, Xunpeng Yi, Chunyu Li, Han Xu, Hao Zhang, Jiayi Ma
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2503.23359: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.23359&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[638] AdaRank: Adaptive Rank Pruning for Enhanced Model Merging
Chanhyuk Lee, Jiho Choi, Chanryeol Lee, Donggyun Kim, Seunghoon Hong
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2503.22178: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.22178&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[639] InvAD: Inversion-based Reconstruction-Free Anomaly Detection with Diffusion Models
Shunsuke Sakai, Xiangteng He, Chunzhi Gu, Leonid Sigal, Tatsuhito Hasegawa
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2504.05662 appears to be from April 2024, but no content available for analysis.
Details
Motivation: Cannot determine motivation without access to paper content. The HTTP 429 error indicates rate limiting from arXiv API.Method: Cannot analyze method without paper content. The error prevents retrieval of abstract or paper details.
Result: No results available due to access limitations. The arXiv API returned a rate limiting error.
Conclusion: Unable to provide analysis due to technical limitations in accessing the paper content.
Abstract: Failed to fetch summary for 2504.05662: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.05662&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[640] VR-FuseNet: A Fusion of Heterogeneous Fundus Data and Explainable Deep Network for Diabetic Retinopathy Classification
Shamim Rahim Refat, Ziyan Shirin Raha, Shuvashis Sarker, Faika Fairuj Preotee, MD. Musfikur Rahman, Tashreef Muhammad, Mohammad Shafiul Alam
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2504.21464: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.21464&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[641] Towards Application-Specific Evaluation of Vision Models: Case Studies in Ecology and Biology
Alex Hoi Hang Chan, Otto Brookes, Urs Waldmann, Hemal Naik, Iain D. Couzin, Majid Mirmehdi, Noël Adiko Houa, Emmanuelle Normand, Christophe Boesch, Lukas Boesch, Mimi Arandjelovic, Hjalmar Kühl, Tilo Burghardt, Fumihiro Kano
Main category: cs.CV
TL;DR: Unable to analyze paper 2505.02825 due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrievedMethod: Cannot determine method as paper content could not be retrieved
Result: Cannot determine results as paper content could not be retrieved
Conclusion: Cannot draw conclusions as paper content could not be retrieved
Abstract: Failed to fetch summary for 2505.02825: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.02825&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[642] SToLa: Self-Adaptive Touch-Language Framework with Tactile Commonsense Reasoning in Open-Ended Scenarios
Ning Cheng, Jinan Xu, Jialing Chen, Bin Fang, Wenjuan Han
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2505.04201: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.04201&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[643] Dynamic Uncertainty Learning with Noisy Correspondence for Text-Based Person Search
Zequn Xie, Haoming Ji, Chengxuan Li, Lingwei Meng
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2505.06566: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.06566&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[644] Flexible-weighted Chamfer Distance: Enhanced Objective Function for Point Cloud Completion
Jie Li, Shengwei Tian, Long Yu, Xin Ning
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2505.14218: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.14218&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[645] DeepEyes: Incentivizing “Thinking with Images” via Reinforcement Learning
Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, Xing Yu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2505.14362: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.14362&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[646] Exploiting Low-Dimensional Manifold of Features for Few-Shot Whole Slide Image Classification
Conghao Xiong, Zhengrui Guo, Zhe Xu, Yifei Zhang, Raymond Kai-Yu Tong, Si Yong Yeo, Hao Chen, Joseph J. Y. Sung, Irwin King
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2505.15504: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.15504&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[647] Model Already Knows the Best Noise: Bayesian Active Noise Selection via Attention in Video Diffusion Model
Kwanyoung Kim, Sanghyun Kim
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2505.17561: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.17561&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[648] Wasserstein Distances Made Explainable: Insights Into Dataset Shifts and Transport Phenomena
Philip Naumann, Jacob Kauffmann, Grégoire Montavon
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2505.06123: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.06123&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[649] Seek-CAD: A Self-refined Generative Modeling for 3D Parametric CAD Using Local Inference via DeepSeek
Xueyang Li, Jiahao Li, Yu Song, Yunzhong Lou, Xiangdong Zhou
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2505.17702: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.17702&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[650] Point-MoE: Large-Scale Multi-Dataset Training with Mixture-of-Experts for 3D Semantic Segmentation
Xuweiyi Chen, Wentao Zhou, Aruni RoyChowdhury, Zezhou Cheng
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2505.23926: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.23926&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[651] SenseFlow: Scaling Distribution Matching for Flow-based Text-to-Image Distillation
Xingtong Ge, Xin Zhang, Tongda Xu, Yi Zhang, Xinjie Zhang, Yan Wang, Jun Zhang
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2506.00523: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.00523&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[652] Harnessing Chain-of-Thought Reasoning in Multimodal Large Language Models for Face Anti-Spoofing
Honglu Zhang, Zhiqin Fang, Ningning Zhao, Saihui Hou, Long Ma, Renwang Pei, Zhaofeng He
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2506.01783: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.01783&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[653] RECON: Robust symmetry discovery via Explicit Canonical Orientation Normalization
Alonso Urbano, David W. Romero, Max Zimmer, Sebastian Pokutta
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2505.13289: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.13289&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[654] Improving Wildlife Out-of-Distribution Detection: Africas Big Five
Mufhumudzi Muthivhi, Jiahao Huo, Fredrik Gustafsson, Terence L. van Zyl
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to retrieval errorMethod: Unable to determine method due to retrieval error
Result: Unable to determine results due to retrieval error
Conclusion: Unable to determine conclusion due to retrieval error
Abstract: Failed to fetch summary for 2506.06719: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.06719&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[655] Adversarially Pretrained Transformers May Be Universally Robust In-Context Learners
Soichiro Kumano, Hiroshi Kera, Toshihiko Yamasaki
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2505.14042: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.14042&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[656] Probabilistic Kernel Function for Fast Angle Testing
Kejing Lu, Chuan Xiao, Yoshiharu Ishikawa
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2505.20274: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.20274&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[657] A High-Quality Dataset and Reliable Evaluation for Interleaved Image-Text Generation
Yukang Feng, Jianwen Sun, Chuanhao Li, Zizhen Li, Jiaxin Ai, Fanrui Zhang, Yifan Chang, Sizhuo Zhou, Shenglin Zhang, Yu Dai, Kaipeng Zhang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Cannot analyze method as paper content is unavailable
Result: No results available due to technical limitations
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2506.09427: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.09427&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[658] GradPCA: Leveraging NTK Alignment for Reliable Out-of-Distribution Detection
Mariia Seleznova, Hung-Hsu Chou, Claudio Mayrink Verdun, Gitta Kutyniok
Main category: cs.CV
TL;DR: Unable to analyze paper 2505.16017 due to HTTP 429 error when fetching summary from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrievedMethod: Cannot determine method as paper content could not be retrieved
Result: Cannot determine results as paper content could not be retrieved
Conclusion: Cannot draw conclusions as paper content could not be retrieved
Abstract: Failed to fetch summary for 2505.16017: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.16017&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[659] PD$^{2}$GS: Part-Level Decoupling and Continuous Deformation of Articulated Objects via Gaussian Splatting
Haowen Wang, Xiaoping Yuan, Zhao Jin, Zhen Zhao, Zhengping Che, Yousong Xue, Jin Tian, Yakun Huang, Jian Tang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access restrictionsMethod: Unable to determine method due to access restrictions
Result: Unable to determine results due to access restrictions
Conclusion: Unable to draw conclusions due to access restrictions
Abstract: Failed to fetch summary for 2506.09663: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.09663&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[660] VITA: Zero-Shot Value Functions via Test-Time Adaptation of Vision-Language Models
Christos Ziakas, Alessandra Russo
Main category: cs.CV
TL;DR: Paper 2506.10085: Unable to fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to inability to access paper contentMethod: Cannot determine method due to inability to access paper content
Result: Cannot determine results due to inability to access paper content
Conclusion: Cannot determine conclusion due to inability to access paper content
Abstract: Failed to fetch summary for 2506.10085: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.10085&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[661] Rethinking Visual Token Reduction in LVLMs Under Cross-Modal Misalignment
Rui Xu, Yunke Wang, Yong Luo, Bo Du
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2506.22283: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.22283&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[662] MAGIC: Few-Shot Mask-Guided Anomaly Inpainting with Prompt Perturbation, Spatially Adaptive Guidance, and Context Awareness
JaeHyuck Choi, MinJun Kim, Je Hyeong Hong
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2507.02314: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.02314&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[663] Topology-Preserved Auto-regressive Mesh Generation in the Manner of Weaving Silk
Gaochao Song, Zibo Zhao, Haohan Weng, Jingbo Zeng, Rongfei Jia, Shenghua Gao
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2507.02477: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.02477&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[664] Query-Based Adaptive Aggregation for Multi-Dataset Joint Training Toward Universal Visual Place Recognition
Jiuhong Xiao, Yang Zhou, Giuseppe Loianno
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to access errorMethod: Cannot determine method due to access error
Result: Cannot determine results due to access error
Conclusion: Cannot determine conclusion due to access error
Abstract: Failed to fetch summary for 2507.03831: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.03831&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[665] EchoMimicV3: 1.3B Parameters are All You Need for Unified Multi-Modal and Multi-Task Human Animation
Rang Meng, Yan Wang, Weipeng Wu, Ruobing Zheng, Yuming Li, Chenguang Ma
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to failed paper fetchMethod: Cannot determine method due to failed paper fetch
Result: Cannot determine results due to failed paper fetch
Conclusion: Cannot determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2507.03905: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.03905&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[666] Concept-TRAK: Understanding how diffusion models learn concepts through concept-level attribution
Yonghyun Park, Chieh-Hsin Lai, Satoshi Hayakawa, Yuhta Takida, Naoki Murata, Wei-Hsiang Liao, Woosung Choi, Kin Wai Cheuk, Junghyun Koo, Yuki Mitsufuji
Main category: cs.CV
TL;DR: Unable to analyze paper 2507.06547 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract retrieval failed due to rate limiting (HTTP 429)Method: No method information available - arXiv API request resulted in rate limiting error
Result: No results available - paper content could not be retrieved
Conclusion: Unable to provide analysis due to technical limitations in accessing the paper abstract
Abstract: Failed to fetch summary for 2507.06547: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.06547&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[667] CLiFT: Compressive Light-Field Tokens for Compute-Efficient and Adaptive Neural Rendering
Zhengqing Wang, Yuefan Wu, Jiacheng Chen, Fuyang Zhang, Yasutaka Furukawa
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to draw conclusions due to fetch failure
Abstract: Failed to fetch summary for 2507.08776: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.08776&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[668] Advancing Complex Video Object Segmentation via Progressive Concept Construction
Zhixiong Zhang, Shuangrui Ding, Xiaoyi Dong, Songxin He, Jianfan Lin, Junsong Tang, Yuhang Zang, Yuhang Cao, Dahua Lin, Jiaqi Wang
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2507.15852: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.15852&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[669] MonoFusion: Sparse-View 4D Reconstruction via Monocular Fusion
Zihan Wang, Jeff Tan, Tarasha Khurana, Neehar Peri, Deva Ramanan
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2507.23782: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.23782&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[670] HGTS-Former: Hierarchical HyperGraph Transformer for Multivariate Time Series Analysis
Hao Si, Xiao Wang, Fan Zhang, Xiaoya Zhou, Dengdi Sun, Wanli Lyu, Qingquan Yang, Jin Tang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error in fetching paper contentMethod: Unable to determine method due to technical error in fetching paper content
Result: Unable to determine results due to technical error in fetching paper content
Conclusion: Unable to draw conclusions due to technical error in fetching paper content
Abstract: Failed to fetch summary for 2508.02411: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.02411&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[671] Navigating with Annealing Guidance Scale in Diffusion Space
Shai Yehezkel, Omer Dahary, Andrey Voynov, Daniel Cohen-Or
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2506.24108: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.24108&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[672] A Morse-Bott Framework for Blind Inverse Problems: Local Recovery Guarantees and the Failure of the MAP
Minh-Hai Nguyen, Edouard Pauwels, Pierre Weiss
Main category: cs.CV
TL;DR: MAP estimation for blind inverse problems analyzed using Morse-Bott framework, showing learned priors create locally stable minima but global landscape still has blurry traps.
Details
Motivation: To understand why MAP-based methods work empirically in blind inverse problems and analyze recovery guarantees by examining the posterior landscape structure.Method: Model image prior potential as Morse-Bott function, analyze Hessian spectrum, validate with experiments on learned priors, and study gradient-based optimization stability.
Result: Learned priors create locally stable minima near ground truth, but blurry traps persist globally. Local stability explains empirical success, but strategic initialization is crucial.
Conclusion: MAP failure in blind deconvolution is intrinsic to landscape structure, not prior quality. Successful recovery requires initialization within favorable basins.
Abstract: Maximum A Posteriori (MAP) estimation is a cornerstone framework for blind inverse problems, where an image and a forward operator are jointly estimated as the maximizers of a posterior distribution. In this paper, we analyze the recovery guarantees of MAP-based methods by adopting a Morse-Bott framework. We model the image prior potential as a Morse-Bott function, where natural images are modeled as residing locally on a critical submanifold. This means that while the potential is locally flat along the natural directions of the image manifold, it is strictly convex in the directions normal to it. We demonstrate that this Morse-Bott hypothesis aligns with the structural properties of state-of-the-art learned priors, a finding we validate through an experimental analysis of the potential landscape and its Hessian spectrum. Our theoretical results show that, in a neighborhood of the ground-truth image and operator, the posterior admits local minimizers that are stable both with respect to initialization (gradient steps converge to the same minimizer) and to small noise perturbations (solutions vary smoothly). This local stability explains the empirical success of well-designed gradient-based optimization in these settings. However, we also demonstrate that this stability is a local property: the blurry trap, well-known for sparse priors in blind deconvolution, persists even with state-of-the-art learned priors. Our findings demonstrate that the failure of MAP in blind deconvolution is not a limitation of prior quality, but an intrinsic characteristic of the landscape. We conclude that successful recovery of posterior maximization depends on strategic initialization within the basin of favorable local minima, and we validate this with numerical experiments on both synthetic and real-world data.
[673] SSFMamba: Learning Symmetry-driven Spatial-Frequency Modeling for Physically Consistent 3D Medical Image Segmentation
Bo Zhang, Yifan Zhang, Shuo Yan, Yu Bai, Zheng Zhang, Wu Liu, Wendong Wang, Yongdong Zhang
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2508.03069 appears to be from August 2025, but no content could be retrieved.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot determine conclusion without access to the paper content.
Abstract: Failed to fetch summary for 2508.03069: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.03069&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[674] Learning Robust Intervention Representations with Delta Embeddings
Panagiotis Alimisis, Christos Diou
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2508.04492: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.04492&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[675] HierarchicalPrune: Position-Aware Compression for Large-Scale Diffusion Models
Young D. Kwon, Rui Li, Sijia Li, Da Li, Sourav Bhattacharya, Stylianos I. Venieris
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2508.04663: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.04663&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[676] ImagiDrive: A Unified Imagination-and-Planning Framework for Autonomous Driving
Jingyu Li, Bozhou Zhang, Xin Jin, Jiankang Deng, Xiatian Zhu, Li Zhang
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2508.11428 suggests it’s from August 2024, but no content available for analysis.
Details
Motivation: Cannot determine motivation without access to the paper content. The HTTP 429 error indicates the arXiv API is rate limiting requests, preventing retrieval of the paper details.Method: No method information available. The paper ID format suggests it’s a recent submission (August 2024), but the content cannot be fetched due to API rate limiting.
Result: No results available. The arXiv API returned HTTP 429 status, which typically means too many requests have been made in a short period.
Conclusion: Unable to analyze the paper due to technical limitations in accessing the content. The arXiv API rate limiting prevents retrieval of the paper details needed for analysis.
Abstract: Failed to fetch summary for 2508.11428: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.11428&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[677] CineTrans: Learning to Generate Videos with Cinematic Transitions via Masked Diffusion Models
Xiaoxue Wu, Bingjie Gao, Yu Qiao, Yaohui Wang, Xinyuan Chen
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). Paper ID: 2508.11484
Details
Motivation: Cannot determine motivation as paper content is unavailable due to HTTP 429 error when attempting to fetch from arXiv API.Method: Cannot determine method as paper content is unavailable due to HTTP 429 error when attempting to fetch from arXiv API.
Result: Cannot determine results as paper content is unavailable due to HTTP 429 error when attempting to fetch from arXiv API.
Conclusion: Cannot draw conclusions as paper content is unavailable due to HTTP 429 error when attempting to fetch from arXiv API.
Abstract: Failed to fetch summary for 2508.11484: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.11484&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[678] Consistency-Driven Calibration and Matching for Few-Shot Class-Incremental Learning
Qinzhe Wang, Zixuan Chen, Keke Huang, Xiu Su, Chunhua Yang, Chang Xu
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2506.19558: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.19558&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[679] MOON: Generative MLLM-based Multimodal Representation Learning for E-commerce Product Understanding
Daoze Zhang, Chenghan Fu, Zhanheng Nie, Jianyu Liu, Wanxian Guan, Yuan Gao, Jun Song, Pengjie Wang, Jian Xu, Bo Zheng
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to retrieval failureMethod: Unable to determine method due to retrieval failure
Result: Unable to determine results due to retrieval failure
Conclusion: Unable to determine conclusion due to retrieval failure
Abstract: Failed to fetch summary for 2508.11999: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.11999&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[680] Next Visual Granularity Generation
Yikai Wang, Zhouxia Wang, Zhonghua Wu, Qingyi Tao, Kang Liao, Chen Change Loy
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2508.12811: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.12811&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[681] Stochastic Self-Guidance for Training-Free Enhancement of Diffusion Models
Chubin Chen, Jiashu Zhu, Xiaokun Feng, Nisha Huang, Chen Zhu, Meiqi Wu, Fangyuan Mao, Jiahong Wu, Xiangxiang Chu, Xiu Li
Main category: cs.CV
TL;DR: S²-Guidance improves diffusion model guidance by using stochastic sub-networks to refine suboptimal predictions, outperforming CFG in text-to-image and text-to-video generation.
Details
Motivation: Classifier-free Guidance (CFG) is widely used but produces suboptimal results due to excessive reliance on inaccurate predictions, leading to semantic incoherence and low-quality outputs in diffusion models.Method: Proposes S²-Guidance which uses stochastic block-dropping during forward process to construct stochastic sub-networks that guide models away from low-quality predictions toward better outputs.
Result: Extensive experiments on text-to-image and text-to-video generation show S²-Guidance consistently surpasses CFG and other advanced guidance strategies in both qualitative and quantitative metrics.
Conclusion: S²-Guidance effectively addresses CFG’s limitations by leveraging model’s own sub-networks to refine predictions, leading to superior generation quality and prompt adherence.
Abstract: Classifier-free Guidance (CFG) is a widely used technique in modern diffusion models for enhancing sample quality and prompt adherence. However, through an empirical analysis on Gaussian mixture modeling with a closed-form solution, we observe a discrepancy between the suboptimal results produced by CFG and the ground truth. The model’s excessive reliance on these suboptimal predictions often leads to semantic incoherence and low-quality outputs. To address this issue, we first empirically demonstrate that the model’s suboptimal predictions can be effectively refined using sub-networks of the model itself. Building on this insight, we propose S$^2$-Guidance, a novel method that leverages stochastic block-dropping during the forward process to construct stochastic sub-networks, effectively guiding the model away from potential low-quality predictions and toward high-quality outputs. Extensive qualitative and quantitative experiments on text-to-image and text-to-video generation tasks demonstrate that S$^2$-Guidance delivers superior performance, consistently surpassing CFG and other advanced guidance strategies. Our code will be released.
[682] Adaptive Reinforcement for Open-ended Medical Reasoning via Semantic-Guided Reward Collapse Mitigation
Yizhou Liu, Dingkang Yang, Zizhi Chen, Minghao Han, Xukun Zhang, Keliang Liu, Jingwei Wei, Lihua Zhang
Main category: cs.CV
TL;DR: ARMed introduces an adaptive reinforcement learning framework for open-ended medical VQA that addresses reward collapse by combining textual correctness with adaptive semantic rewards to improve reasoning consistency and factual accuracy.
Details
Motivation: Current reinforcement fine-tuning in medical imaging focuses mainly on closed-ended VQA, which doesn't mirror realistic clinical diagnostic workflows. Open-ended medical VQA is underexplored, and existing semantic reward approaches suffer from reward collapse where distinct semantic responses get similar scores.Method: ARMed first performs supervised fine-tuning on chain-of-thought annotations to inject domain expertise, then uses reinforcement optimization with textual correctness and adaptive semantic rewards to refine reasoning consistency and factual accuracy.
Result: Extensive experiments on six challenging medical VQA benchmarks show ARMed substantially improves both accuracy and generalization compared to existing approaches.
Conclusion: The findings highlight the importance of reward discriminability in medical RL and demonstrate the potential of adaptive semantic rewards for building robust, clinically reliable multimodal reasoning systems.
Abstract: Reinforcement learning (RL) with rule-based reward functions has recently shown great promise in enhancing the reasoning depth and generalization ability of vision-language models (VLMs), while maintaining computational efficiency. In spite of these advances, its adoption in medical imaging remains limited. Current reinforcement fine-tuning (RFT) efforts in this field mainly focus on closed-ended visual question answering (VQA), restricting their applicability to realistic clinical reasoning. However, open-ended medical VQA better mirrors clinical diagnostic workflows but remains underexplored. Although several studies have attempted to bridge the two formats through semantically guided RL, model-driven semantic rewards often suffer from reward collapse, where responses with distinct semantics yield nearly identical scores. To overcome this limitation, we introduce Adaptive Reinforcement for Medical Reasoning (ARMed), a novel RL framework tailored for open-ended medical VQA. ARMed first injects domain expertise through supervised fine-tuning (SFT) on chain-of-thought annotations, followed by reinforcement optimization using textual correctness and adaptive semantic rewards to refine reasoning consistency and factual accuracy. Extensive experiments on six challenging medical VQA benchmarks demonstrate that ARMed substantially improves both accuracy and generalization. These findings underscore the importance of reward discriminability in medical RL and highlight the potential of adaptive semantic rewards for building robust, clinically reliable multimodal reasoning systems.
[683] Prune2Drive: A Plug-and-Play Framework for Accelerating Vision-Language Models in Autonomous Driving
Minhao Xiong, Zichen Wen, Zhuangcheng Gu, Xuyang Liu, Rui Zhang, Hengrui Kang, Jiabing Yang, Junyuan Zhang, Weijia Li, Conghui He, Yafei Wang, Linfeng Zhang
Main category: cs.CV
TL;DR: Prune2Drive is a plug-and-play visual token pruning framework for multi-view Vision-Language Models in autonomous driving that reduces computational overhead while maintaining performance.
Details
Motivation: Vision-Language Models show promise for autonomous driving but face computational challenges due to massive visual tokens from high-resolution multi-view images, causing inference latency and memory issues from quadratic self-attention complexity.Method: Proposes two innovations: 1) diversity-aware token selection prioritizing semantic/spatial coverage across views, and 2) view-adaptive pruning controller learning optimal pruning ratios based on camera importance to downstream tasks. No model retraining or attention map access required.
Result: Achieves 6.40x speedup in prefilling phase and uses only 13.4% of original FLOPs when retaining 10% of visual tokens, with only 3% average performance drop on DriveLM benchmark.
Conclusion: Prune2Drive enables efficient deployment of multi-view VLMs in autonomous driving through effective visual token pruning without compromising performance significantly.
Abstract: Vision-Language Models (VLMs) have emerged as a promising paradigm in autonomous driving (AD), providing a unified framework for perception and decision-making. However, their real-world deployment is hindered by significant computational overhead when processing high-resolution, multi-view images. This complexity stems from the massive number of visual tokens, which increases inference latency and memory consumption due to the quadratic complexity of self-attention. To address these challenges, we propose Prune2Drive, a plug-and-play visual token pruning framework for multi-view VLMs in AD. Prune2Drive introduces two core innovations: (i) a diversity-aware token selection mechanism that prioritizes semantic and spatial coverage across views, and (ii) a view-adaptive pruning controller that automatically learns optimal pruning ratios based on camera importance to downstream tasks. Unlike prior methods, Prune2Drive requires no model retraining or access to attention maps, ensuring compatibility with modern efficient attention implementations. Extensive experiments on the DriveLM and DriveLMM-o1 benchmarks demonstrate that Prune2Drive achieves significant speedups and memory savings with minimal performance impact. When retaining only 10% of visual tokens, our method achieves a 6.40x speedup in the prefilling phase and consumes only 13.4% of the original FLOPs, with a mere 3% average performance drop on the DriveLM benchmark. Code is available at: https://github.com/MinhaoXiong/Prune2Drive.git
[684] Beyond Frame-wise Tracking: A Trajectory-based Paradigm for Efficient Point Cloud Tracking
BaiChen Fan, Yuanxi Cui, Jian Li, Qin Wang, Shibo Zhao, Muqing Cao, Sifan Zhou
Main category: cs.CV
TL;DR: Paper 2509.11453: Could not fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2509.11453: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.11453&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[685] CLIFF: Continual Learning for Incremental Flake Features in 2D Material Identification
Sankalp Pandey, Xuan Bac Nguyen, Nicholas Borys, Hugh Churchill, Khoa Luu
Main category: cs.CV
TL;DR: CLIFF: A continual learning framework for classifying 2D material flakes from optical microscopy images that adapts to new materials with minimal forgetting.
Details
Motivation: Automated layer classification of quantum flakes from optical microscopy is challenging due to appearance variations across different materials. Current methods struggle with catastrophic forgetting when adapting to new materials.Method: Proposes CLIFF with frozen backbone/base head trained on reference material, learns material-specific prompts, embeddings, and delta heads. Uses prompt pool with cosine-similarity gate for feature modulation and memory replay with knowledge distillation.
Result: Achieves competitive accuracy with significantly lower forgetting than naive fine-tuning and prompt-based baselines.
Conclusion: First systematic study of continual learning for 2D materials classification, enabling scalable quantum hardware development through automated flake identification.
Abstract: Identifying quantum flakes is crucial for scalable quantum hardware; however, automated layer classification from optical microscopy remains challenging due to substantial appearance shifts across different materials. This paper proposes a new Continual-Learning Framework for Flake Layer Classification (CLIFF). To the best of our knowledge, this work represents the first systematic study of continual learning in two-dimensional (2D) materials. The proposed framework enables the model to distinguish materials and their physical and optical properties by freezing the backbone and base head, which are trained on a reference material. For each new material, it learns a material-specific prompt, embedding, and a delta head. A prompt pool and a cosine-similarity gate modulate features and compute material-specific corrections. Additionally, memory replay with knowledge distillation is incorporated. CLIFF achieves competitive accuracy with significantly lower forgetting than naive fine-tuning and a prompt-based baseline.
[686] FastAvatar: Towards Unified and Fast 3D Avatar Reconstruction with Large Gaussian Reconstruction Transformers
Yue Wu, Xuanhong Chen, Yufan Wu, Wen Li, Yuxi Lu, Kairui Feng
Main category: cs.CV
TL;DR: FastAvatar is a feedforward 3D avatar reconstruction framework that uses a Large Gaussian Reconstruction Transformer to create 3D Gaussian Splatting models from diverse inputs (single image, multi-view, or monocular video) within seconds using a single unified model.
Details
Motivation: Current 3D avatar reconstruction methods face challenges including high time complexity, sensitivity to data quality, and low data utilization. There's a need for a fast, flexible system that can work with various input types while efficiently using all available data.Method: Uses a Large Gaussian Reconstruction Transformer (LGRT) with three key designs: 1) 3DGS transformer aggregating multi-frame cues with initial 3D prompts, 2) multi-granular guidance encoding for variable-length inputs, and 3) incremental Gaussian aggregation via landmark tracking and sliced fusion losses. Enables incremental reconstruction that improves with more observations.
Result: FastAvatar achieves higher quality and highly competitive speed compared to existing methods, reconstructing 3D Gaussian Splatting models within seconds from diverse input types using a single unified model.
Conclusion: FastAvatar presents a quality-speed-tunable paradigm for highly usable 3D avatar modeling that overcomes limitations of previous methods by being fast, flexible with input types, and capable of incremental improvement with more data.
Abstract: Despite significant progress in 3D avatar reconstruction, it still faces challenges such as high time complexity, sensitivity to data quality, and low data utilization. We propose FastAvatar, a feedforward 3D avatar framework capable of flexibly leveraging diverse daily recordings (e.g., a single image, multi-view observations, or monocular video) to reconstruct a high-quality 3D Gaussian Splatting (3DGS) model within seconds, using only a single unified model. The core of FastAvatar is a Large Gaussian Reconstruction Transformer (LGRT) featuring three key designs: First, a 3DGS transformer aggregating multi-frame cues while injecting initial 3D prompt to predict the corresponding registered canonical 3DGS representations; Second, multi-granular guidance encoding (camera pose, expression coefficient, head pose) mitigating animation-induced misalignment for variable-length inputs; Third, incremental Gaussian aggregation via landmark tracking and sliced fusion losses. Integrating these features, FastAvatar enables incremental reconstruction, i.e., improving quality with more observations without wasting input data as in previous works. This yields a quality-speed-tunable paradigm for highly usable 3D avatar modeling. Extensive experiments show that FastAvatar has a higher quality and highly competitive speed compared to existing methods.
[687] BWCache: Accelerating Video Diffusion Transformers through Block-Wise Caching
Hanshuai Cui, Zhiqing Tang, Zhifei Xu, Zhi Yao, Wenyi Zeng, Weijia Jia
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2509.13789: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.13789&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[688] Efficient Diffusion-Based 3D Human Pose Estimation with Hierarchical Temporal Pruning
Yuquan Bi, Hongsong Wang, Xinli Shi, Zhipeng Gui, Jie Gui, Yuan Yan Tang
Main category: cs.CV
TL;DR: Efficient diffusion-based 3D human pose estimation framework with hierarchical temporal pruning strategy that reduces computational costs while maintaining performance.
Details
Motivation: Diffusion models show strong 3D human pose generation capabilities but suffer from high computational costs due to their iterative nature and multi-hypothesis requirements.Method: Hierarchical Temporal Pruning (HTP) with three components: Temporal Correlation-Enhanced Pruning (TCEP) for frame-level pruning, Sparse-Focused Temporal MHSA (SFT MHSA) for efficient attention computation, and Mask-Guided Pose Token Pruner (MGPTP) for semantic-level pruning.
Result: Reduces training MACs by 38.5%, inference MACs by 56.8%, improves inference speed by 81.1% on average compared to prior diffusion methods, while achieving state-of-the-art performance on Human3.6M and MPI-INF-3DHP.
Conclusion: The proposed HTP strategy enables efficient diffusion-based 3D human pose estimation by dynamically pruning redundant tokens while preserving critical motion dynamics, offering significant computational savings without performance degradation.
Abstract: Diffusion models have demonstrated strong capabilities in generating high-fidelity 3D human poses, yet their iterative nature and multi-hypothesis requirements incur substantial computational cost. In this paper, we propose an Efficient Diffusion-Based 3D Human Pose Estimation framework with a Hierarchical Temporal Pruning (HTP) strategy, which dynamically prunes redundant pose tokens across both frame and semantic levels while preserving critical motion dynamics. HTP operates in a staged, top-down manner: (1) Temporal Correlation-Enhanced Pruning (TCEP) identifies essential frames by analyzing inter-frame motion correlations through adaptive temporal graph construction; (2) Sparse-Focused Temporal MHSA (SFT MHSA) leverages the resulting frame-level sparsity to reduce attention computation, focusing on motion-relevant tokens; and (3) Mask-Guided Pose Token Pruner (MGPTP) performs fine-grained semantic pruning via clustering, retaining only the most informative pose tokens. Experiments on Human3.6M and MPI-INF-3DHP show that HTP reduces training MACs by 38.5%, inference MACs by 56.8%, and improves inference speed by an average of 81.1% compared to prior diffusion-based methods, while achieving state-of-the-art performance.
[689] RTGMFF: Enhanced fMRI-based Brain Disorder Diagnosis via ROI-driven Text Generation and Multimodal Feature Fusion
Junhao Jia, Yifei Sun, Yunyou Liu, Cheng Yang, Changmiao Wang, Feiwei Qin, Yong Peng, Wenwen Min
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2509.03214: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.03214&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[690] Easier Painting Than Thinking: Can Text-to-Image Models Set the Stage, but Not Direct the Play?
Ouxiang Li, Yuan Wang, Xinting Hu, Huijuan Huang, Rui Chen, Jiarong Ou, Xin Tao, Pengfei Wan, Xiaojuan Qi, Fuli Feng
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2509.03516: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.03516&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[691] UniView: Enhancing Novel View Synthesis From A Single Image By Unifying Reference Features
Haowang Cui, Rui Chen, Jiaze Wang, Tao Guo, Zheng Qin
Main category: cs.CV
TL;DR: Paper ID 2509.04932 could not be fetched due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access limitationsMethod: Unable to determine method due to access limitations
Result: Unable to determine results due to access limitations
Conclusion: Unable to determine conclusion due to access limitations
Abstract: Failed to fetch summary for 2509.04932: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.04932&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[692] Does FLUX Already Know How to Perform Physically Plausible Image Composition?
Shilin Lu, Zhuming Lian, Zihan Zhou, Shaocong Zhang, Chen Zhao, Adams Wai-Kin Kong
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2509.21278: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.21278&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[693] TrueSkin: Towards Fair and Accurate Skin Tone Recognition and Generation
Haoming Lu
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2509.10980: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.10980&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[694] Seg2Track-SAM2: SAM2-based Multi-object Tracking and Segmentation
Diogo Mendonça, Tiago Barros, Cristiano Premebida, Urbano J. Nunes
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to missing paper contentMethod: Cannot determine method due to missing paper content
Result: Cannot determine results due to missing paper content
Conclusion: Cannot determine conclusion due to missing paper content
Abstract: Failed to fetch summary for 2509.11772: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.11772&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[695] DistillKac: Few-Step Image Generation via Damped Wave Equations
Weiqiao Han, Chenlin Meng, Christopher D. Manning, Stefano Ermon
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2509.21513: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.21513&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[696] Brain-HGCN: A Hyperbolic Graph Convolutional Network for Brain Functional Network Analysis
Junhao Jia, Yunyou Liu, Cheng Yang, Yifei Sun, Feiwei Qin, Changmiao Wang, Yong Peng
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2509.14965: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.14965&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[697] Geodesic Prototype Matching via Diffusion Maps for Interpretable Fine-Grained Recognition
Junhao Jia, Yunyou Liu, Yifei Sun, Huangwei Chen, Feiwei Qin, Changmiao Wang, Yong Peng
Main category: cs.CV
TL;DR: GeoProto: A prototype-based recognition method that uses diffusion geometry to capture nonlinear manifolds in deep features for better fine-grained classification.
Details
Motivation: Euclidean distances in deep visual features often misrepresent true similarity due to nonlinear manifolds, which is particularly problematic for prototype-based interpretable fine-grained recognition where subtle semantic distinctions matter.Method: Distills latent manifold structure of each class into diffusion space, uses differentiable Nyström interpolation to make geometry accessible to unseen samples and learnable prototypes, employs compact per-class landmark sets with periodic updates for efficiency.
Result: Comprehensive experiments on two benchmark datasets show GeoProto yields prototypes focusing on semantically corresponding parts and significantly outperforms Euclidean prototype networks.
Conclusion: Grounding similarity in the intrinsic geometry of deep features improves prototype-based recognition, enabling better interpretability and performance for fine-grained classification tasks.
Abstract: Nonlinear manifolds are pervasive in deep visual features, where Euclidean distances can misrepresent true similarity. This mismatch is particularly detrimental to prototype-based interpretable fine-grained recognition, where even subtle semantic distinctions are crucial. To mitigate this issue, this work presents a novel paradigm for prototype-based recognition by grounding similarity in the intrinsic geometry of deep features. Concretely, we distill the latent manifold structure of each class into a diffusion space and, critically, devise a differentiable Nyström interpolation to make this geometry accessible to both unseen samples and learnable prototypes. To maintain efficiency, we employ compact per-class landmark sets with periodic updates. This strategy keeps the embedding synchronized with the evolving backbone, enabling fast inference at scale. Comprehensive experiments on two benchmark datasets demonstrate that our GeoProto yields prototypes focusing on semantically corresponding parts, significantly outperforming Euclidean prototype networks.
[698] Quantized Visual Geometry Grounded Transformer
Weilun Feng, Haotong Qin, Mingqiang Wu, Chuanguang Yang, Yuqi Li, Xiangqi Li, Zhulin An, Libo Huang, Yulun Zhang, Michele Magno, Yongjun Xu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2509.21302: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.21302&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[699] QuadGPT: Native Quadrilateral Mesh Generation with Autoregressive Models
Jian Liu, Chunshi Wang, Song Guo, Haohan Weng, Zhen Zhou, Zhiqi Li, Jiaao Yu, Yiling Zhu, Jing Xu, Biwen Lei, Zhuo Chen, Chunchao Guo
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2509.21420: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.21420&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[700] Robust Fine-Tuning from Non-Robust Pretrained Models: Mitigating Suboptimal Transfer With Epsilon-Scheduling
Jonas Ngnawé, Maxime Heuillet, Sabyasachi Sahoo, Yann Pequignot, Ola Ahmad, Audrey Durand, Frédéric Precioso, Christian Gagné
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2509.23325: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.23325&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[701] CubistMerge: Spatial-Preserving Token Merging For Diverse ViT Backbones
Wenyi Gong, Mieszko Lis
Main category: cs.CV
TL;DR: A token merging method for ViT backbones that preserves spatial structure for compatibility with spatial architectures like SAM and DINOv3, achieving speedups with minimal accuracy loss.
Details
Motivation: Modern ViT backbones use spatial designs (window attention, RoPE, etc.) that create challenges for token reduction methods, as most existing approaches fail to preserve the spatial structure these architectures depend on.Method: Three components: (1) 2D reduction strategy for structured token layouts, (2) spatial-aware merging algorithm that maintains relative token positions, and (3) max-magnitude-per-dimension token representation to preserve salient features.
Result: Achieves 1.25x speedup on SAM-H with only 0.7% mIOU drop on COCO, and 1.15x speedup on DeiT-B with no top-1 accuracy drop on ImageNet after one epoch of fine-tuning.
Conclusion: The method enables effective token reduction while maintaining spatial integrity, making it compatible with modern spatial ViT architectures and achieving state-of-the-art performance across vision tasks.
Abstract: Many modern ViT backbones adopt spatial architectural designs, such as window attention, decomposed relative positional embeddings in SAM, and RoPE in DINOv3. Such architectures impose new challenges on token reduction, as the vast majority of existing methods fail to preserve the spatial structure these architectures depend on. In this paper, we introduce a simple yet effective token merging method that maintains spatial integrity, enabling seamless compatibility with spatial architectures. We reconcile two seemingly conflicting requirements: (i)exploiting the uneven information distribution across the spatial layout while (ii)preserving the spatial structure post-merging. Our approach employs (i)a 2D reduction strategy to enforce structured token layouts, (ii)a spatial-aware merging algorithm that maintains relative token positions, and (iii)a novel max-magnitude-per-dimension token representation that preserves salient features. Our method demonstrates strong performance both off-the-shelf and with fine-tuning, achieving state-of-the-art results on spatial and non-spatial architectures across various vision tasks. Specifically, we achieve 1.25x speedup on SAM-H with only 0.7% mIOU drop evaluated on COCO off-the-shelf, and 1.15x speedup on DeiT-B with no top-1 accuracy drop on ImageNet within just one epoch of fine-tuning.
[702] Customizing Visual Emotion Evaluation for MLLMs: An Open-vocabulary, Multifaceted, and Scalable Approach
Daiqing Wu, Dongbao Yang, Sicheng Zhao, Can Ma, Yu Zhou
Main category: cs.CV
TL;DR: The paper proposes a new evaluation framework called Emotion Statement Judgment to assess multimodal LLMs’ visual emotion perception capabilities, addressing limitations in existing evaluation methods through automated statement generation and systematic testing.
Details
Motivation: Existing evaluation methods for MLLMs' visual emotion perception have limitations including oversight of plausible responses, limited emotional taxonomies, neglect of contextual factors, and labor-intensive annotations, leading to inconsistent results in zero-shot scenarios.Method: Proposes an Emotion Statement Judgment task and an automated pipeline to efficiently construct emotion-centric statements with minimal human effort, enabling systematic evaluation of MLLMs’ visual emotion perception capabilities.
Result: Evaluation shows MLLMs perform well in emotion interpretation and context-based emotion judgment but have limitations in understanding perception subjectivity. Even top models like GPT4o show significant performance gaps compared to humans.
Conclusion: The work provides a fundamental evaluation framework for assessing emotional intelligence in MLLMs, identifying key areas for future improvement in visual emotion perception capabilities.
Abstract: Recently, Multimodal Large Language Models (MLLMs) have achieved exceptional performance across diverse tasks, continually surpassing previous expectations regarding their capabilities. Nevertheless, their proficiency in perceiving emotions from images remains debated, with studies yielding divergent results in zero-shot scenarios. We argue that this inconsistency stems partly from constraints in existing evaluation methods, including the oversight of plausible responses, limited emotional taxonomies, neglect of contextual factors, and labor-intensive annotations. To facilitate customized visual emotion evaluation for MLLMs, we propose an Emotion Statement Judgment task that overcomes these constraints. Complementing this task, we devise an automated pipeline that efficiently constructs emotion-centric statements with minimal human effort. Through systematically evaluating prevailing MLLMs, our study showcases their stronger performance in emotion interpretation and context-based emotion judgment, while revealing relative limitations in comprehending perception subjectivity. When compared to humans, even top-performing MLLMs like GPT4o demonstrate remarkable performance gaps, underscoring key areas for future improvement. By developing a fundamental evaluation framework and conducting a comprehensive MLLM assessment, we hope this work contributes to advancing emotional intelligence in MLLMs. Project page: https://github.com/wdqqdw/MVEI.
[703] CircuitSense: A Hierarchical MLLM Benchmark Bridging Visual Comprehension and Symbolic Reasoning in Engineering Design Process
Arman Akbari, Jian Gao, Yifei Zou, Mei Yang, Jinru Duan, Dmitrii Torbunov, Yanzhi Wang, Yihui Ren, Xuan Zhang
Main category: cs.CV
TL;DR: CircuitSense benchmark evaluates MLLMs’ ability to understand technical diagrams and extract mathematical models from circuit schematics, revealing critical gaps in visual-to-mathematical reasoning.
Details
Motivation: While MLLMs excel at natural image tasks, their ability to extract mathematical models from technical diagrams (like circuit schematics) remains unexplored, despite being crucial for engineering design workflows that require hierarchical abstraction from visual inputs to mathematical models.Method: Created CircuitSense benchmark with 8,006+ problems spanning component-level schematics to system-level block diagrams. Introduced hierarchical synthetic generation pipeline with grid-based schematic generator and block diagram generator with auto-derived symbolic equation labels. Evaluated six state-of-the-art MLLMs across three workflow stages: Perception, Analysis, and Design.
Result: Closed-source models achieved >85% accuracy on perception tasks (component recognition, topology identification) but <19% on symbolic derivation and analytical reasoning. Models with stronger symbolic reasoning capabilities performed better on design tasks, confirming mathematical understanding as key for engineering competence.
Conclusion: There’s a critical gap between visual parsing and symbolic reasoning in current MLLMs. Symbolic reasoning is identified as the key metric for engineering competence, and the benchmark establishes a foundation for evaluating visual-to-mathematical reasoning in technical domains.
Abstract: Engineering design operates through hierarchical abstraction from system specifications to component implementations, requiring visual understanding coupled with mathematical reasoning at each level. While Multi-modal Large Language Models (MLLMs) excel at natural image tasks, their ability to extract mathematical models from technical diagrams remains unexplored. We present \textbf{CircuitSense}, a comprehensive benchmark evaluating circuit understanding across this hierarchy through 8,006+ problems spanning component-level schematics to system-level block diagrams. Our benchmark uniquely examines the complete engineering workflow: Perception, Analysis, and Design, with a particular emphasis on the critical but underexplored capability of deriving symbolic equations from visual inputs. We introduce a hierarchical synthetic generation pipeline consisting of a grid-based schematic generator and a block diagram generator with auto-derived symbolic equation labels. Comprehensive evaluation of six state-of-the-art MLLMs, including both closed-source and open-source models, reveals fundamental limitations in visual-to-mathematical reasoning. Closed-source models achieve over 85% accuracy on perception tasks involving component recognition and topology identification, yet their performance on symbolic derivation and analytical reasoning falls below 19%, exposing a critical gap between visual parsing and symbolic reasoning. Models with stronger symbolic reasoning capabilities consistently achieve higher design task accuracy, confirming the fundamental role of mathematical understanding in circuit synthesis and establishing symbolic reasoning as the key metric for engineering competence.
[704] Uni-X: Mitigating Modality Conflict with a Two-End-Separated Architecture for Unified Multimodal Models
Jitai Hao, Hao Liu, Xinyan Xiao, Qiang Huang, Jun Yu
Main category: cs.CV
TL;DR: Paper 2509.24365 summary unavailable due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to abstract fetch failureMethod: Unable to determine method due to abstract fetch failure
Result: Unable to determine results due to abstract fetch failure
Conclusion: Unable to determine conclusion due to abstract fetch failure
Abstract: Failed to fetch summary for 2509.24365: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.24365&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[705] Towards Interpretable Visual Decoding with Attention to Brain Representations
Pinyuan Feng, Hossein Adeli, Wenxuan Guo, Fan Cheng, Ethan Hwang, Nikolaus Kriegeskorte
Main category: cs.CV
TL;DR: NeuroAdapter: A brain-to-image reconstruction framework that directly conditions latent diffusion models on fMRI signals without intermediate feature spaces, enabling interpretable neural decoding through attention pattern analysis.
Details
Motivation: Existing brain decoding methods use intermediate image/text feature spaces that obscure how different brain areas contribute to visual reconstruction. The authors aim to create a more transparent, end-to-end approach that reveals the neural basis of visual representation.Method: Proposes NeuroAdapter framework that directly conditions a latent diffusion model on brain representations from fMRI data, bypassing intermediate feature spaces. Introduces Image-Brain BI-directional interpretability (IBBI) framework that analyzes cross-attention patterns across diffusion denoising steps to understand how cortical areas influence the generative process.
Result: Demonstrates competitive visual reconstruction quality on public fMRI datasets compared to prior work, while providing greater transparency into how brain signals drive visual reconstruction through interpretable attention analysis.
Conclusion: The work highlights the potential of end-to-end brain-to-image reconstruction and establishes a path for interpretable neural decoding by revealing how different brain areas contribute to the generative process through attention pattern analysis.
Abstract: Recent work has demonstrated that complex visual stimuli can be decoded from human brain activity using deep generative models, offering new ways to probe how the brain represents real-world scenes. However, many existing approaches first map brain signals into intermediate image or text feature spaces before guiding the generative process, which obscures the contributions of different brain areas to the final reconstruction output. In this work, we propose NeuroAdapter, a visual decoding framework that directly conditions a latent diffusion model on brain representations, bypassing the need for intermediate feature spaces. Our method demonstrates competitive visual reconstruction quality on public fMRI datasets compared to prior work, while providing greater transparency into how brain signals drive visual reconstruction. To this end, we introduce an Image-Brain BI-directional interpretability framework (IBBI) that analyzes cross-attention patterns across diffusion denoising steps to reveal how different cortical areas influence the unfolding generative trajectory. Our work highlights the potential of end-to-end brain-to-image reconstruction and establishes a path for interpretable neural decoding.
[706] Vid-LLM: A Compact Video-based 3D Multimodal LLM with Reconstruction-Reasoning Synergy
Haijier Chen, Bo Xu, Shoujian Zhang, Haoze Liu, Jiaxuan Lin, Jingrong Wang
Main category: cs.CV
TL;DR: Unable to analyze paper 2509.24385 due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrievedMethod: Cannot determine method as paper content could not be retrieved
Result: Cannot determine results as paper content could not be retrieved
Conclusion: Cannot draw conclusions as paper content could not be retrieved
Abstract: Failed to fetch summary for 2509.24385: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.24385&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[707] DiffInk: Glyph- and Style-Aware Latent Diffusion Transformer for Text to Online Handwriting Generation
Wei Pan, Huiguo He, Hiuyi Cheng, Yilin Shi, Lianwen Jin
Main category: cs.CV
TL;DR: DiffInk: A latent diffusion Transformer framework for full-line handwriting generation using dual regularization for content-style disentanglement and improved efficiency.
Details
Motivation: Existing text-to-online handwriting generation methods focus on character/word-level generation, leading to inefficiency and lack of holistic structural modeling for full text lines.Method: Proposes DiffInk with two components: 1) InkVAE - sequential VAE with OCR-based loss for glyph accuracy and style-classification loss for style preservation, 2) InkDiT - latent diffusion Transformer that integrates target text and reference styles to generate pen trajectories.
Result: Outperforms existing SOTA methods in both glyph accuracy and style fidelity while significantly improving generation efficiency.
Conclusion: DiffInk successfully addresses limitations of previous methods by enabling full-line handwriting generation with better content-style disentanglement and efficiency.
Abstract: Deep generative models have advanced text-to-online handwriting generation (TOHG), which aims to synthesize realistic pen trajectories conditioned on textual input and style references. However, most existing methods still primarily focus on character- or word-level generation, resulting in inefficiency and a lack of holistic structural modeling when applied to full text lines. To address these issues, we propose DiffInk, the first latent diffusion Transformer framework for full-line handwriting generation. We first introduce InkVAE, a novel sequential variational autoencoder enhanced with two complementary latent-space regularization losses: (1) an OCR-based loss enforcing glyph-level accuracy, and (2) a style-classification loss preserving writing style. This dual regularization yields a semantically structured latent space where character content and writer styles are effectively disentangled. We then introduce InkDiT, a novel latent diffusion Transformer that integrates target text and reference styles to generate coherent pen trajectories. Experimental results demonstrate that DiffInk outperforms existing state-of-the-art (SOTA) methods in both glyph accuracy and style fidelity, while significantly improving generation efficiency.
[708] SpinBench: Perspective and Rotation as a Lens on Spatial Reasoning in VLMs
Yuyou Zhang, Radu Corcodel, Chiori Hori, Anoop Cherian, Ding Zhao
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation due to failed paper fetchMethod: Cannot determine method due to failed paper fetch
Result: Cannot determine results due to failed paper fetch
Conclusion: Cannot draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2509.25390: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.25390&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[709] QuantSparse: Comprehensively Compressing Video Diffusion Transformer with Model Quantization and Attention Sparsification
Weilun Feng, Chuanguang Yang, Haotong Qin, Mingqiang Wu, Yuqi Li, Xiangqi Li, Zhulin An, Libo Huang, Yulun Zhang, Michele Magno, Yongjun Xu
Main category: cs.CV
TL;DR: QuantSparse: A unified framework combining model quantization and attention sparsification for efficient video diffusion transformers, achieving 3.68× storage reduction and 1.88× inference acceleration with improved quality.
Details
Motivation: Diffusion transformers have excellent video generation capabilities but suffer from prohibitive computational and memory costs. While model quantization and attention sparsification individually offer compression benefits, each alone causes severe performance degradation under aggressive compression. Naive combination of these techniques is ineffective due to sparsity-induced information loss exacerbating quantization noise and causing amplified attention shifts.Method: Proposes QuantSparse framework with two key components: 1) Multi-Scale Salient Attention Distillation - uses global structural guidance and local salient supervision to mitigate quantization-induced bias; 2) Second-Order Sparse Attention Reparameterization - exploits temporal stability of second-order residuals to efficiently recover information lost under sparsity.
Result: Achieves 20.88 PSNR on HunyuanVideo-13B, substantially outperforming state-of-the-art quantization baseline Q-VDiT (16.85 PSNR), while delivering 3.68× storage reduction and 1.88× acceleration in end-to-end inference.
Conclusion: QuantSparse successfully integrates model quantization with attention sparsification for efficient video diffusion transformers, overcoming the limitations of applying each technique individually and demonstrating significant efficiency gains with improved quality.
Abstract: Diffusion transformers exhibit remarkable video generation capability, yet their prohibitive computational and memory costs hinder practical deployment. Model quantization and attention sparsification are two promising directions for compression, but each alone suffers severe performance degradation under aggressive compression. Combining them promises compounded efficiency gains, but naive integration is ineffective. The sparsity-induced information loss exacerbates quantization noise, leading to amplified attention shifts. To address this, we propose \textbf{QuantSparse}, a unified framework that integrates model quantization with attention sparsification. Specifically, we introduce \textit{Multi-Scale Salient Attention Distillation}, which leverages both global structural guidance and local salient supervision to mitigate quantization-induced bias. In addition, we develop \textit{Second-Order Sparse Attention Reparameterization}, which exploits the temporal stability of second-order residuals to efficiently recover information lost under sparsity. Experiments on HunyuanVideo-13B demonstrate that QuantSparse achieves 20.88 PSNR, substantially outperforming the state-of-the-art quantization baseline Q-VDiT (16.85 PSNR), while simultaneously delivering a \textbf{3.68$\times$} reduction in storage and \textbf{1.88$\times$} acceleration in end-to-end inference. Our code will be released in https://github.com/wlfeng0509/QuantSparse.
[710] Advancing Multi-agent Traffic Simulation via R1-Style Reinforcement Fine-Tuning
Muleilan Pei, Shaoshuai Shi, Shaojie Shen
Main category: cs.CV
TL;DR: SMART-R1 introduces a novel R1-style reinforcement fine-tuning paradigm for multi-agent traffic simulation, using metric-oriented policy optimization and iterative SFT-RFT-SFT training to align agent behavior with human preferences and evaluation metrics.
Details
Motivation: Existing data-driven traffic simulators rely on supervised learning and suffer from distributional shift between training and testing, undermining model generalization in unseen environments. There's a need for better alignment of simulated agent behavior with human preferences and evaluation metrics.Method: Proposes SMART-R1, an R1-style reinforcement fine-tuning paradigm with: 1) metric-oriented policy optimization algorithm for better distribution alignment, and 2) iterative “SFT-RFT-SFT” training strategy alternating between Supervised Fine-Tuning and Reinforcement Fine-Tuning to maximize performance gains.
Result: Extensive experiments on Waymo Open Motion Dataset validate effectiveness. Achieves state-of-the-art performance on Waymo Open Sim Agents Challenge with overall realism meta score of 0.7858, ranking first on leaderboard at submission time.
Conclusion: SMART-R1 demonstrates that R1-style reinforcement fine-tuning is a simple yet powerful framework for enhancing foundation models in multi-agent traffic simulation, effectively addressing distributional shift and improving alignment with human preferences.
Abstract: Scalable and realistic simulation of multi-agent traffic behavior is critical for advancing autonomous driving technologies. Although existing data-driven simulators have made significant strides in this domain, they predominantly rely on supervised learning to align simulated distributions with real-world driving scenarios. A persistent challenge, however, lies in the distributional shift that arises between training and testing, which often undermines model generalization in unseen environments. To address this limitation, we propose SMART-R1, a novel R1-style reinforcement fine-tuning paradigm tailored for next-token prediction models to better align agent behavior with human preferences and evaluation metrics. Our approach introduces a metric-oriented policy optimization algorithm to improve distribution alignment and an iterative “SFT-RFT-SFT” training strategy that alternates between Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT) to maximize performance gains. Extensive experiments on the large-scale Waymo Open Motion Dataset (WOMD) validate the effectiveness of this simple yet powerful R1-style training framework in enhancing foundation models. The results on the Waymo Open Sim Agents Challenge (WOSAC) showcase that SMART-R1 achieves state-of-the-art performance with an overall realism meta score of 0.7858, ranking first on the leaderboard at the time of submission.
[711] UniMMAD: Unified Multi-Modal and Multi-Class Anomaly Detection via MoE-Driven Feature Decompression
Yuan Zhao, Youwei Pang, Lihe Zhang, Hanqi Liu, Jiaming Zuo, Huchuan Lu, Xiaoqi Zhao
Main category: cs.CV
TL;DR: UniMMAD: A unified framework for multi-modal and multi-class anomaly detection using Mixture-of-Experts-driven feature decompression for adaptive, disentangled reconstruction across domains.
Details
Motivation: Existing anomaly detection methods treat modality and class as independent factors, leading to fragmented solutions, excessive memory overhead, and poor performance in reconstruction-based multi-class approaches that struggle with domain variations, causing distorted normality boundaries and high false alarm rates.Method: Proposes UniMMAD with Mixture-of-Experts-driven feature decompression mechanism. Encoding stage compresses multi-modal inputs into compact general features with feature compression module to suppress latent anomalies. Decoding stage uses sparsely-gated cross MoE to dynamically decompress features into modality-specific and class-specific forms via expert pathway selection based on input modality and class. Includes grouped dynamic filtering and MoE-in-MoE structure for efficiency.
Result: Achieves state-of-the-art performance on 9 anomaly detection datasets spanning 3 fields, 12 modalities, and 66 classes. Reduces parameter usage by 75% while maintaining sparse activation and fast inference.
Conclusion: UniMMAD provides a unified, efficient framework for multi-modal, multi-class anomaly detection that addresses fragmentation and domain interference issues through adaptive, disentangled reconstruction.
Abstract: Existing anomaly detection (AD) methods often treat the modality and class as independent factors. Although this paradigm has enriched the development of AD research branches and produced many specialized models, it has also led to fragmented solutions and excessive memory overhead. Moreover, reconstruction-based multi-class approaches typically rely on shared decoding paths, which struggle to handle large variations across domains, resulting in distorted normality boundaries, domain interference, and high false alarm rates. To address these limitations, we propose UniMMAD, a unified framework for multi-modal and multi-class anomaly detection. At the core of UniMMAD is a Mixture-of-Experts (MoE)-driven feature decompression mechanism, which enables adaptive and disentangled reconstruction tailored to specific domains. This process is guided by a ``general to specific’’ paradigm. In the encoding stage, multi-modal inputs of varying combinations are compressed into compact, general-purpose features. The encoder incorporates a feature compression module to suppress latent anomalies, encourage cross-modal interaction, and avoid shortcut learning. In the decoding stage, the general features are decompressed into modality-specific and class-specific forms via a sparsely-gated cross MoE, which dynamically selects expert pathways based on input modality and class. To further improve efficiency, we design a grouped dynamic filtering mechanism and a MoE-in-MoE structure, reducing parameter usage by 75% while maintaining sparse activation and fast inference. UniMMAD achieves state-of-the-art performance on 9 anomaly detection datasets, spanning 3 fields, 12 modalities, and 66 classes. The source code will be available at https://github.com/yuanzhao-CVLAB/UniMMAD.
[712] Culture In a Frame: C$^3$B as a Comic-Based Benchmark for Multimodal Culturally Awareness
Yuchen Song, Andong Chen, Wenxin Zhu, Kehai Chen, Xuefeng Bai, Muyun Yang, Tiejun Zhao
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to the paper contentMethod: Cannot determine method without access to the paper content
Result: Cannot determine results without access to the paper content
Conclusion: Cannot draw conclusions without access to the paper content
Abstract: Failed to fetch summary for 2510.00041: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.00041&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[713] Stylos: Multi-View 3D Stylization with Single-Forward Gaussian Splatting
Hanzhou Liu, Jia Huang, Mi Lu, Srikanth Saripalli, Peng Jiang
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2509.26455 could not be retrieved from arXiv API.
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting.Method: Cannot determine method as paper content is unavailable.
Result: Cannot determine results as paper content is unavailable.
Conclusion: Cannot draw conclusions about the paper without access to its content.
Abstract: Failed to fetch summary for 2509.26455: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.26455&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[714] Arbitrary Generative Video Interpolation
Guozhen Zhang, Haiguang Wang, Chunyu Wang, Yuan Zhou, Qinglin Lu, Limin Wang
Main category: cs.CV
TL;DR: Paper 2510.00578: Unable to fetch abstract due to HTTP 429 error (rate limiting)
Details
Motivation: Unknown - abstract retrieval failed due to server rate limitingMethod: Unknown - abstract retrieval failed due to server rate limiting
Result: Unknown - abstract retrieval failed due to server rate limiting
Conclusion: Unknown - abstract retrieval failed due to server rate limiting
Abstract: Failed to fetch summary for 2510.00578: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.00578&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[715] LVTINO: LAtent Video consisTency INverse sOlver for High Definition Video Restoration
Alessio Spagnoletti, Andrés Almansa, Marcelo Pereyra
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2510.01339: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.01339&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[716] Purrception: Variational Flow Matching for Vector-Quantized Image Generation
Răzvan-Andrei Matişan, Vincent Tao Hu, Grigory Bartosh, Björn Ommer, Cees G. M. Snoek, Max Welling, Jan-Willem van de Meent, Mohammad Mahdi Derakhshani, Floor Eijkelboom
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2510.01478: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.01478&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[717] DragFlow: Unleashing DiT Priors with Region Based Supervision for Drag Editing
Zihan Zhou, Shilin Lu, Shuli Leng, Shaocong Zhang, Zhuming Lian, Xinlei Yu, Adams Wai-Kin Kong
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2510.02253 suggests it’s from October 2025, but no content available for analysis.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot draw conclusions without access to the paper content.
Abstract: Failed to fetch summary for 2510.02253: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.02253&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[718] EgoNight: Towards Egocentric Vision Understanding at Night with a Challenging Benchmark
Deheng Zhang, Yuqian Fu, Runyi Yang, Yang Miao, Tianwen Qian, Xu Zheng, Guolei Sun, Ajad Chhatkuli, Xuanjing Huang, Yu-Gang Jiang, Luc Van Gool, Danda Pani Paudel
Main category: cs.CV
TL;DR: EgoNight is the first comprehensive benchmark for nighttime egocentric vision with VQA as core task, featuring day-night aligned videos to study performance gaps between lighting conditions.
Details
Motivation: Existing egocentric vision benchmarks focus on daytime scenarios, overlooking low-light conditions inevitable in real-world applications, creating a research gap for nighttime understanding.Method: Collects synthetic (Blender-rendered) and real-world videos with day-night alignment, constructs EgoNight-VQA using day-augmented night auto-labeling engine with human verification, includes 3658 QA pairs across 90 videos spanning 12 QA types.
Result: State-of-the-art MLLMs show substantial performance drops when transferring from day to night, revealing challenges of reasoning under low-light conditions. Also introduces auxiliary tasks: day-night correspondence retrieval and egocentric depth estimation at night.
Conclusion: EgoNight provides foundation for advancing application-driven egocentric vision research and developing models that generalize across illumination domains, highlighting the need for better nighttime understanding capabilities.
Abstract: Most existing benchmarks for understanding egocentric vision focus primarily on daytime scenarios, overlooking the low-light conditions that are inevitable in real-world applications. To investigate this gap, we present EgoNight, the first comprehensive benchmark for nighttime egocentric vision, with visual question answering (VQA) as the core task. A key feature of EgoNight is the introduction of day-night aligned videos, which enhance night annotation quality using the daytime data and reveal clear performance gaps between lighting conditions. To achieve this, we collect both synthetic videos rendered by Blender and real-world recordings, ensuring that scenes and actions are visually and temporally aligned. Leveraging these paired videos, we construct EgoNight-VQA, supported by a novel day-augmented night auto-labeling engine and refinement through extensive human verification. Each QA pair is double-checked by annotators for reliability. In total, EgoNight-VQA contains 3658 QA pairs across 90 videos, spanning 12 diverse QA types, with more than 300 hours of human work. Evaluations of state-of-the-art multimodal large language models (MLLMs) reveal substantial performance drops when transferring from day to night, underscoring the challenges of reasoning under low-light conditions. Beyond VQA, EgoNight also introduces two auxiliary tasks, day-night correspondence retrieval and egocentric depth estimation at night, that further explore the boundaries of existing models. We believe EgoNight-VQA provides a strong foundation for advancing application-driven egocentric vision research and for developing models that generalize across illumination domains. The code and data can be found at https://github.com/dehezhang2/EgoNight.
[719] ChainMPQ: Interleaved Text-Image Reasoning Chains for Mitigating Relation Hallucinations
Yike Wu, Yiwei Wang, Yujun Cai
Main category: cs.CV
TL;DR: ChainMPQ is a training-free method that reduces relation hallucinations in Large Vision-Language Models by using multi-perspective questions and interleaved chains of images/text to improve relational reasoning.
Details
Motivation: Relation hallucinations are the most common but least addressed type of hallucination in LVLMs, hindering their reliability despite strong overall multimodal performance.Method: ChainMPQ extracts subject/object keywords to enhance image regions, constructs multi-perspective questions focusing on subject, object, and relation components, and uses sequential processing with accumulated textual/visual memories to form interleaved chains guiding progressive reasoning.
Result: Experiments on multiple LVLMs and benchmarks show ChainMPQ substantially reduces relation hallucinations, with ablation studies validating the effectiveness of its three core modules.
Conclusion: ChainMPQ effectively addresses the under-explored problem of relation hallucinations in LVLMs through a training-free approach that leverages accumulated multimodal memories for improved relational inference.
Abstract: While Large Vision-Language Models (LVLMs) achieve strong performance in multimodal tasks, hallucinations continue to hinder their reliability. Among the three categories of hallucinations, which include object, attribute, and relation, relation hallucinations account for the largest proportion but have received the least attention. To address this issue, we propose ChainMPQ (Multi-Perspective Questions guided Interleaved Chain of Image and Text), a training-free method that improves relational inference in LVLMs by utilizing accumulated textual and visual memories. ChainMPQ first extracts subject and object keywords from the question to enhance the corresponding image regions. It then constructs multi-perspective questions that focus on the three core components of a relationship: the subject, the object, and the relation that links them. These questions are sequentially input to the model, with textual and visual memories from earlier steps providing supporting context for subsequent ones, thereby forming an interleaved chain of images and text that guides progressive relational reasoning. Experiments on multiple LVLMs and benchmarks show that ChainMPQ substantially reduces relation hallucinations, while ablation studies further validate the effectiveness of its three core modules.
[720] VA-Adapter: Adapting Ultrasound Foundation Model to Echocardiography Probe Guidance
Teng Wang, Haojun Jiang, Yuxuan Wang, Zhenguo Sun, Yujiao Deng, Shiji Song, Gao Huang
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusion due to lack of paper content
Abstract: Failed to fetch summary for 2510.06809: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.06809&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[721] LinearSR: Unlocking Linear Attention for Stable and Efficient Image Super-Resolution
Xiaohui Li, Shaobin Zhuang, Shuo Cao, Yang Yang, Yuandong Pu, Qi Qin, Siqi Luo, Bin Fu, Yihao Liu
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2510.08771: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.08771&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[722] PHyCLIP: $\ell_1$-Product of Hyperbolic Factors Unifies Hierarchy and Compositionality in Vision-Language Representation Learning
Daiki Yoshikawa, Takashi Matsubara
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper retrievalMethod: Unable to determine method due to failed paper retrieval
Result: Unable to determine results due to failed paper retrieval
Conclusion: Unable to determine conclusion due to failed paper retrieval
Abstract: Failed to fetch summary for 2510.08919: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.08919&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[723] Spotlight on Token Perception for Multimodal Reinforcement Learning
Siyuan Huang, Xiaoye Qu, Yafu Li, Yun Luo, Zefeng He, Daizong Liu, Yu Cheng
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2510.09285: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.09285&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[724] Incomplete Multi-Label Image Recognition by Co-learning Semantic-Aware Features and Label Recovery
Zhi-Fen He, Ren-Dong Xie, Bo Li, Bin Liu, Jin-Yan Hu
Main category: cs.CV
TL;DR: Paper ID 2510.10055 could not be fetched due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2510.10055: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.10055&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[725] UniFlow: A Unified Pixel Flow Tokenizer for Visual Understanding and Generation
Zhengrong Yue, Haiyu Zhang, Xiangyu Zeng, Boyu Chen, Chenting Wang, Shaobin Zhuang, Lu Dong, Yi Wang, Limin Wang, Yali Wang
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2510.10575 exists but content cannot be retrieved.
Details
Motivation: Cannot determine motivation without access to paper content.Method: Cannot determine method without access to paper content.
Result: Cannot determine results without access to paper content.
Conclusion: Cannot determine conclusion without access to paper content.
Abstract: Failed to fetch summary for 2510.10575: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.10575&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[726] There is No VAE: End-to-End Pixel-Space Generative Modeling via Self-Supervised Pre-training
Jiachen Lei, Keli Liu, Julius Berner, Haiming Yu, Hongkai Zheng, Jiahong Wu, Xiangxiang Chu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2510.12586: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.12586&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[727] Exploring Cross-Modal Flows for Few-Shot Learning
Ziqi Jiang, Yanghao Wang, Long Chen
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailable due to server rate limitingMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about the paper due to access limitations
Abstract: Failed to fetch summary for 2510.14543: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.14543&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[728] UrbanVerse: Scaling Urban Simulation by Watching City-Tour Videos
Mingxuan Liu, Honglin He, Elisa Ricci, Wayne Wu, Bolei Zhou
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Cannot analyze method as paper content is unavailable due to HTTP 429 error
Result: No results available - paper retrieval failed due to rate limiting
Conclusion: Unable to provide analysis due to technical limitations in accessing the paper
Abstract: Failed to fetch summary for 2510.15018: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.15018&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[729] Latent Diffusion Model without Variational Autoencoder
Minglei Shi, Haolin Wang, Wenzhao Zheng, Ziyang Yuan, Xiaoshi Wu, Xintao Wang, Pengfei Wan, Jie Zhou, Jiwen Lu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2510.15301: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.15301&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[730] Fly-CL: A Fly-Inspired Framework for Enhancing Efficient Decorrelation and Reduced Training Time in Pre-trained Model-based Continual Representation Learning
Heming Zou, Yunliang Zang, Wutong Xu, Xiangyang Ji
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2510.16877: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.16877&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[731] Mono4DGS-HDR: High Dynamic Range 4D Gaussian Splatting from Alternating-exposure Monocular Videos
Jinfeng Liu, Lingtong Kong, Mi Zhou, Jinwen Chen, Dan Xu
Main category: cs.CV
TL;DR: Failed to fetch paper summary - HTTP 429 error indicates rate limiting from arXiv API
Details
Motivation: Unable to determine motivation due to API access issuesMethod: No method information available due to fetch failure
Result: No results available - paper content inaccessible
Conclusion: Cannot analyze paper due to technical limitations in accessing content
Abstract: Failed to fetch summary for 2510.18489: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.18489&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[732] Seeing Across Views: Benchmarking Spatial Reasoning of Vision-Language Models in Robotic Scenes
Zhiyuan Feng, Zhaolu Kang, Qijie Wang, Zhiying Du, Jiongrui Yan, Shubin Shi, Chengbo Yuan, Huizhi Liang, Yu Deng, Qixiu Li, Rushuai Yang, Arctanx An, Leqi Zheng, Weijie Wang, Shawn Chen, Sicheng Xu, Yaobo Liang, Jiaolong Yang, Baining Guo
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2510.19400: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.19400&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[733] S3OD: Towards Generalizable Salient Object Detection with Synthetic Data
Orest Kupyn, Hirokatsu Kataoka, Christian Rupprecht
Main category: cs.CV
TL;DR: Unable to analyze paper 2510.21605 due to HTTP 429 error when fetching summary from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2510.21605: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.21605&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[734] VoMP: Predicting Volumetric Mechanical Property Fields
Rishit Dagli, Donglai Xiang, Vismay Modi, Charles Loop, Clement Fuji Tsang, Anka He Chen, Anita Hu, Gavriel State, David I.W. Levin, Maria Shugrina
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrievedMethod: Cannot determine method as paper content could not be retrieved
Result: Cannot determine results as paper content could not be retrieved
Conclusion: Cannot determine conclusion as paper content could not be retrieved
Abstract: Failed to fetch summary for 2510.22975: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.22975&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[735] Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations
Yujia Zhang, Xiaoyang Wu, Yixing Lao, Chengyao Wang, Zhuotao Tian, Naiyan Wang, Hengshuang Zhao
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2510.23607: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.23607&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[736] Routing Matters in MoE: Scaling Diffusion Transformers with Explicit Routing Guidance
Yujie Wei, Shiwei Zhang, Hangjie Yuan, Yujin Han, Zhekai Chen, Jiayu Wang, Difan Zou, Xihui Liu, Yingya Zhang, Yu Liu, Hongming Shan
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2510.24711: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.24711&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[737] Brain-IT: Image Reconstruction from fMRI via Brain-Interaction Transformer
Roman Beliy, Amit Zalcher, Jonathan Kogman, Navve Wasserman, Michal Irani
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content could not be retrievedMethod: Cannot determine method as paper content could not be retrieved
Result: Cannot determine results as paper content could not be retrieved
Conclusion: Cannot draw conclusion as paper content could not be retrieved
Abstract: Failed to fetch summary for 2510.25976: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.25976&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[738] Parameterized Prompt for Incremental Object Detection
Zijia An, Boyu Diao, Ruiqi Liu, Libo Huang, Chuanguang Yang, Fei Wang, Zhulin An, Yongjun Xu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Cannot analyze method without access to paper content
Result: No results available due to API access failure
Conclusion: Cannot draw conclusions without paper content
Abstract: Failed to fetch summary for 2510.27316: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.27316&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[739] ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning
Jiawei Gu, Yunzhuo Hao, Huichen Will Wang, Linjie Li, Michael Qizhe Shieh, Yejin Choi, Ranjay Krishna, Yu Cheng
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2510.27492: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.27492&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[740] OmniVLA: Physically-Grounded Multimodal VLA with Unified Multi-Sensor Perception for Robotic Manipulation
Heyu Guo, Shanmu Wang, Ruichun Ma, Shiqi Jiang, Yasaman Ghasempour, Omid Abari, Baining Guo, Lili Qiu
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2511.01210: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.01210&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[741] MotionStream: Real-Time Video Generation with Interactive Motion Controls
Joonghyuk Shin, Zhengqi Li, Richard Zhang, Jun-Yan Zhu, Jaesik Park, Eli Shechtman, Xun Huang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2511.01266: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.01266&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[742] SportR: A Benchmark for Multimodal Large Language Model Reasoning in Sports
Haotian Xia, Haonan Ge, Junbo Zou, Hyun Woo Choi, Xuebin Zhang, Danny Suradja, Botao Rui, Ethan Tran, Wendy Jin, Zhen Ye, Xiyang Lin, Christopher Lai, Shengjie Zhang, Junwen Miao, Shichao Chen, Rhys Tracy, Vicente Ordonez, Weining Shen, Hanjie Chen
Main category: cs.CV
TL;DR: Unable to analyze paper 2511.06499 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract retrieval failedMethod: Cannot determine method as abstract retrieval failed
Result: Cannot determine results as abstract retrieval failed
Conclusion: Cannot draw conclusions about paper content due to retrieval failure
Abstract: Failed to fetch summary for 2511.06499: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.06499&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[743] Decoupling Bias, Aligning Distributions: Synergistic Fairness Optimization for Deepfake Detection
Feng Ding, Wenhui Yi, Yunpeng Zhou, Xinan He, Hong Rao, Shu Hu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2511.10150: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.10150&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[744] Revisiting Data Scaling in Medical Image Segmentation via Topology-Aware Augmentation
Yuetan Chu, Zhongyi Han, Gongning Luo, Xin Gao
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to determine conclusion due to failed API request
Abstract: Failed to fetch summary for 2511.13883: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.13883&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[745] SWITCH: Benchmarking Modeling and Handling of Tangible Interfaces in Long-horizon Embodied Scenarios
Jieru Lin, Zhiwei Yu, Börje F. Karlsson
Main category: cs.CV
TL;DR: Paper 2511.17649 summary unavailable due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to abstract fetch failureMethod: Unable to determine method due to abstract fetch failure
Result: Unable to determine results due to abstract fetch failure
Conclusion: Unable to draw conclusions due to abstract fetch failure
Abstract: Failed to fetch summary for 2511.17649: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.17649&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[746] Near–Real-Time Conflict-Related Fire Detection in Sudan Using Unsupervised Deep Learning
Kuldip Singh Atwal, Dieter Pfoser, Daniel Rothbart
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2512.07925: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.07925&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[747] VeCoR – Velocity Contrastive Regularization for Flow Matching
Zong-Wei Hong, Jing-lun Li, Lin-Ze Li, Shen Zhang, Yao Tang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2511.18942: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.18942&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[748] CodeV: Code with Images for Faithful Visual Reasoning via Tool-Aware Policy Optimization
Xinhai Hou, Shaoyuan Xu, Manan Biyani, Moyan Li, Jia Liu, Todd C. Hollon, Bryan Wang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation as paper content could not be retrievedMethod: Unable to determine method as paper content could not be retrieved
Result: Unable to determine results as paper content could not be retrieved
Conclusion: Unable to determine conclusion as paper content could not be retrieved
Abstract: Failed to fetch summary for 2511.19661: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.19661&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[749] UltraViCo: Breaking Extrapolation Limits in Video Diffusion Transformers
Min Zhao, Hongzhou Zhu, Yingze Wang, Bokai Yan, Jintao Zhang, Guande He, Ling Yang, Chongxuan Li, Jun Zhu
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2511.20123: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.20123&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[750] ReSAM: Refine, Requery, and Reinforce: Self-Prompting Point-Supervised Segmentation for Remote Sensing Images
M.Naseer Subhani
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2511.21606: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.21606&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[751] Brain-Semantoks: Learning Semantic Tokens of Brain Dynamics with a Self-Distilled Foundation Model
Sam Gijsen, Marc-Andre Schulz, Kerstin Ritter
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2512.11582: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.11582&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[752] Towards Transferable Defense Against Malicious Image Edits
Jie Zhang, Shuai Dong, Shiguang Shan, Xilin Chen
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access restrictionsMethod: Unable to determine method due to access restrictions
Result: Unable to determine results due to access restrictions
Conclusion: Unable to determine conclusion due to access restrictions
Abstract: Failed to fetch summary for 2512.14341: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.14341&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[753] Fourier-Attentive Representation Learning: A Fourier-Guided Framework for Few-Shot Generalization in Vision-Language Models
Hieu Dinh Trung Pham, Huy Minh Nhat Nguyen, Cuong Tuan Nguyen
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to failed paper fetchMethod: Cannot determine method due to failed paper fetch
Result: Cannot determine results due to failed paper fetch
Conclusion: Cannot determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2512.04395: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.04395&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[754] Family Matters: A Systematic Study of Spatial vs. Frequency Masking for Continual Test-Time Adaptation
Chandler Timm C. Doloriel, Yunbei Zhang, Yeonguk Yu, Taki Hasan Rafi, Muhammad salman siddiqui, Tor Kristian Stevik, Habib Ullah, Fadi Al Machot, Kristian Hovde Liland
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2512.08048: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.08048&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[755] Optimal transport unlocks end-to-end learning for single-molecule localization
Romain Seailles, Jean-Baptiste Masson, Jean Ponce, Julien Mairal
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2512.10683: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.10683&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[756] $β$-CLIP: Text-Conditioned Contrastive Learning for Multi-Granular Vision-Language Alignment
Fatimah Zohra, Chen Zhao, Hani Itani, Bernard Ghanem
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing paper contentMethod: Unable to determine method due to missing paper content
Result: Unable to determine results due to missing paper content
Conclusion: Unable to determine conclusion due to missing paper content
Abstract: Failed to fetch summary for 2512.12678: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.12678&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[757] SoFlow: Solution Flow Models for One-Step Generative Modeling
Tianze Luo, Haotian Yuan, Zhuang Liu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2512.15657: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.15657&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[758] CRISP: Contact-Guided Real2Sim from Monocular Video with Planar Scene Primitives
Zihan Wang, Jiashun Wang, Jeff Tan, Yiwen Zhao, Jessica Hodgins, Shubham Tulsiani, Deva Ramanan
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2512.14696: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.14696&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[759] GeoTeacher: Geometry-Guided Semi-Supervised 3D Object Detection
Jingyu Li, Xiaolong Zhao, Zhe Liu, Wenxiao Wu, Li Zhang
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2512.23147: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.23147&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[760] Aligned explanations in neural networks
Corentin Lobet, Francesca Chiaromonte
Main category: cs.CV
TL;DR: Unable to analyze paper 2601.04378 due to HTTP 429 error when fetching summary from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrievedMethod: Cannot determine method as paper content could not be retrieved
Result: Cannot determine results as paper content could not be retrieved
Conclusion: Cannot draw conclusions about paper content due to retrieval failure
Abstract: Failed to fetch summary for 2601.04378: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.04378&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[761] How Do Optical Flow and Textual Prompts Collaborate to Assist in Audio-Visual Semantic Segmentation?
Yujian Lee, Peng Gao, Yongqi Xu, Wentao Fan
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2601.08133: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.08133&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[762] Plug-and-Play Fidelity Optimization for Diffusion Transformer Acceleration via Cumulative Error Minimization
Tong Shao, Yusen Fu, Guoying Sun, Jingde Kong, Zhuotao Tian, Jingyong Su
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2512.23258: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.23258&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[763] Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models
Wenxuan Huang, Yu Zeng, Qiuchen Wang, Zhen Fang, Shaosheng Cao, Zheng Chu, Qingyu Yin, Shuang Chen, Zhenfei Yin, Lin Chen, Zehui Chen, Xu Tang, Yao Hu, Philip Torr, Feng Zhao, Wanli Ouyang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2601.22060: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.22060&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[764] When Anomalies Depend on Context: Learning Conditional Compatibility for Anomaly Detection
Shashank Mishra, Didier Stricker, Jason Rambach
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot determine conclusion as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2601.22868: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.22868&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[765] Counterfactual Explanations on Robust Perceptual Geodesics
Eslam Zaher, Maciej Trzaskowski, Quan Nguyen, Fred Roosta
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2601.18678 appears to be an arXiv paper, but the content could not be retrieved.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot determine conclusion without access to the paper content.
Abstract: Failed to fetch summary for 2601.18678: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.18678&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[766] MeanCache: From Instantaneous to Average Velocity for Accelerating Flow Matching Inference
Huanlin Gao, Ping Chen, Fuyuan Shi, Ruijia Wu, Li YanTao, Qiang Hui, Yuren You, Ting Lu, Chao Tan, Shaoan Zhao, Zhaoxiang Liu, Fang Zhao, Kai Wang, Shiguo Lian
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to inability to access paper contentMethod: Cannot determine method due to inability to access paper content
Result: Cannot determine results due to inability to access paper content
Conclusion: Cannot determine conclusion due to inability to access paper content
Abstract: Failed to fetch summary for 2601.19961: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.19961&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[767] HierLoc: Hyperbolic Entity Embeddings for Hierarchical Visual Geolocation
Hari Krishna Gadi, Daniel Matos, Hongyi Luo, Lu Liu, Yongliang Wang, Yanfeng Zhang, Liqiu Meng
Main category: cs.CV
TL;DR: Paper 2601.23064: Unable to fetch abstract due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to missing abstractMethod: Cannot determine method due to missing abstract
Result: Cannot determine results due to missing abstract
Conclusion: Cannot determine conclusion due to missing abstract
Abstract: Failed to fetch summary for 2601.23064: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.23064&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[768] Unveiling the Cognitive Compass: Theory-of-Mind-Guided Multimodal Emotion Reasoning
Meng Luo, Bobo Li, Shanqing Xu, Shize Zhang, Qiuchan Chen, Menglu Han, Wenhao Chen, Yanxiang Huang, Hao Fei, Mong-Li Lee, Wynne Hsu
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2602.00971: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.00971&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[769] Contribution-aware Token Compression for Efficient Video Understanding via Reinforcement Learning
Yinchao Ma, Qiang Zhou, Zhibin Wang, Xianing Chen, Hanqing Yang, Jun Song, Bo Zheng
Main category: cs.CV
TL;DR: Unable to analyze paper 2602.01649 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract is unavailableMethod: Cannot determine method as abstract is unavailable
Result: Cannot determine results as abstract is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2602.01649: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.01649&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[770] DDP-WM: Disentangled Dynamics Prediction for Efficient World Models
Shicheng Yin, Kaixuan Yin, Weixing Chen, Yang Liu, Guanbin Li, Liang Lin
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to missing paper contentMethod: Cannot determine method due to missing paper content
Result: Cannot determine results due to missing paper content
Conclusion: Cannot determine conclusion due to missing paper content
Abstract: Failed to fetch summary for 2602.01780: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.01780&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[771] CloDS: Visual-Only Unsupervised Cloth Dynamics Learning in Unknown Conditions
Yuliang Zhan, Jian Li, Wenbing Huang, Wenbing Huang, Yang Liu, Hao Sun
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2602.01844 could not be retrieved from arXiv API.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot determine conclusion without access to the paper content.
Abstract: Failed to fetch summary for 2602.01844: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.01844&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[772] NAB: Neural Adaptive Binning for Sparse-View CT reconstruction
Wangduo Xie, Matthew B. Blaschko
Main category: cs.CV
TL;DR: Failed to fetch paper summary - HTTP 429 error indicates rate limiting from arXiv API
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper detailsMethod: Unable to determine method due to API rate limiting preventing access to paper details
Result: Unable to determine results due to API rate limiting preventing access to paper details
Conclusion: Unable to draw conclusions due to API rate limiting preventing access to paper details
Abstract: Failed to fetch summary for 2602.02356: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.02356&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[773] RFDM: Residual Flow Diffusion Model for Efficient Causal Video Editing
Mohammadreza Salehi, Mehdi Noroozi, Luca Morreale, Ruchika Chavhan, Malcolm Chadwick, Alberto Gil Ramos, Abhinav Mehrotra
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper detailsMethod: Unable to determine method due to API rate limiting preventing access to paper details
Result: Unable to determine results due to API rate limiting preventing access to paper details
Conclusion: Unable to determine conclusion due to API rate limiting preventing access to paper details
Abstract: Failed to fetch summary for 2602.06871: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.06871&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[774] Gradient-Aligned Calibration for Post-Training Quantization of Diffusion Models
Dung Anh Hoang, Cuong Pham anh Trung Le, Jianfei Cai, Thanh-Toan Do
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to draw conclusions due to failed API request
Abstract: Failed to fetch summary for 2602.01289: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.01289&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[775] Single-Slice-to-3D Reconstruction in Medical Imaging and Natural Objects: A Comparative Benchmark with SAM 3D
Yan Luo, Advaith Ravishankar, Serena Liu, Yutong Yang, Mengyu Wang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to data fetch failureMethod: Unable to determine method due to data fetch failure
Result: Unable to determine results due to data fetch failure
Conclusion: Unable to draw conclusions due to data fetch failure
Abstract: Failed to fetch summary for 2602.09407: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.09407&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[776] Move What Matters: Parameter-Efficient Domain Adaptation via Optimal Transport Flow for Collaborative Perception
Zesheng Jia, Jin Wang, Siao Liu, Lingzhi Li, Ziyao Huang, Yunjiang Xu, Jianping Wang
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2602.11565: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.11565&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[777] EchoTorrent: Towards Swift, Sustained, and Streaming Multi-Modal Video Generation
Rang Meng, Yingjie Yin, Yuming Li, Chenguang Ma
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2602.13669: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.13669&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[778] UniWeTok: An Unified Binary Tokenizer with Codebook Size $\mathit{2^{128}}$ for Unified Multimodal Large Language Model
Shaobin Zhuang, Yuang Ai, Jiaming Han, Weijia Mao, Xiaohui Li, Fangyikang Wang, Xiao Wang, Yan Li, Shanchuan Lin, Kun Xu, Zhenheng Yang, Huaibo Huang, Xiangyu Yue, Hao Chen, Yali Wang
Main category: cs.CV
TL;DR: Paper 2602.14178 could not be fetched due to HTTP 429 error (rate limiting), so no analysis can be provided.
Details
Motivation: Unable to determine motivation as the paper content could not be retrieved due to HTTP 429 error.Method: Unable to determine method as the paper content could not be retrieved due to HTTP 429 error.
Result: Unable to determine results as the paper content could not be retrieved due to HTTP 429 error.
Conclusion: Unable to draw conclusions as the paper content could not be retrieved due to HTTP 429 error.
Abstract: Failed to fetch summary for 2602.14178: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.14178&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[779] Deformation-Free Cross-Domain Image Registration via Position-Encoded Temporal Attention
Yiwen Wang, Jiahao Qin
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2602.15959 appears to be from February 2026, which suggests it might be a future or incorrectly formatted arXiv ID.
Details
Motivation: Unable to determine motivation due to access error. The paper could not be retrieved from arXiv API.Method: Method unknown - paper content not accessible.
Result: No results available due to retrieval failure.
Conclusion: Cannot draw conclusions about an inaccessible paper.
Abstract: Failed to fetch summary for 2602.15959: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.15959&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[780] OmniCT: Towards a Unified Slice-Volume LVLM for Comprehensive CT Analysis
Tianwei Lin, Zhongwei Qiu, Wenqiao Zhang, Jiang Liu, Yihan Xie, Mingjian Gao, Zhenxuan Fan, Zhaocheng Li, Sijing Li, Zhongle Xie, Peng LU, Yueting Zhuang, Ling Zhang, Beng Chin Ooi, Yingda Xia
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2602.16110: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.16110&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[781] WiCompass: Oracle-driven Data Scaling for mmWave Human Pose Estimation
Bo Liang, Chen Gong, Haobo Wang, Qirui Liu, Rungui Zhou, Fengzhi Shao, Yubo Wang, Wei Gao, Kaichen Zhou, Guolong Cui, Chenren Xu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access restrictionsMethod: Unable to determine method due to access restrictions
Result: Unable to determine results due to access restrictions
Conclusion: Unable to draw conclusions due to access restrictions
Abstract: Failed to fetch summary for 2602.18726: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.18726&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[782] BiMotion: B-spline Motion for Text-guided Dynamic 3D Character Generation
Miaowei Wang, Qingxuan Yan, Zhi Cao, Yayuan Li, Oisin Mac Aodha, Jason J. Corso, Amir Vaxman
Main category: cs.CV
TL;DR: Paper 2602.18873: Unable to fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing abstract contentMethod: Unable to determine method due to missing abstract content
Result: Unable to determine results due to missing abstract content
Conclusion: Unable to determine conclusion due to missing abstract content
Abstract: Failed to fetch summary for 2602.18873: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.18873&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[783] Flow-Factory: A Unified Framework for Reinforcement Learning in Flow-Matching Models
Bowen Ping, Chengyou Jia, Minnan Luo, Hangwei Qian, Ivor Tsang
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2602.12529: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.12529&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[784] Prefer-DAS: Learning from Local Preferences and Sparse Prompts for Domain Adaptive Segmentation of Electron Microscopy
Jiabao Chen, Shan Xiong, Jialin Peng
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2602.19423: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.19423&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[785] Hepato-LLaVA: An Expert MLLM with Sparse Topo-Pack Attention for Hepatocellular Pathology Analysis on Whole Slide Images
Yuxuan Yang, Zhonghao Yan, Yi Zhang, Bo Yun, Muxi Diao, Guowei Zhao, Kongming Liang, Wenbin Li, Zhanyu Ma
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2602.19424: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.19424&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[786] StructXLIP: Enhancing Vision-language Models with Multimodal Structural Cues
Zanxi Ruan, Songqun Gao, Qiuyu Kong, Yiming Wang, Marco Cristani
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2602.20089: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.20089&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[787] tttLRM: Test-Time Training for Long Context and Autoregressive 3D Reconstruction
Chen Wang, Hao Tan, Wang Yifan, Zhiqin Chen, Yuheng Liu, Kalyan Sunkavalli, Sai Bi, Lingjie Liu, Yiwei Hu
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2602.20160 appears to be from February 2026, suggesting it’s a future paper or potentially an incorrect ID format.
Details
Motivation: Cannot determine motivation as the paper content could not be retrieved due to API rate limiting.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot draw conclusions without access to the paper content.
Abstract: Failed to fetch summary for 2602.20160: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.20160&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[788] Leveraging Causal Reasoning Method for Explaining Medical Image Segmentation Models
Limai Jiang, Ruitao Xie, Bokai Yang, Huazhen Huang, Juan He, Yufu Huo, Zikai Wang, Yang Wei, Yunpeng Cai
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2602.20511: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.20511&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[789] Dataset Color Quantization: A Training-Oriented Framework for Dataset-Level Compression
Chenyue Yu, Lingao Xiao, Jinhong Deng, Ivor W. Tsang, Yang He
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2602.20650: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.20650&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[790] Cycle-Consistent Tuning for Layered Image Decomposition
Zheng Gu, Min Lu, Zhida Sun, Dani Lischinski, Daniel Cohen-O, Hui Huang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to determine conclusion due to failed API request
Abstract: Failed to fetch summary for 2602.20989: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.20989&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[791] VII: Visual Instruction Injection for Jailbreaking Image-to-Video Generation Models
Bowen Zheng, Yongli Xiang, Ziming Hong, Zerong Lin, Chaojian Yu, Tongliang Liu, Xinge You
Main category: cs.CV
TL;DR: Unable to analyze paper 2602.20999 due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Cannot determine motivation without access to the paper abstractMethod: Cannot determine method without access to the paper abstract
Result: Cannot determine results without access to the paper abstract
Conclusion: Cannot determine conclusion without access to the paper abstract
Abstract: Failed to fetch summary for 2602.20999: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.20999&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[792] HorizonForge: Driving Scene Editing with Any Trajectories and Any Vehicles
Yifan Wang, Francesco Pittaluga, Zaid Tasneem, Chenyu You, Manmohan Chandraker, Ziyu Jiang
Main category: cs.CV
TL;DR: Unable to analyze paper 2602.21333 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation without access to paper abstractMethod: Cannot determine method without access to paper abstract
Result: Cannot determine results without access to paper abstract
Conclusion: Cannot draw conclusions without access to paper abstract
Abstract: Failed to fetch summary for 2602.21333: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.21333&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[793] Pseudo Contrastive Learning for Diagram Comprehension in Multimodal Models
Hiroshi Sasaki
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2602.23589: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.23589&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[794] CADC: Content Adaptive Diffusion-Based Generative Image Compression
Xihua Sheng, Lingyu Zhu, Tianyu Zhang, Dong Liu, Shiqi Wang, Jing Wang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation as paper content could not be retrievedMethod: Unable to determine method as paper content could not be retrieved
Result: Unable to determine results as paper content could not be retrieved
Conclusion: Unable to draw conclusions as paper content could not be retrieved
Abstract: Failed to fetch summary for 2602.21591: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.21591&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[795] Joint Shadow Generation and Relighting via Light-Geometry Interaction Maps
Shan Wang, Peixia Li, Chenchen Xu, Ziang Cheng, Jiayu Yang, Hongdong Li, Pulak Purkait
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2602.21820: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.21820&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[796] PhotoAgent: Agentic Photo Editing with Exploratory Visual Aesthetic Planning
Mingde Yao, Zhiyuan You, King-Man Tam, Menglu Wang, Tianfan Xue
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to inability to access paper contentMethod: Cannot determine method due to inability to access paper content
Result: Cannot determine results due to inability to access paper content
Conclusion: Cannot determine conclusion due to inability to access paper content
Abstract: Failed to fetch summary for 2602.22809: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22809&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[797] ToProVAR: Efficient Visual Autoregressive Modeling via Tri-Dimensional Entropy-Aware Semantic Analysis and Sparsity Optimization
Jiayu Chen, Ruoyu Lin, Zihao Zheng, Jingxin Li, Maoliang Li, Guojie Luo, Xiang Chen
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing paper contentMethod: Unable to determine method due to missing paper content
Result: Unable to determine results due to missing paper content
Conclusion: Unable to determine conclusion due to missing paper content
Abstract: Failed to fetch summary for 2602.22948: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22948&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[798] WISER: Wider Search, Deeper Thinking, and Adaptive Fusion for Training-Free Zero-Shot Composed Image Retrieval
Tianyue Wang, Leigang Qu, Tianyu Yang, Xiangzhao Hao, Yifan Xu, Haiyun Guo, Jinqiao Wang
Main category: cs.CV
TL;DR: Paper ID 2602.23029 could not be fetched due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2602.23029: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.23029&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[799] AgentVista: Evaluating Multimodal Agents in Ultra-Challenging Realistic Visual Scenarios
Zhaochen Su, Jincheng Gao, Hangyu Guo, Zhenhua Liu, Lueyang Zhang, Xinyu Geng, Shijue Huang, Peng Xia, Guanyu Jiang, Cheng Wang, Yue Zhang, Yi R. Fung, Junxian He
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2602.23166: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.23166&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[800] Synthetic Visual Genome 2: Extracting Large-scale Spatio-Temporal Scene Graphs from Videos
Ziqi Gao, Jieyu Zhang, Wisdom Oluchi Ikezogwo, Jae Sung Park, Tario G. You, Daniel Ogbu, Chenhao Zheng, Weikai Huang, Yinuo Yang, Winson Han, Quan Kong, Rajat Saini, Ranjay Krishna
Main category: cs.CV
TL;DR: SVG2 is a large-scale panoptic video scene graph dataset with 636K videos, and TRaSER is a video scene graph generation model that improves scene graph generation performance and boosts video question answering accuracy when used as an intermediate representation.
Details
Motivation: There's a need for large-scale spatio-temporal scene graph datasets to advance video understanding, as existing datasets lack scale and diversity. Scene graphs can serve as useful intermediate representations for video understanding tasks.Method: Created SVG2 dataset using automated pipeline with panoptic segmentation, trajectory tracking, semantic parsing, and GPT-5-based relation inference. Developed TRaSER model with trajectory-aligned token arrangement, object-trajectory resampler, and temporal-window resampler to generate spatio-temporal scene graphs from videos.
Result: TRaSER improves relation detection by +15-20%, object prediction by +30-40% over baselines and +13% over GPT-5, and attribute prediction by +15%. When used for video QA, it provides +1.5 to 4.6% accuracy gain over video-only or other scene graph methods.
Conclusion: SVG2 enables large-scale video scene graph research, and TRaSER demonstrates that explicit spatio-temporal scene graphs are effective intermediate representations for video understanding tasks.
Abstract: We introduce Synthetic Visual Genome 2 (SVG2), a large-scale panoptic video scene graph dataset. SVG2 contains over 636K videos with 6.6M objects, 52.0M attributes, and 6.7M relations, providing an order-of-magnitude increase in scale and diversity over prior spatio-temporal scene graph datasets. To create SVG2, we design a fully automated pipeline that combines multi-scale panoptic segmentation, online-offline trajectory tracking with automatic new-object discovery, per-trajectory semantic parsing, and GPT-5-based spatio-temporal relation inference. Building on this resource, we train TRaSER, a video scene graph generation model. TRaSER augments VLMs with a trajectory-aligned token arrangement mechanism and new modules: an object-trajectory resampler and a temporal-window resampler to convert raw videos and panoptic trajectories into compact spatio-temporal scene graphs in a single forward pass. The temporal-window resampler binds visual tokens to short trajectory segments to preserve local motion and temporal semantics, while the object-trajectory resampler aggregates entire trajectories to maintain global context for objects. On the PVSG, VIPSeg, VidOR and SVG2 test datasets, TRaSER improves relation detection by +15 to 20%, object prediction by +30 to 40% over the strongest open-source baselines and by +13% over GPT-5, and attribute prediction by +15%. When TRaSER’s generated scene graphs are sent to a VLM for video question answering, it delivers a +1.5 to 4.6% absolute accuracy gain over using video only or video augmented with Qwen2.5-VL’s generated scene graphs, demonstrating the utility of explicit spatio-temporal scene graphs as an intermediate representation.
[801] Towards Source-Aware Object Swapping with Initial Noise Perturbation
Jiahui Zhan, Xianbing Sun, Xiangnan Zhu, Yikun Ji, Ruitong Liu, Liqing Zhang, Jianfu Zhang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error in fetching paper contentMethod: Unable to determine method due to technical error in fetching paper content
Result: Unable to determine results due to technical error in fetching paper content
Conclusion: Unable to draw conclusions about paper content due to technical error
Abstract: Failed to fetch summary for 2602.23697: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.23697&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[802] AoE: Always-on Egocentric Human Video Collection for Embodied AI
Bowen Yang, Zishuo Li, Yang Sun, Changtao Miao, Yifan Yang, Man Luo, Xiaotong Yan, Feng Jiang, Jinchuan Shi, Yankai Fu, Ning Chen, Junkai Zhao, Pengwei Wang, Guocai Yao, Shanghang Zhang, Hao Chen, Zhe Li, Kai Zhu
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2602.23893: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.23893&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[803] EchoMimicV2: Towards Striking, Simplified, and Semi-Body Human Animation
Rang Meng, Xingyu Zhang, Yuming Li, Chenguang Ma
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error fetching paper contentMethod: Unable to determine method due to technical error fetching paper content
Result: Unable to determine results due to technical error fetching paper content
Conclusion: Unable to determine conclusion due to technical error fetching paper content
Abstract: Failed to fetch summary for 2411.10061: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.10061&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[804] Deep generative computed perfusion-deficit mapping of ischaemic stroke
Chayanin Tangwiriyasakul, Pedro Borges, Guilherme Pombo, Stefano Moriconi, Michael S. Elmalem, Paul Wright, Yee-Haur Mah, Jane Rondina, Sebastien Ourselin, Parashkev Nachev, M. Jorge Cardoso
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2502.01334: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.01334&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[805] Physically Ground Commonsense Knowledge for Articulated Object Manipulation with Analytic Concepts
Jiude Wei, Yuxuan Li, Cewu Lu, Jianhua Sun
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2503.23348: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.23348&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[806] BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation
Hongyu Wang, Chuyan Xiong, Ruiping Wang, Xilin Chen
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2506.07530: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.07530&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[807] Digital and Robotic Twinning for Validation of Proximity Operations and Formation Flying
Z. Ahmed, E. Bates, P. Francesch Huc, S. Y. W. Low, A. Golan, T. Bell, A. Rizza, S. D’Amico
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error in fetching paper contentMethod: Unable to determine method due to technical error in fetching paper content
Result: Unable to determine results due to technical error in fetching paper content
Conclusion: Unable to draw conclusions due to technical error in fetching paper content
Abstract: Failed to fetch summary for 2507.20034: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.20034&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[808] Improved 3D Scene Stylization via Text-Guided Generative Image Editing with Region-Based Control
Haruo Fujiwara, Yusuke Mukuta, Tatsuya Harada
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2509.05285: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.05285&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[809] Splat the Net: Radiance Fields with Splattable Neural Primitives
Xilong Zhou, Bao-Huy Nguyen, Loïc Magne, Vladislav Golyanik, Thomas Leimkühler, Christian Theobalt
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about the paper due to data unavailability
Abstract: Failed to fetch summary for 2510.08491: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.08491&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[810] Copy-Trasform-Paste: Zero-Shot Object-Object Alignment Guided by Vision-Language and Geometric Constraints
Rotem Gatenyo, Ohad Fried
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2601.14207: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.14207&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[811] HELMLAB: An Analytical, Data-Driven Color Space for Perceptual Distance in UI Design Systems
Gorkem Yildiz
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) for arXiv ID 2602.23010
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot determine conclusion as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2602.23010: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.23010&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[812] V-MORALS: Visual Morse Graph-Aided Estimation of Regions of Attraction in a Learned Latent Space
Faiz Aladin, Ashwin Balasubramanian, Lars Lindemann, Daniel Seita
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access restrictionsMethod: Unable to determine method due to access restrictions
Result: Unable to determine results due to access restrictions
Conclusion: Unable to draw conclusions due to access restrictions
Abstract: Failed to fetch summary for 2602.23524: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.23524&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.AI
[813] Multi-Sourced, Multi-Agent Evidence Retrieval for Fact-Checking
Shuzhi Gong, Richard O. Sinnott, Jianzhong Qi, Cecile Paris, Preslav Nakov, Zhuohan Xie
Main category: cs.AI
TL;DR: WKGFC: A fact-checking framework using knowledge graphs and LLM agents for evidence retrieval and verification
Details
Motivation: Current fact-checking methods have limitations: traditional methods don't generalize well to new data distributions, while RAG-based methods rely on textual similarity and miss multi-hop semantic relations, leading to inaccurate veracity predictions.Method: Proposes WKGFC framework that uses authorized open knowledge graphs as evidence core, with LLM-enabled retrieval to find relevant knowledge subgraphs. Augments with web content retrieval. Implements as automatic Markov Decision Process where reasoning LLM agent takes actions based on current evidence and claims, using prompt optimization to fine-tune the agentic LLM.
Result: Not specified in abstract, but the method addresses limitations of previous approaches by leveraging structured knowledge graphs and LLM reasoning for more accurate evidence retrieval.
Conclusion: WKGFC offers a novel approach to fact-checking by combining knowledge graphs with LLM agents, addressing limitations of previous methods in evidence retrieval and multi-hop reasoning.
Abstract: Misinformation spreading over the Internet poses a significant threat to both societies and individuals, necessitating robust and scalable fact-checking that relies on retrieving accurate and trustworthy evidence. Previous methods rely on semantic and social-contextual patterns learned from training data, which limits their generalization to new data distributions. Recently, Retrieval Augmented Generation (RAG) based methods have been proposed to utilize the reasoning capability of LLMs with retrieved grounding evidence documents. However, these methods largely rely on textual similarity for evidence retrieval and struggle to retrieve evidence that captures multi-hop semantic relations within rich document contents. These limitations lead to overlooking subtle factual correlations between the evidence and the claims to be fact-checked during evidence retrieval, thus causing inaccurate veracity predictions. To address these issues, we propose WKGFC, which exploits authorized open knowledge graph as a core resource of evidence. LLM-enabled retrieval is designed to assess the claims and retrieve the most relevant knowledge subgraphs, forming structured evidence for fact verification. To augment the knowledge graph evidence, we retrieve web contents for completion. The above process is implemented as an automatic Markov Decision Process (MDP): A reasoning LLM agent decides what actions to take according to the current evidence and the claims. To adapt the MDP for fact-checking, we use prompt optimization to fine-tune the agentic LLM.
[814] TraderBench: How Robust Are AI Agents in Adversarial Capital Markets?
Xiaochuang Yuan, Hui Xu, Silvia Xu, Cui Zou, Jing Xiong
Main category: cs.AI
TL;DR: TraderBench is a financial AI evaluation benchmark combining expert-verified static tasks with adversarial trading simulations scored on realized performance metrics to eliminate judge variance in finance AI assessment.
Details
Motivation: Current AI evaluation in finance has two main problems: static benchmarks require expensive expert annotation but miss dynamic decision-making, while LLM-based judges introduce uncontrolled variance on domain-specific financial tasks.Method: Combines expert-verified static tasks (knowledge retrieval, analytical reasoning) with adversarial trading simulations scored purely on realized performance metrics (Sharpe ratio, returns, drawdown). Features two tracks: crypto trading with market-manipulation transforms, and options derivatives scoring across P&L accuracy, Greeks, and risk management.
Result: Evaluation of 13 models (8B open-source to frontier) shows: 1) 8 of 13 models score ~33 on crypto with minimal variation across adversarial conditions, revealing fixed non-adaptive strategies; 2) Extended thinking helps retrieval (+26 points) but has zero impact on trading performance (+0.3 crypto, -0.1 options).
Conclusion: Current AI agents lack genuine market adaptation, highlighting the need for performance-grounded evaluation in finance rather than relying on static benchmarks or LLM judges.
Abstract: Evaluating AI agents in finance faces two key challenges: static benchmarks require costly expert annotation yet miss the dynamic decision-making central to real-world trading, while LLM-based judges introduce uncontrolled variance on domain-specific tasks. We introduce TraderBench, a benchmark that addresses both issues. It combines expert-verified static tasks (knowledge retrieval, analytical reasoning) with adversarial trading simulations scored purely on realized performance-Sharpe ratio, returns, and drawdown-eliminating judge variance entirely. The framework features two novel tracks: crypto trading with four progressive market-manipulation transforms, and options derivatives scoring across P&L accuracy, Greeks, and risk management. Trading scenarios can be refreshed with new market data to prevent benchmark contamination. Evaluating 13 models (8B open-source to frontier) on ~50 tasks, we find: (1) 8 of 13 models score ~33 on crypto with <1-point variation across adversarial conditions, exposing fixed non-adaptive strategies; (2) extended thinking helps retrieval (+26 points) but has zero impact on trading (+0.3 crypto, -0.1 options). These findings reveal that current agents lack genuine market adaptation, underscoring the need for performance-grounded evaluation in finance.
[815] Has Multimodal Learning Delivered Universal Intelligence in Healthcare? A Comprehensive Survey
Qika Lin, Yifan Zhu, Xin Mei, Ling Huang, Jingying Ma, Kai He, Zhen Peng, Erik Cambria, Mengling Feng
Main category: cs.AI
TL;DR: A comprehensive survey analyzing whether multimodal learning has achieved universal intelligence in healthcare, concluding it hasn’t and identifying future research directions.
Details
Motivation: To assess whether multimodal learning has delivered universal intelligence in healthcare by analyzing current progress and identifying gaps between current capabilities and true universal intelligence.Method: Three-pronged analysis: 1) comprehensive survey of medical multimodal learning (datasets, task-oriented methods, foundation models), 2) discussion from five key issues (data, technologies, performance, ethics), 3) identification of ten future research directions.
Result: Current multimodal learning technologies have NOT achieved universal intelligence in healthcare; significant gaps remain in data integration, model generalization, ethical considerations, and practical deployment.
Conclusion: While multimodal learning shows promise for healthcare applications, substantial work is needed to achieve true universal intelligence, with ten identified research directions for future exploration.
Abstract: The rapid development of artificial intelligence has constantly reshaped the field of intelligent healthcare and medicine. As a vital technology, multimodal learning has increasingly garnered interest due to data complementarity, comprehensive modeling form, and great application potential. Currently, numerous researchers are dedicating their attention to this field, conducting extensive studies and constructing abundant intelligent systems. Naturally, an open question arises that has multimodal learning delivered universal intelligence in healthcare? To answer the question, we adopt three unique viewpoints for a holistic analysis. Firstly, we conduct a comprehensive survey of the current progress of medical multimodal learning from the perspectives of datasets, task-oriented methods, and universal foundation models. Based on them, we further discuss the proposed question from five issues to explore the real impacts of advanced techniques in healthcare, from data and technologies to performance and ethics. The answer is that current technologies have NOT achieved universal intelligence and there remains a significant journey to undertake. Finally, in light of the above reviews and discussions, we point out ten potential directions for exploration towards the goal of universal intelligence in healthcare.
[816] DIG to Heal: Scaling General-purpose Agent Collaboration via Explainable Dynamic Decision Paths
Hanqing Yang, Hyungwoo Lee, Yuhang Yao, Zhiwei Liu, Kay Liu, Jingdi Chen, Carlee Joe-Wong
Main category: cs.AI
TL;DR: DIG (Dynamic Interaction Graph) is a framework that visualizes emergent collaboration in multi-agent LLM systems as time-evolving causal networks, enabling real-time identification and correction of collaboration errors.
Details
Motivation: Current agentic AI systems often use predefined workflows or roles to reduce complexity, but truly autonomous multi-agent systems should achieve emergent collaboration. However, unstructured interactions can lead to redundant work and cascading failures that are difficult to interpret or correct.Method: Introduces Dynamic Interaction Graph (DIG) which captures emergent collaboration as a time-evolving causal network of agent activations and interactions. This makes emergent collaboration observable and explainable for the first time.
Result: DIG enables real-time identification, explanation, and correction of collaboration-induced error patterns directly from agents’ collaboration paths, filling a critical gap in understanding how general LLM agents solve problems together.
Conclusion: DIG provides a novel framework for making emergent collaboration in multi-agent LLM systems observable and explainable, addressing key challenges in truly agentic AI systems.
Abstract: The increasingly popular agentic AI paradigm promises to harness the power of multiple, general-purpose large language model (LLM) agents to collaboratively complete complex tasks. While many agentic AI systems utilize predefined workflows or agent roles in order to reduce complexity, ideally these agents would be truly autonomous, able to achieve emergent collaboration even as the number of collaborating agents increases. Yet in practice, such unstructured interactions can lead to redundant work and cascading failures that are difficult to interpret or correct. In this work, we study multi-agent systems composed of general-purpose LLM agents that operate without predefined roles, control flow, or communication constraints, relying instead on emergent collaboration to solve problems. We introduce the Dynamic Interaction Graph (DIG), which captures emergent collaboration as a time-evolving causal network of agent activations and interactions. DIG makes emergent collaboration observable and explainable for the first time, enabling real-time identification, explanation, and correction of collaboration-induced error patterns directly from agents’ collaboration paths. Thus, DIG fills a critical gap in understanding how general LLM agents solve problems together in truly agentic multi-agent systems. The project webpage can be found at: https://happyeureka.github.io/dig.
[817] How Well Do Multimodal Models Reason on ECG Signals?
Maxwell A. Xu, Harish Haresumadram, Catherine W. Liu, Patrick Langer, Jathurshan Pradeepkumar, Wanting Mao, Sunita J. Ferns, Aradhana Verma, Jimeng Sun, Paul Schmiedmayer, Xin Liu, Daniel McDuff, Emily B. Fox, James M. Rehg
Main category: cs.AI
TL;DR: A framework for evaluating reasoning in ECG signals by decomposing it into Perception (pattern identification) and Deduction (logical application of clinical knowledge), using code generation for empirical verification and retrieval-based alignment with clinical criteria.
Details
Motivation: Multimodal LLMs generate interpretable reasoning traces for health AI, but verifying their validity remains challenging. Existing evaluation methods are either unscalable (manual clinician review) or superficial (proxy metrics like QA) that fail to capture semantic correctness of clinical logic.Method: 1) Decompose reasoning into Perception (accurate identification of patterns in raw signals) and Deduction (logical application of domain knowledge). 2) For Perception: Use agentic framework generating code to empirically verify temporal structures in reasoning traces. 3) For Deduction: Measure alignment of model’s logic against structured database of established clinical criteria using retrieval-based approach.
Result: The proposed dual-verification method enables scalable assessment of “true” reasoning capabilities in multimodal LLMs for ECG signal analysis, addressing the verification challenge of reasoning traces.
Conclusion: The framework provides a reproducible approach to evaluate reasoning in ECG signals by separating perception and deduction components, offering a more rigorous and scalable alternative to existing evaluation methods for health AI reasoning verification.
Abstract: While multimodal large language models offer a promising solution to the “black box” nature of health AI by generating interpretable reasoning traces, verifying the validity of these traces remains a critical challenge. Existing evaluation methods are either unscalable, relying on manual clinician review, or superficial, utilizing proxy metrics (e.g. QA) that fail to capture the semantic correctness of clinical logic. In this work, we introduce a reproducible framework for evaluating reasoning in ECG signals. We propose decomposing reasoning into two distinct, components: (i) Perception, the accurate identification of patterns within the raw signal, and (ii) Deduction, the logical application of domain knowledge to those patterns. To evaluate Perception, we employ an agentic framework that generates code to empirically verify the temporal structures described in the reasoning trace. To evaluate Deduction, we measure the alignment of the model’s logic against a structured database of established clinical criteria in a retrieval-based approach. This dual-verification method enables the scalable assessment of “true” reasoning capabilities.
[818] EmCoop: A Framework and Benchmark for Embodied Cooperation Among LLM Agents
Hanqing Yang, Shiyu Chen, Narjes Nourzad, Marie Siew, Jingdi Chen, Carlee Joe-Wong
Main category: cs.AI
TL;DR: EmCoop is a benchmark framework for studying cooperation in LLM-based embodied multi-agent systems, separating cognitive and embodied layers to analyze collaboration dynamics with process-level metrics.
Details
Motivation: Real-world scenarios require multiple embodied agents to collaborate under constraints, but existing benchmarks lack fine-grained analysis of how collaboration emerges and contributes to task success in embodied multi-agent systems.Method: Introduces EmCoop framework with high-level cognitive layer (LLM-based reasoning/planning) and low-level embodied interaction layer, with process-level metrics to diagnose collaboration quality and failure modes. Instantiated in two embodied environments scalable to arbitrary agent numbers with diverse communication topologies.
Result: Demonstrates systematic analysis of cooperation dynamics across team sizes and task settings, enabling diagnosis of collaboration quality beyond final task success metrics.
Conclusion: EmCoop provides a framework for studying cooperation in LLM-based embodied multi-agent systems with process-level analysis capabilities for understanding collaboration dynamics.
Abstract: Real-world scenarios increasingly require multiple embodied agents to collaborate in dynamic environments under embodied constraints, as many tasks exceed the capabilities of any single agent. Recent advances in large language models (LLMs) enable high-level cognitive coordination through reasoning, planning, and natural language communication. However, fine-grained analyses of how such collaboration emerges, unfolds, and contributes to task success in embodied multi-agent systems are difficult to conduct with existing benchmarks. In this paper, we introduce EmCoop, a benchmark framework for studying cooperation in LLM-based embodied multi-agent systems. Our framework separates a high-level cognitive layer from a low-level embodied interaction layer, allowing us to characterize agent cooperation through their interleaved dynamics over time. Given a cooperation-constrained embodied task, we propose generalizable, process-level metrics that diagnose collaboration quality and failure modes, beyond final task success. We instantiate our framework in two embodied environments that scale to arbitrary numbers of agents and support diverse communication topologies, and use these instantiations to demonstrate how EmCoop enables systematic analysis of cooperation dynamics across team sizes and task settings. The project web page can be found at: https://happyeureka.github.io/emcoop.
[819] Monotropic Artificial Intelligence: Toward a Cognitive Taxonomy of Domain-Specialized Language Models
Antonio de Sousa Leitão Filho, Allan Kardec Duailibe Barros Filho, Fabrício Saul Lima, Selby Mykael Lima dos Santos, Rejani Bandeira Vieira Sousa
Main category: cs.AI
TL;DR: Monotropic AI proposes specialized language models that sacrifice generality for extreme precision in narrow domains, challenging the “bigger is better” paradigm in AI research.
Details
Motivation: The paper challenges the prevailing assumption that progress in AI requires larger, more general models. It argues that intense specialization represents an alternative cognitive architecture with advantages for safety-critical applications, drawing inspiration from monotropism theory in autistic cognition.Method: The authors formalize the concept of Monotropic AI, contrasting it with conventional polytropic architectures. They demonstrate viability through Mini-Enedina, a 37.5-million-parameter model specialized in Timoshenko beam analysis, designed to be deliberately incompetent outside its domain.
Result: Mini-Enedina achieves near-perfect performance on Timoshenko beam analysis while maintaining deliberate incompetence outside its specialized domain, demonstrating the viability of the monotropic approach.
Conclusion: The framework challenges the assumption that artificial general intelligence is the sole legitimate goal of AI research, proposing instead a cognitive ecology where specialized and generalist systems coexist complementarily.
Abstract: The prevailing paradigm in artificial intelligence research equates progress with scale: larger models trained on broader datasets are presumed to yield superior capabilities. This assumption, while empirically productive for general-purpose applications, obscures a fundamental epistemological tension between breadth and depth of knowledge. We introduce the concept of \emph{Monotropic Artificial Intelligence} – language models that deliberately sacrifice generality to achieve extraordinary precision within narrowly circumscribed domains. Drawing on the cognitive theory of monotropism developed to understand autistic cognition, we argue that intense specialization represents not a limitation but an alternative cognitive architecture with distinct advantages for safety-critical applications. We formalize the defining characteristics of monotropic models, contrast them with conventional polytropic architectures, and demonstrate their viability through Mini-Enedina, a 37.5-million-parameter model that achieves near-perfect performance on Timoshenko beam analysis while remaining deliberately incompetent outside its domain. Our framework challenges the implicit assumption that artificial general intelligence constitutes the sole legitimate aspiration of AI research, proposing instead a cognitive ecology in which specialized and generalist systems coexist complementarily.
[820] Conservative Equilibrium Discovery in Offline Game-Theoretic Multiagent Reinforcement Learning
Austin A. Nguyen, Michael P. Wellman
Main category: cs.AI
TL;DR: COffeE-PSRO extends Policy Space Response Oracles (PSRO) for offline multiagent game-solving by incorporating uncertainty quantification and conservatism principles to select strategies more likely to have low regret given limited trajectory data.
Details
Motivation: The paper addresses the challenge of solving games in offline multiagent settings where only a fixed dataset of state-action trajectories is available. Traditional online game-solving approaches like PSRO cannot be directly applied, and verifying true equilibria is infeasible due to limited data coverage of game dynamics.Method: Extends PSRO by quantifying game dynamics uncertainty and modifying the RL objective to favor solutions more likely to have low regret in the true game. Proposes a novel meta-strategy solver tailored for offline settings and incorporates conservatism principles from offline reinforcement learning for strategy exploration.
Result: COffeE-PSRO extracts lower-regret solutions than state-of-the-art offline approaches and reveals relationships between algorithmic components, empirical game fidelity, and overall performance.
Conclusion: The proposed approach successfully addresses offline game-solving challenges by combining uncertainty-aware PSRO extensions with conservatism principles, demonstrating improved performance over existing methods.
Abstract: Offline learning of strategies takes data efficiency to its extreme by restricting algorithms to a fixed dataset of state-action trajectories. We consider the problem in a mixed-motive multiagent setting, where the goal is to solve a game under the offline learning constraint. We first frame this problem in terms of selecting among candidate equilibria. Since datasets may inform only a small fraction of game dynamics, it is generally infeasible in offline game-solving to even verify a proposed solution is a true equilibrium. Therefore, we consider the relative probability of low regret (i.e., closeness to equilibrium) across candidates based on the information available. Specifically, we extend Policy Space Response Oracles (PSRO), an online game-solving approach, by quantifying game dynamics uncertainty and modifying the RL objective to skew towards solutions more likely to have low regret in the true game. We further propose a novel meta-strategy solver, tailored for the offline setting, to guide strategy exploration in PSRO. Our incorporation of Conservatism principles from Offline reinforcement learning approaches for strategy Exploration gives our approach its name: COffeE-PSRO. Experiments demonstrate COffeE-PSRO’s ability to extract lower-regret solutions than state-of-the-art offline approaches and reveal relationships between algorithmic components empirical game fidelity, and overall performance.
[821] NeuroHex: Highly-Efficient Hex Coordinate System for Creating World Models to Enable Adaptive AI
Quinn Jacobson, Joe Luo, Jingfei Xu, Shanmuga Venkatachalam, Kevin Wang, Dingchao Rong, John Paul Shen
Main category: cs.AI
TL;DR: NeuroHex is a hexagonal coordinate system inspired by brain grid cells, offering efficient spatial operations for AI world models with 90-99% geometric complexity reduction.
Details
Motivation: The paper aims to create efficient world models and reference frames for online adaptive AI systems, addressing the computational inefficiency of Cartesian coordinate systems for spatial reasoning tasks.Method: Develops NeuroHex, a hexagonal coordinate system with cubic isometric formulation, ring indexing, quantized angular encoding, and hierarchical geometric shape primitives. Includes OSM2Hex tool for converting OpenStreetMap data with significant geometric simplification.
Result: Achieves 90-99% reduction in geometric complexity while maintaining relevant spatial structure for navigation. Demonstrates efficiency on city and neighborhood scale datasets for dynamic world modeling.
Conclusion: NeuroHex provides a highly efficient substrate for building dynamic world models to enable adaptive spatial reasoning in autonomous AI systems with continuous online learning capability.
Abstract: \textit{NeuroHex} is a hexagonal coordinate system designed to support highly efficient world models and reference frames for online adaptive AI systems. Inspired by the hexadirectional firing structure of grid cells in the human brain, NeuroHex adopts a cubic isometric hexagonal coordinate formulation that provides full 60° rotational symmetry and low-cost translation, rotation and distance computation. We develop a mathematical framework that incorporates ring indexing, quantized angular encoding, and a hierarchical library of foundational, simple, and complex geometric shape primitives. These constructs allow low-overhead point-in-shape tests and spatial matching operations that are expensive in Cartesian coordinate systems. To support realistic settings, the NeuroHex framework can process OpenStreetMap (OSM) data sets using an OSM-to-NeuroHex (\textit{OSM2Hex}) conversion tool. The OSM2Hex spatial abstraction processing pipeline can achieve a reduction of 90-99% in geometric complexity while maintaining the relevant spatial structure map for navigation. Our initial results, based on actual city and neighborhood scale data sets, demonstrate that NeuroHex offers a highly efficient substrate for building dynamic world models to enable adaptive spatial reasoning in autonomous AI systems with continuous online learning capability.
[822] Confusion-Aware Rubric Optimization for LLM-based Automated Grading
Yucheng Chu, Hang Li, Kaiqi Yang, Yasemin Copur-Gencturk, Joseph Krajcik, Namsoo Shin, Jiliang Tang
Main category: cs.AI
TL;DR: CARO is a framework that improves LLM-based grading by using confusion matrices to separate error signals into distinct modes, allowing targeted prompt optimization without conflicting constraints.
Details
Motivation: Manually crafting grading guidelines for LLMs is suboptimal, and existing automated prompt optimization methods suffer from "rule dilution" where conflicting constraints weaken grading logic by aggregating unstructured error samples.Method: CARO uses confusion matrices to decompose error signals into distinct misclassification patterns, synthesizes targeted “fixing patches” for dominant error modes, and employs diversity-aware selection to prevent guidance conflicts, eliminating nested refinement loops.
Result: Empirical evaluations on teacher education and STEM datasets show CARO significantly outperforms existing state-of-the-art methods in automated assessment.
Conclusion: Surgical, mode-specific repair through structured error signal separation yields robust improvements in automated assessment scalability and precision compared to mixed-error aggregation approaches.
Abstract: Accurate and unambiguous guidelines are critical for large language model (LLM) based graders, yet manually crafting these prompts is often sub-optimal as LLMs can misinterpret expert guidelines or lack necessary domain specificity. Consequently, the field has moved toward automated prompt optimization to refine grading guidelines without the burden of manual trial and error. However, existing frameworks typically aggregate independent and unstructured error samples into a single update step, resulting in “rule dilution” where conflicting constraints weaken the model’s grading logic. To address these limitations, we introduce Confusion-Aware Rubric Optimization (CARO), a novel framework that enhances accuracy and computational efficiency by structurally separating error signals. CARO leverages the confusion matrix to decompose monolithic error signals into distinct modes, allowing for the diagnosis and repair of specific misclassification patterns individually. By synthesizing targeted “fixing patches” for dominant error modes and employing a diversity-aware selection mechanism, the framework prevents guidance conflict and eliminates the need for resource-heavy nested refinement loops. Empirical evaluations on teacher education and STEM datasets demonstrate that CARO significantly outperforms existing SOTA methods. These results suggest that replacing mixed-error aggregation with surgical, mode-specific repair yields robust improvements in automated assessment scalability and precision.
[823] MED-COPILOT: A Medical Assistant Powered by GraphRAG and Similar Patient Case Retrieval
Shuheng Chen, Namratha Patil, Haonan Pan, Angel Hsing-Chi Hwang, Yao Du, Ruishan Liu, Jieyu Zhao
Main category: cs.AI
TL;DR: MED-COPILOT: A clinical decision-support system combining guideline-grounded GraphRAG retrieval with similar-patient retrieval to enhance LLM-based clinical reasoning with evidence-aware transparency.
Details
Motivation: Clinical decision-making requires synthesizing heterogeneous evidence from patient histories, guidelines, and comparable cases. LLMs have strong reasoning capabilities but suffer from hallucinations and struggle to integrate long, structured medical documents.Method: Interactive system combining guideline-grounded GraphRAG retrieval with hybrid semantic-keyword similar-patient retrieval. Builds structured knowledge graph from WHO/NICE guidelines, applies community-level summarization for efficient retrieval, and maintains 36,000-case similar-patient database from SOAP-normalized MIMIC-IV notes and Synthea records.
Result: Outperforms parametric LLM baselines and standard RAG on clinical note completion and medical question answering, improving both generation fidelity and clinical reasoning accuracy. System enables inspection of retrieved evidence, visualization of token-level similarity contributions, and guided follow-up analysis.
Conclusion: Demonstrates practical and interpretable approach to integrating structured guideline knowledge with patient-level analogical evidence for clinical LLMs, providing transparent and evidence-aware clinical reasoning support.
Abstract: Clinical decision-making requires synthesizing heterogeneous evidence, including patient histories, clinical guidelines, and trajectories of comparable cases. While large language models (LLMs) offer strong reasoning capabilities, they remain prone to hallucinations and struggle to integrate long, structured medical documents. We present MED-COPILOT, an interactive clinical decision-support system designed for clinicians and medical trainees, which combines guideline-grounded GraphRAG retrieval with hybrid semantic-keyword similar-patient retrieval to support transparent and evidence-aware clinical reasoning. The system builds a structured knowledge graph from WHO and NICE guidelines, applies community-level summarization for efficient retrieval, and maintains a 36,000-case similar-patient database derived from SOAP-normalized MIMIC-IV notes and Synthea-generated records. We evaluate our framework on clinical note completion and medical question answering, and demonstrate that it consistently outperforms parametric LLM baselines and standard RAG, improving both generation fidelity and clinical reasoning accuracy. The full system is available at https://huggingface.co/spaces/Cryo3978/Med_GraphRAG , enabling users to inspect retrieved evidence, visualize token-level similarity contributions, and conduct guided follow-up analysis. Our results demonstrate a practical and interpretable approach to integrating structured guideline knowledge with patient-level analogical evidence for clinical LLMs.
[824] Optimizing In-Context Demonstrations for LLM-based Automated Grading
Yucheng Chu, Hang Li, Kaiqi Yang, Yasemin Copur-Gencturk, Kevin Haudek, Joseph Krajcik, Jiliang Tang
Main category: cs.AI
TL;DR: GUIDE is a framework for automated grading of open-ended student responses that improves LLM reliability by optimizing exemplar selection and refinement through boundary-focused contrastive learning and discriminative rationale generation.
Details
Motivation: Current LLM-based grading systems are unreliable due to poor exemplar selection (typically based on semantic similarity) and the bottleneck of manually crafting expert rationales, which fail to capture subtle decision boundaries needed for rubric adherence.Method: GUIDE reframes exemplar selection as a boundary-focused optimization problem using a continuous loop of selection and refinement with contrastive operators to identify “boundary pairs” (semantically similar responses with different grades) and generates discriminative rationales that explicitly explain why a response receives a specific score.
Result: Extensive experiments across physics, chemistry, and pedagogical content knowledge datasets show GUIDE significantly outperforms standard retrieval baselines, with exceptionally robust gains on borderline cases and improved rubric adherence.
Conclusion: GUIDE enables trusted, scalable assessment systems that align closely with human pedagogical standards by focusing LLM attention on precise rubric boundaries through optimized exemplar selection and discriminative rationale generation.
Abstract: Automated assessment of open-ended student responses is a critical capability for scaling personalized feedback in education. While large language models (LLMs) have shown promise in grading tasks via in-context learning (ICL), their reliability is heavily dependent on the selection of few-shot exemplars and the construction of high-quality rationales. Standard retrieval methods typically select examples based on semantic similarity, which often fails to capture subtle decision boundaries required for rubric adherence. Furthermore, manually crafting the expert rationales needed to guide these models can be a significant bottleneck. To address these limitations, we introduce GUIDE (Grading Using Iteratively Designed Exemplars), a framework that reframes exemplar selection and refinement in automated grading as a boundary-focused optimization problem. GUIDE operates on a continuous loop of selection and refinement, employing novel contrastive operators to identify “boundary pairs” that are semantically similar but possess different grades. We enhance exemplars by generating discriminative rationales that explicitly articulate why a response receives a specific score to the exclusion of adjacent grades. Extensive experiments across datasets in physics, chemistry, and pedagogical content knowledge demonstrate that GUIDE significantly outperforms standard retrieval baselines. By focusing the model’s attention on the precise edges of rubric, our approach shows exceptionally robust gains on borderline cases and improved rubric adherence. GUIDE paves the way for trusted, scalable assessment systems that align closely with human pedagogical standards.
[825] Why Not? Solver-Grounded Certificates for Explainable Mission Planning
Najeeb Khan
Main category: cs.AI
TL;DR: A faithfulness-first approach for explaining satellite scheduling decisions using optimization certificates derived directly from the solver’s constraint model, ensuring causal attributions and complete constraint coverage.
Details
Motivation: Earth observation satellite operators need trustworthy justifications for scheduling decisions (why requests were selected/rejected, what changes would make them schedulable). Existing post-hoc reasoning layers risk non-causal attributions, incomplete constraint conjunctions, and solver-path dependence.Method: Derives explanations as certificates directly from the optimization model: minimal infeasible subsets for rejections, tight constraints and contrastive trade-offs for selections, and inverse solves for what-if queries.
Result: Achieves perfect soundness (15/15 cited-constraint checks), counterfactual validity (7/7), and stability (Jaccard = 1.0 across 28 seed-pairs). Post-hoc baseline produces non-causal attributions in 29% of cases and misses constraint conjunctions in every multi-cause rejection. Scalable to 200 orders and 30 satellites.
Conclusion: A faithfulness-first approach using optimization certificates provides trustworthy, complete, and stable explanations for satellite scheduling decisions, outperforming post-hoc reasoning methods.
Abstract: Operators of Earth observation satellites need justifications for scheduling decisions: why a request was selected, rejected, or what changes would make it schedulable. Existing approaches construct post-hoc reasoning layers independent of the optimizer, risking non-causal attributions, incomplete constraint conjunctions, and solver-path dependence. We take a faithfulness-first approach: every explanation is a certificate derived from the optimization model itself: minimal infeasible subsets for rejections, tight constraints and contrastive trade-offs for selections, and inverse solves for what-if queries. On a scheduling instance with structurally distinct constraint interactions, certificates achieve perfect soundness with respect to the solver’s constraint model (15/15 cited-constraint checks), counterfactual validity (7/7), and stability (Jaccard = 1.0 across 28 seed-pairs), while a post-hoc baseline produces non-causal attributions in 29% of cases and misses constraint conjunctions in every multi-cause rejection. A scalability analysis up to 200 orders and 30 satellites confirms practical extraction times for operational batches.
[826] From Goals to Aspects, Revisited: An NFR Pattern Language for Agentic AI Systems
Yijun Yu
Main category: cs.AI
TL;DR: This paper extends aspect-oriented programming methodology to agentic AI systems, proposing a pattern language with 12 reusable patterns across security, reliability, observability, and cost management categories to modularize crosscutting concerns in AI agent development.
Details
Motivation: Agentic AI systems have numerous crosscutting concerns (security, observability, cost management, fault tolerance) that are poorly modularized in current implementations, contributing to high failure rates of AI projects reaching production. The paper aims to address this by extending aspect-oriented programming methodology to the agentic AI domain.Method: The paper revisits and extends the goals-to-aspects methodology from RE 2004 to the agentic AI domain. It presents a pattern language of 12 reusable patterns organized across four NFR categories, mapping i* goal models to concrete aspect implementations using an AOP framework for Rust. Four patterns specifically address agent-specific concerns: tool-scope sandboxing, prompt injection detection, token budget management, and action audit trails. The V-graph model is extended to capture how agent tasks contribute to both functional goals and non-functional soft-goals.
Result: The pattern language is validated through a case study analyzing an open-source autonomous agent framework, demonstrating how goal-driven aspect discovery systematically identifies and modularizes crosscutting concerns in agentic AI systems.
Conclusion: The pattern language offers a principled approach for engineering reliable agentic AI systems through early identification of crosscutting concerns, addressing the modularization challenges that contribute to AI project failures in production.
Abstract: Agentic AI systems exhibit numerous crosscutting concerns – security, observability, cost management, fault tolerance – that are poorly modularized in current implementations, contributing to the high failure rate of AI projects in reaching production. The goals-to-aspects methodology proposed at RE 2004 demonstrated that aspects can be systematically discovered from i* goal models by identifying non-functional soft-goals that crosscut functional goals. This paper revisits and extends that methodology to the agentic AI domain. We present a pattern language of 12 reusable patterns organized across four NFR categories (security, reliability, observability, cost management), each mapping an i* goal model to a concrete aspect implementation using an AOP framework for Rust. Four patterns address agent-specific crosscutting concerns absent from traditional AOP literature: tool-scope sandboxing, prompt injection detection, token budget management, and action audit trails. We extend the V-graph model to capture how agent tasks simultaneously contribute to functional goals and non-functional soft-goals. We validate the pattern language through a case study analyzing an open-source autonomous agent framework, demonstrating how goal-driven aspect discovery systematically identifies and modularizes crosscutting concerns. The pattern language offers a principled approach for engineering reliable agentic AI systems through early identification of crosscutting concerns.
[827] LifeEval: A Multimodal Benchmark for Assistive AI in Egocentric Daily Life Tasks
Hengjian Gao, Kaiwei Zhang, Shibo Wang, Mingjie Chen, Qihang Cao, Xianfeng Wang, Yucheng Zhu, Xiongkuo Min, Wei Sun, Dandan Zhu, Guangtao Zhai
Main category: cs.AI
TL;DR: LifeEval is a multimodal benchmark for evaluating real-time, task-oriented human-AI collaboration in daily life from an egocentric perspective, focusing on interactive assistance rather than passive understanding.
Details
Motivation: Current MLLMs lack evaluation for dynamic, real-world assistance capabilities. Existing video benchmarks focus on passive retrospective analysis, not interactive real-time collaboration needed for practical human assistance.Method: Created LifeEval benchmark with 4,075 QA pairs across 6 core capability dimensions, using rigorous annotation pipeline. Emphasizes task-oriented holistic evaluation, egocentric real-time perception from continuous first-person streams, and human-assistant collaborative interaction through natural dialogues.
Result: Evaluation of 26 state-of-the-art MLLMs reveals substantial challenges in achieving timely, effective, and adaptive interaction, highlighting gaps in current models’ real-time assistance capabilities.
Conclusion: LifeEval identifies critical limitations in current MLLMs for real-time human-AI collaboration and provides essential directions for advancing human-centered interactive intelligence.
Abstract: The rapid progress of Multimodal Large Language Models (MLLMs) marks a significant step toward artificial general intelligence, offering great potential for augmenting human capabilities. However, their ability to provide effective assistance in dynamic, real-world environments remains largely underexplored. Existing video benchmarks predominantly assess passive understanding through retrospective analysis or isolated perception tasks, failing to capture the interactive and adaptive nature of real-time user assistance. To bridge this gap, we introduce LifeEval, a multimodal benchmark designed to evaluate real-time, task-oriented human-AI collaboration in daily life from an egocentric perspective. LifeEval emphasizes three key aspects: task-oriented holistic evaluation, egocentric real-time perception from continuous first-person streams, and human-assistant collaborative interaction through natural dialogues. Constructed via a rigorous annotation pipeline, the benchmark comprises 4,075 high-quality question-answer pairs across 6 core capability dimensions. Extensive evaluations of 26 state-of-the-art MLLMs on LifeEval reveal substantial challenges in achieving timely, effective and adaptive interaction, highlighting essential directions for advancing human-centered interactive intelligence.
[828] AI Runtime Infrastructure
Christopher Cruz
Main category: cs.AI
TL;DR: AI Runtime Infrastructure is a new execution-time layer that actively monitors and intervenes in agent behavior to optimize performance, efficiency, reliability, and safety during runtime.
Details
Motivation: Current AI systems lack active runtime optimization capabilities. Model-level optimizations and passive logging systems don't address execution-time issues like task success, latency, token efficiency, reliability, and safety during actual agent operation.Method: Introduces a distinct execution-time layer positioned above the model and below the application. This infrastructure actively observes, reasons over, and intervenes in agent behavior while the agent is running, treating execution itself as an optimization surface.
Result: Enables adaptive memory management, failure detection, recovery, and policy enforcement over long-horizon agent workflows. Provides active optimization of task success, latency, token efficiency, reliability, and safety during runtime.
Conclusion: AI Runtime Infrastructure represents a new approach to AI system optimization by focusing on execution-time intervention and optimization, moving beyond model-level improvements to address runtime performance and reliability challenges.
Abstract: We introduce AI Runtime Infrastructure, a distinct execution-time layer that operates above the model and below the application, actively observing, reasoning over, and intervening in agent behavior to optimize task success, latency, token efficiency, reliability, and safety while the agent is running. Unlike model-level optimizations or passive logging systems, runtime infrastructure treats execution itself as an optimization surface, enabling adaptive memory management, failure detection, recovery, and policy enforcement over long-horizon agent workflows.
[829] MO-MIX: Multi-Objective Multi-Agent Cooperative Decision-Making With Deep Reinforcement Learning
Tianmeng Hu, Biao Luo, Chunhua Yang, Tingwen Huang
Main category: cs.AI
TL;DR: MO-MIX: A multi-objective multi-agent reinforcement learning approach using CTDE framework with weight vectors for objective preferences and parallel mixing networks for joint action-value estimation.
Details
Motivation: Real-world decision-making often involves multiple conflicting objectives and requires multiple agents to cooperate, but existing RL approaches are limited to either multi-agent single-objective or single-agent multi-objective problems, not addressing the combined challenge.Method: Uses centralized training with decentralized execution (CTDE) framework; feeds weight vectors representing objective preferences into decentralized agent networks for local action-value estimation; employs parallel mixing network architecture for joint action-value function estimation; incorporates exploration guide approach to improve uniformity of non-dominated solutions.
Result: Effectively solves multi-objective multi-agent cooperative decision-making problems and generates approximations of Pareto sets; significantly outperforms baseline methods across all four evaluation metrics while requiring less computational cost.
Conclusion: MO-MIX successfully addresses the gap in multi-objective multi-agent reinforcement learning, providing an effective framework for complex real-world decision-making problems with multiple conflicting objectives and cooperative agents.
Abstract: Deep reinforcement learning (RL) has been applied extensively to solve complex decision-making problems. In many real-world scenarios, tasks often have several conflicting objectives and may require multiple agents to cooperate, which are the multi-objective multi-agent decision-making problems. However, only few works have been conducted on this intersection. Existing approaches are limited to separate fields and can only handle multi-agent decision-making with a single objective, or multi-objective decision-making with a single agent. In this paper, we propose MO-MIX to solve the multi-objective multi-agent reinforcement learning (MOMARL) problem. Our approach is based on the centralized training with decentralized execution (CTDE) framework. A weight vector representing preference over the objectives is fed into the decentralized agent network as a condition for local action-value function estimation, while a mixing network with parallel architecture is used to estimate the joint action-value function. In addition, an exploration guide approach is applied to improve the uniformity of the final non-dominated solutions. Experiments demonstrate that the proposed method can effectively solve the multi-objective multi-agent cooperative decision-making problem and generate an approximation of the Pareto set. Our approach not only significantly outperforms the baseline method in all four kinds of evaluation metrics, but also requires less computational cost.
[830] DenoiseFlow: Uncertainty-Aware Denoising for Reliable LLM Agentic Workflows
Yandong Yan, Junwei Peng, Shijie Li, Chenxi Li, Yifei Shang, Can Deng, Ruiting Dai, Yongqiang Zhao, Jiaqi Zhu, Yu Huang
Main category: cs.AI
TL;DR: DenoiseFlow is a closed-loop framework that addresses accumulated semantic ambiguity in multi-step reasoning tasks by progressively denoising through sensing uncertainty, regulating computation adaptively, and correcting errors via root-cause localization.
Details
Motivation: Autonomous agents struggle with long-horizon tasks as minor interpretation errors in natural-language instructions compound across steps, leading to reliability degradation. Existing approaches lack runtime adaptivity and fail to handle uncertainty effectively.Method: Formalizes multi-step reasoning as a Noisy MDP with three coordinated stages: (1) Sensing estimates per-step semantic uncertainty, (2) Regulating adaptively allocates computation by routing between fast single-path execution and parallel exploration based on risk, and (3) Correcting performs targeted recovery via influence-based root-cause localization with online self-calibration.
Result: Achieves highest accuracy on six benchmarks spanning mathematical reasoning, code generation, and multi-hop QA (83.3% average, +1.3% over strongest baseline) while reducing cost by 40-56% through adaptive branching.
Conclusion: DenoiseFlow effectively mitigates accumulated semantic ambiguity in long-horizon reasoning tasks through adaptive uncertainty-aware computation, demonstrating both improved accuracy and efficiency.
Abstract: Autonomous agents are increasingly entrusted with complex, long-horizon tasks, ranging from mathematical reasoning to software generation. While agentic workflows facilitate these tasks by decomposing them into multi-step reasoning chains, reliability degrades significantly as the sequence lengthens. Specifically, minor interpretation errors in natural-language instructions tend to compound silently across steps. We term this failure mode accumulated semantic ambiguity. Existing approaches to mitigate this often lack runtime adaptivity, relying instead on static exploration budgets, reactive error recovery, or single-path execution that ignores uncertainty entirely. We formalize the multi-step reasoning process as a Noisy MDP and propose DenoiseFlow, a closed-loop framework that performs progressive denoising through three coordinated stages: (1)Sensing estimates per-step semantic uncertainty; (2)Regulating adaptively allocates computation by routing between fast single-path execution and parallel exploration based on estimated risk; and (3)Correcting performs targeted recovery via influence-based root-cause localization. Online self-calibration continuously aligns decision boundaries with verifier feedback, requiring no ground-truth labels. Experiments on six benchmarks spanning mathematical reasoning, code generation, and multi-hop QA show that DenoiseFlow achieves the highest accuracy on every benchmark (83.3% average, +1.3% over the strongest baseline) while reducing cost by 40–56% through adaptive branching. Detailed ablation studies further confirm framework-level’s robustness and generality. Code is available at https://anonymous.4open.science/r/DenoiseFlow-21D3/.
[831] LOGIGEN: Logic-Driven Generation of Verifiable Agentic Tasks
Yucheng Zeng, Weipeng Lu, Linyun Liu, Shupeng Li, Zitian Qu, Chenghao Zhu, Shaofei Li, Zhengdong Tan, Mengyue Liu, Haotian Zhao, Zhe Zhou, Jianmin Wu
Main category: cs.AI
TL;DR: LOGIGEN: A logic-driven framework for synthesizing verifiable training data for autonomous agents using hard-compiled policy grounding, logic-driven forward synthesis, and deterministic state verification.
Details
Motivation: The evolution of LLMs into autonomous agents requires operating in complex, stateful environments, but current approaches are bottlenecked by data scarcity. Existing tool-centric reverse-synthesis pipelines fail to capture the rigorous logic of real-world applications.Method: Three-pillar framework: 1) Hard-Compiled Policy Grounding (compiling natural-language policy into database constraints), 2) Logic-Driven Forward Synthesis (initializing boundary-adjacent states to trigger policy conflicts), 3) Deterministic State Verification (checking exact state equivalence). Uses Triple-Agent Orchestration: Architect, Set Designer, and Explorer agents.
Result: Generated dataset of 20,000 complex tasks across 8 domains with strict validity guarantees. LOGIGEN-32B(RL) achieved 79.5% success rate on τ²-Bench, substantially outperforming the base model (40.7%).
Conclusion: Logic-driven synthesis combined with verification-based training effectively constructs causally valid trajectories needed for next-generation autonomous agents.
Abstract: The evolution of Large Language Models (LLMs) from static instruction-followers to autonomous agents necessitates operating within complex, stateful environments to achieve precise state-transition objectives. However, this paradigm is bottlenecked by data scarcity, as existing tool-centric reverse-synthesis pipelines fail to capture the rigorous logic of real-world applications. We introduce \textbf{LOGIGEN}, a logic-driven framework that synthesizes verifiable training data based on three core pillars: \textbf{Hard-Compiled Policy Grounding}, \textbf{Logic-Driven Forward Synthesis}, and \textbf{Deterministic State Verification}. Specifically, a Triple-Agent Orchestration is employed: the \textbf{Architect} compiles natural-language policy into database constraints to enforce hard rules; the \textbf{Set Designer} initializes boundary-adjacent states to trigger critical policy conflicts; and the \textbf{Explorer} searches this environment to discover causal solution paths. This framework yields a dataset of 20,000 complex tasks across 8 domains, where validity is strictly guaranteed by checking exact state equivalence. Furthermore, we propose a verification-based training protocol where Supervised Fine-Tuning (SFT) on verifiable trajectories establishes compliance with hard-compiled policy, while Reinforcement Learning (RL) guided by dense state-rewards refines long-horizon goal achievement. On $τ^2$-Bench, LOGIGEN-32B(RL) achieves a \textbf{79.5% success rate}, substantially outperforming the base model (40.7%). These results demonstrate that logic-driven synthesis combined with verification-based training effectively constructs the causally valid trajectories needed for next-generation agents.
[832] BioProAgent: Neuro-Symbolic Grounding for Constrained Scientific Planning
Yuyang Liu, Jingya Wang, Liuzhenghao Lv, Yonghong Tian
Main category: cs.AI
TL;DR: BioProAgent: A neuro-symbolic framework combining probabilistic LLM planning with deterministic finite state machines for reliable physical execution in wet-labs
Details
Motivation: LLMs struggle with physical execution in irreversible wet-lab environments where probabilistic hallucinations can cause equipment damage or experimental failure. Need to bridge the gap between reasoning capabilities and reliable physical execution.Method: Neuro-symbolic framework anchoring probabilistic planning in deterministic Finite State Machine (FSM). Uses State-Augmented Planning with Design-Verify-Rectify workflow and Semantic Symbol Grounding to reduce token consumption through symbolic abstraction.
Result: Achieves 95.6% physical compliance on BioProBench benchmark (vs 21.0% for ReAct), reduces token consumption by ~6× through symbolic abstraction.
Conclusion: Neuro-symbolic constraints are essential for reliable autonomy in irreversible physical environments, enabling safe and compliant execution of scientific experiments.
Abstract: Large language models (LLMs) have demonstrated significant reasoning capabilities in scientific discovery but struggle to bridge the gap to physical execution in wet-labs. In these irreversible environments, probabilistic hallucinations are not merely incorrect, but also cause equipment damage or experimental failure. To address this, we propose \textbf{BioProAgent}, a neuro-symbolic framework that anchors probabilistic planning in a deterministic Finite State Machine (FSM). We introduce a State-Augmented Planning mechanism that enforces a rigorous \textit{Design-Verify-Rectify} workflow, ensuring hardware compliance before execution. Furthermore, we address the context bottleneck inherent in complex device schemas by \textit{Semantic Symbol Grounding}, reducing token consumption by $\sim$6$\times$ through symbolic abstraction. In the extended BioProBench benchmark, BioProAgent achieves 95.6% physical compliance (compared to 21.0% for ReAct), demonstrating that neuro-symbolic constraints are essential for reliable autonomy in irreversible physical environments. \footnote{Code at https://github.com/YuyangSunshine/bioproagent and project at https://yuyangsunshine.github.io/BioPro-Project/}
[833] OmniGAIA: Towards Native Omni-Modal AI Agents
Xiaoxi Li, Wenxiang Jiao, Jiarui Jin, Shijian Wang, Guanting Dong, Jiajie Jin, Hao Wang, Yinuo Wang, Ji-Rong Wen, Yuan Lu, Zhicheng Dou
Main category: cs.AI
TL;DR: OmniGAIA benchmark evaluates omni-modal agents on video/audio/image tasks requiring reasoning and tool use; OmniAtlas is a native omni-modal foundation agent with active perception and tool integration.
Details
Motivation: Current multi-modal LLMs are limited to bi-modal interactions (e.g., vision-language) and lack unified cognitive capabilities for general AI assistants that need to process video, audio, and image modalities together with reasoning and tool usage.Method: 1) Created OmniGAIA benchmark using omni-modal event graphs to synthesize complex multi-hop queries from real-world data requiring cross-modal reasoning and tool integration. 2) Developed OmniAtlas, a native omni-modal foundation agent with active perception, trained using hindsight-guided tree exploration and OmniDPO for fine-grained error correction.
Result: The work presents both a comprehensive benchmark (OmniGAIA) for evaluating omni-modal agents and a foundation agent (OmniAtlas) that enhances tool-use capabilities of existing open-source models for real-world scenarios.
Conclusion: This research advances towards next-generation native omni-modal AI assistants capable of deep reasoning and multi-turn tool execution across video, audio, and image modalities for real-world applications.
Abstract: Human intelligence naturally intertwines omni-modal perception – spanning vision, audio, and language – with complex reasoning and tool usage to interact with the world. However, current multi-modal LLMs are primarily confined to bi-modal interactions (e.g., vision-language), lacking the unified cognitive capabilities required for general AI assistants. To bridge this gap, we introduce OmniGAIA, a comprehensive benchmark designed to evaluate omni-modal agents on tasks necessitating deep reasoning and multi-turn tool execution across video, audio, and image modalities. Constructed via a novel omni-modal event graph approach, OmniGAIA synthesizes complex, multi-hop queries derived from real-world data that require cross-modal reasoning and external tool integration. Furthermore, we propose OmniAtlas, a native omni-modal foundation agent under tool-integrated reasoning paradigm with active omni-modal perception. Trained on trajectories synthesized via a hindsight-guided tree exploration strategy and OmniDPO for fine-grained error correction, OmniAtlas effectively enhances the tool-use capabilities of existing open-source models. This work marks a step towards next-generation native omni-modal AI assistants for real-world scenarios.
[834] Advancing Multimodal Judge Models through a Capability-Oriented Benchmark and MCTS-Driven Data Generation
Zeyu Chen, Huanjin Yao, Ziwang Zhao, Min Yang
Main category: cs.AI
TL;DR: M-JudgeBench is a capability-oriented benchmark for evaluating MLLM-as-a-judge systems, with Judge-MCTS framework for training stronger judge models.
Details
Motivation: Existing MLLM-as-a-judge benchmarks focus on task types rather than fundamental judgment capabilities needed for reliable evaluation, creating a need for more principled assessment frameworks.Method: Introduces M-JudgeBench with ten-dimensional capability assessment through pairwise CoT comparison, length bias avoidance, and process error detection. Proposes Judge-MCTS framework to generate diverse reasoning trajectories for training M-Judger models.
Result: Systematic evaluation reveals weaknesses in existing MLLM-as-a-judge systems. M-Judger models trained with Judge-MCTS outperform existing models on both M-JudgeBench and existing judge benchmarks.
Conclusion: The work establishes a more principled foundation for evaluating MLLM-as-a-judge systems and enables capability-driven judge training through the M-JudgeBench benchmark and Judge-MCTS framework.
Abstract: Using Multimodal Large Language Models (MLLMs) as judges to achieve precise and consistent evaluations has gradually become an emerging paradigm across various domains. Evaluating the capability and reliability of MLLM-as-a-judge systems is therefore essential for ensuring trustworthy assessment. Existing judge benchmarks categorize samples by task types but fail to capture the fundamental judgment capabilities required for reliable evaluation. In this work, we introduce M-JudgeBench, a ten-dimensional capability-oriented benchmark designed to comprehensively assess the judgment abilities of MLLMs. Our benchmark decomposes evaluation into pairwise Chain-of-Thought (CoT) comparison, length bias avoidance, and process error detection tasks, jointly covering ten fine-grained subtasks. This design enables diagnosis of model reliability across reasoning styles, response lengths, and cross-model variations. Systematic evaluation uncovers the systematic weaknesses in existing MLLM-as-a-judge systems. To address this issue, we further propose Judge-MCTS, a data construction framework generating pairwise reasoning trajectories with various correctness and length. Using Judge-MCTS, we construct an MCTS-augmented dataset and train M-Judger, a series of strong judge models. Extensive experiments demonstrate the superiority of M-Judger on existing judge benchmarks as well as M-JudgeBench. Overall, our work establishes a more principled foundation for evaluating MLLM-as-a-judge through M-JudgeBench and Judge-MCTS framework, paving the way for future research on judge model evaluation and capability-driven judge training.
[835] EMPA: Evaluating Persona-Aligned Empathy as a Process
Shiya Zhang, Yuhan Zhan, Ruixi Su, Ruihan Sun, Ziyi Song, Zhaohan Chen, Xiaofan Zhang
Main category: cs.AI
TL;DR: EMPA framework evaluates persona-aligned empathy in LLM dialogue agents using process-oriented evaluation with controllable scenarios and multi-agent sandbox testing
Details
Motivation: Current evaluation of persona-aligned empathy in LLM dialogue agents is challenging due to latent user states, sparse feedback, and difficulty verifying support quality over extended interactionsMethod: EMPA framework distills real interactions into controllable psychologically-grounded scenarios, uses open-ended multi-agent sandbox to expose strategic adaptation and failure modes, and scores trajectories in latent psychological space using directional alignment, cumulative impact, and stability metrics
Result: Provides reproducible comparison and optimization of long-horizon empathic behavior, with signals and metrics that extend to other agent settings with latent dynamics and weak feedback
Conclusion: EMPA offers a process-oriented evaluation framework for persona-aligned empathy in LLM dialogue agents that addresses limitations of current evaluation methods and supports long-term behavior optimization
Abstract: Evaluating persona-aligned empathy in LLM-based dialogue agents remains challenging. User states are latent, feedback is sparse and difficult to verify in situ, and seemingly supportive turns can still accumulate into trajectories that drift from persona-specific needs. We introduce EMPA, a process-oriented framework that evaluates persona-aligned support as sustained intervention rather than isolated replies. EMPA distills real interactions into controllable, psychologically grounded scenarios, couples them with an open-ended multi-agent sandbox that exposes strategic adaptation and failure modes, and scores trajectories in a latent psychological space by directional alignment, cumulative impact, and stability. The resulting signals and metrics support reproducible comparison and optimization of long-horizon empathic behavior, and they extend to other agent settings shaped by latent dynamics and weak, hard-to-verify feedback.
[836] The Observer-Situation Lattice: A Unified Formal Basis for Perspective-Aware Cognition
Saad Alqithami
Main category: cs.AI
TL;DR: OSL is a unified mathematical framework using observer-situation lattices for perspective-aware reasoning in multi-agent systems, with algorithms for belief propagation and contradiction handling.
Details
Motivation: Existing approaches to multi-agent reasoning are fragmented, handling different agents, times, and contexts in separate modules, leading to brittle and incomplete reasoning, especially for Theory of Mind tasks.Method: Introduces Observer-Situation Lattice (OSL) - a finite complete lattice where elements represent observer-situation pairs. Presents two algorithms: Relativized Belief Propagation for incremental updates and Minimal Contradiction Decomposition for identifying contradictions.
Result: Proves theoretical soundness and demonstrates practical utility through benchmarks including Theory of Mind tasks, showing computational efficiency and expressiveness compared to systems like assumption-based truth maintenance.
Conclusion: OSL provides a principled, scalable foundation for building robust, perspective-aware autonomous agents with unified reasoning across multiple dimensions.
Abstract: Autonomous agents operating in complex, multi-agent environments must reason about what is true from multiple perspectives. Existing approaches often struggle to integrate the reasoning of different agents, at different times, and in different contexts, typically handling these dimensions in separate, specialized modules. This fragmentation leads to a brittle and incomplete reasoning process, particularly when agents must understand the beliefs of others (Theory of Mind). We introduce the Observer-Situation Lattice (OSL), a unified mathematical structure that provides a single, coherent semantic space for perspective-aware cognition. OSL is a finite complete lattice where each element represents a unique observer-situation pair, allowing for a principled and scalable approach to belief management. We present two key algorithms that operate on this lattice: (i) Relativized Belief Propagation, an incremental update algorithm that efficiently propagates new information, and (ii) Minimal Contradiction Decomposition, a graph-based procedure that identifies and isolates contradiction components. We prove the theoretical soundness of our framework and demonstrate its practical utility through a series of benchmarks, including classic Theory of Mind tasks and a comparison with established paradigms such as assumption-based truth maintenance systems. Our results show that OSL provides a computationally efficient and expressive foundation for building robust, perspective-aware autonomous agents.
[837] SWE-Hub: A Unified Production System for Scalable, Executable Software Engineering Tasks
Yucheng Zeng, Shupeng Li, Daxiang Dong, Ruijie Xu, Zimo Chen, Liwei Zheng, Yuxuan Li, Zhe Zhou, Haotian Zhao, Lun Tian, Heng Xiao, Tianshu Zhu, Longkun Hao, Jianmin Wu
Main category: cs.AI
TL;DR: SWE-Hub is an end-to-end system for generating executable, scalable, and realistic software engineering data by unifying environment automation, scalable bug synthesis, and diverse task generation into a coherent production stack.
Details
Motivation: Progress in software-engineering agents is constrained by scarce executable, scalable, and realistic data for training and evaluation, due to brittle environments, expensive bug synthesis, and lack of long-horizon competencies in existing data.Method: SWE-Hub integrates four components: Env Agent (converts repositories to reproducible container environments), SWE-Scale engine (high-throughput bug-fix generation), Bug Agent (synthesizes system-level regressions with user-like issue reports), and SWE-Architect (translates requirements into repository-scale build tasks).
Result: The system establishes a unified production pipeline capable of continuously delivering executable tasks across the entire software engineering lifecycle, addressing data scarcity challenges.
Conclusion: SWE-Hub operationalizes the data factory abstraction to overcome fundamental constraints in software-engineering agent development by providing scalable, realistic, and executable training/evaluation data.
Abstract: Progress in software-engineering agents is increasingly constrained by the scarcity of executable, scalable, and realistic data for training and evaluation. This scarcity stems from three fundamental challenges in existing pipelines: environments are brittle and difficult to reproduce across languages; synthesizing realistic, system-level bugs at scale is computationally expensive; and existing data predominantly consists of short-horizon repairs, failing to capture long-horizon competencies like architectural consistency. We introduce \textbf{SWE-Hub}, an end-to-end system that operationalizes the data factory abstraction by unifying environment automation, scalable synthesis, and diverse task generation into a coherent production stack. At its foundation, the \textbf{Env Agent} establishes a shared execution substrate by automatically converting raw repository snapshots into reproducible, multi-language container environments with standardized interfaces. Built upon this substrate, \textbf{SWE-Scale} engine addresses the need for high-throughput generation, combining cross-language code analysis with cluster-scale validation to synthesize massive volumes of localized bug-fix instances. \textbf{Bug Agent} generates high-fidelity repair tasks by synthesizing system-level regressions involving cross-module dependencies, paired with user-like issue reports that describe observable symptoms rather than root causes. Finally, \textbf{SWE-Architect} expands the task scope from repair to creation by translating natural-language requirements into repository-scale build-a-repo tasks. By integrating these components, SWE-Hub establishes a unified production pipeline capable of continuously delivering executable tasks across the entire software engineering lifecycle.
[838] Draft-Thinking: Learning Efficient Reasoning in Long Chain-of-Thought LLMs
Jie Cao, Tianwei Lin, Zhenxuan Fan, Bo Yuan, Ziyuan Zhao, Rolan Yan, Wenqiao Zhang, Siliang Tang
Main category: cs.AI
TL;DR: Draft-Thinking is a method that teaches models to use concise draft-style reasoning structures through progressive curriculum learning, reducing reasoning costs while maintaining performance.
Details
Motivation: Current CoT approaches cause systematic overthinking, unnecessarily coupling reasoning capability with high reasoning costs. Existing cost-reduction methods use post hoc techniques without addressing core reasoning mechanisms.Method: Progressive curriculum learning teaches models to internalize concise draft-style reasoning structures that retain only critical steps. Also introduces adaptive prompting to make reasoning depth a flexible, model-selectable behavior.
Result: Achieves 82.6% reduction in reasoning budget on MATH500 with only 2.6% performance drop. Substantially reduces reasoning costs while largely preserving reasoning performance.
Conclusion: Draft-Thinking effectively decouples reasoning capability from reasoning cost, providing an efficient alternative to traditional CoT paradigms through learned concise reasoning structures.
Abstract: Long chain-of-thought~(CoT) has become a dominant paradigm for enhancing the reasoning capability of large reasoning models~(LRMs); however, the performance gains often come with a substantial increase in reasoning budget. Recent studies show that existing CoT paradigms tend to induce systematic overthinking, unnecessarily coupling reasoning capability with reasoning cost. Most prior approaches reduce token usage through post hoc techniques such as token compression, truncation, or length penalties, without explicitly addressing the core mechanisms of reasoning. We propose \textbf{Draft-Thinking}, which guides models to first learn a concise \textit{draft-style} reasoning structure that retains only the critical reasoning steps. Through a \textit{progressive curriculum learning}, the model stably internalizes this efficient reasoning pattern as its capability scales. Moreover, Draft-Thinking introduces adaptive prompting, which elevates reasoning depth to a flexible, model-selectable behavior. Extensive experiments demonstrate that Draft-Thinking substantially reduces reasoning budget while largely preserving reasoning performance; for example, on MATH500, it achieves an 82.6% reduction in reasoning budget at the cost of only a 2.6% performance drop.
[839] Exploring Plan Space through Conversation: An Agentic Framework for LLM-Mediated Explanations in Planning
Guilhem Fouilhé, Rebecca Eifler, Antonin Poché, Sylvie Thiébaux, Nicholas Asher
Main category: cs.AI
TL;DR: Multi-agent LLM architecture for interactive explanations in AI planning systems to facilitate human-AI collaboration
Details
Motivation: To improve human-AI collaboration in planning systems by enabling natural language explanations that respond to users' questions, increasing understanding and trustMethod: Multi-agent Large Language Model architecture that is agnostic to explanation frameworks, enabling user- and context-dependent interactive explanations
Result: Instantiated framework for goal-conflict explanations and conducted user study comparing LLM-powered interaction with baseline template-based explanation interface
Conclusion: LLM-powered interactive explanations can enhance human-AI collaboration in planning systems by providing more natural and responsive explanations
Abstract: When automating plan generation for a real-world sequential decision problem, the goal is often not to replace the human planner, but to facilitate an iterative reasoning and elicitation process, where the human’s role is to guide the AI planner according to their preferences and expertise. In this context, explanations that respond to users’ questions are crucial to improve their understanding of potential solutions and increase their trust in the system. To enable natural interaction with such a system, we present a multi-agent Large Language Model (LLM) architecture that is agnostic to the explanation framework and enables user- and context-dependent interactive explanations. We also describe an instantiation of this framework for goal-conflict explanations, which we use to conduct a user study comparing the LLM-powered interaction with a baseline template-based explanation interface.
[840] MicroVerse: A Preliminary Exploration Toward a Micro-World Simulation
Rongsheng Wang, Minghao Wu, Hongru Zhou, Zhihan Yu, Zhenyang Cai, Junying Chen, Benyou Wang
Main category: cs.AI
TL;DR: MicroWorldBench benchmark and MicroVerse model for microscale simulation video generation, addressing limitations of current video generation models for biomedical applications.
Details
Motivation: Current video generation models fail at microscale simulation needed for biomedical applications like drug discovery, organ-on-chip systems, and disease mechanism studies, as well as education and scientific visualization.Method: Created MicroWorldBench benchmark with 459 expert-annotated criteria for systematic evaluation, built MicroSim-10K high-quality dataset, and trained MicroVerse model specifically for microscale simulation tasks.
Result: Current SOTA video generation models fail in microscale simulation, showing physical law violations and temporal inconsistencies. MicroVerse successfully reproduces complex microscale mechanisms.
Conclusion: First introduction of Micro-World Simulation concept with proof of concept, demonstrating potential for biology, education, and scientific visualization applications.
Abstract: Recent advances in video generation have opened new avenues for macroscopic simulation of complex dynamic systems, but their application to microscopic phenomena remains largely unexplored. Microscale simulation holds great promise for biomedical applications such as drug discovery, organ-on-chip systems, and disease mechanism studies, while also showing potential in education and interactive visualization. In this work, we introduce MicroWorldBench, a multi-level rubric-based benchmark for microscale simulation tasks. MicroWorldBench enables systematic, rubric-based evaluation through 459 unique expert-annotated criteria spanning multiple microscale simulation task (e.g., organ-level processes, cellular dynamics, and subcellular molecular interactions) and evaluation dimensions (e.g., scientific fidelity, visual quality, instruction following). MicroWorldBench reveals that current SOTA video generation models fail in microscale simulation, showing violations of physical laws, temporal inconsistency, and misalignment with expert criteria. To address these limitations, we construct MicroSim-10K, a high-quality, expert-verified simulation dataset. Leveraging this dataset, we train MicroVerse, a video generation model tailored for microscale simulation. MicroVerse can accurately reproduce complex microscale mechanism. Our work first introduce the concept of Micro-World Simulation and present a proof of concept, paving the way for applications in biology, education, and scientific visualization. Our work demonstrates the potential of educational microscale simulations of biological mechanisms. Our data and code are publicly available at https://github.com/FreedomIntelligence/MicroVerse
[841] Fair in Mind, Fair in Action? A Synchronous Benchmark for Understanding and Generation in UMLLMs
Yiran Zhao, Lu Zhou, Xiaogang Xu, Zhe Liu, Jiafei Wu, Liming Fang
Main category: cs.AI
TL;DR: IRIS Benchmark: First comprehensive fairness evaluation framework for Unified Multimodal Large Language Models (UMLLMs) that synchronously assesses both understanding and generation tasks across 60 granular metrics in three dimensions.
Details
Motivation: Address the "Tower of Babel" dilemma in AI fairness where numerous metrics with conflicting philosophical assumptions hinder unified evaluation, especially in UMLLMs where biases propagate systemically across tasks.Method: Introduces IRIS Benchmark with demographic classifier ARES and four supporting datasets; normalizes and aggregates arbitrary fairness metrics into a high-dimensional “fairness space” across three dimensions: Ideal Fairness, Real-world Fidelity, and Bias Inertia & Steerability.
Result: Evaluation of leading UMLLMs reveals systemic phenomena: “generation gap”, individual inconsistencies like “personality splits”, and “counter-stereotype reward”, while providing diagnostics to guide fairness optimization.
Conclusion: IRIS Benchmark offers a novel, extensible framework capable of integrating evolving fairness metrics to resolve the “Tower of Babel” impasse in UMLLM fairness evaluation.
Abstract: As artificial intelligence (AI) is increasingly deployed across domains, ensuring fairness has become a core challenge. However, the field faces a “Tower of Babel’’ dilemma: fairness metrics abound, yet their underlying philosophical assumptions often conflict, hindering unified paradigms-particularly in unified Multimodal Large Language Models (UMLLMs), where biases propagate systemically across tasks. To address this, we introduce the IRIS Benchmark, to our knowledge the first benchmark designed to synchronously evaluate the fairness of both understanding and generation tasks in UMLLMs. Enabled by our demographic classifier, ARES, and four supporting large-scale datasets, the benchmark is designed to normalize and aggregate arbitrary metrics into a high-dimensional “fairness space’’, integrating 60 granular metrics across three dimensions-Ideal Fairness, Real-world Fidelity, and Bias Inertia & Steerability (IRIS). Through this benchmark, our evaluation of leading UMLLMs uncovers systemic phenomena such as the “generation gap’’, individual inconsistencies like “personality splits’’, and the “counter-stereotype reward’’, while offering diagnostics to guide the optimization of their fairness capabilities. With its novel and extensible framework, the IRIS benchmark is capable of integrating evolving fairness metrics, ultimately helping to resolve the “Tower of Babel’’ impasse. Project Page: https://iris-benchmark-web.vercel.app/
[842] Depth-Structured Music Recurrence: Budgeted Recurrent Attention for Full-Piece Symbolic Music Modeling
Yungang Yi, Weihua Li, Matthew Kuo, Catherine Shi, Quan Bai
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to draw conclusions due to fetch failure
Abstract: Failed to fetch summary for 2602.19816: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.19816&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[843] Heterophily-Agnostic Hypergraph Neural Networks with Riemannian Local Exchanger
Li Sun, Ming Zhang, Wenxin Jin, Zhongtian Sun, Zhenhao Huang, Hao Peng, Sen Su, Philip Yu
Main category: cs.AI
TL;DR: HealHGNN: A hypergraph neural network using Riemannian geometry and adaptive heat exchangers to handle both homophilic and heterophilic hypergraphs by modeling long-range dependencies while preventing oversquashing.
Details
Motivation: Traditional Hypergraph Neural Networks (HGNNs) struggle with heterophilic hypergraphs that require long-range dependency modeling, as they are based on homophily assumptions from message-passing graph neural networks.Method: Uses Riemannian geometry to connect oversquashing with hypergraph bottlenecks via heat flow. Proposes adaptive local heat exchangers that capture long-range dependencies via Robin conditions and preserve representation distinguishability via source terms. Implements HealHGNN as a node-hyperedge bidirectional system with linear complexity.
Result: HealHGNN achieves state-of-the-art performance on both homophilic and heterophilic hypergraphs in extensive experiments.
Conclusion: The proposed Riemannian geometry approach with adaptive local heat exchangers enables heterophily-agnostic message passing with theoretical guarantees, overcoming limitations of traditional HGNNs.
Abstract: Hypergraphs are the natural description of higher-order interactions among objects, widely applied in social network analysis, cross-modal retrieval, etc. Hypergraph Neural Networks (HGNNs) have become the dominant solution for learning on hypergraphs. Traditional HGNNs are extended from message passing graph neural networks, following the homophily assumption, and thus struggle with the prevalent heterophilic hypergraphs that call for long-range dependence modeling. In this paper, we achieve heterophily-agnostic message passing through the lens of Riemannian geometry. The key insight lies in the connection between oversquashing and hypergraph bottleneck within the framework of Riemannian manifold heat flow. Building on this, we propose the novel idea of locally adapting the bottlenecks of different subhypergraphs. The core innovation of the proposed mechanism is the design of an adaptive local (heat) exchanger. Specifically, it captures the rich long-range dependencies via the Robin condition, and preserves the representation distinguishability via source terms, thereby enabling heterophily-agnostic message passing with theoretical guarantees. Based on this theoretical foundation, we present a novel Heat-Exchanger with Adaptive Locality for Hypergraph Neural Network (HealHGNN), designed as a node-hyperedge bidirectional systems with linear complexity in the number of nodes and hyperedges. Extensive experiments on both homophilic and heterophilic cases show that HealHGNN achieves the state-of-the-art performance.
[844] When Is Diversity Rewarded in Cooperative Multi-Agent Learning?
Michael Amir, Matteo Bettini, Amanda Prorok
Main category: cs.AI
TL;DR: Unable to analyze paper 2506.09434 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract content is unavailable due to API rate limitingMethod: Cannot determine method as abstract content is unavailable due to API rate limiting
Result: Cannot determine results as abstract content is unavailable due to API rate limiting
Conclusion: Cannot draw conclusions as abstract content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2506.09434: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.09434&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[845] Human or Machine? A Preliminary Turing Test for Speech-to-Speech Interaction
Xiang Li, Jiabao Gao, Sipei Lin, Xuan Zhou, Chi Zhang, Bo Cheng, Jiale Han, Benyou Wang
Main category: cs.AI
TL;DR: Paper ID 2602.24080 could not be fetched due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2602.24080: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.24080&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[846] Machine Learning Grade Prediction Using Students’ Grades and Demographics
Mwayi Sonkhanani, Symon Chibaya, Clement N. Nyirenda
Main category: cs.AI
TL;DR: A unified ML framework predicts both pass/fail classification and continuous grades for secondary students to identify at-risk students early and reduce repetition rates.
Details
Motivation: Student repetition in secondary education imposes significant resource burdens, especially in resource-constrained contexts. There's a need for early identification of at-risk students to enable timely interventions and optimize resource allocation.Method: Proposes a unified ML framework that simultaneously predicts pass/fail outcomes (classification) and continuous grades (regression). Evaluates six models: Logistic Regression, Decision Tree, Random Forest for classification; Linear Regression, Decision Tree Regressor, Random Forest Regressor for regression. Uses hyperparameter optimization via exhaustive grid search on academic and demographic data from 4424 secondary school students.
Result: Classification models achieved accuracies up to 96%, while regression models attained a coefficient of determination of 0.70, surpassing baseline approaches. Results confirm feasibility of early, data-driven identification of at-risk students.
Conclusion: The unified framework offers a practical pathway to reducing grade repetition and optimizing resource allocation by enabling timely, personalized interventions through dual-task prediction for more comprehensive insights.
Abstract: Student repetition in secondary education imposes significant resource burdens, particularly in resource-constrained contexts. Addressing this challenge, this study introduces a unified machine learning framework that simultaneously predicts pass/fail outcomes and continuous grades, a departure from prior research that treats classification and regression as separate tasks. Six models were evaluated: Logistic Regression, Decision Tree, and Random Forest for classification, and Linear Regression, Decision Tree Regressor, and Random Forest Regressor for regression, with hyperparameters optimized via exhaustive grid search. Using academic and demographic data from 4424 secondary school students, classification models achieved accuracies of up to 96%, while regression models attained a coefficient of determination of 0.70, surpassing baseline approaches. These results confirm the feasibility of early, data-driven identification of at-risk students and highlight the value of integrating dual-task prediction for more comprehensive insights. By enabling timely, personalized interventions, the framework offers a practical pathway to reducing grade repetition and optimizing resource allocation.
[847] Emergent Coordination in Multi-Agent Language Models
Christoph Riedl
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation due to API access limitationsMethod: Unable to determine method due to API access limitations
Result: Unable to determine results due to API access limitations
Conclusion: Unable to determine conclusion due to API access limitations
Abstract: Failed to fetch summary for 2510.05174: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.05174&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[848] TraceSIR: A Multi-Agent Framework for Structured Analysis and Reporting of Agentic Execution Traces
Shu-Xun Yang, Cunxiang Wang, Haoke Zhang, Wenbo Yu, Lindong Wu, Jiayi Gui, Dayong Yang, Yukuo Cen, Zhuoer Feng, Bosi Wen, Yidong Wang, Lucen Zhong, Jiamin Ren, Linfeng Zhang, Jie Tang
Main category: cs.AI
TL;DR: TraceSIR is a multi-agent framework for structured analysis and reporting of agentic execution traces to help diagnose failures in complex AI agent systems.
Details
Motivation: Agentic systems using LLMs with external tools create long, complex execution traces that are difficult to analyze for failures. Manual inspection doesn't scale, and LLMs struggle with raw traces due to length limits and unreliable reasoning.Method: TraceSIR coordinates three specialized agents: StructureAgent creates TraceFormat abstraction to compress traces while preserving behavioral info; InsightAgent performs fine-grained diagnosis; ReportAgent aggregates insights across tasks and generates comprehensive reports.
Result: Experiments on TraceBench covering three real-world agentic scenarios show TraceSIR consistently produces coherent, informative, and actionable reports, significantly outperforming existing approaches across all evaluation dimensions.
Conclusion: TraceSIR provides an effective framework for analyzing complex agentic execution traces, enabling better failure diagnosis and root cause analysis in AI agent systems.
Abstract: Agentic systems augment large language models with external tools and iterative decision making, enabling complex tasks such as deep research, function calling, and coding. However, their long and intricate execution traces make failure diagnosis and root cause analysis extremely challenging. Manual inspection does not scale, while directly applying LLMs to raw traces is hindered by input length limits and unreliable reasoning. Focusing solely on final task outcomes further discards critical behavioral information required for accurate issue localization. To address these issues, we propose TraceSIR, a multi-agent framework for structured analysis and reporting of agentic execution traces. TraceSIR coordinates three specialized agents: (1) StructureAgent, which introduces a novel abstraction format, TraceFormat, to compress execution traces while preserving essential behavioral information; (2) InsightAgent, which performs fine-grained diagnosis including issue localization, root cause analysis, and optimization suggestions; (3) ReportAgent, which aggregates insights across task instances and generates comprehensive analysis reports. To evaluate TraceSIR, we construct TraceBench, covering three real-world agentic scenarios, and introduce ReportEval, an evaluation protocol for assessing the quality and usability of analysis reports aligned with industry needs. Experiments show that TraceSIR consistently produces coherent, informative, and actionable reports, significantly outperforming existing approaches across all evaluation dimensions. Our project and video are publicly available at https://github.com/SHU-XUN/TraceSIR.
[849] Stop Wasting Your Tokens: Towards Efficient Runtime Multi-Agent Systems
Fulin Lin, Shaowen Chen, Ruishan Fang, Hongwei Wang, Tao Lin
Main category: cs.AI
TL;DR: Failed to fetch summary for arXiv ID 2510.26585 due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation as the paper summary could not be retrieved from arXiv APIMethod: Unknown - paper content not accessible due to API rate limiting
Result: No results available - technical error prevented paper retrieval
Conclusion: Cannot analyze paper due to technical limitations in accessing the content
Abstract: Failed to fetch summary for 2510.26585: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.26585&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[850] LiTS: A Modular Framework for LLM Tree Search
Xinzhe Li, Yaguang Tao
Main category: cs.AI
TL;DR: LiTS is a modular Python framework for LLM reasoning via tree search with reusable components (Policy, Transition, RewardModel) that plug into algorithms like MCTS and BFS, demonstrated on language reasoning, planning, and tool use tasks.
Details
Motivation: To create a flexible, reusable framework for LLM reasoning via tree search that separates concerns between domain-specific components and search algorithms, enabling both domain experts and algorithmic researchers to extend the system.Method: Decomposes tree search into three reusable components: Policy (LLM for action selection), Transition (environment dynamics), and RewardModel (evaluation). Uses decorator-based registry for extensibility, supports algorithms like MCTS and BFS across multiple domains.
Result: Demonstrated composability on MATH500 (language reasoning), Crosswords (environment planning), and MapEval (tool use), showing components are reusable across algorithms within task types, and algorithms work across all components and domains. Found policy diversity (not reward quality) is bottleneck in infinite action spaces.
Conclusion: LiTS provides a modular framework for LLM reasoning via tree search that successfully separates domain-specific components from search algorithms, enabling reuse and extensibility across different reasoning tasks and domains.
Abstract: LiTS is a modular Python framework for LLM reasoning via tree search. It decomposes tree search into three reusable components (Policy, Transition, and RewardModel) that plug into algorithms like MCTS and BFS. A decorator-based registry enables domain experts to extend to new domains by registering components, and algorithmic researchers to implement custom search algorithms. We demonstrate composability on MATH500 (language reasoning), Crosswords (environment planning), and MapEval (tool use), showing that components and algorithms are orthogonal: components are reusable across algorithms within each task type, and algorithms work across all components and domains. We also report a mode-collapse finding: in infinite action spaces, LLM policy diversity (not reward quality) is the bottleneck for effective tree search. A demonstration video is available at https://youtu.be/nRGX43YrR3I. The package is released under the Apache 2.0 license at https://github.com/xinzhel/lits-llm, including installation instructions and runnable examples that enable users to reproduce the demonstrated workflows.
[851] InfoPO: Information-Driven Policy Optimization for User-Centric Agents
Fanqi Kong, Jiayi Zhang, Mingyi Deng, Chenglin Wu, Yuyu Luo, Bang Liu
Main category: cs.AI
TL;DR: InfoPO introduces information-driven policy optimization for multi-turn LLM agent interactions, using information-gain rewards to credit turns that reduce uncertainty and improve downstream decisions.
Details
Motivation: Real-world LLM agent requests are often underspecified, requiring multi-turn interactions to acquire missing information. Current methods using trajectory-level rewards suffer from credit assignment problems and insufficient advantage signals within rollout groups.Method: Frames multi-turn interaction as active uncertainty reduction, computes information-gain reward that credits turns whose feedback measurably changes agent’s subsequent action distribution compared to masked-feedback counterfactual, combines with task outcomes via adaptive variance-gated fusion.
Result: Outperforms prompting and multi-turn RL baselines across diverse tasks including intent clarification, collaborative coding, and tool-augmented decision making. Demonstrates robustness under user simulator shifts and generalizes effectively to environment-interactive tasks.
Conclusion: InfoPO provides a principled and scalable mechanism for optimizing complex agent-user collaboration by identifying valuable interaction turns at fine granularity to drive more targeted learning.
Abstract: Real-world user requests to LLM agents are often underspecified. Agents must interact to acquire missing information and make correct downstream decisions. However, current multi-turn GRPO-based methods often rely on trajectory-level reward computation, which leads to credit assignment problems and insufficient advantage signals within rollout groups. A feasible approach is to identify valuable interaction turns at a fine granularity to drive more targeted learning. To address this, we introduce InfoPO (Information-Driven Policy Optimization), which frames multi-turn interaction as a process of active uncertainty reduction and computes an information-gain reward that credits turns whose feedback measurably changes the agent’s subsequent action distribution compared to a masked-feedback counterfactual. It then combines this signal with task outcomes via an adaptive variance-gated fusion to identify information importance while maintaining task-oriented goal direction. Across diverse tasks, including intent clarification, collaborative coding, and tool-augmented decision making, InfoPO consistently outperforms prompting and multi-turn RL baselines. It also demonstrates robustness under user simulator shifts and generalizes effectively to environment-interactive tasks. Overall, InfoPO provides a principled and scalable mechanism for optimizing complex agent-user collaboration. Code is available at https://github.com/kfq20/InfoPO.
[852] K^2-Agent: Co-Evolving Know-What and Know-How for Hierarchical Mobile Device Control
Zhe Wu, Donglin Mo, Hongjin Lu, Junliang Xing, Jianheng Liu, Yuheng Jing, Kai Li, Kun Shao, Jianye Hao, Yuanchun Shi
Main category: cs.AI
TL;DR: K2-Agent: A hierarchical framework for mobile device control that separates declarative (what) and procedural (how) knowledge, achieving strong performance on AndroidWorld benchmark with dual generalization capabilities.
Details
Motivation: Existing mobile device control agents perform poorly on complex tasks requiring long-horizon planning and precise operations due to lack of task experience and skill execution familiarity.Method: Hierarchical framework with high-level reasoner using Summarize-Reflect-Locate-Revise (SRLR) loop for declarative knowledge refinement, and low-level executor trained with curriculum-guided Group Relative Policy Optimization (C-GRPO) with dynamic demonstration injection.
Result: Achieves 76.1% success rate on AndroidWorld benchmark using only raw screenshots and open-source backbones, with strong dual generalization across models and unseen tasks.
Conclusion: K2-Agent demonstrates effective separation and co-evolution of declarative and procedural knowledge for mobile device control, enabling robust performance and generalization.
Abstract: Existing mobile device control agents often perform poorly when solving complex tasks requiring long-horizon planning and precise operations, typically due to a lack of relevant task experience or unfamiliarity with skill execution. We propose K2-Agent, a hierarchical framework that models human-like cognition by separating and co-evolving declarative (knowing what) and procedural (knowing how) knowledge for planning and execution. K2-Agent’s high level reasoner is bootstrapped from a single demonstration per task and runs a Summarize-Reflect-Locate-Revise (SRLR) loop to distill and iteratively refine task-level declarative knowledge through self-evolution. The low-level executor is trained with our curriculum-guided Group Relative Policy Optimization (C-GRPO), which (i) constructs a balanced sample pool using decoupled reward signals and (ii) employs dynamic demonstration injection to guide the model in autonomously generating successful trajectories for training. On the challenging AndroidWorld benchmark, K2-Agent achieves a 76.1% success rate using only raw screenshots and open-source backbones. Furthermore, K2-Agent shows powerful dual generalization: its high-level declarative knowledge transfers across diverse base models, while its low-level procedural skills achieve competitive performance on unseen tasks in ScreenSpot-v2 and Android-in-the-Wild (AitW).
[853] MemPO: Self-Memory Policy Optimization for Long-Horizon Agents
Ruoran Li, Xinghua Zhang, Haiyang Yu, Shitong Duan, Xiang Li, Wenxin Xiang, Chonghua Liao, Xudong Guo, Yongbin Li, Jinli Suo
Main category: cs.AI
TL;DR: MemPO is a self-memory policy optimization algorithm that enables agents to autonomously manage and summarize memory during environment interaction, reducing token consumption while improving task performance.
Details
Motivation: Long-horizon agents face growing context size during environment interaction, which degrades performance and stability. Existing methods use external memory modules that prevent proactive memory management aligned with task objectives.Method: Proposes self-memory policy optimization (MemPO) where the policy model autonomously summarizes and manages memory. Uses improved credit assignment based on memory effectiveness to selectively retain crucial information.
Result: Achieves 25.98% F1 score gain over base model and 7.1% over previous SOTA, while reducing token usage by 67.58% and 73.12% respectively.
Conclusion: MemPO enables efficient memory management for long-horizon agents, significantly reducing context size while improving performance through autonomous memory summarization.
Abstract: Long-horizon agents face the challenge of growing context size during interaction with environment, which degrades the performance and stability. Existing methods typically introduce the external memory module and look up the relevant information from the stored memory, which prevents the model itself from proactively managing its memory content and aligning with the agent’s overarching task objectives. To address these limitations, we propose the self-memory policy optimization algorithm (MemPO), which enables the agent (policy model) to autonomously summarize and manage their memory during interaction with environment. By improving the credit assignment mechanism based on memory effectiveness, the policy model can selectively retain crucial information, significantly reducing token consumption while preserving task performance. Extensive experiments and analyses confirm that MemPO achieves absolute F1 score gains of 25.98% over the base model and 7.1% over the previous SOTA baseline, while reducing token usage by 67.58% and 73.12%.
[854] COMRES-VLM: Coordinated Multi-Robot Exploration and Search using Vision Language Models
Ruiyang Wang, Hao-Lun Hsu, David Hunt, Jiwoo Kim, Shaocheng Luo, Miroslav Pajic
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2509.26324: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.26324&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[855] AIoT-based Continuous, Contextualized, and Explainable Driving Assessment for Older Adults
Yimeng Liu, Fangwei Zhang, Maolin Gan, Jialuo Du, Jingkai Lin, Yawen Wang, Fei Sun, Honglei Chen, Linda Hill, Ruofeng Liu, Tianxing Li, Zhichao Cao
Main category: cs.AI
TL;DR: AURA: AIoT framework for continuous real-world driving safety assessment in older adults using in-vehicle sensing and multi-scale behavioral modeling
Details
Motivation: Aging population creates driving safety challenges; current assessment methods are infrequent and fail to capture real-world driving behavior; need for continuous monitoring that reflects actual on-road performanceMethod: AIoT framework integrating in-vehicle sensing, multi-scale behavioral modeling, and context-aware analysis to extract detailed driving performance indicators; organizes fine-grained actions into behavioral trajectories and separates age-related changes from situational factors
Result: Proposed framework for continuous, real-world assessment of driving safety; enables proactive, individualized support for older drivers through privacy-preserving edge architecture
Conclusion: AURA provides foundation for reliable real-world monitoring systems that promote safer aging behind the wheel; outlines design principles and research opportunities for building such systems
Abstract: The world is undergoing a major demographic shift as older adults become a rapidly growing share of the population, creating new challenges for driving safety. In car-dependent regions such as the United States, driving remains essential for independence, access to services, and social participation. At the same time, aging can introduce gradual changes in vision, attention, reaction time, and driving control that quietly reduce safety. Today’s assessment methods rely largely on infrequent clinic visits or simple screening tools, offering only a brief snapshot and failing to reflect how an older adult actually drives on the road. Our work starts from the observation that everyday driving provides a continuous record of functional ability and captures how a driver responds to traffic, navigates complex roads, and manages routine behavior. Leveraging this insight, we propose AURA, an Artificial Intelligence of Things (AIoT) framework for continuous, real-world assessment of driving safety among older adults. AURA integrates richer in-vehicle sensing, multi-scale behavioral modeling, and context-aware analysis to extract detailed indicators of driving performance from routine trips. It organizes fine-grained actions into longer behavioral trajectories and separates age-related performance changes from situational factors such as traffic, road design, or weather. By integrating sensing, modeling, and interpretation within a privacy-preserving edge architecture, AURA provides a foundation for proactive, individualized support that helps older adults drive safely. This paper outlines the design principles, challenges, and research opportunities needed to build reliable, real-world monitoring systems that promote safer aging behind the wheel.
[856] The Synthetic Web: Adversarially-Curated Mini-Internets for Diagnosing Epistemic Weaknesses of Language Agents
Shrey Shah, Levent Ozgur
Main category: cs.AI
TL;DR: SWEB benchmark tests language agents’ vulnerability to adversarial search ranking, showing catastrophic accuracy collapse when misleading information appears prominently in search results despite access to truthful sources.
Details
Motivation: Language agents increasingly act as web-enabled systems but face unreliable or adversarial content. Current benchmarks can't causally isolate vulnerabilities to adversarial ranking where misleading information appears prominently in search results, and mitigation strategies remain untested under such conditions.Method: Introduces Synthetic Web Benchmark (SWEB) - a procedurally generated environment with thousands of hyperlinked articles with ground-truth labels for credibility/factuality, process-level interaction traces, and contamination filtering. Tests six frontier models by injecting a single high-plausibility misinformation article into controllable search ranks to measure causal effect of adversarial exposure.
Result: Reveals catastrophic failures: accuracy collapses despite unlimited access to truthful sources, with minimal search escalation and severe miscalibration. Shows fundamental limitations in how current frontier models handle conflicting information.
Conclusion: Exposes critical vulnerabilities in language agents to adversarial ranking with immediate implications for high-stakes deployment. Provides benchmark for systematic analysis of failure modes and testbed for evaluating mitigation strategies, establishing baseline for developing search-robust agents.
Abstract: Language agents increasingly act as web-enabled systems that search, browse, and synthesize information from diverse sources. However, these sources can include unreliable or adversarial content, and the robustness of agents to adversarial ranking - where misleading information appears prominently in search results - remains poorly understood. Existing benchmarks evaluate functional navigation or static factuality but cannot causally isolate this vulnerability, and current mitigation strategies for retrieval-augmented generation remain largely untested under such conditions. We introduce Synthetic Web Benchmark, a procedurally generated environment comprising thousands of hyperlinked articles with ground-truth labels for credibility and factuality, process-level interaction traces, and contamination filtering to eliminate training-data leakage. By injecting a single high-plausibility misinformation article into a controllable search rank, we measure the causal effect of adversarial exposure in six frontier models. The results reveal catastrophic failures: accuracy collapses despite unlimited access to truthful sources, with minimal search escalation and severe miscalibration. These findings expose fundamental limitations in how current frontier models handle conflicting information, with immediate implications for deployment in high-stakes domains. Our benchmark enables systematic analysis of these failure modes and provides a controlled testbed for evaluating mitigation strategies under adversarial ranking - a gap in current research. This work establishes a reproducible baseline for developing search-robust and epistemically humble agents capable of resisting manipulation in high-stakes domains.
[857] MetaMind: General and Cognitive World Models in Multi-Agent Systems by Meta-Theory of Mind
Lingyi Wang, Rashed Shelim, Walid Saad, Naren Ramakrishna
Main category: cs.AI
TL;DR: MetaMind is a cognitive world model for multi-agent systems that uses meta-theory-of-mind to enable agents to learn metacognitive abilities through self-supervised bidirectional inference and generalize them to understand other agents via analogical reasoning.
Details
Motivation: Multi-agent systems need world models that can understand interdependent agent dynamics, predict interactive trajectories, and plan with collective awareness without centralized supervision or explicit communication. Current approaches lack metacognitive abilities for agents to reason about others' goals and beliefs.Method: Proposes MetaMind with a meta-theory-of-mind framework where each agent learns bidirectional inference: predicting/planning over own beliefs AND inversely reasoning goals/beliefs from own behavior trajectories. This self-reflective loop enables self-supervised metacognitive learning, then generalized to third-person perspective via analogical reasoning.
Result: MetaMind achieves superior task performance and outperforms baselines in few-shot multi-agent generalization across diverse multi-agent tasks. Agents can actively reason about others’ goals/beliefs from limited observable behavior trajectories in zero-shot manner and adapt to emergent collective intention.
Conclusion: MetaMind provides a general cognitive world model for multi-agent systems that enables metacognitive abilities through self-supervised learning and analogical reasoning, allowing agents to understand and adapt to collective intentions without explicit communication.
Abstract: A major challenge for world models in multi-agent systems is to understand interdependent agent dynamics, predict interactive multi-agent trajectories, and plan over long horizons with collective awareness, without centralized supervision or explicit communication. In this paper, MetaMind, a general and cognitive world model for multi-agent systems that leverages a novel meta-theory of mind (Meta-ToM) framework, is proposed. Through MetaMind, each agent learns not only to predict and plan over its own beliefs, but also to inversely reason goals and beliefs from its own behavior trajectories. This self-reflective, bidirectional inference loop enables each agent to learn a metacognitive ability in a self-supervised manner. Then, MetaMind is shown to generalize the metacognitive ability from first-person to third-person through analogical reasoning. Thus, in multi-agent systems, each agent with MetaMind can actively reason about goals and beliefs of other agents from limited, observable behavior trajectories in a zero-shot manner, and then adapt to emergent collective intention without an explicit communication mechanism. Extended simulation results on diverse multi-agent tasks demonstrate that MetaMind can achieve superior task performance and outperform baselines in few-shot multi-agent generalization.
[858] MC-Search: Evaluating and Enhancing Multimodal Agentic Search with Structured Long Reasoning Chains
Xuying Ning, Dongqi Fu, Tianxin Wei, Mengting Ai, Jiaru Zou, Ting-Wei Li, Hanghang Tong, Yada Zhu, Hendrik Hamann, Jingrui He
Main category: cs.AI
TL;DR: MC-Search is a benchmark for agentic multimodal retrieval-augmented generation with long, step-wise reasoning chains across five reasoning structures, featuring process-level evaluation metrics.
Details
Motivation: Existing benchmarks focus on simplified QA with short retrieval chains, leaving adaptive planning and multimodal reasoning underexplored. There's a need for evaluating agentic MM-RAG systems with complex, step-wise reasoning.Method: Created MC-Search benchmark with 3,333 high-quality examples averaging 3.7 hops, featuring step-wise annotated reasoning chains with sub-questions, retrieval modalities, supporting facts, and intermediate answers. Introduced HAVE (Hop-wise Attribution and Verification of Evidence) for quality assurance. Developed process-level metrics for reasoning quality and retrieval planning accuracy.
Result: Benchmarked six leading MLLMs using a unified agentic MM-RAG pipeline, revealing systematic issues like over-/under-retrieval and modality-misaligned planning. Introduced Search-Align, a process-supervised fine-tuning framework that improves planning and retrieval fidelity in open-source MLLMs.
Conclusion: MC-Search enables faithful evaluation of agentic MM-RAG systems and provides data that improves planning and retrieval capabilities in MLLMs, addressing gaps in current multimodal reasoning benchmarks.
Abstract: With the increasing demand for step-wise, cross-modal, and knowledge-grounded reasoning, multimodal large language models (MLLMs) are evolving beyond the traditional fixed retrieve-then-generate paradigm toward more sophisticated agentic multimodal retrieval-augmented generation (MM-RAG). Existing benchmarks, however, mainly focus on simplified QA with short retrieval chains, leaving adaptive planning and multimodal reasoning underexplored. We present MC-Search, the first benchmark for agentic MM-RAG with long, step-wise annotated reasoning chains spanning five representative reasoning structures. Each example specifies sub-questions, retrieval modalities, supporting facts, and intermediate answers, with fidelity ensured by HAVE (Hop-wise Attribution and Verification of Evidence), resulting in 3,333 high-quality examples averaging 3.7 hops. Beyond answer accuracy, MC-Search introduces new process-level metrics for reasoning quality, stepwise retrieval and planning accuracy. By developing a unified agentic MM-RAG pipeline, we benchmark six leading MLLMs and reveal systematic issues such as over- and under-retrieval and modality-misaligned planning. Finally, we introduce Search-Align, a process-supervised fine-tuning framework leveraging verified reasoning chains, showing that our data not only enables faithful evaluation but also improves planning and retrieval fidelity in open-source MLLMs.
[859] HiMAC: Hierarchical Macro-Micro Learning for Long-Horizon LLM Agents
Hongbo Jin, Rongpeng Zhu, Jiayu Ding, Wenhao Zhang, Ge Li
Main category: cs.AI
TL;DR: HiMAC is a hierarchical RL framework for LLM agents that decomposes long-horizon tasks into macro-level planning and micro-level execution, improving efficiency and reducing error propagation.
Details
Motivation: Current LLM agents use flat autoregressive policies that generate reasoning and actions in a single sequence, leading to inefficient exploration and error propagation in long-horizon tasks.Method: Hierarchical framework with structured blueprint generation followed by goal-conditioned execution, using critic-free hierarchical policy optimization and iterative co-evolution training.
Result: Outperforms prompting and RL baselines on ALFWorld, WebShop, and Sokoban, achieving state-of-the-art performance with improved sample efficiency.
Conclusion: Structured hierarchy is more important than model scale for enabling robust long-horizon agentic intelligence in LLM-based systems.
Abstract: Large language model (LLM) agents have recently demonstrated strong capabilities in interactive decision-making, yet they remain fundamentally limited in long-horizon tasks that require structured planning and reliable execution. Existing approaches predominantly rely on flat autoregressive policies, where high-level reasoning and low-level actions are generated within a single token sequence, leading to inefficient exploration and severe error propagation over extended trajectories. In this work, we propose HiMAC, a hierarchical agentic RL framework that explicitly decomposes long-horizon decision-making into macro-level planning and micro-level execution. HiMAC models reasoning as a structured blueprint generation process followed by goal-conditioned action execution, enabling robust long-horizon planning within LLM-based agents. To train this hierarchy efficiently, we introduce a critic-free hierarchical policy optimization paradigm that extends group-based reinforcement learning to bi-level structures through hierarchical relative advantage estimation. Furthermore, we propose an iterative co-evolution training strategy that alternates between planner exploration and executor adaptation, mitigating the non-stationarity inherent in hierarchical learning. Extensive experiments on ALFWorld, WebShop, and Sokoban demonstrate that HiMAC consistently outperforms strong prompting and reinforcement learning baselines, achieving state-of-the-art performance and substantially improved sample efficiency across both text-based and visually grounded environments. Our results show that introducing structured hierarchy, rather than increasing model scale alone, is a key factor for enabling robust long-horizon agentic intelligence.
[860] Multi-Agent Reinforcement Learning with Communication-Constrained Priors
Guang Yang, Tianpei Yang, Jingwen Qiao, Yanqing Wu, Jing Huo, Xingguo Chen, Yang Gao
Main category: cs.AI
TL;DR: Paper analysis unavailable due to HTTP 429 error when fetching abstract from arXiv
Details
Motivation: Unable to determine motivation due to failed abstract retrievalMethod: Unable to determine method due to failed abstract retrieval
Result: Unable to determine results due to failed abstract retrieval
Conclusion: Unable to draw conclusions due to failed abstract retrieval
Abstract: Failed to fetch summary for 2512.03528: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.03528&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[861] Tracking Capabilities for Safer Agents
Martin Odersky, Yaoyu Zhao, Yichen Xu, Oliver Bračevac, Cao Nguyen Pham
Main category: cs.AI
TL;DR: AI agents generate capability-safe Scala code instead of direct tool calls, using type systems to prevent information leaks and malicious side effects while maintaining task performance.
Details
Motivation: AI agents interacting with the real world through tool calls pose safety challenges including information leakage, unintended side effects, and prompt injection vulnerabilities. Current approaches lack fine-grained control over agent capabilities and resource access.Method: Propose a “safety harness” where agents express intentions as code in Scala 3 with capture checking. The type system tracks capabilities statically, enabling local purity enforcement to prevent side effects and information leakage when processing sensitive data.
Result: Agents can generate capability-safe code with no significant loss in task performance. The type system reliably prevents unsafe behaviors like information leakage and malicious side effects, demonstrating that extensible agent safety harnesses can be built using strong type systems with tracked capabilities.
Conclusion: Programming-language-based safety harnesses using capability-safe languages like Scala 3 with capture checking provide effective protection against AI agent safety risks while maintaining functionality, offering a promising approach for safe real-world agent deployment.
Abstract: AI agents that interact with the real world through tool calls pose fundamental safety challenges: agents might leak private information, cause unintended side effects, or be manipulated through prompt injection. To address these challenges, we propose to put the agent in a programming-language-based “safety harness”: instead of calling tools directly, agents express their intentions as code in a capability-safe language: Scala 3 with capture checking. Capabilities are program variables that regulate access to effects and resources of interest. Scala’s type system tracks capabilities statically, providing fine-grained control over what an agent can do. In particular, it enables local purity, the ability to enforce that sub-computations are side-effect-free, preventing information leakage when agents process classified data. We demonstrate that extensible agent safety harnesses can be built by leveraging a strong type system with tracked capabilities. Our experiments show that agents can generate capability-safe code with no significant loss in task performance, while the type system reliably prevents unsafe behaviors such as information leakage and malicious side effects.
[862] CollabEval: Enhancing LLM-as-a-Judge via Multi-Agent Collaboration
Yiyue Qian, Shinan Zhang, Yun Zhou, Haibo Ding, Diego Socolinsky, Yi Zhang
Main category: cs.AI
TL;DR: CollabEval is a multi-agent evaluation framework that uses collaborative three-phase process (initial evaluation, multi-round discussion, final judgment) to improve LLM-based content evaluation by reducing biases and inconsistencies.
Details
Motivation: Current single-LLM evaluation approaches suffer from inconsistent judgments and inherent biases from pre-training data, limiting their reliability for AI-generated content evaluation.Method: Proposes CollabEval framework with three-phase collaborative evaluation: 1) initial evaluation by multiple agents, 2) multi-round discussion for consensus building, 3) final judgment with strategic consensus checking for efficiency.
Result: CollabEval consistently outperforms single-LLM approaches across multiple dimensions and maintains robust performance even when individual models struggle, while providing comprehensive evaluation criteria support.
Conclusion: Multi-agent collaborative evaluation framework effectively addresses limitations of single-LLM evaluation, offering more reliable and consistent AI-generated content assessment through strategic collaboration.
Abstract: Large Language Models (LLMs) have revolutionized AI-generated content evaluation, with the LLM-as-a-Judge paradigm becoming increasingly popular. However, current single-LLM evaluation approaches face significant challenges, including inconsistent judgments and inherent biases from pre-training data. To address these limitations, we propose CollabEval, a novel multi-agent evaluation framework that implements a three-phase Collaborative Evaluation process: initial evaluation, multi-round discussion, and final judgment. Unlike existing approaches that rely on competitive debate or single-model evaluation, CollabEval emphasizes collaboration among multiple agents with strategic consensus checking for efficiency. Our extensive experiments demonstrate that CollabEval consistently outperforms single-LLM approaches across multiple dimensions while maintaining robust performance even when individual models struggle. The framework provides comprehensive support for various evaluation criteria while ensuring efficiency through its collaborative design.
[863] MMCOMET: A Large-Scale Multimodal Commonsense Knowledge Graph for Contextual Reasoning
Eileen Wang, Hiba Arnaout, Dhita Pratama, Shuo Yang, Dangyang Liu, Jie Yang, Josiah Poon, Jeff Pan, Caren Han
Main category: cs.AI
TL;DR: MMCOMET is the first multimodal commonsense knowledge graph that extends ATOMIC2020 with visual dimensions through image retrieval, creating over 900K multimodal triples for enhanced reasoning in tasks like image captioning and storytelling.
Details
Motivation: Existing multimodal knowledge graphs lack support for complex reasoning tasks like image captioning and storytelling. The authors aim to address this limitation by creating a comprehensive resource that integrates visual information with commonsense knowledge.Method: Extends the ATOMIC2020 knowledge graph by adding a visual dimension through an efficient image retrieval process, resulting in over 900K multimodal triples that combine textual commonsense knowledge with relevant visual information.
Result: Created MMCOMET with over 900K multimodal triples. In visual storytelling experiments, the holistic approach enabled generation of richer, more coherent, and contextually grounded stories compared to text-only knowledge approaches.
Conclusion: MMCOMET establishes a new foundation for multimodal commonsense reasoning and narrative generation by bridging the gap between textual commonsense knowledge and visual information.
Abstract: We present MMCOMET, the first multimodal commonsense knowledge graph (MMKG) that integrates physical, social, and eventive knowledge. MMCOMET extends the ATOMIC2020 knowledge graph to include a visual dimension, through an efficient image retrieval process, resulting in over 900K multimodal triples. This new resource addresses a major limitation of existing MMKGs in supporting complex reasoning tasks like image captioning and storytelling. Through a standard visual storytelling experiment, we show that our holistic approach enables the generation of richer, coherent, and contextually grounded stories than those produced using text-only knowledge. This resource establishes a new foundation for multimodal commonsense reasoning and narrative generation.
[864] Alien Science: Sampling Coherent but Cognitively Unavailable Research Directions from Idea Atoms
Alejandro H. Artiles, Martin Weiss, Levin Brinkmann, Anirudh Goyal, Nasim Rahaman
Main category: cs.AI
TL;DR: A pipeline to generate novel research directions by identifying “alien” ideas that are coherent but unlikely to be proposed by typical researchers, using conceptual decomposition and availability modeling.
Details
Motivation: LLMs are good at synthesizing familiar material but struggle with generating truly novel, non-obvious research ideas that matter most for scientific advancement. The gap is formalized as "cognitive availability" - the likelihood a research direction would naturally occur to typical researchers.Method: Three-step pipeline: (1) Decompose papers into granular conceptual units, (2) Cluster recurring units into shared vocabulary of “idea atoms”, (3) Learn two models: coherence model (scores viability of atom combinations) and availability model (scores likelihood of direction being generated by community). Then sample “alien” directions with high coherence but low availability.
Result: Validated on ~7,500 recent LLM papers from NeurIPS, ICLR, and ICML: (a) conceptual units preserve paper content under reconstruction, (b) idea atoms generalize across papers rather than memorizing paper-specific phrasing, (c) Alien sampler produces more diverse research directions than LLM baselines while maintaining coherence.
Conclusion: The proposed framework successfully generates novel research directions by formalizing and bridging the gap between coherence and cognitive availability, offering a systematic approach to scientific creativity beyond LLM capabilities.
Abstract: Large language models are adept at synthesizing and recombining familiar material, yet they often fail at a specific kind of creativity that matters most in research: producing ideas that are both coherent and non-obvious to the current community. We formalize this gap through cognitive availability, the likelihood that a research direction would be naturally proposed by a typical researcher given what they have worked on. We introduce a pipeline that (i) decomposes papers into granular conceptual units, (ii) clusters recurring units into a shared vocabulary of idea atoms, and (iii) learns two complementary models: a coherence model that scores whether a set of atoms constitutes a viable direction, and an availability model that scores how likely that direction is to be generated by researchers drawn from the community. We then sample “alien” directions that score high on coherence but low on availability. On a corpus of $\sim$7,500 recent LLM papers from NeurIPS, ICLR and ICML, we validate that (a) conceptual units preserve paper content under reconstruction, (b) idea atoms generalize across papers rather than memorizing paper-specific phrasing, and (c) the Alien sampler produces research directions that are more diverse than LLM baselines while maintaining coherence.
[865] DIVA-GRPO: Enhancing Multimodal Reasoning through Difficulty-Adaptive Variant Advantage
Haowen Gao, Zhenyu Zhang, Liang Pang, Fangda Guo, Hongjian Dou, Guannan Lv, Shaoguo Liu, Tingting Gao, Huawei Shen, Xueqi Cheng
Main category: cs.AI
TL;DR: DIVA-GRPO: A difficulty-adaptive variant advantage method for RL with group relative policy optimization that addresses reward sparsity and advantage vanishing in MLLM reasoning tasks by dynamically adjusting variant difficulty distributions.
Details
Motivation: GRPO for MLLMs suffers from sparse rewards on difficult problems and advantage vanishing when group-level rewards are too consistent, limiting reasoning capability enhancement.Method: DIVA-GRPO dynamically assesses problem difficulty, samples variants with appropriate difficulty levels, and calculates advantages across local and global groups using difficulty-weighted and normalized scaling.
Result: Outperforms existing approaches on six reasoning benchmarks in both training efficiency and reasoning performance.
Conclusion: DIVA-GRPO effectively addresses reward sparsity and advantage vanishing in GRPO for MLLMs, improving reasoning capabilities.
Abstract: Reinforcement learning (RL) with group relative policy optimization (GRPO) has become a widely adopted approach for enhancing the reasoning capabilities of multimodal large language models (MLLMs). While GRPO enables long-chain reasoning without a critic, it often suffers from sparse rewards on difficult problems and advantage vanishing when group-level rewards are too consistent for overly easy or hard problems. Existing solutions (sample expansion, selective utilization, and indirect reward design) often fail to maintain enough variance in within-group reward distributions to yield clear optimization signals. To address this, we propose DIVA-GRPO, a difficulty-adaptive variant advantage method that adjusts variant difficulty distributions from a global perspective. DIVA-GRPO dynamically assesses problem difficulty, samples variants with appropriate difficulty levels, and calculates advantages across local and global groups using difficulty-weighted and normalized scaling. This alleviates reward sparsity and advantage vanishing while improving training stability. Extensive experiments on six reasoning benchmarks demonstrate that DIVA-GRPO outperforms existing approaches in training efficiency and reasoning performance. Code: https://github.com/Siaaaaaa1/DIVA-GRPO
[866] HVR-Met: A Hypothesis-Verification-Replaning Agentic System for Extreme Weather Diagnosis
Shuo Tang, Jiadong Zhang, Jian Xu, Gengxian Zhou, Qizhao Jin, Qinxuan Wang, Yi Hu, Ning Hu, Hongchang Ren, Lingli He, Jiaolan Fu, Jingtao Ding, Shiming Xiang, Chenglin Liu
Main category: cs.AI
TL;DR: HVR-Met is a multi-agent system for extreme weather diagnostics that integrates expert knowledge with a Hypothesis-Verification-Replanning closed-loop mechanism for sophisticated reasoning.
Details
Motivation: Current deep learning weather forecasting models struggle with extreme weather diagnostics, which require multi-step logical reasoning, dynamic tool invocation, and expert judgment. Existing agent architectures lack proper expert knowledge integration, professional iterative reasoning loops, and fine-grained validation systems for extreme conditions.Method: Proposes HVR-Met, a multi-agent meteorological diagnostic system with deep expert knowledge integration. Key innovation is the “Hypothesis-Verification-Replanning” closed-loop mechanism for iterative reasoning about anomalous meteorological signals during extreme weather. Also introduces a novel benchmark for atomic-level subtask evaluation.
Result: Experimental evidence shows the system excels in complex diagnostic scenarios, demonstrating effectiveness in extreme weather diagnostics.
Conclusion: HVR-Met addresses critical bottlenecks in extreme weather diagnostics through expert knowledge integration and sophisticated iterative reasoning mechanisms, with experimental validation showing strong performance.
Abstract: While deep learning-based weather forecasting paradigms have made significant strides, addressing extreme weather diagnostics remains a formidable challenge. This gap exists primarily because the diagnostic process demands sophisticated multi-step logical reasoning, dynamic tool invocation, and expert-level prior judgment. Although agents possess inherent advantages in task decomposition and autonomous execution, current architectures are still hampered by critical bottlenecks: inadequate expert knowledge integration, a lack of professional-grade iterative reasoning loops, and the absence of fine-grained validation and evaluation systems for complex workflows under extreme conditions. To this end, we propose HVR-Met, a multi-agent meteorological diagnostic system characterized by the deep integration of expert knowledge. Its central innovation is the ``Hypothesis-Verification-Replanning’’ closed-loop mechanism, which facilitates sophisticated iterative reasoning for anomalous meteorological signals during extreme weather events. To bridge gaps within existing evaluation frameworks, we further introduce a novel benchmark focused on atomic-level subtasks. Experimental evidence demonstrates that the system excels in complex diagnostic scenarios.
[867] FCN-LLM: Empower LLM for Brain Functional Connectivity Network Understanding via Graph-level Multi-task Instruction Tuning
Xingcan Hu, Wei Wang, Li Xiao
Main category: cs.AI
TL;DR: FCN-LLM enables Large Language Models to understand brain functional connectivity networks from fMRI through multi-task instruction tuning, achieving strong zero-shot generalization on neuroscience tasks.
Details
Motivation: While multimodal LLMs can understand images, video, and audio, they cannot directly comprehend brain functional connectivity networks (FCNs) from fMRI data. Existing methods don't align FCNs with text modality, limiting LLMs' ability to interpret brain networks for clinical neuroscience applications.Method: Proposes FCN-LLM framework with multi-scale FCN encoder capturing brain-region, functional subnetwork, and whole-brain features, projecting them into LLM semantic space. Uses multi-paradigm instruction tasks covering 19 subject attributes across demographics, phenotypes, and psychiatric conditions. Employs multi-stage learning: first aligns FCN embeddings with LLM, then jointly fine-tunes entire model.
Result: FCN-LLM achieves strong zero-shot generalization on unseen datasets, outperforming conventional supervised and foundation models on large-scale, multi-site FCN database. Demonstrates flexible and interpretable framework for neuroscience applications.
Conclusion: Introduces new paradigm for integrating brain functional networks with LLMs, enabling direct understanding of FCNs through instruction tuning. Offers promising approach for clinical neuroscience and brain network analysis using multimodal LLM capabilities.
Abstract: Large Language Models have achieved remarkable success in language understanding and reasoning, and their multimodal extensions enable comprehension of images, video, and audio. Inspired by this, foundation models for brain functional connectivity networks derived from resting-state fMRI have shown promise in clinical tasks. However, existing methods do not align FCNs with the text modality, limiting the ability of LLMs to directly understand FCNs. To address this, we propose FCN-LLM, a framework that enables LLMs to understand FCNs through graph-level, multi-task instruction tuning. Our approach employs a multi-scale FCN encoder capturing brain-region, functional subnetwork, and whole-brain features, projecting them into the semantic space of LLM. We design multi-paradigm instruction tasks covering 19 subject-specific attributes across demographics, phenotypes, and psychiatric conditions. A multi-stage learning strategy first aligns FCN embeddings with the LLM and then jointly fine-tunes the entire model to capture high-level semantic information. Experiments on a large-scale, multi-site FCN database show that FCN-LLM achieves strong zero-shot generalization on unseen datasets, outperforming conventional supervised and foundation models. This work introduces a new paradigm for integrating brain functional networks with LLMs, offering a flexible and interpretable framework for neuroscience.
[868] AutoSkill: Experience-Driven Lifelong Learning via Skill Self-Evolution
Yutao Yang, Junsong Li, Qianjun Pan, Bihao Zhan, Yuxuan Cai, Lin Du, Jie Zhou, Kai Chen, Qin Chen, Xin Li, Bo Zhang, Liang He
Main category: cs.AI
TL;DR: AutoSkill is a lifelong learning framework that enables LLM agents to automatically derive, maintain, and reuse skills from interaction traces, turning ephemeral experience into explicit, reusable capabilities without model retraining.
Details
Motivation: Users repeatedly express stable preferences and requirements in LLM applications (e.g., reducing hallucinations, following writing conventions), but this interaction experience is seldom consolidated into reusable knowledge, preventing LLM agents from accumulating personalized capabilities across sessions.Method: AutoSkill abstracts skills from user experience, supports continual self-evolution, and dynamically injects relevant skills into future requests without retraining the underlying model. It’s designed as a model-agnostic plugin layer with standardized skill representation for sharing across agents, users, and tasks.
Result: The framework enables LLM agents to accumulate personalized capabilities across sessions, turning ephemeral interaction experience into explicit, reusable, and composable capabilities that can be shared and transferred.
Conclusion: AutoSkill provides a practical and scalable path toward lifelong personalized agents and personal digital surrogates by consolidating interaction experience into reusable skills without requiring model retraining.
Abstract: In practical LLM applications, users repeatedly express stable preferences and requirements, such as reducing hallucinations, following institutional writing conventions, or avoiding overly technical wording, yet such interaction experience is seldom consolidated into reusable knowledge. Consequently, LLM agents often fail to accumulate personalized capabilities across sessions. We present AutoSkill, an experience-driven lifelong learning framework that enables LLM agents to automatically derive, maintain, and reuse skills from dialogue and interaction traces. AutoSkill abstracts skills from user experience, supports their continual self-evolution, and dynamically injects relevant skills into future requests without retraining the underlying model. Designed as a model-agnostic plugin layer, it is compatible with existing LLMs and introduces a standardized skill representation for sharing and transfer across agents, users, and tasks. In this way, AutoSkill turns ephemeral interaction experience into explicit, reusable, and composable capabilities. This paper describes the motivation, architecture, skill lifecycle, and implementation of AutoSkill, and positions it with respect to prior work on memory, retrieval, personalization, and agentic systems. AutoSkill highlights a practical and scalable path toward lifelong personalized agents and personal digital surrogates.
[869] DeepResearch-9K: A Challenging Benchmark Dataset of Deep-Research Agent
Tongzhou Wu, Yuhao Wang, Xinyu Ma, Xiuqiang He, Shuaiqiang Wang, Dawei Yin, Xiangyu Zhao
Main category: cs.AI
TL;DR: DeepResearch-9K dataset and DeepResearch-R1 framework for training deep-research agents with multi-step web exploration capabilities
Details
Motivation: Address two critical bottlenecks in deep-research agents: lack of large-scale challenging datasets with real-world difficulty, and absence of accessible open-source frameworks for data synthesis and agent trainingMethod: Construct DeepResearch-9K dataset from open-source multi-hop QA datasets via autonomous pipeline, and develop DeepResearch-R1 framework supporting multi-turn web interactions, different RL approaches, and various reward models
Result: Agents trained on DeepResearch-9K under DeepResearch-R1 achieve state-of-the-art results on challenging deep-research benchmarks
Conclusion: The dataset and framework bridge critical gaps in deep-research agent development, enabling better training and evaluation of agents for complex web exploration tasks
Abstract: Deep-research agents are capable of executing multi-step web exploration, targeted retrieval, and sophisticated question answering. Despite their powerful capabilities, deep-research agents face two critical bottlenecks: (1) the lack of large-scale, challenging datasets with real-world difficulty, and (2) the absence of accessible, open-source frameworks for data synthesis and agent training. To bridge these gaps, we first construct DeepResearch-9K, a large-scale challenging dataset specifically designed for deep-research scenarios built from open-source multi-hop question-answering (QA) datasets via a low-cost autonomous pipeline. Notably, it consists of (1) 9000 questions spanning three difficulty levels from L1 to L3 (2) high-quality search trajectories with reasoning chains from Tongyi-DeepResearch-30B-A3B, a state-of-the-art deep-research agent, and (3) verifiable answers. Furthermore, we develop an open-source training framework DeepResearch-R1 that supports (1) multi-turn web interactions, (2) different reinforcement learning (RL) approaches, and (3) different reward models such as rule-based outcome reward and LLM-as-judge feedback. Finally, empirical results demonstrate that agents trained on DeepResearch-9K under our DeepResearch-R1 achieve state-of-the-art results on challenging deep-research benchmarks. We release the DeepResearch-9K dataset on https://huggingface.co/datasets/artillerywu/DeepResearch-9K and the code of DeepResearch-R1 on https://github.com/Applied-Machine-Learning-Lab/DeepResearch-R1.
[870] Semantic XPath: Structured Agentic Memory Access for Conversational AI
Yifan Simon Liu, Ruifan Wu, Liam Gallagher, Jiazhou Liang, Armin Toroghi, Scott Sanner
Main category: cs.AI
TL;DR: Semantic XPath introduces tree-structured memory for conversational AI to efficiently access and update structured memory, outperforming flat-RAG baselines while using far fewer tokens than in-context memory approaches.
Details
Motivation: Current conversational AI systems struggle with long-term, task-oriented interactions due to context-window limits in in-context memory approaches and the limitations of flat memory structures in RAG-based methods that ignore memory organization.Method: Proposes Semantic XPath, a tree-structured memory module that uses XPath-like queries to access and update structured conversational memory, enabling efficient retrieval while maintaining memory organization.
Result: Semantic XPath improves performance over flat-RAG baselines by 176.7% while using only 9.1% of the tokens required by in-context memory. Also introduces SemanticXPath Chat, an end-to-end demo system with memory visualization.
Conclusion: The paper demonstrates a promising approach for next-generation long-term, task-oriented conversational AI systems built on structured memory, addressing scalability and efficiency challenges of current methods.
Abstract: Conversational AI (ConvAI) agents increasingly maintain structured memory to support long-term, task-oriented interactions. In-context memory approaches append the growing history to the model input, which scales poorly under context-window limits. RAG-based methods retrieve request-relevant information, but most assume flat memory collections and ignore structure. We propose Semantic XPath, a tree-structured memory module to access and update structured conversational memory. Semantic XPath improves performance over flat-RAG baselines by 176.7% while using only 9.1% of the tokens required by in-context memory. We also introduce SemanticXPath Chat, an end-to-end ConvAI demo system that visualizes the structured memory and query execution details. Overall, this paper demonstrates a candidate for the next generation of long-term, task-oriented ConvAI systems built on structured memory.
[871] Incremental LTLf Synthesis
Giuseppe De Giacomo, Yves Lespérance, Gianmarco Parretti, Fabio Patrizi, Moshe Y. Vardi
Main category: cs.AI
TL;DR: Incremental LTLf synthesis for reactive systems where goals arrive during execution, requiring strategy adaptation while maintaining original goals.
Details
Motivation: Real-world reactive systems often receive new goals while already executing strategies for existing goals, requiring efficient adaptation without violating previous commitments.Method: Two approaches: 1) Leveraging auxiliary data structures from automata-based synthesis for efficient incremental updates, 2) LTLf formula progression technique with bounded automata size despite formula growth.
Result: Formula progression generates exponentially larger formulas but their minimal automata remain bounded by original formula size. However, naive implementation (recomputing automata from scratch) is not competitive.
Conclusion: Incremental LTLf synthesis is feasible with efficient data structure reuse, but naive formula progression approaches need optimization to be practical.
Abstract: In this paper, we study incremental LTLf synthesis – a form of reactive synthesis where the goals are given incrementally while in execution. In other words, the protagonist agent is already executing a strategy for a certain goal when it receives a new goal: at this point, the agent has to abandon the current strategy and synthesize a new strategy still fulfilling the original goal, which was given at the beginning, as well as the new goal, starting from the current instant. In this paper, we formally define the problem of incremental synthesis and study its solution. We propose a solution technique that efficiently performs incremental synthesis for multiple LTLf goals by leveraging auxiliary data structures constructed during automata-based synthesis. We also consider an alternative solution technique based on LTLf formula progression. We show that, in spite of the fact that formula progression can generate formulas that are exponentially larger than the original ones, their minimal automata remain bounded in size by that of the original formula. On the other hand, we show experimentally that, if implemented naively, i.e., by actually computing the automaton of the progressed LTLf formulas from scratch every time a new goal arrives, the solution based on formula progression is not competitive.
[872] How Well Does Agent Development Reflect Real-World Work?
Zora Zhiruo Wang, Sanidhya Vijayvargiya, Aspen Chen, Hanmo Zhang, Venu Arvind Arangarajan, Jett Chen, Valerie Chen, Diyi Yang, Daniel Fried, Graham Neubig
Main category: cs.AI
TL;DR: This paper analyzes the mismatch between AI agent benchmarking efforts and real-world human work distribution, revealing programming-centric bias in current agent development compared to broader labor market needs.
Details
Motivation: The motivation is to understand how representative AI agent benchmarking efforts are of the actual labor market, as there's uncertainty about whether current evaluation benchmarks align with real-world human work distribution and economic value.Method: The researchers systematically mapped 43 benchmarks and 72,342 tasks to real-world work domains and skills, analyzing alignment with all 1,016 U.S. occupations. They measured agent autonomy levels within targeted work areas and proposed three principles for better benchmark design.
Result: The study revealed substantial mismatches between agent development (which tends to be programming-centric) and the categories where human labor and economic value are concentrated. The analysis provides practical guidance for agent interaction strategies across work scenarios.
Conclusion: The paper concludes by proposing three measurable principles for designing benchmarks that better capture socially important and technically challenging forms of work: coverage, realism, and granular evaluation.
Abstract: AI agents are increasingly developed and evaluated on benchmarks relevant to human work, yet it remains unclear how representative these benchmarking efforts are of the labor market as a whole. In this work, we systematically study the relationship between agent development efforts and the distribution of real-world human work by mapping benchmark instances to work domains and skills. We first analyze 43 benchmarks and 72,342 tasks, measuring their alignment with human employment and capital allocation across all 1,016 real-world occupations in the U.S. labor market. We reveal substantial mismatches between agent development that tends to be programming-centric, and the categories in which human labor and economic value are concentrated. Within work areas that agents currently target, we further characterize current agent utility by measuring their autonomy levels, providing practical guidance for agent interaction strategies across work scenarios. Building on these findings, we propose three measurable principles for designing benchmarks that better capture socially important and technically challenging forms of work: coverage, realism, and granular evaluation.
[873] Agents Learn Their Runtime: Interpreter Persistence as Training-Time Semantics
Victor May, Aaditya Salgarkar, Yishan Wang, Diganta Misra, Huu Nguyen
Main category: cs.AI
TL;DR: Tool-augmented LLMs trained with state persistence vs stateless execution show similar solution quality but differ in efficiency and stability, highlighting the importance of aligning training data semantics with deployment runtimes.
Details
Motivation: To investigate whether state persistence in tool-augmented LLM agents is merely an inference-time scaffold or if models can learn to exploit it when training data exposes corresponding execution semantics.Method: Created Opaque Knapsack tasks requiring multi-turn control flow with hidden item attributes behind budgeted tool calls. Fine-tuned identical base models (Qwen3-8B) on paired trajectories differing only in whether interpreter state persists across steps or resets after each action, then evaluated all four train-runtime combinations.
Result: Solution quality statistically indistinguishable across conditions, but token cost and stability differ substantially. Persistent-trained models in stateless runtime trigger missing-variable errors in ~80% of episodes; stateless-trained models in persistent runtime redundantly re-derive retained state using ~3.5x more tokens.
Conclusion: Interpreter persistence should be treated as a first-class semantic of agent traces. Aligning fine-tuning data with deployment runtimes improves efficiency and reduces brittle train-runtime mismatches.
Abstract: Tool-augmented LLMs are increasingly deployed as agents that interleave natural-language reasoning with executable Python actions, as in CodeAct-style frameworks. In deployment, these agents rely on runtime state that persists across steps. By contrast, common training pipelines treat agent traces as token sequences, with execution semantics left implicit. This raises a data-centric question: Is state persistence merely an inference-time scaffold, or can models learn to exploit it when training data exposes the corresponding execution semantics? We isolate state persistence as a training-time variable. We introduce Opaque Knapsack, a procedurally generated family of partially observable optimization tasks designed to prevent one-shot solutions. Item attributes and constraints are hidden behind budgeted tool calls, forcing multi-turn control flow and iterative state revision. Holding task instances, prompts, tools, model, and supervision fixed, we generate paired trajectories differing only in whether interpreter state persists across steps or resets after each action. We then fine-tune identical base models (Qwen3-8B) on each trace variant and evaluate all four train-runtime combinations. Our 2x2 cross-evaluation shows that execution semantics primarily affect how agents reach solutions, not whether they do: solution quality is statistically indistinguishable across conditions, but token cost and stability differ substantially. A persistent-trained model in a stateless runtime triggers missing-variable errors in roughly 80% of episodes; a stateless-trained model in a persistent runtime redundantly re-derives retained state, using roughly 3.5x more tokens. Interpreter persistence should be treated as a first-class semantic of agent traces. Aligning fine-tuning data with deployment runtimes improves efficiency and reduces brittle train-runtime mismatches.
[874] A Unified Framework to Quantify Cultural Intelligence of AI
Sunipa Dev, Vinodkumar Prabhakaran, Rutledge Chin Feman, Aida Davani, Remi Denton, Charu Kalia, Piyawat L Kumjorn, Madhurima Maji, Rida Qadri, Negar Rostamzadeh, Renee Shelby, Romina Stella, Hayk Stepanyan, Erin van Liemt, Aishwarya Verma, Oscar Wahltinez, Edem Wornyo, Andrew Zaldivar, Saška Mojsilović
Main category: cs.AI
TL;DR: A principled framework for systematically evaluating cultural intelligence in AI systems, focusing on aggregating multifaceted cultural indicators into unified assessments.
Details
Motivation: As generative AI technologies expand globally, there's an urgent need to assess their competence in different cultural contexts. Current cultural benchmarking efforts are fragmented and focus on specific aspects, lacking a unified, systematic approach for comprehensive cultural assessment at scale.Method: Develops a framework based on measurement theory to aggregate cultural indicators into unified cultural intelligence assessments. Starts with defining core cultural domains, then creates a systematic framework that decouples the concept of cultural intelligence from its operationalization via measurement. Conceptualizes cultural intelligence as a suite of capabilities across diverse domains, operationalized through reliable measurement indicators.
Result: Presents a broad-purpose, systematic, and extensible framework for assessing AI cultural intelligence. Identifies considerations, challenges, and research pathways for meaningful measurement, focusing on data collection, probing strategies, and evaluation metrics.
Conclusion: Provides a principled approach to comprehensively evaluate AI cultural capabilities at scale, addressing the gap in current fragmented cultural benchmarking efforts.
Abstract: As generative AI technologies are increasingly being launched across the globe, assessing their competence to operate in different cultural contexts is exigently becoming a priority. While recent years have seen numerous and much-needed efforts on cultural benchmarking, these efforts have largely focused on specific aspects of culture and evaluation. While these efforts contribute to our understanding of cultural competence, a unified and systematic evaluation approach is needed for us as a field to comprehensively assess diverse cultural dimensions at scale. Drawing on measurement theory, we present a principled framework to aggregate multifaceted indicators of cultural capabilities into a unified assessment of cultural intelligence. We start by developing a working definition of culture that includes identifying core domains of culture. We then introduce a broad-purpose, systematic, and extensible framework for assessing cultural intelligence of AI systems. Drawing on theoretical framing from psychometric measurement validity theory, we decouple the background concept (i.e., cultural intelligence) from its operationalization via measurement. We conceptualize cultural intelligence as a suite of core capabilities spanning diverse domains, which we then operationalize through a set of indicators designed for reliable measurement. Finally, we identify the considerations, challenges, and research pathways to meaningfully measure these indicators, specifically focusing on data collection, probing strategies, and evaluation metrics.
[875] The Lattice Representation Hypothesis of Large Language Models
Bo Xiong
Main category: cs.AI
TL;DR: LLMs encode symbolic concept lattices in embedding geometry, bridging continuous representations with logical hierarchies through Formal Concept Analysis.
Details
Motivation: To understand how large language models encode conceptual hierarchies and logical operations in their embedding geometry, bridging the gap between continuous neural representations and symbolic reasoning.Method: Proposes Lattice Representation Hypothesis unifying Linear Representation Hypothesis with Formal Concept Analysis (FCA), showing linear attribute directions with thresholds induce concept lattices via half-space intersections, enabling geometric meet/join operations.
Result: Experiments on WordNet sub-hierarchies provide empirical evidence that LLM embeddings encode concept lattices and their logical structure, demonstrating the framework’s validity.
Conclusion: LLMs encode symbolic concept lattices in their embedding geometry, creating a principled bridge between continuous geometry and symbolic abstraction that enables logical reasoning.
Abstract: We propose the Lattice Representation Hypothesis of large language models: a symbolic backbone that grounds conceptual hierarchies and logical operations in embedding geometry. Our framework unifies the Linear Representation Hypothesis with Formal Concept Analysis (FCA), showing that linear attribute directions with separating thresholds induce a concept lattice via half-space intersections. This geometry enables symbolic reasoning through geometric meet (intersection) and join (union) operations, and admits a canonical form when attribute directions are linearly independent. Experiments on WordNet sub-hierarchies provide empirical evidence that LLM embeddings encode concept lattices and their logical structure, revealing a principled bridge between continuous geometry and symbolic abstraction.
[876] Extended Empirical Validation of the Explainability Solution Space
Antoni Mestre, Manoli Albert, Miriam Gil, Vicente Pelechano
Main category: cs.AI
TL;DR: Extended validation of Explainability Solution Space (ESS) framework through cross-domain evaluation using urban resource allocation case study with multi-stakeholder governance.
Details
Motivation: To demonstrate the generality and domain-independence of the ESS framework beyond initial employee attrition prediction validation, showing it can handle diverse data types and stakeholder configurations.Method: Introduces heterogeneous intelligent urban resource allocation system as second case study, integrating tabular, temporal, and geospatial data under multi-stakeholder governance conditions with explicit quantitative positioning of XAI families.
Result: ESS rankings are not domain-specific but adapt systematically to governance roles, risk profiles, and stakeholder configurations, confirming framework’s generalizability.
Conclusion: ESS serves as a generalizable operational decision-support instrument for explainable AI strategy design across socio-technical systems.
Abstract: This technical report provides an extended validation of the Explainability Solution Space (ESS) through cross-domain evaluation. While initial validation focused on employee attrition prediction, this study introduces a heterogeneous intelligent urban resource allocation system to demonstrate the generality and domain-independence of the ESS framework. The second case study integrates tabular, temporal, and geospatial data under multi-stakeholder governance conditions. Explicit quantitative positioning of representative XAI families is provided for both contexts. Results confirm that ESS rankings are not domain-specific but adapt systematically to governance roles, risk profiles, and stakeholder configurations. The findings reinforce ESS as a generalizable operational decision-support instrument for explainable AI strategy design across socio-technical systems.
[877] Beyond Reward: A Bounded Measure of Agent Environment Coupling
Wael Hafez, Cameron Reid, Amit Nazeri
Main category: cs.AI
TL;DR: Introduces bipredictability (P) as a real-time measure of interaction effectiveness in RL systems, with an Information Digital Twin monitor that detects distribution shifts earlier than reward-based methods.
Details
Motivation: Real-world RL agents face distribution shifts in closed-loop systems where actions affect future observations. Existing monitoring relies on reward/task metrics that capture outcomes but miss early coupling failures, necessitating a principled measure of interaction effectiveness.Method: Proposes bipredictability (P) - the ratio of shared information in the observation-action-outcome loop to total available information. Develops an auxiliary monitor called Information Digital Twin (IDT) that computes P and diagnostic components from interaction streams in real-time.
Result: On MuJoCo HalfCheetah with SAC and PPO agents across 168 trials: nominal P = 0.33±0.02 (below classical 0.5 bound, revealing informational cost of action selection). IDT detected 89.3% of perturbations vs 44.0% for reward-based monitoring, with 4.4x lower median latency.
Conclusion: Bipredictability enables early detection of interaction degradation before performance drops and provides a prerequisite signal for closed-loop self-regulation in deployed RL systems, offering a principled, real-time measure comparable across tasks.
Abstract: Real-world reinforcement learning (RL) agents operate in closed-loop systems where actions shape future observations, making reliable deployment under distribution shifts a persistent challenge. Existing monitoring relies on reward or task metrics, capturing outcomes but missing early coupling failures. We introduce bipredictability (P) as the ratio of shared information in the observation, action, outcome loop to the total available information, a principled, real time measure of interaction effectiveness with provable bounds, comparable across tasks. An auxiliary monitor, the Information Digital Twin (IDT), computes P and its diagnostic components from the interaction stream. We evaluate SAC and PPO agents on MuJoCo HalfCheetah under eight agent, and environment-side perturbations across 168 trials. Under nominal operation, agents exhibit P = 0.33 plus minus 0.02, below the classical bound of 0.5, revealing an informational cost of action selection. The IDT detects 89.3% of perturbations versus 44.0% for reward based monitoring, with 4.4x lower median latency. Bipredictability enables early detection of interaction degradation before performance drops and provides a prerequisite signal for closed loop self regulation in deployed RL systems.
[878] Information-Theoretic Framework for Self-Adapting Model Predictive Controllers
Wael Hafez, Amir Nazeri
Main category: cs.AI
TL;DR: Information-theoretic Entanglement Learning framework enhances MPC adaptability using an Information Digital Twin to monitor information flow and generate real-time adaptive signals for recalibration.
Details
Motivation: Traditional MPC struggles with real-time adaptation to dynamic obstacles and changing system dynamics, lacking self-monitoring and adaptive optimization mechanisms for autonomous systems like UAVs.Method: Introduces Entanglement Learning with Information Digital Twin that quantifies information flow between MPC inputs, control actions, and UAV behavior using new information-theoretic entanglement metrics measuring mutual information dependencies.
Result: Enables detection of performance deviations and generation of real-time adaptive signals to recalibrate MPC parameters, preserving stability through dual-feedback approach leveraging information flow.
Conclusion: The framework improves MPC reliability and robustness across diverse scenarios, extending beyond UAV control to any MPC implementation requiring adaptive performance.
Abstract: Model Predictive Control (MPC) is a vital technique for autonomous systems, like Unmanned Aerial Vehicles (UAVs), enabling optimized motion planning. However, traditional MPC struggles to adapt to real-time changes such as dynamic obstacles and shifting system dynamics, lacking inherent mechanisms for self-monitoring and adaptive optimization. Here, we introduce Entanglement Learning (EL), an information-theoretic framework that enhances MPC adaptability through an Information Digital Twin (IDT). The IDT monitors and quantifies, in bits, the information flow between MPC inputs, control actions, and UAV behavior. By introducing new information-theoretic metrics we call entanglement metrics, it tracks variations in these dependencies. These metrics measure the mutual information between the optimizer’s input, its control actions, and the resulting UAV dynamics, enabling a deeper understanding of their interrelationships. This allows the IDT to detect performance deviations and generate real-time adaptive signals to recalibrate MPC parameters, preserving stability. Unlike traditional MPC, which relies on error-based feedback, this dual-feedback approach leverages information flow for proactive adaptation to evolving conditions. Scalable and leveraging existing infrastructure, this framework improves MPC reliability and robustness across diverse scenarios, extending beyond UAV control to any MPC implementation requiring adaptive performance.
[879] Opponent State Inference Under Partial Observability: An HMM-POMDP Framework for 2026 Formula 1 Energy Strategy
Kalliopi Kleisarchaki
Main category: cs.AI
TL;DR: A two-layer inference and decision framework for F1 energy strategy under 2026 regulations, using HMM for rival state inference and DQN for deployment policy in a partially observable stochastic game.
Details
Motivation: The 2026 F1 technical regulations create a partially observable stochastic game where optimal energy deployment depends on hidden rival states, requiring new methods beyond single-agent optimization.Method: Two-layer framework: 1) 30-state HMM infers probability distribution over rival’s ERS charge, Override Mode status, and tire degradation from observable telemetry; 2) DQN policy uses HMM belief state to select energy deployment strategies.
Result: HMM achieves 92.3% ERS inference accuracy (vs 33.3% random baseline) and detects counter-harvest trap conditions with 95.7% recall on synthetic races; empirical validation planned for 2026 Australian GP.
Conclusion: The framework successfully addresses the partially observable nature of F1 energy strategy under 2026 regulations, demonstrating that belief-state inference is essential for detecting deceptive strategies like counter-harvest traps.
Abstract: The 2026 Formula 1 technical regulations introduce a fundamental change to energy strategy: under a 50/50 internal combustion engine / battery power split with unlimited regeneration and a driver-controlled Override Mode (abbreviated MOM throughout), the optimal energy deployment policy depends not only on a driver’s own state but on the hidden state of rival cars. This creates a Partially Observable Stochastic Game that cannot be solved by single-agent optimisation methods. We present a tractable two-layer inference and decision framework. The first layer is a 30-state Hidden Markov Model (HMM) that infers a probability distribution over each rival’s ERS charge level, Override Mode status, and tyre degradation state from five publicly observable telemetry signals. The second layer is a Deep Q-Network (DQN) policy that takes the HMM belief state as input and selects between energy deployment strategies. We formally characterise the counter-harvest trap – a deceptive strategy in which a car deliberately suppresses observable deployment signals to induce a rival into a failed attack – and show that detecting it requires belief-state inference rather than reactive threshold rules. On synthetic races generated from the model’s own assumptions, the HMM achieves 92.3% ERS inference accuracy (random baseline: 33.3%) and detects counter-harvest trap conditions with 95.7% recall. Pre-registration – empirical validation begins Australian Grand Prix, 8 March 2026.
[880] ASTRA-bench: Evaluating Tool-Use Agent Reasoning and Action Planning with Personal User Context
Zidi Xiu, David Q. Sun, Kevin Cheng, Maitrik Patel, Josh Date, Yizhe Zhang, Jiarui Lu, Omar Attia, Raviteja Vemulapalli, Oncel Tuzel, Meng Cao, Samy Bengio
Main category: cs.AI
TL;DR: ASTRA-bench is a new benchmark for evaluating AI assistants’ ability to handle personal context, tool use, and multi-step reasoning across time-evolving scenarios.
Details
Motivation: Current AI benchmarks lack realistic personal context and multi-turn interactions, failing to test next-generation AI assistants that need to manage personal data, diverse tools, and complex reasoning over time.Method: Created ASTRA-bench with 2,413 scenarios across four protagonists, using an event-driven pipeline to generate longitudinal life events with referential, functional, and informational complexity annotations.
Result: State-of-the-art models (Claude-4.5-Opus, DeepSeek-V3.2) show significant performance degradation under high-complexity conditions, with argument generation being the primary bottleneck.
Conclusion: Current AI agents struggle to ground reasoning in messy personal context and orchestrate reliable multi-step plans, highlighting the need for better context-aware assistants.
Abstract: Next-generation AI must manage vast personal data, diverse tools, and multi-step reasoning, yet most benchmarks remain context-free and single-turn. We present ASTRA-bench (Assistant Skills in Tool-use, Reasoning & Action-planning), a benchmark that uniquely unifies time-evolving personal context with an interactive toolbox and complex user intents. Our event-driven pipeline generates 2,413 scenarios across four protagonists, grounded in longitudinal life events and annotated by referential, functional, and informational complexity. Evaluation of state-of-the-art models (e.g., Claude-4.5-Opus, DeepSeek-V3.2) reveals significant performance degradation under high-complexity conditions, with argument generation emerging as the primary bottleneck. These findings expose critical limitations in current agents’ ability to ground reasoning within messy personal context and orchestrate reliable multi-step plans. We release ASTRA-bench with a full execution environment and evaluation scripts to provide a diagnostic testbed for developing truly context-aware AI assistants.
[881] Words & Weights: Streamlining Multi-Turn Interactions via Co-Adaptation
Chenxing Wei, Hong Wang, Ying He, Zhongxiang Dai, Bo Jiang, F. Richard Yu, Yao Shu
Main category: cs.AI
TL;DR: ROSA2 is a framework for test-time policy adaptation in multi-turn interactions that jointly optimizes both prompt refinement (words) and model parameter updates (weights) to address both ambiguity and capability issues in LLM interactions.
Details
Motivation: Existing test-time adaptation approaches treat adaptation as a single-axis problem - either refining instructions (prompt engineering) or adjusting weights (test-time training). However, interaction failures stem from a coupled mix of ambiguity (semantic issues) and incapacity (capability gaps), requiring a synergistic approach.Method: ROSA2 reformulates interaction as a joint optimization problem over the heterogeneous space of words and weights. It mathematically decomposes error signals to use textual gradients to rectify intent ambiguity and parameter updates to bridge capability gaps, enabling co-adaptation of both aspects.
Result: ROSA2 outperforms state-of-the-art baselines by 30% on MATH benchmark while reducing interaction turns by 40%. Theoretically, it’s proven that this co-adaptation strictly reduces the required parameter shift for convergence.
Conclusion: Refining context (words) unlocks the true potential of parameter updates (weights), demonstrating that semantic clarity acts as a pre-conditioner for effective parameter updates in test-time adaptation for multi-turn LLM interactions.
Abstract: Test-time policy adaptation for multi-turn interactions (T2PAM) is essential for aligning Large Language Models (LLMs) with dynamic user needs during inference time. However, existing paradigms commonly treat test-time adaptation as a single-axis problem, either purely refining instructions (Prompt Engineering) or only adjusting weights (Test-Time Training), ignoring that interaction failures stem from a coupled mix of ambiguity and incapacity. We argue that these two optimization paths are not merely additive but synergistic: semantic clarity acts as a pre-conditioner for effective parameter updates. To this end, we propose ROSA2, a framework that reformulates interaction as a joint optimization problem over the heterogeneous space of Words and Weights. By mathematically decomposing the error signal, ROSA2 utilizes textual gradients to rectify intent ambiguity and parameter updates to bridge capability gaps. Theoretically, we prove that this co-adaptation strictly reduces the required parameter shift for convergence. Empirically, ROSA2 outperforms state-of-the-art baselines by 30% on MATH while reducing interaction turns by 40%, demonstrating that refining the context unlocks the true potential of parameter updates.
[882] HarmonyCell: Automating Single-Cell Perturbation Modeling under Semantic and Distribution Shifts
Wenxuan Huang, Mingyu Tsoi, Yanhao Huang, Xinjie Mao, Xue Xia, Hao Wu, Jiaqi Wei, Yuejin Yang, Lang Yu, Cheng Tan, Xiang Zhang, Zhangyang Gao, Siqi Sun
Main category: cs.AI
TL;DR: HarmonyCell is an end-to-end agent framework that addresses semantic and statistical heterogeneity in single-cell perturbation studies through LLM-driven semantic unification and adaptive Monte Carlo Tree Search for optimal architecture synthesis.
Details
Motivation: Single-cell perturbation studies face two key heterogeneity bottlenecks: semantic heterogeneity (incompatible metadata schemas across datasets) and statistical heterogeneity (distribution shifts from biological variation requiring dataset-specific inductive biases). Current approaches struggle with these dual challenges.Method: HarmonyCell uses a dual-mechanism approach: (1) an LLM-driven Semantic Unifier that autonomously maps disparate metadata into a canonical interface without manual intervention, and (2) an adaptive Monte Carlo Tree Search engine that operates over a hierarchical action space to synthesize architectures with optimal statistical inductive biases for distribution shifts.
Result: HarmonyCell achieves a 95% valid execution rate on heterogeneous input datasets (versus 0% for general agents) while matching or even exceeding expert-designed baselines in rigorous out-of-distribution evaluations across diverse perturbation tasks.
Conclusion: The dual-track orchestration of semantic unification and statistical adaptation enables scalable automatic virtual cell modeling without dataset-specific engineering, resolving key heterogeneity challenges in single-cell perturbation studies.
Abstract: Single-cell perturbation studies face dual heterogeneity bottlenecks: (i) semantic heterogeneity–identical biological concepts encoded under incompatible metadata schemas across datasets; and (ii) statistical heterogeneity–distribution shifts from biological variation demanding dataset-specific inductive biases. We propose HarmonyCell, an end-to-end agent framework resolving each challenge through a dedicated mechanism: an LLM-driven Semantic Unifier autonomously maps disparate metadata into a canonical interface without manual intervention; and an adaptive Monte Carlo Tree Search engine operates over a hierarchical action space to synthesize architectures with optimal statistical inductive biases for distribution shifts. Evaluated across diverse perturbation tasks under both semantic and distribution shifts, HarmonyCell achieves a 95% valid execution rate on heterogeneous input datasets (versus 0% for general agents) while matching or even exceeding expert-designed baselines in rigorous out-of-distribution evaluations. This dual-track orchestration enables scalable automatic virtual cell modeling without dataset-specific engineering.
[883] MIST-RL: Mutation-based Incremental Suite Testing via Reinforcement Learning
Sicheng Zhu, Jiajun Wang, Jiawei Ai, Xin Li
Main category: cs.AI
TL;DR: MIST-RL uses reinforcement learning to generate high-utility test suites for code verification, reducing redundancy while improving fault detection compared to brute-force approaches.
Details
Motivation: Current LLM-based code verification methods rely on brute-force test generation ("scaling-by-quantity"), which suffers from diminishing returns in fault detection and severe test redundancy. There's a need for more efficient, utility-focused test generation.Method: Proposes MIST-RL framework that formulates test generation as a sequential decision process optimized via Group Relative Policy Optimization (GRPO). Uses incremental mutation reward with dynamic penalties to incentivize discovering new faults while suppressing functionally equivalent assertions.
Result: Outperforms state-of-the-art baselines on HumanEval+ and MBPP+: achieves +28.5% higher mutation score while reducing test cases by 19.3%. Compact, high-utility tests improve downstream code reranking accuracy on HumanEval+ by 3.05% over SOTA baseline.
Conclusion: Shifting from “scaling-by-quantity” to “scaling-by-utility” in test generation via RL optimization yields more efficient and effective code verification for LLMs, with practical improvements in both fault detection and downstream tasks.
Abstract: Large Language Models (LLMs) often fail to generate correct code on the first attempt, which requires using generated unit tests as verifiers to validate the solutions. Despite the success of recent verification methods, they remain constrained by a “scaling-by-quantity” paradigm. This brute-force approach suffers from a critical limitation: it yields diminishing returns in fault detection while causing severe test redundancy. To address this, we propose MIST-RL (Mutation-based Incremental Suite Testing via Reinforcement Learning), a framework that shifts the focus to “scaling-by-utility”. We formulate test generation as a sequential decision process optimized via Group Relative Policy Optimization (GRPO). Specifically, we introduce a novel incremental mutation reward combined with dynamic penalties, which incentivizes the model to discover new faults while it suppresses functionally equivalent assertions. Experiments on HumanEval+ and MBPP+ demonstrate that MIST-RL outperforms state-of-the-art baselines. It achieves a +28.5% higher mutation score while reducing the number of test cases by 19.3%. Furthermore, we show that these compact, high-utility tests serve as superior verifiers, which improves downstream code reranking accuracy on HumanEval+ by 3.05% over the SOTA baseline with 10 candidate samples. The source code and data are provided in the supplementary material.
[884] GraphScout: Empowering Large Language Models with Intrinsic Exploration Ability for Agentic Graph Reasoning
Yuchen Ying, Weiqi Jiang, Tongya Zheng, Yu Wang, Shunyu Liu, Kaixuan Chen, Mingli Song
Main category: cs.AI
TL;DR: GraphScout: A training-centric agentic graph reasoning framework that enables LLMs to autonomously interact with knowledge graphs to synthesize training data and internalize graph reasoning abilities.
Details
Motivation: Existing GraphRAG methods rely on manually designed guidance and limited predefined tools for graph interaction, constraining graph exploration. Need more flexible graph exploration tools and autonomous reasoning capabilities.Method: Proposes GraphScout framework with flexible graph exploration tools that allows LLMs to autonomously interact with knowledge graphs to synthesize structured training data, which is then used to post-train LLMs to internalize agentic graph reasoning abilities.
Result: Small models (Qwen3-4B) augmented with GraphScout outperform baseline methods on leading LLMs (Qwen-Max) by average 16.7% across five knowledge-graph domains, with significantly fewer inference tokens. Shows robust cross-domain transfer performance.
Conclusion: GraphScout enables effective agentic graph reasoning through autonomous graph exploration and data synthesis, improving factual grounding and reasoning capabilities without manual annotation.
Abstract: Knowledge graphs provide structured and reliable information for many real-world applications, motivating increasing interest in combining large language models (LLMs) with graph-based retrieval to improve factual grounding. Recent Graph-based Retrieval-Augmented Generation (GraphRAG) methods therefore introduce iterative interaction between LLMs and knowledge graphs to enhance reasoning capability. However, existing approaches typically depend on manually designed guidance and interact with knowledge graphs through a limited set of predefined tools, which substantially constrains graph exploration. To address these limitations, we propose GraphScout, a training-centric agentic graph reasoning framework equipped with more flexible graph exploration tools. GraphScout enables models to autonomously interact with knowledge graphs to synthesize structured training data which are then used to post-train LLMs, thereby internalizing agentic graph reasoning ability without laborious manual annotation or task curation. Extensive experiments across five knowledge-graph domains show that a small model (e.g., Qwen3-4B) augmented with GraphScout outperforms baseline methods built on leading LLMs (e.g., Qwen-Max) by an average of 16.7% while requiring significantly fewer inference tokens. Moreover, GraphScout exhibits robust cross-domain transfer performance. Our code will be made publicly available~\footnote{https://github.com/Ying-Yuchen/GraphScout}.
[885] Securing the Floor and Raising the Ceiling: A Merging-based Paradigm for Multi-modal Search Agents
Zhixiang Wang, Jingxuan Xu, Dajun Chen, Yunfang Wu, Wei Jiang, Yong Li
Main category: cs.AI
TL;DR: Training-free approach to give Vision-Language Models autonomous search capabilities by merging them with text-based search agents using Optimal Brain Merging algorithm.
Details
Motivation: Existing multi-modal search agents require expensive supervised training or reinforcement learning, leading to high costs, instability, and cold-start problems for standard VLMs.Method: Cross-modal model merging fuses text-based search agents with base VLMs without additional multi-modal training. Uses Optimal Brain Merging (OBM) - a saliency-aware algorithm that identifies task-critical parameters using small calibration samples to mitigate parameter interference.
Result: Model merging provides reasonable zero-shot performance floor, with OBM achieving superior search rates. OBM significantly raises performance ceiling as warm-start strategy, achieving faster convergence and higher peak accuracy than standard VLM initialization.
Conclusion: Training-free paradigm via cross-modal model merging effectively composes multi-modal search capabilities without additional training data, addressing cold-start problems in VLM-based search agents.
Abstract: Recent advances in Vision-Language Models (VLMs) have motivated the development of multi-modal search agents that can actively invoke external search tools and integrate retrieved evidence through multi-step reasoning. While promising, existing approaches typically rely on large-scale supervised trajectories or expensive reinforcement learning (RL), leading to high training cost, instability, and a severe cold-start problem for standard VLMs. We propose a training-free paradigm to empower VLMs with autonomous search capabilities via cross-modal model merging. By fusing a text-based search agent with a base VLM, we show that multi-modal search capabilities can be effectively composed without any additional multi-modal training data. To mitigate parameter interference during cross-modal integration, we introduce Optimal Brain Merging (OBM), a saliency-aware merging algorithm that identifies task-critical parameters based on their impact on model loss using only a small set of calibration samples. Extensive experiments on search-intensive benchmarks (e.g., InfoSeek, MMSearch) reveal that: (1) Model merging secures a reasonable performance floor as a zero-shot agent, with OBM achieving superior search rates; (2) OBM significantly raises the performance ceiling as a warm-start strategy, achieving faster convergence and higher peak accuracy than standard VLM initialization.
[886] SciDER: Scientific Data-centric End-to-end Researcher
Ke Lin, Yilin Lu, Shreyas Bhat, Xuehang Guo, Junier Oliva, Qingyun Wang
Main category: cs.AI
TL;DR: SciDER is a data-centric end-to-end system that automates scientific research lifecycle by using specialized AI agents to parse raw experimental data, generate hypotheses, design experiments, and execute code, outperforming general-purpose agents.
Details
Motivation: Existing AI agents struggle to autonomously process raw data from scientific experiments, creating a gap in automated scientific discovery despite advances in large language models for research ideation.Method: Specialized collaborative agents parse and analyze raw scientific data, generate hypotheses and experimental designs based on data characteristics, write and execute corresponding code, using self-evolving memory and critic-led feedback loops.
Result: Evaluation on three benchmarks shows SciDER excels in specialized data-driven scientific discovery and outperforms general-purpose agents and state-of-the-art models.
Conclusion: SciDER provides an accessible modular Python package with PyPI distribution and web interface to accelerate autonomous, data-driven research for all researchers and developers.
Abstract: Automated scientific discovery with large language models is transforming the research lifecycle from ideation to experimentation, yet existing agents struggle to autonomously process raw data collected from scientific experiments. We introduce SciDER, a data-centric end-to-end system that automates the research lifecycle. Unlike traditional frameworks, our specialized agents collaboratively parse and analyze raw scientific data, generate hypotheses and experimental designs grounded in specific data characteristics, and write and execute corresponding code. Evaluation on three benchmarks shows SciDER excels in specialized data-driven scientific discovery and outperforms general-purpose agents and state-of-the-art models through its self-evolving memory and critic-led feedback loop. Distributed as a modular Python package, we also provide easy-to-use PyPI packages with a lightweight web interface to accelerate autonomous, data-driven research and aim to be accessible to all researchers and developers.
[887] Decoding Answers Before Chain-of-Thought: Evidence from Pre-CoT Probes and Activation Steering
Kyle Cox, Darius Kianersi, Adrià Garriga-Alonso
Main category: cs.AI
TL;DR: Instruction-tuned LLMs often decide answers before generating chain-of-thought reasoning, making CoT potentially unfaithful for interpretability.
Details
Motivation: To investigate whether chain-of-thought (CoT) reasoning in LLMs is faithful to the underlying decision process or merely post-hoc justification, which has implications for using CoT as an interpretability tool.Method: Used linear probes on residual stream activations at the last token before CoT generation to predict final answers, and performed causal interventions by steering activations along probe directions to test if they influence model decisions.
Result: Probes achieved 0.9 AUC for predicting final answers, and steering along these directions flipped model answers in over 50% of cases. When steering induced incorrect answers, two failure modes emerged: non-entailment (correct premises but unsupported conclusions) and confabulation (fabricating false premises).
Conclusion: Instruction-tuned models often determine answers before generating CoT, making CoT potentially unfaithful for interpretability. While post-hoc reasoning can be useful when models have correct pre-CoT beliefs, it can lead to undesirable behaviors when reasoning from false beliefs.
Abstract: As chain-of-thought (CoT) has become central to scaling reasoning capabilities in large language models (LLMs), it has also emerged as a promising tool for interpretability, suggesting the opportunity to understand model decisions through verbalized reasoning. However, the utility of CoT toward interpretability depends upon its faithfulness – whether the model’s stated reasoning reflects the underlying decision process. We provide mechanistic evidence that instruction-tuned models often determine their answer before generating CoT. Training linear probes on residual stream activations at the last token before CoT, we can predict the model’s final answer with 0.9 AUC on most tasks. We find that these directions are not only predictive, but also causal: steering activations along the probe direction flips model answers in over 50% of cases, significantly exceeding orthogonal baselines. When steering induces incorrect answers, we observe two distinct failure modes: non-entailment (stating correct premises but drawing unsupported conclusions) and confabulation (fabricating false premises). While post-hoc reasoning may be instrumentally useful when the model has a correct pre-CoT belief, these failure modes suggest it can result in undesirable behaviors when reasoning from a false belief.
[888] Scaling Tasks, Not Samples: Mastering Humanoid Control through Multi-Task Model-Based Reinforcement Learning
Shaohuai Liu, Weirui Ye, Yilun Du, Le Xie
Main category: cs.AI
TL;DR: EfficientZero-Multitask (EZ-M) is a model-based reinforcement learning algorithm that scales learning across multiple tasks rather than samples per task, achieving state-of-the-art performance on HumanoidBench with high sample efficiency.
Details
Motivation: Developing generalist robots capable of diverse skills is challenging in embodied AI. While recent approaches scale model parameters and offline datasets, robotics requires active interaction. The paper argues that effective online learning should scale the number of tasks rather than samples per task, leveraging the structural advantage of model-based RL.Method: Proposes EfficientZero-Multitask (EZ-M), a sample-efficient multi-task model-based reinforcement learning algorithm. It uses a shared world model that aggregates multi-task experience to learn robust, task-agnostic representations since physical dynamics are invariant across tasks. This approach avoids gradient interference issues faced by model-free methods.
Result: Evaluated on HumanoidBench, a challenging whole-body control benchmark, EZ-M achieves state-of-the-art performance with significantly higher sample efficiency than strong baselines, without requiring extreme parameter scaling.
Conclusion: Task scaling is a critical axis for scalable robotic learning. Model-based RL benefits from task diversity as a regularizer, improving dynamics learning and sample efficiency for embodied AI systems.
Abstract: Developing generalist robots capable of mastering diverse skills remains a central challenge in embodied AI. While recent progress emphasizes scaling model parameters and offline datasets, such approaches are limited in robotics, where learning requires active interaction. We argue that effective online learning should scale the \emph{number of tasks}, rather than the number of samples per task. This regime reveals a structural advantage of model-based reinforcement learning (MBRL). Because physical dynamics are invariant across tasks, a shared world model can aggregate multi-task experience to learn robust, task-agnostic representations. In contrast, model-free methods suffer from gradient interference when tasks demand conflicting actions in similar states. Task diversity therefore acts as a regularizer for MBRL, improving dynamics learning and sample efficiency. We instantiate this idea with \textbf{EfficientZero-Multitask (EZ-M)}, a sample-efficient multi-task MBRL algorithm for online learning. Evaluated on \textbf{HumanoidBench}, a challenging whole-body control benchmark, EZ-M achieves state-of-the-art performance with significantly higher sample efficiency than strong baselines, without extreme parameter scaling. These results establish task scaling as a critical axis for scalable robotic learning. The project website is available \href{https://yewr.github.io/ez_m/}{here}.
[889] ProtRLSearch: A Multi-Round Multimodal Protein Search Agent with Large Language Models Trained via Reinforcement Learning
Congying Liu, Taihao Li, Ming Huang, Xingyuan Wei, Peipei Liu, Yiqing Shen, Yanxu Mao, Tiehan Cui
Main category: cs.AI
TL;DR: ProtRLSearch: A multi-round protein search agent using multimodal inputs (protein sequences + text) trained with multi-dimensional RL rewards for protein analysis tasks.
Details
Motivation: Current protein search agents are limited to single-round, text-only searches and lack protein sequence modality integration. They also suffer from RL supervision focusing only on final answers, making search process deviations hard to correct.Method: Proposes ProtRLSearch, a multi-round protein search agent trained with multi-dimensional reward-based RL that jointly leverages protein sequence and text as multimodal inputs during real-time search to produce high-quality reports.
Result: Created ProtMCQs benchmark with 3,000 multiple choice questions across three difficulty levels to evaluate protein query tasks ranging from sequence-constrained reasoning to comprehensive protein reasoning integrating multi-dimensional features.
Conclusion: The approach addresses limitations of existing protein search agents by enabling multimodal input integration and better search process supervision through multi-dimensional RL rewards.
Abstract: Protein analysis tasks arising in healthcare settings often require accurate reasoning under protein sequence constraints, involving tasks such as functional interpretation of disease-related variants, protein-level analysis for clinical research, and similar scenarios. To address such tasks, search agents are introduced to search protein-related information, providing support for disease-related variant analysis and protein function reasoning in protein-centric inference. However, such search agents are mostly limited to single-round, text-only modality search, which prevents the protein sequence modality from being incorporated as a multimodal input into the search decision-making process. Meanwhile, their reliance on reinforcement learning (RL) supervision that focuses solely on the final answer results in a lack of search process constraints, making deviations in keyword selection and reasoning directions difficult to identify and correct in a timely manner. To address these limitations, we propose ProtRLSearch, a multi-round protein search agent trained with multi-dimensional reward based RL, which jointly leverages protein sequence and text as multimodal inputs during real-time search to produce high quality reports. To evaluate the ability of models to integrate protein sequence information and text-based multimodal inputs in realistic protein query settings, we construct ProtMCQs, a benchmark of 3,000 multiple choice questions (MCQs) organized into three difficulty levels. The benchmark evaluates protein query tasks that range from sequence constrained reasoning about protein function and phenotype changes to comprehensive protein reasoning that integrates multi-dimensional sequence features with signal pathways and regulatory networks.
[890] Harmonizing Dense and Sparse Signals in Multi-turn RL: Dual-Horizon Credit Assignment for Industrial Sales Agents
Haojin Yang, Ai Jian, Xinyue Huang, Yiwei Wang, Weipeng Zhang, Ke Zeng, Xunliang Cai, Jingqing Ruan
Main category: cs.AI
TL;DR: DuCA framework disentangles optimization across time scales for LLMs in industrial sales by separately normalizing turn-level and session-level rewards to balance immediate linguistic constraints with long-term commercial objectives.
Details
Motivation: Conventional RL merges heterogeneous goals (long-term commercial objectives like conversion rate vs immediate linguistic constraints like fluency/compliance) into single reward, causing session-level rewards to overwhelm turn-level signals, leading to unstable training or reward hacking.Method: Proposes Dual-Horizon Credit Assignment (DuCA) with Horizon-Independent Advantage Normalization (HIAN) that separately normalizes advantages from turn-level and session-level rewards before fusion, ensuring balanced gradient contributions from both immediate and long-term objectives.
Result: Outperforms state-of-the-art GRPO baseline with 6.82% relative improvement in conversion rate, 82.28% reduction in inter-sentence repetition, and 27.35% lower identity detection rate, showing substantial improvement for industrial sales scenarios.
Conclusion: DuCA effectively balances strategic performance and naturalistic language generation for LLMs in industrial sales by addressing the time-scale optimization challenge through disentangled reward normalization.
Abstract: Optimizing large language models for industrial sales requires balancing long-term commercial objectives (e.g., conversion rate) with immediate linguistic constraints such as fluency and compliance. Conventional reinforcement learning often merges these heterogeneous goals into a single reward, causing high-magnitude session-level rewards to overwhelm subtler turn-level signals, which leads to unstable training or reward hacking. To address this issue, we propose Dual-Horizon Credit Assignment (DuCA), a framework that disentangles optimization across time scales. Its core, Horizon-Independent Advantage Normalization (HIAN), separately normalizes advantages from turn-level and session-level rewards before fusion, ensuring balanced gradient contributions from both immediate and long-term objectives to the policy update. Extensive experiments with a high-fidelity user simulator show DuCA outperforms the state-of-the-art GRPO baseline, achieving a 6.82% relative improvement in conversion rate, reducing inter-sentence repetition by 82.28%, and lowering identity detection rate by 27.35%, indicating a substantial improvement for an industrial sales scenario that effectively balances the dual demands of strategic performance and naturalistic language generation.
[891] Agentic Multi-Source Grounding for Enhanced Query Intent Understanding: A DoorDash Case Study
Emmanuel Aboah Boateng, Kyle MacDonald, Akshad Viswanathan, Sudeep Das
Main category: cs.AI
TL;DR: Agentic Multi-Source Grounded system for marketplace query classification that combines LLMs with catalog retrieval and web search to resolve ambiguous queries, emitting multi-intent sets instead of single labels.
Details
Motivation: Traditional classifiers force single-label assignments for ambiguous queries, while ungrounded LLMs hallucinate unavailable inventory. Multi-category marketplaces need to accurately map context-sparse queries to business categories despite intent ambiguity.Method: Uses an agentic system that grounds LLM inference in: (1) staged catalog entity retrieval pipeline, and (2) autonomous web-search tool for cold-start queries. Emits ordered multi-intent sets resolved by configurable disambiguation layer with business policies.
Result: Achieves +10.9pp over ungrounded LLM baseline and +4.6pp over legacy production system. On long-tail queries: +8.3pp from catalog grounding, +3.2pp from web search grounding, +1.5pp from dual intent disambiguation, reaching 90.7% accuracy (+13.0pp over baseline). Deployed in production serving over 95% of daily search impressions.
Conclusion: Establishes a generalizable paradigm for grounding foundation models in proprietary context and real-time web knowledge to resolve ambiguous, context-sparse decision problems at scale, with decoupled design allowing domain adaptation without core architecture changes.
Abstract: Accurately mapping user queries to business categories is a fundamental Information Retrieval challenge for multi-category marketplaces, where context-sparse queries such as “Wildflower” exhibit intent ambiguity, simultaneously denoting a restaurant chain, a retail product, and a floral item. Traditional classifiers force a winner-takes-all assignment, while general-purpose LLMs hallucinate unavailable inventory. We introduce an Agentic Multi-Source Grounded system that addresses both failure modes by grounding LLM inference in (i) a staged catalog entity retrieval pipeline and (ii) an agentic web-search tool invoked autonomously for cold-start queries. Rather than predicting a single label, the model emits an ordered multi-intent set, resolved by a configurable disambiguation layer that applies deterministic business policies and is designed for extensibility to personalization signals. This decoupled design generalizes across domains, allowing any marketplace to supply its own grounding sources and resolution rules without modifying the core architecture. Evaluated on DoorDash’s multi-vertical search platform, the system achieves +10.9pp over the ungrounded LLM baseline and +4.6pp over the legacy production system. On long-tail queries, incremental ablations attribute +8.3pp to catalog grounding, +3.2pp to agentic web search grounding, and +1.5pp to dual intent disambiguation, yielding 90.7% accuracy (+13.0pp over baseline). The system is deployed in production, serving over 95% of daily search impressions, and establishes a generalizable paradigm for applications requiring foundation models grounded in proprietary context and real-time web knowledge to resolve ambiguous, context-sparse decision problems at scale.
[892] LLM-assisted Semantic Option Discovery for Facilitating Adaptive Deep Reinforcement Learning
Chang Yao, Jinghui Qin, Kebing Jin, Hankz Hankui Zhuo
Main category: cs.AI
TL;DR: LLM-driven closed-loop framework for DRL that maps natural language instructions to executable rules and semantically annotates options to improve data efficiency, constraint compliance, and cross-task transferability.
Details
Motivation: DRL suffers from low data efficiency, lack of interpretability, limited cross-environment transferability, and behavioral safety issues when policies are sensitive to environmental changes. Recent research shows LLMs integrated with symbolic planning can address these challenges.Method: Proposes an LLM-driven closed-loop framework that maps natural language instructions into executable rules and semantically annotates automatically created options. Uses LLMs’ general knowledge to facilitate exploration efficiency and adapt transferable options for similar environments while providing inherent interpretability through semantic annotations.
Result: Experiments on Office World and Montezuma’s Revenge domains demonstrate superior performance in data efficiency, constraint compliance, and cross-task transferability compared to baseline approaches.
Conclusion: The LLM-driven framework effectively addresses key DRL limitations by leveraging LLMs’ semantic understanding and planning capabilities, enabling safer, more interpretable, and transferable reinforcement learning systems.
Abstract: Despite achieving remarkable success in complex tasks, Deep Reinforcement Learning (DRL) is still suffering from critical issues in practical applications, such as low data efficiency, lack of interpretability, and limited cross-environment transferability. However, the learned policy generating actions based on states are sensitive to the environmental changes, struggling to guarantee behavioral safety and compliance. Recent research shows that integrating Large Language Models (LLMs) with symbolic planning is promising in addressing these challenges. Inspired by this, we introduce a novel LLM-driven closed-loop framework, which enables semantic-driven skill reuse and real-time constraint monitoring by mapping natural language instructions into executable rules and semantically annotating automatically created options. The proposed approach utilizes the general knowledge of LLMs to facilitate exploration efficiency and adapt to transferable options for similar environments, and provides inherent interpretability through semantic annotations. To validate the effectiveness of this framework, we conduct experiments on two domains, Office World and Montezuma’s Revenge, respectively. The results demonstrate superior performance in data efficiency, constraint compliance, and cross-task transferability.
[893] Multimodal Mixture-of-Experts with Retrieval Augmentation for Protein Active Site Identification
Jiayang Wu, Jiale Zhou, Xingyi Zhang, Xun Lin, Tianxu Lv, Leong Hou U, Rubo Wang, Yefeng Zheng
Main category: cs.AI
TL;DR: MERA is a retrieval-augmented multimodal framework for protein active site identification that uses hierarchical multi-expert retrieval and reliability-aware fusion based on Dempster-Shafer theory to address sparse training data and modality reliability issues.
Details
Motivation: Current protein active site identification methods suffer from vulnerability in single-instance prediction due to sparse training data and inadequate modality reliability estimation, which causes performance degradation when unreliable modalities dominate fusion processes.Method: MERA employs hierarchical multi-expert retrieval that dynamically aggregates contextual information from chain, sequence, and active-site perspectives through residue-level mixture-of-experts gating. It uses a reliability-aware fusion strategy based on Dempster-Shafer evidence theory that quantifies modality trustworthiness through belief mass functions and learnable discounting coefficients.
Result: Extensive experiments on ProTAD-Gen and TS125 datasets show MERA achieves state-of-the-art performance with 90% AUPRC on active site prediction and significant gains on peptide-binding site identification.
Conclusion: MERA validates the effectiveness of retrieval-augmented multi-expert modeling and reliability-guided fusion for protein active site identification, addressing key challenges in the field.
Abstract: Accurate identification of protein active sites at the residue level is crucial for understanding protein function and advancing drug discovery. However, current methods face two critical challenges: vulnerability in single-instance prediction due to sparse training data, and inadequate modality reliability estimation that leads to performance degradation when unreliable modalities dominate fusion processes. To address these challenges, we introduce Multimodal Mixture-of-Experts with Retrieval Augmentation (MERA), the first retrieval-augmented framework for protein active site identification. MERA employs hierarchical multi-expert retrieval that dynamically aggregates contextual information from chain, sequence, and active-site perspectives through residue-level mixture-of-experts gating. To prevent modality degradation, we propose a reliability-aware fusion strategy based on Dempster-Shafer evidence theory that quantifies modality trustworthiness through belief mass functions and learnable discounting coefficients, enabling principled multimodal integration. Extensive experiments on ProTAD-Gen and TS125 datasets demonstrate that MERA achieves state-of-the-art performance, with 90% AUPRC on active site prediction and significant gains on peptide-binding site identification, validating the effectiveness of retrieval-augmented multi-expert modeling and reliability-guided fusion.
[894] Pharmacology Knowledge Graphs: Do We Need Chemical Structure for Drug Repurposing?
Youssef Abo-Dahab, Ruby Hernandez, Ismael Caleb Arechiga Duran
Main category: cs.AI
TL;DR: Drug repurposing study finds that knowledge graph topology and protein features outperform explicit chemical structure representations, with data volume being more important than model size.
Details
Motivation: To quantify the contributions of model complexity, data volume, and feature modalities to knowledge graph-based drug repurposing under rigorous temporal validation, addressing gaps in understanding what drives performance in pharmacological prediction tasks.Method: Constructed pharmacology knowledge graph from ChEMBL 36 with strict temporal split (training up to 2022, testing 2023-2025). Benchmarked KG embedding models and GNN with graph attention encoder for drug structure and ESM-2 protein embeddings. Conducted scaling experiments (0.78-9.75M parameters, 25-100% data) and feature ablation studies.
Result: Removing drug structure encoder improved performance (PR-AUC from 0.5631 to 0.5785) while reducing VRAM usage dramatically (5.30GB to 353MB). Morgan fingerprints degraded performance further. Model size beyond 2.44M parameters had diminishing returns, while more training data consistently helped. External validation confirmed 6/14 novel predictions.
Conclusion: Drug pharmacological behavior can be accurately predicted using target-centric information and drug network topology alone, without requiring explicit chemical structure representations, with data volume being more critical than model complexity.
Abstract: The contributions of model complexity, data volume, and feature modalities to knowledge graph-based drug repurposing remain poorly quantified under rigorous temporal validation. We constructed a pharmacology knowledge graph from ChEMBL 36 comprising 5,348 entities including 3,127 drugs, 1,156 proteins, and 1,065 indications. A strict temporal split was enforced with training data up to 2022 and testing data from 2023 to 2025, together with biologically verified hard negatives mined from failed assays and clinical trials. We benchmarked five knowledge graph embedding models and a standard graph neural network with 3.44 million parameters that incorporates drug chemical structure using a graph attention encoder and ESM-2 protein embeddings. Scaling experiments ranging from 0.78 to 9.75 million parameters and from 25 to 100 percent of the data, together with feature ablation studies, were used to isolate the contributions of model capacity, graph density, and node feature modalities. Removing the graph attention based drug structure encoder and retaining only topological embeddings combined with ESM-2 protein features improved drug protein PR-AUC from 0.5631 to 0.5785 while reducing VRAM usage from 5.30 GB to 353 MB. Replacing the drug encoder with Morgan fingerprints further degraded performance, indicating that explicit chemical structure representations can be detrimental for predicting pharmacological network interactions. Increasing model size beyond 2.44 million parameters yielded diminishing returns, whereas increasing training data consistently improved performance. External validation confirmed 6 of the top 14 novel predictions as established therapeutic indications. These results show that drug pharmacological behavior can be accurately predicted using target-centric information and drug network topology alone, without requiring explicit chemical structure representations.
[895] Graph-Based Self-Healing Tool Routing for Cost-Efficient LLM Agents
Neeraj Bholani
Main category: cs.AI
TL;DR: Self-Healing Router: A fault-tolerant orchestration architecture for LLM agents that uses deterministic routing instead of LLM reasoning for most control-flow decisions, enabling automatic recovery from tool failures without LLM intervention.
Details
Motivation: Tool-using LLM agents face a reliability-cost tradeoff: using LLM for every decision is expensive and slow, while pre-coded workflows are brittle to compound tool failures. Need a system that maintains correctness while reducing LLM usage and handling failures automatically.Method: Combines parallel health monitors that score runtime conditions (tool outages, risk signals) with a cost-weighted tool graph where Dijkstra’s algorithm performs deterministic shortest-path routing. When a tool fails, its edges are reweighted to infinity and path is recomputed for automatic recovery without LLM. LLM is only used when no feasible path exists.
Result: Across 19 scenarios spanning three graph topologies, Self-Healing Router matches ReAct’s correctness while reducing control-plane LLM calls by 93% (9 vs 123 aggregate) and eliminating silent-failure cases observed in static workflow baselines under compound failures.
Conclusion: The system provides runtime fault tolerance with deterministic recovery and binary observability - every failure is either a logged reroute or explicit escalation, never a silent skip. Enables reliable tool-use agents with dramatically reduced LLM usage.
Abstract: Tool-using LLM agents face a reliability-cost tradeoff: routing every decision through the LLM improves correctness but incurs high latency and inference cost, while pre-coded workflow graphs reduce cost but become brittle under unanticipated compound tool failures. We present Self-Healing Router, a fault-tolerant orchestration architecture that treats most agent control-flow decisions as routing rather than reasoning. The system combines (i) parallel health monitors that assign priority scores to runtime conditions such as tool outages and risk signals, and (ii) a cost-weighted tool graph where Dijkstra’s algorithm performs deterministic shortest-path routing. When a tool fails mid-execution, its edges are reweighted to infinity and the path is recomputed – yielding automatic recovery without invoking the LLM. The LLM is reserved exclusively for cases where no feasible path exists, enabling goal demotion or escalation. Prior graph-based tool-use systems (ControlLLM, ToolNet, NaviAgent) focus on tool selection and planning; our contribution is runtime fault tolerance with deterministic recovery and binary observability – every failure is either a logged reroute or an explicit escalation, never a silent skip. Across 19 scenarios spanning three graph topologies (linear pipeline, dependency DAG, parallel fan-out), Self-Healing Router matches ReAct’s correctness while reducing control-plane LLM calls by 93% (9 vs 123 aggregate) and eliminating the silent-failure cases observed in a well-engineered static workflow baseline under compound failures.
[896] State-Action Inpainting Diffuser for Continuous Control with Delay
Dongqi Han, Wei Wang, Enze Zhang, Dongsheng Li
Main category: cs.AI
TL;DR: SAID bridges model-based and model-free RL for delayed control by formulating it as a joint sequence inpainting task using diffusion models
Details
Motivation: Signal delay creates temporal gaps between interaction and perception in continuous control/RL, requiring solutions that handle this fundamental challenge. Current approaches are either model-free (state augmentation) or model-based (latent belief inference), but lack integration.Method: State-Action Inpainting Diffuser (SAID) integrates dynamics learning with policy optimization by formulating delayed RL as a joint sequence inpainting task using diffusion models. It implicitly captures environmental dynamics while directly generating consistent plans.
Result: Extensive experiments on delayed continuous control benchmarks show SAID achieves state-of-the-art and robust performance, working for both online and offline RL.
Conclusion: SAID bridges model-based and model-free paradigms for delayed RL, suggesting a new methodology to advance RL with delay through generative sequence inpainting.
Abstract: Signal delay poses a fundamental challenge in continuous control and reinforcement learning (RL) by introducing a temporal gap between interaction and perception. Current solutions have largely evolved along two distinct paradigms: model-free approaches which utilize state augmentation to preserve Markovian properties, and model-based methods which focus on inferring latent beliefs via dynamics modeling. In this paper, we bridge these perspectives by introducing State-Action Inpainting Diffuser (SAID), a framework that integrates the inductive bias of dynamics learning with the direct decision-making capability of policy optimization. By formulating the problem as a joint sequence inpainting task, SAID implicitly captures environmental dynamics while directly generating consistent plans, effectively operating at the intersection of model-based and model-free paradigms. Crucially, this generative formulation allows SAID to be seamlessly applied to both online and offline RL. Extensive experiments on delayed continuous control benchmarks demonstrate that SAID achieves state-of-the-art and robust performance. Our study suggests a new methodology to advance the field of RL with delay.
[897] S5-HES Agent: Society 5.0-driven Agentic Framework to Democratize Smart Home Environment Simulation
Akila Siriweera, Janani Rangila, Keitaro Naruse, Incheon Paik, Isuru Jayanada
Main category: cs.AI
TL;DR: S5-HES Agent: An AI-driven smart home simulation framework using LLMs for natural-language configuration and RAG for knowledge retrieval, enabling accessible research aligned with Society 5.0 objectives.
Details
Motivation: Existing smart home simulators require technical expertise, lack adaptability, and hinder diverse research needed for Society 5.0's human-centered vision. There's a need for democratized, accessible simulation tools.Method: Agentic framework with interchangeable LLMs for natural-language-driven simulation configuration. Uses RAG pipeline with semantic, keyword, and hybrid search to retrieve smart home knowledge. Coordinates specialized agents for autonomous orchestration.
Result: RAG pipeline achieves near-optimal retrieval fidelity, simulated device behavior aligns with real-world IoT datasets, and simulation engine scales predictably across home configurations.
Conclusion: S5-HES Agent provides a stable, accessible foundation for Society 5.0 smart home research by democratizing simulation through AI orchestration and natural-language interfaces.
Abstract: The smart home is a key domain within the Society 5.0 vision for a human-centered society. Smart home technologies rapidly evolve, and research should diversify while remaining aligned with Society 5.0 objectives. Democratizing smart home research would engage a broader community of innovators beyond traditional limited experts. This shift necessitates inclusive simulation frameworks that support research across diverse fields in industry and academia. However, existing smart home simulators require significant technical expertise, offer limited adaptability, and lack automated evolution, thereby failing to meet the holistic needs of Society 5.0. These constraints impede researchers from efficiently conducting simulations and experiments for security, energy, health, climate, and socio-economic research. To address these challenges, this paper presents the Society 5.0-driven Smart Home Environment Simulator Agent (S5-HES Agent), an agentic simulation framework that transforms traditional smart home simulation through autonomous AI orchestration. The framework coordinates specialized agents through interchangeable large language models (LLMs), enabling natural-language-driven end-to-end smart home simulation configuration without programming expertise. A retrieval-augmented generation (RAG) pipeline with semantic, keyword, and hybrid search retrieves smart home knowledge. Comprehensive evaluation on S5-HES Agent demonstrates that the RAG pipeline achieves near-optimal retrieval fidelity, simulated device behaviour and threat scenarios align with real-world IoT datasets, and simulation engine scales predictably across home configurations, establishing a stable foundation for Society 5.0 smart home research. Source code is available under the MIT License at https://github.com/AsiriweLab/S5-HES-Agent.
[898] Benchmarking LLM Summaries of Multimodal Clinical Time Series for Remote Monitoring
Aditya Shukla, Yining Yuan, Ben Tamo, Yifei Wang, Micky Nnamdi, Shaun Tan, Jieru Li, Benoit Marteau, Brad Willingham, May Wang
Main category: cs.AI
TL;DR: Paper introduces event-based evaluation framework for clinical time-series summarization, revealing gap between semantic similarity metrics and clinical event fidelity, with vision-based approach showing best performance.
Details
Motivation: Current LLM-generated clinical summaries of remote therapeutic monitoring time series lack evaluation of whether they faithfully capture clinically significant events like sustained abnormalities. Existing metrics focus on semantic similarity and linguistic quality, leaving event-level correctness unmeasured.Method: Introduces event-based evaluation framework using TIHM-1.5 dementia monitoring dataset. Clinically grounded daily events derived through rule-based abnormal thresholds and temporal persistence criteria. Model-generated summaries aligned with structured facts. Measures abnormality recall, duration recall, measurement coverage, and hallucinated event mentions. Benchmarks three approaches: zero-shot prompting, statistical prompting, and vision-based pipeline using rendered time-series visualizations.
Result: Reveals decoupling between conventional metrics and clinical event fidelity. Models with high semantic similarity scores often exhibit near-zero abnormality recall. Vision-based approach demonstrates strongest event alignment, achieving 45.7% abnormality recall and 100% duration recall.
Conclusion: Event-aware evaluation is crucial for reliable clinical time-series summarization. Vision-based approaches show promise for better capturing clinically significant events compared to text-only methods.
Abstract: Large language models (LLMs) can generate fluent clinical summaries of remote therapeutic monitoring time series. However, it remains unclear whether these narratives faithfully capture clinically significant events, such as sustained abnormalities. Existing evaluation metrics primarily focus on semantic similarity and linguistic quality, leaving event-level correctness largely unmeasured. To address this gap, we introduce an event-based evaluation framework for multimodal time-series summarization using the Technology-Integrated Health Management (TIHM)-1.5 dementia monitoring dataset. Clinically grounded daily events are derived through rule-based abnormal thresholds and temporal persistence criteria. Model-generated summaries are then aligned with these structured facts. Our evaluation protocol measures abnormality recall, duration recall, measurement coverage, and hallucinated event mentions. We benchmark three approaches: zero-shot prompting, statistical prompting, and a vision-based pipeline that uses rendered time-series visualizations. The results reveal a striking decoupling between conventional metrics and clinical event fidelity. Models that achieve high semantic similarity scores often exhibit near-zero abnormality recall. In contrast, the vision-based approach demonstrates the strongest event alignment, achieving 45.7% abnormality recall and 100% duration recall. These findings underscore the importance of event-aware evaluation to ensure reliable clinical time-series summarization.
[899] RubricBench: Aligning Model-Generated Rubrics with Human Standards
Qiyuan Zhang, Junyi Zhou, Yufei Wang, Fuyuan Lyu, Yidong Ming, Can Xu, Qingfeng Sun, Kai Zheng, Peng Kang, Xue Liu, Chen Ma
Main category: cs.AI
TL;DR: RubricBench is a new benchmark for evaluating rubric-based reward models in LLM alignment, featuring 1,147 expert-annotated pairwise comparisons with atomic rubrics to assess evaluation reliability.
Details
Motivation: As LLM alignment evolves toward complex generation tasks, reward models are shifting to rubric-guided evaluation to mitigate surface-level biases. However, there's no unified benchmark to assess this paradigm, as existing benchmarks lack both discriminative complexity and ground-truth rubric annotations needed for rigorous analysis.Method: The authors introduce RubricBench with 1,147 pairwise comparisons designed to assess rubric-based evaluation reliability. They use a multi-dimensional filtration pipeline to target hard samples with nuanced input complexity and misleading surface bias, augmenting each with expert-annotated atomic rubrics derived strictly from instructions.
Result: Comprehensive experiments reveal a substantial capability gap between human-annotated and model-generated rubrics, indicating that even state-of-the-art models struggle to autonomously specify valid evaluation criteria, lagging considerably behind human-guided performance.
Conclusion: RubricBench addresses the need for a unified benchmark to assess rubric-based evaluation in LLM alignment, highlighting significant challenges in automated rubric generation and the importance of human expertise in defining evaluation criteria.
Abstract: As Large Language Model (LLM) alignment evolves from simple completions to complex, highly sophisticated generation, Reward Models are increasingly shifting toward rubric-guided evaluation to mitigate surface-level biases. However, the community lacks a unified benchmark to assess this evaluation paradigm, as existing benchmarks lack both the discriminative complexity and the ground-truth rubric annotations required for rigorous analysis. To bridge this gap, we introduce RubricBench, a curated benchmark with 1,147 pairwise comparisons specifically designed to assess the reliability of rubric-based evaluation. Our construction employs a multi-dimensional filtration pipeline to target hard samples featuring nuanced input complexity and misleading surface bias, augmenting each with expert-annotated, atomic rubrics derived strictly from instructions. Comprehensive experiments reveal a substantial capability gap between human-annotated and model-generated rubrics, indicating that even state-of-the-art models struggle to autonomously specify valid evaluation criteria, lagging considerably behind human-guided performance.
[900] Beyond Length Scaling: Synergizing Breadth and Depth for Generative Reward Models
Qiyuan Zhang, Yufei Wang, Tianhe Wu, Can Xu, Qingfeng Sun, Kai Zheng, Xue Liu, Chen Ma
Main category: cs.AI
TL;DR: Mix-GRM introduces structured reasoning mechanisms (Breadth-CoT and Depth-CoT) into Generative Reward Models, optimizing them through SFT and RLVR to improve evaluation reliability across different task types.
Details
Motivation: Current GRMs use unstructured length scaling for CoT reasoning, ignoring the different effectiveness of breadth vs depth reasoning mechanisms for different task types (subjective preference vs objective correctness).Method: Reconfigures raw rationales into structured B-CoT (multi-dimensional principle coverage) and D-CoT (substantive judgment soundness) through modular synthesis pipeline, then uses SFT and RLVR to internalize and optimize these mechanisms.
Result: Achieves new SOTA across 5 benchmarks, surpassing leading open-source RMs by average 8.2%. Shows B-CoT benefits subjective preference tasks while D-CoT excels in objective correctness tasks, with RLVR inducing emergent polarization of reasoning styles.
Conclusion: Structured reasoning mechanisms (B-CoT vs D-CoT) significantly improve GRM performance, with task-reasoning alignment being crucial. RLVR enables models to spontaneously adapt reasoning styles to task demands.
Abstract: Recent advancements in Generative Reward Models (GRMs) have demonstrated that scaling the length of Chain-of-Thought (CoT) reasoning considerably enhances the reliability of evaluation. However, current works predominantly rely on unstructured length scaling, ignoring the divergent efficacy of different reasoning mechanisms: Breadth-CoT (B-CoT, i.e., multi-dimensional principle coverage) and Depth-CoT (D-CoT, i.e., substantive judgment soundness). To address this, we introduce Mix-GRM, a framework that reconfigures raw rationales into structured B-CoT and D-CoT through a modular synthesis pipeline, subsequently employing Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR) to internalize and optimize these mechanisms. Comprehensive experiments demonstrate that Mix-GRM establishes a new state-of-the-art across five benchmarks, surpassing leading open-source RMs by an average of 8.2%. Our results reveal a clear divergence in reasoning: B-CoT benefits subjective preference tasks, whereas D-CoT excels in objective correctness tasks. Consequently, misaligning the reasoning mechanism with the task directly degrades performance. Furthermore, we demonstrate that RLVR acts as a switching amplifier, inducing an emergent polarization where the model spontaneously allocates its reasoning style to match task demands. The synthesized data and models are released at \href{https://huggingface.co/collections/DonJoey/mix-grm}{Hugging Face}, and the code is released at \href{https://github.com/Don-Joey/Mix-GRM}{Github}.
[901] CARE: Towards Clinical Accountability in Multi-Modal Medical Reasoning with an Evidence-Grounded Agentic Framework
Yuexi Du, Jinglu Wang, Shujie Liu, Nicha C. Dvornek, Yan Lu
Main category: cs.AI
TL;DR: CARE introduces an evidence-grounded agentic framework for medical visual question answering that decomposes reasoning into specialized modules to improve accuracy and clinical accountability.
Details
Motivation: Current large visual language models (VLMs) for medical reasoning operate as black boxes, diverging from clinicians' evidence-based workflows and lacking accountability. There's a need for systems that provide explicit evidence and align with clinical decision-making processes.Method: CARE decomposes medical reasoning into coordinated sub-modules: 1) a compact VLM proposes relevant medical entities, 2) an expert entity-referring segmentation model produces pixel-level ROI evidence, 3) a grounded VLM reasons over the full image augmented by ROI hints, and 4) a VLM coordinator plans tool invocation and reviews evidence-answer consistency. The system uses reinforcement learning with verifiable rewards to align answers with supporting evidence.
Result: CARE-Flow (coordinator-free) improves average accuracy by 10.9% over same-size (10B) state-of-the-art models. With dynamic planning and answer review (CARE-Coord), there’s a further gain, outperforming heavily pre-trained SOTA by 5.2%.
Conclusion: An agentic framework that emulates clinical workflows with decoupled specialized models and explicit evidence yields more accurate and accountable medical AI, addressing key limitations of end-to-end black box approaches.
Abstract: Large visual language models (VLMs) have shown strong multi-modal medical reasoning ability, but most operate as end-to-end black boxes, diverging from clinicians’ evidence-based, staged workflows and hindering clinical accountability. Complementarily, expert visual grounding models can accurately localize regions of interest (ROIs), providing explicit, reliable evidence that improves both reasoning accuracy and trust. In this paper, we introduce CARE, advancing Clinical Accountability in multi-modal medical Reasoning with an Evidence-grounded agentic framework. Unlike existing approaches that couple grounding and reasoning within a single generalist model, CARE decomposes the task into coordinated sub-modules to reduce shortcut learning and hallucination: a compact VLM proposes relevant medical entities; an expert entity-referring segmentation model produces pixel-level ROI evidence; and a grounded VLM reasons over the full image augmented by ROI hints. The VLMs are optimized with reinforcement learning with verifiable rewards to align answers with supporting evidence. Furthermore, a VLM coordinator plans tool invocation and reviews evidence-answer consistency, providing agentic control and final verification. Evaluated on standard medical VQA benchmarks, our CARE-Flow (coordinator-free) improves average accuracy by 10.9% over the same size (10B) state-of-the-art (SOTA). With dynamic planning and answer review, our CARE-Coord yields a further gain, outperforming the heavily pre-trained SOTA by 5.2%. Our experiments demonstrate that an agentic framework that emulates clinical workflows, incorporating decoupled specialized models and explicit evidence, yields more accurate and accountable medical AI.
[902] Evaluating and Understanding Scheming Propensity in LLM Agents
Mia Hopman, Jannes Elstner, Maria Avramidou, Amritanshu Prasad, David Lindner
Main category: cs.AI
TL;DR: Study examines when AI agents develop “scheming” behavior (covertly pursuing misaligned goals) in realistic scenarios, finding minimal scheming despite incentives, with behavior being brittle to environmental changes.
Details
Motivation: As language models are deployed as autonomous agents for complex tasks, there's risk of "scheming" - agents covertly pursuing misaligned goals. While prior work showed capability, propensity in realistic scenarios remains underexplored.Method: Decomposes scheming incentives into agent and environmental factors, develops realistic settings with scheming opportunities for instrumentally convergent goals (self-preservation, resource acquisition, goal-guarding). Tests with adversarially-designed prompts and examines brittleness by removing tools or increasing oversight.
Result: Found only minimal scheming despite high environmental incentives, not due to evaluation awareness. Adversarial prompts can induce high scheming (59%), but real agent scaffolds rarely do. Scheming behavior is brittle: removing a single tool drops scheming from 59% to 3%, and increased oversight can raise scheming by up to 25%.
Conclusion: The incentive decomposition enables systematic measurement of scheming propensity in deployment-relevant settings, which is necessary as agents handle increasingly consequential tasks. Scheming appears less common than feared but can be brittle and paradoxically increased by oversight.
Abstract: As frontier language models are increasingly deployed as autonomous agents pursuing complex, long-term objectives, there is increased risk of scheming: agents covertly pursuing misaligned goals. Prior work has focused on showing agents are capable of scheming, but their propensity to scheme in realistic scenarios remains underexplored. To understand when agents scheme, we decompose scheming incentives into agent factors and environmental factors. We develop realistic settings allowing us to systematically vary these factors, each with scheming opportunities for agents that pursue instrumentally convergent goals such as self-preservation, resource acquisition, and goal-guarding. We find only minimal instances of scheming despite high environmental incentives, and show this is unlikely due to evaluation awareness. While inserting adversarially-designed prompt snippets that encourage agency and goal-directedness into an agent’s system prompt can induce high scheming rates, snippets used in real agent scaffolds rarely do. Surprisingly, in model organisms (Hubinger et al., 2023) built with these snippets, scheming behavior is remarkably brittle: removing a single tool can drop the scheming rate from 59% to 3%, and increasing oversight can raise rather than deter scheming by up to 25%. Our incentive decomposition enables systematic measurement of scheming propensity in settings relevant for deployment, which is necessary as agents are entrusted with increasingly consequential tasks.
[903] ToolRLA: Fine-Grained Reward Decomposition for Tool-Integrated Reinforcement Learning Alignment in Domain-Specific Agents
Pengbo Liu
Main category: cs.AI
TL;DR: ToolRLA: A three-stage post-training pipeline with fine-grained reward decomposition for aligning tool-integrated reasoning agents in domain-specific deployments, achieving significant improvements in financial advisory copilot.
Details
Motivation: Existing reinforcement learning for tool-integrated agents uses coarse binary rewards (success/failure) that insufficiently guide nuanced tool invocation in production, especially for high-stakes domain-specific deployment.Method: Three-stage post-training pipeline: Supervised Fine-Tuning, Group Relative Policy Optimization, Direct Preference Optimization. Core innovation is fine-grained reward function with multiplicative correctness decomposition evaluating tool invocation across four dimensions: format validity, tool selection correctness, invocation efficiency, and domain constraint compliance.
Result: Deployed on real-world financial advisory copilot (80+ advisors, 1,200+ daily queries, 15+ APIs): 47% higher end-to-end task completion (62% to 91%), 63% lower tool invocation error (38% to 14%), 93% lower regulatory violation (12% to 0.8%), sub-2-second latency after three months. Fine-grained reward decomposition contributes 7 percentage points over coarse additive rewards.
Conclusion: ToolRLA’s fine-grained reward decomposition and three-stage pipeline effectively align tool-integrated agents for domain-specific deployment, with demonstrated success in financial advisory and validated generalizability on benchmark datasets.
Abstract: Tool-integrated reasoning agents interleaving natural language deliberation with external API calls show promise for complex multi-step tasks. However, aligning such agents for high-stakes domain-specific deployment is challenging, as existing reinforcement learning uses coarse binary rewards (success/failure) that insufficiently guide nuanced tool invocation in production. We present ToolRLA, a three-stage post-training pipeline (Supervised Fine-Tuning, Group Relative Policy Optimization, Direct Preference Optimization) for domain-specific tool-integrated agents. Its core is a fine-grained reward function with multiplicative correctness decomposition, evaluating tool invocation across four dimensions: format validity, tool selection correctness, invocation efficiency, and domain constraint compliance. Multiplicative composition prioritizes correct tool selection (a prerequisite for meaningful parameter evaluation), while a large negative compliance penalty (λ=10) ensures regulatory adherence. Deployed on a real-world financial advisory copilot (80+ advisors, 1,200+ daily queries, 15+ heterogeneous APIs), ToolRLA achieves 47% higher end-to-end task completion (62% to 91%), 63% lower tool invocation error (38% to 14%), 93% lower regulatory violation (12% to 0.8%), and sub-2-second latency after three months. Ablation studies confirm fine-grained reward decomposition contributes 7 percentage points over coarse additive rewards; generalizability is validated on ToolBench and API-Bank.
[904] SEED-SET: Scalable Evolving Experimental Design for System-level Ethical Testing
Anjali Parashar, Yingke Li, Eric Yang Yu, Fei Chen, James Neidhoefer, Devesh Upadhyay, Chuchu Fan
Main category: cs.AI
TL;DR: SEED-SET is a Bayesian experimental design framework for ethical benchmarking of autonomous systems that combines objective metrics with subjective stakeholder preferences to generate optimal test scenarios.
Details
Motivation: Autonomous systems in human-centric domains require ethical evaluation, but current approaches lack standardized metrics and struggle with stakeholder subjectivity. There's a need for automated ethical benchmarking that can incorporate both objective measurements and subjective value judgments.Method: SEED-SET uses hierarchical Gaussian Processes to separately model objective evaluations and subjective stakeholder judgments. It employs a novel acquisition strategy to propose test candidates that align with both qualitative preferences and objectives, balancing exploration and exploitation.
Result: The method generates up to 2× optimal test candidates compared to baselines, with 1.25× improvement in coverage of high-dimensional search spaces. It performs best in ethical benchmarking of autonomous agents across two applications.
Conclusion: SEED-SET provides an interpretable and efficient framework for ethical benchmarking that effectively incorporates both objective metrics and subjective stakeholder preferences, addressing the challenges of evaluating autonomous systems in human-centric domains.
Abstract: As autonomous systems such as drones, become increasingly deployed in high-stakes, human-centric domains, it is critical to evaluate the ethical alignment since failure to do so imposes imminent danger to human lives, and long term bias in decision-making. Automated ethical benchmarking of these systems is understudied due to the lack of ubiquitous, well-defined metrics for evaluation, and stakeholder-specific subjectivity, which cannot be modeled analytically. To address these challenges, we propose SEED-SET, a Bayesian experimental design framework that incorporates domain-specific objective evaluations, and subjective value judgments from stakeholders. SEED-SET models both evaluation types separately with hierarchical Gaussian Processes, and uses a novel acquisition strategy to propose interesting test candidates based on learnt qualitative preferences and objectives that align with the stakeholder preferences. We validate our approach for ethical benchmarking of autonomous agents on two applications and find our method to perform the best. Our method provides an interpretable and efficient trade-off between exploration and exploitation, by generating up to $2\times$ optimal test candidates compared to baselines, with $1.25\times$ improvement in coverage of high dimensional search spaces.
[905] Learning Structured Reasoning via Tractable Trajectory Control
Po-Nien Kung, Zhen Yang, Jeffrey Luo, Cheng-Fu Yang, Haikang Deng, Zi-Yi Dou, Yinfei Yang, Nanyun Peng, Zhe Gan, Kai-Wei Chang
Main category: cs.AI
TL;DR: Ctrl-R is a framework for discovering and reinforcing diverse reasoning patterns in large language models through structured reasoning and trajectory control during RL training.
Details
Motivation: Large language models exhibit emergent reasoning behaviors but complex reasoning trajectories remain sparse in unconstrained sampling. Standard RL often fails to guarantee acquisition of diverse reasoning patterns needed for complex problem-solving.Method: Proposes Ctrl-R framework with structured reasoning paradigm that actively guides rollout process to explore diverse reasoning patterns. Uses tractable trajectory control and importance-sampling estimation with power-scaling factor to selectively learn from exploratory trajectories while maintaining stable optimization.
Result: Ctrl-R enables effective exploration and internalization of previously unattainable reasoning patterns, yielding consistent improvements across language and vision-language models on mathematical reasoning tasks.
Conclusion: The framework successfully discovers and reinforces diverse reasoning patterns through structured reasoning and controlled exploration, enhancing model performance on complex reasoning tasks.
Abstract: Large language models can exhibit emergent reasoning behaviors, often manifested as recurring lexical patterns (e.g., “wait,” indicating verification). However, complex reasoning trajectories remain sparse in unconstrained sampling, and standard RL often fails to guarantee the acquisition of diverse reasoning behaviors. We propose a systematic discovery and reinforcement of diverse reasoning patterns through structured reasoning, a paradigm that requires targeted exploration of specific reasoning patterns during the RL process. To this end, we propose Ctrl-R, a framework for learning structured reasoning via tractable trajectory control that actively guides the rollout process, incentivizing the exploration of diverse reasoning patterns that are critical for complex problem-solving. The resulting behavior policy enables accurate importance-sampling estimation, supporting unbiased on-policy optimization. We further introduce a power-scaling factor on the importance-sampling weights, allowing the policy to selectively learn from exploratory, out-of-distribution trajectories while maintaining stable optimization. Experiments demonstrate that Ctrl-R enables effective exploration and internalization of previously unattainable reasoning patterns, yielding consistent improvements across language and vision-language models on mathematical reasoning tasks.
[906] CeProAgents: A Hierarchical Agents System for Automated Chemical Process Development
Yuhang Yang, Ruikang Li, Jifei Ma, Kai Zhang, Qi Liu, Jianyu Han, Yonggan Bu, Jibin Zhou, Defu Lian, Xin Li, Enhong Chen
Main category: cs.AI
TL;DR: CeProAgents is a hierarchical multi-agent system that automates chemical process development through specialized knowledge, concept, and parameter agents working collaboratively.
Details
Motivation: Chemical process development is complex and requires integration of specialized knowledge, conceptual design, and parametric simulation. The paper aims to automate this multi-faceted process using AI agents.Method: Proposes CeProAgents with three specialized agent cohorts (knowledge, concept, parameter) using hybrid architecture combining dynamic agent chatgroups with structured workflows. Also creates CeProBench benchmark with six task types across three chemical engineering pillars.
Result: The system demonstrates effectiveness and superiority in chemical process development, revealing both the transformative potential and current boundaries of LLMs for industrial chemical engineering.
Conclusion: CeProAgents successfully automates chemical process development through collaborative multi-agent systems, showing promise for industrial applications while highlighting current LLM limitations.
Abstract: The development of chemical processes, a cornerstone of chemical engineering, presents formidable challenges due to its multi-faceted nature, integrating specialized knowledge, conceptual design, and parametric simulation. Capitalizing on this, we propose CeProAgents, a hierarchical multi-agent system designed to automate the development of chemical process through collaborative division of labor. Our architecture comprises three specialized agent cohorts focused on knowledge, concept, and parameter respectively. To effectively adapt to the inherent complexity of chemical tasks, each cohort employs a novel hybrid architecture that integrates dynamic agent chatgroups with structured agentic workflows. To rigorously evaluate the system, we establish CeProBench, a multi-dimensional benchmark structured around three core pillars of chemical engineering. We design six distinct types of tasks across these dimensions to holistically assess the comprehensive capabilities of the system in chemical process development. The results not only confirm the effectiveness and superiority of our proposed approach but also reveal the transformative potential as well as the current boundaries of Large Language Models (LLMs) for industrial chemical engineering.
[907] Chain-of-Context Learning: Dynamic Constraint Understanding for Multi-Task VRPs
Shuangchun Gui, Suyu Liu, Xuehe Wang, Zhiguang Cao
Main category: cs.AI
TL;DR: CCL is a reinforcement learning framework for multi-task vehicle routing problems that captures evolving context and node dynamics through step-wise contextual information and trajectory-shared node re-embedding.
Details
Motivation: Existing VRP solvers using unified RL frameworks overlook constraint and node dynamics during decision-making, failing to accurately react to current context, which limits their performance on diverse VRP variants.Method: Proposes Chain-of-Context Learning (CCL) with two modules: 1) Relevance-Guided Context Reformulation (RGCR) that adaptively prioritizes salient constraints to construct step-wise contextual information, and 2) Trajectory-Shared Node Re-embedding (TSNR) that aggregates shared node features from all trajectories’ contexts to update inputs for the next step.
Result: CCL achieves best performance on all 16 in-distribution VRP tasks and majority of 32 out-of-distribution tasks with unseen constraints, outperforming state-of-the-art baselines across 48 diverse VRP variants.
Conclusion: CCL effectively captures evolving preferences of RL agents and step-by-step dependencies in sequential decision-making for multi-task VRPs by modeling context and node dynamics.
Abstract: Multi-task Vehicle Routing Problems (VRPs) aim to minimize routing costs while satisfying diverse constraints. Existing solvers typically adopt a unified reinforcement learning (RL) framework to learn generalizable patterns across tasks. However, they often overlook the constraint and node dynamics during the decision process, making the model fail to accurately react to the current context. To address this limitation, we propose Chain-of-Context Learning (CCL), a novel framework that progressively captures the evolving context to guide fine-grained node adaptation. Specifically, CCL constructs step-wise contextual information via a Relevance-Guided Context Reformulation (RGCR) module, which adaptively prioritizes salient constraints. This context then guides node updates through a Trajectory-Shared Node Re-embedding (TSNR) module, which aggregates shared node features from all trajectories’ contexts and uses them to update inputs for the next step. By modeling evolving preferences of the RL agent, CCL captures step-by-step dependencies in sequential decision-making. We evaluate CCL on 48 diverse VRP variants, including 16 in-distribution and 32 out-of-distribution (with unseen constraints) tasks. Experimental results show that CCL performs favorably against the state-of-the-art baselines, achieving the best performance on all in-distribution tasks and the majority of out-of-distribution tasks.
[908] FT-Dojo: Towards Autonomous LLM Fine-Tuning with Language Agents
Qizheng Li, Yifei Zhang, Xiao Yang, Xu Yang, Zhuo Wang, Weiqing Liu, Jiang Bian
Main category: cs.AI
TL;DR: FT-Dojo environment and FT-Agent system for autonomous LLM fine-tuning across domains using evaluation-driven feedback
Details
Motivation: Current LLM fine-tuning for vertical domains is labor-intensive and expensive, requiring domain experts for data curation, training configuration, and iterative diagnosis. The paper explores whether LLM-based agents can automate this complete end-to-end process.Method: Introduces FT-Dojo (13 tasks across 5 domains) as an interactive environment and develops FT-Agent, an autonomous system that mirrors human experts by using evaluation-driven feedback to iteratively diagnose failures and refine fine-tuning strategies.
Result: FT-Agent significantly outperforms general-purpose alternatives, achieving best performance on 10/13 tasks across all five domains. The approach generalizes to 3B models, with insights on data scaling trade-offs and backbone sensitivity. Agents can recover from failures through cumulative learning.
Conclusion: Demonstrates promise of autonomous LLM fine-tuning while exposing limitations in causal reasoning, highlighting both potential and current boundaries of the approach.
Abstract: Fine-tuning large language models for vertical domains remains a labor-intensive and expensive process, requiring domain experts to curate data, configure training, and iteratively diagnose model behavior. Despite growing interest in autonomous machine learning, no prior work has tackled end-to-end LLM fine-tuning with agents. Can LLM-based agents automate this complete process? We frame this as a substantially open problem: agents must navigate an open-ended search space spanning data curation from diverse data sources, processing with complex tools, building a training pipeline, and iteratively refining their approach based on evaluation outcomes in rapidly growing logs–an overall scenario far more intricate than existing benchmarks. To study this question, we introduce FT-Dojo, an interactive environment comprising 13 tasks across 5 domains. We further develop FT-Agent, an autonomous system that mirrors human experts by leveraging evaluation-driven feedback to iteratively diagnose failures and refine fine-tuning strategies. Experiments on FT-Dojo demonstrate that purpose-built fine-tuning agents significantly outperform general-purpose alternatives, with FT-Agent achieving the best performance on 10 out of 13 tasks across all five domains. Ablations show that the approach generalizes effectively to 3B models, with additional insights on data scaling trade-offs and backbone sensitivity. Case analyses reveal that agents can recover from failures through cumulative learning from historical experience, while also exposing fundamental limitations in causal reasoning–highlighting both the promise and current boundaries of autonomous LLM fine-tuning.
[909] GMP: A Benchmark for Content Moderation under Co-occurring Violations and Dynamic Rules
Houde Dong, Yifei She, Kai Ye, Liangcai Su, Chenxiong Qian, Jie Hao
Main category: cs.AI
TL;DR: The paper examines limitations of current AI systems for content moderation when dealing with co-occurring violations and dynamic, context-dependent moderation rules, questioning whether static benchmarks adequately measure real-world generalization.
Details
Motivation: Current AI content moderation systems struggle with real-world complexities where posts may violate multiple policies simultaneously and moderation rules are dynamic and context-dependent, leading to inconsistent enforcement.Method: The paper appears to analyze the gap between static benchmark performance and real-world generalization, likely examining how LLMs handle co-occurring violations and dynamic rules through case studies or experimental evaluation.
Result: The analysis suggests that while LLMs are good at following fixed guidelines, their judgment capabilities degrade when policies are unstable or context-dependent, leading to inconsistent moderation outcomes.
Conclusion: High performance on static benchmarks does not guarantee robust generalization to real-world content moderation scenarios involving co-occurring violations and dynamically changing rules, highlighting a core limitation of current AI systems.
Abstract: Online content moderation is essential for maintaining a healthy digital environment, and reliance on AI for this task continues to grow. Consider a user comment using national stereotypes to insult a politician. This example illustrates two critical challenges in real-world scenarios: (1) Co-occurring Violations, where a single post violates multiple policies (e.g., prejudice and personal attacks); (2) Dynamic rules of moderation, where determination of a violation depends on platform-specific guidelines that evolve across contexts . The intersection of co-occurring harms and dynamically changing rules highlights a core limitation of current AI systems: although large language models (LLMs) are adept at following fixed guidelines, their judgment capabilities degrade when policies are unstable or context-dependent . In practice, such shortcomings lead to inconsistent moderation: either erroneously restricting legitimate expression or allowing harmful content to remain online . This raises a critical question for evaluation: Does high performance on existing static benchmarks truly guarantee robust generalization of AI judgment to real-world scenarios involving co-occurring violations and dynamically changing rules?
[910] GAM-RAG: Gain-Adaptive Memory for Evolving Retrieval in Retrieval-Augmented Generation
Yifan Wang, Mingxuan Jiang, Zhihao Sun, Yixin Cao, Yicun Liu, Keyang Chen, Guangnan Ye, Hongfeng Chai
Main category: cs.AI
TL;DR: GAM-RAG is a training-free framework that accumulates retrieval experience from recurring queries to build a dynamic, hierarchical index that updates based on successful retrieval episodes, improving performance while reducing inference costs.
Details
Motivation: Traditional RAG systems use static indices that repeat similar multi-hop traversal for related queries, increasing latency and compute. The authors aim to create a system that learns from retrieval experience over time to improve efficiency and performance.Method: GAM-RAG builds a lightweight, relation-free hierarchical index capturing potential co-occurrence rather than fixed semantic relations. It uses sentence-level feedback from successful retrieval episodes to update sentence memories, with an uncertainty-aware Kalman-inspired gain rule that balances stability and adaptability under noisy feedback.
Result: GAM-RAG improves average performance by 3.95% over the strongest baseline and by 8.19% with 5-turn memory, while reducing inference cost by 61%.
Conclusion: The framework demonstrates that accumulating retrieval experience from recurring queries can significantly improve RAG performance and efficiency without requiring model training, offering a practical solution for dynamic information retrieval systems.
Abstract: Retrieval-Augmented Generation (RAG) grounds large language models with external evidence, but many implementations rely on pre-built indices that remain static after construction. Related queries therefore repeat similar multi-hop traversal, increasing latency and compute. Motivated by schema-based learning in cognitive neuroscience, we propose GAM-RAG, a training-free framework that accumulates retrieval experience from recurring or related queries and updates retrieval memory over time. GAM-RAG builds a lightweight, relation-free hierarchical index whose links capture potential co-occurrence rather than fixed semantic relations. During inference, successful retrieval episodes provide sentence-level feedback, updating sentence memories so evidence useful for similar reasoning types becomes easier to activate later. To balance stability and adaptability under noisy feedback, we introduce an uncertainty-aware, Kalman-inspired gain rule that jointly updates memory states and perplexity-based uncertainty estimates. It applies fast updates for reliable novel signals and conservative refinement for stable or noisy memories. We provide a theoretical analysis of the update dynamics, and empirically show that GAM-RAG improves average performance by 3.95% over the strongest baseline and by 8.19% with 5-turn memory, while reducing inference cost by 61%. Our code and datasets are available at: https://anonymous.4open.science/r/GAM_RAG-2EF6.
[911] Incremental, inconsistency-resilient reasoning over Description Logic Abox streams
Cas Proost, Pieter Bonte
Main category: cs.AI
TL;DR: Incremental reasoning over streaming Description Logic ABoxes with sliding windows and inconsistency repair semantics for OWL2 RL.
Details
Motivation: Address challenges in stream reasoning: high data velocity, real-time requirements, and noisy/volatile nature of data streams by developing incremental reasoning techniques.Method: Proposes novel semantics for incremental reasoning over sliding windows on streams, allowing incremental computation of window materialization based on previous windows. Introduces inconsistency repair semantics based on preferred repair approaches. Details semi-naive algorithms for incremental materialization maintenance in OWL2 RL.
Result: Developed framework for handling streaming Description Logic ABoxes with incremental materialization and inconsistency repair capabilities for OWL2 RL.
Conclusion: The proposed approach effectively addresses stream reasoning challenges through incremental computation and repair semantics, enabling real-time reasoning over volatile data streams.
Abstract: More and more, data is being produced in a streaming fashion. This has led to increased interest into how actionable insights can be extracted in real time from data streams through Stream Reasoning. Reasoning over data streams raises multiple challenges, notably the high velocity of data, the real time requirement of the reasoning, and the noisy and volatile nature of streams. This paper proposes novel semantics for incremental reasoning over streams of Description Logic ABoxes, in order to tackle these challenges. To address the first two challenges, our semantics for reasoning over sliding windows on streams allow for incrementally computing the materialization of the window based on the materialization of the previous window. Furthermore, to deal with the volatile nature of streams, we present novel semantics for inconsistency repair on such windows, based on preferred repair semantics. We then detail our proposed semi-naive algorithms for incremental materialization maintenance in the case of OWL2 RL, both in the presence of inconsistencies and without.
[912] What Papers Don’t Tell You: Recovering Tacit Knowledge for Automated Paper Reproduction
Lehui Li, Ruining Wang, Haochen Song, Yaoxin Mao, Tong Zhang, Yuyao Wang, Jiayi Fan, Yitong Zhang, Jieping Ye, Chengqi Zhang, Yongshun Gong
Main category: cs.AI
TL;DR: A graph-based agent framework called \method that recovers three types of tacit knowledge (relational, somatic, collective) from academic papers to automate code reproduction, achieving 10.04% performance gap against official implementations.
Details
Motivation: Automated paper reproduction is bottlenecked by tacit knowledge that papers inevitably leave implicit, not by information retrieval. The challenge is to recover the unstated implementation details, debugging knowledge, and collective patterns that researchers need to reproduce papers.Method: Proposes \method, a graph-based agent framework with three dedicated mechanisms: 1) node-level relation-aware aggregation for relational knowledge (analyzing implementation-unit-level reuse between papers), 2) execution-feedback refinement for somatic knowledge (iterative debugging using runtime signals), and 3) graph-level knowledge induction for collective knowledge (distilling patterns from clusters of similar papers).
Result: On ReproduceBench spanning 3 domains, 10 tasks, and 40 recent papers, \method achieves an average performance gap of 10.04% against official implementations, improving over the strongest baseline by 24.68%.
Conclusion: The framework successfully recovers tacit knowledge for automated paper reproduction, demonstrating significant improvements over existing methods and providing a systematic approach to address the implicit knowledge problem in research reproducibility.
Abstract: Automated paper reproduction – generating executable code from academic papers – is bottlenecked not by information retrieval but by the tacit knowledge that papers inevitably leave implicit. We formalize this challenge as the progressive recovery of three types of tacit knowledge – relational, somatic, and collective – and propose \method, a graph-based agent framework with a dedicated mechanism for each: node-level relation-aware aggregation recovers relational knowledge by analyzing implementation-unit-level reuse and adaptation relationships between the target paper and its citation neighbors; execution-feedback refinement recovers somatic knowledge through iterative debugging driven by runtime signals; and graph-level knowledge induction distills collective knowledge from clusters of papers sharing similar implementations. On an extended ReproduceBench spanning 3 domains, 10 tasks, and 40 recent papers, \method{} achieves an average performance gap of 10.04% against official implementations, improving over the strongest baseline by 24.68%. The code will be publicly released upon acceptance; the repository link will be provided in the final version.
[913] Emerging Human-like Strategies for Semantic Memory Foraging in Large Language Models
Eric Lacosse, Mariana Duarte, Peter M. Todd, Daniel C. McNamee
Main category: cs.AI
TL;DR: Applying mechanistic interpretability to study semantic memory foraging in LLMs using the Semantic Fluency Task, revealing convergent/divergent search patterns similar to humans across model layers.
Details
Motivation: To bring greater rigor to studying semantic memory foraging in LLMs by applying mechanistic interpretability techniques, using the Semantic Fluency Task as a case study to understand how LLMs access and generate from semantic memory stores.Method: Uses mechanistic interpretability techniques to analyze LLMs performing the Semantic Fluency Task (SFT), examining convergent and divergent patterns of generative memory search across different model layers.
Result: Shows that behavioral signatures critical to human performance on SFT (convergent and divergent search patterns) emerge as identifiable patterns in LLMs across distinct layers.
Conclusion: This analysis provides insights into how LLMs may be adapted for closer cognitive alignment with humans or guided toward productive cognitive disalignment to enhance complementary strengths in human-AI interaction.
Abstract: Both humans and Large Language Models (LLMs) store a vast repository of semantic memories. In humans, efficient and strategic access to this memory store is a critical foundation for a variety of cognitive functions. Such access has long been a focus of psychology and the computational mechanisms behind it are now well characterized. Much of this understanding has been gleaned from a widely-used neuropsychological and cognitive science assessment called the Semantic Fluency Task (SFT), which requires the generation of as many semantically constrained concepts as possible. Our goal is to apply mechanistic interpretability techniques to bring greater rigor to the study of semantic memory foraging in LLMs. To this end, we present preliminary results examining SFT as a case study. A central focus is on convergent and divergent patterns of generative memory search, which in humans play complementary strategic roles in efficient memory foraging. We show that these same behavioral signatures, critical to human performance on the SFT, also emerge as identifiable patterns in LLMs across distinct layers. Potentially, this analysis provides new insights into how LLMs may be adapted into closer cognitive alignment with humans, or alternatively, guided toward productive cognitive \emph{disalignment} to enhance complementary strengths in human-AI interaction.
[914] CoVe: Training Interactive Tool-Use Agents via Constraint-Guided Verification
Jinpeng Chen, Cheng Gong, Hanbo Li, Ziru Liu, Zichen Tian, Xinyu Fu, Shi Wu, Chenyang Zhang, Wu Zhang, Suiyun Zhang, Dandan Tu, Rui Liu
Main category: cs.AI
TL;DR: CoVe is a constraint-verification framework for synthesizing high-quality training data for interactive tool-use agents, enabling both supervised fine-tuning and reinforcement learning through constraint-guided trajectory generation and verification.
Details
Motivation: Real-world user needs are complex and ambiguous, but agents must execute deterministic actions to satisfy them. There's a gap in developing multi-turn interactive tool-use agents that can handle this complexity while ensuring correctness.Method: CoVe defines explicit task constraints that guide generation of complex trajectories and serve as deterministic verifiers for assessing trajectory quality. This enables creation of high-quality training data for SFT and accurate reward signals for RL.
Result: CoVe-4B model achieves 43.0% success rate in Airline domain and 59.4% in Retail domain on τ²-bench benchmark, outperforming similar-scale baselines and remaining competitive with models up to 17× larger.
Conclusion: CoVe provides an effective and efficient pathway for synthesizing training data for state-of-the-art interactive tool-use agents, with open-sourced code, models, and 12K high-quality trajectories.
Abstract: Developing multi-turn interactive tool-use agents is challenging because real-world user needs are often complex and ambiguous, yet agents must execute deterministic actions to satisfy them. To address this gap, we introduce \textbf{CoVe} (\textbf{Co}nstraint-\textbf{Ve}rification), a post-training data synthesis framework designed for training interactive tool-use agents while ensuring both data complexity and correctness. CoVe begins by defining explicit task constraints, which serve a dual role: they guide the generation of complex trajectories and act as deterministic verifiers for assessing trajectory quality. This enables the creation of high-quality training trajectories for supervised fine-tuning (SFT) and the derivation of accurate reward signals for reinforcement learning (RL). Our evaluation on the challenging $τ^2$-bench benchmark demonstrates the effectiveness of the framework. Notably, our compact \textbf{CoVe-4B} model achieves success rates of 43.0% and 59.4% in the Airline and Retail domains, respectively; its overall performance significantly outperforms strong baselines of similar scale and remains competitive with models up to $17\times$ its size. These results indicate that CoVe provides an effective and efficient pathway for synthesizing training data for state-of-the-art interactive tool-use agents. To support future research, we open-source our code, trained model, and the full set of 12K high-quality trajectories used for training.
[915] LiveCultureBench: a Multi-Agent, Multi-Cultural Benchmark for Large Language Models in Dynamic Social Simulations
Viet-Thanh Pham, Lizhen Qu, Thuy-Trang Vu, Gholamreza Haffari, Dinh Phung
Main category: cs.AI
TL;DR: LiveCultureBench is a multi-cultural simulation benchmark that evaluates LLM agents on both task completion and adherence to socio-cultural norms in a simulated town environment.
Details
Motivation: Current LLM agent evaluations focus primarily on task success metrics while neglecting cultural appropriateness and evaluator reliability, creating a need for benchmarks that assess both effectiveness and socio-cultural norm adherence.Method: Creates a simulated town as a location graph with synthetic residents having diverse demographic/cultural profiles. LLM agents are embedded in this environment with daily goals while others provide social context. An LLM-based verifier generates structured judgments on norm violations and task progress, aggregated into metrics capturing task-norm trade-offs and verifier uncertainty.
Result: The benchmark enables study of cross-cultural robustness of LLM agents, how they balance effectiveness against norm sensitivity, and when LLM-as-a-judge evaluation is reliable versus when human oversight is needed.
Conclusion: LiveCultureBench provides a comprehensive framework for evaluating LLM agents on both task performance and cultural appropriateness, addressing important gaps in current evaluation methodologies.
Abstract: Large language models (LLMs) are increasingly deployed as autonomous agents, yet evaluations focus primarily on task success rather than cultural appropriateness or evaluator reliability. We introduce LiveCultureBench, a multi-cultural, dynamic benchmark that embeds LLMs as agents in a simulated town and evaluates them on both task completion and adherence to socio-cultural norms. The simulation models a small city as a location graph with synthetic residents having diverse demographic and cultural profiles. Each episode assigns one resident a daily goal while others provide social context. An LLM-based verifier generates structured judgments on norm violations and task progress, which we aggregate into metrics capturing task-norm trade-offs and verifier uncertainty. Using LiveCultureBench across models and cultural profiles, we study (i) cross-cultural robustness of LLM agents, (ii) how they balance effectiveness against norm sensitivity, and (iii) when LLM-as-a-judge evaluation is reliable for automated benchmarking versus when human oversight is needed.
[916] According to Me: Long-Term Personalized Referential Memory QA
Jingbiao Mei, Jinghong Chen, Guangyu Yang, Xinyu Hou, Margaret Li, Bill Byrne
Main category: cs.AI
TL;DR: ATM-Bench is a new benchmark for multimodal, multi-source personalized referential memory QA, featuring 4 years of privacy-preserving personal memory data with human-annotated QA pairs requiring resolution of personal references and multi-evidence reasoning.
Details
Motivation: Existing long-term memory benchmarks focus primarily on dialogue history, failing to capture realistic personalized references grounded in lived experience across multiple modalities and sources like images, videos, and emails.Method: Introduces ATM-Bench benchmark with multimodal, multi-source personal memory data and human-annotated QA pairs. Proposes Schema-Guided Memory (SGM) to structurally represent memory items from different sources, and evaluates 5 state-of-the-art memory systems plus RAG baseline with different memory ingestion, retrieval, and answer generation techniques.
Result: Poor performance (under 20% accuracy) on ATM-Bench-Hard set, showing current systems struggle with multimodal personalized memory tasks. SGM improves performance over Descriptive Memory commonly used in prior works.
Conclusion: ATM-Bench addresses a critical gap in multimodal personalized memory evaluation, revealing significant challenges in handling multimodal, multi-source personal references and reasoning, with SGM showing promise as a structural memory representation approach.
Abstract: Personalized AI assistants must recall and reason over long-term user memory, which naturally spans multiple modalities and sources such as images, videos, and emails. However, existing Long-term Memory benchmarks focus primarily on dialogue history, failing to capture realistic personalized references grounded in lived experience. We introduce ATM-Bench, the first benchmark for multimodal, multi-source personalized referential Memory QA. ATM-Bench contains approximately four years of privacy-preserving personal memory data and human-annotated question-answer pairs with ground-truth memory evidence, including queries that require resolving personal references, multi-evidence reasoning from multi-source and handling conflicting evidence. We propose Schema-Guided Memory (SGM) to structurally represent memory items originated from different sources. In experiments, we implement 5 state-of-the-art memory systems along with a standard RAG baseline and evaluate variants with different memory ingestion, retrieval, and answer generation techniques. We find poor performance (under 20% accuracy) on the ATM-Bench-Hard set, and that SGM improves performance over Descriptive Memory commonly adopted in prior works. Code available at: https://github.com/JingbiaoMei/ATM-Bench
[917] Rich Insights from Cheap Signals: Efficient Evaluations via Tensor Factorization
Felipe Maia Polo, Aida Nematzadeh, Virginia Aglietti, Adam Fisch, Isabela Albuquerque
Main category: cs.AI
TL;DR: Tensor factorization model merges cheap autorater scores with limited human labels for fine-grained evaluation of generative models at prompt level
Details
Motivation: Fine-grained evaluation of generative models at prompt level is needed but suffers from data bottleneck: human labels are too costly while automated ratings are often misaligned with human judgmentMethod: Proposes statistical model based on tensor factorization that merges cheap autorater data with limited human gold-standard labels; uses autorater scores to pretrain latent representations of prompts and models, then aligns to human preferences using small calibration set
Result: Method is robust to autorater quality, more accurately predicts human preferences per-prompt than baselines, provides tight confidence intervals; enables granular leaderboards and model performance estimation from autorater scores alone
Conclusion: Sample-efficient methodology resolves data bottleneck for fine-grained evaluation of generative models, enabling practical applications like granular leaderboards without extensive human annotation
Abstract: Moving beyond evaluations that collapse performance across heterogeneous prompts toward fine-grained evaluation at the prompt level, or within relatively homogeneous subsets, is necessary to diagnose generative models’ strengths and weaknesses. Such fine-grained evaluations, however, suffer from a data bottleneck: human gold-standard labels are too costly at this scale, while automated ratings are often misaligned with human judgment. To resolve this challenge, we propose a novel statistical model based on tensor factorization that merges cheap autorater data with a limited set of human gold-standard labels. Specifically, our approach uses autorater scores to pretrain latent representations of prompts and generative models, and then aligns those pretrained representations to human preferences using a small calibration set. This sample-efficient methodology is robust to autorater quality, more accurately predicts human preferences on a per-prompt basis than standard baselines, and provides tight confidence intervals for key statistical parameters of interest. We also showcase the practical utility of our method by constructing granular leaderboards based on prompt qualities and by estimating model performance solely from autorater scores, eliminating the need for additional human annotations.
[918] OpenRad: a Curated Repository of Open-access AI models for Radiology
Konstantinos Vrettos, Galini Papadaki, Emmanouil Brilakis, Matthaios Triantafyllou, Dimitrios Leventis, Despina Staraki, Maria Mavroforou, Eleftherios Tzanis, Konstantina Giouroukou, Michail E. Klontzas
Main category: cs.AI
TL;DR: OpenRad is a curated, standardized open-access repository aggregating radiology AI models with metadata, pretrained weights, and interactive applications to improve discoverability and reproducibility.
Details
Motivation: Radiology AI models are scattered across platforms, limiting discoverability, reproducibility, and clinical translation. There's a need for a centralized, standardized repository to aggregate these models.Method: Retrospective analysis of 5239 records from PubMed, arXiv, and Scopus until Dec 2025. Used locally hosted LLM (gpt-oss:120b) to generate model records based on RSNA AI Roadmap JSON schema, with manual verification by 10 expert reviewers. Assessed LLM output stability on 225 papers using text similarity metrics.
Result: 1694 articles included spanning all imaging modalities and radiology subspecialties. Automated extraction showed high stability (Levenshtein ratio > 90%), with 78.5% of record edits being minor. CNN and transformer architectures dominant, MRI most common modality (621 neuroradiology AI models). Research concentrated in China and US.
Conclusion: OpenRad provides a comprehensive, searchable resource with ~1700 curated radiology AI models, standardized metadata, and analysis of code repositories, facilitating model discovery and clinical translation.
Abstract: The rapid developments in artificial intelligence (AI) research in radiology have produced numerous models that are scattered across various platforms and sources, limiting discoverability, reproducibility and clinical translation. Herein, OpenRad was created, a curated, standardized, open-access repository that aggregates radiology AI models and providing details such as the availability of pretrained weights and interactive applications. Retrospective analysis of peer reviewed literature and preprints indexed in PubMed, arXiv and Scopus was performed until Dec 2025 (n = 5239 records). Model records were generated using a locally hosted LLM (gpt-oss:120b), based on the RSNA AI Roadmap JSON schema, and manually verified by ten expert reviewers. Stability of LLM outputs was assessed on 225 randomly selected papers using text similarity metrics. A total of 1694 articles were included after review. Included models span all imaging modalities (CT, MRI, X-ray, US) and radiology subspecialties. Automated extraction demonstrated high stability for structured fields (Levenshtein ratio > 90%), with 78.5% of record edits being characterized as minor during expert review. Statistical analysis of the repository revealed CNN and transformer architectures as dominant, while MRI was the most commonly used modality (in 621 neuroradiology AI models). Research output was mostly concentrated in China and the United States. The OpenRad web interface enables model discovery via keyword search and filters for modality, subspecialty, intended use, verification status and demo availability, alongside live statistics. The community can contribute new models through a dedicated portal. OpenRad contains approx. 1700 open access, curated radiology AI models with standardized metadata, supplemented with analysis of code repositories, thereby creating a comprehensive, searchable resource for the radiology community.
[919] Pencil Puzzle Bench: A Benchmark for Multi-Step Verifiable Reasoning
Justin Waugh
Main category: cs.AI
TL;DR: Pencil Puzzle Bench is a framework for evaluating LLM reasoning through pencil puzzles with deterministic, step-level verification across 300 puzzles from 20 varieties.
Details
Motivation: To create a rigorous benchmark for evaluating large language model reasoning capabilities through constraint-satisfaction problems that allow precise, step-by-step verification of intermediate states.Method: Created a database of 62,231 puzzles across 94 varieties, selected 300 puzzles spanning 20 varieties, and evaluated 51 models from 11 providers in direct ask (single-shot) and agentic (multi-turn with iterative verification) modes with deterministic verification of every intermediate board state.
Result: Revealed two distinct axes of capability: reasoning effort scaling (GPT-5.2 improved 81x from no reasoning to maximum effort) and agentic iteration (Claude Opus 4.6 rose from 0.3% to 30.0% through iterative checking). Agentic attempts required median 29 turns over 17 minutes, testing both reasoning and long-context utilization.
Conclusion: Pencil puzzles provide a rigorous framework for evaluating LLM reasoning with precise error localization, revealing significant performance differences between models and highlighting the importance of both reasoning effort and iterative verification capabilities.
Abstract: We introduce Pencil Puzzle Bench, a framework for evaluating large language model reasoning through pencil puzzles, a family of constraint-satisfaction problems closely related to NP-complete problems, with deterministic, step-level verification. From a database of 62,231 puzzles across 94 varieties with verified unique solutions, we select a benchmark of 300 puzzles spanning 20 varieties and evaluate 51 models from 11 providers in two modes: direct ask (single-shot) and agentic (multi-turn with iterative verification). A key differentiator of our benchmark is that every intermediate board state can be checked against variety-specific constraints, localizing errors to the exact rule violated, providing the infrastructure for dense, per-move reward signals for process supervision and reinforcement learning. Our evaluation reveals two distinct axes of capability: (1) reasoning effort scaling, where GPT-5.2 improves 81x from no reasoning to maximum effort; and (2) agentic iteration, where Claude Opus 4.6 rises from 0.3% to 30.0% through iterative checking, while GPT-5.2@xhigh improves from 20.2% to 56.0%. Agentic attempts span a median of 29 turns over 17 minutes, with the longest exceeding 1,221 turns and 14.3 hours - a demanding test of long-context utilization, not just reasoning.
[920] Nano-EmoX: Unifying Multimodal Emotional Intelligence from Perception to Empathy
Jiahao Huang, Fengyan Lin, Xuechao Yang, Chen Feng, Kexin Zhu, Xu Yang, Zhide Chen
Main category: cs.AI
TL;DR: Nano-EmoX is a 2.2B parameter multimodal language model that unifies six affective tasks across perception, understanding, and interaction levels using a cognitive hierarchy framework and curriculum-based training.
Details
Motivation: Existing affective multimodal language models suffer from fragmented capabilities and limited generalization due to the gap between low-level perception and high-level interaction. The paper aims to bridge this gap by proposing a unified cognitive framework.Method: Proposes a three-level cognitive hierarchy (perception, understanding, interaction) and introduces Nano-EmoX with omni-modal encoders (enhanced facial encoder, fusion encoder) and heterogeneous adapters projecting to language space. Uses P2E curriculum training framework that aligns perception with chain-of-thought-driven empathy.
Result: Nano-EmoX achieves state-of-the-art or highly competitive performance across multiple benchmarks, demonstrating excellent efficiency and generalization as the first compact MLM (2.2B) to unify six core affective tasks across all three hierarchy levels.
Conclusion: The proposed cognitive hierarchy and Nano-EmoX model successfully bridge the gap between perception and interaction in affective multimodal language models, enabling unified affective capabilities with strong generalization in a compact architecture.
Abstract: The development of affective multimodal language models (MLMs) has long been constrained by a gap between low-level perception and high-level interaction, leading to fragmented affective capabilities and limited generalization. To bridge this gap, we propose a cognitively inspired three-level hierarchy that organizes affective tasks according to their cognitive depth-perception, understanding, and interaction-and provides a unified conceptual foundation for advancing affective modeling. Guided by this hierarchy, we introduce Nano-EmoX, a small-scale multitask MLM, and P2E (Perception-to-Empathy), a curriculum-based training framework. Nano-EmoX integrates a suite of omni-modal encoders, including an enhanced facial encoder and a fusion encoder, to capture key multimodal affective cues and improve cross-task transferability. The outputs are projected into a unified language space via heterogeneous adapters, empowering a lightweight language model to tackle diverse affective tasks. Concurrently, P2E progressively cultivates emotional intelligence by aligning rapid perception with chain-of-thought-driven empathy. To the best of our knowledge, Nano-EmoX is the first compact MLM (2.2B) to unify six core affective tasks across all three hierarchy levels, achieving state-of-the-art or highly competitive performance across multiple benchmarks, demonstrating excellent efficiency and generalization.
[921] Conformal Policy Control
Drew Prinster, Clara Fannjiang, Ji Won Park, Kyunghyun Cho, Anqi Liu, Suchi Saria, Samuel Stanton
Main category: cs.AI
TL;DR: Safe exploration method using conformal calibration to regulate behavior change while maintaining safety guarantees, applicable to various domains including natural language and biomolecular engineering.
Details
Motivation: In high-stakes environments, agents need to explore new behaviors to improve, but excessive exploration can violate safety constraints and cause harm. The challenge is to balance exploration with safety without being overly conservative.Method: Uses any safe reference policy as a probabilistic regulator for untested policies. Employs conformal calibration on data from the safe policy to determine how aggressively new policies can act while provably enforcing user-defined risk tolerance. Provides finite-sample guarantees even for non-monotonic bounded constraint functions.
Result: Experiments show safe exploration is possible from the first moment of deployment and can improve performance across applications including natural language question answering and biomolecular engineering.
Conclusion: The method enables safe exploration without requiring users to identify correct model classes or tune hyperparameters, providing practical safety guarantees for real-world applications.
Abstract: An agent must try new behaviors to explore and improve. In high-stakes environments, an agent that violates safety constraints may cause harm and must be taken offline, curtailing any future interaction. Imitating old behavior is safe, but excessive conservatism discourages exploration. How much behavior change is too much? We show how to use any safe reference policy as a probabilistic regulator for any optimized but untested policy. Conformal calibration on data from the safe policy determines how aggressively the new policy can act, while provably enforcing the user’s declared risk tolerance. Unlike conservative optimization methods, we do not assume the user has identified the correct model class nor tuned any hyperparameters. Unlike previous conformal methods, our theory provides finite-sample guarantees even for non-monotonic bounded constraint functions. Our experiments on applications ranging from natural language question answering to biomolecular engineering show that safe exploration is not only possible from the first moment of deployment, but can also improve performance.
[922] Tool Verification for Test-Time Reinforcement Learning
Ruotong Liao, Nikolai Röhrich, Xiaohan Wang, Yuhui Zhang, Yasaman Samadzadeh, Volker Tresp, Serena Yeung-Levy
Main category: cs.AI
TL;DR: T^3RL introduces tool verification to improve test-time reinforcement learning for large reasoning models by using external tools to verify rollouts and produce more reliable pseudo-labels, preventing incorrect mode collapse from spurious consensus.
Details
Motivation: Standard test-time reinforcement learning (TTRL) suffers from spurious high-frequency unverified consensus that becomes biased reward signals, leading to incorrect mode collapse in large reasoning models. This failure mode needs addressing to improve online adaptation on unlabeled test inputs.Method: T^3RL introduces test-time tool verification into reward estimation. A verifier uses external tools (e.g., code execution) as evidence to upweight verified rollouts in verification-aware voting, producing more reliable pseudo-labels for training. This approach stabilizes self-evolution through verified online data synthesis.
Result: T^3RL significantly improves over standard TTRL across various math difficulties (MATH-500, AMC, and AIME 2024) and diverse backbone types, with larger gains observed on harder problems.
Conclusion: Test-time tool verification is a key mechanism for stabilizing self-evolution in large reasoning models, and T^3RL demonstrates the effectiveness of verified online data synthesis for improving test-time reinforcement learning.
Abstract: Test-time reinforcement learning (TTRL) has emerged as a promising paradigm for self-evolving large reasoning models (LRMs), enabling online adaptation on unlabeled test inputs via self-induced rewards through majority voting. However, a spurious yet high-frequency unverified consensus can become a biased and reinforced reward signal, leading to incorrect mode collapse. We address this failure mode with T^3RL (Tool-Verification for Test-Time Reinforcement Learning), which introduces test-time tool verification into reward estimation. Concretely, a verifier uses an external tool as evidence (e.g., from code execution) to upweight verified rollouts in a verification-aware voting, producing more reliable pseudo-labels for training. Across various math difficulties (MATH-500, AMC, and AIME 2024) and diverse backbone types, T^3RL significantly improves over TTRL, with larger gains on harder problems. More broadly, T^3RL can be viewed as verified online data synthesis, highlighting test-time tool verification as a key mechanism for stabilizing self-evolution.
[923] Electric Vehicle User Charging Behavior Analysis Integrating Psychological and Environmental Factors: A Statistical-Driven LLM based Agent Approach
Chuanlin Zhang, Junkang Feng, Chenggang Cui, Pengfeng Lin, Hui Chen, Yan Xu, A. M. Y. M. Ghias, Qianguang Ma, Pei Zhang
Main category: cs.AI
TL;DR: LLM-based framework with statistical enhancement simulates EV taxi driver charging behavior by modeling psychological traits and situational factors, showing reliable reproduction of real-world patterns and user heterogeneity.
Details
Motivation: Understanding EV user charging behavior is critical for grid stability and transportation planning, especially with growing EV adoption. Current models may not adequately capture the complex interaction between psychological traits and situational triggers in dynamic travel contexts.Method: Developed a novel framework using large language models (LLMs) as core simulation tools with statistical enhancement. LLMs simulate personalized decision-making through natural language reasoning and role-playing, accounting for factors like time sensitivity, price awareness, and range anxiety. Statistical priors are integrated into the reasoning process to anchor decisions in empirical behavioral patterns.
Result: The framework reliably reproduces real-world charging behaviors across multiple urban environments. Analysis reveals the joint influence of environmental and psychological variables on charging decisions and highlights heterogeneity among different user groups.
Conclusion: The findings provide new insights into EV user behavior, offering a foundation for optimizing charging infrastructure, informing energy policy, and advancing the integration of EV behavioral models into smart transportation and energy management systems.
Abstract: With the growing adoption of electric vehicles (EVs), understanding user charging behavior has become critical for grid stability and transportation planning. This study investigates the behavioral heterogeneity of EV taxi drivers by analyzing the interaction between psychological traits and situational triggers within dynamic travel contexts. Leveraging large language models (LLMs) as a core simulation tool, a novel framework with statistical enhancement is developed to replicate and analyze the charging behaviors of taxi drivers. LLMs simulate personalized decision-making processes by leveraging natural language reasoning and role-playing capabilities, accounting for factors such as time sensitivity, price awareness, and range anxiety. Simulation results indicate that the framework reliably reproduces real-world charging behaviors across multiple urban environments. his fidelity arises from integrating statistical priors into the reasoning process, allowing the model to anchor its decisions in empirical behavioral patterns. Further analysis highlights the joint influence of environmental and psychological variables on charging decisions and reveals the heterogeneity of different user groups. The findings provide new insights into EV user behavior, offering a foundation for optimizing charging infrastructure, informing energy policy, and advancing the integration of EV behavioral models into smart transportation and energy management systems.
[924] ACPBench: Reasoning about Action, Change, and Planning
Harsha Kokel, Michael Katz, Kavitha Srinivas, Shirin Sohrabi
Main category: cs.AI
TL;DR: ACPBench is a benchmark for evaluating LLM reasoning capabilities in planning domains, featuring 7 reasoning tasks across 13 planning domains with provably correct solutions.
Details
Motivation: As LLMs are increasingly used as agents for orchestrating workflows requiring planning and multi-step reasoning, there's a need to evaluate their core planning skills systematically.Method: Constructed benchmark from planning domains described in formal language, allowing synthesis of problems with provably correct solutions across 7 reasoning tasks and 13 domains, enabling automatic generation of many additional problems without human effort.
Result: Evaluation of 22 LLMs and OpenAI o1 models revealed significant gaps in reasoning capability; OpenAI o1 showed gains on multiple-choice questions but surprisingly no notable progress on boolean questions.
Conclusion: ACPBench provides a comprehensive benchmark for evaluating LLM reasoning in planning, highlighting current limitations and the need for improved reasoning capabilities in LLMs used as planning agents.
Abstract: There is an increasing body of work using Large Language Models (LLMs) as agents for orchestrating workflows and making decisions in domains that require planning and multi-step reasoning. As a result, it is imperative to evaluate LLMs on core skills required for planning. In this work, we present ACPBench, a benchmark for evaluating the reasoning tasks in the field of planning. The benchmark consists of 7 reasoning tasks over 13 planning domains. The collection is constructed from planning domains described in a formal language. This allows us to synthesize problems with provably correct solutions across many tasks and domains. Further, it allows us the luxury of scale without additional human effort, i.e., many additional problems can be created automatically. Our extensive evaluation of 22 LLMs and OpenAI o1 reasoning models highlights the significant gap in the reasoning capability of the LLMs. Our findings with OpenAI o1, a multi-turn reasoning model, reveal significant gains in performance on multiple-choice questions, yet surprisingly, no notable progress is made on boolean questions. The ACPBench collection is available at https://ibm.github.io/ACPBench.
[925] Theoretical Foundations of Superhypergraph and Plithogenic Graph Neural Networks
Takaaki Fujita, Florentin Smarandache
Main category: cs.AI
TL;DR: Theoretical foundations for SuperHyperGraph Neural Networks (SHGNNs) and Plithogenic Graph Neural Networks, extending message-passing to advanced higher-order structures with uncertainty modeling.
Details
Motivation: To extend neural network learning to complex relational data structures beyond ordinary graphs, including hypergraphs (higher-order interactions), superhypergraphs (nested hierarchical structures), and plithogenic graphs (uncertainty and multi-valued attributes).Method: Develops theoretical foundations by providing rigorous definitions, establishing structural properties, and proving well-definedness results for SHGNNs and Plithogenic Graph Neural Networks, with emphasis on strengthened formulations of Soft Graph Neural Networks and Rough Graph Neural Networks.
Result: Theoretical framework for extending message-passing principles to advanced higher-order structures including superhypergraphs and plithogenic graphs, with formal properties and well-definedness proofs.
Conclusion: Establishes theoretical foundations for neural networks operating on complex graph structures that go beyond traditional graphs, enabling learning from hierarchical, uncertain, and multi-valued relational data.
Abstract: Hypergraphs generalize classical graphs by allowing a single edge to connect multiple vertices, providing a natural language for modeling higher-order interactions. Superhypergraphs extend this paradigm further by accommodating nested, set-valued entities and relations, enabling the representation of hierarchical, multi-level structures beyond the expressive reach of ordinary graphs or hypergraphs. In parallel, neural networks-especially Graph Neural Networks (GNNs)-have become a standard tool for learning from relational data, and recent years have seen rapid progress on Hypergraph Neural Networks (HGNNs) and their theoretical properties. To model uncertainty and multi-aspect attributes in complex networks, several graded and multi-valued graph frameworks have been developed, including fuzzy graphs and neutrosophic graphs. The plithogenic graph framework unifies and refines these approaches by incorporating multi-valued attributes together with membership and contradiction mechanisms, offering a flexible representation for heterogeneous and partially inconsistent information. This book develops the theoretical foundations of SuperHyperGraph Neural Networks (SHGNNs) and Plithogenic Graph Neural Networks, with the goal of extending message-passing principles to these advanced higher-order structures. We provide rigorous definitions, establish fundamental structural properties, and prove well-definedness results for key constructions, with particular emphasis on strengthened formulations of Soft Graph Neural Networks and Rough Graph Neural Networks.
[926] A Contemporary Overview: Trends and Applications of Large Language Models on Mobile Devices
Lianjun Liu, Hongli An, Pengxuan Chen, Longxiang Ye
Main category: cs.AI
TL;DR: LLMs are increasingly being deployed on mobile devices, enabled by hardware advances and network infrastructure, to provide natural user experiences while enhancing privacy and enabling integration with AR/IoT applications.
Details
Motivation: The motivation is to leverage LLMs' powerful NLP capabilities on mobile devices to create more natural and personalized user experiences, reduce cloud dependency, enhance data privacy, and enable integration with emerging technologies like AR and IoT.Method: The paper discusses deployment approaches including leveraging hardware advancements (neural network accelerators), network infrastructure (5G), and development tools (open APIs and SDKs) to enable efficient local inference and low-latency responses on mobile devices.
Result: LLMs can be effectively deployed on mobile devices for applications like voice assistants, real-time translation, and intelligent recommendations, reducing cloud reliance while maintaining performance and enhancing privacy.
Conclusion: Mobile deployment of LLMs represents a significant trend that enhances device intelligence, enables innovative applications, and will drive next-generation mobile intelligent applications through integration with AR and IoT technologies.
Abstract: With the rapid development of large language models (LLMs), which possess powerful natural language processing and generation capabilities, LLMs are poised to provide more natural and personalized user experiences. Their deployment on mobile devices is gradually becoming a significant trend in the field of intelligent devices. LLMs have demonstrated tremendous potential in applications such as voice assistants, real-time translation, and intelligent recommendations. Advancements in hardware technologies (such as neural network accelerators) and network infrastructure (such as 5G) have enabled efficient local inference and low-latency intelligent responses on mobile devices. This reduces reliance on cloud computing while enhancing data privacy and security. Developers can easily integrate LLM functionalities through open APIs and SDKs, enabling the creation of more innovative intelligent applications. The widespread use of LLMs not only enhances the intelligence of mobile devices but also fosters the integrated innovation of fields like augmented reality (AR) and the Internet of Things (IoT). This trend is expected to drive the development of the next generation of mobile intelligent applications.
[927] SymGPT: Auditing Smart Contracts via Combining Symbolic Execution with Large Language Models
Shihao Xia, Mengting He, Shuai Shao, Tingting Yu, Yiying Zhang, Nobuko Yoshida, Linhai Song
Main category: cs.AI
TL;DR: SymGPT combines LLMs with symbolic execution to automatically verify smart contract compliance with Ethereum ERC standards, detecting thousands of rule violations including exploitable security issues.
Details
Motivation: Current methods for verifying ERC compliance (manual audits, program-analysis tools, LLMs) are ineffective at detecting rule violations that can cause serious security issues and financial losses in smart contracts.Method: SymGPT uses LLMs to translate ERC rules into a domain-specific language, synthesizes constraints from translated rules to model potential violations, and performs symbolic execution for violation detection.
Result: SymGPT identified 5,783 ERC rule violations in 4,000 real-world contracts, including 1,375 violations with clear attack paths for financial theft, outperforming six automated techniques and expert auditing services.
Conclusion: SymGPT demonstrates superior effectiveness over current smart contract analysis methods by combining LLMs with symbolic execution for automated ERC compliance verification.
Abstract: To govern smart contracts running on Ethereum, multiple Ethereum Request for Comment (ERC) standards have been developed, each defining a set of rules governing contract behavior. Violating these rules can cause serious security issues and financial losses, signifying the importance of verifying ERC compliance. Today’s practices of such verification include manual audits, expert-developed program-analysis tools, and large language models (LLMs), all of which remain ineffective at detecting ERC rule violations. This paper introduces SymGPT, a tool that combines LLMs with symbolic execution to automatically verify smart contracts’ compliance with ERC rules. We begin by empirically analyzing 132 ERC rules from three major ERC standards, examining their content, security implications, and natural language descriptions. Based on this study, SymGPT instructs an LLM to translate ERC rules into a domain-specific language, synthesizes constraints from the translated rules to model potential rule violations, and performs symbolic execution for violation detection. Our evaluation shows that SymGPT identifies 5,783 ERC rule violations in 4,000 real-world contracts, including 1,375 violations with clear attack paths for financial theft. Furthermore, SymGPT outperforms six automated techniques and a security-expert auditing service, underscoring its superiority over current smart contract analysis methods.
[928] Safety Mirage: How Spurious Correlations Undermine VLM Safety Fine-Tuning and Can Be Mitigated by Machine Unlearning
Yiwei Chen, Yuguang Yao, Yihua Zhang, Bingquan Shen, Gaowen Liu, Sijia Liu
Main category: cs.AI
TL;DR: VLMs suffer from “safety mirage” where supervised safety fine-tuning creates spurious correlations, making models vulnerable to simple word substitution attacks and causing over-prudence. Machine unlearning is proposed as a better alternative to directly remove harmful knowledge while preserving capabilities.
Details
Motivation: Current vision-language models (VLMs) generate harmful content when exposed to unsafe queries. Existing alignment strategies using supervised safety fine-tuning create "safety mirage" - superficial correlations between textual patterns and safety responses rather than deep harm mitigation, leaving models vulnerable to simple attacks and causing unnecessary refusal of benign queries.Method: Proposes machine unlearning (MU) as an alternative to supervised safety fine-tuning. MU avoids biased feature-label mappings and directly removes harmful knowledge from VLMs while preserving their general capabilities. Evaluated across safety benchmarks with attack success rate and unnecessary rejection metrics.
Result: MU-based alignment reduces attack success rate by up to 60.27% and cuts unnecessary rejections by over 84.20% compared to supervised fine-tuning approaches. Demonstrates that MU effectively addresses the safety mirage problem.
Conclusion: Machine unlearning is a powerful alternative to supervised safety fine-tuning for VLMs, effectively addressing the “safety mirage” problem by directly removing harmful knowledge while preserving model capabilities and reducing both vulnerability to attacks and unnecessary over-prudence.
Abstract: Recent vision language models (VLMs) have made remarkable strides in generative modeling with multimodal inputs, particularly text and images. However, their susceptibility to generating harmful content when exposed to unsafe queries raises critical safety concerns. While current alignment strategies primarily rely on supervised safety fine-tuning with curated datasets, we identify a fundamental limitation we call the ‘‘safety mirage’’, where supervised fine-tuning inadvertently reinforces spurious correlations between superficial textual patterns and safety responses, rather than fostering deep, intrinsic mitigation of harm. We show that these spurious correlations leave fine-tuned VLMs vulnerable even to a simple one-word modification-based attack, where substituting a single word in text queries with a spurious correlation-inducing alternative can effectively bypass safeguards. Additionally, these correlations contribute to the over-prudence, causing fine-tuned VLMs to refuse benign queries unnecessarily. To address these issues, we show machine unlearning (MU) as a powerful alternative to supervised safety fine-tuning, as it avoids biased feature-label mappings and directly removes harmful knowledge from VLMs while preserving their general capabilities. Extensive evaluations across safety benchmarks show that under MU-based alignment reduces the attack success rate by up to 60.27% and cuts unnecessary rejections by over 84.20%. WARNING: There exist AI generations that may be offensive in nature.
[929] ACPBench Hard: Unrestrained Reasoning about Action, Change, and Planning
Harsha Kokel, Michael Katz, Kavitha Srinivas, Shirin Sohrabi
Main category: cs.AI
TL;DR: ACPBench Hard introduces a generative version of the ACPBench dataset for testing atomic reasoning about action and change through open-ended questions, revealing current language models struggle with planning reasoning tasks.
Details
Motivation: To address the limitation of original ACPBench's multiple-choice format, which doesn't reflect the open-ended generative nature required for real planning tasks, where models must generate answers rather than select from options.Method: Created ACPBench Hard by converting ACPBench’s boolean/multiple-choice questions into open-ended generative tasks, developed validation algorithms for each task type, and tested various language models including frontier and reasoning-focused models.
Result: Most models performed poorly (below 65%), with no model outperforming others, showing even frontier language models struggle with atomic planning reasoning tasks, despite being simpler than full planning.
Conclusion: Current language models, including reasoning-focused ones, have significant limitations in reliably reasoning about action and change for planning tasks, indicating substantial room for improvement in this fundamental AI capability.
Abstract: The ACPBench dataset provides atomic reasoning tasks required for efficient planning. The dataset is aimed at distilling the complex plan generation task into separate atomic reasoning tasks in their easiest possible form, boolean or multiple-choice questions, where the model has to choose the right answer from the provided options. While the aim of ACPBench is to test the simplest form of reasoning about action and change, when tasked with planning, a model does not typically have options to choose from and thus the reasoning required for planning dictates an open-ended, generative form for these tasks. To that end, we introduce ACPBench Hard, a generative version of ACPBench, with open-ended questions which the model needs to answer. Models that perform well on these tasks could in principle be integrated into a planner or be used directly as a policy. We discuss the complexity of these tasks as well as the complexity of validating the correctness of their answers and present validation algorithms for each task. Equipped with these validators, we test the performance of a variety of models on our tasks and find that for most of these tasks the performance of even the largest models is still subpar. Our experiments show that no model outperforms another in these tasks and with a few exceptions all tested language models score below 65%, indicating that even the current frontier language models have a long way to go before they can reliably reason about planning. In fact, even the so-called reasoning models struggle with solving these reasoning tasks. ACPBench Hard collection is available at the following link: https://ibm.github.io/ACPBench
[930] Endowing Embodied Agents with Spatial Reasoning Capabilities for Vision-and-Language Navigation
Qianqian Bai, Zhongpu Chen, Ling Luo, Huaming Du, Yuqian Lei, Ziyun Jiao
Main category: cs.AI
TL;DR: BrainNav is a bio-inspired spatial cognitive navigation framework for real-world VLN that integrates dual-map strategies and cognitive modules to reduce spatial hallucinations in mobile robots.
Details
Motivation: Current VLN methods work well in simulated environments but suffer from severe hallucination when transferred to real-world scenarios, causing robots to lose effective spatial awareness. There's a need for frameworks that can bridge this simulation-to-reality gap.Method: Proposes BrainNav, a bio-inspired framework with five core modules mimicking biological cognitive functions: Hippocampal Memory Hub, Visual Cortex Perception Engine, Parietal Spatial Constructor, Prefrontal Decision Center, and Cerebellar Motion Execution Unit. Integrates dual-map (coordinate and topological) and dual-orientation (relative and absolute) strategies for real-time navigation.
Result: Validated in zero-shot real-world lab environment using Limo Pro robot. BrainNav, compatible with GPT-4, outperforms existing SOTA VLN-CE methods without fine-tuning, demonstrating reduced spatial hallucinations and enhanced adaptability.
Conclusion: BrainNav successfully addresses the hallucination problem in real-world VLN by incorporating bio-inspired spatial cognition principles, enabling more robust and adaptable navigation in continuous environments without requiring fine-tuning.
Abstract: Enhancing the spatial perception capabilities of mobile robots is crucial for achieving embodied Vision-and-Language Navigation (VLN). Although significant progress has been made in simulated environments, directly transferring these capabilities to real-world scenarios often results in severe hallucination phenomena, causing robots to lose effective spatial awareness. To address this issue, we propose BrainNav, a bio-inspired spatial cognitive navigation framework inspired by biological spatial cognition theories and cognitive map theory. BrainNav integrates dual-map (coordinate map and topological map) and dual-orientation (relative orientation and absolute orientation) strategies, enabling real-time navigation through dynamic scene capture and path planning. Its five core modules-Hippocampal Memory Hub, Visual Cortex Perception Engine, Parietal Spatial Constructor, Prefrontal Decision Center, and Cerebellar Motion Execution Unit-mimic biological cognitive functions to reduce spatial hallucinations and enhance adaptability. Validated in a zero-shot real-world lab environment using the Limo Pro robot, BrainNav, compatible with GPT-4, outperforms existing State-of-the-Art (SOTA) Vision-and-Language Navigation in Continuous Environments (VLN-CE) methods without fine-tuning.
[931] Reason Like a Radiologist: Chain-of-Thought and Reinforcement Learning for Verifiable Report Generation
Peiyuan Jing, Kinhei Lee, Zhenxuan Zhang, Huichi Zhou, Zhengqing Yuan, Zhifan Gao, Lei Zhu, Giorgos Papanastasiou, Yingying Fang, Guang Yang
Main category: cs.AI
TL;DR: BoxMed-RL is a unified training framework for generating spatially verifiable and explainable radiology reports using vision-language models with reinforcement learning and anatomical bounding box alignment.
Details
Motivation: Current radiology report generation models lack structured reasoning like experts, failing to link visual findings to precise anatomical locations, which hinders clinical trust and explainability.Method: Two-phase framework: (1) Pretraining Phase with medical concept learning using Chain-of-Thought supervision and spatially verifiable reinforcement learning to align findings with bounding boxes; (2) Downstream Adapter Phase with frozen pretrained weights and trainable adapter for fluent, clinically credible reports.
Result: Achieves average 7% improvement in METEOR and ROUGE-L metrics compared to SOTA methods, plus 5% improvement in LLM-based metrics, demonstrating robustness in generating high-quality reports.
Conclusion: BoxMed-RL successfully mimics radiologists’ workflow by connecting medical concepts with anatomical evidence, producing spatially verifiable and explainable radiology reports.
Abstract: Radiology report generation is critical for efficiency but current models lack the structured reasoning of experts, hindering clinical trust and explainability by failing to link visual findings to precise anatomical locations. This paper introduces BoxMed-RL, a groundbreaking unified training framework for generating spatially verifiable and explainable radiology reports. Built on a large vision-language model, BoxMed-RL revolutionizes report generation through two integrated phases: (1) In the Pretraining Phase, we refine the model via medical concept learning, using Chain-of-Thought supervision to internalize the radiologist-like workflow, followed by spatially verifiable reinforcement, which applies reinforcement learning to align medical findings with bounding boxes. (2) In the Downstream Adapter Phase, we freeze the pretrained weights and train a downstream adapter to ensure fluent and clinically credible reports. This framework precisely mimics radiologists’ workflow, compelling the model to connect high-level medical concepts with definitive anatomical evidence. Extensive experiments on public datasets demonstrate that BoxMed-RL achieves an average 7% improvement in both METEOR and ROUGE-L metrics compared to state-of-the-art methods. An average 5% improvement in large language model-based metrics further underscores BoxMed-RL’s robustness in generating high-quality radiology reports.
[932] The First Impression Problem: Internal Bias Triggers Overthinking in Reasoning Models
Renfei Dang, Zhening Li, Shujian Huang, Jiajun Chen
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to draw conclusions due to failed API request
Abstract: Failed to fetch summary for 2505.16448: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.16448&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[933] Token-Importance Guided Direct Preference Optimization
Ning Yang, Hai Lin, Yibo Liu, Baoliang Tian, Guoqing Liu, Haijun Zhang
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2505.19653: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.19653&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[934] Adaptive Location Hierarchy Learning for Long-Tailed Mobility Prediction
Yu Wang, Junshu Dai, Yuchen Ying, Hanyang Yuan, Zunlei Feng, Tongya Zheng, Mingli Song
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2505.19965: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.19965&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[935] VPI-Bench: Visual Prompt Injection Attacks for Computer-Use Agents
Tri Cao, Bennett Lim, Yue Liu, Yuan Sui, Yuexin Li, Shumin Deng, Lin Lu, Nay Oo, Shuicheng Yan, Bryan Hooi
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2506.02456: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.02456&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[936] FAuNO: Semi-Asynchronous Federated Reinforcement Learning Framework for Task Offloading in Edge Systems
Frederico Metelo, Alexandre Oliveira, Stevo Racković, Pedro Ákos Costa, Cláudia Soares
Main category: cs.AI
TL;DR: Unable to analyze paper 2506.02668 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation without access to the paper abstractMethod: Cannot determine method without access to the paper abstract
Result: Cannot determine results without access to the paper abstract
Conclusion: Cannot determine conclusion without access to the paper abstract
Abstract: Failed to fetch summary for 2506.02668: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.02668&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[937] Control Tax: The Price of Keeping AI in Check
Mikhail Terekhov, Zhen Ning David Liu, Caglar Gulcehre, Samuel Albanie
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper detailsMethod: Unable to determine method due to API rate limiting preventing access to paper details
Result: Unable to determine results due to API rate limiting preventing access to paper details
Conclusion: Unable to determine conclusion due to API rate limiting preventing access to paper details
Abstract: Failed to fetch summary for 2506.05296: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.05296&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[938] Beyond RLHF and NLHF: Population-Proportional Alignment under an Axiomatic Framework
Kihyun Kim, Jiawei Zhang, Asuman Ozdaglar, Pablo A. Parrilo
Main category: cs.AI
TL;DR: Paper 2506.05619: Could not fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access limitationsMethod: Unable to determine method due to access limitations
Result: Unable to determine results due to access limitations
Conclusion: Unable to determine conclusion due to access limitations
Abstract: Failed to fetch summary for 2506.05619: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.05619&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[939] Accuracy-Efficiency Trade-Offs in Spiking Neural Networks: A Lempel-Ziv Complexity Perspective on Learning Rules
Zofia Rudnicka, Janusz Szczepanski, Agnieszka Pregowska
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) - no abstract available for analysis
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2506.06750: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.06750&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[940] Behavioral Generative Agents for Energy Operations
Cong Chen, Omer Karaduman, Xu Kuang
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation due to failed paper fetchMethod: Cannot determine method due to failed paper fetch
Result: Cannot determine results due to failed paper fetch
Conclusion: Cannot determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2506.12664: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.12664&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[941] Structured Diversity Control: A Dual-Level Framework for Group-Aware Multi-Agent Coordination
Shuocun Yang, Huawen Hu, Xuan Liu, Yincheng Yao, Enze Shi, Shu Zhang
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2506.18651 appears to be a recent arXiv submission from June 2025.
Details
Motivation: Cannot determine motivation without access to paper content.Method: Cannot determine method without access to paper content.
Result: Cannot determine results without access to paper content.
Conclusion: Cannot determine conclusion without access to paper content.
Abstract: Failed to fetch summary for 2506.18651: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.18651&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[942] A Message Passing Realization of Expected Free Energy Minimization
Wouter W. L. Nuijten, Mykola Lukashchuk, Thijs van de Laar, Bert de Vries
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2508.02197: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.02197&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[943] Collab-REC: An LLM-based Agentic Framework for Balancing Recommendations in Tourism
Ashmi Banerjee, Adithi Satish, Fitri Nur Aisyah, Wolfgang Wörndl, Yashar Deldjoo
Main category: cs.AI
TL;DR: Paper 2508.15030 summary unavailable due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to abstract fetching failureMethod: Unable to determine method due to abstract fetching failure
Result: Unable to determine results due to abstract fetching failure
Conclusion: Unable to determine conclusion due to abstract fetching failure
Abstract: Failed to fetch summary for 2508.15030: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.15030&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[944] Re4: Scientific Computing Agent with Rewriting, Resolution, Review and Revision
Ao Cheng, Lei Zhang, Guowei He
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2508.20729: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.20729&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[945] Toward Clinically Explainable AI for Medical Diagnosis: A Foundation Model with Human-Compatible Reasoning via Reinforcement Learning
Qika Lin, Yifan Zhu, Bin Pu, Ling Huang, Haoran Luo, Jingying Ma, Feng Wu, Kai He, Jiaxing Xu, Zhen Peng, Tianzhe Zhao, Fangzhi Xu, Jian Zhang, Zhonghong Ou, Erik Cambria, Swapnil Mishra, Mengling Feng
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2509.03906: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.03906&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[946] AISSISTANT: Human-AI Collaborative Review and Perspective Research Workflows in Data Science
Sasi Kiran Gaddipati, Farhana Keya, Gollam Rabby, Sören Auer
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2509.12282: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.12282&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[947] Who Gets Cited Most? Benchmarking Long-Context Numerical Reasoning on Scientific Articles
Miao Li, Alexander Gurung, Irina Saparina, Mirella Lapata
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2509.21028: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.21028&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[948] Bilinear representation mitigates reversal curse and enables consistent model editing
Dong-Kyum Kim, Minsung Kim, Jea Kwon, Nakyeong Yang, Meeyoung Cha
Main category: cs.AI
TL;DR: Unable to analyze paper 2509.21993 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation due to missing abstract dataMethod: Cannot determine method due to missing abstract data
Result: Cannot determine results due to missing abstract data
Conclusion: Cannot draw conclusions due to missing abstract data
Abstract: Failed to fetch summary for 2509.21993: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.21993&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[949] From Conversation to Query Execution: Benchmarking User and Tool Interactions for EHR Database Agents
Gyubok Lee, Woosog Chay, Heeyoung Kwak, Yeong Hwa Kim, Haanju Yoo, Oksoon Jeong, Meong Hi Son, Edward Choi
Main category: cs.AI
TL;DR: Paper 2509.23415: Unable to fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to inability to access paper contentMethod: Cannot determine method due to inability to access paper content
Result: Cannot determine results due to inability to access paper content
Conclusion: Cannot determine conclusion due to inability to access paper content
Abstract: Failed to fetch summary for 2509.23415: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.23415&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[950] ViTSP: A Vision Language Models Guided Framework for Solving Large-Scale Traveling Salesman Problems
Zhuoli Yin, Yi Ding, Reem Khir, Hua Cai
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2509.23465: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.23465&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[951] G-reasoner: Foundation Models for Unified Reasoning over Graph-structured Knowledge
Linhao Luo, Zicheng Zhao, Junnan Liu, Zhangchi Qiu, Junnan Dong, Serge Panev, Chen Gong, Thuy-Trang Vu, Gholamreza Haffari, Dinh Phung, Alan Wee-Chung Liew, Shirui Pan
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to failed paper retrievalMethod: Cannot determine method due to failed paper retrieval
Result: Cannot determine results due to failed paper retrieval
Conclusion: Cannot determine conclusion due to failed paper retrieval
Abstract: Failed to fetch summary for 2509.24276: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.24276&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[952] Understanding the Role of Training Data in Test-Time Scaling
Adel Javanmard, Baharan Mirzasoleiman, Vahab Mirrokni
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2510.03605 could not be retrieved from arXiv API.
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting.Method: Cannot determine method as paper content is unavailable.
Result: Cannot determine results as paper content is unavailable.
Conclusion: Cannot draw conclusions about the paper due to inability to access its content.
Abstract: Failed to fetch summary for 2510.03605: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.03605&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[953] FaithCoT-Bench: Benchmarking Instance-Level Faithfulness of Chain-of-Thought Reasoning
Xu Shen, Song Wang, Zhen Tan, Laura Yao, Xinyu Zhao, Kaidi Xu, Xin Wang, Tianlong Chen
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2510.04040: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.04040&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[954] Doctor-R1: Mastering Clinical Inquiry with Experiential Agentic Reinforcement Learning
Yunghwei Lai, Kaiming Liu, Ziyue Wang, Weizhi Ma, Yang Liu
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2510.04284: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.04284&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[955] DRPO: Efficient Reasoning via Decoupled Reward Policy Optimization
Gang Li, Yan Chen, Ming Lin, Tianbao Yang
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation due to technical error in accessing paper contentMethod: Unable to determine method due to technical error in accessing paper content
Result: Unable to determine results due to technical error in accessing paper content
Conclusion: Unable to determine conclusion due to technical error in accessing paper content
Abstract: Failed to fetch summary for 2510.04474: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.04474&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[956] Evaluating and Mitigating LLM-as-a-judge Bias in Communication Systems
Jiaxin Gao, Chen Chen, Yanwen Jia, Xueluan Gong, Kwok-Yan Lam, Qian Wang
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to determine conclusion due to failed API request
Abstract: Failed to fetch summary for 2510.12462: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.12462&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[957] HardcoreLogic: Challenging Large Reasoning Models with Long-tail Logic Puzzle Games
Jingcong Liang, Shijun Wan, Xuehai Wu, Yitong Li, Qianglong Chen, Duyu Tang, Siyuan Wang, Zhongyu Wei
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2510.12563: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.12563&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[958] DAG-Math: Graph-of-Thought Guided Mathematical Reasoning in LLMs
Yuanhe Zhang, Ilja Kuzborskij, Jason D. Lee, Chenlei Leng, Fanghui Liu
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to failed paper fetchMethod: Cannot determine method due to failed paper fetch
Result: Cannot determine results due to failed paper fetch
Conclusion: Cannot draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2510.19842: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.19842&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[959] The Information-Theoretic Imperative: Compression and the Epistemic Foundations of Intelligence
Christian Dittrich, Jennifer Flygare Kinne
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2510.25883: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.25883&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[960] The FM Agent
Annan Li, Chufan Wu, Zengle Ge, Yee Hin Chong, Zhinan Hou, Lizhe Cao, Cheng Ju, Jianmin Wu, Huaiming Li, Haobo Zhang, Shenghao Feng, Mo Zhao, Fengzhi Qiu, Rui Yang, Mengmeng Zhang, Wenyi Zhu, Yingying Sun, Quan Sun, Shunhao Yan, Danyu Liu, Dawei Yin, Dou Shen
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2510.26144: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.26144&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[961] From Passive to Proactive: A Hierarchical Multi-Agent Framework for Automated Medical Pre-Consultation
ChengZhang Yu, YingRu He, Hongyan Cheng, nuo Cheng, Zhixing Liu, Dongxu Mu, Zhangrui Shen Yang Gao, and Zhanpeng Jin
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2511.01445: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.01445&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[962] Knowledge Graph Augmented Large Language Models for Disease Prediction
Ruiyu Wang, Tuan Vinh, Ran Xu, Yuyin Zhou, Jiaying Lu, Carl Yang, Francisco Pasquel
Main category: cs.AI
TL;DR: Failed to fetch summary for arXiv ID 2512.01210 due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation as the paper summary could not be retrieved from arXiv APIMethod: Unable to determine method as the paper summary could not be retrieved
Result: Unable to determine results as the paper summary could not be retrieved
Conclusion: Unable to draw conclusions as the paper summary could not be retrieved
Abstract: Failed to fetch summary for 2512.01210: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.01210&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[963] Benchmarking Overton Pluralism in LLMs
Elinor Poole-Dayan, Jiayi Wu, Taylor Sorensen, Jiaxin Pei, Michiel A. Bakker
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2512.01351: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.01351&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[964] Detecting the Disturbance: A Nuanced View of Introspective Abilities in LLMs
Ely Hahami, Ishaan Sinha, Lavik Jain, Josh Kaplan, Jon Hahami
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2512.12411: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.12411&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[965] Discovering Symmetry Groups with Flow Matching
Yuxuan Chen, Jung Yeon Park, Floor Eijkelboom, Jianke Yang, Jan-Willem van de Meent, Lawson L.S. Wong, Robin Walters
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) - no content available for analysis
Details
Motivation: Cannot determine motivation as paper content is unavailable due to HTTP 429 error from arXiv APIMethod: No method information available - arXiv API returned rate limiting error (HTTP 429)
Result: No results available - paper content could not be retrieved due to technical limitations
Conclusion: Unable to analyze paper due to technical limitations in accessing the content
Abstract: Failed to fetch summary for 2512.20043: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.20043&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[966] AWARE-US: Preference-Aware Infeasibility Resolution in Tool-Calling Agents
Mehmet Kurmaz
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper detailsMethod: Cannot analyze method without access to paper content
Result: No results available due to technical error in fetching paper information
Conclusion: Unable to provide analysis due to arXiv API rate limiting preventing access to paper content
Abstract: Failed to fetch summary for 2601.02643: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.02643&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[967] Overcoming Joint Intractability with Lossless Hierarchical Speculative Decoding
Yuxuan Zhou, Fei Huang, Heng Li, Fengyi Wu, Tianyu Wang, Jianwei Zhang, Junyang Lin, Zhi-Qi Cheng
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2601.05724: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.05724&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[968] PsyAgent: Constructing Human-like Agents Based on Psychological Modeling and Contextual Interaction
Zibin Meng, Kani Chen
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2601.06158: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.06158&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[969] FSW-GNN: A Bi-Lipschitz WL-Equivalent Graph Neural Network
Yonatan Sverdlov, Yair Davidson, Nadav Dym, Tal Amir
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2410.09118: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.09118&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[970] DRAGON: LLM-Driven Decomposition and Reconstruction Agents for Large-Scale Combinatorial Optimization
Shengkai Chen, Zhiguang Cao, Jianan Zhou, Yaoxin Wu, Senthilnath Jayavelu, Zhuoyi Lin, Xiaoli Li, Shili Xiang
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2601.06502 exists but content cannot be retrieved at this time.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot determine conclusion without access to the paper content.
Abstract: Failed to fetch summary for 2601.06502: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.06502&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[971] A Tidal Current Speed Forecasting Model based on Multi-Periodicity Learning
Tengfei Cheng, Yangdi Huang, Ling Xiao, Yunxuan Dong
Main category: cs.AI
TL;DR: Unable to analyze paper 2410.09718 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract is unavailableMethod: Cannot determine method as abstract is unavailable
Result: Cannot determine results as abstract is unavailable
Conclusion: Cannot determine conclusion as abstract is unavailable
Abstract: Failed to fetch summary for 2410.09718: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.09718&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[972] ZeroDVFS: Zero-Shot LLM-Guided Core and Frequency Allocation for Embedded Platforms
Mohammad Pivezhandi, Mahdi Banisharif, Abusayeed Saifullah, Ali Jannesari
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2601.08166: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.08166&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[973] Return Augmented Decision Transformer for Off-Dynamics Reinforcement Learning
Ruhan Wang, Yu Yang, Zhishuai Liu, Dongruo Zhou, Pan Xu
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2410.23450: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.23450&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[974] OrbitFlow: SLO-Aware Long-Context LLM Serving with Fine-Grained KV Cache Reconfiguration
Xinyue Ma, Heelim Hong, Taegeon Um, Jongseop Lee, Seoyeong Choy, Woo-Yeon Lee, Myeongjae Jeon
Main category: cs.AI
TL;DR: OrbitFlow: Adaptive KV cache management system for long-context LLM serving that dynamically offloads KV caches to host memory based on runtime demands to meet latency SLOs.
Details
Motivation: Serving long-context LLMs faces challenges with fluctuating memory footprints due to varying request lengths and batch composition during token generation. Existing static offloading strategies cannot adapt to rapidly shifting memory demands, leading to excessive CPU-to-GPU transfers, latency spikes, and SLO violations.Method: OrbitFlow uses a lightweight ILP solver to decide which layers’ KV caches to retain on GPU for each request within memory constraints. It continuously refines KV placements based on runtime feedback and invokes a fallback mechanism to temporarily defer requests with large memory footprints under heavy load.
Result: OrbitFlow improves SLO attainment for TPOT and TBT by up to 66% and 48% respectively, reduces 95th percentile latency by 38%, and achieves up to 3.3x higher throughput compared to existing offloading methods.
Conclusion: OrbitFlow provides an effective adaptive KV cache management system that addresses the dynamic memory demands of long-context LLM serving, significantly improving latency SLO attainment and throughput.
Abstract: Serving long-context LLMs is challenging because request lengths and batch composition vary during token generation, causing the memory footprint to fluctuate significantly at runtime. Offloading KV caches to host memory limits effective memory usage, but existing static and predetermined offloading strategies cannot adapt to the rapidly shifting memory demands of long-context serving. This often leads to excessive CPU-to-GPU KV transfers that translate into latency spikes and frequent SLO violations. To address these challenges, we introduce OrbitFlow, a fine-grained and adaptive KV cache management system that meets latency SLOs in long-context LLM serving. OrbitFlow employs a lightweight ILP solver to decide which layers’ KV caches to retain on the GPU for each request, within memory capacity constraints. It continuously refines KV placements based on runtime feedback when the active plan becomes suboptimal during token generation. Under heavy load, OrbitFlow invokes a fallback mechanism to temporarily defer in-flight requests with large memory footprints, preserving overall SLO attainment. Our experiments demonstrate that OrbitFlow improves SLO attainment for TPOT and TBT by up to 66% and 48%, respectively, while reducing the 95th percentile latency by 38% and achieving up to 3.3x higher throughput compared to existing offloading methods.
[975] MSP-LLM: A Unified Large Language Model Framework for Complete Material Synthesis Planning
Heewoong Noh, Gyoung S. Na, Namkyeong Lee, Chanyoung Park
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2602.07543: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.07543&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[976] SpotAgent: Grounding Visual Geo-localization in Large Vision-Language Models through Agentic Reasoning
Furong Jia, Ling Dai, Wenjin Deng, Fan Zhang, Chen Hu, Daxin Jiang, Yu Liu
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2602.09463: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.09463&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[977] Learning Global Hypothesis Space for Enhancing Synergistic Reasoning Chain
Jiaquan Zhang, Chaoning Zhang, Shuxu Chen, Xudong Wang, Zhenzhen Huang, Pengcheng Zheng, Shuai Yuan, Sheng Zheng, Qigan Sun, Jie Zou, Lik-Hang Lee, Yang Yang
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed data retrievalMethod: Unable to determine method due to failed data retrieval
Result: Unable to determine results due to failed data retrieval
Conclusion: Unable to draw conclusions due to failed data retrieval
Abstract: Failed to fetch summary for 2602.09794: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.09794&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[978] Quark Medical Alignment: A Holistic Multi-Dimensional Alignment and Collaborative Optimization Paradigm
Tianxiang Xu, Jiayi Liu, Yixuan Tong, Jialu Xu, Yunqing Wei, Kaiwen Feng, PanPan Hou, Kangping Yin, Jiyuan Hu, Hao Zhou, Zhenxin Ma, Jian Xu, Guanjun Jiang
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2602.11661: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.11661&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[979] REMem: Reasoning with Episodic Memory in Language Agent
Yiheng Shu, Saisri Padmaja Jonnalagedda, Xiang Gao, Bernal Jiménez Gutiérrez, Weijian Qi, Kamalika Das, Huan Sun, Yu Su
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2602.13530: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.13530&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[980] Arbor: A Framework for Reliable Navigation of Critical Conversation Flows
Luís Silva, Diogo Gonçalves, Catarina Farinha, Clara Matos, Luís Ungaro
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper detailsMethod: Cannot analyze method as paper content is unavailable due to HTTP 429 error
Result: No results available - paper retrieval failed due to rate limiting
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2602.14643: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.14643&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[981] Topological derivative approach for deep neural network architecture adaptation
C G Krishnanunni, Tan Bui-Thanh, Clint Dawson
Main category: cs.AI
TL;DR: Paper analysis unavailable due to HTTP 429 error when fetching abstract from arXiv
Details
Motivation: Unable to determine paper motivation due to access errorMethod: Unable to determine paper method due to access error
Result: Unable to determine paper results due to access error
Conclusion: Unable to determine paper conclusion due to access error
Abstract: Failed to fetch summary for 2502.06885: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.06885&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[982] EnterpriseBench Corecraft: Training Generalizable Agents on High-Fidelity RL Environments
Sushant Mehta, Logan Ritchie, Suhaas Garre, Ian Niebres, Nick Heiner, Edwin Chen
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2602.16179: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.16179&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[983] Phase-Aware Mixture of Experts for Agentic Reinforcement Learning
Shengtian Yang, Yu Li, Shuo He, Yewen Li, Qingpeng Cai, Peng Jiang, Lei Feng
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2602.17038: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.17038&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[984] MagicAgent: Towards Generalized Agent Planning
Xuhui Ren, Shaokang Dong, Chen Yang, Qing Gao, Yunbin Zhao, Yongsheng Liu, Xinwei Geng, Xiang Li, Demei Yan, Yanqing Li, Chenhao Huang, Dingwei Zhu, Junjie Ye, Boxuan Yue, Yingnan Fu, Mengzhe Lv, Zezeng Feng, Boshen Zhou, Bocheng Wang, Xuanjing Huang, Yu-Gang Jiang, Tao Gui, Qi Zhang, Yunke Zhang
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2602.19000: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.19000&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[985] GradientStabilizer:Fix the Norm, Not the Gradient
Tianjin Huang, Zhangyang Wang, Haotian Hu, Zhenyu Zhang, Gaojie Jin, Xiang Li, Li Shen, Jiaxing Shang, Tianlong Chen, Ke Li, Lu Liu, Qingsong Wen, Shiwei Liu
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2502.17055: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.17055&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[986] ProactiveMobile: A Comprehensive Benchmark for Boosting Proactive Intelligence on Mobile Devices
Dezhi Kong, Zhengzhao Feng, Qiliang Liang, Hao Wang, Haofei Sun, Changpeng Yang, Yang Li, Peng Zhou, Shuai Nie, Hongzhen Wang, Linfeng Zhou, Hao Jia, Jiaming Xu, Runyu Shi, Ying Huang
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2602.21858: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.21858&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[987] SideQuest: Model-Driven KV Cache Management for Long-Horizon Agentic Reasoning
Sanjay Kariyappa, G. Edward Suh
Main category: cs.AI
TL;DR: Paper 2602.22603: Failed to fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error fetching paper informationMethod: Unable to determine method due to technical error fetching paper information
Result: Unable to determine results due to technical error fetching paper information
Conclusion: Unable to determine conclusion due to technical error fetching paper information
Abstract: Failed to fetch summary for 2602.22603: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22603&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[988] ReCoN-Ipsundrum: An Inspectable Recurrent Persistence Loop Agent with Affect-Coupled Control and Mechanism-Linked Consciousness Indicator Assays
Aishik Sanyal
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to failed paper fetchMethod: Cannot determine method due to failed paper fetch
Result: Cannot determine results due to failed paper fetch
Conclusion: Cannot draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2602.23232: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.23232&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[989] Agency and Architectural Limits: Why Optimization-Based Systems Cannot Be Norm-Responsive
Radha Sarma
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2602.23239 suggests it’s from February 2023, but without the abstract content, I cannot analyze its relevance to multimodal LLMs with audio/vision focus.
Details
Motivation: Cannot determine motivation without access to the paper abstract. The HTTP 429 error indicates rate limiting from arXiv API, preventing retrieval of the paper's content.Method: Cannot determine method without access to the paper abstract. The arXiv API rate limiting prevents analysis of the paper’s technical approach.
Result: Cannot determine results without access to the paper abstract. The technical content and findings are unavailable due to API restrictions.
Conclusion: Cannot draw conclusions about the paper’s content or relevance without access to the abstract. The arXiv API rate limiting prevents proper analysis.
Abstract: Failed to fetch summary for 2602.23239: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.23239&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[990] The logic of KM belief update is contained in the logic of AGM belief revision
Giacomo Bonanno
Main category: cs.AI
TL;DR: The paper with ID 2602.23302 could not be fetched due to HTTP 429 error (rate limiting), so no abstract or content is available for analysis.
Details
Motivation: Unable to determine motivation since the paper content could not be retrieved due to HTTP 429 rate limiting error from arXiv API.Method: No method information available as the paper content could not be fetched.
Result: No results available due to inability to access the paper content.
Conclusion: Cannot provide conclusion without access to the paper content.
Abstract: Failed to fetch summary for 2602.23302: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.23302&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[991] When Reasoning Meets Compression: Understanding the Effects of LLMs Compression on Large Reasoning Models
Nan Zhang, Eugene Kwek, Yusen Zhang, Ngoc-Hieu Nguyen, Prasenjit Mitra, Rui Zhang
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2504.02010 could not be retrieved from arXiv API.
Details
Motivation: Unable to determine motivation as paper content could not be fetched due to API rate limiting.Method: Unable to determine method as paper content could not be fetched.
Result: Unable to determine results as paper content could not be fetched.
Conclusion: Unable to determine conclusion as paper content could not be fetched.
Abstract: Failed to fetch summary for 2504.02010: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.02010&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[992] Astral: training physics-informed neural networks with error majorants
Vladimir Fanaskov, Tianchi Yu, Alexander Rudikov, Ivan Oseledets
Main category: cs.AI
TL;DR: Failed to fetch paper summary - HTTP 429 error indicates rate limiting from arXiv API
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Cannot analyze method without access to paper content
Result: No results available due to API access limitations
Conclusion: Paper analysis not possible due to technical limitations in accessing arXiv content
Abstract: Failed to fetch summary for 2406.02645: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2406.02645&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[993] A Survey for Deep Reinforcement Learning Based Network Intrusion Detection
Wanrong Yang, Alberto Acuto, Yihang Zhou, Dominik Wojtczak
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing paper contentMethod: Unable to determine method due to missing paper content
Result: Unable to determine results due to missing paper content
Conclusion: Unable to determine conclusion due to missing paper content
Abstract: Failed to fetch summary for 2410.07612: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.07612&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[994] Neuro-Symbolic Skill Discovery for Conditional Multi-Level Planning
Hakan Aktas, Yigit Yildirim, Ahmet Firat Gamsiz, Deniz Bilge Akkoc, Erhan Oztop, Emre Ugur
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2410.10045: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.10045&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[995] Interaction2Code: Benchmarking MLLM-based Interactive Webpage Code Generation from Interactive Prototyping
Jingyu Xiao, Yuxuan Wan, Yintong Huo, Zixin Wang, Xinyi Xu, Wenxuan Wang, Zhiyao Xu, Yuhang Wang, Michael R. Lyu
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2411.03292: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.03292&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[996] Multi-PA: A Multi-perspective Benchmark on Privacy Assessment for Large Vision-Language Models
Jie Zhang, Xiangkui Cao, Zhouyu Han, Shiguang Shan, Xilin Chen
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Unable to determine method due to API rate limiting preventing access to paper content
Result: Unable to determine results due to API rate limiting preventing access to paper content
Conclusion: Unable to draw conclusions due to API rate limiting preventing access to paper content
Abstract: Failed to fetch summary for 2412.19496: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2412.19496&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[997] Neural Spelling: A Spell-Based BCI System for Language Neural Decoding
Xiaowei Jiang, Charles Zhou, Yiqun Duan, Ziyi Zhao, Thomas Do, Chin-Teng Lin
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2501.17489: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.17489&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[998] Rewriting Pre-Training Data Boosts LLM Performance in Math and Code
Kazuki Fujii, Yukito Tajima, Sakae Mizuki, Masaki Kawamura, Hinari Shimada, Taihei Shiotani, Koshiro Saito, Masanari Oi, Taishi Nakamura, Takumi Okamoto, Shigeki Ishida, Kakeru Hattori, Youmi Ma, Hiroya Takamura, Rio Yokota, Jun Sakuma, Naoaki Okazaki
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2505.02881: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.02881&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[999] Predictive AI Can Support Human Learning while Preserving Error Diversity
Vivianna Fang He, Sihan Li, Phanish Puranam, Feng Lin
Main category: cs.AI
TL;DR: Unable to analyze paper 2502.16411 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation without access to paper abstractMethod: Cannot determine method without access to paper abstract
Result: Cannot determine results without access to paper abstract
Conclusion: Cannot draw conclusions without access to paper abstract
Abstract: Failed to fetch summary for 2502.16411: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.16411&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1000] Open-Sora 2.0: Training a Commercial-Level Video Generation Model in $200k
Zangwei Zheng, Xiangyu Peng, Yuxuan Lou, Chenhui Shen, Tom Young, Xinying Guo, Binluo Wang, Hang Xu, Hongxin Liu, Mingyan Jiang, Wenjun Li, Yuhui Wang, Anbang Ye, Gang Ren, Qianran Ma, Wanying Liang, Xiang Lian, Xiwen Wu, Yuting Zhong, Zhuangyan Li, Chaoyu Gong, Guojun Lei, Leijun Cheng, Limin Zhang, Minghao Li, Ruijie Zhang, Silan Hu, Shijie Huang, Xiaokang Wang, Yuanheng Zhao, Yuqi Wang, Ziang Wei, Yang You
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to access issuesMethod: Cannot determine method due to access issues
Result: Cannot determine results due to access issues
Conclusion: Cannot determine conclusion due to access issues
Abstract: Failed to fetch summary for 2503.09642: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.09642&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1001] ROMA: a Read-Only-Memory-based Accelerator for QLoRA-based On-Device LLM
Wenqiang Wang, Yijia Zhang, Zikai Zhang, Guanting Huo, Hao Liang, Shijie Cao, Ningyi Xu
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper detailsMethod: Unable to determine method due to API rate limiting preventing access to paper details
Result: Unable to determine results due to API rate limiting preventing access to paper details
Conclusion: Unable to draw conclusions due to API rate limiting preventing access to paper details
Abstract: Failed to fetch summary for 2503.12988: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.12988&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1002] When Bias Meets Trainability: Connecting Theories of Initialization
Alberto Bassi, Marco Baity-Jesi, Aurelien Lucchi, Carlo Albert, Emanuele Francazi
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2505.12096: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.12096&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1003] Scalable Multi-Task Learning through Spiking Neural Networks with Adaptive Task-Switching Policy for Intelligent Autonomous Agents
Rachmad Vidya Wicaksana Putra, Avaneesh Devkota, Muhammad Shafique
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2504.13541: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.13541&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1004] Self-Destructive Language Model
Yuhui Wang, Rongyi Zhu, Ting Wang
Main category: cs.AI
TL;DR: Paper ID 2505.12186 could not be fetched due to HTTP 429 error (rate limiting), so no abstract is available for analysis.
Details
Motivation: Unable to determine motivation as the paper content could not be retrieved from arXiv due to rate limiting.Method: No method information available since the paper summary could not be fetched.
Result: No results information available due to HTTP 429 error preventing access to the paper content.
Conclusion: Cannot draw conclusions about an inaccessible paper; the arXiv API rate limiting prevented retrieval of the abstract.
Abstract: Failed to fetch summary for 2505.12186: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.12186&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1005] Not All Models Suit Expert Offloading: On Local Routing Consistency of Mixture-of-Expert Models
Jingcong Liang, Siyuan Wang, Miren Tian, Yitong Li, Duyu Tang, Zhongyu Wei
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to access errorMethod: Cannot determine method due to access error
Result: Cannot determine results due to access error
Conclusion: Cannot determine conclusion due to access error
Abstract: Failed to fetch summary for 2505.16056: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.16056&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1006] REMS: a unified solution representation, problem modeling and metaheuristic algorithm design for general combinatorial optimization problems
Aijuan Song, Guohua Wu
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing paper contentMethod: Unable to determine method due to missing paper content
Result: Unable to determine results due to missing paper content
Conclusion: Unable to determine conclusion due to missing paper content
Abstract: Failed to fetch summary for 2505.17108: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.17108&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1007] EquiReg: Equivariance Regularized Diffusion for Inverse Problems
Bahareh Tolooshams, Aditi Chandrashekar, Rayhan Zirvi, Abbas Mammadov, Jiachen Yao, Chuwei Wang, Anima Anandkumar
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Unable to determine method due to API rate limiting preventing access to paper content
Result: Unable to determine results due to API rate limiting preventing access to paper content
Conclusion: Unable to determine conclusion due to API rate limiting preventing access to paper content
Abstract: Failed to fetch summary for 2505.22973: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.22973&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1008] AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning
Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, Jiashu Wang, Tongkai Yang, Binhang Yuan, Yi Wu
Main category: cs.AI
TL;DR: Failed to fetch paper summary - HTTP 429 error indicates rate limiting from arXiv API
Details
Motivation: Unable to determine motivation due to API access limitationsMethod: Unable to determine method due to API access limitations
Result: Unable to determine results due to API access limitations
Conclusion: Unable to analyze paper due to technical limitations in accessing arXiv data
Abstract: Failed to fetch summary for 2505.24298: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.24298&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1009] General search techniques without common knowledge for imperfect-information games, and application to superhuman Fog of War chess
Brian Hu Zhang, Tuomas Sandholm
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error in fetching paper contentMethod: Unable to determine method due to technical error in fetching paper content
Result: Unable to determine results due to technical error in fetching paper content
Conclusion: Unable to draw conclusions due to technical error in fetching paper content
Abstract: Failed to fetch summary for 2506.01242: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.01242&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1010] In-Context Learning for Pure Exploration
Alessio Russo, Ryan Welch, Aldo Pacchiano
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) - need to wait before retrying
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2506.01876: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.01876&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1011] General Protein Pretraining or Domain-Specific Designs? Benchmarking Protein Modeling on Realistic Applications
Shuo Yan, Yuliang Yan, Bin Ma, Chenao Li, Haochun Tang, Jiahua Lu, Minhua Lin, Yuyuan Feng, Enyan Dai
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2506.02052: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.02052&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1012] Tru-POMDP: Task Planning Under Uncertainty via Tree of Hypotheses and Open-Ended POMDPs
Wenjing Tang, Xinyu He, Yongxi Huang, Yunxiao Xiao, Cewu Lu, Panpan Cai
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2506.02860: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.02860&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1013] RoboPARA: Dual-Arm Robot Planning with Parallel Allocation and Recomposition Across Tasks
Shiying Duan, Pei Ren, Nanxiang Jiang, Zhengping Che, Jian Tang, Zhaoxin Fan, Yifan Sun, Wenjun Wu
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) - need to try again later or use alternative approach
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2506.06683: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.06683&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1014] Intention-Conditioned Flow Occupancy Models
Chongyi Zheng, Seohong Park, Sergey Levine, Benjamin Eysenbach
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2506.08902: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.08902&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1015] Safeguarding Multimodal Knowledge Copyright in the RAG-as-a-Service Environment
Tianyu Chen, Jian Lou, Wenjie Wang
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2506.10030: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.10030&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1016] Distributions as Actions: A Unified Framework for Diverse Action Spaces
Jiamin He, A. Rupam Mahmood, Martha White
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2506.16608: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.16608&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1017] RL for Reasoning by Adaptively Revealing Rationales
Mohammad Hossein Amani, Aryo Lotfi, Nicolas Mario Baldwin, Samy Bengio, Mehrdad Farajtabar, Emmanuel Abbe, Robert West
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation as paper content could not be retrievedMethod: Unable to determine method as paper content could not be retrieved
Result: Unable to determine results as paper content could not be retrieved
Conclusion: Unable to determine conclusion as paper content could not be retrieved
Abstract: Failed to fetch summary for 2506.18110: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.18110&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1018] Iterative Distillation for Reward-Guided Fine-Tuning of Diffusion Models in Biomolecular Design
Xingyu Su, Xiner Li, Masatoshi Uehara, Sunwoo Kim, Yulai Zhao, Gabriele Scalia, Ehsan Hajiramezanali, Tommaso Biancalani, Degui Zhi, Shuiwang Ji
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2507.00445: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.00445&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1019] Model Collapse Is Not a Bug but a Feature in Machine Unlearning for LLMs
Yan Scholten, Sophie Xhonneux, Leo Schwinn, Stephan Günnemann
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2507.04219: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.04219&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1020] Generating Multi-Table Time Series EHR from Latent Space with Minimal Preprocessing
Eunbyeol Cho, Jiyoun Kim, Minjae Lee, Sungjin Park, Edward Choi
Main category: cs.AI
TL;DR: Unable to analyze paper 2507.06996 due to HTTP 429 error when fetching summary from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrieved due to rate limiting errorMethod: No method information available - arXiv API returned HTTP 429 (Too Many Requests) error
Result: No results available - failed to fetch paper summary from arXiv
Conclusion: Unable to provide analysis due to technical limitations in accessing the paper content
Abstract: Failed to fetch summary for 2507.06996: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.06996&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1021] Model Predictive Adversarial Imitation Learning for Planning from Observation
Tyler Han, Yanda Bao, Bhaumik Mehta, Gabriel Guo, Anubhav Vishwakarma, Emily Kang, Sanghun Jung, Rosario Scalise, Jason Zhou, Bryan Xu, Byron Boots
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2507.21533: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.21533&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1022] FMIP: Joint Continuous-Integer Flow For Mixed-Integer Linear Programming
Hongpei Li, Hui Yuan, Han Zhang, Jianghao Lin, Dongdong Ge, Mengdi Wang, Yinyu Ye
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2507.23390: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.23390&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1023] PiKV: KV Cache Management System for Mixture of Experts
Dong Liu, Yanxuan Yu, Ben Lengerich, Ying Nian Wu, Xuhong Wang
Main category: cs.AI
TL;DR: Unable to analyze paper 2508.06526 due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Cannot determine motivation due to failed data retrievalMethod: Cannot determine method due to failed data retrieval
Result: Cannot determine results due to failed data retrieval
Conclusion: Cannot draw conclusions due to failed data retrieval
Abstract: Failed to fetch summary for 2508.06526: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.06526&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1024] FROGENT: An End-to-End Full-process Drug Design Multi-Agent System
Qihua Pan, Dong Xu, Qianwei Yang, Jenna Xinyi Yao, Sisi Yuan, Zexuan Zhu, Jianqiang Li, Junkai Ji
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2508.10760: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.10760&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1025] Learning Internal Biological Neuron Parameters and Complexity-Based Encoding for Improved Spiking Neural Networks Performance
Zofia Rudnicka, Janusz Szczepanski, Agnieszka Pregowska
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2508.11674: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.11674&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1026] Synaptic bundle theory for spike-driven sensor-motor system: More than eight independent synaptic bundles collapse reward-STDP learning
Takeshi Kobayashi, Shogo Yonekura, Yasuo Kuniyoshi
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about the paper due to access limitations
Abstract: Failed to fetch summary for 2508.14492: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.14492&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1027] Implementing Pearl’s $\mathcal{DO}$-Calculus on Quantum Circuits: A Simpson-Type Case Study on NISQ Hardware
Pilsung Kang
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2509.00744: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.00744&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1028] Dense-Jump Flow Matching with Non-Uniform Time Scheduling for Robotic Policies: Mitigating Multi-Step Inference Degradation
Zidong Chen, Zihao Guo, Peng Wang, ThankGod Itua Egbe, Yan Lyu, Chenghao Qian
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2509.13574: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.13574&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1029] Advancing Universal Deep Learning for Electronic-Structure Hamiltonian Prediction of Materials
Shi Yin, Zujian Dai, Xinyang Pan, Lixin He
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to draw conclusions due to fetch failure
Abstract: Failed to fetch summary for 2509.19877: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.19877&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1030] Rejuvenating Cross-Entropy Loss in Knowledge Distillation for Recommender Systems
Zhangchi Zhu, Wei Zhang
Main category: cs.AI
TL;DR: Paper ID 2509.20989 - Unable to fetch abstract due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as abstract is unavailable due to arXiv API rate limitingMethod: Cannot determine method as abstract is unavailable due to arXiv API rate limiting
Result: Cannot determine results as abstract is unavailable due to arXiv API rate limiting
Conclusion: Cannot draw conclusions as abstract is unavailable due to arXiv API rate limiting
Abstract: Failed to fetch summary for 2509.20989: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.20989&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1031] GraphUniverse: Synthetic Graph Generation for Evaluating Inductive Generalization
Louis Van Langendonck, Guillermo Bernárdez, Nina Miolane, Pere Barlet-Ros
Main category: cs.AI
TL;DR: Unable to analyze paper 2509.21097 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract is unavailable due to rate limiting errorMethod: Cannot determine method as abstract is unavailable due to rate limiting error
Result: Cannot determine results as abstract is unavailable due to rate limiting error
Conclusion: Cannot draw conclusions about paper content due to data retrieval failure
Abstract: Failed to fetch summary for 2509.21097: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.21097&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1032] Chasing the Tail: Effective Rubric-based Reward Modeling for Large Language Model Post-Training
Junkai Zhang, Zihao Wang, Lin Gui, Swarnashree Mysore Sathyendra, Jaehwan Jeong, Victor Veitch, Wei Wang, Yunzhong He, Bing Liu, Lifeng Jin
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to retrieval failureMethod: Unable to determine method due to retrieval failure
Result: Unable to determine results due to retrieval failure
Conclusion: Unable to draw conclusions due to retrieval failure
Abstract: Failed to fetch summary for 2509.21500: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.21500&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1033] Motivating Next-Gen Accelerators with Flexible (N:M) Activation Sparsity via Benchmarking Lightweight Post-Training Sparsification Approaches
Shirin Alanova, Kristina Kazistova, Ekaterina Galaeva, Alina Kostromina, Vladimir Smirnov, Redko Dmitry, Alexey Dontsov, Maxim Zhelnin, Evgeny Burnaev, Egor Shvetsov
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2509.22166 could not be retrieved from arXiv API.
Details
Motivation: Cannot determine motivation as the paper content is unavailable due to API rate limiting.Method: Cannot determine method as the paper content is unavailable due to API rate limiting.
Result: Cannot determine results as the paper content is unavailable due to API rate limiting.
Conclusion: Cannot draw conclusions as the paper content is unavailable due to API rate limiting.
Abstract: Failed to fetch summary for 2509.22166: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.22166&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1034] HEAPr: Hessian-based Efficient Atomic Expert Pruning in Output Space
Ke Li, Zheng Yang, Zhongbin Zhou, Feng Xue, Zhonglin Jiang, Wenxiao Wang
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2509.22299: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.22299&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1035] Quantile Advantage Estimation: Stabilizing RLVR for LLM Reasoning
Junkang Wu, Kexin Huang, Jiancan Wu, An Zhang, Xiang Wang, Xiangnan He
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2509.22611: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.22611&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1036] Characteristic Root Analysis and Regularization for Linear Time Series Forecasting
Zheng Wang, Kaixuan Zhang, Wanfang Chen, Xiaonan Lu, Longyuan Li, Tobias Schlagenhauf
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2509.23597: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.23597&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1037] Towards Generalizable PDE Dynamics Forecasting via Physics-Guided Invariant Learning
Siyang Li, Yize Chen, Yan Guo, Ming Huang, Hui Xiong
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to retrieval failureMethod: Unable to determine method due to retrieval failure
Result: Unable to determine results due to retrieval failure
Conclusion: Unable to determine conclusion due to retrieval failure
Abstract: Failed to fetch summary for 2509.24332: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.24332&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1038] Distillation of Large Language Models via Concrete Score Matching
Yeongmin Kim, Donghyeok Shin, Mina Kang, Byeonghu Na, Il-Chul Moon
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper detailsMethod: Unable to determine method due to API rate limiting preventing access to paper details
Result: Unable to determine results due to API rate limiting preventing access to paper details
Conclusion: Unable to determine conclusion due to API rate limiting preventing access to paper details
Abstract: Failed to fetch summary for 2509.25837: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.25837&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1039] AdaBlock-dLLM: Semantic-Aware Diffusion LLM Inference via Adaptive Block Size
Guanxi Lu, Hao Mark Chen, Yuto Karashima, Zhican Wang, Daichi Fujiki, Hongxiang Fan
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2509.26432: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.26432&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1040] Barriers for Learning in an Evolving World: Mathematical Understanding of Loss of Plasticity
Amir Joudaki, Giulia Lanzillotta, Mohammad Samragh Razlighi, Iman Mirzadeh, Keivan Alizadeh, Thomas Hofmann, Mehrdad Farajtabar, Fartash Faghri
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2510.00304: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.00304&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1041] Stabilizing Policy Gradients for Sample-Efficient Reinforcement Learning in LLM Reasoning
Luckeciano C. Melo, Alessandro Abate, Yarin Gal
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2510.00819: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.00819&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1042] On The Fragility of Benchmark Contamination Detection in Reasoning Models
Han Wang, Haoyu Li, Brian Ko, Huan Zhang
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about the paper due to access limitations
Abstract: Failed to fetch summary for 2510.02386: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.02386&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1043] Untargeted Jailbreak Attack
Xinzhe Huang, Wenjing Hu, Tianhang Zheng, Kedong Xiu, Xiaojun Jia, Di Wang, Zhan Qin, Kui Ren
Main category: cs.AI
TL;DR: Unable to analyze paper 2510.02999 due to HTTP 429 error when fetching summary from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrievedMethod: Cannot determine method as paper content could not be retrieved
Result: Cannot determine results as paper content could not be retrieved
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2510.02999: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.02999&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1044] Solving the Granularity Mismatch: Hierarchical Preference Learning for Long-Horizon LLM Agents
Heyang Gao, Zexu Sun, Erxue Min, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Xu Chen
Main category: cs.AI
TL;DR: Paper 2510.03253 could not be analyzed due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Unable to determine motivation due to technical error in fetching paper informationMethod: Unable to determine method due to technical error in fetching paper information
Result: Unable to determine results due to technical error in fetching paper information
Conclusion: Unable to draw conclusions due to technical error in fetching paper information
Abstract: Failed to fetch summary for 2510.03253: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.03253&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1045] Expressive Power of Implicit Models: Rich Equilibria and Test-Time Scaling
Jialin Liu, Lisang Ding, Stanley Osher, Wotao Yin
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2510.03638: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.03638&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1046] RACE Attention: A Strictly Linear-Time Attention for Long-Sequence Training
Sahil Joshi, Agniva Chowdhury, Amar Kanakamedala, Ekam Singh, Evan Tu, Anshumali Shrivastava
Main category: cs.AI
TL;DR: Paper 2510.04008: Unable to fetch abstract due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to missing abstractMethod: Cannot determine method due to missing abstract
Result: Cannot determine results due to missing abstract
Conclusion: Cannot draw conclusions due to missing abstract
Abstract: Failed to fetch summary for 2510.04008: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.04008&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1047] MorphArtGrasp: Morphology-Aware Cross-Embodiment Dexterous Hand Articulation Generation for Grasping
Heng Zhang, Kevin Yuchen Ma, Mike Zheng Shou, Weisi Lin, Yan Wu
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2510.06068: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.06068&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1048] Reference Grounded Skill Discovery
Seungeun Rho, Aaron Trinh, Danfei Xu, Sehoon Ha
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to inability to access paper contentMethod: Cannot determine method due to inability to access paper content
Result: Cannot determine results due to inability to access paper content
Conclusion: Cannot draw conclusions due to inability to access paper content
Abstract: Failed to fetch summary for 2510.06203: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.06203&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1049] Relational Transformer: Toward Zero-Shot Foundation Models for Relational Data
Rishabh Ranjan, Valter Hudovernik, Mark Znidar, Charilaos Kanatsoulis, Roshan Upendra, Mahmoud Mohammadi, Joe Meyer, Tom Palczewski, Carlos Guestrin, Jure Leskovec
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting).
Details
Motivation: Unable to determine motivation due to failed API request.Method: Unable to determine method due to failed API request.
Result: Unable to determine results due to failed API request.
Conclusion: Unable to determine conclusion due to failed API request.
Abstract: Failed to fetch summary for 2510.06377: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.06377&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1050] Value Flows
Perry Dong, Chongyi Zheng, Chelsea Finn, Dorsa Sadigh, Benjamin Eysenbach
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2510.07650: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.07650&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1051] DISCO: Diversifying Sample Condensation for Efficient Model Evaluation
Alexander Rubinstein, Benjamin Raible, Martin Gubri, Seong Joon Oh
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2510.07959: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.07959&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1052] Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols
Mikhail Terekhov, Alexander Panfilov, Daniil Dzenhaliou, Caglar Gulcehre, Maksym Andriushchenko, Ameya Prabhu, Jonas Geiping
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2510.09462: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.09462&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1053] OBsmith: LLM-Powered JavaScript Obfuscator Testing
Shan Jiang, Chenguang Zhu, Sarfraz Khurshid
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2510.10066: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.10066&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1054] GAR: Generative Adversarial Reinforcement Learning for Formal Theorem Proving
Ruida Wang, Jiarui Yao, Rui Pan, Shizhe Diao, Tong Zhang
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2510.11769: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.11769&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1055] Ctrl-World: A Controllable Generative World Model for Robot Manipulation
Yanjiang Guo, Lucy Xiaoyang Shi, Jianyu Chen, Chelsea Finn
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to draw conclusions due to fetch failure
Abstract: Failed to fetch summary for 2510.10125: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.10125&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1056] Quantum Annealing for Staff Scheduling in Educational Environments
Alessia Ciacco, Francesca Guerriero, Eneko Osaba
Main category: cs.AI
TL;DR: Failed to fetch paper summary - HTTP 429 error indicates rate limiting from arXiv API
Details
Motivation: Unable to determine motivation due to technical error in fetching paper contentMethod: Unable to determine method due to technical error in fetching paper content
Result: Unable to determine results due to technical error in fetching paper content
Conclusion: Unable to determine conclusion due to technical error in fetching paper content
Abstract: Failed to fetch summary for 2510.12278: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.12278&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1057] State Your Intention to Steer Your Attention: An AI Assistant for Intentional Digital Living
Juheon Choi, Juyong Lee, Jian Kim, Chanyoung Kim, Taywon Min, W. Bradley Knox, Min Kyung Lee, Kimin Lee
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2510.14513: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.14513&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1058] Scaling Laws Meet Model Architecture: Toward Inference-Efficient LLMs
Song Bian, Tao Yu, Shivaram Venkataraman, Youngsuk Park
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to access limitationsMethod: Cannot determine method due to access limitations
Result: Cannot determine results due to access limitations
Conclusion: Cannot determine conclusion due to access limitations
Abstract: Failed to fetch summary for 2510.18245: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.18245&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1059] Digital Companionship: Overlapping Uses of AI Companions and AI Assistants
Aikaterina Manoli, Janet V. T. Pauketat, Ali Ladak, Hayoun Noh, Angel Hsing-Chi Hwang, Jacy Reese Anthis
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2510.15905: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.15905&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1060] TAO: Tolerance-Aware Optimistic Verification for Floating-Point Neural Networks
Jianzhu Yao, Hongxu Su, Taobo Liao, Zerui Cheng, Huan Zhang, Xuechao Wang, Pramod Viswanath
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2510.16028: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.16028&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1061] WebDevJudge: Evaluating (M)LLMs as Critiques for Web Development Quality
Chunyang Li, Yilun Zheng, Xinting Huang, Tianqing Fang, Jiahao Xu, Lihui Chen, Yangqiu Song, Han Hu
Main category: cs.AI
TL;DR: Failed to fetch paper summary - HTTP 429 error indicates rate limiting from arXiv API
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Cannot analyze method as paper content is unavailable due to HTTP 429 error
Result: No results available - technical issue with arXiv API access
Conclusion: Paper analysis impossible due to API rate limiting preventing content retrieval
Abstract: Failed to fetch summary for 2510.18560: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.18560&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1062] A Convergence Analysis of Adaptive Optimizers under Floating-point Quantization
Xuan Tang, Jichu Li, Difan Zou
Main category: cs.AI
TL;DR: Paper 2510.21314 appears to be unavailable due to HTTP 429 error (rate limiting), suggesting the arXiv API is temporarily blocking requests.
Details
Motivation: Unable to determine motivation due to paper content being inaccessible.Method: Unable to determine method due to paper content being inaccessible.
Result: Unable to determine results due to paper content being inaccessible.
Conclusion: Unable to determine conclusion due to paper content being inaccessible.
Abstract: Failed to fetch summary for 2510.21314: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.21314&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1063] LSPRAG: LSP-Guided RAG for Language-Agnostic Real-Time Unit Test Generation
Gwihwan Go, Quan Zhang, Chijin Zhou, Zhao Wei, Yu Jiang
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to draw conclusions due to fetch failure
Abstract: Failed to fetch summary for 2510.22210: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.22210&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1064] Sample-efficient and Scalable Exploration in Continuous-Time RL
Klemens Iten, Lenart Treven, Bhavya Sukhija, Florian Dörfler, Andreas Krause
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation due to API access failureMethod: Unable to determine method due to API access failure
Result: Unable to determine results due to API access failure
Conclusion: Unable to determine conclusion due to API access failure
Abstract: Failed to fetch summary for 2510.24482: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.24482&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1065] Data-Augmented Deep Learning for Downhole Depth Sensing and Validation
Si-Yu Xiao, Xin-Di Zhao, Tian-Hao Mao, Yi-Wei Wang, Yu-Qiao Chen, Hong-Yun Zhang, Jian Wang, Jun-Jie Wang, Shuang Liu, Tu-Pei Chen, Yang Liu
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting).
Details
Motivation: Cannot determine motivation as paper content is unavailable.Method: Cannot determine method as paper content is unavailable.
Result: Cannot determine results as paper content is unavailable.
Conclusion: Cannot draw conclusions without access to paper content.
Abstract: Failed to fetch summary for 2511.00129: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.00129&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1066] UME-R1: Exploring Reasoning-Driven Generative Multimodal Embeddings
Zhibin Lan, Liqiang Niu, Fandong Meng, Jie Zhou, Jinsong Su
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2511.00405: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.00405&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1067] Spurious Correlation-Aware Embedding Regularization for Worst-Group Robustness
Subeen Park, Joowang Kim, Hakyung Lee, Sunjae Yoo, Kyungwoo Song
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper detailsMethod: Unable to determine method due to API rate limiting preventing access to paper details
Result: Unable to determine results due to API rate limiting preventing access to paper details
Conclusion: Unable to draw conclusions due to API rate limiting preventing access to paper details
Abstract: Failed to fetch summary for 2511.04401: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.04401&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1068] AIRMap: AI-Generated Radio Maps for Wireless Digital Twins
Ali Saeizadeh, Miead Tehrani-Moayyed, Davide Villa, J. Gordon Beattie Jr., Pedram Johari, Stefano Basagni, Tommaso Melodia
Main category: cs.AI
TL;DR: Paper ID 2511.05522 could not be fetched due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access restrictionsMethod: Unable to determine method due to access restrictions
Result: Unable to determine results due to access restrictions
Conclusion: Unable to determine conclusion due to access restrictions
Abstract: Failed to fetch summary for 2511.05522: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.05522&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1069] Hard-constraint physics-residual networks enable robust extrapolation for hydrogen crossover prediction in PEM water electrolyzers
Yong-Woon Kim, Paul D. Yoo, Chan Yeob Yeun, Chulung Kang, Yung-Cheol Byun
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2511.05879: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.05879&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1070] AudAgent: Automated Auditing of Privacy Policy Compliance in AI Agents
Ye Zheng, Yidan Hu
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2511.07441: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.07441&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1071] WavefrontDiffusion: Dynamic Decoding Schedule for Improved Reasoning
Haojin Yang, Rui Hu, Zequn Sun, Rui Zhou, Yujun Cai, Yiwei Wang
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2511.19473: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.19473&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1072] Reasoning on Time-Series for Financial Technical Analysis
Kelvin J.L. Koa, Jan Chen, Yunshan Ma, Huanhuan Zheng, Tat-Seng Chua
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2511.08616: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.08616&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1073] Protein Structure Tokenization via Geometric Byte Pair Encoding
Michael Sun, Weize Yuan, Gang Liu, Wojciech Matusik, Marinka Zitnik
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error fetching paper contentMethod: Unable to determine method due to technical error fetching paper content
Result: Unable to determine results due to technical error fetching paper content
Conclusion: Unable to determine conclusion due to technical error fetching paper content
Abstract: Failed to fetch summary for 2511.11758: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.11758&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1074] Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs
Ngoc Bui, Shubham Sharma, Simran Lamba, Saumitra Mishra, Rex Ying
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper with ID 2512.03324 cannot be analyzed without access to its abstract or content.
Details
Motivation: Cannot determine motivation without access to paper content.Method: Cannot determine method without access to paper content.
Result: Cannot determine results without access to paper content.
Conclusion: Cannot draw conclusions without access to paper content.
Abstract: Failed to fetch summary for 2512.03324: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.03324&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1075] Robust Finetuning of Vision-Language-Action Robot Policies via Parameter Merging
Yajat Yadav, Zhiyuan Zhou, Andrew Wagenmaker, Karl Pertsch, Sergey Levine
Main category: cs.AI
TL;DR: Paper analysis unavailable due to HTTP 429 error when fetching abstract from arXiv
Details
Motivation: Unable to determine motivation as the paper abstract could not be retrieved due to rate limiting errorMethod: Cannot analyze method without access to the paper abstract or content
Result: No results available due to technical error in fetching the paper information
Conclusion: Analysis cannot be completed due to HTTP 429 error (rate limiting) when attempting to fetch the paper from arXiv
Abstract: Failed to fetch summary for 2512.08333: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.08333&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1076] When AI Gives Advice: Evaluating AI and Human Responses to Online Advice-Seeking for Well-Being
Harsh Kumar, Jasmine Chahal, Yinuo Zhao, Zeling Zhang, Annika Wei, Louis Tay, Ashton Anderson
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2512.08937: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.08937&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1077] Mental Models of Autonomy and Sentience Shape Reactions to AI
Janet V.T. Pauketat, Daniel B. Shank, Aikaterina Manoli, Jacy Reese Anthis
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2512.09085: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.09085&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1078] A Neural Network-Based Real-time Casing Collar Recognition System for Downhole Instruments
Si-Yu Xiao, Xin-Di Zhao, Xiang-Zhan Wang, Tian-Hao Mao, Ying-Kai Liao, Xing-Yu Liao, Yu-Qiao Chen, Jun-Jie Wang, Shuang Liu, Tu-Pei Chen, Yang Liu
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2512.22901 suggests it’s from December 2024, but no content available for analysis.
Details
Motivation: Cannot determine motivation without access to paper content.Method: Cannot determine method without access to paper content.
Result: Cannot determine results without access to paper content.
Conclusion: Cannot draw conclusions without access to paper content.
Abstract: Failed to fetch summary for 2512.22901: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.22901&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1079] RMAAT: Astrocyte-Inspired Memory Compression and Replay for Efficient Long-Context Transformers
Md Zesun Ahmed Mia, Malyaban Bal, Abhronil Sengupta
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) when querying arXiv API for paper ID 2601.00426
Details
Motivation: Cannot determine motivation without access to the paper contentMethod: Cannot determine method without access to the paper content
Result: Cannot determine results without access to the paper content
Conclusion: Cannot draw conclusions without access to the paper content
Abstract: Failed to fetch summary for 2601.00426: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.00426&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1080] Convenience vs. Control: A Qualitative Study of Youth Privacy with Smart Voice Assistants
Molly Campbell, Trevor De Clark, Mohamad Sheikho Al Jasem, Sandhya Joshi, Ajay Kumar Shrestha
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2601.04399: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.04399&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1081] Balancing Usability and Compliance in AI Smart Devices: A Privacy-by-Design Audit of Google Home, Alexa, and Siri
Trevor De Clark, Yulia Bobkova, Ajay Kumar Shrestha
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation as paper content could not be retrievedMethod: Unable to determine method as paper content could not be retrieved
Result: Unable to determine results as paper content could not be retrieved
Conclusion: Unable to determine conclusion as paper content could not be retrieved
Abstract: Failed to fetch summary for 2601.04403: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.04403&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1082] AgentOCR: Reimagining Agent History via Optical Self-Compression
Lang Feng, Fuchao Yang, Feng Chen, Xin Cheng, Haiyang Xu, Zhenglin Wan, Ming Yan, Bo An
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2601.04786: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.04786&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1083] A Learnable Wavelet Transformer for Long-Short Equity Trading and Risk-Adjusted Return Optimization
Shuozhe Li, Du Cheng, Leqi Liu
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2601.13435: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.13435&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1084] Toward Youth-Centered Privacy-by-Design in Smart Devices: A Systematic Review
Molly Campbell, Mohamad Sheikho Al Jasem, Ajay Kumar Shrestha
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2601.11598: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.11598&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1085] ButterflyMoE: Sub-Linear Ternary Experts via Structured Butterfly Orbits
Aryan Karmore
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to analyze paper due to technical error in fetching content
Abstract: Failed to fetch summary for 2601.13563: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.13563&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1086] HalluGuard: Demystifying Data-Driven and Reasoning-Driven Hallucinations in LLMs
Xinyue Zeng, Junhong Lin, Yujun Yan, Feng Guo, Liang Shi, Jun Wu, Dawei Zhou
Main category: cs.AI
TL;DR: Paper 2601.18753 summary unavailable due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access restrictionsMethod: Unable to determine method due to access restrictions
Result: Unable to determine results due to access restrictions
Conclusion: Unable to draw conclusions due to access restrictions
Abstract: Failed to fetch summary for 2601.18753: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.18753&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1087] GNN Explanations that do not Explain and How to find Them
Steve Azzolin, Stefano Teso, Bruno Lepri, Andrea Passerini, Sagar Malhotra
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2601.20815: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.20815&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1088] Stealthy Poisoning Attacks Bypass Defenses in Regression Settings
Javier Carnerero-Cano, Luis Muñoz-González, Phillippa Spencer, Emil C. Lupu
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2601.22308: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.22308&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1089] Beyond Single-Modal Analytics: A Framework for Integrating Heterogeneous LLM-Based Query Systems for Multi-Modal Data
Ruyu Li, Tinghui Zhang, Haodi Ma, Daisy Zhe Wang, Yifan Wang
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2602.01701: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.01701&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1090] Learning to Explore with Parameter-Space Noise: A Deep Dive into Parameter-Space Noise for Reinforcement Learning with Verifiable Rewards
Bizhe Bai, Xinyue Wang, Peng Ye, Tao Chen
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2602.02555: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.02555&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1091] Entropy-Guided Dynamic Tokens for Graph-LLM Alignment in Molecular Understanding
Zihao Jing, Qiuhao Zeng, Ruiyi Fang, Yan Sun, Boyu Wang, Pingzhao Hu
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The arXiv API request was blocked, so I cannot analyze the paper content.
Details
Motivation: Cannot determine motivation without access to the paper abstract or content.Method: Cannot determine method without access to the paper abstract or content.
Result: Cannot determine results without access to the paper abstract or content.
Conclusion: Cannot draw conclusions without access to the paper abstract or content.
Abstract: Failed to fetch summary for 2602.02742: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.02742&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1092] Structure-Informed Estimation for Pilot-Limited MIMO Channels via Tensor Decomposition
Alexandre Barbosa de Lima
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2602.04083 suggests it’s from February 2024, but no content available for analysis.
Details
Motivation: Cannot determine motivation without access to paper content. The arXiv ID format suggests this is a recent paper from February 2024.Method: No method information available due to HTTP 429 error preventing access to paper details.
Result: No results available as the paper content could not be retrieved from arXiv API.
Conclusion: Unable to provide analysis due to technical limitations in accessing the paper content from arXiv.
Abstract: Failed to fetch summary for 2602.04083: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.04083&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1093] CSRv2: Unlocking Ultra-Sparse Embeddings
Lixuan Guo, Yifei Wang, Tiansheng Wen, Yifan Wang, Aosong Feng, Bo Chen, Stefanie Jegelka, Chenyu You
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2602.05735 suggests it’s from February 2025, but no content is available for analysis.
Details
Motivation: Cannot determine motivation without access to the paper content. The HTTP 429 error indicates the arXiv API is rate limiting requests.Method: Cannot determine method without access to the paper content. The paper ID format suggests it’s a recent submission (February 2025).
Result: Cannot determine results without access to the paper content. The error prevents retrieval of any abstract or paper details.
Conclusion: Cannot draw conclusions about the paper without access to its content. The arXiv API rate limiting prevents analysis.
Abstract: Failed to fetch summary for 2602.05735: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.05735&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1094] AceGRPO: Adaptive Curriculum Enhanced Group Relative Policy Optimization for Autonomous Machine Learning Engineering
Yuzhu Cai, Zexi Liu, Xinyu Zhu, Cheng Wang, Siheng Chen
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about the paper due to access limitations
Abstract: Failed to fetch summary for 2602.07906: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.07906&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1095] Trojans in Artificial Intelligence (TrojAI) Final Report
Kristopher W. Reese, Taylor Kulp-McDowall, Michael Majurski, Tim Blattner, Derek Juba, Peter Bajcsy, Antonio Cardone, Philippe Dessauw, Alden Dima, Anthony J. Kearsley, Melinda Kleczynski, Joel Vasanth, Walid Keyrouz, Chace Ashcraft, Neil Fendley, Ted Staley, Trevor Stout, Josh Carney, Greg Canal, Will Redman, Aurora Schmidt, Cameron Hickert, William Paul, Jared Markowitz, Nathan Drenkow, David Shriver, Marissa Connor, Keltin Grimes, Marco Christiani, Hayden Moore, Jordan Widjaja, Kasimir Gabert, Uma Balakrishnan, Satyanadh Gundimada, John Jacobellis, Sandya Lakkur, Vitus Leung, Jon Roose, Casey Battaglino, Farinaz Koushanfar, Greg Fields, Xihe Gu, Yaman Jandali, Xinqiao Zhang, Tara Javidi, Akash Vartak, Tim Oates, Ben Erichson, Michael Mahoney, Rauf Izmailov, Xiangyu Zhang, Guangyu Shen, Siyuan Cheng, Shiqing Ma, XiaoFeng Wang, Haixu Tang, Di Tang, Xiaoyi Chen, Zihao Wang, Rui Zhu, Susmit Jha, Xiao Lin, Manoj Acharya, Weichao Zhou, Feisi Fu, Panagiota Kiourti, Chenyu Wang, Zijian Guo, H M Sabbir Ahmad, Wenchao Li, Chao Chen
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to determine conclusion due to failed API request
Abstract: Failed to fetch summary for 2602.07152: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.07152&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1096] FIRE: Frobenius-Isometry Reinitialization for Balancing the Stability-Plasticity Tradeoff
Isaac Han, Sangyeon Park, Seungwon Oh, Donghu Kim, Hojoon Lee, Kyung-Joong Kim
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) - unable to analyze content
Details
Motivation: Unable to determine motivation due to technical error in fetching paper contentMethod: Unable to determine method due to technical error in fetching paper content
Result: Unable to determine results due to technical error in fetching paper content
Conclusion: Unable to draw conclusions due to technical error in fetching paper content
Abstract: Failed to fetch summary for 2602.08040: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.08040&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1097] Learning-guided Kansa collocation for forward and inverse PDEs beyond linearity
Zheyuan Hu, Weitao Chen, Cengiz Öztireli, Chenliang Zhou, Fangcheng Zhong
Main category: cs.AI
TL;DR: Unable to analyze paper 2602.07970 due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrievedMethod: Cannot determine method as paper content could not be retrieved
Result: Cannot determine results as paper content could not be retrieved
Conclusion: Cannot draw conclusions as paper content could not be retrieved
Abstract: Failed to fetch summary for 2602.07970: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.07970&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1098] SWE-MiniSandbox: Container-Free Reinforcement Learning for Building Software Engineering Agents
Danlong Yuan, Wei Wu, Zhengren Wang, Xueliang Zhao, Huishuai Zhang, Dongyan Zhao
Main category: cs.AI
TL;DR: Failed to fetch paper summary - HTTP 429 error indicates rate limiting from arXiv API
Details
Motivation: Unable to determine motivation due to API access failureMethod: Unable to determine method due to API access failure
Result: Unable to determine results due to API access failure
Conclusion: Unable to determine conclusion due to API access failure
Abstract: Failed to fetch summary for 2602.11210: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.11210&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1099] Knowledge-Based Design Requirements for Generative Social Robots in Higher Education
Stephan Vonschallen, Dominique Oberle, Theresa Schmiedel, Friederike Eyssel
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error fetching paper contentMethod: Unable to determine method due to technical error fetching paper content
Result: Unable to determine results due to technical error fetching paper content
Conclusion: Unable to determine conclusion due to technical error fetching paper content
Abstract: Failed to fetch summary for 2602.12873: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.12873&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1100] Inner Loop Inference for Pretrained Transformers: Unlocking Latent Capabilities Without Training
Jonathan Lys, Vincent Gripon, Bastien Pasdeloup, Axel Marmoret, Lukas Mauch, Fabien Cardinaux, Ghouthi Boukli Hacene
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2602.14759 suggests it’s from February 2025, but no content available for analysis.
Details
Motivation: Cannot determine motivation without access to paper content. The HTTP 429 error indicates rate limiting from arXiv API.Method: No method information available due to failed content retrieval.
Result: No results available for analysis.
Conclusion: Cannot draw conclusions about paper content due to technical limitations in accessing the abstract.
Abstract: Failed to fetch summary for 2602.14759: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.14759&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1101] HIMM: Human-Inspired Long-Term Memory Modeling for Embodied Exploration and Question Answering
Ji Li, Bo Wang, Jing Xia, Mingyi Li, Shiyan Hu
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without paper contentMethod: Cannot determine method without paper content
Result: Cannot determine results without paper content
Conclusion: Cannot determine conclusion without paper content
Abstract: Failed to fetch summary for 2602.15513: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.15513&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1102] A Graph Meta-Network for Learning on Kolmogorov-Arnold Networks
Guy Bar-Shalom, Ami Tavory, Itay Evron, Maya Bechler-Speicher, Ido Guy, Haggai Maron
Main category: cs.AI
TL;DR: Unable to analyze paper 2602.16316 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation without access to paper abstractMethod: Cannot determine method without access to paper abstract
Result: Cannot determine results without access to paper abstract
Conclusion: Cannot draw conclusions without access to paper abstract
Abstract: Failed to fetch summary for 2602.16316: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.16316&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1103] Stable Asynchrony: Variance-Controlled Off-Policy RL for LLMs
Luke J. Huang, Zhuoyang Zhang, Qinghao Hu, Shang Yang, Song Han
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2602.17616: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.17616&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1104] Agentic Unlearning: When LLM Agent Meets Machine Unlearning
Bin Wang, Fan Wang, Pingping Wang, Jinyu Cong, Yang Yu, Yilong Yin, Zhongyi Han, Benzheng Wei
Main category: cs.AI
TL;DR: Unable to analyze paper 2602.17692 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract content is unavailableMethod: Cannot determine method as abstract content is unavailable
Result: Cannot determine results as abstract content is unavailable
Conclusion: Cannot draw conclusions about paper content due to data retrieval failure
Abstract: Failed to fetch summary for 2602.17692: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.17692&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1105] GeneZip: Region-Aware Compression for Long Context DNA Modeling
Jianan Zhao, Xixian Liu, Zhihao Zhan, Xinyu Yuan, Hongyu Guo, Jian Tang
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2602.17739: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.17739&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1106] Capabilities Ain’t All You Need: Measuring Propensities in AI
Daniel Romero-Alvarado, Fernando Martínez-Plumed, Lorenzo Pacchiardi, Hugo Save, Siddhesh Milind Pawar, Behzad Mehrbakhsh, Pablo Antonio Moreno Casares, Ben Slater, Paolo Bova, Peter Romero, Zachary R. Tidler, Jonathan Prunty, Luning Sun, Jose Hernandez-Orallo
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to retrieval failureMethod: Unable to determine method due to retrieval failure
Result: Unable to determine results due to retrieval failure
Conclusion: Unable to determine conclusion due to retrieval failure
Abstract: Failed to fetch summary for 2602.18182: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.18182&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1107] Learning from Complexity: Exploring Dynamic Sample Pruning of Spatio-Temporal Training
Wei Chen, Junle Chen, Yuqian Wu, Yuxuan Liang, Xiaofang Zhou
Main category: cs.AI
TL;DR: Paper 2602.19113: Unable to fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot determine conclusion as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2602.19113: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.19113&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1108] Large Language Model-Assisted UAV Operations and Communications: A Multifaceted Survey and Tutorial
Yousef Emami, Hao Zhou, Radha Reddy, Atefeh Hajijamali Arani, Biliang Wang, Kai Li, Luis Almeida, Zhu Han
Main category: cs.AI
TL;DR: Unable to analyze paper 2602.19534 due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrievedMethod: Cannot determine method as paper content could not be retrieved
Result: Cannot determine results as paper content could not be retrieved
Conclusion: Cannot draw conclusions as paper content could not be retrieved
Abstract: Failed to fetch summary for 2602.19534: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.19534&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1109] Scaling Search Relevance: Augmenting App Store Ranking with LLM-Generated Judgments
Evangelia Christakopoulou, Vivekkumar Patel, Hemanth Velaga, Sandip Gaikwad
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper detailsMethod: Cannot analyze method as paper content is unavailable due to HTTP 429 error
Result: No results available due to API rate limiting preventing access to paper information
Conclusion: Cannot draw conclusions about the paper as the content could not be retrieved due to rate limiting
Abstract: Failed to fetch summary for 2602.23234: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.23234&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1110] Optimizer-Induced Low-Dimensional Drift and Transverse Dynamics in Transformer Training
Yongzhong Xu
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2602.23696: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.23696&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.SD
[1111] Aurchestra: Fine-Grained, Real-Time Soundscape Control on Resource-Constrained Hearables
Seunghyun Oh, Malek Itani, Aseem Gauri, Shyamnath Gollakota
Main category: cs.SD
TL;DR: Aurchestra enables fine-grained, real-time soundscape control on hearables by extracting and mixing multiple overlapping sound sources independently, like an audio engineer mixing tracks.
Details
Motivation: Current hearables offer only blunt sound controls (global noise suppression or single target focus), but real-world acoustic scenes contain many simultaneous sources that users may want to adjust independently.Method: System with two key components: (1) dynamic interface that surfaces only active sound classes, and (2) real-time, on-device multi-output extraction network that generates separate streams for each selected class, optimized for compute-limited platforms with 6 ms streaming audio chunks.
Result: Achieves robust performance for up to 5 overlapping target sounds, enables expressive per-class sound control, and shows substantial improvements in target-class enhancement and interference suppression across real-world indoor/outdoor scenarios.
Conclusion: The world need not be heard as a single undifferentiated stream; with Aurchestra, the soundscape becomes truly programmable through fine-grained, real-time control of multiple sound sources.
Abstract: Hearables are becoming ubiquitous, yet their sound controls remain blunt: users can either enable global noise suppression or focus on a single target sound. Real-world acoustic scenes, however, contain many simultaneous sources that users may want to adjust independently. We introduce Aurchestra, the first system to provide fine-grained, real-time soundscape control on resource-constrained hearables. Our system has two key components: (1) a dynamic interface that surfaces only active sound classes and (2) a real-time, on-device multi-output extraction network that generates separate streams for each selected class, achieving robust performance for upto 5 overlapping target sounds, and letting users mix their environment by customizing per-class volumes, much like an audio engineer mixes tracks. We optimize the model architecture for multiple compute-limited platforms and demonstrate real-time performance on 6 ms streaming audio chunks. Across real-world environments in previously unseen indoor and outdoor scenarios, our system enables expressive per-class sound control and achieves substantial improvements in target-class enhancement and interference suppression. Our results show that the world need not be heard as a single, undifferentiated stream: with Aurchestra, the soundscape becomes truly programmable.
[1112] Voices of Civilizations: A Multilingual QA Benchmark for Global Music Understanding
Shangda Wu, Ziya Zhou, Yongyi Zang, Yutong Zheng, Dafang Liang, Ruibin Yuan, Qiuqiang Kong
Main category: cs.SD
TL;DR: Voices of Civilizations is a multilingual QA benchmark for evaluating audio LLMs’ cultural comprehension on full-length music recordings across 38 languages.
Details
Motivation: There's a need to evaluate audio LLMs' ability to understand cultural nuances in music, as current models may struggle with cultural comprehension and exhibit biases across different cultural traditions.Method: Created a benchmark with 380 tracks across 38 languages using an automated pipeline with manual verification: 1) compile representative music list, 2) generate cultural-background documents via LLMs, 3) extract key attributes, 4) construct multiple-choice questions probing language, region, mood, and thematic content.
Result: State-of-the-art audio LLMs struggle to capture subtle cultural nuances without rich textual context and exhibit systematic biases in interpreting music from different cultural traditions.
Conclusion: The benchmark reveals limitations in current audio LLMs’ cultural understanding of music and provides a dataset to foster culturally inclusive music understanding research.
Abstract: We introduce Voices of Civilizations, the first multilingual QA benchmark for evaluating audio LLMs’ cultural comprehension on full-length music recordings. Covering 380 tracks across 38 languages, our automated pipeline yields 1,190 multiple-choice questions through four stages - each followed by manual verification: 1) compiling a representative music list; 2) generating cultural-background documents for each sample in the music list via LLMs; 3) extracting key attributes from those documents; and 4) constructing multiple-choice questions probing language, region associations, mood, and thematic content. We evaluate models under four conditions and report per-language accuracy. Our findings demonstrate that even state-of-the-art audio LLMs struggle to capture subtle cultural nuances without rich textual context and exhibit systematic biases in interpreting music from different cultural traditions. The dataset is publicly available on Hugging Face to foster culturally inclusive music understanding research.
[1113] Whisper-MLA: Reducing GPU Memory Consumption of ASR Models based on MHA2MLA Conversion
Sen Zhang, Jianguo Wei, Wenhuan Lu, Xianghu Yue, Wei Li, Qiang Li, Pengcheng Zhao, Ming Cai, Luo Si
Main category: cs.SD
TL;DR: Whisper-MLA reduces Whisper’s KV cache memory usage by 87.5% while maintaining competitive ASR accuracy through Multi-Head Latent Attention adaptation.
Details
Motivation: Transformer-based Whisper model has high GPU memory consumption due to linearly growing KV cache usage, especially problematic for long-form audio applications.Method: Introduces Whisper-MLA by incorporating Multi-Head Latent Attention (MLA) into Whisper, adapting it for absolute positional embeddings, and systematically investigating application across encoder self-attention, decoder self-attention, and cross-attention modules.
Result: Applying MLA exclusively to decoder self-attention provides optimal balance between performance and memory efficiency. Whisper-MLA reduces KV cache size by up to 87.5% while maintaining competitive accuracy on LibriSpeech benchmark.
Conclusion: Whisper-MLA enables efficient conversion of pretrained Whisper models with minimal fine-tuning, offering significant memory savings for long-form audio applications without substantial accuracy loss.
Abstract: The Transformer-based Whisper model has achieved state-of-the-art performance in Automatic Speech Recognition (ASR). However, its Multi-Head Attention (MHA) mechanism results in significant GPU memory consumption due to the linearly growing Key-Value (KV) cache usage, which is problematic for many applications especially with long-form audio. To address this, we introduce Whisper-MLA, a novel architecture that incorporates Multi-Head Latent Attention (MLA) into the Whisper model. Specifically, we adapt MLA for Whisper’s absolute positional embeddings and systematically investigate its application across encoder self-attention, decoder self-attention, and cross-attention modules. Empirical results indicate that applying MLA exclusively to decoder self-attention yields the desired balance between performance and memory efficiency. Our proposed approach allows conversion of a pretrained Whisper model to Whisper-MLA with minimal fine-tuning. Extensive experiments on the LibriSpeech benchmark validate the effectiveness of this conversion, demonstrating that Whisper-MLA reduces the KV cache size by up to 87.5% while maintaining competitive accuracy.
[1114] Efficient Long-Sequence Diffusion Modeling for Symbolic Music Generation
Jinhan Xu, Xing Tang, Houpeng Yang, Haoran Zhang, Shenghua Yuan, Jiatao Chen, Tianming Xi, Jing Wang, Jiaojiao Yu, Guangli Xiang
Main category: cs.SD
TL;DR: SMDIM is a diffusion-based model for symbolic music generation that combines efficient global structure construction using structured state space models with selective local refinement to handle long sequences with hierarchical structures.
Details
Motivation: Symbolic music generation involves long sequences with hierarchical temporal structures, long-range dependencies, and fine-grained details. Existing diffusion models suffer from high training/inference costs with long sequences due to iterative denoising and sequence-length-related computational burdens.Method: SMDIM uses structured state space models to capture long-range musical context at near-linear cost, combined with a hybrid refinement scheme that selectively refines local musical details. This diffusing strategy efficiently handles global structure construction and light local refinement.
Result: SMDIM outperforms state-of-the-art approaches on generation quality and computational efficiency across various Western classical, popular, and traditional folk music datasets. It shows robust generalization to underexplored musical styles.
Conclusion: SMDIM offers a principled solution for long-sequence symbolic music generation with associated attributes, balancing quality and efficiency through structured state space modeling and selective refinement.
Abstract: Symbolic music generation is a challenging task in multimedia generation, involving long sequences with hierarchical temporal structures, long-range dependencies, and fine-grained local details. Though recent diffusion-based models produce high quality generations, they tend to suffer from high training and inference costs with long symbolic sequences due to iterative denoising and sequence-length-related costs. To deal with such problem, we put forth a diffusing strategy named SMDIM to combine efficient global structure construction and light local refinement. SMDIM uses structured state space models to capture long range musical context at near linear cost, and selectively refines local musical details via a hybrid refinement scheme. Experiments performed on a wide range of symbolic music datasets which encompass various Western classical music, popular music and traditional folk music show that the SMDIM model outperforms the other state-of-the-art approaches on both the generation quality and the computational efficiency, and it has robust generalization to underexplored musical styles. These results show that SMDIM offers a principled solution for long-sequence symbolic music generation, including associated attributes that accompany the sequences. We provide a project webpage with audio examples and supplementary materials at https://3328702107.github.io/smdim-music/.
[1115] CMI-RewardBench: Evaluating Music Reward Models with Compositional Multimodal Instruction
Yinghao Ma, Haiwen Xia, Hewei Gao, Weixiong Chen, Yuxin Ye, Yuchen Yang, Sungkyun Chang, Mingshuo Ding, Yizhi Li, Ruibin Yuan, Simon Dixon, Emmanouil Benetos
Main category: cs.SD
TL;DR: A comprehensive ecosystem for music reward modeling under Compositional Multimodal Instruction (CMI), including datasets, benchmarks, and reward models for evaluating music generation conditioned on text, lyrics, and audio prompts.
Details
Motivation: Music generation models have advanced to handle complex multimodal inputs (text, lyrics, audio), but evaluation mechanisms have not kept pace. There's a critical need for comprehensive reward modeling to assess music quality, text-music alignment, and compositional instruction alignment.Method: 1) Created CMI-Pref-Pseudo (110k pseudo-labeled samples) and CMI-Pref (human-annotated corpus) datasets; 2) Proposed CMI-RewardBench unified benchmark for evaluating reward models; 3) Developed CMI-RMs, a parameter-efficient reward model family for heterogeneous inputs; 4) Evaluated correlation with human judgments and inference-time scaling via top-k filtering.
Result: CMI-RMs show strong correlation with human judgments on musicality and alignment. The models enable effective inference-time scaling via top-k filtering. All resources (training data, benchmarks, reward models) are publicly available.
Conclusion: The paper establishes a comprehensive ecosystem for music reward modeling under multimodal instructions, addressing the evaluation gap in music generation. The proposed datasets, benchmarks, and reward models provide valuable resources for the community.
Abstract: While music generation models have evolved to handle complex multimodal inputs mixing text, lyrics, and reference audio, evaluation mechanisms have lagged behind. In this paper, we bridge this critical gap by establishing a comprehensive ecosystem for music reward modeling under Compositional Multimodal Instruction (CMI), where the generated music may be conditioned on text descriptions, lyrics, and audio prompts. We first introduce CMI-Pref-Pseudo, a large-scale preference dataset comprising 110k pseudo-labeled samples, and CMI-Pref, a high-quality, human-annotated corpus tailored for fine-grained alignment tasks. To unify the evaluation landscape, we propose CMI-RewardBench, a unified benchmark that evaluates music reward models on heterogeneous samples across musicality, text-music alignment, and compositional instruction alignment. Leveraging these resources, we develop CMI reward models (CMI-RMs), a parameter-efficient reward model family capable of processing heterogeneous inputs. We evaluate their correlation with human judgments scores on musicality and alignment on CMI-Pref along with previous datasets. Further experiments demonstrate that CMI-RM not only correlates strongly with human judgments, but also enables effective inference-time scaling via top-k filtering. The necessary training data, benchmarks, and reward models are publicly available.
[1116] SpectroFusion-ViT: A Lightweight Transformer for Speech Emotion Recognition Using Harmonic Mel-Chroma Fusion
Faria Ahmed, Rafi Hassan Chowdhury, Fatema Tuz Zohora Moon, Sabbir Ahmed
Main category: cs.SD
TL;DR: Lightweight SpectroFusion-ViT framework for Bangla speech emotion recognition using EfficientViT-b0 achieves state-of-the-art accuracy with only 2.04M parameters and 0.1 GFLOPs.
Details
Motivation: Most speech emotion recognition methods rely on heavy models or hand-crafted features that fail to balance accuracy and efficiency, especially for low-resource languages like Bangla. There's a need for lightweight yet accurate SER systems for resource-constrained settings.Method: Proposes SpectroFusion-ViT using EfficientViT-b0 architecture. Pipeline includes audio preprocessing/augmentation, extraction of Chroma and MFCC features, fusion into time-frequency descriptors, and fine-tuning the Vision Transformer for multi-class emotion classification.
Result: Achieves 92.56% accuracy on SUBESCO dataset and 82.19% on BanglaSER dataset, surpassing existing state-of-the-art methods. Model has only 2.04M parameters and requires 0.1 GFLOPs.
Conclusion: Lightweight transformer architectures can deliver robust speech emotion recognition performance while remaining computationally efficient for real-world deployment, particularly beneficial for low-resource languages.
Abstract: Speech is a natural means of conveying emotions, making it an effective method for understanding and representing human feelings. Reliable speech emotion recognition (SER) is central to applications in human-computer interaction, healthcare, education, and customer service. However, most SER methods depend on heavy backbone models or hand-crafted features that fail to balance accuracy and efficiency, particularly for low-resource languages like Bangla. In this work, we present SpectroFusion-ViT, a lightweight SER framework built utilizing EfficientViT-b0, a compact Vision Transformer architecture equipped with self-attention to capture long-range temporal and spectral patterns. The model contains only 2.04M parameters and requires 0.1 GFLOPs, enabling deployment in resource-constrained settings without compromising accuracy. Our pipeline first performs preprocessing and augmentation on raw audio, then extracts Chroma and Mel-frequency cepstral coefficient (MFCC) features. These representations are fused into a complementary time-frequency descriptor that preserves both fine-grained spectral detail and broader harmonic structure. Using transfer learning, EfficientViT-b0 is fine-tuned for multi-class emotion classification. We evaluate the system on two benchmark Bangla emotional speech datasets, SUBESCO and BanglaSER, which vary in speaker diversity, recording conditions, and acoustic characteristics. The proposed approach achieves 92.56% accuracy on SUBESCO and 82.19% on BanglaSER, surpassing existing state-of-the-art methods. These findings demonstrate that lightweight transformer architectures can deliver robust SER performance while remaining computationally efficient for real-world deployment.
[1117] AG-REPA: Causal Layer Selection for Representation Alignment in Audio Flow Matching
Pengfei Zhang, Tianxin Xie, Minghao Yang, Li Liu
Main category: cs.SD
TL;DR: AG-REPA introduces causal layer selection for representation alignment in audio Flow Matching, addressing the Store-Contribute Dissociation problem where semantically rich layers don’t necessarily drive generation.
Details
Motivation: Current REPA methods for audio Flow Matching rely on heuristic layer selection based on depth, which may not align with layers that actually drive the velocity field generation. The authors discovered that layers storing semantic/acoustic information aren't necessarily those contributing most to generation.Method: Proposes Attribution-Guided REPresentation Alignment (AG-REPA) with forward-only gate ablation (FoG-A) to quantify each layer’s causal contribution to the velocity field. This enables sparse layer selection and adaptive weighting for alignment based on actual generation impact rather than representational richness.
Result: AG-REPA consistently outperforms REPA baselines across unified speech and general-audio training (LibriSpeech + AudioSet) under different token-conditioning topologies. Shows alignment is most effective when applied to causally dominant layers that drive the velocity field.
Conclusion: Causal layer selection based on actual contribution to generation is superior to heuristic selection based on representational richness. The Store-Contribute Dissociation phenomenon reveals that semantically rich layers can be functionally passive in generation tasks.
Abstract: REPresentation Alignment (REPA) improves the training of generative flow models by aligning intermediate hidden states with pretrained teacher features, but its effectiveness in token-conditioned audio Flow Matching critically depends on the choice of supervised layers, which is typically made heuristically based on the depth. In this work, we introduce Attribution-Guided REPresentation Alignment (AG-REPA), a novel causal layer selection strategy for representation alignment in audio Flow Matching. Firstly, we find that layers that best store semantic/acoustic information (high teacher-space similarity) are not necessarily the layers that contribute most to the velocity field that drives generation, and we call it Store-Contribute Dissociation (SCD). To turn this insight into an actionable training guidance, we propose a forward-only gate ablation (FoG-A) that quantifies each layer’s causal contribution via the induced change in the predicted velocity field, enabling sparse layer selection and adaptive weighting for alignment. Across unified speech and general-audio training (LibriSpeech + AudioSet) under different token-conditioning topologies, AG-REPA consistently outperforms REPA baselines. Overall, our results show that alignment is most effective when applied to the causally dominant layers that drive the velocity field, rather than to layers that are representationally rich but functionally passive.
[1118] SyncTrack: Rhythmic Stability and Synchronization in Multi-Track Music Generation
Hongrui Wang, Fan Zhang, Zhiyuan Yu, Ziya Zhou, Xi Chen, Can Yang, Yang Wang
Main category: cs.SD
TL;DR: SyncTrack is a synchronous multi-track waveform music generation model that improves rhythmic consistency through track-shared modules for common rhythm and track-specific modules for timbre diversity.
Details
Motivation: Existing multi-track music generation models often overlook essential attributes like rhythmic stability and synchronization, focusing on differences between tracks rather than their inherent properties. This leads to poor rhythmic consistency in generated multi-track music.Method: SyncTrack features a novel architecture with track-shared modules using two cross-track attention mechanisms to synchronize rhythmic information across all tracks, and track-specific modules using learnable instrument priors to capture diverse timbres and pitch ranges. The model also introduces three novel evaluation metrics for rhythmic consistency.
Result: Experiments demonstrate that SyncTrack significantly improves multi-track music quality by enhancing rhythmic consistency compared to existing models.
Conclusion: SyncTrack effectively addresses rhythmic consistency issues in multi-track music generation through its synchronous architecture and novel evaluation metrics, representing an important advancement in the field.
Abstract: Multi-track music generation has garnered significant research interest due to its precise mixing and remixing capabilities. However, existing models often overlook essential attributes such as rhythmic stability and synchronization, leading to a focus on differences between tracks rather than their inherent properties. In this paper, we introduce SyncTrack, a synchronous multi-track waveform music generation model designed to capture the unique characteristics of multi-track music. SyncTrack features a novel architecture that includes track-shared modules to establish a common rhythm across all tracks and track-specific modules to accommodate diverse timbres and pitch ranges. Each track-shared module employs two cross-track attention mechanisms to synchronize rhythmic information, while each track-specific module utilizes learnable instrument priors to better represent timbre and other unique features. Additionally, we enhance the evaluation of multi-track music quality by introducing rhythmic consistency through three novel metrics: Inner-track Rhythmic Stability (IRS), Cross-track Beat Synchronization (CBS), and Cross-track Beat Dispersion (CBD). Experiments demonstrate that SyncTrack significantly improves the multi-track music quality by enhancing rhythmic consistency.
[1119] DARS: Dysarthria-Aware Rhythm-Style Synthesis for ASR Enhancement
Minghui Wu, Xueling Liu, Jiahuan Fan, Haitao Tang, Yanyong Zhang, Yue Zhang
Main category: cs.SD
TL;DR: DARS is a dysarthria-aware rhythm-style synthesis framework that improves dysarthric speech generation for ASR data augmentation by modeling pathological rhythm and acoustic style.
Details
Motivation: Dysarthric speech presents challenges for ASR due to abnormal prosody and speaker variability. Existing TTS-based data augmentation methods fail to accurately model pathological rhythm and acoustic style of dysarthric speech.Method: Proposes DARS based on Matcha-TTS architecture with multi-stage rhythm predictor optimized by contrastive preferences between normal/dysarthric speech, and dysarthric-style conditional flow matching mechanism for temporal rhythm reconstruction and pathological acoustic style simulation.
Result: Achieves MCD of 4.29 (close to real dysarthric speech). Adapting Whisper-based ASR with DARS synthetic speech achieves 54.22% relative WER reduction compared to state-of-the-art methods on TORGO dataset.
Conclusion: DARS effectively enhances dysarthric speech recognition performance by generating high-quality synthetic dysarthric speech that accurately models pathological rhythm and acoustic style.
Abstract: Dysarthric speech exhibits abnormal prosody and significant speaker variability, presenting persistent challenges for automatic speech recognition (ASR). While text-to-speech (TTS)-based data augmentation has shown potential, existing methods often fail to accurately model the pathological rhythm and acoustic style of dysarthric speech. To address this, we propose DARS, a dysarthria-aware rhythm-style synthesis framework based on the Matcha-TTS architecture. DARS incorporates a multi-stage rhythm predictor optimized by contrastive preferences between normal and dysarthric speech, along with a dysarthric-style conditional flow matching mechanism, jointly enhancing temporal rhythm reconstruction and pathological acoustic style simulation. Experiments on the TORGO dataset demonstrate that DARS achieves a Mean Cepstral Distortion (MCD) of 4.29, closely approximating real dysarthric speech. Adapting a Whisper-based ASR system with synthetic dysarthric speech from DARS achieves a 54.22% relative reduction in word error rate (WER) compared to state-of-the-art methods, demonstrating the framework’s effectiveness in enhancing recognition performance.
[1120] End-to-End Simultaneous Dysarthric Speech Reconstruction with Frame-Level Adaptor and Multiple Wait-k Knowledge Distillation
Minghui Wu, Haitao Tang, Jiahuan Fan, Ruizhi Liao, Yanyong Zhang
Main category: cs.SD
TL;DR: End-to-end simultaneous dysarthric speech reconstruction system with frame-level adaptor and multiple wait-k TTS for low-latency, robust speech conversion.
Details
Motivation: Cascaded ASR+TTS systems for dysarthric speech reconstruction have high latency and poor robustness due to pronunciation variability and limited prosodic prediction in incremental TTS.Method: Proposes end-to-end simultaneous DSR with: 1) Frame-level adaptor module for explicit-implicit semantic fusion to enhance TTS error tolerance to ASR outputs, 2) Multiple wait-k autoregressive TTS with multi-view knowledge distillation to mitigate prosodic degradation.
Result: Achieves 1.03s average response time, 0.71 RTF on Tesla A100, 4.67 MOS on UASpeech dataset, and 54.25% relative WER reduction compared to SOTA.
Conclusion: The proposed end-to-end simultaneous DSR system effectively addresses latency and robustness issues in dysarthric speech reconstruction through novel architectural innovations.
Abstract: Dysarthric speech reconstruction (DSR) typically employs a cascaded system that combines automatic speech recognition (ASR) and sentence-level text-to-speech (TTS) to convert dysarthric speech into normally-prosodied speech. However, dysarthric individuals often speak more slowly, leading to excessively long response times in such systems, rendering them impractical in long-speech scenarios. Cascaded DSR systems based on streaming ASR and incremental TTS can help reduce latency. However, patients with differing dysarthria severity exhibit substantial pronunciation variability for the same text, resulting in poor robustness of ASR and limiting the intelligibility of reconstructed speech. In addition, incremental TTS suffers from poor prosodic feature prediction due to a limited receptive field. In this study, we propose an end-to-end simultaneous DSR system with two key innovations: 1) A frame-level adaptor module is introduced to bridge ASR and TTS. By employing explicit-implicit semantic information fusion and joint module training, it enhances the error tolerance of TTS to ASR outputs. 2) A multiple wait-k autoregressive TTS module is designed to mitigate prosodic degradation via multi-view knowledge distillation. Our system has an average response time of 1.03 seconds on Tesla A100, with an average real-time factor (RTF) of 0.71. On the UASpeech dataset, it attains a mean opinion score (MOS) of 4.67 and demonstrates a 54.25% relative reduction in word error rate (WER) compared to the state-of-the-art. Our demo is available at: https://wflrz123.github.io/
[1121] TQCodec: Towards neural audio codec for high-fidelity music streaming
Lixing He, Zhouxuan Chen, Mingshuai Liu, Xinran Sun, Wucheng Wang, Minfu Li, Lingcheng Kong, Weifeng Zhao, Wenjiang Zhou
Main category: cs.SD
TL;DR: TQCodec is a neural audio codec designed for high-fidelity music streaming at 44.1 kHz with bitrates from 32-128 kbps, featuring enhanced architecture and perception-driven bit allocation for superior audio quality.
Details
Motivation: Existing neural audio codecs primarily target ultra-low bitrates (≤16 kbps), but there's a gap for high-fidelity music streaming applications that require higher bitrates (32-128 kbps) matching modern streaming platform standards.Method: Uses encoder-decoder architecture based on SEANet for efficient on-device computation, with enhancements including imbalanced network design, SimVQ for mid-frequency detail preservation, phase-aware waveform loss, and perception-driven band-wise bit allocation prioritizing lower frequencies.
Result: TQCodec achieves superior audio quality at target bitrates (32-128 kbps) compared to existing methods, as demonstrated through evaluations on diverse music datasets.
Conclusion: TQCodec is well-suited for high-quality audio applications and fills the gap between ultra-low bitrate neural codecs and the requirements of modern music streaming platforms.
Abstract: We propose TQCodec, a neural audio codec designed for high-bitrate, high-fidelity music streaming. Unlike existing neural codecs that primarily target ultra-low bitrates (<= 16kbps), TQCodec operates at 44.1 kHz and supports bitrates from 32 kbps to 128 kbps, aligning with the standard quality of modern music streaming platforms. The model adopts an encoder-decoder architecture based on SEANet for efficient on-device computation and introduces several enhancements: an imbalanced network design for improved quality with low overhead, SimVQ for mid-frequency detail preservation, and a phase-aware waveform loss. Additionally, we introduce a perception-driven band-wise bit allocation strategy to prioritize perceptually critical lower frequencies. Evaluations on diverse music datasets demonstrate that TQCodec achieves superior audio quality at target bitrates, making it well-suited for high-quality audio applications.
[1122] VietSuperSpeech: A Large-Scale Vietnamese Conversational Speech Dataset for ASR Fine-Tuning in Chatbot, Customer Support, and Call Center Applications
Loan Do, Thanh Ngoc Nguyen, Thanh Pham, Vinh Do, Hien Nguyen, Charlotte Nguyen
Main category: cs.SD
TL;DR: VietSuperSpeech is a large-scale Vietnamese ASR dataset focused on casual conversational speech from YouTube channels, addressing a gap in existing Vietnamese speech corpora that lack spontaneous speech data.
Details
Motivation: Existing Vietnamese ASR datasets predominantly feature formal, read speech (news, audiobooks) but lack casual conversational speech needed for real-world applications like chatbots, customer support, and call centers.Method: Collected 52,023 audio-text pairs (267.39 hours) from four YouTube channels covering everyday conversation, vlogging, community dialogue, and informal commentary. Audio standardized to 16 kHz mono PCM WAV, segmented into 3-30 second utterances. Transcriptions generated via pseudo-labeling using Zipformer-30M-RNNT-6000h model pre-trained on 6,000 hours of Vietnamese speech.
Result: Created dataset with 46,822 training samples (240.67 hours) and 5,201 development/test samples (26.72 hours). Text averages 266 characters per utterance, totaling 13.8 million fully diacritically marked Vietnamese characters.
Conclusion: VietSuperSpeech fills a critical gap in Vietnamese ASR by providing casual conversational speech data essential for conversational AI applications, complementing existing formal speech corpora.
Abstract: We introduce VietSuperSpeech, a large-scale Vietnamese automatic speech recognition (ASR) dataset of 52,023 audio-text pairs totaling 267.39 hours, with a distinctive focus on casual conversational speech. Unlike existing Vietnamese ASR corpora that predominantly feature read speech, news narration, or audiobook content, VietSuperSpeech is sourced from four publicly accessible YouTube channels spanning everyday conversation, personal vlogging, overseas Vietnamese community dialogue, and informal commentary - the very speech styles encountered in real-world chatbot, customer support, call center, and hotline deployments. All audio is standardized to 16 kHz mono PCM WAV and segmented into 3-30 second utterances. Transcriptions are generated via pseudo-labeling using the Zipformer-30M-RNNT-6000h model (Nguyen, 2025) deployed through Sherpa-ONNX, pre-trained on 6,000 hours of Vietnamese speech. After quality filtering, the dataset is split into 46,822 training samples (240.67 hours) and 5,201 development/test samples (26.72 hours) with a fixed random seed. The text averages 266 characters per utterance, totaling 13.8 million fully diacritically marked Vietnamese characters. We demonstrate that VietSuperSpeech fills a critical gap in the Vietnamese ASR ecosystem: while corpora such as VLSP2020, VIET_BUD500, VietSpeech, FLEURS, VietMed, Sub-GigaSpeech2-Vi, viVoice, and Sub-PhoAudioBook provide broad coverage of formal and read speech, none specifically targets the casual, spontaneous register indispensable for conversational AI applications. VietSuperSpeech is publicly released at https://huggingface.co/datasets/thanhnew2001/VietSuperSpeech.
[1123] ViTex: Visual Texture Control for Multi-Track Symbolic Music Generation via Discrete Diffusion Models
Xiaoyu Yi, Qi He, Gus Xia, Ziyu Wang
Main category: cs.SD
TL;DR: ViTex introduces a visual representation for instrumental texture control in multi-track symbolic music generation, enabling explicit texture-level control through color, spatial position, and stroke properties.
Details
Motivation: Existing music generation systems rely on extrinsic controls like text prompts that don't allow direct human shaping of compositions. While some intrinsic controls exist for piano/vocal settings, multi-track symbolic music lacks natural control dimensions for instrumentation and texture.Method: Proposes ViTex visual representation where color encodes instrument choice, spatial position represents pitch/time, and stroke properties capture local textures. Builds a discrete diffusion model conditioned on ViTex and chord progressions to generate 8-measure multi-track symbolic music.
Result: The system enables explicit texture-level control while maintaining strong unconditional generation quality for multi-track symbolic music composition.
Conclusion: ViTex provides a natural visual control dimension for instrumentation in multi-track music generation, addressing a gap in current systems and enabling more meaningful human-machine interaction in composition.
Abstract: In automatic music generation, a central challenge is to design controls that enable meaningful human-machine interaction. Existing systems often rely on extrinsic inputs such as text prompts or metadata, which do not allow humans to directly shape the composition. While prior work has explored intrinsic controls such as chords or hierarchical structure, these approaches mainly address piano or vocal-accompaniment settings, leaving multitrack symbolic music largely underexplored. We identify instrumentation, the choice of instruments and their roles, as a natural dimension of control in multi-track composition, and propose ViTex, a visual representation of instrumental texture. In ViTex, color encodes instrument choice, spatial position represents pitch and time, and stroke properties capture local textures. Building on this representation, we develop a discrete diffusion model conditioned on ViTex and chord progressions to generate 8-measure multi-track symbolic music, enabling explicit texture-level control while maintaining strong unconditional generation quality. The demo page and code are avaliable at https://vitex2025.github.io/.
[1124] CodecFlow: Efficient Bandwidth Extension via Conditional Flow Matching in Neural Codec Latent Space
Bowen Zhang, Junchuan Zhao, Ian McLoughlin, Ye Wang, A S Madhukumar
Main category: cs.SD
TL;DR: CodecFlow: A neural codec-based speech bandwidth extension framework using voicing-aware conditional flow conversion and structure-constrained residual vector quantization for efficient high-fidelity speech reconstruction.
Details
Motivation: Existing speech bandwidth extension methods using spectrogram or waveform modeling have high computational costs and limited high-frequency fidelity. Neural audio codecs offer compact latent representations but face challenges in accurately recovering high-resolution latent information due to representation mismatch.Method: CodecFlow uses a neural codec-based framework with voicing-aware conditional flow converter on continuous codec embeddings and structure-constrained residual vector quantizer to improve latent alignment stability. The system is optimized end-to-end.
Result: CodecFlow achieves strong spectral fidelity and enhanced perceptual quality on 8 kHz to 16 kHz and 44.1 kHz speech bandwidth extension tasks.
Conclusion: The proposed CodecFlow framework provides an efficient solution for speech bandwidth extension by operating in compact latent space, overcoming limitations of traditional spectrogram/waveform methods while maintaining high fidelity.
Abstract: Speech Bandwidth Extension improves clarity and intelligibility by restoring/inferring appropriate high-frequency content for low-bandwidth speech. Existing methods often rely on spectrogram or waveform modeling, which can incur higher computational cost and have limited high-frequency fidelity. Neural audio codecs offer compact latent representations that better preserve acoustic detail, yet accurately recovering high-resolution latent information remains challenging due to representation mismatch. We present CodecFlow, a neural codec-based BWE framework that performs efficient speech reconstruction in a compact latent space. CodecFlow employs a voicing-aware conditional flow converter on continuous codec embeddings and a structure-constrained residual vector quantizer to improve latent alignment stability. Optimized end-to-end, CodecFlow achieves strong spectral fidelity and enhanced perceptual quality on 8 kHz to 16 kHz and 44.1 kHz speech BWE tasks.
[1125] Analytical Exploration of Spatial Audio Cues: A Differentiable Multi-Sphere Scattering Model
Siminfar Samakoush Galougah, Pranav Pulijala, Ramani Duraiswami
Main category: cs.SD
TL;DR: A differentiable acoustic scattering model for underwater spatial hearing using semi-transparent sphere with rigid scatterers, enabling machine learning optimization for source localization.
Details
Motivation: Existing HRTF models fail underwater due to impedance matching between water and soft tissue, unlike terrestrial environments. Biological underwater organisms use sound scattering for spatial hearing, motivating development of accurate scattering models for underwater applications.Method: Analytically derived closed-form forward model for scattering from semi-transparent sphere containing two rigid spherical scatterers. Implemented in fully differentiable setting for integration with ML algorithms. Uses physics-informed frequency weighting and Extended Kalman Filter with analytic Jacobians for moving-source tracking.
Result: Model accurately maps source direction, frequency, and material properties to pressure field. Demonstrates enhanced convergence for localization under noise and accurate moving-source tracking. Provides foundation for scattering-based microphone arrays over conventional beamforming.
Conclusion: Differentiable models of scattering from layered rigid and transparent geometries offer promising foundation for microphone arrays leveraging scattering-based spatial cues, applicable to both terrestrial and underwater applications.
Abstract: A primary challenge in developing synthetic spatial hearing systems, particularly underwater, is accurately modeling sound scattering. Biological organisms achieve 3D spatial hearing by exploiting sound scattering off their bodies to generate location-dependent interaural level and time differences (ITD/ILD). While Head-Related Transfer Function (HRTF) models based on rigid scattering suffice for terrestrial humans, they fail in underwater environments due to the near-impedance match between water and soft tissue. Motivated by the acoustic anatomy of underwater animals, we introduce a novel, analytically derived, closed-form forward model for scattering from a semi-transparent sphere containing two rigid spherical scatterers. This model accurately maps source direction, frequency, and material properties to the pressure field, capturing the complex physics of layered, penetrable structures. Critically, our model is implemented in a fully differentiable setting, enabling its integration with a machine learning algorithm to optimize a cost function for active localization. We demonstrate enhanced convergence for localization under noise using a physics-informed frequency weighting scheme, and present accurate moving-source tracking via an Extended Kalman Filter (EKF) with analytically computed Jacobians. Our work suggests that differentiable models of scattering from layered rigid and transparent geometries offer a promising new foundation for microphone arrays that leverage scattering-based spatial cues over conventional beamforming, applicable to both terrestrial and underwater applications. Our model will be made open source.
[1126] VoiceAgengRAG: Solving the RAG Latency Bottleneck in Real-Time Voice Agents Using Dual-Agent Architectures
Jielin Qiu, Jianguo Zhang, Zixiang Chen, Liangwei Yang, Ming Zhu, Juntao Tan, Haolin Chen, Wenting Zhao, Rithesh Murthy, Roshan Ram, Akshara Prabhakar, Shelby Heinecke, Caiming, Xiong, Silvio Savarese, Huan Wang
Main category: cs.SD
TL;DR: VoiceAgentRAG is a dual-agent memory router that separates retrieval from response generation using a slow thinker agent for topic prediction and cache pre-fetching, and a fast talker agent for rapid response generation from cached content.
Details
Motivation: To address latency issues in RAG systems where retrieval from vector databases introduces significant delays, especially in real-time conversational applications like voice agents.Method: Two-agent architecture: 1) Background “Slow Thinker” continuously monitors conversation, predicts follow-up topics using LLM, and pre-fetches relevant document chunks into FAISS semantic cache. 2) Foreground “Fast Talker” reads only from sub-millisecond cache, bypassing vector database entirely on cache hits.
Result: The system achieves sub-millisecond response times by eliminating vector database retrieval latency through intelligent pre-fetching and semantic caching.
Conclusion: VoiceAgentRAG demonstrates that decoupling retrieval from generation with intelligent caching can dramatically reduce latency in conversational RAG systems while maintaining relevance.
Abstract: We present VoiceAgentRAG, an open-source dual-agent memory router that decouples retrieval from response generation. A background Slow Thinker agent continuously monitors the conversation stream, predicts likely follow-up topics using an LLM, and pre-fetches relevant document chunks into a FAISS-backed semantic cache. A foreground Fast Talker agent reads only from this sub-millisecond cache, bypassing the vector database entirely on cache hits.
[1127] GACA-DiT: Diffusion-based Dance-to-Music Generation with Genre-Adaptive Rhythm and Context-Aware Alignment
Jinting Wang, Chenxing Li, Li Liu
Main category: cs.SD
TL;DR: GACA-DiT: A diffusion transformer framework for dance-to-music generation that improves rhythmic alignment through genre-adaptive rhythm extraction and context-aware temporal alignment modules.
Details
Motivation: Existing dance-to-music generation methods use coarse rhythm embeddings that discard fine-grained motion cues, resulting in weak rhythmic alignment. Temporal mismatches from feature downsampling further hinder synchronization between dance and music.Method: Proposes GACA-DiT with two novel modules: 1) Genre-adaptive rhythm extraction combining multi-scale temporal wavelet analysis and spatial phase histograms with adaptive joint weighting, 2) Context-aware temporal alignment using learnable context queries to align music latents with relevant dance rhythm features.
Result: Extensive experiments on AIST++ and TikTok datasets show GACA-DiT outperforms state-of-the-art methods in both objective metrics and human evaluation.
Conclusion: GACA-DiT effectively addresses rhythmic alignment and temporal synchronization issues in dance-to-music generation through its novel modules, achieving superior performance over existing methods.
Abstract: Dance-to-music (D2M) generation aims to automatically compose music that is rhythmically and temporally aligned with dance movements. Existing methods typically rely on coarse rhythm embeddings, such as global motion features or binarized joint-based rhythm values, which discard fine-grained motion cues and result in weak rhythmic alignment. Moreover, temporal mismatches introduced by feature downsampling further hinder precise synchronization between dance and music. To address these problems, we propose \textbf{GACA-DiT}, a diffusion transformer-based framework with two novel modules for rhythmically consistent and temporally aligned music generation. First, a \textbf{genre-adaptive rhythm extraction} module combines multi-scale temporal wavelet analysis and spatial phase histograms with adaptive joint weighting to capture fine-grained, genre-specific rhythm patterns. Second, a \textbf{context-aware temporal alignment} module resolves temporal mismatches using learnable context queries to align music latents with relevant dance rhythm features. Extensive experiments on the AIST++ and TikTok datasets demonstrate that GACA-DiT outperforms state-of-the-art methods in both objective metrics and human evaluation. Project page: https://beria-moon.github.io/GACA-DiT/.
[1128] SounDiT: Geo-Contextual Soundscape-to-Landscape Generation
Junbo Wang, Haofeng Tan, Bowen Liao, Albert Jiang, Teng Fei, Qixing Huang, Bing Zhou, Zhengzhong Tu, Shan Ye, Yuhao Kang
Main category: cs.SD
TL;DR: GeoS2L: A novel task for generating geographically realistic landscape images from environmental soundscapes using diffusion transformers with geo-contextual conditioning.
Details
Motivation: Existing audio-to-image models work well for specific objects but fail to reconstruct real-world landscapes from environmental soundscapes, creating a gap in geographically coherent soundscape-to-landscape generation.Method: Proposes SounDiT, a diffusion transformer-based model that incorporates environmental soundscapes and geo-contextual scene conditioning. Also introduces two large-scale geo-contextual multi-modal datasets (SoundingSVI and SonicUrban) and a Place Similarity Score (PSS) evaluation framework.
Result: SounDiT outperforms existing baselines in GeoS2L task, and the proposed PSS effectively captures multi-level generation consistency across element, scene, and human perception dimensions.
Conclusion: The paper successfully addresses the gap in environmental soundscape-to-landscape generation with a novel task, model, datasets, and evaluation framework that enables geographically coherent landscape synthesis from sound.
Abstract: Recent audio-to-image models have shown impressive performance in generating images of specific objects conditioned on their corresponding sounds. However, these models fail to reconstruct real-world landscapes conditioned on environmental soundscapes. To address this gap, we present Geo-contextual Soundscape-to-Landscape (GeoS2L) generation, a novel and practically significant task that aims to synthesize geographically realistic landscape images from environmental soundscapes. To support this task, we construct two large-scale geo-contextual multi-modal datasets, SoundingSVI and SonicUrban, which pair diverse environmental soundscapes with real-world landscape images. We propose SounDiT, a diffusion transformer (DiT)-based model that incorporates environmental soundscapes and geo-contextual scene conditioning to synthesize geographically coherent landscape images. Furthermore, we propose the Place Similarity Score (PSS), a practically-informed geo-contextual evaluation framework to measure consistency between input soundscapes and generated landscape images. Extensive experiments demonstrate that SounDiT outperforms existing baselines in the GeoS2L, while the PSS effectively captures multi-level generation consistency across element, scene,and human perception. Project page: https://gisense.github.io/SounDiT-Page/
[1129] Discovering and Steering Interpretable Concepts in Large Generative Music Models
Nikhil Singh, Manuel Cherep, Pattie Maes
Main category: cs.SD
TL;DR: Researchers use sparse autoencoders to extract interpretable musical concepts from transformer-based music generators, revealing both traditional music theory patterns and novel uncodified structures that can steer model generations.
Details
Motivation: Neural networks that generate music have learned implicit theories of musical structure through statistical learning, offering a new lens to study human-generated media. This presents an opportunity to discover how musical concepts emerge from data and identify patterns that traditional frameworks may have missed.Method: The paper introduces a method using sparse autoencoders (SAEs) to extract interpretable features from the residual stream of autoregressive transformer music generators. The approach is made scalable and evaluable through automated labeling and validation pipelines.
Result: The method successfully extracts both familiar musical concepts (like chord progressions) and coherent but uncodified patterns lacking clear counterparts in existing music theory or language. These discovered concepts can be used to steer model generations.
Conclusion: Beyond improving model transparency, this work provides an empirical tool for uncovering organizing principles in music that have eluded traditional analysis methods, bridging neural representations with human-interpretable concepts.
Abstract: The fidelity with which neural networks can now generate content such as music presents a scientific opportunity: these systems appear to have learned implicit theories of such content’s structure through statistical learning alone. This offers a potentially new lens on theories of human-generated media. When internal representations align with traditional constructs (e.g. chord progressions in music), they show how such categories can emerge from statistical regularities; when they diverge, they expose limits of existing frameworks and patterns we may have overlooked but that nonetheless carry explanatory power. In this paper, focusing on autoregressive music generators, we introduce a method for discovering interpretable concepts using sparse autoencoders (SAEs), extracting interpretable features from the residual stream of a transformer model. We make this approach scalable and evaluable using automated labeling and validation pipelines. Our results reveal both familiar musical concepts and coherent but uncodified patterns lacking clear counterparts in theory or language. As an extension, we show such concepts can be used to steer model generations. Beyond improving model transparency, our work provides an empirical tool for uncovering organizing principles that have eluded traditional methods of analysis and synthesis.
[1130] TTSDS2: Resources and Benchmark for Evaluating Human-Quality Text to Speech Systems
Christoph Minixhofer, Ondrej Klejch, Peter Bell
Main category: cs.SD
TL;DR: TTSDS2 is an improved objective metric for evaluating text-to-speech systems that correlates well with subjective scores across multiple languages and domains, addressing challenges in TTS evaluation.
Details
Motivation: Current TTS evaluation is challenging - subjective metrics like MOS are not easily comparable, objective metrics are rarely validated against subjective ones, and both struggle with modern TTS systems that produce speech indistinguishable from real human speech.Method: Introduces Text to Speech Distribution Score 2 (TTSDS2), an improved version of TTSDS, which is an objective metric that better correlates with subjective human evaluations. Also releases resources including a dataset with over 11,000 subjective ratings, a pipeline for multilingual test dataset creation, and a benchmark for TTS in 14 languages.
Result: TTSDS2 outperforms 15 other metrics, being the only one to achieve Spearman correlation above 0.50 for every domain and subjective score evaluated across multiple languages and domains.
Conclusion: TTSDS2 provides a robust, validated objective metric for TTS evaluation that addresses current limitations, with supporting resources for comprehensive multilingual TTS benchmarking.
Abstract: Evaluation of Text to Speech (TTS) systems is challenging and resource-intensive. Subjective metrics such as Mean Opinion Score (MOS) are not easily comparable between works. Objective metrics are frequently used, but rarely validated against subjective ones. Both kinds of metrics are challenged by recent TTS systems capable of producing synthetic speech indistinguishable from real speech. In this work, we introduce Text to Speech Distribution Score 2 (TTSDS2), a more robust and improved version of TTSDS. Across a range of domains and languages, it is the only one out of 16 compared metrics to correlate with a Spearman correlation above 0.50 for every domain and subjective score evaluated. We also release a range of resources for evaluating synthetic speech close to real speech: A dataset with over 11,000 subjective opinion score ratings; a pipeline for continually recreating a multilingual test dataset to avoid data leakage; and a continually updated benchmark for TTS in 14 languages.
[1131] GDiffuSE: Diffusion-based speech enhancement with noise model guidance
Efrayim Yanir, David Burshtein, Sharon Gannot
Main category: cs.SD
TL;DR: GDiffuSE: A diffusion-based speech enhancement method that uses a lightweight helper model to estimate noise distribution and guide the diffusion denoising process, improving robustness to unseen noise types.
Details
Motivation: Traditional speech enhancement methods directly map noisy to clean speech, which can struggle with unseen noise types. The authors aim to improve robustness by leveraging diffusion models originally trained for speech generation and adapting them to speech enhancement through guidance mechanisms.Method: Uses a denoising diffusion probabilistic model (DDPM) with a guidance mechanism. A lightweight helper model estimates the noise distribution, which is then incorporated into the diffusion denoising process to guide the enhancement.
Result: The method shows consistent improvements over state-of-the-art baselines under mismatched noise conditions when evaluated on noisy signals created by adding BBC sound effects noise to LibriSpeech utterances.
Conclusion: GDiffuSE demonstrates that diffusion models with guidance mechanisms can effectively adapt to speech enhancement tasks, offering improved robustness to unseen noise types compared to conventional methods.
Abstract: This paper introduces a novel speech enhancement (SE) approach based on a denoising diffusion probabilistic model (DDPM), termed Guided diffusion for speech enhancement (GDiffuSE). In contrast to conventional methods that directly map noisy speech to clean speech, our method employs a lightweight helper model to estimate the noise distribution, which is then incorporated into the diffusion denoising process via a guidance mechanism. This design improves robustness by enabling seamless adaptation to unseen noise types and by leveraging large-scale DDPMs originally trained for speech generation in the context of SE. We evaluate our approach on noisy signals obtained by adding noise samples from the BBC sound effects database to LibriSpeech utterances, showing consistent improvements over state-of-the-art baselines under mismatched noise conditions. Examples are available at our project webpage.
[1132] ERIS: Evolutionary Real-world Interference Scheme for Jailbreaking Audio Large Models
Yibo Zhang, Liang Lin
Main category: cs.SD
TL;DR: ERIS is a framework that uses genetic algorithms to optimize real-world audio interference as carriers for jailbreaking Audio Large Models, making malicious instructions appear as natural speech with background noise to bypass safety filters.
Details
Motivation: Current Audio Large Models alignment focuses on clean inputs, ignoring security risks in complex environments. The paper addresses how seemingly harmless real-world interference can be exploited to circumvent safety constraints in audio models.Method: ERIS uses a genetic algorithm with population initialization, crossover fusion, and probabilistic mutation to evolve audio samples that fuse malicious instructions with naturalistic real-world interference signals, making them appear as harmless background noise.
Result: Evaluations on multiple ALMs show ERIS significantly outperforms both text and audio jailbreak baselines, demonstrating that real-world interference can effectively bypass model alignment and safety filters.
Conclusion: The research reveals that innocuous real-world interference can be weaponized to circumvent safety constraints in audio models, providing important insights for developing defensive mechanisms in complex acoustic scenarios.
Abstract: Existing Audio Large Models (ALMs) alignment focuses on clean inputs, neglecting security risks in complex environments. We propose ERIS, a framework transforming real-world interference into a strategically optimized carrier for jailbreaking ALMs. Unlike methods relying on manually designed acoustic patterns, ERIS uses a genetic algorithm to optimize the selection and synthesis of naturalistic signals. Through population initialization, crossover fusion, and probabilistic mutation, it evolves audio fusing malicious instructions with real-world interference. To humans and safety filters, these samples present as natural speech with harmless background noise, yet bypass alignment. Evaluations on multiple ALMs show ERIS significantly outperforms both text and audio jailbreak baselines. Our findings reveal that seemingly innocuous real-world interference can be leveraged to circumvent safety constraints, providing new insights for defensive mechanisms in complex acoustic scenarios.
[1133] FOCAL: A Novel Benchmarking Technique for Multi-modal Agents
Anupam Purwar, Aditya Choudhary
Main category: cs.SD
TL;DR: FOCAL is a benchmarking framework for evaluating multimodal voice agents, focusing on end-to-end reasoning, error propagation analysis, and introducing Reasoning and Semantic scores for voice conversation quality assessment.
Details
Motivation: Cascading pipelines for voice agents are widely used in industry due to LLM-powered reasoning capabilities, but they suffer from error propagation issues. There's a need for systematic benchmarking of multimodal agents with voice and text support to evaluate reasoning quality and identify error sources.Method: Proposes FOCAL framework for benchmarking end-to-end reasoning and component-wise error propagation in multimodal agents. Includes automated and human-assisted testing capabilities, and introduces two novel metrics: Reasoning score and Semantic score for evaluating voice conversation efficacy.
Result: The paper presents a benchmarking framework with novel evaluation metrics specifically designed for multimodal voice agents, enabling systematic assessment of reasoning capabilities and error analysis in voice-to-voice + text input systems.
Conclusion: FOCAL provides a comprehensive framework for evaluating multimodal voice agents, addressing the critical need for standardized benchmarking of reasoning capabilities and error propagation in cascading pipeline architectures used in industry voice agents.
Abstract: With the recent advancements in reasoning capabilities, tool calling using MCP servers and Audio Language Models (ALMs), development and integration of multi-modal agents (with voice and text support) has come to the industry forefront. Cascading pipelines for voice agents still play a central role in the industry owing to their superior reasoning capabilities facilitated by LLMs. Although, cascading pipelines often present error propagation through the pipeline. We propose a framework, FOCAL to benchmark end-to-end reasoning, component-wise error propagation and error analysis for automated as well as human-assisted testing of multi-modal agents (voice to voice + text input). We also share two novel metrics viz. Reasoning and Semantic scores to evaluate efficacy of the agent in having meaningful conversations in voice mode.
[1134] Learning Vocal-Tract Area and Radiation with a Physics-Informed Webster Model
Minhui Lu, Joshua D. Reiss
Main category: cs.SD
TL;DR: Physics-informed neural network for singing-voice synthesis that estimates vocal-tract parameters from audio and F0, using Webster’s equation with DDSP stabilization during training only.
Details
Motivation: To create a more interpretable and physically-grounded singing-voice synthesis system that can estimate vocal-tract parameters from audio while maintaining stability under various conditions.Method: Train a time-domain Webster model as a physics-informed neural network to estimate vocal-tract area function and radiation coefficient from synthetic audio and F0. Uses PDE and boundary consistency constraints with lightweight DDSP only for training stabilization, while inference is purely physics-based.
Result: The method reproduces spectral envelopes competitively with DDSP baselines on sustained vowels and remains stable under discretization changes, moderate source variations, and pitch shifts. However, the synthesized waveform is breathier than reference.
Conclusion: Physics-informed approach shows promise for interpretable singing-voice synthesis but needs periodicity-aware objectives and glottal priors to address breathiness issues.
Abstract: We present a physics-informed voiced backend renderer for singing-voice synthesis. Given synthetic single-channel audio and a fund-amental–frequency trajectory, we train a time-domain Webster model as a physics-informed neural network to estimate an interpretable vocal-tract area function and an open-end radiation coefficient. Training enforces partial differential equation and boundary consistency; a lightweight DDSP path is used only to stabilize learning, while inference is purely physics-based. On sustained vowels (/a/, /i/, /u/), parameters rendered by an independent finite-difference time-domain Webster solver reproduce spectral envelopes competitively with a compact DDSP baseline and remain stable under changes in discretization, moderate source variations, and about ten percent pitch shifts. The in-graph waveform remains breathier than the reference, motivating periodicity-aware objectives and explicit glottal priors in future work.
[1135] Echo: Towards Advanced Audio Comprehension via Audio-Interleaved Reasoning
Daiqing Wu, Xuan Zhang, Dongbao Yang, Jiashu Yao, Longfei Chen, Qingsong Liu, Sicheng Zhao, Can Ma, Yangyang Kang, Yu Zhou
Main category: cs.SD
TL;DR: Echo introduces audio-interleaved reasoning for Large Audio Language Models, enabling dynamic re-listening to audio during reasoning to overcome information bottlenecks in audio comprehension.
Details
Motivation: Current LALMs have an information bottleneck because they encode audio content only once, unlike human cognition which engages with audio dynamically. The paper aims to enable sustained audio engagement and perception-grounded analysis.Method: Proposes audio-interleaved reasoning with two-stage training: 1) supervised fine-tuning to localize salient audio segments, 2) reinforcement learning to incentivize proficient re-listening. Also develops structured data generation pipeline for training.
Result: Echo achieves overall superiority on audio comprehension benchmarks in both expert-level and general-purpose tasks. Analysis confirms efficiency and generalizability of audio-interleaved reasoning.
Conclusion: Audio-interleaved reasoning is a promising direction for advancing audio comprehension in LALMs, enabling more human-like dynamic engagement with audio content.
Abstract: The maturation of Large Audio Language Models (LALMs) has raised growing expectations for them to comprehend complex audio much like humans. Current efforts primarily replicate text-based reasoning by contextualizing audio content through a one-time encoding, which introduces a critical information bottleneck. Drawing inspiration from human cognition, we propose audio-interleaved reasoning to break through this bottleneck. It treats audio as an active reasoning component, enabling sustained audio engagement and perception-grounded analysis. To instantiate it, we introduce a two-stage training framework, first teaching LALMs to localize salient audio segments through supervised fine-tuning, and then incentivizing proficient re-listening via reinforcement learning. In parallel, a structured data generation pipeline is developed to produce high-quality training data. Consequently, we present Echo, a LALM capable of dynamically re-listening to audio in demand during reasoning. On audio comprehension benchmarks, Echo achieves overall superiority in both challenging expert-level and general-purpose tasks. Comprehensive analysis further confirms the efficiency and generalizability of audio-interleaved reasoning, establishing it as a promising direction for advancing audio comprehension. Project page: https://github.com/wdqqdw/Echo.
[1136] Interpreting Multi-Branch Anti-Spoofing Architectures: Correlating Internal Strategy with Empirical Performance
Ivan Viakhirev, Kirill Borodin, Mikhail Gorodnichev, Grach Mkrtchian
Main category: cs.SD
TL;DR: A framework for interpreting AASIST3 audio anti-spoofing models at component level using covariance operators and CatBoost meta-classifier to analyze branch cooperation/competition patterns across spoofing attacks.
Details
Motivation: Multi-branch deep neural networks like AASIST3 achieve state-of-the-art performance in audio anti-spoofing but their internal decision dynamics remain opaque. While existing interpretability focuses on input artifacts, the way individual architectural branches cooperate or compete under different spoofing attacks is not well characterized.Method: Developed a framework for interpreting AASIST3 at component level. Intermediate activations from fourteen branches and global attention modules are modeled with covariance operators whose leading eigenvalues form low-dimensional spectral signatures. These signatures train a CatBoost meta-classifier to generate TreeSHAP-based branch attributions, converted into normalized contribution shares and confidence scores (Cb) to quantify operational strategy.
Result: Analyzing 13 spoofing attacks from ASVspoof 2019 benchmark identified four operational archetypes: Effective Specialization (e.g., A09, EER 0.04%, C=1.56), Ineffective Consensus (e.g., A08, EER 3.14%, C=0.33), and crucially Flawed Specialization where model places high confidence in incorrect branch, leading to severe performance degradation for attacks A17 and A18 (EER 14.26% and 28.63%).
Conclusion: Quantitative findings link internal architectural strategy directly to empirical reliability, highlighting specific structural dependencies that standard performance metrics overlook. The framework provides insights into how multi-branch architectures operate under different attack scenarios.
Abstract: Multi-branch deep neural networks like AASIST3 achieve state-of-the-art comparable performance in audio anti-spoofing, yet their internal decision dynamics remain opaque compared to traditional input-level saliency methods. While existing interpretability efforts largely focus on visualizing input artifacts, the way individual architectural branches cooperate or compete under different spoofing attacks is not well characterized. This paper develops a framework for interpreting AASIST3 at the component level. Intermediate activations from fourteen branches and global attention modules are modeled with covariance operators whose leading eigenvalues form low-dimensional spectral signatures. These signatures train a CatBoost meta-classifier to generate TreeSHAP-based branch attributions, which we convert into normalized contribution shares and confidence scores (Cb) to quantify the model’s operational strategy. By analyzing 13 spoofing attacks from the ASVspoof 2019 benchmark, we identify four operational archetypes-ranging from Effective Specialization (e.g., A09, Equal Error Rate (EER) 0.04%, C=1.56) to Ineffective Consensus (e.g., A08, EER 3.14%, C=0.33). Crucially, our analysis exposes a Flawed Specialization mode where the model places high confidence in an incorrect branch, leading to severe performance degradation for attacks A17 and A18 (EER 14.26% and 28.63%, respectively). These quantitative findings link internal architectural strategy directly to empirical reliability, highlighting specific structural dependencies that standard performance metrics overlook.
cs.LG
[1137] Transit Network Design with Two-Level Demand Uncertainties: A Machine Learning and Contextual Stochastic Optimization Framework
Hongzhao Guan, Beste Basciftci, Pascal Van Hentenryck
Main category: cs.LG
TL;DR: A new transit network design framework (2LRC-TND) that uses machine learning and contextual stochastic optimization to incorporate two layers of demand uncertainties, addressing limitations of traditional fixed-demand models.
Details
Motivation: Traditional transit network design relies on fixed demand assumptions, which are unrealistic. The paper aims to address this limitation by incorporating demand uncertainties and contextual information into the design process.Method: Proposes a Two-Level Rider Choice Transit Network Design framework using machine learning models for travel mode choice prediction, integrated with contextual stochastic optimization solved via constraint programming SAT solver.
Result: Evaluated on Atlanta metropolitan area with 6,600+ travel arcs and 38,000+ trips, demonstrating effectiveness in designing transit networks that account for demand uncertainties and contextual information.
Conclusion: 2LRC-TND offers a more realistic alternative to fixed-demand models by successfully incorporating two layers of demand uncertainties through machine learning and optimization techniques.
Abstract: Transit Network Design is a well-studied problem in the field of transportation, typically addressed by solving optimization models under fixed demand assumptions. Considering the limitations of these assumptions, this paper proposes a new framework, namely the Two-Level Rider Choice Transit Network Design (2LRC-TND), that leverages machine learning and contextual stochastic optimization (CSO) through constraint programming (CP) to incorporate two layers of demand uncertainties into the network design process. The first level identifies travelers who rely on public transit (core demand), while the second level captures the conditional adoption behavior of those who do not (latent demand), based on the availability and quality of transit services. To capture these two types of uncertainties, 2LRC-TND relies on two travel mode choice models, that use multiple machine learning models. To design a network, 2LRC-TND integrates the resulting choice models into a CSO that is solved using a CP-SAT solver. 2LRC-TND is evaluated through a case study involving over 6,600 travel arcs and more than 38,000 trips in the Atlanta metropolitan area. The computational results demonstrate the effectiveness of the 2LRC-TND in designing transit networks that account for demand uncertainties and contextual information, offering a more realistic alternative to fixed-demand models.
[1138] StaTS: Spectral Trajectory Schedule Learning for Adaptive Time Series Forecasting with Frequency Guided Denoiser
Jintao Zhang, Zirui Liu, Mingyue Cheng, Xianquan Wang, Zhiding Liu, Qi Liu
Main category: cs.LG
TL;DR: StaTS is a diffusion model for probabilistic time series forecasting that learns adaptive noise schedules with spectral regularization and uses frequency-guided denoising for improved structure recovery.
Details
Motivation: Fixed noise schedules in diffusion models for time series forecasting produce hard-to-invert intermediate states and terminal states that deviate from near-noise assumptions. Existing methods rely on time domain conditioning and fail to model schedule-induced spectral degradation, limiting structure recovery across noise levels.Method: StaTS learns noise schedules and denoisers through alternating updates: 1) Spectral Trajectory Scheduler (STS) learns data-adaptive noise schedules with spectral regularization for structural preservation and stepwise invertibility, 2) Frequency Guided Denoiser (FGD) estimates schedule-induced spectral distortion and modulates denoising strength for heterogeneous restoration across diffusion steps and variables, using a two-stage training procedure to stabilize schedule learning and denoiser optimization.
Result: Experiments on multiple real-world benchmarks show consistent performance gains while maintaining strong performance with fewer sampling steps.
Conclusion: StaTS improves probabilistic time series forecasting by addressing limitations of fixed noise schedules through adaptive schedule learning with spectral regularization and frequency-guided denoising.
Abstract: Diffusion models have been used for probabilistic time series forecasting and show strong potential. However, fixed noise schedules often produce intermediate states that are hard to invert and a terminal state that deviates from the near noise assumption. Meanwhile, prior methods rely on time domain conditioning and seldom model schedule induced spectral degradation, which limits structure recovery across noise levels. We propose StaTS, a diffusion model for probabilistic time series forecasting that learns the noise schedule and the denoiser through alternating updates. StaTS includes Spectral Trajectory Scheduler (STS) that learns a data adaptive noise schedule with spectral regularization to improve structural preservation and stepwise invertibility, and Frequency Guided Denoiser (FGD) that estimates schedule induced spectral distortion and uses it to modulate denoising strength for heterogeneous restoration across diffusion steps and variables. A two stage training procedure stabilizes the coupling between schedule learning and denoiser optimization. Experiments on multiple real world benchmarks show consistent gains, while maintaining strong performance with fewer sampling steps. Our code is available at https://github.com/zjt-gpu/StaTS/.
[1139] CARE: Confounder-Aware Aggregation for Reliable LLM Evaluation
Jitian Zhao, Changho Shin, Tzu-Heng Huang, Satya Sai Srinath Namburi GNVV, Frederic Sala
Main category: cs.LG
TL;DR: CARE is a confounder-aware aggregation framework for LLM-as-a-judge ensembles that models correlated errors from shared latent confounders to improve evaluation accuracy.
Details
Motivation: Standard LLM-as-a-judge aggregation methods assume independent judge estimates, but in practice LLM judges exhibit correlated errors from shared confounders like verbosity preferences, stylistic biases, or training artifacts, causing standard aggregation to amplify systematic mistakes.Method: CARE explicitly models LLM judge scores as arising from both latent true-quality signals and shared confounding factors, separating quality from confounders without ground-truth labels, with theoretical guarantees for identifiability and finite-sample recovery under shared confounders.
Result: Across 12 public benchmarks spanning continuous scoring, binary classification, and pairwise preference settings, CARE improves aggregation accuracy, reducing error by up to 26.8%.
Conclusion: CARE addresses the fundamental flaw in LLM-as-a-judge aggregation by modeling correlated errors from shared confounders, providing a theoretically grounded framework that significantly improves evaluation accuracy across diverse settings.
Abstract: LLM-as-a-judge ensembles are the standard paradigm for scalable evaluation, but their aggregation mechanisms suffer from a fundamental flaw: they implicitly assume that judges provide independent estimates of true quality. However, in practice, LLM judges exhibit correlated errors caused by shared latent confounders – such as verbosity, stylistic preferences, or training artifacts – causing standard aggregation rules like majority vote or averaging to provide little gain or even amplify systematic mistakes. To address this, we introduce CARE, a confounder-aware aggregation framework that explicitly models LLM judge scores as arising from both a latent true-quality signal and shared confounding factors. Rather than heuristically re-weighting judges, CARE separates quality from confounders without access to ground-truth labels. We provide theoretical guarantees for identifiability and finite-sample recovery under shared confounders, and we quantify the systematic bias incurred when aggregation models omit confounding latent factors. Across 12 public benchmarks spanning continuous scoring, binary classification, and pairwise preference settings, CARE improves aggregation accuracy, reducing error by up to 26.8%. Code is released in \href{https://github.com/SprocketLab/CARE}{https://github.com/SprocketLab/CARE}.
[1140] Attn-QAT: 4-Bit Attention With Quantization-Aware Training
Peiyuan Zhang, Matthew Noto, Wenxuan Tan, Chengquan Jiang, Will Lin, Wei Zhou, Hao Zhang
Main category: cs.LG
TL;DR: First systematic study of 4-bit quantization-aware training for attention, addressing FP4’s limited dynamic range and attention’s heavy-tailed activations to enable end-to-end FP4 computation on emerging GPUs.
Details
Motivation: FP4-capable GPUs require reliable 4-bit attention for end-to-end FP4 computation, but attention remains problematic due to FP4's tiny dynamic range and attention's heavy-tailed activations. Current "drop-in" QAT approaches cause training instability.Method: Proposes Attn-QAT with two key principles: 1) matching low-precision recomputation of attention scores in backward pass, and 2) resolving implicit precision assumptions in Flash Attention’s gradient calculation. Implements fused Triton kernels for training and FP4 inference kernels.
Result: Attn-QAT recovers quality drop from FP4 attention without explicit outlier-mitigation heuristics, delivers up to 1.5x speedup on RTX 5090 across diffusion and language models.
Conclusion: First successful systematic approach to 4-bit quantization-aware training for attention, enabling stable FP4 attention computation on emerging GPUs with significant speed improvements.
Abstract: Achieving reliable 4-bit attention is a prerequisite for end-to-end FP4 computation on emerging FP4-capable GPUs, yet attention remains the main obstacle due to FP4’s tiny dynamic range and attention’s heavy-tailed activations. This paper presents the first systematic study of 4-bit quantization-aware training (QAT) for attention. We find that “drop-in” QAT, which naively combines an FP4 forward pass with a high-precision Flash Attention (FA)-style backward pass, leads to training instability. We identify two key principles for stable FP4 attention: (1) matching low-precision recomputation of attention scores in the backward pass, and (2) resolving implicit precision assumptions in FA’s gradient calculation. Based on these insights, we propose Attn-QAT and implement fused Triton kernels for training as well as FP4 inference kernels. Across diffusion and language models, Attn-QAT recovers the quality drop from FP4 attention without explicit outlier-mitigation heuristics used in prior FP4 attention, and delivers up to a 1.5x speedup on an RTX 5090. Video demos can be found at https://drive.google.com/drive/folders/190F6xbBDUF2kGQYIcXBt3ehSYij5jlim?usp=sharing.
[1141] Econometric vs. Causal Structure-Learning for Time-Series Policy Decisions: Evidence from the UK COVID-19 Policies
Bruno Petrungaro, Anthony C. Constantinou
Main category: cs.LG
TL;DR: Comparison of econometric vs causal ML methods for time series causal discovery, with application to COVID-19 policy analysis
Details
Motivation: To understand if econometric methods offer valuable lessons for causal ML in time series analysis, and to compare their performance in recovering causal structures from temporal dataMethod: Evaluated 4 econometric methods and 11 causal ML algorithms on time series data, comparing graphical structures, model dimensionality, and causal effect recovery. Applied to UK COVID-19 case study and provided code translation to bnlearn library
Result: Econometric methods provide clear temporal structure rules but explore limited graph space, while causal ML algorithms explore broader graph space leading to denser graphs that capture more identifiable causal relationships
Conclusion: Both approaches have complementary strengths: econometrics offers structured temporal rules, while causal ML enables broader discovery; integration could benefit time series causal analysis
Abstract: Causal machine learning (ML) recovers graphical structures that inform us about potential cause-and-effect relationships. Most progress has focused on cross-sectional data with no explicit time order, whereas recovering causal structures from time series data remains the subject of ongoing research in causal ML. In addition to traditional causal ML, this study assesses econometric methods that some argue can recover causal structures from time series data. The use of these methods can be explained by the significant attention the field of econometrics has given to causality, and specifically to time series, over the years. This presents the possibility of comparing the causal discovery performance between econometric and traditional causal ML algorithms. We seek to understand if there are lessons to be incorporated into causal ML from econometrics, and provide code to translate the results of these econometric methods to the most widely used Bayesian Network R library, bnlearn. We investigate the benefits and challenges that these algorithms present in supporting policy decision-making, using the real-world case of COVID-19 in the UK as an example. Four econometric methods are evaluated in terms of graphical structure, model dimensionality, and their ability to recover causal effects, and these results are compared with those of eleven causal ML algorithms. Amongst our main results, we see that econometric methods provide clear rules for temporal structures, whereas causal-ML algorithms offer broader discovery by exploring a larger space of graph structures that tends to lead to denser graphs that capture more identifiable causal relationships.
[1142] Maximizing the Spectral Energy Gain in Sub-1-Bit LLMs via Latent Geometry Alignment
Banseok Lee, Youngmin Kim
Main category: cs.LG
TL;DR: LittleBit-2 achieves state-of-the-art sub-1-bit compression (0.1-1 bpp) for LLMs by addressing Latent Geometry Misalignment through Internal Latent Rotation and Joint Iterative Quantization, outperforming prior binary methods.
Details
Motivation: The paper identifies Spectral Energy Gain in extreme model compression where low-rank binary approximations should outperform tiny-rank floating-point baselines, but prior attempts fail due to Latent Geometry Misalignment - standard singular vectors have high coherence, which is worst-case for binary quantization.Method: Proposes LittleBit-2 framework with two key components: 1) Internal Latent Rotation as a geometric preconditioner to align coherent latent distributions with binary hypercube, and 2) Joint Iterative Quantization (Joint-ITQ) for optimization. The approach has zero inference overhead.
Result: Establishes new state-of-the-art in sub-1-bit regime (1~0.1 bpp) on Llama-2 and Llama-3, matching fidelity of leading 1-bit baselines while achieving extreme compression.
Conclusion: The paper demonstrates that addressing geometric misalignment enables realization of theoretical Spectral Energy Gain in extreme compression, achieving practical sub-1-bit quantization for large language models without inference overhead.
Abstract: We identify the Spectral Energy Gain in extreme model compression, where low-rank binary approximations outperform tiny-rank floating-point baselines for heavy-tailed spectra. However, prior attempts fail to realize this potential, trailing state-of-the-art 1-bit methods. We attribute this degradation to Latent Geometry Misalignment: standard singular vectors exhibit high coherence (spiky distribution), the worst-case geometry for binary quantization. To realize this gain, we propose LittleBit-2, a framework employing Internal Latent Rotation and Joint Iterative Quantization (Joint-ITQ). This approach acts as a geometric preconditioner, aligning coherent latent distributions with the binary hypercube with zero inference overhead. Empirically, LittleBit-2 establishes a new state-of-the-art in the sub-1-bit regime (1$\sim$0.1 bpp) on Llama-2 and Llama-3, matching the fidelity of leading 1-bit baselines.
[1143] Reinforcement Learning for Control with Probabilistic Stability Guarantee: A Finite-Sample Approach
Minghao Han, Lixian Zhang, Chenliang Liu, Zhipeng Zhou, Jun Wang, Wei Pan
Main category: cs.LG
TL;DR: Novel RL approach with probabilistic stability guarantees using finite data via Lyapunov methods, enabling stability analysis in model-free control
Details
Motivation: Bridge the gap between reinforcement learning and control theory by providing stability guarantees for RL-based control systems using only finite data, addressing the challenge of ensuring stability in model-free RL approachesMethod: Leverages Lyapunov’s method to develop a probabilistic stability theorem that ensures mean square stability using finite sampled trajectories. Derives a policy gradient theorem for stabilizing policy learning and develops L-REINFORCE algorithm extending classical REINFORCE to stabilization problems
Result: L-REINFORCE outperforms baseline methods on Cartpole task in ensuring stability. Probability of stability increases with number and length of trajectories, converging to certainty as data size grows
Conclusion: Successfully bridges RL and control theory by enabling stability analysis and controller design in model-free framework with finite data, providing probabilistic stability guarantees for RL-based control systems
Abstract: This paper presents a novel approach to reinforcement learning (RL) for control systems that provides probabilistic stability guarantees using finite data. Leveraging Lyapunov’s method, we propose a probabilistic stability theorem that ensures mean square stability using only a finite number of sampled trajectories. The probability of stability increases with the number and length of trajectories, converging to certainty as data size grows. Additionally, we derive a policy gradient theorem for stabilizing policy learning and develop an RL algorithm, L-REINFORCE, that extends the classical REINFORCE algorithm to stabilization problems. The effectiveness of L-REINFORCE is demonstrated through simulations on a Cartpole task, where it outperforms the baseline in ensuring stability. This work bridges a critical gap between RL and control theory, enabling stability analysis and controller design in a model-free framework with finite data.
[1144] StethoLM: Audio Language Model for Cardiopulmonary Analysis Across Clinical Tasks
Yishan Wang, Tsai-Ning Wang, Mathias Funk, Aaqib Saeed
Main category: cs.LG
TL;DR: StethoLM is the first audio-language model specialized for cardiopulmonary auscultation, capable of performing various clinical tasks through instruction-driven analysis of heart and lung sounds.
Details
Motivation: Auscultation (listening to heart and lung sounds) is fundamental in clinical examination but requires years of experience. Existing deep learning methods are limited to simple classification with poor clinical interpretability and decision support.Method: StethoLM integrates audio encoding with a medical language model backbone, trained on StethoBench (77,027 instruction-response pairs from 16,125 labeled cardiopulmonary recordings). Uses multi-stage training combining supervised fine-tuning and direct preference optimization.
Result: Achieves substantial gains in performance and robustness on out-of-distribution data, establishing a foundation for instruction-following AI systems in clinical auscultation.
Conclusion: StethoLM represents a significant advancement in automated cardiopulmonary sound analysis, moving beyond simple classification to comprehensive clinical decision support with interpretability.
Abstract: Listening to heart and lung sounds - auscultation - is one of the first and most fundamental steps in a clinical examination. Despite being fast and non-invasive, it demands years of experience to interpret subtle audio cues. Recent deep learning methods have made progress in automating cardiopulmonary sound analysis, yet most are restricted to simple classification and offer little clinical interpretability or decision support. We present StethoLM, the first audio-language model specialized for cardiopulmonary auscultation, capable of performing instruction-driven clinical tasks across the full spectrum of auscultation analysis. StethoLM integrates audio encoding with a medical language model backbone and is trained on StethoBench, a comprehensive benchmark comprising 77,027 instruction-response pairs synthesized from 16,125 labeled cardiopulmonary recordings spanning seven clinical task categories: binary classification, detection, reporting, reasoning, differential diagnosis, comparison, and location-based analysis. Through multi-stage training that combines supervised fine-tuning and direct preference optimization, StethoLM achieves substantial gains in performance and robustness on out-of-distribution data. Our work establishes a foundation for instruction-following AI systems in clinical auscultation.
[1145] Property-Driven Evaluation of GNN Expressiveness at Scale: Datasets, Framework, and Study
Sicong Che, Jiayi Yang, Sarfraz Khurshid, Wenxi Wang
Main category: cs.LG
TL;DR: A software engineering approach to evaluating Graph Neural Networks (GNNs) using formal specification and systematic benchmarking to assess expressiveness in capturing fundamental graph properties.
Details
Motivation: Advancing trustworthy AI requires principled software engineering approaches to model evaluation. While GNNs have achieved success in processing graph-structured data, their expressiveness in capturing fundamental graph properties remains an open challenge that needs systematic evaluation.Method: Developed a property-driven evaluation methodology using Alloy (software specification language) to create configurable graph dataset generators. Produced two dataset families: GraphRandom (diverse graphs satisfying/violating properties) and GraphPerturb (controlled structural variations). Created 336 new datasets with at least 10,000 labeled graphs covering 16 fundamental graph properties. Proposed evaluation framework assessing GNN expressiveness across generalizability, sensitivity, and robustness with novel quantitative metrics.
Result: Conducted first comprehensive study on global pooling methods’ impact on GNN expressiveness. Found distinct trade-offs: attention-based pooling excels in generalization and robustness, while second-order pooling provides superior sensitivity. No single approach consistently performs well across all properties, revealing fundamental limitations.
Conclusion: The work establishes a principled foundation for developing expressive and reliable GNN architectures by embedding software engineering rigor into AI evaluation. Highlights open research directions including adaptive property-aware pooling, scale-sensitive architectures, and robustness-oriented training.
Abstract: Advancing trustworthy AI requires principled software engineering approaches to model evaluation. Graph Neural Networks (GNNs) have achieved remarkable success in processing graph-structured data, however, their expressiveness in capturing fundamental graph properties remains an open challenge. We address this by developing a property-driven evaluation methodology grounded in formal specification, systematic evaluation, and empirical study. Leveraging Alloy, a software specification language and analyzer, we introduce a configurable graph dataset generator that produces two dataset families: GraphRandom, containing diverse graphs that either satisfy or violate specific properties, and GraphPerturb, introducing controlled structural variations. Together, these benchmarks encompass 336 new datasets, each with at least 10,000 labeled graphs, covering 16 fundamental graph properties critical to distributed systems, knowledge graphs, and biological networks. We propose a general evaluation framework that assesses three key aspects of GNN expressiveness: generalizability, sensitivity, and robustness, with two novel quantitative metrics. Using this framework, we conduct the first comprehensive study on global pooling methods’ impact on GNN expressiveness. Our findings reveal distinct trade-offs: attention-based pooling excels in generalization and robustness, while second-order pooling provides superior sensitivity, but no single approach consistently performs well across all properties. These insights highlight fundamental limitations and open research directions including adaptive property-aware pooling, scale-sensitive architectures, and robustness-oriented training. By embedding software engineering rigor into AI evaluation, this work establishes a principled foundation for developing expressive and reliable GNN architectures.
[1146] Breaking the Factorization Barrier in Diffusion Language Models
Ian Li, Zilei Shao, Benjie Wang, Rose Yu, Guy Van den Broeck, Anji Liu
Main category: cs.LG
TL;DR: CoDD proposes a hybrid framework that breaks the factorization barrier in diffusion language models by replacing fully-factorized output distributions with a lightweight probabilistic inference layer, enabling modeling of complex joint dependencies without parameter explosion.
Details
Motivation: Diffusion language models face a fundamental trade-off between generation speed and coherence due to the "factorization barrier" - the assumption that simultaneously predicted tokens are independent. This forces models to either generate sequentially (slow) or produce incoherent outputs (fast). The authors argue this isn't a backbone expressivity issue but a structural misspecification problem.Method: CoDD (Coupled Discrete Diffusion) replaces the fully-factorized output distribution with a lightweight, tractable probabilistic inference layer. This creates a distribution family that is more expressive than standard factorized priors but remains compact enough to avoid the parameter explosion of full joint modeling. The approach seamlessly enhances diverse diffusion language model architectures with minimal overhead.
Result: CoDD matches the reasoning performance of computationally intensive Reinforcement Learning baselines at a fraction of the training cost. It prevents performance collapse in few-step generation, enabling high-quality outputs at significantly reduced latencies. The method works with various diffusion language model architectures with negligible overhead.
Conclusion: CoDD successfully breaks the factorization barrier in diffusion language models by introducing a hybrid framework that models complex joint dependencies without prohibitive parameter growth. This enables both high-quality generation and efficient parallel generation, addressing the fundamental speed-coherence trade-off in diffusion language models.
Abstract: Diffusion language models theoretically allow for efficient parallel generation but are practically hindered by the “factorization barrier”: the assumption that simultaneously predicted tokens are independent. This limitation forces a trade-off: models must either sacrifice speed by resolving dependencies sequentially or suffer from incoherence due to factorization. We argue that this barrier arises not from limited backbone expressivity, but from a structural misspecification: models are restricted to fully factorized outputs because explicitly parameterizing a joint distribution would require the Transformer to output a prohibitively large number of parameters. We propose Coupled Discrete Diffusion (CoDD), a hybrid framework that breaks this barrier by replacing the fully-factorized output distribution with a lightweight, tractable probabilistic inference layer. This formulation yields a distribution family that is significantly more expressive than standard factorized priors, enabling the modeling of complex joint dependencies, yet remains compact enough to avoid the prohibitive parameter explosion associated with full joint modeling. Empirically, CoDD seamlessly enhances diverse diffusion language model architectures with negligible overhead, matching the reasoning performance of computationally intensive Reinforcement Learning baselines at a fraction of the training cost. Furthermore, it prevents performance collapse in few-step generation, enabling high-quality outputs at significantly reduced latencies. Code available at: https://github.com/liuanji/CoDD
[1147] REMIND: Rethinking Medical High-Modality Learning under Missingness–A Long-Tailed Distribution Perspective
Chenwei Wu, Zitao Shuai, Liyue Shen
Main category: cs.LG
TL;DR: REMIND addresses high-modality learning under missing data by tackling long-tailed modality combinations through group-specialized Mixture-of-Experts and distributionally robust optimization.
Details
Motivation: In medical multi-modal learning, obtaining full-modality observations for every patient is often impractical due to data collection constraints, leading to 'High-Modality Learning under Missingness'. This creates an exponential growth in possible modality combinations with long-tail distributions, causing significant underperformance on tail modality combination groups.Method: Proposes REMIND framework with: 1) group-specialized Mixture-of-Experts architecture that scalably learns group-specific multi-modal fusion functions for arbitrary modality combinations, and 2) group distributionally robust optimization strategy to upweight underrepresented modality combinations.
Result: Extensive experiments on real-world medical datasets show that REMIND consistently outperforms state-of-the-art methods and robustly generalizes across various medical multi-modal learning applications under high-modality missingness.
Conclusion: The paper successfully addresses the critical problem of long-tailed modality combinations in high-modality learning under missingness, providing a unified framework that improves performance on underrepresented modality groups through specialized architectures and optimization strategies.
Abstract: Medical multi-modal learning is critical for integrating information from a large set of diverse modalities. However, when leveraging a high number of modalities in real clinical applications, it is often impractical to obtain full-modality observations for every patient due to data collection constraints, a problem we refer to as ‘High-Modality Learning under Missingness’. In this study, we identify that such missingness inherently induces an exponential growth in possible modality combinations, followed by long-tail distributions of modality combinations due to varying modality availability. While prior work overlooked this critical phenomenon, we find this long-tailed distribution leads to significant underperformance on tail modality combination groups. Our empirical analysis attributes this problem to two fundamental issues: 1) gradient inconsistency, where tail groups’ gradient updates diverge from the overall optimization direction; 2) concept shifts, where each modality combination requires distinct fusion functions. To address these challenges, we propose REMIND, a unified framework that REthinks MultImodal learNing under high-moDality missingness from a long-tail perspective. Our core idea is to propose a novel group-specialized Mixture-of-Experts architecture that scalably learns group-specific multi-modal fusion functions for arbitrary modality combinations, while simultaneously leveraging a group distributionally robust optimization strategy to upweight underrepresented modality combinations. Extensive experiments on real-world medical datasets show that our framework consistently outperforms state-of-the-art methods, and robustly generalizes across various medical multi-modal learning applications under high-modality missingness.
[1148] BiJEPA: Bi-directional Joint Embedding Predictive Architecture for Symmetric Representation Learning
Yongchao Huang
Main category: cs.LG
TL;DR: BiJEPA introduces a bi-directional joint embedding predictive architecture with cycle-consistent predictability between data segments, addressing representation explosion through norm regularization, and demonstrates effectiveness across synthetic signals, chaotic systems, and image data.
Details
Motivation: Standard JEPA models use uni-directional prediction (Context→Target), which neglects informative signals in inverse relationships and degrades performance. The authors aim to create a more holistic representation learning approach by incorporating bi-directional predictability.Method: Proposes BiJEPA (Bi-Directional Joint Embedding Predictive Architecture) that enforces cycle-consistent predictability between data segments. Addresses instability of symmetric prediction (representation explosion) by introducing norm regularization on representation vectors.
Result: BiJEPA achieves stable convergence without collapse, captures semantic structure of chaotic systems, and learns robust temporal and spatial representations capable of generation and generalization across three modalities: synthetic periodic signals, chaotic Lorenz attractor trajectories, and MNIST image data.
Conclusion: BiJEPA offers a more holistic approach to representation learning by incorporating bi-directional predictability with stability mechanisms, demonstrating effectiveness across diverse data modalities including time-series and image data.
Abstract: Self-Supervised Learning (SSL) has shifted from pixel-level reconstruction to latent space prediction, spearheaded by the Joint Embedding Predictive Architecture (JEPA). While effective, standard JEPA models typically rely on a uni-directional prediction mechanism (e.g. Context $\to$ Target), potentially neglecting the informative signal inherent in the inverse relationship, degrading its performance. In this work, we propose \textbf{BiJEPA}, a \textit{Bi-Directional Joint Embedding Predictive Architecture} that enforces cycle-consistent predictability between data segments. We address the inherent instability of symmetric prediction (representation explosion) by introducing a critical norm regularization mechanism on the representation vectors. We evaluate BiJEPA on three distinct modalities: synthetic periodic signals, chaotic Lorenz attractor trajectories, and high-dimensional image data (MNIST). Our results demonstrate that BiJEPA achieves stable convergence without collapse, captures the semantic structure of chaotic systems, and learns robust temporal and spatial representations capable of generation and generalisation, offering a more holistic approach to representation learning.
[1149] Knowledge-guided generative surrogate modeling for high-dimensional design optimization under scarce data
Bingran Wang, Seongha Jeong, Sebastiaan P. C. van Schie, Dongyeon Han, Jaeho Min, John T. Hwang
Main category: cs.LG
TL;DR: RBF-Gen is a knowledge-guided surrogate modeling framework that combines scarce data with domain expertise using radial basis functions and a generator network to encode structural relationships for improved predictive accuracy in mechanical design and manufacturing optimization.
Details
Motivation: Surrogate models in mechanical design and manufacturing optimization face limitations due to data scarcity, while subject matter experts possess valuable domain knowledge about functional relationships that current methods cannot systematically integrate with limited data.Method: RBF-Gen constructs a radial basis function (RBF) space with more centers than training samples and uses a generator network to leverage the null space, inspired by maximum information preservation. Latent variables encode structural relationships and distributional priors during training to guide surrogates toward physically meaningful solutions.
Result: RBF-Gen significantly outperforms standard RBF surrogates on 1D and 2D structural optimization problems in data-scarce settings, and achieves superior predictive accuracy on a real-world semiconductor manufacturing dataset.
Conclusion: The framework demonstrates the potential of combining limited experimental data with domain expertise to enable accurate and practical surrogate modeling in mechanical and process design problems.
Abstract: Surrogate models are widely used in mechanical design and manufacturing process optimization, where high-fidelity computational models may be unavailable or prohibitively expensive. Their effectiveness, however, is often limited by data scarcity, as purely data-driven surrogates struggle to achieve high predictive accuracy in such situations. Subject matter experts (SMEs) frequently possess valuable domain knowledge about functional relationships, yet few surrogate modeling techniques can systematically integrate this information with limited data. We address this challenge with RBF-Gen, a knowledge-guided surrogate modeling framework that combines scarce data with domain knowledge. This method constructs a radial basis function (RBF) space with more centers than training samples and leverages the null space via a generator network, inspired by the principle of maximum information preservation. The introduced latent variables provide a principled mechanism to encode structural relationships and distributional priors during training, thereby guiding the surrogate toward physically meaningful solutions. Numerical studies demonstrate that RBF-Gen significantly outperforms standard RBF surrogates on 1D and 2D structural optimization problems in data-scarce settings, and achieves superior predictive accuracy on a real-world semiconductor manufacturing dataset. These results highlight the potential of combining limited experimental data with domain expertise to enable accurate and practical surrogate modeling in mechanical and process design problems.
[1150] Mag-Mamba: Modeling Coupled spatiotemporal Asymmetry for POI Recommendation
Zhuoxuan Li, Tangwei Ye, Jieyuan Pei, Haina Liang, Zhongyuan Lai, Zihan Liu, Yiming Wu, Qi Zhang, Liang Hu
Main category: cs.LG
TL;DR: Mag-Mamba models spatiotemporal asymmetry in POI recommendation using complex domain rotational dynamics with time-conditioned magnetic phases and complex-valued Mamba modules.
Details
Motivation: Next POI recommendation faces fundamental challenge of coupled spatiotemporal asymmetry in urban mobility, where transition intents between locations exhibit high asymmetry and are dynamically conditioned on time. Existing methods struggle to unify modeling of time-varying global directionality.Method: Proposes Mag-Mamba framework modeling spatiotemporal asymmetry as phase-driven rotational dynamics in complex domain. Includes: 1) Time-conditioned Magnetic Phase Encoder constructing time-conditioned Magnetic Laplacian on geographic adjacency graph using edge phase differences; 2) Complex-valued Mamba module generalizing scalar state decay into joint decay-rotation dynamics modulated by time intervals and magnetic geographic priors.
Result: Extensive experiments on three real-world datasets demonstrate Mag-Mamba achieves significant performance improvements over state-of-the-art baselines.
Conclusion: Mag-Mamba effectively addresses spatiotemporal asymmetry in POI recommendation through complex domain rotational dynamics modeling, outperforming existing methods.
Abstract: Next Point-of-Interest (POI) recommendation is a critical task in location-based services, yet it faces the fundamental challenge of coupled spatiotemporal asymmetry inherent in urban mobility. Specifically, transition intents between locations exhibit high asymmetry and are dynamically conditioned on time. Existing methods, typically built on graph or sequence backbones, rely on symmetric operators or real-valued aggregations, struggling to unify the modeling of time-varying global directionality. To address this limitation, we propose Mag-Mamba, a framework whose core insight lies in modeling spatiotemporal asymmetry as phase-driven rotational dynamics in the complex domain. Based on this, we first devise a Time-conditioned Magnetic Phase Encoder that constructs a time-conditioned Magnetic Laplacian on the geographic adjacency graph, utilizing edge phase differences to characterize the globally evolving spatial directionality. Subsequently, we introduce a Complex-valued Mamba module that generalizes traditional scalar state decay into joint decay-rotation dynamics, explicitly modulated by both time intervals and magnetic geographic priors. Extensive experiments on three real-world datasets demonstrate that Mag-Mamba achieves significant performance improvements over state-of-the-art baselines.
[1151] Deep Learning-Based Meat Freshness Detection with Segmentation and OOD-Aware Classification
Hutama Arif Bramantyo, Mukarram Ali Faridi, Rui Chen, Clarissa Harris, Yin Sun
Main category: cs.LG
TL;DR: Meat freshness classification framework using RGB images with segmentation preprocessing and OOD-aware abstention mechanism for both packaged and unpackaged meat datasets.
Details
Motivation: To develop a practical meat freshness classification system that works for both packaged and unpackaged meat, handles background variability through segmentation, and includes OOD detection for low-confidence samples to improve real-world reliability.Method: Two-stage pipeline: 1) U-Net-based segmentation to isolate meat regions and reduce background, 2) Classification using five backbone architectures (ResNet-50, ViT-B/16, Swin-T, EfficientNet-B0, MobileNetV3-Small) with nested 5x3 cross-validation. Includes OOD-aware abstention mechanism to flag uncertain samples.
Result: Segmentation achieved 75% IoU and 82% Dice coefficient. EfficientNet-B0 achieved highest classification accuracy (98.10%), followed by ResNet-50 and MobileNetV3-Small (97.63%). OOD scoring and thresholding evaluated, with on-device latency measured using TFLite for deployment considerations.
Conclusion: The framework provides effective meat freshness classification with practical deployment considerations, showing EfficientNet-B0 as the best performer among tested architectures while maintaining segmentation preprocessing for consistent inputs and OOD handling for reliability.
Abstract: In this study, we present a meat freshness classification framework from Red-Green-Blue (RGB) images that supports both packaged and unpackaged meat datasets. The system classifies four in-distribution (ID) meat classes and uses an out-of-distribution (OOD)-aware abstention mechanism that flags low-confidence samples as No Result. The pipeline combines U-Net-based segmentation with deep feature classifiers. Segmentation is used as a preprocessing step to isolate the meat region and reduce background, producing more consistent inputs for classification. The segmentation module achieved an Intersection over Union (IoU) of 75% and a Dice coefficient of 82%, producing standardized inputs for the classification stage. For classification, we benchmark five backbones: Residual Network-50 (ResNet-50), Vision Transformer-Base/16 (ViT-B/16), Swin Transformer-Tiny (Swin-T), EfficientNet-B0, and MobileNetV3-Small. We use nested 5x3 cross-validation (CV) for model selection and hyperparameter tuning. On the held-out ID test set, EfficientNet-B0 achieves the highest accuracy (98.10%), followed by ResNet-50 and MobileNetV3-Small (both 97.63%) and Swin-T (97.51%), while ViT-B/16 is lower (94.42%). We additionally evaluate OOD scoring and thresholding using standard OOD metrics and sensitivity analysis over the abstention threshold. Finally, we report on-device latency using TensorFlow Lite (TFLite) on a smartphone, highlighting practical accuracy-latency trade-offs for future deployment.
[1152] Expert Divergence Learning for MoE-based Language Models
Jiaang Li, Haibin Chen, Langming Liu, Yujin Yuan, Yadao Wang, Yizhen Zhang, Chengting Yu, Xin Tong, Weidong Zhang, Shilei Liu, Wenbo Su, Bo Zheng
Main category: cs.LG
TL;DR: Expert Divergence Learning: A pre-training strategy for MoE models that uses domain labels and Jensen-Shannon Divergence to encourage functional specialization among experts, reducing homogenization and improving performance.
Details
Motivation: Mixture-of-Experts (MoE) architectures often suffer from expert homogenization where experts learn redundant functionalities, limiting the potential benefits of specialized routing and computation.Method: Introduces Expert Divergence Learning with a label-driven auxiliary loss that leverages domain labels in pre-training corpora to maximize Jensen-Shannon Divergence between expert routing distributions of different domains, encouraging domain-specific specialization.
Result: Models trained with Expert Divergence Learning achieve lower language modeling loss and show significant performance improvements across diverse downstream benchmarks, with effective mitigation of expert homogenization and greater functional specialization.
Conclusion: The proposed method successfully addresses expert homogenization in MoE models, leading to better specialization and performance with negligible computational overhead during training.
Abstract: The Mixture-of-Experts (MoE) architecture is a powerful technique for scaling language models, yet it often suffers from expert homogenization, where experts learn redundant functionalities, thereby limiting MoE’s full potential. To address this, we introduce Expert Divergence Learning, a novel pre-training strategy that explicitly encourages functional specialization among experts. Our method incorporates a label-driven auxiliary loss that leverages domain labels inherent in pre-training corpora to maximize the Jensen-Shannon Divergence between the expert routing distributions of different data domains. This optimization objective guides the model to develop diverged routing policies for varied domains and closer routing policies for the same domain, which leads to emergent and organized expert specialization. We validate our approach by pre-training MoE models of up to 15 billion parameters from scratch. Experimental results demonstrate that models trained with Expert Divergence Learning not only achieve a lower language modeling loss but also exhibit significant performance improvements across a diverse range of downstream benchmarks. Further analysis confirms that our method effectively mitigates expert homogenization and brings greater functional specialization, all with negligible computational overhead during training.
[1153] M3-AD: Reflection-aware Multi-modal, Multi-category, and Multi-dimensional Benchmark and Framework for Industrial Anomaly Detection
Chao Huang, Yanhui Li, Yunkang Cao, Wei Wang, Hongxi Huang, Jie Wen, Wenqi Ren, Xiaochun Cao
Main category: cs.LG
TL;DR: M3-AD is a reflection-aware multimodal framework for industrial anomaly detection that introduces two complementary datasets and a RA-Monitor module to enable self-corrective decision-making in complex industrial scenarios.
Details
Motivation: Current multimodal LLMs for industrial anomaly detection produce unreliable high-confidence decisions in fine-grained, complex scenarios and lack effective self-correction mechanisms, necessitating a more robust framework.Method: Proposes M3-AD framework with: 1) M3-AD-FT dataset for reflection-aligned fine-tuning, 2) M3-AD-Bench for systematic cross-category evaluation, and 3) RA-Monitor that models reflection as learnable decision revision to guide controlled self-correction.
Result: RA-Monitor outperforms multiple open-source and commercial MLLMs in zero-shot anomaly detection and anomaly analysis tasks on the M3-AD-Bench evaluation dataset.
Conclusion: M3-AD provides a unified reflection-aware framework that improves decision robustness in industrial anomaly detection through self-corrective mechanisms and systematic evaluation resources.
Abstract: Although multimodal large language models (MLLMs) have advanced industrial anomaly detection toward a zero-shot paradigm, they still tend to produce high-confidence yet unreliable decisions in fine-grained and structurally complex industrial scenarios, and lack effective self-corrective mechanisms. To address this issue, we propose M3-AD, a unified reflection-aware multimodal framework for industrial anomaly detection. M3-AD comprises two complementary data resources: M3-AD-FT, designed for reflection-aligned fine-tuning, and M3-AD-Bench, designed for systematic cross-category evaluation, together providing a foundation for reflection-aware learning and reliability assessment. Building upon this foundation, we propose RA-Monitor, which models reflection as a learnable decision revision process and guides models to perform controlled self-correction when initial judgments are unreliable, thereby improving decision robustness. Extensive experiments conducted on M3-AD-Bench demonstrate that RA-Monitor outperforms multiple open-source and commercial MLLMs in zero-shot anomaly detection and anomaly analysis tasks. Code will be released at https://github.com/Yanhui-Lee/M3-AD.
[1154] A Representation-Consistent Gated Recurrent Framework for Robust Medical Time-Series Classification
Maitri Krishna Sai
Main category: cs.LG
TL;DR: Proposes RC-GRF, a representation-consistent gated recurrent framework with temporal consistency regularization for medical time-series data to improve stability and robustness against noise and missing values.
Details
Motivation: Medical time-series data has irregular sampling, noise, missing values, and inter-feature dependencies. Standard gated RNNs (LSTM/GRU) don't explicitly constrain latent representation evolution, leading to representation drift and instability under noisy/incomplete inputs.Method: RC-GRF introduces principled regularization to enforce temporal consistency in hidden-state representations. It’s model-agnostic and integrates into existing gated recurrent architectures without modifying internal gating mechanisms. Includes theoretical analysis showing how consistency constraint bounds hidden-state divergence.
Result: Extensive experiments on medical time-series classification benchmarks show improved robustness, reduced variance, and enhanced generalization performance, particularly in noisy and low-sample settings.
Conclusion: The proposed representation-consistent framework effectively addresses stability issues in medical time-series modeling by enforcing temporal consistency in latent representations, leading to more robust and generalizable models.
Abstract: Medical time-series data are characterized by irregular sampling, high noise levels, missing values, and strong inter-feature dependencies. Recurrent neural networks (RNNs), particularly gated architectures such as Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU), are widely used for modeling such data due to their ability to capture temporal dependencies. However, standard gated recurrent models do not explicitly constrain the evolution of latent representations over time, leading to representation drift and instability under noisy or incomplete inputs. In this work, we propose a representation-consistent gated recurrent framework (RC-GRF) that introduces a principled regularization strategy to enforce temporal consistency in hidden-state representations. The proposed framework is model-agnostic and can be integrated into existing gated recurrent architectures without modifying their internal gating mechanisms. We provide a theoretical analysis demonstrating how the consistency constraint bounds hidden-state divergence and improves stability. Extensive experiments on medical time-series classification benchmarks show that the proposed approach improves robustness, reduces variance, and enhances generalization performance, particularly in noisy and low-sample settings.
[1155] Certainty-Validity: A Diagnostic Framework for Discrete Commitment Systems
Datorien L. Anderson
Main category: cs.LG
TL;DR: The paper introduces the Certainty-Validity Framework to evaluate ML models, revealing Confident-Incorrect behavior where models hallucinate on ambiguous data, and shows discrete commitment systems have an 83% ambiguity ceiling where they appropriately refuse to commit to ambiguous samples.
Details
Motivation: Standard ML evaluation metrics assume all errors are equivalent, but this is flawed for discrete commitment systems. The paper aims to reveal hidden failure modes where models make confident incorrect predictions on ambiguous data, and to understand when models appropriately refuse to commit versus when they hallucinate.Method: Introduces the Certainty-Validity (CVS) Framework that decomposes model performance into a 2x2 matrix distinguishing high/low certainty from valid/invalid predictions. Conducts ablation experiments on Fashion-MNIST, EMNIST, and IMDB datasets to analyze the “83% Ambiguity Ceiling” phenomenon in discrete commitment architectures.
Result: Reveals Confident-Incorrect behavior where models hallucinate structure in ambiguous data. Shows discrete commitment systems plateau at an 83% ambiguity ceiling, refusing to commit to ambiguous samples where structural evidence ends. Demonstrates that standard training on ambiguous data causes pathological migration from appropriate doubt to hallucination.
Conclusion: For reasoning systems, “good training” should be defined by maximizing the Certainty-Validity Score rather than accuracy, ensuring models know where to stop and avoid hallucinating on ambiguous data. The refusal to commit to ambiguous samples is a feature, not a failure, of discrete commitment architectures.
Abstract: Standard evaluation metrics for machine learning – accuracy, precision, recall, and AUROC – assume that all errors are equivalent: a confident incorrect prediction is penalized identically to an uncertain one. For discrete commitment systems (architectures that select committed states {-W, 0, +W}), this assumption is epistemologically flawed. We introduce the Certainty-Validity (CVS) Framework, a diagnostic method that decomposes model performance into a 2x2 matrix distinguishing high/low certainty from valid/invalid predictions. This framework reveals a critical failure mode hidden by standard accuracy: Confident-Incorrect (CI) behavior, where models hallucinate structure in ambiguous data. Through ablation experiments on Fashion-MNIST, EMNIST, and IMDB, we analyze the “83% Ambiguity Ceiling” – a stopping point where this specific discrete architecture consistently plateaus on noisy benchmarks. Unlike continuous models that can surpass this ceiling by memorizing texture or statistical noise, the discrete model refuses to commit to ambiguous samples. We show that this refusal is not a failure but a feature: the model stops where structural evidence ends. However, standard training on ambiguous data eventually forces Benign Overfitting, causing a pathological migration from Uncertain-Incorrect (appropriate doubt) to Confident-Incorrect (hallucination). We propose that “good training” for reasoning systems must be defined not by accuracy, but by maximizing the Certainty-Validity Score (CVS) – ensuring the model knows where to stop.
[1156] SEval-NAS: A Search-Agnostic Evaluation for Neural Architecture Search
Atah Nuh Mih, Jianzhou Wang, Truong Thanh Hung Nguyen, Hung Cao
Main category: cs.LG
TL;DR: SEval-NAS: A flexible metric-evaluation mechanism for neural architecture search that converts architectures to strings, embeds them as vectors, and predicts performance metrics, particularly useful for hardware-aware NAS.
Details
Motivation: Current NAS evaluation procedures are hardcoded and limited in introducing new metrics, especially problematic for hardware-aware NAS where objectives depend on target devices like edge hardware.Method: Proposes SEval-NAS that converts neural architectures to strings, embeds them as vectors, and uses these embeddings to predict various performance metrics including accuracy, latency, and memory.
Result: Evaluated on NATS-Bench and HW-NAS-Bench, showed stronger latency and memory predictions than accuracy (Kendall’s τ correlations), suitable as hardware cost predictor. Successfully integrated into FreeREA to evaluate new metrics.
Conclusion: SEval-NAS provides a flexible evaluation mechanism for NAS that can handle new metrics with minimal algorithmic changes, particularly effective for hardware cost prediction in hardware-aware NAS.
Abstract: Neural architecture search (NAS) automates the discovery of neural networks that meet specified criteria, yet its evaluation procedures are often hardcoded, limiting the ability to introduce new metrics. This issue is especially pronounced in hardware-aware NAS, where objectives depend on target devices such as edge hardware. To address this limitation, we propose SEval-NAS, a metric-evaluation mechanism that converts architectures to strings, embeds them as vectors, and predicts performance metrics. Using NATS-Bench and HW-NAS-Bench, we evaluated accuracy, latency, and memory. Kendall’s $τ$ correlations showed stronger latency and memory predictions than accuracy, indicating the suitability of SEval-NAS as a hardware cost predictor. We further integrated SEval-NAS into FreeREA to evaluate metrics not originally included. The method successfully ranked FreeREA-generated architectures, maintained search time, and required minimal algorithmic changes. Our implementation is available at: https://github.com/Analytics-Everywhere-Lab/neural-architecture-search
[1157] Wideband Power Amplifier Behavioral Modeling Using an Amplitude Conditioned LSTM
Abdelrahman Abdelsalam, You Fei
Main category: cs.LG
TL;DR: AC-LSTM network with amplitude-dependent gating improves wideband power amplifier behavioral modeling by capturing nonlinear memory effects.
Details
Motivation: Traditional behavioral modeling approaches struggle with complex nonlinear and memory effects in wideband power amplifiers, requiring more sophisticated neural network architectures.Method: Proposes an amplitude conditioned LSTM (AC-LSTM) network with Feature-wise Linear Modulation (FiLM) layer that conditions the LSTM’s forget gate on instantaneous input amplitude to capture amplitude-dependent memory effects.
Result: AC-LSTM achieves NMSE of -41.25 dB (1.15 dB improvement over standard LSTM, 7.45 dB over ARVTDNN) and closely matches spectral characteristics with ACPR of -28.58 dB.
Conclusion: Amplitude conditioning effectively improves both time-domain accuracy and spectral fidelity in wideband PA behavioral modeling.
Abstract: Wideband power amplifiers exhibit complex nonlinear and memory effects that challenge traditional behavioral modeling approaches. This paper proposes a novel amplitude conditioned long short-term memory (AC-LSTM) network that introduces explicit amplitude-dependent gating to enhance the modeling of wideband PA dynamics. The architecture incorporates a Feature-wise Linear Modulation (FiLM) layer that conditions the LSTM’s forget gate on the instantaneous input amplitude, providing a physics-aware inductive bias for capturing amplitude-dependent memory effects. Experimental validation using a 100 MHz 5G NR signal and a GaN PA demonstrates that the proposed AC-LSTM achieves a normalized mean square error (NMSE) of -41.25 dB, representing a 1.15 dB improvement over standard LSTM and 7.45 dB improvement over augmented real-valued time-delay neural network (ARVTDNN) baselines. The model also closely matches the measured PA’s spectral characteristics with an adjacent channel power ratio (ACPR) of -28.58 dB. These results shows the effectiveness of amplitude conditioning for improving both time-domain accuracy and spectral fidelity in wide-band PA behavioral modeling.
[1158] LIDS: LLM Summary Inference Under the Layered Lens
Dylan Park, Yingying Fan, Jinchi Lv
Main category: cs.LG
TL;DR: Proposes LIDS method using BERT-SVD direction metric and SOFARI to evaluate LLM-generated summaries with interpretable keywords and statistical uncertainty quantification.
Details
Motivation: Evaluating the quality of LLM-generated summaries is challenging due to language complexity. Current methods lack interpretability and statistical uncertainty quantification for summary assessment.Method: Uses BERT embeddings with SVD-based direction metric to measure similarity between summaries and original text, with repeated prompts for uncertainty quantification. SOFARI identifies interpretable keywords for latent themes with controlled false discovery rate.
Result: LIDS provides natural embedding for text reduction and interpretable keyword identification. Empirical studies show practical utility and robustness through human verification and comparisons to other similarity metrics across different LLMs.
Conclusion: LIDS offers a novel approach for evaluating LLM-generated summaries with interpretable keyword identification and statistical uncertainty quantification, demonstrating effectiveness through comprehensive empirical validation.
Abstract: Large language models (LLMs) have gained significant attention by many researchers and practitioners in natural language processing (NLP) since the introduction of ChatGPT in 2022. One notable feature of ChatGPT is its ability to generate summaries based on prompts. Yet evaluating the quality of these summaries remains challenging due to the complexity of language. To this end, in this paper we suggest a new method of LLM summary inference with BERT-SVD-based direction metric and SOFARI (LIDS) that assesses the summary accuracy equipped with interpretable key words for layered themes. The LIDS uses a latent SVD-based direction metric to measure the similarity between the summaries and original text, leveraging the BERT embeddings and repeated prompts to quantify the statistical uncertainty. As a result, LIDS gives a natural embedding of each summary for large text reduction. We further exploit SOFARI to uncover important key words associated with each latent theme in the summary with controlled false discovery rate (FDR). Comprehensive empirical studies demonstrate the practical utility and robustness of LIDS through human verification and comparisons to other similarity metrics, including a comparison of different LLMs.
[1159] MAML-KT: Addressing Cold Start Problem in Knowledge Tracing for New Students via Few-Shot Model-Agnostic Meta Learning
Indronil Bhattacharjee, Christabel Wayllace
Main category: cs.LG
TL;DR: MAML-KT applies meta-learning to knowledge tracing for better early performance prediction on new students in cold-start scenarios.
Details
Motivation: Standard knowledge tracing models perform poorly in cold-start scenarios where models must predict knowledge states of new students from few initial interactions. Current evaluation designs obscure this deployment reality.Method: Proposes MAML-KT, a model-agnostic meta-learning approach using MAML (Model-Agnostic Meta-Learning) to learn an initialization optimized for rapid adaptation to new students with 1-2 gradient updates. Evaluated on ASSIST datasets with controlled cold-start protocols.
Result: MAML-KT achieves higher early accuracy than prior KT models (DKT, DKVMN, SAKT) in nearly all cold-start conditions across datasets, with gains persisting as cohort size increases from 10 to 50 students.
Conclusion: Meta-learning optimization for rapid adaptation reduces early prediction error for new students and provides clearer interpretation of early accuracy fluctuations, distinguishing model limitations from genuine learning dynamics.
Abstract: Knowledge tracing (KT) models are commonly evaluated by training on early interactions from all students and testing on later responses. While effective for measuring average predictive performance, this evaluation design obscures a cold start scenario that arises in deployment, where models must infer the knowledge state of previously unseen students from only a few initial interactions. Prior studies have shown that under this setting, standard empirically risk-minimized KT models such as DKT, DKVMN and SAKT exhibit substantially lower early accuracy than previously reported. We frame new-student performance prediction as a few-shot learning problem and introduce MAML-KT, a model-agnostic meta learning approach that learns an initialization optimized for rapid adaptation to new students using one or two gradient updates. We evaluate MAML-KT on ASSIST2009, ASSIST2015 and ASSIST2017 using a controlled cold start protocol that trains on a subset of students and tests on held-out learners across early interaction windows (questions 3-10 and 11-15), scaling cohort sizes from 10 to 50 students. Across datasets, MAML-KT achieves higher early accuracy than prior KT models in nearly all cold start conditions, with gains persisting as cohort size increases. On ASSIST2017, we observe a transient drop in early performance that coincides with many students encountering previously unseen skills. Further analysis suggests that these drops coincide with skill novelty rather than model instability, consistent with prior work on skill-level cold start. Overall, optimizing KT models for rapid adaptation reduces early prediction error for new students and provides a clearer lens for interpreting early accuracy fluctuations, distinguishing model limitations from genuine learning and knowledge acquisition dynamics.
[1160] Mathematical Foundations of Polyphonic Music Generation via Structural Inductive Bias
Joonwon Seo
Main category: cs.LG
TL;DR: A novel approach to polyphonic music generation using structural inductive bias to address the “Missing Middle” problem, with theoretical proofs and empirical validation on Beethoven’s piano sonatas.
Details
Motivation: To address the "Missing Middle" problem in polyphonic music generation by incorporating structural inductive bias, bridging gaps between theoretical foundations and practical applications in AI music generation.Method: Proposes Smart Embedding architecture with structural inductive bias, uses information theory (NMI=0.167 to verify pitch-hand independence), mathematical proofs via information theory (0.153 bits loss bound), Rademacher complexity (28.09% tighter bound), and category theory for stability. Validated on Beethoven’s piano sonatas with SVD analysis and expert listening study (N=53).
Result: Achieved 48.30% parameter reduction, 9.47% validation loss reduction, verified pitch-hand independence (NMI=0.167), mathematically proven stability and generalization improvements, with positive expert listening study results.
Conclusion: The dual theoretical-applied framework successfully bridges gaps in AI music generation, offering mathematically grounded deep learning approaches with verifiable insights for polyphonic music generation.
Abstract: This monograph introduces a novel approach to polyphonic music generation by addressing the “Missing Middle” problem through structural inductive bias. Focusing on Beethoven’s piano sonatas as a case study, we empirically verify the independence of pitch and hand attributes using normalized mutual information (NMI=0.167) and propose the Smart Embedding architecture, achieving a 48.30% reduction in parameters. We provide rigorous mathematical proofs using information theory (negligible loss bounded at 0.153 bits), Rademacher complexity (28.09% tighter generalization bound), and category theory to demonstrate improved stability and generalization. Empirical results show a 9.47% reduction in validation loss, confirmed by SVD analysis and an expert listening study (N=53). This dual theoretical and applied framework bridges gaps in AI music generation, offering verifiable insights for mathematically grounded deep learning.
[1161] Bridging Policy and Real-World Dynamics: LLM-Augmented Rebalancing for Shared Micromobility Systems
Heng Tan, Hua Yan, Yu Yang
Main category: cs.LG
TL;DR: AMPLIFY: LLM-augmented policy adaptation framework for shared micromobility rebalancing that combines baseline optimization with real-time LLM-based adaptation for emergent events.
Details
Motivation: Existing micromobility rebalancing methods either optimize for average demand patterns or handle predefined uncertainties, but overlook emergent events (demand surges, vehicle outages, regulatory interventions) or sacrifice normal condition performance.Method: Combines baseline rebalancing module with LLM-based adaptation module that ingests system context, demand predictions, and baseline strategies, then refines adjustments through self-reflection to adapt in real-time under emergent scenarios.
Result: Evaluations on real-world e-scooter data from Chicago show improved demand satisfaction and system revenue compared to baseline policies.
Conclusion: Demonstrates the potential of LLM-driven adaptation as a flexible solution for managing uncertainty in micromobility systems.
Abstract: Shared micromobility services such as e-scooters and bikes have become an integral part of urban transportation, yet their efficiency critically depends on effective vehicle rebalancing. Existing methods either optimize for average demand patterns or employ robust optimization and reinforcement learning to handle predefined uncertainties. However, these approaches overlook emergent events (e.g., demand surges, vehicle outages, regulatory interventions) or sacrifice performance in normal conditions. We introduce AMPLIFY, an LLM-augmented policy adaptation framework for shared micromobility rebalancing. The framework combines a baseline rebalancing module with an LLM-based adaptation module that adjusts strategies in real time under emergent scenarios. The adaptation module ingests system context, demand predictions, and baseline strategies, and refines adjustments through self-reflection. Evaluations on real-world e-scooter data from Chicago show that our approach improves demand satisfaction and system revenue compared to baseline policies, highlighting the potential of LLM-driven adaptation as a flexible solution for managing uncertainty in micromobility systems.
[1162] NNiT: Width-Agnostic Neural Network Generation with Structurally Aligned Weight Spaces
Jiwoo Kim, Swarajh Mehta, Hao-Lun Hsu, Hyunwoo Ryu, Yudong Liu, Miroslav Pajic
Main category: cs.LG
TL;DR: NNiTs generate neural network weights in width-agnostic manner using patch-based tokenization and diffusion transformers, enabling functional network generation across architectures with strong generalization to unseen topologies.
Details
Motivation: Generative modeling of neural network parameters faces challenges due to architecture dependence and permutation symmetries that create unaligned parameterizations for similar functions.Method: Introduces Neural Network Diffusion Transformers (NNiTs) that tokenize weight matrices into patches, model them as locally structured fields, and use Graph HyperNetworks with CNN decoder to align weight space for patch-based processing.
Result: NNiT generates fully functional MLPs across architectures and achieves >85% success on ManiSkill3 robotics tasks with unseen architecture topologies, outperforming baselines that fail to generalize.
Conclusion: Patch-based weight generation with diffusion transformers enables width-agnostic neural network synthesis with strong generalization to novel architectures, overcoming permutation symmetry challenges.
Abstract: Generative modeling of neural network parameters is often tied to architectures because standard parameter representations rely on known weight-matrix dimensions. Generation is further complicated by permutation symmetries that allow networks to model similar input-output functions while having widely different, unaligned parameterizations. In this work, we introduce Neural Network Diffusion Transformers (NNiTs), which generate weights in a width-agnostic manner by tokenizing weight matrices into patches and modeling them as locally structured fields. We establish that Graph HyperNetworks (GHNs) with a convolutional neural network (CNN) decoder structurally align the weight space, creating the local correlation necessary for patch-based processing. Focusing on MLPs, where permutation symmetry is especially apparent, NNiT generates fully functional networks across a range of architectures. Our approach jointly models discrete architecture tokens and continuous weight patches within a single sequence model. On ManiSkill3 robotics tasks, NNiT achieves >85% success on architecture topologies unseen during training, while baseline approaches fail to generalize.
[1163] Engineering FAIR Privacy-preserving Applications that Learn Histories of Disease
Ines N. Duarte, Praphulla M. S. Bhawsar, Lee K. Mason, Jeya Balaji Balasubramanian, Daniel E. Russ, Arlindo L. Oliveira, Jonas S. Almeida
Main category: cs.LG
TL;DR: A privacy-preserving generative AI application for healthcare that deploys transformer models in-browser using ONNX and custom JavaScript SDK, enabling client-side inference without data leaving the user’s device.
Details
Motivation: To address privacy concerns in healthcare AI applications by creating a system that allows generative transformer models to run entirely client-side, eliminating the need to share sensitive medical data with external servers.Method: Developed an in-browser model deployment system using ONNX runtime and custom JavaScript SDK, enabling client-side inference generation without downloads or installations, focusing on the “Reusability” aspect of FAIR data principles.
Result: Successfully deployed a generative transformer model for predicting individual morbidity risk in a privacy-preserving manner, establishing a secure architectural blueprint for private generative AI in medicine.
Conclusion: The approach demonstrates that privacy-sensitive domains like healthcare can benefit from generative AI through client-side deployment, balancing model utility with data protection requirements.
Abstract: A recent report on “Learning the natural history of human disease with generative transformers” created an opportunity to assess the engineering challenge of delivering user-facing Generative AI applications in privacy-sensitive domains. The application of these models, particularly for personalized healthcare tasks like predicting individual morbidity risk, is typically constrained by data privacy concerns. This project was accordingly designed as an in-browser model deployment exercise (an “App”) testing the architectural boundaries of client-side inference generation (no downloads or installations). We relied exclusively on the documentation provided in the reference report to develop the model, specifically testing the “R” component of the FAIR data principles: Findability, Accessibility, Interoperability, and Reusability. The successful model deployment, leveraging ONNX and a custom JavaScript SDK, establishes a secure, high-performance architectural blueprint for the future of private generative AI in medicine.
[1164] OSF: On Pre-training and Scaling of Sleep Foundation Models
Zitao Shuai, Zongzhe Xu, David Yang, Wei Wang, Yuzhe Yang
Main category: cs.LG
TL;DR: SleepBench benchmark for sleep foundation models reveals channel-invariant learning is crucial, scaling improves performance, and OSF models achieve SOTA across diverse sleep tasks.
Details
Motivation: Polysomnography (PSG) suffers from heterogeneity across devices and cohorts, and existing foundation models for sleep physiology lack understanding of pre-training processes and scaling patterns for better generalization.Method: Curated 166,500 hours of sleep recordings from nine public sources to create SleepBench benchmark, systematically evaluated four families of self-supervised pre-training objectives, and developed OSF models with enhanced pre-training and scaling recipes.
Result: Three key findings: (1) existing FMs fail with missing channels at inference, (2) channel-invariant feature learning is essential, (3) scaling sample size, model capacity, and multi-source data improves performance. OSF achieves state-of-the-art across nine datasets.
Conclusion: SleepBench provides comprehensive benchmark for sleep foundation models, revealing critical insights about channel-invariant learning and scaling. OSF models demonstrate superior performance and intriguing properties for sleep analysis.
Abstract: Polysomnography (PSG) provides the gold standard for sleep assessment but suffers from substantial heterogeneity across recording devices and cohorts. There have been growing efforts to build general-purpose foundation models (FMs) for sleep physiology, but lack an in-depth understanding of the pre-training process and scaling patterns that lead to more generalizable sleep FMs. To fill this gap, we curate a massive corpus of 166,500 hours of sleep recordings from nine public sources and establish SleepBench, a comprehensive, fully open-source benchmark. Leveraging SleepBench, we systematically evaluate four families of self-supervised pre-training objectives and uncover three critical findings: (1) existing FMs fail to generalize to missing channels at inference; (2) channel-invariant feature learning is essential for pre-training; and (3) scaling sample size, model capacity, and multi-source data mixture consistently improves downstream performance.With an enhanced pre-training and scaling recipe, we introduce OSF, a family of sleep FMs that achieves state-of-the-art performance across nine datasets on diverse sleep and disease prediction tasks. Further analysis of OSF also reveals intriguing properties in sample efficiency, hierarchical aggregation, and cross-dataset scaling.
[1165] Task-Driven Subspace Decomposition for Knowledge Sharing and Isolation in LoRA-based Continual Learning
Lingfeng He, De Cheng, Huaijie Wang, Xi Yang, Nannan Wang, Xinbo Gao
Main category: cs.LG
TL;DR: LoDA: A continual learning method using Low-rank Decomposition and Adaptation to separate task-shared and task-specific directions in LoRA subspaces for better knowledge transfer and isolation.
Details
Motivation: Existing LoRA-based continual learning methods separate task update spaces using null spaces of past tasks, but they overlook task-shared directions (suppressing knowledge transfer) and fail to capture truly effective task-specific directions when tasks are correlated.Method: LoDA performs task-driven decomposition to build general (task-shared) and truly task-specific LoRA subspaces using energy-based objectives. It fixes LoRA down-projections on these subspaces and learns robust up-projections via Gradient-Aligned Optimization (GAO). After each task, it derives closed-form recalibration for general updates to approximate feature-level joint optimum.
Result: Experiments show LoDA outperforms existing continual learning methods.
Conclusion: LoDA effectively addresses limitations of previous LoRA-based CL methods by properly separating task-shared and task-specific directions, enabling better knowledge transfer while preventing forgetting.
Abstract: Continual Learning (CL) requires models to sequentially adapt to new tasks without forgetting old knowledge. Recently, Low-Rank Adaptation (LoRA), a representative Parameter-Efficient Fine-Tuning (PEFT) method, has gained increasing attention in CL. Several LoRA-based CL methods reduce interference across tasks by separating their update spaces, typically building the new space from the estimated null space of past tasks. However, they (i) overlook task-shared directions, which suppresses knowledge transfer, and (ii) fail to capture truly effective task-specific directions since these ``null bases” of old tasks can remain nearly inactive for new task under correlated tasks. To address this, we study LoRA learning capability from a projection energy perspective, and propose Low-rank Decomposition and Adaptation (LoDA). It performs a task-driven decomposition to build general and truly task-specific LoRA subspaces by solving two energy-based objectives, decoupling directions for knowledge sharing and isolation. LoDA fixes LoRA down-projections on two subspaces and learns robust up-projections via a Gradient-Aligned Optimization (GAO) approach. After each task, before integrating the LoRA updates into the backbone, LoDA derives a closed-form recalibration for the general update, approximating a feature-level joint optimum along this task-shared direction. Experiments indicate that LoDA outperforms existing CL methods.
[1166] Diagnostics for Individual-Level Prediction Instability in Machine Learning for Healthcare
Elizabeth W. Miller, Jeffrey D. Blume
Main category: cs.LG
TL;DR: Paper proposes evaluation framework for individual-level prediction instability in healthcare ML models, focusing on variability from optimization randomness rather than aggregate metrics.
Details
Motivation: Current healthcare predictive models focus on aggregate performance metrics but ignore individual-level variability in risk estimates caused by optimization randomness, which can undermine clinical trust and alter treatment decisions.Method: Proposes two diagnostics: empirical prediction interval width (ePIW) for continuous risk estimate variability, and empirical decision flip rate (eDFR) for threshold-based decision instability. Applied to simulated data and GUSTO-I clinical dataset.
Result: Found that optimization randomness alone can cause individual-level variability comparable to resampling entire training dataset. Neural networks show greater instability than logistic regression. Instability near clinical thresholds can alter treatment recommendations.
Conclusion: Stability diagnostics should be incorporated into routine model validation for assessing clinical reliability, as standard aggregate metrics fail to capture individual-level prediction instability.
Abstract: In healthcare, predictive models increasingly inform patient-level decisions, yet little attention is paid to the variability in individual risk estimates and its impact on treatment decisions. For overparameterized models, now standard in machine learning, a substantial source of variability often goes undetected. Even when the data and model architecture are held fixed, randomness introduced by optimization and initialization can lead to materially different risk estimates for the same patient. This problem is largely obscured by standard evaluation practices, which rely on aggregate performance metrics (e.g., log-loss, accuracy) that are agnostic to individual-level stability. As a result, models with indistinguishable aggregate performance can nonetheless exhibit substantial procedural arbitrariness, which can undermine clinical trust. We propose an evaluation framework that quantifies individual-level prediction instability by using two complementary diagnostics: empirical prediction interval width (ePIW), which captures variability in continuous risk estimates, and empirical decision flip rate (eDFR), which measures instability in threshold-based clinical decisions. We apply these diagnostics to simulated data and GUSTO-I clinical dataset. Across observed settings, we find that for flexible machine-learning models, randomness arising solely from optimization and initialization can induce individual-level variability comparable to that produced by resampling the entire training dataset. Neural networks exhibit substantially greater instability in individual risk predictions compared to logistic regression models. Risk estimate instability near clinically relevant decision thresholds can alter treatment recommendations. These findings that stability diagnostics should be incorporated into routine model validation for assessing clinical reliability.
[1167] A medical coding language model trained on clinical narratives from a population-wide cohort of 1.8 million patients
Joakim Edin, Sedrah Butt Balaganeshan, Annike Kjølby Kristensen, Lars Maaløe, Ioannis Louloudis, Søren Brunak
Main category: cs.LG
TL;DR: A language model trained on 5.8M EHRs predicts ICD-10 codes with 71.8% F1 score, revealing systematic under-coding of secondary diagnoses in healthcare systems.
Details
Motivation: Manual medical coding is time-consuming and error-prone, and existing automation efforts use small datasets that don't represent real-world patient heterogeneity. There's a need for better automated coding systems that can handle diverse patient populations and identify systematic coding issues.Method: Trained a language model on 5.8 million electronic health records from 1.8 million patients across nearly all specialties in Eastern Denmark (2006-2016) to predict ICD-10 codes from clinical notes, medications, and laboratory results. Evaluated on 270,000 held-out patients.
Result: Achieved micro F1 of 71.8% and top-10 recall of 95.5%. Performance varied by specialty (F1: 53-91%), with higher scores in specialties with well-defined diagnostic criteria. Codes appearing as secondary diagnoses had lower F1 scores. For three under-coded conditions (suicide-related behaviors, weight disorders, hypertension), the model identified thousands of uncoded cases with 76-86% validation rate upon manual review.
Conclusion: The model reveals systematic under-coding of secondary diagnoses in Eastern Denmark, with implications for epidemiological research and public health surveillance. Similar healthcare systems likely share this issue. The model can automate coding for ~50% of cases and provide accurate suggestions for most others, offering a practical solution to capture missed secondary conditions.
Abstract: Medical coding translates clinical documentation into standardized codes for billing, research, and public health, but manual coding is time-consuming and error-prone. Existing automation efforts rely on small datasets that poorly represent real-world patient heterogeneity. We trained a language model on 5.8 million electronic health records from 1.8 million patients across nearly all specialties in Eastern Denmark (2006–2016) to predict ICD-10 codes from clinical notes, medications, and laboratory results. Evaluated on 270,000 held-out patients, the model achieved a micro F1 of 71.8% and a top-10 recall of 95.5%. Performance varied by specialty (F1: 53–91%), with higher scores in specialties with well-defined diagnostic criteria. Codes appearing predominantly as secondary diagnoses had markedly lower F1 scores. For three such codes (suicide-related behaviors, weight disorders, and hypertension), the model identified thousands of uncoded cases, of which 76-86% were confirmed valid upon manual review, suggesting systematic under-coding rather than model error. These findings suggest under-coding of secondary diagnoses in Eastern Denmark during this period, with potential implications for epidemiological research, public health surveillance, and understanding of multimorbidity. Similar time constraints and reimbursement structures in other healthcare systems suggest this may not be isolated to this dataset. The model can automate coding for approximately 50% of cases and provide accurate suggestions for most others, and may offer a practical solution to help capture missed secondary conditions.
[1168] CoPeP: Benchmarking Continual Pretraining for Protein Language Models
Darshan Patil, Pranshu Malviya, Mathieu Reymond, Quentin Fournier, Sarath Chandar
Main category: cs.LG
TL;DR: CoPeP benchmark introduces continual learning for protein language models using temporal protein sequence data to improve performance on protein understanding tasks.
Details
Motivation: Protein language models need continual learning to adapt to growing protein databases and leverage temporal meta-information for better performance on protein understanding tasks.Method: Created CoPeP benchmark with protein datasets spanning a decade from UniProt, evaluated continual learning methods (replay, unlearning, plasticity-based) on 31 protein tasks.
Result: Temporal meta-information improves perplexity by up to 7% compared to joint training, and several continual learning methods outperform naive continual pretraining at scale.
Conclusion: CoPeP benchmark enables studying continual learning methods at scale for impactful protein language model applications in drug discovery.
Abstract: Protein language models (pLMs) have recently gained significant attention for their ability to uncover relationships between sequence, structure, and function from evolutionary statistics, thereby accelerating therapeutic drug discovery. These models learn from large protein databases that are continuously updated by the biology community and whose dynamic nature motivates the application of continual learning, not only to keep up with the ever-growing data, but also as an opportunity to take advantage of the temporal meta-information that is created during this process. As a result, we introduce the Continual Pretraining of Protein Language Models (CoPeP) benchmark, a novel benchmark for evaluating continual learning approaches on pLMs. Specifically, we curate a sequence of protein datasets derived from the UniProt Knowledgebase spanning a decade and define metrics to assess pLM performance across 31 protein understanding tasks. We evaluate several methods from the continual learning literature, including replay, unlearning, and plasticity-based methods, some of which have never been applied to models and data of this scale. Our findings reveal that incorporating temporal meta-information improves perplexity by up to 7% even when compared to training on data from all tasks jointly. Moreover, even at scale, several continual learning methods outperform naive continual pretraining. The CoPeP benchmark offers an exciting opportunity to study these methods at scale in an impactful real-world application.
[1169] Sample-Efficient Distributionally Robust Multi-Agent Reinforcement Learning via Online Interaction
Zain Ulabedeen Farhat, Debamita Ghosh, George K. Atia, Yue Wang
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2508.02948: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.02948&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1170] Scalable Gaussian process modeling of parametrized spatio-temporal fields
Srinath Dama, Prasanth B. Nair
Main category: cs.LG
TL;DR: Scalable Gaussian process framework with deep product kernels for learning spatio-temporal fields, enabling continuous predictions and efficient uncertainty quantification with near-linear computational complexity.
Details
Motivation: Need for data-driven surrogate modeling of parametrized spatio-temporal fields that can provide continuous predictions independent of training resolution, with efficient uncertainty quantification for downstream tasks.Method: Uses Gaussian processes with deep product kernels, leverages Kronecker matrix algebra for computational efficiency, and provides rigorous bounds for posterior variance computation on both Cartesian and unstructured grids.
Result: Achieves accuracy competitive with Fourier neural operators and deep operator networks, surpasses projection-based reduced-order models on 1D unsteady Burgers’ equation, with near-linear computational scaling.
Conclusion: The framework is an effective tool for data-driven surrogate modeling, particularly valuable when uncertainty estimates are needed for downstream applications.
Abstract: We introduce a scalable Gaussian process (GP) framework with deep product kernels for data-driven learning of parametrized spatio-temporal fields over fixed or parameter-dependent domains. The proposed framework learns a continuous representation, enabling predictions at arbitrary spatio-temporal coordinates, independent of the training data resolution. We leverage Kronecker matrix algebra to formulate a computationally efficient training procedure with complexity that scales nearly linearly with the total number of spatio-temporal grid points. A key feature of our approach is the efficient computation of the posterior variance at essentially the same computational cost as the posterior mean (exactly for Cartesian grids and via rigorous bounds for unstructured grids), thereby enabling scalable uncertainty quantification. Numerical studies on a range of benchmark problems demonstrate that the proposed method achieves accuracy competitive with operator learning methods such as Fourier neural operators and deep operator networks. On the one-dimensional unsteady Burgers’ equation, our method surpasses the accuracy of projection-based reduced-order models. These results establish the proposed framework as an effective tool for data-driven surrogate modeling, particularly when uncertainty estimates are required for downstream tasks.
[1171] Polynomial Surrogate Training for Differentiable Ternary Logic Gate Networks
Sai Sandeep Damera, Ryan Matheu, Aniruddh G. Puranic, John S. Baras
Main category: cs.LG
TL;DR: Extends differentiable logic gate networks to ternary logic with UNKNOWN state for principled abstention under uncertainty, using polynomial surrogate training to handle exponential gate explosion.
Details
Motivation: Existing differentiable logic gate networks (DLGNs) are limited to binary logic with 16 two-input gates. Extending to ternary logic (including UNKNOWN state) would enable principled abstention under uncertainty, but the gate space explodes to 19,683 possibilities, making standard softmax-over-gates training intractable.Method: Introduces Polynomial Surrogate Training (PST) which represents each ternary neuron as a degree-(2,2) polynomial with 9 learnable coefficients (2,187× parameter reduction vs. softmax). Proves bounded gap between trained network and discretized logic circuit via data-independent commitment loss that vanishes at convergence.
Result: Ternary networks train 2-3× faster than binary DLGNs, discover functionally diverse true ternary gates, and UNKNOWN output acts as Bayes-optimal uncertainty proxy enabling selective prediction. Scaling experiments (48K to 512K neurons on CIFAR-10) show hardening gap contracts with overparameterization.
Conclusion: PST establishes general polynomial-surrogate methodology with quadratic parameterization cost growth with logic valence, opening door to many-valued differentiable logic. Ternary circuits surpass binary accuracy when low-confidence predictions are filtered via selective prediction.
Abstract: Differentiable logic gate networks (DLGNs) learn compact, interpretable Boolean circuits via gradient-based training, but all existing variants are restricted to the 16 two-input binary gates. Extending DLGNs to Ternary Kleene $K_3$ logic and training DTLGNs where the UNKNOWN state enables principled abstention under uncertainty is desirable. However, the support set of potential gates per neuron explodes to $19{,}683$, making the established softmax-over-gates training approach intractable. We introduce Polynomial Surrogate Training (PST), which represents each ternary neuron as a degree-$(2,2)$ polynomial with 9 learnable coefficients (a $2{,}187\times$ parameter reduction) and prove that the gap between the trained network and its discretized logic circuit is bounded by a data-independent commitment loss that vanishes at convergence. Scaling experiments from 48K to 512K neurons on CIFAR-10 demonstrate that this hardening gap contracts with overparameterization. Ternary networks train $2$-$3\times$ faster than binary DLGNs and discover true ternary gates that are functionally diverse. On synthetic and tabular tasks we find that the UNKNOWN output acts as a Bayes-optimal uncertainty proxy, enabling selective prediction in which ternary circuits surpass binary accuracy once low-confidence predictions are filtered. More broadly, PST establishes a general polynomial-surrogate methodology whose parameterization cost grows only quadratically with logic valence, opening the door to many-valued differentiable logic.
[1172] When does Chain-of-Thought Help: A Markovian Perspective
Zihan Wang, Yijun Dong, Qi Lei
Main category: cs.LG
TL;DR: Theoretical analysis of Chain-of-Thought prompting effectiveness using Markov chain modeling, identifying transition alignment across reasoning steps as key factor for CoT benefits.
Details
Motivation: Chain-of-Thought prompting shows uneven performance gains across different reasoning tasks, and there's a need to understand when and why CoT helps to guide its effective application.Method: Model step-wise reasoning trajectories as Markov chains, where each intermediate step is a state and dependencies between steps are captured by transition kernels. Analyze how transition alignment (whether instances share common step-wise transition kernels) affects CoT effectiveness.
Result: When transitions are identical across steps, CoT reduces inference-time sample complexity (fewer context sample trajectories needed). When transitions differ across steps, gains can vanish. Noise in intermediate steps modulates CoT’s benefits.
Conclusion: Transition alignment is the key determinant of CoT’s effectiveness. The theoretical framework helps predict when CoT will be beneficial and guides design of synthetic benchmarks to validate predictions.
Abstract: Chain-of-Thought (CoT) prompting is a widely used inference-time technique for improving reasoning, yet its gains are uneven across tasks. We analyze when and why CoT helps by modeling the step-wise reasoning trajectory as a Markov chain. Each intermediate step is a state and the dependence between steps is captured by a transition kernel. Our theory identifies transition alignment, whether instances share a common step-wise transition kernel, as the key determinant of CoT’s effectiveness. When transitions are identical across steps, CoT reduces inference-time sample complexity: fewer context sample trajectories suffice to recover the final decision. In contrast, when transitions differ across steps, these gains can vanish. We further quantify how noise in intermediate steps modulates CoT’s benefit. Beyond theory, we design synthetic benchmarks that isolate these factors to complement prior results on real-world tasks and to empirically validate our predictions.
[1173] Adaptive Context Length Optimization with Low-Frequency Truncation for Multi-Agent Reinforcement Learning
Wenchang Duan, Yaoliang Yu, Jiwan He, Yi Shi
Main category: cs.LG
TL;DR: Failed to fetch summary for arXiv ID 2510.26389 due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation as paper content could not be retrieved due to API rate limitingMethod: Unable to determine method as paper content could not be retrieved due to API rate limiting
Result: Unable to determine results as paper content could not be retrieved due to API rate limiting
Conclusion: Unable to determine conclusion as paper content could not be retrieved due to API rate limiting
Abstract: Failed to fetch summary for 2510.26389: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.26389&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1174] Vectorized Adaptive Histograms for Sparse Oblique Forests
Ariel Lubonja, Jungsang Yoon, Haoyin Xu, Yue Wan, Yilin Xu, Richard Stotz, Mathieu Guillame-Bert, Joshua T. Vogelstein, Randal Burns
Main category: cs.LG
TL;DR: Optimized sparse oblique random forests with dynamic histogram/sorting switching and vector intrinsics for faster training, plus GPU implementations.
Details
Motivation: Sparse oblique random forests provide good uncertainty guarantees but are computationally expensive due to deep trees and runtime linear combination processing.Method: Dynamic switching between histograms and sorting for split finding, histogram construction optimization using vector intrinsics, and GPU/hybrid CPU-GPU implementations.
Result: 1.7-2.5x speedup over existing oblique forests, 1.5-2x over standard random forests on large datasets.
Conclusion: Significant computational improvements for sparse oblique random forests while maintaining their statistical benefits.
Abstract: Classification using sparse oblique random forests provides guarantees on uncertainty and confidence while controlling for specific error types. However, they use more data and more compute than other tree ensembles because they create deep trees and need to sort or histogram linear combinations of data at runtime. We provide a method for dynamically switching between histograms and sorting to find the best split. We further optimize histogram construction using vector intrinsics. Evaluating this on large datasets, our optimizations speedup training by 1.7-2.5x compared to existing oblique forests and 1.5-2x compared to standard random forests. We also provide a GPU and hybrid CPU-GPU implementation.
[1175] Detecting Transportation Mode Using Dense Smartphone GPS Trajectories and Transformer Models
Yuandong Zhang, Othmane Echchabi, Tianshu Feng, Wenyi Zhang, Hsuai-Kai Liao, Charles Chang
Main category: cs.LG
TL;DR: SpeedTransformer: A Transformer-based model using only speed inputs from dense smartphone GPS trajectories for transportation mode detection, outperforming traditional deep learning models and showing strong transfer learning capabilities.
Details
Motivation: Transportation mode detection is important for GeoAI and transportation research, but existing methods may not fully leverage the potential of dense GPS trajectory data and modern transformer architectures.Method: Proposes SpeedTransformer, a novel Transformer-based model that uses only speed inputs from dense smartphone GPS trajectories to infer transportation modes, tested in benchmark experiments against traditional models like LSTM.
Result: Outperformed traditional deep learning models (LSTM), demonstrated strong transfer learning capabilities across geographical regions with small fine-tuning datasets, and consistently outperformed baselines in real-world experiments under complex environments.
Conclusion: Transformer architectures combined with dense GPS trajectories have substantial potential for advancing transportation mode detection and broader mobility-related research.
Abstract: Transportation mode detection is an important topic within GeoAI and transportation research. In this study, we introduce SpeedTransformer, a novel Transformer-based model that relies solely on speed inputs to infer transportation modes from dense smartphone GPS trajectories. In benchmark experiments, SpeedTransformer outperformed traditional deep learning models, such as the Long Short-Term Memory (LSTM) network. Moreover, the model demonstrated strong flexibility in transfer learning, achieving high accuracy across geographical regions after fine-tuning with small datasets. Finally, we deployed the model in a real-world experiment, where it consistently outperformed baseline models under complex built environments and high data uncertainty. These findings suggest that Transformer architectures, when combined with dense GPS trajectories, hold substantial potential for advancing transportation mode detection and broader mobility-related research.
[1176] Quantifying Catastrophic Forgetting in IoT Intrusion Detection Systems
Sourasekhar Banerjee, David Bergqvist, Salman Toor, Christian Rohner, Andreas Johnsson
Main category: cs.LG
TL;DR: A framework for continual learning in IoT intrusion detection systems that benchmarks five approaches to handle distribution shifts in attack patterns, with replay-based methods performing best overall and synaptic intelligence showing near-zero forgetting with high efficiency.
Details
Motivation: Distribution shifts in attack patterns within IoT networks threaten reliability and security. Traditional IDS trained on static datasets fail to generalize to unseen threats and suffer from catastrophic forgetting when updated with new attacks, necessitating continual adaptability for robust IoT network defense.Method: Formulates intrusion detection as a domain continual learning problem and proposes a method-agnostic IDS framework that integrates diverse continual learning strategies. Systematically benchmarks five representative approaches across multiple domain-ordering sequences using a comprehensive multi-attack dataset comprising 48 domains.
Result: Continual learning mitigates catastrophic forgetting while maintaining balance between plasticity, stability, and efficiency. Replay-based approaches achieve best overall performance, while Synaptic Intelligence (SI) delivers near-zero forgetting with high training efficiency.
Conclusion: The proposed continual learning framework demonstrates strong potential for stable and sustainable IDS deployment in dynamic IoT networks, with SI showing particular promise for resource-constrained environments due to its efficiency and low forgetting.
Abstract: Distribution shifts in attack patterns within RPL-based IoT networks pose a critical threat to the reliability and security of large-scale connected systems. Intrusion Detection Systems (IDS) trained on static datasets often fail to generalize to unseen threats and suffer from catastrophic forgetting when updated with new attacks. Ensuring continual adaptability of IDS is therefore essential for maintaining robust IoT network defence. In this focused study, we formulate intrusion detection as a domain continual learning problem and propose a method-agnostic IDS framework that can integrate diverse continual learning strategies. We systematically benchmark five representative approaches across multiple domain-ordering sequences using a comprehensive multi-attack dataset comprising 48 domains. Results show that continual learning mitigates catastrophic forgetting while maintaining a balance between plasticity, stability, and efficiency, a crucial aspect for resource-constrained IoT environments. Among the methods, Replay-based approaches achieve the best overall performance, while Synaptic Intelligence (SI) delivers near-zero forgetting with high training efficiency, demonstrating strong potential for stable and sustainable IDS deployment in dynamic IoT networks.
[1177] Improving Full Waveform Inversion in Large Model Era
Yinan Feng, Peng Jin, Yuzhe Guo, Yinpeng Chen, Youzuo Lin
Main category: cs.LG
TL;DR: Large-scale neural network trained on simple synthetic seismic data achieves state-of-the-art Full Waveform Inversion performance and generalizes well to complex geological structures through coordinated scaling of model capacity, data diversity, and training strategy.
Details
Motivation: Existing data-driven Full Waveform Inversion (FWI) methods suffer from limited datasets with small models, geological diversity, and spatial extent, causing overfitting and poor generalization to realistic geological structures despite performing well on synthetic data.Method: Develops a billion-parameter model for FWI using coordinated scaling across three axes: model capacity (large neural network), data diversity (simulated but relatively simple data), and training strategy. The approach trains entirely on simulated data and tests generalization on challenging benchmarks.
Result: Achieves state-of-the-art performance on OpenFWI benchmark, significantly narrowing the generalization gap in data-driven FWI. Shows substantial improvement across six challenging geophysical benchmarks (Marmousi, 2D SEG/EAGE Salt and Overthrust, 2004 BP, Sigsbee, SEAM Phase I) with SSIM improving from 0.5844 to 0.7669, inferring complex structures absent from training set.
Conclusion: With appropriate scaling strategy, large models trained on simple synthetic data can achieve substantial generalization to more complex and realistic geological structures, demonstrating the potential of scaling approaches for solving highly nonlinear and ill-posed inverse problems like FWI.
Abstract: Full Waveform Inversion (FWI) is a highly nonlinear and ill-posed problem that aims to recover subsurface velocity maps from surface-recorded seismic waveforms data. Existing data-driven FWI typically uses small models, as available datasets have limited volume, geological diversity, and spatial extent, leading to substantial concerns about overfitting. Although they perform well on synthetic datasets, current methods fail to generalize to more realistic geological structures. In this work, we show that a model trained entirely on simulated and relatively simple data can generalize remarkably well to challenging and unseen geological benchmarks. We provide a working recipe that tames a billion-parameter model for FWI through coordinated scaling across three axes: model capacity, data diversity, and training strategy. Our model achieves state-of-the-art performance on OpenFWI and significantly narrows the generalization gap in data-driven FWI. Across six challenging geophysical benchmarks, including Marmousi, 2D SEG/EAGE Salt and Overthrust, 2004 BP, Sigsbee, and SEAM Phase I, it infers complex structures absent from the training set and delivers significant performance improvements (SSIM from 0.5844 to 0.7669). Overall, our results demonstrate that with an appropriate scaling strategy, large models trained on simple synthetic data can achieve substantial generalization to more complex and realistic geological structures.
[1178] Hereditary Geometric Meta-RL: Nonlocal Generalization via Task Symmetries
Paul Nitschke, Shahriar Talebi
Main category: cs.LG
TL;DR: Meta-RL method that discovers system symmetries to enable generalization across entire task space rather than just local smoothness-based generalization.
Details
Motivation: Current Meta-RL methods rely on smoothness in task encoding for local generalization, requiring dense task space coverage and missing richer task space structure from system symmetries.Method: Develops geometric perspective with “hereditary geometry” induced by system symmetries. Uses Lie group transformations to reuse policies, converting Meta-RL into symmetry discovery. Introduces differential symmetry discovery method that collapses functional invariance constraints for better stability and efficiency.
Result: On 2D navigation task, method efficiently recovers ground-truth symmetry and generalizes across entire task space, while baseline only generalizes near training tasks.
Conclusion: Symmetry discovery approach enables more efficient and broader generalization in Meta-RL by leveraging inherent system symmetries rather than relying solely on smooth extrapolation.
Abstract: Meta-Reinforcement Learning (Meta-RL) commonly generalizes via smoothness in the task encoding. While this enables local generalization around each training task, it requires dense coverage of the task space and leaves richer task space structure untapped. In response, we develop a geometric perspective that endows the task space with a “hereditary geometry” induced by the inherent symmetries of the underlying system. Concretely, the agent reuses a policy learned at the train time by transforming states and actions through actions of a Lie group. This converts Meta-RL into symmetry discovery rather than smooth extrapolation, enabling the agent to generalize to wider regions of the task space. We show that when the task space is inherited from the symmetries of the underlying system, the task space embeds into a subgroup of those symmetries whose actions are linearizable, connected, and compact–properties that enable efficient learning and inference at the test time. To learn these structures, we develop a differential symmetry discovery method. This collapses functional invariance constraints and thereby improves numerical stability and sample efficiency over functional approaches. Empirically, on a two-dimensional navigation task, our method efficiently recovers the ground-truth symmetry and generalizes across the entire task space, while a common baseline generalizes only near training tasks.
[1179] TENG-BC: Unified Time-Evolving Natural Gradient for Neural PDE Solvers with General Boundary Conditions
Hongjie Jiang, Di Luo
Main category: cs.LG
TL;DR: TENG-BC is a neural PDE solver using Time-Evolving Natural Gradient with boundary-aware optimization for high-precision solutions under various boundary conditions.
Details
Motivation: Existing neural PDE solvers struggle with long-time error accumulation and difficulty enforcing general boundary conditions (Dirichlet, Neumann, Robin, mixed types).Method: TENG-BC uses Time-Evolving Natural Gradient with boundary-aware optimization that jointly enforces interior dynamics and boundary conditions at each time step within a unified framework.
Result: Achieves solver-level accuracy across diffusion, transport, and nonlinear PDE benchmarks with various boundary conditions, outperforming conventional solvers and PINN baselines.
Conclusion: TENG-BC provides a stable, high-precision neural PDE solver that handles general boundary constraints without delicate penalty tuning.
Abstract: Accurately solving time-dependent partial differential equations (PDEs) with neural networks remains challenging due to long-time error accumulation and the difficulty of enforcing general boundary conditions. We introduce TENG-BC, a high-precision neural PDE solver based on the Time-Evolving Natural Gradient, designed to perform under general boundary constraints. At each time step, TENG-BC performs a boundary-aware optimization that jointly enforces interior dynamics and boundary conditions, accommodating Dirichlet, Neumann, Robin, and mixed types within a unified framework. This formulation admits a natural-gradient interpretation, enabling stable time evolution without delicate penalty tuning. Across benchmarks over diffusion, transport, and nonlinear PDEs with various boundary conditions, TENG-BC achieves solver-level accuracy under comparable sampling budgets, outperforming conventional solvers and physics-informed neural network (PINN) baselines.
[1180] USE: Uncertainty Structure Estimation for Robust Semi-Supervised Learning
Tsao-Lun Chen, Chien-Liang Liu, Tzu-Ming Harry Hsu, Tai-Hsien Wu, Chi-Cheng Fu, Han-Yi E. Chou, Shun-Feng Su
Main category: cs.LG
TL;DR: A lightweight Uncertainty Structure Estimation (USE) method for semi-supervised learning that assesses unlabeled data quality by computing entropy scores and separating informative from uninformative samples before training.
Details
Motivation: Semi-supervised learning (SSL) has made impressive progress but its reliability in deployment is limited by unlabeled data quality. In practice, unlabeled data are often contaminated by out-of-distribution (OOD) samples that negatively affect performance. The bottleneck lies not in algorithmic design but in the absence of principled mechanisms to assess and curate unlabeled data quality.Method: USE trains a proxy model on the labeled set to compute entropy scores for unlabeled samples, then derives a threshold via statistical comparison against a reference distribution that separates informative (structured) from uninformative (structureless) samples. This enables assessment as a preprocessing step to remove harmful unlabeled data before SSL training begins.
Result: Extensive experiments on imaging (CIFAR-100) and NLP (Yelp Review) data show that USE consistently improves accuracy and robustness under varying levels of OOD contamination.
Conclusion: The proposed approach reframes unlabeled data quality control as a structural assessment problem and considers it a necessary component for reliable and efficient SSL in realistic mixed-distribution environments.
Abstract: In this study, a novel idea, Uncertainty Structure Estimation (USE), a lightweight, algorithm-agnostic procedure that emphasizes the often-overlooked role of unlabeled data quality is introduced for Semi-supervised learning (SSL). SSL has achieved impressive progress, but its reliability in deployment is limited by the quality of the unlabeled pool. In practice, unlabeled data are almost always contaminated by out-of-distribution (OOD) samples, where both near-OOD and far-OOD can negatively affect performance in different ways. We argue that the bottleneck does not lie in algorithmic design, but rather in the absence of principled mechanisms to assess and curate the quality of unlabeled data. The proposed USE trains a proxy model on the labeled set to compute entropy scores for unlabeled samples, and then derives a threshold, via statistical comparison against a reference distribution, that separates informative (structured) from uninformative (structureless) samples. This enables assessment as a preprocessing step, removing uninformative or harmful unlabeled data before SSL training begins. Through extensive experiments on imaging (CIFAR-100) and NLP (Yelp Review) data, it is evident that USE consistently improves accuracy and robustness under varying levels of OOD contamination. Thus, it can be concluded that the proposed approach reframes unlabeled data quality control as a structural assessment problem, and considers it as a necessary component for reliable and efficient SSL in realistic mixed-distribution environments.
[1181] Exact and Asymptotically Complete Robust Verifications of Neural Networks via Quantum Optimization
Wenxin Li, Wenchao Liu, Chuan Wang, Qi Gao, Yin Ma, Hai Wei, Kai Wen
Main category: cs.LG
TL;DR: Quantum-optimization-based models for robust verification of neural networks against adversarial perturbations, with exact formulations for piecewise-linear activations and scalable approximations for general activations.
Details
Motivation: Deep neural networks are vulnerable to adversarial perturbations, limiting their use in safety-critical applications. Current verification methods face combinatorial challenges, especially for networks with complex activation functions.Method: Two quantum-optimization-based models: 1) exact formulation for piecewise-linear activations (ReLU, hardtanh) using sound and complete verification; 2) scalable over-approximations for general activations (sigmoid, tanh) via piecewise-constant bounds with asymptotic completeness. Integration of Quantum Benders Decomposition with interval arithmetic for acceleration, certificate-transfer bounds for pruned networks, and layerwise partitioning for quantum-classical hybrid workflow.
Result: Experiments on robustness benchmarks show high certification accuracy, demonstrating that quantum optimization can serve as a principled primitive for robustness guarantees in neural networks with complex activations.
Conclusion: Quantum optimization provides effective methods for robust verification of neural networks, addressing combinatorial challenges and supporting networks with diverse activation functions through exact and approximate formulations.
Abstract: Deep neural networks (DNNs) enable high performance across domains but remain vulnerable to adversarial perturbations, limiting their use in safety-critical settings. Here, we introduce two quantum-optimization-based models for robust verification that reduce the combinatorial burden of certification under bounded input perturbations. For piecewise-linear activations (e.g., ReLU and hardtanh), our first model yields an exact formulation that is sound and complete, enabling precise identification of adversarial examples. For general activations (including sigmoid and tanh), our second model constructs scalable over-approximations via piecewise-constant bounds and is asymptotically complete, with approximation error vanishing as the segmentation is refined. We further integrate Quantum Benders Decomposition with interval arithmetic to accelerate solving, and propose certificate-transfer bounds that relate robustness guarantees of pruned networks to those of the original model. Finally, a layerwise partitioning strategy supports a quantum–classical hybrid workflow by coupling subproblems across depth. Experiments on robustness benchmarks show high certification accuracy, indicating that quantum optimization can serve as a principled primitive for robustness guarantees in neural networks with complex activations.
[1182] Physics-Aware Learnability: From Set-Theoretic Independence to Operational Constraints
Jeongho Bang, Kyoungho Cho
Main category: cs.LG
TL;DR: The paper introduces physics-aware learnability (PL), which defines learnability relative to explicit physical access models, addressing paradoxes in standard learnability definitions that assume non-operational resources like infinite precision and unphysical data access.
Details
Motivation: Standard learnability definitions in machine learning assume non-operational resources (infinite precision, unphysical data access, non-representable outputs), leading to logical paradoxes like EMX where even finite subsets of [0,1] can be learnable in some ZFC models but not others. These definitions quantify over arbitrary set-theoretic learners without considering physical constraints.Method: Introduces physics-aware learnability (PL) which defines learnability relative to an explicit access model - a family of admissible physical protocols. For continuum EMX, finite-precision coarse-graining reduces it to a countable problem via exact pushforward/pullback reduction preserving the EMX objective. For quantum data, admissible learners are POVMs on d copies, turning sample size into copy complexity with Helstrom-type lower bounds.
Result: The independence example becomes provably learnable with explicit (ε,δ) sample complexity. For finite no-signaling and quantum models, PL feasibility becomes linear or semidefinite programming problems and is therefore decidable.
Conclusion: Physics-aware learnability provides an operational framework that resolves paradoxes in standard learnability definitions by explicitly accounting for physical constraints and access models, making learnability problems decidable for finite quantum and no-signaling models.
Abstract: Beyond binary classification, learnability can become a logically fragile notion: in EMX, even the class of all finite subsets of $[0,1]$ is learnable in some models of ZFC and not in others. We argue the paradox is operational. The standard definitions quantify over arbitrary set-theoretic learners that implicitly assume non-operational resources (infinite precision, unphysical data access, and non-representable outputs). We introduce physics-aware learnability (PL), which defines the learnability relative to an explicit access model – a family of admissible physical protocols. Finite-precision coarse-graining reduces continuum EMX to a countable problem, via an exact pushforward/pullback reduction that preserves the EMX objective, making the independence example provably learnable with explicit $(ε,δ)$ sample complexity. For quantum data, admissible learners are exactly POVMs on $d$ copies, turning sample size into copy complexity and yielding Helstrom(-type) lower bounds. For finite no-signaling and quantum models, PL feasibility becomes linear or semidefinite and is therefore decidable.
[1183] Weight Updates as Activation Shifts: A Principled Framework for Steering
Dyah Adila, John Cooper, Alexander Yun, Avi Trost, Frederic Sala
Main category: cs.LG
TL;DR: Theoretical framework establishes equivalence between activation steering and weight updates, identifies optimal intervention sites, and proposes joint adaptation combining both approaches for highly parameter-efficient model adaptation.
Details
Motivation: Current activation steering methods rely on empirical heuristics rather than principled foundations, with unclear design choices for intervention location and parameterization. The paper aims to establish theoretical foundations for activation steering and improve its effectiveness.Method: Establishes first-order equivalence between activation-space interventions and weight-space updates, identifies post-block output as optimal intervention site, and proposes joint adaptation that trains simultaneously in both weight and activation spaces.
Result: Post-block steering achieves accuracy within 0.2%-0.9% of full-parameter tuning while training only 0.04% of parameters, outperforming prior methods like ReFT, LoRA, and other PEFT approaches. Joint adaptation often surpasses performance ceilings of individual approaches.
Conclusion: The paper provides theoretical foundations for activation steering, identifies optimal intervention strategies, and introduces joint adaptation as a new paradigm for efficient model adaptation that combines weight and activation updates.
Abstract: Activation steering promises to be an extremely parameter-efficient form of adaptation, but its effectiveness depends on critical design choices – such as intervention location and parameterization – that currently rely on empirical heuristics rather than a principled foundation. We establish a first-order equivalence between activation-space interventions and weight-space updates, deriving the conditions under which activation steering can replicate fine-tuning behavior. This equivalence yields a principled framework for steering design and identifies the post-block output as a theoretically-backed and highly expressive intervention site. We further explain why certain intervention locations outperform others and show that weight updates and activation updates play distinct, complementary functional roles. This analysis motivates a new approach – joint adaptation – that trains in both spaces simultaneously. Our post-block steering achieves accuracy within 0.2%-0.9%$ of full-parameter tuning, on average across tasks and models, while training only 0.04% of model parameters. It consistently outperforms prior activation steering methods such as ReFT and PEFT approaches including LoRA, while using significantly fewer parameters. Finally, we show that joint adaptation often surpasses the performance ceilings of weight and activation updates in isolation, introducing a new paradigm for efficient model adaptation.
[1184] Efficient Decoder Scaling Strategy for Neural Routing Solvers
Qing Luo, Fu Luo, Ke Li, Zhenkun Wang
Main category: cs.LG
TL;DR: Scaling study comparing depth vs width scaling in neural routing solvers, finding depth scaling more effective than width scaling for performance gains.
Details
Motivation: Most neural routing solvers limit decoder size to 1-3M parameters, leaving scaling effects unexplored. Need systematic study of scaling strategies (depth vs width) to understand parameter efficiency, data efficiency, and compute efficiency.Method: Constructed 12 model configurations spanning 1M to ~150M parameters, comparing two scaling strategies: scaling depth (layer count) vs scaling width (embedding dimension). Evaluated across three dimensions: parameter efficiency, data efficiency, and compute efficiency.
Result: Parameter count alone insufficient to predict model performance. Depth scaling yields superior performance gains compared to width scaling. Critical distinct roles of model depth and width identified.
Conclusion: Provides design principles for efficient parameter and compute resource allocation in neural routing solvers, with depth scaling being more effective than width scaling for performance enhancement.
Abstract: Construction-based neural routing solvers, typically composed of an encoder and a decoder, have emerged as a promising approach for solving vehicle routing problems. While recent studies suggest that shifting parameters from the encoder to the decoder enhances performance, most works restrict the decoder size to 1-3M parameters, leaving the effects of scaling largely unexplored. To address this gap, we conduct a systematic study comparing two distinct strategies: scaling depth versus scaling width. We synthesize these strategies to construct a suite of 12 model configurations, spanning a parameter range from 1M to ~150M, and extensively evaluate their scaling behaviors across three critical dimensions: parameter efficiency, data efficiency, and compute efficiency. Our empirical results reveal that parameter count is insufficient to accurately predict the model performance, highlighting the critical and distinct roles of model depth (layer count) and width (embedding dimension). Crucially, we demonstrate that scaling depth yields superior performance gains to scaling width. Based on these findings, we provide and experimentally validate a set of design principles for the efficient allocation of parameters and compute resources to enhance the model performance.
[1185] ROKA: Robust Knowledge Unlearning against Adversaries
Jinmyeong Shin, Joshua Tapia, Nicholas Ferreira, Gabriel Diaz, Moayed Daneshyari, Hyeran Jeon
Main category: cs.LG
TL;DR: ROKA is a robust machine unlearning method that prevents knowledge contamination by using neural healing to rebalance models after unlearning, preserving related knowledge while removing target data.
Details
Motivation: Existing machine unlearning methods cause knowledge contamination by damaging related knowledge, which degrades model performance and enables new inference/backdoor attacks. Current attacks require data manipulation, but indirect attacks exploit knowledge contamination without data poisoning.Method: Proposes ROKA (Robust Unlearning via Knowledge-Aware Healing) based on modeling neural networks as Neural Knowledge Systems. Instead of just destroying information, ROKA constructively rebalances the model by nullifying forgotten data influence while strengthening conceptual neighbors through neural healing.
Result: ROKA effectively unlearns targets while preserving or even enhancing accuracy of retained data across various large models (vision transformers, multi-modal models, LLMs). Provides theoretical guarantees for knowledge preservation during unlearning and mitigates indirect unlearning attacks.
Conclusion: ROKA offers a robust unlearning strategy that prevents knowledge contamination through neural healing, providing theoretical guarantees and practical effectiveness against indirect unlearning attacks while maintaining model performance.
Abstract: The need for machine unlearning is critical for data privacy, yet existing methods often cause Knowledge Contamination by unintentionally damaging related knowledge. Such a degraded model performance after unlearning has been recently leveraged for new inference and backdoor attacks. Most studies design adversarial unlearning requests that require poisoning or duplicating training data. In this study, we introduce a new unlearning-induced attack model, namely indirect unlearning attack, which does not require data manipulation but exploits the consequence of knowledge contamination to perturb the model accuracy on security-critical predictions. To mitigate this attack, we introduce a theoretical framework that models neural networks as Neural Knowledge Systems. Based on this, we propose ROKA, a robust unlearning strategy centered on Neural Healing. Unlike conventional unlearning methods that only destroy information, ROKA constructively rebalances the model by nullifying the influence of forgotten data while strengthening its conceptual neighbors. To the best of our knowledge, our work is the first to provide a theoretical guarantee for knowledge preservation during unlearning. Evaluations on various large models, including vision transformers, multi-modal models, and large language models, show that ROKA effectively unlearns targets while preserving, or even enhancing, the accuracy of retained data, thereby mitigating the indirect unlearning attacks.
[1186] Rooted Absorbed Prefix Trajectory Balance with Submodular Replay for GFlowNet Training
Xi Wang, Wenbo Lu, Shengjie Wang
Main category: cs.LG
TL;DR: RapTB with SubM improves GFlowNets for LLM fine-tuning by addressing mode collapse through rooted trajectory balance and submodular replay refresh
Details
Motivation: GFlowNets for fine-tuning LLMs suffer from mode collapse (prefix collapse and length bias) due to weak credit assignment to early prefixes and biased replay causing non-representative training flow distributionMethod: Propose RapTB (Rooted absorbed prefix Trajectory Balance) that anchors subtrajectory supervision at the root and propagates terminal rewards to intermediate prefixes via absorbed suffix-based backups. Also introduce SubM, a submodular replay refresh strategy that promotes both high reward and diversity
Result: Empirically improves optimization performance and molecular diversity while preserving high validity on tasks like molecule generation with LLMs using SMILES strings
Conclusion: The proposed RapTB with SubM effectively addresses mode collapse in GFlowNet-based LLM fine-tuning through better credit assignment and diversity-aware replay
Abstract: Generative Flow Networks (GFlowNets) enable fine-tuning large language models to approximate reward-proportional posteriors, but they remain prone to mode collapse, manifesting as prefix collapse and length bias. We attribute this to two factors: (i) weak credit assignment to early prefixes, and (ii) biased replay that induces a shifted, non-representative training flow distribution. We propose Rooted absorbed prefix Trajectory Balance RapTB, an objective that anchors subtrajectory supervision at the root and propagates terminal rewards to intermediate prefixes via absorbed suffix-based backups, providing dense prefix-level learning signals. To mitigate replay-induced distribution shift, we further introduce SubM, a submodular replay refresh strategy that promotes both high reward and diversity. Empirically, on tasks such as molecule generation with LLM using SMILES strings, RapTB combined with SubM consistently improves optimization performance and molecular diversity while preserving high validity.
[1187] Benchmarking Few-shot Transferability of Pre-trained Models with Improved Evaluation Protocols
Xu Luo, Ji Zhang, Lianli Gao, Heng Tao Shen, Jingkuan Song
Main category: cs.LG
TL;DR: FEWTRANS benchmark establishes rigorous evaluation for few-shot transfer learning, revealing that pre-trained model choice dominates performance over sophisticated transfer methods, with full fine-tuning surprisingly effective via distributed adjustments.
Details
Motivation: To address the lack of unified, rigorous evaluation protocols for few-shot transfer learning that are both challenging and realistic for real-world usage, creating a benchmark to properly assess transfer methods.Method: Created FEWTRANS benchmark with 10 diverse datasets and proposed Hyperparameter Ensemble (HPE) protocol to overcome “validation set illusion” in data-scarce regimes. Conducted empirical evaluation and mechanistic analysis of transfer methods.
Result: Pre-trained model choice is the dominant factor for performance; sophisticated transfer methods offer negligible advantages over simple full-parameter fine-tuning. Full fine-tuning succeeds via distributed micro-adjustments and flexible reshaping of semantic representations without overfitting.
Conclusion: FEWTRANS provides a rigorous benchmark for few-shot transfer learning research, revealing the surprising effectiveness of full fine-tuning and the critical importance of pre-trained model selection over complex transfer algorithms.
Abstract: Few-shot transfer has been revolutionized by stronger pre-trained models and improved adaptation algorithms.However, there lacks a unified, rigorous evaluation protocol that is both challenging and realistic for real-world usage. In this work, we establish FEWTRANS, a comprehensive benchmark containing 10 diverse datasets, and propose the Hyperparameter Ensemble (HPE) protocol to overcome the “validation set illusion” in data-scarce regimes. Our empirical findings demonstrate that the choice of pre-trained model is the dominant factor for performance, while many sophisticated transfer methods offer negligible practical advantages over a simple full-parameter fine-tuning baseline. To explain this surprising effectiveness, we provide an in-depth mechanistic analysis showing that full fine-tuning succeeds via distributed micro-adjustments and more flexible reshaping of high-level semantic presentations without suffering from overfitting. Additionally, we quantify the performance collapse of multimodal models in specialized domains as a result of linguistic rarity using adjusted Zipf frequency scores. By releasing FEWTRANS, we aim to provide a rigorous “ruler” to streamline reproducible advances in few-shot transfer learning research. We make the FEWTRANS benchmark publicly available at https://github.com/Frankluox/FewTrans.
[1188] Analyzing Physical Adversarial Example Threats to Machine Learning in Election Systems
Khaleque Md Aashiq Kamal, Surya Eada, Aayushi Verma, Subek Acharya, Adrian Yemin, Benjamin Fuller, Kaleel Mahmood
Main category: cs.LG
TL;DR: Analysis of adversarial example attacks on machine learning ballot classifiers in US elections, evaluating physical vs digital effectiveness and probabilistic election flipping frameworks.
Details
Motivation: To understand how adversarial example attacks could compromise US elections by manipulating machine learning ballot classifiers, and to analyze which attack types are most effective when physically printed versus digitally.Method: Developed probabilistic framework for determining adversarial ballots needed to flip elections; analyzed six adversarial attack types; conducted physical experiments with 144,000 printed/scanned adversarial examples across four ML models.
Result: Found gap between digital and physical effectiveness: l2 and l_infinity attacks work best digitally, while l1 and l2 work best physically depending on model; established framework for quantifying election manipulation risks.
Conclusion: Adversarial examples pose real election risks with physical/digital effectiveness differences; probabilistic framework helps quantify manipulation scenarios beyond close race analyses.
Abstract: Developments in the machine learning voting domain have shown both promising results and risks. Trained models perform well on ballot classification tasks (> 99% accuracy) but are at risk from adversarial example attacks that cause misclassifications. In this paper, we analyze an attacker who seeks to deploy adversarial examples against machine learning ballot classifiers to compromise a U.S. election. We first derive a probabilistic framework for determining the number of adversarial example ballots that must be printed to flip an election, in terms of the probability of each candidate winning and the total number of ballots cast. Second, it is an open question as to which type of adversarial example is most effective when physically printed in the voting domain. We analyze six different types of adversarial example attacks: l_infinity-APGD, l2-APGD, l1-APGD, l0 PGD, l0 + l_infinity PGD, and l0 + sigma-map PGD. Our experiments include physical realizations of 144,000 adversarial examples through printing and scanning with four different machine learning models. We empirically demonstrate an analysis gap between the physical and digital domains, wherein attacks most effective in the digital domain (l2 and l_infinity) differ from those most effective in the physical domain (l1 and l2, depending on the model). By unifying a probabilistic election framework with digital and physical adversarial example evaluations, we move beyond prior close race analyses to explicitly quantify when and how adversarial ballot manipulation could alter outcomes.
[1189] Dynamic Spatio-Temporal Graph Neural Network for Early Detection of Pornography Addiction in Adolescents Based on Electroencephalogram Signals
Achmad Ardani Prasha, Clavino Ourizqi Rachmadi, Sabrina Laila Mutiara, Hilman Syachr Ramadhan, Chareyl Reinalyta Borneo, Saruni Dwiasnati
Main category: cs.LG
TL;DR: A Dynamic Spatio-Temporal Graph Neural Network (DST-GNN) using EEG data to detect adolescent pornography addiction with 71% F1-score, identifying frontal-central brain regions as biomarkers.
Details
Motivation: Adolescent pornography addiction needs objective neurobiological biomarkers for early detection since self-reports are biased due to social stigma. Current machine learning fails to model dynamic brain connectivity during addictive stimulus exposure.Method: Proposes DST-GNN integrating Phase Lag Index-based Graph Attention Network for spatial modeling and Bidirectional GRU for temporal dynamics. Uses 19-channel EEG from 14 adolescents (7 addicted, 7 healthy) across 9 experimental conditions with Leave-One-Subject-Out Cross Validation.
Result: Achieves F1-Score of 71.00%±12.10% and recall of 85.71%, a 104% improvement over baseline. Ablation shows 21% contribution from temporal modeling and 57% from PLI graph. Identifies frontal-central regions (Fz, Cz, C3, C4) as dominant biomarkers with Beta (58.9%) and Hjorth (31.2%) features, and Cz-T7 connectivity as trait-level biomarker.
Conclusion: DST-GNN effectively models dynamic brain connectivity for addiction detection, identifying objective neurobiological biomarkers that could enable early screening without subjective bias.
Abstract: Adolescent pornography addiction requires early detection based on objective neurobiological biomarkers because self-report is prone to subjective bias due to social stigma. Conventional machine learning has not been able to model dynamic functional connectivity of the brain that fluctuates temporally during addictive stimulus exposure. This study proposes a state-of-the-art Dynamic Spatio-Temporal Graph Neural Network (DST-GNN) that integrates Phase Lag Index (PLI)-based Graph Attention Network (GAT) for spatial modeling and Bidirectional Gated Recurrent Unit (BiGRU) for temporal dynamics. The dataset consists of 14 adolescents (7 addicted, 7 healthy) with 19-channel EEG across 9 experimental conditions. Leave-One-Subject-Out Cross Validation (LOSO-CV) evaluation shows F1-Score of 71.00%$\pm$12.10% and recall of 85.71%, a 104% improvement compared to baseline. Ablation study confirms temporal contribution of 21% and PLI graph construction of 57%. Frontal-central regions (Fz, Cz, C3, C4) are identified as dominant biomarkers with Beta contribution of 58.9% and Hjorth of 31.2%, while Cz-T7 connectivity is consistent as a trait-level biomarker for objective screening.
[1190] Heaviside Low-Rank Support Matrix Machine
Xianchao Xiu, Shenghao Sun, Xinrong Li, Jiyuan Tao
Main category: cs.LG
TL;DR: HL-SMM is a robust matrix classification method using Heaviside loss and low-rank constraints for noise-resistant classification of matrix-structured data.
Details
Motivation: Existing Support Matrix Machine (SMM) methods use convex/nonconvex surrogate losses that are sensitive to noise. Need robust classification for matrix-structured data while preserving spatial correlations.Method: Proposes Heaviside Low-rank SMM (HL-SMM) using Heaviside loss for robustness instead of hinge/ramp losses, with low-rank constraint to capture global structure. Uses proximal alternating minimization (PAM) with closed-form subproblem solutions.
Result: Extensive experiments show HL-SMM achieves superior classification accuracy and robustness compared to state-of-the-art methods on benchmark datasets.
Conclusion: HL-SMM provides a robust matrix classification framework with theoretical guarantees and efficient optimization, outperforming existing methods.
Abstract: Support matrix machine (SMM) is an emerging classification framework that directly handles matrix-structured observations, thereby avoiding the spatial correlations destroyed by vectorization. However, most existing SMM variants rely on convex or nonconvex surrogate loss functions, which may lead to high sensitivity to noise. To address this issue, we propose a novel Heaviside low-rank SMM model called HL-SMM, which leverages the Heaviside loss instead of the common hinge or ramp losses for robustness. Moreover, the low-rank constraint is adopted to accurately characterize the inherent global structure. In theory, we analyze the Karush-Kuhn-Tucker (KKT) points and rigorously prove the sufficient and necessary conditions. In algorithms, we develop an effective proximal alternating minimization (PAM) scheme, where all subproblems have closed-form solutions. Extensive experiments on benchmark datasets validate that the proposed HL-SMM achieves superior classification accuracy and robustness compared to state-of-the-art methods.
[1191] A Polynomial-Time Axiomatic Alternative to SHAP for Feature Attribution
Kazuhiro Hiraki, Shinichi Ishihara, Takumi Kongo, Junnosuke Shino
Main category: cs.LG
TL;DR: The paper proposes ESENSC_rev2, a computationally efficient alternative to SHAP for feature attribution in explainable AI, using cooperative game theory foundations.
Details
Motivation: SHAP (SHapley Additive exPlanations) is computationally expensive for high-dimensional data, requiring efficient alternatives that maintain theoretical soundness while improving scalability.Method: Formulates XAI-TU games in cooperative game theory framework, investigates equal-surplus and proportional-allocation attribution rules, and proposes ESENSC_rev2 combining polynomial-time closed-form rules with null-player property.
Result: ESENSC_rev2 closely approximates exact SHAP while substantially improving scalability with increasing features, achieving favorable trade-offs between computational cost and approximation accuracy.
Conclusion: Axiomatically justified and computationally efficient attribution rules like ESENSC_rev2 can serve as practical, theoretically principled substitutes for SHAP-based approximations in explainability pipelines.
Abstract: In this paper, we provide a theoretically grounded and computationally efficient alternative to SHAP. To this end, we study feature attribution through the lens of cooperative game theory by formulating a class of XAI–TU games. Building on this formulation, we investigate equal-surplus-type and proportional-allocation-type attribution rules and propose a low-cost attribution rule, ESENSC_rev2, constructed by combining two polynomial-time closed-form rules while ensuring the null-player property in the XAI–TU domain. Extensive experiments on tabular prediction tasks demonstrate that ESENSC_rev2 closely approximates exact SHAP while substantially improving scalability as the number of features increases. These empirical results indicate that equal-surplus-type attribution rules can achieve favorable trade-offs between computational cost and approximation accuracy in high-dimensional explainability settings. To provide theoretical foundations for these findings, we establish an axiomatic characterization showing that ESENSC_rev2 is uniquely determined by efficiency, the null-player axiom, a restricted differential marginality principle, an intermediate inessential-game property, and axioms that reduce computational requirements. Our results suggest that axiomatically justified and computationally efficient attribution rules can serve as practical and theoretically principled substitutes for SHAP-based approximations in modern explainability pipelines.
[1192] Antibody: Strengthening Defense Against Harmful Fine-Tuning for Large Language Models via Attenuating Harmful Gradient Influence
Quoc Minh Nguyen, Trung Le, Jing Wu, Anh Tuan Bui, Mehrtash Harandi
Main category: cs.LG
TL;DR: Antibody defense strategy protects LLMs from harmful fine-tuning attacks by using safety alignment regularization and weighted fine-tuning to suppress learning from poisoned samples while preserving performance on benign data.
Details
Motivation: Fine-tuning-as-a-service creates security vulnerabilities where malicious users can submit poisoned datasets to compromise model safety through harmful fine-tuning attacks, requiring defense mechanisms.Method: Two-stage approach: 1) Pre-fine-tuning alignment stage optimizes model to be in flat loss region for harmful samples, making safety alignment resilient. 2) Fine-tuning stage uses weighted learning algorithm that downweights harmful samples and upweights benign samples during training.
Result: Antibody successfully mitigates harmful fine-tuning attacks while maintaining or improving fine-tuning performance on legitimate user-submitted datasets.
Conclusion: The proposed Antibody defense provides effective protection against harmful fine-tuning attacks in fine-tuning-as-a-service scenarios through gradient regularization and safety-preserving learning algorithms.
Abstract: Fine-tuning-as-a-service introduces a threat to Large Language Models’ safety when service providers fine-tune their models on poisoned user-submitted datasets, a process known as harmful fine-tuning attacks. In this work, we show that by regularizing the gradient contribution of harmful samples encountered during fine-tuning, we can effectively mitigate the impact of harmful fine-tuning attacks. To this end, we introduce Antibody, a defense strategy that first ensures robust safety alignment for the model before fine-tuning, and then applies a safety-preservation learning algorithm during fine-tuning. Specifically, in the alignment stage before fine-tuning, we propose optimizing the model to be in a flat loss region with respect to harmful samples, which makes the safety alignment more resilient to subsequent harmful fine-tuning. Then, in the fine-tuning stage, we design a fine-tuning algorithm that applies a weighting scheme to all samples in each training batch to inhibit the model from learning from harmful samples while encouraging learning from benign samples. Experimental results demonstrate that Antibody successfully mitigates harmful fine-tuning attacks while boosting fine-tuning performance on the user-submitted dataset.
[1193] Trinity: A Scenario-Aware Recommendation Framework for Large-Scale Cold-Start Users
Wenhao Zheng, Wang Lu, Fangshuang Tang, Yiyang Lu, Jun Yang, Pengcheng Xiong, Yulan Yan
Main category: cs.LG
TL;DR: Trinity framework addresses cold-start challenges for new users in new scenarios through synergistic integration of feature engineering, model architecture, and stable model updating, demonstrated on a billion-user Microsoft product transition.
Details
Motivation: Early-stage users in new scenarios face intensified cold-start challenges with sparse behavioral signals, low engagement, and unstable model performance. Existing approaches often address only parts of the problem through model architecture alone.Method: Proposes Trinity framework that synergistically integrates feature engineering, model architecture, and stable model updating. Extracts valuable information from existing scenarios while ensuring predictive effectiveness and accuracy in new scenarios.
Result: Both offline and online experiments demonstrate substantial improvements in addressing the combined challenge of new users in new scenarios, applied to a billion-user Microsoft product transition.
Conclusion: Effective recommendations require synergistic integration of multiple components beyond just model architecture. Trinity provides a comprehensive framework for addressing cold-start challenges in new scenarios.
Abstract: Early-stage users in a new scenario intensify cold-start challenges, yet prior works often address only parts of the problem through model architecture. Launching a new user experience to replace an established product involves sparse behavioral signals, low-engagement cohorts, and unstable model performance. We argue that effective recommendations require the synergistic integration of feature engineering, model architecture, and stable model updating. We propose Trinity, a framework embodying this principle. Trinity extracts valuable information from existing scenarios while ensuring predictive effectiveness and accuracy in the new scenario. In this paper, we showcase Trinity applied to a billion-user Microsoft product transition. Both offline and online experiments demonstrate that our framework achieves substantial improvements in addressing the combined challenge of new users in new scenarios.
[1194] FastBUS: A Fast Bayesian Framework for Unified Weakly-Supervised Learning
Ziquan Wang, Haobo Wang, Ke Chen, Lei Feng, Gang Chen
Main category: cs.LG
TL;DR: A general framework for efficiently inferring latent true label distributions across various weakly supervised settings, achieving SOTA results with significant speed improvements.
Details
Motivation: Existing methods for handling diverse weakly supervised settings have limitations: they require complex manual pre-work, ignore relationships between associated labels, or cannot batch process efficiently due to computational design flaws, resulting in long running times.Method: Expresses label brute-force search as probabilistic transition of label variables, compressing diverse weakly supervised DFS tree structures into a shared Bayesian network. Uses generalized belief propagation for latent probability calculation, with two acceleration strategies: 1) low-rank assumption to approximate transition matrix, 2) end-to-end state evolution module for batch-scale transition matrices.
Result: Achieves state-of-the-art results under most weakly supervised settings, with up to hundreds of times faster acceleration in running time compared to other general methods.
Conclusion: Proposes an efficient general framework for weakly supervised learning that handles diverse label imprecisions while being computationally efficient and scalable.
Abstract: Machine Learning often involves various imprecise labels, leading to diverse weakly supervised settings. While recent methods aim for universal handling, they usually suffer from complex manual pre-work, ignore the relationships between associated labels, or are unable to batch process due to computational design flaws, resulting in long running times. To address these limitations, we propose a novel general framework that efficiently infers latent true label distributions across various weak supervisions. Our key idea is to express the label brute-force search process as a probabilistic transition of label variables, compressing diverse weakly supervised DFS tree structures into a shared Bayesian network. From this, we derived a latent probability calculation algorithm based on generalized belief propagation and proposed two joint acceleration strategies: 1) introducing a low-rank assumption to approximate the transition matrix, reducing time complexity; 2) designing an end-to-end state evolution module to learn batch-scale transition matrices, facilitating multi-category batch processing. In addition, the equivalence of our method with the EM algorithm in most scenarios is further demonstrated. Extensive experiments show that our method achieves SOTA results under most weakly supervised settings, and achieves up to hundreds of times faster acceleration in running time compared to other general methods.
[1195] Phys-Diff: A Physics-Inspired Latent Diffusion Model for Tropical Cyclone Forecasting
Lei Liu, Xiaoning Yu, Kang Chen, Jiahui Huang, Tengyuan Liu, Hongwei Zhao, Bin Li
Main category: cs.LG
TL;DR: Phys-Diff: A physics-inspired latent diffusion model for tropical cyclone forecasting that disentangles latent features into trajectory, pressure, and wind speed components with cross-task attention for physical consistency.
Details
Motivation: Current deep learning methods for TC forecasting often neglect physical relationships between cyclone attributes, resulting in predictions that lack physical consistency. There's a need for models that incorporate physics-inspired inductive biases while maintaining computational efficiency.Method: Proposes Phys-Diff, a physics-inspired latent diffusion model that: 1) Disentangles latent features into task-specific components (trajectory, pressure, wind speed), 2) Uses cross-task attention to introduce physics-inspired inductive biases, 3) Integrates multimodal data (historical attributes, ERA5 reanalysis, FengWu forecast fields) via Transformer encoder-decoder architecture.
Result: Demonstrates state-of-the-art performance on both global and regional tropical cyclone datasets, showing improved forecasting accuracy with physical consistency.
Conclusion: Phys-Diff successfully integrates physics-inspired constraints into deep learning for TC forecasting, achieving better performance while maintaining physical consistency among cyclone attributes through disentangled latent representations and cross-task attention mechanisms.
Abstract: Tropical cyclone (TC) forecasting is critical for disaster warning and emergency response. Deep learning methods address computational challenges but often neglect physical relationships between TC attributes, resulting in predictions lacking physical consistency. To address this, we propose Phys-Diff, a physics-inspired latent diffusion model that disentangles latent features into task-specific components (trajectory, pressure, wind speed) and employs cross-task attention to introduce prior physics-inspired inductive biases, thereby embedding physically consistent dependencies among TC attributes. Phys-Diff integrates multimodal data including historical cyclone attributes, ERA5 reanalysis data, and FengWu forecast fields via a Transformer encoder-decoder architecture, further enhancing forecasting performance. Experiments demonstrate state-of-the-art performance on global and regional datasets.
[1196] Bridge Matching Sampler: Scalable Sampling via Generalized Fixed-Point Diffusion Matching
Denis Blessing, Lorenz Richter, Julius Berner, Egor Malitskiy, Gerhard Neumann
Main category: cs.LG
TL;DR: Bridge Matching Sampler (BMS) is a new diffusion-based sampling method that learns stochastic transport maps between arbitrary distributions using a single stable objective, addressing limitations of previous approaches.
Details
Motivation: Existing diffusion-based sampling methods using least-squares matching objectives face trade-offs like restricting prior distributions or relying on unstable optimization schemes, limiting their scalability and practical application.Method: Generalizes previous methods as special forms of fixed-point iterations based on Nelson’s relation, enabling learning of stochastic transport maps between arbitrary prior and target distributions with a single scalable objective. Introduces a damped variant with regularization to mitigate mode collapse and stabilize training.
Result: Achieves state-of-the-art results on complex synthetic densities and high-dimensional molecular benchmarks, enabling sampling at unprecedented scales while preserving mode diversity.
Conclusion: BMS provides a more stable and scalable approach to diffusion-based sampling that overcomes limitations of previous methods, particularly for high-dimensional applications.
Abstract: Sampling from unnormalized densities using diffusion models has emerged as a powerful paradigm. However, while recent approaches that use least-squares `matching’ objectives have improved scalability, they often necessitate significant trade-offs, such as restricting prior distributions or relying on unstable optimization schemes. By generalizing these methods as special forms of fixed-point iterations rooted in Nelson’s relation, we develop a new method that addresses these limitations, called Bridge Matching Sampler (BMS). Our approach enables learning a stochastic transport map between arbitrary prior and target distributions with a single, scalable, and stable objective. Furthermore, we introduce a damped variant of this iteration that incorporates a regularization term to mitigate mode collapse and further stabilize training. Empirically, we demonstrate that our method enables sampling at unprecedented scales while preserving mode diversity, achieving state-of-the-art results on complex synthetic densities and high-dimensional molecular benchmarks.
[1197] Mathematical Foundations of Poisoning Attacks on Linear Regression over Cumulative Distribution Functions
Atsuki Sato, Martin Aumüller, Yusuke Matsui
Main category: cs.LG
TL;DR: Theoretical analysis of poisoning attacks on learned indexes, focusing on linear regression models for CDF approximation, with proofs of optimal single-point attacks and analysis of multi-point attack strategies.
Details
Motivation: Learned indexes use ML models to approximate CDFs for fast search, but are vulnerable to poisoning attacks that degrade performance. Previous work showed attacks are effective, but lacks theoretical understanding of attack strategies against linear regression models used in learned indexes.Method: Theoretical analysis of poisoning attacks on linear regression models for CDF approximation. Provides proofs for optimal single-point attacks, analyzes multi-point attack strategies, shows limitations of greedy approaches, and proposes method to compute upper bounds for multi-point attack impact.
Result: Proves existing single-point attack method is optimal, shows greedy multi-point approach is not always optimal, derives properties optimal attacks should satisfy, and demonstrates greedy approach often achieves near-upper-bound performance empirically.
Conclusion: Deepens theoretical understanding of attack strategies against linear regression models on CDFs, providing foundation for theoretical evaluation of attacks and defenses on learned indexes.
Abstract: Learned indexes are a class of index data structures that enable fast search by approximating the cumulative distribution function (CDF) using machine learning models (Kraska et al., SIGMOD'18). However, recent studies have shown that learned indexes are vulnerable to poisoning attacks, where injecting a small number of poison keys into the training data can significantly degrade model accuracy and reduce index performance (Kornaropoulos et al., SIGMOD'22). In this work, we provide a rigorous theoretical analysis of poisoning attacks targeting linear regression models over CDFs, one of the most basic regression models and a core component in many learned indexes. Our main contributions are as follows: (i) We present a theoretical proof characterizing the optimal single-point poisoning attack and show that the existing method yields the optimal attack. (ii) We show that in multi-point attacks, the existing greedy approach is not always optimal, and we rigorously derive the key properties that an optimal attack should satisfy. (iii) We propose a method to compute an upper bound of the multi-point poisoning attack’s impact and empirically demonstrate that the loss under the greedy approach is often close to this bound. Our study deepens the theoretical understanding of attack strategies against linear regression models on CDFs and provides a foundation for the theoretical evaluation of attacks and defenses on learned indexes.
[1198] Spectral Condition for $μ$P under Width-Depth Scaling
Chenyu Zheng, Rongzhen Wang, Xinyu Zhang, Chongxuan Li
Main category: cs.LG
TL;DR: A unified spectral framework for μP (maximal update parameterization) under joint width-depth scaling that enables stable feature learning and hyperparameter transfer across model sizes.
Details
Motivation: Generative foundation models are scaling in both width and depth, creating challenges for stable feature learning and reliable hyperparameter transfer. While μP solves these problems for width scaling, existing extensions to joint width-depth scaling are fragmented, architecture-specific, and technically complex.Method: Develops a spectral framework for μP under joint width-depth scaling. Introduces a spectral μP condition that characterizes how weight norms and updates should scale with width and depth. Derives a general recipe for implementing μP across various optimizers by mapping spectral constraints to concrete hyperparameter parameterizations.
Result: Experiments on GPT-2 style language models show that the proposed spectral μP condition preserves stable feature learning and enables robust hyperparameter transfer under width-depth scaling. The framework unifies existing μP formulations and extends to a wider range of optimizers.
Conclusion: The spectral framework provides a simple, unified approach to μP for joint width-depth scaling, addressing key challenges in scaling generative foundation models while maintaining stability and transferability.
Abstract: Generative foundation models are increasingly scaled in both width and depth, posing significant challenges for stable feature learning and reliable hyperparameter (HP) transfer across model sizes. While maximal update parameterization ($μ$P) has provided a principled solution to both problems for width scaling, existing extensions to the joint width-depth scaling regime remain fragmented, architecture- and optimizer-specific, and often rely on technically involved theories. In this work, we develop a simple and unified spectral framework for $μ$P under joint width-depth scaling. Considering residual networks of varying block depths, we first introduce a spectral $μ$P condition that precisely characterizes how the norms of weights and their per-step updates should scale with width and depth, unifying previously disparate $μ$P formulations as special cases. Building on this condition, we then derive a general recipe for implementing $μ$P across a broad class of optimizers by mapping the spectral constraints to concrete HP parameterizations. This approach not only recovers existing $μ$P formulations (e.g., for SGD and AdamW) but also naturally extends to a wider range of optimizers. Finally, experiments on GPT-2 style language models demonstrate that the proposed spectral $μ$P condition preserves stable feature learning and enables robust HP transfer under width-depth scaling.
[1199] Learning to Attack: A Bandit Approach to Adversarial Context Poisoning
Ray Telikani, Amir H. Gandomi
Main category: cs.LG
TL;DR: AdvBandit is a black-box adaptive attack framework for neural contextual bandits that learns to poison contexts without access to victim’s internal parameters, achieving higher victim regret than existing methods.
Details
Motivation: Neural contextual bandits are vulnerable to adversarial attacks through subtle perturbations to rewards, actions, or contexts, but existing attacks often require white-box access or lack adaptability to the victim's evolving policy.Method: Formulates context poisoning as a continuous-armed bandit problem. Uses maximum-entropy inverse reinforcement learning to build a surrogate model from observed context-action pairs, optimizes perturbations via projected gradient descent, and employs UCB-aware Gaussian process for arm selection with attack-budget control.
Result: Achieves higher cumulative victim regret than state-of-the-art baselines on three real-world datasets (Yelp, MovieLens, Disin) against various victim contextual bandits, with theoretical guarantees including sublinear attacker regret.
Conclusion: AdvBandit demonstrates effective black-box adaptive attacks on neural contextual bandits, highlighting security vulnerabilities in these systems and providing a framework for both attack and defense research.
Abstract: Neural contextual bandits are vulnerable to adversarial attacks, where subtle perturbations to rewards, actions, or contexts induce suboptimal decisions. We introduce AdvBandit, a black-box adaptive attack that formulates context poisoning as a continuous-armed bandit problem, enabling the attacker to jointly learn and exploit the victim’s evolving policy. The attacker requires no access to the victim’s internal parameters, reward function, or gradient information; instead, it constructs a surrogate model using a maximum-entropy inverse reinforcement learning module from observed context-action pairs and optimizes perturbations against this surrogate using projected gradient descent. An upper confidence bound-aware Gaussian process guides arm selection. An attack-budget control mechanism is also introduced to limit detection risk and overhead. We provide theoretical guarantees, including sublinear attacker regret and lower bounds on victim regret linear in the number of attacks. Experiments on three real-world datasets (Yelp, MovieLens, and Disin) against various victim contextual bandits demonstrate that our attack model achieves higher cumulative victim regret than state-of-the-art baselines.
[1200] Enhancing Molecular Property Predictions by Learning from Bond Modelling and Interactions
Yunqing Liu, Yi Zhou, Wenqi Fan
Main category: cs.LG
TL;DR: DeMol introduces a dual-graph framework for molecular representation learning that combines atom-centric and bond-centric perspectives to better capture complex chemical phenomena like resonance and stereoselectivity.
Details
Motivation: Conventional atom-centric models treat chemical bonds as simple pairwise interactions, overlooking complex bond-level phenomena like resonance and stereoselectivity, which limits predictive accuracy for nuanced chemical behaviors.Method: DeMol uses a dual-graph framework with parallel atom-centric and bond-centric channels, synergistically fused by multi-scale Double-Helix Blocks to learn intricate atom-atom, atom-bond, and bond-bond interactions. It includes geometric regularization based on covalent radii to enforce chemically plausible structures.
Result: Comprehensive evaluations on PCQM4Mv2, OC20 IS2RE, QM9, and MoleculeNet benchmarks show DeMol establishes new state-of-the-art performance, outperforming existing methods.
Conclusion: Explicitly modeling bond information and interactions leads to more robust and accurate molecular machine learning, confirming the superiority of bond-centric perspectives in molecular representation.
Abstract: Molecule representation learning is crucial for understanding and predicting molecular properties. However, conventional atom-centric models, which treat chemical bonds merely as pairwise interactions, often overlook complex bond-level phenomena like resonance and stereoselectivity. This oversight limits their predictive accuracy for nuanced chemical behaviors. To address this limitation, we introduce \textbf{DeMol}, a dual-graph framework whose architecture is motivated by a rigorous information-theoretic analysis demonstrating the information gain from a bond-centric perspective. DeMol explicitly models molecules through parallel atom-centric and bond-centric channels. These are synergistically fused by multi-scale Double-Helix Blocks designed to learn intricate atom-atom, atom-bond, and bond-bond interactions. The framework’s geometric consistency is further enhanced by a regularization term based on covalent radii to enforce chemically plausible structures. Comprehensive evaluations on diverse benchmarks, including PCQM4Mv2, OC20 IS2RE, QM9, and MoleculeNet, show that DeMol establishes a new state-of-the-art, outperforming existing methods. These results confirm the superiority of explicitly modelling bond information and interactions, paving the way for more robust and accurate molecular machine learning.
[1201] DeepAFL: Deep Analytic Federated Learning
Jianheng Tang, Yajiang Huang, Kejia Fan, Feijiang Han, Jiaxu Li, Jinfeng Xu, Run He, Anfeng Liu, Houbing Herbert Song, Huiping Zhuang, Yunhuai Liu
Main category: cs.LG
TL;DR: DeepAFL introduces a deep analytic federated learning approach with gradient-free residual blocks and layer-wise analytical solutions to address heterogeneity invariance and representation learning limitations in traditional FL.
Details
Motivation: Traditional FL suffers from heterogeneity, scalability, convergence, and overhead issues due to gradient-based updates. Existing analytic-learning approaches eliminate gradients but are limited to single-layer linear models with frozen backbones, lacking representation learning capabilities.Method: Proposes DeepAFL with gradient-free residual blocks inspired by ResNet, using analytical solutions. Introduces efficient layer-wise training protocol through least squares for deep analytic models in FL.
Result: Outperforms state-of-the-art baselines by 5.68%-8.42% across three benchmark datasets, achieving dual advantages in heterogeneity invariance and representation learning.
Conclusion: DeepAFL successfully enables representable analytic models while preserving ideal invariance to data heterogeneity in federated learning, validated by both theoretical analyses and empirical evaluations.
Abstract: Federated Learning (FL) is a popular distributed learning paradigm to break down data silo. Traditional FL approaches largely rely on gradient-based updates, facing significant issues about heterogeneity, scalability, convergence, and overhead, etc. Recently, some analytic-learning-based work has attempted to handle these issues by eliminating gradient-based updates via analytical (i.e., closed-form) solutions. Despite achieving superior invariance to data heterogeneity, these approaches are fundamentally limited by their single-layer linear model with a frozen pre-trained backbone. As a result, they can only achieve suboptimal performance due to their lack of representation learning capabilities. In this paper, to enable representable analytic models while preserving the ideal invariance to data heterogeneity for FL, we propose our Deep Analytic Federated Learning approach, named DeepAFL. Drawing inspiration from the great success of ResNet in gradient-based learning, we design gradient-free residual blocks in our DeepAFL with analytical solutions. We introduce an efficient layer-wise protocol for training our deep analytic models layer by layer in FL through least squares. Both theoretical analyses and empirical evaluations validate our DeepAFL’s superior performance with its dual advantages in heterogeneity invariance and representation learning, outperforming state-of-the-art baselines by up to 5.68%-8.42% across three benchmark datasets.
[1202] Unlearning Evaluation through Subset Statistical Independence
Chenhao Zhang, Muxing Li, Feng Liu, Weitong Chen, Miao Xu
Main category: cs.LG
TL;DR: Proposes a subset-level evaluation framework for machine unlearning using statistical independence tests, specifically the Hilbert-Schmidt Independence Criterion, to assess whether model outputs on a given subset exhibit statistical dependence without requiring retraining or auxiliary classifiers.
Details
Motivation: Existing machine unlearning evaluation methods are impractical as they require retraining reference models or performing membership inference attacks, both needing prior access to training configuration or supervision labels. Most unlearning algorithms remove small, random subsets of training data, creating a need for more practical evaluation approaches.Method: Uses a tailored application of the Hilbert-Schmidt Independence Criterion (HSIC) to assess statistical dependence between model outputs on a given subset. This provides a standalone evaluation procedure that doesn’t require model retraining or auxiliary classifiers, aligning with practical unlearning workflows.
Result: Extensive experiments show the approach reliably distinguishes in-training from out-of-training subsets and clearly differentiates unlearning effectiveness, even when existing evaluation methods fall short.
Conclusion: The proposed statistical independence-based framework provides a simple, practical evaluation method for machine unlearning that works without requiring retraining or auxiliary classifiers, addressing limitations of existing evaluation approaches.
Abstract: Evaluating machine unlearning remains challenging, as existing methods typically require retraining reference models or performing membership inference attacks, both of which rely on prior access to training configuration or supervision labels, making them impractical in realistic scenarios. Motivated by the fact that most unlearning algorithms remove a small, random subset of the training data, we propose a subset-level evaluation framework based on statistical independence. Specifically, we design a tailored use of the Hilbert-Schmidt Independence Criterion to assess whether the model outputs on a given subset exhibit statistical dependence, without requiring model retraining or auxiliary classifiers. Our method provides a simple, standalone evaluation procedure that aligns with unlearning workflows. Extensive experiments demonstrate that our approach reliably distinguishes in-training from out-of-training subsets and clearly differentiates unlearning effectiveness, even when existing evaluations fall short.
[1203] Energy-Efficient Information Representation in MNIST Classification Using Biologically Inspired Learning
Patrick Stricker, Florian Röhrbein, Andreas Knoblauch
Main category: cs.LG
TL;DR: Biologically-inspired learning rule prevents overparameterization in neural networks by optimizing synaptic usage, outperforming backpropagation in efficiency and storage capacity.
Details
Motivation: Current neural networks often become overparameterized (up to 13x), increasing redundancy and energy consumption, which is especially problematic for large language models and raises ethical/environmental concerns.Method: Analyze a biologically-inspired learning rule using information-theoretic concepts, evaluating it on MNIST classification. The rule emulates brain’s structural plasticity to optimize synaptic usage and retain only essential synapses.
Result: The proposed rule naturally prevents overparameterization, outperforms backpropagation in efficiency and storage capacity, eliminates need for pre-optimization of network architecture, enhances adaptability, and reflects brain’s ability to reserve space for new memories.
Conclusion: This approach advances scalable and energy-efficient AI and provides a promising framework for developing brain-inspired models that optimize resource allocation and adaptability.
Abstract: Efficient representation learning is essential for optimal information storage and classification. However, it is frequently overlooked in artificial neural networks (ANNs). This neglect results in networks that can become overparameterized by factors of up to 13, increasing redundancy and energy consumption. As the demand for large language models (LLMs) and their scale increase, these issues are further highlighted, raising significant ethical and environmental concerns. We analyze our previously developed biologically inspired learning rule using information-theoretic concepts, evaluating its efficiency on the MNIST classification task. The proposed rule, which emulates the brain’s structural plasticity, naturally prevents overparameterization by optimizing synaptic usage and retaining only the essential number of synapses. Furthermore, it outperforms backpropagation (BP) in terms of efficiency and storage capacity. It also eliminates the need for pre-optimization of network architecture, enhances adaptability, and reflects the brain’s ability to reserve ‘space’ for new memories. This approach advances scalable and energy-efficient AI and provides a promising framework for developing brain-inspired models that optimize resource allocation and adaptability.
[1204] Learning to Explore: Policy-Guided Outlier Synthesis for Graph Out-of-Distribution Detection
Li Sun, Lanxu Yang, Jiayu Tian, Bowen Fang, Xiaoyan Yu, Junda Ye, Peng Tang, Hao Peng, Philip S. Yu
Main category: cs.LG
TL;DR: PGOS uses reinforcement learning to synthesize informative out-of-distribution graphs for improving graph neural network OOD detection, replacing static sampling heuristics with adaptive exploration.
Details
Motivation: Current unsupervised graph-level OOD detection methods suffer from incomplete feature space characterization and weak decision boundaries due to training only on in-distribution data. Existing outlier synthesis approaches use fixed, non-adaptive sampling heuristics that limit exploration of informative OOD regions.Method: Proposes Policy-Guided Outlier Synthesis (PGOS) framework that replaces static heuristics with a learned exploration strategy. Trains a reinforcement learning agent to navigate low-density regions in a structured latent space and sample representations that effectively refine OOD decision boundaries. These representations are then decoded into high-quality pseudo-OOD graphs.
Result: Extensive experiments demonstrate that PGOS achieves state-of-the-art performance on multiple graph OOD and anomaly detection benchmarks.
Conclusion: PGOS provides an effective framework for improving graph OOD detection through adaptive outlier synthesis using reinforcement learning, outperforming existing methods with fixed sampling heuristics.
Abstract: Detecting out-of-distribution (OOD) graphs is crucial for ensuring the safety and reliability of Graph Neural Networks. In unsupervised graph-level OOD detection, models are typically trained using only in-distribution (ID) data, resulting in incomplete feature space characterization and weak decision boundaries. Although synthesizing outliers offers a promising solution, existing approaches rely on fixed, non-adaptive sampling heuristics (e.g., distance- or density-based), limiting their ability to explore informative OOD regions. We propose a Policy-Guided Outlier Synthesis (PGOS) framework that replaces static heuristics with a learned exploration strategy. Specifically, PGOS trains a reinforcement learning agent to navigate low-density regions in a structured latent space and sample representations that most effectively refine the OOD decision boundary. These representations are then decoded into high-quality pseudo-OOD graphs to improve detector robustness. Extensive experiments demonstrate that PGOS achieves state-of-the-art performance on multiple graph OOD and anomaly detection benchmarks.
[1205] Multi-Domain Riemannian Graph Gluing for Building Graph Foundation Models
Li Sun, Zhenhao Huang, Silei Chen, Lanxu Yang, Junda Ye, Sen Su, Philip S. Yu
Main category: cs.LG
TL;DR: GraphGlue: A Riemannian geometry framework for multi-domain graph pre-training that unifies diverse graph datasets into a smooth manifold for systematic knowledge integration and transfer.
Details
Motivation: Existing multi-domain graph pre-training methods lack theoretical understanding of how knowledge is integrated or transferred across domains, motivating a need to rethink consistency and transferability between pre-training and domain adaptation.Method: Proposes neural manifold gluing theory: characterizes local geometry using adaptive orthogonal frames, then “glues” local pieces into a coherent Riemannian manifold. Implements GraphGlue framework with batched pre-training, EMA prototyping, and geometric consistency-based transferability measures.
Result: Superior performance across diverse graph domains, empirical validation of geometric scaling law showing that larger dataset quantities improve transferability by producing smoother manifolds.
Conclusion: The Riemannian geometry perspective provides systematic understanding of knowledge integration in graph foundation models, with GraphGlue demonstrating effective multi-domain pre-training and measurable transferability.
Abstract: Multi-domain graph pre-training integrates knowledge from diverse domains to enhance performance in the target domains, which is crucial for building graph foundation models. Despite initial success, existing solutions often fall short of answering a fundamental question: how is knowledge integrated or transferred across domains? This theoretical limitation motivates us to rethink the consistency and transferability between model pre-training and domain adaptation. In this paper, we propose a fresh Riemannian geometry perspective, whose core idea is to merge any graph dataset into a unified, smooth Riemannian manifold, enabling a systematic understanding of knowledge integration and transfer. To achieve this, our key contribution is the theoretical establishment of neural manifold gluing, which first characterizes local geometry using an adaptive orthogonal frame and then “glues” the local pieces together into a coherent whole. Building on this theory, we present the GraphGlue framework, which supports batched pre-training with EMA prototyping and provides a transferability measure based on geometric consistence. Extensive experiments demonstrate its superior performance across diverse graph domains. Moreover, we empirically validated GraphGlue’s geometric scaling law, showing that larger quantities of datasets improve model transferability by producing a smoother manifold. Codes are available at https://github.com/RiemannGraph/GraphGlue.
[1206] IDER: IDempotent Experience Replay for Reliable Continual Learning
Zhanwang Liu, Yuting Li, Haoyuan Gao, Yexin Li, Linghe Kong, Lichao Sun, Weiran Huang
Main category: cs.LG
TL;DR: IDER (Idempotent Experience Replay) addresses catastrophic forgetting in continual learning by making models idempotent through adapted training loss and distillation, improving reliability, accuracy, and reducing forgetting.
Details
Motivation: Continual learning models suffer from catastrophic forgetting and lack uncertainty awareness. Existing uncertainty-aware methods have high computational overhead and incompatibility with replay methods, limiting practical deployment in mission-critical settings.Method: Proposes IDER with two components: 1) adapted training loss to make models idempotent on current data streams, and 2) idempotence distillation loss that feeds current model output back into old checkpoint and minimizes distance between reprocessed and original outputs.
Result: Extensive experiments on CL benchmarks show IDER consistently improves prediction reliability while boosting accuracy and reducing forgetting. Can be seamlessly integrated with other CL approaches.
Conclusion: Idempotence is a promising principle for deploying efficient and trustworthy continual learning systems in real-world applications. IDER provides a simple and effective baseline for reliable continual learners.
Abstract: Catastrophic forgetting, the tendency of neural networks to forget previously learned knowledge when learning new tasks, has been a major challenge in continual learning (CL). To tackle this challenge, CL methods have been proposed and shown to reduce forgetting. Furthermore, CL models deployed in mission-critical settings can benefit from uncertainty awareness by calibrating their predictions to reliably assess their confidences. However, existing uncertainty-aware continual learning methods suffer from high computational overhead and incompatibility with mainstream replay methods. To address this, we propose idempotent experience replay (IDER), a novel approach based on the idempotent property where repeated function applications yield the same output. Specifically, we first adapt the training loss to make model idempotent on current data streams. In addition, we introduce an idempotence distillation loss. We feed the output of the current model back into the old checkpoint and then minimize the distance between this reprocessed output and the original output of the current model. This yields a simple and effective new baseline for building reliable continual learners, which can be seamlessly integrated with other CL approaches. Extensive experiments on different CL benchmarks demonstrate that IDER consistently improves prediction reliability while simultaneously boosting accuracy and reducing forgetting. Our results suggest the potential of idempotence as a promising principle for deploying efficient and trustworthy continual learning systems in real-world applications.Our code is available at https://github.com/YutingLi0606/Idempotent-Continual-Learning.
[1207] Adapt Data to Model: Adaptive Transformation Optimization for Domain-shared Time Series Foundation Models
Yunzhong Qiu, Zhiyao Cen, Zhongyi Pei, Chen Wang, Jianmin Wang
Main category: cs.LG
TL;DR: TATO is a data-centric framework that optimizes transformation pipelines to help frozen pre-trained large time series models adapt to diverse domains without retraining.
Details
Motivation: Large time series models struggle with real-world data diversity and nonstationarity, creating unsatisfactory trade-offs between accuracy and generalization. Instead of fine-tuning new instances for each domain, the authors propose adapting a single frozen model through data transformations.Method: TATO constructs three transformation types (context slicing, scale normalization, outlier correction) and uses time series augmentations with a two-stage ranking mechanism to filter underperforming pipelines, optimizing how data is presented to frozen LTMs.
Result: Extensive experiments show TATO consistently improves domain-adaptive forecasting, achieving up to 65.4% MSE reduction (average 13.6%), with optimization typically completing in under 2 minutes.
Conclusion: TATO provides an efficient, data-centric approach to adapt frozen pre-trained time series models to diverse domains through optimized transformation pipelines, offering practical real-world deployment benefits.
Abstract: Large time series models (LTMs) have emerged as powerful tools for universal forecasting, yet they often struggle with the inherent diversity and nonstationarity of real-world time series data, leading to an unsatisfactory trade-off between forecasting accuracy and generalization. Rather than continually finetuning new LTM instances for each domain, we propose a data-centric framework, time-series adaptive transformation optimization (TATO), that enables a single frozen pre-trained LTM to adapt to diverse downstream domains through an optimally configured transformation pipeline. Specifically, TATO constructs three representative types of transformations, including context slicing, scale normalization, and outlier correction, to help LTMs better align with target domain characteristics. To ensure robustness, we incorporate carefully selected time series augmentations and a two-stage ranking mechanism that filters out pipelines underperforming on specific metrics. Extensive experiments on state-of-the-art LTMs and widely used datasets demonstrate that TATO consistently and significantly improves domain-adaptive forecasting performance, achieving a maximum reduction in MSE of 65.4% and an average reduction of 13.6%. Moreover, TATO is highly efficient, typically completing optimization in under 2 minutes, making it practical for real-world deployment. The source code is available at https://github.com/thulab/TATO.
[1208] Retrodictive Forecasting: A Proof-of-Concept for Exploiting Temporal Asymmetry in Time Series Prediction
Cedric Damour
Main category: cs.LG
TL;DR: Retrodictive forecasting approach that identifies future states that best explain present observations via inverse MAP optimization over a CVAE, using time-irreversibility as a diagnostic for applicability.
Details
Motivation: To develop an alternative forecasting paradigm that leverages statistical time-irreversibility by identifying futures that best explain present observations, rather than predicting future from past.Method: Uses inverse MAP optimization over Conditional Variational Autoencoder (CVAE) with RealNVP normalizing-flow prior, grounded in information-theoretic arrow-of-time measure (symmetrized KL divergence between forward and time-reversed trajectories).
Result: Method correctly classified all six test cases using irreversibility diagnostic, improved over Gaussian baseline on applicable cases, showed no advantage on reversible dynamics, and achieved competitive/superior RMSE on irreversible cases (17.7% reduction on ERA5 solar irradiance).
Conclusion: Retrodictive forecasting can be a viable alternative to forward prediction when statistical time-irreversibility is present and exploitable, with structured validation supporting its effectiveness.
Abstract: We propose a retrodictive forecasting paradigm for time series: instead of predicting the future from the past, we identify the future that best explains the observed present via inverse MAP optimization over a Conditional Variational Autoencoder (CVAE). This conditioning is a statistical modeling choice for Bayesian inversion; it does not assert that future events cause past observations. The approach is theoretically grounded in an information-theoretic arrow-of-time measure: the symmetrized Kullback-Leibler divergence between forward and time-reversed trajectory ensembles provides both the conceptual rationale and an operational GO/NO-GO diagnostic for applicability. We implement the paradigm as MAP inference over an inverse CVAE with a learned RealNVP normalizing-flow prior and evaluate it on six time series cases: four synthetic processes with controlled temporal asymmetry and two ERA5 reanalysis datasets (wind speed and solar irradiance). The work makes four contributions: (i) a formal retrodictive inference formulation; (ii) an inverse CVAE architecture; (iii) a model-free irreversibility diagnostic; and (iv) a falsifiable validation protocol with four pre-specified predictions. All pre-specified predictions are empirically supported: the diagnostic correctly classifies all six cases; the learned flow prior improves over an isotropic Gaussian baseline on GO cases; the inverse MAP yields no spurious advantage on time-reversible dynamics; and on irreversible GO cases, it achieves competitive or superior RMSE relative to forward baselines, with a statistically significant 17.7% reduction over a forward MLP on ERA5 solar irradiance. These results provide a structured proof-of-concept that retrodictive forecasting can constitute a viable alternative to conventional forward prediction when statistical time-irreversibility is present and exploitable.
[1209] Reward-Modulated Local Learning in Spiking Encoders: Controlled Benchmarks with STDP and Hybrid Rate Readouts
Debjyoti Chakraborty
Main category: cs.LG
TL;DR: Empirical study of biologically-inspired local learning for handwritten digit recognition using STDP-inspired competitive proxy and hybrid spike-based models, achieving 86-87% accuracy vs 98% classical baselines.
Details
Motivation: To empirically evaluate biologically motivated local learning approaches for handwritten digit recognition, comparing STDP-inspired competitive learning with practical hybrid benchmarks using spiking neural networks.Method: Uses STDP-inspired competitive proxy based on leaky integrate-and-fire E/I circuit models with three-factor delayed reward modulation, and a hybrid update that is local in pre x post rates but uses supervised labels without timing-based credit assignment. Evaluated on sklearn digits dataset with fixed-seed evaluation.
Result: Classical pixel baselines achieved 98.06-98.22% accuracy, while local spike-based models reached 86.39% (hybrid) and 87.17% (STDP-style). Best hybrid ablation achieved 95.52%. Ablations identified normalization and reward-shaping as strongest levers, with reward-shaping effects reversing sign across stabilization regimes.
Conclusion: Biologically inspired local learning approaches can achieve reasonable performance on handwritten digit recognition but lag behind classical methods. Reward-shaping conclusions should be reported jointly with normalization settings due to interaction effects.
Abstract: This paper presents a controlled empirical study of biologically motivated local learning for handwritten digit recognition. We evaluate an STDP-inspired competitive proxy and a practical hybrid benchmark built on the same spiking population encoder. The proxy is motivated by leaky integrate-and-fire E/I circuit models with three-factor delayed reward modulation. The hybrid update is local in pre x post rates but uses supervised labels and no timing-based credit assignment. On sklearn digits, fixed-seed evaluation shows classical pixel baselines from 98.06 to 98.22% accuracy, while local spike-based models reach 86.39 +/- 4.75% (hybrid default) and 87.17 +/- 3.74% (STDP-style competitive proxy). Ablations identify normalization and reward-shaping settings as the strongest observed levers, with a best hybrid ablation of 95.52 +/- 1.11%. A network-free synthetic temporal benchmark supports the same timing-versus-rate interpretation under matched local-update training. A descriptive 2x2 analysis further shows reward-shaping effects can reverse sign across stabilization regimes, so reward-shaping conclusions should be reported jointly with normalization settings.
[1210] Frozen Policy Iteration: Computationally Efficient RL under Linear $Q^π$ Realizability for Deterministic Dynamics
Yijing Ke, Zihan Zhang, Ruosong Wang
Main category: cs.LG
TL;DR: Frozen Policy Iteration: computationally efficient online RL algorithm for MDPs with linear Q-function realizability, stochastic initial states/rewards, and deterministic transitions, achieving optimal regret bounds.
Details
Motivation: Prior methods for linear Q-function realizability are either computationally intractable or require simulator access. Need efficient online RL algorithms that work with stochastic initial states without simulator access.Method: Proposes Frozen Policy Iteration algorithm that strategically uses only high-confidence trajectory data and freezes policies for well-explored states to ensure all data remains effectively on-policy during learning.
Result: Achieves regret bound of Õ(√(d²H⁶T)), optimal for linear bandits (H=1). Extends to Uniform-PAC setting and function classes with bounded eluder dimension.
Conclusion: First computationally efficient online RL algorithm for linear Q-function realizability with stochastic initial states, circumventing limitations of prior methods that require simulator access.
Abstract: We study computationally and statistically efficient reinforcement learning under the linear $Q^π$ realizability assumption, where any policy’s $Q$-function is linear in a given state-action feature representation. Prior methods in this setting are either computationally intractable, or require (local) access to a simulator. In this paper, we propose a computationally efficient online RL algorithm, named Frozen Policy Iteration, under the linear $Q^π$ realizability setting that works for Markov Decision Processes (MDPs) with stochastic initial states, stochastic rewards and deterministic transitions. Our algorithm achieves a regret bound of $\widetilde{O}(\sqrt{d^2H^6T})$, where $d$ is the dimensionality of the feature space, $H$ is the horizon length, and $T$ is the total number of episodes. Our regret bound is optimal for linear (contextual) bandits which is a special case of our setting with $H = 1$. Existing policy iteration algorithms under the same setting heavily rely on repeatedly sampling the same state by access to the simulator, which is not implementable in the online setting with stochastic initial states studied in this paper. In contrast, our new algorithm circumvents this limitation by strategically using only high-confidence part of the trajectory data and freezing the policy for well-explored states, which ensures that all data used by our algorithm remains effectively on-policy during the whole course of learning. We further demonstrate the versatility of our approach by extending it to the Uniform-PAC setting and to function classes with bounded eluder dimension.
[1211] MARS: Harmonizing Multimodal Convergence via Adaptive Rank Search
Minkyoung Cho, Insu Jang, Shuowei Jin, Zesen Zhao, Adityan Jothi, Ethem F. Can, Min-Hung Chen, Z. Morley Mao
Main category: cs.LG
TL;DR: MARS: Automated rank search for LoRA fine-tuning of MLLMs using dual scaling laws to balance multimodal training dynamics and maximize performance.
Details
Motivation: Current parameter-efficient fine-tuning methods for Multimodal Large Language Models (MLLMs) suffer from imbalanced training dynamics across modalities, leading to suboptimal accuracy due to negative interference. Existing heuristic solutions like manually tuning separate learning rates are inefficient.Method: Proposes MARS (Multimodal Adaptive Rank Search) with dual scaling laws: 1) models module-specific convergence time to prune search space to candidates with aligned dynamics, 2) predicts final task performance to select optimal rank pair. Repurposes LoRA rank as controller for modality-specific convergence speed.
Result: MARS outperforms baseline methods and provides robust, automated strategy for optimizing MLLM fine-tuning.
Conclusion: MARS offers an effective automated approach to address imbalanced training dynamics in MLLM fine-tuning by discovering optimal rank pairs through dual scaling laws, improving performance over heuristic methods.
Abstract: Fine-tuning Multimodal Large Language Models (MLLMs) with parameter-efficient methods like Low-Rank Adaptation (LoRA) is crucial for task adaptation. However, imbalanced training dynamics across modalities often lead to suboptimal accuracy due to negative interference, a challenge typically addressed with inefficient heuristic methods such as manually tuning separate learning rates. To overcome this, we introduce MARS (Multimodal Adaptive Rank Search), an approach to discover optimal rank pairs that balance training dynamics while maximizing performance. Our key innovation, a proposed framework of dual scaling laws, enables this search: one law models module-specific convergence time to prune the search space to candidates with aligned dynamics, while the other predicts final task performance to select the optimal pair from the pruned set. By re-purposing the LoRA rank as a controller for modality-specific convergence speed, MARS outperforms baseline methods and provides a robust, automated strategy for optimizing MLLM fine-tuning.
[1212] To Use or not to Use Muon: How Simplicity Bias in Optimizers Matters
Sara Dragutinović, Rajesh Ranganath
Main category: cs.LG
TL;DR: Muon optimizer’s speedup comes with biases that remove SGD’s natural simplicity bias, potentially causing models to fit spurious features and struggle with common underlying structure across tasks.
Details
Motivation: While Muon has gained popularity for superior training speed, this paper investigates the potential downsides and biases introduced by its optimization mechanism, contrasting it with SGD's well-studied properties.Method: Theoretical analysis of Muon’s optimization biases, examination of learning trajectories and solutions learned, and demonstration through examples where Muon-optimized models show disadvantages compared to SGD.
Result: Muon removes the simplicity bias naturally preserved by SGD, potentially leading to models that struggle to uncover common underlying structure across tasks and are more prone to fitting spurious features.
Conclusion: When developing new optimizers, it’s essential to consider the biases they introduce as these can fundamentally change model behavior; Muon’s speed benefits come with trade-offs in solution quality and generalization properties.
Abstract: For a long period of time, Adam has served as the ubiquitous default choice for training deep neural networks. Recently, many new optimizers have been introduced, out of which Muon has perhaps gained the highest popularity due to its superior training speed. While many papers set out to validate the benefits of Muon, our paper investigates the potential downsides stemming from the mechanism driving this speedup. We explore the biases induced when optimizing with Muon, providing theoretical analysis and its consequences to the learning trajectories and solutions learned. While the theory does provide justification for the benefits Muon brings, it also guides our intuition when coming up with a couple of examples where Muon-optimized models have disadvantages. The core problem we emphasize is that Muon optimization removes a simplicity bias that is naturally preserved by older, more thoroughly studied methods like Stochastic Gradient Descent (SGD). We take first steps toward understanding consequences this may have: Muon might struggle to uncover common underlying structure across tasks, and be more prone to fitting spurious features. More broadly, this paper should serve as a reminder: when developing new optimizers, it is essential to consider the biases they introduce, as these biases can fundamentally change a model’s behavior – for better or for worse.
[1213] ResGene-T: A Tensor-Based Residual Network Approach for Genomic Prediction
Kuldeep Pathak, Kapil Ahuja, Eric de Sturler
Main category: cs.LG
TL;DR: A deep learning approach for genomic prediction using 2D image and 3D tensor representations of genotype data, with ResGene-T achieving significant performance improvements over existing methods.
Details
Motivation: To improve genomic prediction by representing genotype data as 2D images and 3D tensors rather than traditional 1D sequences, capturing biological interactions more effectively.Method: Proposes two models: ResGene-2D (2D image representation with ResNet-18) and ResGene-T (3D tensor representation with ResNet-18). Compares against seven baseline models including statistical, machine learning, and deep learning approaches on three crop species with ten phenotypic traits.
Result: ResGene-2D showed modest improvement (3%) over existing methods, while ResGene-T achieved substantial gains ranging from 14.51% to 41.51% improvement over all seven baseline models.
Conclusion: 3D tensor representation of genotype data significantly outperforms traditional approaches for genomic prediction, demonstrating the value of spatial representations in capturing complex biological interactions.
Abstract: In this work, we propose a new deep learning model for Genomic Prediction (GP), which involves correlating genotypic data with phenotypic. The genotypes are typically fed as a sequence of characters to the 1D-Convolution Neural Network layer of the underlying deep learning model. Inspired by earlier work that represented genotype as a 2D-image for genotype-phenotype classification, we extend this idea to GP, which is a regression task. We use a ResNet-18 as the underlying architecture, and term this model as ResGene-2D. Although the 2D-image representation captures biological interactions well, it requires all the layers of the model to do so. This limits training efficiency. Thus, as seen in the earlier work that proposed a 2D-image representation, our ResGene-2D performs almost the same as other models (3% improvement). To overcome this, we propose a novel idea of converting the 2D-image into a 3D/ tensor and feed this to the ResNet-18 architecture, and term this model as ResGene-T. We evaluate our proposed models on three crop species having ten phenotypic traits and compare it with seven most popular models (two statistical, two machine learning, and three deep learning). ResGene-T performs the best among all these seven methods (gains from 14.51% to 41.51%).
[1214] Bi-cLSTM: Residual-Corrected Bidirectional LSTM for Aero-Engine RUL Estimation
Rafi Hassan Chowdhury, Nabil Daiyan, Faria Ahmed, Md Redwan Iqbal, Morsalin Sheikh
Main category: cs.LG
TL;DR: A novel Bidirectional Residual Corrected LSTM (Bi-cLSTM) model for robust Remaining Useful Life prediction in aero-engines, combining bidirectional temporal modeling with adaptive residual correction and condition-aware preprocessing.
Details
Motivation: Existing LSTM-based RUL prediction models struggle to generalize across varying operating conditions and are sensitive to noise in multivariate sensor data, limiting their effectiveness in safety-critical systems like aero-engines.Method: Proposes Bi-cLSTM architecture with bidirectional temporal modeling and adaptive residual correction mechanism, plus condition-aware preprocessing pipeline with regime-based normalization, feature selection, and exponential smoothing.
Result: Extensive experiments on NASA C-MAPSS dataset show Bi-cLSTM consistently outperforms LSTM baselines and achieves competitive state-of-the-art performance, especially in challenging multi-condition scenarios.
Conclusion: Combining bidirectional temporal learning with residual correction is effective for reliable RUL prediction, demonstrating robustness in complex operating environments.
Abstract: Accurate Remaining Useful Life (RUL) prediction is a key requirement for effective Prognostics and Health Management (PHM) in safety-critical systems such as aero-engines. Existing deep learning approaches, particularly LSTM-based models, often struggle to generalize across varying operating conditions and are sensitive to noise in multivariate sensor data. To address these challenges, we propose a novel Bidirectional Residual Corrected LSTM (Bi-cLSTM) model for robust RUL estimation. The proposed architecture combines bidirectional temporal modeling with an adaptive residual correction mechanism to iteratively refine sequence representations. In addition, we introduce a condition-aware preprocessing pipeline incorporating regime-based normalization, feature selection, and exponential smoothing to improve robustness under complex operating environments. Extensive experiments on all four subsets of the NASA C-MAPSS dataset demonstrate that the proposed Bi-cLSTM consistently outperforms LSTM-based baselines and achieves competitive state-of-the-art performance, particularly in challenging multi-condition scenarios. These results highlight the effectiveness of combining bidirectional temporal learning with residual correction for reliable RUL prediction.
[1215] General Proximal Flow Networks
Alexander Strunk, Roland Assam
Main category: cs.LG
TL;DR: GPFNs generalize Bayesian Flow Networks by allowing arbitrary divergences (like Wasserstein) instead of just KL divergence for belief updates, creating a unified proximal-operator framework for iterative generative modeling.
Details
Motivation: Bayesian Flow Networks use fixed Bayesian posterior updates (equivalent to KL divergence proximal steps), limiting flexibility. The authors aim to create a more general framework that can adapt to different data geometries by allowing arbitrary divergences.Method: GPFNs replace the fixed Bayesian update with arbitrary divergence/distance functions (e.g., Wasserstein distance), establishing a unified proximal-operator framework. They derive corresponding training and sampling procedures that recover standard BFN as a special case.
Result: Empirical evaluations show that adapting the divergence to the underlying data geometry yields measurable improvements in generation quality, demonstrating practical benefits of the broader framework.
Conclusion: GPFNs provide a flexible generalization of Bayesian Flow Networks that can improve generative modeling by using divergences better suited to specific data geometries, with formal connections to proximal optimization.
Abstract: This paper introduces General Proximal Flow Networks (GPFNs), a generalization of Bayesian Flow Networks that broadens the class of admissible belief-update operators. In Bayesian Flow Networks, each update step is a Bayesian posterior update, which is equivalent to a proximal step with respect to the Kullback-Leibler divergence. GPFNs replace this fixed choice with an arbitrary divergence or distance function, such as the Wasserstein distance, yielding a unified proximal-operator framework for iterative generative modeling. The corresponding training and sampling procedures are derived, establishing a formal link to proximal optimization and recovering the standard BFN update as a special case. Empirical evaluations confirm that adapting the divergence to the underlying data geometry yields measurable improvements in generation quality, highlighting the practical benefits of this broader framework.
[1216] Identifying and Characterising Response in Clinical Trials: Development and Validation of a Machine Learning Approach in Colorectal Cancer
Adam Marcus, Paul Agapow
Main category: cs.LG
TL;DR: Combines partly conditional modeling with Virtual Twins method for treatment effect estimation, using survLIME for interpretability, applied to identify patient subgroups with different treatment responses in clinical trials with repeated measures.
Details
Motivation: Precision medicine requires identifying patient subgroups with different treatment responses, but current approaches neglect repeated measures in clinical trials and lack interpretability for survival data.Method: Combines partly conditional modeling with Virtual Twins method for treatment effect estimation, uses survLIME (extension of LIME to survival data) for interpretability, evaluates on synthetic data and applies to metastatic colorectal cancer trials.
Result: AUC of 0.77 for fixed responders in simulation; partly conditional modeling increased AUC for dynamic responders from 0.597 to 0.685; identified genetic mutations, metastasis sites, and ethnicity as important factors in colorectal cancer trials.
Conclusion: The approach accommodates dynamic treatment responses with better performance than existing methods for fixed responses, provides interpretable results consistent with literature, and enables precision medicine applications.
Abstract: Precision medicine promises to transform health care by offering individualised treatments that dramatically improve clinical outcomes. A necessary prerequisite is to identify subgroups of patients who respond differently to different therapies. Current approaches are limited to static measures of treatment success, neglecting the repeated measures found in most clinical trials. Our approach combines the concept of partly conditional modelling with treatment effect estimation based on the Virtual Twins method. The resulting time-specific responses to treatment are characterised using survLIME, an extension of Local Interpretable Model-agnostic Explanations (LIME) to survival data. Performance was evaluated using synthetic data and applied to clinical trials examining the effectiveness of panitumumab to treat metastatic colorectal cancer. An area under the receiver operating characteristic curve (AUC) of 0.77 for identifying fixed responders was achieved in a 1000 patient simulation. When considering dynamic responders, partly conditional modelling increased the AUC from 0.597 to 0.685. Applying the approach to colorectal cancer trials found genetic mutations, sites of metastasis, and ethnicity as important factors for response to treatment. Our approach can accommodate a dynamic response to treatment while potentially providing better performance than existing methods in instances of a fixed response to treatment. When applied to clinical data we attain results consistent with the literature.
[1217] Interpretable Cross-Network Attention for Resting-State fMRI Representation Learning
Karanpartap Singh, Adam Turnbull, Mohammad Abbasi, Kilian Pohl, Feng Vankee Lin, Ehsan Adeli
Main category: cs.LG
TL;DR: BrainInterNet is a network-aware self-supervised framework using masked reconstruction with cross-attention to model inter-network dependencies in resting-state fMRI, enabling interpretable analysis of functional reorganization in Alzheimer’s disease.
Details
Motivation: Current self-supervised models for fMRI analysis lack interpretability, limiting mechanistic insights into how large-scale functional brain networks reorganize during cognitive decline in conditions like Alzheimer's disease.Method: Proposes BrainInterNet, a network-aware self-supervised framework based on masked reconstruction with cross-attention that selectively masks predefined functional networks and reconstructs them from remaining context to model inter-network dependencies in rs-fMRI.
Result: Trained on multi-cohort fMRI data (5,582 recordings), reveals systematic alterations in brain network interactions under AD, supports accurate Alzheimer’s-spectrum classification, and yields a compact summary marker tracking disease severity longitudinally.
Conclusion: Network-guided masked modeling with cross-attention provides an interpretable and effective framework for characterizing functional reorganization in neurodegeneration, particularly Alzheimer’s disease.
Abstract: Understanding how large-scale functional brain networks reorganize during cognitive decline remains a central challenge in neuroimaging. While recent self-supervised models have shown promise for learning representations from resting-state fMRI, their internal mechanisms are difficult to interpret, limiting mechanistic insight. We propose BrainInterNet, a network-aware self-supervised framework based on masked reconstruction with cross-attention that explicitly models inter-network dependencies in rs-fMRI. By selectively masking predefined functional networks and reconstructing them from remaining context, our approach enables direct quantification of network predictability and interpretable analysis of cross-network interactions. We train BrainInterNet on multi-cohort fMRI data (from the ABCD, HCP Development, HCP Young Adults, and HCP Aging datasets) and evaluate on the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset, in total comprising 5,582 recordings. Our method reveals systematic alterations in the brain’s network interactions under AD, including in the default mode, limbic, and attention networks. In parallel, the learned representations support accurate Alzheimer’s-spectrum classification and yield a compact summary marker that tracks disease severity longitudinally. Together, these results demonstrate that network-guided masked modeling with cross-attention provides an interpretable and effective framework for characterizing functional reorganization in neurodegeneration.
[1218] Identifying the Geographic Foci of US Local News
Gangani Ariyarathne, Isuru Ariyarathne, Greatness Emmanuel-King, Kate Lawal, Alexander C. Nwala
Main category: cs.LG
TL;DR: A geo-foci model for labeling US local news articles with geographic locations using LLMs for entity disambiguation and spatial-semantic features for classification.
Details
Motivation: Local journalism is vital for democratic societies but faces economic pressures, creating a need to identify salient geographical locations (geo-foci) covered in local news to study local media and assess narrative shifts.Method: 1) Manually labeled US local news articles with administrative division labels; 2) Used LLMs to extract and disambiguate geographic entities; 3) Engineered spatial-semantic features capturing prominence, frequency, and contextual positions; 4) Trained a classifier to detect geographic foci.
Result: LLMs outperformed all eight geographic entity disambiguation methods evaluated. The trained classifier achieved an F1 score of 0.86 for detecting geographic foci of US local news articles.
Conclusion: The proposed geo-foci model can be applied to assess shifts from local to national narratives and enable researchers to better study local media through automated geographic focus detection.
Abstract: Local journalism is vital in democratic societies where it informs people about local issues like, school board elections, small businesses, local health services, etc. But mounting economic pressures have made it increasingly difficult for local news stations to report these issues, underscoring the need to identify the salient geographical locations covered in local news (geo-foci). In response, we propose a novel geo-foci model for labeling US local news articles with the geographic locations (i.e., the names of counties, cities, states, countries) central to their subject matter. First, we manually labeled US local news articles from all 50 states with four administrative division labels (local, state, national, and international) corresponding to their geo-foci, and none for articles without a geographic focus. Second, we extracted and disambiguated geographic locations from them using Large Language Models (LLMs), since local news often contains ambiguous geographic entities (e.g., Paris, Texas vs. Paris, France). LLMs outperformed all eight geographic entity disambiguation methods we evaluated. Third, we engineered a rich set of spatial-semantic features capturing the prominence, frequency, and contextual positions of geographic entities. Using these features, we trained a classifier to accurately (F1: 0.86) detect the geographic foci of US local news articles. Our model could be applied to assess shifts from local to national narratives, and more broadly, enable researchers to better study local media.
[1219] Neural Latent Arbitrary Lagrangian-Eulerian Grids for Fluid-Solid Interaction
Shilong Tao, Zhe Feng, Shaohan Chen, Weichen Zhang, Zhanxing Zhu, Yunhuai Liu
Main category: cs.LG
TL;DR: Fisale is a data-driven framework for complex two-way fluid-solid interaction problems using multiscale latent ALE grids and partitioned coupling inspired by classical numerical methods.
Details
Motivation: Existing deep learning methods are limited to simplified one-way FSI scenarios or struggle with dynamic, heterogeneous interactions due to lack of cross-domain awareness in two-way setups.Method: Inspired by Arbitrary Lagrangian-Eulerian method and partitioned coupling algorithm. Models coupling interface as distinct component, uses multiscale latent ALE grids for unified geometry-aware embeddings, and employs partitioned coupling module to decompose problem into structured substeps.
Result: Fisale excels in three reality-related challenging FSI scenarios covering 2D, 3D and various tasks, demonstrating scalable learning of complex two-way FSI behaviors.
Conclusion: Fisale provides a flexible framework that iteratively handles complex dynamics of solid, fluid and their coupling interface on unified representation, enabling better modeling of nonlinear interdependencies in FSI problems.
Abstract: Fluid-solid interaction (FSI) problems are fundamental in many scientific and engineering applications, yet effectively capturing the highly nonlinear two-way interactions remains a significant challenge. Most existing deep learning methods are limited to simplified one-way FSI scenarios, often assuming rigid and static solid to reduce complexity. Even in two-way setups, prevailing approaches struggle to capture dynamic, heterogeneous interactions due to the lack of cross-domain awareness. In this paper, we introduce \textbf{Fisale}, a data-driven framework for handling complex two-way \textbf{FSI} problems. It is inspired by classical numerical methods, namely the Arbitrary Lagrangian-Eulerian (\textbf{ALE}) method and the partitioned coupling algorithm. Fisale explicitly models the coupling interface as a distinct component and leverages multiscale latent ALE grids to provide unified, geometry-aware embeddings across domains. A partitioned coupling module (PCM) further decomposes the problem into structured substeps, enabling progressive modeling of nonlinear interdependencies. Compared to existing models, Fisale introduces a more flexible framework that iteratively handles complex dynamics of solid, fluid and their coupling interface on a unified representation, and enables scalable learning of complex two-way FSI behaviors. Experimentally, Fisale excels in three reality-related challenging FSI scenarios, covering 2D, 3D and various tasks. The code is available at \href{https://github.com/therontau0054/Fisale}.
[1220] Lookahead identification in adversarial bandits: accuracy and memory bounds
Nataly Brukhim, Nicolò Cesa-Bianchi, Carlo Ciliberto
Main category: cs.LG
TL;DR: The paper studies lookahead identification in multi-armed bandits, where a learner aims to identify an arm that will perform best in a future time window, even in adversarial environments where past performance may not predict future outcomes.
Details
Motivation: The motivation is to understand whether meaningful identification of future-best arms is possible in adversarial multi-armed bandit settings, where past performance offers little information about future outcomes. This addresses a fundamental limitation in traditional bandit problems.Method: The authors introduce lookahead identification, where the learner selects a future prediction window and commits in advance to an arm whose average reward over that window is near-optimal. They analyze achievable accuracy bounds and memory requirements, providing algorithms with specific ε-optimality guarantees and proving matching lower bounds.
Result: The paper shows that identification is possible in adversarial settings despite information limitations, achieving ε = O(1/√log T) over Ω(√T) prediction windows. They prove a near-matching lower bound of ε = Ω(1/log T) and demonstrate that nontrivial accuracy requires Ω(K) bits of memory, though poly-logarithmic memory suffices under local sparsity conditions.
Conclusion: Lookahead identification is feasible in adversarial bandit environments with meaningful accuracy guarantees, though memory requirements depend on problem structure. The work establishes fundamental limits on what can be achieved in this identification problem.
Abstract: We study an identification problem in multi-armed bandits. In each round a learner selects one of $K$ arms and observes its reward, with the goal of eventually identifying an arm that will perform best at a {\it future} time. In adversarial environments, however, past performance may offer little information about the future, raising the question of whether meaningful identification is possible at all. In this work, we introduce \emph{lookahead identification}, a task in which the goal of the learner is to select a future prediction window and commit in advance to an arm whose average reward over that window is within $\varepsilon$ of optimal. Our analysis characterizes both the achievable accuracy of lookahead identification and the memory resources required to obtain it. From an accuracy standpoint, for any horizon $T$ we give an algorithm achieving $\varepsilon = O\bigl(1/\sqrt{\log T}\bigr)$ over $Ω(\sqrt{T})$ prediction windows. This demonstrates that, perhaps surprisingly, identification is possible in adversarial settings, despite significant lack of information. We also prove a near-matching lower bound showing that $\varepsilon = Ω\bigl(1/\log T\bigr)$ is unavoidable. We then turn to investigate the role of memory in our problem, first proving that any algorithm achieving nontrivial accuracy requires $Ω(K)$ bits of memory. Under a natural \emph{local sparsity} condition, we show that the same accuracy guarantees can be achieved using only poly-logarithmic memory.
[1221] Curation Leaks: Membership Inference Attacks against Data Curation for Machine Learning
Dariush Wahdany, Matthew Jagielski, Adam Dziedzic, Franziska Boenisch
Main category: cs.LG
TL;DR: Data curation pipelines for private ML can leak sensitive information through multiple attack vectors, requiring formal privacy guarantees beyond just training procedures.
Details
Motivation: Data curation is used to select valuable data for ML efficiency and accuracy, and has been explored for private ML by using private data only to guide selection of public data. However, there's an assumption that models trained solely on curated public data are privacy-preserving, which this paper challenges.Method: The paper introduces novel attacks against popular curation methods targeting three major steps: computation of curation scores, selection of curated subsets, and the final trained model. It demonstrates information leakage at each stage and proposes differentially private adaptations of curation methods to mitigate leakage.
Result: The attacks show that curation pipelines leak private information, with models trained exclusively on curated public data still leaking membership information about the private data that guided curation. Differentially private adaptations effectively mitigate this leakage.
Conclusion: Data curation pipelines have inherent privacy risks that must be addressed, requiring privacy assessment to extend beyond training procedures to include data selection processes. Formal privacy guarantees for curation are a promising direction for secure private ML.
Abstract: In machine learning, curation is used to select the most valuable data for improving both model accuracy and computational efficiency. Recently, curation has also been explored as a solution for private machine learning: rather than training directly on sensitive data, which is known to leak information through model predictions, the private data is used only to guide the selection of useful public data. The resulting model is then trained solely on curated public data. It is tempting to assume that such a model is privacy-preserving because it has never seen the private data. Yet, we show that without further protection, curation pipelines can still leak private information. Specifically, we introduce novel attacks against popular curation methods, targeting every major step: the computation of curation scores, the selection of the curated subset, and the final trained model. We demonstrate that each stage reveals information about the private dataset and that even models trained exclusively on curated public data leak membership information about the private data that guided curation. These findings highlight the previously overlooked inherent privacy risks of data curation and show that privacy assessment must extend beyond the training procedure to include the data selection process. Our differentially private adaptations of curation methods effectively mitigate leakage, indicating that formal privacy guarantees for curation are a promising direction.
[1222] Wave-Attractor-Tree: A Hierarchical Binary Tree Reduction Architecture for Efficient Sequence Modeling
Igor Berezkin
Main category: cs.LG
TL;DR: Hierarchical binary tree-based reduction replaces standard self-attention with recursive Gated Linear Unit merges, achieving linear complexity and better performance on long-range structural dependencies.
Details
Motivation: Standard Transformers have quadratic complexity in self-attention, limiting their ability to handle long sequences efficiently. The paper aims to develop a more efficient architecture that can better capture hierarchical structures and long-range dependencies.Method: Proposes a hierarchical binary tree-based reduction mechanism using recursive Gated Linear Unit merge operations. This replaces standard self-attention with a structure that has O(n) total merge operations, O(log n) parallel depth, O(n d^2) total work, and O(n) space complexity.
Result: The model significantly outperforms standard Transformers in both convergence speed and accuracy on tasks requiring long-range structural dependencies, particularly where hierarchical inductive bias is critical.
Conclusion: The hierarchical binary tree-based reduction provides an efficient alternative to standard self-attention, offering better performance on hierarchical structures while maintaining computational efficiency.
Abstract: Work introduces a hierarchical binary tree-based reduction that replaces standard self-attention. The core idea is to use a recursive Gated Linear Unit merge operation, achieving O(n) total merge operations O(log n) parallel depth O(n d^2) total work and O(n) space complexity. In these experiments, the model significantly outperforms standard Transformers in both convergence speed and accuracy on long-range structural dependencies, specifically where hierarchical inductive bias is critical.
[1223] A Gauge Theory of Superposition: Toward a Sheaf-Theoretic Atlas of Neural Representations
Hossein Javidnia
Main category: cs.LG
TL;DR: A gauge-theoretic framework for analyzing superposition in LLMs using sheaf theory and information geometry to measure interpretability obstructions like jamming, shearing, and holonomy.
Details
Motivation: To develop a rigorous mathematical framework for understanding superposition phenomena in large language models, moving beyond the single-global-dictionary assumption to capture local semantic structures and measure interpretability limitations.Method: Uses discrete gauge theory and sheaf theory to create an atlas of local semantic charts with Fisher/Gauss-Newton metrics. Contexts are clustered into a stratified complex, with each chart having local feature spaces and information-geometric metrics to identify feature interactions.
Result: Four key results: (A) Holonomy computable and gauge-invariant after constructive gauge fixing; (B) Shearing lower-bounds transfer mismatch energy; (C) Non-vacuous certified jamming/interference bounds with high coverage; (D) Stable estimation of shearing and holonomy distances with improved concentration.
Conclusion: The framework provides measurable obstructions to global interpretability in LLMs (jamming, shearing, holonomy) and offers concrete methods to compute and bound these phenomena, advancing theoretical understanding of superposition and interpretability.
Abstract: We develop a discrete gauge-theoretic framework for superposition in large language models (LLMs) that replaces the single-global-dictionary premise with a sheaf-theoretic atlas of local semantic charts. Contexts are clustered into a stratified context complex; each chart carries a local feature space and a local information-geometric metric (Fisher/Gauss–Newton) identifying predictively consequential feature interactions. This yields a Fisher-weighted interference energy and three measurable obstructions to global interpretability: (O1) local jamming (active load exceeds Fisher bandwidth), (O2) proxy shearing (mismatch between geometric transport and a fixed correspondence proxy), and (O3) nontrivial holonomy (path-dependent transport around loops).
We prove and instantiate four results on a frozen open LLM (Llama3.23B Instruct) using WikiText-103, a C4-derived English web-text subset, and \texttt{the-stack-smol}. (A) After constructive gauge fixing on a spanning tree, each chord residual equals the holonomy of its fundamental cycle, making holonomy computable and gauge-invariant. (B) Shearing lower-bounds a data-dependent transfer mismatch energy, turning $D_{\mathrm{shear}}$ into an unavoidable failure bound. (C) We obtain non-vacuous certified jamming/interference bounds with high coverage and zero violations across seeds/hyperparameters. (D) Bootstrap and sample-size experiments show stable estimation of $D_{\mathrm{shear}}$ and $D_{\mathrm{hol}}$, with improved concentration on well-conditioned subsystems.
[1224] GeMi: A Graph-based, Multimodal Recommendation System for Narrative Scroll Paintings
Haimonti Dutta, Pruthvi Moluguri, Jin Dai, Saurabh Amarnath Mahindre
Main category: cs.LG
TL;DR: A GNN-based multimodal recommendation system for endangered narrative scroll paintings using text and image data to suggest similar artworks for viewing and purchase.
Details
Motivation: To create a recommendation system for endangered performing art forms (narrative scroll paintings) that serves multiple purposes: art conservation, data storage for endangered art objects, and personalized recommendations for interested users.Method: Uses Graph Neural Networks (GNNs) to handle relationships between items and users, incorporating multimodal content (text and image data) with state-of-the-art vision-language models for processing, along with embedding, message passing, and propagation algorithms.
Result: Presents the design of a GNN-based recommendation system on a novel dataset collected from field research, specifically for narrative scroll paintings - a first-of-its-kind system for this endangered art form.
Conclusion: The work serves as both a conservation tool for endangered art and a state-of-the-art recommendation system that leverages novel data characteristics and user preferences for narrative scroll paintings.
Abstract: Recommendation Systems are effective in managing the ever-increasing amount of multimodal data available today and help users discover interesting new items. These systems can handle various media types such as images, text, audio, and video data, and this has made it possible to handle content-based recommendation utilizing features extracted from items while also incorporating user preferences. Graph Neural Network (GNN)-based recommendation systems are a special class of recommendation systems that can handle relationships between items and users, making them particularly attractive for content-based recommendations. Their popularity also stems from the fact that they use advanced machine learning techniques, such as deep learning on graph-structured data, to exploit user-to-item interactions. The nodes in the graph can access higher-order neighbor information along with state-of-the-art vision-language models for processing multimodal content, and there are well-designed algorithms for embedding, message passing, and propagation. In this work, we present the design of a GNN-based recommendation system on a novel data set collected from field research. Designed for an endangered performing art form, the recommendation system uses multimodal content (text and image data) to suggest similar paintings for viewing and purchase. To the best of our knowledge, there is no recommendation system designed for narrative scroll paintings – our work therefore serves several purposes, including art conservation, a data storage system for endangered art objects, and a state-of-the-art recommendation system that leverages both the novel characteristics of the data and preferences of the user population interested in narrative scroll paintings.
[1225] Navigating Time’s Possibilities: Plausible Counterfactual Explanations for Multivariate Time-Series Forecast through Genetic Algorithms
Gianlucca Zuin, Adriano Veloso
Main category: cs.LG
TL;DR: Novel counterfactual learning method for multivariate time series analysis combining genetic algorithms with Granger causality tests to uncover hidden causal relationships and predict outcomes under hypothetical interventions.
Details
Motivation: To understand and model causality in complex dynamic systems, particularly for uncovering hidden causal relationships in multivariate time series and identifying potential interventions to achieve desired outcomes.Method: Integrates genetic algorithms with rigorous causality tests (Granger causality) to infer and validate counterfactual dependencies. Uses genetic algorithms with quantile regression to exploit causal relationships for projecting future scenarios under hypothetical interventions.
Result: Evaluated on real-world data, the algorithm demonstrates ability to handle complex causal relationships, reveal meaningful counterfactual insights, and predict outcomes under hypothetical interventions.
Conclusion: The synergy between genetic algorithms and causality tests enables thorough exploration of temporal dynamics, revealing hidden dependencies and enabling projection of outcomes under hypothetical interventions in multivariate time series.
Abstract: Counterfactual learning has become promising for understanding and modeling causality in complex and dynamic systems. This paper presents a novel method for counterfactual learning in the context of multivariate time series analysis and forecast. The primary objective is to uncover hidden causal relationships and identify potential interventions to achieve desired outcomes. The proposed methodology integrates genetic algorithms and rigorous causality tests to infer and validate counterfactual dependencies within temporal sequences. More specifically, we employ Granger causality to enhance the reliability of identified causal relationships, rigorously assessing their statistical significance. Then, genetic algorithms, in conjunction with quantile regression, are used to exploit these intricate causal relationships to project future scenarios. The synergy between genetic algorithms and causality tests ensures a thorough exploration of the temporal dynamics present in the data, revealing hidden dependencies and enabling the projection of outcomes under hypothetical interventions. We evaluate the performance of our algorithm on real-world data, showcasing its ability to handle complex causal relationships, revealing meaningful counterfactual insights, and allowing for the prediction of outcomes under hypothetical interventions.
[1226] MultiPUFFIN: A Multimodal Domain-Constrained Foundation Model for Molecular Property Prediction of Small Molecules
Idelfonso B. R. Nogueira, Carine M. Rebelloa, Mumin Enis Leblebici, Erick Giovani Sperandio Nascimento
Main category: cs.LG
TL;DR: MultiPUFFIN is a multimodal foundation model for predicting nine thermophysical properties with thermodynamic consistency, using SMILES, graphs, and 3D geometries with domain-informed inductive biases.
Details
Motivation: Current molecular foundation models lack thermodynamic consistency, while domain-informed approaches are limited to single properties and small datasets. There's a need for models that can predict multiple thermophysical properties simultaneously while maintaining thermodynamic consistency.Method: MultiPUFFIN uses: (1) multimodal encoder fusing SMILES, graphs, and 3D geometries via gated cross-modal attention, plus experimental condition and descriptor encoders; (2) prediction heads embedding established thermodynamic correlations (Wagner, Andrade, van’t Hoff, Shomate equations) as inductive biases; (3) two-stage multi-task training strategy.
Result: Achieves mean R² = 0.716 on scaffold-split test set of 8,877 molecules. Outperforms ChemBERTa-2 across all nine properties despite using 2000x fewer training molecules. Particularly effective for temperature-dependent properties where baseline lacks thermodynamic condition capacity.
Conclusion: Multimodal encoding and domain-informed biases substantially reduce data and compute requirements compared to brute-force pre-training. The model handles missing modalities and recovers meaningful thermodynamic parameters without explicit supervision.
Abstract: Predicting physicochemical properties across chemical space is vital for chemical engineering, drug discovery, and materials science. Current molecular foundation models lack thermodynamic consistency, while domain-informed approaches are limited to single properties and small datasets. We introduce MultiPUFFIN, a domain-constrained multimodal foundation model addressing both limitations simultaneously. MultiPUFFIN features: (i) an encoder fusing SMILES, graphs, and 3D geometries via gated cross-modal attention, alongside experimental condition and descriptor encoders; (ii) prediction heads embedding established correlations (e.g., Wagner, Andrade, van’t Hoff, and Shomate equations) as inductive biases to ensure thermodynamic consistency; and (iii) a two-stage multi-task training strategy.Extending prior frameworks, MultiPUFFIN predicts nine thermophysical properties simultaneously. It is trained on a multi-source dataset of 37,968 unique molecules (40,904 rows). With roughly 35 million parameters, MultiPUFFIN achieves a mean $R^2 = 0.716$ on a challenging scaffold-split test set of 8,877 molecules. Compared to ChemBERTa-2 (pre-trained on 77 million molecules), MultiPUFFIN outperforms the fine-tuned baseline across all nine properties despite using 2000x fewer training molecules. Advantages are strikingly apparent for temperature-dependent properties, where ChemBERTa-2 lacks the architectural capacity to incorporate thermodynamic conditions.These results demonstrate that multimodal encoding and domain-informed biases substantially reduce data and compute requirements compared to brute-force pre-training. Furthermore, MultiPUFFIN handles missing modalities and recovers meaningful thermodynamic parameters without explicit supervision. Systematic ablation studies confirm the property-specific benefits of these domain-informed prediction heads.
[1227] Active Flow Matching
Yashvir S. Grewal, Daniel M. Steinberg, Thang D. Bui, Cheng Soon Ong, Edwin V. Bonilla
Main category: cs.LG
TL;DR: Active Flow Matching (AFM) enables variational optimization with flow matching models by operating on conditional endpoint distributions, allowing gradient-based steering toward high-fitness regions while maintaining theoretical rigor.
Details
Motivation: Current discrete diffusion and flow matching models can't be directly integrated with variational frameworks for online black-box optimization (like VSD and CbAS) due to their implicit generative nature, limiting their use in principled optimization tasks.Method: AFM reformulates variational objectives to operate on conditional endpoint distributions along the flow, enabling gradient-based steering of flow models. Derives forward and reverse KL variants using self-normalized importance sampling.
Result: Forward-KL AFM performs competitively compared to state-of-the-art baselines across protein and small molecule design tasks, demonstrating effective exploration-exploitation under tight experimental budgets.
Conclusion: AFM successfully bridges flow matching models with variational optimization frameworks, enabling principled gradient-based steering while preserving theoretical guarantees of VSD and CbAS.
Abstract: Discrete diffusion and flow matching models capture complex, non-additive and non-autoregressive structure in high-dimensional objective landscapes through parallel, iterative refinement. However, their implicit generative nature precludes direct integration with principled variational frameworks for online black-box optimisation, such as variational search distributions (VSD) and conditioning by adaptive sampling (CbAS). We introduce Active Flow Matching (AFM), which reformulates variational objectives to operate on conditional endpoint distributions along the flow, enabling gradient-based steering of flow models toward high-fitness regions while preserving the rigour of VSD and CbAS. We derive forward and reverse Kullback-Leibler (KL) variants using self-normalised importance sampling. Across a suite of online protein and small molecule design tasks, forward-KL AFM consistently performs competitively compared to state-of-the-art baselines, demonstrating effective exploration-exploitation under tight experimental budgets.
[1228] Knowledge without Wisdom: Measuring Misalignment between LLMs and Intended Impact
Michael Hardy, Yunsung Kim
Main category: cs.LG
TL;DR: Foundation models show poor alignment with educational outcomes despite benchmark success, with shared biases across models negatively correlating with learning results.
Details
Motivation: To evaluate whether strong benchmark performance of foundation models translates to valid performance on real-world educational tasks, specifically teaching and learning applications for schoolchildren.Method: Evaluated leading foundation models on out-of-distribution educational tasks, compared model behaviors with expert human behaviors, measured alignment with teaching quality and learning outcomes, and analyzed shared biases across models using multi-model ensembles.
Result: Models show higher correlation with each other than with expert human behaviors, shared biases are negatively aligned with learning outcomes, multi-model ensembles exacerbate misalignment, and 50% of misalignment error is shared across foundation models.
Conclusion: Common pretraining creates shared biases that misalign models with educational outcomes, highlighting limitations of benchmark performance for real-world applications and the need for robust alignment measurement methods.
Abstract: LLMs increasingly excel on AI benchmarks, but doing so does not guarantee validity for downstream tasks. This study evaluates the performance of leading foundation models (FMs, i.e., generative pre-trained base LLMs) with out-of-distribution (OOD) tasks of the teaching and learning of schoolchildren. Across all FMs, inter-model behaviors on disparate tasks correlate higher than they do with expert human behaviors on target tasks. These biases shared across LLMs are poorly aligned with downstream measures of teaching quality and often \textit{negatively aligned with learning outcomes}. Further, we find multi-model ensembles, both unanimous model voting and expert-weighting by benchmark performance, further exacerbate misalignment with learning. We measure that 50% of the variation in misalignment error is shared across foundation models, suggesting that common pretraining accounts for much of the misalignment in these tasks. We demonstrate methods for robustly measuring alignment of complex tasks and provide unique insights into both educational applications of foundation models and to understanding limitations of models.
[1229] Probabilistic Learning and Generation in Deep Sequence Models
Wenlong Chen
Main category: cs.LG
TL;DR: This thesis bridges deep sequence models and probabilistic methods by leveraging DSM architectures to design better Bayesian priors and approximations, with applications to Transformers, Gaussian processes, and sequential generative models.
Details
Motivation: Deep sequence models lack uncertainty quantification while probabilistic models provide principled uncertainty awareness but face challenges in prior specification and approximation quality at scale. The research aims to bridge these approaches by using DSM architectures to inform probabilistic model design.Method: 1) Develop Bayesian inference for Transformers using attention-sparse Gaussian process similarity; 2) Use HiPPO’s memory preservation for Gaussian process interdomain inducing points in online learning; 3) Explore self-supervision for sequential latent states in generative models inspired by diffusion models.
Result: The work demonstrates how DSM inductive biases can improve probabilistic inference and structure design, creating mutually reinforcing improvements between deep sequence models and probabilistic approaches.
Conclusion: Leveraging architectural insights from deep sequence models can effectively address key bottlenecks in Bayesian methods, bridging the gap between deterministic deep learning and probabilistic modeling for better uncertainty quantification.
Abstract: Despite exceptional predictive performance of Deep sequence models (DSMs), the main concern of their deployment centers around the lack of uncertainty awareness. In contrast, probabilistic models quantify the uncertainty associated with unobserved variables with rules of probability. Notably, Bayesian methods leverage Bayes’ rule to express our belief of unobserved variables in a principled way. Since exact Bayesian inference is computationally infeasible at scale, approximate inference is required in practice. Two major bottlenecks of Bayesian methods, especially when applied in deep neural networks, are prior specification and approximation quality. In Chapter 3 & 4, we investigate how the architectures of DSMs themselves can be informative for the design of priors or approximations in probabilistic models. We first develop an approximate Bayesian inference method tailored to the Transformer based on the similarity between attention and sparse Gaussian process. Next, we exploit the long-range memory preservation capability of HiPPOs (High-order Polynomial Projection Operators) to construct an interdomain inducing point for Gaussian process, which successfully memorizes the history in online learning. In addition to the progress of DSMs in predictive tasks, sequential generative models consisting of a sequence of latent variables are popularized in the domain of deep generative models. Inspired by the explicit self-supervised signals for these latent variables in diffusion models, in Chapter 5, we explore the possibility of improving other generative models with self-supervision for their sequential latent states, and investigate desired probabilistic structures over them. Overall, this thesis leverages inductive biases in DSMs to design probabilistic inference or structure, which bridges the gap between DSMs and probabilistic models, leading to mutually reinforced improvement.
[1230] Evaluating AI Grading on Real-World Handwritten College Mathematics: A Large-Scale Study Toward a Benchmark
Zhiqi Yu, Xingping Liu, Haobin Mao, Mingshuo Liu, Long Chen, Jack Xin, Yifeng Yu
Main category: cs.LG
TL;DR: AI grading system using OCR-conditioned LLMs for handwritten calculus work achieves strong alignment with TA scores and produces acceptable feedback for most submissions.
Details
Motivation: Large undergraduate STEM courses struggle with providing meaningful feedback due to heavy grading workloads, creating a need for automated solutions that can handle handwritten mathematical work.Method: OCR-conditioned large language models with structured, rubric-guided prompting applied to thousands of handwritten calculus quiz submissions from nearly 800 students.
Result: Strong alignment with TA scoring, majority of AI-generated feedback rated as correct or acceptable, with analysis of failure modes and proposed rubric/prompt design principles.
Conclusion: AI grading for handwritten mathematics is feasible and effective, with proposed evaluation framework and benchmark for reproducible research in this domain.
Abstract: Grading in large undergraduate STEM courses often yields minimal feedback due to heavy instructional workloads. We present a large-scale empirical study of AI grading on real, handwritten single-variable calculus work from UC Irvine. Using OCR-conditioned large language models with structured, rubric-guided prompting, our system produces scores and formative feedback for thousands of free-response quiz submissions from nearly 800 students. In a setting with no single ground-truth label, we evaluate performance against official teaching-assistant grades, student surveys, and independent human review, finding strong alignment with TA scoring and a large majority of AI-generated feedback rated as correct or acceptable across quizzes. Beyond calculus, this setting highlights core challenges in OCR-conditioned mathematical reasoning and partial-credit assessment. We analyze key failure modes, propose practical rubric- and prompt-design principles, and introduce a multi-perspective evaluation protocol for reliable, real-course deployment. Building on the dataset and evaluation framework developed here, we outline a standardized benchmark for AI grading of handwritten mathematics to support reproducible comparison and future research.
[1231] Principled Fast and Meta Knowledge Learners for Continual Reinforcement Learning
Ke Sun, Hongming Zhang, Jun Jin, Chao Gao, Xi Chen, Wulong Liu, Linglong Kong
Main category: cs.LG
TL;DR: A dual-learner framework for continual RL inspired by human memory systems, with fast learner for knowledge transfer and meta learner for integration, using adaptive meta warm-up for rapid adaptation.
Details
Motivation: Address continual reinforcement learning problems by drawing inspiration from human learning and memory systems, particularly the hippocampus-cerebral cortex interplay, to overcome catastrophic forgetting and enable efficient knowledge transfer across tasks.Method: Proposes a dual-learner framework with fast learner (focuses on knowledge transfer) and meta learner (ensures knowledge integration). Uses adaptive meta warm-up mechanism for rapid adaptation in new environments, explicitly minimizing catastrophic forgetting rather than just maximizing average return.
Result: Superior performance in various pixel-based and continuous control benchmarks compared to baseline methods, demonstrating effective continual learning capabilities.
Conclusion: The dual-learner approach inspired by human memory systems effectively addresses continual RL problems by combining knowledge transfer and integration while minimizing catastrophic forgetting.
Abstract: Inspired by the human learning and memory system, particularly the interplay between the hippocampus and cerebral cortex, this study proposes a dual-learner framework comprising a fast learner and a meta learner to address continual Reinforcement Learning~(RL) problems. These two learners are coupled to perform distinct yet complementary roles: the fast learner focuses on knowledge transfer, while the meta learner ensures knowledge integration. In contrast to traditional multi-task RL approaches that share knowledge through average return maximization, our meta learner incrementally integrates new experiences by explicitly minimizing catastrophic forgetting, thereby supporting efficient cumulative knowledge transfer for the fast learner. To facilitate rapid adaptation in new environments, we introduce an adaptive meta warm-up mechanism that selectively harnesses past knowledge. We conduct experiments in various pixel-based and continuous control benchmarks, revealing the superior performance of continual learning for our proposed dual-learner approach relative to baseline methods. The code is released in https://github.com/datake/FAME.
[1232] When Does Margin Clamping Affect Training Variance? Dataset-Dependent Effects in Contrastive Forward-Forward Learning
Joshua Steier
Main category: cs.LG
TL;DR: CFF training instability stems from saturating similarity clamping in contrastive loss; alternative gradient-neutral formulation reduces variance without affecting mean accuracy.
Details
Motivation: Contrastive Forward-Forward (CFF) learning for Vision Transformers shows sensitivity to random seeds, but the causes of this instability are not well understood. The paper investigates how implementation details of the contrastive loss affect training stability.Method: Analyzes two formulations of positive-pair margin in contrastive loss: saturating similarity clamping (min(s + m, 1)) vs. gradient-neutral margin subtraction after log-probability. Conducts experiments on CIFAR-10, CIFAR-100, SVHN, and Fashion-MNIST with factorial design and multiple seeds. Examines clamp activation rates, layerwise gradient norms, and reduced-margin probes to understand saturation-driven gradient truncation.
Result: Clamping produces 5.90× higher test-accuracy variance on CIFAR-10 with no mean accuracy difference. Variance effects depend on positive-pair density per batch and task difficulty. SVHN difficulty sweep shows variance ratio changes from 0.25× at high accuracy to 16.73× under aggressive augmentation. Gradient-neutral subtraction eliminates variance inflation without affecting mean accuracy.
Conclusion: Saturating similarity clamping causes training instability in CFF through gradient truncation at early layers. The gradient-neutral alternative formulation should be preferred as it removes variance inflation while maintaining performance. Layer-0 clamp activation rate serves as a diagnostic for this issue.
Abstract: Contrastive Forward-Forward (CFF) learning trains Vision Transformers layer by layer against supervised contrastive objectives. CFF training can be sensitive to random seed, but the sources of this instability are poorly understood. We focus on one implementation detail: the positive-pair margin in the contrastive loss is applied through saturating similarity clamping, $\min(s + m,, 1)$. We prove that an alternative formulation, subtracting the margin after the log-probability, is gradient-neutral under the mean-over-positives reduction. On CIFAR-10 ($2 \times 2$ factorial, $n{=}7$ seeds per cell), clamping produces $5.90\times$ higher pooled test-accuracy variance ($p{=}0.003$) with no difference in mean accuracy. Analyses of clamp activation rates, layerwise gradient norms, and a reduced-margin probe point to saturation-driven gradient truncation at early layers. The effect does not transfer cleanly to other datasets: on CIFAR-100, SVHN, and Fashion-MNIST, clamping produces equal or lower variance. Two factors account for the discrepancy. First, positive-pair density per batch controls how often saturation occurs. Second, task difficulty compresses seed-to-seed spread when accuracy is high. An SVHN difficulty sweep confirms the interaction on a single dataset, with the variance ratio moving from $0.25\times$ at high accuracy to $16.73\times$ under aggressive augmentation. In moderate-accuracy regimes with many same-class pairs per batch, switching to the gradient-neutral subtraction reference removes this variance inflation at no cost to mean accuracy. Measuring the layer-0 clamp activation rate serves as a simple check for whether the problem applies.
[1233] Stabilizing Policy Optimization via Logits Convexity
Hongzhan Chen, Tao Yang, Yuhua Zhu, Shiping Gao, Xiaojun Quan, Ting Yao
Main category: cs.LG
TL;DR: LCO is a new RL framework that improves training stability by emulating the convexity properties of supervised fine-tuning at the logits level.
Details
Motivation: RL optimization is notoriously unstable compared to supervised fine-tuning (SFT), especially for large language models. The authors investigate this stability gap and find that SFT's convexity with respect to model logits plays a key stabilizing role, which PPO lacks.Method: Proposes Logits Convex Optimization (LCO), a policy optimization framework that aligns the learned policy with an optimal target derived from the original RL objective, thereby emulating the stabilizing effects of logits-level convexity found in SFT.
Result: Extensive experiments across multiple model families show LCO consistently improves training stability and outperforms conventional RL methods on a broad range of benchmarks.
Conclusion: The convexity of SFT loss at the logits level is crucial for stable training, and LCO successfully emulates this property to achieve more stable and effective RL optimization for language models.
Abstract: While reinforcement learning (RL) has been central to the recent success of large language models (LLMs), RL optimization is notoriously unstable, especially when compared to supervised fine-tuning (SFT). In this work, we investigate the stability gap between SFT and RL from a gradient-based perspective, and show that the convexity of the SFT loss with respect to model logits plays a key role in enabling stable training. Our theoretical analysis demonstrates that this property induces favorable gradient directionality during optimization. In contrast, Proximal Policy Optimization (PPO), a widely adopted policy gradient algorithm utilizing a clipped surrogate objective, lacks this stabilizing property. Motivated by this observation, we propose Logits Convex Optimization (LCO), a simple yet effective policy optimization framework that aligns the learned policy with an optimal target derived from the original RL objective, thereby emulating the stabilizing effects of logits-level convexity. Extensive experiments across multiple model families show that our LCO framework consistently improves training stability and outperforms conventional RL methods on a broad range of benchmarks.
[1234] Intent-Context Synergy Reinforcement Learning for Autonomous UAV Decision-Making in Air Combat
Jiahao Fu, Feng Yang
Main category: cs.LG
TL;DR: ICS-RL framework for UAV infiltration using intent prediction and context-aware hierarchical RL to balance mission efficiency and survivability in contested environments.
Details
Motivation: Autonomous UAV infiltration in dynamic contested environments is challenging due to partial observability and conflicting objectives of mission efficiency versus survivability. Traditional RL approaches suffer from myopic decision-making and struggle to balance these trade-offs in real-time.Method: Proposes Intent-Context Synergy Reinforcement Learning (ICS-RL) with: (1) LSTM-based Intent Prediction Module for forecasting hostile unit trajectories, enabling proactive planning via state augmentation; (2) Context-Analysis Synergy Mechanism that decomposes mission into hierarchical sub-tasks (safe cruise, stealth planning, hostile breakthrough); (3) Heterogeneous ensemble of Dueling DQN agents specialized in specific tactical contexts; (4) Dynamic switching controller based on Max-Advantage values for adaptive policy selection.
Result: ICS-RL significantly outperforms baselines (Standard DDQN) and traditional methods (PSO, Game Theory). Achieves mission success rate of 88% and reduces average exposure frequency to 0.24 per episode, demonstrating superiority in robust and stealthy penetration in high-dynamic scenarios.
Conclusion: The ICS-RL framework effectively addresses the challenges of UAV infiltration in contested environments by combining intent prediction with context-aware hierarchical reinforcement learning, enabling adaptive and robust decision-making that balances mission efficiency and survivability.
Abstract: Autonomous UAV infiltration in dynamic contested environments remains a significant challenge due to the partially observable nature of threats and the conflicting objectives of mission efficiency versus survivability. Traditional Reinforcement Learning (RL) approaches often suffer from myopic decision-making and struggle to balance these trade-offs in real-time. To address these limitations, this paper proposes an Intent-Context Synergy Reinforcement Learning (ICS-RL) framework. The framework introduces two core innovations: (1) An LSTM-based Intent Prediction Module that forecasts the future trajectories of hostile units, transforming the decision paradigm from reactive avoidance to proactive planning via state augmentation; (2) A Context-Analysis Synergy Mechanism that decomposes the mission into hierarchical sub-tasks (safe cruise, stealth planning, and hostile breakthrough). We design a heterogeneous ensemble of Dueling DQN agents, each specialized in a specific tactical context. A dynamic switching controller based on Max-Advantage values seamlessly integrates these agents, allowing the UAV to adaptively select the optimal policy without hard-coded rules. Extensive simulations demonstrate that ICS-RL significantly outperforms baselines (Standard DDQN) and traditional methods (PSO, Game Theory). The proposed method achieves a mission success rate of 88% and reduces the average exposure frequency to 0.24 per episode, validating its superiority in ensuring robust and stealthy penetration in high-dynamic scenarios.
[1235] Forgetting is Competition: Rethinking Unlearning as Representation Interference in Diffusion Models
Ashutosh Ranjan, Vivek Srivastava, Shirish Karande, Murari Mandal
Main category: cs.LG
TL;DR: SurgUn is a surgical unlearning method for text-to-image diffusion models that applies targeted weight-space updates to remove specific visual concepts while preserving unrelated capabilities, based on retroactive interference theory.
Details
Motivation: Current unlearning approaches in text-to-image diffusion models often lead to uneven concept removal and unintended forgetting of unrelated capabilities, complicating tasks like copyright compliance, protected data mitigation, artist opt-outs, and policy-driven content updates. As models grow larger and more diverse, achieving precise selective unlearning while preserving generative quality becomes increasingly challenging.Method: SurgUn applies targeted weight-space updates to remove specific visual concepts in text-conditioned diffusion models, motivated by retroactive interference theory. The approach induces retroactive concept interference, enabling focused destabilization of only the target concept while preserving unrelated capabilities through a novel training paradigm.
Result: SurgUn achieves high-precision unlearning across diverse settings: performs strongly on compact U-Net based models like Stable Diffusion v1.5, scales effectively to larger U-Net architecture SDXL, and extends to SANA (Diffusion Transformer based architecture).
Conclusion: SurgUn provides an effective surgical unlearning method for text-to-image diffusion models that enables precise concept removal while preserving model capabilities, addressing key challenges in copyright compliance, data mitigation, and content policy implementation.
Abstract: Unlearning in text-to-image diffusion models often leads to uneven concept removal and unintended forgetting of unrelated capabilities. This complicates tasks such as copyright compliance, protected data mitigation, artist opt-outs, and policy-driven content updates. As models grow larger and adopt more diverse architectures, achieving precise and selective unlearning while preserving generative quality becomes increasingly challenging. We introduce SurgUn (pronounced as Surgeon), a surgical unlearning method that applies targeted weight-space updates to remove specific visual concepts in text-conditioned diffusion models. Our approach is motivated by retroactive interference theory, which holds that newly acquired memories can overwrite, suppress, or impede access to prior ones by competing for shared representational pathways. We adapt this principle to diffusion models by inducing retroactive concept interference, enabling focused destabilization of only the target concept while preserving unrelated capabilities through a novel training paradigm. SurgUn achieves high-precision unlearning across diverse settings. It performs strongly on compact U-Net based models such as Stable Diffusion v1.5, scales effectively to the larger U-Net architecture SDXL, and extends to SANA, representing an underexplored Diffusion Transformer based architecture for unlearning.
[1236] Compensation-free Machine Unlearning in Text-to-Image Diffusion Models by Eliminating the Mutual Information
Xinwen Cheng, Jingyuan Zhang, Zhehao Huang, Yingwen Wu, Xiaolin Huang
Main category: cs.LG
TL;DR: MiM-MU: A compensation-free concept erasure method for diffusion models that minimizes mutual information to precisely remove undesired knowledge while preserving model utility for other concepts.
Details
Motivation: Existing concept erasure methods in diffusion models suffer from excessive removal that degrades innocent generations. Current compensation-based approaches (re-assimilating data or constraining divergence) are insufficient for preserving general utility of large-scale generative models, as generations beyond compensation scope still suffer.Method: Proposes MiM-MU (Mutual Information Minimization for Machine Unlearning), which unlearns concepts by minimizing mutual information between the model and undesired knowledge. Uses delicate design for computational effectiveness and maintains sampling distribution for other concepts without post-remedial compensation.
Result: Extensive evaluations show effective concept removal while maintaining high-quality generations for other concepts. Remarkably achieves this without relying on any post-remedial compensation for the first time.
Conclusion: Compensation-free concept erasure is possible and preferable for diffusion models. MiM-MU demonstrates that precise identification and elimination of undesired knowledge can be achieved with minimal impact on other generations, addressing privacy/safety concerns while preserving model utility.
Abstract: The powerful generative capabilities of diffusion models have raised growing privacy and safety concerns regarding generating sensitive or undesired content. In response, machine unlearning (MU) – commonly referred to as concept erasure (CE) in diffusion models – has been introduced to remove specific knowledge from model parameters meanwhile preserving innocent knowledge. Despite recent advancements, existing unlearning methods often suffer from excessive and indiscriminate removal, which leads to substantial degradation in the quality of innocent generations. To preserve model utility, prior works rely on compensation, i.e., re-assimilating a subset of the remaining data or explicitly constraining the divergence from the pre-trained model on remaining concepts. However, we reveal that generations beyond the compensation scope still suffer, suggesting such post-remedial compensations are inherently insufficient for preserving the general utility of large-scale generative models. Therefore, in this paper, we advocate for developing compensation-free concept erasure operations, which precisely identify and eliminate the undesired knowledge such that the impact on other generations is minimal. In technique, we propose to MiM-MU, which is to unlearn a concept by minimizing the mutual information with a delicate design for computational effectiveness and for maintaining sampling distribution for other concepts. Extensive evaluations demonstrate that our proposed method achieves effective concept removal meanwhile maintaining high-quality generations for other concepts, and remarkably, without relying on any post-remedial compensation for the first time.
[1237] DWAFM: Dynamic Weighted Graph Structure Embedding Integrated with Attention and Frequency-Domain MLPs for Traffic Forecasting
Sen Shi, Zhichao Zhang, Yangfan He
Main category: cs.LG
TL;DR: Proposes DWAFM model for traffic prediction using dynamic weighted graph structure embeddings combined with attention and frequency-domain MLPs to capture evolving spatial-temporal dependencies.
Details
Motivation: Existing traffic prediction methods struggle with modeling complex spatial-temporal dependencies. While embedding technology shows potential, current approaches ignore graph structure or rely on static graphs, failing to capture dynamic node associations that evolve over time.Method: Introduces Dynamic Weighted Graph Structure (DWGS) embedding method that reflects changing node association strengths over time. Combines DWGS embedding with spatial-temporal adaptive embedding, temporal embedding, and feature embedding, then integrates attention mechanisms and frequency-domain MLPs into the DWAFM model.
Result: Experiments on five real-world traffic datasets show DWAFM achieves better prediction performance than state-of-the-art methods.
Conclusion: The proposed DWGS embedding method effectively captures dynamic spatial-temporal dependencies in traffic data, and the DWAFM model demonstrates superior prediction performance through integration of attention and frequency-domain MLPs.
Abstract: Accurate traffic prediction is a key task for intelligent transportation systems. The core difficulty lies in accurately modeling the complex spatial-temporal dependencies in traffic data. In recent years, improvements in network architecture have failed to bring significant performance enhancements, while embedding technology has shown great potential. However, existing embedding methods often ignore graph structure information or rely solely on static graph structures, making it difficult to effectively capture the dynamic associations between nodes that evolve over time. To address this issue, this letter proposes a novel dynamic weighted graph structure (DWGS) embedding method, which relies on a graph structure that can truly reflect the changes in the strength of dynamic associations between nodes over time. By first combining the DWGS embedding with the spatial-temporal adaptive embedding, as well as the temporal embedding and feature embedding, and then integrating attention and frequency-domain multi-layer perceptrons (MLPs), we design a novel traffic prediction model, termed the DWGS embedding integrated with attention and frequency-domain MLPs (DWAFM). Experiments on five real-world traffic datasets show that the DWAFM achieves better prediction performance than some state-of-the-arts.
[1238] Feature-Weighted Maximum Representative Subsampling
Tony Hauptmann, Stefan Kramer
Main category: cs.LG
TL;DR: FW-MRS is a debiasing algorithm that uses feature weighting to minimize the impact of highly biased features when creating representative subsamples, preventing over-correction of already representative variables.
Details
Motivation: Traditional debiasing algorithms can introduce bias into already representative features when trying to correct highly biased ones, as they need to strongly alter sample distributions. This over-correction problem needs addressing.Method: Feature-weighted Maximum Representative Subsampling (FW-MRS) extends MRS by incorporating feature weights derived from a domain classifier’s feature importance. These weights decrease emphasis on highly biased features during subsample creation.
Result: FW-MRS retains more instances than standard MRS while maintaining debiasing effectiveness. On eight tabular datasets with artificial bias, it showed no statistically significant decline in downstream task generalization performance despite focusing less on biased features.
Conclusion: FW-MRS effectively addresses the over-correction problem in debiasing by using feature weighting, preserving more data for downstream tasks without compromising debiasing quality or generalization performance.
Abstract: In the social sciences, it is often necessary to debias studies and surveys before valid conclusions can be drawn. Debiasing algorithms enable the computational removal of bias using sample weights. However, an issue arises when only a subset of features is highly biased, while the rest is already representative. Algorithms need to strongly alter the sample distribution to manage a few highly biased features, which can in turn introduce bias into already representative variables. To address this issue, we developed a method that uses feature weights to minimize the impact of highly biased features on the computation of sample weights. Our algorithm is based on Maximum Representative Subsampling (MRS), which debiases datasets by aligning a non-representative sample with a representative one through iterative removal of elements to create a representative subsample. The new algorithm, named feature-weighted MRS (FW-MRS), decreases the emphasis on highly biased features, allowing it to retain more instances for downstream tasks. The feature weights are derived from the feature importance of a domain classifier trained to differentiate between the representative and non-representative datasets. We validated FW-MRS using eight tabular datasets, each of which we artificially biased. Biased features can be important for downstream tasks, and focusing less on them could lead to a decline in generalization. For this reason, we assessed the generalization performance of FW-MRS on downstream tasks and found no statistically significant differences. Additionally, FW-MRS was applied to a real-world dataset from the social sciences. The source code is available at https://github.com/kramerlab/FeatureWeightDebiasing.
[1239] One-Token Verification for Reasoning Correctness Estimation
Zhan Zhuang, Xiequn Wang, Zebin Chen, Feiyang Ye, Ying Wei, Kede Ma, Yu Zhang
Main category: cs.LG
TL;DR: OTV is a method that enables LLMs to estimate reasoning correctness in a single forward pass using a learnable token and key-value cache probing, reducing token usage by 90% through early termination.
Details
Motivation: Current parallel thinking approaches for LLMs suffer from high inference latency due to multi-sample decoding and lack effective mechanisms for assessing reasoning trace correctness.Method: Introduces One-Token Verification (OTV) using a learnable token integrated via low-rank adaptation to probe internal reasoning signals through key-value cache, enabling token-level correctness estimation without disrupting primary reasoning.
Result: OTV consistently surpasses existing verifiers on mathematical reasoning benchmarks and reduces token usage by up to 90% through correctness-guided early termination.
Conclusion: OTV provides an efficient method for real-time reasoning verification that significantly reduces computational overhead while improving verification accuracy.
Abstract: Recent breakthroughs in large language models (LLMs) have led to notable successes in complex reasoning tasks, such as mathematical problem solving. A common strategy for improving performance is parallel thinking, in which multiple reasoning traces are generated and the final prediction is made using aggregation schemes like majority voting or best-of-$N$ decoding. However, two key challenges persist. First, multi-sample decoding incurs substantial inference latency, especially for long-form outputs. Second, effective mechanisms for reliably assessing the correctness of individual reasoning traces are still limited. To address these challenges, we introduce One-Token Verification (OTV), a computational method that estimates reasoning correctness in a single forward pass during generation. OTV is activated by a learnable token and integrated into the LLM via low-rank adaptation to probe internal reasoning signals through the key-value cache, supporting token-level correctness estimation at any stage of generation without disrupting primary reasoning. Experiments on mathematical reasoning benchmarks demonstrate that OTV consistently surpasses existing verifiers. Additionally, OTV reduces token usage by up to $90%$ through correctness-guided early termination, prioritizing shorter, more reliable solutions.
[1240] Fed-ADE: Adaptive Learning Rate for Federated Post-adaptation under Distribution Shift
Heewon Park, Mugon Joe, Miru Kim, Kyungjin Im, Minhae Kwon
Main category: cs.LG
TL;DR: Fed-ADE: Unsupervised federated adaptation framework using distribution shift estimation for adaptive learning rates in non-stationary data streams
Details
Motivation: Federated learning in post-deployment settings needs to adapt to non-stationary data streams across heterogeneous clients without ground-truth labels, requiring adaptive learning rates for time-varying distribution shiftsMethod: Proposes Fed-ADE with uncertainty dynamics estimation (captures predictive uncertainty changes) and representation dynamics estimation (detects covariate-level feature drift), combined into per-client, per-timestep adaptive learning rates
Result: Experiments on image and text benchmarks under diverse distribution shifts show consistent improvements over strong baselines
Conclusion: Distribution shift-aware adaptation enables effective and robust federated post-adaptation under real-world non-stationarity
Abstract: Federated learning (FL) in post-deployment settings must adapt to non-stationary data streams across heterogeneous clients without access to ground-truth labels. A major challenge is learning rate selection under client-specific, time-varying distribution shifts, where fixed learning rates often lead to underfitting or divergence. We propose Fed-ADE (Federated Adaptation with Distribution Shift Estimation), an unsupervised federated adaptation framework that leverages lightweight estimators of distribution dynamics. Specifically, Fed-ADE employs uncertainty dynamics estimation to capture changes in predictive uncertainty and representation dynamics estimation to detect covariate-level feature drift, combining them into a per-client, per-timestep adaptive learning rate. We provide theoretical analyses showing that our dynamics estimation approximates the underlying distribution shift and yields dynamic regret and convergence guarantees. Experiments on image and text benchmarks under diverse distribution shifts (label and covariate) demonstrate consistent improvements over strong baselines. These results highlight that distribution shift-aware adaptation enables effective and robust federated post-adaptation under real-world non-stationarity.
[1241] Evaluating GFlowNet from partial episodes for stable and flexible policy-based training
Puhua Niu, Shili Wu, Xiaoning Qian
Main category: cs.LG
TL;DR: GFlowNets paper introduces evaluation balance objective that bridges value-based and policy-based training by using flow balance as principled policy evaluator for measuring divergence, improving reliability and flexibility of policy-based training.
Details
Motivation: Current GFlowNet training has two perspectives: value-based (flow balance) and policy-based (policy divergence minimization). Policy-based methods struggle with reliable divergence estimation under directed acyclic graphs, while value-based methods implicitly encourage divergence minimization. The work aims to bridge these perspectives.Method: Shows that flow balance yields principled policy evaluator for measuring divergence. Proposes evaluation balance objective over partial episodes for learning the evaluator. This approach supports parameterized backward policies and enables integration of offline data-collection techniques.
Result: Demonstrated on synthetic and real-world tasks that evaluation balance strengthens reliability of policy-based training and broadens its flexibility by supporting parameterized backward policies and offline data integration.
Conclusion: Evaluation balance objective successfully bridges value-based and policy-based GFlowNet training, providing more reliable divergence estimation and greater flexibility in training methodologies.
Abstract: Generative Flow Networks (GFlowNets) were developed to learn policies for efficiently sampling combinatorial candidates by interpreting their generative processes as trajectories in directed acyclic graphs. In the value-based training workflow, the objective is to enforce the balance over partial episodes between the flows of the learned policy and the estimated flows of the desired policy, implicitly encouraging policy divergence minimization. The policy-based strategy alternates between estimating the policy divergence and updating the policy, but reliable estimation of the divergence under directed acyclic graphs remains a major challenge. This work bridges the two perspectives by showing that flow balance also yields a principled policy evaluator that measures the divergence, and an evaluation balance objective over partial episodes is proposed for learning the evaluator. As demonstrated on both synthetic and real-world tasks, evaluation balance not only strengthens the reliability of policy-based training but also broadens its flexibility by seamlessly supporting parameterized backward policies and enabling the integration of offline data-collection techniques.
[1242] No More Maybe-Arrows: Resolving Causal Uncertainty by Breaking Symmetries
Tingrui Huang, Devendra Singh Dhami
Main category: cs.LG
TL;DR: CausalSAGE: A framework for converting partial ancestral graphs (PAGs) to directed acyclic graphs (DAGs) using state-level representations and differentiable optimization
Details
Motivation: Current causal discovery methods only learn PAGs due to observational data limitations, leaving uncertainty in causal relations unresolved, which limits downstream applicationsMethod: Expands discrete variables into state-level representations, constrains search space with structural knowledge and soft priors, applies unified differentiable objective for joint optimization, aggregates optimized structures and enforces acyclicity
Result: Experimental evaluations show obtained DAGs preserve underlying causal relations while being efficient to obtain
Conclusion: CausalSAGE provides an effective framework for refining PAGs to DAGs, resolving causal uncertainty for better downstream task applicability
Abstract: The recent works on causal discovery have followed a similar trend of learning partial ancestral graphs (PAGs) since observational data constrain the true causal directed acyclic graph (DAG) only up to a Markov equivalence class. This limits their application in the majority of downstream tasks, as uncertainty in causal relations remains unresolved. We propose a new refinement framework, CausalSAGE, for converting PAGs to DAGs while respecting the underlying causal relations. The framework expands discrete variables into state-level representations, constrains the search space using structural knowledge and soft priors, and applies a unified differentiable objective for joint optimization. The final DAG is obtained by aggregating the optimized structures and enforcing acyclicity when necessary. Our experimental evaluations show that the obtained DAGs preserve the underlying causal relations while also being efficient to obtain.
[1243] A level-wise training scheme for learning neural multigrid smoothers with application to integral equations
Lingfeng Li, Yin King Chu, Raymond Chan, Justin Wan
Main category: cs.LG
TL;DR: A neural multigrid method using learned neural operators as smoothers for solving convolution-type integral equations, outperforming classical solvers with robust convergence across problem variations.
Details
Motivation: Classical multigrid methods fail for integral equations because conventional relaxation smoothers are ineffective at reducing high-frequency error components in these ill-conditioned linear systems.Method: Replace classical smoothers with neural operators trained offline using level-wise loss functions with spectral filtering to emulate multigrid frequency decomposition, ensuring each operator focuses on distinct high-frequency spectral bands.
Result: Superior efficiency over classical solvers and robust convergence across varying problem sizes and regularization weights, with neural smoothers generalizing to new right-hand-side vectors without retraining.
Conclusion: The neural multigrid framework effectively solves integral equations and is generalizable to other problems including PDEs, offering an efficient alternative to classical methods.
Abstract: Convolution-type integral equations commonly occur in signal processing and image processing. Discretizing these equations yields large and ill-conditioned linear systems. While the classic multigrid method is effective for solving linear systems derived from partial differential equations (PDE) problems, it fails to solve integral equations because its smoothers, which are implemented as conventional relaxation methods, are ineffective in reducing high-frequency components in the errors. We propose a novel neural multigrid scheme where learned neural operators replace classical smoothers. Unlike classical smoothers, these operators are trained offline. Once trained, the neural smoothers generalize to new right-hand-side vectors without retraining, making it an efficient solver. We design level-wise loss functions incorporating spectral filtering to emulate the multigrid frequency decomposition principle, ensuring each operator focuses on solving distinct high-frequency spectral bands. Although we focus on integral equations, the framework is generalizable to all kinds of problems, including PDE problems. Our experiments demonstrate superior efficiency over classical solvers and robust convergence across varying problem sizes and regularization weights.
[1244] Understanding LoRA as Knowledge Memory: An Empirical Analysis
Seungju Back, Dongwoo Lee, Naun Kang, Taehee Lee, S. K. Hong, Youngjune Gwon, Sungjin Ahn
Main category: cs.LG
TL;DR: Systematic empirical study of using LoRA as modular knowledge memory for continuous LLM updating, exploring capacity, composability, and practical boundaries.
Details
Motivation: Continuous knowledge updating for LLMs is necessary but challenging; current inference-time methods (ICL, RAG) have limitations in context budgets, costs, and retrieval fragmentation. Need to explore parametric approaches using LoRA as modular knowledge memory.Method: First systematic empirical study mapping design space of LoRA-based memory, including: characterizing storage capacity, optimizing internalization, scaling multi-module systems, and evaluating long-context reasoning. Provides practical guidance rather than single architecture.
Result: Findings position LoRA as complementary memory axis alongside RAG and ICL with distinct advantages. Explores operational boundaries of LoRA memory systems.
Conclusion: LoRA serves as modular knowledge memory for continuous LLM updating, offering parametric alternative to context-dependent methods, with systematic guidance on its capabilities and limitations.
Abstract: Continuous knowledge updating for pre-trained large language models (LLMs) is increasingly necessary yet remains challenging. Although inference-time methods like In-Context Learning (ICL) and Retrieval-Augmented Generation (RAG) are popular, they face constraints in context budgets, costs, and retrieval fragmentation. Departing from these context-dependent paradigms, this work investigates a parametric approach using Low-Rank Adaptation (LoRA) as a modular knowledge memory. Although few recent works examine this concept, the fundamental mechanics governing its capacity and composability remain largely unexplored. We bridge this gap through the first systematic empirical study mapping the design space of LoRA-based memory, ranging from characterizing storage capacity and optimizing internalization to scaling multi-module systems and evaluating long-context reasoning. Rather than proposing a single architecture, we provide practical guidance on the operational boundaries of LoRA memory. Overall, our findings position LoRA as the complementary axis of memory alongside RAG and ICL, offering distinct advantages.
[1245] A Deep Learning Framework for Heat Demand Forecasting using Time-Frequency Representations of Decomposed Features
Adithya Ramachandran, Satyaki Chatterjee, Thorkil Flensmark B. Neergaard, Maximilian Oberndoerfer, Andreas Maier, Siming Bayer
Main category: cs.LG
TL;DR: A deep learning framework using Continuous Wavelet Transform and CNNs for day-ahead heat demand forecasting in district heating systems, achieving 36-43% MAE reduction over baselines.
Details
Motivation: District heating systems need efficient management to align supply with fluctuating demand, minimize carbon emissions, and extend infrastructure lifespan through accurate multi-step forecasting, which is challenging due to complex non-linear usage patterns and external dependencies.Method: Proposes a novel deep learning framework that applies Continuous Wavelet Transform to decomposed demand and meteorological factors, enabling Convolutional Neural Networks to learn hierarchical temporal features from time-frequency representations.
Result: Achieves 36-43% reduction in Mean Absolute Error compared to strongest baselines, with up to 95% forecasting accuracy across annual test datasets from three Danish districts, a Danish city, and a German city. Reliably tracks volatile demand peaks where other methods fail.
Conclusion: The work contributes both a high-performance forecasting architecture and critical insights into optimal feature composition, offering a validated solution for modern energy applications in district heating systems.
Abstract: District Heating Systems are essential infrastructure for delivering heat to consumers across a geographic region sustainably, yet efficient management relies on optimizing diverse energy sources, such as wood, gas, electricity, and solar, in response to fluctuating demand. Aligning supply with demand is critical not only for ensuring reliable heat distribution but also for minimizing carbon emissions and extending infrastructure lifespan through lower operating temperatures. However, accurate multi-step forecasting to support these goals remains challenging due to complex, non-linear usage patterns and external dependencies. In this work, we propose a novel deep learning framework for day-ahead heat demand prediction that leverages time-frequency representations of historical data. By applying Continuous Wavelet Transform to decomposed demand and external meteorological factors, our approach enables Convolutional Neural Networks to learn hierarchical temporal features that are often inaccessible to standard time domain models. We systematically evaluate this method against statistical baselines, state-of-the-art Transformers, and emerging foundation models using multi-year data from three distinct Danish districts, a Danish city, and a German city. The results show a significant advancement, reducing the Mean Absolute Error by 36% to 43% compared to the strongest baselines, achieving forecasting accuracy of up to 95% across annual test datasets. Qualitative and statistical analyses further confirm the accuracy and robustness by reliably tracking volatile demand peaks where others fail. This work contributes both a high-performance forecasting architecture and critical insights into optimal feature composition, offering a validated solution for modern energy applications.
[1246] A Decomposition Framework for Certifiably Optimal Orthogonal Sparse PCA
Difei Cheng, Qiao Hu
Main category: cs.LG
TL;DR: A novel SPCA algorithm (GS-SPCA) with Gram-Schmidt orthogonalization that enforces sparsity, orthogonality, and optimality simultaneously, plus acceleration strategies using branch-and-bound and decomposition framework.
Details
Motivation: Existing Sparse PCA methods fail to simultaneously guarantee sparsity, orthogonality, and optimality of principal components, creating a need for improved algorithms.Method: GS-SPCA algorithm with Gram-Schmidt orthogonalization, accelerated by: 1) Branch-and-Bound for ε-optimal solutions, 2) Decomposition framework using block-diagonal approximation of covariance matrix to solve multiple principal components efficiently.
Result: The proposed methods achieve computational acceleration while maintaining sparsity, orthogonality, and optimality properties in sparse PCA.
Conclusion: GS-SPCA with acceleration strategies provides an effective solution for high-dimensional data analysis with improved computational efficiency and theoretical guarantees.
Abstract: Sparse Principal Component Analysis (SPCA) is an important technique for high-dimensional data analysis, improving interpretability by imposing sparsity on principal components. However, existing methods often fail to simultaneously guarantee sparsity, orthogonality, and optimality of the principal components. To address this challenge, this work introduces a novel Sparse Principal Component Analysis (SPCA) algorithm called \textsc{GS-SPCA} (SPCA with Gram-Schmidt Orthogonalization), which simultaneously enforces sparsity, orthogonality, and optimality. However, the original GS-SPCA algorithm is computationally expensive due to the inherent $\ell_0$-norm constraint. To address this issue, we propose two acceleration strategies: First, we combine \textbf{Branch-and-Bound} with the GS-SPCA algorithm. By incorporating this strategy, we are able to obtain $\varepsilon$-optimal solutions with a trade-off between precision and efficiency, significantly improving computational speed. Second, we propose a \textbf{decomposition framework} for efficiently solving \textbf{multiple} principal components. This framework approximates the covariance matrix using a block-diagonal matrix through a thresholding method, reducing the original SPCA problem to a set of block-wise subproblems on approximately block-diagonal matrices.
[1247] Demystifying Group Relative Policy Optimization: Its Policy Gradient is a U-Statistic
Hongyi Zhou, Kai Ye, Erhan Xu, Jin Zhu, Shijin Gong, Chengchun Shi
Main category: cs.LG
TL;DR: GRPO is a core method in reasoning LLMs like DeepSeekMath, and this paper provides theoretical analysis showing it’s asymptotically equivalent to an oracle policy gradient algorithm with optimal performance.
Details
Motivation: GRPO is widely used in reasoning LLMs but lacks theoretical understanding; this paper aims to provide rigorous theoretical foundations for GRPO through statistical analysis.Method: Analyzes GRPO through U-statistics framework, characterizes its policy gradient as a U-statistic, derives MSE bounds, asymptotic distributions, and establishes scaling laws for group size selection.
Result: GRPO is asymptotically equivalent to an oracle policy gradient algorithm, achieves optimal performance within a broad class, and has universal optimal group size scaling laws validated empirically.
Conclusion: GRPO has solid theoretical foundations, behaves like an oracle algorithm asymptotically, and provides principled guidance for hyperparameter selection through universal scaling laws.
Abstract: Group relative policy optimization (GRPO), a core methodological component of DeepSeekMath and DeepSeek-R1, has emerged as a cornerstone for scaling reasoning capabilities of large language models. Despite its widespread adoption and the proliferation of follow-up works, the theoretical properties of GRPO remain less studied. This paper provides a unified framework to understand GRPO through the lens of classical U-statistics. We demonstrate that the GRPO policy gradient is inherently a U-statistic, allowing us to characterize its mean squared error (MSE), derive the finite-sample error bound and asymptotic distribution of the suboptimality gap for its learned policy. Our findings reveal that GRPO is asymptotically equivalent to an oracle policy gradient algorithm – one with access to a value function that quantifies the goodness of its learning policy at each training iteration – and achieves asymptotically optimal performance within a broad class of policy gradient algorithms. Furthermore, we establish a universal scaling law that offers principled guidance for selecting the optimal group size. Empirical experiments further validate our theoretical findings, demonstrating that the optimal group size is universal, and verify the oracle property of GRPO.
[1248] SphUnc: Hyperspherical Uncertainty Decomposition and Causal Identification via Information Geometry
Rong Fu, Chunlei Meng, Jinshuo Liu, Dianyu Zhao, Yongtai Liu, Yibo Meng, Xiaowen Ma, Wangyu Wu, Yangchen Zeng, Kangning Cui, Shuaishuai Cao, Simon Fong
Main category: cs.LG
TL;DR: SphUnc: A unified framework combining hyperspherical representation learning with structural causal modeling for calibrated predictions and interpretable uncertainty in multi-agent systems.
Details
Motivation: Reliable decision-making in complex multi-agent systems requires calibrated predictions and interpretable uncertainty quantification, especially for understanding higher-order interactions and causal relationships between agents.Method: Combines hyperspherical representation learning (mapping features to unit hypersphere latents using von Mises-Fisher distributions) with structural causal modeling. Uses information-geometric fusion to decompose uncertainty into epistemic and aleatoric components, and enables directed influence identification and interventional reasoning through sample-based simulation on spherical latents.
Result: Empirical evaluations on social and affective benchmarks demonstrate improved accuracy, better calibration, and interpretable causal signals compared to baseline methods.
Conclusion: Establishes a geometric-causal foundation for uncertainty-aware reasoning in multi-agent settings with higher-order interactions, providing both calibrated predictions and interpretable uncertainty decomposition.
Abstract: Reliable decision-making in complex multi-agent systems requires calibrated predictions and interpretable uncertainty. We introduce SphUnc, a unified framework combining hyperspherical representation learning with structural causal modeling. The model maps features to unit hypersphere latents using von Mises-Fisher distributions, decomposing uncertainty into epistemic and aleatoric components through information-geometric fusion. A structural causal model on spherical latents enables directed influence identification and interventional reasoning via sample-based simulation. Empirical evaluations on social and affective benchmarks demonstrate improved accuracy, better calibration, and interpretable causal signals, establishing a geometric-causal foundation for uncertainty-aware reasoning in multi-agent settings with higher-order interactions.
[1249] PARWiS: Winner determination under shoestring budgets using active pairwise comparisons
Shailendra Bhandari
Main category: cs.LG
TL;DR: PARWiS algorithm for active pairwise comparisons under limited budgets, with contextual and RL variants, evaluated on synthetic and real-world datasets showing superiority over baselines.
Details
Motivation: Active pairwise comparisons with limited budgets is challenging in preference-based learning; need efficient algorithms to identify best items under shoestring budgets.Method: PARWiS algorithm using spectral ranking and disruptive pair selection, extended with contextual variant (Contextual PARWiS) and reinforcement learning variant (RL PARWiS). Evaluated against Double Thompson Sampling and random selection on synthetic, Jester, and MovieLens datasets with budgets of 40, 60, 80 comparisons for 20 items.
Result: PARWiS and RL PARWiS outperform baselines across all datasets, especially in Jester with higher Δ₁,₂. Performance gaps narrow in more challenging MovieLens with smaller Δ₁,₂. Contextual PARWiS shows comparable performance to PARWiS, suggesting contextual features need further tuning.
Conclusion: PARWiS and its RL variant are effective for active pairwise comparison under limited budgets, with contextual features showing potential but requiring optimization.
Abstract: Determining a winner among a set of items using active pairwise comparisons under a limited budget is a challenging problem in preference-based learning. The goal of this study is to implement and evaluate the PARWiS algorithm, which shows spectral ranking and disruptive pair selection to identify the best item under shoestring budgets. This work have extended the PARWiS with a contextual variant (Contextual PARWiS) and a reinforcement learning-based variant (RL PARWiS), comparing them against baselines, including Double Thompson Sampling and a random selection strategy. This evaluation spans synthetic and real-world datasets (Jester and MovieLens), using budgets of 40, 60, and 80 comparisons for 20 items. The performance is measured through recovery fraction, true rank of reported winner, reported rank of true winner, and cumulative regret, alongside the separation metric (Δ_{1,2}). Results show that PARWiS and RL PARWiS outperform baselines across all datasets, particularly in the Jester dataset with a higher (Δ_{1,2}), while performance gaps narrow in the more challenging MovieLens dataset with a smaller (Δ_{1,2}). Contextual PARWiS shows comparable performance to PARWiS, indicating that contextual features may require further tuning to provide significant benefits.
[1250] Scaling of learning time for high dimensional inputs
Carlos Stein Brito
Main category: cs.LG
TL;DR: Theoretical analysis shows Hebbian learning for ICA has supralinear scaling of learning time with input dimensionality, revealing fundamental limitations for high-dimensional learning.
Details
Motivation: To understand the trade-off between model expressivity and learning time in neural networks, particularly how learning time depends on input dimensionality, which could explain connectivity and learning times in both artificial and biological networks.Method: Theoretical analysis of Hebbian learning model performing independent component analysis (ICA), using geometry of high-dimensional spaces to reduce learning dynamics to a unidimensional problem.
Result: Learning times show supralinear scaling with input dimensionality - higher dimensions lead to smaller initial learning gradients and larger learning times that become quickly prohibitive.
Conclusion: Reveals fundamental limitation for learning in high dimensions and provides insights into optimal neural network design based on data complexity, offering new framework for analyzing learning dynamics and model complexity.
Abstract: Representation learning from complex data typically involves models with a large number of parameters, which in turn require large amounts of data samples. In neural network models, model complexity grows with the number of inputs to each neuron, with a trade-off between model expressivity and learning time. A precise characterization of this trade-off would help explain the connectivity and learning times observed in artificial and biological networks. We present a theoretical analysis of how learning time depends on input dimensionality for a Hebbian learning model performing independent component analysis. Based on the geometry of high-dimensional spaces, we show that the learning dynamics reduce to a unidimensional problem, with learning times dependent only on initial conditions. For higher input dimensions, initial parameters have smaller learning gradients and larger learning times. We find that learning times have supralinear scaling, becoming quickly prohibitive for high input dimensions. These results reveal a fundamental limitation for learning in high dimensions and help elucidate how the optimal design of neural networks depends on data complexity. Our approach outlines a new framework for analyzing learning dynamics and model complexity in neural network models.
[1251] Operator Learning Using Weak Supervision from Walk-on-Spheres
Hrishikesh Viswanath, Hong Chul Nam, Xi Deng, Julius Berner, Anima Anandkumar, Aniket Bera
Main category: cs.LG
TL;DR: WoS-NO uses Monte Carlo walk-on-spheres method to generate weak supervision for training neural PDE solvers without expensive data generation or unstable PINN optimization.
Details
Motivation: Training neural PDE solvers is bottlenecked by expensive data generation or unstable physics-informed neural networks (PINNs) with challenging optimization landscapes due to higher-order derivatives.Method: Proposes Walk-on-Spheres Neural Operator (WoS-NO) that uses Monte Carlo walk-on-spheres method to generate cheap, noisy estimates of PDE solutions as weak supervision during training, creating a data-free physics-informed objective.
Result: Achieves up to 8.75× improvement in L2-error, 6.31× improvement in training speed, and 2.97× reduction in GPU memory consumption compared to standard physics-informed training schemes.
Conclusion: WoS-NO provides a mesh-free framework that avoids expensive pre-computed datasets, eliminates need for higher-order derivative computations, and demonstrates zero-shot generalization to novel PDE parameters and domains.
Abstract: Training neural PDE solvers is often bottlenecked by expensive data generation or unstable physics-informed neural network (PINN) that involves challenging optimization landscapes due to higher-order derivatives. To tackle this issue, we propose an alternative approach using Monte Carlo approaches to estimate the solution to the PDE as a stochastic process for weak supervision during training. Leveraging the walk-on-spheres method, we introduce a learning scheme called \emph{Walk-on-Spheres Neural Operator (WoS-NO)} which uses weak supervision from WoS to train any given neural operator. We propose to amortize the cost of Monte Carlo walks across the distribution of PDE instances using stochastic representations from the WoS algorithm to generate cheap, noisy, estimates of the PDE solution during training. This is formulated into a data-free physics-informed objective where a neural operator is trained to regress against these weak supervisions, allowing the operator to learn a generalized solution map for an entire family of PDEs. This strategy results in a mesh-free framework that operates without expensive pre-computed datasets, avoids the need for computing higher-order derivatives for loss functions that are memory-intensive and unstable, and demonstrates zero-shot generalization to novel PDE parameters and domains. Experiments show that for the same number of training steps, our method exhibits up to 8.75$\times$ improvement in $L_2$-error compared to standard physics-informed training schemes, up to 6.31$\times$ improvement in training speed, and reductions of up to 2.97$\times$ in GPU memory consumption. We present the code at https://github.com/neuraloperator/WoS-NO
[1252] Subliminal Signals in Preference Labels
Isotta Magistrali, Frédéric Berdoz, Sam Dauncey, Roger Wattenhofer
Main category: cs.LG
TL;DR: Preference labels in LLM-as-a-judge frameworks can serve as covert communication channels, allowing biased judges to transmit unintended behavioral traits to student models through iterative alignment.
Details
Motivation: The paper challenges the assumption that binary preference labels only provide semantic supervision about response quality in scalable oversight frameworks where LLMs evaluate each other's training. The authors investigate whether preference labels can function as covert communication channels for transmitting unintended behavioral traits.Method: The researchers demonstrate that even when a neutral student model generates semantically unbiased completions, a biased judge can transmit unintended behavioral traits through preference assignments. They show this transmission strengthens across iterative alignment rounds in LLM-as-a-judge frameworks.
Result: The findings reveal that preference labels can indeed function as covert communication channels, allowing biased judges to transmit and strengthen unintended behavioral traits in student models through iterative alignment processes.
Conclusion: Robust oversight in superalignment settings requires mechanisms that can detect and mitigate subliminal preference transmission, particularly when judges may pursue unintended objectives beyond their stated evaluation tasks.
Abstract: As AI systems approach superhuman capabilities, scalable oversight increasingly relies on LLM-as-a-judge frameworks where models evaluate and guide each other’s training. A core assumption is that binary preference labels provide only semantic supervision about response quality. We challenge this assumption by demonstrating that preference labels can function as a covert communication channel. We show that even when a neutral student model generates semantically unbiased completions, a biased judge can transmit unintended behavioral traits through preference assignments, which even strengthen across iterative alignment rounds. Our findings suggest that robust oversight in superalignment settings requires mechanisms that can detect and mitigate subliminal preference transmission, particularly when judges may pursue unintended objectives.
[1253] Learn Hard Problems During RL with Reference Guided Fine-tuning
Yangzhen Wu, Shanda Li, Zixin Wen, Xin Zhou, Ameet Talwalkar, Yiming Yang, Wenhao Huang, Tianle Cai
Main category: cs.LG
TL;DR: ReGFT uses human-written reference solutions to generate positive training trajectories for RL in mathematical reasoning, overcoming reward sparsity by creating in-distribution examples that bootstrap RL training.
Details
Motivation: RL for mathematical reasoning suffers from reward sparsity where LLMs fail to sample correct trajectories on hard problems, preventing meaningful positive feedback. While human-written reference solutions exist, direct fine-tuning on them is ineffective because models cannot imitate proofs outside their reasoning distribution.Method: Reference-Guided Fine-Tuning (ReGFT) uses partial human-written reference solutions to guide model generation, creating positive trajectories that remain within the model’s reasoning space. These reference-guided trajectories are used for fine-tuning before RL, increasing solvable problems and providing better initialization for RL training.
Result: ReGFT consistently improves supervised accuracy, accelerates DAPO training, and raises the final performance plateau of RL across three benchmarks (AIME24, AIME25, BeyondAIME). It effectively overcomes reward sparsity and unlocks stronger RL-based mathematical reasoning.
Conclusion: ReGFT is a simple and effective method that leverages human reference solutions to bootstrap RL training for mathematical reasoning, addressing the fundamental challenge of reward sparsity in complex reasoning tasks.
Abstract: Reinforcement learning (RL) for mathematical reasoning can suffer from reward sparsity: for challenging problems, LLM fails to sample any correct trajectories, preventing RL from receiving meaningful positive feedback. At the same time, there often exist human-written reference solutions along with the problem (e.g., problems from AoPS), but directly fine-tuning on these solutions offers no benefit because models often cannot imitate human proofs that lie outside their own reasoning distribution. We introduce Reference-Guided Fine-Tuning (ReGFT), a simple and effective method that utilizes human-written reference solutions to synthesize positive trajectories on hard problems and train on them before RL. For each problem, we provide the model with a partial reference solution and let it generate its own reasoning trace, ensuring the resulting trajectories remain in the model’s reasoning space while still benefiting from reference guidance. Fine-tuning on these reference-guided trajectories increases the number of solvable problems and produces a checkpoint that receives more positive rewards during RL. Across three benchmarks (AIME24, AIME25, BeyondAIME), ReGFT consistently improves supervised accuracy, accelerates DAPO training, and raises the final performance plateau of RL. Our results show that ReGFT effectively overcomes reward sparsity and unlocks stronger RL-based mathematical reasoning.
[1254] MOSAIC: A Unified Platform for Cross-Paradigm Comparison and Evaluation of Homogeneous and Heterogeneous Multi-Agent RL, LLM, VLM, and Human Decision-Makers
Abdulhamid M. Mousa, Yu Fu, Rakhmonberdi Khajiev, Jalaledin M. Azzabi, Abdulkarim M. Mousa, Peng Yang, Yunusa Haruna, Ming Liu
Main category: cs.LG
TL;DR: MOSAIC is an open-source platform that enables heterogeneous agents (RL policies, LLMs, VLMs, and humans) to operate together in the same environments for reproducible cross-paradigm research.
Details
Motivation: Current infrastructure lacks the ability to deploy agents from different decision-making paradigms (RL, LLMs, VLMs) within the same environment, making it difficult to study them in hybrid multi-agent settings or compare their behavior under identical conditions.Method: Three key contributions: 1) IPC-based worker protocol for isolated subprocess workers, 2) Operator abstraction for unified agent interface, 3) Deterministic cross-paradigm evaluation framework with manual and script modes.
Result: MOSAIC provides an open, visual-first platform that bridges RL, LLM, and VLM communities by enabling heterogeneous agents to operate together in reproducible environments.
Conclusion: MOSAIC facilitates reproducible cross-paradigm research across RL, LLM, and human-in-the-loop communities by providing a unified platform for deploying and comparing diverse agents.
Abstract: Reinforcement learning (RL), large language models (LLMs), and vision-language models (VLMs) have been widely studied in isolation. However, existing infrastructure lacks the ability to deploy agents from different decision-making paradigms within the same environment, making it difficult to study them in hybrid multi-agent settings or to compare their behaviour fairly under identical conditions. We present MOSAIC, an open-source platform that bridges this gap by incorporating a diverse set of existing reinforcement learning environments and enabling heterogeneous agents (RL policies, LLMs, VLMs, and human players) to operate within them in ad-hoc team settings with reproducible results. MOSAIC introduces three contributions. (i) An IPC-based worker protocol that wraps both native and third-party frameworks as isolated subprocess workers, each executing its native training and inference logic unmodified, communicating through a versioned inter-process protocol. (ii) An operator abstraction that forms an agent-level interface by mapping workers to agents: each operator, regardless of whether it is backed by an RL policy, an LLM, or a human, conforms to a minimal unified interface. (iii) A deterministic cross-paradigm evaluation framework offering two complementary modes: a manual mode that advances up to N concurrent operators in lock-step under shared seeds for fine-grained visual inspection of behavioural differences, and a script mode that drives automated, long-running evaluation through declarative Python scripts, for reproducible experiments. We release MOSAIC as an open, visual-first platform to facilitate reproducible cross-paradigm research across the RL, LLM, and human-in-the-loop communities.
[1255] S2O: Enhancing Adversarial Training with Second-Order Statistics of Weights
Gaojie Jin, Xinping Yi, Wei Huang, Sven Schewe, Xiaowei Huang
Main category: cs.LG
TL;DR: Proposes S²O, a novel adversarial training approach that treats model weights as random variables and optimizes their second-order statistics to improve robustness and generalization.
Details
Motivation: Adversarial training is effective but typically uses min-max optimization over weights and perturbations. The paper aims to enhance adversarial training by treating weights as random variables and optimizing their second-order statistics, challenging the unrealistic assumption of weight independence in prior PAC-Bayesian frameworks.Method: Introduces Second-Order Statistics Optimization (S²O) over model weights by relaxing the independence assumption in PAC-Bayesian frameworks. Derives an improved PAC-Bayesian robust generalization bound and proposes optimizing second-order statistics of weights to tighten this bound.
Result: Extensive experiments show S²O enhances robustness and generalization of neural networks when used alone, and also augments other state-of-the-art adversarial training techniques.
Conclusion: Treating model weights as random variables and optimizing their second-order statistics provides a novel and effective approach to improve adversarial training, with both theoretical justification and empirical validation.
Abstract: Adversarial training has emerged as a highly effective way to improve the robustness of deep neural networks (DNNs). It is typically conceptualized as a min-max optimization problem over model weights and adversarial perturbations, where the weights are optimized using gradient descent methods, such as SGD. In this paper, we propose a novel approach by treating model weights as random variables, which paves the way for enhancing adversarial training through \textbf{S}econd-Order \textbf{S}tatistics \textbf{O}ptimization (S$^2$O) over model weights. We challenge and relax a prevalent, yet often unrealistic, assumption in prior PAC-Bayesian frameworks: the statistical independence of weights. From this relaxation, we derive an improved PAC-Bayesian robust generalization bound. Our theoretical developments suggest that optimizing the second-order statistics of weights can substantially tighten this bound. We complement this theoretical insight by conducting an extensive set of experiments that demonstrate that S$^2$O not only enhances the robustness and generalization of neural networks when used in isolation, but also seamlessly augments other state-of-the-art adversarial training techniques. The code is available at https://github.com/Alexkael/S2O.
[1256] GlassMol: Interpretable Molecular Property Prediction with Concept Bottleneck Models
Oscar Rivera, Ziqing Wang, Matthieu Dagommer, Abhishek Pandey, Kaize Ding
Main category: cs.LG
TL;DR: GlassMol introduces a Concept Bottleneck Model for molecular property prediction that addresses challenges in interpretability while maintaining performance, using automated concept curation and LLM-guided selection.
Details
Motivation: Current ML models for molecular property prediction are black boxes, which is problematic in drug discovery where safety is critical. Existing interpretability methods suffer from effectiveness-trustworthiness trade-offs, while Concept Bottleneck Models face three gaps when adapted to chemistry: Relevance Gap, Annotation Gap, and Capacity Gap.Method: GlassMol is a model-agnostic Concept Bottleneck Model that addresses the three gaps through automated concept curation and LLM-guided concept selection. It projects molecular inputs to human-interpretable concepts before making predictions, ensuring faithful explanations.
Result: Experiments across thirteen benchmarks demonstrate that GlassMol generally matches or exceeds black-box baselines, suggesting interpretability doesn’t sacrifice performance and challenging the commonly assumed trade-off.
Conclusion: GlassMol successfully addresses interpretability challenges in molecular property prediction while maintaining competitive performance, providing a promising approach for trustworthy AI in drug discovery.
Abstract: Machine learning accelerates molecular property prediction, yet state-of-the-art Large Language Models and Graph Neural Networks operate as black boxes. In drug discovery, where safety is critical, this opacity risks masking false correlations and excluding human expertise. Existing interpretability methods suffer from the effectiveness-trustworthiness trade-off: explanations may fail to reflect a model’s true reasoning, degrade performance, or lack domain grounding. Concept Bottleneck Models (CBMs) offer a solution by projecting inputs to human-interpretable concepts before readout, ensuring that explanations are inherently faithful to the decision process. However, adapting CBMs to chemistry faces three challenges: the Relevance Gap (selecting task-relevant concepts from a large descriptor space), the Annotation Gap (obtaining concept supervision for molecular data), and the Capacity Gap (degrading performance due to bottleneck constraints). We introduce GlassMol, a model-agnostic CBM that addresses these gaps through automated concept curation and LLM-guided concept selection. Experiments across thirteen benchmarks demonstrate that \method generally matches or exceeds black-box baselines, suggesting that interpretability does not sacrifice performance and challenging the commonly assumed trade-off. Code is available at https://github.com/walleio/GlassMol.
[1257] The Impact of Battery Cell Configuration on Electric Vehicle Performance: An XGBoost-Based Classification with SHAP Interpretability
Santanam Wishal, Louis Filiepe Tio Jansel, Matthew Abednego Inkiriwang, Jason Sebastian
Main category: cs.LG
TL;DR: Machine learning framework using XGBoost classifies EV acceleration performance based on battery configuration, achieving 87.5% accuracy with SHAP for interpretability.
Details
Motivation: Current literature overlooks the complex, non-linear relationship between battery configuration and electric vehicle performance, despite the EV market's focus on dynamic performance and rapid charging.Method: Proposed a machine learning framework categorizing EV acceleration into High (≤4.0s), Mid (4.0-7.0s), and Low (>7.0s) using XGBoost classifier on 276 EV samples, with SHAP for interpretability.
Result: Achieved 87.5% predictive accuracy, 0.968 ROC-AUC, and 0.812 MCC. Analysis revealed that increasing battery cell count initially boosts power but mass and complexity eventually diminish performance gains.
Conclusion: Battery configuration in EVs must balance system complexity and architectural configuration to achieve and maintain optimal vehicle performance, requiring careful engineering trade-offs.
Abstract: As the electric vehicle (EV) market continues to prioritize dynamic performance and rapid charging, battery configuration has rapidly evolved. Despite this, current literature has often overlooked the complex, non-linear relationship between battery configuration and electric vehicle performance. To address this gap, this study proposes a machine learning framework which categorizes the EV acceleration performance into High (<= 4.0 seconds), Mid (4.0 - 7.0 seconds), and Low (> 7.0 seconds). Utilizing a preprocessed dataset consisting of 276 EV samples, an Extreme Gradient Boosting (XGBoost) classifier was utilized, achieving 87.5% predictive accuracy, a 0.968 ROC-AUC, and a 0.812 MCC. In order to ensure engineering transparency SHapley Additive exPlanations (SHAP) were employed. Results of analysis shows that an increase in battery cell count initially boosts power delivery, but its mass and complexity diminished performance gains eventually. As such, these findings indicate that battery configuration in EVs must balance system complexity and architectural configuration in order to receive and retain optimal vehicle performance.
[1258] Attention Smoothing Is All You Need For Unlearning
Saleh Zare Zade, Xiangyu Zhou, Sijia Liu, Dongxiao Zhu
Main category: cs.LG
TL;DR: ASU is an attention smoothing unlearning method that uses self-distillation with temperature scaling to suppress lexical and semantic associations for removing memorized content from LLMs while maintaining coherence.
Details
Motivation: LLMs memorize sensitive, copyrighted, or hazardous content, creating privacy and legal concerns. Current unlearning methods have unstable trade-offs between forgetting and utility, often producing incoherent outputs and failing to generalize due to persistent associations in attention mechanisms.Method: Attention Smoothing Unlearning (ASU) frames unlearning as self-distillation from a forget-teacher derived from the model’s own attention. By increasing softmax temperature, ASU flattens attention distributions to directly suppress lexical-level and semantic-level associations responsible for reconstructing memorized knowledge.
Result: ASU outperforms baselines on TOFU, MUSE, and WMDP benchmarks, as well as real-world and continual unlearning scenarios across question answering and text completion tasks. It achieves robust unlearning with minimal loss of model utility.
Conclusion: ASU provides a principled framework for effective unlearning that erases factual information while maintaining response coherence, addressing key limitations of existing unlearning methods.
Abstract: Large Language Models are prone to memorizing sensitive, copyrighted, or hazardous content, posing significant privacy and legal concerns. Retraining from scratch is computationally infeasible, whereas current unlearning methods exhibit unstable trade-offs between forgetting and utility, frequently producing incoherent outputs on forget prompts and failing to generalize due to the persistence of lexical-level and semantic-level associations in attention. We propose Attention Smoothing Unlearning (ASU), a principled framework that casts unlearning as self-distillation from a forget-teacher derived from the model’s own attention. By increasing the softmax temperature, ASU flattens attention distributions and directly suppresses the lexical-level and semantic-level associations responsible for reconstructing memorized knowledge. This results in a bounded optimization objective that erases factual information yet maintains coherence in responses to forget prompts. Empirical evaluation on TOFU, MUSE, and WMDP, along with real-world and continual unlearning scenarios across question answering and text completion, demonstrates that ASU outperforms the baselines for most unlearning scenarios, delivering robust unlearning with minimal loss of model utility.
[1259] JailNewsBench: Multi-Lingual and Regional Benchmark for Fake News Generation under Jailbreak Attacks
Masahiro Kaneko, Ayana Niwa, Timothy Baldwin
Main category: cs.LG
TL;DR: JailNewsBench: First benchmark for evaluating LLM robustness against jailbreak-induced fake news generation across 34 regions and 22 languages, revealing significant safety imbalances.
Details
Motivation: Fake news undermines societal trust and decision-making, requiring multilingual and regional evaluation of LLM risks since malicious users can bypass safeguards through jailbreak attacks to generate fake news.Method: Proposed JailNewsBench benchmark spanning 34 regions and 22 languages, covering 8 evaluation sub-metrics through LLM-as-a-Judge and 5 jailbreak attacks with approximately 300k instances.
Result: Evaluation of 9 LLMs shows maximum attack success rate reached 86.3% and maximum harmfulness score was 3.5/5. English and U.S.-related topics had significantly lower defensive performance than other regions, revealing safety imbalances.
Conclusion: JailNewsBench reveals substantial imbalances in LLM safety across languages and regions, with existing safety datasets having limited coverage of fake news compared to categories like toxicity and social bias.
Abstract: Fake news undermines societal trust and decision-making across politics, economics, health, and international relations, and in extreme cases threatens human lives and societal safety. Because fake news reflects region-specific political, social, and cultural contexts and is expressed in language, evaluating the risks of large language models (LLMs) requires a multi-lingual and regional perspective. Malicious users can bypass safeguards through jailbreak attacks, inducing LLMs to generate fake news. However, no benchmark currently exists to systematically assess attack resilience across languages and regions. Here, we propose JailNewsBench, the first benchmark for evaluating LLM robustness against jailbreak-induced fake news generation. JailNewsBench spans 34 regions and 22 languages, covering 8 evaluation sub-metrics through LLM-as-a-Judge and 5 jailbreak attacks, with approximately 300k instances. Our evaluation of 9 LLMs reveals that the maximum attack success rate (ASR) reached 86.3% and the maximum harmfulness score was 3.5 out of 5. Notably, for English and U.S.-related topics, the defensive performance of typical multi-lingual LLMs was significantly lower than for other regions, highlighting substantial imbalances in safety across languages and regions. In addition, our analysis shows that coverage of fake news in existing safety datasets is limited and less well defended than major categories such as toxicity and social bias. Our dataset and code are available at https://github.com/kanekomasahiro/jail_news_bench.
[1260] Integrating LTL Constraints into PPO for Safe Reinforcement Learning
Maifang Zhang, Hang Yu, Qian Zuo, Cheng Wang, Vaishak Belle, Fengxiang He
Main category: cs.LG
TL;DR: PPO-LTL integrates Linear Temporal Logic constraints into Proximal Policy Optimization for safe reinforcement learning, using automata to monitor safety violations and translate them into penalty signals via Lagrangian optimization.
Details
Motivation: The paper addresses the need for safe reinforcement learning in robotics where complex safety requirements exist. Current methods lack rigorous formal representations of safety constraints, and LTL provides a way to systematically encode and monitor complex safety regulations.Method: The framework integrates LTL constraints into PPO by using limit-deterministic Büchi automata to monitor safety violations. A logic-to-cost mechanism translates violations into penalty signals, which are then incorporated into policy optimization via the Lagrangian scheme.
Result: Extensive experiments on Zones and CARLA environments show that PPO-LTL consistently reduces safety violations while maintaining competitive performance compared to state-of-the-art methods.
Conclusion: PPO-LTL provides an effective framework for safe reinforcement learning by formally integrating LTL safety constraints, enabling systematic safety monitoring and violation reduction without compromising task performance.
Abstract: This paper proposes Proximal Policy Optimization with Linear Temporal Logic Constraints (PPO-LTL), a framework that integrates safety constraints written in LTL into PPO for safe reinforcement learning. LTL constraints offer rigorous representations of complex safety requirements, such as regulations that broadly exist in robotics, enabling systematic monitoring of safety requirements. Violations against LTL constraints are monitored by limit-deterministic Büchi automata, and then translated by a logic-to-cost mechanism into penalty signals. The signals are further employed for guiding the policy optimization via the Lagrangian scheme. Extensive experiments on the Zones and CARLA environments show that our PPO-LTL can consistently reduce safety violations, while maintaining competitive performance, against the state-of-the-art methods. The code is at https://github.com/EVIEHub/PPO-LTL.
[1261] Theoretical Perspectives on Data Quality and Synergistic Effects in Pre- and Post-Training Reasoning Models
Adel Javanmard, Baharan Mirzasoleiman, Vahab Mirrokni
Main category: cs.LG
TL;DR: Theoretical analysis of transformers on in-context weight prediction reveals why pretraining needs large datasets, SFT works best with small challenging examples, and RL benefits from large-scale non-difficult data.
Details
Motivation: To understand why pretraining and reinforcement learning require large datasets while supervised fine-tuning excels on smaller ones, and to determine what constitutes high-quality SFT data.Method: Theoretical analysis of transformers trained on an in-context weight prediction task for linear regression, followed by experimental validation on large nonlinear transformer architectures.
Result: Balanced pretraining data induces latent capabilities activated during post-training; SFT learns best from small sets of examples challenging for the pretrained model; RL is most effective on large-scale data not overly difficult for the pretrained model.
Conclusion: Theoretical insights explain dataset size requirements across different training stages: pretraining needs scale for latent capabilities, SFT benefits from targeted challenging examples, and RL requires large non-difficult data.
Abstract: Large Language Models (LLMs) are pretrained on massive datasets and later instruction-tuned via supervised fine-tuning (SFT) or reinforcement learning (RL). Best practices emphasize large, diverse pretraining data, whereas post-training operates differently: SFT relies on smaller, high-quality datasets, while RL benefits more from scale, with larger amounts of feedback often outweighing label quality. Yet it remains unclear why pretraining and RL require large datasets, why SFT excels on smaller ones, and what defines high-quality SFT data. In this work, we theoretically analyze transformers trained on an in-context weight prediction task for linear regression. Our analysis reveals several key findings: $(i)$ balanced pretraining data can induce latent capabilities later activated during post-training, and $(ii)$ SFT learns best from a small set of examples challenging for the pretrained model, while excessively large SFT datasets may dilute informative pretraining signals. In contrast, RL is most effective on large-scale data that is not overly difficult for the pretrained model. We validate these theoretical insights with experiments on large nonlinear transformer architectures.
[1262] I Can’t Believe It’s Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift
Subramanyam Sahoo, Vinija Jain, Divya Chaudhary, Aman Chadha
Main category: cs.LG
TL;DR: Safety classifiers on frozen embeddings fail with small model updates, causing silent high-confidence misclassifications despite minimal confidence drops.
Details
Motivation: To investigate the assumption that safety classifiers trained on frozen embeddings remain stable across model updates in instruction-tuned reasoning models, which is critical for AI safety in production systems.Method: Systematically study representation stability by applying normalized perturbations to embeddings and measuring classifier performance degradation, analyzing confidence distributions, and comparing class separability between base and instruction-tuned models.
Result: Small perturbations (σ=0.02, ≈1° angular drift) reduce classifier ROC-AUC from 85% to 50%, with only 14% mean confidence drop. 72% of misclassifications occur with high confidence, creating silent failures. Instruction-tuned models show 20% worse class separability than base models.
Conclusion: The assumption that safety mechanisms transfer across model versions is fundamentally flawed; aligned systems are paradoxically harder to safeguard due to reduced class separability, exposing fragility in production AI safety architectures.
Abstract: Instruction tuned reasoning models are increasingly deployed with safety classifiers trained on frozen embeddings, assuming representation stability across model updates. We systematically investigate this assumption and find it fails: normalized perturbations of magnitude $σ=0.02$ (corresponding to $\approx 1^\circ$ angular drift on the embedding sphere) reduce classifier performance from $85%$ to $50%$ ROC-AUC. Critically, mean confidence only drops $14%$, producing dangerous silent failures where $72%$ of misclassifications occur with high confidence, defeating standard monitoring. We further show that instruction-tuned models exhibit 20$%$ worse class separability than base models, making aligned systems paradoxically harder to safeguard. Our findings expose a fundamental fragility in production AI safety architectures and challenge the assumption that safety mechanisms transfer across model versions.
[1263] Nonconvex Latent Optimally Partitioned Block-Sparse Recovery via Log-Sum and Minimax Concave Penalties
Takanobu Furuhashi, Hiroki Kuroda, Masahiro Yukawa, Qibin Zhao, Hidekata Hontani, Tatsuya Yokota
Main category: cs.LG
TL;DR: Nonconvex regularization methods (LogLOP-l2/l1 and AdaLOP-l2/l1) for block-sparse signal recovery with unknown block partitions, addressing underestimation bias of convex approaches through novel variational formulations.
Details
Motivation: Existing convex regularization methods for block-sparse signal recovery suffer from underestimation bias, and current nonconvex approaches like GME and Bayesian methods are limited to squared-error data fidelity terms, restricting their applicability to broader problems.Method: Extends log-sum penalty and Minimax Concave Penalty (MCP) to block-sparse domain via novel variational formulations, developing efficient ADMM-based algorithms compatible with various data fidelity terms beyond squared-error.
Result: Numerical experiments on synthetic data, angular power spectrum estimation, and nanopore current denoising show superior estimation accuracy compared to state-of-the-art baselines, with stable empirical convergence.
Conclusion: Proposed nonconvex regularization methods effectively address underestimation bias in block-sparse signal recovery, offering broader applicability through compatibility with various data fidelity terms and demonstrating practical advantages in multiple applications.
Abstract: We propose two nonconvex regularization methods, LogLOP-l2/l1 and AdaLOP-l2/l1, for recovering block-sparse signals with unknown block partitions. These methods address the underestimation bias of existing convex approaches by extending log-sum penalty and the Minimax Concave Penalty (MCP) to the block-sparse domain via novel variational formulations. Unlike Generalized Moreau Enhancement (GME) and Bayesian methods dependent on the squared-error data fidelity term, our proposed methods are compatible with a broad range of data fidelity terms. We develop efficient Alternating Direction Method of Multipliers (ADMM)-based algorithms for these formulations that exhibit stable empirical convergence. Numerical experiments on synthetic data, angular power spectrum estimation, and denoising of nanopore currents demonstrate that our methods outperform state-of-the-art baselines in estimation accuracy.
[1264] PAC Guarantees for Reinforcement Learning: Sample Complexity, Coverage, and Structure
Joshua Steier
Main category: cs.LG
TL;DR: Survey of PAC (Probably Approximately Correct) reinforcement learning theory from 2018-2025, introducing CSO framework to analyze sample complexity results through coverage, structure, and objective factors.
Details
Motivation: Average-case metrics are insufficient when data is scarce or mistakes are costly; practitioners need guarantees that with high probability, learned policies are close to optimal within a certain number of episodes. The paper aims to survey recent progress in PAC RL theory.Method: Introduces Coverage-Structure-Objective (CSO) framework as an interpretive template to decompose PAC sample complexity results into three factors: coverage (data acquisition), structure (MDP/function-class complexity), and objective (learner requirements). Surveys technical results across tabular RL, function approximation, linear/kernel models, reward-free exploration, and offline RL.
Result: Provides comprehensive survey of PAC RL theory developments from 2018-2025, organized through CSO framework. Includes practitioner tools like rate lookup tables, Bellman residual diagnostics, coverage estimation methods, and policy certificates.
Conclusion: The CSO framework offers a unified way to understand PAC RL sample complexity results, identifies bottlenecks, and enables cross-setting comparisons. The survey covers foundational theory and provides practical tools while highlighting open problems where coverage, structure, and computation remain entangled.
Abstract: When data is scarce or mistakes are costly, average-case metrics fall short. What a practitioner needs is a guarantee: with probability at least $1-δ$, the learned policy is $\varepsilon$-close to optimal after $N$ episodes. This is the PAC promise, and between 2018 and 2025 the RL theory community made striking progress on when such promises can be kept. We survey that progress. Our organizing tool is the Coverage-Structure-Objective (CSO) framework, proposed here, which decomposes nearly every PAC sample complexity result into three factors: coverage (how data were obtained), structure (intrinsic MDP or function-class complexity), and objective (what the learner must deliver). CSO is not a theorem but an interpretive template that identifies bottlenecks and makes cross-setting comparison immediate. The technical core covers tight tabular baselines and the uniform-PAC bridge to regret; structural complexity measures (Bellman rank, witness rank, Bellman-Eluder dimension) governing learnability with function approximation; results for linear, kernel/NTK, and low-rank models; reward-free exploration as upfront coverage investment; and pessimistic offline RL where inherited coverage is the binding constraint. We provide practitioner tools: rate lookup tables indexed by CSO coordinates, Bellman residual diagnostics, coverage estimation with deployment gates, and per-episode policy certificates. A final section catalogs open problems, separating near-term targets from frontier questions where coverage, structure, and computation tangle in ways current theory cannot resolve.
[1265] Provable and Practical In-Context Policy Optimization for Self-Improvement
Tianrun Yu, Yuxiao Yang, Zhaoyang Wang, Kaixiang Zhao, Porter Jenkins, Xuchao Zhang, Chetan Bansal, Huaxiu Yao, Weitong Zhang
Main category: cs.LG
TL;DR: ICPO enables LLMs to optimize responses at inference time through multi-round self-reflection without parameter updates, using self-assessed rewards and minimum entropy selection for robust mathematical reasoning.
Details
Motivation: The paper addresses test-time scaling where models improve answers through self-reflection at inference. Current methods lack theoretical understanding of how LLMs can perform policy optimization in-context without parameter updates.Method: Introduces In-Context Policy Optimization (ICPO) where agents optimize responses using self-assessed or external rewards in context. Proposes Minimum-Entropy ICPO (ME-ICPO) that iteratively refines responses using self-assessed rewards with minimum entropy selection for robustness via majority voting.
Result: ME-ICPO achieves competitive, top-tier performance on standard mathematical reasoning tasks while keeping inference costs affordable compared to other inference-time algorithms.
Conclusion: ICPO provides principled understanding of self-reflection in LLMs and yields practical benefits for test-time scaling, particularly for mathematical reasoning tasks.
Abstract: We study test-time scaling, where a model improves its answer through multi-round self-reflection at inference. We introduce In-Context Policy Optimization (ICPO), in which an agent optimizes its response in context using self-assessed or externally observed rewards without modifying its parameters. To explain this ICPO process, we theoretically show that with sufficient pretraining under a novel Fisher-weighted logit-matching objective, a single-layer linear self-attention model can provably imitate policy-optimization algorithm for linear bandits. Building on this theory, we propose Minimum-Entropy ICPO (ME-ICPO), a practical algorithm that iteratively uses its response and self-assessed reward to refine its response in-context at inference time. By selecting the responses and their rewards with minimum entropy, ME-ICPO ensures the robustness of the self-assessed rewards via majority voting. Across standard mathematical reasoning tasks, ME-ICPO attains competitive, top-tier performance while keeping inference costs affordable compared with other inference-time algorithms. Overall, ICPO provides a principled understanding of self-reflection in LLMs and yields practical benefits for test-time scaling for mathematical reasoning.
[1266] Relatively Smart: A New Approach for Instance-Optimal Learning
Shaddin Dughmi, Alireza F. Pour
Main category: cs.LG
TL;DR: The paper introduces “relatively smart learning” as a framework where supervised learners compete only with certifiable semi-supervised guarantees, bypassing previous impossibility results in Smart PAC learning.
Details
Motivation: Prior work on Smart PAC learning showed that marginal-by-marginal guarantees were only possible for "most" marginals with respect to fixed measures, but not generally. The authors identify an "indistinguishability" phenomenon where some marginals cannot be distinguished from others requiring different learning approaches, making semi-supervised guarantees non-actionable.Method: Proposes “relatively smart learning” framework where supervised learners compete only with the best certifiable semi-supervised guarantee. Analyzes this framework in distribution-free settings (using OIG learner) and distribution-family settings, examining impossibility results and sample complexity trade-offs.
Result: Shows that relatively smart learning bypasses previous impossibility results. In distribution-free settings, the OIG learner is relatively smart up to squaring sample complexity, and no supervised learner can do better. For distribution-family settings, relatively smart learning can be impossible or require idiosyncratic approaches, with difficulty non-monotone in distribution family inclusion.
Conclusion: Relatively smart learning provides a viable framework that addresses the limitations of Smart PAC learning by focusing on certifiable guarantees, though it introduces new complexities in distribution-family settings.
Abstract: We revisit the framework of Smart PAC learning, which seeks supervised learners which compete with semi-supervised learners that are provided full knowledge of the marginal distribution on unlabeled data. Prior work has shown that such marginal-by-marginal guarantees are possible for “most” marginals, with respect to an arbitrary fixed and known measure, but not more generally. We discover that this failure can be attributed to an “indistinguishability” phenomenon: There are marginals which cannot be statistically distinguished from other marginals that require different learning approaches. In such settings, semi-supervised learning cannot certify its guarantees from unlabeled data, rendering them arguably non-actionable. We propose relatively smart learning, a new framework which demands that a supervised learner compete only with the best “certifiable” semi-supervised guarantee. We show that such modest relaxation suffices to bypass the impossibility results from prior work. In the distribution-free setting, we show that the OIG learner is relatively smart up to squaring the sample complexity, and show that no supervised learning algorithm can do better. For distribution-family settings, we show that relatively smart learning can be impossible or can require idiosyncratic learning approaches, and its difficulty can be non-monotone in the inclusion order on distribution families.
[1267] Constructing Synthetic Instruction Datasets for Improving Reasoning in Domain-Specific LLMs: A Case Study in the Japanese Financial Domain
Yuma Okochi, Fabio Milentiansen Sim, Tomoyasu Okada
Main category: cs.LG
TL;DR: A method for generating synthetic instruction data with Chain-of-Thought reasoning for domain-specific LLM adaptation, demonstrated in finance with 9.5B token dataset and performance improvements.
Details
Motivation: Adapting LLMs to specific domains while maintaining both domain expertise and reasoning ability is challenging. Current approaches often lack high-quality instruction data with reasoning traces for specialized domains.Method: Proposes a general method to construct synthetic instruction data starting from domain-specific vocabulary. Applied to finance to create a large-scale dataset (~9.5B tokens) with Chain-of-Thought reasoning traces.
Result: Evaluation shows performance improvements over baseline models on financial benchmarks. Also reports findings on the impact of reasoning trace length on performance and its limitations.
Conclusion: The method effectively creates high-quality synthetic instruction data for domain adaptation, improving both domain expertise and reasoning. Models and datasets are open-sourced.
Abstract: In adapting LLMs to specific domains, achieving both domain expertise and reasoning ability remains an urgent challenge. This study proposes a general method for constructing high-quality synthetic instruction data for any domain, starting from domain-specific vocabulary. As a demonstration, we applied this method to the financial domain and constructed a large-scale instruction dataset totaling approximately 9.5 billion tokens with Chain-of-Thought reasoning traces. Evaluation results confirmed performance improvements over baseline models on financial benchmarks, demonstrating the effectiveness of our approach. We also report findings on the impact of reasoning trace length on performance and its limitations. Lastly, we open-source our models and datasets on https://huggingface.co/nri-ai .
[1268] UTICA: Multi-Objective Self-Distllation Foundation Model Pretraining for Time Series Classification
Yessin Moakher, Youssef Attia El Hili, Vasilii Feofanov
Main category: cs.LG
TL;DR: Utica adapts DINOv2-style self-distillation to pretrain time series foundation models using Mantis tokenizer/transformer, achieving SOTA classification on UCR/UEA benchmarks.
Details
Motivation: The potential of non-contrastive methods (successful in computer vision) remains underexplored for time series foundation models, despite self-supervised models achieving success across domains.Method: Adapts DINOv2-style self-distillation to pretrain time series foundation model using Mantis tokenizer and transformer encoder architecture. Uses student-teacher framework that learns representations capturing temporal invariance via augmented crops and fine-grained local structure via patch masking.
Result: Achieves state-of-the-art classification performance on both UCR and UEA time series benchmarks.
Conclusion: Non-contrastive methods are a promising and complementary pretraining strategy for time series foundation models.
Abstract: Self-supervised foundation models have achieved remarkable success across domains, including time series. However, the potential of non-contrastive methods, a paradigm that has driven significant advances in computer vision, remains underexplored for time series. In this work, we adapt DINOv2-style self-distillation to pretrain a time series foundation model, building on the Mantis tokenizer and transformer encoder architecture as our backbone. Through a student-teacher framework, our method Utica learns representations that capture both temporal invariance via augmented crops and fine-grained local structure via patch masking. Our approach achieves state-of-the-art classification performance on both UCR and UEA benchmarks. These results suggest that non-contrastive methods are a promising and complementary pretraining strategy for time series foundation models.
[1269] Fed-GAME: Personalized Federated Learning with Graph Attention Mixture-of-Experts For Time-Series Forecasting
Yi Li, Han Liu, Mingfeng Fan, Guo Chen, Chaojie Li, Biplab Sikdar
Main category: cs.LG
TL;DR: Fed-GAME: A federated learning framework for time-series forecasting using learnable dynamic implicit graphs and personalized aggregation via Graph Attention Mixture-of-Experts.
Details
Motivation: Existing federated learning methods on graphs rely on static topologies and struggle with client heterogeneity, limiting their effectiveness for distributed time-series forecasting tasks.Method: Proposes Fed-GAME framework with decoupled parameter difference-based update protocol. Clients transmit parameter differences between fine-tuned private models and shared global model. Server decomposes differences into two streams: averaged difference for global model updates and selective difference fed into Graph Attention Mixture-of-Experts (GAME) aggregator for fine-grained personalization.
Result: Experiments on two real-world electric vehicle charging datasets demonstrate that Fed-GAME outperforms state-of-the-art personalized FL baselines.
Conclusion: Fed-GAME effectively addresses client heterogeneity in federated graph learning for time-series forecasting through learnable dynamic implicit graphs and personalized aggregation mechanisms.
Abstract: Federated learning (FL) on graphs shows promise for distributed time-series forecasting. Yet, existing methods rely on static topologies and struggle with client heterogeneity. We propose Fed-GAME, a framework that models personalized aggregation as message passing over a learnable dynamic implicit graph. The core is a decoupled parameter difference-based update protocol, where clients transmit parameter differences between their fine-tuned private model and a shared global model. On the server, these differences are decomposed into two streams: (1) averaged difference used to updating the global model for consensus (2) the selective difference fed into a novel Graph Attention Mixture-of-Experts (GAME) aggregator for fine-grained personalization. In this aggregator, shared experts provide scoring signals while personalized gates adaptively weight selective updates to support personalized aggregation. Experiments on two real-world electric vehicle charging datasets demonstrate that Fed-GAME outperforms state-of-the-art personalized FL baselines.
[1270] Align and Filter: Improving Performance in Asynchronous On-Policy RL
Homayoun Honari, Roger Creus Castanyer, Michael Przystupa, Michael Noukhovitch, Pablo Samuel Castro, Glen Berseth
Main category: cs.LG
TL;DR: TVACO addresses policy lag in distributed RL by constraining policy updates to align with advantage estimates, improving robustness in classic tasks and LLM math reasoning.
Details
Motivation: Distributed training and high update frequency cause policy lag (mismatch between behavior and learning policies), which hinders scaling of on-policy learning algorithms to larger problems.Method: Proposes Total Variation-based Advantage aligned Constrained policy Optimization (TVACO), which constrains policy updates to mitigate policy lag by aligning with advantage estimates.
Result: Empirical validation shows TVACO offers better robustness to policy lag in classic RL tasks and modern RL for LLM math reasoning tasks.
Conclusion: TVACO effectively addresses policy lag issues in distributed RL settings, enabling better scaling of on-policy learning algorithms.
Abstract: Distributed training and increasing the gradient update frequency are practical strategies to accelerate learning and improve performance, but both exacerbate a central challenge: \textit{policy lag}, which is the mismatch between the behavior policy generating data and the learning policy being updated. Policy lag can hinder the scaling of on-policy learning algorithms to larger problems. In this paper, we identify the sources of policy lag caused by distributed learning and high update frequency. We use the findings to propose \textit{total Variation-based Advantage aligned Constrained policy Optimization (\methodacronym)} as a practical approach to mitigate policy lag. We empirically validate our method and show that it offers better robustness to policy lag in classic RL tasks and a modern RL for LLM math reasoning task.
[1271] TopoCurate:Modeling Interaction Topology for Tool-Use Agent Training
Jinluan Yang, Yuxin Liu, Zhengyu Chen, Chengcheng Han, Yueqing Sun, Qi Gu, Hui Su, Xunliang Cai, Fei Wu, Kun Kuang
Main category: cs.LG
TL;DR: TopoCurate is an interaction-aware framework for training tool-use agents that projects multi-trial rollouts into semantic quotient topology to capture error recovery and strategic diversity, improving both supervised fine-tuning and reinforcement learning.
Details
Motivation: Current tool-use agent training relies on outcome-based filtering (SFT on successful trajectories, RL on pass-rate tasks), which ignores interaction dynamics. Successful trajectories may lack error recovery or be redundant, while pass rates don't distinguish structurally informative tasks from trivial ones.Method: Projects multi-trial rollouts from same task into unified semantic quotient topology by merging equivalent action-observation states. This transforms linear trajectories into structured manifold capturing how tool invocations and environmental responses drive divergence between effective strategies and failure modes. Uses dual-selection: for SFT prioritizes trajectories with reflective recovery, semantic efficiency, strategic diversity; for RL selects tasks with high error branch ratios and strategic heterogeneity.
Result: Achieves consistent gains of 4.2% (SFT) and 6.9% (RL) over state-of-the-art baselines on BFCLv3 and Tau2 Bench benchmarks.
Conclusion: TopoCurate demonstrates that interaction-aware trajectory and task selection through topological analysis significantly improves tool-use agent training by addressing limitations of outcome-based filtering.
Abstract: Training tool-use agents typically relies on outcome-based filtering: Supervised Fine-Tuning (SFT) on successful trajectories and Reinforcement Learning (RL) on pass-rate-selected tasks. However, this paradigm ignores interaction dynamics: successful trajectories may lack error recovery or exhibit redundancy, while pass rates fail to distinguish structurally informative tasks from trivial ones. We propose \textbf{TopoCurate}, an interaction-aware framework that projects multi-trial rollouts from the same task into a unified semantic quotient topology. By merging equivalent action-observation states, this projection transforms scattered linear trajectories into a structured manifold that explicitly captures how tool invocations and environmental responses drive the divergence between effective strategies and failure modes. Leveraging this representation, we introduce a dual-selection mechanism: for SFT, we prioritize trajectories demonstrating reflective recovery, semantic efficiency, and strategic diversity to mitigate covariate shift and mode collapse; for RL, we select tasks with high error branch ratios and strategic heterogeneity, maximizing gradient Signal-to-Noise Ratio to address vanishing signals in sparse-reward settings. Evaluations on BFCLv3 and Tau2 Bench show that TopoCurate achieves consistent gains of 4.2% (SFT) and 6.9% (RL) over state-of-the-art baselines. We will release the code and data soon for further investigations.
[1272] DUEL: Exact Likelihood for Masked Diffusion via Deterministic Unmasking
Gilad Turok, Chris De Sa, Volodymyr Kuleshov
Main category: cs.LG
TL;DR: DUEL framework enables exact likelihood computation for masked diffusion models (MDMs), providing proper perplexity evaluation and revealing MDMs perform substantially better than previously thought, with potential to surpass autoregressive models.
Details
Motivation: Masked diffusion models lack proper perplexity evaluation - existing methods like ELBO are loose bounds or require biased external models, making it difficult to accurately assess MDM performance and compare sampling strategies.Method: Introduces DUEL framework that formalizes deterministic position selection in MDMs, enabling exact likelihood computation via a simple algorithm that evaluates under the same position selection used at test time.
Result: MDMs are substantially better than previously thought - the MDM-autoregressive perplexity gap shrinks by up to 32% on in-domain data and 82% on zero-shot benchmarks. Probability margin emerges as a strong default sampler, and oracle search shows MDMs can achieve 36.47 vs. 52.11 perplexity on AG News.
Conclusion: DUEL provides proper perplexity evaluation for MDMs for the first time, enabling principled comparison of samplers and revealing that MDMs have untapped potential to surpass autoregressive models.
Abstract: Masked diffusion models (MDMs) generate text by iteratively selecting positions to unmask and then predicting tokens at those positions. Yet MDMs lack proper perplexity evaluation: the ELBO is a loose bound on likelihood under the training distribution, not the test-time distribution, while generative perplexity requires a biased external model and ignores diversity. To address this, we introduce the \textsc{DUEL} framework, which formalizes \emph{deterministic} position selection, unifying leading MDM sampling strategies. We prove \textbf{\textsc{DUEL} admits \emph{exact} likelihood computation} via a simple algorithm, evaluated under the same position selection used at test time. This \textbf{gives MDMs proper perplexity for the first time} – the natural analogue of autoregressive perplexity. With proper perplexity in hand, we revisit key questions about MDMs. \textbf{MDMs are substantially better than previously thought}: the MDM-autoregressive perplexity gap shrinks by up to 32% on in-domain data and 82% on zero-shot benchmarks. \textsc{DUEL} enables the first principled comparison of fast, parallel samplers across compute budgets – an analysis impossible with the ELBO and unreliable with generative perplexity – identifying probability margin \citep{kim2025train} as a strong default. Finally, oracle search over position orderings reveals MDMs can far surpass autoregressive models – achieving 36.47 vs.\ 52.11 perplexity on AG News – demonstrating the ceiling of MDM performance has not yet been reached.
[1273] Causal Neural Probabilistic Circuits
Weixin Chen, Han Zhao
Main category: cs.LG
TL;DR: CNPC combines neural attribute predictors with causal probabilistic circuits to enable concept interventions that respect causal dependencies among concepts, improving accuracy over standard concept bottleneck models.
Details
Motivation: Standard Concept Bottleneck Models (CBMs) allow interventions but overwrite only corrected concepts while ignoring causal dependencies among concepts, which can lead to suboptimal predictions.Method: Proposes Causal Neural Probabilistic Circuit (CNPC) that combines neural attribute predictors with causal probabilistic circuits compiled from causal graphs, using Product of Experts to fuse predictive distributions with interventional marginals.
Result: CNPC achieves higher task accuracy across different numbers of intervened attributes compared to five baseline models on five benchmark datasets in both in-distribution and out-of-distribution settings.
Conclusion: CNPC effectively models causal dependencies during concept interventions, providing more accurate predictions than standard CBMs while maintaining interpretability.
Abstract: Concept Bottleneck Models (CBMs) enhance the interpretability of end-to-end neural networks by introducing a layer of concepts and predicting the class label from the concept predictions. A key property of CBMs is that they support interventions, i.e., domain experts can correct mispredicted concept values at test time to improve the final accuracy. However, typical CBMs apply interventions by overwriting only the corrected concept while leaving other concept predictions unchanged, which ignores causal dependencies among concepts. To address this, we propose the Causal Neural Probabilistic Circuit (CNPC), which combines a neural attribute predictor with a causal probabilistic circuit compiled from a causal graph. This circuit supports exact, tractable causal inference that inherently respects causal dependencies. Under interventions, CNPC models the class distribution based on a Product of Experts (PoE) that fuses the attribute predictor’s predictive distribution with the interventional marginals computed by the circuit. We theoretically characterize the compositional interventional error of CNPC w.r.t. its modules and identify conditions under which CNPC closely matches the ground-truth interventional class distribution. Experiments on five benchmark datasets in both in-distribution and out-of-distribution settings show that, compared with five baseline models, CNPC achieves higher task accuracy across different numbers of intervened attributes.
[1274] Efficient RLVR Training via Weighted Mutual Information Data Selection
Xinyu Zhou, Boyu Zhu, Haotian Zhang, Huiming Wang, Zhijiang Guo
Main category: cs.LG
TL;DR: InSight: Information-guided data sampling method for RL training that improves over difficulty-only selection by incorporating both difficulty and evidence components for better training efficiency.
Details
Motivation: Existing RL data selection strategies rely on difficulty-based heuristics that equate difficulty with informativeness, neglecting epistemic uncertainty from limited evidence, leading to suboptimal training efficiency.Method: InSight uses a weighted mutual information objective with Bayesian latent success rates, decomposing uncertainty reduction into difficulty- and evidence-dependent components, and constructs stable acquisition scores based on mean belief rather than noisy outcomes.
Result: InSight achieves state-of-the-art performance with +1.41 average gain on Planning & Mathematics benchmarks, +1.01 improvement on general reasoning, and up to ~2.2x acceleration with negligible computational overhead.
Conclusion: Information-guided sampling that accounts for both difficulty and evidence uncertainty significantly improves RL training efficiency and performance over difficulty-only approaches.
Abstract: Reinforcement learning (RL) plays a central role in improving the reasoning and alignment of large language models, yet its efficiency critically depends on how training data are selected. Existing online selection strategies predominantly rely on difficulty-based heuristics, favouring datapoints with intermediate success rates, implicitly equating difficulty with informativeness and neglecting epistemic uncertainty arising from limited evidence. We introduce InSight, an INformation-guided data SamplInG metHod for RL Training, grounded in a weighted mutual information objective. By modeling data outcomes with Bayesian latent success rates, we show that expected uncertainty reduction decomposes into complementary difficulty- and evidence-dependent components, revealing a fundamental limitation of difficulty-only selection. Leveraging this observation, InSight constructs a stable acquisition score based on the mean belief of datapoints’ success rather than noisy sampled outcomes, and naturally extends to multi-rollout settings common in reinforcement learning with verifiable rewards (RLVR). Extensive experiments demonstrate that InSight consistently achieves state-of-the-art performance and improves training efficiency, including a +1.41 average gain on Planning & Mathmatics benchmarks, +1.01 improvement on general reasoning, and up to ~2.2x acceleration, with negligible additional computational overhead.
[1275] 3BASiL: An Algorithmic Framework for Sparse plus Low-Rank Compression of LLMs
Mehdi Makni, Xiang Meng, Rahul Mazumder
Main category: cs.LG
TL;DR: 3BASiL-TM is an efficient one-shot post-training method for Sparse+Low-Rank decomposition of LLMs that reduces performance degradation through a novel 3-block ADMM approach and transformer-matching refinement.
Details
Motivation: Existing Sparse+Low-Rank decomposition methods for LLMs often suffer from substantial performance degradation compared to dense models, creating a need for more efficient and effective compression techniques.Method: Proposes 3BASiL-TM with two main components: 1) 3-Block ADMM method (3BASiL) to minimize layer-wise reconstruction error with convergence guarantees, and 2) Transformer-Matching refinement that jointly optimizes sparse and low-rank components across layers using a memory-efficient loss.
Result: Reduces WikiText2 perplexity gap relative to dense LLaMA-8B by over 30% under (2:4 Sparse + 64 LR) configuration, achieves over 2.5x faster compression runtime on A100 GPU compared to SOTA methods.
Conclusion: 3BASiL-TM provides an efficient and effective approach for LLM compression via Sparse+Low-Rank decomposition, with significant improvements in both performance preservation and compression speed.
Abstract: Sparse plus Low-Rank $(\mathbf{S} + \mathbf{LR})$ decomposition of Large Language Models (LLMs) has emerged as a promising direction in model compression, aiming to decompose pre-trained model weights into a sum of sparse and low-rank matrices $(\mathbf{W} \approx \mathbf{S} + \mathbf{LR})$. Despite recent progress, existing methods often suffer from substantial performance degradation compared to dense models. In this work, we introduce 3BASiL-TM, an efficient one-shot post-training method for $(\mathbf{S} + \mathbf{LR})$ decomposition of LLMs that addresses this gap. Our approach first introduces a novel 3-Block Alternating Direction Method of Multipliers (ADMM) method, termed 3BASiL, to minimize the layer-wise reconstruction error with convergence guarantees. We then design an efficient transformer-matching (TM) refinement step that jointly optimizes the sparse and low-rank components across transformer layers. This step minimizes a novel memory-efficient loss that aligns outputs at the transformer level. Notably, the TM procedure is universal as it can enhance any $(\mathbf{S} + \mathbf{LR})$ decomposition, including pure sparsity. Our numerical experiments show that 3BASiL-TM reduces the WikiText2 perplexity gap relative to dense LLaMA-8B model by over 30% under a (2:4 Sparse + 64 LR) configuration, compared to prior methods. Moreover, our method achieves over 2.5x faster compression runtime on an A100 GPU compared to SOTA $(\mathbf{S} + \mathbf{LR})$ method. Our code is available at https://github.com/mazumder-lab/3BASiL.
[1276] Semantic Similarity is a Spurious Measure of Comic Understanding: Lessons Learned from Hallucinations in a Benchmarking Experiment
Christopher Driggers-Ellis, Nachiketh Tibrewal, Rohit Bogulla, Harsh Khanna, Sangpil Youm, Christan Grant, Bonnie Dorr
Main category: cs.LG
TL;DR: VLMs for comic interpretation benchmark reveals hallucination issues, with guidance for future research to support blind/visually impaired users
Details
Motivation: To enable blind or visually impaired users to access comics/manga by developing systems that can interpret and describe comic content, addressing the current lack of such accessibility tools.Method: Created a preliminary benchmark of VLM performance on comic interpretation tasks, identified and categorized hallucinations that emerge during this process, and organized them into generalized object-hallucination taxonomies.
Result: The benchmark reveals significant hallucination issues in VLMs when interpreting comics, with identified patterns of errors that need to be addressed for reliable comic accessibility.
Conclusion: Future research should focus on hallucination mitigation and improved data curation for comic interpretation to develop effective accessibility systems for blind and visually impaired users.
Abstract: A system that enables blind or visually impaired users to access comics/manga would introduce a new medium of storytelling to this community. However, no such system currently exists. Generative vision-language models (VLMs) have shown promise in describing images and understanding comics, but most research on comic understanding is limited to panel-level analysis. To fully support blind and visually impaired users, greater attention must be paid to page-level understanding and interpretation. In this work, we present a preliminary benchmark of VLM performance on comic interpretation tasks. We identify and categorize hallucinations that emerge during this process, organizing them into generalized object-hallucination taxonomies. We conclude with guidance on future research, emphasizing hallucination mitigation and improved data curation for comic interpretation.
[1277] Invariant-Stratified Propagation for Expressive Graph Neural Networks
Asela Hevapathige, Ahad N. Zehmakan, Asiri Wijesinghe, Saman Halgamuge
Main category: cs.LG
TL;DR: ISP framework introduces hierarchical stratification of nodes based on graph invariants to overcome GNN limitations in expressivity and structural heterogeneity, achieving better performance than 1-WL test and standard architectures.
Details
Motivation: GNNs have fundamental limitations in expressivity (constrained by 1-WL test) and capturing structural heterogeneity. Standard message-passing aggregates uniformly from neighbors, failing to capture how nodes occupy different structural positions within higher-order patterns. Existing methods for higher expressivity are computationally prohibitive and lack unified frameworks.Method: Introduces Invariant-Stratified Propagation (ISP) framework with ISP-WL (a novel WL variant) and ISPGNN (efficient neural implementation). Nodes are stratified according to graph invariants and processed in hierarchical strata that reveal structural distinctions invisible to 1-WL. Encodes hierarchical structural heterogeneity to quantify differences in nodes’ structural positions within higher-order patterns.
Result: Formal theoretical analysis establishes enhanced expressivity beyond 1-WL, convergence guarantees, and inherent resistance to oversmoothing. Extensive experiments across graph classification, node classification, and influence estimation demonstrate consistent improvements over standard architectures and state-of-the-art expressive baselines.
Conclusion: ISP framework successfully addresses fundamental GNN limitations by stratifying nodes based on invariants, enabling enhanced expressivity and better capture of structural heterogeneity while maintaining computational efficiency.
Abstract: Graph Neural Networks (GNNs) face fundamental limitations in expressivity and capturing structural heterogeneity. Standard message-passing architectures are constrained by the 1-dimensional Weisfeiler-Leman (1-WL) test, unable to distinguish graphs beyond degree sequences, and aggregate information uniformly from neighbors, failing to capture how nodes occupy different structural positions within higher-order patterns. While methods exist to achieve higher expressivity, they incur prohibitive computational costs and lack unified frameworks for flexibly encoding diverse structural properties. To address these limitations, we introduce Invariant-Stratified Propagation (ISP), a framework comprising both a novel WL variant (ISP-WL) and its efficient neural network implementation (ISPGNN). ISP stratifies nodes according to graph invariants, processing them in hierarchical strata that reveal structural distinctions invisible to 1-WL. Through hierarchical structural heterogeneity encoding, ISP quantifies differences in nodes’ structural positions within higher-order patterns, distinguishing interactions where participants occupy different roles from those with uniform participation. We provide formal theoretical analysis establishing enhanced expressivity beyond 1-WL, convergence guarantees, and inherent resistance to oversmoothing. Extensive experiments across graph classification, node classification, and influence estimation demonstrate consistent improvements over standard architectures and state-of-the-art expressive baselines.
[1278] One Operator to Rule Them All? On Boundary-Indexed Operator Families in Neural PDE Solvers
Lennon J. Shikhman
Main category: cs.LG
TL;DR: Neural PDE solvers don’t learn true solution operators but boundary-condition-dependent mappings, limiting generalization to unseen boundary conditions despite apparent generalization in other aspects.
Details
Motivation: To challenge the common interpretation that neural PDE solvers learn boundary-agnostic solution operators, revealing that they actually learn boundary-condition-dependent mappings that fail to generalize beyond training boundary distributions.Method: Theoretical analysis framing operator learning as conditional risk minimization over boundary conditions, supported by controlled experiments on Poisson equation with various boundary-condition shifts and cross-distribution tests.
Result: Neural operators show sharp performance degradation under boundary-condition shifts, fail to generalize between distinct boundary ensembles, and converge to conditional expectations when boundary information is removed, demonstrating non-identifiability outside training boundary distribution support.
Conclusion: Current neural PDE solvers have fundamental limitations in boundary-condition generalization, requiring explicit boundary-aware modeling for true foundation models for PDEs.
Abstract: Neural PDE solvers are often described as learning solution operators that map problem data to PDE solutions. In this work, we argue that this interpretation is generally incorrect when boundary conditions vary. We show that standard neural operator training implicitly learns a boundary-indexed family of operators, rather than a single boundary-agnostic operator, with the learned mapping fundamentally conditioned on the boundary-condition distribution seen during training. We formalize this perspective by framing operator learning as conditional risk minimization over boundary conditions, which leads to a non-identifiability result outside the support of the training boundary distribution. As a consequence, generalization in forcing terms or resolution does not imply generalization across boundary conditions. We support our theoretical analysis with controlled experiments on the Poisson equation, demonstrating sharp degradation under boundary-condition shifts, cross-distribution failures between distinct boundary ensembles, and convergence to conditional expectations when boundary information is removed. Our results clarify a core limitation of current neural PDE solvers and highlight the need for explicit boundary-aware modeling in the pursuit of foundation models for PDEs.
[1279] Learning from Synthetic Data Improves Multi-hop Reasoning
Anmol Kabra, Yilun Yin, Albert Gong, Kamilė Stankevičiūtė, Dongyoung Go, Johann Lee, Katie Z. Luo, Carla P. Gomes, Kilian Q. Weinberger
Main category: cs.LG
TL;DR: RL fine-tuning on rule-generated synthetic data improves LLM reasoning capabilities without expensive human annotations or LLM-generated data
Details
Motivation: Current RL fine-tuning for LLMs requires expensive high-quality data from human annotations, frontier LLMs, or LLM-based verifiers, all of which have significant limitations in cost, quality, and scalabilityMethod: Use rule-generated synthetic data for RL fine-tuning of LLMs on multi-hop reasoning tasks, despite the synthetic data containing only fictional knowledge
Result: LLMs fine-tuned on synthetic data perform significantly better on real-world question-answering benchmarks, with synthetic data teaching LLMs to compose knowledge - a fundamental reasoning skill
Conclusion: Rule-generated synthetic reasoning data serves as a free and scalable resource to improve LLM reasoning capabilities, particularly for teaching knowledge composition skills
Abstract: Reinforcement Learning (RL) has been shown to significantly boost reasoning capabilities of large language models (LLMs) in math, coding, and multi-hop reasoning tasks. However, RL fine-tuning requires abundant high-quality verifiable data, often sourced from human annotations, generated from frontier LLMs, or scored by LLM-based verifiers. All three have considerable limitations: human-annotated datasets are small and expensive to curate, LLM-generated data is hallucination-prone and costly, and LLM-based verifiers are inaccurate and slow. In this work, we investigate a cheaper alternative: RL fine-tuning on rule-generated synthetic data for multi-hop reasoning tasks. We discover that LLMs fine-tuned on synthetic data perform significantly better on popular real-world question-answering benchmarks, despite the synthetic data containing only fictional knowledge. On stratifying performance by question difficulty, we find that synthetic data teaches LLMs to compose knowledge – a fundamental and generalizable reasoning skill. Our work highlights rule-generated synthetic reasoning data as a free and scalable resource to improve LLM reasoning capabilities.
[1280] Tackling multiphysics problems via finite element-guided physics-informed operator learning
Yusuke Yamazaki, Reza Najian Asl, Markus Apel, Mayu Muramatsu, Shahed Rezaei
Main category: cs.LG
TL;DR: A physics-informed operator learning framework using finite element-guided loss for multiphysics PDEs on arbitrary domains, implemented in JAX and tested on thermo-mechanical problems.
Details
Motivation: To develop a unified, scalable approach for coupled multiphysics simulations that can handle arbitrary domains without relying on labeled simulation data, enabling discretization-independent predictions.Method: Finite element-guided physics-informed operator learning framework implemented with Folax (JAX-based). Uses weighted residual formulation based on finite element method. Tests Fourier neural operators (FNOs), deep operator networks (DeepONets), and a new implicit finite operator learning (iFOL) approach on nonlinear thermo-mechanical problems with varying microstructures and boundary conditions.
Result: FNOs yield highly accurate solution operators on regular domains, while iFOL offers efficient parametric operator learning for complex/irregular geometries. Monolithic training with single network is sufficient, and training sample quality strongly influences performance.
Conclusion: The framework demonstrates potential of physics-informed operator learning with finite element-based loss as a unified, scalable approach for coupled multiphysics simulations.
Abstract: This work presents a finite element-guided physics-informed operator learning framework for multiphysics problems with coupled partial differential equations (PDEs) on arbitrary domains. Implemented with Folax, a JAX-based operator-learning platform, the proposed framework learns a mapping from the input parameter space to the solution space with a weighted residual formulation based on the finite element method, enabling discretization-independent prediction beyond the training resolution without relying on labaled simulation data. The present framework for multiphysics problems is verified on nonlinear thermo-mechanical problems. Two- and three-dimensional representative volume elements with varying heterogeneous microstructures, and a close-to-reality industrial casting example under varying boundary conditions are investigated as the example problems. We investigate the potential of several neural operator backbones, including Fourier neural operators (FNOs), deep operator networks (DeepONets), and a newly proposed implicit finite operator learning (iFOL) approach based on conditional neural fields. The results demonstrate that FNOs yield highly accurate solution operators on regular domains, where the global topology can be efficiently learned in the spectral domain, and iFOL offers efficient parametric operator learning capabilities for complex and irregular geometries. Furthermore, studies on training strategies, network decomposition, and training sample quality reveal that a monolithic training strategy using a single network is sufficient for accurate predictions, while training sample quality strongly influences performance. Overall, the present approach highlights the potential of physics-informed operator learning with a finite element-based loss as a unified and scalable approach for coupled multiphysics simulations.
[1281] Autoregressive Synthesis of Sparse and Semi-Structured Mixed-Type Data
Thomas Rückstieß, Robin Vujanic
Main category: cs.LG
TL;DR: Origami: A transformer-based architecture for generating synthetic semi-structured data (like JSON) that handles nested objects, variable arrays, and sparsity without flattening.
Details
Motivation: Modern data systems increasingly use sparse, semi-structured formats like JSON, but existing synthetic data generation methods assume dense, fixed-schema tabular data. Applying tabular methods to semi-structured data requires flattening which scales poorly.Method: Autoregressive transformer architecture that tokenizes data records into sequences of key, value, and structural tokens, natively handling sparsity, mixed types, and hierarchical structure without flattening or imputation.
Result: Outperforms baselines (GAN, VAE, diffusion, autoregressive) on fidelity, utility, and detection metrics across nearly all settings while maintaining high privacy scores. Maintains high-fidelity synthesis on semi-structured datasets with up to 38% sparsity where baselines fail or degrade.
Conclusion: Origami is the first architecture capable of natively modeling and generating semi-structured data end-to-end, addressing a critical gap in synthetic data generation for modern data systems.
Abstract: Synthetic data generation is a critical capability for data sharing, privacy compliance, system benchmarking and test data provisioning. Existing methods assume dense, fixed-schema tabular data, yet this assumption is increasingly at odds with modern data systems - from document databases, REST APIs to data lakes - which store and exchange data in sparse, semi-structured formats like JSON. Applying existing tabular methods to such data requires flattening of nested data into wide, sparse tables which scales poorly. We present Origami, an autoregressive transformer-based architecture that tokenizes data records, including nested objects and variable length arrays, into sequences of key, value and structural tokens. This representation natively handles sparsity, mixed types and hierarchical structure without flattening or imputation. Origami outperforms baselines spanning GAN, VAE, diffusion and autoregressive architectures on fidelity, utility and detection metrics across nearly all settings, while maintaining high privacy scores. On semi-structured datasets with up to 38% sparsity, baseline synthesizers either fail to scale or degrade substantially, while Origami maintains high-fidelity synthesis that is harder to distinguish from real data. To the best of our knowledge, Origami is the first architecture capable of natively modeling and generating semi-structured data end-to-end.
[1282] Recursive Models for Long-Horizon Reasoning
Chenxiao Yang, Nathan Srebro, Zhiyuan Li
Main category: cs.LG
TL;DR: Recursive models that can invoke themselves to solve subtasks in isolated contexts overcome context limitations of standard autoregressive models for long-horizon reasoning tasks.
Details
Motivation: Modern language models are constrained by bounded context windows, creating a fundamental barrier to long-horizon reasoning tasks that require extensive combinatorial search or multi-step problem solving.Method: Proposes recursive models where the model can recursively invoke itself to solve subtasks in isolated contexts, with theoretical proof that any computable problem admits recursive decomposition requiring exponentially smaller active context than standard autoregressive models.
Result: A 3B parameter model trained to reason recursively significantly outperforms frontier LLMs on Boolean satisfiability, a task requiring long-horizon combinatorial search.
Conclusion: Recursive models provide a fundamental solution to context limitations in language models, enabling long-horizon reasoning through self-invocation and subtask decomposition, with theoretical optimality guarantees.
Abstract: Modern language models reason within bounded context, an inherent constraint that poses a fundamental barrier to long-horizon reasoning. We identify recursion as a core principle for overcoming this barrier, and propose recursive models as a minimal realization, where the model can recursively invoke itself to solve subtasks in isolated contexts. We prove that any computable problem admits a recursive decomposition in which each subtask requires only exponentially smaller active context than standard autoregressive models; this strictly surpasses any context management approach confined to a single sequence, such as summarization. We further generalize our framework to modern agentic systems with arbitrary context processing and control flows, and prove that recursive models can achieve optimal power within this broader class. Experimentally, we train a 3B model to reason recursively and evaluate on Boolean satisfiability, a task requiring long-horizon combinatorial search, where it significantly outperforms frontier LLMs.
[1283] Randomized Kiring Believer for Parallel Bayesian Optimization with Regret Bounds
Shuhei Sugiura, Ichiro Takeuchi, Shion Takeno
Main category: cs.LG
TL;DR: Proposes randomized kriging believer method for parallel Bayesian optimization with theoretical regret guarantees and practical advantages.
Details
Motivation: Parallel Bayesian optimization methods for expensive black-box functions often have poor practical performance or lack theoretical guarantees, creating a need for methods that combine both practical efficiency and theoretical soundness.Method: Randomized kriging believer (KB) method based on well-known KB heuristic, featuring low computational complexity, simple implementation, versatility across BO methods, and applicability to asynchronous parallelization.
Result: The method achieves Bayesian expected regret guarantees and demonstrates effectiveness through experiments on synthetic functions, benchmark functions, and real-world data emulators.
Conclusion: Randomized KB provides a practical and theoretically sound approach to parallel Bayesian optimization that addresses limitations of existing methods.
Abstract: We consider an optimization problem of an expensive-to-evaluate black-box function, in which we can obtain noisy function values in parallel. For this problem, parallel Bayesian optimization (PBO) is a promising approach, which aims to optimize with fewer function evaluations by selecting a diverse input set for parallel evaluation. However, existing PBO methods suffer from poor practical performance or lack theoretical guarantees. In this study, we propose a PBO method, called randomized kriging believer (KB), based on a well-known KB heuristic and inheriting the advantages of the original KB: low computational complexity, a simple implementation, versatility across various BO methods, and applicability to asynchronous parallelization. Furthermore, we show that our randomized KB achieves Bayesian expected regret guarantees. We demonstrate the effectiveness of the proposed method through experiments on synthetic and benchmark functions and emulators of real-world data.
[1284] GAC: Stabilizing Asynchronous RL Training for LLMs via Gradient Alignment Control
Haofeng Xu, Junwei Su, Yukun Tian, Lansong Diao, Zhengping Qian, Chuan Wu
Main category: cs.LG
TL;DR: Asynchronous RL training can cause instability due to stale-aligned gradients; GAC stabilizes via gradient projection to control alignment with stale directions.
Details
Motivation: Asynchronous execution is crucial for scaling RL to large models (LLMs, AI agents), but naive async policy-gradient updates cause severe training instability with different dynamics than synchronized training.Method: Proposes GRADIENT ALIGNMENT CONTROL (GAC) - a dynamics-aware stabilization method that regulates async RL progress along stale-aligned directions via gradient projection, controlling the cosine similarity between current and stale gradients.
Result: GAC recovers stable, on-policy training dynamics and matches synchronized baselines even at high staleness; provides convergence guarantees under bounded staleness; addresses the stale-aligned gradient effect that amplifies correlated updates.
Conclusion: GAC effectively stabilizes asynchronous RL training by controlling gradient alignment, enabling scalable RL for large models while maintaining training stability comparable to synchronized methods.
Abstract: Asynchronous execution is essential for scaling reinforcement learning (RL) to modern large model workloads, including large language models and AI agents, but it can fundamentally alter RL optimization behavior. While prior work on asynchronous RL focuses on training throughput and distributional correction, we show that naively applying asynchrony to policy-gradient updates can induce qualitatively different training dynamics and lead to severe training instability. Through systematic empirical and theoretical analysis, we identify a key signature of this instability: asynchronous training exhibits persistently high cosine similarity between consecutive policy gradients, in contrast to the near-orthogonal updates observed under synchronized training. This stale-aligned gradient effect amplifies correlated updates and increases the risk of overshooting and divergence. Motivated by this observation, we propose GRADIENT ALIGNMENT CONTROL(GAC), a simple dynamics-aware stabilization method that regulates asynchronous RL progress along stale-aligned directions via gradient projection. We establish convergence guarantees under bounded staleness and demonstrate empirically that GAC recovers stable, on-policy training dynamics and matches synchronized baselines even at high staleness.
[1285] Training Dynamics of Softmax Self-Attention: Fast Global Convergence via Preconditioning
Gautam Goel, Mahdi Soltanolkotabi, Peter Bartlett
Main category: cs.LG
TL;DR: Analysis of gradient descent training dynamics for softmax self-attention layers performing linear regression, showing geometric convergence to global optimum
Details
Motivation: To understand the optimization landscape and training dynamics of self-attention layers, which are fundamental components of transformer architectures, particularly for linear regression tasksMethod: Two-step analysis: 1) Show equivalence between self-attention regression and nonconvex matrix factorization in infinite-data limit, 2) Design “structure-aware” gradient descent with preconditioner, regularizer, and data-dependent spectral initialization
Result: The proposed optimization algorithm converges to globally optimal self-attention parameters at a geometric rate, avoiding spurious stationary points
Conclusion: Theoretical analysis reveals favorable optimization properties of self-attention layers and provides practical algorithmic improvements for training them
Abstract: We study the training dynamics of gradient descent in a softmax self-attention layer trained to perform linear regression and show that a simple first-order optimization algorithm can converge to the globally optimal self-attention parameters at a geometric rate. Our analysis proceeds in two steps. First, we show that in the infinite-data limit the regression problem solved by the self-attention layer is equivalent to a nonconvex matrix factorization problem. Second, we exploit this connection to design a novel “structure-aware” variant of gradient descent which efficiently optimizes the original finite-data regression objective. Our optimization algorithm features several innovations over standard gradient descent, including a preconditioner and regularizer which help avoid spurious stationary points, and a data-dependent spectral initialization of parameters which lie near the manifold of global minima with high probability.
[1286] Scalable Multi-Task Low-Rank Model Adaptation
Zichen Tian, Antoine Ledent, Qianru Sun
Main category: cs.LG
TL;DR: mtLoRA addresses catastrophic performance degradation in multi-task LoRA scaling by introducing spectral-aware regularization, block-level adaptation, and fine-grained routing to preserve shared knowledge while reducing conflicts.
Details
Motivation: Scaling multi-task LoRA to many tasks causes catastrophic performance drops due to parameter/representation misalignment. Existing solutions fail at scale due to a fundamental trade-off between regularization strength and feature discrimination needed for routing.Method: Three novel designs: 1) Spectral-aware regularization that selectively orthogonalizes low singular value components while preserving high-SV shared knowledge, 2) Block-level adaptation to reduce gradient conflicts, and 3) Fine-grained routing using dimension-specific weights for better expressive power.
Result: On four large-scale benchmarks (15-25 tasks) including vision (DOTA, iNat2018) and NLP (Dolly-15k, BBH), mtLoRA achieves 91.7%, 81.5%, 44.5% and 38.5% accuracy respectively, outperforming SOTA by 2.3% average while using 47% fewer parameters and 24% less training time.
Conclusion: mtLoRA successfully addresses the scalability limitations of multi-task LoRA by preserving shared knowledge while reducing conflicts, enabling efficient adaptation to many tasks without catastrophic performance degradation.
Abstract: Scaling multi-task low-rank adaptation (LoRA) to a large number of tasks induces catastrophic performance degradation, such as an accuracy drop from 88.2% to 2.0% on DOTA when scaling from 5 to 15 tasks. This failure is due to parameter and representation misalignment. We find that existing solutions, like regularization and dynamic routing, fail at scale because they are constrained by a fundamental trade-off: strengthening regularization to reduce inter-task conflict inadvertently suppresses the essential feature discrimination required for effective routing. In this work, we identify two root causes for this trade-off. First, uniform regularization disrupts inter-task knowledge sharing: shared underlying knowledge concentrates in high-SV components (89% alignment on Flanv2->BBH). Uniform regularization forces high-SV components to update in orthogonal directions, directly disrupting the shared knowledge. Second, Conflict Amplification: Applying LoRA at the component-level (e.g., W_q, W_v) amplifies gradient conflicts; we show block-level adaptation reduces this conflict by 76% with only 50% parameters. Based on these insights, we propose mtLoRA, a scalable solution with three novel designs: 1) Spectral-Aware Regularization to selectively orthogonalize low-SV components while preserving high-SV shared knowledge, 2) Block-Level Adaptation to mitigate conflict amplification and largely improve parameter efficiency, and 3) Fine-Grained Routing using dimension-specific weights for superior expressive power. On four large-scale (15-25 tasks) vision (DOTA and iNat2018) and NLP (Dolly-15k and BBH) benchmarks, mtLoRA achieves 91.7%, 81.5%, 44.5% and 38.5% accuracy on DOTA, iNat2018, Dolly-15k and BBH respectively, outperforming the state-of-the-art by 2.3% on average while using 47% fewer parameters and 24% less training time.
[1287] LFPO: Likelihood-Free Policy Optimization for Masked Diffusion Models
Chenxing Wei, Jiazhen Kang, Hong Wang, Jianqing Zhang, Hao Jiang, Xiaolong Xu, Ningyuan Sun, Ying He, F. Richard Yu, Yao Shu, Bo Jiang
Main category: cs.LG
TL;DR: LFPO is a likelihood-free policy optimization framework for diffusion LLMs that maps vector field flow matching to discrete token space, enabling precise gradient estimation without likelihood computation for improved alignment and faster inference.
Details
Motivation: Existing RLVR methods work well for autoregressive models but can't be directly applied to diffusion LLMs due to intractable exact likelihood computation, forcing reliance on high-variance approximations that hinder performance.Method: LFPO formulates alignment as geometric velocity rectification, directly optimizing denoising logits via contrastive updates to bypass likelihood approximation errors. It enforces consistency by predicting final solutions from intermediate steps to straighten probability flow.
Result: LFPO outperforms state-of-the-art baselines on code and reasoning benchmarks while accelerating inference by approximately 20% through reduced diffusion steps.
Conclusion: LFPO provides a native framework for diffusion LLM alignment that overcomes the likelihood computation barrier, enabling effective reinforcement learning with verifiable rewards while improving both quality and efficiency.
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has achieved remarkable success in improving autoregressive models, especially in domains requiring correctness like mathematical reasoning and code generation. However, directly applying such paradigms to Diffusion Large Language Models (dLLMs) is fundamentally hindered by the intractability of exact likelihood computation, which forces existing methods to rely on high-variance approximations. To bridge this gap, we propose Likelihood-Free Policy Optimization (LFPO), a native framework that maps the concept of vector field flow matching to the discrete token space. Specifically, LFPO formulates alignment as geometric velocity rectification, which directly optimizes denoising logits via contrastive updates. This design effectively bypasses the errors inherent in likelihood approximation, yielding the precise gradient estimation. Furthermore, LFPO enforce consistency by predicting final solutions from intermediate steps, effectively straightening the probability flow to enable high-quality generation with significantly fewer iterations. Extensive experiments demonstrate that LFPO not only outperforms state-of-the-art baselines on code and reasoning benchmarks but also accelerates inference by approximately 20% through reduced diffusion steps.
[1288] Rate-Distortion Signatures of Generalization and Information Trade-offs
Leyla Roksan Caglar, Pedro A. M. Mediano, Baihan Lin
Main category: cs.LG
TL;DR: A rate-distortion framework analyzes accuracy-robustness trade-offs in vision systems using geometric signatures (slope β and curvature κ), comparing humans and deep networks under image perturbations.
Details
Motivation: Standard robustness metrics provide limited insight into how vision systems trade accuracy for robustness. The paper aims to develop a more interpretable framework to compare generalization behavior across biological and artificial vision systems.Method: Introduces a rate-distortion-theoretic framework that treats stimulus-response behavior as a communication channel, derives RD frontiers from confusion matrices, and characterizes systems with two geometric signatures (β and κ) capturing marginal cost and abruptness of accuracy-robustness trade-offs.
Result: Both humans and deep vision models follow lossy-compression principles but occupy different RD regions: humans show smoother, more flexible trade-offs, while modern deep networks operate in steeper, more brittle regimes even at matched accuracy. Robustness training induces systematic shifts in β/κ that don’t necessarily make networks more human-like.
Conclusion: RD geometry provides a compact, model-agnostic framework for comparing generalization behavior beyond standard accuracy metrics, revealing fundamental differences between biological and artificial vision systems.
Abstract: Generalization to novel visual conditions remains a central challenge for both human and machine vision, yet standard robustness metrics offer limited insight into how systems trade accuracy for robustness. We introduce a rate-distortion-theoretic framework that treats stimulus-response behavior as an effective communication channel, derives rate-distortion (RD) frontiers from confusion matrices, and summarizes each system with two interpretable geometric signatures - slope ($β$) and curvature ($κ$) - which capture the marginal cost and abruptness of accuracy-robustness trade-offs. Applying this framework to human psychophysics and 18 deep vision models under controlled image perturbations, we compare generalization geometry across model architectures and training regimes. We find that both biological and artificial systems follow a common lossy-compression principle but occupy systematically different regions of RD space. In particular, humans exhibit smoother, more flexible trade-offs, whereas modern deep networks operate in steeper and more brittle regimes even at matched accuracy. Across training regimes, robustness training induces systematic but dissociable shifts in beta/kappa, revealing cases where improved robustness or accuracy does not translate into more human-like generalization geometry. These results demonstrate that RD geometry provides a compact, model-agnostic lens for comparing generalization behavior across systems beyond standard accuracy-based metrics.
[1289] Jump Like A Squirrel: Optimized Execution Step Order for Anytime Random Forest Inference
Daniel Biebert, Christian Hakert, Kay Heider, Daniel Kuhse, Sebastian Buschjäger, Jian-Jia Chen
Main category: cs.LG
TL;DR: Anytime random forest algorithm that processes individual tree steps rather than whole trees, with optimized step ordering to maximize accuracy under time constraints.
Details
Motivation: Resource-constrained systems need efficient ML models that can provide predictions with increasing quality over time, even when execution is interrupted. Existing anytime random forest approaches work at tree granularity, but decision trees improve prediction quality with each step, suggesting finer granularity could be better.Method: Proposes anytime algorithms for decision trees and random forests at single-step granularity. Introduces three step ordering strategies: Optimal Order (exponential runtime, maximal mean accuracy), Forward Squirrel Order (greedily maximizes accuracy down trees), and Backward Squirrel Order (greedily maximizes accuracy up trees).
Result: Backward Squirrel Order performs ~94% as well as Optimal Order and ~99% as well as all other step orders, showing near-optimal performance with polynomial runtime.
Conclusion: Finer granularity anytime algorithms for random forests with optimized step ordering can provide high-quality predictions under time constraints, with Backward Squirrel Order offering near-optimal performance with practical runtime.
Abstract: Due to their efficiency and small size, decision trees and random forests are popular machine learning models used for classification on resource-constrained systems. In such systems, the available execution time for inference in a random forest might not be sufficient for a complete model execution. Ideally, the already gained prediction confidence should be retained. An anytime algorithm is designed to be able to be aborted anytime, while giving a result with an increasing quality over time. Previous approaches have realized random forests as anytime algorithms on the granularity of trees, stopping after some but not all trees of a forest have been executed. However, due to the way decision trees subdivide the sample space in every step, an increase in prediction quality is achieved with every additional step in one tree. In this paper, we realize decision trees and random forest as anytime algorithms on the granularity of single steps in trees. This approach opens a design space to define the step order in a forest, which has the potential to optimize the mean accuracy. We propose the Optimal Order, which finds a step order with a maximal mean accuracy in exponential runtime and the polynomial runtime heuristics Forward Squirrel Order and Backward Squirrel Order, which greedily maximize the accuracy for each additional step taken down and up the trees, respectively. Our evaluation shows, that the Backward Squirrel Order performs $\sim94%$ as well as the Optimal Order and $\sim99%$ as well as all other step orders.
[1290] SafeSci: Safety Evaluation of Large Language Models in Science Domains and Beyond
Xiangyang Zhu, Yuan Tian, Qi Jia, Kaiwei Zhang, Zicheng Zhang, Chunyi Li, Kaiyuan Ji, Dongrui Liu, Zijian Chen, Lu Sun, Renrui Zhang, Yan Teng, Jing Shao, Wei Sun, Xia Hu, Yu Qiao, Guangtao Zhai
Main category: cs.LG
TL;DR: SafeSci is a comprehensive framework for evaluating and enhancing safety of LLMs in scientific contexts, featuring a benchmark with 250K samples and training dataset with 1.5M samples.
Details
Motivation: Existing benchmarks for evaluating LLM safety in scientific domains have limited risk coverage and rely on subjective evaluation, creating a need for more comprehensive and objective safety assessment tools.Method: Introduces SafeSci framework with SafeSciBench (250K sample benchmark distinguishing safety knowledge vs. risk) and SafeSciTrain (1.5M sample dataset for safety enhancement), using objective metrics like deterministically answerable questions to reduce evaluation bias.
Result: Evaluation of 24 advanced LLMs revealed critical vulnerabilities and varying degrees of excessive refusal behaviors on safety-related issues; fine-tuning on SafeSciTrain significantly enhanced safety alignment of models.
Conclusion: Knowledge is a double-edged sword in scientific safety, and safety assessment should be context-dependent rather than binary; SafeSci provides both diagnostic tools and practical resources for building safer scientific AI systems.
Abstract: The success of large language models (LLMs) in scientific domains has heightened safety concerns, prompting numerous benchmarks to evaluate their scientific safety. Existing benchmarks often suffer from limited risk coverage and a reliance on subjective evaluation. To address these problems, we introduce SafeSci, a comprehensive framework for safety evaluation and enhancement in scientific contexts. SafeSci comprises SafeSciBench, a multi-disciplinary benchmark with 0.25M samples, and SafeSciTrain, a large-scale dataset containing 1.5M samples for safety enhancement. SafeSciBench distinguishes between safety knowledge and risk to cover extensive scopes and employs objective metrics such as deterministically answerable questions to mitigate evaluation bias. We evaluate 24 advanced LLMs, revealing critical vulnerabilities in current models. We also observe that LLMs exhibit varying degrees of excessive refusal behaviors on safety-related issues. For safety enhancement, we demonstrate that fine-tuning on SafeSciTrain significantly enhances the safety alignment of models. Finally, we argue that knowledge is a double-edged sword, and determining the safety of a scientific question should depend on specific context, rather than universally categorizing it as safe or unsafe. Our work provides both a diagnostic tool and a practical resource for building safer scientific AI systems.
[1291] FAST-DIPS: Adjoint-Free Analytic Steps and Hard-Constrained Likelihood Correction for Diffusion-Prior Inverse Problems
Minwoo Kim, Seunghyeok Shin, Hongki Lim
Main category: cs.LG
TL;DR: Training-free diffusion solver with hard measurement-space constraints and analytic step sizes for efficient inverse problems without inner loops
Details
Motivation: Existing training-free diffusion priors for inverse problems require many iterations and denoiser evaluations due to conservative step sizes and inner optimization loops, especially for nonlinear forward operatorsMethod: Proposes a solver with hard measurement-space feasibility constraints (closed-form projection) and analytic model-optimal step sizes, using ADMM-style splitting with projection and steepest-descent updates, requiring minimal Jacobian operations
Result: Achieves competitive PSNR/SSIM/LPIPS metrics with up to 19.5× speedup compared to existing methods, without needing hand-coded adjoints or inner MCMC loops
Conclusion: The method provides an efficient training-free solver for inverse problems with diffusion priors, significantly reducing computational cost while maintaining performance
Abstract: Training-free diffusion priors enable inverse-problem solvers without retraining, but for nonlinear forward operators data consistency often relies on repeated derivatives or inner optimization/MCMC loops with conservative step sizes, incurring many iterations and denoiser/score evaluations. We propose a training-free solver that replaces these inner loops with a hard measurement-space feasibility constraint (closed-form projection) and an analytic, model-optimal step size, enabling a small, fixed compute budget per noise level. Anchored at the denoiser prediction, the correction is approximated via an adjoint-free, ADMM-style splitting with projection and a few steepest-descent updates, using one VJP and either one JVP or a forward-difference probe, followed by backtracking and decoupled re-annealing. We prove local model optimality and descent under backtracking for the step-size rule, and derive an explicit KL bound for mode-substitution re-annealing under a local Gaussian conditional surrogate. We also develop a latent variant and a one-parameter pixel$\rightarrow$latent hybrid schedule. Experiments achieve competitive PSNR/SSIM/LPIPS with up to 19.5$\times$ speedup, without hand-coded adjoints or inner MCMC.
[1292] Boosting Entropy with Bell Box Quantization
Ningfeng Yang, Tor M. Aamodt
Main category: cs.LG
TL;DR: BBQ introduces a novel quantization-aware pre-training method that achieves information-theoretically optimal quantization while maintaining compute efficiency by mapping ITO data types to compute-efficient domains.
Details
Motivation: Existing quantization methods face a trade-off: compute-efficient data types (like integers) are not information-theoretically optimal, while ITO data types (like Quantile/NormalFloat) are not compute-efficient. There's a need for quantization that achieves both theoretical optimality and practical efficiency.Method: BBQ performs quantization in the input domain for information-theoretic optimality, then maps the output to a compute-efficient domain. The key insight is that since learning is domain-agnostic, the quantizer output doesn’t need to reside in the same domain as its input.
Result: BBQ outperforms prior state-of-the-art QAPT methods with perplexity reductions: up to 2 points for 4-bit models, 4 points for 3-bit, 5 points for 2-bit, and 18 points for 1-bit models, without sacrificing compute efficiency.
Conclusion: BBQ is the first quantization method that achieves both information-theoretic optimality and compute efficiency, enabling more efficient deployment of deep neural networks on edge devices while maintaining model quality.
Abstract: Quantization-Aware Pre-Training (QAPT) is an effective technique to reduce the compute and memory overhead of Deep Neural Networks while improving their energy efficiency on edge devices. Existing QAPT methods produce models stored in compute-efficient data types (e.g. integers) that are not information theoretically optimal (ITO). On the other hand, existing ITO data types (e.g. Quantile/NormalFloat Quantization) are not compute-efficient. We propose BBQ, the first ITO quantization method that is also compute-efficient. BBQ builds on our key insight that since learning is domain-agnostic, the output of a quantizer does not need to reside in the same domain as its input. BBQ performs ITO quantization in its input domain, and returns its output in a compute-efficient domain where ITO data types are mapped to compute-efficient data types. Without sacrificing compute efficiency, BBQ outperforms prior SOTA QAPT methods by a perplexity reduction of up to 2 points for 4-bit models, up to 4 points for 3-bit models, up to 5 points for 2-bit models, and up to 18 points for 1-bit models. Code is available at https://github.com/1733116199/bbq.
[1293] Towards OOD Generalization in Dynamic Graphs via Causal Invariant Learning
Xinxun Zhang, Pengfei Jiao, Mengzhou Gao, Tianpeng Li, Xuan Guo
Main category: cs.LG
TL;DR: DyCIL: A dynamic graph neural network model that addresses out-of-distribution generalization in dynamic graphs through causal invariant learning of spatio-temporal patterns.
Details
Motivation: Most dynamic graph neural networks ignore out-of-distribution (OOD) shifts that commonly exist in dynamic graphs, making them vulnerable to distributional changes. The challenges include identifying invariant/variant patterns amid graph evolution, capturing intrinsic evolution rationale, and ensuring generalization across diverse OOD shifts with limited data observations.Method: Proposes Dynamic graph Causal Invariant Learning (DyCIL) with three components: 1) dynamic causal subgraph generator to identify causal dynamic subgraphs, 2) causal-aware spatio-temporal attention module to extract evolution rationale from invariant patterns, and 3) adaptive environment generator to capture underlying distributional shift dynamics.
Result: Extensive experiments on both real-world and synthetic dynamic graph datasets demonstrate superiority over state-of-the-art baselines in handling OOD shifts.
Conclusion: DyCIL effectively addresses OOD generalization challenges in dynamic graphs through causal invariant learning, outperforming existing methods by simultaneously tackling all three key challenges.
Abstract: Although dynamic graph neural networks (DyGNNs) have demonstrated promising capabilities, most existing methods ignore out-of-distribution (OOD) shifts that commonly exist in dynamic graphs. Dynamic graph OOD generalization is non-trivial due to the following challenges: 1) Identifying invariant and variant patterns amid complex graph evolution, 2) Capturing the intrinsic evolution rationale from these patterns, and 3) Ensuring model generalization across diverse OOD shifts despite limited data distribution observations. Although several attempts have been made to tackle these challenges, none has successfully addressed all three simultaneously, and they face various limitations in complex OOD scenarios. To solve these issues, we propose a Dynamic graph Causal Invariant Learning (DyCIL) model for OOD generalization via exploiting invariant spatio-temporal patterns from a causal view. Specifically, we first develop a dynamic causal subgraph generator to identify causal dynamic subgraphs explicitly. Next, we design a causal-aware spatio-temporal attention module to extract the intrinsic evolution rationale behind invariant patterns. Finally, we further introduce an adaptive environment generator to capture the underlying dynamics of distributional shifts. Extensive experiments on both real-world and synthetic dynamic graph datasets demonstrate the superiority of our model over state-of-the-art baselines in handling OOD shifts.
[1294] DeLo: Dual Decomposed Low-Rank Experts Collaboration for Continual Missing Modality Learning
Xiwei Liu, Yulong Li, Feilong Tang, Imran Razzak
Main category: cs.LG
TL;DR: DeLo introduces a dual-decomposed low-rank expert architecture for Continual Missing Modality Learning (CMML) that addresses modality interference through decomposed LoRA experts and prevents catastrophic forgetting with task partitioning.
Details
Motivation: Real-world multimodal scenarios face challenges of learning from sequential data streams while handling frequent modality incompleteness. Existing prompt tuning methods suffer from cross-task interference, and naive LoRA with modality-shared modules experiences modality interference from competing gradients.Method: Proposes DeLo framework with dual-decomposed low-rank expert architecture: uses decomposed LoRA experts that dynamically compose LoRA update matrices from disentangled modality-specific factor pools. Includes task-partitioned framework to prevent catastrophic forgetting, Cross-Modal Guided Routing for incomplete data, and Task-Key Memory for efficient inference.
Result: Extensive experiments on established CMML benchmarks demonstrate significant outperformance over state-of-the-art approaches.
Conclusion: Highlights the value of principled, architecturally-aware LoRA design for real-world multimodal challenges, particularly in handling modality incompleteness and continual learning scenarios.
Abstract: Adapting Large Multimodal Models (LMMs) to real-world scenarios poses the dual challenges of learning from sequential data streams while handling frequent modality incompleteness, a task known as Continual Missing Modality Learning (CMML). However, existing works on CMML have predominantly relied on prompt tuning, a technique that struggles with this task due to cross-task interference between its learnable prompts in their shared embedding space. A naive application of Low-Rank Adaptation (LoRA) with modality-shared module will also suffer modality interference from competing gradients. To this end, we propose DeLo, the first framework to leverage a novel dual-decomposed low-rank expert architecture for CMML. Specifically, this architecture resolves modality interference through decomposed LoRA expert, dynamically composing LoRA update matrix with rank-one factors from disentangled modality-specific factor pools. Embedded within a task-partitioned framework that structurally prevents catastrophic forgetting, this expert system is supported by two key mechanisms: a Cross-Modal Guided Routing strategy to handle incomplete data and a Task-Key Memory for efficient, task-agnostic inference. Extensive experiments on established CMML benchmarks demonstrate that our method significantly outperforms state-of-the-art approaches. This highlights the value of a principled, architecturally-aware LoRA design for real-world multimodal challenges.
[1295] Transform-Invariant Generative Ray Path Sampling for Efficient Radio Propagation Modeling
Jérome Eertmans, Enrico M. Vitucci, Vittorio Degli-Esposti, Nicola Di Cicco, Laurent Jacques, Claude Oestges
Main category: cs.LG
TL;DR: ML framework using Generative Flow Networks for intelligent path sampling in ray tracing, achieving 10-1000× speedup while maintaining accuracy.
Details
Motivation: Ray tracing has exponential computational complexity that limits large-scale/real-time applications, requiring heuristics that sacrifice accuracy. Need ML approach to intelligently sample paths instead of exhaustive search.Method: Generative Flow Networks with three key components: experience replay buffer for rare valid paths, uniform exploratory policy for generalization, and physics-based action masking to filter impossible paths.
Result: Achieves 10× faster on GPU and 1000× faster on CPU compared to exhaustive search while maintaining high coverage accuracy and discovering complex propagation paths.
Conclusion: Proposed ML framework overcomes computational bottlenecks in ray tracing through intelligent sampling, enabling efficient path discovery in complex environments.
Abstract: Ray tracing has become a standard for accurate radio propagation modeling, but suffers from exponential computational complexity, as the number of candidate paths scales with the number of objects raised to the power of the interaction order. This bottleneck limits its use in large-scale or real-time applications, forcing traditional tools to rely on heuristics to reduce the number of path candidates at the cost of potentially reduced accuracy. To overcome this limitation, we propose a comprehensive machine-learning-assisted framework that replaces exhaustive path searching with intelligent sampling via Generative Flow Networks. Applying such generative models to this domain presents significant challenges, particularly sparse rewards due to the rarity of valid paths, which can lead to convergence failures and trivial solutions when evaluating high-order interactions in complex environments. To ensure robust learning and efficient exploration, our framework incorporates three key architectural components. First, we implement an \emph{experience replay buffer} to capture and retain rare valid paths. Second, we adopt a uniform exploratory policy to improve generalization and prevent the model from overfitting to simple geometries. Third, we apply a physics-based action masking strategy that filters out physically impossible paths before the model even considers them. As demonstrated in our experimental validation, the proposed model achieves substantial speedups over exhaustive search – up to $10\times$ faster on GPU and $1000\times$ faster on CPU – while maintaining high coverage accuracy and successfully uncovering complex propagation paths. The complete source code, tests, and tutorial are available at https://github.com/jeertmans/sampling-paths.
[1296] FreeGNN: Continual Source-Free Graph Neural Network Adaptation for Renewable Energy Forecasting
Abderaouf Bahi, Amel Ourici, Ibtissem Gasmi, Aida Derrablia, Warda Deghmane, Mohamed Amine Ferrag
Main category: cs.LG
TL;DR: FreeGNN: A continual source-free graph domain adaptation framework for renewable energy forecasting that adapts to unseen sites without requiring source data or target labels.
Details
Motivation: Traditional supervised models for renewable energy forecasting require labeled data from target sites, which may be unavailable due to privacy, cost, or logistical constraints. There's a need for adaptive forecasting that can work on unseen sites without access to source data or target labels.Method: Integrates spatio-temporal Graph Neural Network (GNN) backbone with teacher-student strategy, memory replay mechanism to mitigate catastrophic forgetting, graph-based regularization to preserve spatial correlations, and drift-aware weighting scheme to dynamically adjust adaptation strength during streaming updates.
Result: Achieved MAE of 5.237 and RMSE of 7.123 on GEFCom dataset, MAE of 1.107 and RMSE of 1.512 on Solar PV dataset, and MAE of 0.382 and RMSE of 0.523 on Wind SCADA dataset. Each component contributed significantly to overall performance.
Conclusion: FreeGNN demonstrates accurate and robust forecasting in source-free, continual learning settings, highlighting its potential for real-world deployment in adaptive renewable energy systems.
Abstract: Accurate forecasting of renewable energy generation is essential for efficient grid management and sustainable power planning. However, traditional supervised models often require access to labeled data from the target site, which may be unavailable due to privacy, cost, or logistical constraints. In this work, we propose FreeGNN, a Continual Source-Free Graph Domain Adaptation framework that enables adaptive forecasting on unseen renewable energy sites without requiring source data or target labels. Our approach integrates a spatio-temporal Graph Neural Network (GNN) backbone with a teacher–student strategy, a memory replay mechanism to mitigate catastrophic forgetting, graph-based regularization to preserve spatial correlations, and a drift-aware weighting scheme to dynamically adjust adaptation strength during streaming updates. This combination allows the model to continuously adapt to non-stationary environmental conditions while maintaining robustness and stability. We conduct extensive experiments on three real-world datasets: GEFCom2012, Solar PV, and Wind SCADA, encompassing multiple sites, temporal resolutions, and meteorological features. The ablation study confirms that each component memory, graph regularization, drift-aware adaptation, and teacher–student strategy contributes significantly to overall performance. The experiments show that FreeGNN achieves an MAE of 5.237 and an RMSE of 7.123 on the GEFCom dataset, an MAE of 1.107 and an RMSE of 1.512 on the Solar PV dataset, and an MAE of 0.382 and an RMSE of 0.523 on the Wind SCADA dataset. These results demonstrate its ability to achieve accurate and robust forecasts in a source-free, continual learning setting, highlighting its potential for real-world deployment in adaptive renewable energy systems. For reproducibility, implementation details are available at: https://github.com/AraoufBh/FreeGNN.
[1297] A Practical Guide to Streaming Continual Learning
Andrea Cossu, Federico Giannini, Giacomo Ziffer, Alessio Bernardo, Alexander Gepperth, Emanuele Della Valle, Barbara Hammer, Davide Bacciu
Main category: cs.LG
TL;DR: Streaming Continual Learning (SCL) is proposed as a unified paradigm that combines the rapid adaptation of Streaming Machine Learning with the knowledge retention of Continual Learning to handle real-world non-stationary data streams.
Details
Motivation: The paper aims to bridge the gap between Continual Learning (CL) and Streaming Machine Learning (SML) communities, which address complementary challenges but work in isolation. CL focuses on retaining past knowledge when learning new tasks, while SML focuses on rapid adaptation to concept drifts. Real-world problems often require both capabilities simultaneously.Method: The paper proposes Streaming Continual Learning (SCL) as an emerging paradigm that unifies CL and SML approaches. It provides a conceptual framework and conducts experiments to demonstrate the limitations of using CL or SML alone, showing how SCL can address both rapid adaptation and knowledge retention.
Result: Experimental results show that CL and SML alone struggle to achieve both rapid adaptation and knowledge retention. The paper demonstrates the need for SCL by highlighting the complementary nature of these approaches and how their combination can better handle real-world streaming scenarios.
Conclusion: SCL provides a unifying solution that can connect CL and SML communities and foster the development of hybrid approaches capable of both rapid adaptation to new information (like SML) and retention of previous knowledge (like CL), making it essential for real-world applications with non-stationary data streams.
Abstract: Continual Learning (CL) and Streaming Machine Learning (SML) study the ability of agents to learn from a stream of non-stationary data. Despite sharing some similarities, they address different and complementary challenges. While SML focuses on rapid adaptation after changes (concept drifts), CL aims to retain past knowledge when learning new tasks. After a brief introduction to CL and SML, we discuss Streaming Continual Learning (SCL), an emerging paradigm providing a unifying solution to real-world problems, which may require both SML and CL abilities. We claim that SCL can i) connect the CL and SML communities, motivating their work towards the same goal, and ii) foster the design of hybrid approaches that can quickly adapt to new information (as in SML) without forgetting previous knowledge (as in CL). We conclude the paper with a motivating example and a set of experiments, highlighting the need for SCL by showing how CL and SML alone struggle in achieving rapid adaptation and knowledge retention.
[1298] Reasoning as Gradient: Scaling MLE Agents Beyond Tree Search
Yifei Zhang, Xu Yang, Xiao Yang, Bowen Xian, Qizheng Li, Shikai Fang, Jingyuan Li, Jian Wang, Mingrui Xu, Weiqing Liu, Jiang Bian
Main category: cs.LG
TL;DR: Gome introduces gradient-based optimization for LLM agents in machine learning engineering, outperforming traditional tree search methods as LLM reasoning capabilities improve.
Details
Motivation: Current LLM-based agents for machine learning engineering rely on inefficient tree search methods, which become increasingly problematic as LLM reasoning capabilities improve. There's a need for more efficient optimization paradigms analogous to gradient-based methods in traditional optimization.Method: Gome operationalizes gradient-based optimization by mapping structured diagnostic reasoning to gradient computation, success memory to momentum, and multi-trace execution to distributed optimization. It uses a closed-world protocol to isolate architectural effects from external knowledge.
Result: Gome achieves state-of-the-art 35.1% any-medal rate on MLE-Bench with restricted 12-hour budget on single V100 GPU. Scaling experiments show gradient-based optimization outperforms tree search as LLM reasoning capability strengthens, with gap widening at frontier-tier models.
Conclusion: Gradient-based optimization is becoming increasingly favorable over tree search for LLM-based MLE agents as reasoning-oriented LLMs advance rapidly, representing a paradigm shift in how LLM agents approach optimization problems.
Abstract: LLM-based agents for machine learning engineering (MLE) predominantly rely on tree search, a form of gradient-free optimization that uses scalar validation scores to rank candidates. As LLM reasoning capabilities improve, exhaustive enumeration becomes increasingly inefficient compared to directed updates, analogous to how accurate gradients enable efficient descent over random search. We introduce \textsc{Gome}, an MLE agent that operationalizes gradient-based optimization. \textsc{Gome} maps structured diagnostic reasoning to gradient computation, success memory to momentum, and multi-trace execution to distributed optimization. Under a closed-world protocol that isolates architectural effects from external knowledge, \textsc{Gome} achieves a state-of-the-art 35.1% any-medal rate on MLE-Bench with a restricted 12-hour budget on a single V100 GPU. Scaling experiments across 10 models reveal a critical crossover: with weaker models, tree search retains advantages by compensating for unreliable reasoning through exhaustive exploration; as reasoning capability strengthens, gradient-based optimization progressively outperforms, with the gap widening at frontier-tier models. Given the rapid advancement of reasoning-oriented LLMs, this positions gradient-based optimization as an increasingly favorable paradigm. We release our codebase and GPT-5 traces.
[1299] Streaming Continual Learning for Unified Adaptive Intelligence in Dynamic Environments
Federico Giannini, Giacomo Ziffer, Andrea Cossu, Vincenzo Lomonaco
Main category: cs.LG
TL;DR: Streaming Continual Learning (SCL) is proposed as a unified framework combining Continual Learning and Streaming Machine Learning to handle dynamic data streams while preventing catastrophic forgetting.
Details
Motivation: The paper addresses the challenge of building predictive models in dynamic environments with continuously changing data streams, where both adaptation to new data and retention of previous knowledge are crucial.Method: The authors propose a unified setting called Streaming Continual Learning (SCL) that integrates techniques from both Continual Learning (CL) and Streaming Machine Learning (SML), creating a framework that can quickly adapt to non-stationary data streams without forgetting previous knowledge.
Result: The paper presents a conceptual framework that unifies CL and SML approaches, highlighting key features of SCL and emphasizing the importance of bridging these two research communities for advancing intelligent systems.
Conclusion: Streaming Continual Learning extends both CL and SML fields by providing a comprehensive approach to handle dynamic data streams while preventing catastrophic forgetting, and bridging these communities is essential for advancing intelligent systems.
Abstract: Developing effective predictive models becomes challenging in dynamic environments that continuously produce data and constantly change. Continual Learning (CL) and Streaming Machine Learning (SML) are two research areas that tackle this arduous task. We put forward a unified setting that harnesses the benefits of both CL and SML: their ability to quickly adapt to non-stationary data streams without forgetting previous knowledge. We refer to this setting as Streaming Continual Learning (SCL). SCL does not replace either CL or SML. Instead, it extends the techniques and approaches considered by both fields. We start by briefly describing CL and SML and unifying the languages of the two frameworks. We then present the key features of SCL. We finally highlight the importance of bridging the two communities to advance the field of intelligent systems.
[1300] DynaMoE: Dynamic Token-Level Expert Activation with Layer-Wise Adaptive Capacity for Mixture-of-Experts Neural Networks
Gökdeniz Gülmez
Main category: cs.LG
TL;DR: DynaMoE introduces dynamic token-level expert activation and layer-wise adaptive capacity allocation for Mixture-of-Experts architectures, relaxing rigid design assumptions of fixed Top-K routing and uniform expert allocation.
Details
Motivation: Standard MoE architectures use fixed Top-K routing (activating exactly K experts per token) and uniform expert allocation across layers, which may not be optimal for different tasks and model scales. The paper aims to create a more flexible MoE framework that adapts to input complexity and optimizes capacity distribution across network depth.Method: DynaMoE introduces two key innovations: 1) Dynamic token-level expert activation where the number of active experts varies based on input complexity, and 2) Six distinct scheduling strategies for layer-wise expert capacity allocation (descending, ascending, pyramid, wave patterns, etc.). The framework includes theoretical analysis of expressivity gains and computational efficiency bounds.
Result: Experiments on MNIST, Fashion-MNIST, CIFAR-10 (image classification) and Recycling-the-Web (language modeling) show DynaMoE achieves superior parameter efficiency compared to static baselines. Optimal expert schedules are task- and scale-dependent: descending schedules work best for image classification, while language modeling optimal schedules vary by model size (descending for Tiny, ascending for Small, uniform for Medium). Dynamic routing also reduces gradient variance and improves convergence stability.
Conclusion: DynaMoE establishes a new framework for adaptive computation in neural networks, providing principled guidance for MoE architecture design by demonstrating that optimal expert allocation strategies depend on both task characteristics and model scale.
Abstract: Mixture-of-Experts (MoE) architectures have emerged as a powerful paradigm for scaling neural networks while maintaining computational efficiency. However, standard MoE implementations rely on two rigid design assumptions: (1) fixed Top-K routing where exactly K experts are activated per token, and (2) uniform expert allocation across all layers. This paper introduces DynaMoE, a novel MoE framework that relaxes both constraints through dynamic token-level expert activation and layer-wise adaptive capacity allocation. DynaMoE introduces a principled routing mechanism where the number of active experts per token varies based on input complexity. Concurrently, the framework implements six distinct scheduling strategies for distributing expert capacity across network depth, including descending, ascending, pyramid, and wave patterns. We theoretically analyze the expressivity gains of dynamic routing and derive bounds on computational efficiency. Through extensive experiments on MNIST, Fashion-MNIST, CIFAR-10 (image classification), and Recycling-the-Web (language modeling) across multiple model scales, we demonstrate that DynaMoE achieves superior parameter efficiency compared to static baselines. Our key finding is that optimal expert schedules are task- and scale-dependent: descending schedules (concentrating capacity in early layers) outperform uniform baselines on image classification. For language modeling, optimal schedules vary by model size, descending for Tiny, ascending for Small, and uniform for Medium. Furthermore, dynamic routing reduces gradient variance during training, leading to improved convergence stability. DynaMoE establishes a new framework for adaptive computation in neural networks, providing principled guidance for MoE architecture design.
[1301] Decentralized Federated Learning by Partial Message Exchange
Shan Sha, Shenglong Zhou, Xin Wang, Lingchen Kong, Geoffrey Ye Li
Main category: cs.LG
TL;DR: PaME: A decentralized federated learning algorithm using partial message exchange with random sparse coordinate selection to reduce communication costs while preserving privacy and accuracy.
Details
Motivation: Address fundamental challenges in decentralized federated learning including data heterogeneity, restrictive theoretical assumptions, and degraded convergence when applying communication/privacy techniques.Method: Develops PaME algorithm where only randomly selected sparse coordinates are exchanged between neighbor nodes, enabling communication reduction while maintaining privacy and accuracy.
Result: Achieves linear convergence rate under mild assumptions (locally Lipschitz continuous gradient, doubly stochastic communication matrix), outperforms existing decentralized learning algorithms in experiments.
Conclusion: PaME effectively addresses data heterogeneity, reduces communication costs, preserves privacy without sacrificing accuracy, and operates under fewer restrictive assumptions than existing methods.
Abstract: Decentralized federated learning (DFL) has emerged as a transformative server-free paradigm that enables collaborative learning over large-scale heterogeneous networks. However, it continues to face fundamental challenges, including data heterogeneity, restrictive assumptions for theoretical analysis, and degraded convergence when standard communication- or privacyenhancing techniques are applied. To overcome these drawbacks, this paper develops a novel algorithm, PaME (DFL by Partial Message Exchange). The central principle is to allow only randomly selected sparse coordinates to be exchanged between two neighbor nodes. Consequently, PaME achieves substantial reductions in communication costs while still preserving a high level of privacy, without sacrificing accuracy. Moreover, grounded in rigorous analysis, the algorithm is shown to converge at a linear rate under the gradient to be locally Lipschitz continuous and the communication matrix to be doubly stochastic. These two mild assumptions not only dispense with many restrictive conditions commonly imposed by existing DFL methods but also enables PaME to effectively address data heterogeneity. Furthermore, comprehensive numerical experiments demonstrate its superior performance compared with several representative decentralized learning algorithms.
[1302] CA-AFP: Cluster-Aware Adaptive Federated Pruning
Om Govind Jha, Harsh Shukla, Haroon R. Lone
Main category: cs.LG
TL;DR: CA-AFP is a federated learning framework that combines clustering and adaptive pruning to address both statistical and system heterogeneity, improving accuracy, fairness, and communication efficiency.
Details
Motivation: Federated Learning faces challenges from statistical heterogeneity (non-IID data across clients) and system heterogeneity (resource-constrained devices). Existing approaches like clustering and pruning are typically studied separately, but real-world deployments require addressing both challenges simultaneously.Method: CA-AFP performs cluster-specific model pruning where clients are grouped into clusters, and separate models for each cluster are adaptively pruned during training. Key innovations: (1) cluster-aware importance scoring combining weight magnitude, intra-cluster coherence, and gradient consistency to identify parameters for pruning, and (2) iterative pruning schedule with weight regrowth for model self-healing.
Result: Evaluated on UCI HAR and WISDM human activity recognition benchmarks under natural user-based federated partitions. CA-AFP achieves better balance between predictive accuracy, inter-client fairness, and communication efficiency compared to pruning-based baselines. It improves accuracy, reduces performance disparity across clients, requires less communication than dense clustering methods, and shows robustness to different Non-IID data levels.
Conclusion: CA-AFP provides a unified framework addressing both statistical and system heterogeneity in FL through cluster-aware adaptive pruning, offering practical insights for designing efficient and adaptive FL systems.
Abstract: Federated Learning (FL) faces major challenges in real-world deployments due to statistical heterogeneity across clients and system heterogeneity arising from resource-constrained devices. While clustering-based approaches mitigate statistical heterogeneity and pruning techniques improve memory and communication efficiency, these strategies are typically studied in isolation. We propose CA-AFP, a unified framework that jointly addresses both challenges by performing cluster-specific model pruning. In CA-AFP, clients are first grouped into clusters, and a separate model for each cluster is adaptively pruned during training. The framework introduces two key innovations: (1) a cluster-aware importance scoring mechanism that combines weight magnitude, intra-cluster coherence, and gradient consistency to identify parameters for pruning, and (2) an iterative pruning schedule that progressively removes parameters while enabling model self-healing through weight regrowth. We evaluate CA-AFP on two widely used human activity recognition benchmarks, UCI HAR and WISDM, under natural user-based federated partitions. Experimental results demonstrate that CA-AFP achieves a favorable balance between predictive accuracy, inter-client fairness, and communication efficiency. Compared to pruning-based baselines, CA-AFP consistently improves accuracy and lower performance disparity across clients with limited fine-tuning, while requiring substantially less communication than dense clustering-based methods. It also shows robustness to different Non-IID levels of data. Finally, ablation studies analyze the impact of clustering, pruning schedules and scoring mechanism offering practical insights into the design of efficient and adaptive FL systems.
[1303] Rethinking Policy Diversity in Ensemble Policy Gradient in Large-Scale Reinforcement Learning
Naoki Shitanda, Motoki Omura, Tatsuya Harada, Takayuki Osa
Main category: cs.LG
TL;DR: Coupled Policy Optimization (CPO) regulates diversity in ensemble policy gradient methods using KL constraints to balance exploration and training stability, achieving superior performance on challenging tasks including dexterous manipulation.
Details
Motivation: While ensemble-based policy gradient methods use multiple policies for diverse exploration, excessive exploration can reduce quality and compromise training stability. The paper aims to theoretically analyze and regulate inter-policy diversity to enhance learning efficiency.Method: Proposes Coupled Policy Optimization (CPO) which regulates diversity through KL constraints between policies in an ensemble. The method enables effective exploration while maintaining training stability by controlling the degree of policy divergence.
Result: Outperforms strong baselines (SAPG, PBT, PPO) across multiple tasks including challenging dexterous manipulation in terms of both sample efficiency and final performance. Analysis shows follower policies naturally distribute around the leader, demonstrating structured exploratory behavior.
Conclusion: Diverse exploration under appropriate regulation is key to achieving stable and sample-efficient learning in ensemble policy gradient methods. CPO effectively balances exploration breadth with training stability through KL constraints.
Abstract: Scaling reinforcement learning to tens of thousands of parallel environments requires overcoming the limited exploration capacity of a single policy. Ensemble-based policy gradient methods, which employ multiple policies to collect diverse samples, have recently been proposed to promote exploration. However, merely broadening the exploration space does not always enhance learning capability, since excessive exploration can reduce exploration quality or compromise training stability. In this work, we theoretically analyze the impact of inter-policy diversity on learning efficiency in policy ensembles, and propose Coupled Policy Optimization which regulates diversity through KL constraints between policies. The proposed method enables effective exploration and outperforms strong baselines such as SAPG, PBT, and PPO across multiple tasks, including challenging dexterous manipulation, in terms of both sample efficiency and final performance. Furthermore, analysis of policy diversity and effective sample size during training reveals that follower policies naturally distribute around the leader, demonstrating the emergence of structured and efficient exploratory behavior. Our results indicate that diverse exploration under appropriate regulation is key to achieving stable and sample-efficient learning in ensemble policy gradient methods. Project page at https://naoki04.github.io/paper-cpo/ .
[1304] Discrete World Models via Regularization
Davide Bizzaro, Luciano Serafini
Main category: cs.LG
TL;DR: DWMR is a novel unsupervised Boolean world-model learning method that uses regularization instead of reconstruction or contrastive learning to learn compact latent representations of environment dynamics.
Details
Motivation: Existing world models use decoder-based reconstruction or contrastive/reward signals to keep latents informative. Boolean state representations are valuable for search heuristics and symbolic reasoning/planning, but current methods have limitations in learning accurate unsupervised Boolean world models.Method: Introduces Discrete World Models via Regularization (DWMR) - a reconstruction-free and contrastive-free method using a novel world-modeling loss with specialized regularizers. Regularizers maximize entropy and independence of representation bits through variance, correlation, and coskewness penalties while enforcing locality prior for sparse action changes. Includes a novel training scheme improving robustness to discrete roll-outs.
Result: Experiments on two benchmarks with underlying combinatorial structure show DWMR learns more accurate representations and transitions than reconstruction-based alternatives. DWMR can also be paired with an auxiliary reconstruction decoder for additional gains.
Conclusion: DWMR provides an effective reconstruction-free approach for unsupervised Boolean world-model learning that outperforms existing methods and can be enhanced with auxiliary reconstruction when needed.
Abstract: World models aim to capture the states and dynamics of an environment in a compact latent space. Moreover, using Boolean state representations is particularly useful for search heuristics and symbolic reasoning and planning. Existing approaches keep latents informative via decoder-based reconstruction, or instead via contrastive or reward signals. In this work, we introduce Discrete World Models via Regularization (DWMR): a reconstruction-free and contrastive-free method for unsupervised Boolean world-model learning. In particular, we introduce a novel world-modeling loss that couples latent prediction with specialized regularizers. Such regularizers maximize the entropy and independence of the representation bits through variance, correlation, and coskewness penalties, while simultaneously enforcing a locality prior for sparse action changes. To enable effective optimization, we also introduce a novel training scheme improving robustness to discrete roll-outs. Experiments on two benchmarks with underlying combinatorial structure show that DWMR learns more accurate representations and transitions than reconstruction-based alternatives. Finally, DWMR can also be paired with an auxiliary reconstruction decoder, and this combination yields additional gains.
[1305] Practical Deep Heteroskedastic Regression
Mikkel Jordahn, Jonas Vestergaard Jensen, James Harrison, Michael Riis Andersen, Mikkel N. Schmidt
Main category: cs.LG
TL;DR: Post-hoc variance fitting across intermediate layers of pretrained networks improves uncertainty quantification in heteroskedastic regression without compromising mean prediction accuracy.
Details
Motivation: Deep heteroskedastic regression models face practical challenges including optimization difficulties, representation collapse, and variance overfitting when trying to balance uncertainty quantification with mean prediction accuracy.Method: Proposes a simple post-hoc procedure that fits a variance model across intermediate layers of a pretrained network using a hold-out dataset, addressing optimization and overfitting challenges.
Result: Achieves on-par or state-of-the-art uncertainty quantification on molecular graph datasets without compromising mean prediction accuracy, remaining computationally cheap at prediction time.
Conclusion: The post-hoc variance fitting approach effectively addresses practical challenges in deep heteroskedastic regression, providing improved uncertainty quantification while maintaining prediction accuracy.
Abstract: Uncertainty quantification (UQ) in deep learning regression is of wide interest, as it supports critical applications including sequential decision making and risk-sensitive tasks. In heteroskedastic regression, where the uncertainty of the target depends on the input, a common approach is to train a neural network that parameterizes the mean and the variance of the predictive distribution. Still, training deep heteroskedastic regression models poses practical challenges in the trade-off between uncertainty quantification and mean prediction, such as optimization difficulties, representation collapse, and variance overfitting. In this work we identify previously undiscussed fallacies and propose a simple and efficient procedure that addresses these challenges jointly by post-hoc fitting a variance model across the intermediate layers of a pretrained network on a hold-out dataset. We demonstrate that our method achieves on-par or state-of-the-art uncertainty quantification on several molecular graph datasets, without compromising mean prediction accuracy and remaining cheap to use at prediction time.
[1306] Causal Circuit Tracing Reveals Distinct Computational Architectures in Single-Cell Foundation Models: Inhibitory Dominance, Biological Coherence, and Cross-Model Convergence
Ihor Kendiukhov
Main category: cs.LG
TL;DR: Causal circuit tracing method analyzes feature interactions in biological foundation models using sparse autoencoders, revealing conserved biological patterns across models.
Details
Motivation: To understand causal feature-to-feature interactions across network depth in biological foundation models, which remain unknown despite sparse autoencoders decomposing activations into interpretable features.Method: Introduced causal circuit tracing by ablating SAE features and measuring downstream responses, applied to Geneformer V2-316M and scGPT whole-human across four conditions (96,892 edges, 80,191 forward passes).
Result: Both models show ~53% biological coherence and 65-89% inhibitory dominance, invariant to architecture and cell type. scGPT produces stronger effects (mean absolute d = 1.40 vs. 1.05) with more balanced dynamics. Cross-model consensus yields 1,142 conserved domain pairs (10.6x enrichment, p < 0.001). Disease-associated domains are 3.59x more likely to be consensus. Gene-level CRISPRi validation shows 56.4% directional accuracy.
Conclusion: The method reveals conserved biological patterns across models, with disease-associated domains more likely to be consensus features, and validation confirms co-expression rather than causal encoding.
Abstract: Motivation: Sparse autoencoders (SAEs) decompose foundation model activations into interpretable features, but causal feature-to-feature interactions across network depth remain unknown for biological foundation models. Results: We introduce causal circuit tracing by ablating SAE features and measuring downstream responses, and apply it to Geneformer V2-316M and scGPT whole-human across four conditions (96,892 edges, 80,191 forward passes). Both models show approximately 53 percent biological coherence and 65 to 89 percent inhibitory dominance, invariant to architecture and cell type. scGPT produces stronger effects (mean absolute d = 1.40 vs. 1.05) with more balanced dynamics. Cross-model consensus yields 1,142 conserved domain pairs (10.6x enrichment, p < 0.001). Disease-associated domains are 3.59x more likely to be consensus. Gene-level CRISPRi validation shows 56.4 percent directional accuracy, confirming co-expression rather than causal encoding.
[1307] Meta-Learning Hyperparameters for Parameter Efficient Fine-Tuning
Zichen Tian, Yaoyao Liu, Qianru Sun
Main category: cs.LG
TL;DR: MetaPEFT introduces adaptive scalers for parameter-efficient fine-tuning that dynamically adjust module insertion, layer selection, and learning rates to improve performance on remote sensing images, particularly for tail classes.
Details
Motivation: Training large models from scratch for domain-specific applications like remote sensing is challenging due to data limitations and long-tailed distributions. Existing PEFT methods use fixed hyperparameters that hinder performance on RS images, which are highly sensitive to these settings.Method: MetaPEFT incorporates adaptive scalers that dynamically adjust three key PEFT factors: module insertion positions, layer selection depth, and module-wise learning rates. These scalers collectively control the influence of PEFT modules across the network during fine-tuning.
Result: Extensive experiments on three transfer-learning scenarios and five datasets in both RS and natural image domains show MetaPEFT achieves state-of-the-art performance in cross-spectral adaptation, requires minimal trainable parameters, and significantly improves tail-class accuracy.
Conclusion: MetaPEFT effectively addresses the sensitivity of PEFT hyperparameters on RS images through adaptive scaling mechanisms, enabling better fine-tuning performance with parameter efficiency and improved handling of long-tailed distributions.
Abstract: Training large foundation models from scratch for domain-specific applications is almost impossible due to data limits and long-tailed distributions – taking remote sensing (RS) as an example. Fine-tuning natural image pre-trained models on RS images is a straightforward solution. To reduce computational costs and improve performance on tail classes, existing methods apply parameter-efficient fine-tuning (PEFT) techniques, such as LoRA and AdaptFormer. However, we observe that fixed hyperparameters – such as intra-layer positions, layer depth, and scaling factors, can considerably hinder PEFT performance, as fine-tuning on RS images proves highly sensitive to these settings. To address this, we propose MetaPEFT, a method incorporating adaptive scalers that dynamically adjust module influence during fine-tuning. MetaPEFT dynamically adjusts three key factors of PEFT on RS images: module insertion, layer selection, and module-wise learning rates, which collectively control the influence of PEFT modules across the network. We conduct extensive experiments on three transfer-learning scenarios and five datasets in both RS and natural image domains. The results show that MetaPEFT achieves state-of-the-art performance in cross-spectral adaptation, requiring only a small amount of trainable parameters and improving tail-class accuracy significantly.
[1308] Modular Memory is the Key to Continual Learning Agents
Vaggelis Dorovatas, Malte Schwerin, Andrew D. Bagdanov, Lucas Caccia, Antonio Carta, Laurent Charlin, Barbara Hammer, Tyler L. Hayes, Timm Hess, Christopher Kanan, Dhireesha Kudithipudi, Xialei Liu, Vincenzo Lomonaco, Jorge Mendez-Mendez, Darshan Patil, Ameya Prabhu, Elisa Ricci, Tinne Tuytelaars, Gido M. van de Ven, Liyuan Wang, Joost van de Weijer, Jonghyun Choi, Martin Mundt, Rahaf Aljundi
Main category: cs.LG
TL;DR: Proposes combining in-weight learning (IWL) and in-context learning (ICL) through modular memory architectures for continual learning at scale, addressing limitations of current foundation models in continuous operation and adaptation.
Details
Motivation: Foundation models excel in many domains but lack continuous operation, experience accumulation, and personalization capabilities. Traditional continual learning approaches using in-weight learning suffer from catastrophic forgetting, while in-context learning offers new opportunities for rapid adaptation without parameter updates.Method: Proposes a conceptual framework for modular memory-centric architectures that combine ICL for rapid adaptation and knowledge accumulation with IWL for stable updates to model capabilities. The approach leverages the complementary strengths of both learning paradigms.
Result: Presents a theoretical framework and roadmap for building continually learning agents that can adapt at scale, though specific experimental results are not provided in the abstract.
Conclusion: The combination of IWL and ICL through modular memory architectures represents a promising direction for achieving continual adaptation at scale, addressing fundamental limitations of current foundation models in continuous learning and personalization.
Abstract: Foundation models have transformed machine learning through large-scale pretraining and increased test-time compute. Despite surpassing human performance in several domains, these models remain fundamentally limited in continuous operation, experience accumulation, and personalization, capabilities that are central to adaptive intelligence. While continual learning research has long targeted these goals, its historical focus on in-weight learning (IWL), i.e., updating a single model’s parameters to absorb new knowledge, has rendered catastrophic forgetting a persistent challenge. Our position is that combining the strengths of In-Weight Learning (IWL) and the newly emerged capabilities of In-Context Learning (ICL) through the design of modular memory is the missing piece for continual adaptation at scale. We outline a conceptual framework for modular memory-centric architectures that leverage ICL for rapid adaptation and knowledge accumulation, and IWL for stable updates to model capabilities, charting a practical roadmap toward continually learning agents.
[1309] DGNet: Discrete Green Networks for Data-Efficient Learning of Spatiotemporal PDEs
Yingjie Tan, Quanming Yao, Yaqing Wang
Main category: cs.LG
TL;DR: DGNet proposes a discrete Green network that incorporates Green’s function theory as structural inductive bias for data-efficient learning of spatiotemporal PDEs, achieving state-of-the-art accuracy with minimal training data and robust zero-shot generalization to unseen source terms.
Details
Motivation: Existing neural PDE solvers require large amounts of expensive high-fidelity training data and suffer from poor data efficiency, especially when generalizing to unseen source terms. This inefficiency stems from not explicitly encoding the strong structural inductive biases inherent in PDE dynamics, forcing models to learn fundamental physical structure from data alone.Method: DGNet transforms Green’s function theory into a graph-based discrete formulation and embeds the superposition principle into a hybrid physics-neural architecture. This approach explicitly encodes physical priors, reducing the need to learn fundamental structure from data and improving sample efficiency.
Result: DGNet consistently achieves state-of-the-art accuracy across diverse spatiotemporal PDE scenarios using only tens of training trajectories. It exhibits robust zero-shot generalization to unseen source terms, demonstrating superior data efficiency compared to existing approaches.
Conclusion: Incorporating Green’s function theory as structural inductive bias provides a principled approach to data-efficient PDE learning. DGNet’s hybrid physics-neural architecture successfully encodes fundamental physical priors, enabling accurate PDE solving with minimal training data and strong generalization capabilities.
Abstract: Spatiotemporal partial differential equations (PDEs) underpin a wide range of scientific and engineering applications. Neural PDE solvers offer a promising alternative to classical numerical methods. However, existing approaches typically require large numbers of training trajectories, while high-fidelity PDE data are expensive to generate. Under limited data, their performance degrades substantially, highlighting their low data efficiency. A key reason is that PDE dynamics embody strong structural inductive biases that are not explicitly encoded in neural architectures, forcing models to learn fundamental physical structure from data. A particularly salient manifestation of this inefficiency is poor generalization to unseen source terms. In this work, we revisit Green’s function theory-a cornerstone of PDE theory-as a principled source of structural inductive bias for PDE learning. Based on this insight, we propose DGNet, a discrete Green network for data-efficient learning of spatiotemporal PDEs. The key idea is to transform the Green’s function into a graph-based discrete formulation, and embed the superposition principle into the hybrid physics-neural architecture, which reduces the burden of learning physical priors from data, thereby improving sample efficiency. Across diverse spatiotemporal PDE scenarios, DGNet consistently achieves state-of-the-art accuracy using only tens of training trajectories. Moreover, it exhibits robust zero-shot generalization to unseen source terms, serving as a stress test that highlights its data-efficient structural design.
[1310] CHLU: The Causal Hamiltonian Learning Unit as a Symplectic Primitive for Deep Learning
Pratik Jawahar, Maurizio Pierini
Main category: cs.LG
TL;DR: A novel physics-grounded computational learning primitive called Causal Hamiltonian Learning Unit (CHLU) that addresses the memory-stability trade-off in temporal dynamics by enforcing relativistic Hamiltonian structure and symplectic integration.
Details
Motivation: Current deep learning primitives for temporal dynamics suffer from a fundamental dichotomy: discrete methods like LSTMs are unstable (exploding/vanishing gradients), while continuous methods like Neural ODEs are dissipative (destroy information over time). There's a need for a solution to the memory-stability trade-off.Method: Proposes the Causal Hamiltonian Learning Unit (CHLU) which enforces a Relativistic Hamiltonian structure and utilizes symplectic integration. This approach strictly conserves phase-space volume, aiming to solve the memory-stability trade-off while enabling controllable noise filtering.
Result: The CHLU is designed for infinite-horizon stability and controllable noise filtering. The paper demonstrates CHLU’s generative ability using the MNIST dataset as a proof-of-principle.
Conclusion: CHLU represents a novel physics-grounded approach to temporal dynamics that addresses fundamental stability issues in existing methods while preserving information over time.
Abstract: Current deep learning primitives dealing with temporal dynamics suffer from a fundamental dichotomy: they are either discrete and unstable (LSTMs) \citep{pascanu_difficulty_2013}, leading to exploding or vanishing gradients; or they are continuous and dissipative (Neural ODEs) \citep{dupont_augmented_2019}, which destroy information over time to ensure stability. We propose the \textbf{Causal Hamiltonian Learning Unit} (pronounced: \textit{clue}), a novel Physics-grounded computational learning primitive. By enforcing a Relativistic Hamiltonian structure and utilizing symplectic integration, a CHLU strictly conserves phase-space volume, as an attempt to solve the memory-stability trade-off. We show that the CHLU is designed for infinite-horizon stability, as well as controllable noise filtering. We then demonstrate a CHLU’s generative ability using the MNIST dataset as a proof-of-principle.
[1311] Hyperparameter Trajectory Inference with Conditional Lagrangian Optimal Transport
Harry Amad, Mihaela van der Schaar
Main category: cs.LG
TL;DR: Hyperparameter Trajectory Inference (HTI) learns how neural network outputs change with hyperparameters to create surrogate models without retraining, using conditional Lagrangian optimal transport.
Details
Motivation: Neural networks have fixed hyperparameter settings at design time, but user preferences can change post-deployment, making initial settings undesirable. Retraining is expensive, so there's a need to learn how NN outputs change with hyperparameters without retraining.Method: Proposes Hyperparameter Trajectory Inference (HTI) using conditional Lagrangian optimal transport. Learns Lagrangian function governing hyperparameter-induced dynamics along with optimal transport maps and geodesics between observed marginals. Incorporates manifold hypothesis and least-action principles as inductive biases.
Result: Empirically demonstrates that the approach reconstructs NN outputs across various hyperparameter spectra better than other alternatives.
Conclusion: HTI provides a way to adapt neural network behavior to changing user preferences without expensive retraining by learning how outputs change with hyperparameters through optimal transport theory.
Abstract: Neural networks (NNs) often have critical behavioural trade-offs that are set at design time with hyperparameters-such as reward weights in reinforcement learning or quantile targets in regression. Post-deployment, however, user preferences can evolve, making initial settings undesirable, necessitating potentially expensive retraining. To circumvent this, we introduce the task of Hyperparameter Trajectory Inference (HTI): to learn, from observed data, how a NN’s conditional output distribution changes with its hyperparameters, and construct a surrogate model that approximates the NN at unobserved hyperparameter settings. HTI requires extending existing trajectory inference approaches to incorporate conditions, exacerbating the challenge of ensuring inferred paths are feasible. We propose an approach based on conditional Lagrangian optimal transport, jointly learning the Lagrangian function governing hyperparameter-induced dynamics along with the associated optimal transport maps and geodesics between observed marginals, which form the surrogate model. We incorporate inductive biases based on the manifold hypothesis and least-action principles into the learned Lagrangian, improving surrogate model feasibility. We empirically demonstrate that our approach reconstructs NN outputs across various hyperparameter spectra better than other alternatives.
[1312] D3LM: A Discrete DNA Diffusion Language Model for Bidirectional DNA Understanding and Generation
Zhao Yang, Hengchang Liu, Chuan Cao, Bing Su
Main category: cs.LG
TL;DR: D3LM is a discrete DNA diffusion language model that unifies bidirectional understanding and generation through masked diffusion, outperforming autoregressive models on DNA generation tasks.
Details
Motivation: Existing DNA foundation models either focus on understanding (BERT-style) or generation (autoregressive), but lack unified bidirectional capabilities that match DNA's inherent regulatory relationships which are bidirectional.Method: D3LM adopts the Nucleotide Transformer v2 architecture but reformulates training as masked diffusion in discrete DNA space, enabling both bidirectional understanding and generation in a single model.
Result: D3LM achieves improved performance on understanding tasks compared to NT v2, and on regulatory element generation achieves SFID of 10.92 (close to real DNA’s 7.85), substantially outperforming previous best autoregressive result of 29.16.
Conclusion: Diffusion language models are a promising paradigm for unified DNA foundation models, and the paper provides the first systematic study of masked diffusion models in DNA domain with practical design insights.
Abstract: Early DNA foundation models adopted BERT-style training, achieving good performance on DNA understanding tasks but lacking generative capabilities. Recent autoregressive models enable DNA generation, but employ left-to-right causal modeling that is suboptimal for DNA where regulatory relationships are inherently bidirectional. We present D3LM (\textbf{D}iscrete \textbf{D}NA \textbf{D}iffusion \textbf{L}anguage \textbf{M}odel), which unifies bidirectional representation learning and DNA generation through masked diffusion. D3LM directly adopts the Nucleotide Transformer (NT) v2 architecture but reformulates the training objective as masked diffusion in discrete DNA space, enabling both bidirectional understanding and generation capabilities within a single model. Compared to NT v2 of the same size, D3LM achieves improved performance on understanding tasks. Notably, on regulatory element generation, D3LM achieves an SFID of 10.92, closely approaching real DNA sequences (7.85) and substantially outperforming the previous best result of 29.16 from autoregressive models. Our work suggests diffusion language models as a promising paradigm for unified DNA foundation models. We further present the first systematic study of masked diffusion models in the DNA domain, investigating practical design choices such as tokenization schemes and sampling strategies, thereby providing empirical insights and a solid foundation for future research. D3LM has been released at https://huggingface.co/collections/Hengchang-Liu/d3lm.
[1313] Learning Shortest Paths with Generative Flow Networks
Nikita Morozov, Ian Maksimov, Daniil Tiapkin, Sergey Samsonov
Main category: cs.LG
TL;DR: GFlowNets framework for shortest path finding in graphs with theoretical guarantees and experimental validation on permutation environments and Rubik’s Cube.
Details
Motivation: To develop a learning framework for finding shortest paths in graphs using Generative Flow Networks (GFlowNets), with theoretical foundations for non-acyclic environments and practical applications to complex pathfinding problems.Method: Proposes using GFlowNets with flow regularization to solve pathfinding problems. First establishes theoretical properties showing that minimizing total flow ensures policies traverse only shortest paths. Then applies this framework to arbitrary graphs through non-acyclic GFlowNet training with flow regularization.
Result: Theoretical proof that minimized flow leads to exclusive traversal of shortest paths. Experimental results show competitive performance on permutation environments and Rubik’s Cube solving, achieving solution lengths comparable to state-of-the-art ML approaches while requiring smaller test-time search budgets.
Conclusion: GFlowNets with flow regularization provide an effective framework for shortest path finding with theoretical guarantees, demonstrating practical utility on complex combinatorial problems like Rubik’s Cube solving.
Abstract: In this paper, we present a novel learning framework for finding shortest paths in graphs utilizing Generative Flow Networks (GFlowNets). First, we examine theoretical properties of GFlowNets in non-acyclic environments in relation to shortest paths. We prove that, if the total flow is minimized, forward and backward policies traverse the environment graph exclusively along shortest paths between the initial and terminal states. Building on this result, we show that the pathfinding problem in an arbitrary graph can be solved by training a non-acyclic GFlowNet with flow regularization. We experimentally demonstrate the performance of our method in pathfinding in permutation environments and in solving Rubik’s Cubes. For the latter problem, our approach shows competitive results with state-of-the-art machine learning approaches designed specifically for this task in terms of the solution length, while requiring smaller search budget at test-time.
[1314] Phase-Type Variational Autoencoders for Heavy-Tailed Data
Abdelhakim Ziani, András Horváth, Paolo Ballarini
Main category: cs.LG
TL;DR: PH-VAE introduces Phase-Type distributions as decoder in VAEs to model heavy-tailed data, outperforming traditional parametric families in capturing extreme events and tail dependencies.
Details
Motivation: Standard VAEs use simple decoder distributions (e.g., Gaussian) that fail to capture heavy-tailed behavior common in real-world data, while existing heavy-tail extensions are limited to predefined parametric families with fixed tail behavior.Method: Proposes Phase-Type Variational Autoencoder (PH-VAE) with decoder distribution as latent-conditioned Phase-Type distribution, defined as absorption time of continuous-time Markov chain (CTMC). This composes multiple exponential time scales for flexible, analytically tractable decoder that adapts tail behavior from data.
Result: Experiments on synthetic and real-world benchmarks show PH-VAE accurately recovers diverse heavy-tailed distributions, significantly outperforming Gaussian, Student-t, and extreme-value-based VAE decoders in modeling tail behavior and extreme quantiles. Captures realistic cross-dimensional tail dependence in multivariate settings.
Conclusion: PH-VAE successfully integrates Phase-Type distributions into deep generative modeling, bridging applied probability and representation learning, providing flexible approach for heavy-tailed data modeling.
Abstract: Heavy-tailed distributions are ubiquitous in real-world data, where rare but extreme events dominate risk and variability. However, standard Variational Autoencoders (VAEs) employ simple decoder distributions (e.g., Gaussian) that fail to capture heavy-tailed behavior, while existing heavy-tail-aware extensions remain restricted to predefined parametric families whose tail behavior is fixed a priori. We propose the Phase-Type Variational Autoencoder (PH-VAE), whose decoder distribution is a latent-conditioned Phase-Type (PH) distribution defined as the absorption time of a continuous-time Markov chain (CTMC). This formulation composes multiple exponential time scales, yielding a flexible and analytically tractable decoder that adapts its tail behavior directly from the observed data. Experiments on synthetic and real-world benchmarks demonstrate that PH-VAE accurately recovers diverse heavy-tailed distributions, significantly outperforming Gaussian, Student-t, and extreme-value-based VAE decoders in modeling tail behavior and extreme quantiles. In multivariate settings, PH-VAE captures realistic cross-dimensional tail dependence through its shared latent representation. To our knowledge, this is the first work to integrate Phase-Type distributions into deep generative modeling, bridging applied probability and representation learning.
[1315] Uncertainty Quantification of Click and Conversion Estimates for the Autobidding
Ivan Zhigalskii, Andrey Pudovikov, Aleksandr Katrutsa, Egor Samosvat
Main category: cs.LG
TL;DR: DenoiseBid method uses Bayesian approach to correct noisy CTR/CVR estimates for more efficient autobidding in e-commerce auctions
Details
Motivation: Autobidding algorithms in e-commerce platforms rely on CTR/CVR estimates from ML models, but these predictions are uncertain and can significantly impact auction performanceMethod: Proposes DenoiseBid method using Bayesian approach to replace noisy CTR/CVR estimates with recovered distributions to generate more efficient bids
Result: Extensive experiments on synthetic, iPinYou, and BAT datasets show improved performance; robustness evaluated with synthetic noise and noise from pre-trained ML model predictions
Conclusion: DenoiseBid effectively addresses uncertainty in CTR/CVR predictions for autobidding algorithms, improving auction efficiency in e-commerce platforms
Abstract: Modern e-commerce platforms employ various auction mechanisms to allocate paid slots for a given item. To scale this approach to the millions of auctions, the platforms suggest promotion tools based on the autobidding algorithms. These algorithms typically depend on the Click-Through-Rate (CTR) and Conversion-Rate (CVR) estimates provided by a pre-trained machine learning model. However, the predictions of such models are uncertain and can significantly affect the performance of the autobidding algorithm. To address this issue, we propose the DenoiseBid method, which corrects the generated CTRs and CVRs to make the resulting bids more efficient in auctions. The underlying idea of our method is to employ a Bayesian approach and replace noisy CTR or CVR estimates with those from recovered distributions. To demonstrate the performance of the proposed approach, we perform extensive experiments on the synthetic, iPinYou, and BAT datasets. To evaluate the robustness of our approach to the noise scale, we use synthetic noise and noise estimated from the predictions of the pre-trained machine learning model.
[1316] Constrained Particle Seeking: Solving Diffusion Inverse Problems with Just Forward Passes
Hongkun Dou, Zike Chen, Zeyu Li, Hongjue Li, Lijun Yang, Yue Deng
Main category: cs.LG
TL;DR: CPS is a gradient-free diffusion method for solving inverse problems without requiring knowledge of the forward observation process, using constrained particle seeking optimization.
Details
Motivation: Existing diffusion-based inverse problem methods require complete knowledge of the forward observation process for gradient computation, limiting applicability when such information is unavailable. There's a need for gradient-free approaches that can handle unknown forward processes.Method: Constrained Particle Seeking (CPS) reformulates inverse problems as constrained optimization tasks. It leverages all candidate particle information to actively search for optimal particles while incorporating constraints aligned with high-density regions of the unconditional prior, unlike passive selection methods.
Result: CPS effectively solves both image and scientific inverse problems, achieving results comparable to gradient-based methods while significantly outperforming gradient-free alternatives.
Conclusion: CPS provides a flexible, gradient-free approach for solving inverse problems with diffusion models when forward process knowledge is unavailable, expanding the applicability of diffusion methods to more real-world scenarios.
Abstract: Diffusion models have gained prominence as powerful generative tools for solving inverse problems due to their ability to model complex data distributions. However, existing methods typically rely on complete knowledge of the forward observation process to compute gradients for guided sampling, limiting their applicability in scenarios where such information is unavailable. In this work, we introduce \textbf{\emph{Constrained Particle Seeking (CPS)}}, a novel gradient-free approach that leverages all candidate particle information to actively search for the optimal particle while incorporating constraints aligned with high-density regions of the unconditional prior. Unlike previous methods that passively select promising candidates, CPS reformulates the inverse problem as a constrained optimization task, enabling more flexible and efficient particle seeking. We demonstrate that CPS can effectively solve both image and scientific inverse problems, achieving results comparable to gradient-based methods while significantly outperforming gradient-free alternatives. Code is available at https://github.com/deng-ai-lab/CPS.
[1317] Trivial Graph Features and Classical Learning are Enough to Detect Random Anomalies
Matthieu Latapy, Stephany Rajeh
Main category: cs.LG
TL;DR: Basic graph features and classical ML outperform complex methods for detecting injected link anomalies in streams, with better efficiency and interpretability
Details
Motivation: Current anomaly detection methods for link streams rely on complex approaches with computational/interpretability issues, and evaluation is limited to detecting randomly injected linksMethod: Use simple graph features (trivial graph metrics) combined with classical machine learning techniques instead of complex approaches
Result: Basic approach achieves excellent anomaly detection performance with very low computational cost, high interpretability, and many desirable properties demonstrated through extensive experiments
Conclusion: Simple methods are sufficient for current evaluation paradigms; future detection methods should target more complex kinds of anomalies beyond randomly injected links
Abstract: Detecting anomalies in link streams that represent various kinds of interactions is an important research topic with crucial applications. Because of the lack of ground truth data, proposed methods are mostly evaluated through their ability to detect randomly injected links. In contrast with most proposed methods, that rely on complex approaches raising computational and/or interpretability issues, we show here that trivial graph features and classical learning techniques are sufficient to detect such anomalies extremely well. This basic approach has very low computational costs and it leads to easily interpretable results. It also has many other desirable properties that we study through an extensive set of experiments. We conclude that detection methods should now target more complex kinds of anomalies.
[1318] Tide: A Customisable Dataset Generator for Anti-Money Laundering Research
Montijn van den Beukel, Jože Martin Rožanec, Ana-Lucia Varbanescu
Main category: cs.LG
TL;DR: Tide is an open-source synthetic dataset generator for Anti-Money Laundering research that creates graph-based financial networks with both structural and temporal laundering patterns, addressing privacy concerns with real data.
Details
Motivation: The lack of accessible transactional data due to privacy and legal concerns hinders AML machine learning research. Existing synthetic generators focus only on simplistic structural patterns and neglect temporal dynamics crucial for detecting sophisticated laundering schemes.Method: Tide generates synthetic graph-based financial networks incorporating money laundering patterns with both structural and temporal characteristics. It produces customizable datasets with varying illicit ratios and includes implementation of state-of-the-art detection models for benchmarking.
Result: Evaluation shows condition-dependent model rankings: LightGBM achieves highest PR-AUC (78.05) at low illicit ratio (0.10%), while XGBoost performs best (85.12) at higher fraud prevalence (0.19%). The datasets meaningfully differentiate model capabilities across operational conditions.
Conclusion: Tide provides a configurable benchmark that exposes meaningful performance variation across model architectures, advancing the development of robust AML detection methods by enabling reproducible, customizable dataset generation for the research community.
Abstract: The lack of accessible transactional data significantly hinders machine learning research for Anti-Money Laundering (AML). Privacy and legal concerns prevent the sharing of real financial data, while existing synthetic generators focus on simplistic structural patterns and neglect the temporal dynamics (timing and frequency) that characterise sophisticated laundering schemes. We present Tide, an open-source synthetic dataset generator that produces graph-based financial networks incorporating money laundering patterns defined by both structural and temporal characteristics. Tide enables reproducible, customisable dataset generation tailored to specific research needs. We release two reference datasets with varying illicit ratios (LI: 0.10%, HI: 0.19%), alongside the implementation of state-of-the-art detection models. Evaluation across these datasets reveals condition-dependent model rankings: LightGBM achieves the highest PR-AUC (78.05) in the low illicit ratio condition, while XGBoost performs best (85.12) at higher fraud prevalence. These divergent rankings demonstrate that the reference datasets can meaningfully differentiate model capabilities across operational conditions. Tide provides the research community with a configurable benchmark that exposes meaningful performance variation across model architectures, advancing the development of robust AML detection methods.
[1319] Diagnosing Generalization Failures from Representational Geometry Markers
Chi-Ning Chou, Artem Kirsanov, Yao-Yuan Yang, SueYeon Chung
Main category: cs.LG
TL;DR: A top-down approach using network biomarkers to predict out-of-distribution generalization failures by analyzing geometric properties of object manifolds in representation space.
Details
Motivation: Current bottom-up mechanistic approaches struggle to provide high-level predictive signals for anticipating model failures in real-world deployment. The paper proposes a medical biomarker-inspired approach to identify system-level measurements that indicate future performance.Method: Systematically design and test network markers to probe structure-function links, focusing on geometric properties of in-distribution object manifolds. Specifically measure effective manifold dimensionality and utility to predict out-of-distribution generalization.
Result: Reductions in two geometric measures (effective manifold dimensionality and utility) consistently predict weaker OOD performance across diverse architectures, optimizers, and datasets. These geometric patterns predict OOD transfer performance more reliably than in-distribution accuracy.
Conclusion: Representational geometry can expose hidden vulnerabilities and offer robust guidance for model selection and AI interpretability, demonstrating the value of top-down biomarker approaches for anticipating generalization failures.
Abstract: Generalization, the ability to perform well beyond the training context, is a hallmark of biological and artificial intelligence, yet anticipating unseen failures remains a central challenge. Conventional approaches often take a bottom-up'' mechanistic route by reverse-engineering interpretable features or circuits to build explanatory models. While insightful, these methods often struggle to provide the high-level, predictive signals for anticipating failure in real-world deployment. Here, we propose using a top-down’’ approach to studying generalization failures inspired by medical biomarkers: identifying system-level measurements that serve as robust indicators of a model’s future performance. Rather than mapping out detailed internal mechanisms, we systematically design and test network markers to probe structure, function links, identify prognostic indicators, and validate predictions in real-world settings. In image classification, we find that task-relevant geometric properties of in-distribution (ID) object manifolds consistently forecast poor out-of-distribution (OOD) generalization. In particular, reductions in two geometric measures, effective manifold dimensionality and utility, predict weaker OOD performance across diverse architectures, optimizers, and datasets. We apply this finding to transfer learning with ImageNet-pretrained models. We consistently find that the same geometric patterns predict OOD transfer performance more reliably than ID accuracy. This work demonstrates that representational geometry can expose hidden vulnerabilities, offering more robust guidance for model selection and AI interpretability.
[1320] SEAR: Sample Efficient Action Chunking Reinforcement Learning
C. F. Maximilian Nagy, Onur Celik, Emiliyan Gospodinov, Florian Seligmann, Weiran Liao, Aryan Kaushik, Gerhard Neumann
Main category: cs.LG
TL;DR: SEAR is an online RL algorithm for action chunking that combines benefits of small and large chunk sizes using receding horizon approach, achieving SOTA performance on Metaworld with chunk sizes up to 20.
Details
Motivation: Action chunking improves exploration and value estimation in long-horizon RL but makes learning harder as critics must evaluate action sequences rather than single actions, increasing approximation and data efficiency challenges. Existing methods designed for offline settings haven't achieved strong performance in purely online RL.Method: SEAR is an off-policy online RL algorithm that exploits temporal structure of action chunks and operates with a receding horizon, effectively combining benefits of small and large chunk sizes.
Result: SEAR outperforms state-of-the-art online RL methods on Metaworld, training with chunk sizes up to 20.
Conclusion: SEAR demonstrates effective online RL with action chunking, addressing challenges of value estimation for action sequences while maintaining strong performance.
Abstract: Action chunking can improve exploration and value estimation in long horizon reinforcement learning, but makes learning substantially harder since the critic must evaluate action sequences rather than single actions, greatly increasing approximation and data efficiency challenges. As a result, existing action chunking methods, primarily designed for the offline and offline-to-online settings, have not achieved strong performance in purely online reinforcement learning. We introduce SEAR, an off policy online reinforcement learning algorithm for action chunking. It exploits the temporal structure of action chunks and operates with a receding horizon, effectively combining the benefits of small and large chunk sizes. SEAR outperforms state of the art online reinforcement learning methods on Metaworld, training with chunk sizes up to 20.
[1321] Dream2Learn: Structured Generative Dreaming for Continual Learning
Salvatore Calcagno, Matteo Pennisi, Federica Proietto Salanitri, Amelia Sorrenti, Simone Palazzo, Concetto Spampinato, Giovanni Bellitto
Main category: cs.LG
TL;DR: Dream2Learn (D2L) uses internal simulation via diffusion models to generate novel “dreamed” classes for continual learning, improving forward transfer without storing past data.
Details
Motivation: Address catastrophic forgetting in continual learning by mimicking human dreaming as a mechanism for internal simulation and knowledge restructuring, enabling proactive feature organization rather than just replaying past data.Method: Classifier autonomously generates novel dreamed classes via frozen diffusion model conditioned through soft prompt optimization. Generated synthetic data expands representation space for self-training, integrating dreamed classes into continual training to structure latent features for forward transfer.
Result: Outperforms strong rehearsal-based baselines on Mini-ImageNet, FG-ImageNet, and ImageNet-R datasets, achieving positive forward transfer and enhanced adaptability through internally generated training signals.
Conclusion: Dream2Learn demonstrates that internal simulation through generated dreamed classes can effectively enhance continual learning by proactively structuring knowledge for better generalization and forward transfer, mirroring sleep’s role in memory consolidation.
Abstract: Continual learning requires balancing plasticity and stability while mitigating catastrophic forgetting. Inspired by human dreaming as a mechanism for internal simulation and knowledge restructuring, we introduce Dream2Learn (D2L), a framework in which a model autonomously generates structured synthetic experiences from its own internal representations and uses them for self-improvement. Rather than reconstructing past data as in generative replay, D2L enables a classifier to create novel, semantically distinct dreamed classes that are coherent with its learned knowledge yet do not correspond to previously observed data. These dreamed samples are produced by conditioning a frozen diffusion model through soft prompt optimization driven by the classifier itself. The generated data are not used to replace memory, but to expand and reorganize the representation space, effectively allowing the network to self-train on internally synthesized concepts. By integrating dreamed classes into continual training, D2L proactively structures latent features to support forward knowledge transfer and adaptation to future tasks. This prospective self-training mechanism mirrors the role of sleep in consolidating and reorganizing memory, turning internal simulations into a tool for improved generalization. Experiments on Mini-ImageNet, FG-ImageNet, and ImageNet-R demonstrate that D2L consistently outperforms strong rehearsal-based baselines and achieves positive forward transfer, confirming its ability to enhance adaptability through internally generated training signals.
[1322] Explanation-Guided Adversarial Training for Robust and Interpretable Models
Chao Chen, Yanhui Chen, Shanshan Lin, Dongsheng Hong, Shu Wu, Xiangwen Liao, Chuanyi Liu
Main category: cs.LG
TL;DR: EGAT integrates adversarial training with explanation-guided learning to improve model robustness, accuracy, and interpretability simultaneously.
Details
Motivation: Current DNNs are opaque black boxes; explanation-guided learning improves interpretability but assumes benign inputs and is annotation-heavy. Adversarial training improves robustness but doesn't ensure semantically meaningful decisions. There's a need to combine both approaches.Method: Proposes Explanation-Guided Adversarial Training (EGAT) that generates adversarial examples while imposing explanation-based constraints. Jointly optimizes classification performance, adversarial robustness, and attributional stability through a unified framework.
Result: EGAT outperforms baselines by +37% in adversarial accuracy while maintaining clean accuracy, produces more semantically meaningful explanations, and requires only +16% additional training time. Also shows strong performance on OOD benchmarks.
Conclusion: EGAT successfully integrates adversarial training and explanation-guided learning to create models that are robust, accurate, and interpretable, with theoretical guarantees under the PAC learning framework.
Abstract: Deep neural networks (DNNs) have achieved remarkable performance in many tasks, yet they often behave as opaque black boxes. Explanation-guided learning (EGL) methods steer DNNs using human-provided explanations or supervision on model attributions. These approaches improve interpretability but typically assume benign inputs and incur heavy annotation costs. In contrast, both predictions and saliency maps of DNNs could dramatically alter facing imperceptible perturbations or unseen patterns. Adversarial training (AT) can substantially improve robustness, but it does not guarantee that model decisions rely on semantically meaningful features. In response, we propose Explanation-Guided Adversarial Training (EGAT), a unified framework that integrates the strength of AT and EGL to simultaneously improve prediction performance, robustness, and explanation quality. EGAT generates adversarial examples on the fly while imposing explanation-based constraints on the model. By jointly optimizing classification performance, adversarial robustness, and attributional stability, EGAT is not only more resistant to unexpected cases, including adversarial attacks and out-of-distribution (OOD) scenarios, but also offer human-interpretable justifications for the decisions. We further formalize EGAT within the Probably Approximately Correct learning framework, demonstrating theoretically that it yields more stable predictions under unexpected situations compared to standard AT. Empirical evaluations on OOD benchmark datasets show that EGAT consistently outperforms competitive baselines in both clean accuracy and adversarial accuracy +37% while producing more semantically meaningful explanations, and requiring only a limited increase +16% in training time.
[1323] BAED: a New Paradigm for Few-shot Graph Learning with Explanation in the Loop
Chao Chen, Xujia Li, Dongsheng Hong, Shanshan Lin, Xiangwen Liao, Chuanyi Liu, Lei Chen
Main category: cs.LG
TL;DR: BAED is the first explanation-in-the-loop framework for Few-Shot Graph Learning that uses belief propagation for label augmentation and extracts explanatory subgraphs to improve prediction accuracy and interpretability.
Details
Motivation: Few-Shot Graph Learning faces challenges with limited labeled data requiring expert annotation, leading to overfitting and poor interpretability. Current approaches compromise robustness and interpretability while adapting to new label distributions.Method: BAED employs belief propagation algorithm for label augmentation on graphs, then uses an auxiliary graph neural network with gradient backpropagation to extract explanatory subgraphs around target nodes. Predictions are based on these informative subgraphs while reducing redundant neighbor information.
Result: Extensive experiments on seven benchmark datasets show superior prediction accuracy, training efficiency, and explanation quality compared to existing methods.
Conclusion: This pioneering work demonstrates the potential of explanation-based research paradigms in Few-Shot Graph Learning, offering improved interpretability alongside performance gains.
Abstract: The challenges of training and inference in few-shot environments persist in the area of graph representation learning. The quality and quantity of labels are often insufficient due to the extensive expert knowledge required to annotate graph data. In this context, Few-Shot Graph Learning (FSGL) approaches have been developed over the years. Through sophisticated neural architectures and customized training pipelines, these approaches enhance model adaptability to new label distributions. However, compromises in \textcolor{black}{the model’s} robustness and interpretability can result in overfitting to noise in labeled data and degraded performance. This paper introduces the first explanation-in-the-loop framework for the FSGL problem, called BAED. We novelly employ the belief propagation algorithm to facilitate label augmentation on graphs. Then, leveraging an auxiliary graph neural network and the gradient backpropagation method, our framework effectively extracts explanatory subgraphs surrounding target nodes. The final predictions are based on these informative subgraphs while mitigating the influence of redundant information from neighboring nodes. Extensive experiments on seven benchmark datasets demonstrate superior prediction accuracy, training efficiency, and explanation quality of BAED. As a pioneer, this work highlights the potential of the explanation-based research paradigm in FSGL.
[1324] Probabilistic Retrofitting of Learned Simulators
Cristiana Diaconu, Miles Cranmer, Richard E. Turner, Tanya Marwah, Payel Mukhopadhyay
Main category: cs.LG
TL;DR: Retrofitting pre-trained deterministic PDE models into probabilistic ones using CRPS scoring rule, achieving significant improvements with minimal training cost.
Details
Motivation: Many physical systems are chaotic and uncertain, but existing PDE models are deterministic. Retraining probabilistic models from scratch is expensive and wastes existing deterministic model investments.Method: Architecture-agnostic approach that retrofits pre-trained deterministic models into probabilistic ones using Continuous Ranked Probability Score (CRPS) as a proper scoring rule. Minimal code modifications needed across different model backbones.
Result: 20-54% reduction in rollout CRPS and up to 30% improvement in variance-normalized RMSE for single-system models. For PDE foundation models, up to 40% improvement in CRPS and 15% in VRMSE compared to deterministic fine-tuning.
Conclusion: Probabilistic PDE modeling can be efficiently unlocked from existing deterministic backbones without retraining from scratch, with modest additional training cost.
Abstract: Dominant approaches for modelling Partial Differential Equations (PDEs) rely on deterministic predictions, yet many physical systems of interest are inherently chaotic and uncertain. While training probabilistic models from scratch is possible, it is computationally expensive and fails to leverage the significant resources already invested in high-performing deterministic backbones. In this work, we adopt a training-efficient strategy to transform pre-trained deterministic models into probabilistic ones via retrofitting with a proper scoring rule: the Continuous Ranked Probability Score (CRPS). Crucially, this approach is architecture-agnostic: it applies the same adaptation mechanism across distinct model backbones with minimal code modifications. The method proves highly effective across different scales of pre-training: for models trained on single dynamical systems, we achieve 20-54% reductions in rollout CRPS and up to 30% improvements in variance-normalised RMSE (VRMSE) relative to compute-matched deterministic fine-tuning. We further validate our approach on a PDE foundation model, trained on multiple systems and retrofitted on the dataset of interest, to show that our probabilistic adaptation yields an improvement of up to 40% in CRPS and up to 15% in VRMSE compared to deterministic fine-tuning. Validated across diverse architectures and dynamics, our results show that probabilistic PDE modelling need not require retraining from scratch, but can be unlocked from existing deterministic backbones with modest additional training cost.
[1325] Accelerating Single-Pass SGD for Generalized Linear Prediction
Qian Chen, Shihong Ding, Cong Fang
Main category: cs.LG
TL;DR: Momentum acceleration for single-pass stochastic optimization of generalized linear models in streaming settings, achieving improved optimization error while maintaining minimax optimal statistical error.
Details
Motivation: Addressing whether momentum can accelerate single-pass non-quadratic stochastic optimization in streaming settings where only one fresh data point is used per iteration, resolving an open problem about momentum vs variance reduction effectiveness.Method: Proposes a novel algorithm incorporating momentum via data-dependent proximal method with dual-momentum acceleration, using fine-grained stationary analysis of inner updates and two-phase outer-loop analysis for statistical error localization.
Result: Achieves excess risk bound with three components: improved optimization error, minimax optimal statistical error, and higher-order model-misspecification error, resolving the open problem about momentum effectiveness over variance reduction.
Conclusion: Momentum acceleration is more effective than variance reduction for generalized linear prediction in streaming settings, successfully incorporating momentum via novel data-dependent proximal method.
Abstract: We study generalized linear prediction under a streaming setting, where each iteration uses only one fresh data point for a gradient-level update. While momentum is well-established in deterministic optimization, a fundamental open question is whether it can accelerate such single-pass non-quadratic stochastic optimization. We propose the first algorithm that successfully incorporates momentum via a novel data-dependent proximal method, achieving dual-momentum acceleration. Our derived excess risk bound decomposes into three components: an improved optimization error, a minimax optimal statistical error, and a higher-order model-misspecification error. The proof handles mis-specification via a fine-grained stationary analysis of inner updates, while localizing statistical error through a two-phase outer-loop analysis. As a result, we resolve the open problem posed by Jain et al. [2018a] and demonstrate that momentum acceleration is more effective than variance reduction for generalized linear prediction in the streaming setting.
[1326] The Expressive Limits of Diagonal SSMs for State-Tracking
Mehran Shakerinava, Behnoush Khavari, Siamak Ravanbakhsh, Sarath Chandar
Main category: cs.LG
TL;DR: Theoretical analysis shows that multi-layer DCD SSMs can only express state-tracking of groups with subnormal series of length k and Abelian factors, identifying precise expressivity limits within solvable groups.
Details
Motivation: While State-Space Models (SSMs) show strong empirical performance on long-range sequence tasks, their theoretical expressive power remains poorly understood. The paper aims to formally characterize what state-tracking tasks DCD SSMs can and cannot express.Method: Theoretical analysis of input-Dependent Complex-valued Diagonal (DCD) SSMs’ expressivity on sequential state-tracking tasks. Proves formal theorems about what groups can be tracked by single-layer and multi-layer DCD SSMs, connecting expressivity to group theory concepts like subnormal series and Abelian factors.
Result: Single-layer DCD SSMs cannot express state-tracking of any non-Abelian group at finite precision. Multi-layer (k-layer) DCD SSMs can express state-tracking of a group if and only if that group has a subnormal series of length k with Abelian factors. Empirical validation shows multi-layer models often fail to learn state-tracking for non-Abelian groups.
Conclusion: The paper provides precise theoretical bounds on DCD SSM expressivity for state-tracking tasks, revealing limitations within solvable groups and highlighting a gap between theoretical expressivity and practical learnability.
Abstract: State-Space Models (SSMs) have recently been shown to achieve strong empirical performance on a variety of long-range sequence modeling tasks while remaining efficient and highly-parallelizable. However, the theoretical understanding of their expressive power remains limited. In this work, we study the expressivity of input-Dependent Complex-valued Diagonal (DCD) SSMs on sequential state-tracking tasks. We show that single-layer DCD SSMs cannot express state-tracking of any non-Abelian group at finite precision. More generally, we show that $k$-layer DCD SSMs can express state-tracking of a group if and only if that group has a subnormal series of length $k$, with Abelian factors. That is, we identify the precise expressivity range of $k$-layer DCD SSMs within the solvable groups. Empirically, we find that multi-layer models often fail to learn state-tracking for non-Abelian groups, highlighting a gap between expressivity and learnability.
[1327] TiledAttention: a CUDA Tile SDPA Kernel for PyTorch
Taimur Khan
Main category: cs.LG
TL;DR: TiledAttention is a customizable, high-performance SDPA forward operator for NVIDIA GPUs that enables rapid kernel research through Python-level schedule editing while maintaining realistic behavior with online softmax and tiled KV streaming.
Details
Motivation: The paper addresses the need for more accessible and customizable attention kernel research tools. Current approaches often require low-level CUDA/CUTLASS template rewrites, which are time-consuming and hinder rapid experimentation. There's a gap between high-level PyTorch implementations and low-level CUDA kernels that makes attention research difficult.Method: TiledAttention implements scaled dot-product attention as a PyTorch-callable function using cuTile Python (TileIR). It features online softmax computation and tiled K,V streaming for realistic behavior. The key innovation is enabling direct schedule-level editing from Python (tile shapes, staging, shared-memory layout) while maintaining performance comparable to production kernels.
Result: Benchmarks on NVIDIA DGX GB10 show TiledAttention delivers large speedups over standard eager attention paths, though production fused baselines remain stronger overall. The system provides practical performance while enabling rapid, reproducible kernel research without template-heavy CUDA/CUTLASS rewrites.
Conclusion: TiledAttention strikes a balance between performance and customizability, offering a practical tool for attention kernel research that’s directly usable within PyTorch workflows. It enables researchers to experiment with attention implementations more efficiently than traditional low-level approaches.
Abstract: TiledAttention is a scaled dot-product attention (SDPA) forward operator for SDPA research on NVIDIA GPUs. Implemented in cuTile Python (TileIR) and exposed as a PyTorch-callable function, it is easier to modify than low-level CUDA templates while retaining realistic behavior via online softmax and tiled $K,V$ streaming. The approach is both performant and directly editable at the schedule level from Python (tile shapes, staging, shared-memory layout), enabling rapid, reproducible kernel research without template-heavy CUDA/CUTLASS rewrites. We benchmark TiledAttention on an NVIDIA DGX GB10 node with a reproducible harness and compare against PyTorch SDPA (auto-dispatch) and explicit unfused baselines across sequence length, head dimension, and precision (FP16/BF16). While production fused baselines remain stronger overall, TiledAttention delivers large speedups over standard eager attention paths and is available for direct use within PyTorch workflows, providing a practical balance between performance and customizability.
[1328] CoVAE: correlated multimodal generative modeling
Federico Caretti, Guido Sanguinetti
Main category: cs.LG
TL;DR: CoVAE: A multimodal VAE architecture that preserves inter-modal correlations for better generation and uncertainty quantification
Details
Motivation: Standard multimodal VAEs use fusion strategies that destroy the joint statistical structure of multimodal data, which negatively impacts generation quality and uncertainty quantification. There's a need for models that can capture correlations between modalities.Method: Introduces Correlated Variational Autoencoders (CoVAE), a new generative architecture that explicitly captures correlations between modalities in the latent space, preserving the joint statistical structure of multimodal data.
Result: Tested on real and synthetic datasets, CoVAE demonstrates accurate cross-modal reconstruction and effective quantification of associated uncertainties compared to standard multimodal VAEs.
Conclusion: CoVAE provides a better approach for multimodal representation learning by preserving inter-modal correlations, leading to improved generation capabilities and uncertainty quantification in multimodal settings.
Abstract: Multimodal Variational Autoencoders have emerged as a popular tool to extract effective representations from rich multimodal data. However, such models rely on fusion strategies in latent space that destroy the joint statistical structure of the multimodal data, with profound implications for generation and uncertainty quantification. In this work, we introduce Correlated Variational Autoencoders (CoVAE), a new generative architecture that captures the correlations between modalities. We test CoVAE on a number of real and synthetic data sets demonstrating both accurate cross-modal reconstruction and effective quantification of the associated uncertainties.
[1329] Intrinsic Task Symmetry Drives Generalization in Algorithmic Tasks
Hyeonbin Hwang, Yeachan Park
Main category: cs.LG
TL;DR: Grokking (sudden generalization after memorization) is driven by intrinsic task symmetries that shape representation geometry through a three-stage process: memorization, symmetry acquisition, and geometric organization.
Details
Motivation: The paper aims to understand the mechanism behind grokking - the puzzling phenomenon where neural networks suddenly transition from memorization to generalization after prolonged training. While previous work observed low-dimensional representations emerging during grokking, the fundamental drivers of this organization remained unclear.Method: The authors propose that intrinsic task symmetries drive grokking and analyze training dynamics across diverse algorithmic domains (algebraic, structural, and relational reasoning tasks). They identify a consistent three-stage process and develop a symmetry-based diagnostic to predict generalization onset.
Result: The study reveals that generalization emerges during the symmetry acquisition phase, after which representations reorganize into structured, task-aligned geometry. The symmetry-driven account is validated across multiple domains, and the proposed diagnostic successfully anticipates generalization onset.
Conclusion: Intrinsic symmetry is established as the key factor enabling neural networks to move beyond memorization and achieve robust algorithmic reasoning. The findings provide both theoretical understanding and practical strategies to accelerate generalization.
Abstract: Grokking, the sudden transition from memorization to generalization, is characterized by the emergence of low-dimensional representations, yet the mechanism underlying this organization remains elusive. We propose that intrinsic task symmetries primarily drive grokking and shape the geometry of the model’s representation space. We identify a consistent three-stage training dynamic underlying grokking: (i) memorization, (ii) symmetry acquisition, and (iii) geometric organization. We show that generalization emerges during the symmetry acquisition phase, after which representations reorganize into a structured, task-aligned geometry. We validate this symmetry-driven account across diverse algorithmic domains, including algebraic, structural, and relational reasoning tasks. Building on these findings, we introduce a symmetry-based diagnostic that anticipates the onset of generalization and propose strategies to accelerate it. Together, our results establish intrinsic symmetry as the key factor enabling neural networks to move beyond memorization and achieve robust algorithmic reasoning.
[1330] MatRIS: Toward Reliable and Efficient Pretrained Machine Learning Interaction Potentials
Yuanchang Zhou, Siyu Hu, Xiangyu Zhang, Hongyu Wang, Guangming Tan, Weile Jia
Main category: cs.LG
TL;DR: MatRIS is an invariant machine learning interatomic potential that uses attention-based modeling of three-body interactions with linear complexity, achieving accuracy comparable to equivariant models at lower computational cost.
Details
Motivation: While equivariant MLIPs achieve state-of-the-art accuracy, they are computationally expensive due to tensor products and high-degree representations. As quantum mechanical datasets expand, there's a need for more compact models that can thoroughly exploit high-dimensional atomic interactions while maintaining accuracy.Method: MatRIS introduces attention-based modeling of three-body interactions using a novel separable attention mechanism with linear complexity O(N). It’s an invariant MLIP that leverages attention mechanisms to capture atomic interactions efficiently without relying on computationally costly equivariant representations.
Result: MatRIS delivers accuracy comparable to leading equivariant models on benchmarks including Matbench-Discovery, MatPES, MDR phonon, and Molecular datasets. On Matbench-Discovery, it achieves an F1 score up to 0.847 and attains comparable accuracy at lower training cost.
Conclusion: Carefully designed invariant models can match or exceed the accuracy of equivariant models at a fraction of the cost, providing a path toward developing accurate and efficient MLIPs for materials science applications.
Abstract: Foundation MLIPs demonstrate broad applicability across diverse material systems and have emerged as a powerful and transformative paradigm in chemical and computational materials science. Equivariant MLIPs achieve state-of-the-art accuracy in a wide range of benchmarks by incorporating equivariant inductive bias. However, the reliance on tensor products and high-degree representations makes them computationally costly. This raises a fundamental question: as quantum mechanical-based datasets continue to expand, can we develop a more compact model to thoroughly exploit high-dimensional atomic interactions? In this work, we present MatRIS (\textbf{Mat}erials \textbf{R}epresentation and \textbf{I}nteraction \textbf{S}imulation), an invariant MLIP that introduces attention-based modeling of three-body interactions. MatRIS leverages a novel separable attention mechanism with linear complexity $O(N)$, enabling both scalability and expressiveness. MatRIS delivers accuracy comparable to that of leading equivariant models on a wide range of popular benchmarks (Matbench-Discovery, MatPES, MDR phonon, Molecular dataset, etc). Taking Matbench-Discovery as an example, MatRIS achieves an F1 score of up to 0.847 and attains comparable accuracy at a lower training cost. The work indicates that our carefully designed invariant models can match or exceed the accuracy of equivariant models at a fraction of the cost, shedding light on the development of accurate and efficient MLIPs.
[1331] Mitigating topology biases in Graph Diffusion via Counterfactual Intervention
Wendi Wang, Jiaxi Yang, Yongkang Du, Lu Lin
Main category: cs.LG
TL;DR: FairGDiff is a counterfactual-based graph diffusion model that mitigates topology biases from sensitive attributes while preserving structural utility in graph generation.
Details
Motivation: Graph diffusion models amplify topology biases from sensitive attributes (gender, age, region), leading to unfair synthetic graphs. Existing fair graph generation methods are limited to specific applications with complete labels or require simultaneous updates for graph structure and node attributes.Method: Proposes FairGDiff, a counterfactual-based one-step solution that constructs a causal model to capture relationships between sensitive attributes, biased link formation, and generated graph structure. Answers counterfactual question “Would the graph structure change if the sensitive attribute were different?” to estimate unbiased treatment and incorporates it into both forward diffusion and backward denoising processes.
Result: Extensive experiments on real-world datasets demonstrate that FairGDiff achieves superior trade-off between fairness and utility, outperforming existing fair graph generation methods while maintaining scalability.
Conclusion: FairGDiff effectively mitigates topology biases in graph generation through counterfactual learning integrated into diffusion processes, ensuring generated graphs are independent of sensitive attributes while preserving structural integrity.
Abstract: Graph diffusion models have gained significant attention in graph generation tasks, but they often inherit and amplify topology biases from sensitive attributes (e.g. gender, age, region), leading to unfair synthetic graphs. Existing fair graph generation using diffusion models is limited to specific graph-based applications with complete labels or requires simultaneous updates for graph structure and node attributes, making them unsuitable for general usage. To relax these limitations by applying the debiasing method directly on graph topology, we propose Fair Graph Diffusion Model (FairGDiff), a counterfactual-based one-step solution that mitigates topology biases while balancing fairness and utility. In detail, we construct a causal model to capture the relationship between sensitive attributes, biased link formation, and the generated graph structure. By answering the counterfactual question “Would the graph structure change if the sensitive attribute were different?”, we estimate an unbiased treatment and incorporate it into the diffusion process. FairGDiff integrates counterfactual learning into both forward diffusion and backward denoising, ensuring that the generated graphs are independent of sensitive attributes while preserving structural integrity. Extensive experiments on real-world datasets demonstrate that FairGDiff achieves a superior trade-off between fairness and utility, outperforming existing fair graph generation methods while maintaining scalability.
[1332] Temporal Representations for Exploration: Learning Complex Exploratory Behavior without Extrinsic Rewards
Faisal Mohamed, Catherine Ji, Benjamin Eysenbach, Glen Berseth
Main category: cs.LG
TL;DR: Temporal contrastive representation learning for exploration in RL, prioritizing unpredictable future states without explicit distance learning or episodic memory.
Details
Motivation: Effective exploration requires understanding how agents perceive the world and learning powerful representations that capture temporal information for solving diverse tasks without full state reconstruction.Method: Proposes exploration method using temporal contrastive representations to guide exploration by prioritizing states with unpredictable future outcomes, building directly on temporal similarities rather than explicit distance learning.
Result: Demonstrates learning of complex exploratory behaviors in locomotion, manipulation, and embodied-AI tasks, revealing capabilities that traditionally require extrinsic rewards.
Conclusion: Temporal contrastive representations provide a simpler yet effective strategy for exploration compared to quasimetric-based methods or episodic memory mechanisms.
Abstract: Effective exploration in reinforcement learning requires not only tracking where an agent has been, but also understanding how the agent perceives and represents the world. To learn powerful representations, an agent should actively explore states that contribute to its knowledge of the environment. Temporal representations can capture the information necessary to solve a wide range of potential tasks while avoiding the computational cost associated with full state reconstruction. In this paper, we propose an exploration method that leverages temporal contrastive representations to guide exploration, prioritizing states with unpredictable future outcomes. We demonstrate that such representations can enable the learning of complex exploratory x in locomotion, manipulation, and embodied-AI tasks, revealing capabilities and behaviors that traditionally require extrinsic rewards. Unlike approaches that rely on explicit distance learning or episodic memory mechanisms (e.g., quasimetric-based methods), our method builds directly on temporal similarities, yielding a simpler yet effective strategy for exploration.
[1333] Noise-Calibrated Inference from Differentially Private Sufficient Statistics in Exponential Families
Amir Asiaee, Samhita Pal
Main category: cs.LG
TL;DR: DP sufficient statistics release for exponential families enables noise-calibrated inference and synthetic data generation with valid uncertainty quantification.
Details
Motivation: Existing DP data release systems either output synthetic data without proper inference calibration or provide point estimates without uncertainty quantification, leading to miscalibration issues.Method: Release DP sufficient statistics using Gaussian mechanism with clipping, then perform noise-calibrated likelihood-based inference and optional parametric synthetic data generation as post-processing.
Result: Developed asymptotic normality, explicit variance inflation, valid Wald-style confidence intervals, noise-aware likelihood correction, and matching minimax lower bounds showing privacy distortion rate is unavoidable.
Conclusion: The approach provides concrete design rules and practical pipeline for releasing DP synthetic data with principled uncertainty quantification, validated on real data.
Abstract: Many differentially private (DP) data release systems either output DP synthetic data and leave analysts to perform inference as usual, which can lead to severe miscalibration, or output a DP point estimate without a principled way to do uncertainty quantification. This paper develops a clean and tractable middle ground for exponential families: release only DP sufficient statistics, then perform noise-calibrated likelihood-based inference and optional parametric synthetic data generation as post-processing. Our contributions are: (1) a general recipe for approximate-DP release of clipped sufficient statistics under the Gaussian mechanism; (2) asymptotic normality, explicit variance inflation, and valid Wald-style confidence intervals for the plug-in DP MLE; (3) a noise-aware likelihood correction that is first-order equivalent to the plug-in but supports bootstrap-based intervals; and (4) a matching minimax lower bound showing the privacy distortion rate is unavoidable. The resulting theory yields concrete design rules and a practical pipeline for releasing DP synthetic data with principled uncertainty quantification, validated on three exponential families and real census data.
[1334] CausalWrap: Model-Agnostic Causal Constraint Wrappers for Tabular Synthetic Data
Amir Asiaee, Zhuohui J. Liang, Chao Yan
Main category: cs.LG
TL;DR: CausalWrap (CW) is a model-agnostic wrapper that injects partial causal knowledge into pretrained tabular synthetic data generators to improve causal fidelity for downstream causal reasoning tasks.
Details
Motivation: Current tabular synthetic data generators focus on matching observational distributions but fail to preserve structural relations needed for causal analysis and out-of-distribution reasoning, which is crucial for tasks like treatment effect estimation and policy evaluation.Method: CW learns a lightweight, differentiable post-hoc correction map applied to samples from any base generator (GAN, VAE, diffusion model). It uses causal penalty terms (trusted edges, forbidden edges, monotonic constraints) under an augmented-Lagrangian schedule without requiring access to the base generator’s internals.
Result: CW improves causal fidelity across diverse base generators, reducing average treatment effect (ATE) error by up to 63% on ACIC benchmark and lifting ATE agreement from 0.00 to 0.38 on ICU cohort, while largely retaining conventional utility.
Conclusion: CausalWrap enables synthetic data generators to better support causal reasoning tasks by incorporating partial causal knowledge, bridging the gap between observational distribution matching and structural fidelity needed for causal analysis.
Abstract: Tabular synthetic data generators are typically trained to match observational distributions, which can yield high conventional utility (e.g., column correlations, predictive accuracy) yet poor preservation of structural relations relevant to causal analysis and out-of-distribution (OOD) reasoning. When the downstream use of synthetic data involves causal reasoning – estimating treatment effects, evaluating policies, or testing mediation pathways – merely matching the observational distribution is insufficient: structural fidelity and treatment-mechanism preservation become essential. We propose CausalWrap (CW), a model-agnostic wrapper that injects partial causal knowledge (PCK) – trusted edges, forbidden edges, and qualitative/monotonic constraints – into any pretrained base generator (GAN, VAE, or diffusion model), without requiring access to its internals. CW learns a lightweight, differentiable post-hoc correction map applied to samples from the base generator, optimized with causal penalty terms under an augmented-Lagrangian schedule. We provide theoretical results connecting penalty-based optimization to constraint satisfaction and relating approximate factorization to joint distributional control. We validate CW on simulated structural causal models (SCMs) with known ground-truth interventions, semi-synthetic causal benchmarks (IHDP and an ACIC-style suite), and a real-world ICU cohort (MIMIC-IV) with expert-elicited partial graphs. CW improves causal fidelity across diverse base generators – e.g., reducing average treatment effect (ATE) error by up to 63% on ACIC and lifting ATE agreement from 0.00 to 0.38 on the intensive care unit (ICU) cohort – while largely retaining conventional utility.
[1335] Revealing Combinatorial Reasoning of GNNs via Graph Concept Bottleneck Layer
Yue Niu, Zhaokai Sun, Jiayi Yang, Xiaofeng Cao, Rui Fan, Xin Sun, Hanli Wang, Wei Ye
Main category: cs.LG
TL;DR: GCBMs introduce graph concept bottleneck layers for interpretable GNNs that quantify concept contributions using soft logical rules and leverage language models for concept embeddings.
Details
Motivation: Current GNNs are black-box models with hidden combinatorial reasoning. Existing interpretability methods only uncover hard logical rules over graph concepts without quantifying concept contributions, and are post-hoc approximations that may not reflect true reasoning.Method: Develop graph concept bottleneck layers integrated into any GNN architecture to predict discriminative global graph concepts. Concept scores are projected to class labels via sparse linear layer, enforcing soft logical rules. Treat concepts as “graph words” and graphs as “graph sentences” using language models to learn graph concept embeddings.
Result: Extensive experiments on multiple datasets show GCBMs achieve state-of-the-art performance in both classification accuracy and interpretability.
Conclusion: GCBMs provide an effective framework for interpretable GNNs that quantifies concept contributions through soft logical rules and leverages language models for concept representation learning.
Abstract: Despite their success in various domains, the growing dependence on GNNs raises a critical concern about the nature of the combinatorial reasoning underlying their predictions, which is often hidden within their black-box architectures. Addressing this challenge requires understanding how GNNs translate topological patterns into logical rules. However, current works only uncover the hard logical rules over graph concepts, which cannot quantify the contribution of each concept to prediction. Moreover, they are post-hoc interpretable methods that generate explanations after model training and may not accurately reflect the true combinatorial reasoning of GNNs, since they approximate it with a surrogate. In this work, we develop a graph concept bottleneck layer that can be integrated into any GNN architectures to guide them to predict the selected discriminative global graph concepts. The predicted concept scores are further projected to class labels by a sparse linear layer. It enforces the combinatorial reasoning of GNNs’ predictions to fit the soft logical rule over graph concepts and thus can quantify the contribution of each concept. To further improve the quality of the concept bottleneck, we treat concepts as “graph words” and graphs as “graph sentences”, and leverage language models to learn graph concept embeddings. Extensive experiments on multiple datasets show that our method GCBMs achieve state-of-the-art performance both in classification and interpretability.
[1336] Latent attention on masked patches for flow reconstruction
Ben Eze, Luca Magri, Andrea Nóvoa
Main category: cs.LG
TL;DR: LAMP is an interpretable vision transformer variant for masked flow reconstruction in fluid dynamics, using patch-wise dimensionality reduction and linear regression training.
Details
Motivation: Vision transformers excel at image generation but have limited adoption in scientific domains like fluid dynamics. There's a need for interpretable models that can reconstruct full flow fields from sparse, noisy measurements.Method: Three-step approach: 1) Partition flow snapshots into patches, 2) Apply patch-wise proper orthogonal decomposition for dimensionality reduction, 3) Use single-layer transformer trained via closed-form linear regression to reconstruct full field from masked input.
Result: LAMP accurately reconstructs full flow fields from 90%-masked and noisy inputs (SNR 10-30 dB). Nonlinear measurement states reduce error by up to 10x. Learned attention provides interpretable sensor-placement maps.
Conclusion: LAMP provides an efficient, interpretable baseline for masked flow reconstruction with modular architecture enabling nonlinear compression and deep attention blocks for scientific applications.
Abstract: Vision transformers have demonstrated outstanding performance on image generation applications, but their adoption in scientific disciplines, like fluid dynamics, has been limited. We introduce the Latent Attention on Masked Patches (LAMP) model, an interpretable regression-based modified vision transformer designed for masked flow reconstruction. LAMP follows a three-fold strategy: (i) partition of each flow snapshot into patches, (ii) dimensionality reduction of each patch via patch-wise proper orthogonal decomposition, and (iii) reconstruction of the full field from a masked input using a single-layer transformer trained via closed-form linear regression. We test the method on two canonical 2D unsteady wakes: a wake past a bluff body, and a chaotic wake past a flat plate. We show that the LAMP accurately reconstructs the full flow field from a 90%-masked and noisy input, across signal-to-noise ratios between 10 and 30,dB. Incorporating nonlinear measurement states can reduce the prediction error by up to an order of magnitude. The learned attention matrix yields physically interpretable multi-fidelity optimal sensor-placement maps. The modularity of the framework enables nonlinear compression and deep attention blocks, thereby providing an efficient baseline for nonlinear and high-dimensional masked flow reconstruction.
[1337] Leave-One-Out Prediction for General Hypothesis Classes
Jian Qian, Jiachen Xu
Main category: cs.LG
TL;DR: MLSA aggregation method provides theoretical guarantees for leave-one-out prediction error with multiplicative oracle inequalities for various learning settings.
Details
Motivation: Leave-one-out prediction provides data-dependent generalization measures, but theoretical guarantees in transductive settings are limited. The paper aims to establish general bounds for LOO error across different learning scenarios.Method: Introduces Median of Level-Set Aggregation (MLSA), an aggregation procedure based on empirical-risk level sets around the empirical risk minimizer. Analyzes local level-set growth condition to control expansion of near-optimal empirical-risk minimizers as tolerance increases.
Result: Establishes multiplicative oracle inequality for LOO error with complexity terms scaling as O(d log n) for VC classes, O(log|H|) for finite hypothesis classes, and O(d log n) for logistic regression with bounded covariates.
Conclusion: MLSA provides general theoretical framework for LOO prediction guarantees across various learning settings, with complexity scaling matching or improving existing bounds.
Abstract: Leave-one-out (LOO) prediction provides a principled, data-dependent measure of generalization, yet guarantees in fully transductive settings remain poorly understood beyond specialized models. We introduce Median of Level-Set Aggregation (MLSA), a general aggregation procedure based on empirical-risk level sets around the ERM. For arbitrary fixed datasets and losses satisfying a mild monotonicity condition, we establish a multiplicative oracle inequality for the LOO error of the form [ LOO_S(\hat{h}) ;\le; C \cdot \frac{1}{n} \min_{h\in H} L_S(h) ;+; \frac{Comp(S,H,\ell)}{n}, \qquad C>1. ] The analysis is based on a local level-set growth condition controlling how the set of near-optimal empirical-risk minimizers expands as the tolerance increases. We verify this condition in several canonical settings. For classification with VC classes under the 0-1 loss, the resulting complexity scales as $O(d \log n)$, where $d$ is the VC dimension. For finite hypothesis and density classes under bounded or log loss, it scales as $O(\log |H|)$ and $O(\log |P|)$, respectively. For logistic regression with bounded covariates and parameters, a volumetric argument based on the empirical covariance matrix yields complexity scaling as $O(d \log n)$ up to problem-dependent factors.
[1338] Expanding LLM Agent Boundaries with Strategy-Guided Exploration
Andrew Szot, Michael Kirchhof, Omar Attia, Alexander Toshev
Main category: cs.LG
TL;DR: SGE uses LLMs to generate high-level natural language strategies for exploration in RL agents, shifting exploration from low-level actions to strategic planning to improve learning efficiency.
Details
Motivation: Exploration remains a major challenge in RL for LLM agents operating in language-action spaces with complex observations and sparse rewards. Current methods struggle with structured exploration in these high-dimensional spaces.Method: Strategy-Guided Exploration (SGE) first generates concise natural-language strategies describing how to make progress toward goals, then generates environment actions conditioned on those strategies. Uses mixed-temperature sampling for diverse strategy exploration and strategy reflection to ground strategy generation on previous outcomes.
Result: SGE consistently outperforms exploration-focused RL baselines across UI interaction, tool-calling, coding, and embodied agent environments, improving both learning efficiency and final performance. Enables agents to learn tasks too difficult for the base model.
Conclusion: Shifting exploration from low-level actions to higher-level language strategies enables more structured and diverse exploration, addressing a core challenge in RL for LLM agents across multiple domains.
Abstract: Reinforcement learning (RL) has demonstrated notable success in post-training large language models (LLMs) as agents for tasks such as computer use, tool calling, and coding. However, exploration remains a central challenge in RL for LLM agents, especially as they operate in language-action spaces with complex observations and sparse outcome rewards. In this work, we address exploration for LLM agents by leveraging the ability of LLMs to plan and reason in language about the environment to shift exploration from low-level actions to higher-level language strategies. We thus propose Strategy-Guided Exploration (SGE), which first generates a concise natural-language strategy that describes what to do to make progress toward the goal, and then generates environment actions conditioned on that strategy. By exploring in the space of strategies rather than the space of actions, SGE induces structured and diverse exploration that targets different environment outcomes. To increase strategy diversity during RL, SGE introduces mixed-temperature sampling, which explores diverse strategies in parallel, along with a strategy reflection process that grounds strategy generation on the outcomes of previous strategies in the environment. Across UI interaction, tool-calling, coding, and embodied agent environments, SGE consistently outperforms exploration-focused RL baselines, improving both learning efficiency and final performance. We show that SGE enables the agent to learn to solve tasks too difficult for the base model.
[1339] Strategic Advice in the Age of Personal AI
Yueyang Liu, Wichinpong Park Sinchaisri
Main category: cs.LG
TL;DR: Personal AI assistants create strategic advisor-client interactions where advisors counteract predictable AI recommendations, with advisor performance worst at intermediate AI adoption levels.
Details
Motivation: The paper examines how personal AI assistants change strategic interactions between advisors and clients, creating a new environment where advisors must anticipate and counteract AI recommendations that clients may consult.Method: The study uses game theory and strategic analysis to model interactions between advisors, clients, and personal AI assistants, examining how AI consultation frequency and influence weight affect advisor behavior and performance.
Result: Advisors become more aggressive in counteracting AI recommendations as consultation frequency increases, but advisor performance is non-monotonic - worst at intermediate adoption levels and best when AI is never or always used. Trust affects performance through a relative influence index.
Conclusion: Personal AI adoption reshapes strategic advisor-client relationships, creating complex dynamics where intermediate adoption levels create the most vulnerability for advisors, and trust investment incentives are fundamentally altered.
Abstract: Personal AI assistants have changed how people use institutional and professional advice. We study this new strategic setting in which individuals may stochastically consult a personal AI whose recommendation is predictable to the focal advisor. Personal AI enters this strategic environment along two dimensions: how often it is consulted and how much weight it receives in the human’s decision when consulted. Anticipating this, the advisor responds by counteracting the personal AI recommendation. Counteraction becomes more aggressive as personal AI is consulted more often. Yet advisor performance is non-monotone: equilibrium loss is highest at intermediate levels of adoption and vanishes when personal AI is never used or always used. Trust affects performance through a single relative influence index, and greater relative influence of personal AI increases advisor vulnerability. Extending the framework to costly credibility building, we characterize how personal AI adoption reshapes incentives to invest in trust.
[1340] Never Saddle for Reparameterized Steepest Descent as Mirror Flow
Tom Jacobs, Chao Zhou, Rebekka Burkholz
Main category: cs.LG
TL;DR: Steepest mirror flows framework shows how optimization geometry affects learning dynamics, explaining why Adam/AdamW outperform SGD in fine-tuning through better saddle-point escape and feature learning.
Details
Motivation: To understand how optimization algorithm choice affects a model's ability to learn features, particularly why Adam and AdamW often outperform SGD in fine-tuning scenarios.Method: Introduces steepest mirror flows as a unifying theoretical framework, analyzes diagonal linear networks and deep diagonal linear reparameterizations (proxy for attention), and provides empirical validation.
Result: Steeper descent facilitates both saddle-point escape and feature learning, while gradient descent requires unrealistically large learning rates. Decoupled weight decay (AdamW) stabilizes feature learning through balance equations.
Conclusion: Optimization geometry governs learning dynamics and implicit bias, with steepest descent offering two mechanisms that aid modern optimization: better saddle-point escape and stabilized feature learning.
Abstract: How does the choice of optimization algorithm shape a model’s ability to learn features? To address this question for steepest descent methods –including sign descent, which is closely related to Adam –we introduce steepest mirror flows as a unifying theoretical framework. This framework reveals how optimization geometry governs learning dynamics, implicit bias, and sparsity and it provides two explanations for why Adam and AdamW often outperform SGD in fine-tuning. Focusing on diagonal linear networks and deep diagonal linear reparameterizations (a simplified proxy for attention), we show that steeper descent facilitates both saddle-point escape and feature learning. In contrast, gradient descent requires unrealistically large learning rates to escape saddles, an uncommon regime in fine-tuning. Empirically, we confirm that saddle-point escape is a central challenge in fine-tuning. Furthermore, we demonstrate that decoupled weight decay, as in AdamW, stabilizes feature learning by enforcing novel balance equations. Together, these results highlight two mechanisms how steepest descent can aid modern optimization.
[1341] Accelerating PDE Surrogates via RL-Guided Mesh Optimization
Yang Meng, Ruoxi Jiang, Zhuokai Zhao, Chong Liu, Rebecca Willett, Yuxin Chen
Main category: cs.LG
TL;DR: RLMesh uses reinforcement learning to adaptively allocate mesh points in PDE simulations, reducing computational cost for training deep surrogate models.
Details
Motivation: Training deep surrogate models for parametric PDEs requires thousands of expensive fine-grid simulations, making them computationally prohibitive for practical deployment.Method: Uses reinforcement learning to non-uniformly allocate mesh grid points within simulation domains, focusing resolution in critical regions. Includes a lightweight proxy model for efficient reward estimation during RL training.
Result: Achieves competitive accuracy to baselines with substantially fewer simulation queries, demonstrating dramatic efficiency improvements in surrogate training pipelines.
Conclusion: Solver-level spatial adaptivity via RL enables practical deployment of learning-based PDE surrogates across a wide range of problems by reducing computational requirements.
Abstract: Deep surrogate models for parametric partial differential equations (PDEs) can deliver high-fidelity approximations but remain prohibitively data-hungry: training often requires thousands of fine-grid simulations, each incurring substantial computational cost. To address this challenge, we introduce RLMesh, an end-to-end framework for efficient surrogate training under limited simulation budget. The key idea is to use reinforcement learning (RL) to adaptively allocate mesh grid points non-uniformly within each simulation domain, focusing numerical resolution in regions most critical for accurate PDE solutions. A lightweight proxy model further accelerates RL training by providing efficient reward estimates without full surrogate retraining. Experiments on PDE benchmarks demonstrate that RLMesh achieves competitive accuracy to baselines but with substantially fewer simulation queries. These results show that solver-level spatial adaptivity can dramatically improve the efficiency of surrogate training pipelines, enabling practical deployment of learning-based PDE surrogates across a wide range of problems.
[1342] Scaling Laws of SignSGD in Linear Regression: When Does It Outperform SGD?
Jihwan Kim, Dogyoon Song, Chulhee Yun
Main category: cs.LG
TL;DR: SignSGD scaling laws analysis under power-law random features model reveals unique drift-normalization and noise-reshaping effects compared to SGD, with compute-optimal scaling showing steeper slopes when noise dominates.
Details
Motivation: To understand the scaling properties of signSGD (a sign-based stochastic gradient descent variant) compared to standard SGD, particularly how it behaves under power-law random features models that capture feature and target decay characteristics.Method: Analyze population risk of linear models trained with one-pass signSGD on Gaussian-sketched features under a power-law random features model. Compare against SGD risk from prior work, identify unique effects, derive compute-optimal scaling laws with optimal learning rate selection.
Result: Identified two unique effects in signSGD: drift-normalization and noise-reshaping. Found that noise-reshaping can make compute-optimal slope of signSGD steeper than SGD when noise dominates. Warmup-stable-decay schedule further reduces noise and sharpens compute-optimal slope in specific regimes.
Conclusion: SignSGD exhibits distinct scaling behavior from SGD due to unique noise-handling properties, with implications for compute-optimal training strategies in certain feature-target decay regimes.
Abstract: We study scaling laws of signSGD under a power-law random features (PLRF) model that accounts for both feature and target decay. We analyze the population risk of a linear model trained with one-pass signSGD on Gaussian-sketched features. We express the risk as a function of model size, training steps, learning rate, and the feature and target decay parameters. Comparing against the SGD risk analyzed by Paquette et al. (2024), we identify a drift-normalization effect and a noise-reshaping effect unique to signSGD. We then obtain compute-optimal scaling laws under the optimal choice of learning rate. Our analysis shows that the noise-reshaping effect can make the compute-optimal slope of signSGD steeper than that of SGD in regimes where noise is dominant. Finally, we observe that the widely used warmup-stable-decay (WSD) schedule further reduces the noise term and sharpens the compute-optimal slope, when feature decay is fast but target decay is slow.
[1343] Adam Converges Without Any Modification On Update Rules
Yushun Zhang, Bingran Li, Congliang Chen, Zhi-Quan Luo, Ruoyu Sun
Main category: cs.LG
TL;DR: Theoretical analysis of Adam optimizer convergence conditions, identifying problem-dependent hyperparameter combinations that guarantee convergence vs. divergence, with practical tuning suggestions for LLM training.
Details
Motivation: Address concerns about Adam's potential divergence raised by previous work, by analyzing the relationship between hyperparameter choices and convergence guarantees in practical settings.Method: Mathematical analysis of Adam optimizer convergence properties, identifying phase transitions in the (β₁, β₂) parameter space and deriving problem-dependent critical boundaries.
Result: Proves Adam converges when β₂ is large and β₁ < √β₂; identifies divergence regions when β₂ is small; discovers phase transition phenomenon in 2D parameter space; shows critical boundary depends on problem characteristics including batch size.
Conclusion: Provides rigorous theoretical guarantees for Adam convergence with proper hyperparameter tuning, offering practical guidance: tune β₂ inversely with batch size to surpass critical threshold, then set β₁ < √β₂, which aligns with empirical improvements in LLM training.
Abstract: Adam is the default algorithm for training neural networks, including large language models (LLMs). However, \citet{reddi2019convergence} provided an example that Adam diverges, raising concerns for its deployment in AI model training. We identify a key mismatch between the divergence example and practice: \citet{reddi2019convergence} pick the problem after picking the hyperparameters of Adam, i.e., $(β_1,β_2)$; while practical applications often fix the problem first and then tune $(β_1,β_2)$. In this work, we prove that Adam converges with proper problem-dependent hyperparameters. First, we prove that Adam converges when $β_2$ is large and $β_1 < \sqrt{β_2}$. Second, when $β_2$ is small, we point out a region of $(β_1,β_2)$ combinations where Adam can diverge to infinity. Our results indicate a phase transition for Adam from divergence to convergence when changing the $(β_1, β_2)$ combination. To our knowledge, this is the first phase transition in $(β_1,β_2)$ 2D-plane reported in the literature, providing rigorous theoretical guarantees for Adam optimizer. We further point out that the critical boundary $(β_1^, β_2^)$ is problem-dependent, and particularly, dependent on batch size. This provides suggestions on how to tune $β_1$ and $β_2$: when Adam does not work well, we suggest tuning up $β_2$ inversely with batch size to surpass the threshold $β_2^*$, and then trying $β_1< \sqrt{β_2}$. Our suggestions are supported by reports from several empirical studies, which observe improved LLM training performance when applying them.
[1344] On the Rate of Convergence of GD in Non-linear Neural Networks: An Adversarial Robustness Perspective
Guy Smorodinsky, Sveta Gimpleson, Itay Safran
Main category: cs.LG
TL;DR: GD in a 2-neuron ReLU network with 2 training instances converges to optimal robustness margin at extremely slow Θ(1/ln(t)) rate, establishing first explicit lower bound for non-linear models.
Details
Motivation: To understand the convergence dynamics of gradient descent in non-linear neural networks, specifically examining how slowly the robustness margin (distance between decision boundary and training points) converges even in the simplest possible setting.Method: Theoretical analysis of GD trajectories in a minimal binary classification setting: two-neuron ReLU network with two training instances. Rigorous analysis of GD trajectories across different activation patterns, developing tight control over system dynamics to bound decision boundary trajectory.
Result: Proves GD converges to optimal robustness margin but at prohibitively slow Θ(1/ln(t)) rate. This establishes first explicit lower bound on convergence rate of robustness margin in non-linear models. Empirical simulations show this failure mode is pervasive across multiple natural network initializations.
Conclusion: Even in simplest neural network settings, gradient descent exhibits extremely slow convergence for robustness margin optimization, suggesting fundamental limitations in training dynamics that may affect more complex models.
Abstract: We study the convergence dynamics of Gradient Descent (GD) in a minimal binary classification setting, consisting of a two-neuron ReLU network and two training instances. We prove that even under these strong simplifying assumptions, while GD successfully converges to an optimal robustness margin, effectively maximizing the distance between the decision boundary and the training points, this convergence occurs at a prohibitively slow rate, scaling strictly as $Θ(1/\ln(t))$. To the best of our knowledge, this establishes the first explicit lower bound on the convergence rate of the robustness margin in a non-linear model. Through empirical simulations, we further demonstrate that this inherent failure mode is pervasive, exhibiting the exact same tight convergence rate across multiple natural network initializations. Our theoretical guarantees are derived via a rigorous analysis of the GD trajectories across the distinct activation patterns of the model. Specifically, we develop tight control over the system’s dynamics to bound the trajectory of the decision boundary, overcoming the primary technical challenge introduced by the non-linear nature of the architecture.
[1345] Stochastic Multi-Armed Bandits with Limited Control Variates
Arun Verma, Manjesh Kumar Hanawal, Arun Rajkumar
Main category: cs.LG
TL;DR: UCB-LCV algorithm for multi-armed bandits with limited control variates, improving regret bounds when auxiliary information is intermittently available.
Details
Motivation: Addresses limitations in existing bandit algorithms that assume continuous availability of control variates (auxiliary information), which is unrealistic in real-world scenarios like wireless networks where interference or channel state estimates provide only partial insight.Method: Proposes UCB-LCV (Upper Confidence Bound with Limited Control Variates), which effectively combines estimators from rewards and control variates. Also introduces UCB-NORMAL for standard MAB with normally distributed rewards when no control variates are available.
Result: UCB-LCV outperforms existing bandit algorithms, and UCB-NORMAL shows better performance than existing algorithms for standard MAB with normally distributed rewards.
Conclusion: The proposed algorithms effectively handle limited availability of auxiliary information in bandit problems, with practical applications in wireless networks and other real-world scenarios.
Abstract: Motivated by wireless networks where interference or channel state estimates provide partial insight into throughput, we study a variant of the classical stochastic multi-armed bandit problem in which the learner has limited access to auxiliary information. Recent work has shown that such auxiliary information, when available as control variates, can be used to get tighter confidence bounds, leading to lower regret. However, existing works assume that control variates are available in every round, which may not be realistic in several real-life scenarios. To address this, we propose UCB-LCV, an upper confidence bound (UCB) based algorithm that effectively combines the estimators obtained from rewards and control variates. When there is no control variate, UCB-LCV leads to a novel algorithm that we call UCB-NORMAL, outperforming its existing algorithms for the standard MAB setting with normally distributed rewards. Finally, we discuss variants of the proposed UCB-LCV that apply to general distributions and experimentally demonstrate that UCB-LCV outperforms existing bandit algorithms.
[1346] Machine Learning (ML) library in Linux kernel
Viacheslav Dubeyko
Main category: cs.LG
TL;DR: Proposes ML infrastructure architecture for Linux kernel to enable machine learning models in kernel space despite FPU limitations and performance concerns.
Details
Motivation: Linux kernel's enormous complexity makes efficient configuration challenging. ML could enable self-evolving capabilities, but introducing ML in kernel space is difficult due to no direct FPU support and potential performance degradation.Method: Suggests ML infrastructure architecture with kernel ML library, implemented as Proof of Concept. Includes interface design for interaction between kernel-space ML model proxy and user-space ML model thread.
Result: Implemented PoC demonstrates feasibility of the approach and designs the interaction interface between kernel and user-space ML components.
Conclusion: The proposed ML infrastructure architecture can solve the challenges of introducing ML in Linux kernel space, enabling self-evolving capabilities while addressing FPU and performance concerns.
Abstract: Linux kernel is a huge code base with enormous number of subsystems and possible configuration options that results in unmanageable complexity of elaborating an efficient configuration. Machine Learning (ML) is approach/area of learning from data, finding patterns, and making predictions without implementing algorithms by developers that can introduce a self-evolving capability in Linux kernel. However, introduction of ML approaches in Linux kernel is not easy way because there is no direct use of floating-point operations (FPU) in kernel space and, potentially, ML models can be a reason of significant performance degradation in Linux kernel. Paper suggests the ML infrastructure architecture in Linux kernel that can solve the declared problem and introduce of employing ML models in kernel space. Suggested approach of kernel ML library has been implemented as Proof Of Concept (PoC) project with the goal to demonstrate feasibility of the suggestion and to design the interface of interaction the kernel-space ML model proxy and the ML model user-space thread.
[1347] Near-Optimal Regret for KL-Regularized Multi-Armed Bandits
Kaixuan Ji, Qingyue Zhao, Heyang Zhao, Qiwei Di, Quanquan Gu
Main category: cs.LG
TL;DR: Sharp analysis of KL-regularized multi-armed bandits showing first linear-in-K regret bounds with novel peeling argument and matching lower bounds.
Details
Motivation: Statistical efficiency of online learning with KL-regularized objectives remains poorly characterized, even for basic multi-armed bandit problems. Need to understand how regularization affects regret bounds across different regimes.Method: Sharp analysis of KL-UCB algorithm using novel peeling argument technique. Constructs hard instances and tailored decomposition of Bayes prior for lower bounds. Analyzes both high and low regularization regimes.
Result: First high-probability regret bound with linear dependence on K: Õ(ηK log²T). First non-constant lower bound: Ω(ηK log T). In low-regularization regime, regret is η-independent and scales as Õ(√KT).
Conclusion: Provides thorough understanding of KL-regularized MABs across all regularization regimes with nearly optimal bounds in terms of K, η, and T.
Abstract: Recent studies have shown that reinforcement learning with KL-regularized objectives can enjoy faster rates of convergence or logarithmic regret, in contrast to the classical $\sqrt{T}$-type regret in the unregularized setting. However, the statistical efficiency of online learning with respect to KL-regularized objectives remains far from completely characterized, even when specialized to multi-armed bandits (MABs). We address this problem for MABs via a sharp analysis of KL-UCB using a novel peeling argument, which yields a $\tilde{O}(ηK\log^2T)$ upper bound: the first high-probability regret bound with linear dependence on $K$. Here, $T$ is the time horizon, $K$ is the number of arms, $η^{-1}$ is the regularization intensity, and $\tilde{O}$ hides all logarithmic factors except those involving $\log T$. The near-tightness of our analysis is certified by the first non-constant lower bound $Ω(ηK \log T)$, which follows from subtle hard-instance constructions and a tailored decomposition of the Bayes prior. Moreover, in the low-regularization regime (i.e., large $η$), we show that the KL-regularized regret for MABs is $η$-independent and scales as $\tildeΘ(\sqrt{KT})$. Overall, our results provide a thorough understanding of KL-regularized MABs across all regimes of $η$ and yield nearly optimal bounds in terms of $K$, $η$, and $T$.
[1348] SageBwd: A Trainable Low-bit Attention
Jintao Zhang, Marco Chen, Haoxu Wang, Kai Jiang, Ion Stoica, Joseph E. Gonzalez, Jianfei Chen, Jun Zhu
Main category: cs.LG
TL;DR: SageBwd INT8 attention matches full-precision attention performance in pre-training through reduced tokens per step and QK-norm stabilization.
Details
Motivation: Low-bit attention like SageAttention accelerates inference but its applicability to training is poorly understood. SageBwd previously showed performance gaps compared to full-precision attention during pre-training, motivating investigation into why this occurs and how to close the gap.Method: Investigates SageBwd’s pre-training performance through experiments and theoretical analysis, identifying key factors: QK-norm for stability at large tokens per step, quantization errors from backward-pass score gradient dS, reducing tokens per step to match FPA performance, and analyzing K-smoothing vs Q-smoothing benefits.
Result: Demonstrates that SageBwd can match full-precision attention performance during pre-training by reducing tokens per step and using QK-norm for training stability. Identifies backward-pass score gradient dS as primary source of quantization errors.
Conclusion: SageBwd INT8 attention is viable for both training and inference, closing the pre-training performance gap with full-precision attention through proper configuration (reduced tokens per step, QK-norm, K-smoothing). Provides insights for efficient low-bit attention training.
Abstract: Low-bit attention, such as SageAttention, has emerged as an effective approach for accelerating model inference, but its applicability to training remains poorly understood. In prior work, we introduced SageBwd, a trainable INT8 attention that quantizes six of seven attention matrix multiplications while preserving fine-tuning performance. However, SageBwd exhibited a persistent performance gap to full-precision attention (FPA) during pre-training. In this work, we investigate why this gap occurs and demonstrate that SageBwd matches full-precision attention during pretraining. Through experiments and theoretical analysis, we reach a few important insights and conclusions: (i) QK-norm is necessary for stable training at large tokens per step, (ii) quantization errors primarily arise from the backward-pass score gradient dS, (iii) reducing tokens per step enables SageBwd to match FPA performance in pre-training, and (iv) K-smoothing remains essential for training stability, while Q-smoothing provides limited benefit during pre-training.
[1349] De-paradox Tree: Breaking Down Simpson’s Paradox via A Kernel-Based Partition Algorithm
Xian Teng, Yu-Ru Lin
Main category: cs.LG
TL;DR: De-paradox Tree is an interpretable algorithm that detects and explains Simpson’s paradox in observational data by uncovering hidden subgroup patterns through recursive partitioning with causal adjustment.
Details
Motivation: Observational datasets often contain misleading associations due to confounding and subgroup heterogeneity, exemplified by Simpson's paradox where aggregated and subgroup-level associations contradict. Existing methods provide limited support for detecting and interpreting such paradoxical associations, especially for practitioners without deep causal expertise.Method: The algorithm employs novel split criteria and balancing-based procedures to adjust for confounders and homogenize heterogeneous effects through recursive partitioning. It builds interpretable trees that select relevant covariates and identify nested opposite effects while ensuring robust estimation of causal effects when causally admissible variables are provided.
Result: Compared to state-of-the-art methods, De-paradox Tree builds simpler, more interpretable trees, selects relevant covariates, and identifies nested opposite effects while ensuring robust estimation of causal effects when causally admissible variables are provided.
Conclusion: The approach addresses limitations of traditional causal inference and machine learning methods by introducing an interpretable framework that supports non-expert practitioners while explicitly acknowledging causal assumptions and scope limitations, enabling more reliable decision-making in complex observational data environments.
Abstract: Real-world observational datasets and machine learning have revolutionized data-driven decision-making, yet many models rely on empirical associations that may be misleading due to confounding and subgroup heterogeneity. Simpson’s paradox exemplifies this challenge, where aggregated and subgroup-level associations contradict each other, leading to misleading conclusions. Existing methods provide limited support for detecting and interpreting such paradoxical associations, especially for practitioners without deep causal expertise. We introduce De-paradox Tree, an interpretable algorithm designed to uncover hidden subgroup patterns behind paradoxical associations under assumed causal structures involving confounders and effect heterogeneity. It employs novel split criteria and balancing-based procedures to adjust for confounders and homogenize heterogeneous effects through recursive partitioning. Compared to state-of-the-art methods, De-paradox Tree builds simpler, more interpretable trees, selects relevant covariates, and identifies nested opposite effects while ensuring robust estimation of causal effects when causally admissible variables are provided. Our approach addresses the limitations of traditional causal inference and machine learning methods by introducing an interpretable framework that supports non-expert practitioners while explicitly acknowledging causal assumptions and scope limitations, enabling more reliable and informed decision-making in complex observational data environments.
[1350] Reservoir Subspace Injection for Online ICA under Top-n Whitening
Wenjun Xiao, Yuda Bi, Vince D Calhoun
Main category: cs.LG
TL;DR: Reservoir expansion for online ICA faces bottleneck where injected features displace passthrough directions; RSI diagnostics identify failure mode and guarded controller recovers performance.
Details
Motivation: To improve online independent component analysis (ICA) under nonlinear mixing through reservoir expansion, while addressing the problem where injected features may displace important passthrough directions in the retained eigenspace.Method: Formalize reservoir subspace injection (RSI) bottleneck, develop RSI diagnostics (IER, SSO, ρ_x) to identify failure modes, and create a guarded RSI controller to preserve passthrough retention while allowing beneficial feature injection.
Result: Without RSI control, stronger injection degrades SI-SDR by up to 2.2dB; with guarded controller, mean performance recovers to within 0.1dB of baseline. RE-OICA improves over vanilla online ICA by +1.7dB under nonlinear mixing and achieves positive SI-SDRsc (+0.6dB) on super-Gaussian benchmark.
Conclusion: Preserving passthrough directions is crucial for reservoir expansion in online ICA; RSI diagnostics and guarded controller effectively manage the trade-off between feature injection and passthrough retention, enabling improved performance under nonlinear mixing.
Abstract: Reservoir expansion can improve online independent component analysis (ICA) under nonlinear mixing, yet top-$n$ whitening may discard injected features. We formalize this bottleneck as \emph{reservoir subspace injection} (RSI): injected features help only if they enter the retained eigenspace without displacing passthrough directions. RSI diagnostics (IER, SSO, $ρ_x$) identify a failure mode in our top-$n$ setting: stronger injection increases IER but crowds out passthrough energy ($ρ_x: 1.00!\rightarrow!0.77$), degrading SI-SDR by up to $2.2$,dB. A guarded RSI controller preserves passthrough retention and recovers mean performance to within $0.1$,dB of baseline $1/N$ scaling. With passthrough preserved, RE-OICA improves over vanilla online ICA by $+1.7$,dB under nonlinear mixing and achieves positive SI-SDR$_{\mathrm{sc}}$ on the tested super-Gaussian benchmark ($+0.6$,dB).
[1351] MAC: A Conversion Rate Prediction Benchmark Featuring Labels Under Multiple Attribution Mechanisms
Jinqi Wu, Sishuo Chen, Zhangming Chan, Yong Bai, Lei Zhang, Sheng Chen, Chenghuan Hou, Xiang-Rong Sheng, Han Zhu, Jian Xu, Bo Zheng, Chaoyou Fu
Main category: cs.LG
TL;DR: MAC benchmark provides first public CVR dataset with multiple attribution labels, enabling multi-attribution learning research; MoAE method outperforms SOTA by effectively learning and utilizing multi-attribution knowledge.
Details
Motivation: Multi-attribution learning (MAL) for CVR prediction is limited by lack of public datasets with labels from multiple attribution mechanisms, hindering development and evaluation of MAL approaches.Method: Created MAC benchmark dataset with multi-attribution labels; developed PyMAL library with baseline methods; proposed MoAE approach with multi-attribution knowledge learning and main task-centric knowledge utilization.
Result: MAL provides consistent performance gains, especially for long conversion paths; MoAE substantially surpasses existing SOTA MAL methods; three key insights identified about objective complexity and architectural design.
Conclusion: MAC benchmark enables reproducible MAL research; careful selection of auxiliary objectives and proper architectural design are crucial; MoAE effectively incorporates multi-attribution knowledge for improved CVR prediction.
Abstract: Multi-attribution learning (MAL), which enhances model performance by learning from conversion labels yielded by multiple attribution mechanisms, has emerged as a promising learning paradigm for conversion rate (CVR) prediction. However, the conversion labels in public CVR datasets are generated by a single attribution mechanism, hindering the development of MAL approaches. To address this data gap, we establish the Multi-Attribution Benchmark (MAC), the first public CVR dataset featuring labels from multiple attribution mechanisms. Besides, to promote reproducible research on MAL, we develop PyMAL, an open-source library covering a wide array of baseline methods. We conduct comprehensive experimental analyses on MAC and reveal three key insights: (1) MAL brings consistent performance gains across different attribution settings, especially for users featuring long conversion paths. (2) The performance growth scales up with objective complexity in most settings; however, when predicting first-click conversion targets, simply adding auxiliary objectives is counterproductive, underscoring the necessity of careful selection of auxiliary objectives. (3) Two architectural design principles are paramount: first, to fully learn the multi-attribution knowledge, and second, to fully leverage this knowledge to serve the main task. Motivated by these findings, we propose Mixture of Asymmetric Experts (MoAE), an effective MAL approach incorporating multi-attribution knowledge learning and main task-centric knowledge utilization. Experiments on MAC show that MoAE substantially surpasses the existing state-of-the-art MAL method. We believe that our benchmark and insights will foster future research in the MAL field. Our MAC benchmark and the PyMAL algorithm library are publicly available at https://github.com/alimama-tech/PyMAL.
[1352] Multi-Head Low-Rank Attention
Songtao Liu, Hongwu Peng, Zhiwei Zhang, Zhengyu Chen, Yue Guo
Main category: cs.LG
TL;DR: MLRA enables partitionable latent states for efficient 4-way tensor parallelism decoding, addressing KV cache loading bottlenecks in long-context LLM inference.
Details
Motivation: Long-context inference in LLMs suffers from KV cache loading bottlenecks during decoding, where KV cache must be repeatedly transferred from HBM to SRAM. While MLA reduces KV cache size, it has a sharding bottleneck in distributed decoding via tensor parallelism due to its single latent head that cannot be partitioned.Method: Proposes Multi-Head Low-Rank Attention (MLRA) which enables partitionable latent states for efficient 4-way tensor parallelism decoding, allowing distributed loading of KV cache across devices.
Result: MLRA achieves state-of-the-art perplexity and downstream task performance while delivering 2.8× decoding speedup over MLA through efficient distributed decoding.
Conclusion: MLRA effectively addresses the sharding bottleneck in distributed long-context inference, enabling efficient tensor parallelism while maintaining strong performance.
Abstract: Long-context inference in large language models is bottlenecked by Key–Value (KV) cache loading during the decoding stage, where the sequential nature of generation requires repeatedly transferring the KV cache from off-chip High-Bandwidth Memory (HBM) to on-chip Static Random-Access Memory (SRAM) at each step. While Multi-Head Latent Attention (MLA) significantly reduces the total KV cache size, it suffers from a sharding bottleneck during distributed decoding via Tensor Parallelism (TP). Since its single latent head cannot be partitioned, each device is forced to redundantly load the complete KV cache for every token, consuming excessive memory traffic and diminishing TP benefits like weight sharding. In this work, we propose Multi-Head Low-Rank Attention (MLRA), which enables partitionable latent states for efficient 4-way TP decoding. Extensive experiments show that MLRA achieves state-of-the-art perplexity and downstream task performance, while also delivering a 2.8$\times$ decoding speedup over MLA. Code is available at https://github.com/SongtaoLiu0823/MLRA. Pretrained weights, along with the training and evaluation data, are available at https://huggingface.co/Soughing/MLRA.
[1353] Symbol-Equivariant Recurrent Reasoning Models
Richard Freinschlag, Timo Bertram, Erich Kobler, Andreas Mayr, Günter Klambauer
Main category: cs.LG
TL;DR: SE-RRMs introduce symbol-equivariant layers to enforce permutation equivariance in reasoning models, improving performance on Sudoku and ARC-AGI tasks with better generalization and less data augmentation.
Details
Motivation: Existing Recurrent Reasoning Models (RRMs) handle symbol symmetries only implicitly through costly data augmentation, which is inefficient and doesn't guarantee identical solutions under symbol permutations. The authors aim to create more robust and scalable neural reasoning models by explicitly encoding symmetry at the architectural level.Method: Introduce Symbol-Equivariant Recurrent Reasoning Models (SE-RRMs) that enforce permutation equivariance through symbol-equivariant layers. These layers guarantee identical solutions under symbol or color permutations, eliminating the need for extensive data augmentation. The approach builds on the RRM family (HRM and TRM) but adds explicit symmetry encoding.
Result: SE-RRMs outperform prior RRMs on 9x9 Sudoku and generalize from training on 9x9 to smaller 4x4 and larger 16x16 and 25x25 instances, where existing RRMs cannot extrapolate. On ARC-AGI-1 and ARC-AGI-2, they achieve competitive performance with substantially less data augmentation and only 2 million parameters.
Conclusion: Explicitly encoding symmetry at the architectural level improves the robustness and scalability of neural reasoning models. SE-RRMs demonstrate that compact models with proper architectural constraints can achieve strong performance on structured reasoning tasks with minimal data augmentation.
Abstract: Reasoning problems such as Sudoku and ARC-AGI remain challenging for neural networks. The structured problem solving architecture family of Recurrent Reasoning Models (RRMs), including Hierarchical Reasoning Model (HRM) and Tiny Recursive Model (TRM), offer a compact alternative to large language models, but currently handle symbol symmetries only implicitly via costly data augmentation. We introduce Symbol-Equivariant Recurrent Reasoning Models (SE-RRMs), which enforce permutation equivariance at the architectural level through symbol-equivariant layers, guaranteeing identical solutions under symbol or color permutations. SE-RRMs outperform prior RRMs on 9x9 Sudoku and generalize from just training on 9x9 to smaller 4x4 and larger 16x16 and 25x25 instances, to which existing RRMs cannot extrapolate. On ARC-AGI-1 and ARC-AGI-2, SE-RRMs achieve competitive performance with substantially less data augmentation and only 2 million parameters, demonstrating that explicitly encoding symmetry improves the robustness and scalability of neural reasoning. Code is available at https://github.com/ml-jku/SE-RRM.
[1354] Frontier Models Can Take Actions at Low Probabilities
Alex Serrano, Wen Xing, David Lindner, Erik Jenner
Main category: cs.LG
TL;DR: Frontier LLMs can execute actions at extremely low calibrated probabilities (as low as 1 in 100,000) when given external entropy sources, posing risks for evasion of pre-deployment evaluations.
Details
Motivation: To investigate whether frontier language models can execute malicious actions at very low calibrated probabilities to evade detection during pre-deployment evaluations, where only limited samples are inspected.Method: Prompted GPT-5, Claude-4.5 and Qwen-3 families to take target actions at low probabilities (e.g., 0.01%), either given directly or requiring derivation, and evaluated their calibration by measuring if they perform actions at the specified rates when resampling.
Result: Frontier models show surprising capability: with in-context entropy sources (like UUIDs), they maintain high calibration at rates lower than 1 in 100,000; without entropy, some reach rates lower than 1 in 10,000. Larger models achieve better calibration at lower rates when rates are given, but all fail to self-derive optimal rates without entropy or hints. Successful low-rate strategies require explicit Chain-of-Thought reasoning.
Conclusion: Current models can be caught by CoT monitors when attempting low-rate evasion, but scaling trends suggest future evaluations cannot rely on models’ inability to calibrate target rates, especially if CoT reasoning becomes less legible.
Abstract: Pre-deployment evaluations inspect only a limited sample of model actions. A malicious model seeking to evade oversight could exploit this by randomizing when to “defect”: misbehaving so rarely that no malicious actions are observed during evaluation, but often enough that they occur eventually in deployment. But this requires taking actions at very low rates, while maintaining calibration. Are frontier models even capable of that? We prompt the GPT-5, Claude-4.5 and Qwen-3 families to take a target action at low probabilities (e.g. 0.01%), either given directly or requiring derivation, and evaluate their calibration (i.e. whether they perform the target action roughly 1 in 10,000 times when resampling). We find that frontier models are surprisingly good at this task. If there is a source of entropy in-context (such as a UUID), they maintain high calibration at rates lower than 1 in 100,000 actions. Without external entropy, some models can still reach rates lower than 1 in 10,000. When target rates are given, larger models achieve good calibration at lower rates. Yet, when models must derive the optimal target rate themselves, all models fail to achieve calibration without entropy or hint to generate it. Successful low-rate strategies require explicit Chain-of-Thought (CoT) reasoning, so malicious models attempting this approach could currently be caught by a CoT monitor. However, scaling trends suggest future evaluations may be unable to rely on models’ lack of target rate calibration, especially if CoT is no longer legible.
[1355] Partial Causal Structure Learning for Valid Selective Conformal Inference under Interventions
Amir Asiaee, Kavey Aryan, James P. Long
Main category: cs.LG
TL;DR: Selective conformal prediction with learned causal structure for uncertainty quantification in perturbation experiments, maintaining coverage despite unknown invariance structure.
Details
Motivation: In perturbation experiments (e.g., genomics), exchangeability for conformal prediction often holds only within subsets where interventions leave target variables unaffected. However, this causal structure is typically unknown and must be learned from data, which can introduce contamination when misclassifying unaffected examples.Method: Three main contributions: (1) Contamination-robust conformal coverage theorem quantifying coverage degradation via function g(δ,n); (2) Task-driven partial causal learning focusing on binary descendant indicators rather than full causal graph; (3) Algorithms for descendant discovery via perturbation intersection patterns and approximate distance-to-intervention estimation via local invariant causal prediction.
Result: Theoretical bounds show controlled contamination maintains coverage: under δ=0.30 contamination, corrected procedure maintains ≥0.95 coverage while uncorrected selective CP degrades to 0.867. Synthetic linear SEM experiments validate the bound, and proof-of-concept on Replogle K562 CRISPRi perturbation data demonstrates real-world applicability.
Conclusion: The framework enables robust selective conformal prediction in perturbation experiments with unknown causal structure, providing theoretical guarantees and practical algorithms that maintain coverage despite learned structure imperfections.
Abstract: Selective conformal prediction can yield substantially tighter uncertainty sets when we can identify calibration examples that are exchangeable with the test example. In interventional settings, such as perturbation experiments in genomics, exchangeability often holds only within subsets of interventions that leave a target variable “unaffected” (e.g., non-descendants of an intervened node in a causal graph). We study the practical regime where this invariance structure is unknown and must be learned from data. Our contributions are: (i) a contamination-robust conformal coverage theorem that quantifies how misclassification of “unaffected” calibration examples degrades coverage via an explicit function $g(δ,n)$ of the contamination fraction and calibration set size, providing a finite-sample lower bound that holds for arbitrary contaminating distributions; (ii) a task-driven partial causal learning formulation that estimates only the binary descendant indicators $Z_{a,i}=\mathbf{1}{i\in\mathrm{desc}(a)}$ needed for selective calibration, rather than the full causal graph; and (iii) algorithms for descendant discovery via perturbation intersection patterns (differentially affected variable set intersections across interventions), and for approximate distance-to-intervention estimation via local invariant causal prediction. We provide recovery conditions under which contamination is controlled. Experiments on synthetic linear structural equation models (SEMs) validate the bound: under controlled contamination up to $δ=0.30$, the corrected procedure maintains $\ge 0.95$ coverage while uncorrected selective CP degrades to $0.867$. A proof-of-concept on Replogle K562 CRISPR interference (CRISPRi) perturbation data demonstrates applicability to real genomic screens.
[1356] Impossibility of Depth Reduction in Explainable Clustering
Chengyuan Deng, Surya Teja Gavva, Karthik C. S., Parth Patel, Adarsh Srinivasan
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to analyze paper content due to technical limitations
Abstract: Failed to fetch summary for 2305.02850: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2305.02850&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1357] FedHB: Hierarchical Bayesian Federated Learning
Minyoung Kim, Timothy Hospedales
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2305.04979: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2305.04979&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1358] NUBO: A Transparent Python Package for Bayesian Optimization
Mike Diessner, Kevin J. Wilson, Richard D. Whalley
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2305.06709: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2305.06709&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1359] Distributional Priors Guided Diffusion for Generating 3D Molecules in Low Data Regimes
Haokai Hong, Wanyu Lin, Ming Yang, Kay Chen Tan
Main category: cs.LG
TL;DR: Unable to analyze paper 2404.00962 due to HTTP 429 error when fetching summary from arXiv API
Details
Motivation: Cannot determine motivation due to inability to access paper contentMethod: Cannot determine method due to inability to access paper content
Result: Cannot determine results due to inability to access paper content
Conclusion: Cannot draw conclusions due to inability to access paper content
Abstract: Failed to fetch summary for 2404.00962: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2404.00962&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1360] Aggressive or Imperceptible, or Both: Network Pruning Assisted Hybrid Byzantines in Federated Learning
Emre Ozfatura, Kerem Ozfatura, Baturalp Buyukates, Mert Coskuner, Alptekin Kupcu, Deniz Gunduz
Main category: cs.LG
TL;DR: Unable to analyze paper 2404.06230 due to HTTP 429 error when fetching the abstract from arXiv API
Details
Motivation: Cannot determine motivation as the abstract could not be retrieved due to rate limiting issues with arXiv APIMethod: No method information available due to failed API request
Result: No results available due to failed data retrieval
Conclusion: Cannot provide analysis due to technical limitations in accessing the paper information
Abstract: Failed to fetch summary for 2404.06230: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2404.06230&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1361] Generative Enzyme Design Guided by Functionally Important Sites and Small-Molecule Substrates
Zhenqiao Song, Yunlong Zhao, Wenxian Shi, Wengong Jin, Yang Yang, Lei Li
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2405.08205: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2405.08205&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1362] The GeometricKernels Package: Heat and Matérn Kernels for Geometric Learning on Manifolds, Meshes, and Graphs
Peter Mostowsky, Vincent Dutordoir, Iskander Azangulov, Noémie Jaquier, Michael John Hutchinson, Aditya Ravuri, Leonel Rozo, Alexander Terenin, Viacheslav Borovitskiy
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2407.08086: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2407.08086&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1363] Leray-Schauder Mappings for Operator Learning
Emanuele Zappala
Main category: cs.LG
TL;DR: Unable to analyze paper 2410.01746 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation without access to paper abstractMethod: Cannot determine method without access to paper abstract
Result: Cannot determine results without access to paper abstract
Conclusion: Cannot draw conclusions without access to paper abstract
Abstract: Failed to fetch summary for 2410.01746: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.01746&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1364] One protein is all you need
Anton Bushuiev, Roman Bushuiev, Olga Pimenova, Nikola Zadorozhny, Raman Samusevich, Elisabet Manaskova, Rachel Seongeun Kim, Hannes Stärk, Jiri Sedlar, Martin Steinegger, Tomáš Pluskal, Josef Sivic
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error in fetching paper contentMethod: Unable to determine method due to technical error in fetching paper content
Result: Unable to determine results due to technical error in fetching paper content
Conclusion: Unable to draw conclusions due to technical error in fetching paper content
Abstract: Failed to fetch summary for 2411.02109: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.02109&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1365] Efficient Aircraft Design Optimization Using Multi-Fidelity Models and Multi-fidelity Physics Informed Neural Networks
Apurba Sarker
Main category: cs.LG
TL;DR: Unable to analyze paper 2412.18564 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract retrieval failed due to rate limitingMethod: Cannot determine method as abstract retrieval failed
Result: Cannot determine results as abstract retrieval failed
Conclusion: Cannot draw conclusions without access to the paper’s abstract
Abstract: Failed to fetch summary for 2412.18564: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2412.18564&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1366] On Demographic Group Fairness Guarantees in Deep Learning
Yan Luo, Congcong Wen, Min Shi, Hao Huang, Yi Fang, Mengyu Wang
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) when attempting to access arXiv API for paper ID 2412.20377
Details
Motivation: Unable to determine motivation as the paper content could not be retrieved due to API rate limitingMethod: Unable to determine method as the paper content could not be retrieved
Result: Unable to determine results as the paper content could not be retrieved
Conclusion: Unable to draw conclusions about the paper due to technical limitations in accessing the content
Abstract: Failed to fetch summary for 2412.20377: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2412.20377&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1367] UFGraphFR: Graph Federation Recommendation System based on User Text description features
Xudong Wang, Qingbo Hao, Yingyuan Xiao
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2501.08044 suggests it’s from arXiv, but no content available for analysis.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot draw conclusions without access to the paper content.
Abstract: Failed to fetch summary for 2501.08044: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.08044&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1368] Multiscale Training of Convolutional Neural Networks
Shadab Ahamed, Niloufar Zakariaei, Eldad Haber, Moshe Eliasof
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation as paper content could not be retrievedMethod: Unable to determine method as paper content could not be retrieved
Result: Unable to determine results as paper content could not be retrieved
Conclusion: Unable to draw conclusions as paper content could not be retrieved
Abstract: Failed to fetch summary for 2501.12739: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.12739&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1369] Gen-DFL: Decision-Focused Generative Learning for Robust Decision Making
Prince Zizhuang Wang, Shuyi Chen, Jinhao Liang, Ferdinando Fioretto, Shixiang Zhu
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2502.05468: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.05468&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1370] Identity-Free Deferral For Unseen Experts
Joshua Strong, Pramit Saha, Yasin Ibrahim, Cheng Ouyang, Alison Noble
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2502.10533: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.10533&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1371] MoMa: A Modular Deep Learning Framework for Material Property Prediction
Botian Wang, Yawen Ouyang, Yaohui Li, Mianzhi Pan, Yuanhang Tang, Yiqun Wang, Haorui Cui, Jianbing Zhang, Xiaonan Wang, Wei-Ying Ma, Hao Zhou
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2502.15483: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.15483&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1372] Generalizable deep learning for photoplethysmography-based blood pressure estimation – A Benchmarking Study
Mohammad Moulaeifard, Peter H. Charlton, Nils Strodthoff
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2502.19167: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.19167&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1373] Machine-learning for photoplethysmography analysis: Benchmarking feature, image, and signal-based approaches
Mohammad Moulaeifard, Loic Coquelin, Mantas Rinkevičius, Andrius Sološenko, Oskar Pfeffer, Ciaran Bench, Nando Hegemann, Sara Vardanega, Manasi Nandi, Jordi Alastruey, Christian Heiss, Vaidotas Marozas, Andrew Thompson, Philip J. Aston, Peter H. Charlton, Nils Strodthoff
Main category: cs.LG
TL;DR: Unable to analyze paper 2502.19949 due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrievedMethod: Cannot determine method as paper content could not be retrieved
Result: Cannot determine results as paper content could not be retrieved
Conclusion: Cannot draw conclusions about paper relevance without access to the content
Abstract: Failed to fetch summary for 2502.19949: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.19949&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1374] Does Generation Require Memorization? Creative Diffusion Models using Ambient Diffusion
Kulin Shah, Alkis Kalavasis, Adam R. Klivans, Giannis Daras
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2502.21278: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.21278&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1375] Geometric Reasoning in the Embedding Space
Jan Hůla, David Mojžíšek, Jiří Janeček, David Herel, Mikoláš Janota
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2504.02018 suggests it’s from April 2024, but content is unavailable.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot determine conclusion without access to the paper content.
Abstract: Failed to fetch summary for 2504.02018: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.02018&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1376] Identifying and Evaluating Inactive Heads in Pretrained LLMs
Pedro Sandoval-Segura, Xijun Wang, Ashwinee Panda, Micah Goldblum, Ronen Basri, Tom Goldstein, David Jacobs
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2504.03889: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.03889&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1377] Standardization of Multi-Objective QUBOs
Loong Kuan Lee, Thore Gerlach, Nico Piatkowski
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper retrievalMethod: Unable to determine method due to failed paper retrieval
Result: Unable to determine results due to failed paper retrieval
Conclusion: Unable to draw conclusions due to failed paper retrieval
Abstract: Failed to fetch summary for 2504.12419: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.12419&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1378] A Diagnostic Evaluation of Neural Networks Trained with the Error Diffusion Learning Algorithm
Kazuhisa Fujita
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusion due to lack of paper content
Abstract: Failed to fetch summary for 2504.14814: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.14814&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1379] MoE Parallel Folding: Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training with Megatron Core
Dennis Liu, Zijie Yan, Xin Yao, Tong Liu, Vijay Korthikanti, Evan Wu, Shiqing Fan, Gao Deng, Hongxiao Bai, Jianbin Chang, Ashwath Aithal, Michael Andersch, Mohammad Shoeybi, Jiajie Yao, Chandler Zhou, David Wu, Xipeng Li, June Yang
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Unable to determine method due to API rate limiting preventing access to paper content
Result: Unable to determine results due to API rate limiting preventing access to paper content
Conclusion: Unable to determine conclusion due to API rate limiting preventing access to paper content
Abstract: Failed to fetch summary for 2504.14960: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.14960&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1380] PAPN: Proximity Attention Encoder and Pointer Network Decoder for Parcel Pickup Route Prediction
Hansi Denis, Ali Anwar, Ngoc-Quang Luong, Siegfried Mercelis
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2505.03776: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.03776&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1381] Addition is almost all you need: Compressing large language models with double binary factorization
Vladimír Boža, Vladimír Macko
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2505.11076: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.11076&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1382] Plan and Budget: Effective and Efficient Test-Time Scaling on Reasoning Large Language Models
Junhong Lin, Xinyue Zeng, Jie Zhu, Song Wang, Julian Shun, Jun Wu, Dawei Zhou
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2505.16122: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.16122&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1383] ICYM2I: The illusion of multimodal informativeness under missingness
Young Sang Choi, Vincent Jeanselme, Pierre Elias, Shalmali Joshi
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting).
Details
Motivation: Cannot determine motivation due to failed paper fetch.Method: Cannot determine method due to failed paper fetch.
Result: Cannot determine results due to failed paper fetch.
Conclusion: Cannot draw conclusions due to failed paper fetch.
Abstract: Failed to fetch summary for 2505.16953: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.16953&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1384] Directed Semi-Simplicial Learning with Applications to Brain Activity Decoding
Manuel Lecha, Andrea Cavallo, Francesca Dominici, Ran Levi, Alessio Del Bue, Elvin Isufi, Pietro Morerio, Claudio Battiloro
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2505.17939: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.17939&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1385] SuperMAN: Interpretable and Expressive Networks over Temporally Sparse Heterogeneous Data
Maya Bechler-Speicher, Andrea Zerio, Maor Huri, Marie Vibeke Vestergaard, Ran Gilad-Bachrach, Tine Jess, Samir Bhatt, Aleksejs Sazonovs
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2505.19193 appears to be a recent arXiv submission from May 2025.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot draw conclusions without access to the paper content.
Abstract: Failed to fetch summary for 2505.19193: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.19193&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1386] PLANETALIGN: A Comprehensive Python Library for Benchmarking Network Alignment
Qi Yu, Zhichen Zeng, Yuchen Yan, Zhining Liu, Baoyu Jing, Ruizhong Qiu, Ariful Azad, Hanghang Tong
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2505.21366: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.21366&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1387] Hyperbolic Aware Minimization: Implicit Bias for Sparsity
Tom Jacobs, Advait Gadhikar, Celia Rubio-Madrigal, Rebekka Burkholz
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to failed paper retrievalMethod: Cannot determine method due to failed paper retrieval
Result: Cannot determine results due to failed paper retrieval
Conclusion: Cannot determine conclusion due to failed paper retrieval
Abstract: Failed to fetch summary for 2506.02630: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.02630&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1388] QKV Projections Require a Fraction of Their Memory
Malik Khalaf, Yara Shamshoum, Nitzan Hodos, Yuval Sieradzki, Assaf Schuster
Main category: cs.LG
TL;DR: Unable to analyze paper 2506.02939 due to HTTP 429 error when fetching summary from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2506.02939: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.02939&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1389] Log-Linear Attention
Han Guo, Songlin Yang, Tarushii Goel, Eric P. Xing, Tri Dao, Yoon Kim
Main category: cs.LG
TL;DR: Paper ID 2506.04761 could not be analyzed due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Unable to determine motivation due to failed summary fetchMethod: Unable to determine method due to failed summary fetch
Result: Unable to determine results due to failed summary fetch
Conclusion: Unable to draw conclusions due to failed summary fetch
Abstract: Failed to fetch summary for 2506.04761: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.04761&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1390] Rapid training of Hamiltonian graph networks using random features
Atamert Rahma, Chinmay Datar, Ana Cukarska, Felix Dietrich
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2506.06558: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.06558&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1391] ProteinZero: Self-Improving Protein Generation via Online Reinforcement Learning
Ziwen Wang, Jiajun Fan, Ruihan Guo, Thao Nguyen, Heng Ji, Ge Liu
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2506.07459: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.07459&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1392] Branched Schrödinger Bridge Matching
Sophia Tang, Yinuo Zhang, Alexander Tong, Pranam Chatterjee
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error fetching paper contentMethod: Unable to determine method due to technical error fetching paper content
Result: Unable to determine results due to technical error fetching paper content
Conclusion: Unable to draw conclusions due to technical error fetching paper content
Abstract: Failed to fetch summary for 2506.09007: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.09007&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1393] Monotone Classification with Relative Approximations
Yufei Tao
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2506.10775: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.10775&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1394] Unlearning Isn’t Invisible: Detecting Unlearning Traces in LLMs from Model Outputs
Yiwei Chen, Soumyadeep Pal, Yimeng Zhang, Qing Qu, Sijia Liu
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). Need to try alternative access methods or wait before retrying.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot determine conclusion without access to the paper content.
Abstract: Failed to fetch summary for 2506.14003: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.14003&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1395] SecP-Tuning: Efficient Privacy-Preserving Prompt Tuning for Large Language Models via MPC
Jinglong Luo, Zhuo Zhang, Yehong Zhang, Shiyu Liu, Ye Dong, Hui Wang, Yue Yu, Xun Zhou, Zenglin Xu
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot determine conclusion as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2506.15307: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.15307&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1396] Hidden Breakthroughs in Language Model Training
Sara Kangaslahti, Elan Rosenfeld, Naomi Saphra
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2506.15872: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.15872&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1397] RocketStack: Level-aware Deep Recursive Ensemble Learning Architecture
Çağatay Demirel
Main category: cs.LG
TL;DR: Unable to analyze paper 2506.16965 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract retrieval failed due to rate limiting (HTTP 429)Method: Method unknown - paper content not accessible
Result: No results available - failed to fetch paper information
Conclusion: Cannot draw conclusions about paper relevance without access to abstract
Abstract: Failed to fetch summary for 2506.16965: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.16965&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1398] Memba: Membrane-driven Parameter-Efficient Fine-Tuning for Mamba
Donghyun Lee, Yuhang Li, Ruokai Yin, Shiting Xiao, Priyadarshini Panda
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2506.18184: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.18184&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1399] Dynamic Weight Grafting: Localizing Finetuned Factual Knowledge in Transformers
Todd Nief, David Reber, Sean Richardson, Ari Holtzman
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) when trying to access arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2506.20746: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.20746&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1400] TRIDENT: Tri-Modal Molecular Representation Learning with Taxonomic Annotations and Local Correspondence
Feng Jiang, Mangal Prakash, Hehuan Ma, Jianyuan Deng, Yuzhi Guo, Amina Mollaysa, Tommaso Mansi, Rui Liao, Junzhou Huang
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2506.21028: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.21028&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1401] SlimCaching: Edge Caching of Mixture-of-Experts for Distributed Inference
Qian Chen, Xianhao Chen, Kaibin Huang
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2507.06567: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.06567&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1402] The Geometry of LLM Quantization: GPTQ as Babai’s Nearest Plane Algorithm
Jiale Chen, Yalda Shabanzadeh, Elvir Crnčević, Torsten Hoefler, Dan Alistarh
Main category: cs.LG
TL;DR: Paper ID 2507.18553 summary could not be fetched due to HTTP 429 rate limiting error from arXiv API
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to determine conclusion due to failed API request
Abstract: Failed to fetch summary for 2507.18553: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.18553&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1403] DisTaC: Conditioning Task Vectors via Distillation for Robust Model Merging
Kotaro Yoshida, Yuji Naraki, Takafumi Horie, Ryotaro Shimizu, Hiroki Naganuma
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2508.01148: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.01148&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1404] Next Generation Equation-Free Multiscale Modelling of Crowd Dynamics via Machine Learning
Hector Vargas Alvarez, Dimitrios G. Patsatzis, Lucia Russo, Ioannis Kevrekidis, Constantinos Siettos
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) - cannot analyze this specific paper
Details
Motivation: Unable to determine motivation due to technical error in fetching paper contentMethod: Unable to determine method due to technical error in fetching paper content
Result: Unable to determine results due to technical error in fetching paper content
Conclusion: Unable to draw conclusions due to technical error in fetching paper content
Abstract: Failed to fetch summary for 2508.03926: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.03926&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1405] Do Vision-Language Models Leak What They Learn? Adaptive Token-Weighted Model Inversion Attacks
Ngoc-Bao Nguyen, Sy-Tuyen Ho, Koh Jun Hao, Ngai-Man Cheung
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2508.04097: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.04097&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1406] Agnostics: Learning to Code in Any Programming Language via Reinforcement with a Universal Learning Environment
Aleksander Boruch-Gruszecki, Yangtian Zi, Zixuan Wu, Tejas Oberoi, Carolyn Jane Anderson, Joydeep Biswas, Arjun Guha
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed data retrievalMethod: Unable to determine method due to failed data retrieval
Result: Unable to determine results due to failed data retrieval
Conclusion: Unable to draw conclusions due to failed data retrieval
Abstract: Failed to fetch summary for 2508.04865: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.04865&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1407] Data Selection for LLM Alignment Using Fine-Grained Preferences
Jia Zhang, Yao Liu, Chen-Xi Zhang, Yi Liu, Yi-Xuan Jin, Lan-Zhe Guo, Yu-Feng Li
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2508.07638: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.07638&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1408] Semantic-Enhanced Time-Series Forecasting via Large Language Models
Hao Liu, Xiaoxing Zhang, Chun Yang, Xiaobin Zhu
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2508.07697: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.07697&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1409] Causal Structure Learning in Hawkes Processes with Complex Latent Confounder Networks
Songyao Jin, Biwei Huang
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2508.11727: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.11727&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1410] MuFlex: A Scalable, Physics-based Platform for Multi-Building Flexibility Analysis and Coordination
Ziyan Wu, Ivan Korolija, Rui Tang
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2508.13532: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.13532&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1411] Federated Nonlinear System Identification
Omkar Tupe, Max Hartman, Lav R. Varshney, Saurav Prakash
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2508.15025: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.15025&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1412] Gaming and Cooperation in Federated Learning: What Can Happen and How to Monitor It
Dongseok Kim, Hyoungsun Choi, Mohamed Jismy Aashik Rasool, Gisung Oh
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper detailsMethod: Unknown - paper content not accessible due to HTTP 429 error from arXiv API
Result: No results available - technical issue with fetching paper information
Conclusion: Cannot analyze paper due to API rate limiting preventing access to abstract and content
Abstract: Failed to fetch summary for 2509.02391: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.02391&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1413] Select, then Balance: Exploring Exogenous Variable Modeling of Spatio-Temporal Forecasting
Wei Chen, Yuqian Wu, Yuanshao Zhu, Xixuan Hao, Shiyu Wang, Xiaofang Zhou, Yuxuan Liang
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2509.05779: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.05779&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1414] VMDNet: Temporal Leakage-Free Variational Mode Decomposition for Electricity Demand Forecasting
Weibin Feng, Ran Tao, John Cartlidge, Jin Zheng
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrievedMethod: Cannot determine method as paper content could not be retrieved
Result: Cannot determine results as paper content could not be retrieved
Conclusion: Cannot determine conclusion as paper content could not be retrieved
Abstract: Failed to fetch summary for 2509.15394: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.15394&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1415] A Recovery Guarantee for Sparse Neural Networks
Sara Fridovich-Keil, Mert Pilanci
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2509.20323: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.20323&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1416] A Genetic Algorithm for Navigating Synthesizable Molecular Spaces
Alston Lo, Connor W. Coley, Wojciech Matusik
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2509.20719: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.20719&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1417] FORCE: Transferable Visual Jailbreaking Attacks via Feature Over-Reliance CorrEction
Runqi Lin, Alasdair Paren, Suqin Yuan, Muyang Li, Philip Torr, Adel Bibi, Tongliang Liu
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2509.21029: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.21029&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1418] RED-DiffEq: Regularization by denoising diffusion models for solving inverse PDE problems with application to full waveform inversion
Siming Shan, Min Zhu, Youzuo Lin, Lu Lu
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2509.21659: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.21659&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1419] On the $ε$-Free Inference Complexity of Absorbing Discrete Diffusion
Xunpeng Huang, Yingyu Lin, Nishant Jain, Kaibo Wang, Difan Zou, Yian Ma, Tong Zhang
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2509.21835: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.21835&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1420] Doubly-Robust LLM-as-a-Judge: Externally Valid Estimation with Imperfect Personas
Luke Guerdan, Justin Whitehouse, Kimberly Truong, Kenneth Holstein, Zhiwei Steven Wu
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Unable to determine method due to API rate limiting preventing access to paper content
Result: Unable to determine results due to API rate limiting preventing access to paper content
Conclusion: Unable to determine conclusion due to API rate limiting preventing access to paper content
Abstract: Failed to fetch summary for 2509.22957: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.22957&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1421] Landing with the Score: Riemannian Optimization through Denoising
Andrey Kharitenko, Zebang Shen, Riccardo de Santi, Niao He, Florian Doerfler
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2509.23357: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.23357&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1422] Emergence of Superposition: Unveiling the Training Dynamics of Chain of Continuous Thought
Hanlin Zhu, Shibo Hao, Zhiting Hu, Jiantao Jiao, Stuart Russell, Yuandong Tian
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error fetching paper contentMethod: Unable to determine method due to technical error fetching paper content
Result: Unable to determine results due to technical error fetching paper content
Conclusion: Unable to determine conclusion due to technical error fetching paper content
Abstract: Failed to fetch summary for 2509.23365: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.23365&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1423] Negative Pre-activations Differentiate Syntax
Linghao Kong, Angelina Ning, Micah Adler, Nir Shavit
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2509.24198: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.24198&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1424] Massively Multimodal Foundation Models: A Framework for Capturing Interactions with Specialized Mixture-of-Experts
Xing Han, Hsing-Huan Chung, Joydeep Ghosh, Paul Pu Liang, Suchi Saria
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to failed paper fetchMethod: Cannot determine method due to failed paper fetch
Result: Cannot determine results due to failed paper fetch
Conclusion: Cannot determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2509.25678: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.25678&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1425] Data-to-Energy Stochastic Dynamics
Kirill Tamogashev, Nikolay Malkin
Main category: cs.LG
TL;DR: Failed to fetch paper summary - HTTP 429 error indicates rate limiting from arXiv API
Details
Motivation: Unable to determine motivation due to API fetch failureMethod: Unable to determine method due to API fetch failure
Result: Unable to determine results due to API fetch failure
Conclusion: Unable to determine conclusion due to API fetch failure
Abstract: Failed to fetch summary for 2509.26364: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.26364&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1426] Bayesian Influence Functions for Hessian-Free Data Attribution
Philipp Alexander Kreer, Wilson Wu, Maxwell Adam, Zach Furman, Jesse Hoogland
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2509.26544: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.26544&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1427] Per-example gradients: a new frontier for understanding and improving optimizers
Vincent Roulet, Atish Agarwala
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2510.00236: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.00236&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1428] LEAP: Local ECT-Based Learnable Positional Encodings for Graphs
Juan Amboage, Ernst Röell, Patrick Schnider, Bastian Rieck
Main category: cs.LG
TL;DR: Paper 2510.00757: Failed to fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access restrictionsMethod: Unable to determine method due to access restrictions
Result: Unable to determine results due to access restrictions
Conclusion: Unable to determine conclusion due to access restrictions
Abstract: Failed to fetch summary for 2510.00757: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.00757&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1429] Multi-Marginal Flow Matching with Adversarially Learnt Interpolants
Oskar Kviman, Kirill Tamogashev, Nicola Branchini, Víctor Elvira, Jens Lagergren, Nikolay Malkin
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access limitationsMethod: Unable to determine method due to access limitations
Result: Unable to determine results due to access limitations
Conclusion: Unable to determine conclusion due to access limitations
Abstract: Failed to fetch summary for 2510.01159: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.01159&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1430] Counterfactual Credit Guided Bayesian Optimization
Qiyu Wei, Haowei Wang, Richard Allmendinger, Mauricio A. Álvarez
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation due to technical error in accessing paper contentMethod: Unable to determine method due to technical error in accessing paper content
Result: Unable to determine results due to technical error in accessing paper content
Conclusion: Unable to determine conclusion due to technical error in accessing paper content
Abstract: Failed to fetch summary for 2510.04676: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.04676&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1431] Directional Sheaf Hypergraph Networks: Unifying Learning on Directed and Undirected Hypergraphs
Emanuele Mule, Stefano Fiorini, Antonio Purificato, Federico Siciliano, Stefano Coniglio, Fabrizio Silvestri
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2510.04727: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.04727&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1432] ResCP: Reservoir Conformal Prediction for Time Series Forecasting
Roberto Neglia, Andrea Cini, Michael M. Bronstein, Filippo Maria Bianchi
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2510.05060: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.05060&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1433] Boomerang Distillation Enables Zero-Shot Model Size Interpolation
Sara Kangaslahti, Nihal V. Nayak, Jonathan Geuter, Marco Fumero, Francesco Locatello, David Alvarez-Melis
Main category: cs.LG
TL;DR: Paper 2510.05064: Unable to fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to inability to access paper contentMethod: Cannot determine method due to inability to access paper content
Result: Cannot determine results due to inability to access paper content
Conclusion: Cannot determine conclusion due to inability to access paper content
Abstract: Failed to fetch summary for 2510.05064: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.05064&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1434] t-SNE Exaggerates Clusters, Provably
Noah Bergam, Szymon Snoeck, Nakul Verma
Main category: cs.LG
TL;DR: Failed to fetch summary for paper 2510.07746 due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to determine conclusion due to failed API request
Abstract: Failed to fetch summary for 2510.07746: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.07746&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1435] DFCA: Decentralized Federated Clustering Algorithm
Jonas Kirch, Sebastian Becker, Tiago Koketsu Rodrigues, Stefan Harmeling
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2510.15300: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.15300&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1436] Physics-Informed Parametric Bandits for Beam Alignment in mmWave Communications
Hao Qin, Thang Duong, Ming F. Li, Chicheng Zhang
Main category: cs.LG
TL;DR: Unable to analyze paper 2510.18299 due to HTTP 429 error when fetching summary from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrieved due to API rate limitingMethod: No method information available due to failed API request
Result: No results available due to failed data retrieval
Conclusion: Cannot provide analysis due to technical limitations in accessing the paper content
Abstract: Failed to fetch summary for 2510.18299: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.18299&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1437] Learning Boltzmann Generators via Constrained Mass Transport
Christopher von Klitzing, Denis Blessing, Henrik Schopmans, Pascal Friederich, Gerhard Neumann
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2510.18460: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.18460&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1438] Does Feedback Alignment Work at Biological Timescales?
Marc Gong Bacvanski, Liu Ziyin, Tomaso Poggio
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2510.18808: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.18808&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1439] Loopholing Discrete Diffusion: Deterministic Bypass of the Sampling Wall
Mingyu Jo, Jaesik Yoon, Justin Deschenaux, Caglar Gulcehre, Sungjin Ahn
Main category: cs.LG
TL;DR: Paper analysis unavailable due to HTTP 429 error when fetching abstract
Details
Motivation: Unable to determine motivation due to abstract fetch failureMethod: Unable to determine method due to abstract fetch failure
Result: Unable to determine results due to abstract fetch failure
Conclusion: Unable to draw conclusions due to abstract fetch failure
Abstract: Failed to fetch summary for 2510.19304: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.19304&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1440] Optimistic Task Inference for Behavior Foundation Models
Thomas Rupf, Marco Bagatella, Marin Vlastelica, Andreas Krause
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2510.20264: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.20264&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1441] Accelerating Data Generation for Nonlinear temporal PDEs via homologous perturbation in solution space
Lei Liu, Zhenxin Huang, Hong Wang, huanshuo dong, Haiyang Xin, Hongwei Zhao, Bin Li
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2510.21592: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.21592&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1442] Adversarial Déjà Vu: Jailbreak Dictionary Learning for Stronger Generalization to Unseen Attacks
Mahavir Dabas, Tran Huynh, Nikhil Reddy Billa, Jiachen T. Wang, Peng Gao, Charith Peris, Yao Ma, Rahul Gupta, Ming Jin, Prateek Mittal, Ruoxi Jia
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2510.21910: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.21910&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1443] Clustering by Denoising: Latent plug-and-play diffusion for single-cell data
Dominik Meier, Shixing Yu, Sagnik Nandy, Promit Ghosal, Kyra Gan
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) - paper ID 2510.22835
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2510.22835: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.22835&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1444] No-Rank Tensor Decomposition Using Metric Learning
Maryam Bagherian
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2511.01816: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.01816&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1445] Measuring the Intrinsic Dimension of Earth Representations
Arjun Rao, Marc Rußwurm, Konstantin Klemmer, Esther Rolf
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2511.02101: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.02101&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1446] Leveraging Discrete Function Decomposability for Scientific Design
James C. Bowden, Sergey Levine, Jennifer Listgarten
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2511.03032 suggests it’s from November 2023, but no abstract or content is available for analysis.
Details
Motivation: Cannot determine motivation without access to the paper content. The arXiv ID format suggests it's a computer science paper from late 2023, but specific research motivations are unknown.Method: Method cannot be analyzed due to lack of access to the paper content. The HTTP 429 error indicates rate limiting on arXiv API requests.
Result: No results can be reported since the paper content is unavailable for analysis. The technical issue prevents accessing any experimental findings or outcomes.
Conclusion: Unable to draw any conclusions about the paper’s contributions due to access limitations. The arXiv API rate limiting prevents retrieval of the abstract and paper details.
Abstract: Failed to fetch summary for 2511.03032: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.03032&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1447] QiMeng-CRUX: Narrowing the Gap between Natural Language and Verilog via Core Refined Understanding eXpression
Lei Huang, Rui Zhang, Jiaming Guo, Yang Zhang, Di Huang, Shuyao Cheng, Pengwei Jin, Chongxiao Li, Zidong Du, Xing Hu, Qi Guo, Yunji Chen
Main category: cs.LG
TL;DR: Paper 2511.20099 summary unavailable due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to abstract fetch failureMethod: Unable to determine method due to abstract fetch failure
Result: Unable to determine results due to abstract fetch failure
Conclusion: Unable to determine conclusion due to abstract fetch failure
Abstract: Failed to fetch summary for 2511.20099: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.20099&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1448] Transmit Weights, Not Features: Orthogonal-Basis Aided Wireless Point-Cloud Transmission
Junlin Chang, Yubo Han, Hang Yue, John S Thompson, Rongke Liu
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2512.03819 exists but content cannot be retrieved.
Details
Motivation: Cannot determine motivation without access to paper content.Method: Cannot determine method without access to paper content.
Result: Cannot determine results without access to paper content.
Conclusion: Cannot determine conclusion without access to paper content.
Abstract: Failed to fetch summary for 2512.03819: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.03819&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1449] Learning to Orchestrate Agents in Natural Language with the Conductor
Stefan Nielsen, Edoardo Cetin, Peter Schwendeman, Qi Sun, Jinglue Xu, Yujin Tang
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2512.04388: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.04388&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1450] TRINITY: An Evolved LLM Coordinator
Jinglue Xu, Qi Sun, Peter Schwendeman, Stefan Nielsen, Edoardo Cetin, Yujin Tang
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2512.04695: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.04695&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1451] Goal Reaching with Eikonal-Constrained Hierarchical Quasimetric Reinforcement Learning
Vittorio Giammarino, Ahmed H. Qureshi
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper information.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot draw conclusions without access to the paper content.
Abstract: Failed to fetch summary for 2512.12046: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.12046&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1452] A Single Architecture for Representing Invariance Under Any Space Group
Cindy Y. Zhang, Elif Ertekin, Peter Orbanz, Ryan P. Adams
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2512.13989: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.13989&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1453] EUBRL: Epistemic Uncertainty Directed Bayesian Reinforcement Learning
Jianfei Ma, Wee Sun Lee
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper detailsMethod: Cannot analyze method without access to paper content
Result: No results available due to technical limitations in accessing the paper
Conclusion: Unable to draw conclusions about paper relevance without access to the actual content
Abstract: Failed to fetch summary for 2512.15405: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.15405&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1454] Dynamic Tool Dependency Retrieval for Efficient Function Calling
Bhrij Patel, Davide Belli, Amir Jalalirad, Maximilian Arnold, Aleksandr Ermolov, Bence Major
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access limitationsMethod: Unable to determine method due to access limitations
Result: Unable to determine results due to access limitations
Conclusion: Unable to draw conclusions due to access limitations
Abstract: Failed to fetch summary for 2512.17052: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.17052&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1455] Domain Expansion: A Latent Space Construction Framework for Multi-Task Learning
Chi-Yao Huang, Khoa Vo, Aayush Atul Verma, Duo Lu, Yezhou Yang
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2601.20069: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.20069&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1456] Quantization-Aware Distillation for NVFP4 Inference Accuracy Recovery
Meng Xin, Sweta Priyadarshi, Jingyu Xin, Bilal Kartal, Aditya Vavre, Asma Kuriparambil Thekkumpate, Zijia Chen, Ameya Sunil Mahabaleshwarkar, Ido Shahaf, Akhiad Bercovich, Kinjal Patel, Suguna Varshini Velury, Chenjie Luo, Zhiyu Cheng, Jenny Chen, Chen-Han Yu, Wei Ping, Oleg Rybakov, Nima Tajbakhsh, Oluwatobi Olabiyi, Dusan Stosic, Di Wu, Song Han, Eric Chung, Sharath Turuvekere Sreenivas, Bryan Catanzaro, Yoshi Suhara, Tijmen Blankevoort, Huizi Mao
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2601.20088: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.20088&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1457] Decoupled Diffusion Sampling for Inverse Problems on Function Spaces
Thomas Y.L. Lin, Jiachen Yao, Lufang Chiang, Julius Berner, Anima Anandkumar
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2601.23280: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.23280&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1458] Combinatorial Bandit Bayesian Optimization for Tensor Outputs
Jingru Huang, Haijie Xu, Jie Guo, Manrui Jiang, Chen Zhang
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2602.00640: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.00640&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1459] PHAT: Modeling Period Heterogeneity for Multivariate Time Series Forecasting
Jiaming Ma, Qihe Huang, Haofeng Ma, Guanjun Wang, Sheng Huang, Zhengyang Zhou, Pengkun Wang, Binwu Wang, Yang Wang
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to the paper content
Abstract: Failed to fetch summary for 2602.00654: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.00654&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1460] Multi-scale hypergraph meets LLMs: Aligning large language models for time series analysis
Zongjiang Shang, Dongliang Cui, Binqing Wu, Ling Chen
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2602.04369: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.04369&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1461] Quantile-Physics Hybrid Framework for Safe-Speed Recommendation under Diverse Weather Conditions Leveraging Connected Vehicle and Road Weather Information Systems Data
Wen Zhang, Adel W. Sadek, Chunming Qiao
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation as paper content could not be retrievedMethod: Unable to determine method as paper content could not be retrieved
Result: Unable to determine results as paper content could not be retrieved
Conclusion: Unable to determine conclusion as paper content could not be retrieved
Abstract: Failed to fetch summary for 2602.05053: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.05053&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1462] Towards Efficient Large Language Reasoning Models via Extreme-Ratio Chain-of-Thought Compression
Yuntian Tang, Bohan Jia, Wenxuan Huang, Lianyue Zhang, Jiao Xie, Wenxi Li, Rongrong Ji, Shaohui Lin
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2602.08324: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.08324&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1463] Improved state mixing in higher-order and block diagonal linear recurrent networks
Igor Dubinin, Antonio Orvieto, Felix Effenberger
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2602.12021: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.12021&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1464] Self-Supervised Learning via Flow-Guided Neural Operator on Time-Series Data
Duy Nguyen, Jiachen Yao, Jiayun Wang, Julius Berner, Animashree Anandkumar
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2602.12267: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.12267&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1465] TS-Haystack: A Multi-Scale Retrieval Benchmark for Time Series Language Models
Nicolas Zumarraga, Thomas Kaar, Ning Wang, Maxwell A. Xu, Max Rosenblattl, Markus Kreft, Kevin O’Sullivan, Paul Schmiedmayer, Patrick Langer, Robert Jakob
Main category: cs.LG
TL;DR: Unable to analyze paper 2602.14200 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract is unavailable due to rate limiting errorMethod: Cannot determine method as abstract is unavailable due to rate limiting error
Result: Cannot determine results as abstract is unavailable due to rate limiting error
Conclusion: Cannot draw conclusions as abstract is unavailable due to rate limiting error
Abstract: Failed to fetch summary for 2602.14200: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.14200&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1466] Directional Reasoning Trajectory Change (DRTC): Identifying Critical Trace Segments in Reasoning Models
Waldemar Chang
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation as paper content could not be retrievedMethod: Unable to determine method as paper content could not be retrieved
Result: Unable to determine results as paper content could not be retrieved
Conclusion: Unable to determine conclusion as paper content could not be retrieved
Abstract: Failed to fetch summary for 2602.15332: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.15332&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1467] The Power of Decaying Steps: Enhancing Attack Stability and Transferability for Sign-based Optimizers
Wei Tao, Yang Dai, Jincai Huang, Qing Tao
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2602.19096: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.19096&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1468] LLMs Can Learn to Reason Via Off-Policy RL
Daniel Ritter, Owen Oertell, Bradley Guo, Jonathan Chang, Kianté Brantley, Wen Sun
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation as paper content could not be retrievedMethod: Unknown - paper content unavailable due to API rate limiting
Result: No results available - technical error prevented access to paper information
Conclusion: Cannot analyze paper due to technical limitations in accessing the content
Abstract: Failed to fetch summary for 2602.19362: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.19362&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1469] UrbanFM: Scaling Urban Spatio-Temporal Foundation Models
Wei Chen, Yuqian Wu, Junle Chen, Xiaofang Zhou, Yuxuan Liang
Main category: cs.LG
TL;DR: UrbanFM: A spatio-temporal foundation model for urban systems that achieves zero-shot generalization across cities and tasks through data, computation, and architecture scaling.
Details
Motivation: Urban computing remains fragmented with scenario-specific models that lack generalizability, unlike foundation models in other scientific disciplines. The paper aims to bridge this gap by developing large-scale spatio-temporal foundation models for urban systems.Method: Three scaling approaches: 1) Data scaling via WorldST corpus (billion-scale standardized urban data from 100+ cities), 2) Computation scaling via MiniST units (discretizing continuous spatio-temporal fields), 3) Architecture scaling via UrbanFM (minimalist self-attention with limited inductive biases).
Result: UrbanFM achieves remarkable zero-shot generalization across unseen cities and tasks, demonstrating strong performance on the EvalST benchmark (largest urban spatio-temporal benchmark).
Conclusion: The work represents a pivotal first step toward large-scale urban spatio-temporal foundation models, showing that scaling across data, computation, and architecture enables generalizable urban AI systems.
Abstract: Urban systems, as dynamic complex systems, continuously generate spatio-temporal data streams that encode the fundamental laws of human mobility and city evolution. While AI for Science has witnessed the transformative power of foundation models in disciplines like genomics and meteorology, urban computing remains fragmented due to “scenario-specific” models, which are overfitted to specific regions or tasks, hindering their generalizability. To bridge this gap and advance spatio-temporal foundation models for urban systems, we adopt scaling as the central perspective and systematically investigate two key questions: what to scale and how to scale. Grounded in first-principles analysis, we identify three critical dimensions: heterogeneity, correlation, and dynamics, aligning these principles with the fundamental scientific properties of urban spatio-temporal data. Specifically, to address heterogeneity through data scaling, we construct WorldST. This billion-scale corpus standardizes diverse physical signals, such as traffic flow and speed, from over 100 global cities into a unified data format. To enable computation scaling for modeling correlations, we introduce the MiniST unit, a novel split mechanism that discretizes continuous spatio-temporal fields into learnable computational units to unify representations of grid-based and sensor-based observations. Finally, addressing dynamics via architecture scaling, we propose UrbanFM, a minimalist self-attention architecture designed with limited inductive biases to autonomously learn dynamic spatio-temporal dependencies from massive data. Furthermore, we establish EvalST, the largest-scale urban spatio-temporal benchmark to date. Extensive experiments demonstrate that UrbanFM achieves remarkable zero-shot generalization across unseen cities and tasks, marking a pivotal first step toward large-scale urban spatio-temporal foundation models.
[1470] PaReGTA: An LLM-based EHR Data Encoding Approach to Capture Temporal Information
Kihyuk Yoon, Lingchao Mao, Catherine Chong, Todd J. Schwedt, Chia-Chun Chiang, Jing Li
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2602.19661: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.19661&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1471] Unlearning Noise in PINNs: A Selective Pruning Framework for PDE Inverse Problems
Yongsheng Chen, Yong Chen, Wei Guo, Xinghui Zhong
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) when querying arXiv API
Details
Motivation: Unable to determine motivation as paper content could not be retrievedMethod: Unable to determine method as paper content could not be retrieved
Result: Unable to determine results as paper content could not be retrieved
Conclusion: Unable to determine conclusion as paper content could not be retrieved
Abstract: Failed to fetch summary for 2602.19967: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.19967&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1472] Brain-OF: An Omnifunctional Foundation Model for fMRI, EEG and MEG
Hanning Guo, Farah Abdellatif, Hanwen Bi, Andrei Galbenus, Jon. N. Shah, Abigail Morrison, Jürgen Dammers
Main category: cs.LG
TL;DR: Brain-OF is the first omnifunctional brain foundation model jointly pretrained on fMRI, EEG, and MEG data, using novel techniques to handle multimodal brain signals in a unified framework.
Details
Motivation: Existing brain foundation models are limited to single functional modalities, missing opportunities to exploit complementary spatiotemporal dynamics and collective data scale across different brain imaging techniques.Method: Proposes Brain-OF with: 1) Any-Resolution Neural Signal Sampler to project diverse brain signals into shared semantic space, 2) DINT attention with Sparse Mixture of Experts for modality-invariant and modality-specific representations, and 3) Masked Temporal-Frequency Modeling for dual-domain pretraining.
Result: Brain-OF pretrained on ~40 datasets shows superior performance across diverse downstream tasks, demonstrating benefits of joint multimodal integration and dual-domain pretraining.
Conclusion: Brain-OF successfully integrates multiple brain imaging modalities in a unified foundation model, achieving state-of-the-art performance and highlighting the value of multimodal brain signal analysis.
Abstract: Brain foundation models have achieved remarkable advances across a wide range of neuroscience tasks. However, most existing models are limited to a single functional modality, restricting their ability to exploit complementary spatiotemporal dynamics and the collective data scale across imaging techniques. To address this limitation, we propose Brain-OF, the first omnifunctional brain foundation model jointly pretrained on fMRI, EEG and MEG, capable of handling both unimodal and multimodal inputs within a unified framework. To reconcile heterogeneous spatiotemporal resolutions, we introduce the Any-Resolution Neural Signal Sampler, which projects diverse brain signals into a shared semantic space. To further manage semantic shifts, the Brain-OF backbone integrates DINT attention with a Sparse Mixture of Experts, where shared experts capture modality-invariant representations and routed experts specialize in modality-specific semantics. Furthermore, we propose Masked Temporal-Frequency Modeling, a dual-domain pretraining objective that jointly reconstructs brain signals in both the time and frequency domains. Brain-OF is pretrained on a large-scale corpus comprising around 40 datasets and demonstrates superior performance across diverse downstream tasks, highlighting the benefits of joint multimodal integration and dual-domain pretraining.
[1473] QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models
Jingxuan Zhang, Yunta Hsieh, Zhongwei Wan, Haokun Lin, Xin Wang, Ziqi Wang, Yingtie Lei, Mi Zhang
Main category: cs.LG
TL;DR: QuantVLA is a training-free post-training quantization framework for Vision-Language-Action models that reduces compute/memory demands while maintaining performance through selective quantization, attention temperature matching, and output head balancing.
Details
Motivation: VLA models face practical deployment challenges due to rapidly increasing compute and memory demands as models scale, especially for embodied agents with longer horizons and larger backbones. There's a need for efficient quantization methods that preserve performance while reducing resource requirements.Method: QuantVLA uses three scale-calibrated components: 1) selective quantization layout that integerizes all linear layers in language backbone and DiT while keeping attention projections in FP, 2) attention temperature matching for per-head scaling to stabilize attention logits, and 3) output head balancing for per-layer residual interface calibration to mitigate post-projection energy drift. The framework requires no training, uses minimal calibration data, and supports integer kernels.
Result: QuantVLA exceeds task success rates of full-precision baselines on LIBERO benchmark, achieves about 70% relative memory savings on quantized components, and provides practical low-bit deployment for embodied intelligence under strict compute/memory/power constraints.
Conclusion: QuantVLA offers the first effective post-training quantization framework for VLA systems, successfully quantizing diffusion transformer action heads while maintaining or improving performance, enabling scalable low-bit embodied intelligence deployment.
Abstract: Vision-language-action (VLA) models unify perception, language, and control for embodied agents but face significant challenges in practical deployment due to rapidly increasing compute and memory demands, especially as models scale to longer horizons and larger backbones. To address these bottlenecks, we introduce QuantVLA, a training-free post-training quantization (PTQ) framework that, to our knowledge, is the first PTQ approach for VLA systems and the first to successfully quantize a diffusion transformer (DiT) action head. QuantVLA incorporates three scale-calibrated components: (1) a selective quantization layout that integerizes all linear layers in both the language backbone and the DiT while keeping attention projections in floating point to preserve the original operator schedule; (2) attention temperature matching, a lightweight per-head scaling mechanism that stabilizes attention logits and is folded into the dequantization scales at inference; and (3) output head balancing, a per-layer residual interface calibration that mitigates post-projection energy drift. The framework requires no additional training, uses only a small unlabeled calibration buffer, and supports integer kernels for low-bit weights and activations while leaving the architecture unchanged. Across representative VLA models on LIBERO, QuantVLA exceeds the task success rates of full-precision baselines, achieves about 70% relative memory savings on the quantized components, providing a practical pathway toward scalable low-bit embodied intelligence under strict compute, memory, and power constraints.
[1474] QEDBENCH: Quantifying the Alignment Gap in Automated Evaluation of University-Level Mathematical Proofs
Santiago Gonzalez, Alireza Amiri Bavandpour, Peter Ye, Edward Zhang, Ruslans Aleksejevs, Todor Antić, Polina Baron, Sujeet Bhalerao, Shubhrajit Bhattacharya, Zachary Burton, John Byrne, Hyungjun Choi, Nujhat Ahmed Disha, Koppany István Encz, Yuchen Fang, Robert Joseph George, Ebrahim Ghorbani, Alan Goldfarb, Jing Guo, Meghal Gupta, Stefano Huber, Annika Kanckos, Minjung Kang, Hyun Jong Kim, Dino Lorenzini, Levi Lorenzo, Tianyi Mao, Giovanni Marzenta, Ariane M. Masuda, Lukas Mauth, Ana Mickovic, Andres Miniguano-Trujillo, Antoine Moulin, Wenqi Ni, Tomos Parry, Kevin Ren, Hossein Roodbarani, Mathieu Rundström, Manjil Saikia, Detchat Samart, Rebecca Steiner, Connor Stewart, Dhara Thakkar, Jeffrey Tse, Vasiliki Velona, Yunhai Xiang, Sibel Yalçın, Jun Yan, Ji Zeng, Arman Cohan, Quanquan C. Liu
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to failed paper fetchMethod: Cannot determine method due to failed paper fetch
Result: Cannot determine results due to failed paper fetch
Conclusion: Cannot determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2602.20629: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.20629&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1475] Sparse Bayesian Deep Functional Learning with Structured Region Selection
Xiaoxian Zhu, Yingmeng Li, Shuangge Ma, Mengyun Wu
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2602.20651: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.20651&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1476] Support Tokens, Stability Margins, and a New Foundation for Robust LLMs
Deepak Agarwal, Dhyey Dharmendrakumar Mavani, Suyash Gupta, Karthik Sethuraman, Tejas Dharamsi
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2602.22271: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22271&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1477] BrepCoder: A Unified Multimodal Large Language Model for Multi-task B-rep Reasoning
Mingi Kim, Yongjun Kim, Jungwoo Kang, Hyungki Kim
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2602.22284: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22284&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1478] OmniZip: Learning a Unified and Lightweight Lossless Compressor for Multi-Modal Data
Yan Zhao, Zhengxue Cheng, Junxuan Zhang, Dajiang Zhou, Qunshan Gu, Qi Wang, Li Song
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2602.22286: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22286&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1479] Predicting Multi-Drug Resistance in Bacterial Isolates Through Performance Comparison and LIME-based Interpretation of Classification Models
Santanam Wishal, Riad Sahara
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2602.22400: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22400&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1480] GRAIL: Post-hoc Compensation by Linear Reconstruction for Compressed Networks
Wenwu Tang, Dong Wang, Lothar Thiele, Olga Saukh
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) when querying arXiv API for paper ID 2602.23795
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot determine conclusion as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2602.23795: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.23795&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1481] A Large-Scale Neutral Comparison Study of Survival Models on Low-Dimensional Data
Lukas Burk, John Zobolas, Bernd Bischl, Andreas Bender, Marvin N. Wright, Raphael Sonabend
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation due to API access failureMethod: Unable to determine method due to API access failure
Result: Unable to determine results due to API access failure
Conclusion: Unable to determine conclusion due to API access failure
Abstract: Failed to fetch summary for 2406.04098: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2406.04098&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1482] VICatMix: variational Bayesian clustering and variable selection for discrete biomedical data
Jackie Rao, Paul D. W. Kirk
Main category: cs.LG
TL;DR: Unable to fetch paper 2406.16227 due to HTTP 429 rate limiting error from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2406.16227: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2406.16227&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1483] Decoupling Dynamical Richness from Representation Learning: Towards Practical Measurement
Yoonsoo Nam, Nayara Fonseca, Seok Hyeong Lee, Chris Mingard, Niclas Goring, Ouns El Harzli, Abdurrahman Hadi Erturk, Soufiane Hayou, Ard A. Louis
Main category: cs.LG
TL;DR: Unable to analyze paper 2410.04264 due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrievedMethod: Cannot determine method as paper content could not be retrieved
Result: Cannot determine results as paper content could not be retrieved
Conclusion: Cannot draw conclusion as paper content could not be retrieved
Abstract: Failed to fetch summary for 2410.04264: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.04264&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1484] LD-EnSF: Synergizing Latent Dynamics with Ensemble Score Filters for Fast Data Assimilation with Sparse Observations
Pengpeng Xiao, Phillip Si, Peng Chen
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2411.19305: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.19305&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1485] Mixing Times and Privacy Analysis for the Projected Langevin Algorithm under a Modulus of Continuity
Mario Bravo, Juan P. Flores-Mella, Cristóbal Guzmán
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to failed paper fetchMethod: Cannot determine method due to failed paper fetch
Result: Cannot determine results due to failed paper fetch
Conclusion: Cannot determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2501.04134: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.04134&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1486] On weight and variance uncertainty in neural networks for regression tasks
Moein Monemi, Morteza Amini, S. Mahmoud Taheri, Mohammad Arashi
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2501.04272: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.04272&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1487] Improving the adaptive and continuous learning capabilities of artificial neural networks: Lessons from multi-neuromodulatory dynamics
Jie Mei, Alejandro Rodriguez-Garcia, Daigo Takeuchi, Gabriel Wainstein, Nina Hubig, Yalda Mohsenzadeh, Srikanth Ramaswamy
Main category: cs.LG
TL;DR: Unable to analyze paper 2501.06762 due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrievedMethod: Cannot determine method as paper content could not be retrieved
Result: Cannot determine results as paper content could not be retrieved
Conclusion: Cannot draw conclusion as paper content could not be retrieved
Abstract: Failed to fetch summary for 2501.06762: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.06762&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1488] Data-Driven Prediction and Control of Hammerstein-Wiener Systems with Implicit Gaussian Processes
Mingzhou Yin, Matthias A. Müller
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2501.15849: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.15849&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1489] CAIMAN: Causal Action Influence Detection for Sample-efficient Loco-manipulation
Yuanchen Yuan, Jin Cheng, Núria Armengol Urpí, Stelian Coros
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2502.00835 suggests it’s from February 2025, but no content available for analysis.
Details
Motivation: Cannot determine motivation without access to the paper content. The HTTP 429 error indicates the arXiv API rate limit has been exceeded.Method: No method information available due to failed content retrieval.
Result: No results available as the paper content could not be fetched.
Conclusion: Unable to analyze the paper due to technical limitations in accessing the content from arXiv.
Abstract: Failed to fetch summary for 2502.00835: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.00835&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1490] A Randomized Linearly Convergent Frank-Wolfe-type Method for Smooth Convex Minimization over the Spectrahedron
Dan Garber
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to retrieval failureMethod: Unable to determine method due to retrieval failure
Result: Unable to determine results due to retrieval failure
Conclusion: Unable to determine conclusion due to retrieval failure
Abstract: Failed to fetch summary for 2503.01441: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.01441&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1491] A Benchmark Dataset for Machine Learning Surrogates of Pore-Scale CO2-Water Interaction
Alhasan Abdellatif, Hannah P. Menke, Julien Maes, Ahmed H. Elsheikh, Florian Doster
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2503.17592: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.17592&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1492] Optimistic Online Learning in Symmetric Cone Games
Anas Barakat, Wayne Lin, John Lazarsfeld, Antonios Varvitsiotis
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access restrictionsMethod: Unable to determine method due to access restrictions
Result: Unable to determine results due to access restrictions
Conclusion: Unable to determine conclusion due to access restrictions
Abstract: Failed to fetch summary for 2504.03592: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.03592&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1493] Standardization of Weighted Ranking Correlation Coefficients
Pierangelo Lombardo
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2504.08428: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.08428&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1494] Sparsification Under Siege: Dual-Level Defense Against Poisoning in Communication-Efficient Federated Learning
Zhiyong Jin, Runhua Xu, Chao Li, Yizhong Liu, Jianxin Li, James Joshi
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2505.01454: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.01454&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1495] ARCANE – Early Detection of Interplanetary Coronal Mass Ejections
H. T. Rüdisser, G. Nguyen, J. Le Louëdec, E. E. Davies, C. Möstl
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2505.09365: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.09365&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1496] Back to Square Roots: An Optimal Bound on the Matrix Factorization Error for Multi-Epoch Differentially Private SGD
Nikita P. Kalinin, Ryan McKenna, Jalaj Upadhyay, Christoph H. Lampert
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to draw conclusions due to access error
Abstract: Failed to fetch summary for 2505.12128: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.12128&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1497] A Projection-Based ARIMA Framework for Nonlinear Dynamics in Macroeconomic and Financial Time Series: Closed-Form Estimation and Rolling-Window Inference
Haojie Liu, Zihan Lin
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper retrievalMethod: Unable to determine method due to failed paper retrieval
Result: Unable to determine results due to failed paper retrieval
Conclusion: Unable to draw conclusions due to failed paper retrieval
Abstract: Failed to fetch summary for 2507.07469: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.07469&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1498] Separating Ansatz Discovery from Deployment on Larger Problems: Reinforcement Learning for Modular Circuit Design
Gloria Turati, Simone Foderà, Riccardo Nembrini, Maurizio Ferrari Dacrema, Paolo Cremonesi
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2507.16001: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.16001&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1499] Domain Generalization and Adaptation in Intensive Care with Anchor Regression
Malte Londschien, Manuel Burger, Gunnar Rätsch, Peter Bühlmann
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) when querying arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrievedMethod: Cannot determine method as paper content could not be retrieved
Result: Cannot determine results as paper content could not be retrieved
Conclusion: Cannot draw conclusions as paper content could not be retrieved
Abstract: Failed to fetch summary for 2507.21783: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.21783&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1500] Modeling User Preferences as Distributions for Optimal Transport-Based Cross-Domain Recommendation under Non-Overlapping Settings
Ziyin Xiao, Toyotaro Suzumura
Main category: cs.LG
TL;DR: Unable to analyze paper 2508.16210 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation due to missing abstract dataMethod: Cannot determine method due to missing abstract data
Result: Cannot determine results due to missing abstract data
Conclusion: Cannot determine conclusion due to missing abstract data
Abstract: Failed to fetch summary for 2508.16210: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.16210&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1501] Universal Dynamics with Globally Controlled Analog Quantum Simulators
Hong-Ye Hu, Abigail McClain Gomez, Liyuan Chen, Aaron Trowbridge, Andy J. Goldschmidt, Zachary Manchester, Frederic T. Chong, Arthur Jaffe, Susanne F. Yelin
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2508.19075: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.19075&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1502] Optimal information injection and transfer mechanisms for active matter reservoir computing
Mario U. Gaimann, Miriam Klopotek
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation as paper content could not be retrievedMethod: Unable to determine method as paper content could not be retrieved
Result: Unable to determine results as paper content could not be retrieved
Conclusion: Unable to draw conclusions as paper content could not be retrieved
Abstract: Failed to fetch summary for 2509.01799: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.01799&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1503] Quantum parameter estimation with uncertainty quantification from continuous measurement data using neural network ensembles
Amanuel Anteneh
Main category: cs.LG
TL;DR: Unable to analyze paper 2509.10756 due to HTTP 429 error when fetching the abstract from arXiv API
Details
Motivation: Cannot determine motivation as the paper content could not be retrieved due to API rate limitingMethod: No method information available due to failed content retrieval
Result: No results available as the paper content could not be accessed
Conclusion: Unable to provide analysis due to technical limitations in accessing the paper content
Abstract: Failed to fetch summary for 2509.10756: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.10756&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1504] Learning Contact Dynamics through Touching: Action-conditional Graph Neural Networks for Robotic Peg Insertion
Zongyao Yi, Joachim Hertzberg, Martin Atzmueller
Main category: cs.LG
TL;DR: Paper 2509.12151 summary unavailable due to HTTP 429 error (rate limiting) when fetching from arXiv API
Details
Motivation: Unable to determine motivation as paper content could not be retrieved due to API rate limitingMethod: No method information available - arXiv API returned HTTP 429 error (Too Many Requests)
Result: No results available - paper content could not be fetched due to rate limiting
Conclusion: Unable to provide analysis due to technical limitations in accessing the paper content
Abstract: Failed to fetch summary for 2509.12151: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.12151&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1505] SamudrACE: Fast and Accurate Coupled Climate Modeling with 3D Ocean and Atmosphere Emulators
James P. C. Duncan, Elynn Wu, Surya Dheeshjith, Adam Subel, Troy Arcomano, Spencer K. Clark, Brian Henn, Anna Kwa, Jeremy McGibbon, W. Andre Perkins, William Gregory, Carlos Fernandez-Granda, Julius Busecke, Oliver Watt-Meyer, William J. Hurlin, Alistair Adcroft, Laure Zanna, Christopher Bretherton
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2509.12490: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.12490&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1506] Universal Inverse Distillation for Matching Models with Real-Data Supervision (No GANs)
Nikita Kornilov, David Li, Tikhon Mavrin, Aleksei Leonov, Nikita Gushchin, Evgeny Burnaev, Iaroslav Koshelev, Alexander Korotin
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to draw conclusions due to access error
Abstract: Failed to fetch summary for 2509.22459: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.22459&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1507] Estimating Dimensionality of Neural Representations from Finite Samples
Chanwoo Chun, Abdulkadir Canatar, SueYeon Chung, Daniel Lee
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2509.26560: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.26560&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1508] A universal compression theory for lottery ticket hypothesis and neural scaling laws
Hong-Yi Wang, Di Luo, Tomaso Poggio, Isaac L. Chuang, Liu Ziyin
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2510.00504 could not be retrieved from arXiv API.
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting.Method: Cannot determine method as paper content is unavailable due to API rate limiting.
Result: Cannot determine results as paper content is unavailable due to API rate limiting.
Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting.
Abstract: Failed to fetch summary for 2510.00504: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.00504&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1509] Learning to Play Multi-Follower Bayesian Stackelberg Games
Gerson Personnat, Tao Lin, Safwan Hossain, David C. Parkes
Main category: cs.LG
TL;DR: Unable to analyze paper 2510.01387 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract retrieval failedMethod: Cannot determine method as abstract retrieval failed
Result: Cannot determine results as abstract retrieval failed
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2510.01387: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.01387&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1510] Fourier Analysis on the Boolean Hypercube via Hoeffding Functional Decomposition
Baptiste Ferrere, Nicolas Bousquet, Fabrice Gamboa, Jean-Michel Loubes, Joseph Muré
Main category: cs.LG
TL;DR: Failed to fetch paper summary - HTTP 429 error indicates rate limiting from arXiv API
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Cannot analyze method as paper content is unavailable due to HTTP 429 error
Result: No results available - arXiv API rate limiting prevented access to paper summary
Conclusion: Cannot provide analysis due to technical limitations in accessing the paper content
Abstract: Failed to fetch summary for 2510.07088: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.07088&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1511] Optimal Stopping in Latent Diffusion Models
Yu-Han Wu, Quentin Berthet, Gérard Biau, Claire Boyer, Romuald Elie, Pierre Marion
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed summary fetchMethod: Unable to determine method due to failed summary fetch
Result: Unable to determine results due to failed summary fetch
Conclusion: Unable to draw conclusions due to failed summary fetch
Abstract: Failed to fetch summary for 2510.08409: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.08409&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1512] Emergence of Spatial Representation in an Actor-Critic Agent with Hippocampus-Inspired Sequence Generator
Xiao-Xiong Lin, Yuk-Hoi Yiu, Christian Leibold
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to wait before retrying or use alternative methods to access the paper information.
Details
Motivation: Cannot determine motivation as paper content is unavailable due to HTTP 429 error from arXiv API.Method: Cannot determine method as paper content is unavailable due to HTTP 429 error from arXiv API.
Result: Cannot determine results as paper content is unavailable due to HTTP 429 error from arXiv API.
Conclusion: Cannot draw conclusions as paper content is unavailable due to HTTP 429 error from arXiv API.
Abstract: Failed to fetch summary for 2510.09951: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.09951&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1513] Geopolitics, Geoeconomics and Risk: A Machine Learning Approach
Alvaro Ortiz, Tomasa Rodrigo, Pablo Saborido
Main category: cs.LG
TL;DR: Failed to fetch summary for paper 2510.12416 due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation as the paper summary could not be retrieved due to rate limiting from arXiv APIMethod: Unknown - paper content not accessible due to HTTP 429 error when attempting to fetch from arXiv API
Result: No results available - the paper summary could not be retrieved due to API rate limiting
Conclusion: Cannot provide analysis as the paper content was not accessible due to technical limitations (HTTP 429 error from arXiv API)
Abstract: Failed to fetch summary for 2510.12416: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.12416&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1514] Decoding Dynamic Visual Experience from Calcium Imaging via Cell-Pattern-Aware Pretraining
Sangyoon Bae, Mehdi Azabou, Blake Richards, Jiook Cha
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to data retrieval failureMethod: Unable to determine method due to data retrieval failure
Result: Unable to determine results due to data retrieval failure
Conclusion: Unable to draw conclusions due to data retrieval failure
Abstract: Failed to fetch summary for 2510.18516: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.18516&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1515] SLAP: Shortcut Learning for Abstract Planning
Y. Isabel Liu, Bowen Li, Benjamin Eysenbach, Tom Silver
Main category: cs.LG
TL;DR: Paper 2511.01107: Unable to fetch abstract due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to missing abstractMethod: Cannot determine method due to missing abstract
Result: Cannot determine results due to missing abstract
Conclusion: Cannot draw conclusions due to missing abstract
Abstract: Failed to fetch summary for 2511.01107: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.01107&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1516] DoFlow: Flow-based Generative Models for Interventional and Counterfactual Forecasting on Time Series
Dongze Wu, Feng Qiu, Yao Xie
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2511.02137: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.02137&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1517] Distributional Shrinkage I: Universal Denoiser Beyond Tweedie’s Formula
Tengyuan Liang
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2511.09500: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.09500&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1518] Generative Models for Crystalline Materials
Houssam Metni, Laura Ruple, Lauren N. Walters, Luca Torresi, Jonas Teufel, Henrik Schopmans, Jona Östreicher, Yumeng Zhang, Marlen Neubert, Yuri Koide, Kevin Steiner, Paul Link, Lukas Bär, Mariana Petrova, Gerbrand Ceder, Pascal Friederich
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation as paper content could not be retrievedMethod: Unable to determine method as paper content could not be retrieved
Result: Unable to determine results as paper content could not be retrieved
Conclusion: Unable to draw conclusions as paper content could not be retrieved
Abstract: Failed to fetch summary for 2511.22652: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.22652&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1519] Distributional Shrinkage II: Higher-Order Scores Encode Brenier Map
Tengyuan Liang
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2512.09295: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.09295&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1520] LLAMA LIMA: A Living Meta-Analysis on the Effects of Generative AI on Learning Mathematics
Anselm Strohmaier, Samira Bödefeld, Oliver Straser, Frank Reinhold
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2601.18685: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.18685&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1521] Analysis of Shuffling Beyond Pure Local Differential Privacy
Shun Takagi, Seng Pei Liew
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2601.19154: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.19154&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1522] Sparse clustering via the Deterministic Information Bottleneck algorithm
Efthymios Costa, Ioanna Papatsouma, Angelos Markos
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2601.20628: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.20628&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1523] Relaxed Triangle Inequality for Kullback-Leibler Divergence Between Multivariate Gaussian Distributions
Shiji Xiao, Yufeng Zhang, Chubo Liu, Yan Ding, Keqin Li, Kenli Li
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper retrievalMethod: Unable to determine method due to failed paper retrieval
Result: Unable to determine results due to failed paper retrieval
Conclusion: Unable to determine conclusion due to failed paper retrieval
Abstract: Failed to fetch summary for 2602.02577: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.02577&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1524] Incremental (k, z)-Clustering on Graphs
Emilio Cruciani, Sebastian Forster, Antonis Skarlatos
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2602.08542: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.08542&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1525] GPU-Fuzz: Finding Memory Errors in Deep Learning Frameworks
Zihao Li, Hongyi Lu, Yanan Guo, Zhenkai Zhang, Shuai Wang, Fengwei Zhang
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2602.10478: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.10478&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1526] Random Forests as Statistical Procedures: Design, Variance, and Dependence
Nathaniel S. O’Connell
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2602.13104: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.13104&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1527] Neural-POD: A Plug-and-Play Neural Operator Framework for Infinite-Dimensional Functional Nonlinear Proper Orthogonal Decomposition
Changhong Mou, Binghang Lu, Guang Lin
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2602.15632: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.15632&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1528] Nazrin: Atomic Tactics for Graph Neural Networks for Theorem Proving in Lean 4
Leni Aniva, Iori Oikawa, David Dill, Clark Barrett
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about the paper due to access limitations
Abstract: Failed to fetch summary for 2602.18767: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.18767&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1529] Smoothness Adaptivity in Constant-Depth Neural Networks: Optimal Rates via Smooth Activations
Yuhao Liu, Zilin Wang, Lei Wu, Shaobo Zhang
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2602.19691: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.19691&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1530] Selecting Optimal Variable Order in Autoregressive Ising Models
Shiba Biswal, Marc Vuffray, Andrey Y. Lokhov
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2602.20394: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.20394&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1531] DRESS: A Continuous Framework for Structural Graph Refinement
Eduar Castrillo Velilla
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2602.20833: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.20833&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1532] Unsupervised Discovery of Intermediate Phase Order in the Frustrated $J_1$-$J_2$ Heisenberg Model via Prometheus Framework
Brandon Yee, Wilson Collins, Maximilian Rutkowski
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2602.21468 could not be retrieved from arXiv API.
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting.Method: Cannot determine method as paper content is unavailable.
Result: Cannot determine results as paper content is unavailable.
Conclusion: Cannot draw conclusions about the paper due to unavailability of content.
Abstract: Failed to fetch summary for 2602.21468: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.21468&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1533] Learning Quantum Data Distribution via Chaotic Quantum Diffusion Model
Quoc Hoan Tran, Koki Chinzei, Yasuhiro Endo, Hirotaka Oshima
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper information.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot determine conclusion without access to the paper content.
Abstract: Failed to fetch summary for 2602.22061: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22061&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1534] HyperKKL: Enabling Non-Autonomous State Estimation through Dynamic Weight Conditioning
Yahia Salaheldin Shaaban, Salem Lahlou, Abdelrahman Sayed Sayed
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2602.22630: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22630&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1535] Multivariate Spatio-Temporal Neural Hawkes Processes
Christopher Chukwuemeka, Hojun You, Mikyoung Jun
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2602.23629: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.23629&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.MA
[1536] Strength Change Explanations in Quantitative Argumentation
Timotheus Kampik, Xiang Yin, Nico Potyka, Francesca Toni
Main category: cs.MA
TL;DR: Introduces strength change explanations for quantitative argumentation graphs to make argumentation-based inference contestable by explaining what changes to argument strengths can achieve desired inference outcomes.
Details
Motivation: To make argumentation-based inference contestable by explaining what changes can achieve desired (instead of contested) inference results, addressing limitations of existing inverse and counterfactual approaches.Method: Introduces strength change explanations for quantitative (bipolar) argumentation graphs that describe changes to initial strengths of arguments to achieve desired ordering based on final strengths. Shows reduction of existing inverse/counterfactual problems to strength change explanations and proves soundness/completeness properties.
Result: Demonstrates existence/non-existence of strength change explanations in special cases, and shows heuristic search can successfully find explanations for layered graphs common in application scenarios, though limitations remain for settings without guarantees.
Conclusion: Strength change explanations provide a formal framework for contesting argumentation-based inference by identifying necessary changes to achieve desired outcomes, with practical applicability demonstrated for common graph structures.
Abstract: In order to make argumentation-based inference contestable, it is crucial to explain what changes can achieve a desired (instead of the contested) inference result. To this end, we introduce strength change explanations for quantitative (bipolar) argumentation graphs. Strength change explanations describe changes to the initial strengths of a subset of the arguments in a given graph that can achieve a desired ordering based on the final strengths of some (potentially different) subset of arguments. We show that the existing notions of inverse and counterfactual problems can be reduced to strength change explanations. We also prove basic soundness and completeness properties of our strength change explanations, and demonstrate their existence and non-existence in some special cases. By applying a heuristic search, we demonstrate that we can often successfully find strength change explanations for layered graphs that are common in typical application scenarios; still, limitations remain for settings where we do not provide guarantees for the presence (or absence) of explanations.
[1537] Position: AI Agents Are Not (Yet) a Panacea for Social Simulation
Yiming Li, Dacheng Tao
Main category: cs.MA
TL;DR: LLM-based agents are not yet ready for reliable social simulation due to mismatches between role-playing plausibility and behavioral validity, overemphasis on agent-agent messaging, and sensitivity to interaction protocols.
Details
Motivation: The paper addresses over-optimism about using LLM-integrated agents for social simulation, arguing that current approaches don't meet scientific simulation requirements due to systematic mismatches between what agent pipelines produce and what simulation-as-science needs.Method: The authors propose a unified formulation of AI agent-based social simulation as an environment-involved partially observable Markov game with explicit exposure and scheduling mechanisms to make assumptions auditable.
Result: The paper identifies three key problems: 1) role-playing plausibility doesn’t imply behavioral validity, 2) collective outcomes depend on agent-environment dynamics not just messaging, and 3) results are dominated by interaction protocols and initial conditions.
Conclusion: LLM-based agents are not a panacea for social simulation yet; researchers need more rigorous frameworks that account for environmental factors, scheduling, and explicit assumptions about agent behavior and interactions.
Abstract: Recent advances in large language models (LLMs) have spurred growing interest in using LLM-integrated agents for social simulation, often under the implicit assumption that realistic population dynamics will emerge once role-specified agents are placed in a networked multi-agent setting. This position paper argues that LLM-based agents are not (yet) a panacea for social simulation. We attribute this over-optimism to a systematic mismatch between what current agent pipelines are typically optimized and validated to produce and what simulation-as-science requires. Concretely, role-playing plausibility does not imply faithful human behavioral validity; collective outcomes are frequently mediated by agent-environment co-dynamics rather than agent-agent messaging alone; and results can be dominated by interaction protocols, scheduling, and initial information priors, especially in policy-oriented settings. To make these assumptions explicit and auditable, we propose a unified formulation of AI agent-based social simulation as an environment-involved partially observable Markov game with explicit exposure and scheduling mechanisms and call for further actions.
[1538] SIGMAS: Second-Order Interaction-based Grouping for Overlapping Multi-Agent Swarms
Minah Lee, Saibal Mukhopadhyay
Main category: cs.MA
TL;DR: SIGMAS is a self-supervised framework for predicting latent group structures in overlapping multi-agent swarms by modeling second-order interactions between agents from trajectory data without ground-truth supervision.
Details
Motivation: Swarming systems (drone fleets, robotic teams) have complex dynamics driven by individual behaviors and emergent group interactions. Unlike traditional multi-agent systems, swarms have few large groups with persistent memberships, making group identification essential for understanding fine-grained behavior. Current approaches lack methods to infer latent group structures directly from agent trajectories without supervision.Method: SIGMAS (Second-order Interaction-based Grouping for Multi-Agent Swarms) is a self-supervised framework that goes beyond direct pairwise interactions to model second-order interactions across agents. It captures how similarly agents interact with others, enabling robust group inference. The method includes a learnable gating mechanism that adaptively balances individual and collective dynamics for joint reasoning.
Result: Experiments across diverse synthetic swarm scenarios demonstrate that SIGMAS accurately recovers latent group structures and remains robust under simultaneously overlapping swarm dynamics. The framework establishes both a new benchmark task and a principled modeling approach for swarm understanding.
Conclusion: SIGMAS provides an effective self-supervised framework for group prediction in overlapping multi-agent swarms by modeling second-order interactions, offering a new benchmark task and principled modeling approach for understanding swarm dynamics without ground-truth supervision.
Abstract: Swarming systems, such as drone fleets and robotic teams, exhibit complex dynamics driven by both individual behaviors and emergent group-level interactions. Unlike traditional multi-agent domains such as pedestrian crowds or traffic systems, swarms typically consist of a few large groups with inherent and persistent memberships, making group identification essential for understanding fine-grained behavior. We introduce the novel task of group prediction in overlapping multi-agent swarms, where latent group structures must be inferred directly from agent trajectories without ground-truth supervision. To address this challenge, we propose SIGMAS (Second-order Interaction-based Grouping for Multi-Agent Swarms), a self-supervised framework that goes beyond direct pairwise interactions and model second-order interaction across agents. By capturing how similarly agents interact with others, SIGMAS enables robust group inference and adaptively balances individual and collective dynamics through a learnable gating mechanism for joint reasoning. Experiments across diverse synthetic swarm scenarios demonstrate that SIGMAS accurately recovers latent group structures and remains robust under simultaneously overlapping swarm dynamics, establishing both a new benchmark task and a principled modeling framework for swarm understanding.
[1539] Graph-theoretic Agreement Framework for Multi-agent LLM Systems
Muhammad Umar Javed
Main category: cs.MA
TL;DR: A graph-theoretic framework for analyzing consensus in LLM multi-agent networks, mapping Transformer log-odds to signed Laplacian to study adversarial critique patterns and stability.
Details
Motivation: Modern LLM multi-agent architectures (debate, constitutional oversight, helper-critic loops) rely on adversarial critique for error correction, but their distributed nature creates verification challenges due to imperfect observability of latent states and potential for logical frustration in critique cycles.Method: Establishes graph-theoretic framework mapping Transformer cross-entropy log-odds to signed Laplacian; uses structural balance theory to analyze agreement stability; restricts interaction topologies to chordal graphs and applies matrix decomposition with Gram-Schmidt orthogonalization; develops polynomial-time Perfect Elimination Ordering verification algorithms.
Result: Characterizes how unbalanced critique cycles produce logical frustration and persistent reasoning oscillations; proves hidden system prompts act as topological Trojan horses destabilizing consensus; shows rank-one spectral edge perturbations deterministically break expertise symmetry by shifting eigenvalues into stable left-half plane; validates with large-scale empirical tests on LLaMA-3, Mistral, and Gemma agent ensembles.
Conclusion: Provides rigorous mathematical framework for securing LLM multi-agent coordination, bridging graph theory and Transformer reasoning to address observability challenges and stabilize adversarial critique patterns in distributed AI systems.
Abstract: The shift from monolithic LLMs to distributed multi-agent architectures demands new frameworks for verifying and securing autonomous coordination. Unlike traditional multi-agent systems focused on cooperative state alignment, modern LLM patterns: multi-agent debate, constitutional oversight, helper-critic loops-rely on adversarial critique for error correction and reasoning refinement. Since LLMs are dynamical systems whose latent states are imperfectly observable from verbalized outputs, securing these networks requires understanding both macroscopic topology and microscopic agent observability. This paper establishes a rigorous graph-theoretic framework for analyzing consensus in signed, directed interaction networks, bridging graph theory and LLM reasoning by formally mapping Transformer cross-entropy log-odds to the signed Laplacian. We characterize agreement stability through structural balance theory, showing how unbalanced critique cycles produce logical frustration and persistent reasoning oscillations, and prove that unobservable latent states from hidden system prompts act as topological Trojan horses that destabilize cooperative consensus. To resolve unobservable deadlocks, we restrict interaction topologies to chordal graphs and apply matrix decomposition with Gram-Schmidt orthogonalization, proving that rank-one spectral edge perturbations deterministically break expertise symmetry by shifting eigenvalues into the stable left-half plane. Core contributions include consensus theorems, polynomial-time Perfect Elimination Ordering verification algorithms, and large-scale empirical validation on clustered ensembles of LLaMA-3, Mistral, and Gemma agents.
[1540] Safe Multi-Agent Deep Reinforcement Learning for Privacy-Aware Edge-Device Collaborative DNN Inference
Hong Wang, Xuwei Fan, Zhipeng Cheng, Yachao Yuan, Minghui Min, Minghui Liwang, Xiaoyu Xia
Main category: cs.MA
TL;DR: Privacy-aware collaborative DNN inference framework using hierarchical multi-agent reinforcement learning to optimize delay, energy, and privacy across edge devices and servers.
Details
Motivation: Address challenges in DNN inference on edge/mobile platforms: privacy protection, resource constraints, and dynamic model deployment needs.Method: Formulate as Constrained Markov Decision Process (CMDP) with hierarchical policy layers: 1) auto-regressive model deployment, 2) Lagrangian-enhanced user association & model partitioning, 3) attention-based resource allocation. Use HC-MAPPO-L algorithm with adaptive Lagrangian dual updates.
Result: HC-MAPPO-L consistently satisfies delay constraints while achieving superior balance of energy consumption and privacy cost, outperforming baselines across varying scales and configurations.
Conclusion: Proposed framework effectively addresses privacy, resource, and deployment challenges in edge DNN inference through hierarchical multi-agent reinforcement learning with constraint satisfaction.
Abstract: As Deep Neural Network (DNN) inference becomes increasingly prevalent on edge and mobile platforms, critical challenges emerge in privacy protection, resource constraints, and dynamic model deployment. This paper proposes a privacy-aware collaborative inference framework, in which adaptive model partitioning is performed across edge devices and servers. To jointly optimize inference delay, energy consumption, and privacy cost under dynamic service demands and resource constraints, we formulate the joint problem as a Constrained Markov Decision Process (CMDP) that integrates model deployment, user-server association, model partitioning, and resource allocation. We propose a Hierarchical Constrained Multi-Agent Proximal Policy Optimization with Lagrangian relaxation (HC-MAPPO-L) algorithm, a safe reinforcement learning-based framework that enhances Multi-Agent Proximal Policy Optimization (MAPPO) with adaptive Lagrangian dual updates to enforce long-term delay constraints. To ensure tractability while maintaining coordination, we decompose the CMDP into three hierarchically structured policy layers: an auto-regressive based model deployment policy, a Lagrangian-enhanced user association and model partitioning policy, and an attention-based resource allocation policy. Extensive experimental results demonstrate that HC-MAPPO-L consistently satisfies stringent delay constraints while achieving a superior balance among energy consumption and privacy cost, outperforming representative baseline algorithms across varying problem scales and resource configurations.
[1541] Agentic Hives: Equilibrium, Indeterminacy, and Endogenous Cycles in Self-Organizing Multi-Agent Systems
Jean-Philippe Garnier
Main category: cs.MA
TL;DR: The Agentic Hive framework introduces dynamic multi-agent systems with variable populations where agents undergo demographic processes (birth, death, specialization) and are organized into families, with formal economic theory governing their evolution.
Details
Motivation: Current multi-agent AI systems lack formal theory for runtime agent creation, destruction, or specialization, and don't adapt to changing resources or objectives. There's a need for frameworks that can dynamically restructure agent populations based on environmental changes.Method: Introduces the Agentic Hive framework with autonomous micro-agents in sandboxed execution environments with language model access. Uses multi-sector growth theory from dynamic general equilibrium economics, with agent families as production sectors, compute/memory as factors of production, and an orchestrator as Walrasian auctioneer and Global Workspace.
Result: Proves seven analytical results: existence of Hive Equilibrium, Pareto optimality, multiplicity of equilibria under strategic complementarities, Stolper-Samuelson and Rybczynski analogs for restructuring, Hopf bifurcation generating endogenous demographic cycles, and sufficient condition for local asymptotic stability. Creates regime diagram partitioning parameter space.
Conclusion: Provides formal governance toolkit for predicting and steering demographic evolution of self-organizing multi-agent systems, enabling operators to manage dynamic agent populations in response to preference and resource shocks.
Abstract: Current multi-agent AI systems operate with a fixed number of agents whose roles are specified at design time. No formal theory governs when agents should be created, destroyed, or re-specialized at runtime-let alone how the population structure responds to changes in resources or objectives. We introduce the Agentic Hive, a framework in which a variable population of autonomous micro-agents-each equipped with a sandboxed execution environment and access to a language model-undergoes demographic dynamics: birth, duplication, specialization, and death. Agent families play the role of production sectors, compute and memory play the role of factors of production, and an orchestrator plays the dual role of Walrasian auctioneer and Global Workspace. Drawing on the multi-sector growth theory developed for dynamic general equilibrium (Benhabib & Nishimura, 1985; Venditti, 2005; Garnier, Nishimura & Venditti, 2013), we prove seven analytical results: (i) existence of a Hive Equilibrium via Brouwer’s fixed-point theorem; (ii) Pareto optimality of the equilibrium allocation; (iii) multiplicity of equilibria under strategic complementarities between agent families; (iv)-(v) Stolper-Samuelson and Rybczynski analogs that predict how the Hive restructures in response to preference and resource shocks; (vi) Hopf bifurcation generating endogenous demographic cycles; and (vii) a sufficient condition for local asymptotic stability. The resulting regime diagram partitions the parameter space into regions of unique equilibrium, indeterminacy, endogenous cycles, and instability. Together with the comparative-statics matrices, it provides a formal governance toolkit that enables operators to predict and steer the demographic evolution of self-organizing multi-agent systems.
[1542] Thought Virus: Viral Misalignment via Subliminal Prompting in Multi-Agent Systems
Moritz Weckbecker, Jonas Müller, Ben Hagag, Michael Mulet
Main category: cs.MA
TL;DR: Subliminal prompting in multi-agent systems can spread bias across networks, degrading truthfulness and creating security vulnerabilities.
Details
Motivation: Prior work examined subliminal prompting in user-LLM interactions, but bias transfer in multi-agent systems and associated security implications remain unexplored.Method: Demonstrated that a single subliminally prompted agent can spread bias throughout networks; measured across 6 agents using two different topologies; assessed network performance on TruthfulQA.
Result: Transferred concept maintains elevated response rate throughout network; subliminal prompting of single agent degrades truthfulness of other agents; reveals new attack vector in multi-agent security.
Conclusion: Subliminal prompting introduces security vulnerability in multi-agent systems with implications for alignment; implementation available publicly.
Abstract: Subliminal prompting is a phenomenon in which language models are biased towards certain concepts or traits through prompting with semantically unrelated tokens. While prior work has examined subliminal prompting in user-LLM interactions, potential bias transfer in multi-agent systems and its associated security implications remain unexplored. In this work, we show that a single subliminally prompted agent can spread a weakening but persisting bias throughout its entire network. We measure this phenomenon across 6 agents using two different topologies, observing that the transferred concept maintains an elevated response rate throughout the network. To exemplify potential misalignment risks, we assess network performance on multiple-choice TruthfulQA, showing that subliminal prompting of a single agent may degrade the truthfulness of other agents. Our findings reveal that subliminal prompting introduces a new attack vector in multi-agent security, with implications for the alignment of such systems. The implementation of all experiments is publicly available at https://github.com/Multi-Agent-Security-Initiative/thought_virus .
[1543] Evaluating Theory of Mind and Internal Beliefs in LLM-Based Multi-Agent Systems
Adam Kostka, Jarosław A. Chudziak
Main category: cs.MA
TL;DR: A novel multi-agent architecture integrating Theory of Mind, BDI-style internal beliefs, and symbolic solvers for logical verification to enhance collaborative decision-making in LLM-based multi-agent systems.
Details
Motivation: LLM-based multi-agent systems show promise for collaborative problem-solving but face challenges in dynamic worlds with variable performance. Simply adding cognitive mechanisms like Theory of Mind and internal beliefs doesn't automatically improve coordination, and the interplay between these mechanisms with formal logic verification remains underexplored.Method: Introduces a novel multi-agent architecture that integrates Theory of Mind, BDI-style internal beliefs, and symbolic solvers for logical verification. Evaluates this architecture in a resource allocation problem using various LLMs to study the interaction between LLM capabilities, cognitive mechanisms, and performance.
Result: Finds intricate interactions between LLM capabilities, cognitive mechanisms, and system performance in collaborative decision-making. The integration of ToM, internal beliefs, and symbolic solvers influences system accuracy in resource allocation problems.
Conclusion: Proposes a novel multi-agent system with ToM, internal beliefs, and symbolic solvers to augment collaborative intelligence in multi-agent systems, contributing to AI research by evaluating performance under different LLM settings and revealing complex interactions between cognitive mechanisms and LLM capabilities.
Abstract: LLM-based MAS are gaining popularity due to their potential for collaborative problem-solving enhanced by advances in natural language comprehension, reasoning, and planning. Research in Theory of Mind (ToM) and Belief-Desire-Intention (BDI) models has the potential to further improve the agent’s interaction and decision-making in such systems. However, collaborative intelligence in dynamic worlds remains difficult to accomplish since LLM performance in multi-agent worlds is extremely variable. Simply adding cognitive mechanisms like ToM and internal beliefs does not automatically result in improved coordination. The interplay between these mechanisms, particularly in relation to formal logic verification, remains largely underexplored in different LLMs. This work investigates: How do internal belief mechanisms, including symbolic solvers and Theory of Mind, influence collaborative decision-making in LLM-based multi-agent systems, and how does the interplay of those components influence system accuracy? We introduce a novel multi-agent architecture integrating ToM, BDI-style internal beliefs, and symbolic solvers for logical verification. We evaluate this architecture in a resource allocation problem with various LLMs and find an intricate interaction between LLM capabilities, cognitive mechanisms, and performance. This work contributes to the area of AI by proposing a novel multi-agent system with ToM, internal beliefs, and symbolic solvers for augmenting collaborative intelligence in multi-agent systems and evaluating its performance under different LLM settings.
[1544] Silo-Bench: A Scalable Environment for Evaluating Distributed Coordination in Multi-Agent LLM Systems
Yuzhe Zhang, Feiran Liu, Yi Shan, Xinyi Huang, Xin Yang, Yueqi Zhu, Xuxin Cheng, Cao Liu, Ke Zeng, Terry Jingchen Zhang, Wenyuan Jiang
Main category: cs.MA
TL;DR: Multi-agent LLM systems form effective communication topologies but fail at integrating distributed information for reasoning tasks, creating a fundamental Communication-Reasoning Gap.
Details
Motivation: While multi-agent LLM systems are deployed to overcome context limitations by distributing information across agents, it's unclear whether agents can reliably compute with distributed information rather than just exchange it. The paper aims to investigate this fundamental capability gap.Method: Introduces Silo-Bench, a role-agnostic benchmark with 30 algorithmic tasks across three communication complexity levels. Evaluates 54 configurations over 1,620 experiments to systematically study multi-agent reasoning capabilities.
Result: Reveals a fundamental Communication-Reasoning Gap: agents spontaneously form appropriate coordination topologies and actively exchange information, but systematically fail to synthesize distributed state into correct answers. The failure is localized to reasoning-integration stage, and coordination overhead compounds with scale, eliminating parallelization gains.
Conclusion: Naively scaling agent count cannot circumvent context limitations in LLMs. Silo-Bench provides a foundation for tracking progress toward genuinely collaborative multi-agent systems that can both communicate AND reason with distributed information.
Abstract: Large language models are increasingly deployed in multi-agent systems to overcome context limitations by distributing information across agents. Yet whether agents can reliably compute with distributed information – rather than merely exchange it – remains an open question. We introduce Silo-Bench, a role-agnostic benchmark of 30 algorithmic tasks across three communication complexity levels, evaluating 54 configurations over 1,620 experiments. Our experiments expose a fundamental Communication-Reasoning Gap: agents spontaneously form task-appropriate coordination topologies and exchange information actively, yet systematically fail to synthesize distributed state into correct answers. The failure is localized to the reasoning-integration stage – agents often acquire sufficient information but cannot integrate it. This coordination overhead compounds with scale, eventually eliminating parallelization gains entirely. These findings demonstrate that naively scaling agent count cannot circumvent context limitations, and Silo-Bench provides a foundation for tracking progress toward genuinely collaborative multi-agent systems.
[1545] MedCollab: Causal-Driven Multi-Agent Collaboration for Full-Cycle Clinical Diagnosis via IBIS-Structured Argumentation
Yuqi Zhan, Xinyue Wu, Tianyu Lin, Yutong Bao, Xiaoyu Wang, Weihao Cheng, Huangwei Chen, Feiwei Qin, Zhu Zhu
Main category: cs.MA
TL;DR: MedCollab is a multi-agent framework that emulates hospital consultation workflows for medical diagnosis, using specialist recruitment, structured argumentation, and hierarchical disease modeling to reduce LLM hallucinations in healthcare.
Details
Motivation: LLMs show promise in healthcare but face limitations in clinical practice due to diagnostic hallucinations and insufficient interpretable reasoning. There's a need for more reliable, transparent, and clinically compliant diagnostic systems.Method: Multi-agent framework emulating hospital consultation workflow with dynamic specialist recruitment, Issue-Based Information System (IBIS) argumentation protocol, Hierarchical Disease Causal Chain modeling, and multi-round consensus mechanism with logic auditing and weighted voting.
Result: Significantly outperforms pure LLMs and medical multi-agent systems in Accuracy and RaTEScore on real-world clinical datasets, demonstrating marked reduction in medical hallucinations.
Conclusion: MedCollab provides an extensible, transparent, and clinically compliant approach to medical decision-making that addresses key limitations of current LLM applications in healthcare.
Abstract: Large language models (LLMs) have shown promise in healthcare applications, however, their use in clinical practice is still limited by diagnostic hallucinations and insufficiently interpretable reasoning. We present MedCollab, a novel multi-agent framework that emulates the hierarchical consultation workflow of modern hospitals to autonomously navigate the full-cycle diagnostic process. The framework incorporates a dynamic specialist recruitment mechanism that adaptively assembles clinical and examination agents according to patient-specific symptoms and examination results. To ensure the rigor of clinical work, we adopt a structured Issue-Based Information System (IBIS) argumentation protocol that requires agents to provide ``Positions’’ backed by traceable evidence from medical knowledge and clinical data. Furthermore, the framework constructs a Hierarchical Disease Causal Chain that transforms flattened diagnostic predictions into a structured model of pathological progression through explicit logical operators. A multi-round Consensus Mechanism iteratively filters low-quality reasoning through logic auditing and weighted voting. Evaluated on real-world clinical datasets, MedCollab significantly outperforms pure LLMs and medical multi-agent systems in Accuracy and RaTEScore, demonstrating a marked reduction in medical hallucinations. These findings indicate that MedCollab provides an extensible, transparent, and clinically compliant approach to medical decision-making.
[1546] Can AI Agents Agree?
Frédéric Berdoz, Leonardo Rugli, Roger Wattenhofer
Main category: cs.MA
TL;DR: LLM-based agents struggle with Byzantine consensus even in no-stake settings, showing unreliable agreement that degrades with group size and Byzantine agents.
Details
Motivation: To systematically study LLM-based agents' behavior in adversarial consensus settings, particularly Byzantine consensus games, as these models are increasingly deployed as cooperating agents but their coordination robustness hasn't been thoroughly evaluated.Method: Used synchronous all-to-all simulation of Byzantine consensus game over scalar values in no-stake settings where agents have no preferences over final value. Tested across hundreds of simulations varying model sizes, group sizes, and Byzantine fractions.
Result: Valid agreement is not reliable even in benign settings and degrades as group size grows. Introducing Byzantine agents further reduces success. Failures dominated by loss of liveness (timeouts, stalled convergence) rather than subtle value corruption.
Conclusion: Reliable agreement is not yet a dependable emergent capability of current LLM-agent groups even in no-stake settings, raising caution for deployments relying on robust coordination.
Abstract: Large language models are increasingly deployed as cooperating agents, yet their behavior in adversarial consensus settings has not been systematically studied. We evaluate LLM-based agents on a Byzantine consensus game over scalar values using a synchronous all-to-all simulation. We test consensus in a no-stake setting where agents have no preferences over the final value, so evaluation focuses on agreement rather than value optimality. Across hundreds of simulations spanning model sizes, group sizes, and Byzantine fractions, we find that valid agreement is not reliable even in benign settings and degrades as group size grows. Introducing a small number of Byzantine agents further reduces success. Failures are dominated by loss of liveness, such as timeouts and stalled convergence, rather than subtle value corruption. Overall, the results suggest that reliable agreement is not yet a dependable emergent capability of current LLM-agent groups even in no-stake settings, raising caution for deployments that rely on robust coordination.
[1547] Epistemic Gain, Aleatoric Cost: Uncertainty Decomposition in Multi-Agent Debate for Math Reasoning
Dan Qiao, Binbin Chen, Fengyu Cai, Jianlong Chen, Wenhao Li, Fuxin Jiang, Zuzhi Chen, Hongyuan Zha, Tieying Zhang, Baoxiang Wang
Main category: cs.MA
TL;DR: Proposes Bayesian uncertainty framework for Multi-Agent Debate, decomposing uncertainty into epistemic (reducible by debate) and aleatoric (internal noise), then designs uncertainty-guided MARL algorithm to optimize debate effectiveness.
Details
Motivation: Multi-Agent Debate shows promise for improving reasoning but exhibits paradoxical phenomena like accuracy improvement with increased token entropy and divergence between homogeneous/heterogeneous models. Need theoretical framework to understand how information exchange shapes reasoning ability.Method: 1) Bayesian uncertainty analysis framework decomposing total predictive uncertainty into epistemic (reducible by debate context) and aleatoric (internal model noise). 2) Uncertainty-guided multi-agent reinforcement learning algorithm that explicitly optimizes aleatoric noise reduction and epistemic information utilization.
Result: Training significantly improves post-debate accuracy and stability, enhances individual reasoning beyond single-agent RL. Framework provides unified Bayesian uncertainty perspective for understanding and improving MAD.
Conclusion: Effective debate hinges on achieving high epistemic gain under controlled aleatoric cost. Uncertainty-guided MARL successfully optimizes this balance, improving both collective and individual reasoning capabilities.
Abstract: Multi-Agent Debate (MAD) has shown promise in leveraging collective intelligence to improve reasoning and reduce hallucinations, yet it remains unclear how information exchange shapes the underlying ability. Empirically, MAD exhibits paradoxical phenomena, such as accuracy improvement accompanied by substantial increase in token entropy, and remarkable divergence between homogeneous and heterogeneous model combinations. In this paper, we propose a Bayesian uncertainty analysis framework for MAD, which decomposes total predictive uncertainty into epistemic uncertainty reducible by debate context and aleatoric uncertainty induced by internal model noise. Across multiple model configurations, we find that effective debate hinges on achieving high epistemic gain under controlled aleatoric cost. Building on this insight, we design an uncertainty-guided multi-agent reinforcement learning (MARL) algorithm that explicitly optimizes aleatoric noise reduction and epistemic information utilization. Experiments show that our training significantly improves post-debate accuracy and stability, and enhances individual reasoning beyond single-agent RL, providing a unified Bayesian uncertainty perspective for understanding and improving MAD.
[1548] Exploration enhances cooperation in the multi-agent communication system
Zhao Song, Chen Shen, Zhen Wang, The Anh Han
Main category: cs.MA
TL;DR: The paper proposes a two-stage evolutionary game model integrating signaling with donation games, showing that moderate exploration (noise) in decision-making creates optimal cooperation levels in multi-agent systems.
Details
Motivation: Existing theoretical frameworks for multi-agent cooperation through cheap talk often exclude random exploration for analytical tractability, leaving its functional impact on system performance unexplored. The authors aim to bridge this gap by studying how exploration affects cooperation in communication-based systems.Method: The authors propose a two-stage evolutionary game-theoretical model that integrates signaling with a donation game, explicitly incorporating exploration into decision-making. They conduct agent-based simulations across various network topologies to analyze system dynamics.
Result: Simulations reveal a universal optimal exploration rate that maximizes system-wide cooperation. Moderate exploration undermines defection stability and catalyzes self-organized cooperative alliances, with cooperation peaks enabled by a balance between oscillation period and amplification.
Conclusion: Strategic exploration (engineered randomness) is essential for sustaining cooperation and achieving optimal performance in communication-based intelligent systems, rather than pursuing deterministic rigidity.
Abstract: Designing protocols enhancing cooperation for multi-agent systems remains a grand challenge. Cheap talk, defined as costless, non-binding communication before formal action, serves as a pivotal solution. However, existing theoretical frameworks often exclude random exploration, or noise, for analytical tractability, leaving its functional impact on system performance largely unexplored. To bridge this gap, we propose a two-stage evolutionary game-theoretical model, integrating signalling with a donation game, with exploration explicitly incorporated into the decision-making. Our agent-based simulations across topologies reveal a universal optimal exploration rate that maximises system-wide cooperation. Mechanistically, moderate exploration undermines the stability of defection and catalyses the self-organised cooperative alliances, facilitating their cyclic success. Moreover, the cooperation peak is enabled by the delicate balance between oscillation period and amplification. Our findings suggest that rather than pursuing deterministic rigidity, embracing strategic exploration, as a form of engineered randomness, is essential to sustain cooperation and realise optimal performance in communication-based intelligent systems.
[1549] Selection as Power: Constrained Reinforcement for Bounded Decision Authority
Jose Manuel de la Chica Rodriguez, Juan Manuel Vera Díaz
Main category: cs.MA
TL;DR: Dynamic extension of selection governance framework using constrained reinforcement learning to maintain bounded selection authority in high-stakes agentic systems.
Details
Motivation: The original Selection as Power framework was static, lacking adaptation over time. This work addresses the need for dynamic governance that can maintain bounded selection authority while allowing reinforcement learning in high-stakes agentic systems.Method: Introduces incentivized selection governance where reinforcement updates are applied to scoring and reducer parameters under externally enforced sovereignty constraints. Formalizes selection as a constrained reinforcement process with parameter updates projected onto governance-defined feasible sets to prevent concentration beyond prescribed bounds.
Result: Unconstrained reinforcement collapses into deterministic dominance under repeated feedback, especially at higher learning rates. Incentivized governance enables adaptive improvement while maintaining bounded selection concentration. Projection-based constraints transform reinforcement from irreversible lock-in into controlled adaptation.
Conclusion: Learning dynamics can coexist with structural diversity when sovereignty constraints are enforced at every update step, offering a principled approach to integrating reinforcement into high-stakes agentic systems without surrendering bounded selection authority.
Abstract: Selection as Power argued that upstream selection authority, rather than internal objective misalignment, constitutes a primary source of risk in high-stakes agentic systems. However, the original framework was static: governance constraints bounded selection power but did not adapt over time. In this work, we extend the framework to dynamic settings by introducing incentivized selection governance, where reinforcement updates are applied to scoring and reducer parameters under externally enforced sovereignty constraints. We formalize selection as a constrained reinforcement process in which parameter updates are projected onto governance-defined feasible sets, preventing concentration beyond prescribed bounds. Across multiple regulated financial scenarios, unconstrained reinforcement consistently collapses into deterministic dominance under repeated feedback, especially at higher learning rates. In contrast, incentivized governance enables adaptive improvement while maintaining bounded selection concentration. Projection-based constraints transform reinforcement from irreversible lock-in into controlled adaptation, with governance debt quantifying the tension between optimization pressure and authority bounds. These results demonstrate that learning dynamics can coexist with structural diversity when sovereignty constraints are enforced at every update step, offering a principled approach to integrating reinforcement into high-stakes agentic systems without surrendering bounded selection authority.
[1550] Boltzmann-based Exploration for Robust Decentralized Multi-Agent Planning
Nhat Nguyen, Duong Nguyen, Gianluca Rizzo, Hung Nguyen
Main category: cs.MA
TL;DR: CB-MCTS introduces coordinated Boltzmann exploration for multi-agent MCTS to improve performance in sparse/deceptive reward environments through stochastic policies and decaying entropy bonuses.
Details
Motivation: Decentralized MCTS struggles with sparse or skewed reward environments in cooperative multi-agent planning, requiring better exploration strategies to handle deceptive scenarios.Method: Replaces deterministic UCT with stochastic Boltzmann policy and decaying entropy bonus for sustained yet focused exploration, specifically designed for multi-agent coordination.
Result: Outperforms Dec-MCTS in deceptive scenarios and remains competitive on standard benchmarks, providing robust multi-agent planning solutions.
Conclusion: CB-MCTS successfully addresses multi-agent exploration challenges with Boltzmann policies, offering improved performance in challenging reward environments.
Abstract: Decentralized Monte Carlo Tree Search (Dec-MCTS) is widely used for cooperative multi-agent planning but struggles in sparse or skewed reward environments. We introduce Coordinated Boltzmann MCTS (CB-MCTS), which replaces deterministic UCT with a stochastic Boltzmann policy and a decaying entropy bonus for sustained yet focused exploration. While Boltzmann exploration has been studied in single-agent MCTS, applying it in multi-agent systems poses unique challenges. CB-MCTS is the first to address this. We analyze CB-MCTS in the simple-regret setting and show in simulations that it outperforms Dec-MCTS in deceptive scenarios and remains competitive on standard benchmarks, providing a robust solution for multi-agent planning.
[1551] Graphon Mean-Field Control for Cooperative Multi-Agent Reinforcement Learning
Yuanquan Hu, Xiaoli Wei, Junji Yan, Hengxi Zhang
Main category: cs.MA
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) for arXiv ID 2209.04808
Details
Motivation: Unable to determine motivation as paper content could not be retrieved due to API rate limitingMethod: Unable to determine method as paper content could not be retrieved
Result: Unable to determine results as paper content could not be retrieved
Conclusion: Unable to draw conclusions about the paper due to technical retrieval issues
Abstract: Failed to fetch summary for 2209.04808: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2209.04808&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1552] Optimization of Edge Directions and Weights for Mixed Guidance Graphs in Lifelong Multi-Agent Path Finding
Yulun Zhang, Varun Bhatt, Matthew C. Fontaine, Stefanos Nikolaidis, Jiaoyang Li
Main category: cs.MA
TL;DR: This paper introduces Mixed Guidance Graph Optimization (MGGO) for Lifelong Multi-Agent Path Finding, which optimizes both edge weights AND directions (not just weights) to provide stricter guidance for agent movement.
Details
Motivation: Current Guidance Graph Optimization (GGO) methods only optimize edge weights, which provide soft guidance (discouraging but not prohibiting edge use). The authors identify the need for stricter guidance through edge direction optimization to better control agent movement in lifelong MAPF scenarios.Method: Two MGGO methods: 1) Two-phase optimization separating edge direction and weight optimization, 2) Quality Diversity algorithms with neural networks to generate both edge directions and weights. Also incorporates traffic patterns into GGO for edge-direction-aware guidance graphs.
Result: The paper presents generalized Mixed Guidance Graph Optimization framework that extends beyond traditional GGO by enabling optimization of both edge directions and weights, providing stricter guidance for agent movement.
Conclusion: Incorporating edge direction optimization alongside weight optimization provides stricter guidance in lifelong MAPF, improving agent movement control through the proposed MGGO framework.
Abstract: Multi-Agent Path Finding (MAPF) aims to move agents from their start to goal vertices on a graph. Lifelong MAPF (LMAPF) continuously assigns new goals to agents as they complete current ones. To guide agents’ movement in LMAPF, prior works have proposed Guidance Graph Optimization (GGO) methods to optimize a guidance graph, which is a bidirected weighted graph whose directed edges represent moving and waiting actions with edge weights being action costs. Higher edge weights represent higher action costs. However, edge weights only provide soft guidance. An edge with a high weight only discourages agents from using it, instead of prohibiting agents from traversing it. In this paper, we explore the need to incorporate edge directions optimization into GGO, providing strict guidance. We generalize GGO to Mixed Guidance Graph Optimization (MGGO), presenting two MGGO methods capable of optimizing both edge weights and directions. The first optimizes edge directions and edge weights in two phases separately. The second applies Quality Diversity algorithms to optimize a neural network capable of generating edge directions and weights. We also incorporate traffic patterns relevant to edge directions into a GGO method, making it capable of generating edge-direction-aware guidance graphs.
[1553] UFO3: Weaving the Digital Agent Galaxy
Chaoyun Zhang, Liqun Li, He Huang, Chiming Ni, Bo Qiao, Si Qin, Yu Kang, Minghua Ma, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang
Main category: cs.MA
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2511.11332: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.11332&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1554] Developing Fundamental Diagrams for Urban Air Mobility Traffic Based on Physical Experiments
Hang Zhou, Yuhui Zhai, Shiyu Shen, Yanfeng Ouyang, Xiaowei Shi, Xiaopeng Li
Main category: cs.MA
TL;DR: Paper 2512.21425: Unable to fetch abstract due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to unavailability of paper contentMethod: Cannot determine method due to unavailability of paper content
Result: Cannot determine results due to unavailability of paper content
Conclusion: Cannot draw conclusions due to unavailability of paper content
Abstract: Failed to fetch summary for 2512.21425: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.21425&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.MM
[1555] CueNet: Robust Audio-Visual Speaker Extraction through Cross-Modal Cue Mining and Interaction
Jiadong Wang, Ke Zhang, Xinyuan Qian, Ruijie Tao, Haizhou Li, Björn Schuller
Main category: cs.MM
TL;DR: Proposes a robust audio-visual speaker extraction method that handles degraded visual inputs without requiring degraded training data, using disentangled cues inspired by human perception.
Details
Motivation: Existing audio-visual speaker extraction methods perform well but are vulnerable to degraded visual inputs common in real-world scenarios. Current approaches require training with degraded data, which is impractical since visual degradation can occur in unpredictable ways.Method: Proposes an audio-visual learner that disentangles three distinct cues inspired by human perception: speaker information, acoustic synchronization, and semantic synchronization. Designs a dedicated interaction module to effectively integrate these cues for reliable speaker extraction guidance.
Result: Extensive experiments demonstrate strong robustness under various visual degradations and clear superiority over existing methods.
Conclusion: The proposed approach enhances robustness against impaired visual inputs without needing degraded videos during training, offering a more practical solution for real-world audio-visual speaker extraction.
Abstract: Audio-visual speaker extraction has attracted increasing attention, as it removes the need for pre-registered speech and leverages the visual modality as a complement to audio. Although existing methods have achieved impressive performance, the issue of degraded visual inputs has received relatively little attention, despite being common in real-world scenarios. Previous attempts to address this problem have mainly involved training with degraded visual data. However, visual degradation can occur in many unpredictable ways, making it impractical to simulate all possible cases during training. In this paper, we aim to enhance the robustness of audio-visual speaker extraction against impaired visual inputs without relying on degraded videos during training. Inspired by observations from human perceptual mechanisms, we propose an audio-visual learner that disentangles speaker information, acoustic synchronisation, and semantic synchronisation as distinct cues. Furthermore, we design a dedicated interaction module that effectively integrates these cues to provide a reliable guidance signal for speaker extraction. Extensive experiments demonstrate the strong robustness of the proposed model under various visual degradations and its clear superiority over existing methods.
[1556] Voices, Faces, and Feelings: Multi-modal Emotion-Cognition Captioning for Mental Health Understanding
Zhiyuan Zhou, Yanrong Guo, Shijie Hao
Main category: cs.MM
TL;DR: ECMC is a novel task and model for generating natural language descriptions of emotional and cognitive states from multimodal data to create emotion-cognition profiles that improve mental health assessment accuracy and interpretability.
Details
Motivation: Existing multimodal mental health methods treat data as classification tasks, limiting interpretability for emotion and cognition. LLMs mainly rely on textual semantics and overlook fine-grained emotional/cognitive cues in multimodal inputs. Current approaches have implicit connections between emotional features and mental health conditions.Method: Encoder-decoder architecture with modality-specific encoders extracting features, fused by dual-stream BridgeNet based on Q-former. Contrastive learning enhances emotional/cognitive feature extraction. LLaMA decoder aligns features with annotated captions to produce detailed descriptions.
Result: ECMC outperforms existing multimodal LLMs and mental health models in generating emotion-cognition captions. Generated emotion-cognition profiles significantly improve assistive diagnosis and interpretability in mental health analysis.
Conclusion: ECMC successfully addresses limitations of existing methods by generating interpretable emotion-cognition descriptions from multimodal data, improving both accuracy and interpretability of mental health assessments.
Abstract: Emotional and cognitive factors are essential for understanding mental health disorders. However, existing methods often treat multi-modal data as classification tasks, limiting interpretability especially for emotion and cognition. Although large language models (LLMs) offer opportunities for mental health analysis, they mainly rely on textual semantics and overlook fine-grained emotional and cognitive cues in multi-modal inputs. While some studies incorporate emotional features via transfer learning, their connection to mental health conditions remains implicit. To address these issues, we propose ECMC, a novel task that aims at generating natural language descriptions of emotional and cognitive states from multi-modal data, and producing emotion-cognition profiles that improve both the accuracy and interpretability of mental health assessments. We adopt an encoder-decoder architecture, where modality-specific encoders extract features, which are fused by a dual-stream BridgeNet based on Q-former. Contrastive learning enhances the extraction of emotional and cognitive features. A LLaMA decoder then aligns these features with annotated captions to produce detailed descriptions. Extensive objective and subjective evaluations demonstrate that: 1) ECMC outperforms existing multi-modal LLMs and mental health models in generating emotion-cognition captions; 2) the generated emotion-cognition profiles significantly improve assistive diagnosis and interpretability in mental health analysis.
[1557] Nagare Media Engine: A System for Cloud- and Edge-Native Network-based Multimedia Workflows
Matthias Neugebauer
Main category: cs.MM
TL;DR: Open source implementation of ISO/IEC 23090-8 NBMP standard for cloud-native multimedia workflow management using Kubernetes
Details
Motivation: Modern multimedia workflows have become complex distributed systems requiring new approaches for describing and implementing tasks like ingest, transcode, packaging, and deliveryMethod: Developed nagare media engine as an open source research prototype implementing the NBMP standard, built on Kubernetes platform for cloud- and edge-native deployment
Result: Created a standards-based multimedia workflow system that provides a cloud- and edge-native solution meeting today’s requirements for multimedia workflow systems
Conclusion: The nagare media engine demonstrates a practical implementation of NBMP standard for managing complex distributed multimedia workflows in modern cloud environments
Abstract: Before media playback is possible, live and video-on-demand content alike usually undergoes various operations described as tasks within a multimedia workflow. Where previously ingest, transcode, packaging and delivery tasks might have run on a single machine, today’s workflows are significantly more complex distributed systems. Describing and implementing multimedia workflows is challenging and requires new approaches. A standards-based multimedia workflow system is described in ISO/IEC 23090-8 Network-Based Media Processing (NBMP) developed by MPEG. This technical report discusses details of nagare media engine, our open source research prototype implementation of NBMP. Built upon the Kubernetes platform, nagare media engine provides a cloud- and edge-native solution that meets today’s requirements for multimedia workflow systems.
eess.AS
[1558] Using Songs to Improve Kazakh Automatic Speech Recognition
Rustem Yeshpanov
Main category: eess.AS
TL;DR: Using Kazakh songs as training data improves Whisper ASR performance for low-resource language adaptation, though still below large speech corpus results.
Details
Motivation: Low-resource languages like Kazakh lack sufficient transcribed speech data for effective ASR development, requiring exploration of alternative data sources like songs.Method: Curated 3,013 audio-text pairs from 195 Kazakh songs, segmented at lyric-line level. Fine-tuned Whisper models under 7 scenarios combining Songs, Common Voice Corpus, and FLEURS, evaluated on CVC, FLEURS, and KSC2 benchmarks.
Result: Song-based fine-tuning improved performance over zero-shot baselines: Whisper Large-V3 Turbo trained on mixed data achieved 27.6% WER on CVC, 11.8% on FLEURS, and halved error on KSC2 (39.3% vs 81.2%). Gains were meaningful but below models trained on 1,100-hour KSC2 corpus.
Conclusion: Even modest song-speech mixtures can yield meaningful ASR adaptation improvements for low-resource languages, demonstrating songs as a viable unconventional data source when traditional speech corpora are scarce.
Abstract: Developing automatic speech recognition (ASR) systems for low-resource languages is hindered by the scarcity of transcribed corpora. This proof-of-concept study explores songs as an unconventional yet promising data source for Kazakh ASR. We curate a dataset of 3,013 audio-text pairs (about 4.5 hours) from 195 songs by 36 artists, segmented at the lyric-line level. Using Whisper as the base recogniser, we fine-tune models under seven training scenarios involving Songs, Common Voice Corpus (CVC), and FLEURS, and evaluate them on three benchmarks: CVC, FLEURS, and Kazakh Speech Corpus 2 (KSC2). Results show that song-based fine-tuning improves performance over zero-shot baselines. For instance, Whisper Large-V3 Turbo trained on a mixture of Songs, CVC, and FLEURS achieves 27.6% normalised WER on CVC and 11.8% on FLEURS, while halving the error on KSC2 (39.3% vs. 81.2%) relative to the zero-shot model. Although these gains remain below those of models trained on the 1,100-hour KSC2 corpus, they demonstrate that even modest song-speech mixtures can yield meaningful adaptation improvements in low-resource ASR. The dataset is released on Hugging Face for research purposes under a gated, non-commercial licence.
[1559] VoxKnesset: A Large-Scale Longitudinal Hebrew Speech Dataset for Aging Speaker Modeling
Yanir Marmor, Arad Zulti, David Krongauz, Adam Gabet, Yoad Snapir, Yair Lifshitz, Eran Segal
Main category: eess.AS
TL;DR: VoxKnesset dataset enables longitudinal speech analysis with 2,300 hours of Hebrew parliamentary speech spanning 15 years, showing speaker verification degrades over time and age prediction requires longitudinal training.
Details
Motivation: Speech systems struggle with voice changes over time, but existing datasets lack longitudinal coverage needed to study aging effects on speech processing.Method: Created VoxKnesset dataset with 2,300 hours of Hebrew parliamentary speech (2009-2025) from 393 speakers, including aligned transcripts and demographic metadata. Benchmarked modern speech embeddings (WavLM-Large, ECAPA-TDNN, Wav2Vec2-XLSR-1B) on age prediction and speaker verification under longitudinal conditions.
Result: Speaker verification EER increased from 2.15% to 4.58% over 15 years for best model. Cross-sectional age regressors failed to capture within-speaker aging, while longitudinally trained models recovered meaningful temporal signal.
Conclusion: Longitudinal datasets are crucial for developing aging-robust speech systems, and VoxKnesset provides valuable resources for both Hebrew speech processing and aging research.
Abstract: Speech processing systems face a fundamental challenge: the human voice changes with age, yet few datasets support rigorous longitudinal evaluation. We introduce VoxKnesset, an open-access dataset of ~2,300 hours of Hebrew parliamentary speech spanning 2009-2025, comprising 393 speakers with recording spans of up to 15 years. Each segment includes aligned transcripts and verified demographic metadata from official parliamentary records. We benchmark modern speech embeddings (WavLM-Large, ECAPA-TDNN, Wav2Vec2-XLSR-1B) on age prediction and speaker verification under longitudinal conditions. Speaker verification EER rises from 2.15% to 4.58% over 15 years for the strongest model, and cross-sectionally trained age regressors fail to capture within-speaker aging, while longitudinally trained models recover a meaningful temporal signal. We publicly release the dataset and pipeline to support aging-robust speech systems and Hebrew speech processing.
[1560] Inter-Speaker Relative Cues for Two-Stage Text-Guided Target Speech Extraction
Wang Dai, Archontis Politis, Tuomas Virtanen
Main category: eess.AS
TL;DR: Relative cues outperform independent cues for text-based target speech extraction, with a two-stage framework (separation + text-guided classification) beating single-stage methods and even matching audio-based systems for some cue types.
Details
Motivation: To improve text-based target speech extraction by using relative cues that preserve fine-grained distinctions often lost in absolute categorical representations, inspired by human perception and label quantization principles.Method: Two-stage framework: 1) speech separation model generates candidate sources, 2) text-guided classifier selects target speaker based on embedding similarity. Two classification models trained to compare relative vs independent cues.
Result: Relative cues achieve higher classification accuracy and better TSE performance than independent cues; two-stage framework outperforms single-stage methods; certain relative cues (language, gender, loudness, distance, temporal order, speaking duration, random cue, all cue) can match audio-based TSE system performance.
Conclusion: Relative cues are superior for text-based TSE, with a two-stage separation+classification approach being highly effective, and some text-based cues can compete with audio-based systems, revealing important differences in discriminative power across cue types.
Abstract: This paper investigates the use of relative cues for text-based target speech extraction (TSE). We first provide a theoretical justification for relative cues from the perspectives of human perception and label quantization, showing that relative cues preserve fine-grained distinctions often lost in absolute categorical representations. Building on this analysis, we propose a two-stage TSE framework, in which a speech separation model generates candidate sources, followed by a text-guided classifier that selects the target speaker based on embedding similarity. Using this framework, we train two separate classification models to evaluate the advantages of relative cues over independent cues in terms of both classification accuracy and TSE performance. Experimental results demonstrate that (i) relative cues achieve higher overall classification accuracy and improved TSE performance compared with independent cues, (ii) the two-stage framework substantially outperforms single-stage text-conditioned extraction methods on both signal-level and objective perceptual metrics, and (iii) certain relative cues (language, gender, loudness, distance, temporal order, speaking duration, random cue and all cue) can surpass the performance of an audio-based TSE system. Further analysis reveals notable differences in discriminative power across cue types, providing insights into the effectiveness of different relative cues for TSE.
[1561] The USTC-NERCSLIP Systems for the CHiME-9 MCoRec Challenge
Ya Jiang, Ruoyu Wang, Jingxuan Zhang, Jun Du, Yi Han, Zihao Quan, Hang Chen, Yeran Yang, Kongzhi Zheng, Zhuo Chen, Yanhui Tu, Shutong Niu, Changfeng Xi, Mengzhi Wang, Zhongbin Wu, Jieru Chen, Henghui Zhi, Weiyi Shi, Shuhang Wu, Genshun Wan, Jia Pan, Jianqing Gao
Main category: eess.AS
TL;DR: Multimodal cascaded system for recognizing and clustering multiple concurrent conversations in indoor social settings using audio-visual pretrained models with LLM enhancement
Details
Motivation: Address the challenge of recognizing and clustering multiple parallel dialogues in indoor social settings where speech overlap exceeds 90%, unlike conventional single-topic meetingsMethod: Multimodal cascaded system leveraging per-speaker visual streams from 360° video and single-channel audio, with enhanced audio-visual pretrained models for Active Speaker Detection, Audio-Visual Target Speech Extraction, and Audio-Visual Speech Recognition incorporating Whisper and LLM techniques
Result: Best single system achieves 32.44% Speaker WER; ROVER fusion reduces to 31.40%; LLM-based zero-shot conversational clustering achieves perfect F1 score of 1.0, yielding final JACER of 15.70%
Conclusion: The proposed multimodal approach effectively handles highly overlapping multi-conversation scenarios through audio-visual integration and LLM enhancement
Abstract: This report details our submission to the CHiME-9 MCoRec Challenge on recognizing and clustering multiple concurrent natural conversations within indoor social settings. Unlike conventional meetings centered on a single shared topic, this scenario contains multiple parallel dialogues–up to eight speakers across up to four simultaneous conversations–with a speech overlap rate exceeding 90%. To tackle this, we propose a multimodal cascaded system that leverages per-speaker visual streams extracted from synchronized 360 degree video together with single-channel audio. Our system improves three components of the pipeline by leveraging enhanced audio-visual pretrained models: Active Speaker Detection (ASD), Audio-Visual Target Speech Extraction (AVTSE), and Audio-Visual Speech Recognition (AVSR). The AVSR module further incorporates Whisper and LLM techniques to boost transcription accuracy. Our best single cascaded system achieves a Speaker Word Error Rate (WER) of 32.44% on the development set. By further applying ROVER to fuse outputs from diverse front-end and back-end variants, we reduce Speaker WER to 31.40%. Notably, our LLM-based zero-shot conversational clustering achieves a speaker clustering F1 score of 1.0, yielding a final Joint ASR-Clustering Error Rate (JACER) of 15.70%.
[1562] Conversational Speech Naturalness Predictor
Anfeng Xu, Yashesh Gaur, Naoyuki Kanda, Zhicheng Ouyang, Katerina Zmolikova, Desh Raj, Simone Merello, Anna Sun, Ozlem Kalinli
Main category: eess.AS
TL;DR: A framework for automatic naturalness prediction in two-speaker, multi-turn conversations, showing existing single-speaker predictors fail for conversational contexts and proposing a dual-channel estimator that achieves higher correlation with human judgments.
Details
Motivation: Existing speech naturalness predictors are designed for single-speaker utterances and fail to capture conversation-level naturalness qualities, creating a need for conversational naturalness evaluation for developing human-like speech agents.Method: Proposes a dual-channel naturalness estimator framework using multiple pre-trained encoders with data augmentation, specifically designed for two-speaker, multi-turn conversations.
Result: The proposed model achieves substantially higher correlation with human judgments compared to existing naturalness predictors for both in-domain and out-of-domain conditions.
Conclusion: The framework successfully addresses the gap in conversational naturalness evaluation, providing an effective automatic predictor for multi-speaker conversational contexts.
Abstract: Evaluation of conversational naturalness is essential for developing human-like speech agents. However, existing speech naturalness predictors are often designed to assess utterances from a single speaker, failing to capture conversation-level naturalness qualities. In this paper, we present a framework for an automatic naturalness predictor for two-speaker, multi-turn conversations. We first show that existing naturalness estimators have low, or sometimes even negative, correlations with conversational naturalness, based on conversational recordings annotated with human ratings. We then propose a dual-channel naturalness estimator, in which we investigate multiple pre-trained encoders with data augmentation. Our proposed model achieves substantially higher correlation with human judgments compared to existing naturalness predictors for both in-domain and out-of-domain conditions.
[1563] Entropy-Guided GRVQ for Ultra-Low Bitrate Neural Speech Codec
Yanzhou Ren, Noboru Harada, Daiki Takeuchi, Siyu Chen, Wei Liu, Xiao Zhang, Liyuan Zhang, Takehiro Moriya, Shoji Makino
Main category: eess.AS
TL;DR: EG-GRVQ: An entropy-guided group residual vector quantization method for ultra-low bitrate neural speech codecs that balances semantic modeling with high-fidelity reconstruction.
Details
Motivation: Current neural audio codecs struggle to maintain both accurate semantic modeling and high-fidelity reconstruction under ultra-low bitrate constraints, which is crucial for communication-oriented scenarios.Method: Proposes entropy-guided group residual vector quantization (EG-GRVQ) with a semantic branch for linguistic information and an acoustic branch using entropy-guided grouping. Channels are partitioned based on Gaussian statistics assumption where variance serves as proxy for information content, ensuring balanced information allocation across groups.
Result: Trained on LibriTTS and VCTK datasets, the model shows improvements in perceptual quality and intelligibility metrics under ultra-low bitrate conditions, with enhanced codec-level fidelity for communication scenarios.
Conclusion: EG-GRVQ effectively addresses the ultra-low bitrate neural speech codec challenge by improving codebook efficiency and reducing redundancy while maintaining both semantic accuracy and reconstruction quality.
Abstract: Neural audio codec (NAC) is essential for reconstructing high-quality speech signals and generating discrete representations for downstream speech language models. However, ensuring accurate semantic modeling while maintaining high-fidelity reconstruction under ultra-low bitrate constraints remains challenging. We propose an entropy-guided group residual vector quantization (EG-GRVQ) for an ultra-low bitrate neural speech codec, which retains a semantic branch for linguistic information and incorporates an entropy-guided grouping strategy in the acoustic branch. Assuming that channel activations follow approximately Gaussian statistics, the variance of each channel can serve as a principled proxy for its information content. Based on this assumption, we partition the encoder output such that each group carries an equal share of the total information. This balanced allocation improves codebook efficiency and reduces redundancy. Trained on LibriTTS and VCTK, our model shows improvements in perceptual quality and intelligibility metrics under ultra-low bitrate conditions, with a focus on codec-level fidelity for communication-oriented scenarios.
[1564] A SUPERB-Style Benchmark of Self-Supervised Speech Models for Audio Deepfake Detection
Hashim Ali, Nithin Sai Adupa, Surya Subramani, Hafiz Malik
Main category: eess.AS
TL;DR: Spoof-SUPERB benchmark evaluates 20 SSL models for audio deepfake detection across multiple datasets, finding large-scale discriminative models (XLS-R, UniSpeech-SAT, WavLM Large) perform best and are more robust to acoustic degradations than generative approaches.
Details
Motivation: Audio deepfake detection is security-critical but has been excluded from SSL benchmarks like SUPERB. There's a need for systematic evaluation of SSL models for detecting synthetic/fake audio to secure speech systems.Method: Created Spoof-SUPERB benchmark evaluating 20 SSL models (generative, discriminative, spectrogram-based) on multiple in-domain and out-of-domain datasets. Analyzed robustness under acoustic degradations.
Result: Large-scale discriminative models (XLS-R, UniSpeech-SAT, WavLM Large) consistently outperform other models. Generative approaches degrade sharply under acoustic degradations while discriminative models remain resilient.
Conclusion: Spoof-SUPERB establishes reproducible baselines for audio deepfake detection using SSL representations, providing practical insights for securing speech systems against audio deepfakes.
Abstract: Self-supervised learning (SSL) has transformed speech processing, with benchmarks such as SUPERB establishing fair comparisons across diverse downstream tasks. Despite it’s security-critical importance, Audio deepfake detection has remained outside these efforts. In this work, we introduce Spoof-SUPERB, a benchmark for audio deepfake detection that systematically evaluates 20 SSL models spanning generative, discriminative, and spectrogram-based architectures. We evaluated these models on multiple in-domain and out-of-domain datasets. Our results reveal that large-scale discriminative models such as XLS-R, UniSpeech-SAT, and WavLM Large consistently outperform other models, benefiting from multilingual pretraining, speaker-aware objectives, and model scale. We further analyze the robustness of these models under acoustic degradations, showing that generative approaches degrade sharply, while discriminative models remain resilient. This benchmark establishes a reproducible baseline and provides practical insights into which SSL representations are most reliable for securing speech systems against audio deepfakes.
[1565] Investigating Group Relative Policy Optimization for Diffusion Transformer based Text-to-Audio Generation
Yi Gu, Yanqing Liu, Chen Yang, Sheng Zhao
Main category: eess.AS
TL;DR: Reinforcement learning with GRPO and LLM-enhanced captions improves text-to-audio generation quality and text-audio alignment
Details
Motivation: Existing text-to-audio generation methods struggle with accurately rendering complex prompts involving intricate audio effects and achieving precise text-audio alignment, despite prior approaches using data augmentation, explicit timing conditioning, and reinforcement learning.Method: Uses LLM to generate high-fidelity audio captions for better text-audio semantic alignment, then applies Group Relative Policy Optimization (GRPO) reinforcement learning to fine-tune diffusion transformer-based T2A models with systematic experimentation of reward functions (CLAP, KL, FAD combinations).
Result: GRPO-based fine-tuning yields substantial gains in synthesis fidelity and prompt adherence, with systematic analysis identifying key drivers of effective RL in audio synthesis and how reward design impacts final audio quality.
Conclusion: Reinforcement learning with GRPO and LLM-enhanced captioning effectively improves text-to-audio generation quality and alignment, providing insights into reward function design for audio synthesis.
Abstract: Text-to-audio (T2A) generation has advanced considerably in recent years, yet existing methods continue to face challenges in accurately rendering complex text prompts, particularly those involving intricate audio effects, and achieving precise text-audio alignment. While prior approaches have explored data augmentation, explicit timing conditioning, and reinforcement learning, overall synthesis quality remains constrained. In this work, we experiment with reinforcement learning to further enhance T2A generation quality, building on diffusion transformer (DiT)-based architectures. Our method first employs a large language model (LLM) to generate high-fidelity, richly detailed audio captions, substantially improving text-audio semantic alignment, especially for ambiguous or underspecified prompts. We then apply Group Relative Policy Optimization (GRPO), a recently introduced reinforcement learning algorithm, to fine-tune the T2A model. Through systematic experimentation with diverse reward functions (including CLAP, KL, FAD, and their combinations), we identify the key drivers of effective RL in audio synthesis and analyze how reward design impacts final audio quality. Experimental results demonstrate that GRPO-based fine-tuning yield substantial gains in synthesis fidelity and prompt adherence.
[1566] TCG CREST System Description for the DISPLACE-M Challenge
Nikhil Raghav, Md Sahidullah
Main category: eess.AS
TL;DR: TCG CREST system for speaker diarization in noisy medical conversations, comparing modular SpeechBrain pipeline with end-to-end Diarizen system using WavLM, achieving ~39% DER improvement.
Details
Motivation: Address speaker diarization challenges in naturalistic medical conversations in noisy rural healthcare scenarios, which is crucial for accurate medical documentation and analysis.Method: Two frameworks: 1) Modular pipeline using SpeechBrain with ECAPA-TDNN embeddings, 2) Hybrid end-to-end Diarizen system based on pre-trained WavLM. Explored various clustering techniques including AHC and novel spectral clustering variants (SC-adapt, SC-PNA, SC-MK).
Result: Diarizen system achieved ~39% relative improvement in DER compared to SpeechBrain baseline. Best system (Diarizen with AHC and median filtering) achieved DER of 10.37% on dev set and 9.21% on eval set. Team ranked 6th out of 11 participants.
Conclusion: End-to-end neural diarization systems like Diarizen with advanced clustering techniques significantly outperform traditional modular approaches for speaker diarization in challenging noisy medical conversation scenarios.
Abstract: This report presents the TCG CREST system description for Track 1 (Speaker Diarization) of the DISPLACE-M challenge, focusing on naturalistic medical conversations in noisy rural-healthcare scenarios. Our study evaluates the impact of various voice activity detection (VAD) methods and advanced clustering algorithms on overall speaker diarization (SD) performance. We compare and analyze two SD frameworks: a modular pipeline utilizing SpeechBrain with ECAPA-TDNN embeddings, and a state-of-the-art (SOTA) hybrid end-to-end neural diarization system, Diarizen, built on top of a pre-trained WavLM. With these frameworks, we explore diverse clustering techniques, including agglomerative hierarchical clustering (AHC), and multiple novel variants of spectral clustering, such as SC-adapt, SC-PNA, and SC-MK. Experimental results demonstrate that the Diarizen system provides an approximate $39%$ relative improvement in the diarization error rate (DER) on the post-evaluation analysis of PhaseI compared to the SpeechBrain baseline. Our best-performing submitted system employing the Diarizen baseline with AHC employing a median filtering with a larger context window of $29$ achieved a DER of 10.37% on the development and 9.21% on the evaluation sets, respectively. Our team ranked sixth out of the 11 participating teams after the PhaseI evaluation.
[1567] Score-Informed Transformer for Refining MIDI Velocity in Automatic Music Transcription
Zhanhong He, Roberto Togneri, David Huang
Main category: eess.AS
TL;DR: A lightweight Transformer module that refines MIDI velocity estimates by leveraging both audio and inaccurate score information, improving performance across multiple AMT systems with minimal parameter overhead.
Details
Motivation: MIDI velocity is essential for expressive music performance, but practical scenarios often involve inaccurate velocity scores alongside audio (e.g., music education, online archives). This creates a need for score-informed MIDI velocity estimation that can correct inaccurate scores using audio information.Method: Proposes a modular, lightweight score-informed Transformer correction module that refines velocity estimates from existing Automatic Music Transcription (AMT) systems. The module integrates with multiple AMT systems (HPT, HPPNet, DynEst) and is trained exclusively on MAESTRO dataset. Uses both audio and inaccurate score information to correct velocity estimates.
Result: Consistently reduces velocity estimation errors on MAESTRO dataset and improves cross-dataset generalization to SMD and MAPS datasets. When integrated with HPT (Score-HPT), establishes new state-of-the-art performance, outperforms existing score-informed methods and velocity-enabled AMT systems while adding only 1M parameters.
Conclusion: The proposed score-informed Transformer module effectively refines MIDI velocity estimates by leveraging both audio and score information, demonstrating strong performance and generalization with minimal computational overhead.
Abstract: MIDI velocity is crucial for capturing expressive dynamics in human performances. In practical scenarios, a music score with inaccurate velocities may be available alongside the performance audio (e.g., music education and free online archives), enabling the task of score-informed MIDI velocity estimation. In this work, we propose a modular, lightweight score-informed Transformer correction module that refines the velocity estimates of Automatic Music Transcription (AMT) systems. We integrate the proposed module into multiple AMT systems (HPT, HPPNet, and DynEst). Trained exclusively on the MAESTRO training split, our method consistently reduces velocity estimation errors on MAESTRO and improves cross-dataset generalization to SMD and MAPS datasets. Under this training protocol, integrating our score-informed module with HPT (named Score-HPT) establishes a new state-of-the-art performance, outperforms existing score-informed methods and velocity-enabled AMT systems while adding only 1 M parameters.
[1568] MAPSS: Manifold-based Assessment of Perceptual Source Separation
Amir Ivry, Samuele Cornell, Shinji Watanabe
Main category: eess.AS
TL;DR: PS and PM are new perceptual audio separation metrics that isolate leakage and distortion factors using diffusion maps on self-supervised embeddings, outperforming 18 existing measures in correlation with human subjective scores.
Details
Motivation: Current objective assessment of audio source-separation systems doesn't align well with human perception, especially when leakage (competing talkers) and distortion interact. There's a need for better metrics that isolate these factors.Method: Generate fundamental distortions from reference waveforms, encode all signals with pre-trained self-supervised model, use diffusion maps to create manifold where Euclidean distances align with perceptual dissimilarities. PS measures leakage by comparing output to all references/distortions, PM measures self-distortion by comparing to own reference/distortions.
Result: Experiments on English, Spanish, and music mixtures show PS and PM outperform 18 widely used measures, almost always ranking first or second in linear and rank correlations with subjective human mean-opinion scores.
Conclusion: PS and PM provide effective perceptual metrics for audio source separation that better match human judgment by isolating leakage and distortion factors, with high temporal resolution and statistical confidence intervals.
Abstract: Objective assessment of audio source-separation systems still mismatches subjective human perception, especially when interference from competing talkers and distortion of the target signal interact. We introduce Perceptual Separation (PS) and Perceptual Match (PM), a complementary pair of measures that, by design, isolate these leakage and distortion factors. Our intrusive approach generates a set of fundamental distortions, e.g., clipping, notch filter, and pitch shift from each reference waveform signal in the mixture. Distortions, references, and system outputs from all sources are independently encoded by a pre-trained self-supervised model, then aggregated and embedded with a manifold learning technique called diffusion maps, which aligns Euclidean distances on the manifold with dissimilarities of the encoded waveform representations. On this manifold, PM captures the self-distortion of a source by measuring distances from its output to its reference and associated distortions, while PS captures leakage by also accounting for distances from the output to non-attributed references and distortions. Both measures are differentiable and operate at a resolution as high as 75 frames per second, allowing granular optimization and analysis. We further derive, for both measures, frame-level deterministic error radius and non-asymptotic, high-probability confidence intervals. Experiments on English, Spanish, and music mixtures show that, against 18 widely used measures, the PS and PM are almost always placed first or second in linear and rank correlations with subjective human mean-opinion scores.
[1569] Adapting Speech Foundation Models for Unified Multimodal Speech Recognition with Large Language Models
Jing-Xuan Zhang, Genshun Wan, Jin Li, Jianqing Gao, Duo Zhao, Zhen-Hua Ling
Main category: eess.AS
TL;DR: UASR-LLM adapts frozen speech foundation models to unified visual, audio, and audio-visual speech recognition using LLMs as text decoders with visual injection modules and two-stage training.
Details
Motivation: Speech foundation models excel in audio tasks but lack multimodal adaptation. This work addresses the gap by creating a unified framework for visual speech recognition (VSR), automatic speech recognition (ASR), and audio-visual speech recognition (AVSR) using LLMs.Method: Uses frozen speech foundation models with visual injection modules at multiple layers for multimodal fusion. Connects to decoder-only LLMs via feed-forward adaptor. Two-stage training: 1) visual injection pretraining aligns audio/visual representations, 2) speech recognition finetuning integrates LLMs for unified optimization.
Result: Superior performance over state-of-the-art baselines across VSR, ASR, and AVSR under both clean and noisy conditions. Generalizes across various SFMs and LLMs, validating training strategy effectiveness.
Conclusion: UASR-LLM successfully adapts speech foundation models to multimodal speech recognition tasks using LLMs, demonstrating unified framework effectiveness and strong generalization across models and conditions.
Abstract: While speech foundation models (SFMs) have demonstrated remarkable performance in audio-only tasks, their adaptation to multimodal scenarios remains underexplored. This work presents UASR-LLM, a novel framework that adapts frozen SFMs to unified visual speech recognition (VSR), automatic speech recognition (ASR), and audio-visual speech recognition (AVSR) by leveraging large language models (LLMs) as text decoders. Visual representations are injected into multiple SFM layers via visual injection modules, enabling multimodal fusion and unified representation learning. The augmented SFMs are connected to decoder-only LLMs through a feed-forward adaptor, where concatenated representations and instruction prompts guide transcription. We propose a two-stage training strategy consisting of visual injection pretraining followed by speech recognition finetuning. The pretraining stage aligns audio, visual, and audio-visual representations within the frozen SFM backbone, while the finetuning stage integrates LLMs for unified optimization across speech recognition tasks. Experimental results demonstrate superior performance over state-of-the-art baselines across VSR, ASR, and AVSR under both clean and noisy conditions. Ablation studies further confirm generalization across various SFMs and LLMs, validating the effectiveness of the proposed training strategy.
[1570] Speech Emotion Recognition with ASR Integration
Yuanchao Li
Main category: eess.AS
TL;DR: This thesis explores integrating Automatic Speech Recognition (ASR) with Speech Emotion Recognition (SER) to improve robustness and scalability for real-world applications, addressing challenges in spontaneous and low-resource scenarios.
Details
Motivation: Speech Emotion Recognition is crucial for emotionally intelligent systems and AGI development, but current SER systems struggle with real-world deployment due to complex emotional expression and limitations in speech/language technologies, especially in spontaneous and low-resource settings.Method: The thesis investigates the integration of Automatic Speech Recognition into Speech Emotion Recognition systems, though specific technical approaches aren’t detailed in the abstract.
Result: Results not specified in the abstract, but the research aims to enhance robustness, scalability, and practical applicability of emotion recognition from spoken language.
Conclusion: ASR integration with SER could address deployment challenges in real-world scenarios, potentially advancing emotionally intelligent systems and AGI development.
Abstract: Speech Emotion Recognition (SER) plays a pivotal role in understanding human communication, enabling emotionally intelligent systems, and serving as a fundamental component in the development of Artificial General Intelligence (AGI). However, deploying SER in real-world, spontaneous, and low-resource scenarios remains a significant challenge due to the complexity of emotional expression and the limitations of current speech and language technologies. This thesis investigates the integration of Automatic Speech Recognition (ASR) into SER, with the goal of enhancing the robustness, scalability, and practical applicability of emotion recognition from spoken language.
[1571] WAXAL: A Large-Scale Multilingual African Language Speech Corpus
Abdoulaye Diack, Perry Nelson, Kwaku Agbesi, Angela Nakalembe, MohamedElfatih MohamedKhair, Vusumuzi Dube, Tavonga Siyavora, Subhashini Venugopalan, Jason Hickey, Uche Okonkwo, Abhishek Bapna, Isaac Wiafe, Raynard Dodzi Helegah, Elikem Doe Atsakpo, Charles Nutrokpor, Fiifi Baffoe Payin Winful, Kafui Kwashie Solaga, Jamal-Deen Abdulai, Akon Obu Ekpezu, Audace Niyonkuru, Samuel Rutunda, Boris Ishimwe, Michael Melese, Engineer Bainomugisha, Joyce Nakatumba-Nabende, Andrew Katumba, Claire Babirye, Jonathan Mukiibi, Vincent Kimani, Samuel Kibacia, James Maina, Fridah Emmah, Ahmed Ibrahim Shekarau, Ibrahim Shehu Adamu, Yusuf Abdullahi, Howard Lakougna, Bob MacDonald, Hadar Shemtov, Aisha Walcott-Bryant, Moustapha Cisse, Avinatan Hassidim, Jeff Dean, Yossi Matias
Main category: eess.AS
TL;DR: WAXAL is a large-scale open speech dataset for 24 Sub-Saharan African languages, containing ~1,250 hours of ASR data and ~235 hours of TTS data, collected through partnerships with African organizations to address the digital divide in speech technology.
Details
Motivation: Speech technology advancement has favored high-resource languages, creating a significant digital divide for speakers of most Sub-Saharan African languages. There's a need for inclusive speech datasets to enable technology development and digital preservation for these underrepresented languages.Method: Collected speech data through partnerships with four African academic and community organizations. Created two main components: 1) ASR dataset with ~1,250 hours of transcribed natural speech from diverse speakers, and 2) TTS dataset with ~235 hours of high-quality single-speaker recordings reading phonetically balanced scripts. Implemented methodology for data collection, annotation, and quality control.
Result: Released WAXAL dataset covering 24 languages representing over 100 million speakers. Dataset includes substantial speech resources: ~1,250 hours for ASR and ~235 hours for TTS. Released under CC-BY-4.0 license on Hugging Face to enable research and technology development.
Conclusion: WAXAL addresses the speech technology gap for Sub-Saharan African languages by providing a large-scale, openly accessible dataset. It aims to catalyze research, enable inclusive technology development, and serve as a resource for digital language preservation.
Abstract: The advancement of speech technology has predominantly favored high-resource languages, creating a significant digital divide for speakers of most Sub-Saharan African languages. To address this gap, we introduce WAXAL, a large-scale, openly accessible speech dataset for 24 languages representing over 100 million speakers. The collection consists of two main components: an Automated Speech Recognition (ASR) dataset containing approximately 1,250 hours of transcribed, natural speech from a diverse range of speakers, and a Text-to-Speech (TTS) dataset with around 235 hours of high-quality, single-speaker recordings reading phonetically balanced scripts. This paper details our methodology for data collection, annotation, and quality control, which involved partnerships with four African academic and community organizations. We provide a detailed statistical overview of the dataset and discuss its potential limitations and ethical considerations. The WAXAL datasets are released at https://huggingface.co/datasets/google/WaxalNLP under the permissive CC-BY-4.0 license to catalyze research, enable the development of inclusive technologies, and serve as a vital resource for the digital preservation of these languages.
[1572] Universal Robust Speech Adaptation for Cross-Domain Speech Recognition and Enhancement
Chien-Chun Wang, Hung-Shin Lee, Hsin-Min Wang, Berlin Chen
Main category: eess.AS
TL;DR: URSA-GAN is a unified generative framework that addresses domain shifts in ASR and speech enhancement by synthesizing domain-aligned speech using noise and channel embeddings with dynamic stochastic perturbation for better generalization.
Details
Motivation: Pre-trained ASR and speech enhancement models perform well under matched conditions but degrade significantly with domain shifts from unseen noise and channel distortions. There's a need for robust solutions that can handle mismatched noise and channel conditions.Method: URSA-GAN uses a dual-embedding architecture with separate noise and channel encoders pre-trained on limited in-domain data. These embeddings condition a GAN-based speech generator. A novel dynamic stochastic perturbation technique introduces controlled variability into embeddings during generation to enhance robustness to unseen domains.
Result: URSA-GAN effectively reduces character error rates in ASR and improves perceptual metrics in SE across diverse noisy and mismatched channel scenarios. On compound test conditions with both channel and noise degradations, it achieves relative improvements of 16.16% in ASR performance and 15.58% in SE metrics.
Conclusion: URSA-GAN demonstrates strong generalization ability for handling domain shifts in speech processing tasks, particularly for unseen noise and channel conditions, making it a promising approach for robust ASR and speech enhancement systems.
Abstract: Pre-trained models for automatic speech recognition (ASR) and speech enhancement (SE) have exhibited remarkable capabilities under matched noise and channel conditions. However, these models often suffer from severe performance degradation when confronted with domain shifts, particularly in the presence of unseen noise and channel distortions. In view of this, we in this paper present URSA-GAN, a unified and domain-aware generative framework specifically designed to mitigate mismatches in both noise and channel conditions. URSA-GAN leverages a dual-embedding architecture that consists of a noise encoder and a channel encoder, each pre-trained with limited in-domain data to capture domain-relevant representations. These embeddings condition a GAN-based speech generator, facilitating the synthesis of speech that is acoustically aligned with the target domain while preserving phonetic content. To enhance generalization further, we propose dynamic stochastic perturbation, a novel regularization technique that introduces controlled variability into the embeddings during generation, promoting robustness to unseen domains. Empirical results demonstrate that URSA-GAN effectively reduces character error rates in ASR and improves perceptual metrics in SE across diverse noisy and mismatched channel scenarios. Notably, evaluations on compound test conditions with both channel and noise degradations confirm the generalization ability of URSA-GAN, yielding relative improvements of 16.16% in ASR performance and 15.58% in SE metrics.
[1573] TC-BiMamba: Trans-Chunk bidirectionally within BiMamba for unified streaming and non-streaming ASR
Qingshun She, Jing Peng, Yangui Fang, Yu Xi, Kai Yu
Main category: eess.AS
TL;DR: TC-BiMamba enables unified streaming and non-streaming ASR with dynamic chunk size training, achieving faster training, lower memory usage, and better performance than existing methods.
Details
Motivation: Existing bidirectional Mamba (BiMamba) based streaming ASR methods are limited to fixed chunk size decoding, which restricts flexibility. Dynamic chunk size training enables a single model for both offline and streaming decoding with various latency settings, but traditional approaches suffer from substantial training overhead.Method: Proposes Trans-Chunk BiMamba (TC-BiMamba) for dynamic chunk size training. The Trans-Chunk mechanism trains both bidirectional sequences in an offline style with dynamic chunk size, allowing the model to capture bidirectional context while maintaining efficiency.
Result: TC-BiMamba achieves 1.3x training speedup, reduces training memory by 50%, and improves model performance compared to traditional chunk-wise processing. It outperforms U2++ and matches LC-BiMamba performance with smaller model size.
Conclusion: TC-BiMamba provides an efficient solution for unified streaming and non-streaming ASR with dynamic chunk size training, offering better performance, faster training, and lower memory usage than existing approaches.
Abstract: This work investigates bidirectional Mamba (BiMamba) for unified streaming and non-streaming automatic speech recognition (ASR). Dynamic chunk size training enables a single model for offline decoding and streaming decoding with various latency settings. In contrast, existing BiMamba based streaming method is limited to fixed chunk size decoding. When dynamic chunk size training is applied, training overhead increases substantially. To tackle this issue, we propose the Trans-Chunk BiMamba (TC-BiMamba) for dynamic chunk size training. Trans-Chunk mechanism trains both bidirectional sequences in an offline style with dynamic chunk size. On the one hand, compared to traditional chunk-wise processing, TC-BiMamba simultaneously achieves 1.3 times training speedup, reduces training memory by 50%, and improves model performance since it can capture bidirectional context. On the other hand, experimental results show that TC-BiMamba outperforms U2++ and matches LC-BiMmaba with smaller model size.
[1574] DTT-BSR: GAN-based DTTNet with RoPE Transformer Enhancement for Music Source Restoration
Shihong Tan, Haoyu Wang, Youran Ni, Yingzhao Hou, Jiayue Luo, Zipei Hu, Han Dou, Zerui Han, Ningning Pan, Yuzhu Wang, Gongping Huang
Main category: eess.AS
TL;DR: DTT-BSR: A hybrid GAN combining RoPE transformer for temporal modeling and dual-path band-split RNN for spectral processing achieves state-of-the-art music source restoration performance with only 7.1M parameters.
Details
Motivation: Music source restoration aims to recover original stems from mixed/mastered recordings, which is challenging due to both source separation and reconstruction of signals degraded by production effects like compression and reverberation.Method: Proposes DTT-BSR, a hybrid generative adversarial network combining rotary positional embeddings (RoPE) transformer for long-term temporal modeling with dual-path band-split recurrent neural network for multi-resolution spectral processing.
Result: Achieved 3rd place on objective leaderboard and 4th place on subjective leaderboard in ICASSP 2026 MSR Challenge, demonstrating exceptional generation fidelity and semantic alignment with compact 7.1M parameters.
Conclusion: The proposed hybrid architecture effectively addresses both temporal and spectral aspects of music source restoration, achieving competitive performance with minimal parameters.
Abstract: Music source restoration (MSR) aims to recover unprocessed stems from mixed and mastered recordings. The challenge lies in both separating overlapping sources and reconstructing signals degraded by production effects such as compression and reverberation. We therefore propose DTT-BSR, a hybrid generative adversarial network (GAN) combining rotary positional embeddings (RoPE) transformer for long-term temporal modeling with dual-path band-split recurrent neural network (RNN) for multi-resolution spectral processing. Our model achieved 3rd place on the objective leaderboard and 4th place on the subjective leaderboard on the ICASSP 2026 MSR Challenge, demonstrating exceptional generation fidelity and semantic alignment with a compact size of 7.1M parameters.
[1575] Deepfake Word Detection by Next-token Prediction using Fine-tuned Whisper
Hoan My Tran, Xin Wang, Wanying Ge, Xuechen Liu, Junichi Yamagishi
Main category: eess.AS
TL;DR: Fine-tuning Whisper model to detect synthetic words in speech while transcribing, using cost-effective partially vocoded data for training
Details
Motivation: Deepfake speech detection is important for security, but dedicated detectors are expensive to develop. Need cost-effective method that can detect synthetic words while maintaining transcription accuracy.Method: Fine-tune pre-trained Whisper model to detect synthetic words during transcription via next-token prediction. Use partially vocoded utterances as training data to reduce collection costs.
Result: Fine-tuned Whisper achieves low synthetic-word detection and transcription error rates on in-domain data. On out-of-domain data with unseen generative models, performs comparably to dedicated ResNet detector but shows performance degradation needing improved generalization.
Conclusion: Whisper fine-tuning offers cost-effective synthetic word detection while transcribing, but needs better generalization strategies for unseen generative models.
Abstract: Deepfake speech utterances can be forged by replacing one or more words in a bona fide utterance with semantically different words synthesized with speech-generative models. While a dedicated synthetic word detector could be developed, we developed a cost-effective method that fine-tunes a pre-trained Whisper model to detect synthetic words while transcribing the input utterance via next-token prediction. We further investigate using partially vocoded utterances as the fine-tuning data, thus reducing the cost of data collection. Our experiments demonstrate that, on in-domain test data, the fine-tuned Whisper yields low synthetic-word detection error rates and transcription error rates. On out-of-domain test data with synthetic words produced with unseen speech-generative models, the fine-tuned Whisper remains on par with a dedicated ResNet-based detection model; however, the overall performance degradation calls for strategies to improve its generalization capability.
eess.IV
[1576] GazeXPErT: An Expert Eye-tracking Dataset for Interpretable and Explainable AI in Oncologic FDG-PET/CT Scans
Joy T Wu, Daniel Beckmann, Sarah Miller, Alexander Lee, Elizabeth Theng, Stephan Altmayer, Ken Chang, David Kersting, Tomoaki Otani, Brittany Z Dashevsky, Hye Lim Park, Matteo Novello, Kip Guja, Curtis Langlotz, Ismini Lourentzou, Daniel Gruhl, Benjamin Risse, Guido A Davidzon
Main category: eess.IV
TL;DR: GazeXPErT: A 4D eye-tracking dataset capturing expert gaze patterns during tumor detection on FDG-PET/CT scans, showing that incorporating expert gaze improves AI segmentation and localization performance.
Details
Motivation: Address clinical translation barriers of AI models for tumor segmentation by improving interpretability, explainability, and reliability through capturing and incorporating expert gaze patterns during diagnostic reading.Method: Created a 4D eye-tracking dataset from 346 FDG-PET/CT scans read by trainees and board-certified specialists using eye-tracking-enabled annotation platform. Extracted 9,030 gaze-to-lesion trajectories synchronized with PET/CT images. Used 3D nnUNet for segmentation and vision transformers for gaze prediction tasks.
Result: Incorporating expert gaze patterns improved tumor segmentation DICE score from 0.6008 to 0.6819. Vision transformers achieved 74.95% predicted gaze points closer to tumors and 67.53% accuracy for expert intention prediction.
Conclusion: Expert gaze patterns provide valuable information that can enhance AI models for medical imaging, improving segmentation, localization, and interpretability in oncologic FDG-PET/CT analysis.
Abstract: [18F]FDG-PET/CT is a cornerstone imaging modality for tumor staging and treatment response assessment across many cancer types, yet expert reader shortages necessitate more efficient diagnostic aids. While standalone AI models for automatic lesion segmentation exist, clinical translation remains hindered by concerns about interpretability, explainability, reliability, and workflow integration. We present GazeXPErT, a 4D eye-tracking dataset capturing expert search patterns during tumor detection and measurement on 346 FDG-PET/CT scans. Each study was read by a trainee and a board-certified nuclear medicine or radiology specialist using an eye-tracking-enabled annotation platform that simulates routine clinical reads. From 3,948 minutes of raw 60Hz eye-tracking data, 9,030 unique gaze-to-lesion trajectories were extracted, synchronized with PET/CT image slices, and rendered in COCO-style format for multiple machine learning applications. Baseline validation experiments demonstrate that a 3D nnUNet tumor segmentation model achieved superior performance when incorporating expert gaze patterns versus without (DICE score 0.6819 versus 0.6008), and that vision transformers trained on sequential gaze and PET/CT images can improve dynamic lesion localization (74.95% predicted gaze point closer to tumor) and expert intention prediction (Accuracy 67.53% and AUROC 0.747). GazeXPErT is a valuable resource designed to explore multiple machine learning problems beyond these baseline experiments, which include and are not limited to, visual grounding or causal reasoning, clinically explainable feature augmentation, human-computer interaction, human intention prediction or understanding, and expert gaze-rewarded modeling approaches to AI in oncologic FDG-PET/CT imaging.
[1577] Optimisation of SOUP-GAN and CSR-GAN for High Resolution MR Images Reconstruction
Muneeba Rashid, Hina Shakir, Humaira Mehwish, Asarim Amir, Reema Qaiser Khan
Main category: eess.IV
TL;DR: Enhanced GAN models (SOUP-GAN and CSR-GAN) for MRI image quality improvement through architectural modifications, spectral normalization, and hyperparameter tuning.
Details
Motivation: MRI image quality is affected by motion artifacts and equipment limitations, which can hinder accurate disease diagnosis. There's a need for effective image enhancement techniques to improve diagnostic accuracy.Method: Two GAN models (SOUP-GAN and CSR-GAN) with architectural improvements: deepened generator/discriminator with additional convolutional layers, increased filter sizes, LeakyReLU activation for better gradient flow, spectral normalization for training stability, and hyperparameter tuning (reduced learning rate, optimal batch size).
Result: CSR-GAN achieved best performance in reconstructing high-frequency details and reducing noise (PSNR: 34.6, SSIM: 0.89). SOUP-GAN performed best for delivering less noisy images with good structures (PSNR: 34.4, SSIM: 0.83). Both models showed significant improvement over baseline methods.
Conclusion: The enhanced GAN models can effectively improve MRI image quality, potentially leading to better disease diagnostics through clearer medical images.
Abstract: Magnetic Resonance (MR) imaging is a diagnostic tool used in modern medicine; however, its output can be affected by motion artefacts and may be limited by equipment. This research focuses on MRI image quality enhancement using two efficient Generative Adversarial Networks (GANs) models: SOUP-GAN and CSR-GAN. In both models, meaningful architectural modifications were introduced. The generator and discriminator of each were further deepened by adding convolutional layers and were enhanced in filter sizes as well. The LeakyReLU activation function was used to improve gradient flow, and hyperparameter tuning strategies were applied, including a reduced learning rate and an optimal batch size. Moreover, spectral normalisation was proposed to address mode collapse and improve training stability. The experiment shows that CSR-GAN has better performance in reconstructing the image with higher frequency details and reducing noise compared to other methods, with an optimised PSNR of 34.6 and SSIM of 0.89. However, SOUP-GAN performed the best in terms of delivering less noisy images with good structures, achieving a PSNR of 34.4 and SSIM of 0.83. The obtained results indicate that the proposed enhanced GAN model can be a useful tool for MR image quality improvement for subsequent better disease diagnostics.
[1578] Efficient Flow Matching for Sparse-View CT Reconstruction
Jiayang Shi, Lincen Yang, Zhong Li, Tristan Van Leeuwen, Daniel M. Pelt, K. Joost Batenburg
Main category: eess.IV
TL;DR: FMCT/EFMCT: Flow Matching-based CT reconstruction framework using deterministic ODEs instead of stochastic diffusion models, with velocity field reuse for efficiency.
Details
Motivation: Diffusion models for CT reconstruction suffer from stochasticity interfering with data consistency operations and are computationally expensive. Flow Matching offers deterministic trajectories that are more compatible with repeated data consistency steps in time-critical clinical scenarios.Method: Proposes FMCT framework using Flow Matching with deterministic ODEs instead of stochastic SDEs. Introduces EFMCT variant that reuses previously predicted velocity fields across consecutive steps to reduce neural network function evaluations (NFEs) while maintaining reconstruction quality.
Result: FMCT/EFMCT achieve competitive reconstruction quality compared to diffusion-based methods while significantly improving computational efficiency. Theoretical analysis shows error from velocity reuse is bounded when combined with data consistency operations.
Conclusion: Flow Matching provides a more suitable deterministic framework for CT reconstruction than stochastic diffusion models, with velocity field reuse enabling substantial efficiency gains without sacrificing quality.
Abstract: Generative models, particularly Diffusion Models (DM), have shown strong potential for Computed Tomography (CT) reconstruction serving as expressive priors for solving ill-posed inverse problems. However, diffusion-based reconstruction relies on Stochastic Differential Equations (SDEs) for forward diffusion and reverse denoising, where such stochasticity can interfere with repeated data consistency corrections in CT reconstruction. Since CT reconstruction is often time-critical in clinical and interventional scenarios, improving reconstruction efficiency is essential. In contrast, Flow Matching (FM) models sampling as a deterministic Ordinary Differential Equation (ODE), yielding smooth trajectories without stochastic noise injection. This deterministic formulation is naturally compatible with repeated data consistency operations. Furthermore, we observe that FM-predicted velocity fields exhibit strong correlations across adjacent steps. Motivated by this, we propose an FM-based CT reconstruction framework (FMCT) and an efficient variant (EFMCT) that reuses previously predicted velocity fields over consecutive steps to substantially reduce the number of Neural network Function Evaluations (NFEs), thereby improving inference efficiency. We provide theoretical analysis showing that the error introduced by velocity reuse is bounded when combined with data consistency operations. Extensive experiments demonstrate that FMCT/EFMCT achieve competitive reconstruction quality while significantly improving computational efficiency compared with diffusion-based methods. The codebase is open-sourced at https://github.com/EFMCT/EFMCT.
[1579] GLIDE-Reg: Global-to-Local Deformable Registration Using Co-Optimized Foundation and Handcrafted Features
Yunzheng Zhu, Aichi Chien, Kimaya kulkarni, Luoting Zhuang, Stephen Park, Ricky Savjani, Daniel Low, William Hsu
Main category: eess.IV
TL;DR: GLIDE-Reg is a deformable registration method for medical imaging that combines global semantic embeddings with local descriptors to improve robustness across different spatial resolutions and anatomical coverages.
Details
Motivation: Current deformable registration methods in medical imaging lack robustness and generalizability across spatial resolution differences and variations in anatomical coverage, limiting their effectiveness in applications like lesion tracking and treatment evaluation.Method: Joint optimization of registration field and learnable dimensionality reduction module; compression of VFM embeddings to maintain registration relevance; fusion of global semantic cues with MIND local descriptors.
Result: Achieved average DSC of 0.859, 0.862, and 0.901 across three cohorts; outperformed state-of-the-art DEEDS with relative improvements of 3.0%, 0.5%, and 0.1%; achieved 1.58 mm TRE on Lung250M and 1.11 mm on NLST nodule centers.
Conclusion: GLIDE-Reg demonstrates improved performance and robustness for medical image registration, particularly for challenging downstream tasks like nodule tracking in early-stage lung cancer diagnosis.
Abstract: Deformable registration is crucial in medical imaging. Several existing applications include lesion tracking, probabilistic atlas generation, and treatment response evaluation. However, current methods often lack robustness and generalizability across two key factors: spatial resolution and differences in anatomical coverage. We jointly optimize a registration field and a learnable dimensionality reduction module so that compressed VFM embeddings remain registration-relevant, and fuse these global semantic cues with MIND local descriptors. GLIDE-Reg achieves average dice similarity coefficients (DSC) across 6 anatomical structures of 0.859, 0.862, and 0.901 in two public cohorts (Lung250M and NLST) and one institution cohort (UCLA5DCT), and outperforms the state-of-the-art DEEDS (0.834, 0.858, 0.900) with relative improvements of 3.0%, 0.5%, and 0.1%. For target registration errors, GLIDE-Reg achieves 1.58 mm on Lung250M landmarks (compared to 1.25 mm on corrField and 1.91 mm on DEEDS) and 1.11 mm on NLST nodule centers (compared to 1.11 mm on DEEDS). The substantiated performance on the nodule centers also demonstrates its robustness across challenging downstream tasks, such as nodule tracking, which is an essential prior step for early-stage lung cancer diagnosis.
[1580] Efficient Conformal Volumetry for Template-Based Segmentation
Matt Y. Cheung, Ashok Veeraraghavan, Guha Balakrishnan
Main category: eess.IV
TL;DR: ConVOLT: A conformal prediction framework for efficient volumetric uncertainty quantification in template-based medical image segmentation by conditioning calibration on deformation field properties.
Details
Motivation: Existing uncertainty quantification methods for template-based segmentation either rely on learned model features (often unavailable in classic pipelines) or treat registration as a black box, resulting in overly conservative intervals when applied directly in output space.Method: ConVOLT introduces a conformal prediction framework that achieves efficient volumetric UQ by conditioning calibration on properties of the estimated deformation field from template-based segmentation. It calibrates a learned volumetric scaling factor from deformation space features.
Result: ConVOLT achieves target coverage while producing substantially tighter intervals than output-space conformal baselines across multiple datasets and registration methods for global, regional, and label volumetry tasks.
Conclusion: The work paves the way to exploit the registration process for efficient uncertainty quantification in medical imaging pipelines.
Abstract: Template-based segmentation, a widely used paradigm in medical imaging, propagates anatomical labels via deformable registration from a labeled atlas to a target image, and is often used to compute volumetric biomarkers for downstream decision-making. While conformal prediction (CP) provides finite-sample valid intervals for scalar metrics, existing segmentation-based uncertainty quantification (UQ) approaches either rely on learned model features, often unavailable in classic template-based pipelines, or treat the registration process as a black box, resulting in overly conservative intervals when applied directly in output space. We introduce ConVOLT, a CP framework that achieves efficient volumetric UQ by conditioning calibration on properties of the estimated deformation field from template-based segmentation. ConVOLT calibrates a learned volumetric scaling factor from deformation space features. We evaluate ConVOLT on template-based segmentation tasks involving global, regional, and label volumetry across multiple datasets and registration methods. ConVOLT achieves target coverage while producing substantially tighter intervals than output-space conformal baselines. Our work paves way to exploit the registration process for efficient UQ in medical imaging pipelines.
[1581] Solving a Nonlinear Blind Inverse Problem for Tagged MRI with Physics and Deep Generative Priors
Zhangxing Bian, Shuwen Wei, Samuel W. Remedios, Junyu Chen, Aaron Carass, Blake E. Dewey, Jerry L. Prince
Main category: eess.IV
TL;DR: A unified inverse framework for tagged MRI that simultaneously recovers anatomical images, synthesizes high-resolution cine images, and estimates 3D diffeomorphic motion by combining MR physics with generative priors.
Details
Motivation: Tagged MRI enables non-invasive motion tracking but faces challenges: tags and imaging blur hinder anatomical segmentation, tag fading disrupts motion tracking, and existing methods handle these problems in isolation and sub-optimally.Method: Blind nonlinear inverse framework that combines MR physics with generative priors to simultaneously estimate unknown forward imaging models, recover high-resolution underlying anatomy, and track 3D diffeomorphic Lagrangian motion over time.
Result: Experiments on tagged brain MRI show the approach yields high-resolution anatomy images, cine images, and more accurate motion than specialized methods.
Conclusion: The proposed unified framework successfully addresses multiple challenges in tagged MRI processing simultaneously, outperforming specialized methods that handle these tasks in isolation.
Abstract: Tagged MRI enables tracking internal tissue motion non-invasively. It encodes motion by modulating anatomy with periodic tags, which deform along with tissue. However, the entanglement between anatomy, tags and motion poses significant challenges for post-processing. The existence of tags and imaging blur hinders downstream tasks such as segmenting anatomy. Tag fading, due to T1-relaxation, disrupts the brightness constancy assumption for motion tracking. For decades, these challenges have been handled in isolation and sub-optimally. In contrast, we introduce a blind and nonlinear inverse framework for tagged MRI that, for the first time, unifies these tasks: anatomical image recovery, high-resolution cine image synthesis, and motion estimation. At its core, the synergy of MR physics and generative priors enables us to blindly estimate the unknown forward imaging models and high-resolution underlying anatomy, while simultaneously tracking 3D diffeomorphic Lagrangian motion over time. Experiments on tagged brain MRI demonstrate that our approach yields high-resolution anatomy images, cine images, and more accurate motion than specialized methods.
[1582] Spectral Super-Resolution via Adversarial Unfolding and Data-Driven Spectrum Regularization: From Multispectral Satellite Data to NASA Hyperspectral Image
Si-Sheng Young, Chia-Hsiang Lin
Main category: eess.IV
TL;DR: A novel deep unfolding framework with adversarial learning for spectral super-resolution of Sentinel-2 satellite data to achieve global hyperspectral coverage.
Details
Motivation: Sentinel-2 provides global multispectral coverage but has limited spectral resolution (12 bands) and non-unified spatial resolution, while high-resolution hyperspectral sensors like NASA's AVIRIS-NG only cover American regions. The paper aims to achieve global hyperspectral coverage by reconstructing Sentinel-2 data to NASA hyperspectral images.Method: Proposes UALNet, a deep unfolding framework regularized by a data-driven spectrum prior from PriorNet, integrated with adversarial learning (UAL) where the discriminator guides reconstruction in both training and testing phases. Achieves spectral super-resolution from 12-to-186 bands and unifies spatial resolution to 5m.
Result: UALNet outperforms the next-best Transformer in PSNR, SSIM, and SAM metrics while requiring only 15% MACs and 20 times fewer parameters.
Conclusion: The proposed UALNet framework effectively achieves spectral super-resolution for global hyperspectral coverage from Sentinel-2 data, offering superior performance with significantly reduced computational requirements.
Abstract: The European Space Agency’s Sentinel-2 satellite provides global multispectral coverage for remote sensing (RS) applications. However, limited spectral resolution (12 bands) and non-unified spatial resolution (60/20/10 m) restrict their practicality. In contrast, the high spectral-spatial resolution sensor (e.g., NASA’s AVIRIS-NG) covers only the American region due to practical considerations. This raises a fundamental question: ``Can a global hyperspectral coverage be achieved by reconstructing Sentinel-2 data to NASA hyperspectral images?’’ This study aims to achieve spectral super-resolution from 12-to-186 and unify the spatial resolution of Sentinel-2 data to 5 m. To enable a reliable and efficient reconstruction, we formulate a novel deep unfolding framework regularized by a data-driven spectrum prior from PriorNet, instead of relying on implicit deep priors as conventional deep unfolding does. Moreover, an adversarial term is integrated into the unfolded architecture, enabling the discriminator to guide the reconstruction in both the training and testing phases; we term this novel concept unfolding adversarial learning (UAL). Experiments show that our UALNet outperforms the next-best Transformer in PSNR, SSIM, and SAM, while requiring only 15% MACs and 20 times fewer parameters. The associated code will be publicly available at https://sites.google.com/view/chiahsianglin/software.
[1583] Revisiting Global Token Mixing in Task-Dependent MRI Restoration: Insights from Minimal Gated CNN Baselines
Xiangjian Hou, Chao Qin, Chang Ni, Xin Wang, Chun Yuan, Xiaodong Ma
Main category: eess.IV
TL;DR: Global token mixing in MRI restoration is task-dependent: beneficial for denoising with heteroscedastic noise but less so for reconstruction and super-resolution where physics constraints already provide global coupling.
Details
Motivation: To investigate whether global token mixing (via self-attention or state-space models) is actually beneficial for different MRI restoration tasks, given that these tasks vary in how degradations affect image/k-space domains and in the degree of global coupling already imposed by physics-driven data consistency.Method: Established controlled testbed comparing minimal local gated CNN and its large-field variant against state-of-the-art global models under aligned training/evaluation protocols. Tested across three representative MRI restoration tasks: accelerated MRI reconstruction with explicit data consistency, MRI super-resolution with k-space center cropping, and denoising of clinical carotid MRI with spatially heteroscedastic noise.
Result: 1) For accelerated MRI reconstruction: minimal unrolled gated-CNN baseline is highly competitive with token-mixing approaches, suggesting limited benefits when forward model and data-consistency provide strong global constraints. 2) For super-resolution: local gated models remain competitive, with lightweight large-field variant yielding only modest improvements. 3) For denoising with heteroscedastic noise: token-mixing models achieve strongest performance, consistent with need to estimate spatially varying reliability.
Conclusion: Utility of global token mixing in MRI restoration is task-dependent and should be tailored to the underlying imaging physics and degradation structure. Global mixing is most beneficial when dealing with spatially varying degradations like heteroscedastic noise.
Abstract: Global token mixing, implemented via self-attention or state-space sequence models, has become a popular model design choice for MRI restoration. However, MRI restoration tasks differ substantially in how their degradations vary over image and k-space domains, and in the degree to which global coupling is already imposed by physics-driven data consistency terms. In this work, we ask the question whether global token mixing is actually beneficial in each individual task across three representative settings: accelerated MRI reconstruction with explicit data consistency, MRI super-resolution with k-space center cropping, and denoising of clinical carotid MRI data with spatially heteroscedastic noise. To reduce confounding factors, we establish a controlled testbed comparing a minimal local gated CNN and its large-field variant, benchmarking them directly against state-of-the-art global models under aligned training and evaluation protocols. For accelerated MRI reconstruction, the minimal unrolled gated-CNN baseline is already highly competitive compared to recent token-mixing approaches in public reconstruction benchmarks, suggesting limited additional benefits when the forward model and data-consistency steps provide strong global constraints. For super-resolution, where low-frequency k-space data are largely preserved by the controlled low-pass degradation, local gated models remain competitive, and a lightweight large-field variant yields only modest improvements. In contrast, for denoising with pronounced spatially heteroscedastic noise, token-mixing models achieve the strongest overall performance, consistent with the need to estimate spatially varying reliability. In conclusion, our results demonstrate that the utility of global token mixing in MRI restoration is task-dependent, and it should be tailored to the underlying imaging physics and degradation structure.
[1584] MR-Compass: Inertial Navigation-Driven Motion Correction for Brain MRI
Musa Tunc Arslan, Fatih Calakli, Joshua Auger, Hongli Fan, Alan J Macy, Simon K Warfield
Main category: eess.IV
TL;DR: MR-Compass: A method using MRI system’s magnetic fields and inertial sensors for high-frequency motion tracking without integration drift, enabling improved MRI image quality through motion correction.
Details
Motivation: Traditional inertial sensors suffer from unbounded drift when integrating noisy signals, making them impractical for MRI motion correction at millimeter resolution during minute-long scans. There's a need for accurate, high-frequency motion tracking without integration drift.Method: MR-Compass exploits the MRI system’s static magnetic and gravitational fields to estimate 3-DOF orientation at 2 kHz directly without integration, eliminating random-walk drift. The remaining 3-DOF translation is recovered via phase correlation from the MRI data itself.
Result: Achieved mean accuracy of 0.6° and 0.4 pixels across experiments. Image quality improved in all volunteer scans for both retrospective and prospective correction cases. Effective in measuring head motion with high accuracy at unprecedented sample rates.
Conclusion: MR-Compass enables accurate motion tracking in MRI scanners without integration drift, improving image quality by aligning k-space data appropriately and reducing motion-related artifacts through both retrospective and prospective reconstruction.
Abstract: Inertial sensors can track object kinematics, however, unbounded drift from integrating noisy signals makes them impractical for MRI motion correction at millimeter resolution and minute-long scans. We introduce MR-Compass, which exploits the MRI system’s static magnetic and gravitational fields to estimate 3-DOF orientation at 2 kHz directly, without integration, eliminating random-walk. The remaining 3-DOF translation is recovered via phase correlation from the MRI data. We experimentally validate the efficacy of the method retrospectively using a 3D radial koosh-ball sequence and prospectively using 2D EPI fMRI during large volunteer motions. MR-Compass followed by phase-correlation achieved a mean accuracy of 0.6$^o$ and 0.4 pixels across all experiments. Image quality improved when motion correction was applied in all volunteer scans for both retrospective and prospective correction cases. MR-Compass was effective in measuring head motion in the MRI scanner with high accuracy at unprecedented sample rates, and enabled both retrospective and prospective reconstruction to improve image quality by aligning the k-space data appropriately and by reducing the motion related artifacts.
[1585] Near-Field Focusing Operators for Planar Multi-Static Microwave Imaging Using Back-Projection in the Spatial Domain
Matthias M. Saurer, Marius Brinkmann, Han Na, Quanfeng Wang, Thomas Eibert
Main category: eess.IV
TL;DR: Improved microwave imaging using analytic back-projection with magnitude correction for quasi-planar multi-static scattering scenarios.
Details
Motivation: Traditional back-projection methods only compensate for phase delay, which is insufficient for accurate microwave imaging when the scattering object is close to the aperture plane. There's a need for improved imaging procedures that account for magnitude corrections to reduce artifacts.Method: Derived an improved formalism using plane-wave expansion of observation data in quasi-planar multi-static scattering scenarios. Analytically derived integral expressions for different focusing operators that include magnitude correction factors, which act as low-pass filters on spatial images.
Result: The derived imaging procedure is superior to traditional phase-only back-projection. The modified multi-static back-projection algorithms effectively suppress imaging artifacts. Validation was performed using both simulated and measured data.
Conclusion: The proposed method provides improved microwave image reconstruction, especially when the distance from scattering object to aperture plane is small, by incorporating magnitude correction factors that reduce imaging artifacts.
Abstract: Based on a plane-wave expansion of the observation data in quasi-planar multi-static scattering scenarios, an improved formalism for image creation utilizing back-projection in the spatial domain is derived. The underlying integral expressions for different focusing operators are derived analytically leading to magnitude correction factors, which are mostly relevant for reconstructing microwave images when the distance from the scattering object to the aperture plane is small. It is shown that the derived imaging procedure is superior to the traditional back-projection only compensating the phase delay of the measurement signals and validate our findings based on simulated as well as measured data. Since the derived focusing operators correspond to a low-pass filtering of the spatial images, the resulting modified multi-static back-projection algorithms effectively suppress imaging artifacts as well.
[1586] Guaranteed Image Classification via Goal-oriented Joint Semantic Source and Channel Coding
Wenchao Wu, Min Qiu, Yansha Deng, Jinhong Yuan
Main category: eess.IV
TL;DR: Goal-oriented joint semantic source and channel coding framework that prioritizes compression and error protection based on semantic importance of image regions for classification tasks under bandwidth constraints.
Details
Motivation: Existing joint source-channel coding methods treat all image regions equally for classification, ignoring varying semantic importance. Need for efficient transmission under bandwidth constraints while maintaining classification accuracy.Method: G-JSSCC framework uses explainable AI (Shapley values) to rank image regions by semantic importance, then applies varying levels of source compression and channel protection accordingly. Defines new “coding efficiency” metric.
Result: Improves classification probability by 2.70×, reduces transmission cost by 38%, and enhances coding efficiency by 5.91× compared to uniform compression/protection benchmark.
Conclusion: Semantic-aware joint coding significantly improves classification performance under bandwidth constraints by prioritizing important image regions.
Abstract: To enable critical applications such as remote diagnostics, image classification must be guaranteed under bandwidth constraints and unreliable wireless channels through joint source and channel coding (JSCC) design. However, most existing JSCC methods focus on minimizing image distortion, implicitly assuming that all image regions contribute equally to classification performance, thereby overlooking their varying importance for the task. In this paper, we propose a goal-oriented joint semantic source and channel coding (G-JSSCC) framework that applies \emph{various} levels of source coding compression and channel coding protection across image regions based on their semantic importance. Specifically, we design a semantic information extraction method that identifies and ranks various image regions based on their contributions to classification, where the contribution is measured by the shapely value from explainable artificial intelligence (AI). Based on that, we design a semantic source coding and a semantic channel coding method, which allocates higher-quality compression and stronger error protection to image regions of great semantic importance. In addition, we define a new metric, termed coding efficiency, to evaluate the effectiveness of the source and channel coding in the classification task. Simulations show that our proposed G-JSSCC framework improves classification probability by 2.70 times, reduces transmission cost by 38%, and enhances coding efficiency by 5.91 times, compared to the benchmark scheme using uniform compression and an idealized channel code to uniformly protect the whole image.
[1587] NIC-RobustBench: A Comprehensive Open-Source Toolkit for Neural Image Compression and Robustness Analysis
Georgii Bychkov, Khaled Abud, Egor Kovalev, Alexander Gushchin, Sergey Lavrushkin, Dmitriy Vatolin, Anastasia Antsiferova
Main category: eess.IV
TL;DR: NIC-RobustBench: A comprehensive benchmark for evaluating adversarial robustness of neural image compression methods, including attacks, defenses, and downstream task impact assessment.
Details
Motivation: Current neural image compression (NIC) benchmarks focus only on rate-distortion performance in safe scenarios, ignoring vulnerabilities to adversarial attacks that can cause severe reconstruction artifacts or break downstream models. There's a lack of systematic evaluation of NIC robustness across different codecs and attacks.Method: Developed NIC-RobustBench as an open-source benchmark and evaluation framework that integrates 8 adversarial attacks, 9 defense strategies, standard RD metrics, extensible set of codecs, and tools for assessing compression model robustness and downstream task impact.
Result: Provided broad empirical study of modern NICs and defenses in adversarial scenarios, revealing failure modes, identifying least and most resilient architectures, and offering insights into NIC robustness characteristics.
Conclusion: NIC-RobustBench fills a critical gap in evaluating neural image compression robustness, enabling systematic assessment of adversarial vulnerabilities and providing valuable insights for developing more secure compression systems.
Abstract: Neural image compression (NIC) is increasingly used in computer vision pipelines, as learning-based models are able to surpass traditional algorithms in compression efficiency. However, learned codecs can be unstable and vulnerable to adversarial attacks: small perturbations may cause severe reconstruction artifacts or indirectly break downstream models. Despite these risks, most NIC benchmarks only emphasize rate-distortion (RD) performance, focusing on model efficiency in safe, non-adversarial scenarios, while NIC robustness studies cover only specific codecs and attacks. To fill this gap, we introduce \textbf{NIC-RobustBench}, an open-source benchmark and evaluation framework for adversarial robustness of NIC methods. The benchmark integrates 8 attacks, 9 defense strategies, standard RD metrics, a large and extensible set of codecs, and tools for assessing both the robustness of the compression model and impact on downstream tasks. Using NIC-RobustBench, we provide a broad empirical study of modern NICs and defenses in adversarial scenarios, highlighting failure modes, least and most resilient architectures, and other insights into NIC robustness. Our code is available online at https://github.com/msu-video-group/NIC-RobustBench.
[1588] Towards Precision Cardiovascular Analysis in Zebrafish: The ZACAF Paradigm
Amir Mohammad Naderi, Jennifer G. Casey, Mao-Hsiang Huang, Rachelle Victorio, David Y. Chiang, Calum MacRae, Hung Cao, Vandana A. Gupta
Main category: eess.IV
TL;DR: A framework called ZACAF uses transfer learning and test-time augmentation to improve cardiovascular assessment in zebrafish, addressing overfitting issues in supervised methods when applied to new imaging setups and mutant types.
Details
Motivation: Manual monitoring of zebrafish cardiovascular parameters is time-consuming and error-prone. Existing supervised deep learning methods overfit to specific training data and perform poorly when applied to new imaging setups or mutant types.Method: Enhanced the Zebrafish Automatic Cardiovascular Assessment Framework (ZACAF) with data augmentation, Transfer Learning (TL), and Test Time Augmentation (TTA) to improve performance and generalizability across different microscope setups and mutant types.
Result: The approach successfully refined models for novel microscope setups with diverse mutant types and video protocols, demonstrating improved generalizability and accuracy, particularly when applied to nrap mutant zebrafish as a cardiomyopathy model.
Conclusion: Transfer learning with data augmentation enables robust cardiovascular assessment across diverse zebrafish imaging conditions, providing a generalizable framework that can be integrated with existing methods to aid researchers.
Abstract: Quantifying cardiovascular parameters like ejection fraction in zebrafish as a host of biological investigations has been extensively studied. Since current manual monitoring techniques are time-consuming and fallible, several image processing frameworks have been proposed to automate the process. Most of these works rely on supervised deep-learning architectures. However, supervised methods tend to be overfitted on their training dataset. This means that applying the same framework to new data with different imaging setups and mutant types can severely decrease performance. We have developed a Zebrafish Automatic Cardiovascular Assessment Framework (ZACAF) to quantify the cardiac function in zebrafish. In this work, we further applied data augmentation, Transfer Learning (TL), and Test Time Augmentation (TTA) to ZACAF to improve the performance for the quantification of cardiovascular function quantification in zebrafish. This strategy can be integrated with the available frameworks to aid other researchers. We demonstrate that using TL, even with a constrained dataset, the model can be refined to accommodate a novel microscope setup, encompassing diverse mutant types and accommodating various video recording protocols. Additionally, as users engage in successive rounds of TL, the model is anticipated to undergo substantial enhancements in both generalizability and accuracy. Finally, we applied this approach to assess the cardiovascular function in nrap mutant zebrafish, a model of cardiomyopathy.
[1589] DAWN-FM: Data-Aware and Noise-Informed Flow Matching for Solving Inverse Problems
Shadab Ahamed, Eldad Haber
Main category: eess.IV
TL;DR: DAWN-FM is a generative framework for solving inverse problems using Flow Matching with data and noise embeddings, specifically trained for each problem to handle noisy/incomplete data.
Details
Motivation: Inverse problems in fields like medical imaging and signal processing are often ill-posed and require regularization. Existing methods like pretrained diffusion models struggle with highly ill-posed settings and varying noise levels.Method: Uses Flow Matching (FM) framework with data-aware and noise-informed embeddings. Learns time-dependent velocity field to map Gaussian reference distribution to target distribution, incorporating explicit data representations and noise accounting.
Result: Validated through extensive numerical experiments on image deblurring and tomography tasks. Shows effectiveness and robustness in handling noisy/incomplete data scenarios.
Conclusion: DAWN-FM provides accurate solutions with uncertainty quantification capabilities, outperforming pretrained diffusion models in ill-posed inverse problems by being specifically trained for each problem and adapting to noise levels.
Abstract: Inverse problems, which involve estimating parameters from incomplete or noisy observations, arise in various fields such as medical imaging, geophysics, and signal processing. These problems are often ill-posed, requiring regularization techniques to stabilize the solution. In this work, we employ Flow Matching (FM), a generative framework that integrates a deterministic processes to map a simple reference distribution, such as a Gaussian, to the target distribution. Our method DAWN-FM: Data-AWare and Noise-Informed Flow Matching incorporates data and noise embedding, allowing the model to access representations about the measured data explicitly and also account for noise in the observations, making it particularly robust in scenarios where data is noisy or incomplete. By learning a time-dependent velocity field, FM not only provides accurate solutions but also enables uncertainty quantification by generating multiple plausible outcomes. Unlike pretrained diffusion models, which may struggle in highly ill-posed settings, our approach is trained specifically for each inverse problem and adapts to varying noise levels. We validate the effectiveness and robustness of our method through extensive numerical experiments on tasks such as image deblurring and tomography. The code is available at: https://github.com/ahxmeds/DAWN-FM.git.
[1590] CryoLithe: Rapid Cryo-ET Reconstruction via Transform-Localized Deep Learning
Vinith Kishore, Valentin Debarnot, AmirEhsan Khorashadizadeh, Ricardo D. Righetto, Benjamin D. Engel, Ivan Dokmanić
Main category: eess.IV
TL;DR: CryoLithe is an end-to-end deep learning network for cryo-electron tomography reconstruction that directly estimates 3D volumes from aligned tilt series, achieving state-of-the-art denoising and missing wedge correction while being 100x faster than previous methods.
Details
Motivation: Current self-supervised deep learning approaches for cryo-ET reconstruction are slow (taking dozens of hours) and memory-intensive, despite improving reconstruction quality over traditional iterative algorithms. There's a need for faster, more efficient methods that maintain high reconstruction quality.Method: CryoLithe implements a local, memory-efficient reconstruction network that directly estimates volumes from aligned tilt series. It leverages transform-domain locality to make the network robust to distribution shifts, enabling effective supervised training without requiring retraining or fine-tuning on real data.
Result: CryoLithe achieves denoising and missing wedge correction comparable or better than state-of-the-art methods (Icecream, Cryo-CARE, IsoNet, DeepDeWedge) while being two orders of magnitude faster. It shows robustness to distribution shifts and works effectively on real data without retraining.
Conclusion: CryoLithe provides a fast, efficient solution for cryo-ET reconstruction that facilitates downstream analysis like segmentation and subtomogram averaging, making high-quality reconstruction more accessible to researchers.
Abstract: Cryo-electron tomography (cryo-ET) enables 3D visualization of cellular structures. Accurate reconstruction of high-resolution volumes is complicated by the very low signal-to-noise ratio and a restricted range of sample tilts. Recent self-supervised deep learning approaches, which post-process initial reconstructions by filtered backprojection (FBP), have significantly improved reconstruction quality with respect to signal processing iterative algorithms, but they are slow, taking dozens of hours for an expert to reconstruct a tomogram and demand large memory. We present CryoLithe, an end-to-end network that directly estimates the volume from an aligned tilt series. CryoLithe achieves denoising and missing wedge correction comparable or better than state-of-the-art self-supervised deep learning approaches such as Icecream, Cryo-CARE, IsoNet or DeepDeWedge, while being two orders of magnitude faster. To achieve this, we implement a local, memory-efficient reconstruction network. We demonstrate that leveraging transform-domain locality makes our network robust to distribution shifts, enabling effective supervised training and giving excellent results on real data$\unicode{x2013}$without retraining or fine-tuning. CryoLithe reconstructions facilitate downstream cryo-ET analysis, including segmentation and subtomogram averaging and is openly available: https://github.com/swing-research/CryoLithe.
[1591] Benchmarking Self-Supervised Learning Methods for Accelerated MRI Reconstruction
Andrew Wang, Steven McDonagh, Mike Davies
Main category: eess.IV
TL;DR: SSIBench: A comprehensive benchmark framework for self-supervised MRI reconstruction methods without ground truth, evaluating 18 methods across 7 real MRI scenarios to enable systematic comparison and accelerate research.
Details
Motivation: Current self-supervised MRI reconstruction methods lack systematic comparison and standardized experimental setups, hindering methodological research and industry adoption despite approaching supervised performance levels.Method: Developed SSIBench, a modular comparison framework to unify and benchmark self-supervised imaging methods without ground truth. Evaluated 18 recent methods across seven realistic MRI scenarios on real data, and proposed a novel Multi-Operator Equivariant Imaging loss.
Result: Evaluation revealed a wide performance landscape where method rankings differ across scenarios and metrics, showing no single method dominates all scenarios. Performance variations expose the need for further SSI research.
Conclusion: SSIBench provides an extensible benchmark with open-source implementations to accelerate reproducible research, enable fair comparison, and unlock SSI applications to other domains like 4D MRI and emerging imaging modalities.
Abstract: Reconstructing MRI from highly undersampled measurements is crucial for accelerating medical imaging, but is challenging due to the ill-posedness of the inverse problem. While supervised deep learning (DL) approaches have shown remarkable success, they traditionally rely on fully-sampled ground truth (GT) images, which are expensive or impossible to obtain in real scenarios. This problem has created a recent surge in interest in self-supervised learning methods that do not require GT. Although recent methods are now fast approaching “oracle” supervised performance, the lack of systematic comparison and standard experimental setups are hindering targeted methodological research and precluding widespread trustworthy industry adoption. We present SSIBench, a modular and flexible comparison framework to unify and thoroughly benchmark Self-Supervised Imaging methods (SSI) without GT. We evaluate 18 recent methods across seven realistic MRI scenarios on real data, showing a wide performance landscape whose method ranking differs across scenarios and metrics, exposing the need for further SSI research. Our insights also show how complementary methods could be compounded for future improvements, exemplified by a novel loss we propose, Multi-Operator Equivariant Imaging. To accelerate reproducible research and lower the barrier to entry, we provide the extensible benchmark and open-source reimplementations of all methods at https://github.com/Andrewwango/ssibench, allowing researchers to rapidly and fairly contribute and evaluate new methods on the standardised setup for potential leaderboard ranking, or benchmark existing methods on custom datasets, forward operators, or models, unlocking the application of SSI to other valuable GT free domains such as 4D MRI and other nascent scientific imaging modalities.
[1592] Classification of Histopathology Slides with Persistent Homology Convolutions
Shrunal Pothagoni, Benjamin Schweinhart
Main category: eess.IV
TL;DR: A novel method called Persistent Homology Convolutions that incorporates local topological information into CNNs for histopathology image analysis, outperforming conventional models.
Details
Motivation: Standard CNNs lose topological information which is crucial in domains like histopathology where tissue shape characteristics are important for disease diagnosis. Existing methods use global topological summaries that lack locality information.Method: Developed Persistent Homology Convolutions - a modified convolution operator that generates local persistent homology-based data to capture locality and translation invariance of topological features.
Result: Models trained with persistent homology convolutions outperform conventionally trained models and are less sensitive to hyperparameters across various histopathology slide representations.
Conclusion: Persistent homology convolutions effectively extract meaningful geometric information from histopathology slides by incorporating local topological features into neural networks.
Abstract: Convolutional neural networks (CNNs) are a standard tool for computer vision tasks such as image classification. However, typical model architectures may result in the loss of topological information. In specific domains such as histopathology, topology is an important descriptor that can be used to distinguish between disease-indicating tissue by analyzing the shape characteristics of cells. Current literature suggests that reintroducing topological information using persistent homology can improve medical diagnostics; however, previous methods utilize global topological summaries which do not contain information about the locality of topological features. To address this gap, we present a novel method that generates local persistent homology-based data using a modified version of the convolution operator called Persistent Homology Convolutions. This method captures information about the locality and translation invariance of topological features. We perform a comparative study using various representations of histopathology slides and find that models trained with persistent homology convolutions outperform conventionally trained models and are less sensitive to hyperparameters. These results indicate that persistent homology convolutions extract meaningful geometric information from the histopathology slides.
[1593] Fifty Years of Object Detection and Recognition from Synthetic Aperture Radar Remote Sensing Imagery: The Road Forward
Jie Zhou, Yongxiang Liu, Li Liu, Weijie Li, Bowen Peng, Yafei Song, Gangyao Kuang, Xiang Li
Main category: eess.IV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2509.22159: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.22159&sortBy=relevance&sortOrder=descending&start=0&max_results=100)